What are N-grams?
N-grams are contiguous sequences of n items from a given sample of text or speech. In the context of natural language processing, an n-gram is a sequence of n words or characters. N-grams are used to capture the linguistic structure in a text, such as word or character dependencies, and can be employed in various NLP tasks, such as language modeling, text classification, and information retrieval.
Examples of N-grams:
- Unigrams (n = 1): Single words or characters, e.g., “the”, “cat”, “sat”.
- Bigrams (n = 2): Sequences of two words or characters, e.g., “the cat”, “cat sat”, “sat on”.
- Trigrams (n = 3): Sequences of three words or characters, e.g., “the cat sat”, “cat sat on”, “sat on the”.
Resources to learn more about N-grams:
- N-grams and how to implement it in python, a tutorial on N-grams and their implementation in Python.
- What is N-grams by Kavitta Ganesan, an article explaining the concept of N-grams.
- Understanding word N-grams and N-grams probability, an article discussing word N-grams and their probabilities in natural language processing.