Ngram, bigram, trigram are methods used in search engines to predict the next word in a incomplete sentence. If n=1 , it is unigram, if n=2 it is bigram and so on….
This will club N adjacent words in a sentence based upon N
If input is “ wireless speakers for tv”, output will be the following-
N=1 Unigram- Ouput- “wireless” , “speakers”, “for” , “tv”
N=2 Bigram- Ouput- “wireless speakers”, “speakers for” , “for tv”
N=3 Trigram – Output- “wireless speakers for” , “speakers for tv”
Why is this important?
Imagine we have to create a search engine by inputting all the game of thrones dialogues.
If the computer was given a task to find out the missing word after valar ……. The asnwer could be “valar morgulis” or “valar dohaeris” . you can see it in action in the google search engine. How can we program a computer to figure it out?
By analyzing the number of occurrences in the source document of various terms, we can use probability to find which is the most possible term after valar.
The below image illustrates this- The frequency of words shows hat like a baby is more probable than like a bad
Lets understand the mathematics behind this-
this table shows the bigram counts of a document. Individual counts are given here.
It simply means
- “i want” occured 827 times in document.
- “want want” occured 0 times.
Now lets calculate the probability of the occurence of ” i want english food”
We can use the formula P(wn | wn−1) = C(wn−1wn) / C(wn−1)
This means Probability of want given chinese= P(chinese | want)=count (want chinese)/count (chinese)
p(i want chinese food)
= p(want | i)* p(chinese | want) *p( food | chinese)
= [count (i want)/ count(i) ]*[count (want chinese)/count(want)]*[count(chinese food)/count(chinese)]
You can create your own N gram search engine using expertrec from here