What are Stop words? These are commonly found words in documents which add little value to understanding the meaning of a document.
Why are stop words important? When we remove stop words from a document, understanding the meaning of the document becomes easier and the context behind the article.
For example, when we remove stop words from the sentence “Narendra Modi is the prime minister of India” we will get Narendra, Modi, prime minister, India as the output which says what the sentence is all about.
For example, removing stop words is an important application in building search engines.
How are stop words computed?
Stop words are computed using a method called TF (term frequency) IDF (inverse document frequency) most commonly. To know more about TF IDF read this article.
Advantages of removing stop words-
- Improved performance- Reducing the number of words to be processed. In search engines this reduces the index size.
- Searches become faster.
Disadvantages of removing stop words-
- Sometimes, the context of a sentence is lost. For example in the sentence- “I am not angry” if we remove not , the context the meaning of the sentence is lost.
Can stop words be different for different domains-
- Yes stop words can vary from document to document.
- Also different languages have seperate stop words.
How to implement stop words-
- You can implement using NLTK in python. From a given txt file, the following code can remove stop words-
12345678910111213import iofrom nltk.corpus import stopwordsfrom nltk.tokenize import word_tokenize#word_tokenize accepts a string as an input, not a file.stop_words = set(stopwords.words('english'))file1 = open("text.txt")line = file1.read()# Use this to read file content as a stream:words = line.split()for r in words:if not r in stop_words:appendFile = open('filteredtext.txt','a')appendFile.write(" "+r)appendFile.close()
- You can use a ready made search engine from that implements the best practices in handling stop words.