In this article, we will see how to create a semantic search engine using SOLR. SOLR is an open-source search engine platform that offers full-text search engine capabilities.
Consider three, documents:
- Watermelons are red.
- I like watermelons for breakfast.
- The fruit is large, green outside and red inside with seeds.
Now, what happens if you search for the term “banana”. Under normal circumstances, you only get back the first and second documents. A semantic search engine should be able to bring the third sentence as well.
- You need a lot of data. The success of your semantic search depends on the data size and availability of user behavioral data.
- Developing a good semantic search takes a lot of computational resources and brainpower.
Semantic search using Collaborative filtering :
Collaborative filtering is mostly used in recommendation engines. Content-based filtering methods use just the content of two documents for finding similarity. Collaborative filtering on the other hand use user behavior to find similar documents without relying on the contents of two documents.
Similarly in semantic search engines, similar documents can be found using a similar approach.
Here’s how the semantic search engine works:
- The text field is identified inside the documents and a term-document matrix which indicates the importance or relevance of each word in a cell is made. Similar to TF IDF which eliminates all the stop words.
- Now collaborative filtering is applied to this matrix to generate another matrix that indicates the right weight for each term after removing stop words.
- After this, the top terms that are highly relevant to this matrix are sent into SOLR which is used to power semantic search.
Semantic Search Demo!
So let’s consider an example. In this data set, the field of interest is the
Body which contains the contents of all questions and answers.
Now, we execute the below code using Python:
>>> from SemanticAnalyzer import *
>>> stvc = SolrTermVectorCollector(field='Body',feature='tf',batchSize=1000)
>>> tdc = TermDocCollection(source=stvc,numTopics=150)
Once the code finishes running, let’s give it an easy test; here are the 20 most highly correlated words with the word melon.
The results of that process have now been saved, so future searches will be quick. Now the result of this code will give similar terms to the word melon.
Indeed, most of these terms are like a hall of fame of dark things from Star Wars and Harry Potter.
Now let’s push these results to SOLR
Semantic Search using Expertrec
You can also skip all the steps above and simply create a semantic search engine using Expertrec.
- Go to semantic search creator.
- Sign in with your Gmail ID.
- Add your website URL.
- Wait for the crawl to complete.
- By now the search results UI will be ready ( you can check out the demo).
- Go to Search settings-> Semantic search – Enable it.