To build a complete search engine you would need the following components-
- Crawler- This goes from one website to another, grabs the content of these websites and stores them in a database.
- Parser– It processes the data from the crawler, saves the metadata
- Indexer– Reads the parsed data and creates an inverted index (similar to the one you would find at the end of a book ) this makes the search engine to retrieve results faster . If not the search engine will have to go through all the documents one by one. This reduces processing time.
- Search results ranker – For every search query, the search engine retrieves many documents/ results. This ranker orders these results based on some score. Google uses an algorithm known as the page rank algorithm. You can come up with your own scoring algorithm as well.
What is a web crawler?
A webcrawler goes from one website to another and downloads the content of the websites from the web.
Here is a typical crawler architecture.
Some pointers to keep in mind while designing a good webcrawler for searching the web?
- Ability to download huge web pages
- Less time to download web pages
- Consume optimal bandwidth.
- Handle HTTP 301 and HTTP 302 Redirects– The crawler should be able to handle such pages.
- DNS caching– Instead of doing a DNS lookup every time , the crawler should cache the DNS. This helps reducing the crawl time and internet bandwidth used.
- Multithreading–Most crawlers launch several “threads” in order to download web pages in parallel. Instead of a single thread downloading the files, you can use this approach to parallel fetch multiple pages.
- Asynchronous crawl– Asynchronous crawling , since only one thread is used to send and receive all the web requests in parallel. This saves a RAM and CPU usage. Using this we can crawl more than 3,000,000 web pages while using less than 200 MB of RAM. Using this we can achieve a crawl speed of more than 250 pages per second.
- Duplicate detection- The crawler should be able to find duplicate URLs and remove them.
- Handing Robots.txt – The crawler should read the settings in the robots.txt for crawling the pages. Some pages (or page patterns) will be marked as Disallow and these pages should not be crawled. Robots.txt will be found at website.com/robots.txt
- Sitemap.xml- Sitemap is a link map of the website. It has all the URLs that need to crawled. This makes the crawling process simpler.
- Crawler policies-
- selection policy which states which pages have to be downloaded.
- a re-visit policy which states frequency to look for changes in the website.
- a politeness policy How fast the website can be crawled (so that website load does not increase)
- a parallelization policy Instructions for distributed crawlers.
What are some open source web crawlers you can use-
Expertrec is a search solution that provides a ready made search engine ( crawler+parser+indexer+Search UI ). You can create your own at https://cse.expertrec.com/?platform=cse