To build a complete search engine you would need the following components-

  1. Crawler- This goes from one website to another, grabs the content of these websites and stores them in a database.
  2. Parser–    It processes the data from the crawler, saves the metadata
  3. Indexer– Reads the parsed data and creates an inverted index (similar to the one you would find at the end of a book ) this makes the search engine to retrieve results faster  . If not the search engine will have to go through all the documents one by one. This reduces processing time.
  4. Search results ranker – For every search query, the search engine retrieves many documents/ results. This ranker orders these results based on some score. Google uses an algorithm known as the page rank algorithm. You can come up with your own scoring algorithm as well.
  5. Search User interface– Mostly users search in browsers or mobile apps through the search engine interface. This is usually built using Javascript.webcrawler search engine

 

What is a web crawler?

A webcrawler goes from one website to another and downloads the content of the websites from the web.

webcrawler search engine

crawler architecture

Here is a typical crawler architecture.

webcrawler search engine

Some pointers to keep in mind while designing a good webcrawler for searching the web?

  1. Ability to  download huge web pages
  2. Less time to download web pages webcrawler search engine
  3. Consume optimal bandwidth.
  4. Handle HTTP 301 and HTTP 302 Redirects– The crawler should be able to handle such pages.
  5. DNS caching– Instead of doing a DNS lookup every time , the crawler should cache the DNS. This helps reducing the crawl time and internet bandwidth used.

    webcrawler search engine

    reducing dns caching

  6. MultithreadingMost crawlers launch several “threads” in order to download web pages in parallel. Instead of a single thread downloading the files, you can use this approach to parallel fetch multiple pages.

    webcrawler search engine

    multithreading

  7. Asynchronous crawl– Asynchronous crawling , since only one thread is used to send and receive all the web requests in parallel. This saves a RAM and CPU usage. Using this we can crawl more than 3,000,000 web pages while using less than 200 MB of RAM. Using this we can achieve a crawl speed of more than 250 pages per second.
  8. Duplicate detection- The crawler should be able to find duplicate URLs and remove them.

    webcrawler search engine

    simhash duplicate detection

  9. Handing Robots.txt – The crawler should read the settings in the robots.txt for crawling the pages. Some pages (or page patterns) will be marked as Disallow and these pages should not be crawled. Robots.txt will be found at website.com/robots.txt

    webcrawler search engine

    robots.txt

  10. Sitemap.xml- Sitemap is a link map of the website. It has all the URLs that need to crawled. This makes the crawling process simpler.

    webcrawler search engine

    sitemap

  11. Crawler policies-
    1. selection policy which states  which pages have to be downloaded.
    2. a re-visit policy which states frequency to look for changes in the website.
    3. a politeness policy  How fast the website can be crawled (so that website load does not increase)
    4. a parallelization policy  Instructions for distributed crawlers.

What are some open source web crawlers you can use-

  1. nutch
  2. Scrapy
  3. Heritrix. https://github.com/internetarchive/heritrix3
  4. wget
  5. http://stormcrawler.net/

Expertrec is a search solution that provides a ready made search engine ( crawler+parser+indexer+Search UI ). You can create your own at https://cse.expertrec.com/?platform=cse

 

webcrawler search engine

 

Categories: crawler

muthali ganesh

Muthali loves writing about emerging technologies and easy solutions for complex tech issues. You can reach out to him through chat or by raising a support ticket on the left hand side of the page.