A simple crawler-based search engine that demonstrates the main features of a search engine (web crawling, indexing and ranking)
The web crawler is a Spring Boot bean service that collects documents from all over the web. The crawler starts with a list of URL addresses (seed set). It downloads the documents identified by these URLs and extracts hyper-links from them. The extracted URLs are added to the list of URLs to be downloaded. Thus, the crawler is a recursive service. The crawler has the following features:
- The crawler maintains its state. That is, if interrupted then rerun again, it starts to crawl the documents on the list without revisiting documents that have been previously downloaded.
- It respects the robots exclusion protocol (REP).
- It's a multi-threaded crawler implementation where the user can control the number of threads before starting the crawler.
The output of the crawling process is a set of downloaded HTML documents. To respond to user queries fast enough, the contents of these documents are indexed using a multi-threaded indexer service in a database table that stores the words contained in each document and their importance (e.g. whether they are presented in a <title></title>
tag, in a <header></header>
tag or as plain text). Words are stored with their respective documents in which they are included and the indices at which they occurred in each document.
The ranker module sorts documents based on their popularity and relevance to the search query.
- Word Relevance: Relevance is a relation between the query words and the result page. It is calculated in several ways such as tf-idf of the query word in the result page and whether the query word appeared in the title, heading or body and then the score is aggregated from all query words to produce the final page relevance score.
- Popularity: Popularity is a measure for the importance of any web page regardless of the requested query. PageRank algorithm is used to calculate page popularity.
- Users' Frequent Domains: Web sites are biasedly ranked towards each user frequently visited domains which are recorded each time a user clicks on a query result.
- Geographic Location of the User: Pages score increase if they are related to the user’s location.
- Page Recency: A web page’s score increases if it was published recently.
-
[Optional] Setup the database by:
0.1 Executing database_schema.sql
0.2 Importing database data from database_dump_csv.rar -
Run using your favorite Java IDE. In our case, we used IntelliJ IDEA.
1.1 To run the Crawler Service, uncomment the following lines in HippoHippoGoApplication.java// CrawlerService crawlerService = applicationContext.getBean(CrawlerService.class); // crawlerService.Crawl();
1.2 To run the Indexer Service, uncomment the following lines in HippoHippoGoApplication.java
// IndexerService indexer = applicationContext.getBean(IndexerService.class); // indexer.main();
1.3 To run the Ranker Service, uncomment the following lines in HippoHippoGoApplication.java
// RankerService rankerService = applicationContext.getBean(RankerService.class); // rankerService.rankPages();