The hippohippogo-search-engine's intro from mmmacmp

HippoHippoGo

A simple crawler-based search engine that demonstrates the main features of a search engine (web crawling, indexing and ranking)

About the Project
Getting Started
- Prerequisites
- Running
Acknowledgements
Contributors

About The Project

Landing Page

Autocomplete Suggestions

Voice Recognition Search

Web Search

Image Search

Trends

Crawler

The web crawler is a Spring Boot bean service that collects documents from all over the web. The crawler starts with a list of URL addresses (seed set). It downloads the documents identified by these URLs and extracts hyper-links from them. The extracted URLs are added to the list of URLs to be downloaded. Thus, the crawler is a recursive service. The crawler has the following features:

The crawler maintains its state. That is, if interrupted then rerun again, it starts to crawl the documents on the list without revisiting documents that have been previously downloaded.
It respects the robots exclusion protocol (REP).
It's a multi-threaded crawler implementation where the user can control the number of threads before starting the crawler.

Indexer

The output of the crawling process is a set of downloaded HTML documents. To respond to user queries fast enough, the contents of these documents are indexed using a multi-threaded indexer service in a database table that stores the words contained in each document and their importance (e.g. whether they are presented in a <title></title> tag, in a <header></header> tag or as plain text). Words are stored with their respective documents in which they are included and the indices at which they occurred in each document.

Ranker

The ranker module sorts documents based on their popularity and relevance to the search query.

Word Relevance: Relevance is a relation between the query words and the result page. It is calculated in several ways such as tf-idf of the query word in the result page and whether the query word appeared in the title, heading or body and then the score is aggregated from all query words to produce the final page relevance score.
Popularity: Popularity is a measure for the importance of any web page regardless of the requested query. PageRank algorithm is used to calculate page popularity.
Users' Frequent Domains: Web sites are biasedly ranked towards each user frequently visited domains which are recorded each time a user clicks on a query result.
Geographic Location of the User: Pages score increase if they are related to the user’s location.
Page Recency: A web page’s score increases if it was published recently.

Built With

Getting Started

Prerequisites

Download and install Maven using this link
Download and install MySQL using this link

Running

[Optional] Setup the database by:
0.1 Executing database_schema.sql
0.2 Importing database data from database_dump_csv.rar
Run using your favorite Java IDE. In our case, we used IntelliJ IDEA.
1.1 To run the Crawler Service, uncomment the following lines in HippoHippoGoApplication.java
```
// CrawlerService crawlerService = applicationContext.getBean(CrawlerService.class);
// crawlerService.Crawl();
```
1.2 To run the Indexer Service, uncomment the following lines in HippoHippoGoApplication.java
```
// IndexerService indexer = applicationContext.getBean(IndexerService.class);
// indexer.main();
```
1.3 To run the Ranker Service, uncomment the following lines in HippoHippoGoApplication.java
```
// RankerService rankerService = applicationContext.getBean(RankerService.class);
// rankerService.rankPages();
```

mmmacmp / hippohippogo-search-engine Goto Github PK

hippohippogo-search-engine's Introduction