Code Monkey home page Code Monkey logo

hippohippogo-search-engine's Introduction


Logo

HippoHippoGo

A simple crawler-based search engine that demonstrates the main features of a search engine (web crawling, indexing and ranking)

Table of Contents

About The Project

Landing Page

Landing Page

Autocomplete Suggestions

Autocomplete

Voice Recognition Search

Voice Search

Web Search

Web Search

Image Search

Image Search

Trends

Trends

Crawler

The web crawler is a Spring Boot bean service that collects documents from all over the web. The crawler starts with a list of URL addresses (seed set). It downloads the documents identified by these URLs and extracts hyper-links from them. The extracted URLs are added to the list of URLs to be downloaded. Thus, the crawler is a recursive service. The crawler has the following features:

  • The crawler maintains its state. That is, if interrupted then rerun again, it starts to crawl the documents on the list without revisiting documents that have been previously downloaded.
  • It respects the robots exclusion protocol (REP).
  • It's a multi-threaded crawler implementation where the user can control the number of threads before starting the crawler.

Indexer

The output of the crawling process is a set of downloaded HTML documents. To respond to user queries fast enough, the contents of these documents are indexed using a multi-threaded indexer service in a database table that stores the words contained in each document and their importance (e.g. whether they are presented in a <title></title> tag, in a <header></header> tag or as plain text). Words are stored with their respective documents in which they are included and the indices at which they occurred in each document.

Ranker

The ranker module sorts documents based on their popularity and relevance to the search query.

  1. Word Relevance: Relevance is a relation between the query words and the result page. It is calculated in several ways such as tf-idf of the query word in the result page and whether the query word appeared in the title, heading or body and then the score is aggregated from all query words to produce the final page relevance score.
  2. Popularity: Popularity is a measure for the importance of any web page regardless of the requested query. PageRank algorithm is used to calculate page popularity.
  3. Users' Frequent Domains: Web sites are biasedly ranked towards each user frequently visited domains which are recorded each time a user clicks on a query result.
  4. Geographic Location of the User: Pages score increase if they are related to the user’s location.
  5. Page Recency: A web page’s score increases if it was published recently.

Built With

Getting Started

Prerequisites

  • Download and install Maven using this link
  • Download and install MySQL using this link

Running

  1. [Optional] Setup the database by:
    0.1 Executing database_schema.sql
    0.2 Importing database data from database_dump_csv.rar

  2. Run using your favorite Java IDE. In our case, we used IntelliJ IDEA.
    1.1 To run the Crawler Service, uncomment the following lines in HippoHippoGoApplication.java

    // CrawlerService crawlerService = applicationContext.getBean(CrawlerService.class);
    // crawlerService.Crawl();

    1.2 To run the Indexer Service, uncomment the following lines in HippoHippoGoApplication.java

    // IndexerService indexer = applicationContext.getBean(IndexerService.class);
    // indexer.main();

    1.3 To run the Ranker Service, uncomment the following lines in HippoHippoGoApplication.java

    // RankerService rankerService = applicationContext.getBean(RankerService.class);
    // rankerService.rankPages();

Acknowledgements

Contributors

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.