Code Monkey home page Code Monkey logo

shakesearch's Introduction

ShakeSearch

Welcome to the Pulley Shakesearch Take-home Challenge! In this repository, you'll find a simple web app that allows a user to search for a text string in the complete works of Shakespeare.

You can see a live version of the app at https://pulley-shakesearch.herokuapp.com/. Try searching for "Hamlet" to display a list of results.

In it's current state, however, the app is just a rough prototype. The search is case sensitive, the results are difficult to read, and the search is limited to exact matches.

Improvements

You can see a live version of the improved ap at https://ts-shakesearch.herokuapp.com/.

Try searching for "Nile gods the" to display a list of results.

You will see the following improvements:

  • Ranked result list containing references to the search terms in Shakespeare's works
  • Stopwords like 'the' are ignored
  • Multiple forms of the word 'god' are found (God's, God, gods)

Document Parser

documents.go does:

  • Read completeworks.txt
  • Mark doucment start and end
  • Create a document for each of Shakespeare's works

Indexer

Index implements an inverted map based alogorithm for fast full-text search. The index holds a list of word references within documents for each token.

Documents are indexed on startup of the application. The Index then provides two different query methods:

  • Query(searchTerm string) []QueryDocument
  • QueryConcurrent(searchTerm string) []QueryDocument

The concurrent query method splits the search term to multiple tokens. A concurrent lookup for each token is executed in different threads using goroutines.

The Index analyses documents as follows:

  1. Tokenize documents (extract words)
  2. Apply filters to each token
    • Lower case filter
    • Stopword filter (remove common words, see stopwords_en.txt)
    • Apply stemmer (normalize forms of the same word, e.g. fish, fishes -> fish)

The index stores the output of the analyzer. The data structure is defined as follows:

  • Inverted Map: map[token] [] DocRef(*document, start, end) ]
  • Reference to a word withn a document: DocRef(*document, start, end)

A query against the index works as follows:

  1. Analyze query term to get search tokens
  2. Lookup index for all analyzed tokens
  3. Sort (rank) search results

Further Improvements

  • Support fuzzy search using a Radix Tree
  • Fix document parser encoding

Your Mission

Improve the search backend. Think about the problem from the user's perspective and prioritize your changes according to what you think is most useful.

Submission

  1. Fork this repository and send us a link to your fork after pushing your changes.
  2. Heroku hosting - The project includes a Heroku Procfile and, in its current state, can be deployed easily on Heroku's free tier.
  3. In your submission, share with us what changes you made and how you would prioritize changes if you had more time.

shakesearch's People

Contributors

tspiegl avatar yyw avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.