Code Monkey home page Code Monkey logo

docsearch's Introduction

This project is intended to work seamlessly in python2.7 and python3.4

P.S: For more detailed explanation about the algorithm, use the following link. http://bharathramh92.github.io/DocSearch/ Sub projects (README of its project is inside their corresponding directory)

  • InitialDataExtraction(Data retrieval from Google Books API).
  • KeyWord(KeyWord generation using Stanford NLP)

Indexing/Querying

  • Query.py and InvertedIndex.py are main files for indexing and querying.

Resources requirement for indexing/querying

  • For indexing, Resource/id_doc_rdd_raw directory needs to be created in this path were README.md resides.

  • Raw data id_doc_rdd_raw which maps doc_id to document data would be the data structure.

  • id_doc_rdd_raw eg: {"uKQ0CgAAQBAJ": {"imageLinks": {"smallThumbnail": "http://books.google.com/books/content?id=uKQ0CgAAQBAJ&printsec=frontcover&img=1&zoom=5&edge=curl&so urce=gbs_api", "thumbnail": "http://books.google.com/books/content?id=uKQ0CgAAQBAJ&printsec=frontcover&img=1&zoom=1&edge=curl&source=gbs_api"}, "catego ries": ["Fiction"], "description": "Meredith has never considered herself submissive even though her greatest fantasy is being pleasured against her wi ll. When Mark orders her to her knees the first time, she can\u2019t get there fast enough\u2014and then hates herself afterward for losing control. A" , "publisher": "Ellora's Cave Publishing Inc", "ISBN_13": "9781419994289", "keyWords": ["Meredith", "Mark"], "infoLink": "http://books.google.com/books ?id=uKQ0CgAAQBAJ&dq=go&as_pt=BOOKS&hl=&source=gbs_api", "authors": ["L.E. Chamberlin"], "ISBN_10": "141999428X", "maturityRating": "NOT_MATURE", "title ": "The Rewards of Letting Go"}}

  • First run InvertedIndex.py to create index_rdd.

  • index_rdd map each indexed term to corresponding documend id with the zone/entity name

  • index_rdd eg: ('aceline', (('K3NLoAEACAAJ', 'keyWords'), ('7DiNBwAAQBAJ', 'title')))

  • 'aceline' is the indexed term, K3NLoAEACAAJ is the doc_id, keyWords was the zone and likewise.

  • view_rdd can be used if using ranking based on view count has to be considered as well

  • view_rdd eg: ('NLngYyWFl_YC', 1292215), where the data format is (doc_id, view_count)

Query.py --> Query for document from indexed data

  • Before performing this step, perform indexing as mentioned before.
  • get_ranking(query_term, zone_restriction, VIEW_RANKED_RETRIEVAL) method has to be called as shown in main() method.
  • Only one of the parameters has to be defined
  • query_term is used for searching the keyword across all zones
  • zone_restriction is used if particular terms have to be restricted to search within a zone.
  • query_term: search term if it has to be searched across all the zones.
  • zone_restriction: dictionary whose keys are zone and the corresponding value is the search term of that zone.
  • eg: zone_restriction = {KEYWORDS: 'pop', CATEGORIES: 'art', TITLE: 'culture', PUBLISHER: 'macmillan'}
  • eg: q_term = 'cormen clrs'

Dependencies for this project

  • nltk library
  • download "wordnet corpora" for using lemmatizer(nltk)

Instructions for testing the code with sample data.

docsearch's People

Contributors

bharathramh92 avatar midhunmathewsunny avatar

Stargazers

SreeLakshmi Setturu avatar spuran avatar

Watchers

James Cloos avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.