Code Monkey home page Code Monkey logo

cs410-final-project's Introduction

CS 410 Search Buddy

Go to Installation
Go to Libraries/Frameworks Used
Go to Application Structure

Team Members: Joanna Huang (joannah2), Drshika Asher (drshika2), Brooke Novosad (novosad3), Rainy Yan (yuzheng9)
Team Captain: Brooke Novosad (novosad3)
Team Name: Autumn Lovers
Presentation: Presentation Video

Project Description: Our application allows you to enter any CS410 related search query to find the 10 most relevant course materials (CS410 Lecture Transcripts and Slides) with blurbs to match your query.

Installation

  1. Have python3 and pip installed: tutorial

  2. Clone Repository Locally

HTTPS

$ git clone https://github.com/drshika/CS410-Final-Project.git

SSH

$ git clone [email protected]:drshika/CS410-Final-Project.git
  1. Install requirements with pip:
$ pip install -r requirements.txt
  1. Run the app on localhost:
$ python app.py

or

$ flask run
  1. Open in browser
 * Serving Flask app 'app'
 * Debug mode: on
WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
 * Running on http://127.0.0.1:5000

Paste http://127.0.0.1:5000 (or whatever port address it assigns) into your browser to view the webpage.

Libraries

  • Rank_bm25: this library has the Okapi BM25 ranking function to get the 10 most relevant document names.
  • Flask: this is a framework used to make GET and POST requests for the user query and the search results from our backend.
  • Numpy: this library was used to sort the outputted list from most to least relevant.
  • BeautifulSoup: this library was used to parse the HTML web pages from the CS410 course website to get the documents that users can search for.
  • PyPDF2: this library was used to scrape PDFs into text.
  • Pytest this was used to test our program

Structure

├────── extraction_scripts/             # Scripts used to process raw HTML lecture transcripts and slides
│      ├── get_pdf.py                   # Scrapes lecture slide PDFs
│      └── get_transcripts.py           # Scrapes lecture Transcripts
├───── html_files/                      # Raw HTML files from CS410 Coursera Website
├───── lecture_slides/                  # CS410 PDF lecture slides
├───── lecture_slides_extractions/      # Text from CS410 lecture slides
├───── lecture_transcripts/             # Transcripts from CS410 lecture videos
├───── static/images                    # Images used in this repository
├───── project_docs/                    # Misc Project Docs
│     ├── Project Progress Report.md
│     ├── Project Proposal.md
│     └── sources.txt                   # Reference Code Documentation
├───── templates/                       # HTML template pages       
│     ├── answer.html                   # Query response page
│     └── home.html                     # Home page
├───── .gitignore
├───── app.py                           # Main Flask driver code
├───── ranking_function.py              # Utility function to return ranked documents
├───── README.md
├───── test_pytest.py                   # Tests different queries against hand calulated rankings
└───── requirements.txt                 # Used to install required python libraries

Team Contributions

  1. Proposal preparation and revision (5hrs) - all
  2. Project scoping and approach analysis (5hrs) -all
  3. Collect all documents in CS 410 Coursera (5 hrs) - Brooke, Drshika, Joanna
  4. Data clean up (tokenization, remove stop words, etc.) (20 hrs - verification by hand included) - Brooke, Rainy, Drshika
  5. Data processing, analysis, & storage (5 hrs) - Brooke, Rainy
  6. Website
  7. Frontend (5hrs) - Drshika
  8. Backend (5hrs) - Brooke, Rainy, Drshika
  9. Data routing (10hrs) - Brooke, Rainy, Drshika
  10. Search algorithms (5hrs) - Brooke, Rainy
  11. Summarization blurb for each query (5 hrs) - Brooke

cs410-final-project's People

Contributors

drshika avatar brooke-novosad avatar rainyyyan avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.