Code Monkey home page Code Monkey logo

arxiv-recommendations's Introduction

Zotero and arXiv Recomendation System

-- Project Status: [ Active ]

Project Intro/Objective

This project contains some code meant to analyse my Zotero library containg journal articles I have been amassing for many years. The objective is to build some scraping and recommendation system on the arXiv in order to find and classify new papers and help find relevant research tuned to my interests. This project can be viewed as an improvement over the basic scraper and keyword highlighter I previously developped: arxiv_scanner_flask.

Methods Used

  • Data Analysis
  • Machine Learning
  • Data Visualization
  • Predictive Modeling
  • Content-based recommendation system
  • Natural Language Processing
  • Web scraping

Technologies

  • Python
  • Pandas, Scikit-learn, numpy

Project Description

This project involved various critical aspects of data science and is meant as a training project to bring to production some useful product. The various interesting steps are:

  • ETL: Merge my Zotero library and a random arXiv sample of papers (different schema).
  • Analysis: Analyze the dataset of arXiv articles (included in my library or not) in order to identify trends in topics, authors.
  • Encoding: A novel technical aspect I used this project to train myself on is encoding text-based features.
  • Recommender system: Using the sparsely encoded title, author list and category list of the articles, I built and compared various cosine similarity matrices which then served as recommendation matrices. This is a simple yet very effective system.
  • Classifier: I built an unsupervised clustering system for topics using the summary column and non-negative matrix factorization. Each author is then attributed a list of most frequent topics which will be used as encoding for authors. Therefore, similarity between authors will now mean similarity of interests instead of textual similarity.
  • Packaging: This project can be ran as a standalone script with any arXiv identifier. The program will first pull the article from arXiv and then run the similarity pipeline before returning recommendations.

Getting Started

The notebook dealing with the data merging and arXiv random sampling can be found in this notebook. The proof of concept for the recommender system and the classifier can be found in this notebook. The improvement with target encoding for authors can be found in this notebook.

To see how to use this code, just run python3 main.py --help or any of the command below:

# Get recommendations for an article
python3 main.py 2303.17685

# Change the number of recommendations
python3 main.py 2303.17685 -n 30

# Use the target encoding instead of basic encoding for authors (slower)
python3 main.py 2303.17685 --encode_topics

arxiv-recommendations's People

Contributors

nicolaschagnet avatar

Stargazers

 avatar

Watchers

Kostas Georgiou avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.