Code Monkey home page Code Monkey logo

arxiv-doc2vec-recommender's Introduction

SciExplorer

A document similarity engine for discovering related scientifc articles using distributed representation (Doc2Vec), trained on 600,000+ articles from Arxiv.org on AWS. The vectors are used to return articles similar to a search query (which can be entire paragraphs), or any article found by browsing through the interface.

A serendipitous find was that the network of documents and the similarity between them form a weighted graph. I used a community detection algorithm on this weighted graph to visualize the relationship between topics (whose vectors were calculated by simply averaging their document vector).

Motivation

Knowledge required to tackle complex problems is often siloed in disparate disciplines, each with their own unique lexica and citation communities. There's an opportunity for text-mining to find overlaps across disciplines that could lead to innovative research.

Recent advances in Natural Language Processing have given us new tools for overcoming these barriers to discovery. Specifically, a set of machine learning algorithms developed at Google (Word2Vec) learn to represent text using fixed number of dimensions, each of which encodes some aspect of that word's or document's meaning. For example, we can locate the "gender" dimension that has been automatically encoded by the algorithm, by subtracting the vector for "woman" from that of "man". Adding this dimension to "aunt" will yield "uncle", and so on.

Here are some examples from pre-trained word vectors:

  • Iraq - Violence = Jordan
  • Human - Animal = Ethics
  • Library - Books = Hall

Semantically similar words have very similar vectors, so the cosine similarity between, for example, "strong" and "powerful" will be approximately 1.

How SciExplorer works

I used the Open Archives Initative's API to gather metadata and abstracts for over 600,000 articles published on Arxiv in the past 10 years. This data is parsed, cleaned, and placed into a PostGRES database.

SciExporer's engine utilizes Word2Vec and Doc2Vec, which produce word and document vectors via a simple, 2-level neural network that trains on raw text, analyzing words in their local context. Training was done on a powerful EC2 instance with several processors in an online manner that does not require loading the entire corpus into memory.

The number of hidden layer neurons is a tuning parameter that can be tweaked to yield either more exact matches or fuzzier searches. The current model employs only 100 neurons, which is actually rather few. As we scale up the data to include full-text PDFs, we can increase that number because we will have more data to train those extra features.

When a query is sent in, a vector is inferred for the document based on text the model has already seen. We then look for cosine similarity with other documents in the database that are closely matched.

The topic visualizer employs D3 and a community detection algorithm (Louvain) to find clusters of meaning, which reveals several neighborhoods that are from separate parent categories. This visualization can be used to discover neighboring topics that may overlap semantically with the user's primary interest.

arxiv-doc2vec-recommender's People

Contributors

sallamander avatar sepehr125 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

arxiv-doc2vec-recommender's Issues

Can't able to install some libraries

PackagesNotFoundError: The following packages are not available from current channels:

  - psycopg2==2.6.1=py34_0
  - qt==4.8.7=1
  - scipy==0.16.1=np110py34_0
  - ptyprocess==0.5=py34_0
  - tk==8.5.18=0
  - readline==6.2=2
  - freetype==2.5.5=0
  - xz==5.0.5=0
  - terminado==0.5=py34_1
  - sqlite==3.9.2=0
  - numpy==1.10.2=py34_0
  - pexpect==3.3=py34_0

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.