Code Monkey home page Code Monkey logo

microsoft-academic-graph's Introduction

Microsoft Academic Graph

Big Data project for coursework Z604 Big Data Analytics

Team Members

Task 1: Citation Recommendation Problem

In task 1, given a paper id, we tried to predict Paper References (citation) based on the paper keywords, authors and other such data. This is an important problem as it can be used to recommend other reading material for researchers similar to the paper they are interested in.

>SAMPLING:

  • Sampling.java: This class primarily does the work of sampling. We started with a random sample of 100,000 papers and fetched corresponding data for these papers from various other tables.

>DATA PREPARATION:

  • JsonToNeo4J.java: This class primarily converts the json input file to separate csv files as per the tables so that they can be imported as nodes and relationships in neo4j database.

  • Paper.java: This is a data class for holding Paper objects in the various tasks and other files.

>PREDICTION AND RANKING:

  • PaperScore.java: This is a data class for holding the paper scores of the papers and for storing the calculated final score.

  • PaperScoreNormVars.java: This class primarily contains static variables for saving the Normalization variables. These variables helped in normalizing the scores of the Author Popularity, Paper rank, Keyword match and Paper year scores.

  • PaperScoreWeights.java: This is an ENUM to hold the weight given to the various scores in the Paper scores calculation.

  • Prediction.java: This class primarily works on the predicting the citation recommendations for various papers using Recommendation class. This class gives output in the steps of 20 from 20 predictions to 100 predictions.

  • Recommendation.java: This class primarily does the work of ranking the citation recommendations. The scores are calculated, final score is calculated and the recommendations are sorted in descending order of the final score.

Task 2: Predict Papers with most relevant Keywords

In task 2, given a paper id, we try to predict key words that the paper can be tagged with. To run the below given python files, the PaperCollection.json is required to compute page rank on the 100000 paper records present in it.

Important Files :

  • PageRank_Rec.py:- Implements Page Rank and compute most relevant paper keywords for a target paper.

  • keyWordCloud.ipynb:- Used to visualize our output pertaining to predicting keywords.

  • PaperCollection.json:- Used for storing the subset of the papers data in the .json format.

  • buildGraphFromPy.json:-Used for build a graph based on schema designed from the .py env.

Evaluation

Task 1

When testing for dataset of different sizes we found that we get the best results from the largest dataset which has details about 100,000 papers and all its references. Increasing the dataset size from 1000 to 100,000 papers saw an increase in the overall precision and recall. Hence, we decided to primarily work with the largest sampled dataset. With this dataset, we varied the number of prediction size from 20 to 100 in steps of 20. We found that we got the best precision with 60 predictions, with the value of 18%.

Task 2

We computed the precision recall measures for this task and found some unstable results since the data under consideration is a subset and building a ground truth on this sub-set either gave us exact match of 100%, 50% or no match in most of the cases since we found only 1 or 2 PRefIds for the ground truth which can either be present in the predicted list or not be present.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.