Code Monkey home page Code Monkey logo

impact-prediction's Introduction

Scientific Impact Prediction

This package computes citation predictions for papers and h-index predictions for authors using a dataset underlying the Semantic Scholar search service created by AI2. Here we describe how to acquire the data to train a collection of regression models and produce a collection of plots showing their relative performance.

If you would like to simply access the data and are uninterested in running the code then skip to the "Getting the data" section. Otherwise you will want to first clone this repository to a local directory:

git clone [email protected]:Lucaweihs/impact-prediction.git

Table of Contents

  1. Getting the data
  2. Code dependecies
  3. Features used in prediction

Getting the data

Data can be downloaded manually as individual files or, if you are just interested in producing predictions, just those files necessary to train models and produce predictions can be automatically downloaded and extracted using the download_data.py script. To use the download_data.py script run the commands:

# Enter the impact prediction directory
cd path/to/impact-prediction
# Run the script to download the data
python download_data.py

We now describe the individual files and provide URLs to download them manually.

Data file descriptions

These data span the years between 1975 and 2015. The features are generated using information available only in 2005 and we train models to predict in the years 2006-2015. The data comes in two formats, tab separated values files (.tsv) and json files (.json); all data are compressed with gzip so be sure to unzip them (gzip -d filename). Authors are identified by their name and papers by a unique identifier. The paper ids correspond to those used by Semanic Scholar and more information about a paper with a given paper id can be found by using Semantic Scholar. For example, one can find more information about the paper with id

214899d16f39a494c3e69118c53a7b5877c0bbfc

by going to the URL:

www.semanticscholar.org/paper/214899d16f39a494c3e69118c53a7b5877c0bbfc.

Author names

File name: authors-1975-2005-2015-2.tsv

Format: Every line is the name of an author taken from a paper.

Download link

Author features

File name: authorFeatures-1975-2005-2015-2.tsv

Format: The first line specifies the feature names and every other line represents the feature values for a particular author. These features are ordered to correspond to the authors from the "author names" file.

Download link

Author responses

File name: authorResponses-1975-2005-2015-2.tsv

Format: The first line specfies the column names, and are of the form total_citations_in_NUMBER or hindex_in_NUMBER. Here NUMBER is the number of years since 2006 so that if a particular row has a value of 5 in the hindex_in_5 column this means that the author corresponding to that row had an hindex of 5 in year 2006 + 5 = 2011. These responses are ordered to correspond to the authors from the "author names" file.

Download link

Author histories

File name: authorHistories-hindex-1975-2005-2015-2.tsv

Format: Every line corresponds to the h-index of an author since the beginning of their career until 2005. These histories are ordered to correspond to the authors from the "author names" file.

Example: If an author has a 5 year old career, by 2005, and their per-year h-index is 1,1,2,3,4. Then the line corresponding to said author would be 1 1 2 3 4

Download link

Paper ids

File name: paperIds-1975-2005-2015-2.tsv

Format: Each line corresponds to a single paper id. These ids are unique identifiers of papers.

Download link

Paper features

File name: paperFeatures-1975-2005-2015-2.tsv

Format: The first line specifies the feature names and every other line represents the feature values for a particular paper. These features are ordered to correspond to the papers from the "paper ids" file.

Download link

Paper responses

File name: paperResponses-1975-2005-2015-2.tsv

Format: Each line corresponds to the observed cumulative citation count of an author in the years between 2006 and 2015. These responses are ordered to correspond to the paper from the "paper ids" file.

Download link

Paper histories

File name: paperHistories-1975-2005-2015-2.tsv

Format: Same as for "author histories" but replacing authors with papers and the h-index with cumulative citation counts.

Download link

Citation graph

File name: citationGraph-1975-2015.json

Format: Each line corresponds to a json dictionary with the following fields:

  • id - a paper id
  • cites - a list of the paper ids cited by id

Notes: The citation graph includes all papers published between 1975 and 2015.

Download link

Key citation graph

File name: keyCitationGraph-1975-2015.json

Format: Exactly as for the "citation graph" file but only includes key citations between papers.

Download link

Paper meta data

File name: paperIdToPaperData-1975-2015.json

Format: Every line is a json dictionary corresponding to a single paper with the following fields:

  • citations-in-year - a dictionary of where the keys are years and the values are the number of citations the paper received in a particular year.
  • is-survey - true/false depending on whether or not the paper is a survey.
  • year - the publication year.
  • id - the paper's id.
  • authors - a list of the papers authors.
  • venue - the venue where the paper was published.

Download link

Code dependencies

This project is written in Python 2.7.11 using the following packages, divided into several categories.

Data reading, representation, serialization:

  • cPickle
  • configparser
  • gzip
  • pandas
  • subprocess

Parallel processing:

  • joblib
  • multiprocessing

Modeling/mathematics:

  • numpy
  • scipy
  • sklearn
  • tensorflow

Plotting:

  • matplotlib
  • seaborn

The majority of these come preinstalled on, or can be easily installed through, any scientific python manager, e.g. anaconda. If you have the anaconda package manager installed you can easily set up an "impact-prediction" environment containing the necessarily packages by running the command

conda env create -f environment.yml

in the terminal from within the project.

Training and comparing models

We should note that the training process can take many hours even on a strong machine, be prepared. That said, after a model has finished training we save the results so that you do not have to train it again. We assume you have downloaded the data using the download_data.py script described in the "Getting the data" section. To train a collection of models for h-index prediction and produce a collection of associated plots (placed in the ./plots directory) you can run the following command from within the impact-prediction directory:

python author_predictions.py hindex "author_hindex:4;author_age:5,12"

The created plots show the MAPE, R^2, and PA-R^2 metrics of the various trained algorithms on training, validation, and testing datasets. These plots are named to be self-descriptive. The above code trains and tests only on those authors with an h-index >= 4 by 2005 and whose career length was between 5 and 12 years in 2005.

To train models for paper citation prediction you can run the command:

python paper_predictions.py citation "paper_citations:5"

As above this will create a number of plots in the "plots" directory. Here the above command will only include those papers with >= 5 citations by 2005.

Features used in prediction

We have two sets of features used in our predictions, one for authors and one for papers. The features are listed below.

Author Features

Feature Name Description
author_hindex H-index
author_hindex_delta Change in h-index over the last 2 years
author_citation_count Cumulative citation count
author_key_citation_count Cumulative key citation count (see Zhu et al. 2015)
author_citations_delta_{0,1} Citations this year and 1 year ago
author_key_citations_delta_{0,1} Key citations this year and 1 year ago
author_mean_citations_per_paper Mean number of citations per paper
author_mean_citation_per_paper_delta Change in mean number of citations per paper over last 2 years
author_mean_citations_per_year Mean number of citations per year
author_papers Number of papers published
author_papers_delta Number of papers published in last 2 years
author_mean_citation_rank Rank of author (between 0 and 1) among all other authors in terms of mean citations per year
author_unweighted_pagerank PageRank of author in unweighted coauthorship network
author_weighted_pagerank PageRank of author in weighted coauthorship network
author_age Career length (years since first paper published)
author_recent_num_coauthors Total number of coauthors in last 2 years
author_max_single_paper_citations Max number of citations for any of author's papers
venue_hindex_{mean, min,max} H-indices of venues author has published in
venue_hindex_delta_{mean, min,max} Change in h-index over last two years for venues author has published in
venue_citations_{mean, min,max} Mean citations per paper of venues author has published in
venue_citations_delta_{mean, min,max} Change in mean citations per paper over last two years for venues author has published in
venue_papers_{mean, min, max} Number of papers in venues in which the author has published
venue_papers_delta_{mean, min, max} Change in number of papers in venues in which the author has published over the last 2 years
venue_rank_{mean, min, max} Ranks of venues (between 0-1) in which the author has published determined by mean number of citations per paper
venue_max_single_paper_citations_{mean, min, max} Maximum number of citations any paper published in a venue has received for each venue the author has published in
total_num_venues Total number of venues published in

Paper Features

Feature Name Description
author_hindex_{mean, min, max} H-indices of authors
author_hindex_delta_{mean, min, max} Change in h-indices of authors in last 2 years
author_citations_{mean, min, max} Cumulative citations for each author
author_citations_delta_{mean, min, max} Change in cumulative citations for each author in last 2 years
author_key_citations_{mean, min, max} Cumulative key citations for each author
author_key_citations_delta_{mean, min, max} Change in cumulative key citations for each author in last 2 years
author_mean_citations_{mean, min, max} Mean citations per paper for each author
author_mean_citations_delta_{mean, min, max} Change in mean citations per paper for each author in last 2 years
author_papers_{mean, min, max} Number of papers published for each author
author_papers_delta_{mean, min, max} Number of papers published for each author in last 2 years
author_unweighted_pagerank_{mean, min, max} PageRank of each author in the unweighted coauthorship network
author_weighted_pagerank_{mean, min, max} PageRank of each author in the weighted coauthorship network
author_mean_citation_rank_{mean, min, max} Rank of each author among all authors in terms of mean citations per paper
author_recent_num_coauthor_{mean, min, max} Number of coauthors each author had in last 2 years
author_max_single_paper_citations_{mean, min, max} Maximum citations a single paper of each author has received
total_num_authors Total number of authors for the paper
venue_hindex H-index of the venue
venue_hindex_delta Change in h-index of the venue in last 2 years
venue_mean_citations Mean citations per paper published in the venue
venue_mean_citations_delta Change in mean citations per paper published in the venue in last 2 years
venue_papers Number of papers published in the venue
venue_papers_delta Number of papers published in the venue in last 2 years
venue_rank Rank of the venue among all venues in terms of mean citations per paper
venue_max_single_paper_citations Maximum number of citations any paper published in the venue has received
paper_age Age of the paper in years (rounded up)
paper_citations Cumulative citation count
paper_key_citations Cumulative key citation count
paper_mean_citations_per_year Average number of citations received per year
is_survey Whether or not the paper is a survey
paper_citations_delta_{0,1} Number of citations the paper received in the last year and the year before that
paper_key_citations_delta_{0,1} Number of key citations the paper received in the last year and the year before that

impact-prediction's People

Contributors

lucaweihs avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

impact-prediction's Issues

ValueError: The truth value of an array with more than one element is ambiguous

Hi, Luca Weihs! I have downloaded your source code from the github, and ran as you told in README.md. But I met the error message as below:

Traceback (most recent call last):
File "paper_predictions.py", line 22, in
run_tests(config)
File "/home/bdsirs/tianch/app/impact-prediction-master/test_runners.py", line 152, in run_tests
mape_test_name, colors=ml_colors, markers=ml_markers)
File "/home/bdsirs/tianch/app/impact-prediction-master/plotting.py", line 35, in plot_mape
zorder=1, lw=3)
File "/home/bdsirs/tianch/py2env/local/lib/python2.7/site-packages/matplotlib/pyplot.py", line 2929, in errorbar
**kwargs)
File "/home/bdsirs/tianch/py2env/local/lib/python2.7/site-packages/matplotlib/init.py", line 1898, in inner
return func(ax, *args, **kwargs)
File "/home/bdsirs/tianch/py2env/local/lib/python2.7/site-packages/matplotlib/axes/_axes.py", line 2906, in errorbar
data_line = mlines.Line2D(x, y, **plot_line_style)
File "/home/bdsirs/tianch/py2env/local/lib/python2.7/site-packages/matplotlib/lines.py", line 420, in init
self.set_markerfacecolor(markerfacecolor)
File "/home/bdsirs/tianch/py2env/local/lib/python2.7/site-packages/matplotlib/lines.py", line 1206, in set_markerfacecolor
if self._markerfacecolor != fc:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.