Scientific Impact Prediction

This package computes citation predictions for papers and h-index predictions for authors using a dataset underlying the Semantic Scholar search service created by AI2. Here we describe how to acquire the data to train a collection of regression models and produce a collection of plots showing their relative performance.

If you would like to simply access the data and are uninterested in running the code then skip to the "Getting the data" section. Otherwise you will want to first clone this repository to a local directory:

git clone [email protected]:Lucaweihs/impact-prediction.git

Getting the data
Code dependecies
Features used in prediction

Getting the data

Data can be downloaded manually as individual files or, if you are just interested in producing predictions, just those files necessary to train models and produce predictions can be automatically downloaded and extracted using the download_data.py script. To use the download_data.py script run the commands:

# Enter the impact prediction directory
cd path/to/impact-prediction
# Run the script to download the data
python download_data.py

We now describe the individual files and provide URLs to download them manually.

Data file descriptions

These data span the years between 1975 and 2015. The features are generated using information available only in 2005 and we train models to predict in the years 2006-2015. The data comes in two formats, tab separated values files (.tsv) and json files (.json); all data are compressed with gzip so be sure to unzip them (gzip -d filename). Authors are identified by their name and papers by a unique identifier. The paper ids correspond to those used by Semanic Scholar and more information about a paper with a given paper id can be found by using Semantic Scholar. For example, one can find more information about the paper with id

214899d16f39a494c3e69118c53a7b5877c0bbfc

by going to the URL:

www.semanticscholar.org/paper/214899d16f39a494c3e69118c53a7b5877c0bbfc.

Author names

File name: authors-1975-2005-2015-2.tsv

Format: Every line is the name of an author taken from a paper.

Download link

Author features

File name: authorFeatures-1975-2005-2015-2.tsv

Format: The first line specifies the feature names and every other line represents the feature values for a particular author. These features are ordered to correspond to the authors from the "author names" file.

Download link

Author responses

File name: authorResponses-1975-2005-2015-2.tsv

Format: The first line specfies the column names, and are of the form total_citations_in_NUMBER or hindex_in_NUMBER. Here NUMBER is the number of years since 2006 so that if a particular row has a value of 5 in the hindex_in_5 column this means that the author corresponding to that row had an hindex of 5 in year 2006 + 5 = 2011. These responses are ordered to correspond to the authors from the "author names" file.

Download link

Author histories

File name: authorHistories-hindex-1975-2005-2015-2.tsv

Format: Every line corresponds to the h-index of an author since the beginning of their career until 2005. These histories are ordered to correspond to the authors from the "author names" file.

Example: If an author has a 5 year old career, by 2005, and their per-year h-index is 1,1,2,3,4. Then the line corresponding to said author would be 1 1 2 3 4

Download link

Paper ids

File name: paperIds-1975-2005-2015-2.tsv

Format: Each line corresponds to a single paper id. These ids are unique identifiers of papers.

Download link

Paper features

File name: paperFeatures-1975-2005-2015-2.tsv

Format: The first line specifies the feature names and every other line represents the feature values for a particular paper. These features are ordered to correspond to the papers from the "paper ids" file.

Download link

Paper responses

File name: paperResponses-1975-2005-2015-2.tsv

Format: Each line corresponds to the observed cumulative citation count of an author in the years between 2006 and 2015. These responses are ordered to correspond to the paper from the "paper ids" file.

Download link

Paper histories

File name: paperHistories-1975-2005-2015-2.tsv

Format: Same as for "author histories" but replacing authors with papers and the h-index with cumulative citation counts.

Download link

Citation graph

File name: citationGraph-1975-2015.json

Format: Each line corresponds to a json dictionary with the following fields:

id - a paper id
cites - a list of the paper ids cited by id

Notes: The citation graph includes all papers published between 1975 and 2015.

Download link

Key citation graph

File name: keyCitationGraph-1975-2015.json

Format: Exactly as for the "citation graph" file but only includes key citations between papers.

Download link

Paper meta data

File name: paperIdToPaperData-1975-2015.json

Format: Every line is a json dictionary corresponding to a single paper with the following fields:

citations-in-year - a dictionary of where the keys are years and the values are the number of citations the paper received in a particular year.
is-survey - true/false depending on whether or not the paper is a survey.
year - the publication year.
id - the paper's id.
authors - a list of the papers authors.
venue - the venue where the paper was published.

Download link

Code dependencies

This project is written in Python 2.7.11 using the following packages, divided into several categories.

Data reading, representation, serialization:

cPickle
configparser
gzip
pandas
subprocess

Parallel processing:

joblib
multiprocessing

Modeling/mathematics:

numpy
scipy
sklearn
tensorflow

Plotting:

matplotlib
seaborn

The majority of these come preinstalled on, or can be easily installed through, any scientific python manager, e.g. anaconda. If you have the anaconda package manager installed you can easily set up an "impact-prediction" environment containing the necessarily packages by running the command

conda env create -f environment.yml

in the terminal from within the project.

Training and comparing models

We should note that the training process can take many hours even on a strong machine, be prepared. That said, after a model has finished training we save the results so that you do not have to train it again. We assume you have downloaded the data using the download_data.py script described in the "Getting the data" section. To train a collection of models for h-index prediction and produce a collection of associated plots (placed in the ./plots directory) you can run the following command from within the impact-prediction directory:

python author_predictions.py hindex "author_hindex:4;author_age:5,12"

The created plots show the MAPE, R^2, and PA-R^2 metrics of the various trained algorithms on training, validation, and testing datasets. These plots are named to be self-descriptive. The above code trains and tests only on those authors with an h-index >= 4 by 2005 and whose career length was between 5 and 12 years in 2005.

To train models for paper citation prediction you can run the command:

python paper_predictions.py citation "paper_citations:5"

As above this will create a number of plots in the "plots" directory. Here the above command will only include those papers with >= 5 citations by 2005.

Features used in prediction

We have two sets of features used in our predictions, one for authors and one for papers. The features are listed below.

Author Features

Feature Name	Description
author_hindex	H-index
author_hindex_delta	Change in h-index over the last 2 years
author_citation_count	Cumulative citation count
author_key_citation_count	Cumulative key citation count (see Zhu et al. 2015)
author_citations_delta_{0,1}	Citations this year and 1 year ago
author_key_citations_delta_{0,1}	Key citations this year and 1 year ago
author_mean_citations_per_paper	Mean number of citations per paper
author_mean_citation_per_paper_delta	Change in mean number of citations per paper over last 2 years
author_mean_citations_per_year	Mean number of citations per year
author_papers	Number of papers published
author_papers_delta	Number of papers published in last 2 years
author_mean_citation_rank	Rank of author (between 0 and 1) among all other authors in terms of mean citations per year
author_unweighted_pagerank	PageRank of author in unweighted coauthorship network
author_weighted_pagerank	PageRank of author in weighted coauthorship network
author_age	Career length (years since first paper published)
author_recent_num_coauthors	Total number of coauthors in last 2 years
author_max_single_paper_citations	Max number of citations for any of author's papers
venue_hindex_{mean, min,max}	H-indices of venues author has published in
venue_hindex_delta_{mean, min,max}	Change in h-index over last two years for venues author has published in
venue_citations_{mean, min,max}	Mean citations per paper of venues author has published in
venue_citations_delta_{mean, min,max}	Change in mean citations per paper over last two years for venues author has published in
venue_papers_{mean, min, max}	Number of papers in venues in which the author has published
venue_papers_delta_{mean, min, max}	Change in number of papers in venues in which the author has published over the last 2 years
venue_rank_{mean, min, max}	Ranks of venues (between 0-1) in which the author has published determined by mean number of citations per paper
venue_max_single_paper_citations_{mean, min, max}	Maximum number of citations any paper published in a venue has received for each venue the author has published in
total_num_venues	Total number of venues published in

Paper Features

Feature Name	Description
author_hindex_{mean, min, max}	H-indices of authors
author_hindex_delta_{mean, min, max}	Change in h-indices of authors in last 2 years
author_citations_{mean, min, max}	Cumulative citations for each author
author_citations_delta_{mean, min, max}	Change in cumulative citations for each author in last 2 years
author_key_citations_{mean, min, max}	Cumulative key citations for each author
author_key_citations_delta_{mean, min, max}	Change in cumulative key citations for each author in last 2 years
author_mean_citations_{mean, min, max}	Mean citations per paper for each author
author_mean_citations_delta_{mean, min, max}	Change in mean citations per paper for each author in last 2 years
author_papers_{mean, min, max}	Number of papers published for each author
author_papers_delta_{mean, min, max}	Number of papers published for each author in last 2 years
author_unweighted_pagerank_{mean, min, max}	PageRank of each author in the unweighted coauthorship network
author_weighted_pagerank_{mean, min, max}	PageRank of each author in the weighted coauthorship network
author_mean_citation_rank_{mean, min, max}	Rank of each author among all authors in terms of mean citations per paper
author_recent_num_coauthor_{mean, min, max}	Number of coauthors each author had in last 2 years
author_max_single_paper_citations_{mean, min, max}	Maximum citations a single paper of each author has received
total_num_authors	Total number of authors for the paper
venue_hindex	H-index of the venue
venue_hindex_delta	Change in h-index of the venue in last 2 years
venue_mean_citations	Mean citations per paper published in the venue
venue_mean_citations_delta	Change in mean citations per paper published in the venue in last 2 years
venue_papers	Number of papers published in the venue
venue_papers_delta	Number of papers published in the venue in last 2 years
venue_rank	Rank of the venue among all venues in terms of mean citations per paper
venue_max_single_paper_citations	Maximum number of citations any paper published in the venue has received
paper_age	Age of the paper in years (rounded up)
paper_citations	Cumulative citation count
paper_key_citations	Cumulative key citation count
paper_mean_citations_per_year	Average number of citations received per year
is_survey	Whether or not the paper is a survey
paper_citations_delta_{0,1}	Number of citations the paper received in the last year and the year before that
paper_key_citations_delta_{0,1}	Number of key citations the paper received in the last year and the year before that

lucaweihs / impact-prediction Goto Github PK

impact-prediction's Introduction

Scientific Impact Prediction

Table of Contents

Getting the data

Data file descriptions

Author names

Author features

Author responses

Author histories

Paper ids

Paper features

Paper responses

Paper histories

Citation graph

Key citation graph

Paper meta data

Code dependencies

Training and comparing models

Features used in prediction

Author Features

Paper Features

impact-prediction's People

Contributors

Stargazers

Watchers

Forkers

impact-prediction's Issues

Recommend Projects

Recommend Topics

Recommend Org