The goal of this project is to apply machine learning/artificial intelligence techniques to the link prediction problem of whether a research paper cites another research paper. The citation network consists of several thousands of research papers, along with their abstracts and their lists of authors. The dataset was taken from machine learning, artificial intelligence, data mining, and natural language processing conferences and journals. The project aims to use edge information to learn the parameters of a classifier and then to use the classifier to predict whether two nodes are linked by an edge or not. Here is the detailed presentation of the project
The project has the following structure:
.
├── README.md
├── Project presentation - citation prediction.pdf
├── code_clean.ipynb
├── data
│ ├── processed
│ │ ├── X_test.csv
│ │ ├── X_train.csv
│ │ ├── X_valid.csv
│ │ ├── y_test.csv
│ │ ├── y_train.csv
│ │ └── y_valid.csv
│ └── raw
│ ├── abstracts.txt
│ ├── authors.txt
│ ├── edgelist.txt
│ └── test.txt
├── doc2vec
│ ├── doc2vec_model_abstracts
│ └── doc2vec_model_authors
├── embed
│ └── abstracts_emb.json
└── viz
└── tableau viz.twb
The dataset used in this project is available in the data
folder, which contains two sub-folders:
raw
: containing the original data files:edgelist.txt
: a citation network created from papers published at machine learning, artificial intelligence, data mining, and natural language processing venues. Nodes correspond to papers, while edges represent citation relationships. The graph is undirected.abstracts.txt
: it contains the abstracts of the papers.authors.txt
: this file contains the authors of the papers.test.txt
: this file contains 106,692 unordered node pairs. The goal is to predict if there is an edge between the two elements of each pair or not.
processed
: containing the processed data files:X_train.csv
: training set featuresy_train.csv
: training set labelsX_valid.csv
: validation set featuresy_valid.csv
: validation set labelsX_test.csv
: test set featuresy_test.csv
: test set labels
In addition, the doc2vec
folder contains two trained Doc2Vec models for the abstracts and authors data, and the embed
folder contains the abstracts data in embedded format.
The code.ipynb
notebook contains cleaned and commented code for the machine learning models used in the project, including Logistic Regression, XgBoost, and MLP.
The viz
folder contains a Tableau visualization with interesting insights about the citation network.
The following machine learning methods are used for this project:
- Logistic Regression
- XGBoost
- MLP (Multi-Layer Perceptron)
To further enhance the prediction model, additional techniques such as Node2vec, Tf-Idf, and Scibert could be explored to improve feature extraction and representation.