Code Monkey home page Code Monkey logo

citation_prediciton's Introduction

Citation Prediction between Research Papers

Table of Contents

  1. Introduction
  2. Repo Structure
  3. Dataset
  4. Code
  5. Visualization
  6. Methods Used
  7. Future Work

Introduction

The goal of this project is to apply machine learning/artificial intelligence techniques to the link prediction problem of whether a research paper cites another research paper. The citation network consists of several thousands of research papers, along with their abstracts and their lists of authors. The dataset was taken from machine learning, artificial intelligence, data mining, and natural language processing conferences and journals. The project aims to use edge information to learn the parameters of a classifier and then to use the classifier to predict whether two nodes are linked by an edge or not. Here is the detailed presentation of the project

Repo Structure

The project has the following structure:

.
├── README.md
├── Project presentation - citation prediction.pdf
├── code_clean.ipynb
├── data
│   ├── processed
│   │   ├── X_test.csv
│   │   ├── X_train.csv
│   │   ├── X_valid.csv
│   │   ├── y_test.csv
│   │   ├── y_train.csv
│   │   └── y_valid.csv
│   └── raw
│       ├── abstracts.txt
│       ├── authors.txt
│       ├── edgelist.txt
│       └── test.txt
├── doc2vec
│   ├── doc2vec_model_abstracts
│   └── doc2vec_model_authors
├── embed
│   └── abstracts_emb.json
└── viz
    └── tableau viz.twb

Dataset

The dataset used in this project is available in the data folder, which contains two sub-folders:

  • raw: containing the original data files:
    • edgelist.txt: a citation network created from papers published at machine learning, artificial intelligence, data mining, and natural language processing venues. Nodes correspond to papers, while edges represent citation relationships. The graph is undirected.
    • abstracts.txt: it contains the abstracts of the papers.
    • authors.txt: this file contains the authors of the papers.
    • test.txt: this file contains 106,692 unordered node pairs. The goal is to predict if there is an edge between the two elements of each pair or not.
  • processed: containing the processed data files:
    • X_train.csv: training set features
    • y_train.csv: training set labels
    • X_valid.csv: validation set features
    • y_valid.csv: validation set labels
    • X_test.csv: test set features
    • y_test.csv: test set labels

In addition, the doc2vec folder contains two trained Doc2Vec models for the abstracts and authors data, and the embed folder contains the abstracts data in embedded format.

Code

The code.ipynb notebook contains cleaned and commented code for the machine learning models used in the project, including Logistic Regression, XgBoost, and MLP.

Visualization

The viz folder contains a Tableau visualization with interesting insights about the citation network.

Methods Used

The following machine learning methods are used for this project:

  • Logistic Regression
  • XGBoost
  • MLP (Multi-Layer Perceptron)

Future Work

To further enhance the prediction model, additional techniques such as Node2vec, Tf-Idf, and Scibert could be explored to improve feature extraction and representation.

citation_prediciton's People

Contributors

omarlaouan avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.