Our model is based on a deep learning pipeline with three major sources of input: 1) extracted features that describe engaging user, tweet creator and tweet content 2) language model embeddings for tweet content 3) engagement/creation history embeddings for user and tweet creator. These inputs are combined with feed forward neural networks to generate 4-way predictions for each engagement type.
For the language model component, we use pretrained multilingual Transformer-based models BERT-Base and XLM-Roberta-Large. These models are first fine-tuned in an unsupervised way for the Twitter data with the language modelling loss. We then further fine-tune for the target engagement prediction task by backpropagating gradients end-to-end from the classification loss.
Engagement history embeddings are generated by sampling tweets that user engaged with and passing them through the language model to get embeddings. These embeddings are combined with self-attention to create engagement representation that is used as one of the inputs into the model. Similar procedure is applied to get tweet creation history representation.
We use a hybrid Java-Python pipeline where data parsing and feature extraction is done in Java, and deep learning model training is done in Python. To run the code first execute run.sh
. This script requires path to the main PROJECT_PATH
directory which must contain a subdirectory PROJECT_PATH/Data/
with the training.tsv
, val.tsv
and competition_test.tsv
challenge datasets. The script will parse the data, extract features, train baseline XGBoost model using features only and run inference on the leaderboard (val.tsv
) and test (competition_test.tsv
) sets. Trained XGBoost models and predictions are outputted to PROJECT_PATH/Models/XGB/
.
Then execute the Python inference run script and see the corresponding README.