This is a binary classification task using machine learning models to predict judge's adjudication on asylum court. We further generated and quantified textual features extracted from New York Times and Wikileaks corpus using NLP tools including fasttext for language model, n-grams, and tf-idf.
In feature_engineering
folder, we presented a way to reduce the dimension and sparsity of textual features using language model and k-means clustering. In train
folder, we are developing a machine learning task abstraction which aims to be reused for following tasks. The script for task abstraction is in ./train/model.py
, and the driver script is ./train/train.py
.
- Python 3.6
- fasttext
- Scikit-learn 0.19.1
-
Run training model: (Using final dataset)
$ sh runtrain.sh
The model result will be output to
train/model_result/[DecisionTree | RandomForest]/
.The feature importance, and ROC_AUC accuracy for each model setting is saved under this folder for each running.
We used random forest to traing on data with feature sets: asylum court case and news group in time series manner.
We reached 0.89 roc auc score after rounding.
-
- Scrape nytimes
- Scrape wikileaks
- Join case data
-
- Feature Engineering
-
- Training pipeline starts here, the scripts only support DecisionTree and RandomForest now
- Please make sure directories are created :
model_results/[DecisionTree | RandomForest]
under the train directory
-
- Model analysis, and findings in data