Code to accompany: https://arxiv.org/abs/2010.12002
Sections follow the workflow shown on Figure 1.
First fill out the lists in data/sp500_list_25-02-2020.txt
corresponding to the keywords
attribute
with the keywords corresponding to the respective companies.
Our pre-screening code is a Spark application that processes raw WARC files from common crawl, performs initial filtering, and outputs news articles in jsonl files. Because Common Crawl is hosted on Amazon S3, our code was run on AWS EMR, and reads and writes to and from S3. It can also be adapted to run on other infrastructure. Our code was cloned and adapted from https://github.com/commoncrawl/cc-pyspark.
Set the paths in lines 384-286 in src/data/ccspar.py
. You need to point the script
to an s3 path for a .txt
file containing the s3 paths of the Common Crawl WARC files to process.
See https://commoncrawl.org/the-data/get-started/ to retrieve WARC paths. You also need to
set an output directory (this logs the s3 requests made by the script), and an s3 bucket
where the script will write the processed data to. By default it will write to
your_s3_bucket/CCNewsProcessed/run_timestamp
. Finally, you need to point the script to
the sp500_list_25-02-2020.txt
file from step 1, and indicate the number of input WARC files to process.
The Spark application can be run with
python ccspark.py
Set the input_dir
, output_statistics_path
and output_file_path
variables at line 76 - 79 in the
src/data/filtering_stage.py
file. Then execute
python filtering_state.py
from the src/data
directory. The input_dir
path should be the output of the Pre-Screeining step
i.e. your_s3_bucket/CCNewsProcessed
.
Remark: This script has been implemented to be able to run on several CPU cores. If you have sufficient local computational power, it can be run locally with the data downloaded from S3. It is recommended to have at least 8 cores.
Set the variables data_path
and output_path
at line 232-233 in the src/data/news_alignment.py
. Then execute
python news_alignment.py
from the src/data
directory. The data_path
path should be the output_file_path
from the Filtering step.
Set the variables data_path
and output_path
at line 79-80 in the src/models/benchmark/utils/text_preprocessing.py
. Then execute
python text_preprocessing.py
from the src/models/benchmark/utils
directory. The data_path
should be the output_path
from the Alignment step.
The output of this file is python dictionary as a pickle file with the following attributes
count_matrix
: numpy matrix with word countscolumns
: list of words corresponding to the columns incount_matrix
index
: pandas DataFrame with values:company
,publish_date
,publish_date_utc
,url
. Each row of this DataFrame corresponds to a row incount_matrix
in the same order.
To download financial data run
python finance_utils_yf.py
from src/data/finance data
.
To add financial labels to the index matrix (see Pre-Processing step above) run
python finance_preprocess.py
from src/models/utils/
. Set the paths for input datasets in lines 154-158.
Re-create conda environment
conda create --name myenv --file /src/models/transfer_entropy/conda_transfer_entropy_analysis_environment.yml
Set the paths:
PATH_NBBO_RETURNS = "data/finance_data/minute_OHLC/NBBO/NBBO_parsed_v2.csv"
PATH_SENTIMENT_INDEX = "data/index_matrices_2020/EW_LS_2020-06-17_18-09-14_index_matrix.p"
in the script /src/models/transfer_entropy/transfer_entropy_intraday-2018-2020-polished.ipynb
The input files should be intra-day sentiment extracted data and intra-day financial returns.
To run a backtest execute
python backtesting.py
from src/models/
. Set the paths for input datasets in lines 736-738.
Backtest configuration can be set with parameters in lines 759-779.
For description of parameters see lines 342-375.