Lin Burg
Inbal Geva Oren
System prerequisites:
OS: Windows 10
IDE: PyCharm 2022.3.1 (Professional Edition)
MySql Server: MySql server version 8.0
-
After cloning the project and opening it in Pycharm, create a virtual environment directory for the project:
- Press ctrl+alt+s
- "Project: XXX"
- "Python Interpreter"
- "Add Interpreter" => "Add Local Interpreter"
- Select "Virtual Environment" => "New"
- in Location, name it "venv" and save
-
Then, run in pycharm's terminal: "pip install -r requirements.txt"
-
Define two environment variables for each python file in the project :
-
Set up an environment variable named "DB_CONNECTION_STRING", which enables the connection to the MySql database via sqlalchemy, as following:
- Open the "Edit Run/Debug configuration" dialog
- Press "Edit configuration"
- Under configuration tab - look for "Environment Variables" -> press the edit button to the right
- In the new opened window -> manually add another environment variable named: "DB_CONNECTION_STRING". Then, in the "value" section, copy and paste the following statement - replacing the "username" and "password" with your own, in the correct order: "mysql+pymysql://username:password@localhost/stock_database"
-
Set up the connection with Tiingo API: create an environment variable named "TIINGO_API_TOKEN".
To retrieve the necessary token - enter the following link: https://api.tiingo.com/documentation/general/connecting -> click on the "click here to see your API Token" button and copy the token. Then, similarly:- Open the "Edit Run/Debug configuration" dialog
- Press "Edit configuration"
- Under configuration tab - look for "Environment Variables" -> press the edit button to the right
- In the new opened window -> manually add another environment variable named: "TIINGO_API_TOKEN". Then, in the value section, insert the API token retrieved from Tiingo API.
-
-
Jupyter notebook environment setup: create a ".env" file (locate it in the same subdirectory ("src" / "random forest model") whenever working with the jupyter files) in order to allow the jupyter files to connect to the database and access other files. The ".env" file should include the following environment variables (same as the general project configuration):
- "DB_CONNECTION_STRING"
- "TIINGO_API_TOKEN"
It's very important to keep the ".env" a local project file (do not upload to GitHub) as it holds your database password
Each of the following steps needs to be run once: notice, this stage has an extended runtime.
-
First, create the empty database schemas - run the "src/scripts/create_database_tables.py" script.
-
Populate the database schemas with S&P500 historical data* by running the "src/scripts/populate_db.py" script.
-
Set up a scheduled task which will be responsible for the daily database update. (important notice: this method will only work on Windows operating system) To do so, edit the following powershell script according to the instructions (found within the script itself): "powershell_scripts/schedule_daily_db_update.sps1"
Once the task is scheduled, it'll independently run the "update_db.py" script - make sure you don't remove / locate it in a different path!For further information: https://www.makeuseof.com/windows-powershell-scheduled-task/ Once editing is complete - run the script.
*Tiingo's website offers complimentary 3 years of "DOW30" fundamental historical data. Accessing a longer period of time, to all 500 S&P stocks, requires payment: https://www.tiingo.com/account/billing/pricing
- To run the Random Forest model, run the following jupyter notebook template according to necessity ("src" directory):
- Initial model with standard uniform parameters template: "src/rf_model_standard_params.ipynb"
- Model optimization (train, validation, test) template: "src/rf_model_optimization.ipynb"
- To test the credibility of the data preprocessing, we have created a python script which creates similar empty database schemas (that needs to be also populated as described in the process above) yet these tables are meant to be manually sporadically filled with Nan values by the user (in order to simulate data that has Nan values). Then, upon running the Random Forest Model jupyter notebook - it is possible to test the outcome results of a stock that has undergone Nan values handling compared to same stock that had no missing values.
- Data structures apprehension from Tiingo's website: during the development process we have sent http requests using the "miscellaneous/http_requests/http_request_tiingo.http" file in order to better understand the data structures that were returned from Tiingo's websites (using Tiingo's API).
- Our model optimization results are all found in the excluded directory: ""
**Important notice - if you wish to run any python script or jupyter notebook that is located outside the "src" directory, make sure to refactor the file's location to the "src" directory first - otherwise it won't run properly.