richardogoma / bitcoin-rate-etl Goto Github PK

0.0 0.0 1.0 48 KB

An ETL pipeline to ingest near-real time data of Bitcoin rates across major currencies (USD/GBP/EUR) from the CoinDesk Bitcoin Price Index API.

License: MIT License

Makefile 2.40% Python 69.09% Shell 28.51%

bitcoin-rate-etl's Introduction

Hi there 👋

🌱 I am a data integration specialist, learning data engineering
🔭 I am building a data engineering portfolio
👯 I am looking to collaborate on data engineering-related projects
💬 Ask me about Data Quality Management, Business Intelligence & Data Analytics

bitcoin-rate-etl's People

Contributors

Watchers

Forkers

fitzjoe

bitcoin-rate-etl's Issues

Implement logging feature

Here are some considerations:

Create a logs/ directory
Create a log submodule, most likely in the load submodule of the etl module. This module has the following features:
i. Contains functions to write the INFO and ERROR messages to a program.log file in the logs/ dir
ii. Contains a default script that runs when the module is invoked. This script checks if the KiB of the log file exceeds 1024KiB (or a certain threshold), and if so, it rotates the log file. The old log file would be added to an archive (rar or zip format) in the logs/ dir

The log submodule would have two functions:
i. log.Info(info_msg) which would handle the info messages
ii. log.Error(error_msg) which would handle the error messages

from etl.load import log

log.Info(
    f"{datetime.now()}: INFO: Inserted {info} into the database. ETL process duration: {process_duration} secs"
)

log.Error(f"{datetime.now()}: ERROR: {str(error_msg)}")

In the __init__.py file of the etl module, the logging functions should be invoked appropriately in place of the print function currently in use

Functional test for 48hrs data range retention --failed

With regards to #3 ,
48 hours' worth of data is equivalent to 2880 minutes since the pipeline continuously inserts new data into the db every minute, this is a good functional test.

------------------------ Today

sqlite> select * from bitcoin_rates limit 1;
87|2023-06-13T20:31:00+00:00|Bitcoin|25871.5434|21618.0547|25202.6605
sqlite> select * from bitcoin_rates order by unique_number desc limit 1;
2948|2023-06-15T20:30:00+00:00|Bitcoin|25239.2009|21089.6743|24586.6666

The timestamp 2023-06-13T20:30:00+00:00 is deleted when a record is inserted with timestamp 2023-06-15T20:30:00+00:00 is due to this line in the patch: datetime('now', '-48 hours')". Actually, the time to use isn't now, but 48 hours from the timestamp of the record to be inserted. That can be obtained programmatically from this list

formatted_data = [
            str(item)
            if isinstance(item, Decimal)
            else item.isoformat()
            if isinstance(item, datetime)
            else str(item)
            for item in data
        ]

Consolidate etl setup scripts

There should be only one setup script. The other commands for virtual environments should be handled externally. But ensure that technical guidance is written in the README.md

Also, make further research on sourcing the setup script and simply running it in another shell instance, that is, what is the difference between these:

source setup_etl.sh
./setup_etl.sh
bash setup_etl.sh

Implement ETL pipeline with linting and documentation updates

Discontinuous timestamp in loaded data

DISCONTINUOUS TIMESTAMP
This could either be due to a delay by the client in fetching the JSON payload before it is refreshed, or due to the data at the missing timestamp not being available on the API's server. The latter is highly unlikely.

So we can try to architect the ETL pipeline to synchronize with the refresh schedule of the server.

IOError raised when load_data returns False

load_data function returning false indicates an existing record was updated, so the program shouldn't be terminated

Duplicates in loaded data

DUPLICATES
This is due to a delay by the client in fetching the JSON payload before it is refreshed, so we have to use an UPSERT clause and UNIQUE constraint on the timestamp field to prevent duplicate records from being inserted.

Create pip installable package

In the Flask App,
Install the etl client as a pip dependency. So configure the etl client to be installable as a package and modify reference or invocation in the flask app as required.

References:
Python: Creating a pip installable package
Installing Packages
You can pip install directly from GitHub
Pip Install From GitHub - Python Packages - Easy Method - Must Watch for Beginners

Handle empty response from API

response = retrieve_rates(uri=api_url)
parsed_data = parse_dict(response)

In the __init__.py file,
Raise httperror or sth in the requests.exceptions class if the response from the extract phase is an empty dict

Server Data Refresh Schedule

The data at the CoinDesk server is refreshed every minute continuously: so the client has to be architected to continuously fetch data from the server every minute.

If the client hits the server now, the next hit has to be a minute or 60 seconds from now, regardless of the compute time spent by the client processing the response payload.

Pytest results show that the ETL client typically takes less than 1 second for a complete run.

Let's take for an example, the client fires at 18:30, then the next fire should be 18:31 (one minute from '18:30'). Time elapsed after client processing let's say 0.23 seconds. So the client completes its run at say T(18:30+0.23s).
For the next fire, this means that the program has to wait for T(18:31 - last_run_time) secs, provided it's value is greater than zero.

Script execution is terminated and "console closed."

Observations show that this issue typically happens every 48 hours on pythonanywhere's console servers, and thus affecting the operations of the data streaming client.

Please see this thread https://www.pythonanywhere.com/forums/topic/8287/

TypeError Handling

except (requests.exceptions.RequestException, sqlite3.Error) as error_msg:
   pass

Include TypeError to the class of errors handled in the except block of the __init__.py file of the etl module. This is because of the datetime and decimal format transformations invoked from the parser submodule of the transform module.

Discontinuous timestamp in loaded data --persisting

In addition to the resolution in #11 involving the setup_etl.sh script, and syncing the initial program trigger with the start of the next minute. It is true that when the nohup command is fired,

# Start the ETL pipeline in the background and append stdout to output.log
nohup nice -n 10 python3 -u etl_pipeline.py >> output.log 2>&1 &

There is a delay, maybe a minute, for Python to initialize and execute the first run. Subsequent runs are 60 secs plus the current time after the program execution, which includes delay. So the next run doesn't really start at the next minute, but 60 seconds after the last run.

# Wait for one minute before fetching the next update
time.sleep(60)

We have to programmatically calculate the appropriate delay in seconds, not just the blanket 60 seconds to solve this discontinuous timestamp issue.