Code Monkey home page Code Monkey logo

bitcoin-rate-etl's Introduction

Hi there ๐Ÿ‘‹

๐ŸŒฑ I am a data integration specialist, learning data engineering
๐Ÿ”ญ I am building a data engineering portfolio
๐Ÿ‘ฏ I am looking to collaborate on data engineering-related projects
๐Ÿ’ฌ Ask me about Data Quality Management, Business Intelligence & Data Analytics

bitcoin-rate-etl's People

Contributors

richardogoma avatar richardogoma-nlng avatar

Watchers

 avatar

Forkers

fitzjoe

bitcoin-rate-etl's Issues

Implement logging feature

Here are some considerations:

  1. Create a logs/ directory
  2. Create a log submodule, most likely in the load submodule of the etl module. This module has the following features:
    i. Contains functions to write the INFO and ERROR messages to a program.log file in the logs/ dir
    ii. Contains a default script that runs when the module is invoked. This script checks if the KiB of the log file exceeds 1024KiB (or a certain threshold), and if so, it rotates the log file. The old log file would be added to an archive (rar or zip format) in the logs/ dir

The log submodule would have two functions:
i. log.Info(info_msg) which would handle the info messages
ii. log.Error(error_msg) which would handle the error messages

from etl.load import log

log.Info(
    f"{datetime.now()}: INFO: Inserted {info} into the database. ETL process duration: {process_duration} secs"
)

log.Error(f"{datetime.now()}: ERROR: {str(error_msg)}")
  1. In the __init__.py file of the etl module, the logging functions should be invoked appropriately in place of the print function currently in use

Functional test for 48hrs data range retention --failed

With regards to #3 ,
48 hours' worth of data is equivalent to 2880 minutes since the pipeline continuously inserts new data into the db every minute, this is a good functional test.

------------------------ Today

sqlite> select * from bitcoin_rates limit 1;
87|2023-06-13T20:31:00+00:00|Bitcoin|25871.5434|21618.0547|25202.6605
sqlite> select * from bitcoin_rates order by unique_number desc limit 1;
2948|2023-06-15T20:30:00+00:00|Bitcoin|25239.2009|21089.6743|24586.6666

The timestamp 2023-06-13T20:30:00+00:00 is deleted when a record is inserted with timestamp 2023-06-15T20:30:00+00:00 is due to this line in the patch: datetime('now', '-48 hours')". Actually, the time to use isn't now, but 48 hours from the timestamp of the record to be inserted. That can be obtained programmatically from this list

formatted_data = [
            str(item)
            if isinstance(item, Decimal)
            else item.isoformat()
            if isinstance(item, datetime)
            else str(item)
            for item in data
        ]

Consolidate etl setup scripts

There should be only one setup script. The other commands for virtual environments should be handled externally. But ensure that technical guidance is written in the README.md

Also, make further research on sourcing the setup script and simply running it in another shell instance, that is, what is the difference between these:

source setup_etl.sh
./setup_etl.sh
bash setup_etl.sh

Discontinuous timestamp in loaded data

DISCONTINUOUS TIMESTAMP
This could either be due to a delay by the client in fetching the JSON payload before it is refreshed, or due to the data at the missing timestamp not being available on the API's server. The latter is highly unlikely.

So we can try to architect the ETL pipeline to synchronize with the refresh schedule of the server.

Duplicates in loaded data

DUPLICATES
This is due to a delay by the client in fetching the JSON payload before it is refreshed, so we have to use an UPSERT clause and UNIQUE constraint on the timestamp field to prevent duplicate records from being inserted.

Handle empty response from API

response = retrieve_rates(uri=api_url)
parsed_data = parse_dict(response)

In the __init__.py file,
Raise httperror or sth in the requests.exceptions class if the response from the extract phase is an empty dict

Server Data Refresh Schedule

The data at the CoinDesk server is refreshed every minute continuously: so the client has to be architected to continuously fetch data from the server every minute.

If the client hits the server now, the next hit has to be a minute or 60 seconds from now, regardless of the compute time spent by the client processing the response payload.

Pytest results show that the ETL client typically takes less than 1 second for a complete run.

Let's take for an example, the client fires at 18:30, then the next fire should be 18:31 (one minute from '18:30'). Time elapsed after client processing let's say 0.23 seconds. So the client completes its run at say T(18:30+0.23s).
For the next fire, this means that the program has to wait for T(18:31 - last_run_time) secs, provided it's value is greater than zero.

TypeError Handling

except (requests.exceptions.RequestException, sqlite3.Error) as error_msg:
   pass

Include TypeError to the class of errors handled in the except block of the __init__.py file of the etl module. This is because of the datetime and decimal format transformations invoked from the parser submodule of the transform module.

Discontinuous timestamp in loaded data --persisting

In addition to the resolution in #11 involving the setup_etl.sh script, and syncing the initial program trigger with the start of the next minute. It is true that when the nohup command is fired,

# Start the ETL pipeline in the background and append stdout to output.log
nohup nice -n 10 python3 -u etl_pipeline.py >> output.log 2>&1 &

There is a delay, maybe a minute, for Python to initialize and execute the first run. Subsequent runs are 60 secs plus the current time after the program execution, which includes delay. So the next run doesn't really start at the next minute, but 60 seconds after the last run.

# Wait for one minute before fetching the next update
time.sleep(60)

We have to programmatically calculate the appropriate delay in seconds, not just the blanket 60 seconds to solve this discontinuous timestamp issue.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.