Comprehensive ETL, Data Warehousing, Processing & Visualization Project
A Low-Level Approach with SQL, Shell Scripts, SQLLoader and Apache Superset
#batch #dwh #datamart #nfz #data #etl #bi #visualisation #healthcare #dataengineering
#analyst #pipeline #dataops #batch #shell #oracle #apachesuperset #sqlcl #docker #pluggabledb
Projects status: in progress
Let's start with a quick preview to grab your attention. I've prepared comprehensive cross-sectional descriptions and installers (shell & SQL scripts) for various versions of my reports. Of course, you don't have to go through the installation process. For more details, simply scroll down.
- Here are two dashboard options that showcase the results of my process:
-
BI Dashboard - Modern Approach (show more screenshots)
Example 1: Apache Superset: BI Visualisation with application side processing (in progress!)
SQL> select total_hospitalization_in_dataset
2* from rpt_totalhosp;
TOTAL_HOSPITALIZATION_IN_DATASET
___________________________________
21,194,349
Example 2: SQlCl: Show Report View as Text Output
- Sample: Distribution based on the count of specific modes of patient discharge per admission, according to the NFZ (National Health Fund) modes of discharge and admission
REASON_FOR_ADMISSION | DIS_1 | DIS_2 | DIS_3 | DIS_4 | DIS_6 | DIS_7 | DIS_8 | DIS_9 | DIS_10 | DIS_11 |
---|---|---|---|---|---|---|---|---|---|---|
Emergency admission - other cases (3) | 6919800 | 15455742 | 1055898 | 188649 | 709275 | 14814 | 2970 | 1122597 | 819 | 2382 |
Planned admission based on a referral (6) | 8711580 | 18046503 | 407859 | 62466 | 238764 | 6309 | 2295 | 213738 | 639 | 2988 |
Emergency admission due to transfer by the medical rescue team (2) | 1603881 | 4697712 | 493851 | 154932 | 203370 | 11586 | 978 | 1175358 | 480 | 903 |
Admission of a newborn as a result of childbirth in this hospital (5) | 713970 | 191286 | 74289 | 1260 | 20469 | 72 | 219 | 11430 | 33 | 93 |
Admission based on an oncology diagnostics and treatment card (11) | 271371 | 340449 | 5535 | 897 | 2766 | 75 | 30 | 9750 | 15 | |
(...) |
and more...
Develop a ETL pipeline and utilize BI tools to process and visualize over 21 million NFZ hospitalization records (2017-2022), with a emphasis on data processing framework.
- Environment Setup: Script to automate database and environment configuration
- ETL & DWH: Automated extraction, transformation, and loading of data into a star schema data mart
- Reports & Dashboards:
- Data Profiling Dashboard (visualisation + text-based report)
- Main Analytical Dashboard (visualisation + text-based report)
This project is based on real NFZ data sourced from the open-data portal, dane.gov.pl. The dataset comprises all hospitalizations covered by the NFZ (National Health Fund of Poland) in 2017-2022 *.
- Dane dotyczące hospitalizacji rozliczonych JGP w latach 2019-2021 (contains hospitalizations 2017, 2018)
- Dane dotyczące hospitalizacji rozliczonych JGP w latach 2022
The dataset comprises over 20 million records, with each record representing an individual patient's hospitalization data.
It's based on the public goverment repositories hub named data.gov.pl.
Used repository is not bad quality, but I saw small differences between files. I want to make some improvements, create DWH star schema, materialized view as fact table, dimensions etc... I will clean and process the data to visualize it with Apache Superset.
This repository showcases my Data Engineering project, highlighting my diverse data-related skills. It includes database administration, data warehousing, and ETL development tasks. My objective is to prepare and process this data for visualization purposes, including the creation of charts and dashboards. I aim to learn modern data visualization BI tool.
Tool/Software | Description |
---|---|
Visualization | |
Apache Superset | Visualization tool |
Data Cleaning | |
Oracle SQL | Data cleaning and manipulation |
Virtual Machines | |
VM 1 | Oracle Linux with DB 21 |
VM 2 | Centos, Docker, Apache Superset container with cx_oracle connector |
Shell Scripting | |
MINGW Bash | Shell scripting |
Data Loading | |
SqlLoader | Data loading |
IDE | |
SQLDeveloper 23 | IDE for DBA tasks and queries |
Visual Studio Code | IDE with new Oracle plugin |
Database Tools | |
SQLPlus & SQLCli | Database command-line interface |
Others | |
PlantUml | Tool for diagrams |
GIT | Version control system |
ROK;MIESIAC;OW_NFZ;NIP_PODMIOTU;KOD_PRODUKTU_KONTRAKTOWEGO;KOD_PRODUKTU_JEDNOSTKOWEGO;KOD_TRYBU_PRZYJECIA;KOD_TRYBU_WYPISU;PLEC_PACJENTA;GRUPA_WIEKOWA_PACJENTA;PRZEDZIAL_DLUGOSCI_TRWANIA_HOSPITALIZACJI;LICZBA_HOSPITALIZACJI 2022;4;"07";"1132866688";"03.4580.991.02";"5.51.01.0008013";6;2;"K";"65 i wiecej";"6 i wie™cej dni";"<5" 2022;8;"02";"5562239217";"03.4220.030.02";"5.51.01.0001087";3;2;"K";"45-64";"6 i wiecej dni";"<5" 2022;11;"03";"9462146139";"03.4580.991.02";"5.51.01.0008015";6;2;"K";"65 i wiecej";"6 i wiecej dni";"<5" 2022;9;"15";"7842008454";"03.4450.040.02";"5.51.01.0012014";3;2;"K";"65 i wiecej";"0 dni";"<5"
The data contains some foreign keys pointing to static dictionaries:
The dictionary source for Polish HL7 implementations includes
- discharge modes https://www.cez.gov.pl/HL7POL-1.3.2/plcda-html-1.3.2/plcda-html/voc-2.16.840.1.113883.3.4424.13.11.36-2015-10-26T000000.html
- admision modes
I had to transform the source file into a star schema model for that data mart...
As a DBA, I handle SQL scripts, database creation, structure definition, user management, and permissions.
I've prepared some bash and SQL scripts to create database, structures and objects and automate the installation process. You can use either install.sh or rebuild.sh to get started.
Figure 1: Demonstration of the script execution process (gif animations)
Of course you don't have to go through the installation process. Simply open the text-based dashboard or view the visualization screenshots. However, if you're a professional user, you can review my analysis queries (along with all other scripts) by navigating to the sql/ GitHub directory.
Data profiling
Analysis
- View the text-based dashboard
- View the BI Apache Superset dashboard
- Review my queries sql/6-create-reports-as-analyst.sql