Comments (13)
Hi all,
I had a go at scraping the NICD datasets for data since the 27 October 2020 (when the daily reports table has 10 columns). There is still work to do to get some of the columns described in covid19za/data/nicd_hospital_surveillance_data.csv (general, high care, isolation, total health care workers admitted) and in filtering for the many ways in which dates appear in the raw pdf files. Hope that this can be useful as a basis for scraping if there is no luck with NICD:
https://gist.github.com/sjbeckett/f1d3822db7d41d33dfddd01814a64481
from covid19za.
All
I did some work on the hospitalization data too. Actually my son :-).
Not in the data/nicd_hospital_surveillance_data.csv file yet, but in another file with a different format.
See data/covid19za_provincial_raw_hospitalization.csv
The scraper lives in scripts/daily_nicd_datcov.R.
And the github workflow runs this script every night -- it appears as if the posting / publication of these files are manual, so I'm not 100% sure what the best time would be.
Remaining todo:
- make sure the scraper runs stable
- extract the relevant parts and update the old summary hospitalization file, which is partially done already.
g
from covid19za.
A friend has made the request, we will see if she gets a response @anelda
from covid19za.
Hi all,
I had a go at scraping the NICD datasets for data since the 27 October 2020 (when the daily reports table has 10 columns). There is still work to do to get some of the columns described in covid19za/data/nicd_hospital_surveillance_data.csv (general, high care, isolation, total health care workers admitted) and in filtering for the many ways in which dates appear in the raw pdf files. Hope that this can be useful as a basis for scraping if there is no luck with NICD:
https://gist.github.com/sjbeckett/f1d3822db7d41d33dfddd01814a64481
Thank you. i think this is a good start and we can then try to start filling in where the scraper could not.
from covid19za.
The challenge has been keeping up with the updates of the NICD Hospital Admissions reports. They are available here https://www.nicd.ac.za/diseases-a-z-index/disease-index-covid-19/surveillance-reports/daily-hospital-surveillance-datcov-report/
The second page of the daily reports has this table
Now if we can get someone to start doing backfill (start with 1 December 2021 for example, and work backwards it would be great @maximeLpt
from covid19za.
Do you have a scrapper for that?
from covid19za.
@maximeLpt no. It was initially filled in by a volunteer, day by day.
from covid19za.
Is there no chance NICD can/will just share the table in Excel or CSV format?
from covid19za.
Hey @anelda @vukosim let me know what the NICD says, and if not luck comes - then we can surely build a wrapper/scraper for the PDF's or find a way to parse the data from a different format?
from covid19za.
I managed to write a fairly decent scraper for the pdfs. Perhaps the script can be enhanced by adding a filter for a particular day. eg. r Report_Date >= Sys.Date()
to avoid re-downloading each file. The script can run a cronjob.
Another script is the table scraper. The results were mixed. This could be due to changes to how tables are formatted , how the files are generated etc. Extracting each table from all files can be time-consuming. There probably exists a way to uniformly format the tables into a sound format in another language to technique.
directory containing the scripts
from covid19za.
Hi all,
I had a go at scraping the NICD datasets for data since the 27 October 2020 (when the daily reports table has 10 columns). There is still work to do to get some of the columns described in covid19za/data/nicd_hospital_surveillance_data.csv (general, high care, isolation, total health care workers admitted) and in filtering for the many ways in which dates appear in the raw pdf files. Hope that this can be useful as a basis for scraping if there is no luck with NICD:
https://gist.github.com/sjbeckett/f1d3822db7d41d33dfddd01814a64481
Just a note I updated the above gist to incorporate general and high care patient numbers.
Of note, isolation was no longer reported after 02-09-2020; and admitted healthcare workers were no longer reported after 08-10-2020.
from covid19za.
Greetings everyone. I took a different approach to this issue and created an API which is free and publicly available. At the moment, the database has data from 01 July 2021 to 28 Dec 2021.
The API has 2 main endpoints:
-
/all:
-
/dates:
- Allows for filtering data between 2 specific dates. Province is not required but you can specify it if you are looking for a specific province data..
Getting the PDFs from NICD website and uploading the data to a cloud database is automated but annotating the table on the PDF is still manual because the PDF formatting varies inconsistently between the documents so that is the only manual process thus far.
Try the API and I would appreciate some pointers and thoughts:
https://covidza-data.deta.dev/docs
from covid19za.
Happy new year everyone. There will be a bit more action to finalise this in the next 2 weeks. Thank you so much for the work and ideas.
from covid19za.
Related Issues (20)
- [Feature] Capture Vaccination Figures HOT 6
- [DATA] Extra comma in file covid19za_timeline_testing.csv HOT 1
- [DATA] provincial_gp_cumulative.csv - "date" column error HOT 1
- [Feature] FYI added tracking for Phase 2 Vaccinations HOT 3
- [DATA] Duplicate data on 24 and 25 May 2021 HOT 2
- gis_nicd_scraper HOT 24
- [DATA]Data Sources: NICD - South Africa URL error
- [Feature] Pull SAHPRA Adverse events following immunisation (AEFI) for COVID-19 vaccines
- [BUG] Scraper workflow error HOT 1
- [BUG] Challenge with scraper (R) HOT 1
- [BUG] Run failed: Scrape provincial vaccinations data from sacoronavirus.co.za HOT 2
- [BUG] Vaccination data not updated HOT 2
- [DATA] Blip in the data HOT 6
- [DATA] dsfsi covid data NICD scraper broken HOT 2
- [Feature] Additional data sources HOT 2
- [DATA] HOT 1
- Op-Ed and Letter to the editor HOT 17
- Closing and Archiving the Project HOT 5
- Good News Everyone NSTF Award Nomination - Data for Research Award HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from covid19za.