datopian / bad-data Goto Github PK

View Code? Open in Web Editor NEW

22.0 22.0 10.0 3.04 MB

Examples of bad data, especially from government.

Home Page: https://datahub.io/@rufuspollock/bad-data

HTML 100.00%

bad-data's People

Contributors

Stargazers

Watchers

Forkers

raphadasilva todrobbins rossmounce nmashton infoculture mtayseer luisdaniel adamamyl codeforcroatia slavinski

bad-data's Issues

Russian foreign trade statistics as DBF and after captcha

title: Russian foreign trade statistics
dataformat: DBF (DBase
datapublisher: Federal Customs Service
dataurl: http://stat.customs.ru/apex/f?p=201:3:822234424961570::NO:::

author: Ivan Begtin
authorurl: http://infoculture.ru

What's bad?

Impossible to download whole data once. No bulk download
For each data slice you have to enter captcha
Data format DBF (DBase) is proprietary

Screenshot

Fix HTML on OKFN Labs Bad Data page

Screenshot with some comments:

http://imgur.com/Qvs2ZKo

Bond Yields from Eurostat

url: https://stats.ecb.europa.eu/stats/download/irs/irs/irs.zip
- home page is http://www.ecb.europa.eu/stats/money/long/html/index.en.html
cached: https://github.com/okfn/bad-data/blob/master/data/bond-yields-eurostat.csv
format: csv
- (but inside a zip)
wrongness
- metadata inlined into csv
- blank rows (first line and end of file)

UK Government Statistical Service bad instructions

https://bosker.wordpress.com/2014/12/05/the-government-statistical-services-terrible-spreadsheet-advice/

TFL Passenger numbers

What's bad?

No date heading in first column
Dates are of form: "2006/2007 - 1" rather than a month or similar (though not clear if it is monthly since 13 items in a year!)
Percentage sign written into percentage column
Huge number of blank rows

Spreadsheet as image

http://webarchive.nationalarchives.gov.uk/+/http://www.alderhey.com/Library/Images/Finance/09201.jpg

This hospital provided its spend data only as screenshots of spreadsheets monthly from April to September 2010. Optical Character Recognition or a LOT of typing would be required to extract the information. The dates are not even visible.

Efforts to parse and aggregate the data, such as https://openspending.org/ukgov-25k-spending are impossible. Not at all transparent.

The hospital made a further step to make this inaccessible - they deleted it from their website. We can only access it now because the National Archive provides a cached copy.

UK Government has made a huge efforts to be transparent, and the publication of spend transaction data across central government and NHS trusts was required by David Cameron. The Treasury published guidelines and examples of how the data was to be presented in CSV, which column names etc. There's been reasonable traction, although there are always some that slip through the gaps, like this one.

Issues with Mexican town data

@csarven has raised some issues with this data.

For sure the URL slug for this page is ~~misleading~~ incorrect, since it says “mex-list-towns-pop-over-5000” when the data is actually for populations under 5000.

Apart from that, could you clarify what else is wrong, @csarven? I’ve reviewed your points but wasn’t clear what else you felt was wrong.

Nature Magazine

Nature tends to publish fabulous cutting-edge scientific research data of different types bundled all-together in a PDF called "supplementary information"

e.g. http://www.nature.com/nature/journal/vaop/ncurrent/extref/nature12764-s1.pdf

In this PDF they have bundled/bungled together:

words
image data
scatterplot data
a bar chart
some awful sideways printed tables of numbers

...some say this is one of the world's 'best' research journals.

SCREENCROP of the table

Illegible spectrum

I published this on my blog about 6 years ago. I think this was from a ***** ******** ** Chemistry journal but hold fire till I check.

This was, of course , digital data in the spectrometer (perhaps 2^16 points)

Eights Centuries of Global Real Interests: Excel as an app with metadata in the sheets, links back to main sheet and more

Great dataset. But why do we need to move metadata into a spreadsheet. First two sheets are pure metadata.

Version metadata is in a random human readable location on bottom right.

Lots of nice spacing (plus navigation back to main sheet! this is a full excel app ...)

Source data

eight-centuries-of-global-real-interest-rates-r-g-and-the-suprasecular-decline-1311-2018-data.xlsx

Russian weather data by from radiometrics detectors published as screenshot

title: Russian weather by radiometric analysis
dataformat: IMG
datapublisher: State enterprise "Central Aerological Observatory"
dataurl: http://www.nowcast.ru/ , http://www.nowcast.ru/data/uvk.html

author: Ivan Begtin
authorurl: http://infoculture.ru

What's bad?

Data or documents unavailable. Instead screenshot of desktop realtime application published hourly.
It's imposible to use this information somehow.

Screenshot

Science Magazine

Science also publishes fabulous cutting-edge scientific research data of different types bundled all-together in a PDF called "supplementary materials"

e.g. http://www.sciencemag.org/content/suppl/2013/10/30/342.6158.592.DC1/1243283.McLellan.SM.pdf

In this PDF (!) they have bundled/bungled together:

words
image data
tables

One of the tables (S1) is split over THREE pages (with page breaks in between) and if you try and copy and paste out the whole table in one go, it'll be contaminated by the page numbers at each page break AND the footnotes on each section of the (same) table.

It is fairly typical of the many supp. materials files they publish each and every week.

SCREENSHOT of page break between table

BLS Unemployment Stats as ASCII Spreadsheet

url: ftp://ftp.bls.gov/pub/special.requests/lf/aa2010/aat1.txt
cleaned: https://github.com/datasets/employment-us
- clean up script at https://github.com/datasets/employment-us/blob/master/scripts/process.py
format: txt
wrongness
- human readable not machine readable data

Wonderful example of an "ASCII Spreadsheet" including "merge cells" (Employed heading ...)

Academic papers

@rgrp et al.

Take an academic publishing organisation at random, which publishes the papers in PDF. Start here: http://en.wikipedia.org/wiki/List_of_academic_journals (get one of them e.g., Springer to request academic papers in XHTML+RDFa from here on end)

If OKFN is 100% behind that, I'll support OKFN's Bad Data initiative by 120%.

See also: https://github.com/csarven/linked-research e.g., Print view http://csarven.ca/linked-statisical-data-analysis in Firefox, and dereference the URI for RDF. Some write-up: http://csarven.ca/linked-research

Interested? Got resources?

Get in touch with http://www.opendatafail.fr/ and see if we can translate some and include here

Some good pdf examples from csv,conf

@pauldeschacht had some great examples from his talk at csv,conf.

Would you be up for adding some, @pauldeschacht?

Compressing axis to distort time trend

Here is an example of lying with data my students uncovered. The source of the data is the American Chemistry Council, as repeated by the EPA https://epa.gov/facts-and-figures-about-materials-waste-and-recycling/plastics-material-specific-data.

Is plastic use rising or plateauing? And is the waste being recycled or given a second useful life? Almost no plastic is composted, but this graph from the EPA and ACC seems to indicate plastic use is plateauing. A point the chemical industry likes to make. But look closely at the time scale:

See, they switch from decades to years! Why? Because it stretches the graph in time, giving the impression of a slow down. But if you graph linearly, ....

Russian Federal tax service one-lined huge XML file of tax benefits

title: Benefits for Russian taxpayers on federal, regional and municipal levels
dataformat: XML (Microsoft Word XML)
datapublisher: Federal Tax Service
dataurl: http://nalog.ru/ru/opendata/p9/

author: Ivan Begtin
authorurl: http://infoculture.ru

What's bad?

This data provided as single XML file with size of 746MB
XML file have no line breaks at all so DOM parsers and some of SAX parsers unable to process this data.
Data scheme provided as CSV list of fields instead of XSD file

Screenshot

Cairo Transport Data

title: Cairo Transport Data
dataformat: PDF
datapublisher: Governorate of Cairo
dataurl: http://www.cairo.gov.eg/HaykalTanzemy/body/Shared%20Documents/%D9%85%D8%B3%D8%A7%D8%B1%D8%A7%D8%AA%20%D8%AE%D8%B7%D9%88%D8%AA%20%D8%A7%D9%84%D8%A7%D8%AA%D9%88%D8%A8%D9%8A%D8%B3%20%D8%AF%D8%A7%D8%AE%D9%84%20%D9%85%D8%AD%D8%A7%D9%81%D8%B8%D8%A9%20%D8%A7%D9%84%D9%82%D8%A7%D9%87%D8%B1%D8%A9%20.pdf

author: Mohammad Tayseer
authorurl: http://mtayseer.net

What's bad?

The file is in PDF format, which is very hard to parse.
There is another copy in Excel format,
but it asks for credentials!
Stations are jammed into a single cell, sometimes separated by dashes, sometimes by underscores.
Stations are written in many different ways. Sometimes with the complete name. Sometimes written as shortcuts. Sometimes there are spelling mistakes.
Sometimes the names of stations are hidden
No consistent numbering of lines
A lot of lines are defined by start & end stations only, not mentioning the intermediate lines, making it impossible for anyone to know the route of
the bus.
The data is not updated for more than 3 years

Russian Ministry of Interior data as Word XML

title: List of regional divisions of ministry of interior
dataformat: XML (Microsoft Word XML)
datapublisher: Ministry of Interior of Russian Federation
dataurl: http://mvd.ru/opendata/od1

author: Ivan Begtin
authorurl: http://infoculture.ru, http://ru.okfn.org

What's bad?

This file is XML file but it's not data XML format. XML is Microsoft Word XML markup.
It's "camouflaged" as 3-stars data but it's only 1-star data.
To parse the data we have to open it in any Word XML editor like MS Office 2007-2013 or LibreOffice and to extract table from it's contents.

Screenshot

datopian / bad-data Goto Github PK

bad-data's People

Contributors

Stargazers

Watchers

Forkers

bad-data's Issues

Recommend Projects

Recommend Topics

Recommend Org