Code Monkey home page Code Monkey logo

bad-data's People

Contributors

adamamyl avatar andylolz avatar luisdaniel avatar mtayseer avatar nmashton avatar rufuspollock avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bad-data's Issues

Spreadsheet as image

http://webarchive.nationalarchives.gov.uk/+/http://www.alderhey.com/Library/Images/Finance/09201.jpg

This hospital provided its spend data only as screenshots of spreadsheets monthly from April to September 2010. Optical Character Recognition or a LOT of typing would be required to extract the information. The dates are not even visible.

Efforts to parse and aggregate the data, such as https://openspending.org/ukgov-25k-spending are impossible. Not at all transparent.

The hospital made a further step to make this inaccessible - they deleted it from their website. We can only access it now because the National Archive provides a cached copy.

UK Government has made a huge efforts to be transparent, and the publication of spend transaction data across central government and NHS trusts was required by David Cameron. The Treasury published guidelines and examples of how the data was to be presented in CSV, which column names etc. There's been reasonable traction, although there are always some that slip through the gaps, like this one.

Nature Magazine

Nature tends to publish fabulous cutting-edge scientific research data of different types bundled all-together in a PDF called "supplementary information"

e.g. http://www.nature.com/nature/journal/vaop/ncurrent/extref/nature12764-s1.pdf

In this PDF they have bundled/bungled together:

  • words
  • image data
  • scatterplot data
  • a bar chart
  • some awful sideways printed tables of numbers

...some say this is one of the world's 'best' research journals.

SCREENCROP of the table

2013-11-22-160523_1091x755_scrot

Illegible spectrum

I published this on my blog about 6 years ago. I think this was from a ***** ******** ** Chemistry journal but hold fire till I check.
suppdata
This was, of course , digital data in the spectrometer (perhaps 2^16 points)

Eights Centuries of Global Real Interests: Excel as an app with metadata in the sheets, links back to main sheet and more

Great dataset. But why do we need to move metadata into a spreadsheet. First two sheets are pure metadata.

Version metadata is in a random human readable location on bottom right.

image

image

Lots of nice spacing (plus navigation back to main sheet! this is a full excel app ...)

image

Source data

eight-centuries-of-global-real-interest-rates-r-g-and-the-suprasecular-decline-1311-2018-data.xlsx

Science Magazine

Science also publishes fabulous cutting-edge scientific research data of different types bundled all-together in a PDF called "supplementary materials"

e.g. http://www.sciencemag.org/content/suppl/2013/10/30/342.6158.592.DC1/1243283.McLellan.SM.pdf

In this PDF (!) they have bundled/bungled together:

  • words
  • image data
  • tables

One of the tables (S1) is split over THREE pages (with page breaks in between) and if you try and copy and paste out the whole table in one go, it'll be contaminated by the page numbers at each page break AND the footnotes on each section of the (same) table.

It is fairly typical of the many supp. materials files they publish each and every week.

SCREENSHOT of page break between table
2013-11-22-160020_1091x755_scrot

Academic papers

@rgrp et al.

Take an academic publishing organisation at random, which publishes the papers in PDF. Start here: http://en.wikipedia.org/wiki/List_of_academic_journals (get one of them e.g., Springer to request academic papers in XHTML+RDFa from here on end)

If OKFN is 100% behind that, I'll support OKFN's Bad Data initiative by 120%.

See also: https://github.com/csarven/linked-research e.g., Print view http://csarven.ca/linked-statisical-data-analysis in Firefox, and dereference the URI for RDF. Some write-up: http://csarven.ca/linked-research

Interested? Got resources?

Compressing axis to distort time trend

Here is an example of lying with data my students uncovered. The source of the data is the American Chemistry Council, as repeated by the EPA https://epa.gov/facts-and-figures-about-materials-waste-and-recycling/plastics-material-specific-data.

Is plastic use rising or plateauing? And is the waste being recycled or given a second useful life? Almost no plastic is composted, but this graph from the EPA and ACC seems to indicate plastic use is plateauing. A point the chemical industry likes to make. But look closely at the time scale:

image

See, they switch from decades to years! Why? Because it stretches the graph in time, giving the impression of a slow down. But if you graph linearly, ....

image

Russian Federal tax service one-lined huge XML file of tax benefits

title: Benefits for Russian taxpayers on federal, regional and municipal levels
dataformat: XML (Microsoft Word XML)
datapublisher: Federal Tax Service
dataurl: http://nalog.ru/ru/opendata/p9/

author: Ivan Begtin
authorurl: http://infoculture.ru

What's bad?

  1. This data provided as single XML file with size of 746MB
  2. XML file have no line breaks at all so DOM parsers and some of SAX parsers unable to process this data.
  3. Data scheme provided as CSV list of fields instead of XSD file

Screenshot
2013-12-07 16 35 11

Cairo Transport Data

title: Cairo Transport Data
dataformat: PDF
datapublisher: Governorate of Cairo
dataurl: http://www.cairo.gov.eg/HaykalTanzemy/body/Shared%20Documents/%D9%85%D8%B3%D8%A7%D8%B1%D8%A7%D8%AA%20%D8%AE%D8%B7%D9%88%D8%AA%20%D8%A7%D9%84%D8%A7%D8%AA%D9%88%D8%A8%D9%8A%D8%B3%20%D8%AF%D8%A7%D8%AE%D9%84%20%D9%85%D8%AD%D8%A7%D9%81%D8%B8%D8%A9%20%D8%A7%D9%84%D9%82%D8%A7%D9%87%D8%B1%D8%A9%20.pdf

author: Mohammad Tayseer
authorurl: http://mtayseer.net

What's bad?

  1. The file is in PDF format, which is very hard to parse.
  2. There is another copy in Excel format,
    but it asks for credentials!
  3. Stations are jammed into a single cell, sometimes separated by dashes, sometimes by underscores.
  4. Stations are written in many different ways. Sometimes with the complete name. Sometimes written as shortcuts. Sometimes there are spelling mistakes.
  5. Sometimes the names of stations are hidden
  6. No consistent numbering of lines
  7. A lot of lines are defined by start & end stations only, not mentioning the intermediate lines, making it impossible for anyone to know the route of
    the bus.
  8. The data is not updated for more than 3 years

Russian Ministry of Interior data as Word XML

title: List of regional divisions of ministry of interior
dataformat: XML (Microsoft Word XML)
datapublisher: Ministry of Interior of Russian Federation
dataurl: http://mvd.ru/opendata/od1

author: Ivan Begtin
authorurl: http://infoculture.ru, http://ru.okfn.org

What's bad?

  1. This file is XML file but it's not data XML format. XML is Microsoft Word XML markup.
  2. It's "camouflaged" as 3-stars data but it's only 1-star data.
  3. To parse the data we have to open it in any Word XML editor like MS Office 2007-2013 or LibreOffice and to extract table from it's contents.

Screenshot
2013-12-07 16 24 24

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.