Code Monkey home page Code Monkey logo

data-quality-uk-25k-spend's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

data-quality-uk-25k-spend's Issues

[meta] Overall Process

Scrape the list of files - this is the "data" for this repo

  1. Get list of publishers
  2. Get list of all datafiles

Analyse data

  1. cache datafiles
  2. run validation / check
  3. generate summary results

Asides

  • Should we write to sqlite and then dump to csv as needed? Could be quite nice as a process and actually what you want?

Data file title format

the title column in the datafiles.tsv/csv output currently has a date string. Could the format please be:

dataset title + โ€˜/โ€˜ + resource title

Publisher ID, and reference to publisher from datafiles

Currently, publisher ID is a UUID. We also have a name field for publisher, which is a slugified version of the title.

Could use this "name" field as the ID for the publisher, and, in datafiles, have this as the reference to the publisher.

Store results data in this repo

Merge spd-data-uk-all with this repo: https://github.com/okfn/spd-data-uk-all/tree/ministerial-dpts/data

  • Merge across scripts (put everything in /scripts/)
    • How does this connect with spd-admin (do we want to merge spd-admin stuff in here too - probably (?))
  • Merge across README
    • What data is in this repo (Data Package like)
    • Extensive section on scripts

  • Merge results data files in there
    • Think about removing runs ... (maybe just assume current data is from latest run in runs.csv) - maybe add commit rev number to runs ? - DON'T DO NOW
  • Merge source data files (publishers, sources etc)
    • publishers (use the one here - discard spd)
    • sources => datafiles.csv

Problem in publishers homepage field

The homepage field of publishers seems to have either NULL, or, a link to a page related to the the body on the What Do They Know website. This is not correct. is there another field available that may contain the correct url to the home page?

Refactor data collection and generation process

Description

We have to clean up this junkyard. The README on master describes the current flow. The README on the feature/refactor branch describes what we want to get to.

The basic thing is that everything to do with building the actual data quality assessment database should be controlled via the Data Quality CLI, so all the hacks added here for that need to be streamlined into post-pipeline and post-batch hooks there (abstracted into Tasks in that codebase).

Tasks

  • Refer to the OKI coding standards and the example codebase for Python style
  • Update the new README to reflect this flow
    • An ID data script, which builds publishers and sources lists
      • Exclude HTML pages from sources.
      • Add functionality for configurable sleep time between pipelines in GoodTables
      • Handle HTTP errors and compressed formats in the GoodTables
      • Remove preprocess_sources script.
      • Fetch the cached files as a post-processing task in data-quality-cli
      • Remove the fetch_sources script.
      • After GoodTables will cover scoring bad things with 0, remove make_results script.
      • Transform make_performance script into a dq run task.
      • Try again to make id_data script compatible with python 2.7 or open another issue
  • Fix the scripts that identify assessable data, and further manually curate the identified data (if it helps streamline the scripts) so that we only assess the relevant publishers, and only on their 25k data (not other documents of spend data)

Specific notes on data identification

  • See this code grabs all sorts of fiscal files, not just 25k. we only what 25k data.
  • See the attached .ods file which was provided by @jacattell as a list of relevant ministerial departments to assess for data quality on 25k publication. Ensure that we do not have publishers (+sources of) that are not on this list - the ones marked in yellow are the ones we have that should not be there according to @jacattell

A point about discoverability of data

ministerial-departments.ods.zip

Add sources from gov.uk

Copied out of #15 as looks like this needs special treatment.

@jacattell @davidread if you can tell me who can actually provide any clarity on this for those outside of government that would be great!

Useful reference, perhaps:

It looks like gov.uk is listing datasets that have not been pushed to data.gov.uk.

Example query (https://www.gov.uk/government/publications?keywords=spend&publication_filter_option=transparency-data&topics%5B%5D=all&departments%5B%5D=all&official_document_status=all&world_locations%5B%5D=all&from_date=&to_date=).

As far as I know, we are supposed to be assessing data.gov.uk, being the data portal of the UK govt., so I'm not 100% sure we should look for data on this location too.

Looks like we really should query against gov.uk too.

Find a way to retrieve `period_id` or subtitute for timeliness calculation reliably

Currently period_id is extracted from the title/url by the period.py script . This can go wrong if the title has typos or if it doesn't contain all the necessary information (https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/519924/_25K_March_Data.csv) leading to empty or wrong period_id and errors in data-quality-cli performance calculation.
The need for period.py script is explained by the lack of created or last_modified fields for some resources. Ex from the CKAN API response:

[{'description': 'GTCE Spend April 2011 to January 2012',
  'format': 'CSV',
  'id': '0cdc2222-867c-4586-a5b4-6022a71b3cb6',
  'position': 0,
  'revision_id': '6773cab6-21b8-47c5-a02c-d5c71f23a151',
  'tracking_summary': {'recent': 0, 'total': 0},
  'url': 'http://media.education.gov.uk/assets/files/xls/gtce%20spend%20april%202011%20to%20january%202012.csv'},
 {'description': 'QCDA April 2010',
  'format': 'CSV',
  'id': 'f748cb1d-6e1b-4620-b996-3451ebfc9702',
  'position': 0,
  'revision_id': '6773cab6-21b8-47c5-a02c-d5c71f23a151',
  'tracking_summary': {'recent': 0, 'total': 0},
  'url': 'http://media.education.gov.uk/assets/files/xsl/qcda%20spend%20%20april%202010.csv'}]

Proposed solutions:

  1. Ensure that data-quality-cli and (possibly) data-quality-dashboard are able to deal with empty periods.
  2. Use both period_id and created fields and improve the scoring algorithm to not score timeliness when any one these fields is unknown. This also makes sense since by using the period extracted from the title, period_id no longer refers to the time of upload but time covered by the data. As I understand timeliness, being on time would mean that the month of upload is ~=month covered by the data +1. So both period of creation and period covered by data should be known. This is closely related to frictionlessdata/data-quality-cli#13
    For example, the following would be cataloged as not timely:
data,format,period_id,created_at
http://data.defra.gov.uk/ops/procurement/1204/EA-OVER-25K-1204.csv,csv,2012-04-01,2013-02-07T10:18:08.619710

I'll drop some numbers:
Currently we have 1306 sources which is only a subset of UK CKAN.

By using this retrieval method:

datafile['created_at'] = resource.get('created') or ast.literal_eval(resource.get('archiver',"{}")).get('created', '')

From 1306 resources, 374 rows have created field null.
revision_id field is present in all of them which could be used to get the time of last update on CKAN. This however doesn't solve the problem if the concept of timeliness is what I explained above.

There is only 1 field where period_id is null which is insignificant in the context of this app, but I think the above mentioned issues provide some important insights into how to make data-quality-cli/dashboard better.

.csv files are not CSV format

Hello @gvidon (I'm following up here on behalf of Open Knowledge)

One this here is that the .csv files in data/* are not actually CSV, they are TSV.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.