[meta] Overall Process

Scrape the list of files - this is the "data" for this repo

Get list of publishers
Get list of all datafiles

Analyse data

cache datafiles
run validation / check
generate summary results

Asides

Should we write to sqlite and then dump to csv as needed? Could be quite nice as a process and actually what you want?

Additional fields on entries in datafiles

Could each entry in datafiles also have the following properties:

format
encoding

If these are not found for a given entry, they can be set to NULL

Data file title format

the title column in the datafiles.tsv/csv output currently has a date string. Could the format please be:

dataset title + ‘/‘ + resource title

Publisher ID, and reference to publisher from datafiles

Currently, publisher ID is a UUID. We also have a name field for publisher, which is a slugified version of the title.

Could use this "name" field as the ID for the publisher, and, in datafiles, have this as the reference to the publisher.

Is it possible to extract the "source page" for each datafile?

Meaning, the page on the official website of the government body that hosts a link to the datafile.

Add back old fields in datafiles.csv

Store results data in this repo

Merge spd-data-uk-all with this repo: https://github.com/okfn/spd-data-uk-all/tree/ministerial-dpts/data

Merge across scripts (put everything in /scripts/)
- How does this connect with spd-admin (do we want to merge spd-admin stuff in here too - probably (?))
Merge across README
- What data is in this repo (Data Package like)
- Extensive section on scripts

Merge results data files in there
- Think about removing runs ... (maybe just assume current data is from latest run in runs.csv) - maybe add commit rev number to runs ? - DON'T DO NOW
Merge source data files (publishers, sources etc)
- publishers (use the one here - discard spd)
- sources => datafiles.csv

Make this repository a valid Data Package

Make this repository a valid Data Package, datapackage.json, with a resources array for the stuff in data, and use the sources array for the stuff in fetched

Problem in publishers homepage field

The homepage field of publishers seems to have either NULL, or, a link to a page related to the the body on the What Do They Know website. This is not correct. is there another field available that may contain the correct url to the home page?

Refactor data collection and generation process

Description

We have to clean up this junkyard. The README on master describes the current flow. The README on the feature/refactor branch describes what we want to get to.

The basic thing is that everything to do with building the actual data quality assessment database should be controlled via the Data Quality CLI, so all the hacks added here for that need to be streamlined into post-pipeline and post-batch hooks there (abstracted into Tasks in that codebase).

Tasks

Specific notes on data identification

See this code grabs all sorts of fiscal files, not just 25k. we only what 25k data.
See the attached .ods file which was provided by @jacattell as a list of relevant ministerial departments to assess for data quality on 25k publication. Ensure that we do not have publishers (+sources of) that are not on this list - the ones marked in yellow are the ones we have that should not be there according to @jacattell

A point about discoverability of data

It looks like gov.uk is listing datasets that have not been pushed to data.gov.uk. Again, James provided an example query (https://www.gov.uk/government/publications?keywords=spend&publication_filter_option=transparency-data&topics%5B%5D=all&departments%5B%5D=all&official_document_status=all&world_locations%5B%5D=all&from_date=&to_date=). As far as I know, we are supposed to be assessing data.gov.uk, being the data portal of the UK govt., so I'm not 100% sure we should look for data on this location too. @davidread do you have any opinion on that?

ministerial-departments.ods.zip

Use (and store) sqlite

Add sources from gov.uk

Copied out of #15 as looks like this needs special treatment.

@jacattell @davidread if you can tell me who can actually provide any clarity on this for those outside of government that would be great!

Useful reference, perhaps:

It looks like gov.uk is listing datasets that have not been pushed to data.gov.uk.

Example query (https://www.gov.uk/government/publications?keywords=spend&publication_filter_option=transparency-data&topics%5B%5D=all&departments%5B%5D=all&official_document_status=all&world_locations%5B%5D=all&from_date=&to_date=).

~~As far as I know, we are supposed to be assessing data.gov.uk, being the data portal of the UK govt., so I'm not 100% sure we should look for data on this location too.~~

Looks like we really should query against gov.uk too.

Find a way to retrieve `period_id` or subtitute for timeliness calculation reliably

Currently period_id is extracted from the title/url by the period.py script . This can go wrong if the title has typos or if it doesn't contain all the necessary information (https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/519924/_25K_March_Data.csv) leading to empty or wrong period_id and errors in data-quality-cli performance calculation.
The need for period.py script is explained by the lack of created or last_modified fields for some resources. Ex from the CKAN API response:

[{'description': 'GTCE Spend April 2011 to January 2012',
  'format': 'CSV',
  'id': '0cdc2222-867c-4586-a5b4-6022a71b3cb6',
  'position': 0,
  'revision_id': '6773cab6-21b8-47c5-a02c-d5c71f23a151',
  'tracking_summary': {'recent': 0, 'total': 0},
  'url': 'http://media.education.gov.uk/assets/files/xls/gtce%20spend%20april%202011%20to%20january%202012.csv'},
 {'description': 'QCDA April 2010',
  'format': 'CSV',
  'id': 'f748cb1d-6e1b-4620-b996-3451ebfc9702',
  'position': 0,
  'revision_id': '6773cab6-21b8-47c5-a02c-d5c71f23a151',
  'tracking_summary': {'recent': 0, 'total': 0},
  'url': 'http://media.education.gov.uk/assets/files/xsl/qcda%20spend%20%20april%202010.csv'}]

Proposed solutions:

Ensure that data-quality-cli and (possibly) data-quality-dashboard are able to deal with empty periods.
- Pros:
  Tools will be more error-proof which is welcomed for the new CKAN generator feature.
  - Cons:
    According to the current implementation, each entity in the dashboard will have another row with an empty Period.
Use both period_id and created fields and improve the scoring algorithm to not score timeliness when any one these fields is unknown. This also makes sense since by using the period extracted from the title, period_id no longer refers to the time of upload but time covered by the data. As I understand timeliness, being on time would mean that the month of upload is ~=month covered by the data +1. So both period of creation and period covered by data should be known. This is closely related to frictionlessdata/data-quality-cli#13
For example, the following would be cataloged as not timely:

data,format,period_id,created_at
http://data.defra.gov.uk/ops/procurement/1204/EA-OVER-25K-1204.csv,csv,2012-04-01,2013-02-07T10:18:08.619710

I'll drop some numbers:
Currently we have 1306 sources which is only a subset of UK CKAN.

By using this retrieval method:

datafile['created_at'] = resource.get('created') or ast.literal_eval(resource.get('archiver',"{}")).get('created', '')

From 1306 resources, 374 rows have created field null.
revision_id field is present in all of them which could be used to get the time of last update on CKAN. This however doesn't solve the problem if the concept of timeliness is what I explained above.

There is only 1 field where period_id is null which is insignificant in the context of this app, but I think the above mentioned issues provide some important insights into how to make data-quality-cli/dashboard better.

.csv files are not CSV format

Hello @gvidon (I'm following up here on behalf of Open Knowledge)

One this here is that the .csv files in data/* are not actually CSV, they are TSV.

rufuspollock-okfn / data-quality-uk-25k-spend Goto Github PK

data-quality-uk-25k-spend's People

Stargazers

Watchers

Forkers

data-quality-uk-25k-spend's Issues

Asides

Description

Tasks

Specific notes on data identification

A point about discoverability of data

Recommend Projects

Recommend Topics

Recommend Org