rufuspollock-okfn / data-quality-uk-25k-spend Goto Github PK
View Code? Open in Web Editor NEWDatabase of all UK Government spending data files (25k and Local Gov)
Database of all UK Government spending data files (25k and Local Gov)
Scrape the list of files - this is the "data" for this repo
Analyse data
Could each entry in datafiles also have the following properties:
If these are not found for a given entry, they can be set to NULL
the title column in the datafiles.tsv/csv output currently has a date string. Could the format please be:
dataset title + โ/โ + resource title
Currently, publisher ID is a UUID. We also have a name field for publisher, which is a slugified version of the title.
Could use this "name" field as the ID for the publisher, and, in datafiles, have this as the reference to the publisher.
Meaning, the page on the official website of the government body that hosts a link to the datafile.
Merge spd-data-uk-all with this repo: https://github.com/okfn/spd-data-uk-all/tree/ministerial-dpts/data
/scripts/
)
Make this repository a valid Data Package, datapackage.json
, with a resources
array for the stuff in data
, and use the sources
array for the stuff in fetched
The homepage field of publishers seems to have either NULL, or, a link to a page related to the the body on the What Do They Know website. This is not correct. is there another field available that may contain the correct url to the home page?
We have to clean up this junkyard. The README on master describes the current flow. The README on the feature/refactor
branch describes what we want to get to.
The basic thing is that everything to do with building the actual data quality assessment database should be controlled via the Data Quality CLI, so all the hacks added here for that need to be streamlined into post-pipeline and post-batch hooks there (abstracted into Tasks
in that codebase).
preprocess_sources
script.fetch_sources
script.make_results
script.make_performance
script into a dq run
task.id_data
script compatible with python 2.7 or open another issueCopied out of #15 as looks like this needs special treatment.
@jacattell @davidread if you can tell me who can actually provide any clarity on this for those outside of government that would be great!
Useful reference, perhaps:
It looks like gov.uk is listing datasets that have not been pushed to data.gov.uk.
As far as I know, we are supposed to be assessing data.gov.uk, being the data portal of the UK govt., so I'm not 100% sure we should look for data on this location too.
Looks like we really should query against gov.uk too.
Currently period_id
is extracted from the title/url by the period.py
script . This can go wrong if the title has typos or if it doesn't contain all the necessary information (https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/519924/_25K_March_Data.csv
) leading to empty or wrong period_id
and errors in data-quality-cli
performance calculation.
The need for period.py
script is explained by the lack of created
or last_modified
fields for some resources. Ex from the CKAN API response:
[{'description': 'GTCE Spend April 2011 to January 2012',
'format': 'CSV',
'id': '0cdc2222-867c-4586-a5b4-6022a71b3cb6',
'position': 0,
'revision_id': '6773cab6-21b8-47c5-a02c-d5c71f23a151',
'tracking_summary': {'recent': 0, 'total': 0},
'url': 'http://media.education.gov.uk/assets/files/xls/gtce%20spend%20april%202011%20to%20january%202012.csv'},
{'description': 'QCDA April 2010',
'format': 'CSV',
'id': 'f748cb1d-6e1b-4620-b996-3451ebfc9702',
'position': 0,
'revision_id': '6773cab6-21b8-47c5-a02c-d5c71f23a151',
'tracking_summary': {'recent': 0, 'total': 0},
'url': 'http://media.education.gov.uk/assets/files/xsl/qcda%20spend%20%20april%202010.csv'}]
Proposed solutions:
data-quality-cli
and (possibly) data-quality-dashboard
are able to deal with empty periods.
Period
.period_id
and created
fields and improve the scoring algorithm to not score timeliness when any one these fields is unknown. This also makes sense since by using the period extracted from the title, period_id
no longer refers to the time of upload but time covered by the data. As I understand timeliness, being on time would mean that the month of upload is ~=month covered by the data +1. So both period of creation and period covered by data should be known. This is closely related to frictionlessdata/data-quality-cli#13data,format,period_id,created_at
http://data.defra.gov.uk/ops/procurement/1204/EA-OVER-25K-1204.csv,csv,2012-04-01,2013-02-07T10:18:08.619710
I'll drop some numbers:
Currently we have 1306 sources which is only a subset of UK CKAN.
By using this retrieval method:
datafile['created_at'] = resource.get('created') or ast.literal_eval(resource.get('archiver',"{}")).get('created', '')
From 1306 resources, 374 rows have created
field null.
revision_id
field is present in all of them which could be used to get the time of last update on CKAN. This however doesn't solve the problem if the concept of timeliness is what I explained above.
There is only 1 field where period_id
is null which is insignificant in the context of this app, but I think the above mentioned issues provide some important insights into how to make data-quality-cli/dashboard better.
Hello @gvidon (I'm following up here on behalf of Open Knowledge)
One this here is that the .csv files in data/* are not actually CSV, they are TSV.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.