pwyf / data-quality-tester Goto Github PK
View Code? Open in Web Editor NEWTest IATI activity files against PWYF index methodology
Home Page: http://dataqualitytester.publishwhatyoufund.org
License: MIT License
Test IATI activity files against PWYF index methodology
Home Page: http://dataqualitytester.publishwhatyoufund.org
License: MIT License
e.g. unitedstates includes a number of different US agencies.
The BDD IATI tester is ready to use now, so we should switch out the YAML tests here, and switch in the BDD tester.
E.g. this sort of approach: https://stackoverflow.com/questions/15644964/python-progress-bar-and-downloads
Specify whether MOUs pull from either Org or Activity file so that the test can show up on DQT.
Via Catherine Marschner.
Even different colors would be fine.
Via Catherine Marschner.
We might want to get away from pass/fail and instead provide concrete use cases for what the data could allow someone to do.
The BDD tester works for org files, too, so we might as well allow org file testing.
NB Currently if you try to test an org file, everything breaks in a weird way (see #15.) UPDATE this was fixed.
When tests have run the progress bars which are used to represent the results e.g. "Project attributes: 89% pass" just look like the tests themselves are still loading. This has caused confusion for some users.
Related issue #18
For the purposes of improving data quality, bad quality data is more useful & interesting than good quality data. So we should frontload that.
Wow, that’s really broken.
In fact, this happens whenever you test a file that contains no activities.
It’s not clear what the difference is between the test
and current_test
columns of the output csv. I initially thought this was just for convenience, but checking foxpath-tools, the same distinction exists in the test_doc_json_out function there.
Should this be enabled, then results on the relevant indicators, could then be processed
This would be very valuable @publishwhatyoufund , as otherwise users have to check these indicators offline/manually
Not sure we want to change them here due to risk of not being consistent with the tracker.
Via Catherine Marschner.
I would like to review #36
I have tried to test the proposed fix locally, but I am having difficulties with running the DQT.
I followed the instructions at https://github.com/pwyf/data-quality-tester, which required some updates to be run afterwards.
Once the DQT loaded locally, attempting to test a file resulted in the following error:
FileNotFoundError: [Errno 2] No such file or directory: '/home/simon/code/data-quality-tester/2018-index-indicator-definitions/test_definitions/performance/../step_definitions.py'
I cloned the 2018-index-indicator-definitions repo into the data-quality-tester folder and now when I try and load an IATI activity file I am getting the following error message instead:
> File "/home/simon/code/data-quality-tester/DataQualityTester/tasks.py", line 119, in test_file_task
result[lookup.get(out[0])] += 1
TypeError: 'bool' object is not subscriptable
The front-end hangs with this error.
It seems to be necessarily to routinely clear out old tasks from the celery queue, otherwise the task runner grinds to a halt (and when new data files are uploaded, the spinny wheel just keeps spinning).
I’m not sure why this happens… I note that CoVE doesn’t process files in the background, and I wonder if it’s because of this sort of added complexity.
According to the IATI schema, element ordering matters. So it should be trivial to point to the right bit of the file and say “something’s missing here.”
NB this was part of the original plan for the DQT, but was never added. It would be super useful to include this feature!
Should be deployed with https://github.com/pwyf/ansible
The celery tasks do not appear to be correctly releasing the memory used by them which eventually cases the server to run low on memory.
Related info #28
Sounds like something for an “iati-tester” rather than an “iati-simple-tester”, but…
It would be useful to be able to specify organisation conditions, so that organisations that only publish data at certain hierarchies are not penalised.
Currently, files on the IATI registry can be tested. It would be useful to add a direct file upload as well (this was in an earlier version, but disappeared!)
Many of the tests inspect a different corpus from the one submitted (due to various filters within each test)
The user only gets to see the identifiers of the activities that fail any test, but not the full list of those that were assessed.
Changing the screen output to a list of all the identifiers, with a Pass / Fail message would help. If this can be presented as a table, or csv list, then it would make it easier for users to then take the data into other analysis tools
Extra nicety would be default the list to the Failed identifiers first
eg:
It's important to print the org identifier from iati-organisations/iati-organisation/organisation-identifier
- as many publishers might have multiple iati-organisation
in any file (under the root iati-organisation
)
Without this, it is impossible to understand which iati-organisation
has therefore failed the test
We currently only show activities that fail tests. It would be useful to also show passes and not relevants.
Via Catherine Marschner.
The Budget alignment test has two parts
Summary:
The DQT only currently tests the Capital Spend. Adding the second test (alongside #52) would make it clearer to users where data improvements need to be made
Mostly for cache busting
Neither Test 7 or 10 are performed by the DQT - yet the interface implies that they could be, if data is presented
This is misleading. Either rename these to make it clear they are not performed, or remove them completely. I'd suggest the second option @publishwhatyoufund
The technical methodology doc calls the activity-specific budget indicators
The DQT doesnt include the Project word, which can be misleading, given there are "Organisation" budget tests on the same page
Suggestion: add Project to the names of tests 11 & 12 in the DQT, to avoid any ambiguity
Eek
in middleware.py we have
def csrf_protect():
if request.method == 'POST':
token = session.pop('_csrf_token', None)
if not token or token != request.form.get('_csrf_token'):
abort(403)
This fails to correctly validate the csrf if:
User opens upload page (1)
User opens upload page (2)
Both 1 and 2 will have the same csrf token renderd in the template
When one of the upload pages is submitted (or any POST request) then the CSRF token is popped from the session, this means than when the user goes to submit the other page the CSRF token is deemed invalid because it is comparing against None. A simple patch to change this from pop to get should fix this.
CLI commands are built into flask now.
This is probably a feature we just want For Internal Use Only… But:
E.g. https://getsatisfaction.com type thing
Might be interesting to pull out a random sample activity, and render it à la d-portal? (since this is roughly what we do in the sampling phase of the aid transparency index).
Rather than having to manually go back to the homepage
Related to #65
Having spent more time with the csv download, and request this be ordered according to the methodology in #65, I'e two more requests:
type | indicator_num | name | score | total_tested | failed | passed | not-relevant | relevant |
---|---|---|---|---|---|---|---|---|
test | 3 | Organisation strategy is present | 100 | 1 | 0 | 1 | 0 | 1 |
test | 4 | Annual report is present | 100 | 1 | 0 | 1 | 0 | 1 |
test | 5 | Allocation policy is present | 100 | 1 | 0 | 1 | 0 | 1 |
test | 6 | Procurement policy is present | 100 | 1 | 0 | 1 | 0 | 1 |
test | 7 | Strategy (country/sector) or Memorandum of Understanding | 1 | 0 | 0 | 1 | 0 | |
test | 8 | Audit is present | 100 | 1 | 0 | 1 | 0 | 1 |
test | 9 | Organisation budget available one year forward | 100 | 1 | 0 | 1 | 0 | 1 |
test | 9 | Organisation budget available two years forward | 100 | 1 | 0 | 1 | 0 | 1 |
test | 9 | Organisation budget available three years forward | 100 | 1 | 0 | 1 | 0 | 1 |
test | 10 | Disaggregated budget | 1 | 0 | 0 | 1 | 0 | |
test | 11 | Budget available forward annually | 1 | 0 | 0 | 1 | 0 | |
test | 11 | Budget available forward quarterly | 1 | 0 | 0 | 1 | 0 | |
test | 12 | Budget document is present | 1 | 0 | 0 | 1 | 0 | |
test | 13 | Commitment is present | 1 | 0 | 0 | 1 | 0 | |
test | 14 | Disbursements or expenditures are present | 1 | 0 | 0 | 1 | 0 | |
test | 15 | Capital spend is present | 1 | 0 | 0 | 1 | 0 | |
test | 15 | Publish detailed CRS purpose codes in the sector field | 1 | 0 | 0 | 1 | 0 | |
test | 16 | Title is present | 1 | 0 | 0 | 1 | 0 | |
test | 16 | Title has at least 10 characters | 1 | 0 | 0 | 1 | 0 | |
test | 17 | Description is present | 1 | 0 | 0 | 1 | 0 | |
test | 17 | Description has at least 80 characters | 1 | 0 | 0 | 1 | 0 | |
test | 18 | Planned start date is present | 1 | 0 | 0 | 1 | 0 | |
test | 18 | Planned end date is present | 1 | 0 | 0 | 1 | 0 | |
test | 19 | Actual start date is present | 1 | 0 | 0 | 1 | 0 | |
test | 19 | Actual end date is present | 1 | 0 | 0 | 1 | 0 | |
test | 20 | Current status is present | 1 | 0 | 0 | 1 | 0 | |
test | 20 | Current status is valid | 1 | 0 | 0 | 1 | 0 | |
test | 21 | Contact info is present | 1 | 0 | 0 | 1 | 0 | |
test | 22 | Sector is present | 1 | 0 | 0 | 1 | 0 | |
test | 22 | Sector uses DAC CRS 5 digit purpose codes | 1 | 0 | 0 | 1 | 0 | |
test | 23 | Location (sub-national) | 1 | 0 | 0 | 1 | 0 | |
test | 23 | Location (sub-national) coordinates or point | 1 | 0 | 0 | 1 | 0 | |
test | 24 | Conditions data | 1 | 0 | 0 | 1 | 0 | |
test | 24 | Conditions document | 1 | 0 | 0 | 1 | 0 | |
test | 25 | IATI Identifier is present | 1 | 0 | 0 | 1 | 0 | |
test | 25 | IATI Identifier starts with reporting org ref | 1 | 0 | 0 | 1 | 0 | |
test | 26 | Flow type | 1 | 0 | 0 | 1 | 0 | |
test | 26 | Flow type uses standard codelist | 1 | 0 | 0 | 1 | 0 | |
test | 27 | Aid type is present | 1 | 0 | 0 | 1 | 0 | |
test | 27 | Aid type is valid | 1 | 0 | 0 | 1 | 0 | |
test | 28 | Default finance type | 1 | 0 | 0 | 1 | 0 | |
test | 28 | Finance type uses standard codelist | 1 | 0 | 0 | 1 | 0 | |
test | 29 | Tied aid status | 1 | 0 | 0 | 1 | 0 | |
test | 29 | Tied aid status uses standard codelist | 1 | 0 | 0 | 1 | 0 | |
test | 30 | Implementing organisation | 1 | 0 | 0 | 1 | 0 | |
test | 30 | Participating Orgs | 1 | 0 | 0 | 1 | 0 | |
test | 31 | Tender is present | 1 | 0 | 0 | 1 | 0 | |
test | 31 | Contract is present | 1 | 0 | 0 | 1 | 0 | |
test | 33 | Objectives of activity document | 1 | 0 | 0 | 1 | 0 | |
test | 34 | Pre- and/or post-project impact appraisal documents | 1 | 0 | 0 | 1 | 0 | |
test | 35 | Project performance and evaluation document | 1 | 0 | 0 | 1 | 0 | |
test | 36 | Results data | 1 | 0 | 0 | 1 | 0 | |
test | 36 | Results document | 1 | 0 | 0 | 1 | 0 |
In the DTQ there is an test for Implementer, which sits in the Project Attributes section
This may no longer be relevant to the planned Network Data indicator, so should be removed
If it is relevant, it should be moved to the Joining-up development data section of tests
Wow, the README is woefully out of date.
The spinner is all over the place.
Currently to test a file on the registry, you have to go there and find a URL. It would be better to use the registry CKAN API to pull in org and package lists.
NB this feature existed in the very first iteration and I removed it (because it wasn’t the intended use of the DQT.)
The IATI ruleset tests are unrelated to the Aid Transparency Index, but their presence in the tester has been a source of some confusion. As such, the feature should be clearly marked as experimental.
The DQT presents it's results as such (example: org planning and commitments):
The specific tests made here have no reference back to the Methodology document - eg:
Furthermore, the indicators in the DQT are not even ordered in the same way as the Methodology.
This makes it more work to equate each metric with the methodology, in order to get an understanding of data issues
What would fix this?
$ flask db upgrade
Usage: flask db upgrade [OPTIONS] [REVISION]
Error: While importing "DataQualityTester", an ImportError was raised:
Traceback (most recent call last):
File "/home/bjwebb/opendataservices/pwyf/data-quality-tester/.ve/lib/python3.6/site-packages/flask/cli.py", line 235, in locate_app
__import__(module_name)
File "/home/bjwebb/opendataservices/pwyf/data-quality-tester/DataQualityTester/__init__.py", line 41, in <module>
from DataQualityTester import commands, routes, models, views, lib
File "/home/bjwebb/opendataservices/pwyf/data-quality-tester/DataQualityTester/commands.py", line 9, in <module>
from DataQualityTester.models import SuppliedData
File "/home/bjwebb/opendataservices/pwyf/data-quality-tester/DataQualityTester/models.py", line 13, in <module>
from DataQualityTester.tasks import download_task
File "/home/bjwebb/opendataservices/pwyf/data-quality-tester/DataQualityTester/tasks.py", line 7, in <module>
from bdd_tester import bdd_tester
ImportError: cannot import name 'bdd_tester'
I'm assuming this is due to changes in the latest versions of the bdd_tester.
@michaelwood @Lathrisk huge thanks for the labelling and ordering of the test results in the DQT #52, and then the csv export of the results
Certainly, the DQT on-screen results are in order - with Indicator 24 ( #55 ) and 30 ( #56 ) in need of moving into the relevant section
In terms of the csv export, the tests are not presented sequentially - eg:
indicator_num | name |
---|---|
Organisational planning and commitments | |
4 | Annual report is present |
5 | Allocation policy is present |
6 | Procurement policy is present |
8 | Audit is present |
3 | Organisation strategy is present |
7 | Strategy (country/sector) or Memorandum of Understanding |
Finance and budgets | |
10 | Disaggregated budget |
14 | Disbursements or expenditures are present |
11 | Budget available forward annually |
11 | Budget available forward quarterly |
12 | Budget document is present |
9 | Organisation budget available one year forward |
9 | Organisation budget available two years forward |
9 | Organisation budget available three years forward |
15 | Capital spend is present |
15 | Publish detailed CRS purpose codes in the sector field |
13 | Commitment is present |
Project attributes | |
30 | Implementing organisation |
23 | Location (sub-national) |
23 | Location (sub-national) coordinates or point |
22 | Sector is present |
22 | Sector uses DAC CRS 5 digit purpose codes |
30 | Participating Orgs |
25 | IATI Identifier is present |
25 | IATI Identifier starts with reporting org ref |
21 | Contact info is present |
19 | Actual start date is present |
19 | Actual end date is present |
17 | Description is present |
17 | Description has at least 80 characters |
20 | Current status is present |
20 | Current status is valid |
18 | Planned start date is present |
18 | Planned end date is present |
16 | Title is present |
16 | Title has at least 10 characters |
Joining-up development data | |
26 | Flow type |
26 | Flow type uses standard codelist |
24 | Conditions data |
24 | Conditions document |
29 | Tied aid status |
29 | Tied aid status uses standard codelist |
28 | Default finance type |
28 | Finance type uses standard codelist |
27 | Aid type is present |
27 | Aid type is valid |
31 | Tender is present |
31 | Contract is present |
Performance | |
34 | Pre- and/or post-project impact appraisal documents |
35 | Project performance and evaluation document |
33 | Objectives of activity document |
36 | Results data |
36 | Results document |
Maybe this was because this functionality was implemented before #52 was completed? Would you be able to refactor the csv export so that it is in synch accordingly?
Publish What You Fund have received the following email:
Hi,
We are testing our activity file against the PWYF data quality tester and we are getting an error on sector code (see screenshot below). The case seems to be unique for sector code 43060, which is present in the IATI sector code list. Hope you can help us with this issue.
Thank you.
Regards,
-Lulu
Consultant
SPOP- Asian Development Bank
Lulu references http://reference.iatistandard.org/203/codelists/Sector/ which is the replicated DAC 5 digit sector codelist and includes the code '43060': "Disaster Risk Reduction".
Via Catherine Marshchner.
All looks very default bootstrap at the moment. Would be good to make look less like a hackday project / more like an actual project.
The DQT provides both the tests for Conditions:
1 - The revised methodology has changes to these, so these should be updated
2 - The revised methodology places these in the Project Attributes section, so these should be moved in the DQT
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.