The data-quality-tester from pwyf

Add a means of disaggregating agencies within IATI publishers

e.g. unitedstates includes a number of different US agencies.

Use the BDD IATI tester

The BDD IATI tester is ready to use now, so we should switch out the YAML tests here, and switch in the BDD tester.

Perform download tasks in the background; add a progress bar (ooooh)

E.g. this sort of approach: https://stackoverflow.com/questions/15644964/python-progress-bar-and-downloads

All quantitative tests should be present on the DQT

Specify whether MOUs pull from either Org or Activity file so that the test can show up on DQT.

Via Catherine Marschner.

For components where tests are mixed between activity and org file, add some differentiator so I can see which tests are from which file

Even different colors would be fine.

Via Catherine Marschner.

Add lots more explanatory copy

Think about language used

We might want to get away from pass/fail and instead provide concrete use cases for what the data could allow someone to do.

Allow for testing organisational files too

The BDD tester works for org files, too, so we might as well allow org file testing.

~~NB Currently if you try to test an org file, everything breaks in a weird way (see #15.)~~ UPDATE this was fixed.

Don't use progress bars as charts

When tests have run the progress bars which are used to represent the results e.g. "Project attributes: 89% pass" just look like the tests themselves are still loading. This has caused confusion for some users.

Related issue #18

Default to ‘failing’ activities on test page

For the purposes of improving data quality, bad quality data is more useful & interesting than good quality data. So we should frontload that.

Testing an org file results in a redirect loop!

Wow, that’s really broken.

In fact, this happens whenever you test a file that contains no activities.

It’s unclear what the difference is between test and current_test columns

It’s not clear what the difference is between the test and current_test columns of the output csv. I initially thought this was just for convenience, but checking foxpath-tools, the same distinction exists in the test_doc_json_out function there.

Add function to upload both an activity and org file

Should this be enabled, then results on the relevant indicators, could then be processed

This would be very valuable @publishwhatyoufund , as otherwise users have to check these indicators offline/manually

Add some ability to at least see hierarchy assignments

Not sure we want to change them here due to risk of not being consistent with the tracker.

Via Catherine Marschner.

Download results by indicator as CSV / XLSX

Unable to test IATI file: TypeError: 'bool' object is not subscriptable in tasks.py

I would like to review #36

I have tried to test the proposed fix locally, but I am having difficulties with running the DQT.

I followed the instructions at https://github.com/pwyf/data-quality-tester, which required some updates to be run afterwards.

Once the DQT loaded locally, attempting to test a file resulted in the following error:

FileNotFoundError: [Errno 2] No such file or directory: '/home/simon/code/data-quality-tester/2018-index-indicator-definitions/test_definitions/performance/../step_definitions.py'

I cloned the 2018-index-indicator-definitions repo into the data-quality-tester folder and now when I try and load an IATI activity file I am getting the following error message instead:

>   File "/home/simon/code/data-quality-tester/DataQualityTester/tasks.py", line 119, in test_file_task
    result[lookup.get(out[0])] += 1
TypeError: 'bool' object is not subscriptable

The front-end hangs with this error.

Drop celery completely

It seems to be necessarily to routinely clear out old tasks from the celery queue, otherwise the task runner grinds to a halt (and when new data files are uploaded, the spinny wheel just keeps spinning).

I’m not sure why this happens… I note that CoVE doesn’t process files in the background, and I wonder if it’s because of this sort of added complexity.

Flag errors inline in the XML

According to the IATI schema, element ordering matters. So it should be trivial to point to the right bit of the file and say “something’s missing here.”

NB this was part of the original plan for the DQT, but was never added. It would be super useful to include this feature!

Sort out deployment

Should be deployed with https://github.com/pwyf/ansible

Memory usage issues after time

The celery tasks do not appear to be correctly releasing the memory used by them which eventually cases the server to run low on memory.

Related info #28

Add a way to specify organisation conditions i.e. hierarchies

~~Sounds like something for an “iati-tester” rather than an “iati-simple-tester”, but…~~

It would be useful to be able to specify organisation conditions, so that organisations that only publish data at certain hierarchies are not penalised.

Add file upload to flask app

Currently, files on the IATI registry can be tested. It would be useful to add a direct file upload as well (this was in an earlier version, but disappeared!)

For each test, list the identifiers that have been assessed, and their pass/fail status

Many of the tests inspect a different corpus from the one submitted (due to various filters within each test)

The user only gets to see the identifiers of the activities that fail any test, but not the full list of those that were assessed.

Changing the screen output to a list of all the identifiers, with a Pass / Fail message would help. If this can be presented as a table, or csv list, then it would make it easier for users to then take the data into other analysis tools

Extra nicety would be default the list to the Failed identifiers first

BUG: Org file tests do not report the organisation-identifier in failed tests

eg:

https://dataqualitytester.publishwhatyoufund.org/package/b80f8d0a-6e33-4722-a211-5c59a350a6de/finance/Organisation%2Bbudget%2Bavailable%2Btwo%2Byears%2Bforward

It's important to print the org identifier from iati-organisations/iati-organisation/organisation-identifier - as many publishers might have multiple iati-organisation in any file (under the root iati-organisation)

Without this, it is impossible to understand which iati-organisation has therefore failed the test

@michaelwood @Lathrisk @publishwhatyoufund

Show passes and “not relevant”s as well as fails

We currently only show activities that fail tests. It would be useful to also show passes and not relevants.

Via Catherine Marschner.

Budget alignment test is not fully tested

The Budget alignment test has two parts

Summary:

Capital Spend being present
Detailed CRS codes in the sector field

The DQT only currently tests the Capital Spend. Adding the second test (alongside #52) would make it clearer to users where data improvements need to be made

Add webassets

Mostly for cache busting

Remove tests that are not performed

Neither Test 7 or 10 are performed by the DQT - yet the interface implies that they could be, if data is presented

This is misleading. Either rename these to make it clear they are not performed, or remove them completely. I'd suggest the second option @publishwhatyoufund

@michaelwood @Lathrisk

Add "Project" to the names of tests #11 & #12

The technical methodology doc calls the activity-specific budget indicators

11 Project budget
12 Project budget documents

The DQT doesnt include the Project word, which can be misleading, given there are "Organisation" budget tests on the same page

Suggestion: add Project to the names of tests 11 & 12 in the DQT, to avoid any ambiguity

@publishwhatyoufund @Lathrisk @michaelwood

Add pagination on test page

Eek

CSRF tokens get pop'ed from session invalidating the next submission

in middleware.py we have

def csrf_protect():
    if request.method == 'POST':
        token = session.pop('_csrf_token', None)
        if not token or token != request.form.get('_csrf_token'):
            abort(403)

This fails to correctly validate the csrf if:
User opens upload page (1)
User opens upload page (2)
Both 1 and 2 will have the same csrf token renderd in the template

When one of the upload pages is submitted (or any POST request) then the CSRF token is popped from the session, this means than when the user goes to submit the other page the CSRF token is deemed invalid because it is comparing against None. A simple patch to change this from pop to get should fix this.

Don’t use Flask-Script

CLI commands are built into flask now.

Experimental: Test an entire organisation (with conditions) on the IATI registry

This is probably a feature we just want For Internal Use Only… But:

this could be really useful in thinking about modularising the tracker
this probably wouldn’t be tooooo difficult to do

Add user survey / contact details / analytics

E.g. https://getsatisfaction.com type thing

Add a “show me [a passing activity] at random” button

Might be interesting to pull out a random sample activity, and render it à la d-portal? (since this is roughly what we do in the sampling phase of the aid transparency index).

Add a ‘test another file’ button (30 mins)

Rather than having to manually go back to the homepage

Include "relevant" in the CSV file - and drop the category scores

Related to #65

Having spent more time with the csv download, and request this be ordered according to the methodology in #65, I'e two more requests:

Add a new column for "relevant" to state the number of records that were relevant to the test (we state total and not relevant, but miss that last step)
Remove the categories, so we just have an ordered list of the tests - eg:

type	indicator_num	name	score	total_tested	passed	not-relevant	relevant
test	3	Organisation strategy is present	100	1	1	0	1
test	4	Annual report is present	100	1	1	0	1
test	5	Allocation policy is present	100	1	1	0	1
test	6	Procurement policy is present	100	1	1	0	1
test	7	Strategy (country/sector) or Memorandum of Understanding		1	0	1	0
test	8	Audit is present	100	1	1	0	1
test	9	Organisation budget available one year forward	100	1	1	0	1
test	9	Organisation budget available two years forward	100	1	1	0	1
test	9	Organisation budget available three years forward	100	1	1	0	1
test	10	Disaggregated budget		1	0	1	0
test	11	Budget available forward annually		1	0	1	0
test	11	Budget available forward quarterly		1	0	1	0
test	12	Budget document is present		1	0	1	0
test	13	Commitment is present		1	0	1	0
test	14	Disbursements or expenditures are present		1	0	1	0
test	15	Capital spend is present		1	0	1	0
test	15	Publish detailed CRS purpose codes in the sector field		1	0	1	0
test	16	Title is present		1	0	1	0
test	16	Title has at least 10 characters		1	0	1	0
test	17	Description is present		1	0	1	0
test	17	Description has at least 80 characters		1	0	1	0
test	18	Planned start date is present		1	0	1	0
test	18	Planned end date is present		1	0	1	0
test	19	Actual start date is present		1	0	1	0
test	19	Actual end date is present		1	0	1	0
test	20	Current status is present		1	0	1	0
test	20	Current status is valid		1	0	1	0
test	21	Contact info is present		1	0	1	0
test	22	Sector is present		1	0	1	0
test	22	Sector uses DAC CRS 5 digit purpose codes		1	0	1	0
test	23	Location (sub-national)		1	0	1	0
test	23	Location (sub-national) coordinates or point		1	0	1	0
test	24	Conditions data		1	0	1	0
test	24	Conditions document		1	0	1	0
test	25	IATI Identifier is present		1	0	1	0
test	25	IATI Identifier starts with reporting org ref		1	0	1	0
test	26	Flow type		1	0	1	0
test	26	Flow type uses standard codelist		1	0	1	0
test	27	Aid type is present		1	0	1	0
test	27	Aid type is valid		1	0	1	0
test	28	Default finance type		1	0	1	0
test	28	Finance type uses standard codelist		1	0	1	0
test	29	Tied aid status		1	0	1	0
test	29	Tied aid status uses standard codelist		1	0	1	0
test	30	Implementing organisation		1	0	1	0
test	30	Participating Orgs		1	0	1	0
test	31	Tender is present		1	0	1	0
test	31	Contract is present		1	0	1	0
test	33	Objectives of activity document		1	0	1	0
test	34	Pre- and/or post-project impact appraisal documents		1	0	1	0
test	35	Project performance and evaluation document		1	0	1	0
test	36	Results data		1	0	1	0
test	36	Results document		1	0	1	0

Indicator 30: Network Data (if implemented) - Implementer test is no longer relevant / needs moving

In the DTQ there is an test for Implementer, which sits in the Project Attributes section

This may no longer be relevant to the planned Network Data indicator, so should be removed

If it is relevant, it should be moved to the Joining-up development data section of tests

Update the README

Wow, the README is woefully out of date.

Fix the horrible loading screen (30 mins)

The spinner is all over the place.

Test all files on the IATI registry

Currently to test a file on the registry, you have to go there and find a URL. It would be better to use the registry CKAN API to pull in org and package lists.

NB this feature existed in the very first iteration and I removed it (because it wasn’t the intended use of the DQT.)

Mark IATI ruleset tests as experimental

The IATI ruleset tests are unrelated to the Aid Transparency Index, but their presence in the tester has been a source of some confusion. As such, the feature should be clearly marked as experimental.

Present each indicator with it's ID from the methodology, and in order

The DQT presents it's results as such (example: org planning and commitments):

The specific tests made here have no reference back to the Methodology document - eg:

Furthermore, the indicators in the DQT are not even ordered in the same way as the Methodology.

This makes it more work to equate each metric with the methodology, in order to get an understanding of data issues

What would fix this?

For each indicator in the DQT number it according to it's corresponding indicator in the methodology
In the DQT results page, order the indicators by this numbering

ImportError: cannot import name 'bdd_tester'

$ flask db upgrade
Usage: flask db upgrade [OPTIONS] [REVISION]

Error: While importing "DataQualityTester", an ImportError was raised:

Traceback (most recent call last):
  File "/home/bjwebb/opendataservices/pwyf/data-quality-tester/.ve/lib/python3.6/site-packages/flask/cli.py", line 235, in locate_app
    __import__(module_name)
  File "/home/bjwebb/opendataservices/pwyf/data-quality-tester/DataQualityTester/__init__.py", line 41, in <module>
    from DataQualityTester import commands, routes, models, views, lib
  File "/home/bjwebb/opendataservices/pwyf/data-quality-tester/DataQualityTester/commands.py", line 9, in <module>
    from DataQualityTester.models import SuppliedData
  File "/home/bjwebb/opendataservices/pwyf/data-quality-tester/DataQualityTester/models.py", line 13, in <module>
    from DataQualityTester.tasks import download_task
  File "/home/bjwebb/opendataservices/pwyf/data-quality-tester/DataQualityTester/tasks.py", line 7, in <module>
    from bdd_tester import bdd_tester
ImportError: cannot import name 'bdd_tester'

I'm assuming this is due to changes in the latest versions of the bdd_tester.

csv download not in order of methodology

@michaelwood @Lathrisk huge thanks for the labelling and ordering of the test results in the DQT #52, and then the csv export of the results

Certainly, the DQT on-screen results are in order - with Indicator 24 ( #55 ) and 30 ( #56 ) in need of moving into the relevant section

In terms of the csv export, the tests are not presented sequentially - eg:

indicator_num	name
	Organisational planning and commitments
4	Annual report is present
5	Allocation policy is present
6	Procurement policy is present
8	Audit is present
3	Organisation strategy is present
7	Strategy (country/sector) or Memorandum of Understanding

	Finance and budgets
10	Disaggregated budget
14	Disbursements or expenditures are present
11	Budget available forward annually
11	Budget available forward quarterly
12	Budget document is present
9	Organisation budget available one year forward
9	Organisation budget available two years forward
9	Organisation budget available three years forward
15	Capital spend is present
15	Publish detailed CRS purpose codes in the sector field
13	Commitment is present

	Project attributes
30	Implementing organisation
23	Location (sub-national)
23	Location (sub-national) coordinates or point
22	Sector is present
22	Sector uses DAC CRS 5 digit purpose codes
30	Participating Orgs
25	IATI Identifier is present
25	IATI Identifier starts with reporting org ref
21	Contact info is present
19	Actual start date is present
19	Actual end date is present
17	Description is present
17	Description has at least 80 characters
20	Current status is present
20	Current status is valid
18	Planned start date is present
18	Planned end date is present
16	Title is present
16	Title has at least 10 characters

	Joining-up development data
26	Flow type
26	Flow type uses standard codelist
24	Conditions data
24	Conditions document
29	Tied aid status
29	Tied aid status uses standard codelist
28	Default finance type
28	Finance type uses standard codelist
27	Aid type is present
27	Aid type is valid
31	Tender is present
31	Contract is present

	Performance
34	Pre- and/or post-project impact appraisal documents
35	Project performance and evaluation document
33	Objectives of activity document
36	Results data
36	Results document

Maybe this was because this functionality was implemented before #52 was completed? Would you be able to refactor the csv export so that it is in synch accordingly?

Sector codelist error

Publish What You Fund have received the following email:

Hi,

We are testing our activity file against the PWYF data quality tester and we are getting an error on sector code (see screenshot below). The case seems to be unique for sector code 43060, which is present in the IATI sector code list. Hope you can help us with this issue.

Thank you.

Regards,

-Lulu

Consultant

SPOP- Asian Development Bank

Lulu references http://reference.iatistandard.org/203/codelists/Sector/ which is the replicated DAC 5 digit sector codelist and includes the code '43060': "Disaster Risk Reduction".

Conditions should be present
Conditions documents should be present

1 - The revised methodology has changes to these, so these should be updated
2 - The revised methodology places these in the Project Attributes section, so these should be moved in the DQT

@publishwhatyoufund

pwyf / data-quality-tester Goto Github PK

data-quality-tester's People

Contributors

Stargazers

Watchers

Forkers

data-quality-tester's Issues

Recommend Projects

Recommend Topics

Recommend Org