Code Monkey home page Code Monkey logo

data-quality-tester's People

Contributors

andylolz avatar bjwebb avatar dependabot[bot] avatar kindly avatar markbrough avatar michaelwood avatar mk270 avatar simon-20 avatar siwhitehouse avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

data-quality-tester's Issues

Think about language used

We might want to get away from pass/fail and instead provide concrete use cases for what the data could allow someone to do.

Don't use progress bars as charts

When tests have run the progress bars which are used to represent the results e.g. "Project attributes: 89% pass" just look like the tests themselves are still loading. This has caused confusion for some users.

Related issue #18

Unable to test IATI file: TypeError: 'bool' object is not subscriptable in tasks.py

I would like to review #36

I have tried to test the proposed fix locally, but I am having difficulties with running the DQT.

I followed the instructions at https://github.com/pwyf/data-quality-tester, which required some updates to be run afterwards.

Once the DQT loaded locally, attempting to test a file resulted in the following error:

FileNotFoundError: [Errno 2] No such file or directory: '/home/simon/code/data-quality-tester/2018-index-indicator-definitions/test_definitions/performance/../step_definitions.py'

I cloned the 2018-index-indicator-definitions repo into the data-quality-tester folder and now when I try and load an IATI activity file I am getting the following error message instead:

>   File "/home/simon/code/data-quality-tester/DataQualityTester/tasks.py", line 119, in test_file_task
    result[lookup.get(out[0])] += 1
TypeError: 'bool' object is not subscriptable

The front-end hangs with this error.

Drop celery completely

It seems to be necessarily to routinely clear out old tasks from the celery queue, otherwise the task runner grinds to a halt (and when new data files are uploaded, the spinny wheel just keeps spinning).

I’m not sure why this happens… I note that CoVE doesn’t process files in the background, and I wonder if it’s because of this sort of added complexity.

Flag errors inline in the XML

According to the IATI schema, element ordering matters. So it should be trivial to point to the right bit of the file and say “something’s missing here.”

NB this was part of the original plan for the DQT, but was never added. It would be super useful to include this feature!

Memory usage issues after time

The celery tasks do not appear to be correctly releasing the memory used by them which eventually cases the server to run low on memory.

Related info #28

Add a way to specify organisation conditions i.e. hierarchies

Sounds like something for an “iati-tester” rather than an “iati-simple-tester”, but…

It would be useful to be able to specify organisation conditions, so that organisations that only publish data at certain hierarchies are not penalised.

Add file upload to flask app

Currently, files on the IATI registry can be tested. It would be useful to add a direct file upload as well (this was in an earlier version, but disappeared!)

For each test, list the identifiers that have been assessed, and their pass/fail status

Many of the tests inspect a different corpus from the one submitted (due to various filters within each test)

The user only gets to see the identifiers of the activities that fail any test, but not the full list of those that were assessed.

Changing the screen output to a list of all the identifiers, with a Pass / Fail message would help. If this can be presented as a table, or csv list, then it would make it easier for users to then take the data into other analysis tools

Extra nicety would be default the list to the Failed identifiers first

BUG: Org file tests do not report the organisation-identifier in failed tests

eg:

https://dataqualitytester.publishwhatyoufund.org/package/b80f8d0a-6e33-4722-a211-5c59a350a6de/finance/Organisation%2Bbudget%2Bavailable%2Btwo%2Byears%2Bforward

Data-Quality-Tester (1)

It's important to print the org identifier from iati-organisations/iati-organisation/organisation-identifier - as many publishers might have multiple iati-organisation in any file (under the root iati-organisation)

Without this, it is impossible to understand which iati-organisation has therefore failed the test

@michaelwood @Lathrisk @publishwhatyoufund

Budget alignment test is not fully tested

The Budget alignment test has two parts

Summary:

  • Capital Spend being present
  • Detailed CRS codes in the sector field

The DQT only currently tests the Capital Spend. Adding the second test (alongside #52) would make it clearer to users where data improvements need to be made

Add "Project" to the names of tests #11 & #12

The technical methodology doc calls the activity-specific budget indicators

  • 11 Project budget
  • 12 Project budget documents

The DQT doesnt include the Project word, which can be misleading, given there are "Organisation" budget tests on the same page

Data-Quality-Tester (5)

Suggestion: add Project to the names of tests 11 & 12 in the DQT, to avoid any ambiguity

@publishwhatyoufund @Lathrisk @michaelwood

CSRF tokens get pop'ed from session invalidating the next submission

in middleware.py we have

def csrf_protect():
    if request.method == 'POST':
        token = session.pop('_csrf_token', None)
        if not token or token != request.form.get('_csrf_token'):
            abort(403)

This fails to correctly validate the csrf if:
User opens upload page (1)
User opens upload page (2)
Both 1 and 2 will have the same csrf token renderd in the template

When one of the upload pages is submitted (or any POST request) then the CSRF token is popped from the session, this means than when the user goes to submit the other page the CSRF token is deemed invalid because it is comparing against None. A simple patch to change this from pop to get should fix this.

Include "relevant" in the CSV file - and drop the category scores

Related to #65

Having spent more time with the csv download, and request this be ordered according to the methodology in #65, I'e two more requests:

  • Add a new column for "relevant" to state the number of records that were relevant to the test (we state total and not relevant, but miss that last step)
  • Remove the categories, so we just have an ordered list of the tests - eg:
type indicator_num name score total_tested failed passed not-relevant relevant
test 3 Organisation strategy is present 100 1 0 1 0 1
test 4 Annual report is present 100 1 0 1 0 1
test 5 Allocation policy is present 100 1 0 1 0 1
test 6 Procurement policy is present 100 1 0 1 0 1
test 7 Strategy (country/sector) or Memorandum of Understanding   1 0 0 1 0
test 8 Audit is present 100 1 0 1 0 1
test 9 Organisation budget available one year forward 100 1 0 1 0 1
test 9 Organisation budget available two years forward 100 1 0 1 0 1
test 9 Organisation budget available three years forward 100 1 0 1 0 1
test 10 Disaggregated budget   1 0 0 1 0
test 11 Budget available forward annually   1 0 0 1 0
test 11 Budget available forward quarterly   1 0 0 1 0
test 12 Budget document is present   1 0 0 1 0
test 13 Commitment is present   1 0 0 1 0
test 14 Disbursements or expenditures are present   1 0 0 1 0
test 15 Capital spend is present   1 0 0 1 0
test 15 Publish detailed CRS purpose codes in the sector field   1 0 0 1 0
test 16 Title is present   1 0 0 1 0
test 16 Title has at least 10 characters   1 0 0 1 0
test 17 Description is present   1 0 0 1 0
test 17 Description has at least 80 characters   1 0 0 1 0
test 18 Planned start date is present   1 0 0 1 0
test 18 Planned end date is present   1 0 0 1 0
test 19 Actual start date is present   1 0 0 1 0
test 19 Actual end date is present   1 0 0 1 0
test 20 Current status is present   1 0 0 1 0
test 20 Current status is valid   1 0 0 1 0
test 21 Contact info is present   1 0 0 1 0
test 22 Sector is present   1 0 0 1 0
test 22 Sector uses DAC CRS 5 digit purpose codes   1 0 0 1 0
test 23 Location (sub-national)   1 0 0 1 0
test 23 Location (sub-national) coordinates or point   1 0 0 1 0
test 24 Conditions data   1 0 0 1 0
test 24 Conditions document   1 0 0 1 0
test 25 IATI Identifier is present   1 0 0 1 0
test 25 IATI Identifier starts with reporting org ref   1 0 0 1 0
test 26 Flow type   1 0 0 1 0
test 26 Flow type uses standard codelist   1 0 0 1 0
test 27 Aid type is present   1 0 0 1 0
test 27 Aid type is valid   1 0 0 1 0
test 28 Default finance type   1 0 0 1 0
test 28 Finance type uses standard codelist   1 0 0 1 0
test 29 Tied aid status   1 0 0 1 0
test 29 Tied aid status uses standard codelist   1 0 0 1 0
test 30 Implementing organisation   1 0 0 1 0
test 30 Participating Orgs   1 0 0 1 0
test 31 Tender is present   1 0 0 1 0
test 31 Contract is present   1 0 0 1 0
test 33 Objectives of activity document   1 0 0 1 0
test 34 Pre- and/or post-project impact appraisal documents   1 0 0 1 0
test 35 Project performance and evaluation document   1 0 0 1 0
test 36 Results data   1 0 0 1 0
test 36 Results document   1 0 0 1 0

Test all files on the IATI registry

Currently to test a file on the registry, you have to go there and find a URL. It would be better to use the registry CKAN API to pull in org and package lists.

NB this feature existed in the very first iteration and I removed it (because it wasn’t the intended use of the DQT.)

Mark IATI ruleset tests as experimental

The IATI ruleset tests are unrelated to the Aid Transparency Index, but their presence in the tester has been a source of some confusion. As such, the feature should be clearly marked as experimental.

Present each indicator with it's ID from the methodology, and in order

The DQT presents it's results as such (example: org planning and commitments):

Data-Quality-Tester

The specific tests made here have no reference back to the Methodology document - eg:

Screenshot from 2021-07-20 16-38-38

Furthermore, the indicators in the DQT are not even ordered in the same way as the Methodology.

This makes it more work to equate each metric with the methodology, in order to get an understanding of data issues

What would fix this?

  • For each indicator in the DQT number it according to it's corresponding indicator in the methodology
  • In the DQT results page, order the indicators by this numbering

ImportError: cannot import name 'bdd_tester'

$ flask db upgrade
Usage: flask db upgrade [OPTIONS] [REVISION]

Error: While importing "DataQualityTester", an ImportError was raised:

Traceback (most recent call last):
  File "/home/bjwebb/opendataservices/pwyf/data-quality-tester/.ve/lib/python3.6/site-packages/flask/cli.py", line 235, in locate_app
    __import__(module_name)
  File "/home/bjwebb/opendataservices/pwyf/data-quality-tester/DataQualityTester/__init__.py", line 41, in <module>
    from DataQualityTester import commands, routes, models, views, lib
  File "/home/bjwebb/opendataservices/pwyf/data-quality-tester/DataQualityTester/commands.py", line 9, in <module>
    from DataQualityTester.models import SuppliedData
  File "/home/bjwebb/opendataservices/pwyf/data-quality-tester/DataQualityTester/models.py", line 13, in <module>
    from DataQualityTester.tasks import download_task
  File "/home/bjwebb/opendataservices/pwyf/data-quality-tester/DataQualityTester/tasks.py", line 7, in <module>
    from bdd_tester import bdd_tester
ImportError: cannot import name 'bdd_tester'

I'm assuming this is due to changes in the latest versions of the bdd_tester.

csv download not in order of methodology

@michaelwood @Lathrisk huge thanks for the labelling and ordering of the test results in the DQT #52, and then the csv export of the results

Certainly, the DQT on-screen results are in order - with Indicator 24 ( #55 ) and 30 ( #56 ) in need of moving into the relevant section

In terms of the csv export, the tests are not presented sequentially - eg:

indicator_num name
  Organisational planning and commitments
4 Annual report is present
5 Allocation policy is present
6 Procurement policy is present
8 Audit is present
3 Organisation strategy is present
7 Strategy (country/sector) or Memorandum of Understanding
   
  Finance and budgets
10 Disaggregated budget
14 Disbursements or expenditures are present
11 Budget available forward annually
11 Budget available forward quarterly
12 Budget document is present
9 Organisation budget available one year forward
9 Organisation budget available two years forward
9 Organisation budget available three years forward
15 Capital spend is present
15 Publish detailed CRS purpose codes in the sector field
13 Commitment is present
   
  Project attributes
30 Implementing organisation
23 Location (sub-national)
23 Location (sub-national) coordinates or point
22 Sector is present
22 Sector uses DAC CRS 5 digit purpose codes
30 Participating Orgs
25 IATI Identifier is present
25 IATI Identifier starts with reporting org ref
21 Contact info is present
19 Actual start date is present
19 Actual end date is present
17 Description is present
17 Description has at least 80 characters
20 Current status is present
20 Current status is valid
18 Planned start date is present
18 Planned end date is present
16 Title is present
16 Title has at least 10 characters
   
  Joining-up development data
26 Flow type
26 Flow type uses standard codelist
24 Conditions data
24 Conditions document
29 Tied aid status
29 Tied aid status uses standard codelist
28 Default finance type
28 Finance type uses standard codelist
27 Aid type is present
27 Aid type is valid
31 Tender is present
31 Contract is present
   
  Performance
34 Pre- and/or post-project impact appraisal documents
35 Project performance and evaluation document
33 Objectives of activity document
36 Results data
36 Results document

Maybe this was because this functionality was implemented before #52 was completed? Would you be able to refactor the csv export so that it is in synch accordingly?

Sector codelist error

Publish What You Fund have received the following email:

Hi,

We are testing our activity file against the PWYF data quality tester and we are getting an error on sector code (see screenshot below). The case seems to be unique for sector code 43060, which is present in the IATI sector code list. Hope you can help us with this issue.

Thank you.

errorMsg

Regards,

-Lulu

Consultant

SPOP- Asian Development Bank

Lulu references http://reference.iatistandard.org/203/codelists/Sector/ which is the replicated DAC 5 digit sector codelist and includes the code '43060': "Disaster Risk Reduction".

Restyle away the bootstrap

All looks very default bootstrap at the moment. Would be good to make look less like a hackday project / more like an actual project.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.