Code Monkey home page Code Monkey logo

dataproofer's Introduction

Dataproofer

""

A proofreader for your data

Every day, more and more data is created. Journalists, analysts, and data visualizers turn that data into stories and insights.

But before you can make use of any data, you need to know if it’s reliable. Is it weird? Is it clean? Can I use it to write or make a viz?

This used to be a long manual process, using valuable time and introducing the possibility for human error. People can’t always spot every mistake every time, no matter how hard they try.

Data proofer is built to automate this process of checking a dataset for errors or potential mistakes.

Getting Started (Desktop)

Download a .zip of the latest release from the Dataproofer releases page.

Drag the app into your applications folder.

Select your dataset, which can be either a CSV on your computer, or a Google Sheet that you’ve published to the web.

Once you select your dataset, you can choose which suites and tests run by turning them on or off.

Proof your data, get your results, and feel confident about your dataset.

Getting Started (Command Line)

npm install -g dataproofer

Read the documentation

dataproofer --help
>  Usage: dataproofer <file>

  A proofreader for your data

  Options:

    -h, --help          output usage information
    -V, --version       output the version number
    -o, --out <file>    file to output results. default stdout
    -c, --core          run tests from the core suite
    -i, --info          run tests from the info suite
    -a, --stats         run tests from the statistical suite
    -g, --geo           run tests from the geographic suite
    -t, --tests <list>  comma-separated list to use
    -j, --json          output JSON of test results
    -J, --json-pretty   output an indented JSON of test results
    -S, --summary       output overall test results, excluding pass/fail results
    -v, --verbose       include descriptions about each column
    -x, --exclude       exclude tests that passed

  Examples:

    $ dataproofer my_data.csv

Run a test

dataproofer data.csv

Save the results

dataproofer --json data.csv --out data.json

Learn how to run specific test suites or tests and output longer or shorter summaries, use the --help flag.

Found a bug? Let us know.

Table of Contents

""

Test Suites

A set of tests that infer descriptive information based on the contents of a table's cells.

  • Check for numeric values in columns
  • Check for strings in columns

A set of tests related to common problems and data checks — namely, making sure data has not been truncated by looking for specific cut-off indicators.

  • Check for duplicate rows
  • Check for empty columns (no values)
  • Check for special, non-typical Latin characters/letters in strings
  • Check for big integer cut-offs as defined by MySQL and PostgreSQL, common database programs
  • Check for integer cut-offs as defined by MySQL and PostgreSQL, common database programs
  • Check for small integer cut-offs as defined by MySQL and PostgreSQL, common database programs
  • Check for whether there are exactly 65k rows — an indication there may be missing rows lost when the data was exported from a database
  • Check for strings that are exactly 255 characters — an indication there may be missing data lost when the data was exported from MySQL

A set of tests related to common geographic data problems.

  • Check for invalid latitude and longitude values (values outside the range of -180º to 180º)
  • Check for void latitude and longitude values (values at 0º,0º)

A set of test related to common statistical used to detect outlying data.

  • Check for outliers within a column relative to the column's median
  • Check for outliers within a column relative to the column's mean

""

Development

git clone https://github.com/dataproofer/Dataproofer.git
cd Dataproofer
yarn

How You Can Help

Write a test

See our test to-do list and leave a comment

Add a feature

See our features list and leave a comment

Short on time?

See our smaller issues and leave a comment

Got more time?

See our medium-sized issues and leave a comment

Plenty of time?

See our larger issues and leave a comment

Creating a new test

  • Make a copy of the basic test template
  • Read the comments and follow along with links
  • Let us know if you're running into trouble dataproofer [at] dataproofer.org
  • require that test in a suite's index.js
  • Add that test to the exports in index.js

Tests are made up of a few parts. Here's a brief over-view. For a more in-depth look, dive into the documentation.

.name()

This is the name of your test. It shows up in the test-selection screen as well as on the results page

.description()

This is a text-only description of what the test does, and what it is meant to check. Imagine you are explaining it to a remarkably intelligent 5-year-old.

.methodology()

This is where the code your test executes lives. Pass it a function that takes in rows and columnHeads

rows is an array of objects from the data. The object uses column headers as the key, and the row’s value as the value.

So if your data looks like this:

President Year
George Washington 1789
John Adams 1797
Thomas Jefferson 1801

Then the first object in your array of rows will look like this:

{ president: ‘George Washington’, year: ‘1789’ }

and so on.

Generally, to run a test, you are going to want to loop over each row and do some operations on it — counting cells and using conditionals to detect unwanted values.

Helper Scripts

Helper scripts help you test and display the results of Dataproofer tests. These are a small set of functions we've found ourselves reusing.

  • isEmpty: detect if a cell is empty
  • isNumeric: detect if a cell contains a number
  • stripNumeric: remove number formatting like "$" or "%"
  • percent: return a number with a "%" sign

For more information, please see the full util documentation

""

Troubleshooting a test that won't run

Tests are run inside a try catch loop in src/processing.js. You may wish to temporarily remove the try/catch while iterating on a test. Otherwise, for now we recommend heavy doses of console.log and the Chrome debugger.

Iterating on tests

Dataproofer saves a copy of the most recently loaded file in the Application Data directory provided to it by the OS. You can quickly load the file and run the tests by typing loadLastFile() in the console. This saves you several clicks for loading the file and clicking the run button while you are iterating on a test. If you want to temporarily avoid any clicks you can add the function call to the ipc.on("last-file-selected", event handler in electron/js/controller.js

Release a new version

We can push releases to GitHub manually for now:

git tag -a 'v0.1.1' -m "first release"
git push && git push --tags

The binary (Dataproofer.app) can be uploaded to the releases page for the tag you pushed, and should be zipped up first (Right click and choose "Compress Dataproofer")

""

Sources

Thank You

vocativ-logo
knight-logo

A huge thank you to the Vocativ and the Knight Foundation. This project was funded in part by the Knight Foundation's Prototype Fund.

Special Thanks

  • Alex Koppelman (interviewee), Editorial Director @ Vocativ
  • Allee Manning (interviewee), Data Reporter @ Vocativ
  • Allegra Denton (design consulting), Designer @ Vocativ
  • Brian Byrne (interviewee), Data Reporter @ Vocativ
  • Daniel Littlewood (video producer), Special Projects Producer @ Vocativ
  • EJ Fox (project lead), Dataviz Editor @ Vocativ
  • Gerald Rich (lead developer), Interactive Producer @ Vocativ
  • Ian Johnson (lead developer), Dataproofer
  • Jason Das (UX and design), Dataproofer
  • Joe Presser (video producer), Dataproofer
  • Julia Kastner (concept & name consulting), Project Manager @ Vocativ
  • Kelli Vanover (design consulting), Product Manager @ Vocativ
  • Nicu Calcea (developer), Data Projects Editor @ GlobalData Media / New Statesman
  • Markham Nolan (interviewee), Visuals Editor @ Vocativ
  • Rob Di Ieso (design consulting), Art Director @ Vocativ

... and the countless journalists who've encouraged us along the way. Thank you!

dataproofer's People

Contributors

amccartney avatar ctavan avatar dependabot-preview[bot] avatar dependabot[bot] avatar ejfox avatar enjalot avatar geomars avatar newsroomdev avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dataproofer's Issues

Test: check if a sequential "month" column skips a month

Please read how to create a new test if you're interested in writing this test.

If a column is named month, or some variation, and contains an ordered series of month, check to see if it skips a month or more.

In the future we want to let users specify the length of gap in time they want to check against (i.e. show me if there's a gap longer than a trimester). For now we can test for gaps in immediately sequential months (i.e. January, February, March), and highlight the cells preceding and following the gap.

Methodology considerations:

  • Should sort the column first in case the column is unordered
  • Handle the following formats:
    • Full name, e.g. January
    • Abbreviation, e.g. Jan.
    • Number, e.g. 01 or 1

Check to restrict filetypes permitted

Summary

I managed to import an HTML document into Dataproofer which it then interpreted as a spreadsheet and spat out irrelevant stats. Dataset was the HTML of the front page of Hacker News
(attached). I did this on OS X 10.11.4 but OS/Version shouldn't really matter.

Steps to reproduce

Import attached XML document into Dataproofer.

Expected behavior

Maybe tell me that the data type is unsupported or at least don't run stats on what is clearly not a valid dataset.

Relevant logs and/or screenshots

screen shot 2016-03-28 at 6 16 53 pm

### Possible fixes?

Be especially suspicious of a dataset with a single column. Maybe pop up a warning asking the user if they are sure that this is what they want to do.

Related issues/pull requests?

None that I am aware of. I doubt that this was introduced from a pull request -- just seems to be an edge case that may not have been handled.

UPDATE: This seems to work on binary data as well, I imported images and zip files successfully into Dataproofer which should not happen.

Thanks!

Correctly sort columns with dollars or percents

Summary

Sorting on dollars or percents returns an incorrectly sorted column. Mac OS X v10.11.3

Steps to reproduce

Loaded sample dataset sf-police-salaries.csv, ran stats-suite, then tried to sort the salaries by clicking on the column header's name

Expected behavior

Column should sort from $0 to $999,999 regardless of formatting

Sreenshots/Logs

screen shot 2016-03-22 at 6 19 19 pm

Possible fixes?

There's a couple of different format options we can pass to the grid so it knows how to properly sort a column. This will require additional formatting checks added to dataproofertest-js/util.js in order to help automatically specify the columns type.

  • add util function to check for USD. should return true/false
  • add util function to check for percents. should return true/false
  • pass an array of objects detailing format info for each column to the columns option. See new Handsontable in renderer.js

Create a loading animation/screen/message

Something to inform users a little bit more, and maybe a suggestion to try breaking up their data into smaller samples if it's taking forever. Can't just let it silently fail.

Test: check if a sequential "year" column skips years

Please read how to create a new test if you're interested in writing this test.

If a column is named year, or some variation, and contains an ordered series of years, check to see if it skips a year or more.

In the future we want to let users specify the length of gap in time they want to check against (i.e. show me if there's a gap longer than five years). For now we can test for gaps in immediately sequential years (i.e. 1999, 2000, 2001), and highlight the cells preceding and following the gap.

Methodology considerations:

  • Should sort the column first in case the column is unordered
  • Should parse the string as a numeric and check to make sure the column is more than 90% numeric. See stats-suite/medianAbsoluteDeviationOutliers.js for a working example

Test: check for dates in 1900 or 1904

Please read how to create a new test if you're interested in writing this test.

For reasons beyond obscure, Excel's default date from which it counts all other dates is January 1st, 1900, unless you're using Excel on a Mac, in which case it's January 1st, 1904. There are a variety of ways in which data in Excel can be entered or calculated incorrectly and end up as one of these two dates. If you spot them in your data, it's probably an issue.

-Quartz Bad Data Guide

Automate initialization of repo with sub-modules

This is the documented start-up process in the README

git clone [email protected]:dataproofer/stats-suite.git
git clone [email protected]:dataproofer/geo-suite.git

cd core-suite
npm link
cd stats-suite
npm link
cd geo-suite
npm link

cd ../Dataproofer/src
npm link dataproofer-core-suite
npm link dataproofer-stats-suite
npm link dataproofer-geo-suite

npm link
cd ../electron
npm link dataproofer```

Can we automate with a shell script or similar? 

Test: check for invalid lat longs

Is a lat or a long between -180 to 180º and does it not equal exactly zero, aka Null Island aka https://www.google.com/maps/place/0°00'00.0"N+0°00'00.0"E/

Discussion: require a data source

Asking for a source will help us assign overall scores to data sources and discern overall patterns with a particular data publisher. On the flip side, this could get messy. How do we prevent one user from specifying the source as the FBI vs the Federal Bureau of Investigations?

Check if geographic coordinates in rows are in the same format

It happens sometimes to have a dataset with rows in decimal format (lat = 40.689213, lon = -74.044493) and others in sexagesimal format (lat 40° 41' 21" N, lon = 74° 2' 40" W).

A simple test would be to test against a regex for decimal format or search for symbols used for sexadecimal format (' " ° `) and keep track if it is consistent along the dataset.

Test: Margin of error is too large

Please read how to create a new test if you're interested in writing this test.

Given a user-designated margin of error column and the column it's describing, if the error is larger than 10% of the total amount flag that cell.

For example, according to the 2014 5-year ACS estimates, the number of Asians living in New York is 1,106,989 +/- 3,526 (0.3%). The number of Filipinos is 71,969 +/- 3,088 (4.3%). The number of Samoans is 203 +/- 144. (71%)

The first two numbers are safe to report. The third number should never be used in published reporting.

Quartz Guide to Bad Data

Test: Name consistency

Please read how to create a new test if you're interested in writing this test.

Does your data have Middle Eastern or East Asian names in it? Are you sure the surnames are always in the same place? Is it possible anyone in your dataset uses a mononym? These are the sorts of things that data creators habitually get wrong. If you're working with a list of ethnically diverse names—which is any list of names—then you should do at least a cursory review before assuming that joining the first_name and last_name columns will give you something that is appropriate to publish. -Quartz Bad Data Guide

If a column is designated automatically or by the user as a name column, provide a brief description of why missing cells in a name column is potentially bad.

Test: check for missing values

If the column is more than 3/4s complete, are there null values?

Assumptions: Is 3/4s a good cutoff? Should it be higher, or lower?

Error highlighting for locally saved tests

Summary

We want to show the user where a syntax or runtime error happened if they edited a locally saved test.

Steps to reproduce

Introduce a syntax error into any local test

Expected behavior

We want to highlight the line in the test's code which caused the error. Right now we only see console.error output in the dev console

Relevant logs and/or screenshots

Possible fixes?

We can examine the stack trace of the error and look for the line number after the eval statement:
ReferenceError: x is not defined at eval (eval at loadTest (file:///Users/enjalot/code/dataproofer/dataproofer/electron/js/controller.js:64:70), <anonymous>:24:15)

Create template test

Create a template test that users can clone to start working on their own tests

Test: check if listed US state name actually exists

Please read how to create a new test if you're interested in writing this test.

If column header is state, check to see if the names match the following formats:

  • Full Name, e.g. California
  • AP style, e.g. Calif.
  • USPS abbreviation, e.g. CA

Flag any names that do not match US states, and be aware some columns may or may not punctuate their columns with abbreviations.

Bonus: create a second test for US states & territories if someone is interested.

Zip folder extraction takes too long on OS X

Summary

Title says it all -- it took me 50 seconds with the OS X standard Archive Utility and nearly 2 minutes and 30 seconds with The Unarchiver to unzip the package on OS X. Granted, my Macbook is a little old (MacBook Pro (Retina, 13-inch, Late 2013)), but still I've never had an app's extraction take such a long time.

Steps to reproduce

Tried to unzip the downloaded zip file on OS X. No clear instructions on where to go after unzipping the file either.

Expected behavior

I was expecting one of those drag-and-drop installers that are common with OS X but if not that then at least something that allows me to double-click on the app from the Chrome Downloads bar in order to open it up like a lot of other OS X applications. To be honest, I forgot that I had downloaded the file for a bit before I came back and finally opened it 15 minutes later.

Relevant logs and/or screenshots

N/A

Possible fixes?

Possibly move the LICENSE, LICENSES.chromium.html and version files inside the app package so the end user doesn't see this. Atom and Slack are both great examples of how this can be done.

Related issues/pull requests?

Not that I'm aware of. I'm not trying to be harsh on purpose, but as a journalist this is something I would like such an app to have.

Affix comments next to cells regardless of scroll position of results pane

Summary

Comments scroll up with the screen after scrolling down to view more results

Steps to reproduce

Loaded sf-police-salaries.csv scrolled down the results pane on the left, then hovered over grid cells marked with a red tag.

Expected behavior

Comments appeared higher than expected. Comments should stay affixed next to their respective cell

Relevant logs and/or screenshots

scrolling-comments-bug

Redraw CSV minimap canvas element after column sort

Summary

Sorting columns breaks CSV minimap/fingerprint visualization.

Steps to reproduce

Loaded sf-police-salaries.csv, clicked on column header Base

Expected behavior

Minimap still highlighted cells that had since been re-sorted. Minimaps will need to update when cells are sorted

Relevant logs and/or screenshots

sorting-fingerprint-bug

Possible fixes?

Clear and redraw canvas after HandsOnTable fires afterColumnSort event in renderer.js

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.