Code Monkey home page Code Monkey logo

coronadatascraper's Introduction

coronadatascraper's People

Contributors

aed3 avatar andys1376 avatar camjc avatar chunder avatar cmcjacob avatar feliperoberto avatar herbcaudill avatar hyperknot avatar jacobmcgowan avatar jbencina avatar jeffchenoweth avatar jgehrcke avatar jord-holt avatar jzohrab avatar kb1ujs avatar lazd avatar natebaldwindesign avatar og3og avatar osvyg avatar paulboal avatar piccolbo avatar praging avatar qgolsteyn avatar ryanblock avatar samirrayani avatar shaperilio avatar stevewallace avatar truffleclock avatar webguyty avatar wikunia avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

coronadatascraper's Issues

Dates in time series JSON are not zero-padded

I’m not aware that any date parser really needs dates to be zero-padded to detect the format correctly, but if going with ISO 8601, might as well go all the way. (They’re already zero-padded in the non-JHU CSVs.)

Fails if output directories are not present

Running MacOS Mojave
node version: v13.11.0
yarn version: 1.17.3

When running yarn start, the scraping process halts prematurely and provides the following output:

$ NODE_OPTIONS='--insecure-http-parser' node index.js
(node:74621) ExperimentalWarning: The ESM module loader is experimental.
⏳ Scraping data...
  🚦  Loading data for https://opendata.arcgis.com/datasets/d14de7e28b0448ab82eb36d6f25b1ea1_0.csv from server
(node:74621) Warning: Setting the NODE_TLS_REJECT_UNAUTHORIZED environment variable to '0' makes TLS connections and HTTPS requests insecure by disabling certificate verification.
(node:74621) Warning: Using insecure HTTP parsing
(node:74621) UnhandledPromiseRejectionWarning: Error: ENOENT: no such file or directory, open 'cache/ea58672b23e340fbf8f209b7af40173c.csv'
(node:74621) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag `--unhandled-rejections=strict` (see https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode). (rejection id: 1)
(node:74621) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.
✨  Done in 2.30s.

Node appears to throw while in a promise while opening a file in the directory cache indicated by the line:

(node:74621) UnhandledPromiseRejectionWarning: Error: ENOENT: no such file or directory, open 'cache/ea58672b23e340fbf8f209b7af40173c.csv'

Remove duplicates from data

If data for a location is added twice, it should be de-duped.

The priority of dedupe should be determined by a _priority field on the scraper.

MD scraper broken

Sever returns:

{"status":"Failed","error":{"message":"Failed while trying to get item information","code":400,"request":"https://www.arcgis.com/sharing/rest/content/items/ca77764e722c442986ef6514da88411c?f=json","response":"Item does not exist or is inaccessible.","timestamp":"2020-03-16T23:11:04.078Z"},"generating":{}}

@SteveWallace can you look at this?

a few more ISO3166 names

it seems like Congo (Brazzaville) and Hong Kong have ISO3166 codes (COG and HKG). (at least according to wikipedia). i think that would just leave Kosovo without a code.

Data change summary

Before writing data, a summary should be printed that details what changed. This will help debug issues when scrapers break, and will help verify to scraper authors that their scraper works.

Implementing scraper for Spain (Official government data)

Hello there!
I have the source data for spain and it's states but this data is in a pdf. I got the way to transform the data from pdf to html so then I can parse it properly.

A far I've seen the parser is in JS, should I create an intermediate project where I do this step and then parse my own data?

WI scraper broken

Looks like HTML moved around.

Remember: preserve old scraper by hiding it behind datetime.scrapeDateIsBefore('2020-3-16') and adjust URLs accordingly.

State totals should be calculated if state's website shows all counties

When available, the County-level data for the US is provided in addition to the State-level data, causing duplication. For example, New Jersey state-level data shows 29 cases (County == NULL), and County-level data aggregates to 47 cases. Suggest to exclude the State-level data when County-level data is available.

Also, State descriptors for State-level and County-level data is inconsistent. For example, "New Jersey" and "NJ" (see https://github.com/lazd/coronadatascraper/issues/16)

Finally, suggestion to use "(unknown)" literal value for Counties when either data is available only at the State (or Country) level.

San diego data is wrong

Website data is below. note this a maitrix, need sum all three columns and to be bias towards positive also sum presumptive-->

  San Diego County1 Federal Quarantine2 Non-San Diego County Residents3
Positive (confirmed cases) 0 2 0
Presumptive Positive 8 1 0
Pending Results 38 6 4
Negative 99 11 8
Total Tested 145 20 12

URL
https://www.sandiegocounty.gov/content/sdc/hhsa/programs/phs/community_epidemiology/dc/2019-nCoV/status.html

**Scraper code -->

{
    county: 'San Diego County',
    state: 'CA',
    country: 'USA',
    url: 'https://www.sandiegocounty.gov/content/sdc/hhsa/programs/phs/community_epidemiology/dc/2019-nCoV/status.html',
    scraper: async function() {
      let $ = await fetch.page(this.url);

      let cases = parse.number($('td:contains("Positive (confirmed cases)")').next('td').text()) + parse.number($('td:contains("Presumptive Positive")').next('td').text());
      return {
        cases: cases,
        tested: parse.number($('td:contains("Total Tested")').next('td').text())
      };
    }

I would fix this,b tut dont know Java Scriping

Manual deployment to npm

Not sure if this is the right way to deliver the data, but releasing to npm should be considered.

Last acquired timestamp as text file

Create a simple text file that provides a timestamp (in UTC format) as to when the last acquired run happened. Would be part of the metadata record. Convenient to have it as a separate file for easy program check.

Missing location and population data for GBR counties

Tons of GBR counties can't find their respective GeoJSON:

❌ Could not find location Channel Islands, GBR
❌ Could not find location Bournemouth, Christchurch and Poole, GBR
❌ Could not find location Bristol, City of, GBR
❌ Could not find location Cornwall and Isles of Scilly, GBR
❌ Could not find location Hackney and City of London, GBR
❌ Could not find location Herefordshire, County of, GBR
❌ Could not find location Kingston upon Hull, City of, GBR
❌ Could not find location St. Helens, GBR
❌ Could not find location Wirral, GBR

And we're missing population data for tons too:

❌ Barking and Dagenham, GBR: ?
❌ Barnet, GBR: ?
❌ Barnsley, GBR: ?
❌ Bath and North East Somerset, GBR: ?
❌ Bedford, GBR: ?
❌ Bexley, GBR: ?
❌ Birmingham, GBR: ?
❌ Blackburn with Darwen, GBR: ?
❌ Blackpool, GBR: ?
❌ Bolton, GBR: ?
❌ Bournemouth, Christchurch and Poole, GBR: ?
❌ Bracknell Forest, GBR: ?
❌ Bradford, GBR: ?
❌ Brent, GBR: ?
❌ Brighton and Hove, GBR: ?
❌ Bristol, City of, GBR: ?
❌ Bromley, GBR: ?
❌ Bury, GBR: ?
❌ Calderdale, GBR: ?
❌ Camden, GBR: ?
❌ Central Bedfordshire, GBR: ?
❌ Cheshire East, GBR: ?
❌ Cheshire West and Chester, GBR: ?
❌ Cornwall and Isles of Scilly, GBR: ?
❌ County Durham, GBR: ?
❌ Coventry, GBR: ?
❌ Croydon, GBR: ?
❌ Darlington, GBR: ?
❌ Derby, GBR: ?
❌ Doncaster, GBR: ?
❌ Dudley, GBR: ?
❌ Ealing, GBR: ?
❌ Enfield, GBR: ?
❌ Gateshead, GBR: ?
❌ Greenwich, GBR: ?
❌ Hackney and City of London, GBR: ?
❌ Halton, GBR: ?
❌ Hammersmith and Fulham, GBR: ?
❌ Haringey, GBR: ?
❌ Harrow, GBR: ?
❌ Hartlepool, GBR: ?
❌ Havering, GBR: ?
❌ Herefordshire, County of, GBR: ?
❌ Hillingdon, GBR: ?
❌ Hounslow, GBR: ?
❌ Islington, GBR: ?
❌ Kensington and Chelsea, GBR: ?
❌ Kingston upon Hull, City of, GBR: ?
❌ Kingston upon Thames, GBR: ?
❌ Kirklees, GBR: ?
❌ Knowsley, GBR: ?
❌ Lambeth, GBR: ?
❌ Leeds, GBR: ?
❌ Leicester, GBR: ?
❌ Lewisham, GBR: ?
❌ Liverpool, GBR: ?
❌ Luton, GBR: ?
❌ Manchester, GBR: ?
❌ Medway, GBR: ?
❌ Merton, GBR: ?
❌ Middlesbrough, GBR: ?
❌ Milton Keynes, GBR: ?
❌ Newcastle upon Tyne, GBR: ?
❌ Newham, GBR: ?
❌ North East Lincolnshire, GBR: ?
❌ North Lincolnshire, GBR: ?
❌ North Somerset, GBR: ?
❌ North Tyneside, GBR: ?
❌ Nottingham, GBR: ?
❌ Oldham, GBR: ?
❌ Peterborough, GBR: ?
❌ Plymouth, GBR: ?
❌ Portsmouth, GBR: ?
❌ Reading, GBR: ?
❌ Redbridge, GBR: ?
❌ Redcar and Cleveland, GBR: ?
❌ Richmond upon Thames, GBR: ?
❌ Rochdale, GBR: ?
❌ Rotherham, GBR: ?
❌ Salford, GBR: ?
❌ Sandwell, GBR: ?
❌ Sefton, GBR: ?
❌ Sheffield, GBR: ?
❌ Slough, GBR: ?
❌ Solihull, GBR: ?
❌ South Gloucestershire, GBR: ?
❌ South Tyneside, GBR: ?
❌ Southampton, GBR: ?
❌ Southend-on-Sea, GBR: ?
❌ Southwark, GBR: ?
❌ St. Helens, GBR: ?
❌ Stockport, GBR: ?
❌ Stockton-on-Tees, GBR: ?
❌ Stoke-on-Trent, GBR: ?
❌ Sunderland, GBR: ?
❌ Sutton, GBR: ?
❌ Swindon, GBR: ?
❌ Tameside, GBR: ?
❌ Telford and Wrekin, GBR: ?
❌ Thurrock, GBR: ?
❌ Torbay, GBR: ?
❌ Tower Hamlets, GBR: ?
❌ Trafford, GBR: ?
❌ Wakefield, GBR: ?
❌ Walsall, GBR: ?
❌ Waltham Forest, GBR: ?
❌ Wandsworth, GBR: ?
❌ Warrington, GBR: ?
❌ West Berkshire, GBR: ?
❌ Westminster, GBR: ?
❌ Wigan, GBR: ?
❌ Windsor and Maidenhead, GBR: ?
❌ Wirral, GBR: ?
❌ Wokingham, GBR: ?
❌ Wolverhampton, GBR: ?
❌ York, GBR: ?

I tried figuring it out, but I'm unfamiliar with these counties and the data sources available for them, so I gave up. Can someone please look into this?

Ratings for data sources

Scrapers should get a objective rating based on:

  • ease of machine readability
  • the completeness of available data
  • timelines of updates
  • consistency

We should advertise these ratings and actively point to them when reaching out to government department to demand better data.

Error in "yarn start" step

Ok,
I'm starting with the scraper and I'm finding myself with problems setting the invironment.
I'm running Ubuntu 18.04
node version: v12.16.1
yarn version: 1.22.4

So far I've run:

yarn install
yarn start

I got an error related to some node arguments but I updated node to the version I said in the top and there wasn't error anymore.
When I run: yarn start now I get the next error:

yarn run v1.22.4
$ NODE_OPTIONS='--insecure-http-parser' node index.js
internal/modules/cjs/loader.js:1174
      throw new ERR_REQUIRE_ESM(filename, parentPath, packageJsonPath);
      ^

Error [ERR_REQUIRE_ESM]: Must use import to load ES Module: /home/esteve/projects/coronadatascraper/index.js
    at Object.Module._extensions..js (internal/modules/cjs/loader.js:1174:13)
    at Module.load (internal/modules/cjs/loader.js:1002:32)
    at Function.Module._load (internal/modules/cjs/loader.js:901:14)
    at Function.executeUserEntryPoint [as runMain] (internal/modules/run_main.js:74:12)
    at internal/main/run_main_module.js:18:47 {
  code: 'ERR_REQUIRE_ESM'
}
error Command failed with exit code 1.
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.

I have not that much experience in JS development, I'm more of a Python guy myself 😞

Originally posted by @sorny92 in https://github.com/lazd/coronadatascraper/issues/14#issuecomment-599113167

Palestine and Kosovo codes

Palestine in ISO 3166 is defined as "PSE" (currently the code uses "PSR").

also, at least the World Bank seems to define Kosovo as "XKX", so maybe that would be a good code to use there.

Separate scrapers into individual files

Its a bit of a pain for folks to contribute to the scraper since git will get super confused when trying to merge/rebase.

It'd be good to separate all the scraper definitions into their own files (maybe named by the country code or something) and then bring them all in at runtime, instead of just having one giant scraper file.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.