Code Monkey home page Code Monkey logo

uk_planning_scraper's People

Contributors

adrianshort avatar jnicho02 avatar keithp avatar notquiteminerva avatar pezholio avatar rgarner avatar richardjpope avatar rossjones avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

uk_planning_scraper's Issues

Northgate date_received is actually date validated

The Northgate search results page column Date Registered is the date on which the application was validated not the date the authority received the application.

The received date is only available on the Application Dates detail page (Received). On this page we also have Validated and Registered which appear to be the same thing.

Fix so that this data correctly gets returned in the date_validated attribute and the date_registered is nil.

Hackney `info_url`s give site error when visited

This is something to do with Hackney/Northgate session cookies. If you've got the cookie (eg from a previous visit) the URLs work. If you don't have the cookie they give an error.

This is causing a problem for adrianshort/kiosks.

Add logger to option hash

Users should be able to pass in a Logger object with the options hash, configured any way they like. Otherwise, it should default to logging to $stdout at Logger::INFO level. Do it something like this.

Fixnum (NameError)

Hello, I am a complete rookie when it comes to Ruby so apologies if this is an obvious fix. After installing all the dependencies I run the Ruby script but keep getting this message:

C:\Users\Griogair\Documents\Planning Portal Database>ruby my_scraper.rb
C:/Users/Griogair/.local/share/gem/ruby/3.2.0/gems/uk_planning_scraper-0.5.0/lib/uk_planning_scraper/authority_scrape_params.rb:15:in `validated_days': uninitialized constant UKPlanningScraper::Authority::Fixnum (NameError)

  check_class(n, Fixnum)
                 ^^^^^^
    from my_scraper.rb:7:in `block in <main>'
    from my_scraper.rb:6:in `each'
    from my_scraper.rb:6:in `<main>'

Would be very grateful if you could please explain why this is happening and make any suggestions on how to resolve the error.

Thank you,

Griogair

Catch SSL_Connect error

Getting: https://www.planningpa.bolton.gov.uk/online-applications-17/search.do?action=advanced
 /app/vendor/ruby-2.3.1/lib/ruby/2.3.0/net/http.rb:933:in `connect_nonblock': SSL_connect returned=1 errno=0 state=unknown state: unknown protocol (OpenSSL::SSL::SSLError)
 	from /app/vendor/ruby-2.3.1/lib/ruby/2.3.0/net/http.rb:933:in `connect'
 	from /app/vendor/ruby-2.3.1/lib/ruby/2.3.0/net/http.rb:863:in `do_start'
 	from /app/vendor/ruby-2.3.1/lib/ruby/2.3.0/net/http.rb:858:in `start'
 	from /app/vendor/bundle/ruby/2.3.0/gems/net-http-persistent-3.0.0/lib/net/http/persistent.rb:692:in `start'
 	from /app/vendor/bundle/ruby/2.3.0/gems/net-http-persistent-3.0.0/lib/net/http/persistent.rb:622:in `connection_for'
 	from /app/vendor/bundle/ruby/2.3.0/gems/net-http-persistent-3.0.0/lib/net/http/persistent.rb:927:in `request'
 	from /app/vendor/bundle/ruby/2.3.0/gems/mechanize-2.7.6/lib/mechanize/http/agent.rb:280:in `fetch'
 	from /app/vendor/bundle/ruby/2.3.0/gems/mechanize-2.7.6/lib/mechanize.rb:464:in `get'
 	from /app/vendor/bundle/ruby/2.3.0/bundler/gems/uk_planning_scraper-8d15678700bb/lib/uk_planning_scraper/idox.rb:13:in `scrape_idox'
 	from /app/vendor/bundle/ruby/2.3.0/bundler/gems/uk_planning_scraper-8d15678700bb/lib/uk_planning_scraper/authority.rb:43:in `scrape'
 	from scraper.rb:9:in `block in <main>'
 	from scraper.rb:6:in `each'
 	from scraper.rb:6:in `each_with_index'
 	from scraper.rb:6:in `<main>'

Filter output field list

Allow users to specify which fields they do or don't want included in the output.

Add only and except to the params hash in Authority#scrape.

Both of these should be a comma-separated list of field names.

Using only and except params at the same time throws an error.

We might need to consider how this would interact with potential options for a deep or shallow scrape, eg an option like documents: true which scrapes the contents of documents pages.

One specific use case is including or excluding personal data eg applicants' and agents' names, email addresses and phone numbers. But it'd be nicer to do that with an option like personal_data: false.

Scrape fails with Bristol Idox variant

/uk_planning_scraper/lib/uk_planning_scraper.rb:54:in block (2 levels) in search': undefined method []' for nil:NilClass (NoMethodError)

It appears not to be finding any elements for li.searchresult.

Northgate: follow 301/302 redirects

Currently the Northgate scraper throws a fatal error when getting a 301/302 redirect on the first scrape. We should follow the redirect rather than fail.

This has been an issue with Camden, Islington and Merton, all of which have moved from HTTP to HTTPS.

Catch timeout exception

/app/vendor/ruby-2.3.1/lib/ruby/2.3.0/net/protocol.rb:158:in `rbuf_fill': too many connection resets (due to Net::ReadTimeout - Net::ReadTimeout) after 0 requests on 47021302552820, last used 132.706131095 seconds ago (Net::HTTP::Persistent::Error)
 	from /app/vendor/ruby-2.3.1/lib/ruby/2.3.0/net/protocol.rb:136:in `readuntil'
 	from /app/vendor/ruby-2.3.1/lib/ruby/2.3.0/net/protocol.rb:146:in `readline'
 	from /app/vendor/ruby-2.3.1/lib/ruby/2.3.0/net/http/response.rb:40:in `read_status_line'
 	from /app/vendor/ruby-2.3.1/lib/ruby/2.3.0/net/http/response.rb:29:in `read_new'
 	from /app/vendor/ruby-2.3.1/lib/ruby/2.3.0/net/http.rb:1437:in `block in transport_request'
 	from /app/vendor/ruby-2.3.1/lib/ruby/2.3.0/net/http.rb:1434:in `catch'
 	from /app/vendor/ruby-2.3.1/lib/ruby/2.3.0/net/http.rb:1434:in `transport_request'
 	from /app/vendor/ruby-2.3.1/lib/ruby/2.3.0/net/http.rb:1407:in `request'
 	from /app/vendor/bundle/ruby/2.3.0/gems/net-http-persistent-3.0.0/lib/net/http/persistent.rb:933:in `block in request'
 	from /app/vendor/bundle/ruby/2.3.0/gems/net-http-persistent-3.0.0/lib/net/http/persistent.rb:630:in `connection_for'
 	from /app/vendor/bundle/ruby/2.3.0/gems/net-http-persistent-3.0.0/lib/net/http/persistent.rb:927:in `request'
 	from /app/vendor/bundle/ruby/2.3.0/gems/mechanize-2.7.6/lib/mechanize/http/agent.rb:280:in `fetch'
 	from /app/vendor/bundle/ruby/2.3.0/gems/mechanize-2.7.6/lib/mechanize.rb:464:in `get'
 	from /app/vendor/bundle/ruby/2.3.0/bundler/gems/uk_planning_scraper-23d5825e7331/lib/uk_planning_scraper.rb:98:in `block in search'
 	from /app/vendor/bundle/ruby/2.3.0/bundler/gems/uk_planning_scraper-23d5825e7331/lib/uk_planning_scraper.rb:95:in `each'
 	from /app/vendor/bundle/ruby/2.3.0/bundler/gems/uk_planning_scraper-23d5825e7331/lib/uk_planning_scraper.rb:95:in `each_with_index'
 	from /app/vendor/bundle/ruby/2.3.0/bundler/gems/uk_planning_scraper-23d5825e7331/lib/uk_planning_scraper.rb:95:in `search'
 	from scraper.rb:20:in `block in <main>'
 	from scraper.rb:19:in `each'
 	from scraper.rb:19:in `<main>'

Raise exception when Idox search fails due to too many results

Currently this fails silently with an empty array returned from Authority#scrape, eg:

UKPlanningScraper::Authority.named('Newham').scrape({ validated_days: 90 })
 # => []

UKPlanningScraper::Authority.named('Newham').scrape({ validated_days: 9 })
Using Idox scraper.
Getting: https://pa.newham.gov.uk/online-applications/search.do?action=advanced
Found 10 apps on this page.
...

The search form returns this error in HTML:

<div class="messagebox errors">
  <h2>Please check the search criteria:</h2>
  <ul>
    <li>Too many results found. Please enter some more parameters.</li>
  </ul>
</div>

Standardise decision text

Rename decision to decision_raw.

Create a new field Decision and parse decision_raw into the LGA standard codes:

  • Approve
  • Refuse
  • Split
  • Withdrawn
  • Prior not required (prior approval not required)
  • Prior granted (prior approval required and granted)
  • Prior refused (prior approval required and refused)
  • Prior refused permission required (prior approval refused because planning permission required)

Create Application class to store scraped data

This will allow us to ensure that fields are consistent across the different scrapers, to set up defaults consistently, and to validate Application objects before the data is returned.

We'll need a to_hash instance method to convert the data for returning to the user.

case_officer_code modifies Authority url param

How to reproduce:

auth = UKPlanningScraper::Authority.named("X") # Northgate authorities only
apps1 = auth.case_officer_code("123").decided_days(7).scrape
apps2 = auth.decided_days(7).scrape

This fails because the url param of Authority has been modified so the apps2 call to scrape fails because it's trying to scrape from the case officer page rather than the general search page.

The url param needs to stay constant throughout the lifetime of the object so scrape will always be requesting the right URL.

How should we use version numbers for authorities data changes?

Semantic versioning 2.0.0 says:

Given a version number MAJOR.MINOR.PATCH, increment the:

MAJOR version when you make incompatible API changes,
MINOR version when you add functionality in a backwards-compatible manner, and
PATCH version when you make backwards-compatible bug fixes.

Additional labels for pre-release and build metadata are available as extensions to the MAJOR.MINOR.PATCH format.

This gem includes data in the authorities.csv file. Should changes to this file be regarded as adding functionality, requiring updating the MINOR version number, or just backwards compatible bug fixes, requiring an update to the PATCH version?

Perhaps two scenarios:

  1. A new authority is added. This is regarded as new functionality (you can now do something you couldn't do before: scrape that authority's data) so we update the MINOR version.
  2. An existing authority is modified: either the URL has changed or we've added/removed tags. This is a "bugfix" so we update the PATCH version.

Thoughts? How would this proposal tie in with keeping track of versions usefully in your Gemfile?

Scrape Idox tabs

  • details tab (Further Information)
  • contacts tab
  • dates tab
  • bump minor version number

See also #20.

Standardise status text

Rename status to status_raw.

Parse status_raw into a new Status field according to the LGA standard:

  • Live (in the process of being decided)
  • Withdrawn
  • Decided
  • Appeal (in the process of being decided via a non-determination appeal)
  • Called in (in the process of being considered by the Secretary of State)
  • Referred to SoS (in the process of being considered by the Secretary of
    State) # FIXME - appears to be a duplicate of Called in
  • Invalid (requires something to happen to it before it can be decided)
  • Not ours (belong to other planning authorities)
  • Registered (received but not yet been processed and validated)

This is the status at the extracted date.

Enforce scrape parameters

Currently the scrape params are in a hash that doesn't get checked. The user has no way of knowing which of them have affected their scrape results.

We should check the params and raise an exception if an invalid parameter is present.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.