adrianshort / uk_planning_scraper Goto Github PK

View Code? Open in Web Editor NEW

27.0 5.0 18.0 305 KB

A Ruby gem to get planning applications data from UK council websites.

License: GNU Lesser General Public License v3.0

Ruby 99.51% Shell 0.49%

civictech planning urban-planning localgov localgovdigital uk urbanism plantech

uk_planning_scraper's People

Contributors

Stargazers

Watchers

Forkers

pezholio richardjpope dhilton jnicho02 liverpool-uk rossjones notquiteminerva rgarner keithp rsohw mysociety mikedorson2 tubbz-alt nsenkevich zymurgic mandula-abhilash tacgnol jamjarjack

uk_planning_scraper's Issues

Idox eg Newham: Some authorities don't have received from/to dates on search form

Can we just check if the fields are present and add them if they're not?

UKPlanningScraper::Authority.named('Newham').scrape({ received_days: 1 })`

raises:

NoMethodError: undefined method `date(applicationReceivedStart)'

lib/mechanize/form.rb:266:in `method_missing: undefined method searchCriteria.caseStatus`

Commenting out;

uk_planning_scraper/lib/uk_planning_scraper/idox.rb

Line 45 in bcd86c9

form.send(:"searchCriteria\.caseStatus", params[:status])

Fixed it

Get `on_notice_to` date in Idox scraper

Merge tags in `authorities.csv` into a single CSV field

The tags are all one word each so they can be separated by spaces.

This will make the CSV file display nicely in GitHub and allow for easier editing.

Ensuring that you have every planning authority in the UK

I've started adding some of the authorities that I found on https://www.local.gov.uk/our-support/guidance-and-resources/communications-support/digital-councils/social-media/go-further/a-z-councils-online and found a special one referred to by the Adur web site...South Downs National Park. Which makes me wonder how/where to get a full defacto list of planning authorities. It may exist somewhere of data.gov

Aberdeen `inlink` scrape misses one application

Should be 32 but only gets 31:

https://kiosks.adrianshort.org/authorities/aberdeen/

(Reported by @NotQuiteMinerva)

Northgate date_received is actually date validated

The Northgate search results page column Date Registered is the date on which the application was validated not the date the authority received the application.

The received date is only available on the Application Dates detail page (Received). On this page we also have Validated and Registered which appear to be the same thing.

Fix so that this data correctly gets returned in the date_validated attribute and the date_registered is nil.

Hackney `info_url`s give site error when visited

This is something to do with Hackney/Northgate session cookies. If you've got the cookie (eg from a previous visit) the URLs work. If you don't have the cookie they give an error.

This is causing a problem for adrianshort/kiosks.

Kensington and Chelsea not executing search for InLinks

The Kensington and Chelsea site has a different configuration to other planning searches - there are 72 applications for InLinks on the site but they don't show up on the InLink Kiosks page.

The field name to search proposals is:

Proposal Keyword

The link:

https://www.rbkc.gov.uk/planning/searches/default.aspx?adv=1&proposal=inlink&batch=20&pgapp=1#tabs-planning-1

will find the applications if you can do something with that.

Add logger to option hash

Users should be able to pass in a Logger object with the options hash, configured any way they like. Otherwise, it should default to logging to $stdout at Logger::INFO level. Do it something like this.

Fixnum (NameError)

Hello, I am a complete rookie when it comes to Ruby so apologies if this is an obvious fix. After installing all the dependencies I run the Ruby script but keep getting this message:

C:\Users\Griogair\Documents\Planning Portal Database>ruby my_scraper.rb
C:/Users/Griogair/.local/share/gem/ruby/3.2.0/gems/uk_planning_scraper-0.5.0/lib/uk_planning_scraper/authority_scrape_params.rb:15:in `validated_days': uninitialized constant UKPlanningScraper::Authority::Fixnum (NameError)

  check_class(n, Fixnum)
                 ^^^^^^
    from my_scraper.rb:7:in `block in <main>'
    from my_scraper.rb:6:in `each'
    from my_scraper.rb:6:in `<main>'

Would be very grateful if you could please explain why this is happening and make any suggestions on how to resolve the error.

Thank you,

Griogair

Catch SSL_Connect error

Getting: https://www.planningpa.bolton.gov.uk/online-applications-17/search.do?action=advanced
 /app/vendor/ruby-2.3.1/lib/ruby/2.3.0/net/http.rb:933:in `connect_nonblock': SSL_connect returned=1 errno=0 state=unknown state: unknown protocol (OpenSSL::SSL::SSLError)
 	from /app/vendor/ruby-2.3.1/lib/ruby/2.3.0/net/http.rb:933:in `connect'
 	from /app/vendor/ruby-2.3.1/lib/ruby/2.3.0/net/http.rb:863:in `do_start'
 	from /app/vendor/ruby-2.3.1/lib/ruby/2.3.0/net/http.rb:858:in `start'
 	from /app/vendor/bundle/ruby/2.3.0/gems/net-http-persistent-3.0.0/lib/net/http/persistent.rb:692:in `start'
 	from /app/vendor/bundle/ruby/2.3.0/gems/net-http-persistent-3.0.0/lib/net/http/persistent.rb:622:in `connection_for'
 	from /app/vendor/bundle/ruby/2.3.0/gems/net-http-persistent-3.0.0/lib/net/http/persistent.rb:927:in `request'
 	from /app/vendor/bundle/ruby/2.3.0/gems/mechanize-2.7.6/lib/mechanize/http/agent.rb:280:in `fetch'
 	from /app/vendor/bundle/ruby/2.3.0/gems/mechanize-2.7.6/lib/mechanize.rb:464:in `get'
 	from /app/vendor/bundle/ruby/2.3.0/bundler/gems/uk_planning_scraper-8d15678700bb/lib/uk_planning_scraper/idox.rb:13:in `scrape_idox'
 	from /app/vendor/bundle/ruby/2.3.0/bundler/gems/uk_planning_scraper-8d15678700bb/lib/uk_planning_scraper/authority.rb:43:in `scrape'
 	from scraper.rb:9:in `block in <main>'
 	from scraper.rb:6:in `each'
 	from scraper.rb:6:in `each_with_index'
 	from scraper.rb:6:in `<main>'

Add User-Agent to option hash

So that users can specify the scraper user agent to their own taste. Default to blank.

Filter output field list

Allow users to specify which fields they do or don't want included in the output.

Add only and except to the params hash in Authority#scrape.

Both of these should be a comma-separated list of field names.

Using only and except params at the same time throws an error.

We might need to consider how this would interact with potential options for a deep or shallow scrape, eg an option like documents: true which scrapes the contents of documents pages.

One specific use case is including or excluding personal data eg applicants' and agents' names, email addresses and phone numbers. But it'd be nicer to do that with an option like personal_data: false.

Get `on_notice_to` date in Northgate scraper

Catch duplicate authorities in authorities.csv

Tests should fail if there is a duplicate authority name.

Not sure about duplicate authority URLs. Do any authorities share the same planning website?

Scrape fails with Bristol Idox variant

/uk_planning_scraper/lib/uk_planning_scraper.rb:54:in block (2 levels) in search': undefined method []' for nil:NilClass (NoMethodError)

It appears not to be finding any elements for li.searchresult.

Northgate: follow 301/302 redirects

Currently the Northgate scraper throws a fatal error when getting a 301/302 redirect on the first scrape. We should follow the redirect rather than fail.

This has been an issue with Camden, Islington and Merton, all of which have moved from HTTP to HTTPS.

Catch timeout exception

/app/vendor/ruby-2.3.1/lib/ruby/2.3.0/net/protocol.rb:158:in `rbuf_fill': too many connection resets (due to Net::ReadTimeout - Net::ReadTimeout) after 0 requests on 47021302552820, last used 132.706131095 seconds ago (Net::HTTP::Persistent::Error)
 	from /app/vendor/ruby-2.3.1/lib/ruby/2.3.0/net/protocol.rb:136:in `readuntil'
 	from /app/vendor/ruby-2.3.1/lib/ruby/2.3.0/net/protocol.rb:146:in `readline'
 	from /app/vendor/ruby-2.3.1/lib/ruby/2.3.0/net/http/response.rb:40:in `read_status_line'
 	from /app/vendor/ruby-2.3.1/lib/ruby/2.3.0/net/http/response.rb:29:in `read_new'
 	from /app/vendor/ruby-2.3.1/lib/ruby/2.3.0/net/http.rb:1437:in `block in transport_request'
 	from /app/vendor/ruby-2.3.1/lib/ruby/2.3.0/net/http.rb:1434:in `catch'
 	from /app/vendor/ruby-2.3.1/lib/ruby/2.3.0/net/http.rb:1434:in `transport_request'
 	from /app/vendor/ruby-2.3.1/lib/ruby/2.3.0/net/http.rb:1407:in `request'
 	from /app/vendor/bundle/ruby/2.3.0/gems/net-http-persistent-3.0.0/lib/net/http/persistent.rb:933:in `block in request'
 	from /app/vendor/bundle/ruby/2.3.0/gems/net-http-persistent-3.0.0/lib/net/http/persistent.rb:630:in `connection_for'
 	from /app/vendor/bundle/ruby/2.3.0/gems/net-http-persistent-3.0.0/lib/net/http/persistent.rb:927:in `request'
 	from /app/vendor/bundle/ruby/2.3.0/gems/mechanize-2.7.6/lib/mechanize/http/agent.rb:280:in `fetch'
 	from /app/vendor/bundle/ruby/2.3.0/gems/mechanize-2.7.6/lib/mechanize.rb:464:in `get'
 	from /app/vendor/bundle/ruby/2.3.0/bundler/gems/uk_planning_scraper-23d5825e7331/lib/uk_planning_scraper.rb:98:in `block in search'
 	from /app/vendor/bundle/ruby/2.3.0/bundler/gems/uk_planning_scraper-23d5825e7331/lib/uk_planning_scraper.rb:95:in `each'
 	from /app/vendor/bundle/ruby/2.3.0/bundler/gems/uk_planning_scraper-23d5825e7331/lib/uk_planning_scraper.rb:95:in `each_with_index'
 	from /app/vendor/bundle/ruby/2.3.0/bundler/gems/uk_planning_scraper-23d5825e7331/lib/uk_planning_scraper.rb:95:in `search'
 	from scraper.rb:20:in `block in <main>'
 	from scraper.rb:19:in `each'
 	from scraper.rb:19:in `<main>'

Raise exception when Idox search fails due to too many results

Currently this fails silently with an empty array returned from Authority#scrape, eg:

UKPlanningScraper::Authority.named('Newham').scrape({ validated_days: 90 })
 # => []

UKPlanningScraper::Authority.named('Newham').scrape({ validated_days: 9 })
Using Idox scraper.
Getting: https://pa.newham.gov.uk/online-applications/search.do?action=advanced
Found 10 apps on this page.
...

The search form returns this error in HTML:

<div class="messagebox errors">
  <h2>Please check the search criteria:</h2>
  <ul>
    <li>Too many results found. Please enter some more parameters.</li>
  </ul>
</div>

Standardise decision text

Rename decision to decision_raw.

Create a new field Decision and parse decision_raw into the LGA standard codes:

Approve
Refuse
Split
Withdrawn
Prior not required (prior approval not required)
Prior granted (prior approval required and granted)
Prior refused (prior approval required and refused)
Prior refused permission required (prior approval refused because planning permission required)

Create Application class to store scraped data

This will allow us to ensure that fields are consistent across the different scrapers, to set up defaults consistently, and to validate Application objects before the data is returned.

We'll need a to_hash instance method to convert the data for returning to the user.

case_officer_code modifies Authority url param

How to reproduce:

auth = UKPlanningScraper::Authority.named("X") # Northgate authorities only
apps1 = auth.case_officer_code("123").decided_days(7).scrape
apps2 = auth.decided_days(7).scrape

This fails because the url param of Authority has been modified so the apps2 call to scrape fails because it's trying to scrape from the case officer page rather than the general search page.

The url param needs to stay constant throughout the lifetime of the object so scrape will always be requesting the right URL.

Make more scrape parameters work for Northgate

applicant_name
application_type
development_type

How should we use version numbers for authorities data changes?

Semantic versioning 2.0.0 says:

Given a version number MAJOR.MINOR.PATCH, increment the:

MAJOR version when you make incompatible API changes,
MINOR version when you add functionality in a backwards-compatible manner, and
PATCH version when you make backwards-compatible bug fixes.

Additional labels for pre-release and build metadata are available as extensions to the MAJOR.MINOR.PATCH format.

This gem includes data in the authorities.csv file. Should changes to this file be regarded as adding functionality, requiring updating the MINOR version number, or just backwards compatible bug fixes, requiring an update to the PATCH version?

Perhaps two scenarios:

A new authority is added. This is regarded as new functionality (you can now do something you couldn't do before: scrape that authority's data) so we update the MINOR version.
An existing authority is modified: either the URL has changed or we've added/removed tags. This is a "bugfix" so we update the PATCH version.

Thoughts? How would this proposal tie in with keeping track of versions usefully in your Gemfile?

URI::encode throws warning: URI.escape is obsolete

Currently from these lines in northgate.rb:

results_url = URI::encode(base_url + response2.headers['Location'].gsub!('PS=10', 'PS=99999'))

app.info_url = URI::encode(generic_url + cells[0].at('a')[:href].strip)

We need to improve the URL escaping by understanding what's precisely required here and doing it properly.

This post gives some guidance:

https://docs.knapsackpro.com/2020/uri-escape-is-obsolete-percent-encoding-your-query-string

Scrape Idox tabs

details tab (Further Information)
contacts tab
dates tab
bump minor version number

Create `authority` scrape parameter for websites that serve multiple authorities

eg Northern Ireland:

http://epicpublic.planningni.gov.uk/publicaccess/search.do?action=advanced&searchType=Application

which is an Idox site that has a param called searchCriteria.localGovernmentDistrict.

An alternative approach would be to list each authority separately in authorities.csv and add extra scrape parameters to that file so that only the specific authority is searched.

Standardise status text

Rename status to status_raw.

Parse status_raw into a new Status field according to the LGA standard:

Live (in the process of being decided)
Withdrawn
Decided
Appeal (in the process of being decided via a non-determination appeal)
Called in (in the process of being considered by the Secretary of State)
Referred to SoS (in the process of being considered by the Secretary of
State) # FIXME - appears to be a duplicate of Called in
Invalid (requires something to happen to it before it can be decided)
Not ours (belong to other planning authorities)
Registered (received but not yet been processed and validated)

This is the status at the extracted date.

Support appeal lodged and appeal decided searches

Idox appeal lodged
Northgate appeal lodged
Idox appeal decided
Northgate appeal decided

Enforce scrape parameters

Currently the scrape params are in a hash that doesn't get checked. The user has no way of knowing which of them have affected their scrape results.

We should check the params and raise an exception if an invalid parameter is present.