adrianshort / uk_planning_scraper Goto Github PK
View Code? Open in Web Editor NEWA Ruby gem to get planning applications data from UK council websites.
License: GNU Lesser General Public License v3.0
A Ruby gem to get planning applications data from UK council websites.
License: GNU Lesser General Public License v3.0
Can we just check if the fields are present and add them if they're not?
UKPlanningScraper::Authority.named('Newham').scrape({ received_days: 1 })`
raises:
NoMethodError: undefined method `date(applicationReceivedStart)'
The tags are all one word each so they can be separated by spaces.
This will make the CSV file display nicely in GitHub and allow for easier editing.
I've started adding some of the authorities that I found on https://www.local.gov.uk/our-support/guidance-and-resources/communications-support/digital-councils/social-media/go-further/a-z-councils-online and found a special one referred to by the Adur web site...South Downs National Park. Which makes me wonder how/where to get a full defacto list of planning authorities. It may exist somewhere of data.gov
Should be 32 but only gets 31:
https://kiosks.adrianshort.org/authorities/aberdeen/
(Reported by @NotQuiteMinerva)
The Northgate search results page column Date Registered
is the date on which the application was validated not the date the authority received the application.
The received date is only available on the Application Dates detail page (Received
). On this page we also have Validated
and Registered
which appear to be the same thing.
Fix so that this data correctly gets returned in the date_validated
attribute and the date_registered
is nil
.
This is something to do with Hackney/Northgate session cookies. If you've got the cookie (eg from a previous visit) the URLs work. If you don't have the cookie they give an error.
This is causing a problem for adrianshort/kiosks.
The Kensington and Chelsea site has a different configuration to other planning searches - there are 72 applications for InLinks on the site but they don't show up on the InLink Kiosks page.
The field name to search proposals is:
Proposal Keyword
The link:
will find the applications if you can do something with that.
Users should be able to pass in a Logger object with the options hash, configured any way they like. Otherwise, it should default to logging to $stdout at Logger::INFO level. Do it something like this.
Hello, I am a complete rookie when it comes to Ruby so apologies if this is an obvious fix. After installing all the dependencies I run the Ruby script but keep getting this message:
C:\Users\Griogair\Documents\Planning Portal Database>ruby my_scraper.rb
C:/Users/Griogair/.local/share/gem/ruby/3.2.0/gems/uk_planning_scraper-0.5.0/lib/uk_planning_scraper/authority_scrape_params.rb:15:in `validated_days': uninitialized constant UKPlanningScraper::Authority::Fixnum (NameError)
check_class(n, Fixnum)
^^^^^^
from my_scraper.rb:7:in `block in <main>'
from my_scraper.rb:6:in `each'
from my_scraper.rb:6:in `<main>'
Would be very grateful if you could please explain why this is happening and make any suggestions on how to resolve the error.
Thank you,
Griogair
Getting: https://www.planningpa.bolton.gov.uk/online-applications-17/search.do?action=advanced
/app/vendor/ruby-2.3.1/lib/ruby/2.3.0/net/http.rb:933:in `connect_nonblock': SSL_connect returned=1 errno=0 state=unknown state: unknown protocol (OpenSSL::SSL::SSLError)
from /app/vendor/ruby-2.3.1/lib/ruby/2.3.0/net/http.rb:933:in `connect'
from /app/vendor/ruby-2.3.1/lib/ruby/2.3.0/net/http.rb:863:in `do_start'
from /app/vendor/ruby-2.3.1/lib/ruby/2.3.0/net/http.rb:858:in `start'
from /app/vendor/bundle/ruby/2.3.0/gems/net-http-persistent-3.0.0/lib/net/http/persistent.rb:692:in `start'
from /app/vendor/bundle/ruby/2.3.0/gems/net-http-persistent-3.0.0/lib/net/http/persistent.rb:622:in `connection_for'
from /app/vendor/bundle/ruby/2.3.0/gems/net-http-persistent-3.0.0/lib/net/http/persistent.rb:927:in `request'
from /app/vendor/bundle/ruby/2.3.0/gems/mechanize-2.7.6/lib/mechanize/http/agent.rb:280:in `fetch'
from /app/vendor/bundle/ruby/2.3.0/gems/mechanize-2.7.6/lib/mechanize.rb:464:in `get'
from /app/vendor/bundle/ruby/2.3.0/bundler/gems/uk_planning_scraper-8d15678700bb/lib/uk_planning_scraper/idox.rb:13:in `scrape_idox'
from /app/vendor/bundle/ruby/2.3.0/bundler/gems/uk_planning_scraper-8d15678700bb/lib/uk_planning_scraper/authority.rb:43:in `scrape'
from scraper.rb:9:in `block in <main>'
from scraper.rb:6:in `each'
from scraper.rb:6:in `each_with_index'
from scraper.rb:6:in `<main>'
So that users can specify the scraper user agent to their own taste. Default to blank.
Allow users to specify which fields they do or don't want included in the output.
Add only
and except
to the params
hash in Authority#scrape
.
Both of these should be a comma-separated list of field names.
Using only
and except
params at the same time throws an error.
We might need to consider how this would interact with potential options for a deep or shallow scrape, eg an option like documents: true
which scrapes the contents of documents pages.
One specific use case is including or excluding personal data eg applicants' and agents' names, email addresses and phone numbers. But it'd be nicer to do that with an option like personal_data: false
.
Tests should fail if there is a duplicate authority name.
Not sure about duplicate authority URLs. Do any authorities share the same planning website?
/uk_planning_scraper/lib/uk_planning_scraper.rb:54:in block (2 levels) in search': undefined method
[]' for nil:NilClass (NoMethodError)
It appears not to be finding any elements for li.searchresult
.
Currently the Northgate scraper throws a fatal error when getting a 301/302 redirect on the first scrape. We should follow the redirect rather than fail.
This has been an issue with Camden, Islington and Merton, all of which have moved from HTTP to HTTPS.
/app/vendor/ruby-2.3.1/lib/ruby/2.3.0/net/protocol.rb:158:in `rbuf_fill': too many connection resets (due to Net::ReadTimeout - Net::ReadTimeout) after 0 requests on 47021302552820, last used 132.706131095 seconds ago (Net::HTTP::Persistent::Error)
from /app/vendor/ruby-2.3.1/lib/ruby/2.3.0/net/protocol.rb:136:in `readuntil'
from /app/vendor/ruby-2.3.1/lib/ruby/2.3.0/net/protocol.rb:146:in `readline'
from /app/vendor/ruby-2.3.1/lib/ruby/2.3.0/net/http/response.rb:40:in `read_status_line'
from /app/vendor/ruby-2.3.1/lib/ruby/2.3.0/net/http/response.rb:29:in `read_new'
from /app/vendor/ruby-2.3.1/lib/ruby/2.3.0/net/http.rb:1437:in `block in transport_request'
from /app/vendor/ruby-2.3.1/lib/ruby/2.3.0/net/http.rb:1434:in `catch'
from /app/vendor/ruby-2.3.1/lib/ruby/2.3.0/net/http.rb:1434:in `transport_request'
from /app/vendor/ruby-2.3.1/lib/ruby/2.3.0/net/http.rb:1407:in `request'
from /app/vendor/bundle/ruby/2.3.0/gems/net-http-persistent-3.0.0/lib/net/http/persistent.rb:933:in `block in request'
from /app/vendor/bundle/ruby/2.3.0/gems/net-http-persistent-3.0.0/lib/net/http/persistent.rb:630:in `connection_for'
from /app/vendor/bundle/ruby/2.3.0/gems/net-http-persistent-3.0.0/lib/net/http/persistent.rb:927:in `request'
from /app/vendor/bundle/ruby/2.3.0/gems/mechanize-2.7.6/lib/mechanize/http/agent.rb:280:in `fetch'
from /app/vendor/bundle/ruby/2.3.0/gems/mechanize-2.7.6/lib/mechanize.rb:464:in `get'
from /app/vendor/bundle/ruby/2.3.0/bundler/gems/uk_planning_scraper-23d5825e7331/lib/uk_planning_scraper.rb:98:in `block in search'
from /app/vendor/bundle/ruby/2.3.0/bundler/gems/uk_planning_scraper-23d5825e7331/lib/uk_planning_scraper.rb:95:in `each'
from /app/vendor/bundle/ruby/2.3.0/bundler/gems/uk_planning_scraper-23d5825e7331/lib/uk_planning_scraper.rb:95:in `each_with_index'
from /app/vendor/bundle/ruby/2.3.0/bundler/gems/uk_planning_scraper-23d5825e7331/lib/uk_planning_scraper.rb:95:in `search'
from scraper.rb:20:in `block in <main>'
from scraper.rb:19:in `each'
from scraper.rb:19:in `<main>'
Currently this fails silently with an empty array returned from Authority#scrape
, eg:
UKPlanningScraper::Authority.named('Newham').scrape({ validated_days: 90 })
# => []
UKPlanningScraper::Authority.named('Newham').scrape({ validated_days: 9 })
Using Idox scraper.
Getting: https://pa.newham.gov.uk/online-applications/search.do?action=advanced
Found 10 apps on this page.
...
The search form returns this error in HTML:
<div class="messagebox errors">
<h2>Please check the search criteria:</h2>
<ul>
<li>Too many results found. Please enter some more parameters.</li>
</ul>
</div>
Rename decision
to decision_raw
.
Create a new field Decision
and parse decision_raw
into the LGA standard codes:
This will allow us to ensure that fields are consistent across the different scrapers, to set up defaults consistently, and to validate Application
objects before the data is returned.
We'll need a to_hash
instance method to convert the data for returning to the user.
How to reproduce:
auth = UKPlanningScraper::Authority.named("X") # Northgate authorities only
apps1 = auth.case_officer_code("123").decided_days(7).scrape
apps2 = auth.decided_days(7).scrape
This fails because the url
param of Authority
has been modified so the apps2
call to scrape
fails because it's trying to scrape from the case officer page rather than the general search page.
The url
param needs to stay constant throughout the lifetime of the object so scrape
will always be requesting the right URL.
Semantic versioning 2.0.0 says:
Given a version number MAJOR.MINOR.PATCH, increment the:
MAJOR version when you make incompatible API changes,
MINOR version when you add functionality in a backwards-compatible manner, and
PATCH version when you make backwards-compatible bug fixes.Additional labels for pre-release and build metadata are available as extensions to the MAJOR.MINOR.PATCH format.
This gem includes data in the authorities.csv
file. Should changes to this file be regarded as adding functionality, requiring updating the MINOR version number, or just backwards compatible bug fixes, requiring an update to the PATCH version?
Perhaps two scenarios:
Thoughts? How would this proposal tie in with keeping track of versions usefully in your Gemfile?
Currently from these lines in northgate.rb:
results_url = URI::encode(base_url + response2.headers['Location'].gsub!('PS=10', 'PS=99999'))
app.info_url = URI::encode(generic_url + cells[0].at('a')[:href].strip)
We need to improve the URL escaping by understanding what's precisely required here and doing it properly.
This post gives some guidance:
https://docs.knapsackpro.com/2020/uri-escape-is-obsolete-percent-encoding-your-query-string
See also #20.
eg Northern Ireland:
http://epicpublic.planningni.gov.uk/publicaccess/search.do?action=advanced&searchType=Application
which is an Idox site that has a param called searchCriteria.localGovernmentDistrict
.
An alternative approach would be to list each authority separately in authorities.csv
and add extra scrape parameters to that file so that only the specific authority is searched.
See also #17.
Rename status
to status_raw
.
Parse status_raw
into a new Status
field according to the LGA standard:
Called in
This is the status at the extracted date.
Currently the scrape params are in a hash that doesn't get checked. The user has no way of knowing which of them have affected their scrape results.
We should check the params and raise an exception if an invalid parameter is present.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.