Comments (9)
any idea of how you'd like to see pupa handle this?
from pupa.
I don't have strong opinions on how the API should work, but one way is to be able to change the "active jurisdiction" so that objects are yielded to the appropriate jurisdiction. Pseudo-code:
# __init__.py
from utils import CanadianJurisdiction
class QuebecMunicipalities(CanadianJurisdiction):
# Here you would find either:
# * nothing, since this is a fake jurisdiction
# * dummy variables which will be ignored by the scraper
# * a list of all the jurisdictions if one of the above two can't be implemented
# people.py
from pupa.scrape import Scraper
class QuebecMunicipalitiesPersonScraper(Scraper):
# get the list of municipalities
for municipality in municipalities:
# create a jurisdiction object
self.set_jurisdiction(jurisdiction)
# yield a lot of people
However, I can imagine a lot of challenges in changing Pupa to work this way.
Maybe there are some Python metaprogramming tricks I can use, to make it seem like there are several thousand modules with common people.py
scraper code, without requiring me to have thousands of folders of __init__.py
files and small people.py
files all inheriting from the same meta-scraper class.
from pupa.
the people.py files won't be needed if they're all the same, as multiple
jurisdictions can point to the same scraper(s)
your proposed solution might work, I'll play with some proof of concept code
On Tue, May 20, 2014 at 3:54 PM, James McKinney [email protected]:
I don't have strong opinions on how the API should work, but one way is to
be able to change the "active jurisdiction" so that objects are yielded to
the appropriate jurisdiction. Pseudo-code:init.py
from utils import CanadianJurisdiction
class QuebecMunicipalities(CanadianJurisdiction):
# Here you would find either:
# * nothing, since this is a fake jurisdiction
# * dummy variables which will be ignored by the scraper
# * a list of all the jurisdictionspeople.py
from pupa.scrape import Scraper
class QuebecMunicipalitiesPersonScraper(Scraper):
# get the list of municipalities
for municipality in municipalities:
# create a jurisdiction object
self.set_jurisdiction(jurisdiction)
# yield a lot of peopleHowever, I can imagine a lot of challenges in changing Pupa to work this
way.Maybe there are some Python metaprogramming tricks I can use, to make it
seem like there are several thousand modules with common people.pyscraper code, without requiring me to have thousands of folders of
init.py files and small people.py files all inheriting from the same
meta-scraper class.—
Reply to this email directly or view it on GitHubhttps://github.com//issues/70#issuecomment-43674764
.
from pupa.
Cool - how do you make multiple jurisdictions point to the same scrapers?
from pupa.
there's now an example of this in https://github.com/opencivicdata/scrapers-us-state
there's still one file per jurisdiction (maybe we can improve that, maybe this is good enough though) but they all point to the same scraper (and the jurisdictions in this case are actually auto-generated classes)
from pupa.
Thanks! In Quebec I'll have 1000 auto-generated jurisdictions, mixed in with manual jurisdictions; we scrape the big cities individually (to get email addresses), but we're happy to use a provincial directory for the smaller cities (which has one email for the entire council). It may be confusing to have this mix, so avoiding one file per jurisdiction would still be ideal.
How is Pupa 0.0.4 coming along? How soon can I start upgrading to the PostgreSQL version?
from pupa.
pupa 0.4 is pretty much ready, there are still rough edges but no more than existed in the mongo version I believe. I was hoping to update some docs before calling it 0.4 officially, but we're using it in development now and will be releasing it as 0.4 and switching production over soon
the 1000 jurisdiction issue still requires more work/thinking on the best way to do it. i think a different command like pupa bulkupdate might get around some of the challenges we'd face, once things settle down here I'll try and think of a cleaner interface for this
from pupa.
Pinging for any updates on how to implement common scraper code for 1000s of jurisdictions.
In the update
command's handle
method, I'm wondering if instead of getting a single jurisdiction from a module, it might get a list of jurisdictions instead, and then loop over them. Alternatively, there could be a bulkupdate
command as mentioned earlier, which expects the module to define multiple jurisdictions.
from pupa.
My workaround is to just put all the jurisdictions into one jurisdiction, in an organization hierarchy, which is fine for my needs, but maybe not in the general case. However, as there is no other demand for the general case, I'm closing.
from pupa.
Related Issues (20)
- Add a way to de-duplicate `actions` on a `Bill` HOT 4
- Handle `VoteEvent`s that address the passage of _multiple_ `Bill`s HOT 6
- Be able to run an `--import`-only `pupa update` without Internet connection
- Allow `Post` scraper and importer to set `maximum_memberships` for multi-member districts HOT 1
- Allow event location to be null HOT 3
- Require dates on sessions? HOT 3
- Relation "opencivicdata_jurisdiction" does not exist HOT 2
- add feature flags to pupa
- let scraper set date of an eventdocument and eventmedia HOT 1
- allow tightening of schema
- Orphaned identifiers result in duplicate objects
- new maintainer? HOT 1
- change travis alerting to someone else? HOT 1
- Log a warning if more than a certain threshold of objects are updated in an import
- let pupa work with a sqlite db
- voteevent importer should use action reference in it's resolution spec
- Add failsafe to pupa clean command HOT 3
- use alternate identifiers to try to match bill to update
- Use alternate identifiers to identify people when adding memberships to an org.
- Scraper doesn't attempt to re-link bills to agenda items if the event hasn't changed HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pupa.