Code Monkey home page Code Monkey logo

Comments (9)

jamesturk avatar jamesturk commented on August 23, 2024

any idea of how you'd like to see pupa handle this?

from pupa.

jpmckinney avatar jpmckinney commented on August 23, 2024

I don't have strong opinions on how the API should work, but one way is to be able to change the "active jurisdiction" so that objects are yielded to the appropriate jurisdiction. Pseudo-code:

# __init__.py
from utils import CanadianJurisdiction
class QuebecMunicipalities(CanadianJurisdiction):
    # Here you would find either:
    # * nothing, since this is a fake jurisdiction
    # * dummy variables which will be ignored by the scraper
    # * a list of all the jurisdictions if one of the above two can't be implemented
# people.py
from pupa.scrape import Scraper
class QuebecMunicipalitiesPersonScraper(Scraper):
    # get the list of municipalities
    for municipality in municipalities:
        # create a jurisdiction object
        self.set_jurisdiction(jurisdiction)
        # yield a lot of people

However, I can imagine a lot of challenges in changing Pupa to work this way.

Maybe there are some Python metaprogramming tricks I can use, to make it seem like there are several thousand modules with common people.py scraper code, without requiring me to have thousands of folders of __init__.py files and small people.py files all inheriting from the same meta-scraper class.

from pupa.

jamesturk avatar jamesturk commented on August 23, 2024

the people.py files won't be needed if they're all the same, as multiple
jurisdictions can point to the same scraper(s)

your proposed solution might work, I'll play with some proof of concept code

On Tue, May 20, 2014 at 3:54 PM, James McKinney [email protected]:

I don't have strong opinions on how the API should work, but one way is to
be able to change the "active jurisdiction" so that objects are yielded to
the appropriate jurisdiction. Pseudo-code:

init.py

from utils import CanadianJurisdiction
class QuebecMunicipalities(CanadianJurisdiction):
# Here you would find either:
# * nothing, since this is a fake jurisdiction
# * dummy variables which will be ignored by the scraper
# * a list of all the jurisdictions

people.py

from pupa.scrape import Scraper
class QuebecMunicipalitiesPersonScraper(Scraper):
# get the list of municipalities
for municipality in municipalities:
# create a jurisdiction object
self.set_jurisdiction(jurisdiction)
# yield a lot of people

However, I can imagine a lot of challenges in changing Pupa to work this
way.

Maybe there are some Python metaprogramming tricks I can use, to make it
seem like there are several thousand modules with common people.pyscraper code, without requiring me to have thousands of folders of
init.py files and small people.py files all inheriting from the same
meta-scraper class.


Reply to this email directly or view it on GitHubhttps://github.com//issues/70#issuecomment-43674764
.

from pupa.

jpmckinney avatar jpmckinney commented on August 23, 2024

Cool - how do you make multiple jurisdictions point to the same scrapers?

from pupa.

jamesturk avatar jamesturk commented on August 23, 2024

there's now an example of this in https://github.com/opencivicdata/scrapers-us-state

there's still one file per jurisdiction (maybe we can improve that, maybe this is good enough though) but they all point to the same scraper (and the jurisdictions in this case are actually auto-generated classes)

from pupa.

jpmckinney avatar jpmckinney commented on August 23, 2024

Thanks! In Quebec I'll have 1000 auto-generated jurisdictions, mixed in with manual jurisdictions; we scrape the big cities individually (to get email addresses), but we're happy to use a provincial directory for the smaller cities (which has one email for the entire council). It may be confusing to have this mix, so avoiding one file per jurisdiction would still be ideal.

How is Pupa 0.0.4 coming along? How soon can I start upgrading to the PostgreSQL version?

from pupa.

jamesturk avatar jamesturk commented on August 23, 2024

pupa 0.4 is pretty much ready, there are still rough edges but no more than existed in the mongo version I believe. I was hoping to update some docs before calling it 0.4 officially, but we're using it in development now and will be releasing it as 0.4 and switching production over soon

the 1000 jurisdiction issue still requires more work/thinking on the best way to do it. i think a different command like pupa bulkupdate might get around some of the challenges we'd face, once things settle down here I'll try and think of a cleaner interface for this

from pupa.

jpmckinney avatar jpmckinney commented on August 23, 2024

Pinging for any updates on how to implement common scraper code for 1000s of jurisdictions.

In the update command's handle method, I'm wondering if instead of getting a single jurisdiction from a module, it might get a list of jurisdictions instead, and then loop over them. Alternatively, there could be a bulkupdate command as mentioned earlier, which expects the module to define multiple jurisdictions.

from pupa.

jpmckinney avatar jpmckinney commented on August 23, 2024

My workaround is to just put all the jurisdictions into one jurisdiction, in an organization hierarchy, which is fine for my needs, but maybe not in the general case. However, as there is no other demand for the general case, I'm closing.

from pupa.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.