Code Monkey home page Code Monkey logo

Comments (19)

fgregg avatar fgregg commented on July 21, 2024

@rshorey @paultag is there anything I'm forgetting to do make links for these entities?

from scrapers-us-municipal.

fgregg avatar fgregg commented on July 21, 2024

In the above commit I set the parent_ids for committees. Just fixing the code and rerunning the scraper resulted in no-ops for the committees, so blew them away in the DB and reran.

Looking at the add_sponsorship and add_participant methods in pupa, I think I'm using them correctly?

https://github.com/opencivicdata/scrapers-us-municipal/blob/master/chicago/bills.py#L214

https://github.com/opencivicdata/scrapers-us-municipal/blob/master/chicago/events.py#L113

from scrapers-us-municipal.

fgregg avatar fgregg commented on July 21, 2024

@rshorey, @paultag could you give me some guidance on null sponsor ids, and null event participant ids?

from scrapers-us-municipal.

rshorey avatar rshorey commented on July 21, 2024

@fgregg could you shoot me an example of a detail page for a bill so I can compare more easily? Thanks!

from scrapers-us-municipal.

fgregg avatar fgregg commented on July 21, 2024

Examples are in top of issue. Here they are again:

Bill (sponsor has null id)- http://api.opencivicdata.org/ocd-bill/0056ee50-61c6-47f4-b6c6-84ac2bdd39e9/
Event (both participants and bills have null ids)- http://api.opencivicdata.org/ocd-event/044cc556-a7d1-4c42-8637-2cc5e175081a/

from scrapers-us-municipal.

rshorey avatar rshorey commented on July 21, 2024

Just to be clear, are these bills/individuals that ought to exist (and thus have IDs), or are they bills/individuals we haven't scraped for some reason?

Basically, are you trying to add people/bills with null ids (in which case this looks fine to me) or trying to get them matched, and they're not working?

from scrapers-us-municipal.

fgregg avatar fgregg commented on July 21, 2024

these are bills, committees, and individuals that do exist and have been scraped (not completely sure about the bill actually). these should not be null ids.

from scrapers-us-municipal.

fgregg avatar fgregg commented on July 21, 2024

Here's the page for the committee: http://api.opencivicdata.org/ocd-organization/fd64b9c4-4563-48df-911a-bc2354e90297/

Here's the page for the sponsor: http://api.opencivicdata.org/ocd-person/89a935c1-31d8-4dac-879b-9078c892f9f2/

from scrapers-us-municipal.

paultag avatar paultag commented on July 21, 2024

It's an issue with entity resolution, this is where we'd do manual matching
in openstates

On Mon, Jun 1, 2015 at 11:26 AM, Forest Gregg [email protected]
wrote:

Here's the page for the committee:
http://api.opencivicdata.org/ocd-organization/fd64b9c4-4563-48df-911a-bc2354e90297/

Here's the page for the sponsor:
http://api.opencivicdata.org/ocd-person/89a935c1-31d8-4dac-879b-9078c892f9f2/


Reply to this email directly or view it on GitHub
#39 (comment)
.

Paul Tagliamonte
Software Developer | Sunlight Foundation

from scrapers-us-municipal.

fgregg avatar fgregg commented on July 21, 2024

Ahh! Is there anything I can do to help here?

from scrapers-us-municipal.

paultag avatar paultag commented on July 21, 2024

Hilariously, this isn't implemented yet!

Yay!

It's part of the ongoing work that's just started a week or two ago,
building that out is currently on the plate

On Mon, Jun 1, 2015 at 11:59 AM, Forest Gregg [email protected]
wrote:

Ahh! Is there anything I can do to help here?


Reply to this email directly or view it on GitHub
#39 (comment)
.

Paul Tagliamonte
Software Developer | Sunlight Foundation

from scrapers-us-municipal.

paultag avatar paultag commented on July 21, 2024

It depends on both reporting and admin work, which is what team OCD is on
right now

On Mon, Jun 1, 2015 at 12:00 PM, Paul Tagliamonte <
[email protected]> wrote:

Hilariously, this isn't implemented yet!

Yay!

It's part of the ongoing work that's just started a week or two ago,
building that out is currently on the plate

On Mon, Jun 1, 2015 at 11:59 AM, Forest Gregg [email protected]
wrote:

Ahh! Is there anything I can do to help here?


Reply to this email directly or view it on GitHub
#39 (comment)
.

Paul Tagliamonte
Software Developer | Sunlight Foundation

Paul Tagliamonte
Software Developer | Sunlight Foundation

from scrapers-us-municipal.

fgregg avatar fgregg commented on July 21, 2024

okay, thank's for you the update!

from scrapers-us-municipal.

boblannon avatar boblannon commented on July 21, 2024

so let's be clear about what's being resolved. there are two kinds of resolution that pupa performs:

1. Deduplication of objects to be imported
Each importer also checks to make sure that it is not creating duplicate entries in the database. Most of this is done in the _prepare_imports method of the importer. BaseImporter uses omnihash to detect duplicate objects and constructs self.duplicates, a map that establishes one (arbitrarily selected) identifier as the "true" new object, and all other matching objects as duplicates thereof.

As with get_object, the other importer subclasses do specialized checks in their _prepare_imports methods. PersonImporter, for instance, checks to make sure that there are no people who share a name and can not be disambiguated using their birth dates.

2. Resolution against objects imported by previous importers
Importers that deal with objects with references to other objects must make sure that they don't create duplicates. The BillImporter, for instance, performs its import after the IndividualImporter and OrganizationImporter. It checks with those importers to resolve identifiers and make sure it uses the (again, arbitrarily selected) "main" identifier as indicated by those importers' resolve_json_id methods.

3. Resolution against database
Prior to import, resolution is performed against existing data in the database to be sure that the appropriate operation is performed (either a create, update, or no-op) this happens in the import_item method of BaseImporter class.

import_item calls out to get_object, which is implemented in the various subclasses of BaseImporter. Each one performs its own specific kind of check for whether an object already exists in the database. The PersonImporter, for instance, checks for already-imported people using their names and dates of birth.

Not implemented: Resolution across jurisdictions
All of the resolutions above are performed within a single jurisdiction. Objects in one jurisdiction are not resolved with objects in another. The OpenStates team has done such resolutions before, but only manually.

Not implemented: Resolution across databases
If two separate instances of pupa scrape the same jurisdiction but write to different databases, pupa doesn't do anything to resolve the objects in one database with the objects in another. Nor does it resolve objects against any central authoritative database.

from scrapers-us-municipal.

fgregg avatar fgregg commented on July 21, 2024

Thanks! Right now, I'm concerned with 2. So I need to dig into resolve_json_ids

from scrapers-us-municipal.

fgregg avatar fgregg commented on July 21, 2024

Cool! I got sponsors linked now, but I don't see resolve_json_id ever called for event participants or bills attached to events. The only thing I see there is an 'id' and it looks like it won't accept a pseudo_id

from scrapers-us-municipal.

fgregg avatar fgregg commented on July 21, 2024

@boblannon This works for sponsors, https://github.com/opencivicdata/scrapers-us-municipal/pull/58/files#diff-0d0f0450f469b3ad1598867cde71cb7bR42

Is that the right way to do it?

from scrapers-us-municipal.

boblannon avatar boblannon commented on July 21, 2024

@paultag is probably a better one to ask. I don't have a lot of experience with legislative objects.

from scrapers-us-municipal.

boblannon avatar boblannon commented on July 21, 2024

@fgregg and i discussed this off thread, but one way of addressing this is using pseudo id's. another way, which i used when working with disclosures, is to give bill importers access to (previously run) person and organization importers, and resolving the ids using the person and org importers' internal resolution functions. i did that in pupa/#170 (which is still open because it's rather oversized and requires some upstream choices be made on things like JSONField) but @fgregg plans to excise the relevant lines of code and submit them as a separate PR.

from scrapers-us-municipal.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.