Comments (19)
@rshorey @paultag is there anything I'm forgetting to do make links for these entities?
from scrapers-us-municipal.
In the above commit I set the parent_ids for committees. Just fixing the code and rerunning the scraper resulted in no-ops for the committees, so blew them away in the DB and reran.
Looking at the add_sponsorship and add_participant methods in pupa, I think I'm using them correctly?
https://github.com/opencivicdata/scrapers-us-municipal/blob/master/chicago/bills.py#L214
https://github.com/opencivicdata/scrapers-us-municipal/blob/master/chicago/events.py#L113
from scrapers-us-municipal.
@rshorey, @paultag could you give me some guidance on null sponsor ids, and null event participant ids?
from scrapers-us-municipal.
@fgregg could you shoot me an example of a detail page for a bill so I can compare more easily? Thanks!
from scrapers-us-municipal.
Examples are in top of issue. Here they are again:
Bill (sponsor has null id)- http://api.opencivicdata.org/ocd-bill/0056ee50-61c6-47f4-b6c6-84ac2bdd39e9/
Event (both participants and bills have null ids)- http://api.opencivicdata.org/ocd-event/044cc556-a7d1-4c42-8637-2cc5e175081a/
from scrapers-us-municipal.
Just to be clear, are these bills/individuals that ought to exist (and thus have IDs), or are they bills/individuals we haven't scraped for some reason?
Basically, are you trying to add people/bills with null ids (in which case this looks fine to me) or trying to get them matched, and they're not working?
from scrapers-us-municipal.
these are bills, committees, and individuals that do exist and have been scraped (not completely sure about the bill actually). these should not be null ids.
from scrapers-us-municipal.
Here's the page for the committee: http://api.opencivicdata.org/ocd-organization/fd64b9c4-4563-48df-911a-bc2354e90297/
Here's the page for the sponsor: http://api.opencivicdata.org/ocd-person/89a935c1-31d8-4dac-879b-9078c892f9f2/
from scrapers-us-municipal.
It's an issue with entity resolution, this is where we'd do manual matching
in openstates
On Mon, Jun 1, 2015 at 11:26 AM, Forest Gregg [email protected]
wrote:
Here's the page for the committee:
http://api.opencivicdata.org/ocd-organization/fd64b9c4-4563-48df-911a-bc2354e90297/Here's the page for the sponsor:
http://api.opencivicdata.org/ocd-person/89a935c1-31d8-4dac-879b-9078c892f9f2/—
Reply to this email directly or view it on GitHub
#39 (comment)
.
Paul Tagliamonte
Software Developer | Sunlight Foundation
from scrapers-us-municipal.
Ahh! Is there anything I can do to help here?
from scrapers-us-municipal.
Hilariously, this isn't implemented yet!
Yay!
It's part of the ongoing work that's just started a week or two ago,
building that out is currently on the plate
On Mon, Jun 1, 2015 at 11:59 AM, Forest Gregg [email protected]
wrote:
Ahh! Is there anything I can do to help here?
—
Reply to this email directly or view it on GitHub
#39 (comment)
.
Paul Tagliamonte
Software Developer | Sunlight Foundation
from scrapers-us-municipal.
It depends on both reporting and admin work, which is what team OCD is on
right now
On Mon, Jun 1, 2015 at 12:00 PM, Paul Tagliamonte <
[email protected]> wrote:
Hilariously, this isn't implemented yet!
Yay!
It's part of the ongoing work that's just started a week or two ago,
building that out is currently on the plateOn Mon, Jun 1, 2015 at 11:59 AM, Forest Gregg [email protected]
wrote:Ahh! Is there anything I can do to help here?
—
Reply to this email directly or view it on GitHub
#39 (comment)
.Paul Tagliamonte
Software Developer | Sunlight Foundation
Paul Tagliamonte
Software Developer | Sunlight Foundation
from scrapers-us-municipal.
okay, thank's for you the update!
from scrapers-us-municipal.
so let's be clear about what's being resolved. there are two kinds of resolution that pupa performs:
1. Deduplication of objects to be imported
Each importer also checks to make sure that it is not creating duplicate entries in the database. Most of this is done in the _prepare_imports
method of the importer. BaseImporter
uses omnihash to detect duplicate objects and constructs self.duplicates
, a map that establishes one (arbitrarily selected) identifier as the "true" new object, and all other matching objects as duplicates thereof.
As with get_object
, the other importer subclasses do specialized checks in their _prepare_imports
methods. PersonImporter
, for instance, checks to make sure that there are no people who share a name and can not be disambiguated using their birth dates.
2. Resolution against objects imported by previous importers
Importers that deal with objects with references to other objects must make sure that they don't create duplicates. The BillImporter
, for instance, performs its import after the IndividualImporter
and OrganizationImporter
. It checks with those importers to resolve identifiers and make sure it uses the (again, arbitrarily selected) "main" identifier as indicated by those importers' resolve_json_id
methods.
3. Resolution against database
Prior to import, resolution is performed against existing data in the database to be sure that the appropriate operation is performed (either a create, update, or no-op) this happens in the import_item method of BaseImporter
class.
import_item
calls out to get_object
, which is implemented in the various subclasses of BaseImporter
. Each one performs its own specific kind of check for whether an object already exists in the database. The PersonImporter
, for instance, checks for already-imported people using their names and dates of birth.
Not implemented: Resolution across jurisdictions
All of the resolutions above are performed within a single jurisdiction. Objects in one jurisdiction are not resolved with objects in another. The OpenStates team has done such resolutions before, but only manually.
Not implemented: Resolution across databases
If two separate instances of pupa scrape the same jurisdiction but write to different databases, pupa doesn't do anything to resolve the objects in one database with the objects in another. Nor does it resolve objects against any central authoritative database.
from scrapers-us-municipal.
Thanks! Right now, I'm concerned with 2. So I need to dig into resolve_json_ids
from scrapers-us-municipal.
Cool! I got sponsors linked now, but I don't see resolve_json_id
ever called for event participants or bills attached to events. The only thing I see there is an 'id' and it looks like it won't accept a pseudo_id
from scrapers-us-municipal.
@boblannon This works for sponsors, https://github.com/opencivicdata/scrapers-us-municipal/pull/58/files#diff-0d0f0450f469b3ad1598867cde71cb7bR42
Is that the right way to do it?
from scrapers-us-municipal.
@paultag is probably a better one to ask. I don't have a lot of experience with legislative objects.
from scrapers-us-municipal.
@fgregg and i discussed this off thread, but one way of addressing this is using pseudo id's. another way, which i used when working with disclosures, is to give bill importers access to (previously run) person and organization importers, and resolving the ids using the person and org importers' internal resolution functions. i did that in pupa/#170 (which is still open because it's rather oversized and requires some upstream choices be made on things like JSONField) but @fgregg plans to excise the relevant lines of code and submit them as a separate PR.
from scrapers-us-municipal.
Related Issues (20)
- Chicago: Ald Reilly has incorrect term start date HOT 3
- Chicago: attribute Dept./Agency to Mayor for sponsorships
- Chicago: missing bill R2023-0004485 HOT 1
- Chicago: missing vote_event on agenda item HOT 1
- Chicago: duplicate events from different sources HOT 2
- Chicago: individual participants missing from event HOT 1
- Chicago: event in the past has confirmed status
- Chicago: agenda item not connecting to vote_event HOT 1
- Chicago: some agenda actions not being captured HOT 2
- Chicago: Parse PDF of agenda when agenda is missing from API
- The order of bill history is not accurate HOT 1
- Chicago: add introduction as an action
- Chicago: missing action on bill HOT 1
- Chicago: misnamed legislative session
- Chicago: bill history order is wrong
- Chicago: Attach Rule 45 reports to events
- Errors when setting up with Docker
- capture keyLegislation for chicago bills.
- chicago: not getting the latest events HOT 1
- Chicago: scrape"Action description" text for bill actions
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from scrapers-us-municipal.