This repository contains documentation for developers including:
- Writing Scrapers using Pupa
- Open Civic Data's Data Type Specifications
- Open Civic Data Proposals
Read these docs at https://open-civic-data.readthedocs.io/en/latest/
Open Civic Data project documentation
Home Page: https://open-civic-data.readthedocs.io
This repository contains documentation for developers including:
Read these docs at https://open-civic-data.readthedocs.io/en/latest/
On the Filing
data type in the campaign finance enhancement proposal, coverage_start_date
and coverage_end_date
are defined as date and (possibly) time values.
None of the filings we'll be loading from California will have time data associated with these values. Obviously, that's just one state, but do we have a sense of how many jurisdictions will have precise start and end times for each filing's coverage? And even if these precise times are often available, how important is this precision beyond the start and end date?
Maybe I'm getting too far into the implementation details in this space, but I want to be clear what I'm doing. For now, I'll leave these as DateTime fields and plan to set the time portion to midnight UTC when unknown.
If the precision is important, then we might consider, someday, converting these to "fuzzy" DateTime fields that don't require time parts. OCD is already doing something like this for date fields where parts of the date are missing. But it would be better if we converted these to a custom model field with field lookups allowing users to query these fields as if they were regular Date or DateTime fields.
Something @palewire and I were just discussing. This is something we touched on in the previous conversation, but I don't think we fully resolved.
In CA, filers don't have to itemize contributions under $100. Instead, they can report a total for all unitemized contributions. It isn't clear where this would be recorded in our current schema, and it's pretty essential in order to figuring out the total amount raised.
We think we need a place to store this and other total or summary amounts. What we have in mind is an optional, repeating .totals
or .summaries
field on Filing
that would have the following properties:
Would be interested to hear from others if this would adequately cover other kinds of filings in other jurisdiction. Maybe in some cases we might also need to link the totals to specific elections?
@jpmckinney wrote:
The "Areas vs. Jurisdictions and Divisions" section conflates two questions: "Jurisdiction versus Division" and "Why we call it Division and not Area". An Area in Popolo is the same thing as a Division. There is nothing in Popolo for a Jurisdiction, and there will probably never be †. If you find it necessary to describe the difference between Jurisdictions and Divisions, that's probably best done in 0003 where Jurisdictions are introduced.
So, for the remaining question of "Why we call it Division and not Area". The justification given is "Area does (in essence) equate to Division however, and the different terminology a remnant of decisions made prior to Area being introduced in Popolo." My reasons for "area":
If area_id is adopted, this should be used in 0003 as well.
†: FYI, the reason jurisdictions will likely never be part of Popolo is that I don't think a jurisdiction actually exists distinctly from its top-level organization. I understand that, for OCD, having jurisdictions makes organizing data and APIs and writing code easier. However, from a pure modeling perspective, there's no real thing as a "jurisdiction".
@jamesturk wrote:
We can clarify this better in that paragraph, but I disagree about Area & Division being the same thing, and this is unlikely to change.
In response to specific points:
1 & 2. We actually do use the term division and have since day one, the use is integrated into other APIs that use Open Civic Data division identifiers too.
3. I'd disagree as to whether or not it is confusing, but more importantly OCD Divisions do exist and have properties (ones that do not match the Area schema Popolo added).
4. We have to be pragmatic and value backwards compatibility and practicality more than conformance, I think 100% adherence is unlikely especially given some fundamental differences in how Votes will be handled. If you prefer we stop using the term Popolo we can, or just give a nod to the fact that we were inspired by Popolo, but we aren't going to be adding/changing things like this for the sake of compliance.
@jpmckinney wrote:
It sounds like I need more information than I currently have, in order to evaluate the best way forward.
@jamesturk wrote:
2-
Google uses the term division:
https://developers.google.com/civic-information/docs/v1/divisions
As does OpenElections https://github.com/openelections/specs/wiki/Elections-Data-Spec-Version-2
The endpoint for divisions has been http://api.opencivicdata.org/divisions/ since it was published.
3- Since the properties are all optional, I suppose this part is manageable, but there's a mismatch in thinking between Divisions & Areas and how they relate to boundaries. Divisions do not have a boundary but instead have a relationship to a boundary with start & end times. This seems like it'd be another noncompliance point since we'd have a field w/ the (arguably) the same purpose but a different name & structure.
4- These two cases are recent enough revisions and were they the only place I'd be OK with it. I'm more concerned about us being a moving target for others, we've used the term division in our endpoints and IDs for over a year, changing things on them now just isn't practical.
5- Point well taken, we're behind on getting you that feedback but I've just asked Paul to chime in with that today.
@jamesturk wrote:
there is also at least one vendor API that is using the term division_id internally, that's less of an issue (esp. as they haven't published it yet) but worth noting
See opencivicdata/scrapers-us-municipal#17
@jamesturk and @paultag have some ideas for this.
Also, is Filing.filer
guaranteed to be a Committee
?
Per discussion at #71 (comment) - we'd like to be able to model the purpose and current status of campaign committees, which will be Organizations.
write a general introduction to contributing as a non-developer - pointing people at anthropod, etc.
maybe a good time to come up with a non-anthropod name for the deployment
(Creating this as a placeholder for revisiting during a major revision.)
Related Slack conversation: https://opencivicdata.slack.com/archives/pupa/p1454452385000025
On my reading, from_organization
struck me as the organization (in my case, a committee) from which the Bill originated. The actual intention is that it represents the parent legislature organization. My thought is that this should probably be reflected in a better name, perhaps from_legislature
.
cc: @fgregg
explain full text search, explain how operators break down
To make code list changes easier to review independently, I propose the following policy.
Other issues:
For all lists:
BILL_ACTION_CLASSIFICATION_CHOICES
are not titlecase.For ORGANIZATION_CLASSIFICATION_CHOICES
and BILL_CLASSIFICATION_CHOICES
:
For BILL_ACTION_CLASSIFICATION_CHOICES
:
committee-passage-unfavorable
.It's not clear that CommitteeStatusUpdate
can't be rolled into CommitteeAttributeUpdate
- just add description
to CommitteeAttributeUpdate
.
Also, the class name already establishes the semantics, so I'd change attribute_to_update
to property
and new_attribute_value
to value
.
Right now this repo has:
I think in reconsidering the purpose of this repository it should have:
The "how cities can adopt OCD" stuff is stale and not worth keeping IMO. If the need arises I imagine we'd take a different approach now than what Sunlight was pursuing when those were started.
And the pupa docs should probably be moved to the pupa repository (& linked from the intro page)
If others are OK with this I'd like to start on this soon so that we can have Open States docs reference good OCD docs where appropriate
The /people endpoint is returning a 500 error this morning, but it appears the HTTP status code is 200, while the text says "
All errors should be returned as an HTTP status code so clients will act correctly.
In Chicago we have individual level participation in events (which aldermen went to what council meeting or committee meeting). @paultag asked me to open up an issue for extending the Events OCDPEP for this data.
says that the range for page
is (0 - max_page), but that's evidently not true?
https://api.opencivicdata.org/jurisdictions/?apikey=xxxx&page=0
response:
{
"error": "No such page (heh, literally - its out of bounds)"
}
in https://github.com/opencivicdata/docs.opencivicdata.org/blob/master/proposals/drafts/elections.rst
a new Party type is created that is effectively a subclass of Organization that adds:
I'm not convinced on any of these fields being appropriate for inclusion and want to discuss their purpose & usage.
color
Parties (in the US at least) don't have official colors, red & blue have only been in use for the past. This seems like a property that should exist in applications using colors to represent parties, not in the core metadata.
Looking at Wikipedia, it does seem that some parties in other countries have official colors, but it is often more than one.
is_write_in
This doesn't feel like a permanent feature of a party either, as it would vary a lot election to election. Ballot access is a complicated thing, in general no party is guaranteed ballot access in a particular race, it usually depending on their showing in the last election. A good example here would be the US Green party which has ballot access currently in 19 states, but that changes year to year.
abbreviation
This could be a useful addition, but I also wonder if it is redundant w/ alternate names.
It might make more sense to have a recommendation that organizations have an alternate name added with a particular note set.
Open to being convinced on any of these, but wanted to start the discussion before we had an implementation.
show the content of arrays better than we currently are, right now if there's an items key we pass unless there's also properties
The docs currently have a note on them:
Parts of Open Civic Data underwent a large refactor as of mid-2014, some information on this page may be out of date.
We’re working on updating this documentation as soon as possible.
Has the documentation been update yet? If not, any thoughts on when they might be?
scraped_data
is _data
, scrape_cache
is _cache
, --fast
is --fastmode
Correct all mentions of:
get_scraper
scrape_*
(e.g. scrape_people
)get_*
(e.g. get_people
) becomes scrape
Legislator
I'm working on filling out @aepton's implementation of the Campaign Finance enhancement proposal, and I've got a few questions regarding the spec.
Currently, a Committee
can be linked to a Jurisdiction
in at least two ways:
Committee
references a CommitteeType
which references a Jurisdiction
.Organization
, Committee
also inherits an optional reference to a Jurisdiction
.Feels redundant. Looking back at our earlier convo, seems like Committee Type was defined as its own data type because we believe the available types and their regulatory meanings will vary across jurisdictions.
I don't doubt this, but I wonder how much bearing it should have on the specification. Why does it matter if "candidate" type committees in WA file at a different frequency, disclose different transactions or otherwise have different rules from "candidate" type filing committees in IL? What's wrong with allowing the filings, transactions, etc. associated with these committees to differ depending on the committee's jurisdiction, even while they're all grouped under the same label?
This of course wouldn't rule out jurisdiction-specific ETL code for integrating the committees and filings into the jurisdictionally agnostic models. I'm just saying the rules surrounding committee types might not need to be reflected in the schema.
For now, I've left the .jurisdiction
field off of the Committee
model. But my proposal would be:
Committee Type
as it's own data type and, instead, have an open text .type
or .classification
field on Committee
.jurisdiction
field Committee
inherits from Organization
.The image field on Person and Organization models are only for a URL string. It seems metadata attributing credit or name of the source for the photo (GPO Member Guide) and a possible note (e.g. "Official Congress Headshot") would provide more complete information about the photo.
I figure more tools for scraping and importing data into a common format is good.
classification
and order
on Agenda Itemsdate
, text
, links
on Documentsname
was renamed to note
on Medianote
on Location, remove it?type
on Media, remove it?The docs should include or link to instructions for running the database locally--something like
Adding docker instructions would be neat too!
first proposed here: opencivicdata/python-opencivicdata#30
I don't think I get the logic of making contributors "persons"--is this an optional designation?
@aepton, @jpmckinney : Consider the case of a committee giving to another committee. Wouldn't that mean the committee is a committee (and hence a subtype of a popolo org) when it is receiving money and other times the committee is a person (and then a subtype of a popolo person) when it is donating? To my nose that doesn't smell right and makes tracking the flow of money harder, not easier.
Moreover, differentiating contributor types is often the point of this sorta work, even if there aren't easy answers available in the source data. Being able to say that XX percent of funds came from corporate donors is pretty powerful... I don't really understand the rules here, but I'd make donor type it's own field, where person and organization are options, but only assigned if there's solid reason for thinking this (in many jurisdictions this info can be gleaned, at least in part, though I'm sure that's not true everywhere). And, of course, detailed local knowledge may be the only way to know for sure...
**Not sure where issues of data format modification suggestions should be raised so raising it in this project. Also, the spec in the docs may be outdated so if these things were later included then apologies.
Different kinds of votes in different legislatures require different percentages of support to pass. Seems like important information to store about a vote. The @opencongress
congress scrapers include this attribute in votes as you can see in the following example.
{
"requires": "1/2",
"result": "Failed",
"result_text": "Failed"
}
The result_text
may also be relevant too since we're storing passing as a boolean value instead of the actual text specific to the vote type. For instance, "Nomination Confirmed" for federal votes in OCD would be reduced to a boolean value on passed
so we'd lose how the legislature labels the passing of the vote.
Arguments for / against including these attributes?
Once you have a working scraper, you need a way to run it regularly and monitor its status. Could you add some docs on your Jenkins setup (or whatever else)?
Now that, California Civic Data Alliance has published data with the election models, so that the models have gotten a good shake down, are we ready to accept the Election PEP?
Right now, all bill actions have an organization attribute
organization, organization_id
____The organization that this action took place within.
This seems not quite right for the actions of 'signing' or 'vetoing' done by the executive.
Here are the two ways, I've approached this. Neither seem quite right to me.
bill_action = {'description' : 'Veto',
'date' : action_date,
'organization' : 'Office of the Mayor',
'classification' : 'executive-veto')
I don't really like this approach because the mayor is not the one vetoing legislature because he holds a position in the office of the mayor. He can do it because he is the mayor of the "City of Chicago".
So this is my current approach.
bill_action = {'description' : 'Veto',
'date' : action_date,
'organization' : 'City of Chicago',
'classification' : 'executive-veto',
'related_entitites = [{'name' : 'Rahm Emanuel, 'entity_type': 'person'}
I like this better, but "City of Chicago" also doesn't quite seem like the right container.
I think I would like the following to be legal, note the absence of organization"
bill_action = {'description' : 'Veto',
'date' : action_date,
'person' : 'Rahm Emanuel',
'classification' : 'executive-veto')
Thoughts?
Event object should list agenda
as one of its properties.
The URL http://www.w3schools.com/xpath/xpath_syntax.asp is invalid.
Fixing with the correct URL.
In CA, I believe all transactions in a filing apply to the same election, but seems like this isn't the case in other jurisdictions. For example, in our earlier convo, @LindsayYoung raised the point that in the FEC schema one filing could apply to multiple elections, especially during primary season.
However, Lindsay also suggested that "election is generally more useful on the transaction level". If so, modeling that transaction-to-election relationship feels more straight-forward and precise.
Would it be an improvement, then, if we:
.election
field from Filing
.election
field to Transaction
(and maybe CommitteeAttributeUpdate
too)?cc @aepton
Consistency:
organization
to match other schema, instead of from_organization
?entity_type
to _type
?Choice of terms:
mimetype
is old-fashioned. This is typically called a content_type
for some time.versions.links
and documents.links
: These refer to different forms of the version/document. DCAT uses the term distributions
. Whatever term you decide on, DCAT's definition of Distribution is very clear and can maybe be reused. Strictly speaking, links don't have a content type, but distributions do.versions.name
and documents.name
: Based on the examples, these are not really the names/titles of the documents - maybe note
is closer to the intended meaning?summaries
and other_titles
, it's not clear why the property name is text
. I would either expect value
(as in ContactDetail, Count, and most future Popolo subdocuments) or the singular form (as in Identifier, OtherName) - in this case summary
or title
.Here are some suggested terms from the Dublin Core Metadata Terms, which Popolo is likely to adopt in a generic Document class, since they are the most broadly used metadata terms:
identifier
instead of name
, since HB 2117
is better described as an identifier than as a name. The docs already acknowledge that name
is easily confused with title
.abstract
or abstracts
instead of summaries
Questions:
primary
a classification?actions.text
a description of the action ("Referred to committee"), or the actual text of the action which may be identical to the text of a motion ("That Bill HB-1 be referred to the Committee on House Adminstration"), or both? text
suggests that the action text is taken from official proceedings, but the definition of the term suggests it's a description of the action, not its official text. Depending on the most common case, it may be clearer to name it description
.Sometimes the source text for motions and bills are incomprehensible. It's been Open States practice to rewrite for human readability.
As a data user, I want to know the provenance of the information. If motion or bill title has been rewritten for clarity, I want to know that, and I'd also like to know the original text.
@showerst proposed adding an attribute like this to objects with modified texts (in his example a motion). { "modified" : {"motion": "cp/h lwr"}}
@jamesturk suggested that we use the existing extras
field for this, since we may not be ready to standardize on this practice.
I'm curious to hear @jpmckinney's thoughts, as this would effect objects that are part of popolo (which we strive to maintain compatibility with).
Right now, the Event
model in opencivicdata-django requires a jurisdiction
attribute.
This is so that all events related to particular legislature can be easily grouped together.
However, not all the things that we want to model within the wider OCD world have jurisdictions, i.e, Election days.
I would like to propose that OCD Events have an scope
attribute that can be a jurisdiction_id
, division_id
or None
.
This would allow for current pupa practice to be largely unchanged, but also allow for events that are not associated with jurisdictions.
I don't love the name scope
.
Thoughts? @jamesturk @gordonje @jpmckinney
In California, campaign expenditures are reported on Schedule E of Form 460 which does not provide an obvious place for filers to report the date of a payment made. Thus, about 40% of Form 460 Schedule E Items are missing an expense date.
In the current version of the draft spec Transaction.date
isn't labeled as optional, but maybe it should be?
In the draft implementation in python-opencivicdata
, I'm going to allow NULLs in this field, for now.
This sort of overlaps with #98 in terms of how these decisions facilitate longitudinal analysis of transaction data.
Different kinds of votes in different legislatures require different percentages of support to pass. Seems like important information to store about a vote - more important than shoving it into extra attributes. The @opencongress
congress scrapers include this attribute in votes as you can see in the following example.
{
"requires": "1/2",
"result": "Failed",
"result_text": "Failed"
}
The result_text
may also be relevant too since we're storing passing as a boolean value instead of the actual text specific to the vote type. This may make more sense though to push to extra attributes.
Making a note that the documented sponsorships.id
filter on the Bills search should be removed in favor of sponsorships__person__id
and sponsorships__organization__id
filters.
California campaign finance committees are required to itemize returned contributions on the same schedule that includes received contributions. The real world situation would be something like:
In our source data, we have a line item for the original contribution and another with a negative amount for the returned contribution:
filer | contributor | amount | date | type |
---|---|---|---|---|
GAVIN 4 GOV | JOHN DOE | 1000.00 | 5/1/2017 | Contribution |
GAVIN 4 GOV | JOHN DOE | -1000.00 | 10/1/2017 | Returned |
In mapping these records to the Transaction
model, my initial thought was to flip the sender and receiver and take the absolute value of the amount. So the above source records would become:
sender | recipient | amount | date | classification |
---|---|---|---|---|
JOHN DOE | GAVIN 4 GOV | 1000.00 | 5/1/2017 | Contribution |
GAVIN 4 GOV | JOHN DOE | 1000.00 | 10/1/2017 | Returned |
However, @palewire and I discussed further and decided against this approach. We're worried about the potential for inaccuracies when summing the amount field. Instead, we're planning to leave the source values more or less unchanged in loading the Transaction
model:
sender | recipient | amount | date | classification |
---|---|---|---|---|
GAVIN 4 GOV | JOHN DOE | 1000.00 | 5/1/2017 | Contribution |
GAVIN 4 GOV | JOHN DOE | -1000.00 | 10/1/2017 | Returned |
Have others seen similar use cases in other jurisdictions and, if so, does our approach for fitting it into our shared models make sense?
If we agree this is proper use, then we might expand the description on Transaction.amount
to say that negative numbers are allowed and why.
Event.{description, location, classification} to name a few, probably more? need to look at it closer
also cc @paultag in case he knows what's up
"A governing body that exists within a division. While ‘Florida’ would be a Jurisdiction, the Florida State Legislature would be a jurisdiction."
I assume the second "jurisdiction" is supposed to be "organization"?
If not yet done: Please join the W3C Open Government Community Group (http://www.w3.org/community/opengov/) ! (It very likely will become a lot more active in 2015 ;-)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.