Comments (27)
Looping in @drinks, who manages our CapitolWords project, which contains a big CR parser.
The text of amendments would be a huge get, and something I've wanted to parse out of the CR for a while. But I have no idea how hard that is.
from congress.
Last time I did it (which was years ago), it wasn't really hard, but annoying to deal with CR scraping on THOMAS.
If you have Subversion installed, you can get my old Perl CR parser this way:
svn cat svn://govtrack.us/govtrack/gather/us/parse_record.pl
The last two subs handle the amendment-specific stuff. Only about 100 lines. Probably nothing salvageable, but maybe a good sanity check on logic.
Thanks Chris (and Eric and Dan).
from congress.
This could also be a starting place, though I'm not sure how complete:
http://capitolwords.org/api/1/text.json?apikey=&title=text%20of%20amendments (http://capitolwords.org/api/1/text.json?apikey=%3Ckey%3E&title=text%20of%20amendments)
On Wednesday, March 20, 2013 at 5:04 PM, Joshua Tauberer wrote:
Last time I did it (which was years ago), it wasn't really hard, but annoying to deal with CR scraping on THOMAS.
If you have Subversion installed, you can get my old Perl CR parser this way:
svn cat svn://govtrack.us/govtrack/gather/us/parse_record.pl (http://govtrack.us/govtrack/gather/us/parse_record.pl)
The last two subs handle the amendment-specific stuff. Only about 100 lines. Probably nothing salvageable, but maybe a good sanity check on logic.
Thanks Chris (and Eric and Dan).—
Reply to this email directly or view it on GitHub (#52 (comment)).
from congress.
I think we can get it all straight from THOMAS. Every amendment has a link to "text of amendment," which links to a landing page, which in turn links to a custom query (using their weird ephemeral URLs). Since amendments_info.py
already hits this page, I may start there.
from congress.
That's welcome. The most ideal thing here would be to extract and reuse
whatever code CapitolWords is using to parse the CR - like a
python-congressional-record library or something, that
unitedstates/congress could wield to retrieve amendment text and store
metadata for it. But if that's not feasible, then doing a one-off that uses
THOMAS.gov is fine.
I hope Congress.gov preserves the same features and linkage, though -
THOMAS.gov's shutoff date is getting closer.
On Wed, Mar 20, 2013 at 5:13 PM, Chris Wilson [email protected]:
I think we can get it all straight from THOMAS. Every amendment has a link
to "text of amendment," which links to a landing page, which in turn links
to a custom query (using their weird ephemeral URLs). Since
amendments_info.py already hits this page, I may start there.—
Reply to this email directly or view it on GitHubhttps://github.com//issues/52#issuecomment-15203861
.
Developer | sunlightfoundation.com
from congress.
I can't use permalinks for this stuff that I'm aware of, but it looks like if you go to an amendment's detail page and use the "Text of Amendment as Submitted" link, it takes you to a pretty terrible page, which links to a "printer friendly" page which looks a lot less terrible. I just looked at S. Amdt. 26 as an example, whose printer friendly page looks reasonable enough. The scraper might be a bit annoying and stateful, given the lack of permalinks.
My suggestion is to fetch text for amendments whose info has already been fetched, and maybe as a separate script, like amendment_text.py, which outputs new files in that amendment's output directory. The amendment number itself doesn't seem to appear on the printer friendly page.
from congress.
Very good suggestion. Otherwise, THOMAS appears to behave differently based on the length of the text, since some amendments are all of 200 words. The CapitalWords project looks awesome, but too complex for me to integrate at the moment.
from congress.
Actually, does this URL structure mean anything to anyone?
http://thomas.loc.gov/cgi-bin/query/z?r113:S12MR3-0044:/
That's what I get if I use the share button in the upper right and go to "save" for SA27, a short Rubio amendment to SA26.
SA26 goes to here:
http://thomas.loc.gov/cgi-bin/query/z?r113:S11MR3-0039:/
I don't see any immediately obvious way to get this URL directly from the Amendment # though
from congress.
AFAICT, the format of the identifier is:
z?r[congress]:[chamber][day][month][year]-[item ID]:
Where [chamber] is "H" or "S" (or "E" for extension of remarks or "D" for digest), [day] is the 2-digit day of the month, [month] is a 2-char representation of the month name, [year] is the significant digit of the year ("3" for 2013), and [item ID] is the 4-digit number of the item on the day's register.
For example, SA27 is the 44th item on the Senate register for March 12, 2013 in the 113th Congress:
http://thomas.loc.gov/cgi-bin/query/B?r113:@FIELD%28FLD003+s%29+@FIELD%28DDATE+20130312%29
from congress.
Thx, Gordon!
Just pushed a branch called amendment_text. I'm having some encoding nightmares, so there's a sloppy catch in the utils unescape function right now with a URL in the comments that will break the original.
To try it, just pass --fulltext to amendments call (see README in branch). Needs so work.
from congress.
I made rework to just pull the text file of amendments from each day's CR. e.g.
http://www.gpo.gov/fdsys/pkg/CREC-2013-03-21/html/CREC-2013-03-21-pt1-PgS2169.htm
Much easier to download this file and regex it than crawl through THOMAS
from congress.
That's exactly the approach Capitol Words takes. CW doesn't do anything special to parse out amendment text right now, but you still might look at its parser for lessons and ideas.
from congress.
In fact, I really should loop in @bycoffe too, since I believe he wrote the original version of that parser.
from congress.
Awesome, will do.
Has anyone tried to parse the amendment text and connect to the original legislation?
e.g., for "On page 4, line 6, decrease the amount by $20,000,000,000."
to connect to whatever's on page 4, line 6 of the actual legislation?
from congress.
As far as I know, no one has tried either that or anything in the bigger picture of parsing how things change other things.
from congress.
People sure do talk about it a lot, though! Everyone wants this. Are you a bad enough dude to connect amendments to legislation through page and line numbers?
from congress.
Any suggestions for how to integrate the fdsys routines into the amendment task, thus grabbing the unadulterated text of the amendment directly from GPO? I have the script to download the text file from GPO and parse it and place it in the right place, but right now it takes manual URLs from GPO (like the one a few messages back). Not sure how to go from an amendment (says, SA 136) and find it in the Congressional Record.
FYI, here's a crude demo of matching amendments to legislation. I can upload the guys of it once I can get amendment text working right.
http://experimentsinform.com/media/demos/revisions/site/
from congress.
...whoa. Even this "crude demo" is more interesting than anything I've seen yet in the US for matching up amendments.
You can see an example of me using fdsys.py as a support library (instead of a standalone task) in bill_versions.py, which downloads bill text and version metadata from GPO and outputs JSON files for each version of each bill:
https://github.com/unitedstates/congress/blob/master/tasks/bill_versions.py
I actually haven't integrated bill_versions output into my production systems yet (still using an older Ruby-based GPO sync script), but I plan to.
from congress.
Awesome, thank you!
On Mon, Mar 25, 2013 at 4:13 PM, Eric Mill [email protected] wrote:
...whoa. Even this "crude demo" is more interesting than anything I've
seen yet in the US for matching up amendments.You can see an example of me using fdsys.py as a support library (instead
of a standalone task) in bill_versions.py, which downloads bill text and
version metadata from GPO and outputs JSON files for each version of each
bill:
https://github.com/unitedstates/congress/blob/master/tasks/bill_versions.pyI actually haven't integrated bill_versions output into my production
systems yet (still using an older Ruby-based GPO sync script), but I plan
to.—
Reply to this email directly or view it on GitHubhttps://github.com//issues/52#issuecomment-15421789
.
[email protected]
434.242.9728
from congress.
The "amendment_text" branch now has a "parse" command that attempts to interpret the text of the amendments (which now are retrieved from GPO). It's smart enough to figure out how to pinpoint "Title III" to a page and line if the amendment doesn't specify the exact location, but still has a long ways to go. See README.
from congress.
I just want to pop in to say that this is pretty awesome work.
from congress.
Thanks! Will take a lot more. All help much appreciated.
from congress.
Hey @wilson428, could I summon you to do a quick summary of the State of Amendment Text?
I'm putting the finishing touches on getting a /amendments
endpoint done in Sunlight's Congress API using this data, and am making it full-text searchable on the purpose and description only. I'd love to get the full text indexed as well, even if it's not in structured form the way that bill text is.
What's the remaining path to dropping .txt files, even unformatted blob-of-words style, for each amendment?
from congress.
Awesome! This stalled over issues of where the raw text of amendments is stored in the CR, particularly in the House. I was following links from THOMAS to the CR to get the full text, but the agreement seemed to be that I should be using the fdsys scripts instead.
Then it turned out that House amendments in the CR are buried in the text of the daily activity. There was some sense that the text could be retrieved from a committee, I think?
I would love to dive back in, though I'd prefer to work on a forward-looking solution that doesn't die with Thomas end-of-life. Suggestions?
from congress.
If the text appears in the CR for both chambers, that's probably the most reliable path, even if the parsing is particularly messy. There's a lot of value even if the text is extracted in a non-readable, but indexable, form.
I don't think any committee would have floor amendments, although committee reports (which are released after huge delays) might have committee amendments in them sometimes. Also possible that the House Committee Repository might grow to include them for the House some day.
from congress.
Gotcha. I'll dive back in this week and give a more detailed report. Thanks for nudge.
from congress.
And to be clear, I'm happy to work on this, either primarily or secondarily, once I emerge from my current unrelated work blitz.
from congress.
Related Issues (20)
- [Bug] Error handling in govinfo.py line 73 HOT 5
- [Bug] Votes scraper not pulling in most recent vote, until I cleared cache HOT 2
- [Bug] Bad zip file HOT 1
- Newbie Q: Pulling bills for only one topic HOT 2
- Is there any interest in using govinfo's bulkdata zip files HOT 1
- Error: ImportError: No module named html.entities after the Feb 28th update HOT 4
- Unable to scrape Committee meetings HOT 1
- Downloading House votes in 2001 and 1991 raises exception HOT 5
- Error in parsing sponsor & byRequest HOT 4
- Discrepancies on amendment roll call votes
- Update PyPI Package HOT 8
- (votes, committee_meetings): senate.gov and clerk.house.gov not redirecting to https
- Correct Virtual Env Suggestion
- Request - Include Mastodon ID for members of congress HOT 2
- Error from lxml when parsing amendments "purpose" field HOT 1
- Bills & data.json HOT 1
- Errors when parsing amendments for 118th Congress
- Diff: draft-ietf-httpbis-unprompted-auth-06.txt - draft-ietf-httpbis-unprompted-auth-07.txt
- Bulkdata download from sunlight foundation error HOT 3
- googletagmanager.com/gtag/js?id="+oCONFIG.GWT_GA4ID%5B0%5D);head.appendChild(GA4Object);window.dataLayer=window.dataLayer%7C%7C%5B%5D
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from congress.