Code Monkey home page Code Monkey logo

Comments (27)

konklone avatar konklone commented on July 28, 2024

Looping in @drinks, who manages our CapitolWords project, which contains a big CR parser.

The text of amendments would be a huge get, and something I've wanted to parse out of the CR for a while. But I have no idea how hard that is.

from congress.

JoshData avatar JoshData commented on July 28, 2024

Last time I did it (which was years ago), it wasn't really hard, but annoying to deal with CR scraping on THOMAS.

If you have Subversion installed, you can get my old Perl CR parser this way:
svn cat svn://govtrack.us/govtrack/gather/us/parse_record.pl

The last two subs handle the amendment-specific stuff. Only about 100 lines. Probably nothing salvageable, but maybe a good sanity check on logic.

Thanks Chris (and Eric and Dan).

from congress.

drinks avatar drinks commented on July 28, 2024

This could also be a starting place, though I'm not sure how complete:

http://capitolwords.org/api/1/text.json?apikey=&title=text%20of%20amendments (http://capitolwords.org/api/1/text.json?apikey=%3Ckey%3E&title=text%20of%20amendments)

On Wednesday, March 20, 2013 at 5:04 PM, Joshua Tauberer wrote:

Last time I did it (which was years ago), it wasn't really hard, but annoying to deal with CR scraping on THOMAS.
If you have Subversion installed, you can get my old Perl CR parser this way:
svn cat svn://govtrack.us/govtrack/gather/us/parse_record.pl (http://govtrack.us/govtrack/gather/us/parse_record.pl)
The last two subs handle the amendment-specific stuff. Only about 100 lines. Probably nothing salvageable, but maybe a good sanity check on logic.
Thanks Chris (and Eric and Dan).


Reply to this email directly or view it on GitHub (#52 (comment)).

from congress.

wilson428 avatar wilson428 commented on July 28, 2024

I think we can get it all straight from THOMAS. Every amendment has a link to "text of amendment," which links to a landing page, which in turn links to a custom query (using their weird ephemeral URLs). Since amendments_info.py already hits this page, I may start there.

from congress.

konklone avatar konklone commented on July 28, 2024

That's welcome. The most ideal thing here would be to extract and reuse
whatever code CapitolWords is using to parse the CR - like a
python-congressional-record library or something, that
unitedstates/congress could wield to retrieve amendment text and store
metadata for it. But if that's not feasible, then doing a one-off that uses
THOMAS.gov is fine.

I hope Congress.gov preserves the same features and linkage, though -
THOMAS.gov's shutoff date is getting closer.

On Wed, Mar 20, 2013 at 5:13 PM, Chris Wilson [email protected]:

I think we can get it all straight from THOMAS. Every amendment has a link
to "text of amendment," which links to a landing page, which in turn links
to a custom query (using their weird ephemeral URLs). Since
amendments_info.py already hits this page, I may start there.


Reply to this email directly or view it on GitHubhttps://github.com//issues/52#issuecomment-15203861
.

Developer | sunlightfoundation.com

from congress.

konklone avatar konklone commented on July 28, 2024

I can't use permalinks for this stuff that I'm aware of, but it looks like if you go to an amendment's detail page and use the "Text of Amendment as Submitted" link, it takes you to a pretty terrible page, which links to a "printer friendly" page which looks a lot less terrible. I just looked at S. Amdt. 26 as an example, whose printer friendly page looks reasonable enough. The scraper might be a bit annoying and stateful, given the lack of permalinks.

My suggestion is to fetch text for amendments whose info has already been fetched, and maybe as a separate script, like amendment_text.py, which outputs new files in that amendment's output directory. The amendment number itself doesn't seem to appear on the printer friendly page.

from congress.

wilson428 avatar wilson428 commented on July 28, 2024

Very good suggestion. Otherwise, THOMAS appears to behave differently based on the length of the text, since some amendments are all of 200 words. The CapitalWords project looks awesome, but too complex for me to integrate at the moment.

from congress.

wilson428 avatar wilson428 commented on July 28, 2024

Actually, does this URL structure mean anything to anyone?

http://thomas.loc.gov/cgi-bin/query/z?r113:S12MR3-0044:/

That's what I get if I use the share button in the upper right and go to "save" for SA27, a short Rubio amendment to SA26.

SA26 goes to here:

http://thomas.loc.gov/cgi-bin/query/z?r113:S11MR3-0039:/

I don't see any immediately obvious way to get this URL directly from the Amendment # though

from congress.

GPHemsley avatar GPHemsley commented on July 28, 2024

AFAICT, the format of the identifier is:

z?r[congress]:[chamber][day][month][year]-[item ID]:

Where [chamber] is "H" or "S" (or "E" for extension of remarks or "D" for digest), [day] is the 2-digit day of the month, [month] is a 2-char representation of the month name, [year] is the significant digit of the year ("3" for 2013), and [item ID] is the 4-digit number of the item on the day's register.

For example, SA27 is the 44th item on the Senate register for March 12, 2013 in the 113th Congress:

http://thomas.loc.gov/cgi-bin/query/B?r113:@FIELD%28FLD003+s%29+@FIELD%28DDATE+20130312%29

from congress.

wilson428 avatar wilson428 commented on July 28, 2024

Thx, Gordon!

Just pushed a branch called amendment_text. I'm having some encoding nightmares, so there's a sloppy catch in the utils unescape function right now with a URL in the comments that will break the original.

To try it, just pass --fulltext to amendments call (see README in branch). Needs so work.

from congress.

wilson428 avatar wilson428 commented on July 28, 2024

I made rework to just pull the text file of amendments from each day's CR. e.g.
http://www.gpo.gov/fdsys/pkg/CREC-2013-03-21/html/CREC-2013-03-21-pt1-PgS2169.htm

Much easier to download this file and regex it than crawl through THOMAS

from congress.

konklone avatar konklone commented on July 28, 2024

That's exactly the approach Capitol Words takes. CW doesn't do anything special to parse out amendment text right now, but you still might look at its parser for lessons and ideas.

from congress.

konklone avatar konklone commented on July 28, 2024

In fact, I really should loop in @bycoffe too, since I believe he wrote the original version of that parser.

from congress.

wilson428 avatar wilson428 commented on July 28, 2024

Awesome, will do.

Has anyone tried to parse the amendment text and connect to the original legislation?

e.g., for "On page 4, line 6, decrease the amount by $20,000,000,000."

to connect to whatever's on page 4, line 6 of the actual legislation?

from congress.

JoshData avatar JoshData commented on July 28, 2024

As far as I know, no one has tried either that or anything in the bigger picture of parsing how things change other things.

from congress.

konklone avatar konklone commented on July 28, 2024

People sure do talk about it a lot, though! Everyone wants this. Are you a bad enough dude to connect amendments to legislation through page and line numbers?

from congress.

wilson428 avatar wilson428 commented on July 28, 2024

Any suggestions for how to integrate the fdsys routines into the amendment task, thus grabbing the unadulterated text of the amendment directly from GPO? I have the script to download the text file from GPO and parse it and place it in the right place, but right now it takes manual URLs from GPO (like the one a few messages back). Not sure how to go from an amendment (says, SA 136) and find it in the Congressional Record.

FYI, here's a crude demo of matching amendments to legislation. I can upload the guys of it once I can get amendment text working right.

http://experimentsinform.com/media/demos/revisions/site/

from congress.

konklone avatar konklone commented on July 28, 2024

...whoa. Even this "crude demo" is more interesting than anything I've seen yet in the US for matching up amendments.

You can see an example of me using fdsys.py as a support library (instead of a standalone task) in bill_versions.py, which downloads bill text and version metadata from GPO and outputs JSON files for each version of each bill:
https://github.com/unitedstates/congress/blob/master/tasks/bill_versions.py

I actually haven't integrated bill_versions output into my production systems yet (still using an older Ruby-based GPO sync script), but I plan to.

from congress.

wilson428 avatar wilson428 commented on July 28, 2024

Awesome, thank you!

On Mon, Mar 25, 2013 at 4:13 PM, Eric Mill [email protected] wrote:

...whoa. Even this "crude demo" is more interesting than anything I've
seen yet in the US for matching up amendments.

You can see an example of me using fdsys.py as a support library (instead
of a standalone task) in bill_versions.py, which downloads bill text and
version metadata from GPO and outputs JSON files for each version of each
bill:
https://github.com/unitedstates/congress/blob/master/tasks/bill_versions.py

I actually haven't integrated bill_versions output into my production
systems yet (still using an older Ruby-based GPO sync script), but I plan
to.


Reply to this email directly or view it on GitHubhttps://github.com//issues/52#issuecomment-15421789
.

[email protected]
434.242.9728

from congress.

wilson428 avatar wilson428 commented on July 28, 2024

The "amendment_text" branch now has a "parse" command that attempts to interpret the text of the amendments (which now are retrieved from GPO). It's smart enough to figure out how to pinpoint "Title III" to a page and line if the amendment doesn't specify the exact location, but still has a long ways to go. See README.

from congress.

GPHemsley avatar GPHemsley commented on July 28, 2024

I just want to pop in to say that this is pretty awesome work.

from congress.

wilson428 avatar wilson428 commented on July 28, 2024

Thanks! Will take a lot more. All help much appreciated.

from congress.

konklone avatar konklone commented on July 28, 2024

Hey @wilson428, could I summon you to do a quick summary of the State of Amendment Text?

I'm putting the finishing touches on getting a /amendments endpoint done in Sunlight's Congress API using this data, and am making it full-text searchable on the purpose and description only. I'd love to get the full text indexed as well, even if it's not in structured form the way that bill text is.

What's the remaining path to dropping .txt files, even unformatted blob-of-words style, for each amendment?

from congress.

wilson428 avatar wilson428 commented on July 28, 2024

Awesome! This stalled over issues of where the raw text of amendments is stored in the CR, particularly in the House. I was following links from THOMAS to the CR to get the full text, but the agreement seemed to be that I should be using the fdsys scripts instead.

Then it turned out that House amendments in the CR are buried in the text of the daily activity. There was some sense that the text could be retrieved from a committee, I think?

I would love to dive back in, though I'd prefer to work on a forward-looking solution that doesn't die with Thomas end-of-life. Suggestions?

from congress.

konklone avatar konklone commented on July 28, 2024

If the text appears in the CR for both chambers, that's probably the most reliable path, even if the parsing is particularly messy. There's a lot of value even if the text is extracted in a non-readable, but indexable, form.

I don't think any committee would have floor amendments, although committee reports (which are released after huge delays) might have committee amendments in them sometimes. Also possible that the House Committee Repository might grow to include them for the House some day.

from congress.

wilson428 avatar wilson428 commented on July 28, 2024

Gotcha. I'll dive back in this week and give a more detailed report. Thanks for nudge.

from congress.

konklone avatar konklone commented on July 28, 2024

And to be clear, I'm happy to work on this, either primarily or secondarily, once I emerge from my current unrelated work blitz.

from congress.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.