Code Monkey home page Code Monkey logo

Comments (38)

konklone avatar konklone commented on July 28, 2024

Ooh - using fdsys.py?

Curious how you intend to use them?

from congress.

GPHemsley avatar GPHemsley commented on July 28, 2024

Josh can probably explain better, but the plan is to extract as much of the standard metadata from them as possible.

from congress.

GPHemsley avatar GPHemsley commented on July 28, 2024

I've got a script now that extracts the basic Congress and bill numbers for all bills that became law. I'm going to attempt to squeeze some more data out of the MODS files in the coming days.

I'm also thinking that at some point the bill scraping code should be separated from the bill format/metadata files, since there will be multiple sources for bill data but only one way to output for each format (json/xml). I may tackle that at the same time.

from congress.

JoshData avatar JoshData commented on July 28, 2024

@konklone: My primary goal is to turn it into bills on GovTrack (enacted only, of course). So we'll be generating output that looks similar to the bill output. Plus whatever other interesting metadata is in the MODS files. And something with the text layer in the PDF.

from congress.

GPHemsley avatar GPHemsley commented on July 28, 2024

I think the goal should also be to be able to do something like this and have it Just Work™:

./run bills --congress=82

(Even if it doesn't get all the bills in the 82nd Congress at first.)

from congress.

GPHemsley avatar GPHemsley commented on July 28, 2024

I've got a working version of the script here:

https://github.com/GPHemsley/congress/blob/historical-bills-1951/tasks/statutes.py

You can run it by running a command like this:

./run statutes --path=STATUTE/1951/STATUTE-65 --govtrack

(After you run the fdsys task to put the fdsys files in place.)

from congress.

GPHemsley avatar GPHemsley commented on July 28, 2024

So Josh pulled this into master, with a bunch of documentation, if you want to use it. (There may be more to do still, so I don't know if this should be closed just yet.)

from congress.

konklone avatar konklone commented on July 28, 2024

Ah ha! I get it now (looked at the code). So this is intended to fill in gaps from 1951 to 1972. I like this a lot. I have a couple thoughts (surprise), related to keeping the system sane as we expand into more scripts.

Since the data we get from THOMAS is uniformly superior to the data from the Statutes (right? is there anything unique to the Statutes collection?), then the statutes script should probably default to an end year of 1972. This could become moot if we make the bills script default to a Statutes-driven approach for years before 1973.

It'd be great to keep this down to running one command, instead of two, using fdsys.py as a support library that the statutes.py script uses (rather than running it as a script as a prerequisite). You can see how I did this in bill_versions.py, to generate JSON files for each version of bill text.

from congress.

konklone avatar konklone commented on July 28, 2024

Separate thought - how much work would have to go into using scanned copies of the Statutes at Large for years prior to 1951? Since there's so much value just in getting the metadata, and scanning accurate text is not a concern, would it be worth it to engage in a manual (and one-time!) metadata collection effort using copies of the Statutes going back into antiquity, if we had them scanned?

from congress.

GPHemsley avatar GPHemsley commented on July 28, 2024

I think it would be good to separate the code that is related to all bills from the code related to a single source of bills. And on top of that, it might be good to have the scrapers be separate from the parsers. Then the source-specific scripts could import the generic processing and output methods (as I do in statutes.py). Consolidating fdsys might be a part of that.

The Statutes data turns gaping holes into slightly-less-gaping holes by reverse-engineering (sort of) the metadata related to bills that have become law. The metadata is provided by the LOC somewhat accidentally, as a byproduct of being needed for archiving. As it stands, this does not get any information about the many bills that were never passed/enacted during the period from 1951 to 1972, and the data that it does get sometimes suffers from poor OCR. So yes, the Statutes data is essentially only a fallback for the cases when THOMAS data is not available.

Prior to 1951, Statutes data is not even available for enacted bills (AFAIK—I could be wrong), at least not from FDsys. However, the LOC also has information available on its American Memory site, such as here that might be worth looking into. Some parts are in text form, which would be (relatively) easy to scrape, while others are in GIF/TIFF image format that would be a little more difficult. However, this does provide even different versions of a given bill, for Congresses that are available.

But if by "manual" metadata collection, you mean human-read and -input, I would definitely advise against that. There is just way too much metadata to collect.

from congress.

GPHemsley avatar GPHemsley commented on July 28, 2024

It might be worth noting that they don't appear to have begun using codes like "H.R." to refer to bills until the 9th Congress (or later, depending on what you use as a reference).

from congress.

konklone avatar konklone commented on July 28, 2024

I do mean human-read and -input, but it'd be one-time only. It seems like
it might be a worthwhile project, if the metadata that you've gathered from
GPO's work is useful enough to build around. There is no fully official set
of scanned Statutes before 1951 that I'm aware of, but I've definitely seen
allegedly official unofficial sets of scanned Statutes PDFs going back a
long way. Whether or not to consider them official enough for use would be
an interesting question alone.

I don't yet think it's worth tearing up the way we've done bills yet in a
big way. Right now, we have a solid scraper for bill metadata from
1973-now, a scraper for useful-if-holey data from 1951-1972, and a
downloader for bill text from 1989-present (bill_versions.py). They all do
very different things, they're all straightforward to use, and there's no
friction or wasted effort yet. My inclination is usually to refactor
reactively rather than proactively, and each scraper being relatively
autonomous and separate allows us all to experiment more easily. I like how
things are working.

On Sun, Jan 27, 2013 at 10:22 PM, Gordon P. Hemsley <
[email protected]> wrote:

I think it would be good to separate the code that is related to all bills
from the code related to a single source of bills. And on top of that, it
might be good to have the scrapers be separate from the parsers. Then the
source-specific scripts could import the generic processing and output
methods (as I do in statutes.py). Consolidating fdsys might be a part of
that.

The Statutes data turns gaping holes into slightly-less-gaping holes by
reverse-engineering (sort of) the metadata related to bills that have
become law. The metadata is provided by the LOC somewhat accidentally, as a
byproduct of being needed for archiving. As it stands, this does not get
any information about the many bills that were never passed/enacted during
the period from 1951 to 1972, and the data that it does get sometimes
suffers from poor OCR. So yes, the Statutes data is essentially only a
fallback for the cases when THOMAS data is not available.

Prior to 1951, Statutes data is not even available for enacted bills
(AFAIK—I could be wrong), at least not from FDsys. However, the LOC also
has information available on its American Memory site, such as herehttp://memory.loc.gov/ammem/amlaw/lwhbsb.htmlthat might be worth looking into. Some parts are in text form, which would
be (relatively) easy to scrape, while others are in GIF/TIFF image format
that would be a little more difficult. However, this does provide even
different versions of a given bill, for Congresses that are available.

But if by "manual" metadata collection, you mean human-read and -input, I
would definitely advise against that. There is just way too much metadata
to collect.


Reply to this email directly or view it on GitHubhttps://github.com//issues/34#issuecomment-12767398.

Developer | sunlightfoundation.com

from congress.

GPHemsley avatar GPHemsley commented on July 28, 2024

If I'm understanding your intentions correctly, you're talking hundreds of thousands—if not millions—of bills, aren't you? I think a much more worthwhile project in the short term would be scraping American Memory. That has text data from the 6th through the 42nd Congresses which would be much easier to parse automatically.

Regarding refactoring, my original purpose of suggestion was because I needed to import bill_info.py into statutes.py in order to get the generic output methods, but along with them came the THOMAS-related methods that I didn't need. At the very least, I think those two should be separated.

from congress.

konklone avatar konklone commented on July 28, 2024

I definitely don't want to micromanage anything here - if you think
something can be improved, improve it. I would just be careful about adding
any burden (making anyone writing a new script having to know more about
how other scripts work) to make things feel cleaner.

One way to make things better might be: utils.py is getting pretty weighty,
and is a mix of project-meta helpers and congress-meta helpers. Making a
congress.py file, and moving bill_info.output_bill, utils.current_congress,
utils.split_bill_id, etc.into it seems like a good idea - it separates them
like you describe, while keeping the scripts follow the same flat pattern
of "I only depend on myself plus there's a couple pools of utility methods
I can dip into". I could see fdsys.py becoming its own pool of methods, and
those files being put in their own directory.

Again, I do not want to nitpick, this is all going to work. We're hitting
an awesome stride of growth in this project, and it probably does merit a
bit of reorganization. I just think it will be easier in the long run for
all of us if this all stays flat and simple and mostly non-systematized.

On Sun, Jan 27, 2013 at 11:58 PM, Gordon P. Hemsley <
[email protected]> wrote:

If I'm understanding your intentions correctly, you're talking hundreds of
thousands—if not millions—of bills, aren't you? I think a much more
worthwhile project in the short term would be scraping American Memory.
That has text data from the 6th through the 42nd Congresses which would be
much easier to parse automatically.

Regarding refactoring, my original purpose of suggestion was because I
needed to import bill_info.py into statutes.py in order to get the generic
output methods, but along with them came the THOMAS-related methods that I
didn't need. At the very least, I think those two should be separated.


Reply to this email directly or view it on GitHubhttps://github.com//issues/34#issuecomment-12768709.

Developer | sunlightfoundation.com

from congress.

GPHemsley avatar GPHemsley commented on July 28, 2024

Speaking from my experience in writing statutes.py, I think splitting things out would make it easier, not harder, to write new scripts. I spent most of my time trying to track down all the various *_for() methods and what they meant and did, to see which ones I needed or could use. If all the generic ones were in their own file, it would have been somewhat easier for me to understand what was going on, I think.

For the record, these are the bill_info methods I used:

  • latest_status
  • history_from_actions
  • slip_law_from
  • current_title_for
  • output_bill

And there are probably others that I just didn't need but could be split out alongside them.

Speaking of utils, we could probably use some consolidating of the congress and congress-legislators utils.py files, perhaps as a separate project/repo. I had to do a lot of hacky things for legacy conversion to make things work together happily.

But yeah, I'm not attempting to make any crazy hierarchies here. Just splitting the pie up into smaller slices so I can pick only exactly what I need (while also making sure I can actually get what I need).

from congress.

GPHemsley avatar GPHemsley commented on July 28, 2024

Pull request #39 is an important fix for making sure you get the right correspondence between bill number and bill text.

from congress.

JoshData avatar JoshData commented on July 28, 2024

Let's not refactor yet. The next thing is pulling out bill text from 1951-1993 (there are fewer years of bill text on GPO than bill metadata on THOMAS).

from congress.

GPHemsley avatar GPHemsley commented on July 28, 2024

I've tied in the Statute PDFs in pull request #41, so now you can actually see the bill/law associated with the often obscure titles.

Of course, pulling the text out of those PDFs is going to be quite an adventure unto itself. (Perhaps even one reserved for a separate issue.)

from congress.

GPHemsley avatar GPHemsley commented on July 28, 2024

Would it be appropriate to include a reference such as "STATUTE-72-Pg3" alongside the action of enactment? The references field seems geared specifically towards the Congressional Record, but I think it would be good to open it up a little more to allow for other sources of information.

from congress.

konklone avatar konklone commented on July 28, 2024

I'd just add a new field. Even though it's called something general like
"references", I think it should remain only CR refs, to keep assumptions
when parsing that field simple. "source" might make sense.

On Wed, Jan 30, 2013 at 1:56 AM, Gordon P. Hemsley <[email protected]

wrote:

Would it be appropriate to include a reference such as "STATUTE-72-Pg3"
alongside the action of enactment? The references field seems geared
specifically towards the Congressional Record, but I think it would be good
to open it up a little more to allow for other sources of information.


Reply to this email directly or view it on GitHubhttps://github.com//issues/34#issuecomment-12876892.

Developer | sunlightfoundation.com

from congress.

GPHemsley avatar GPHemsley commented on July 28, 2024

Yeah, you're probably right. There will always be a record about it in the Congressional Record (or equivalent), but we might not always get the information from there. How about I make it a list named "sources", in case we ever have to combine sources to make a single action entry?

from congress.

konklone avatar konklone commented on July 28, 2024

Sure.

On Wed, Jan 30, 2013 at 11:44 AM, Gordon P. Hemsley <
[email protected]> wrote:

Yeah, you're probably right. There will always be a record about it in the
Congressional Record (or equivalent), but we might not always get the
information from there. How about I make it a list named "sources", in case
we ever have to combine sources to make a single action entry?


Reply to this email directly or view it on GitHubhttps://github.com//issues/34#issuecomment-12898960.

Developer | sunlightfoundation.com

from congress.

GPHemsley avatar GPHemsley commented on July 28, 2024

Of course that would leave the citation format to be determined. Should each source have a code for the general document/organization and then a specific citation within it, or should it just be { ... "sources": [ "STATUTE-72-Pg3" ] ... }?

from congress.

konklone avatar konklone commented on July 28, 2024

It's not too big a deal, since we can always regenerate it later, so how
about just a URL to the original document for now? The other option is a
full dict that's like [{source: "statutes", volume: "72", page: "3"}],
which is also fine.

On Wed, Jan 30, 2013 at 11:48 AM, Gordon P. Hemsley <
[email protected]> wrote:

Of course that would leave the citation format to be determined. Should
each source have a code for the general document/organization and then a
specific citation within it, or should it just be { ... "sources": [
"STATUTE-72-Pg3" ] ... }?


Reply to this email directly or view it on GitHubhttps://github.com//issues/34#issuecomment-12899255.

Developer | sunlightfoundation.com

from congress.

JoshData avatar JoshData commented on July 28, 2024

Both would be really helpful. I was going to add a source_url to all of our task output anyway, pointing to the page closest to where the information was scraped suitable for "see more" type links. I'd like to see source_url to all of the tasks, and for the Statutes-generated files just something special for that, i.e. statute_citation: { "volume": 72, "page": 3 ] which would match the "72 Stat 3" type citations people actually use.

from congress.

konklone avatar konklone commented on July 28, 2024

So the sources field would look like:

[{
  "source": "statutes",
  "source_url": "...",
  "volume": 72,
  "page": 3
}]

from congress.

GPHemsley avatar GPHemsley commented on July 28, 2024

What URL should I use for the the source_url? MODS? PDF?

Also, I think it would be good to also include the access ID ("STATUTE-72-Pg3"), since that's the primary identifier of a particular statute at the GPO. (When multiple statutes appear on the same page, they get different access IDs; "72 Stat. 3" could be ambiguous, though I could also include a field containing the page position value.)

from congress.

JoshData avatar JoshData commented on July 28, 2024

I'd like a URL I can link to, so these sort of pages would be good:
http://www.gpo.gov/fdsys/granule/STATUTE-118/STATUTE-118-Pg493/content-detail.html

I'm not sure the accessID is useful without also the package ID it's contained in (STATUTE-72). Feel free to include one or both.

The citation is ambiguous to a bill, but it's what lawyers use sometimes, so it's useful.

from congress.

konklone avatar konklone commented on July 28, 2024

I think the "source_url" field should probably literally be the URL that
was used to get the data being output, for provenance' sake. But you could
add other URLs - and like you said, you can use the GPO identifier to
construct other kinds of detail URLs client-side, too.

On Thu, Jan 31, 2013 at 12:27 PM, Joshua Tauberer
[email protected]:

I'd like a URL I can link to, so these sort of pages would be good:

http://www.gpo.gov/fdsys/granule/STATUTE-118/STATUTE-118-Pg493/content-detail.html

I'm not sure the accessID is useful without also the package ID it's
contained in (STATUTE-72). Feel free to include one or both.

The citation is ambiguous to a bill, but it's what lawyers use sometimes,
so it's useful.


Reply to this email directly or view it on GitHubhttps://github.com//issues/34#issuecomment-12953943.

Developer | sunlightfoundation.com

from congress.

GPHemsley avatar GPHemsley commented on July 28, 2024

I currently have it outputting this:

  "sources": [
    {
      "access_id": "STATUTE-71-PgB6", 
      "page": "B6", 
      "position": "1", 
      "source": "statute", 
      "source_url": "http://www.gpo.gov/fdsys/granule/STATUTE-71/STATUTE-71-PgB6/content-detail.html", 
      "volume": "71"
    }
  ], 

from congress.

JoshData avatar JoshData commented on July 28, 2024

@konklone If that's different then what I was suggesting, then I'm just going to ask for yet another field for a human-readable page....

from congress.

JoshData avatar JoshData commented on July 28, 2024

Gordon- Looks great to me.

from congress.

GPHemsley avatar GPHemsley commented on July 28, 2024

Updated:

  "sources": [
    {
      "access_id": "STATUTE-73-Pg14-2", 
      "package_id": "STATUTE-73", 
      "page": "14", 
      "position": "2", 
      "source": "statutes", 
      "source_url": "http://www.gpo.gov/fdsys/granule/STATUTE-73/STATUTE-73-Pg14-2/content-detail.html", 
      "volume": "73"
    }
  ], 

from congress.

GPHemsley avatar GPHemsley commented on July 28, 2024

I funny speak in summaries commit my. Pull request #43.

from congress.

JoshData avatar JoshData commented on July 28, 2024

I've got this new bill data from 1951-1972 up on GovTrack now (http://www.govtrack.us/congress/bills/browse). Nice work, Gordon.

For the text, I'm thinking we extract the text layer of the PDF into bills/x/xddd/text-versions/enr/document.txt. (That's where the fdsys --store command puts current bill text.) Thoughts?

from congress.

konklone avatar konklone commented on July 28, 2024

That makes sense to me. bill_versions.py is near-identical, putting a file
at bills/x/xddd/text-versions/enr.json. I'll change it to be enr/data.json
instead.

On Sat, Feb 2, 2013 at 12:29 PM, Joshua Tauberer
[email protected]:

I've got this new bill data from 1951-1972 up on GovTrack now (
http://www.govtrack.us/congress/bills/browse). Nice work, Gordon.

For the text, I'm thinking we extract the text layer of the PDF into
bills/x/xddd/text-versions/enr/document.txt. (That's where the fdsys
--store command puts current bill text.) Thoughts?


Reply to this email directly or view it on GitHubhttps://github.com//issues/34#issuecomment-13033948.

Developer | sunlightfoundation.com

from congress.

GPHemsley avatar GPHemsley commented on July 28, 2024

@tauberer It looks like you missed 1951–1957 (82–84). Also, you might want to make sure that the 85–88 files have been generated by the latest version of all files/scripts involved.

from congress.

konklone avatar konklone commented on July 28, 2024

This looks done enough to close. Re-open if I'm wrong, of course.

from congress.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.