catalyst-cooperative / ferc-xbrl-extractor Goto Github PK

View Code? Open in Web Editor NEW

11.0 8.0 0.0 148.8 MB

A tool for converting FERC filings published in XBRL into SQLite databases

License: MIT License

Python 100.00%

data database energy etl ferc finance sqlite xbrl climate-change electricity

ferc-xbrl-extractor's People

Contributors

Stargazers

Watchers

ferc-xbrl-extractor's Issues

tox-pytest badge always says "failing"

Even when our tests are passing, the badge in the README says it's failing. I don't get it. Instructions for workflow status badges.

This is the address for the badge:
https://github.com/catalyst-cooperative/ferc-xbrl-extractor/workflows/tox-pytest/badge.svg

And here's what it looks like as of 2023-10-18T19:55

But it's all green in the actions:

Add ability to filter XBRL extraction to a subset of tables

Traditionally we've allowed extracting only a subset of tables from the DBF based FERC 1 data for testing purposes. This is a fairly easy update to make in the XBRL extraction, so this issue will add that functionality. This requires a small API update to provide an option for specifying tables.

Implement logging.

Currently, the only feedback/output is a couple of print statements. These should be replaced with detailed logging.

Identify lost XBRL facts

In @zschira 's investigations into catalyst-cooperative/pudl#2755 , he found that there are many XBRL facts that refer to contexts which are missing dimensions, which means that we don't assign those facts to any tables.

We would like to assign those facts to the best of our ability, which means understanding these "lost facts" that don't have a home.

First we should identify them. We may have to implement the notion of a nullable dimension to the table schemas.

Extract additional metadata from XBRL taxonomies

From this issue, we've found that there is additional metadata contained in XBRL taxonomies that will be useful in linking XBRL/DBF data. Specifically, we see use in having the XBRL references, which contain FERC account numbers and form location info, the balance type (credit or debit) and calculations which define relationships between columns (like aggregations). This information can all be pulled out of taxonomies and put into a more usable format.

Add ability to only extract a requested subset of tables

Improve documentation

The extractor needs some improvements in the README and docs to make it easier to use and understand technical details about how it works.

Add glossary of XBRL terms
Improve docs describing extraction process
Add concrete examples of CLI use to readme

Improve unit and integration test coverage

Now that we've got CI running #46 we can start improving our test coverage! Some tests that we know we'd like to run:

Run Tox / pytest in CI

Add a GitHub actions workflow to run Tox in CI, and test that the package works on Python 3.8-3.11.

We also need to expand the unit & integration tests for better coverage, but getting the CI up and running is a necessary step on its own.

Errors when attempting to use ferc1_to_sqlite

When attempting to run the ferc1_to_sqlite script processing all tables for years 2020 and 2021 (so, 1 year each from the two FERC data sources) I get a bunch of errors about files not existing (but looking in the cache directory, they do seem to exist):

2022-07-07 17:53:38 [    INFO] catalystcoop.ferc_xbrl_extractor.xbrl:155 Parsing taxonomy from https://eCollection.ferc.gov/taxonomy/form1/2022-01-01/form/form1/form-1_2022-01-01.xsd
2022-07-07 17:53:39 [    INFO] catalystcoop.ferc_xbrl_extractor.xbrl:155 Parsing taxonomy from https://eCollection.ferc.gov/taxonomy/form1/2022-01-01/form/form1/form-1_2022-01-01.xsd
2022-07-07 17:53:39 [    INFO] catalystcoop.ferc_xbrl_extractor.xbrl:155 Parsing taxonomy from https://eCollection.ferc.gov/taxonomy/form1/2022-01-01/form/form1/form-1_2022-01-01.xsd
2022-07-07 17:53:39 [    INFO] catalystcoop.ferc_xbrl_extractor.xbrl:155 Parsing taxonomy from https://eCollection.ferc.gov/taxonomy/form1/2022-01-01/form/form1/form-1_2022-01-01.xsd
2022-07-07 17:53:42,621 [webCache:cacheDownloadRenamingError] [Errno 2] No such file or directory: '/home/zane/.config/arelle/cache/https/eCollection.ferc.gov/taxonomy/form1/2022-01-01/schedules/ScheduleAccumulatedDeferredIncomeTaxes/sched-234_2022-01-01.xsd.tmp' -> '/home/zane/.config/arelle/cache/https/eCollection.ferc.gov/taxonomy/form1/2022-01-01/schedules/ScheduleAccumulatedDeferredIncomeTaxes/sched-234_2022-01-01.xsd' 
Unsuccessful renaming of downloaded file to active file /home/zane/.config/arelle/cache/https/eCollection.ferc.gov/taxonomy/form1/2022-01-01/schedules/ScheduleAccumulatedDeferredIncomeTaxes/sched-234_2022-01-01.xsd 
Please remove with file manager. - 

2022-07-07 17:53:42,624 [IOerror] sched-234_2022-01-01.xsd: file error: [Errno 2] No such file or directory: '/home/zane/.config/arelle/cache/https/eCollection.ferc.gov/taxonomy/form1/2022-01-01/schedules/ScheduleAccumulatedDeferredIncomeTaxes/sched-234_2022-01-01.xsd' - https://eCollection.ferc.gov/taxonomy/form1/2022-01-01/form/form1/form-1_2022-01-01.xsd 2

2022-07-07 17:53:54,479 [webCache:cacheDownloadRenamingError] [Errno 2] No such file or directory: '/home/zane/.config/arelle/cache/https/eCollection.ferc.gov/taxonomy/form1/2022-01-01/ferc-core_2022-01-01.xsd.tmp' -> '/home/zane/.config/arelle/cache/https/eCollection.ferc.gov/taxonomy/form1/2022-01-01/ferc-core_2022-01-01.xsd' 
Unsuccessful renaming of downloaded file to active file /home/zane/.config/arelle/cache/https/eCollection.ferc.gov/taxonomy/form1/2022-01-01/ferc-core_2022-01-01.xsd 
Please remove with file manager. -

I notice there's some weird case-sensitive stuff going on in my arelle cache directory, with the version from the last time I played around in April not having any uppercase in the domain name / directory:

drwxr-xr-x 3 zane zane 4.0K 2022-04-12 21:51 ecollection.ferc.gov/
drwxr-xr-x 3 zane zane 4.0K 2022-07-07 17:53 eCollection.ferc.gov/

There are also some syntax errors coming up...

2022-07-07 17:58:45,875 [xmlSchema:syntax] Char 0x0 out of allowed range, sched-304_2022-01-01_def.xml, line 223, column 128 - ../../../../../../../sched-304_2022-01-01_def.xml , https://eCollection.ferc.gov/taxonomy/form1/2022-01-01/schedules/ScheduleSalesOfElectricityByRateSchedules/sched-304_2022-01-01.xsd 5
2022-07-07 17:58:45,875 [xmlSchema:syntax] AttValue: ' expected, sched-304_2022-01-01_def.xml, line 223, column 128 - ../../../../../../../sched-304_2022-01-01_def.xml , https://eCollection.ferc.gov/taxonomy/form1/2022-01-01/schedules/ScheduleSalesOfElectricityByRateSchedules/sched-304_2022-01-01.xsd 5
2022-07-07 17:58:45,875 [xmlSchema:syntax] attributes construct error, sched-304_2022-01-01_def.xml, line 223, column 128 - ../../../../../../../sched-304_2022-01-01_def.xml , https://eCollection.ferc.gov/taxonomy/form1/2022-01-01/schedules/ScheduleSalesOfElectricityByRateSchedules/sched-304_2022-01-01.xsd 5
2022-07-07 17:58:45,875 [xmlSchema:syntax] Couldn't find end of Start Tag loc line 223, sched-304_2022-01-01_def.xml, line 223, column 128 - ../../../../../../../sched-304_2022-01-01_def.xml , https://eCollection.ferc.gov/taxonomy/form1/2022-01-01/schedules/ScheduleSalesOfElectricityByRateSchedules/sched-304_2022-01-01.xsd 5
2022-07-07 17:58:45,875 [xmlSchema:syntax] Premature end of data in tag definitionLink line 17, sched-304_2022-01-01_def.xml, line 223, column 128 - ../../../../../../../sched-304_2022-01-01_def.xml , https://eCollection.ferc.gov/taxonomy/form1/2022-01-01/schedules/ScheduleSalesOfElectricityByRateSchedules/sched-304_2022-01-01.xsd 5

Despite the errors above, it seems to be attempting to extract some filings. However in the end I get an error from concurrent.futures:

2022-07-07 18:02:37 [    INFO] catalystcoop.ferc_xbrl_extractor.xbrl:126 Extracting a6879fc6-64be-47e6-a830-0b9b4aaa24eb
2022-07-07 18:02:39 [    INFO] catalystcoop.ferc_xbrl_extractor.xbrl:126 Extracting 90b8961f-01d5-426c-b9e2-2c7b39884ced
2022-07-07 18:02:41 [    INFO] catalystcoop.ferc_xbrl_extractor.xbrl:126 Extracting f863e3cd-6ca9-42a9-b013-ca12fd301d2a
concurrent.futures.process._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/zane/mambaforge/envs/pudl-dev/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/home/zane/mambaforge/envs/pudl-dev/lib/python3.10/concurrent/futures/process.py", line 205, in _process_chunk
    return [fn(*args) for args in chunk]
  File "/home/zane/mambaforge/envs/pudl-dev/lib/python3.10/concurrent/futures/process.py", line 205, in <listcomp>
    return [fn(*args) for args in chunk]
  File "/home/zane/mambaforge/envs/pudl-dev/lib/python3.10/site-packages/ferc_xbrl_extractor/xbrl.py", line 93, in process_batch
    instance_dfs = process_instance(instance, db_path, taxonomy, save_metadata)
  File "/home/zane/mambaforge/envs/pudl-dev/lib/python3.10/site-packages/ferc_xbrl_extractor/xbrl.py", line 124, in process_instance
    tables = get_fact_tables(taxonomy, db_path, save_metadata)
  File "/home/zane/mambaforge/envs/pudl-dev/lib/python3.10/site-packages/ferc_xbrl_extractor/xbrl.py", line 156, in get_fact_tables
    taxonomy = Taxonomy.from_path(taxonomy_path, save_metadata)
  File "/home/zane/mambaforge/envs/pudl-dev/lib/python3.10/site-packages/ferc_xbrl_extractor/taxonomy.py", line 175, in from_path
    roles = [
  File "/home/zane/mambaforge/envs/pudl-dev/lib/python3.10/site-packages/ferc_xbrl_extractor/taxonomy.py", line 176, in <listcomp>
    LinkRole.from_list(role, concept_dict) for role in view.jsonObject["roles"]
  File "/home/zane/mambaforge/envs/pudl-dev/lib/python3.10/site-packages/ferc_xbrl_extractor/taxonomy.py", line 147, in from_list
    concepts=Concept.from_list(linkrole_list[3], concept_dict),
IndexError: list index out of range
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/zane/mambaforge/envs/pudl-dev/bin/ferc1_to_sqlite", line 33, in <module>
    sys.exit(load_entry_point('catalystcoop.pudl', 'console_scripts', 'ferc1_to_sqlite')())
  File "/home/zane/code/catalyst/pudl/src/pudl/convert/ferc1_to_sqlite.py", line 113, in main
    pudl.extract.ferc1.xbrl2sqlite(
  File "/home/zane/code/catalyst/pudl/src/pudl/extract/ferc1.py", line 618, in xbrl2sqlite
    xbrl.extract(
  File "/home/zane/mambaforge/envs/pudl-dev/lib/python3.10/site-packages/ferc_xbrl_extractor/xbrl.py", line 61, in extract
    for i, batch in enumerate(results):
  File "/home/zane/mambaforge/envs/pudl-dev/lib/python3.10/concurrent/futures/process.py", line 570, in _chain_from_iterable_of_lists
    for element in iterable:
  File "/home/zane/mambaforge/envs/pudl-dev/lib/python3.10/concurrent/futures/_base.py", line 609, in result_iterator
    yield fs.pop().result()
  File "/home/zane/mambaforge/envs/pudl-dev/lib/python3.10/concurrent/futures/_base.py", line 446, in result
    return self.__get_result()
  File "/home/zane/mambaforge/envs/pudl-dev/lib/python3.10/concurrent/futures/_base.py", line 391, in __get_result
    raise self._exception
IndexError: list index out of range
[1]    50211 exit 1     ferc1_to_sqlite --clobber src/pudl/package_data/settings/etl_full.yml

Proportion of lost facts in tests seems non-deterministic

We had a slew of dependency updates this week which mean a bunch of CI tests got run, and many of them are failing, but in slightly different ways. It looks like the proportion of facts that get lost isn't the same every time. Are we doing some random sampling that might get different answers on different runs? If so maybe we want to select a fixed subset to test on, or set the random seed to a constant value?

Of the 24 runs (pull request + push for each of Python 3.10 & 3.11, across 6 dependabot PRs) there were the following 4 failures. So like a 17% failure rate?

example 1

_______________________ test_lost_facts_pct[form2_2021] ________________________

extracted = ExtractOutput(table_defs={'corporate_officer_certification_001_duration': <ferc_xbrl_extractor.datapackage.FactTable o...s', 'c-00:compressor_station_equipment_gas_transmission_plant', 'c-1044:compressor_hours_of_operation_during_year'}})})
request = <FixtureRequest for <Function test_lost_facts_pct[form2_2021]>>

    def test_lost_facts_pct(extracted, request):
        table_defs, instances, table_data, stats = extracted
        total_facts = sum(len(i.fact_id_counts) for i in instances)
        total_used_facts = sum(len(f_ids) for f_ids in stats["fact_ids"].values())
    
        used_fact_ratio = total_used_facts / total_facts
    
        if "form6_" in request.node.name:
            # We have unallocated data for Form 6 for some reason.
            total_threshold = 0.9
            per_filing_threshold = 0.8
            # Assert that this is < 0.95 so we remember to fix this test once we
            # fix the bug. We don't use xfail here because the parametrization is
            # at the *fixture* level, and only the lost facts tests should fail
            # for form 6.
            assert used_fact_ratio > total_threshold and used_fact_ratio <= 0.95
        else:
            total_threshold = 0.99
            per_filing_threshold = 0.95
>           assert used_fact_ratio > total_threshold and used_fact_ratio <= 1
E           assert (0.9854457831325302 > 0.99)

example 2

 _______________________ test_lost_facts_pct[form6_2021] ________________________

extracted = ExtractOutput(table_defs={'annual_corporate_officer_certification_001_duration': <ferc_xbrl_extractor.datapackage.Fact...atio_for_long_term_debt_rate_of_return', 'iffa57d4b6aa645f68d9de3452bd9d47d_D20210101-20211231:schedule_exemption'}})})
request = <FixtureRequest for <Function test_lost_facts_pct[form6_2021]>>

    def test_lost_facts_pct(extracted, request):
        table_defs, instances, table_data, stats = extracted
        total_facts = sum(len(i.fact_id_counts) for i in instances)
        total_used_facts = sum(len(f_ids) for f_ids in stats["fact_ids"].values())
    
        used_fact_ratio = total_used_facts / total_facts
    
        if "form6_" in request.node.name:
            # We have unallocated data for Form 6 for some reason.
            total_threshold = 0.9
            per_filing_threshold = 0.8
            # Assert that this is < 0.95 so we remember to fix this test once we
            # fix the bug. We don't use xfail here because the parametrization is
            # at the *fixture* level, and only the lost facts tests should fail
            # for form 6.
>           assert used_fact_ratio > total_threshold and used_fact_ratio <= 0.95
E           assert (0.8165926212091945 > 0.9)

example 3

_______________________ test_lost_facts_pct[form2_2021] ________________________

extracted = ExtractOutput(table_defs={'corporate_officer_certification_001_duration': <ferc_xbrl_extractor.datapackage.FactTable o...1:disposition_of_excess_gas', 'c-910:other_utility_operating_income_associated_with_taxes_other_than_income_taxes'}})})
request = <FixtureRequest for <Function test_lost_facts_pct[form2_2021]>>

    def test_lost_facts_pct(extracted, request):
        table_defs, instances, table_data, stats = extracted
        total_facts = sum(len(i.fact_id_counts) for i in instances)
        total_used_facts = sum(len(f_ids) for f_ids in stats["fact_ids"].values())
    
        used_fact_ratio = total_used_facts / total_facts
    
        if "form6_" in request.node.name:
            # We have unallocated data for Form 6 for some reason.
            total_threshold = 0.9
            per_filing_threshold = 0.8
            # Assert that this is < 0.95 so we remember to fix this test once we
            # fix the bug. We don't use xfail here because the parametrization is
            # at the *fixture* level, and only the lost facts tests should fail
            # for form 6.
            assert used_fact_ratio > total_threshold and used_fact_ratio <= 0.95
        else:
            total_threshold = 0.99
            per_filing_threshold = 0.95
>           assert used_fact_ratio > total_threshold and used_fact_ratio <= 1
E           assert (0.972722891566265 > 0.99)

example 4

And then in just this one case, there aren't any keyword / tags associated with the datapackage?

        if datapackage_path:
            # Verify that datapackage descriptor is valid before outputting
            frictionless_package = Package(descriptor=datapackage.dict(by_alias=True))
            if not frictionless_package.metadata_valid:
>               raise RuntimeError(
                    f"Generated datapackage is invalid - {frictionless_package.metadata_errors}"
                )
E               RuntimeError: Generated datapackage is invalid - [{'code': 'package-error',
E                'description': 'A validation cannot be processed.',
E                'message': 'The data package has an error: "[] is too short" at "resources" '
E                           'in metadata and at "properties/resources/minItems" in profile',
E                'name': 'Package Error',
E                'note': '"[] is too short" at "resources" in metadata and at '
E                        '"properties/resources/minItems" in profile',
E                'tags': []}]

And also a bunch of errors when trying to access the taxonomy files?

2023-09-13 22:14:15,948 [webCache:retrievalError] Forbidden 
retrieving https://ecollection.ferc.gov/taxonomy/form714/2022-01-01/form/form714/form-714_2022-01-01.xsd - 

2023-09-13 22:14:15,949 [FileNotLoadable] File can not be loaded: https://ecollection.ferc.gov/taxonomy/form714/2022-01-01/form/form714/form-714_2022-01-01.xsd - https://ecollection.ferc.gov/taxonomy/form714/2022-01-01/form/form714/form-714_2022-01-01.xsd

Group extracted XBRL taxonomy metadata by table

Right now all of the XBRL taxonomy metadata we extract is organized into an array of JSON objects, with each one pertaining to a single XBRL fact. E.g. in the context of the XBRL derived electric_plant_in_service table there's a top level object for each of the following FERC accounts or aggregations of FERC accounts (as well as additional facts for each one with suffixes _additions, _retirements, _adjustments, and _transfers):

land_and_land_rights_steam_production
structures_and_improvements_steam_production
boiler_plant_equipment_steam_production
engines_and_engine_driven_generators_steam_production
turbogenerator_units_steam_production
accessory_electric_equipment_steam_production
miscellaneous_power_plant_equipment_steam_production
steam_production_plant

But there's no programmatically usable indication that each one of these array elements belongs to the electric_plant_in_service table.

This complicates our use of the metdata in the ETL pipeline, since it means we can't select the appropriate subset of the metadata fields for use with a given table unless we already have a list of all of the columns that appear in that table. Right now we're reading the column names at the beginning of the XBRL table transform, and then using them to grab the right metadata, but it would be much simpler if we could grab the right metadata at the beginning, knowing only the name of the table (i.e. electric_plant_in_service).

Given how generic some of the field names are (like organization...) it's also kind of surprising that they would be globally unique or always have the same meaning among the thousands of FERC 1 fields.

Clearly there's information in the XBRL taxonomies that indicate what fields are part of which tables. How can we integrate that into the taxonomy export so that we can select one table's worth of metadata at a time without needing to extract a list of column names from the actual database first?

Would it make sense to make it a dictionary of dictionaries, each of which contains an array of "facts"? E.g.

{
  "electric_plant_in_service_204": {
    [
      "land_and_land_rights_steam_production": {...},
      "structures_and_improvements_steam_production": {...},
      "boiler_plant_equipment_steam_production": {...},
      "engines_and_engine_driven_generators_steam_production": {...},
      "turbogenerator_units_steam_production": {...},
      "accessory_electric_equipment_steam_production": {...},
      "miscellaneous_power_plant_equipment_steam_production": {...},
      "steam_production_plant": {...},
      ...
    ]
  },
  "electric_energy_account_401a": {
    [
      ...
    ]
  }
}

Or do exactly the same elements need to be associated with more than one table?

Remove locale preserving context manager

With the integration of PR Arelle/Arelle#513 into the released Arelle 2.3.0, we no longer need to use a context manager to reset the LC_ALL environment variable back to its original value after using Arelle, so we should remove that workaround in our code (and test that it actually works...)

Parallelize XBRL processing

FERC XBRL taxonomy parsing occasionally hangs for 10+ minutes

Occasionally, when extracting the FERC XBRL data in the Taxonomy parsing step, the process will hang for a very long time. I've seen it take 15 minutes and then resume normal operation (and I've also quit out of it after 20 minutes). Normally this process takes only a few seconds.

It seems to be totally stochastic. Like if you run the extraction over and over and over again at the command line, it'll be fine 90% of the time and then once in a while it gets stuck. It seems like it can happen on any of the FERC forms. Very odd.

Not sure if it's related at all, but I sometimes also get this failed assertion when it's hanging. Not sure why it would only appear sometimes

dyld[70238]: Assertion failed: (this->magic == kMagic), function matchesPath, file Loader.cpp, line 154.

(WIP!) Xbrl steam/fuel

@cmgosnell commented on Mon Jun 27 2022

status:

the steam and fuel table are both processed/concatenated and the ferc plant-id-er works!

I'd love some overall design input... I pretty aggressively moved cleaning details into global variable so I could make all functions take a df and/or a table name and/or a source of the ferc1 data (xbrl/dbf) in the hopes that this will be more dagster-friendly and standardized.

questions:

is that function call design even reasonable?
if so, can we/should we migrate a lot of the content in these global variable into the pudl metadata. I can see benefits of doing this but most of these feel really really ferc specific (even if they are technically generalize-able).
best way to pass around dfs (raw and transformed)? (I think this is in part answered in ben's dagster comment)
- I don't feel like I have a good way to do this in a standard way. especially because of the interdependence of some of these tables (steam and fuel for instance!).... it seems like using the dagster op/decorators would enable this standardization... but rn we either:
  - in the main transform function we always load all of the tables so we can be explicit which raw/transform tables are fed into each table transform.
  - do the above (explicit unique table args) but with an if table_name in ferc1_settings.tables: before each table call... sounds gross to me.
  - do some clean standard thing for all the tables except steam! i thiiiiink we could do something like:

ferc1_tfr_dfs = {}
# make all the non-steam tables
for table in ferc1_settings.tables:
    ferc1_tfr_dfs[table_name] = global().get(table_name)(
        ferc1_dbf_raw_dfs.get(table_name),
        ferc1_xbrl_raw_dfs.get(table_name)
    )
# make the steam table using fuel
ferc1_tfr_dfs["plants_steam_ferc1"] = plants_steam_ferc1(
    steam_dbf_raw=ferc1_dbf_raw_dfs.get("plants_steam_ferc1"), 
    steam_xbrl_raw=ferc1_dbf_raw_xbrl.get("plants_steam_ferc1"),
    fuel_transformed=ferc1_tfr_dfs.get("fuel_ferc1")
)

remaining tasks:

standardize the oob_to_nan
delete _old functions
dagster-friendly pudl.transform.ferc1.transform function
test if the dbf solutions for _multiplicative_error_correction are applicable for the xbrl data.
probably more...

avoiding for now tasks:

migrate the metadata stored here into the pudl table metadata (when possible/applicable.
the utility_id_ferc1
checking if the plant-ID-er is actually doing a good job
standardizing the extract step

See #1707 and #1722 for the table-specific task lists

@cmgosnell commented on Wed Jun 29 2022

a note about the extract step:

I think it would be cleaner for the extract step to save dfs with pudl table names with _instant and _duration suffixes when applicable. Instead of the current setup which saves the ferc1 raw table names. This way we could automatically grab the extracted tables with just the pudl table name (which is now the argument for many many of these transform functions).

@review-notebook-app[bot] commented on Thu Jul 07 2022

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

@zaneselvans commented on Fri Aug 12 2022

I asked some questions about the FERC1 transform refactor design in this comment on #1739

@zaneselvans commented on Wed Aug 17 2022

Also @bendnorman did you see this list of outstanding questions on the linked issue? catalyst-cooperative/pudl#1739 (comment)

Use archived taxonomies to parse XBRL

Relying on a live web resource to parse archived data is fragile, as the live resource may not be available or may have changed later.

Update the FERC XBRL extraction process to rely on archived FERC XBRL taxonomies available from our Zenodo archives, rather than the live taxonomies available from FERC.

Comments on initial repo

Data Acquisition / Inputs

We need to archive the raw XBRL so that we can access the input files programmatically in the same way we access all the other data.
Recent years of XBRL data are split into multiple zip archives for what seems like no good reason. Should we combine them into a single ZIP for archiving purposes?
The names of the individual files making up the filings contain the old FERC Form 1 Respondent ID -- it's the number that follows the CamelCaseRespondentName in the file names. E.g. PSCo has respondent_id==145: PublicServiceCompanyOfColorado-145-2020Q1F1.xbrl
Where are the taxonomies downloaded from? I don't see any URLs stored in the repo. A concrete example should be in the README. When I go try and find a taxonomy to download, the zipfile I get doesn't contain the file that's in the URL in the extract_fuel notebook. Where did you find that URL?
Can we programmatically download the taxonomies for each year? How much do they change over time? What's the pattern for the URL downloading them? Is there a different taxonomy for every quarter? Every year? Do they revise the taxonomies in retrospect? Are old versions available? Do we need to archive the taxonomies that pertain to a given downloaded version of the data? (yes, probably yes). Do we need to use a separate taxonomy to process every single quarter/year of data? What do we need to do to ensure that all of the different years of data can coexist in the same database schema?
It seems odd that the 2011 XBRL Zipfile file is only 29MB, while the 2020 zipfiles add up to 122MB. Why such a big difference?

`xbrl_extract` CLI

What happens if I don't give it a --to-sql argument? That's optional but... where does the data go if I don't give it that? The help message gives no indication.
The script needs to provide logging output. When I run it nothing happens. Is it downloading the taxonomy? Is it converting data? Is there a problem? There are only like 400 filings per year, so it can print a line out for each one and be fine.
Eventually it does start printing some output, but it appears to be going through the filings in a random order. Alphabetizing them would be good, so that adjacent records in the DB are related to each other, and also so that we can see roughly how far through the whole process we've gotten.
Timestamps, loglevel, and other logging metadata should be output. I can't tell if these are logs or just print() statements, or how long the whole thing is taking.
The output that's being generated is just paths to files. It doesn't say what is being done. I note that no SQLite DB is being created in the location that I specified. Is it going to read everything into memory before writing it out? The raw inputs for 2020 take 1.2 GB on disk, that could be a lot of memory.
It seems to be maxing out one CPU, but not writing anything to disk. Can the reading of the individual filings be easily parallelized?
For me it took 24 minutes to run through the 2020 data, and the resulting database is only 55MB which is crazy small compared to the 1.2 GB of the original data on disk.

`extract_fuel.ipynb`

The old database tables all start with f1_ not ferc_ The PUDL FERC 1 database tables all end with _ferc1 -- because we have a bunch of different FERC datasets, it's important never to name anything just "ferc" -- it always needs to have the form number.
It's unclear to me how you're inferring the row numbers in the fuel table. Are you literally just counting up the number of rows? Where did you get the magic number of 15 rows per supplement from?
We should be able to do something simpler involving a function of df.groupby("respondent_id").cumcount() to assign the row and supplement numbers I think.
I'm concerned that this method of assigning row numbers may fail catastrophically on some tables though. E.g. in the Plant In Service table, there are many header rows which take up row numbers, but do not contain data. Will those rows show up at all in the new data?
The FuelBurnedAverageHeatContent column looks totally jacked. But I guess that's because we haven't cleaned up all the units errors yet.
What's up with the -0 and -1 etc. suffixes in the plant_name column?

`extract_steam.ipynb`

Using column ordering to map old and new column names seems extremely fragile.
The way the new and old columns are named is extremely different. I'm skeptical that this can be automated / generalized. On the upside, the new column names are much more informative.

Row Numbers / DB structural differences.

There are 183 tables in the new DB (vs 116 in the old one)
There are 3679 columns in the new DB (vs about 3300 in the old one), but I think ~HALF of the columns in the old DB are _f footnote columns that don't contain real data, and I don't think any such columns exist in the new DB.
Overall the 2020 data contains 558,580 records
Clearly there are substantial structural differences between these two databases.
Given that the new DB seems to have no concept of row numbers, one major issue is going to be mapping old records to the new records. In the old DB there are ~3 different kinds of tables:
- Tables where each record really does correspond to a single database record (e.g. f1_fuel or f1_steam)
- Tables where individual rows have different semantic meanings, and those meanings potentially change over time as new rows are added. (e.g. the f1_plant_in_srvce table, where each row is a header, subtotal, total, or a particular FERC Account number).
- Free form tables where row numbers are meaningless, and the respondents can enter whatever they want in one or more of the columns (e.g. the f1_edcfu / depreciation tables).
Looking at the Plant in Service table in the new DB, it appears that they have turned those crazy rows number based contents into columns which is sensible. That table now has 485 columns with long descriptive names (none of which include the FERC Account number of course!)
These structures are going to be challenging to reconcile with each other in an automated way.
Having 10 years of overlapping data that should be identical, reported in the two different formats, does open some potential ML based approaches...
I think we may end up having two entirely separate sources of FERC1 data, which have to be managed separately, with different transformations that lead to a single consolidated / cleaned output in the PUDL DB.
Looking at the modern version of f1_edcfu_epda (a messy depreciation table) the junk column where they list the FERC Acct. number (or a descriptive string) still seems to be a total mess. However there's a couple of columns that look like they're supposed to somehow indicate the order in which these rows should appear, and potentially provide some additional information about how they are grouped.

Test official arelle package

Arelle has begun officially packaging the library for PyPI here. It would probably make sense to switch to this version assuming it doesn't break anything, so we can get updates/improvements

Datapackage descriptors annotating SQLite DBs are invalid

The datapackage descriptors we are currently generating to annotate the SQLite DBs which are derived from XBRL data are not valid. For example, in the ferc-xbrl-extractor environment running this command:

frictionless validate ferc714_xbrl_datapackage.json

Results in a bunch of errors like:

# -------
# invalid: sqlite:////Users/zane/code/catalyst/pudl-work/output/ferc714_xbrl.sqlite
# -------

## Summary

+-----------------------------+--------------------------------------------------------------------------+
| Description                 | Size/Name/Count                                                          |
+=============================+==========================================================================+
| File name (Not Found)       | sqlite:////Users/zane/code/catalyst/pudl-work/output/ferc714_xbrl.sqlite |
+-----------------------------+--------------------------------------------------------------------------+
| File size                   | N/A                                                                      |
+-----------------------------+--------------------------------------------------------------------------+
| Total Time Taken (sec)      | 0.002                                                                    |
+-----------------------------+--------------------------------------------------------------------------+
| Total Errors                | 1                                                                        |
+-----------------------------+--------------------------------------------------------------------------+
| Scheme Error (scheme-error) | 1                                                                        |
+-----------------------------+--------------------------------------------------------------------------+

## Errors

+-------+---------+---------+---------------------------------------------------+
| row   | field   | code    | message                                           |
+=======+=========+=========+===================================================+
|       |         | scheme- | The data source could not be successfully loaded: |
|       |         | error   | cannot create loader "". Try installing           |
|       |         |         | "frictionless-"                                   |
+-------+---------+---------+---------------------------------------------------+

Or if we try and validate a single resource and return the errors in JSON form:

frictionless validate --json --resource-name planning_area_hourly_demand_and_forecast_summer_and_winter_peak_demand_and_annual_net_energy_for_load_table_03_2_instant ferc714_xbrl_datapackage.json

{
  "version": "4.40.11",
  "time": 0.001,
  "errors": [],
  "tasks": [
    {
      "resource": {
        "path": "sqlite:////Users/zane/code/catalyst/pudl-work/output/ferc714_xbrl.sqlite",
        "profile": "tabular-data-resource",
        "name": "planning_area_hourly_demand_and_forecast_summer_and_winter_peak_demand_and_annual_net_energy_for_load_table_03_2_instant",
        "dialect": {
          "table": "planning_area_hourly_demand_and_forecast_summer_and_winter_peak_demand_and_annual_net_energy_for_load_table_03_2_instant"
        },
        "title": "03.2 - Schedule - Planning Area Hourly Demand and Forecast Summer and Winter Peak Demand and Annual Net Energy for Load, Table - instant",
        "description": "ferc:SchedulePlanningAreaHourlyDemandAndForecastSummerAndWinterPeakDemandAndAnnualNetEnergyForLoadBAbstract",
        "format": "sqlite",
        "mediatype": "application/vnd.sqlite3",
        "schema": {
          "fields": [
            {
              "name": "entity_id",
              "title": "Entity Identifier",
              "type": "string",
              "format": "default",
              "description": "Unique identifier of respondent"
            },
            {
              "name": "filing_name",
              "title": "Filing Name",
              "type": "string",
              "format": "default",
              "description": "Name of filing"
            },
            {
              "name": "date",
              "title": "Instant Date",
              "type": "date",
              "format": "default",
              "description": "Date of instant period"
            },
            {
              "name": "planning_area_hourly_demand_and_forecast_summer_and_winter_peak_demand_and_annual_net_energy_for_load_axis",
              "title": "Planning Area Hourly Demand and Forecast Summer and Winter Peak Demand and Annual Net Energy for Load [Axis]",
              "type": "string",
              "format": "default",
              "description": "Typed dimension used to distinguish a set of related facts about planning area hourly demand and forecast summer and winter peak demand and annual net energy for load."
            }
          ],
          "primary_key": [
            "entity_id",
            "filing_name",
            "date",
            "planning_area_hourly_demand_and_forecast_summer_and_winter_peak_demand_and_annual_net_energy_for_load_axis"
          ]
        },
        "scheme": "",
        "hashing": "md5",
        "stats": {
          "hash": "",
          "bytes": 0
        }
      },
      "time": 0.001,
      "scope": [],
      "partial": false,
      "errors": [
        {
          "code": "scheme-error",
          "name": "Scheme Error",
          "tags": [],
          "note": "cannot create loader \"\". Try installing \"frictionless-\"",
          "message": "The data source could not be successfully loaded: cannot create loader \"\". Try installing \"frictionless-\"",
          "description": "Data reading error because of incorrect scheme."
        }
      ],
      "stats": {
        "errors": 1
      },
      "valid": false
    }
  ],
  "stats": {
    "errors": 1,
    "tasks": 1
  },
  "valid": false
}

The problem?

I think the issue here is that we are using v4 of the frictionless package, and the ability to annotate SQLite DBs was only introduced in v5. Looking at the datapacakge.json file, I see that tie dialect field is invalid. In frictionless v5, it would need to say sql and and then point at the table within the sql dictionary. in previous versions it would describe the CSV dialect that's being used in the file that the path element points at. See this example of data package annotating an SQLite DB.

{
            "path": "sqlite:////Users/zane/code/catalyst/pudl-work/output/ferc714_xbrl.sqlite",
            "profile": "tabular-data-resource",
            "name": "planning_area_hourly_demand_and_forecast_summer_and_winter_peak_demand_and_annual_net_energy_for_load_table_03_2_instant",
            "dialect": {
                "table": "planning_area_hourly_demand_and_forecast_summer_and_winter_peak_demand_and_annual_net_energy_for_load_table_03_2_instant"
            },
            "title": "03.2 - Schedule - Planning Area Hourly Demand and Forecast Summer and Winter Peak Demand and Annual Net Energy for Load, Table - instant",
            "description": "ferc:SchedulePlanningAreaHourlyDemandAndForecastSummerAndWinterPeakDemandAndAnnualNetEnergyForLoadBAbstract",
            "format": "sqlite",
            "mediatype": "application/vnd.sqlite3",
            "schema": {
                "fields": [
                    {
                        "name": "entity_id",
                        "title": "Entity Identifier",
                        "type": "string",
                        "format": "default",
                        "description": "Unique identifier of respondent"
                    },
                    {
                        "name": "filing_name",
                        "title": "Filing Name",
                        "type": "string",
                        "format": "default",
                        "description": "Name of filing"
                    },
                    {
                        "name": "date",
                        "title": "Instant Date",
                        "type": "date",
                        "format": "default",
                        "description": "Date of instant period"
                    },
                    {
                        "name": "planning_area_hourly_demand_and_forecast_summer_and_winter_peak_demand_and_annual_net_energy_for_load_axis",
                        "title": "Planning Area Hourly Demand and Forecast Summer and Winter Peak Demand and Annual Net Energy for Load [Axis]",
                        "type": "string",
                        "format": "default",
                        "description": "Typed dimension used to distinguish a set of related facts about planning area hourly demand and forecast summer and winter peak demand and annual net energy for load."
                    }
                ],
                "primary_key": [
                    "entity_id",
                    "filing_name",
                    "date",
                    "planning_area_hourly_demand_and_forecast_summer_and_winter_peak_demand_and_annual_net_energy_for_load_axis"
                ]
            }
        }

Frictionless v4 can't interpret `sqlite://` URL as path

As it is, the system sees the sqlite:// URL in the path and has no idea how to interpret it to find the data.

The `sqlite://` path must be relative not absolute

In addition to being unable to interpret the sqlite:// URL as a path at all, the URL uses an absolute path rather than a relative path, which is invalid. It is invalid both in that it violates the frictionless data resource specification which says:

A “url-or-path” is a string with the following additional constraints:

MUST either be a URL or a POSIX path
URLs MUST be fully qualified. MUST be using either http or https scheme. (Absence of a scheme indicates MUST be a POSIX path)
POSIX paths (unix-style with / as separator) are supported for referencing local files, with the security restraint that they MUST be relative siblings or children of the descriptor. Absolute paths (/) and relative parent paths (…/) MUST NOT be used, and implementations SHOULD NOT support these path types.

In addition, the absolute path is simply wrong if you download our nightly build outputs since the path to which the descriptor and databases were written on the build server have no meaning on the user's machine:

sqlite:////home/catalyst/pudl_work/output/ferc1_xbrl.sqlite

What to do?

If we want to annotate SQLite DBs, I think we need to update to Frictionless v5 and follow the datapackage standards for annotating SQL sources, including using a relative path to the DB from the location of the datapackage.json file.
We should be validating every datapackage that we output during the nightly builds and in the integration tests for this repository.
If we want to, I think we can annotate all of the SQLite DBs and Parquet outptus using a single large datpackage descriptor.

Refactor instance parsing to wrap outputs in a single data structure

Currently when parsing an XBRL instance, the code will return a tuple of dictionaries containing facts and contexts. Passing around these structures is not a very readable implementation, and feels error prone. Wrapping these structures in a class, and moving some functionality to class methods should improve readability and robustness.

XBRL extraction is much slower with pandas 2.1.1 than pandas 2.0.3

Running the XBRL extractions using pandas 2.1.1 is much slower than 2.0.3. tox -e integration takes about 300 seconds with pandas 2.0.3, but 677 seconds with pandas 2.1.1.

The difference in runtime for pandas 2.1.1 seems to increase more than linearly with the size of the data being extracted, with the full extraction of all XBRL data taking 13 minutes using pandas 2.0.3 and more than 7 hours with pandas 2.1.1. See this comment thread.

Running

tox -re integration

Takes:

~600s with pandas>=1.5,<2.2
~500s with pandas[performance]>=1.5<2.2
~300s with pandas>=1.5,<2.1

I tried changing the string inference default to use PyArrow as suggested in the Pandas 2.1 release notes to all of the modules that import pandas as pd in case the problem was slow object treatment of strings, but it had no impact on overall runtime. Not sure if that's because it didn't actually have an effect on the treatment of strings, or if it did, and that wasn't the problem:

pd.options.future.infer_string = True

There's one recent open performance regression issue for pandas 2.1: pandas-dev/pandas#55256

Update documentation

The README and docs are pretty sparse at the moment and we should include some high level descriptions of how XBRL parsing actually works. Most people are not very familiar with XBRL, so it would be helpful to have a brief overview of what XBRL is, how it's structured, and how we are working with it.

Add support for FERC Forms 714 and 2

Up until now, the FERC XBRL Extractor has only been tested on Form 1. Form 1 is the most well integrated in PUDL, which is why it has been the priority, but there is preliminary support for Form 714 with plans to further integrate 714 and form 2. The XBRL extractor is not meant to be specific to Form 1, so ideally this process will go smoothly.

Tasks

Form 714

Download test filings and test extraction
Explore output database

Form 2

Download test filings and test extraction
Explore output database

Deduce taxonomy `version` from individual filings

XBRL filings include the taxonomy used in the filing. This should be dediced automatically, rather than passing a taxonomy directly.

Stop concatenating null dataframes

In the main PUDL ETL we are currently getting tens of thousands of warnings like:

/Users/zane/miniforge3/envs/pudl-dev/lib/python3.11/site-packages/ferc_xbrl_extractor/xbrl.py:169: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.

We should probably stop attempting to do these null concatenations, lest it break down in the future. Or maybe we can just assign explicit data types? Or suppress the warnings so we only get it 1 time, instead of 17,000 times?

For the moment these warnings are being ignored completely. See #170

Open PR to package Arelle on main repo

Pretty print datapackage.json descriptors for human readability

We pretty-print the XBRL taxonomy in JSON to enable human readability, but the datapackage.json outputs are still all put on a single line, which is both difficult to read as a human being, and strain some editors, which read only some lines into memory to improve performance.

It seems like it should be easy to pretty-print the datapackage outputs as well.

Update FERC XBRL extractor to provided options for specifying the path of Frictionless Data Package descriptor for output

To enable PUDL access the datapackage descriptor generated during XBRL extraction, there needs to be an API update to allow for specifying a path to write the datapackage descriptor.