catalyst-cooperative / ferc-xbrl-extractor Goto Github PK
View Code? Open in Web Editor NEWA tool for converting FERC filings published in XBRL into SQLite databases
License: MIT License
A tool for converting FERC filings published in XBRL into SQLite databases
License: MIT License
Even when our tests are passing, the badge in the README says it's failing. I don't get it. Instructions for workflow status badges.
This is the address for the badge:
https://github.com/catalyst-cooperative/ferc-xbrl-extractor/workflows/tox-pytest/badge.svg
And here's what it looks like as of 2023-10-18T19:55
But it's all green in the actions:
Traditionally we've allowed extracting only a subset of tables from the DBF based FERC 1 data for testing purposes. This is a fairly easy update to make in the XBRL extraction, so this issue will add that functionality. This requires a small API update to provide an option for specifying tables.
Currently, the only feedback/output is a couple of print statements. These should be replaced with detailed logging.
In @zschira 's investigations into catalyst-cooperative/pudl#2755 , he found that there are many XBRL facts that refer to contexts which are missing dimensions, which means that we don't assign those facts to any tables.
We would like to assign those facts to the best of our ability, which means understanding these "lost facts" that don't have a home.
First we should identify them. We may have to implement the notion of a nullable dimension to the table schemas.
From this issue, we've found that there is additional metadata contained in XBRL taxonomies that will be useful in linking XBRL/DBF data. Specifically, we see use in having the XBRL references
, which contain FERC account numbers and form location info, the balance
type (credit or debit) and calculations which define relationships between columns (like aggregations). This information can all be pulled out of taxonomies and put into a more usable format.
The extractor needs some improvements in the README and docs to make it easier to use and understand technical details about how it works.
Now that we've got CI running #46 we can start improving our test coverage! Some tests that we know we'd like to run:
LC_ALL
environment variable is not altered by Arelle (this should be fixed but... just in case of regressions)Add a GitHub actions workflow to run Tox in CI, and test that the package works on Python 3.8-3.11.
We also need to expand the unit & integration tests for better coverage, but getting the CI up and running is a necessary step on its own.
When attempting to run the ferc1_to_sqlite
script processing all tables for years 2020 and 2021 (so, 1 year each from the two FERC data sources) I get a bunch of errors about files not existing (but looking in the cache directory, they do seem to exist):
2022-07-07 17:53:38 [ INFO] catalystcoop.ferc_xbrl_extractor.xbrl:155 Parsing taxonomy from https://eCollection.ferc.gov/taxonomy/form1/2022-01-01/form/form1/form-1_2022-01-01.xsd
2022-07-07 17:53:39 [ INFO] catalystcoop.ferc_xbrl_extractor.xbrl:155 Parsing taxonomy from https://eCollection.ferc.gov/taxonomy/form1/2022-01-01/form/form1/form-1_2022-01-01.xsd
2022-07-07 17:53:39 [ INFO] catalystcoop.ferc_xbrl_extractor.xbrl:155 Parsing taxonomy from https://eCollection.ferc.gov/taxonomy/form1/2022-01-01/form/form1/form-1_2022-01-01.xsd
2022-07-07 17:53:39 [ INFO] catalystcoop.ferc_xbrl_extractor.xbrl:155 Parsing taxonomy from https://eCollection.ferc.gov/taxonomy/form1/2022-01-01/form/form1/form-1_2022-01-01.xsd
2022-07-07 17:53:42,621 [webCache:cacheDownloadRenamingError] [Errno 2] No such file or directory: '/home/zane/.config/arelle/cache/https/eCollection.ferc.gov/taxonomy/form1/2022-01-01/schedules/ScheduleAccumulatedDeferredIncomeTaxes/sched-234_2022-01-01.xsd.tmp' -> '/home/zane/.config/arelle/cache/https/eCollection.ferc.gov/taxonomy/form1/2022-01-01/schedules/ScheduleAccumulatedDeferredIncomeTaxes/sched-234_2022-01-01.xsd'
Unsuccessful renaming of downloaded file to active file /home/zane/.config/arelle/cache/https/eCollection.ferc.gov/taxonomy/form1/2022-01-01/schedules/ScheduleAccumulatedDeferredIncomeTaxes/sched-234_2022-01-01.xsd
Please remove with file manager. -
2022-07-07 17:53:42,624 [IOerror] sched-234_2022-01-01.xsd: file error: [Errno 2] No such file or directory: '/home/zane/.config/arelle/cache/https/eCollection.ferc.gov/taxonomy/form1/2022-01-01/schedules/ScheduleAccumulatedDeferredIncomeTaxes/sched-234_2022-01-01.xsd' - https://eCollection.ferc.gov/taxonomy/form1/2022-01-01/form/form1/form-1_2022-01-01.xsd 2
2022-07-07 17:53:54,479 [webCache:cacheDownloadRenamingError] [Errno 2] No such file or directory: '/home/zane/.config/arelle/cache/https/eCollection.ferc.gov/taxonomy/form1/2022-01-01/ferc-core_2022-01-01.xsd.tmp' -> '/home/zane/.config/arelle/cache/https/eCollection.ferc.gov/taxonomy/form1/2022-01-01/ferc-core_2022-01-01.xsd'
Unsuccessful renaming of downloaded file to active file /home/zane/.config/arelle/cache/https/eCollection.ferc.gov/taxonomy/form1/2022-01-01/ferc-core_2022-01-01.xsd
Please remove with file manager. -
I notice there's some weird case-sensitive stuff going on in my arelle cache directory, with the version from the last time I played around in April not having any uppercase in the domain name / directory:
drwxr-xr-x 3 zane zane 4.0K 2022-04-12 21:51 ecollection.ferc.gov/
drwxr-xr-x 3 zane zane 4.0K 2022-07-07 17:53 eCollection.ferc.gov/
There are also some syntax errors coming up...
2022-07-07 17:58:45,875 [xmlSchema:syntax] Char 0x0 out of allowed range, sched-304_2022-01-01_def.xml, line 223, column 128 - ../../../../../../../sched-304_2022-01-01_def.xml , https://eCollection.ferc.gov/taxonomy/form1/2022-01-01/schedules/ScheduleSalesOfElectricityByRateSchedules/sched-304_2022-01-01.xsd 5
2022-07-07 17:58:45,875 [xmlSchema:syntax] AttValue: ' expected, sched-304_2022-01-01_def.xml, line 223, column 128 - ../../../../../../../sched-304_2022-01-01_def.xml , https://eCollection.ferc.gov/taxonomy/form1/2022-01-01/schedules/ScheduleSalesOfElectricityByRateSchedules/sched-304_2022-01-01.xsd 5
2022-07-07 17:58:45,875 [xmlSchema:syntax] attributes construct error, sched-304_2022-01-01_def.xml, line 223, column 128 - ../../../../../../../sched-304_2022-01-01_def.xml , https://eCollection.ferc.gov/taxonomy/form1/2022-01-01/schedules/ScheduleSalesOfElectricityByRateSchedules/sched-304_2022-01-01.xsd 5
2022-07-07 17:58:45,875 [xmlSchema:syntax] Couldn't find end of Start Tag loc line 223, sched-304_2022-01-01_def.xml, line 223, column 128 - ../../../../../../../sched-304_2022-01-01_def.xml , https://eCollection.ferc.gov/taxonomy/form1/2022-01-01/schedules/ScheduleSalesOfElectricityByRateSchedules/sched-304_2022-01-01.xsd 5
2022-07-07 17:58:45,875 [xmlSchema:syntax] Premature end of data in tag definitionLink line 17, sched-304_2022-01-01_def.xml, line 223, column 128 - ../../../../../../../sched-304_2022-01-01_def.xml , https://eCollection.ferc.gov/taxonomy/form1/2022-01-01/schedules/ScheduleSalesOfElectricityByRateSchedules/sched-304_2022-01-01.xsd 5
Despite the errors above, it seems to be attempting to extract some filings. However in the end I get an error from concurrent.futures
:
2022-07-07 18:02:37 [ INFO] catalystcoop.ferc_xbrl_extractor.xbrl:126 Extracting a6879fc6-64be-47e6-a830-0b9b4aaa24eb
2022-07-07 18:02:39 [ INFO] catalystcoop.ferc_xbrl_extractor.xbrl:126 Extracting 90b8961f-01d5-426c-b9e2-2c7b39884ced
2022-07-07 18:02:41 [ INFO] catalystcoop.ferc_xbrl_extractor.xbrl:126 Extracting f863e3cd-6ca9-42a9-b013-ca12fd301d2a
concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/zane/mambaforge/envs/pudl-dev/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
File "/home/zane/mambaforge/envs/pudl-dev/lib/python3.10/concurrent/futures/process.py", line 205, in _process_chunk
return [fn(*args) for args in chunk]
File "/home/zane/mambaforge/envs/pudl-dev/lib/python3.10/concurrent/futures/process.py", line 205, in <listcomp>
return [fn(*args) for args in chunk]
File "/home/zane/mambaforge/envs/pudl-dev/lib/python3.10/site-packages/ferc_xbrl_extractor/xbrl.py", line 93, in process_batch
instance_dfs = process_instance(instance, db_path, taxonomy, save_metadata)
File "/home/zane/mambaforge/envs/pudl-dev/lib/python3.10/site-packages/ferc_xbrl_extractor/xbrl.py", line 124, in process_instance
tables = get_fact_tables(taxonomy, db_path, save_metadata)
File "/home/zane/mambaforge/envs/pudl-dev/lib/python3.10/site-packages/ferc_xbrl_extractor/xbrl.py", line 156, in get_fact_tables
taxonomy = Taxonomy.from_path(taxonomy_path, save_metadata)
File "/home/zane/mambaforge/envs/pudl-dev/lib/python3.10/site-packages/ferc_xbrl_extractor/taxonomy.py", line 175, in from_path
roles = [
File "/home/zane/mambaforge/envs/pudl-dev/lib/python3.10/site-packages/ferc_xbrl_extractor/taxonomy.py", line 176, in <listcomp>
LinkRole.from_list(role, concept_dict) for role in view.jsonObject["roles"]
File "/home/zane/mambaforge/envs/pudl-dev/lib/python3.10/site-packages/ferc_xbrl_extractor/taxonomy.py", line 147, in from_list
concepts=Concept.from_list(linkrole_list[3], concept_dict),
IndexError: list index out of range
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/zane/mambaforge/envs/pudl-dev/bin/ferc1_to_sqlite", line 33, in <module>
sys.exit(load_entry_point('catalystcoop.pudl', 'console_scripts', 'ferc1_to_sqlite')())
File "/home/zane/code/catalyst/pudl/src/pudl/convert/ferc1_to_sqlite.py", line 113, in main
pudl.extract.ferc1.xbrl2sqlite(
File "/home/zane/code/catalyst/pudl/src/pudl/extract/ferc1.py", line 618, in xbrl2sqlite
xbrl.extract(
File "/home/zane/mambaforge/envs/pudl-dev/lib/python3.10/site-packages/ferc_xbrl_extractor/xbrl.py", line 61, in extract
for i, batch in enumerate(results):
File "/home/zane/mambaforge/envs/pudl-dev/lib/python3.10/concurrent/futures/process.py", line 570, in _chain_from_iterable_of_lists
for element in iterable:
File "/home/zane/mambaforge/envs/pudl-dev/lib/python3.10/concurrent/futures/_base.py", line 609, in result_iterator
yield fs.pop().result()
File "/home/zane/mambaforge/envs/pudl-dev/lib/python3.10/concurrent/futures/_base.py", line 446, in result
return self.__get_result()
File "/home/zane/mambaforge/envs/pudl-dev/lib/python3.10/concurrent/futures/_base.py", line 391, in __get_result
raise self._exception
IndexError: list index out of range
[1] 50211 exit 1 ferc1_to_sqlite --clobber src/pudl/package_data/settings/etl_full.yml
We had a slew of dependency updates this week which mean a bunch of CI tests got run, and many of them are failing, but in slightly different ways. It looks like the proportion of facts that get lost isn't the same every time. Are we doing some random sampling that might get different answers on different runs? If so maybe we want to select a fixed subset to test on, or set the random seed to a constant value?
Of the 24 runs (pull request + push for each of Python 3.10 & 3.11, across 6 dependabot PRs) there were the following 4 failures. So like a 17% failure rate?
_______________________ test_lost_facts_pct[form2_2021] ________________________
extracted = ExtractOutput(table_defs={'corporate_officer_certification_001_duration': <ferc_xbrl_extractor.datapackage.FactTable o...s', 'c-00:compressor_station_equipment_gas_transmission_plant', 'c-1044:compressor_hours_of_operation_during_year'}})})
request = <FixtureRequest for <Function test_lost_facts_pct[form2_2021]>>
def test_lost_facts_pct(extracted, request):
table_defs, instances, table_data, stats = extracted
total_facts = sum(len(i.fact_id_counts) for i in instances)
total_used_facts = sum(len(f_ids) for f_ids in stats["fact_ids"].values())
used_fact_ratio = total_used_facts / total_facts
if "form6_" in request.node.name:
# We have unallocated data for Form 6 for some reason.
total_threshold = 0.9
per_filing_threshold = 0.8
# Assert that this is < 0.95 so we remember to fix this test once we
# fix the bug. We don't use xfail here because the parametrization is
# at the *fixture* level, and only the lost facts tests should fail
# for form 6.
assert used_fact_ratio > total_threshold and used_fact_ratio <= 0.95
else:
total_threshold = 0.99
per_filing_threshold = 0.95
> assert used_fact_ratio > total_threshold and used_fact_ratio <= 1
E assert (0.9854457831325302 > 0.99)
_______________________ test_lost_facts_pct[form6_2021] ________________________
extracted = ExtractOutput(table_defs={'annual_corporate_officer_certification_001_duration': <ferc_xbrl_extractor.datapackage.Fact...atio_for_long_term_debt_rate_of_return', 'iffa57d4b6aa645f68d9de3452bd9d47d_D20210101-20211231:schedule_exemption'}})})
request = <FixtureRequest for <Function test_lost_facts_pct[form6_2021]>>
def test_lost_facts_pct(extracted, request):
table_defs, instances, table_data, stats = extracted
total_facts = sum(len(i.fact_id_counts) for i in instances)
total_used_facts = sum(len(f_ids) for f_ids in stats["fact_ids"].values())
used_fact_ratio = total_used_facts / total_facts
if "form6_" in request.node.name:
# We have unallocated data for Form 6 for some reason.
total_threshold = 0.9
per_filing_threshold = 0.8
# Assert that this is < 0.95 so we remember to fix this test once we
# fix the bug. We don't use xfail here because the parametrization is
# at the *fixture* level, and only the lost facts tests should fail
# for form 6.
> assert used_fact_ratio > total_threshold and used_fact_ratio <= 0.95
E assert (0.8165926212091945 > 0.9)
_______________________ test_lost_facts_pct[form2_2021] ________________________
extracted = ExtractOutput(table_defs={'corporate_officer_certification_001_duration': <ferc_xbrl_extractor.datapackage.FactTable o...1:disposition_of_excess_gas', 'c-910:other_utility_operating_income_associated_with_taxes_other_than_income_taxes'}})})
request = <FixtureRequest for <Function test_lost_facts_pct[form2_2021]>>
def test_lost_facts_pct(extracted, request):
table_defs, instances, table_data, stats = extracted
total_facts = sum(len(i.fact_id_counts) for i in instances)
total_used_facts = sum(len(f_ids) for f_ids in stats["fact_ids"].values())
used_fact_ratio = total_used_facts / total_facts
if "form6_" in request.node.name:
# We have unallocated data for Form 6 for some reason.
total_threshold = 0.9
per_filing_threshold = 0.8
# Assert that this is < 0.95 so we remember to fix this test once we
# fix the bug. We don't use xfail here because the parametrization is
# at the *fixture* level, and only the lost facts tests should fail
# for form 6.
assert used_fact_ratio > total_threshold and used_fact_ratio <= 0.95
else:
total_threshold = 0.99
per_filing_threshold = 0.95
> assert used_fact_ratio > total_threshold and used_fact_ratio <= 1
E assert (0.972722891566265 > 0.99)
And then in just this one case, there aren't any keyword / tags associated with the datapackage?
if datapackage_path:
# Verify that datapackage descriptor is valid before outputting
frictionless_package = Package(descriptor=datapackage.dict(by_alias=True))
if not frictionless_package.metadata_valid:
> raise RuntimeError(
f"Generated datapackage is invalid - {frictionless_package.metadata_errors}"
)
E RuntimeError: Generated datapackage is invalid - [{'code': 'package-error',
E 'description': 'A validation cannot be processed.',
E 'message': 'The data package has an error: "[] is too short" at "resources" '
E 'in metadata and at "properties/resources/minItems" in profile',
E 'name': 'Package Error',
E 'note': '"[] is too short" at "resources" in metadata and at '
E '"properties/resources/minItems" in profile',
E 'tags': []}]
And also a bunch of errors when trying to access the taxonomy files?
2023-09-13 22:14:15,948 [webCache:retrievalError] Forbidden
retrieving https://ecollection.ferc.gov/taxonomy/form714/2022-01-01/form/form714/form-714_2022-01-01.xsd -
2023-09-13 22:14:15,949 [FileNotLoadable] File can not be loaded: https://ecollection.ferc.gov/taxonomy/form714/2022-01-01/form/form714/form-714_2022-01-01.xsd - https://ecollection.ferc.gov/taxonomy/form714/2022-01-01/form/form714/form-714_2022-01-01.xsd
Right now all of the XBRL taxonomy metadata we extract is organized into an array of JSON objects, with each one pertaining to a single XBRL fact. E.g. in the context of the XBRL derived electric_plant_in_service
table there's a top level object for each of the following FERC accounts or aggregations of FERC accounts (as well as additional facts for each one with suffixes _additions
, _retirements
, _adjustments
, and _transfers
):
land_and_land_rights_steam_production
structures_and_improvements_steam_production
boiler_plant_equipment_steam_production
engines_and_engine_driven_generators_steam_production
turbogenerator_units_steam_production
accessory_electric_equipment_steam_production
miscellaneous_power_plant_equipment_steam_production
steam_production_plant
But there's no programmatically usable indication that each one of these array elements belongs to the electric_plant_in_service
table.
This complicates our use of the metdata in the ETL pipeline, since it means we can't select the appropriate subset of the metadata fields for use with a given table unless we already have a list of all of the columns that appear in that table. Right now we're reading the column names at the beginning of the XBRL table transform, and then using them to grab the right metadata, but it would be much simpler if we could grab the right metadata at the beginning, knowing only the name of the table (i.e. electric_plant_in_service
).
Given how generic some of the field names are (like organization
...) it's also kind of surprising that they would be globally unique or always have the same meaning among the thousands of FERC 1 fields.
Clearly there's information in the XBRL taxonomies that indicate what fields are part of which tables. How can we integrate that into the taxonomy export so that we can select one table's worth of metadata at a time without needing to extract a list of column names from the actual database first?
Would it make sense to make it a dictionary of dictionaries, each of which contains an array of "facts"? E.g.
{
"electric_plant_in_service_204": {
[
"land_and_land_rights_steam_production": {...},
"structures_and_improvements_steam_production": {...},
"boiler_plant_equipment_steam_production": {...},
"engines_and_engine_driven_generators_steam_production": {...},
"turbogenerator_units_steam_production": {...},
"accessory_electric_equipment_steam_production": {...},
"miscellaneous_power_plant_equipment_steam_production": {...},
"steam_production_plant": {...},
...
]
},
"electric_energy_account_401a": {
[
...
]
}
}
Or do exactly the same elements need to be associated with more than one table?
With the integration of PR Arelle/Arelle#513 into the released Arelle 2.3.0, we no longer need to use a context manager to reset the LC_ALL
environment variable back to its original value after using Arelle, so we should remove that workaround in our code (and test that it actually works...)
Occasionally, when extracting the FERC XBRL data in the Taxonomy parsing step, the process will hang for a very long time. I've seen it take 15 minutes and then resume normal operation (and I've also quit out of it after 20 minutes). Normally this process takes only a few seconds.
It seems to be totally stochastic. Like if you run the extraction over and over and over again at the command line, it'll be fine 90% of the time and then once in a while it gets stuck. It seems like it can happen on any of the FERC forms. Very odd.
Not sure if it's related at all, but I sometimes also get this failed assertion when it's hanging. Not sure why it would only appear sometimes
dyld[70238]: Assertion failed: (this->magic == kMagic), function matchesPath, file Loader.cpp, line 154.
@cmgosnell commented on Mon Jun 27 2022
the steam and fuel table are both processed/concatenated and the ferc plant-id-er works!
I'd love some overall design input... I pretty aggressively moved cleaning details into global variable so I could make all functions take a df and/or a table name and/or a source of the ferc1 data (xbrl/dbf) in the hopes that this will be more dagster-friendly and standardized.
transform
function we always load all of the tables so we can be explicit which raw/transform tables are fed into each table transform.if table_name in ferc1_settings.tables:
before each table call... sounds gross to me.ferc1_tfr_dfs = {}
# make all the non-steam tables
for table in ferc1_settings.tables:
ferc1_tfr_dfs[table_name] = global().get(table_name)(
ferc1_dbf_raw_dfs.get(table_name),
ferc1_xbrl_raw_dfs.get(table_name)
)
# make the steam table using fuel
ferc1_tfr_dfs["plants_steam_ferc1"] = plants_steam_ferc1(
steam_dbf_raw=ferc1_dbf_raw_dfs.get("plants_steam_ferc1"),
steam_xbrl_raw=ferc1_dbf_raw_xbrl.get("plants_steam_ferc1"),
fuel_transformed=ferc1_tfr_dfs.get("fuel_ferc1")
)
oob_to_nan
pudl.transform.ferc1.transform
function_multiplicative_error_correction
are applicable for the xbrl data.See #1707 and #1722 for the table-specific task lists
@cmgosnell commented on Wed Jun 29 2022
a note about the extract step:
_instant
and _duration
suffixes when applicable. Instead of the current setup which saves the ferc1 raw table names. This way we could automatically grab the extracted tables with just the pudl table name (which is now the argument for many many of these transform functions).@review-notebook-app[bot] commented on Thu Jul 07 2022
Check out this pull request on
See visual diffs & provide feedback on Jupyter Notebooks.
Powered by ReviewNB
@zaneselvans commented on Fri Aug 12 2022
I asked some questions about the FERC1 transform refactor design in this comment on #1739
@zaneselvans commented on Wed Aug 17 2022
Also @bendnorman did you see this list of outstanding questions on the linked issue? catalyst-cooperative/pudl#1739 (comment)
Relying on a live web resource to parse archived data is fragile, as the live resource may not be available or may have changed later.
Update the FERC XBRL extraction process to rely on archived FERC XBRL taxonomies available from our Zenodo archives, rather than the live taxonomies available from FERC.
See also: catalyst-cooperative/pudl-scrapers#42
respondent_id==145
: PublicServiceCompanyOfColorado-145-2020Q1F1.xbrl
extract_fuel
notebook. Where did you find that URL?xbrl_extract
CLI--to-sql
argument? That's optional but... where does the data go if I don't give it that? The help message gives no indication.print()
statements, or how long the whole thing is taking.extract_fuel.ipynb
f1_
not ferc_
The PUDL FERC 1 database tables all end with _ferc1
-- because we have a bunch of different FERC datasets, it's important never to name anything just "ferc" -- it always needs to have the form number.df.groupby("respondent_id").cumcount()
to assign the row and supplement numbers I think.FuelBurnedAverageHeatContent
column looks totally jacked. But I guess that's because we haven't cleaned up all the units errors yet.-0
and -1
etc. suffixes in the plant_name
column?extract_steam.ipynb
_f
footnote columns that don't contain real data, and I don't think any such columns exist in the new DB.f1_fuel
or f1_steam
)f1_plant_in_srvce
table, where each row is a header, subtotal, total, or a particular FERC Account number).f1_edcfu
/ depreciation tables).f1_edcfu_epda
(a messy depreciation table) the junk column where they list the FERC Acct. number (or a descriptive string) still seems to be a total mess. However there's a couple of columns that look like they're supposed to somehow indicate the order in which these rows should appear, and potentially provide some additional information about how they are grouped.Arelle has begun officially packaging the library for PyPI here. It would probably make sense to switch to this version assuming it doesn't break anything, so we can get updates/improvements
The datapackage descriptors we are currently generating to annotate the SQLite DBs which are derived from XBRL data are not valid. For example, in the ferc-xbrl-extractor
environment running this command:
frictionless validate ferc714_xbrl_datapackage.json
Results in a bunch of errors like:
# -------
# invalid: sqlite:////Users/zane/code/catalyst/pudl-work/output/ferc714_xbrl.sqlite
# -------
## Summary
+-----------------------------+--------------------------------------------------------------------------+
| Description | Size/Name/Count |
+=============================+==========================================================================+
| File name (Not Found) | sqlite:////Users/zane/code/catalyst/pudl-work/output/ferc714_xbrl.sqlite |
+-----------------------------+--------------------------------------------------------------------------+
| File size | N/A |
+-----------------------------+--------------------------------------------------------------------------+
| Total Time Taken (sec) | 0.002 |
+-----------------------------+--------------------------------------------------------------------------+
| Total Errors | 1 |
+-----------------------------+--------------------------------------------------------------------------+
| Scheme Error (scheme-error) | 1 |
+-----------------------------+--------------------------------------------------------------------------+
## Errors
+-------+---------+---------+---------------------------------------------------+
| row | field | code | message |
+=======+=========+=========+===================================================+
| | | scheme- | The data source could not be successfully loaded: |
| | | error | cannot create loader "". Try installing |
| | | | "frictionless-" |
+-------+---------+---------+---------------------------------------------------+
Or if we try and validate a single resource and return the errors in JSON form:
frictionless validate --json --resource-name planning_area_hourly_demand_and_forecast_summer_and_winter_peak_demand_and_annual_net_energy_for_load_table_03_2_instant ferc714_xbrl_datapackage.json
{
"version": "4.40.11",
"time": 0.001,
"errors": [],
"tasks": [
{
"resource": {
"path": "sqlite:////Users/zane/code/catalyst/pudl-work/output/ferc714_xbrl.sqlite",
"profile": "tabular-data-resource",
"name": "planning_area_hourly_demand_and_forecast_summer_and_winter_peak_demand_and_annual_net_energy_for_load_table_03_2_instant",
"dialect": {
"table": "planning_area_hourly_demand_and_forecast_summer_and_winter_peak_demand_and_annual_net_energy_for_load_table_03_2_instant"
},
"title": "03.2 - Schedule - Planning Area Hourly Demand and Forecast Summer and Winter Peak Demand and Annual Net Energy for Load, Table - instant",
"description": "ferc:SchedulePlanningAreaHourlyDemandAndForecastSummerAndWinterPeakDemandAndAnnualNetEnergyForLoadBAbstract",
"format": "sqlite",
"mediatype": "application/vnd.sqlite3",
"schema": {
"fields": [
{
"name": "entity_id",
"title": "Entity Identifier",
"type": "string",
"format": "default",
"description": "Unique identifier of respondent"
},
{
"name": "filing_name",
"title": "Filing Name",
"type": "string",
"format": "default",
"description": "Name of filing"
},
{
"name": "date",
"title": "Instant Date",
"type": "date",
"format": "default",
"description": "Date of instant period"
},
{
"name": "planning_area_hourly_demand_and_forecast_summer_and_winter_peak_demand_and_annual_net_energy_for_load_axis",
"title": "Planning Area Hourly Demand and Forecast Summer and Winter Peak Demand and Annual Net Energy for Load [Axis]",
"type": "string",
"format": "default",
"description": "Typed dimension used to distinguish a set of related facts about planning area hourly demand and forecast summer and winter peak demand and annual net energy for load."
}
],
"primary_key": [
"entity_id",
"filing_name",
"date",
"planning_area_hourly_demand_and_forecast_summer_and_winter_peak_demand_and_annual_net_energy_for_load_axis"
]
},
"scheme": "",
"hashing": "md5",
"stats": {
"hash": "",
"bytes": 0
}
},
"time": 0.001,
"scope": [],
"partial": false,
"errors": [
{
"code": "scheme-error",
"name": "Scheme Error",
"tags": [],
"note": "cannot create loader \"\". Try installing \"frictionless-\"",
"message": "The data source could not be successfully loaded: cannot create loader \"\". Try installing \"frictionless-\"",
"description": "Data reading error because of incorrect scheme."
}
],
"stats": {
"errors": 1
},
"valid": false
}
],
"stats": {
"errors": 1,
"tasks": 1
},
"valid": false
}
I think the issue here is that we are using v4 of the frictionless
package, and the ability to annotate SQLite DBs was only introduced in v5. Looking at the datapacakge.json
file, I see that tie dialect
field is invalid. In frictionless
v5, it would need to say sql
and and then point at the table within the sql
dictionary. in previous versions it would describe the CSV dialect that's being used in the file that the path
element points at. See this example of data package annotating an SQLite DB.
{
"path": "sqlite:////Users/zane/code/catalyst/pudl-work/output/ferc714_xbrl.sqlite",
"profile": "tabular-data-resource",
"name": "planning_area_hourly_demand_and_forecast_summer_and_winter_peak_demand_and_annual_net_energy_for_load_table_03_2_instant",
"dialect": {
"table": "planning_area_hourly_demand_and_forecast_summer_and_winter_peak_demand_and_annual_net_energy_for_load_table_03_2_instant"
},
"title": "03.2 - Schedule - Planning Area Hourly Demand and Forecast Summer and Winter Peak Demand and Annual Net Energy for Load, Table - instant",
"description": "ferc:SchedulePlanningAreaHourlyDemandAndForecastSummerAndWinterPeakDemandAndAnnualNetEnergyForLoadBAbstract",
"format": "sqlite",
"mediatype": "application/vnd.sqlite3",
"schema": {
"fields": [
{
"name": "entity_id",
"title": "Entity Identifier",
"type": "string",
"format": "default",
"description": "Unique identifier of respondent"
},
{
"name": "filing_name",
"title": "Filing Name",
"type": "string",
"format": "default",
"description": "Name of filing"
},
{
"name": "date",
"title": "Instant Date",
"type": "date",
"format": "default",
"description": "Date of instant period"
},
{
"name": "planning_area_hourly_demand_and_forecast_summer_and_winter_peak_demand_and_annual_net_energy_for_load_axis",
"title": "Planning Area Hourly Demand and Forecast Summer and Winter Peak Demand and Annual Net Energy for Load [Axis]",
"type": "string",
"format": "default",
"description": "Typed dimension used to distinguish a set of related facts about planning area hourly demand and forecast summer and winter peak demand and annual net energy for load."
}
],
"primary_key": [
"entity_id",
"filing_name",
"date",
"planning_area_hourly_demand_and_forecast_summer_and_winter_peak_demand_and_annual_net_energy_for_load_axis"
]
}
}
sqlite://
URL as pathAs it is, the system sees the sqlite://
URL in the path
and has no idea how to interpret it to find the data.
sqlite://
path must be relative not absoluteIn addition to being unable to interpret the sqlite://
URL as a path at all, the URL uses an absolute path rather than a relative path, which is invalid. It is invalid both in that it violates the frictionless data resource specification which says:
A “url-or-path” is a string with the following additional constraints:
In addition, the absolute path is simply wrong if you download our nightly build outputs since the path to which the descriptor and databases were written on the build server have no meaning on the user's machine:
sqlite:////home/catalyst/pudl_work/output/ferc1_xbrl.sqlite
datapackage.json
file.Currently when parsing an XBRL instance, the code will return a tuple of dictionaries containing facts and contexts. Passing around these structures is not a very readable implementation, and feels error prone. Wrapping these structures in a class, and moving some functionality to class methods should improve readability and robustness.
Running the XBRL extractions using pandas 2.1.1 is much slower than 2.0.3. tox -e integration
takes about 300 seconds with pandas 2.0.3, but 677 seconds with pandas 2.1.1.
The difference in runtime for pandas 2.1.1 seems to increase more than linearly with the size of the data being extracted, with the full extraction of all XBRL data taking 13 minutes using pandas 2.0.3 and more than 7 hours with pandas 2.1.1. See this comment thread.
Running
tox -re integration
Takes:
pandas>=1.5,<2.2
pandas[performance]>=1.5<2.2
pandas>=1.5,<2.1
I tried changing the string inference default to use PyArrow as suggested in the Pandas 2.1 release notes to all of the modules that import pandas as pd
in case the problem was slow object
treatment of strings, but it had no impact on overall runtime. Not sure if that's because it didn't actually have an effect on the treatment of strings, or if it did, and that wasn't the problem:
pd.options.future.infer_string = True
There's one recent open performance regression issue for pandas 2.1: pandas-dev/pandas#55256
The README and docs are pretty sparse at the moment and we should include some high level descriptions of how XBRL parsing actually works. Most people are not very familiar with XBRL, so it would be helpful to have a brief overview of what XBRL is, how it's structured, and how we are working with it.
Up until now, the FERC XBRL Extractor has only been tested on Form 1. Form 1 is the most well integrated in PUDL, which is why it has been the priority, but there is preliminary support for Form 714 with plans to further integrate 714 and form 2. The XBRL extractor is not meant to be specific to Form 1, so ideally this process will go smoothly.
XBRL filings include the taxonomy used in the filing. This should be dediced automatically, rather than passing a taxonomy directly.
In the main PUDL ETL we are currently getting tens of thousands of warnings like:
/Users/zane/miniforge3/envs/pudl-dev/lib/python3.11/site-packages/ferc_xbrl_extractor/xbrl.py:169: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
We should probably stop attempting to do these null concatenations, lest it break down in the future. Or maybe we can just assign explicit data types? Or suppress the warnings so we only get it 1 time, instead of 17,000 times?
For the moment these warnings are being ignored completely. See #170
We pretty-print the XBRL taxonomy in JSON to enable human readability, but the datapackage.json outputs are still all put on a single line, which is both difficult to read as a human being, and strain some editors, which read only some lines into memory to improve performance.
It seems like it should be easy to pretty-print the datapackage outputs as well.
To enable PUDL access the datapackage descriptor generated during XBRL extraction, there needs to be an API update to allow for specifying a path to write the datapackage descriptor.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.