After careful consideration of the challenges presented, I've developed a proposal I'm eager to present. However, before we delve into the specifics, let's take a moment to review the data landscape we're navigating.
Exploring the Data Landscape
Let's take a look at a curated selection representing the variance in structure and content across different 10-Q filings:
10-Q/CAT/0000018230-03-000208
10-Q/BEN/0000038777-22-000138
10-Q/BSX/0001072613-08-001558
10-Q/CBOE/0001558370-20-012101
10Q/AAPL/0000320193_23_000077
Tree-oriented representation
Let's take a few examples of the design:
10-Q/CAT/0000018230-03-000208
from typing import Dict
table_element: TableElement = ...
tree: Dict = table_element.parse()
assert tree['2007']['Long-Term Debt']['Machinery and Engines'].value == 275
assert tree['2007']['Long-Term Debt']['Machinery and Engines'].scale == 'millions'
assert tree['2007']['Long-Term Debt']['Machinery and Engines'].unit == '$'
10-Q/CBOE/0001558370-20-012101
assert tree['Percentage of Total Revenues'] \
['Three Months Ended, September 30']['2020']['Operating Expenses'].value == 32.2
assert tree['Percentage of Total Revenues'] \
['Three Months Ended, September 30']['2020']['Operating Expenses'].unit == "%"
assert tree['Three Months Ended, September 30']['2020']['Operating Expenses'].value == 12.6
assert tree['Three Months Ended, September 30']['2020']['Operating Expenses'].scale == "millions"
assert tree['Three Months Ended, September 30']['2020']['Operating Expenses'].unit == "$"
Tabular-oriented representation
Year |
Period End Date |
Period Description |
Category |
Subcategory |
Value |
Scale |
Unit |
2007 |
- |
- |
Long-Term Debt |
Machinery and Engines |
275 |
millions |
$ |
2020 |
September 30 |
Three Months Ended |
Operating Expenses |
- |
12.6 |
millions |
$ |
2020 |
September 30 |
Three Months Ended |
Percentage of Total Revenues |
Operating Expenses |
32.2 |
- |
% |
Parsing process
The parsing process could be organized as HTML -> Tabular -> Dict
. This would involve parsing the HTML directly to a table, such as with pandas.read_html()
and then converting it to a python dictionary.
The other way would be HTML -> Dict -> Tabular
, this would involve traversing the HTML tags (the DOM Tree) directly with a tool like BeautifulSoup4
, and then constructing the Table from the constructed tree (Python dictionary).
Your turn
We're excited to share our proposal for a new standardized table parsing method using a tree structure, designed to streamline the representation of data from SEC EDGAR reports. We'd greatly appreciate your professional insights to refine this approach. Please share your thoughts!