Code Monkey home page Code Monkey logo

sec-parser's People

Contributors

alphanome avatar deenaawny-github-account avatar elijas avatar fivecctv117 avatar inf800 avatar john0isaac avatar kameleon-ad avatar kausmeows avatar nnilayy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sec-parser's Issues

`part2item6` common issue - exhibits with tables

Related to alphanome-ai/sec-ai#47


AssertionError: Missing: ['part2item6'] is raised in generalization tests if exhibits appear with a table at the same level as top-level section title.

This is a common issue in:

FAILED tests/generalization/processing_steps/test_top_level_section_title_classifier.py::test_top_level_section_title_classifier[10-Q_RTX_0000101829-23-000032] - AssertionError: Missing: ['part2item6']
FAILED tests/generalization/processing_steps/test_top_level_section_title_classifier.py::test_top_level_section_title_classifier[10-Q_TMO_0000097745-23-000059] - AssertionError: Missing: ['part2item6']
FAILED tests/generalization/processing_steps/test_top_level_section_title_classifier.py::test_top_level_section_title_classifier[10-Q_UPS_0001090727-23-000038] - AssertionError: Missing: ['part2item6']
FAILED tests/generalization/processing_steps/test_top_level_section_title_classifier.py::test_top_level_section_title_classifier[10-Q_MCD_0000063908-23-000076] - AssertionError: Missing: ['part2item6']
FAILED tests/generalization/processing_steps/test_top_level_section_title_classifier.py::test_top_level_section_title_classifier[10-Q_MDT_0001613103-23-000128] - AssertionError: Missing: ['part2item6']
FAILED tests/generalization/processing_steps/test_top_level_section_title_classifier.py::test_top_level_section_title_classifier[10-Q_GM_0001467858-23-000098] - AssertionError: Missing: ['part2item6'], Unexpected: ['part1item1']
FAILED tests/generalization/processing_steps/test_top_level_section_title_classifier.py::test_top_level_section_title_classifier[10-Q_KHC_0001637459-23-000114] - AssertionError: Missing: ['part2item6']

Almost all of them (RTX, TMO, MDT, KHC, GM) have the following pattern:

<div ...>

    <span ...>
    Item 6. Exhibits
    </span>

    <table ...>
    ...
    </table>

</div>

UPS with small difference:

<div ...>

    <span ...>
    Item 6. 
    </span>
    <span ...>
    Exhibits
    </span>

    <table ...>
    ...
    </table>

</div>

MCD has major difference where the title "Item 6. Exhibits" is written inside the table

<div>
    <table>
        <tr>
        <tr/>

        <tr>
            <td ...>
                <div ...>        
                    <span ...>
                        Item 6. Exhibits
                    </span>
                </div>
            </td>
        </tr>

        ...

        <tr>
        <tr/>
        <tr>
        <tr/>
    </table>
</div>

Screenshots for two major cases:

Table at the same level as top-level section title

image

Top-level section title written inside table:

image

Singular Visual Line Should Be Identified as a Single TextElement

Problem

For MSFT 0000950170-23-014423, the top section title "PART I. FINANCIAL INFORMATION " is identified as two semantic elements:
{
"cls_name": "TopSectionTitle",
"level": 0,
"section_type": "part1",
"text_content": "PART I. FINANCI"
},
{
"cls_name": "TitleElement",
"level": 0,
"text_content": "AL INFORMATION"
}

This should be:
{ "cls_name": "TopSectionTitle",
"level": 0,
"section_type": "part1",
"text_content": "PART I. FINANCIAL INFORMATION"
}

Ideas about a possible solution

Adjust text element merger to keep merging elements until a new visual line.

Implement parsing accuracy end-to-end tests

Objective

Introduce an end-to-end test mechanism that allows freezing the expected parsing results, facilitating manual review and approval of changes in parsing behavior.

Tasks

  1. Freezing Mechanism: Implement a way to "lock" or "freeze" the current expected parsing output in a YAML document. This will be based on the syntax available in the parsing_plugins test modules.
  2. Output Review: Add a pre-commit check to enable manual review. This check will block the commit if there's a mismatch between parsing-result.txt and parsing-frozen.txt (names are tentative).
  3. Hashing HTML: For readability and storage efficiency, hash the HTML content for each expected semantic element.

Show trailing text during `tree.render()`

I was having a look at the output of tree.render() and noticed that only the starting text is being printed as shown below.

├── TitleElement: Macroeconomic Conditions
│   ├── TextElement: Macroeconomic conditions, including inflation, cha ... 

Does the core logic always make sure that the entire text body is captured if the starting text is right? If not, I would suggest printing out the trailing text also as shown below:

├── TitleElement: Macroeconomic Conditions
│   ├── TextElement: Macroeconomic conditions, including inflation, cha ... y’s results of operations and financial condition.

Would work easily by replacing

title = f"{title[:max_line_length]}..."

with

title = f"{title[:max_line_length]} ... {title[-max_line_length:]}" 

Request for Feedback: Architectural Design Proposal for Standardized Parsing of SEC EDGAR Tables

After careful consideration of the challenges presented, I've developed a proposal I'm eager to present. However, before we delve into the specifics, let's take a moment to review the data landscape we're navigating.

Exploring the Data Landscape

Let's take a look at a curated selection representing the variance in structure and content across different 10-Q filings:

10-Q/CAT/0000018230-03-000208

image

10-Q/BEN/0000038777-22-000138

image

10-Q/BSX/0001072613-08-001558

image

10-Q/CBOE/0001558370-20-012101

image

10Q/AAPL/0000320193_23_000077

image

Tree-oriented representation

Let's take a few examples of the design:

10-Q/CAT/0000018230-03-000208

image
from typing import Dict

table_element: TableElement = ...
tree: Dict = table_element.parse()
assert tree['2007']['Long-Term Debt']['Machinery and Engines'].value == 275
assert tree['2007']['Long-Term Debt']['Machinery and Engines'].scale == 'millions'
assert tree['2007']['Long-Term Debt']['Machinery and Engines'].unit == '$'

10-Q/CBOE/0001558370-20-012101

image
assert tree['Percentage of Total Revenues'] \
           ['Three Months Ended, September 30']['2020']['Operating Expenses'].value == 32.2
assert tree['Percentage of Total Revenues'] \
           ['Three Months Ended, September 30']['2020']['Operating Expenses'].unit == "%"

assert tree['Three Months Ended, September 30']['2020']['Operating Expenses'].value == 12.6
assert tree['Three Months Ended, September 30']['2020']['Operating Expenses'].scale == "millions"
assert tree['Three Months Ended, September 30']['2020']['Operating Expenses'].unit == "$"

Tabular-oriented representation

Year Period End Date Period Description Category Subcategory Value Scale Unit
2007 - - Long-Term Debt Machinery and Engines 275 millions $
2020 September 30 Three Months Ended Operating Expenses - 12.6 millions $
2020 September 30 Three Months Ended Percentage of Total Revenues Operating Expenses 32.2 - %

Parsing process

The parsing process could be organized as HTML -> Tabular -> Dict. This would involve parsing the HTML directly to a table, such as with pandas.read_html() and then converting it to a python dictionary.

The other way would be HTML -> Dict -> Tabular, this would involve traversing the HTML tags (the DOM Tree) directly with a tool like BeautifulSoup4, and then constructing the Table from the constructed tree (Python dictionary).

Your turn

We're excited to share our proposal for a new standardized table parsing method using a tree structure, designed to streamline the representation of data from SEC EDGAR reports. We'd greatly appreciate your professional insights to refine this approach. Please share your thoughts!

Raise Code Unit Test Coverage to 90-100%

codecov

Current state

Code coverage is at 72%.

Rationale

  • Increases robustness against regressions.
  • Increases community trust and encourages more contributions.

Goal

Raise it to ≥90-100%.

  • Important: Please ensure that tests are meaningful and cover edge cases, rather than just inflating the coverage percentage.

Resources

  1. Codecov.io

ModuleNotFoundError: No module named 'sec_downloader'

The import statement for sec_downloader in the "How to use" section will not work because instructions for installing sec_downloader is not provided. This results in

ModuleNotFoundError: No module named 'sec_downloader'

Two ways to handle this:

Either add pip install sec_downloader in the README file's "Getting Started" section.

(or)

Add a line with sec-downloader = "^0.2.3" in pyproject.toml's [tool.poetry.dependencies] instead of just adding it in dev dependencies. This will make sure that the import statement will work with just pip install sec-ai

Fix the TopSectionTitle being split in MSFT filing

Context

MSFT accuracy-test (permalink at the time of posting)

Problem

Titles come out as two separate title elements

        {
            "text_content": "PART I. FINANCI"
        },
        {
            "text_content": "AL INFORMATION"
        },

This is because MSFT puts the section titles into two pieces for some reason

Ideas about a possible solution

Maybe include the line information into the solution: If two elements of the same type (and level) are on the same line, they should probably be identified as a single element

small change in doc

image

change "blue" in For example, we can get a percentage of blue text: to "green".

Adjusting Top Section Title Regular Expression to Handle Accented Characters

Problem

The regular expression employed in the top section manager for 10q needs modification, specifically to eliminate accented characters from both the regular expression itself and the input being matched.

Ideas about a possible solution

Removal of comma characters (accent characters) from additional_title used in the regular expression.

The additional_title before:

additional_title =  ("Unregistered Sales of Equity Securities, Use of Proceeds, and "
                    "Issuer Purchases of Equity Securities")

and after:

additional_title =  ("Unregistered Sales of Equity Securities Use of Proceeds and "
                    "Issuer Purchases of Equity Securities")

Page headers should be identified as PageHeaderElement

Context

MSFT accuracy-test (permalink at the time of posting)

Problem

The header "PART I" is identified as top section title element, when it should be identified as page header element. And because of this, the actual top section title element is incorrectly identified as title element

image

Ideas about a possible solution

One possible solution: Identify page header elements first, and then the top section title element will start working correctly, as it will not be confused by the header elements

Manually find bugs in various 10-Q documents and create Github issues

Objective

To enhance sec-parser by manually reviewing 10-Q documents from key companies and addressing issues.

Process once the Github issues are created:

  1. Issue List: Manually review 10-Qs from selected companies and list parsing issues.
  2. Triage: Categorize issues as Critical, High, Medium, or Low impact.
  3. Plan: Develop a plan to fix these issues.
  4. Fix: Start implementing fixes based on priority.

Caching for .parse_latest

As noted in the readme.md

The parser utilizes caching, so multiple calls to retrieve the same data will not consume your API calls limit.

After running the code several times, I found that the API calls quota is being consumed. I noticed that the .parse_latest() currently doesn't seem to have any caching yet or I am missing something. Can someone help confirming this?

Thanks

Add "Open in" badge(s) to two jupyter notebooks

As discussed in this thread, we are looking to enhance the accessibility of our guide notebooks by adding an "Open in SageMaker Studio Lab" badge. This badge will allow users to directly open and run the notebooks in a cloud environment, improving the user experience.

Tasks for Contributors:

  1. Update the links in the following badges to point to the guide notebooks (see links below).

    Open In SageMaker Studio Lab

    Kaggle

  2. Verify that clicking the updated link opens the notebook in the cloud environment.

  3. Add the updated badge to the following notebooks, positioning it after the existing badges:

  4. Ensure that the badge functions correctly when integrated into the notebook.

  5. Create a GitHub Pull Request with your changes.


Special thanks to @mahimairaja for the excellent idea! 🙌 🙌

This issue is ideal for new contributors, offering a great opportunity to make a significant impact on our project. We appreciate your contributions!

Welcome to sec-parser! Start Here for Contributing

Contribution Workflow

We're excited about your interest in contributing to Alphanome AI's projects! To ensure a smooth and efficient process for all contributors, we've established this workflow. Please follow these steps to contribute effectively and avoid overlapping efforts.

Step 1: Select a Task

  1. Option A: Explore Open Issues:

    • Check out our Request For Contributions board for tasks that are ready for contributions.
    • Alternatively, browse through the GitHub Issues page of a specific project, such as sec-parser Issues or sec-ai Issues.
    • Tips:
      • Look for tasks labeled contributions-welcome. These tasks align with the project goals.
      • If you're new to the project, look for tasks labeled good-first-issue.
      • Be sure to check if a task is already tagged in-progress to avoid duplicate efforts.
  2. Option B: Propose a New Task:

    • Go through our Short-Term Roadmap to understand our focus areas and upcoming projects.
    • If you discover an issue or have a novel idea, feel free to propose it. Initiate a conversation either in the Discussions forum or on our Discord server.

Step 2: Prepare for Contribution

  1. Read CONTRIBUTING.md:

    • Before you begin, read the CONTRIBUTING.md file of the project for guidelines on setup, coding standards, and codebase understanding.
  2. Fork the Project:

    • Fork the project on GitHub to create your own workspace.
  3. Communicate Your Plan:

    • We recommend commenting on the issue you're tackling to discuss your approach and seek guidance. This also allows us to tag the issue as in-progress.
  4. Continuously Sync Your Fork:

    • Follow this GitHub Guide to synchronize your fork with the main repository.

Step 3: Begin Your Contribution

  1. Submit a Pull Request:

    • Create a pull request with your changes, clearly explaining your contributions.
  2. Check for Errors:

    • Run our automated checks and your local tests to catch and fix any issues before final submission.

We're grateful for your contributions and look forward to your valuable input in our project!

Seeking Assistance and Asking Questions

If you have any questions, or concerns, or need further clarification, feel free to reach out. Please use our Discussions page for more detailed queries and Discord for quick, conversational questions. For questions specific to a GitHub issue or pull request, kindly post them directly in the respective issue or PR thread.


Note
For Maintainers: Content above is taken from our Common Contributing Guide. If any updates or changes are made here, please ensure they are also reflected in the original guide for consistency.

Additional Instructions to add API Key & Developer setup

Following @INF800 addition towards the readme.md in issue #10, I personally think it will be better if we add .env tutorial to set up the SECAPIO_API_KEY and add these lines of code in the readme.md file.

from dotenv import load_dotenv
import os

env_path = ".env"
load_dotenv(dotenv_path=env_path)
api_key = os.environ.get('SECAPIO_API_KEY')

By the way, I just noticed that the changes merged regarding the previous issue is r verted eby this commit 6e2d1a8

Create a visualisation tool that overlays parsed elements with semi-transparent boxes

Background

In the example below, a Semantic Segmentation model consolidates multiple individual pixels into a single, coherent Semantic Element within an image.

image

Similarly, sec-parser features a Semantic Segmentation algorithm that consolidates multiple HTML tags into a single, coherent Semantic Element within a page.

Task

We'd like to have a new python function that has one input (a list of semantic elements) and one output (a string of HTML source code that has all the semi-transparent overlays applied), for example:

Untitled 2

The coloring should be based on the type of the semantic element (just an example code to clearly convey the idea).

colored_html = ""
for element in parsed_elements:
    if isinstance(element, TextElement): 
        html = element.get_source_code()
        colored_html = colored_html + add_color('yellow', html)
    ...

Parsing into a SemanticTree is most likely not needed, you can just use the list of semantic elements.

Notes

  • You can retrieve the HTML to be modified directly from the semantic elements themselves. Therefore, the input of your function could be just ingesting a list of semantic elements (or a semantic tree).
  • It can be either part of the debug_dashboard or it can be a completely separate tool in the dev_utils folder.

The task snapshot-verify fails

Error
FileNotFoundError: [Errno 2] No such file or directory: '/home/deenaawny/issue-66/sec-parser-test-data/10-Q/AA/0001193125-18-236766/expected-semantic-elements-list.json'
task: Failed to run task "snapshot-verify": exit status 1

Solution
comment accession number in selected-filings.yaml

#- 0001193125-18-236766 # 10-Q AA Alcoa Corp 2018-06-30

Pull Request
To do

Write missing unit tests

Philosophy

Working with Complex Data: Unit Testing Approach

When dealing with complex data, a common and effective strategy is to encapsulate the complexity within a unit test. This approach involves defining the various scenarios you anticipate and then focusing on testing these scenarios rather than the entire document or using extensive debugging tools.
This method significantly reduces the time required to verify if your modifications are working as expected. Here's how you can do it:

1. Isolate the complexity: Identify the complex part of your data and isolate it as a unit test. This could be a function, a class, or any other component that you find complex.

2. Define the scenarios: Determine what you want to happen for different inputs or states of your program. These scenarios will form the basis of your unit tests.

3. Work with the unit test: Once you have your unit test set up, you can make changes and run the test to see if your changes are working as expected. This is much quicker and more efficient than working with the full document or using full debugging tools.

Remember, the goal here is to make your testing process more efficient and manageable. By isolating complexity and focusing on unit tests, you can achieve this goal and ensure your changes work as intended.

source

Technical details

The goal is to make the command task c finish successfully with 100% unit test coverage (it's currently at 98%)

Note
It's just a shortcut for task pre-commit-checks, which itself include task unit-tests among other things, and the task unit-tests is a shortcut to calling pytest tests/unit/.

Some helpful tips:

  1. You can get more detailed output when passing -vv to the pytest command. You can do it with
task c -- --v
  1. It's convenient for the tests to be re-run automatically, right after you save any file, you can do it with
task monitor-unit-tests 

or (as described previously)

task monitor-unit-tests -- --v
  1. Install some IDE extensions that can highlight covered and not covered lines by unit tests. For Cursor (or VSCode) users I recommend Coverage Gutters which will work out of the box with no additional configuration. Enable it as described here.

  2. Consider using the following GPT-4 prompt to get you the first draft and continue from there

Create a pytest test suite for a given functionality in my codebase. The tests should be structured according to the Arrange-Act-Assert (AAA) pattern for clarity and maintainability. Include comments within each test to clearly delineate the Arrange, Act, and Assert stages. Include tests for both normal and edge cases. Provide descriptive test function names. All code output must be full and complete to be pasted, don't leave it to me to finish it off (you can exclude imports though if you want). Use test tables (pytest parameterized) where appropriate. Here is my code:
{{PASTE CODE HERE}}

Parsers for other type of docs

Hi, I'm impressed with your work. I'm expecting this work will enable us to extract valuable insights from SEC filings. Really appreciate it.

But I'm wondering if you are planning to add more parsers for other types of documents, such as 10-K, 8-K, etc.
It would be very useful if we have more parsers.

Thanks.

Extracting Text Elements Using sec_parser

I am currently using sec_parser to parse SEC filings, and I've encountered issues while trying to extract specific text elements such as TopSectionTitle, TitleElement, and TextElement. The output includes ANSI escape codes which seem to interfere with straightforward text extraction.

Is there any subfunction to do this?

FAILED tests/generalization/processing_steps/test_top_level_section_title_classifier.py::test_top_level_section_title_classifier[10-Q_MMM_0000066740-23-000058] - AssertionError: Missing: ['part2item3']

Related to alphanome-ai/sec-ai#47


Currently, TextElement can become HighlightedTextElement only if 80% of the text content is bold with some font weight. This is not an ideal scenario in cases such as part2item3 of 10-Q_MMM_0000066740-23-000058 which looks like:

image

where ~55% of text content is bold with some font weight. The style_string default dict for it looks like:

image

Because of this TextElement cannot become HighlightedTextElement which in turn cannot become TitleElement which in turn cannot become TopLevelSectionElement.

Solution:

Change PERCENTAGE_THRESHOLD from 80 to a value less than 54.9.

In order to find a sweet spot I checked if there were more fails like these where the threshold needs to be decreased, but could not find any in the dataset. So, I think the best value as of now would be 50 percent assuming neither the trailing description or the bold heading has more text content.

Make HighlightedTextClassifier work with `<b>` tags

Discussed in https://github.com/orgs/alphanome-ai/discussions/56

Originally posted by Elijas November 24, 2023

Example document

https://www.sec.gov/Archives/edgar/data/1675149/000119312518236766/d828236d10q.htm

image
 <p style="margin-top:9pt; margin-bottom:0pt; text-indent:4%; font-size:10pt; font-family:Times New Roman">
  Options to purchase 1 million shares of common stock at a weighted average exercise price of $36.28 were
outstanding as of June 30, 2017, but were not included in the computation of diluted EPS because they were anti-dilutive, as the exercise prices of the options were greater than the average market price of Alcoa Corporation�s common stock.
 </p>
 <p style="margin-top:13pt; margin-bottom:0pt; font-size:10pt; font-family:Times New Roman">
  <b>
   G. Accumulated Other Comprehensive Loss
  </b>
 </p>
 <p style="margin-top:6pt; margin-bottom:0pt; text-indent:4%; font-size:10pt; font-family:Times New Roman">
  The following table details the activity of the three components that comprise Accumulated other comprehensive loss for both Alcoa
Corporation�s shareholders and Noncontrolling interest:
 </p>

Goal

The "G. Accumulated Other Comprehensive Loss" should be recognized as HighlightedTextElement (and therefore, TitleElement).

Most likely, you will have to get a percentage of text that is covered inside the <b> tag, by reusing the parts implemented in the HighlightedTextElement. This will help you avoid situations where text text text <b>bold</b> text text is recognized as higlighted

Download documents directly from SEC EDGAR instead of using sec-api.io API

We're currently using sec-api.io

  • To remove title pages;
  • To find separation points between top-level sections;

The goal of this issue is to implement logic for identifying the separation points between the title page and the first root section, and between different root sections in SEC EDGAR HTML documents.

  1. Implement logic to identify the separation point between the title page and the first root section.
  2. Implement logic to identify separation points between root sections.
  3. The logic should be robust, able to handle edge cases.
  4. Confirm the implementation's accuracy across a large dataset using 3rd party APIs.

Parse Page Numbers and Page Separators

image
  • It would probably be best to create new classes such as PageElement which would be inherited from IrrelevantElement to allow for an easy removal of all such elements in one swoop
  • Include unit tests with a variety of real-life samples. Use pytest test tables. Include the source document ticker and accession number in the test name.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.