Data Preparation & Experimentation: SEC EDGAR

⚠️ This Repo is Work in Progress ⚠️

This repository will contain 2 things:

source-data-pull: Scripts to download SEC EDGAR data and format it for Neo4j loading, analytics, and GenAI. Specifically linking together form 10-K and form 13 data.
exploration:Exploratory notebooks and apps for testing loading and GenAI applications with the data.

Background

Linking form10K documents and issuing companies from form13 is non-trivial since the two data sources use different identifiers and the SEC EDGAR API provides no way to directly resolve between them. form10k filings are identified under CIK, an SEC system id, while form13 use the CUSIP identifier, another industry standard id, for issuers.

Thankfully, others have created scripts for mapping CIK and CUSIP via scraping other SEC filings. This repository is one example, and they even provide a pre-calculated mapping in the form of a csv file.

In this project we will build off that work by taking a CIK-CUSIP mapping as input then

Pulling and parsing text from form 10Ks
Pulling and staging holdings data from form 13s

These will be staged in a format conducive to loading into Neo4j and linking the data together.

Prerequisites for Source Data Pull

You are required to have a CIK-CUSIP mapping csv file as input. If you do not have one and want to test out. see the cik-sample-mapping.csv file. While you could use the csv from the repo noted above it contains over 50k mappings which can take a while to process.

[TODO] Add Python Package prereqs

Source Data Pull: Pulling and parsing text from form 10Ks

Currently, there are two command line utilities for this. Will need to combine into one as we clean up.

f10k-get-urls.py takes the cik-cusip mapping as input along with a date range and grabs the urls for raw 10k filings. It then writes them to another csv.
f10k-download-parse-format.py takes the above output, downloads raw 10k files, parses out relevant 10K item text, and saves to json files. See 10K Notes below for more details on the reasoning behind parsing and item selection.

Source Data Pull: Pulling and staging holdings data from form 13s

There are two command line utilities for this. f13-download.py for downloading the raw filings and f13-parse-and-format.py for parsing, formatting, and aggregating them into a csv. These are split into two steps to facilitate faster iteration and experimentation. The download can take a while to run (several hours or more...an overnight type of run), so this way you can do that once then change anything needed on the parsing/formatting side.

10K Notes

A 10K is a comprehensive report filed annually by a publicly traded company about its financial performance and is required by the U.S. Securities and Exchange Commission (SEC). The report contains a comprehensive overview of the company's business and financial condition and includes audited financial statements. While 10Ks contain images and table figures, they primarily consist of free-form text which is what we are interested in extracting here.

Raw 10K reports are structured in iXBRL, or Inline eXtensible Business Reporting Language, which is extremely verbose, containing more markup than actual text content, here is an example from APPLE.

This makes raw 10K files very large, unwieldy, and inefficient for direct application of LLM or text embedding services. For this reason, the program contained here, f10k-download-parse-format.py, applies regex and NLP to parse out as much iXBRL and unnecessary content as possible to make 10K text useful.

In addition, f10k-download-parse-format.py also extracts only a subset of items from the 10K that we feel are most relevant for initial exploration and experimentation. These are sections that discuss the overall business outlook and risk factors, specifically:

Item 1 – Business This describes the business of the company: who and what the company does, what subsidiaries it owns, and what markets it operates in. It may also include recent events, competition, regulations, and labor issues. (Some industries are heavily regulated, have complex labor requirements, which have significant effects on the business.) Other topics in this section may include special operating costs, seasonal factors, or insurance matters.
Item 1A – Risk Factors Here, the company lays out anything that could go wrong, likely external effects, possible future failures to meet obligations, and other risks disclosed to adequately warn investors and potential investors.
Item 7 – Management's Discussion and Analysis of Financial Condition and Results of Operations Here, management discusses the operations of the company in detail by usually comparing the current period versus the prior period. These comparisons provide a reader an overview of the operational issues of what causes such increases or decreases in the business.
Item 7A – Quantitative and Qualitative Disclosures about Market Risks.

weinix / data-prep-sec-edgar Goto Github PK

data-prep-sec-edgar's Introduction

Data Preparation & Experimentation: SEC EDGAR

⚠️ This Repo is Work in Progress ⚠️

Background

Prerequisites for Source Data Pull

Source Data Pull: Pulling and parsing text from form 10Ks

Source Data Pull: Pulling and staging holdings data from form 13s

10K Notes

data-prep-sec-edgar's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent