Code Monkey home page Code Monkey logo

data-prep-sec-edgar's Introduction

Data Preparation & Experimentation: SEC EDGAR

⚠️ This Repo is Work in Progress ⚠️

This repository will contain 2 things:

  1. source-data-pull: Scripts to download SEC EDGAR data and format it for Neo4j loading, analytics, and GenAI. Specifically linking together form 10-K and form 13 data.
  2. exploration:Exploratory notebooks and apps for testing loading and GenAI applications with the data.

Background

Linking form10K documents and issuing companies from form13 is non-trivial since the two data sources use different identifiers and the SEC EDGAR API provides no way to directly resolve between them. form10k filings are identified under CIK, an SEC system id, while form13 use the CUSIP identifier, another industry standard id, for issuers.

Thankfully, others have created scripts for mapping CIK and CUSIP via scraping other SEC filings. This repository is one example, and they even provide a pre-calculated mapping in the form of a csv file.

In this project we will build off that work by taking a CIK-CUSIP mapping as input then

  1. Pulling and parsing text from form 10Ks
  2. Pulling and staging holdings data from form 13s

These will be staged in a format conducive to loading into Neo4j and linking the data together.

Prerequisites for Source Data Pull

You are required to have a CIK-CUSIP mapping csv file as input. If you do not have one and want to test out. see the cik-sample-mapping.csv file. While you could use the csv from the repo noted above it contains over 50k mappings which can take a while to process.

[TODO] Add Python Package prereqs

Source Data Pull: Pulling and parsing text from form 10Ks

Currently, there are two command line utilities for this. Will need to combine into one as we clean up.

  1. f10k-get-urls.py takes the cik-cusip mapping as input along with a date range and grabs the urls for raw 10k filings. It then writes them to another csv.
  2. f10k-download-parse-format.py takes the above output, downloads raw 10k files, parses out relevant 10K item text, and saves to json files. See 10K Notes below for more details on the reasoning behind parsing and item selection.

Source Data Pull: Pulling and staging holdings data from form 13s

There are two command line utilities for this. f13-download.py for downloading the raw filings and f13-parse-and-format.py for parsing, formatting, and aggregating them into a csv. These are split into two steps to facilitate faster iteration and experimentation. The download can take a while to run (several hours or more...an overnight type of run), so this way you can do that once then change anything needed on the parsing/formatting side.

10K Notes

A 10K is a comprehensive report filed annually by a publicly traded company about its financial performance and is required by the U.S. Securities and Exchange Commission (SEC). The report contains a comprehensive overview of the company's business and financial condition and includes audited financial statements. While 10Ks contain images and table figures, they primarily consist of free-form text which is what we are interested in extracting here.

Raw 10K reports are structured in iXBRL, or Inline eXtensible Business Reporting Language, which is extremely verbose, containing more markup than actual text content, here is an example from APPLE.

This makes raw 10K files very large, unwieldy, and inefficient for direct application of LLM or text embedding services. For this reason, the program contained here, f10k-download-parse-format.py, applies regex and NLP to parse out as much iXBRL and unnecessary content as possible to make 10K text useful.

In addition, f10k-download-parse-format.py also extracts only a subset of items from the 10K that we feel are most relevant for initial exploration and experimentation. These are sections that discuss the overall business outlook and risk factors, specifically:

  • Item 1 – Business This describes the business of the company: who and what the company does, what subsidiaries it owns, and what markets it operates in. It may also include recent events, competition, regulations, and labor issues. (Some industries are heavily regulated, have complex labor requirements, which have significant effects on the business.) Other topics in this section may include special operating costs, seasonal factors, or insurance matters.
  • Item 1A – Risk Factors Here, the company lays out anything that could go wrong, likely external effects, possible future failures to meet obligations, and other risks disclosed to adequately warn investors and potential investors.
  • Item 7 – Management's Discussion and Analysis of Financial Condition and Results of Operations Here, management discusses the operations of the company in detail by usually comparing the current period versus the prior period. These comparisons provide a reader an overview of the operational issues of what causes such increases or decreases in the business.
  • Item 7A – Quantitative and Qualitative Disclosures about Market Risks.

data-prep-sec-edgar's People

Contributors

zach-blumenfeld avatar akollegger avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.