kamilkrukowski / edgar-doc-parser Goto Github PK

A package for downloading, extracting, parsing, and processing data from SEC-EDGAR, a public online database of all documents filed with the USA's Securities and Exchange Commission.

License: MIT License

Python 72.31% Jupyter Notebook 27.69%

ixbrl ixml sec-edgar sec-edgar-api

edgar-doc-parser's Issues

Unexpected behavior of the old parser

The old parser completely ignores all annotation outside <span> tags, which happens when the annotation is applied on a whole section of the document.
when we have structure like <span A><ix B><span C> texts on spanC <\span><\ix><\span> its is for sure that the "texts on spanC" will be labeled as unannotated on the element of <spanC>. But for ix B, it is unclear whether "texts on spanC" would be incorporated as a part of the annotation of <spanA>.

Sanity check for regex parser

I cannot come up with a good way to sanity check the regex parser in which the return element will be in the form of paired integers that indicate the beginning and ending of html string.

Get_annotation_features style

We need the webelement -> pandas dataframe with features pipeline to have function definitions.

We want to remove commented out legacy code, add comments for code blocks, add definitions to functions.
We need explanations of each column name generated by the functions.
We should convert some helper function names into private class syntax i.e. (get_page_breaks -> _get_page_breaks)

Previously, some of the functions had legacy arguments -- For example an out_path argument that was completely unused and ignored.

Speed up ```Downloader.unpack_bulk```

Can we use mp multiprocessing base python library?

Separate ```ix:numeric``` labels from label list

We need to consider numeric and non-numeric labels separately in preprocess.py

Reformat ```index.rst``` and related website documentation page files

We need to expand and rewrite the HTML of our documentation website

Launch documentation page on github

Please launch some form of github.io page for this repository

DataLoader only gets the first tikr passed from list

As stated, tikrs=['nflx', 'pg'] is equivalent to tikrs=['nflx']

Simple fix, the code needs indentation in looping.

Create new function that reimplements a combined ```unpack_bulk``` and ```featurize_file```.

Basically, unpack_bulk saves intermediate files, and featurize_file with low memory mode loads those files.

If we just kept them in memory, our pipeline would probably double in speed as we are I/O bound, and we wouldn't need to write/read from disk nearly as often. We could even 'skip' unpacking files that we don't want to featurize in an 'early_quit' unpacking that depends on the featurization document_type.

Backwards Compatibility with New Parser Featurization, Unit Testing

I only changed _parse_unannotated_text and _parse_annotated_text (some functions are commented out but remains unchanged). The code is ready to run but unit test is not done. There is a small demo under the examples folder that checks the text field found. But I don't know how to run parser._parse_annotated_text() on a specific file. Someone please help me and add that to the jupyter notebook, thank you! Then I will make other tests to make sure that the new parser finds what need to be find.

And here are some issues I found:

table in span
- currently only detect the 'span in table' situation. However, there could be the case with span->table->annotation.
- if fix, the in_table as a list is not enough, and the feature generation function needs to be adjust accordingly.
page number unsupported yet. It is a quick fix by adding it as a python attribute to the element object, but that raises a conflict with the feature generation step.

Relative file imports need to be sanitized before package upload to pipit

This issue is true for all EDGAR classes. As an example, we previously calculated parser.data_dir by calculating the path of EDGAR/parser.py and concatenating a data_dir constructor argument. With PyPi, our packaged binaries end up being located in some form of pip/conda/homebrew folder. This completely hides the data_dir from the user. We'd like for the user to be able to visually see the data_dir and delete it if they wish.

Currently, I've pushed a fix where the user has to pass an explicit absolute path of the data_dir to all EDGAR classes. In preprocess.py this comes down to data_dir=os.path.join(os.path.abspath(__file__), 'data')
I would like to re-implement the user being allowed to pass in relative paths.

For example:
We call metadata(data_dir='data')
and we'd want metadata.data_dir == preprocess.py.__file__ + 'data'
That would be desirable, without the previously mentioned boilerplate.

Review style of file paths, change them to ```os.path.join``` and ```pathlib.Path```

Some file paths do not work between different file-systems i.e. Mac/Linux/Windows.

Please change paths to work on ALL systems by removing system specific paths and replacing them with the proper python calls to generate system-friendly paths.

Add alphanumeric handling for wordpiece tokenization

Currently the FastBertTokenizer from Transformers adds a lot of alphanumeric merge-'words' such as '723', '51251', '52', '1251'. We should unify these into one token or find some other handling method.

Add ```label_list``` options to ```preprocess.py```

Currently, we generate and filter the label_list in preprocess.py. We should generate the list and parse it externally, then pass the label_list to preprocess.py for actual data filtering.

8-k documents are not implemented for parser, downloader

Implement ```parser.process_file``` for unannotated files

parser.process_file currently only works on annotated files -- Add annotation check and convert to work on unannotated files.

Finding Neighboring Webelements - Reconcile 'Div' and 'Pagebreak' webelement parsing schemes

Two approaches:

split up the document by page breaker tag. Grouping all the text on the same page and using it as the neighboring text.
use the text in the parent div container for the label as the neighboring text

```document_type``` kwarg does not support ```all``` setting for any Downloader class function.

need to update downloader.py so document_type = 'all' works
remove complete arg and replace functionality with document_type = 'all'
ensure this runs with preprocess.py

Start Processer (data_processer) class in EDGAR Package

We need a commit that adds a data processing class to the EDGAR package for downstream machine learning

Finalize initial dataset

For each annotated document, convert to machine learning ready format and provide sentences to sentence processing pipeline

Parser.py ```ix:numeric``` and ```ix:non-numeric``` convert try-except pattern into regex

The runtime and the style of using sequential try-except to detect the presence of a specific tag is problematic.

This should be near line 205 in parser.py

We should refactor with a regex on the webelement.text field in some manner.

(We need to expand the tag_list at some point and we can't scale this using the except pattern)

Remove external ```FilingType``` kwargs from ```downloader.py```

FilingType is a type from secedgar. We should remove this type from user-facing application as it requires an external install of secedgar thus defeating the purpose of packaging this repository.

Let's replace it with a string that can call FilingType within our classes if necessary, but focus on allowing the user to input strings to manage the type.

Actually, can all FilingType kwargs be replaced by the new document_type kwarg system we are adding?

Group training samples by page

Currently our setup/generate_dataset.py creates a list of tuples, these are (x, y) elements where x represents the training data and consists of text and y consists of a list of ixbrl annotations. We want to create a list of lists, [(x1, y1), (x2, y2)] where each list corresponds to a page and each tuple represents an annotated web-element on that page.

Autodetect missing API Keys, prompt user for keys, then save them in .keys private file

The api_keys.yaml solution is kinda janky. Mainly, it is a headache to work with as a contributor and annoying to deal with as a package user. We should instead keyword a '.keys.yaml' private file in the EDGAR directory. When the dataloader tries to load it, it can autodetect its absence, prompt the user for the necessary fields, and save it permanently for future use. This would be immensely useful to contributors because .keys.yaml would never appear in the commit history and thus be safe when checking out branches.

URGENT: Tables leaked into document data ; Cannot proceed with Annotation Tag Distribution Analysis

@NC-Zhao When we first talked about the annotated document parser we discussed that <table> annotations should be kept separate from all other non-table annotations

However, they are not separated at all -- This is breaking the machine learning model and analysis downstream.
I just found this out after working on figures and documents analyzing the distribution of annotations tags for a solid 4 hours.

I will have to re-do all of the analysis -- but I cannot proceed until the annotated document parser is fixed. Therefore my contributions are on-hold until this is fixed.

```DataLoader.init``` crash for rare tickers.

Looks like some uncaught exception can disrupt the data loading after some tikrs are processed but not all.

@NC-Zhao Please provide minimum reproducible error case

Add unannotated neighboring elements

In our fresh and newly polished generate_dataset.py we now output elements grouped together by page.

However, the output sample_data.csv (for nflx variant as well) only contains annotated webelements.

This is a legacy artifact from the non-grouped data.

We want to add all the unannotated neighbors back in on a per-page basis. However, we still probably don't want annotated table elements in the neighbor groups (or do we? This could be an explicit commented 'toggle' in the code.)

We will need downstream filtering that 'deletes/ignores' pages that only contain (annotated tables or unannotated elements).

The current status of the data is really clean and impressive!

Need to scale internal dataset size

Could you expand the tikrs.txt list and confirm new added tickers work in our framework?

We will rapidly approach local memory limits from the download sizes. I believe the next steps are to enable programmatic off-loading of raw data after it is processed.

My current machine sits at ~6.4 GB memory used by 10 companies, but I believe the actual dataset I am using after processing is ~130 MB

IMS Documents are ignored by parser

As a first patch -- we should output a warning that these documents are being skipped over (They are currently)

Afterwards, we can show some initial analysis on this population of documents (How many are we currently skipping?)

We can also write a parser to incorporate and annotate this document class

Convert documentation to modern standards

Follow the Numpy style guide.

If our documentation is formatted this way, the website will automatically pull the documentation from python .py files onto the website.

```parser.find_page_location``` does not work on entire dataset

setup/generate_dataset.py

Traceback (most recent call last):
File "/home/seresne/ml/capstone/FinDocNLP/setup/generate_dataset.py", line 33, in
features = parser.process_file(tikr, doc, fname)
File "/home/seresne/ml/capstone/FinDocNLP/setup/../EDGAR/parser.py", line 383, in process_file
features = self.get_annotation_features(elems, annotation_dict)
File "/home/seresne/ml/capstone/FinDocNLP/setup/../EDGAR/parser.py", line 328, in get_annotation_features
page_location = self.find_page_location()
File "/home/seresne/ml/capstone/FinDocNLP/setup/../EDGAR/parser.py", line 284, in find_page_location
page_location = {page_number: [0,page_breaks[0].location["y"]]}
IndexError: list index out of range

parser.find_page_location return wrong value

The highlighting in the parse_text_by_page change the location of the page breaker. This led to the wrong value being returned.

Figure out earliest annotated document date, add it as an option for downloads.

Suppose we only want to Query_Server for annotated documents. Sure, we can't explicitly guarantee we get those documents, but we can explicitly not download documents from years that are not annotated. Presumably this is sometime 2018-2022, but we want that to be coded into the functions themselves as a bool kwarg in the interface.

Hopefully this cuts down download speed by 5x since we wouldn't download and delete random files we don't need.

Move ```data/metadata/.keys.yaml``` to ```./keys.yaml```

As written. People often package their data folder, we should try to separate API call header metadata from the data folder.

implement ```examples/generate_label_summary.py```

Collect summary statistics on the ix:numeric and ix:non-numeric elements across all files related to 10 tickers in the dataset

Implement regex style parsing for annotated and unannotated documents

We need parser.parse_annotated_documents, parser.parse_unnannotated_documents, parser.get_annotation_features to run MUCH faster. Please ensure that the output of the modified implementations matches the original implementation output exactly.

kamilkrukowski / edgar-doc-parser Goto Github PK

edgar-doc-parser's Issues

Recommend Projects

Recommend Topics

Recommend Org