Code Monkey home page Code Monkey logo

edgar-doc-parser's Issues

Unexpected behavior of the old parser

  • The old parser completely ignores all annotation outside <span> tags, which happens when the annotation is applied on a whole section of the document.
  • when we have structure like <span A><ix B><span C> texts on spanC <\span><\ix><\span> its is for sure that the "texts on spanC" will be labeled as unannotated on the element of <spanC>. But for ix B, it is unclear whether "texts on spanC" would be incorporated as a part of the annotation of <spanA>.

Sanity check for regex parser

I cannot come up with a good way to sanity check the regex parser in which the return element will be in the form of paired integers that indicate the beginning and ending of html string.

Get_annotation_features style

We need the webelement -> pandas dataframe with features pipeline to have function definitions.

We want to remove commented out legacy code, add comments for code blocks, add definitions to functions.
We need explanations of each column name generated by the functions.
We should convert some helper function names into private class syntax i.e. (get_page_breaks -> _get_page_breaks)

Previously, some of the functions had legacy arguments -- For example an out_path argument that was completely unused and ignored.

Create new function that reimplements a combined ```unpack_bulk``` and ```featurize_file```.

Basically, unpack_bulk saves intermediate files, and featurize_file with low memory mode loads those files.

If we just kept them in memory, our pipeline would probably double in speed as we are I/O bound, and we wouldn't need to write/read from disk nearly as often. We could even 'skip' unpacking files that we don't want to featurize in an 'early_quit' unpacking that depends on the featurization document_type.

Backwards Compatibility with New Parser Featurization, Unit Testing

I only changed _parse_unannotated_text and _parse_annotated_text (some functions are commented out but remains unchanged). The code is ready to run but unit test is not done. There is a small demo under the examples folder that checks the text field found. But I don't know how to run parser._parse_annotated_text() on a specific file. Someone please help me and add that to the jupyter notebook, thank you! Then I will make other tests to make sure that the new parser finds what need to be find.

And here are some issues I found:

  • table in span
    • currently only detect the 'span in table' situation. However, there could be the case with span->table->annotation.
    • if fix, the in_table as a list is not enough, and the feature generation function needs to be adjust accordingly.
  • page number unsupported yet. It is a quick fix by adding it as a python attribute to the element object, but that raises a conflict with the feature generation step.

Relative file imports need to be sanitized before package upload to pipit

This issue is true for all EDGAR classes. As an example, we previously calculated parser.data_dir by calculating the path of EDGAR/parser.py and concatenating a data_dir constructor argument. With PyPi, our packaged binaries end up being located in some form of pip/conda/homebrew folder. This completely hides the data_dir from the user. We'd like for the user to be able to visually see the data_dir and delete it if they wish.

Currently, I've pushed a fix where the user has to pass an explicit absolute path of the data_dir to all EDGAR classes. In preprocess.py this comes down to data_dir=os.path.join(os.path.abspath(__file__), 'data')
I would like to re-implement the user being allowed to pass in relative paths.

For example:
We call metadata(data_dir='data')
and we'd want metadata.data_dir == preprocess.py.__file__ + 'data'
That would be desirable, without the previously mentioned boilerplate.

Finalize initial dataset

For each annotated document, convert to machine learning ready format and provide sentences to sentence processing pipeline

Remove external ```FilingType``` kwargs from ```downloader.py```

FilingType is a type from secedgar. We should remove this type from user-facing application as it requires an external install of secedgar thus defeating the purpose of packaging this repository.

Let's replace it with a string that can call FilingType within our classes if necessary, but focus on allowing the user to input strings to manage the type.

Actually, can all FilingType kwargs be replaced by the new document_type kwarg system we are adding?

Group training samples by page

Currently our setup/generate_dataset.py creates a list of tuples, these are (x, y) elements where x represents the training data and consists of text and y consists of a list of ixbrl annotations. We want to create a list of lists, [(x1, y1), (x2, y2)] where each list corresponds to a page and each tuple represents an annotated web-element on that page.

Autodetect missing API Keys, prompt user for keys, then save them in .keys private file

The api_keys.yaml solution is kinda janky. Mainly, it is a headache to work with as a contributor and annoying to deal with as a package user. We should instead keyword a '.keys.yaml' private file in the EDGAR directory. When the dataloader tries to load it, it can autodetect its absence, prompt the user for the necessary fields, and save it permanently for future use. This would be immensely useful to contributors because .keys.yaml would never appear in the commit history and thus be safe when checking out branches.

URGENT: Tables leaked into document data ; Cannot proceed with Annotation Tag Distribution Analysis

@NC-Zhao When we first talked about the annotated document parser we discussed that <table> annotations should be kept separate from all other non-table annotations

However, they are not separated at all -- This is breaking the machine learning model and analysis downstream.
I just found this out after working on figures and documents analyzing the distribution of annotations tags for a solid 4 hours.

I will have to re-do all of the analysis -- but I cannot proceed until the annotated document parser is fixed. Therefore my contributions are on-hold until this is fixed.

Add unannotated neighboring elements

In our fresh and newly polished generate_dataset.py we now output elements grouped together by page.

However, the output sample_data.csv (for nflx variant as well) only contains annotated webelements.

This is a legacy artifact from the non-grouped data.

We want to add all the unannotated neighbors back in on a per-page basis. However, we still probably don't want annotated table elements in the neighbor groups (or do we? This could be an explicit commented 'toggle' in the code.)

We will need downstream filtering that 'deletes/ignores' pages that only contain (annotated tables or unannotated elements).

The current status of the data is really clean and impressive!

Need to scale internal dataset size

Could you expand the tikrs.txt list and confirm new added tickers work in our framework?

We will rapidly approach local memory limits from the download sizes. I believe the next steps are to enable programmatic off-loading of raw data after it is processed.

My current machine sits at ~6.4 GB memory used by 10 companies, but I believe the actual dataset I am using after processing is ~130 MB

IMS Documents are ignored by parser

As a first patch -- we should output a warning that these documents are being skipped over (They are currently)

Afterwards, we can show some initial analysis on this population of documents (How many are we currently skipping?)

We can also write a parser to incorporate and annotate this document class

```parser.find_page_location``` does not work on entire dataset

setup/generate_dataset.py

Traceback (most recent call last):
File "/home/seresne/ml/capstone/FinDocNLP/setup/generate_dataset.py", line 33, in
features = parser.process_file(tikr, doc, fname)
File "/home/seresne/ml/capstone/FinDocNLP/setup/../EDGAR/parser.py", line 383, in process_file
features = self.get_annotation_features(elems, annotation_dict)
File "/home/seresne/ml/capstone/FinDocNLP/setup/../EDGAR/parser.py", line 328, in get_annotation_features
page_location = self.find_page_location()
File "/home/seresne/ml/capstone/FinDocNLP/setup/../EDGAR/parser.py", line 284, in find_page_location
page_location = {page_number: [0,page_breaks[0].location["y"]]}
IndexError: list index out of range

Figure out earliest annotated document date, add it as an option for downloads.

Suppose we only want to Query_Server for annotated documents. Sure, we can't explicitly guarantee we get those documents, but we can explicitly not download documents from years that are not annotated. Presumably this is sometime 2018-2022, but we want that to be coded into the functions themselves as a bool kwarg in the interface.

Hopefully this cuts down download speed by 5x since we wouldn't download and delete random files we don't need.

Analyze distribution of IXBRL labels

We need to analyze what ixbrl tags (numeric, non:numeric) to predict on, which are company specific and should be filtered out, and which tags contain enough information (or not enough) to be predicted.

Implement `smart` train-val-test split for dataset

In our multi class single label classification task, a random train/val/test split does not ensure that a minimum number of samples per each prediction class are present in each split.

For some metrics, such as roc_auc_score, this throws invalid errors as we have classes being tested that have 0 representative members in the test set.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.