theogognf / finagg Goto Github PK

View Code? Open in Web Editor NEW

410.0 410.0 15.0 5.42 MB

A Python package for aggregating and normalizing historical data from popular and free financial APIs.

License: Apache License 2.0

Python 100.00%

ai data finance ml python

finagg's People

Contributors

Stargazers

Watchers

Forkers

mnkagiri mikeinottawa gabrielbalanconstantin ugbot worthmining hrocha michaelbeirnee saytosid karim1104 bigbaer01 dorucioclea shamonne 3000 vbprojects

finagg's Issues

BEA vs FRED

You mention that FRED data can be used instead of BEA, but FRED data looks quite scattered, do you have a guide, or have you replicated the data before, thanks.

Migrate experimental 3rd party dependencies inside repo

Some dependencies we use are very useful but are still in their early stages and can be subject to breaking changes. It's probably better to manually move into this project as a submodule for now rather than adding them to a requirement. We can easily move back to using their wheels directly once they become more stable. The main ones include:

torchrl's tensor specs
tensordict
dadaptation optimizers

Static SQL table columns, better column names, and better feature refinement logic

Repurposing this issue to focus on general SQL improvements. Most of the SQL tables follow a key-value/ETA model that is fine for low row counts, but has huge slowdowns for larger row count tables. As an example, the SEC tables have about the same query performance, but the Yahoo! Finance tables are about 10x SLOWER using the key-value/ETA model vs a typical column-per-attribute model

In addition, I'd like to rename some feature columns so the naming convention is a bit more consistent across subpackages. If a column is just the result of another column being passed to the function, then I'd like to use a naming convention similar to other SQL engines: FUNC(COL) where FUNC is the function and COL is the argument to the function. This'll change names like AssetsCurrent_pct_change to PCT_CHANGE(AssetsCurrent). Although it's a bit more verbose, this makes it easier to describe columns, especially when since some columns will be processed by multiple functions (e.g.,, percent change columns normalized by industry for SEC features would be NORM(PCT_CHANGE(COL))

The pivoted table change and the column name change brings up another point - features have a column attribute that lists the names of all the columns returned by the feature, but, since we're swapping to a pivoted table representation and since the column names will have function names in them, it may be a bit more straightforward to remove the column attribute and just use columns on the SQL table definitions

To summarize the above and more:

Switch to "pivoted" table view for all refined tables
Use FUNC(COL) naming convention for refined columns
Don't use raw columns in refined tables (people can just join dataframes if they want raw columns with refined columns anyways)
Use SQL table definitions to create refined columns from raw columns rather than depending on a feature column attribute
Use log change instead of percent change for all features that're guaranteed to be > 0 across all of time
Keep raw column names as close to the API sources as possible (no pointless renames like EarningsPerShare instead of EarningsPerShareBasic)
Update docs to reflect these changes

Foreign key constraints for feature tables referencing raw tables

There are no foreign key constraints within any of the submodule table definitions. However, feature tables are heavily linked to the raw tables. Feature tables should be constrained to only allow inserts if rows with corresponding primary keys exist in the raw tables that they're derived from. We can probably add additional on-update or on-insert events to update the feature tables according to updates/inserts for the raw tables.

Morningstar API implementation

Morningstar is a pay2win site, but it may be useful for generating features that don't require much normalization (Morningstar already provides a fair value). It's at least worth looking into if their API is simple to implement.

pct change features shouldn't ffill

Some features are percent change features meaning they're computed using the current row and previous row of a column. When dates aren't aligned for features, they're forward-filled as a mechanism to align the dates. Forward-filling for percent change features isn't exactly correct because the result would indicate a percent change is repeated, when in fact there is no percent change occurring for the filled index. It's probably better if a pad method was used for percent change columns instead.

Popular index historical values in FRED features

FRED provides historical values for the S&P 500, Nasdaq 100, and so on. These values are commonly used in analysis. We should add them to the finagg.fred.feat.economic feature set (specifically the log change values)

Getting well structured financial statements

Hello! I am a new user of finagg. I would like a well structured of any financial statements (cashflow statement, balance sheet and income statement) for many years. For instance, I would like to get the financial statements for AAPL for the last 10 years in a dataframe. Can I do that simply using finagg? Thanks in advance!

Use datetime date for date index/columns

Most dates are represented as strings. It may be useful to default to a datetime object instead, though it may be a pain to implement. We can at least investigate this a bit

SEC company exchanges API/table

Either I didn't notice this before or this is a new file available, but the SEC has a file that maps a company to the exchanges the company is listed on. This may be useful for people that only want to look at companies on a particular exchange. I should implement a simple API for this just like is implemented for the ticker file and MAYBE implement a table for storing that data (or merge it into the finagg.sec.sql.submissions table)

Clear docs build dir before build in workflow

Seems like there are still some old docs leftover when building. Probably need to clear the docs build dir to get a fresh directory

Baseline RLlib models and scripts

Implement environment-specific, custom RLlib models and action distributions for the micro trader environment. They can be used as reference for the GPU implementations

Nightly builds, tests, etc.

Run tox pipelines in github workflow

Implement GPU-only algo

Need to implement a GPU-only PPO algorithm. Part of it is already implemented - we just need to finish implementing interfaces:

step method for performing gradient steps
init_policy, init_optimizer, init_ schedulers
collect method for collecting environment experiences

setup.cfg -> pyproject.toml

Seems that the pyproject.toml is the new standard for Python project metadata/tool info. Phase-out the setup.cfg and replace it with a pyproject.toml

SEC bulk data install

This is repurposed to download and install the nightly ZIP files available through the SEC EDGAR archives. This is noted to be significantly more efficient than getting many individual company facts/submissions through the REST API

Volatility ratios for features

New feature column ideas, but not prioritized until after project issues are done

financial statements do not match

Hello theOGognf! I have a little question for you. When I go on quickfs.net or Seeking Alpha I got these result for the gross profit for AAPL:

When I use your code

df = finagg.sec.api.company_facts.get(ticker="AAPL")  # get all data associated with AAPL
df = finagg.sec.api.get_unique_filings(df, form="10-K")  # get all unique annual filings
df = finagg.sec.api.join_filings(df, form="10-K")  # pivot the table for convenience

I got this result

Just one example, look at the gross profit in FY 2023, i.e. 1.5283e+11, and look where it is located in the spreadsheet. The same result is located in year 2021. So it is delayed by almost 2 years. Is it normal? There might be an issue somewhere. Can you look at it

Docs gen

Get some sphinx docs working. Probably do away with the private module member naming just to make it easier for document generation as well. Host the docs on GitHub pages in a separate branch once they're working. We can work on a workflow for automating this in the future

Industry-aggregated features

I'm frequently using aggregated features from the sec.features.quarterly_features and mixed.features.fundamental_features grouped by industry for analysis. It's probably time I make these aggregation functions an additional feature.

Add custom tickers/series during installation

We default to using popular tickers/series, but we should have the ability to specify a custom ticker/series as well for each CLI

Implement GPU-only Env

Need to implement an Env protocol for the GPU-only algorithm implementation. It just needs the following:

Update docs and readme

Some feedback the package has gotten includes updating the docs to include some use-cases, recipes, or common workflows for aggregating data, and also updating the docs to describe the use-cases for each submodule as it's not immediately clear from a new user which submodule to use.

The README could also be updated with small snippets of the above updates along with some other FAQs and answers that I've gotten

`recreate-tables` install options

Currently, the default is to recreate tables whenever an install command or method is called by dropping and then creating the tables using a subpackage's SQL metadata. This isn't ideal in that it wipes whatever data was already installed, so users have to keep reinstalling the same data as before if they want it.

Instead, installation commands should be additive in that they don't default to dropping tables and instead attempt to install new data without dropping, not losing whatever was previously installed. There should still be an option to recreate tables for all install commands in case a user has very limited storage to work with or doesn't care about previous data.

Trainer implementation

A trainer is a level above an algorithm in that the trainer is used for running training routines/jobs and has interfaces for tracking experiments, checkpointing, evaluating, etc.. The algorithm is solely focused on updating the policy and collecting environment samples while the trainer is focused on all the workflows associated with training.

The trainer should have the following methods:

train for running one training step and logging metrics to a tensorboard logger
eval for running one evaluation step and logging metrics to a tensorboard logger
describe for describing the trainer's parameters and all its current metrics (average reward, min reward, max reward, most recent losses)
checkpoint for checkpointing the trainer and all its underlying pieces
run for running all the above methods with some config options

File-based SEC EDGAR parser

I currently implement the SEC EDGAR API, but the API is still relatively new and doesn't contain all the data that may be available through the SEC EDGAR historical data file archives. I think we'd want to use the file-based SEC EDGAR data as an alternative to using the SEC EDGAR API in cases where a company's data cant be found through the API. Just glancing at how the data files are organized, I don't think itd be too big of an effort to implement. A first implementation should probably have the following elements

Methods for crawling through the index files for each year and quarter
Methods for parsing index files and storing them into SQL tables
Methods for getting filings based on an index entry
Methods for storing filings in SQL tables and querying them from SQL tables
Options for enabling the file-based methods as an alternative to the API methods in cases of errors

Algo schedulers

Need to implement algo schedulers for scheduling entropy coefficient and learning rates. They need to be implemented slightly differently since the LR scheduler will need to interface with PyTorch optimizers while the entropy scheduler just needs to update an entropy coefficient.

Entropy scheduler
LR scheduler

Clunky API implementations

Just a thought I had: the API implementations are a bit clunky and could be slimmed down by writing a simple decorator that instantiates an API-specific class that has the respective get method and url attribute instead of writing standalone classes for each endpoint.

So instead of:

class Series
    url = ...

    def get():
        ...

we have:

@secapi(url)
def series():
    ...

Move algo to separate project

Although its made it easy to develop, the RL algo submodule is a bit weird to have as a part of finagg. It may be better served as a dependency in its own project

Publish 0.1.0

Placeholder for publishing the first public version of this package to PyPI

GPU -only environments and algorithms

One of the end goals of this project is to enable the development of RL agents with a single node GPU in minutes. This is easily possible using combination of methods from IsaacGym, RLlib, and TorchRL.

tensordict can be passed between modules, with all operations exclusively on a GPU, and customizations that enable user defined models and action distributions.

Better table names

Table names should be fully qualified such that they can be accessed according to their submodule location.

An example would be the prices table in yfinance. It should be something along the lines of "yfinance.raw.prices". Similarly, the daily features table should be something like "yfinance.refined.daily"

Annual SEC features from 10-K

finagg.sec.feat.quarterly provides methods for quarterly SEC features from 10-Q forms, but 10-Q forms are unaudited and prone to missing or incorrect data (although the rate that "bad" data occurs shouldn't be large enough to be too impactful). On the other hand, 10-K forms are audited, should be less prone to errors, and should be more widely available for companies. Maybe we should add a finagg.sec.feat.annually feature for exposing methods related to the 10-K forms.

This may require some minor version updates to be able to distinguish between 10-Q forms and 10-K forms in the SQL tables and the raw features. I think all that's needed is:

Add a "form" parameter to the finagg.sec.feat.tags.from_raw method that defaults to "10-Q"
Add conditional form == "10-Q" to WHERE clauses for quarterly SEC features so as to not mix 10-Q and 10-K forms
Add a RefinedAnnually class to finagg.sec.feat with alias finagg.sec.feat.annually that replicates finagg.sec.feat.quarterly but only using 10-K forms. This includes all the normalized subfeatures (finagg.sec.feat.annually.industry and finagg.sec.feat.annually.normalized)

Maybe 10-K form features only include the year and filing date as the index as well since the fp column/field will always be "FY".

Use setuptools_scm and investigate pyscaffold

setuptools_scm seems like the defacto version scheming tool now. This'll remove versioneers files and dependency. We can also look at pyscaffold for other project best practices that we may be missing

Installing should store data even if all the data isnt available for a feature set

I frequently find myself trying to lookup a particular set of values for a ticker, but unable to do so because the tickers dataset was entirely dropped due to it not having all the data required for a feature set. Maybe it's better to just store all the data scrape on install and then install features for tickers that have all the required columns

Twitter sentiment features

Seems like it's common to use Twitter sentiment features as part of AI/ML for stock price predictions. It'd be nice to include an API, database, and feature set for that

Better env configuration

Having a separate class for environment configuration is a bit verbose. This can probably be slimmed down by just passing args directly to the environment instead of a dictionary. Although, I'm not sure if RLlib supports those environments.

Investigate migration to SQLAlchemy 2.0

Surely there's a great reason they broke backwards compatibility. We should investigate it

Wrapper validation/helpers

Not all wrappers are compatible with one another. There should be some specification of wrappers that either validates combinations of wrappers or catches common wrapper errors and reraises a more verbose error.

Getting the float and the number of shares outstanding

Hello! Do you have an idea how could I get the float (= shares outstanding - restricted shares) and the number of shares outstanding for the last couple of year for any stock? For instance, I would like to get the float and the number of shares outstanding for AAPL in the last 10 years in a nice pandas dataframe. Thanks in advance!

P.S. I would say the float is the most important part for me.

update local sql database

hi,

i'm playing with your module and so far i find it really nice.

I'm wondering how to update the sqlite db. I did a install a few months ago and i'm wondering if i can put it up to date. but when i check the CLI documentation is talks about a fresh install. if possible i would like to add new values rather that start from zero. but i can't find a way to do it.

Time-averaged FRED features

Add another FRED feature that's the time-normalized values of the FRED economic features (similar to the industry-normalized features for the SEC submodule)

Finagg install issues

Using finagg install on a mac, python 3.10.4 using pyenv and I am running into this error when using "finagg install":

Traceback (most recent call last): sqlite3.OperationalError: no such table: sec.raw.submissions

Is there any fix to this?

Move installation functions

Right now, each submodule has one installation function with options to install other features. It'd make more sense if the installation function was split into a sql.install function and a features.MyFeature.install method, where the features.MyFeature.install method installed based on rows written to a table via the sql.install function. This will make it easier to choose which features to install in the future.