theogognf / finagg Goto Github PK
View Code? Open in Web Editor NEWA Python package for aggregating and normalizing historical data from popular and free financial APIs.
License: Apache License 2.0
A Python package for aggregating and normalizing historical data from popular and free financial APIs.
License: Apache License 2.0
You mention that FRED data can be used instead of BEA, but FRED data looks quite scattered, do you have a guide, or have you replicated the data before, thanks.
Some dependencies we use are very useful but are still in their early stages and can be subject to breaking changes. It's probably better to manually move into this project as a submodule for now rather than adding them to a requirement. We can easily move back to using their wheels directly once they become more stable. The main ones include:
Repurposing this issue to focus on general SQL improvements. Most of the SQL tables follow a key-value/ETA model that is fine for low row counts, but has huge slowdowns for larger row count tables. As an example, the SEC tables have about the same query performance, but the Yahoo! Finance tables are about 10x SLOWER using the key-value/ETA model vs a typical column-per-attribute model
In addition, I'd like to rename some feature columns so the naming convention is a bit more consistent across subpackages. If a column is just the result of another column being passed to the function, then I'd like to use a naming convention similar to other SQL engines: FUNC(COL)
where FUNC
is the function and COL
is the argument to the function. This'll change names like AssetsCurrent_pct_change
to PCT_CHANGE(AssetsCurrent)
. Although it's a bit more verbose, this makes it easier to describe columns, especially when since some columns will be processed by multiple functions (e.g.,, percent change columns normalized by industry for SEC features would be NORM(PCT_CHANGE(COL))
The pivoted table change and the column name change brings up another point - features have a column
attribute that lists the names of all the columns returned by the feature, but, since we're swapping to a pivoted table representation and since the column names will have function names in them, it may be a bit more straightforward to remove the column
attribute and just use columns on the SQL table definitions
To summarize the above and more:
FUNC(COL)
naming convention for refined columnscolumn
attributeThere are no foreign key constraints within any of the submodule table definitions. However, feature tables are heavily linked to the raw tables. Feature tables should be constrained to only allow inserts if rows with corresponding primary keys exist in the raw tables that they're derived from. We can probably add additional on-update or on-insert events to update the feature tables according to updates/inserts for the raw tables.
Morningstar is a pay2win site, but it may be useful for generating features that don't require much normalization (Morningstar already provides a fair value). It's at least worth looking into if their API is simple to implement.
Some features are percent change features meaning they're computed using the current row and previous row of a column. When dates aren't aligned for features, they're forward-filled as a mechanism to align the dates. Forward-filling for percent change features isn't exactly correct because the result would indicate a percent change is repeated, when in fact there is no percent change occurring for the filled index. It's probably better if a pad method was used for percent change columns instead.
FRED provides historical values for the S&P 500, Nasdaq 100, and so on. These values are commonly used in analysis. We should add them to the finagg.fred.feat.economic
feature set (specifically the log change values)
Hello! I am a new user of finagg. I would like a well structured of any financial statements (cashflow statement, balance sheet and income statement) for many years. For instance, I would like to get the financial statements for AAPL for the last 10 years in a dataframe. Can I do that simply using finagg? Thanks in advance!
Most dates are represented as strings. It may be useful to default to a datetime object instead, though it may be a pain to implement. We can at least investigate this a bit
Either I didn't notice this before or this is a new file available, but the SEC has a file that maps a company to the exchanges the company is listed on. This may be useful for people that only want to look at companies on a particular exchange. I should implement a simple API for this just like is implemented for the ticker file and MAYBE implement a table for storing that data (or merge it into the finagg.sec.sql.submissions
table)
Seems like there are still some old docs leftover when building. Probably need to clear the docs build dir to get a fresh directory
Implement environment-specific, custom RLlib models and action distributions for the micro trader environment. They can be used as reference for the GPU implementations
Run tox pipelines in github workflow
Need to implement a GPU-only PPO algorithm. Part of it is already implemented - we just need to finish implementing interfaces:
step
method for performing gradient stepsinit_policy
, init_optimizer
, init_
schedulerscollect
method for collecting environment experiencesSeems that the pyproject.toml is the new standard for Python project metadata/tool info. Phase-out the setup.cfg and replace it with a pyproject.toml
This is repurposed to download and install the nightly ZIP files available through the SEC EDGAR archives. This is noted to be significantly more efficient than getting many individual company facts/submissions through the REST API
New feature column ideas, but not prioritized until after project issues are done
Hello theOGognf! I have a little question for you. When I go on quickfs.net or Seeking Alpha I got these result for the gross profit for AAPL:
When I use your code
df = finagg.sec.api.company_facts.get(ticker="AAPL") # get all data associated with AAPL
df = finagg.sec.api.get_unique_filings(df, form="10-K") # get all unique annual filings
df = finagg.sec.api.join_filings(df, form="10-K") # pivot the table for convenience
I got this result
Just one example, look at the gross profit in FY 2023, i.e. 1.5283e+11, and look where it is located in the spreadsheet. The same result is located in year 2021. So it is delayed by almost 2 years. Is it normal? There might be an issue somewhere. Can you look at it
Get some sphinx docs working. Probably do away with the private module member naming just to make it easier for document generation as well. Host the docs on GitHub pages in a separate branch once they're working. We can work on a workflow for automating this in the future
I'm frequently using aggregated features from the sec.features.quarterly_features
and mixed.features.fundamental_features
grouped by industry for analysis. It's probably time I make these aggregation functions an additional feature.
We default to using popular tickers/series, but we should have the ability to specify a custom ticker/series as well for each CLI
Need to implement an Env protocol for the GPU-only algorithm implementation. It just needs the following:
num_envs
, config
, and device
args and attributesstep
methodreset
methodclose
methodSome feedback the package has gotten includes updating the docs to include some use-cases, recipes, or common workflows for aggregating data, and also updating the docs to describe the use-cases for each submodule as it's not immediately clear from a new user which submodule to use.
The README could also be updated with small snippets of the above updates along with some other FAQs and answers that I've gotten
Currently, the default is to recreate tables whenever an install command or method is called by dropping and then creating the tables using a subpackage's SQL metadata. This isn't ideal in that it wipes whatever data was already installed, so users have to keep reinstalling the same data as before if they want it.
Instead, installation commands should be additive in that they don't default to dropping tables and instead attempt to install new data without dropping, not losing whatever was previously installed. There should still be an option to recreate tables for all install commands in case a user has very limited storage to work with or doesn't care about previous data.
A trainer is a level above an algorithm in that the trainer is used for running training routines/jobs and has interfaces for tracking experiments, checkpointing, evaluating, etc.. The algorithm is solely focused on updating the policy and collecting environment samples while the trainer is focused on all the workflows associated with training.
The trainer should have the following methods:
train
for running one training step and logging metrics to a tensorboard loggereval
for running one evaluation step and logging metrics to a tensorboard loggerdescribe
for describing the trainer's parameters and all its current metrics (average reward, min reward, max reward, most recent losses)checkpoint
for checkpointing the trainer and all its underlying piecesrun
for running all the above methods with some config optionsI currently implement the SEC EDGAR API, but the API is still relatively new and doesn't contain all the data that may be available through the SEC EDGAR historical data file archives. I think we'd want to use the file-based SEC EDGAR data as an alternative to using the SEC EDGAR API in cases where a company's data cant be found through the API. Just glancing at how the data files are organized, I don't think itd be too big of an effort to implement. A first implementation should probably have the following elements
Need to implement algo schedulers for scheduling entropy coefficient and learning rates. They need to be implemented slightly differently since the LR scheduler will need to interface with PyTorch optimizers while the entropy scheduler just needs to update an entropy coefficient.
Just a thought I had: the API implementations are a bit clunky and could be slimmed down by writing a simple decorator that instantiates an API-specific class that has the respective get
method and url
attribute instead of writing standalone classes for each endpoint.
So instead of:
class Series
url = ...
def get():
...
we have:
@secapi(url)
def series():
...
Although its made it easy to develop, the RL algo submodule is a bit weird to have as a part of finagg. It may be better served as a dependency in its own project
Placeholder for publishing the first public version of this package to PyPI
One of the end goals of this project is to enable the development of RL agents with a single node GPU in minutes. This is easily possible using combination of methods from IsaacGym, RLlib, and TorchRL.
tensordict
can be passed between modules, with all operations exclusively on a GPU, and customizations that enable user defined models and action distributions.
Table names should be fully qualified such that they can be accessed according to their submodule location.
An example would be the prices table in yfinance. It should be something along the lines of "yfinance.raw.prices". Similarly, the daily features table should be something like "yfinance.refined.daily"
finagg.sec.feat.quarterly
provides methods for quarterly SEC features from 10-Q forms, but 10-Q forms are unaudited and prone to missing or incorrect data (although the rate that "bad" data occurs shouldn't be large enough to be too impactful). On the other hand, 10-K forms are audited, should be less prone to errors, and should be more widely available for companies. Maybe we should add a finagg.sec.feat.annually
feature for exposing methods related to the 10-K forms.
This may require some minor version updates to be able to distinguish between 10-Q forms and 10-K forms in the SQL tables and the raw features. I think all that's needed is:
finagg.sec.feat.tags.from_raw
method that defaults to "10-Q"form == "10-Q"
to WHERE clauses for quarterly SEC features so as to not mix 10-Q and 10-K formsRefinedAnnually
class to finagg.sec.feat
with alias finagg.sec.feat.annually
that replicates finagg.sec.feat.quarterly
but only using 10-K forms. This includes all the normalized subfeatures (finagg.sec.feat.annually.industry
and finagg.sec.feat.annually.normalized
)Maybe 10-K form features only include the year and filing date as the index as well since the fp
column/field will always be "FY"
.
setuptools_scm seems like the defacto version scheming tool now. This'll remove versioneers files and dependency. We can also look at pyscaffold for other project best practices that we may be missing
I frequently find myself trying to lookup a particular set of values for a ticker, but unable to do so because the tickers dataset was entirely dropped due to it not having all the data required for a feature set. Maybe it's better to just store all the data scrape on install and then install features for tickers that have all the required columns
Seems like it's common to use Twitter sentiment features as part of AI/ML for stock price predictions. It'd be nice to include an API, database, and feature set for that
Having a separate class for environment configuration is a bit verbose. This can probably be slimmed down by just passing args directly to the environment instead of a dictionary. Although, I'm not sure if RLlib supports those environments.
Surely there's a great reason they broke backwards compatibility. We should investigate it
Not all wrappers are compatible with one another. There should be some specification of wrappers that either validates combinations of wrappers or catches common wrapper errors and reraises a more verbose error.
Hello! Do you have an idea how could I get the float (= shares outstanding - restricted shares) and the number of shares outstanding for the last couple of year for any stock? For instance, I would like to get the float and the number of shares outstanding for AAPL in the last 10 years in a nice pandas dataframe. Thanks in advance!
P.S. I would say the float is the most important part for me.
hi,
i'm playing with your module and so far i find it really nice.
I'm wondering how to update the sqlite db. I did a install a few months ago and i'm wondering if i can put it up to date. but when i check the CLI documentation is talks about a fresh install. if possible i would like to add new values rather that start from zero. but i can't find a way to do it.
Add another FRED feature that's the time-normalized values of the FRED economic features (similar to the industry-normalized features for the SEC submodule)
Using finagg install on a mac, python 3.10.4 using pyenv and I am running into this error when using "finagg install":
Traceback (most recent call last): sqlite3.OperationalError: no such table: sec.raw.submissions
Is there any fix to this?
Right now, each submodule has one installation function with options to install other features. It'd make more sense if the installation function was split into a sql.install function and a features.MyFeature.install method, where the features.MyFeature.install method installed based on rows written to a table via the sql.install function. This will make it easier to choose which features to install in the future.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.