Code Monkey home page Code Monkey logo

ubco-w2022t2-data301 / project-group-group15 Goto Github PK

View Code? Open in Web Editor NEW
2.0 0.0 2.0 154.1 MB

An algorithmic asset allocation analysis project leveraging the power of quantitative analysis, financial modelling and data analytics 📈

License: MIT License

Jupyter Notebook 99.40% Python 0.60%
asset-allocation asset-management equities portfolio-management risk-management stocks ubc algorithmic-asset-allocation data-analytics data-science python

project-group-group15's Introduction

Group 15 - Algorithmic Asset Allocation and Portfolio Construction

Milestones

Details for Milestone are available on Canvas (left sidebar, Course Project).

Describe your topic/interest in about 150-200 words

Our combined goal is to investigate what equity data is the most important to consider when developing a portfolio of equities that preferences either a growth portfolio or value investment strategy. From an industry perspective, we intend to provide a solution to the problem of maximising client returns by optimising risk and return in the case of optimal asset allocation. Clients have diverse investment goals, which is why we have divided our group project into the exploration of both growth and value investment strategies. As a successful analysis requires us to validate our findings with a series of control tests, we are equally passionate about discovering new trends in our data in addition to validating our findings with what is already known. Therefore, our research questions inherently prescribe an implicit comparison with known factors. We are passionate about revealing underlying trends across financial markets, being composed of a group of Computer Science and Data Science students.

Describe your dataset in about 150-200 words

The raw data is composed of 7 data sets of equity data that includes data on the financial accounts, valuation, performance, dividends and margins of 8116 publicly-traded companies in the United States. The data is sourced from TradingView, which grants the permission to use and distribute the data per their attribution policy cited in their Terms of Service1. This data has been extracted from TradingView's publicly available Stock Screener.

There are 76 data columns shared across all 7 data sets, of which 70 of them are unique. Select market data is provided by ICE Data Services, integrated by TradingView. As the S&P 500 Index is the generally accepted benchmark for the top performing companies in the United States, 500 rows of data will be used, in which the number of columns will be significantly reduced according to their relevance in our analysis. Consequently, the total number of raw data points is 616816, of which our analysis will focus on 35000 unique raw data points. The data sets capture static financial market data for the 30th of January, 2023, which can be updated by uploading new data, but for our analysis, will remain constant.

Team Members

  • Colin Lefter: I am a first-year Computer Science student who is passionate about algorithmic finanical modelling, having experience with algorithmic trading, Python, Java and R.
  • Keisha Kwek: I am a first-year student in the Faculty of Science on track to major in Data Science. I am passionate about various environmental and economical sustainability matters in the world, where I hope to contribute in the improvement of with the application of my skills from my degree.

Images

img img img img

References

TradingView grants all users of tradingview.com, and all other available versions of the site, to use snapshots of TradingView charts in analysis, press releases, books, articles, blog posts and other publications. In addition, TradingView grants the use of all previously mentioned materials in education sessions, the display of TradingView charts during video broadcasts, which includes overviews, news, analytics and otherwise use or promote TradingView charts or any products from the TradingView website on the condition that TradingView attribution is clearly visible at all times when such charts and products are used.

Attribution must include a reference to TradingView, including, but not limited to, those described herein.

Footnotes

  1. From TradingView's Terms of Service page:

project-group-group15's People

Contributors

colinlefter avatar firasm avatar github-classroom[bot] avatar keishakwek avatar

Stargazers

 avatar  avatar

project-group-group15's Issues

Tier 2 Algorithm Proposal

  • Construct linear regression models for each equity with the 3-month change (of every equity) on the y axis and the values (for every equity) of a singular financial variable on the x axis
  • Construct an optimization algorithm that does the above and extracts the r and r^2 values from each model, along with the name associated with each model (y on x)
  • The optimization algorithm should then calculate the z score for each r / r^2 variable in the list. Then, it should filter through these scores and select variables that lie within the same positive z score brackets (depending on whether a positive or negative value is considered favorable) (e.g. [0, 0.5), [0.5, 1), [1, 1.5), [1.5, 2), [2, 2.5), [2.5, 3) to be predictors for the next multiple linear regression and recursively continue this process until all predictors lie within one bracket at which the base case is reached and that is the final model which is said to best determine the price of the equity

Feedback Session 2

Feedback for Group

  1. What is your contracted grade?
  • A+
  1. Are there are any group dynamics issues that we need to be aware of ?
  • No
  1. Notes from [Islam/Saira] on Feedback on the analysis:
  • Have a final plot that shows low companies and high companies
  • Okay to do things in seaborn and pandas
  • Otherwise on track for A+
  • Clean up your repo, move unused ipynb files to ungraded section.

Project To Do

Phase 1: Data Loading

Assigned to Colin:

  • Write Python code guidelines for project

Assigned to Keisha:

  • To be determined

Assigned to Yigit:

  • To be determined

Assigned to All:

  • Write code of conduct README file (still need to approve of content as a group)
  • Write project vision statement (still need to approve of content as a group)

Task list:

  • Identify all relevant variables for our analysis from each data set
  • Delete irrelevant columns from each data set (via Microsoft Excel) and submit a pull request on GitHub to upload all of the data files to the processed data folder
  • Discuss pull request as a group and finalize processed data set (accept pull request on GitHub)
  • Identify parameters for the central algorithm (e.g. holding period, preferred industries, capital invested)
  • Create general class framework in Jupyter Notebook and import relevant libraries (Plotly, Pandas, Numpy, Scikit-Learn, Seaborn)
  • Load data into Pandas data frames and develop any helper functions used in data wrangling, processing and cleaning

Phase 2: Analysis

Assigned to Colin:

  • To be determined

Assigned to Keisha:

  • To be determined

Assigned to Yigit:

  • To be determined

Assigned to All:

  • Identify analysis algorithms to be used and developed (concepts, use and method of construction)
  • Identify how each algorithm may build upon each other
  • Identify points where data visualization takes place (e.g. within all analysis algorithms, include a case where the relevant data is plotted on command)

Task list:

  • Write analysis algorithms into Jupyter Notebook documents (specific algorithms will be assigned to different group members at a later stage)
  • Write data visualization components into analysis algorithms / create separate functions for doing so (specific plots will be assigned to different group members at a later stage)

Error in plot for analysis2

Hi @keishakwek,

While @Iadel95 and I were going through your project repository, we noticed that there was an error in one of your cells that didn't produce a plot.

Could you please fix this so we can appropriately address your project?

Screenshot 2023-04-20 at 1 21 35 PM

Feedback Session 3 (Student Notes)

Feedback for Group

  1. What is your contracted grade?
  • A+
  1. Are there are any group dynamics issues that we need to be aware of ?
  • No
  1. Notes from Dr. Moosvi on Feedback on the analysis:
  • #19
  • The filters at the top should be placed next to the graphs so it is easier to see which filter applies to what graph
  • The classified portfolio should use a different colour scheme as it is somewhat confusing how you used the same colour gradient as the funds allocated filter but the classified portfolio uses discrete colours (even if there is a connection, better to have another discrete filter)
  • You should have a filter that allows you to compare only one scatter plot from each section at a time (Additionally, one way to get around the average bar issue is to add an average number box at the top of the graphs)
  • You should also add notes throughout your dashboard for each section that needs explaining (like you did in the video)

Project Analysis Stages and Division

As we hope to produce an investment portfolio software that has an interactive user interface, we must ensure that the analysis section of the software is well-researched and calculated for the aggregation of the client portfolio. An illustration of our product building stages is attached below (please feel free to comment if you have any questions or concerns).
Group 15 Project Steps

To do this, we must ensure that each type of investment is examined to fit its own investment type (or company type). For instance, in the previous meeting we have concluded that the three main types of stocks in a portfolio come from growth, value, and GARP companies. Moreover, the volatility, return, and overall nature of these companies vary – hence their analysis procedures as well. Accommodating for the intricate variables each investment type has, such as their variance in industry volatility and data cleaning and processing in respect to each industry, I believe it would be most effective to have each member of the group focus on analyze a specific investment type (or company type).

This way we are able to build a solid foundation for our software to be a reliable portfolio integrator, as in the portfolio aggregation stage, we will be able to adjust the variance of each company based on the client’s investment whims appropriately. For example, if the client wants it to be more risky, then the algorithm will aggregate the calculated stocks to include more growth companies and adjust to any client variation and produce the best combination for the client - and this can only be done with a solid foundation in analysis of each stock type.

The stock type division will be as follows:

  1. Growth companies: Colin Lefter
  2. Value companies: Keisha Kwek
  3. GARP companies: Yigit Yurdusever

Exploratory Data Analysis and Algorithm Proposal

One of the fundamental components of our data analysis is determining which variables to conduct our analysis upon in relation to identifying what variables -- when considered collectively -- are deterministic of a positive return on investment. Although there are some generally accepted variables such as P/E (Price-Earnings ratio) and NCF (Net Cash Flow) that are important to consider when picking a stock, there are many other variables that also need to be considered. Producing quality data analysis requires us to demonstrate that the variables we have chosen to analyze are indeed relevant to our analysis, but even more importantly, one of the first steps of data analysis is exploratory data analysis -- which requires us to figure out the properties of our data set in order to draft approaches to analyzing it.

Keisha has suggested that we construct a statistical summary analysis of our data set, which can be done by using the df.describe().T function in Pandas and to then focus on the standard deviation component of that output. In addition to that, I propose that one way we can determine which variables are important is by examining how large the standard deviation is. We would need to calculate the Z-score (standard score) of each standard deviation in order to fairly compare columns by their standard deviation as well as equities within that column. The logic behind this is that the higher standard deviation, the higher the probability that that column may be deterministic of the performance of an equity (because there is significant variation among companies for that column). Although this is only a hypothesis, and may very well be wrong, it is something worth investigating to start out our analysis. If this analysis proves to be effective, we can actually take the Z-scores for each column and run them through the data normalization algorithm (the MinMax scaler) to generate weighted values from 0 to 1. Therefore, we can actually assign a weight to each column so that we can quantify which variables are the most important (i.e. P/E may have one of the highest weights because it is generally regarded as one of the most important variables to examine when picking a stock).

Another interesting thing we can do is determine a threshold for classifying variables as important/less important. This threshold would be determined for the Z-scores, and we can actually take all of the columns that have been classified as "unimportant" and set them aside to be run through the Fourier Transform. The logic behind this is that if there is extremely little variation among companies for that specific financial variable, it will be very hard to compare companies based on that value -- implying that there is a lot of noise in the data, which is exactly what the discipline of signal processing deals with. Running a Fourier Transform can actually filter through this noise and identify peaks in the noise which we can use to determine which companies actually stand out for that seemingly trendless variable.

Final Analysis Algorithm Plan

  • The research question has been changed to reflect the desire to find out what equity data is the most deterministic of the performance of an equity
  • This specifies that the price change of the equity is no longer the benchmark used to determine whether financial variables dictate positive or negative changes in the price of the equity, as there is no correlation
  • Therefore, the new strategy is to see which variables dictate positive or negative changes in all other variables

Method 1: Cross-Predictor Heat Map Contrast

Data Fundamentals

  • 40 financial variables
  • 20 companies observed
  • Need to use the top 20 and bottom 20 companies by market capitalization as the basis of the comparison
  • Heat map structure: 40 financial variables x 20 companies
  • Requires the comparison of two heat maps for every sorted financial variable, implying 2 sets per variable, therefore a total of 80 heat maps
  • Can have 4 columns with 2 pairs of top and bottom heat maps respectively

Strategy (Build off of the correlation plot analysis)

  1. From the correlation plot, take the x-values that have been classified as being the most deterministic of the performance of the widest range of y-values by correlation
  2. For each of these x-values, plot two corresponding heat maps of the top 20 companies and bottom 20 companies sorted by that x value
  3. Observe the contrast in the number of yellow versus blue squares between the heat map of the top 20 companies and the heat map of the bottom 20 companies ordered by that x value

What the results indicate

  • The x values extracted from the correlation plot have been selected on the basis of the correlation that they have with other financial variables
  • The theory is that the more correlations that exist between variables, the higher the likelihood that sorting the heat map by that x value will dictate the overall performance of an equity
  • Therefore, if there is an apparent strong visual contrast in the number of yellow (high scores) versus blue (low scores) tiles, it is confirmed that the variable in question does indeed dictate the performance of an equity with significant effect

Method 2: Multiple Linear Regression

  • It is not enough to conclude that equity performance is dictated by a series of variables without the ability to quantify the observation that produced such a conclusion
  • Therefore, each heat map can be supported with the construction of a multiple linear regression that maps all of the financial variables in the data set against a singular y-value that was used as the leading x-value in the heat map plot comparison

Strategy (build off of the heat map analysis)

  1. For each x variable (predictor) that was used in the heat map analysis, construct a standardized multiple linear regression plot that will take that singular x value as the y value that is matched against all other remaining x values in the data set
  2. Collect summary statistics from the each model and aggregate them in a data frame that matches each y variable against the corresponding summary statistics from each model
  3. Compare the summary statistics (mainly the coefficient of determination) with the observed percentage of yellow / blue tiles in each heatmap and see if the degree of contrast is supported by the value of the coefficient

What the results indicate

  • If the coefficient of determination for the model that corresponds to each pair of heatmap plots is high, and the heat maps have a stark contrast in the number of yellow versus blue tiles, then it is confirmed that the singular x variable examined from the heat maps (corresponding to the singular y value in the multiple linear regression model) is indeed deterministic of the performance of an equity in large part.

Method 3: Density Plots

  • After the weight of each variable has been determined, it is worth constructing two density plots that illustrate the distributions of the highest-weight financial variables compared to the lowest-weight financial variables

Strategy (build off the multiple linear regression analysis)

  1. Construct 4 density plots -- each density plot should be ordered by quantiles
  2. Plot 1: all financial variables that exist up to the 25th percentile by weighted value
  3. Plot 2: all financial variables that exist in the range of the 25th - 50th percentile by weighted value
  4. Plot 3: all financial variables that exist in the range of the 50th - 75th percentile by weighted value
  5. Plot 4: all financial variables that exist above the 75th percentile by weighted value

What the results indicate

  • It is possible that the highest-weight financial variables have negatively-skewed, tight distributions and that the lowest-weight financial variables have positively-skewed, wide distributions.
  • The theory is that the highest-weight financial variables are largely driven by the performance of outliers, implying that the aggregated performance of most companies in the S&P500 is relatively poor. This may suggest that positive outliers in this distribution dictate top performance amongst all other equities in the data set.

Final step -- portfolio construction

  • The coefficients of determination can be aggregated into an ordered set that is then normalized
  • These normalized values will become multipliers to the existing normalized values in the original data set
  • The normalized values across all financial columns for each equity are summed up and expressed as a percentage of the maximum possible attainable score (which is just the number of financial variables in the data set) in a new column
  • The net capital to be invested by the client is then multiplied by each value in the column (corresponding to each equity), giving the total amount of cash to be allocated to each equity

Group Meeting #2

Main topics of discussion:

  • Rough draft of class MindMap
  • Potential values to consider in data visualization and analysis

Meeting notes:
Meeting2.pdf

Feedback Session 1

Feedback for Group

  1. What is your contracted grade?
  • A+
  1. Are there are any group dynamics issues that we need to be aware of ?
  • No
  1. Notes from [Islam/Saira] on Feedback on the analysis:
  • Subset the number of companies you're working with, 50 or 100 should be enough
  • Faceted plots so it's less scrolling
  • Count to density -
  • Heatmap, show only the top 5 and bottom 5
  • Pick the top 10 companies based on dividend over that year (discussed this a bit more, and your idea is good)
  • Pick the top 3 dividend paying companies by sector
  • Coordinate data cleaning and processing
  • Suggest having one notebook for cleaning and EDA, another for analysis

Contract Grading Requirements (A+)

Contract for an A+ project:

  • Complex method chaining
  • At least three advanced operations (for e.g anonymous or lambda functions in assign or apply functions)
  • At least two method chains
  • Effective use of Git branches, code reviews, Pull Requests (and approvers)
  • Extra “flair” and effort
  • Visit Project student hours 4 2 (updated) or more times for feedback (and implement it)

Lower-level requirements:

Contract for a C project:

Data Loading

  • Use relative paths to load datasets

Simple data cleaning, wrangling, processing

  • Dealing with missing values
  • Update column names
  • Fixing inconsistent capitalization issues in the dataset
  • Removing unnecessary columns
  • Create new “aggregate” columns (for e.g. weighted score)

Simple data visualization

  • Axis titles (all plots must have titles)
  • Axis labels (all plots must have labels)
  • Ensure all titles and labels are appropriate (with units) and not the default column names
  • Larger font size to ensure all labels and titles are legible
  • Flip x and y-axes, for example, plotting categorical data is on the y-axis (so it’s easier to read)

Effectively communicating data through visualizations

  • Captions describing each plot
  • At least one key insight from the plot

Contract for a B project:

Complex data visualizations (compound plots, complex plots)

  • Usage of subplots or small multiples (aka facet plots)

Complex data cleaning/wrangling/processing

  • Generation of additional dataframes that are summaries or aggregated from the original dataset
  • Use of Pivot Tables and/or Groupby objects

Complex research questions

  • Questions that need several plots to answer, or generate additional sub-questions that can be - - [ ] answered with more plots or more data wrangling.

Equitable labour distribution

  • All students in the group should be putting in roughly the same amount of work on the project

Contract for an A project:

  • Use of functions in wrangling, processing, cleaning
  • Simple method chaining
  • At least four basic operations (for e.g rename, reorder, or drop columns, replace values within the dataframe)
  • At least two method chains
  • Deep analysis and insightful commentary
  • Effective use of GitHub issues
  • Sophisticated Tableau Dashboard
  • Visit instructor or Project TA student hours at least 3 2 (updated) times for feedback (and implement it)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.