Code Monkey home page Code Monkey logo

moneyball's Introduction

Project_3: MoneyBall

Hypothesis:

Use a machine learning model to predict the 2021 OPS (On-base Plus Slugging) for the entire roster of MLB players and select a full roster of players with the highest OPS for their salary.

OPS Background:

OPS (On-base Plus Slugging) sums a player's on-base percentage and slugging percentage. These two metrics are effectively asking "Does the player get on base?" and "Can he hit?". In summary, how many opportunities does the player create for runs and how successful is he in actually being able to hit? The objective is to utilize a team's offensive statistics to create a lineup with the highest chance of winning. Rather than the traditional scouting approach where there are more qualitative measures of performance such as strength, bat speed, arm extension/follow through, aggression, etc., using these two simple statistics does not require individual analysis of every player and in theory can create a cohesive team by combining compatible players by analyzing their historical performance.

The OPS formula below:

OPS

H = Hits - when the batter strikes the ball without error

BB = Walks - when a pitcher throws four pitches out of the strike zone, none of which are swing at by the batter

HBP = Hit by pitch - when a batter is struck by a pitched ball without swinging at it and is awarded first base.

AB = At bat - when a batter reaches base via fielder's choice, hit, or error (not including catcher's interference)

SF = Sacrifice fly - when a batter hits a fly-ball to the outfield or foul territory that allows a runner to score

1B = Single - when batter hits the ball and reaches first base

2B = Double - when batter hits the ball and reaches second base

3B = Triple - when batter hits the ball and reaches third base

HR = Home run - when batter hits the ball and circles all four bases

Model:

Long Short-term memory, recurrent neural network

Model loss functions: Initially tested with binary cross-entropy. Can be penalized for being very confident and very wrong in some cases.

Also tested with RMSE http://cs229.stanford.edu/proj2019aut/data/assignment_308875_raw/26382632.pdf since it is better for regression rather than classification. We stuck with cross entropy because it didn't make any significant changes.

Optimizer: We used Adam since it seems to be the most popular choise for LSTM RNN models and outperforms its similar counterparts RMSProp and AdaDelta

Epochs: We used 50 and 100 to test - (http://cs229.stanford.edu/proj2019aut/data/assignment_308875_raw/26382632.pdf; they used 300 for ~200 players for reference)

Did more epochs do better? Not necessarily, more epochs trended to more pessimmistic predictions for OPS.

Data and Parameters:

Data:

Used MLB historical game/player data from 2010 through 2020

2021 player salaries

Dropped any players if they had an OPS of 0 in the year they played

Model Performance Evaluation:

Model predictions were slightly conservative due to the unpredictability of OPS

Results for 50 epochs:

1 (pitcher) = P (Starting: Shohei Ohtani, Relief/Close: Will Smith)

2 (catcher) = C PJ Higgins

3 (first baseman) = 1B Ronald Guzman

4 (second baseman) = 2B Max Moroff

5 (third baseman) = 3B Isaac Paredes

6 (shortstop) = SS Gavin Lux

7 (left fielder) = LF Jake Cave

8 (center fielder) = CF Roman Quinn

9 (right fielder) = RF Kyle Garlick

50

Results for 100 epochs:

1 (pitcher) = P (Starting: Shohei Ohtani, Relief/Close: Will Smith)

2 (catcher) = C PJ Higgins

3 (first baseman) = 1B Ronald Guzman

4 (second baseman) = 2B Max Moroff

5 (third baseman) = 3B Alex Blandino

6 (shortstop) = SS Gavin Lux

7 (left fielder) = LF Jake Cave

8 (center fielder) = CF Roman Quinn

9 (right fielder) = RF Jaylin Davis

100

Correlation:

In addition to our models and team roster selections, we also explored the correlations between the wins, losses, runs scored, runs allowed, and the run differentials.

teams

Key for heatmap:

W = Wins

L = Losses

RS = Runs Scored

RA = Runs Allowed (every run scored against the defensive team)

Dff = (RS - RA) Run differential

heatmap

Findings:

We have discovered a strong positive correlation between team wins and run differentials. Run differentials are used to predict the expected win total for a team. If it is positive, a team scores more runs than it allows and negative if it allows more runs than it scores. Generally, the higher a run differential the better. The results below support this since teams with a high (positive) run differential trend toward more wins.

scatter

moneyball's People

Contributors

somyapanda avatar andyliang245 avatar priscilla-wong avatar jfkraemer28 avatar kwarari avatar brennanseanm avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.