This notebook/repo presents the results obtaned during the MP data technical test in which the goal is to:
- Implement the best possible model based on recall to predict NBA players chances of staying more than 5 years in the NBA via tabular historical data
- Implement and design a REST API to query this model via an URL request.
We separate the preliminary data exploration in a few phases:
- Basic exploration
- Need for feature engineering exploration
- Missing data replacement / outliers
- What kind of task?
import pandas as pd
import matplotlib.pyplot as plt
import sklearn as sk
import numpy as np
import seaborn as sns
df = pd.read_csv('data/nba_logreg.csv')
len(df)
1340
df.head()
Name | GP | MIN | PTS | FGM | FGA | FG% | 3P Made | 3PA | 3P% | ... | FTA | FT% | OREB | DREB | REB | AST | STL | BLK | TOV | TARGET_5Yrs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Brandon Ingram | 36 | 27.4 | 7.4 | 2.6 | 7.6 | 34.7 | 0.5 | 2.1 | 25.0 | ... | 2.3 | 69.9 | 0.7 | 3.4 | 4.1 | 1.9 | 0.4 | 0.4 | 1.3 | 0.0 |
1 | Andrew Harrison | 35 | 26.9 | 7.2 | 2.0 | 6.7 | 29.6 | 0.7 | 2.8 | 23.5 | ... | 3.4 | 76.5 | 0.5 | 2.0 | 2.4 | 3.7 | 1.1 | 0.5 | 1.6 | 0.0 |
2 | JaKarr Sampson | 74 | 15.3 | 5.2 | 2.0 | 4.7 | 42.2 | 0.4 | 1.7 | 24.4 | ... | 1.3 | 67.0 | 0.5 | 1.7 | 2.2 | 1.0 | 0.5 | 0.3 | 1.0 | 0.0 |
3 | Malik Sealy | 58 | 11.6 | 5.7 | 2.3 | 5.5 | 42.6 | 0.1 | 0.5 | 22.6 | ... | 1.3 | 68.9 | 1.0 | 0.9 | 1.9 | 0.8 | 0.6 | 0.1 | 1.0 | 1.0 |
4 | Matt Geiger | 48 | 11.5 | 4.5 | 1.6 | 3.0 | 52.4 | 0.0 | 0.1 | 0.0 | ... | 1.9 | 67.4 | 1.0 | 1.5 | 2.5 | 0.3 | 0.3 | 0.4 | 0.8 | 1.0 |
5 rows ร 21 columns
df.iloc[0].transpose()
Name Brandon Ingram
GP 36
MIN 27.4
PTS 7.4
FGM 2.6
FGA 7.6
FG% 34.7
3P Made 0.5
3PA 2.1
3P% 25.0
FTM 1.6
FTA 2.3
FT% 69.9
OREB 0.7
DREB 3.4
REB 4.1
AST 1.9
STL 0.4
BLK 0.4
TOV 1.3
TARGET_5Yrs 0.0
Name: 0, dtype: object
df.columns
Index(['Name', 'GP', 'MIN', 'PTS', 'FGM', 'FGA', 'FG%', '3P Made', '3PA',
'3P%', 'FTM', 'FTA', 'FT%', 'OREB', 'DREB', 'REB', 'AST', 'STL', 'BLK',
'TOV', 'TARGET_5Yrs'],
dtype='object')
Looking at the values in the table above, all data is already properly formatted, and clean. It looks like some data might be redundant (e.g the % fields are just the relevant attempted field divided by the made field times 100. A good step towards this would be feature engineering)
A potential step to reducing the input size of the model would be to conduct PCA and determine which variables the dataset can be brought down to. However, this is usually done after a first round of fitting and model selection to have a baseline to compare to, which is why it does not fall within the scope of this technical test.
sns.pairplot(data=df,hue = "TARGET_5Yrs", kind="kde")
<seaborn.axisgrid.PairGrid at 0x7f3be55dfbe0>
A quick glance shows that there doesn't seem to be many outliers. Orange is positive ground truth, meaning that for most metrics, players who make it past the 5 year mark have higher stats than other players, which is to be expected (except for 3P%, where the distribution seems mostly the same for both kinds of players, and centered around 2 peaks - makes sense if you know basketball : there are shooting specialists and non-shooting specialists)
fig,ax = plt.subplots(figsize = (5,5))
sns.barplot(data = (df.TARGET_5Yrs.value_counts()/df.TARGET_5Yrs.value_counts().sum()).reset_index(), x = 'index', y = 'TARGET_5Yrs', palette= 'spring', edgecolor = 'k', linewidth = 1, alpha = .5)
plt.grid(True)
plt.title('distribution of target')
plt.xlabel('Ground truth')
plt.ylabel('distribution')
display()
The approximate balance is 40-60 in favor of players that stay in the league. This can not be considered an unreasonable imbalance and so we decide to not conduct oversampling/undersampling on the dataset to account for eventual class imbalance.
Note that data is binary and we are attempting to predict whether or not a player will stay. Our task is a binary classification task
df.isna().sum()
Name 0
GP 0
MIN 0
PTS 0
FGM 0
FGA 0
FG% 0
3P Made 0
3PA 0
3P% 11
FTM 0
FTA 0
FT% 0
OREB 0
DREB 0
REB 0
AST 0
STL 0
BLK 0
TOV 0
TARGET_5Yrs 0
dtype: int64
Only 11 missing values for one field, which is 3P%. This is only relevant when 3PA is 0, and so we decide to replace them with 0, which will be shown in the code walkthrough.
In this section we will be walking through the code in test.py to explain the different steps as well as selecting the best model based on recall. Indeed, recall is a measure of among all the positive elements (i.e the players that stay in the league more than 5 years), how many are retrieved. It is a measure of how many relevant elements are spotted - a hit rate if you will.
On the other hand, Precision measure how many retrieved items are relevant. It is a measure of how often the model makes mistakes.
From a business standpoint, it makes more sense for stakeholders to not miss a promising recruit than to not want to hire a recruit that does not perform later on. It is much easier to fire a player than to get a very good player from another team or back into basketball. So, recall is our metric of choice for the model selection. Let's import our model fitter class first:
from src.test import NBAevaluator
evaluator = NBAevaluator()
the first step is to load and clean the dataset (i.e replace nan values with 0). Note that scaling the whole dataset is not good practice as it causes data leakage between the train and test set, so we do that later on. The code for this is shown here:
def load_and_clean(self):
# Load dataset
df = self.load()
# extract names, labels, features names and values
names = df['Name'].values.tolist() # players names
y = df['TARGET_5Yrs'].values # labels
paramset = df.drop(['TARGET_5Yrs', 'Name'], axis=1).columns.values
X = df.drop(['TARGET_5Yrs', 'Name'], axis=1).values
# replacing Nan values (only present when no 3 points attempts have been performed by a player)
for x in np.argwhere(np.isnan(X)):
X[x] = 0.0
return names, X, y
names, X, y = evaluator.load_and_clean()
Next, split into train-test set. based on the size of the dataset, we estimate a 90-10 split is relevant for the task at hand:
def split_train_test(self, X, y):
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.1, random_state=42)
return X_train, X_test, y_train, y_test
evaluator.X_train, evaluator.X_test, evaluator.y_train, evaluator.y_test = evaluator.split_train_test(X, y)
We then scale the data in the TRAINING set ONLY and apply the scaler to the test set to obtain the scaled training and test sets without the data leakage associated with scaling them together:
def scale_train_test(self, X_train, X_test):
MMS = MinMaxScaler()
X_train = MMS.fit_transform(X_train)
self.scaler = MMS
X_test = self.scaler.transform(X_test)
return X_train, X_test
evaluator.X_train, evaluator.X_test = evaluator.scale_train_test(evaluator.X_train, evaluator.X_test)
Then, a list of classifiers is imported from config (classic binary classification task classifiers) :
- KNN
- LogReg
- XGBoost
- Random Forest
- Support Vector Machine (two kernels)
And we perform a gridsearch on these models on the training set with 10-fold cross-validation to determine the optimal model. We optimize the grid searches for accuracy - as optimizing them for recall led to overfitting-prone models always choosing to classify the player as a 1, but choose our best model based on recall. All the models are fitted for their respective gridsearch using 10-fold cross validation, and report the average performance of each model on the validation sets:
def score_classifier(self, dataset, classifier, classifier_name, labels, gridsearch=None):
"""
performs 3 random trainings/tests to build a confusion matrix and prints results with precision and recall scores
:param dataset: the dataset to work on
:param classifier: the classifier to use
:param labels: the labels used for training and validation
:return:
"""
if gridsearch is not None:
# Here we optimize for accuracy for a reason : optimizing for recall yields
# best classifiers of recall 1 which always classify as a 1 (of course)
# overfitting on the 1 class. So, we optimize for accuracy and choose the
# best classifier for recall later on
gs = GridSearchCV(estimator=classifier,
param_grid=gridsearch, scoring="accuracy")
else:
gs = classifier
kf = KFold(n_splits=10, random_state=50, shuffle=True)
confusion_mat = np.zeros((2, 2))
recall, precision, accuracy = 0, 0, 0
for training_ids, test_ids in kf.split(dataset):
training_set = dataset[training_ids]
training_labels = labels[training_ids]
val_set = dataset[test_ids]
val_labels = labels[test_ids]
gs.fit(training_set, training_labels)
if gridsearch is not None:
classifier = gs.best_estimator_
else:
classifier = gs
predicted_labels = classifier.predict(val_set)
confusion_mat += confusion_matrix(val_labels, predicted_labels)
recall += recall_score(val_labels, predicted_labels)
precision += precision_score(val_labels, predicted_labels)
accuracy += accuracy_score(val_labels, predicted_labels)
recall /= 10
precision /= 10
accuracy /= 10
return {
'confusion_matrix': confusion_mat,
'recall': recall,
'precision': precision,
'accuracy': accuracy,
'model': classifier}
Note that ideally, the parameters such as the number of folds would be moved to a config file to hard-code nothing into the base code
train_records = evaluator.fit_classifiers_(
evaluator.X_train, evaluator.y_train, gs=True)
evaluator.train_records = train_records
fitting: KNNC
fitting: SVC
fitting: SVCGamma
fitting: RFC
fitting: MLPC
We then score each classifier on our set apart test set to determine the best classifier for recall, which we will save in the back-end folder of our API for inference. This is done in the following function:
def score_classifier_on_test_set(self, test_set, test_labels, classifier, classifier_name):
predicted_labels = classifier.predict(test_set)
confusion_mat = confusion_matrix(test_labels, predicted_labels)
recall = recall_score(test_labels, predicted_labels)
precision = precision_score(test_labels, predicted_labels)
accuracy = accuracy_score(test_labels, predicted_labels)
print(classifier_name + ':')
print('confusion matrix: \n', confusion_mat)
print(
f'recall : {recall} - precision : {precision} - accuracy : {accuracy}')
return {"confusion_matrix": confusion_mat,
"precision": precision,
"recall": recall,
"accuracy": accuracy}
test_records = {}
for record in train_records.keys():
test_records[record] = evaluator.score_classifier_on_test_set(
evaluator.X_test, evaluator.y_test, train_records[record]['model'], record)
KNNC:
confusion matrix:
[[38 16]
[23 57]]
recall : 0.7125 - precision : 0.7808219178082192 - accuracy : 0.7089552238805971
SVC:
confusion matrix:
[[29 25]
[14 66]]
recall : 0.825 - precision : 0.7252747252747253 - accuracy : 0.7089552238805971
SVCGamma:
confusion matrix:
[[31 23]
[15 65]]
recall : 0.8125 - precision : 0.7386363636363636 - accuracy : 0.7164179104477612
RFC:
confusion matrix:
[[29 25]
[16 64]]
recall : 0.8 - precision : 0.7191011235955056 - accuracy : 0.6940298507462687
MLPC:
confusion matrix:
[[36 18]
[17 63]]
recall : 0.7875 - precision : 0.7777777777777778 - accuracy : 0.7388059701492538
XGBC:
confusion matrix:
[[31 23]
[16 64]]
recall : 0.8 - precision : 0.735632183908046 - accuracy : 0.7089552238805971
logreg:
confusion matrix:
[[32 22]
[13 67]]
recall : 0.8375 - precision : 0.7528089887640449 - accuracy : 0.7388059701492538
We chose the model which has good recall and a good precision-accuracy-recall balance : Logistic Regression. We can save it using the following function from the evaluator class:
def select_save_best_model(self,model_name = "logreg"):
print("performance of best selected model on test set: \n \n")
self.score_classifier_on_test_set(self.X_test,self.y_test, self.train_records[model_name]['model'], model_name)
pipeline = Pipeline([('scaler', self.scaler),('model',self.train_records[model_name]['model'])])
joblib.dump(pipeline,f'nba_performance_prediction_back/pipelines/best_model.pkl')
This is all condensed into the fitting_pipeline
method of the evaluator class:
def fitting_pipeline(self, gs=False):
print('loading and cleaning dataset')
names, X, y = self.load_and_clean()
# normalize dataset
# NO SCALING ON ALL DATA => INFO LEAKAGE
# X = MinMaxScaler().fit_transform(df_vals)
print('splitting into test and train set')
self.X_train, self.X_test, self.y_train, self.y_test = self.split_train_test(X, y)
print("scaling train data and applying on test data")
self.X_train, self.X_test = self.scale_train_test(self.X_train, self.X_test)
print('fitting classifiers on train set')
train_records = self.fit_classifiers_(
self.X_train, self.y_train, gs=gs)
self.train_records = train_records
print('scoring best classifiers on test set')
test_records = {}
for record in train_records.keys():
test_records[record] = self.score_classifier_on_test_set(
self.X_test, self.y_test, train_records[record]['model'], record)
return test_records
Which can be ran from the cli using:
python src/main.py
We can also get the parameters of each optimal model to not have to conduct the grid search every time (gs = False).
for record in evaluator.train_records.keys():
model = evaluator.train_records[record]['model']
evaluator.score_classifier_on_test_set(
evaluator.X_test, evaluator.y_test, evaluator.train_records[record]['model'], record)
print(model.get_params())
Our goal is to design a rest API to take unitary calls for the model to process. To do this, we use django. All the source code can be found in the nba_performance_prediction_back folder. We Also created a webapp to showcase the results with react and linked it to the django Backend. To test the API, please install the required dependencies in requirement.txt in a virtualenv by running
pip install virtualenv
virtualenv nba
source nba/bin/activate
pip install -r nba_performance_predition_back/requirements.txt
and run
python manage.py runserver
You can either test the API by sending a post request to http://localhost:8000/scoreJson with the following body;
{ "GP": 10, "MIN": 5, "PTS": 10, "FGM": 5, "FGA": 2, "FG%": 65, "3PMade": 23, "3PA": 46, "3P%": 11, "FTM": 1, "FTA": 2, "FT%": 50, "OREB": 10, "DREB": 12, "REB": 15, "AST": 16, "STL": 12, "BLK": 4, "TOV": 1 }
Or you can run the Front-end web app to get the full experience. To do this, install npm and react and tailwind css and try it out by running
npm install -g serve
serve -s build
From within nba_performance_prediction
Unfortunately, I would have dockerized the whole app but did not have the time to do so.
- prevent user from entering negative values for some stuff + impossible values (attemps > made)
- make app responsive
- automate training with clic
- feature engineering (PCA)