Code Monkey home page Code Monkey logo

predict-blood-donations's Introduction

DrivenData's Predict Blood Donations

Feature engineering did not improve scores in most cases. Scaling was used for algorithms that required it. Hyper-parameters were estimated by GridSearchCV, a brute-force stratified 10-fold cross-validated search.

leaderboard_score is the contest score for predictions of the unknown test-set; lower is better. Camel-case model names refer to scikit-learn models; lower-case were hand-crafted in some way.

model leaderboard_score
bagged_nolearn 0.4313
ensemble of averages 0.4370
voting ensemble 0.4396
LogisticRegression 0.4411
bagged_logit 0.4442
GradientBoostingClassifier 0.4452
LogisticRegressionCV 0.4457
bagged_scikit_nn 0.4465
bagged_gbc 0.4527
nolearn 0.4566
ExtraTreesClassifier 0.4729
blending ensemble 0.4834
XGBClassifier 0.4851
BaggingClassifier 0.4885
scikit_nn 0.5020
boosted_svc 0.5334
SVC 0.5336
SGDClassifier 0.5670
cosine_similarity 0.5732
boosted_logit 0.5891
KMeans 0.6289
AdaBoostClassifier 0.6642
KNeighborsClassifier 1.1870
RandomForestClassifier 1.7907

Simple logistic regression did quite well; it seems odd that bagging and boosting both reduced its performance. In general though, ensembling did improve performances.


A number of statistics were recorded for each model from 10-fold CV predictions of the training data:

  • accuracy the proportion correctly predicted

  • logloss the sklearn.metrics.log_loss

  • AUC the area under the ROC curve

  • f1 the weighted average of precision and recall

  • mu the average over 100 cross-validated scores with permutations

  • std the stdev over 100 cross-validated scores with permutations

Starting with all the variables, R's step function produced the following

Call:
lm(formula = leaderboard_score ~ mu + std, data = score_data,
    na.action = na.omit)

Residuals:
     Min       1Q   Median       3Q      Max
-0.18728 -0.05472 -0.03539  0.02082  0.42898

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   25.722      2.962   8.685 3.09e-07 ***
mu           -33.089      3.897  -8.490 4.11e-07 ***
std          -60.589      7.857  -7.711 1.35e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1499 on 15 degrees of freedom
  (8 observations deleted due to missingness)
Multiple R-squared:  0.8311,	Adjusted R-squared:  0.8086
F-statistic: 36.91 on 2 and 15 DF,  p-value: 1.61e-06

Possibly std is a stand-in for statistical-learning's variance.


The work is available on GitHub and BitBucket. (Only GitHub permits the viewing of IPython notebooks).

Dataset derived from Blood Transfusion Service Center Data Set

Yeh, I-Cheng, Yang, King-Jang, and Ting, Tao-Ming, "Knowledge discovery on RFM model using Bernoulli sequence", Expert Systems with Applications, 2008 1

T. Santhanam and Shyam Sundaram , "Application of CART Algorithm in Blood Donors Classification", Journal of Computer Science 6 (5): 548-552, 2010 2

predict-blood-donations's People

Contributors

grfiv avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.