Code Monkey home page Code Monkey logo

ml-notes's Introduction

ml-notes

General

Hyperparameter Optimization

Boosting

Feature selection/importance

  • For linear regression (i.e. after using the lm function in R), a good package that provides feature selection based on diverse criteria (e.g. p-value) is SignifReg.
  • Correlation can be used in 2 ways:
    • Remove features with an absolute correlation of 0.75 or higher when finding the correlation between variables in a training dataset (NO RESPONSE is used here - unsupervised so-to-speak)
    • Variables that correlate highly with the response output, have higher importance or significance
    • Visualize correlations with: corrplot library (examples)
  • Use Lasso (L1) regularization or Elastic Net (L1 and L2) to remove unimportant features - larger coefficient => more importance (zeroed coefficients => good riddance!)
  • Use randomForest package
    • Visualize importance: varImpPlot()
    • Article: Tune number of Trees? - 500 is alright in general, tune the mtry parameter using the function RFtune()!!!
    • If random.Forest() is run with proximity=TRUE (keep N less than 10000, depending on your RAM as well) it generates a N x N matrix of proximity (similarity) (N = number of rows/data points). This can be scaled to 2D using MDSplot() (same data, same response vector, same random forest used to train the data), which internally uses the stat::cmdscale() to see the dataset (every point) in 2D (very slow).
  • For faster and multi-threaded Random Forests, use the ranger R package
    • No tuning offered for mtry (so do that with randomForest on random samples of your dataset), but everything else is better and faster!
  • Use the Boruta R package and plot the importance boxplot result!
    • Uses ranger under the hood currently, so multi-thread support for free

Correlation

  • Article: Correlation Test Between Two Variables in R
  • Use ggscatter: prints R^2 and p-value.
  • Use ComplexHeatmap to visualize a correlation matrix between two variables of interest, e.g. mRNA and protein expressions (if the data dimensions are large)
  • On the correlation measure to use between two variables (continuous vs discrete). From this article I quote: "The idea behind using logistic regression to understand correlation between cont and categorical variables is actually quite straightforward and follows as such: If there is a relationship between the categorical and continuous variable, we should be able to construct an accurate predictor of the categorical variable from the continuous variable. If the resulting classifier has a high degree of fit, is accurate, sensitive, and specific we can conclude the two variables share a relationship and are indeed correlated."

Logistic Regression in R

  • Application and Interpretation article
  • If the response categorical variable is binary, use stats::glm(family = binomial)
  • For multiple response classes, use Ordinal Logistic Regression only when the proportional odds assumption is true (the relationship between each pair of response groups is the same - e.g. in a survey where you answer with 3 choices, the distance between unlikely and somewhat likely may be shorter than the distance between somewhat likely and very likely). In R, use MASS::polr or rms::lrm. Otherwise, go with Multinomial Logistic Regression: nnet:multinom.
  • Measures of goodness of fit (fit statistics - related to model validation) for logistic binary regression: article IBM

Dimensionality Reduction

  • PCA (Prinicipal Componenent Analysis): article
    • Use R packages FactoMineR (run PCA) and factoexta (for visualization)
  • MCA (Multiple Correspondence Analysis - PCA for categorical variables): article
  • UMAP: a non-linear dimensionality reduction method, check R package uwot

PR (Precision-Recall curves)

Hierarchical Clustering Analysis

Survival Analysis

ml-notes's People

Contributors

bblodfon avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.