Code Monkey home page Code Monkey logo

bigqueryml's Introduction

BigQueryML

Model selection guied

spatial queries
https://cloud.google.com/bigquery/docs/reference/standard-sql/geography_functions

date queries
https://cloud.google.com/bigquery/docs/reference/standard-sql/date_functions

hyperparameter tuning
https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-hyperparameter-tuning

model export
https://cloud.google.com/bigquery/docs/exporting-models#export_model_formats_and_samples

Linear Regression

code

{CREATE MODEL | CREATE MODEL IF NOT EXISTS | CREATE OR REPLACE MODEL}
model_name
[OPTIONS(MODEL_TYPE = { 'LINEAR_REG' | 'LOGISTIC_REG' },
    INPUT_LABEL_COLS = string_array,
    OPTIMIZE_STRATEGY = { 'AUTO_STRATEGY' | 'BATCH_GRADIENT_DESCENT' | 'NORMAL_EQUATION' },
    L1_REG = float64_value,
    L2_REG = float64_value,
    MAX_ITERATIONS = int64_value,
    LEARN_RATE_STRATEGY = { 'LINE_SEARCH' | 'CONSTANT' },
    LEARN_RATE = float64_value,
    EARLY_STOP = { TRUE | FALSE },
    MIN_REL_PROGRESS = float64_value,
    DATA_SPLIT_METHOD = { 'AUTO_SPLIT' | 'RANDOM' | 'CUSTOM' | 'SEQ' | 'NO_SPLIT' },
    DATA_SPLIT_EVAL_FRACTION = float64_value,
    DATA_SPLIT_COL = string_value,
    LS_INIT_LEARN_RATE = float64_value,
    WARM_START = { TRUE | FALSE },
    AUTO_CLASS_WEIGHTS = { TRUE | FALSE },
    CLASS_WEIGHTS = struct_array,
    ENABLE_GLOBAL_EXPLAIN = { TRUE | FALSE },
    CALCULATE_P_VALUES = { TRUE | FALSE },
    FIT_INTERCEPT = { TRUE | FALSE },
    CATEGORY_ENCODING_METHOD = { 'ONE_HOT_ENCODING`, 'DUMMY_ENCODING' }
)];

Logistic Regression

K-means clusturing

limitations
- applicable to only numerical attributes
- not perform well when clusters are varying sizes and density
- needs clipping of the outliers

tips
- anormally detection

code

{CREATE MODEL | CREATE MODEL IF NOT EXISTS | CREATE OR REPLACE MODEL}
model_name
[OPTIONS(MODEL_TYPE = { 'KMEANS' },
    NUM_CLUSTERS = int64_value,
    KMEANS_INIT_METHOD = { 'RANDOM' | 'KMEANS++' | 'CUSTOM' },
    KMEANS_INIT_COL = string_value,
    DISTANCE_TYPE = { 'EUCLIDEAN' | 'COSINE' },
    STANDARDIZE_FEATURES = { TRUE | FALSE },
    MAX_ITERATIONS = int64_value,
    EARLY_STOP = { TRUE | FALSE },
    MIN_REL_PROGRESS = float64_value,
    WARM_START = { TRUE | FALSE }
)];

Boosted trees

types 
- adaptive boosting
- gradient boosting
- XGBoost

tips
- no need to do feature selection but have to do feature engineering


XGBoost parameter
https://xgboost.readthedocs.io/en/latest/parameter.html#parameters-for-tree-booster

code

{CREATE MODEL | CREATE MODEL IF NOT EXISTS | CREATE OR REPLACE MODEL} model_name
[OPTIONS(MODEL_TYPE = { 'BOOSTED_TREE_CLASSIFIER' | 'BOOSTED_TREE_REGRESSOR' },
         BOOSTER_TYPE = {'GBTREE' | 'DART'},
         NUM_PARALLEL_TREE = int64_value,
         DART_NORMALIZE_TYPE = {'TREE' | 'FOREST'},
         TREE_METHOD = {'AUTO' | 'EXACT' | 'APPROX' | 'HIST'},
         MIN_TREE_CHILD_WEIGHT = int64_value,
         COLSAMPLE_BYTREE = float64_value,
         COLSAMPLE_BYLEVEL = float64_value,
         COLSAMPLE_BYNODE = float64_value,
         MIN_SPLIT_LOSS = float64_value,
         MAX_TREE_DEPTH = int64_value,
         SUBSAMPLE = float64_value,
         AUTO_CLASS_WEIGHTS = { TRUE | FALSE },
         CLASS_WEIGHTS = struct_array,
         INSTANCE_WEIGHT_COL = string_value,
         L1_REG = float64_value,
         L2_REG = float64_value,
         EARLY_STOP = { TRUE | FALSE },
         LEARN_RATE = float64_value,
         INPUT_LABEL_COLS = string_array,
         MAX_ITERATIONS = int64_value,
         MIN_REL_PROGRESS = float64_value,
         DATA_SPLIT_METHOD = { 'AUTO_SPLIT' | 'RANDOM' | 'CUSTOM' | 'SEQ' | 'NO_SPLIT' },
         DATA_SPLIT_EVAL_FRACTION = float64_value,
         DATA_SPLIT_COL = string_value,
         ENABLE_GLOBAL_EXPLAIN = { TRUE | FALSE },
         XGBOOST_VERSION = { '0.9' | '1.1' }
)];

DNN

Sigmoid function
- vanishing gradient problem
RELU function
- less vulnerable vanishing pradient problem
{CREATE MODEL | CREATE MODEL IF NOT EXISTS | CREATE OR REPLACE MODEL} model_name
[OPTIONS(MODEL_TYPE = { 'DNN_CLASSIFIER' | 'DNN_REGRESSOR' },
         ACTIVATION_FN = { 'RELU' | 'RELU6' | 'CRELU' | 'ELU' | 'SELU' | 'SIGMOID' | 'TANH' },
         AUTO_CLASS_WEIGHTS = { TRUE | FALSE },
         BATCH_SIZE = int64_value,
         CLASS_WEIGHTS = struct_array,
         DROPOUT = float64_value,
         EARLY_STOP = { TRUE | FALSE },
         HIDDEN_UNITS = int_array,
         L1_REG = float64_value,
         L2_REG = float64_value,
         LEARN_RATE = float64_value,
         INPUT_LABEL_COLS = string_array,
         MAX_ITERATIONS = int64_value,
         MIN_REL_PROGRESS = float64_value,
         OPTIMIZER = { 'ADAGRAD' | 'ADAM' | 'FTRL' | 'RMSPROP' | 'SGD' },
         WARM_START = { TRUE | FALSE },
         DATA_SPLIT_METHOD = { 'AUTO_SPLIT' | 'RANDOM' | 'CUSTOM' | 'SEQ' | 'NO_SPLIT' },
         DATA_SPLIT_EVAL_FRACTION = float64_value,
         DATA_SPLIT_COL = string_value,
         ENABLE_GLOBAL_EXPLAIN = { TRUE | FALSE },
         INTEGRATED_GRADIENTS_NUM_STEPS = int64_value,
         TF_VERSION = { '1.15' | '2.8.0' }
)];

PCA

concept
- lower the demiesion to avoid overfitting
{CREATE MODEL | CREATE MODEL IF NOT EXISTS | CREATE OR REPLACE MODEL}
model_name
[OPTIONS(MODEL_TYPE = { 'PCA' },
    NUM_PRINCIPAL_COMPONENTS = int64_value,
    PCA_EXPLAINED_VARIANCE_RATIO = float64_value,
    SCALE_FEATURES = { TRUE | FALSE }
    PCA_SOLVER = { 'FULL' | 'RANDOMIZED' | 'AUTO' },
)];

bigqueryml's People

Contributors

kmu973 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.