taspinar / siml Goto Github PK

View Code? Open in Web Editor NEW

482.0 482.0 257.0 36.18 MB

Machine Learning algorithms implemented from scratch

Home Page: http://www.ataspinar.com

License: MIT License

Python 0.88% Jupyter Notebook 99.12%

siml's People

Contributors

Stargazers

Watchers

Forkers

soaring52 elenita1221 spendyala vikashsahu4 albinb valli223 libardo1 ibrahim85 tozammel lcharanteja salukhadka mukeshsharma04 aryanugroho sisodev nish1291 gridl genomicsiter iamblichus cwru-sdle gauravsingh026 clustersdata paurichardson darioromero cuulee sohailkhanmarwat nhatnguyen12 pursh2002 uttamsinha97 deepukr85 cerberusv2px caitouwh saijal ahmed9914 emoseno qqss88 drorlederman philipversteeg jmetzz lepy hbcbh1999 rohitjhander vhcg77 crlsmcl esskay0000 shvmshukla mhdella nhua6456 hieuqtran briando2005 dalcimar gdnyfcuso satpreetsingh elzaksspro dthboyd newenglandml gunasekar1987 townmath whitesonkun sdiether olgaliak manu4linux dabingnai sharadgupta27 zpleefly j6e ankitnamdeo34 jaigovind steccami ziyad2 aprnaa ajayarunachalam chrajyalakshmi darkcurrent ghayth82 yangshuodelove andrei-wonge stjordanis dadounhind 459548764 onuralpyigit cj401 kunlun-liu fredrikorn batermj andrearapuzzi luozhongbin2017 ag-pom pankajkarman settur1409 dennispiskovatskov martinfrasch moomalq joseph8923 wildgarden afigar carlosenciso ts32 tawandavera msultanmahmud biohazardtao

siml's Issues

wavelet forecasting question

Hi, Thanks for the great blog posts and notebooks. This is very well done!

Regarding your comment that wavelet may not be a that useful in deal with 'online' data, have you considered using 'locally stationary wavelet' to achieve that?

I noticed the PyWavelets package already implemented swt(), have you thought of adopting it to deal with 'online data'? it must be more fun！

Auto-correlation in Python

Hi @taspinar ,

in http://ataspinar.com/2018/04/04/machine-learning-with-signal-processing-techniques/ auto correlation is based on Numpy, what is totally fine. Also searched for ACF and PCF toolkits and I wanted to point you direction statsmodels.

e.g.

from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.tsa.stattools import acf

plt.plot(t_values,acf(composite_y_value, nlags=N,fft=True),linestyle='-', color='blue')
plt.xlabel('time delay [s]')
plt.ylabel('Autocorrelation amplitude')
plt.show()

plot_acf(composite_y_value,zero=True, lags=N-1)

This returns plots with ACF between -1 and 1, what I would rather expect. Not sure why numpy has this scaling, that is weird to me at least in terms of correlation.

Further, statsmodels.tsa.stattools.acf has a fft parameter that you are refering in your fun fact. Very interesting. Maybe you could elaborate on the transformation between FFT, PSD and the auto-correlation if ever you are making adjustments.

Great blog!

do you have Laplas Smoothing for naive_bayes?

and how you treat unbalanced data ?

Bayesian : do you have capability to not crush calculations when there is observations in test data not seen in train data

regarding to Bayesian for categorical data
https://github.com/taspinar/siml/blob/master/siml/naive_bayes.py
do you have capability to not crush calculations when there is observations in test data not seen in train data
seems to be this one
https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.CategoricalNB.html#sklearn.naive_bayes.CategoricalNB

dose not

Typo

siml/notebooks/Machine Learning with Signal Processing techniques.ipynb

Line 83 in 1d76e79

    
           "The test dataset contains 7352 signals, each one of length 128 and 9 components \n",

[no_signals_train, no_steps_train, no_components_train] = np.shape(train_signals)
[no_signals_test, no_steps_test, no_components_test] = np.shape(train_signals)==>test_signals

error in naive_bayes

error in https://github.com/taspinar/siml/blob/master/siml/naive_bayes.py
File "e:\Baysian\code\siml-master see naive_bayes py plain basyesian\siml-master\siml\naive_bayes.py", line 131, in
predicted_Y = nbc.classify(X_test[:100])
File "e:\Baysian\code\siml-master see naive_bayes py plain basyesian\siml-master\siml\naive_bayes.py", line 119, in classify
prediction = self.classify_single_elem(X_elem)
File "e:\Baysian\code\siml-master see naive_bayes py plain basyesian\siml-master\siml\naive_bayes.py", line 112, in classify_single_elem
return self.get_max_value_key(Y_dict)
File "e:\Baysian\code\siml-master see naive_bayes py plain basyesian\siml-master\siml\naive_bayes.py", line 17, in get_max_value_key
max_value_index = values.index(max(values))

builtins.AttributeError: 'dict_values' object has no attribute 'index'

Computational time

Hello, Could you please guide me in calculating computational time for both FFT and DWT, to differentiate the performance? Thanks!

How to select Certain Wavelet lvl for feature extraction?

I liked your detailed explanation regarding the wavelet and its implementation in python. I have been through several articles relating wavelet to be used as feature extraction, however, in some articles, they mentioned that they select certain detail/approximate coefficient level where they implement feature extraction for that lvl. I want to ask how is it possible to decide which level to choose for a feature extraction?

Question on sample points length

Great code! very useful, thank you.
i have a question. i am trying to adopt this sort of method to analyse my resting EEG data of MCI vs normal patients.These are .set files where for each individual who was examined, each channel (there are totally 19 electrodes placed), the sample points are varying in length. How is this sort of a problem addressed where for an individual, we have a n_channels*n_times where n_times is different for each individual.

"ValueError: not enough values to unpack (expected 4, got 2)"

HI, @taspinar

I met this error when run your scalogram ipynb.
at Third block: 3. Plot the Scaleogram using the Continuous Wavelet Transform

ValueError Traceback (most recent call last)
in
29
30 fig, ax = plt.subplots(figsize=(10, 10))
---> 31 plot_wavelet(ax, time, signal, scales, xlabel=xlabel, ylabel=ylabel, title=title)
32 plt.show()

in plot_wavelet(ax, time, signal, scales, waveletname, cmap, title, ylabel, xlabel)
3
4 dt = time[1] - time[0]
----> 5 [coefficients, frequencies,,] = pywt.cwt(signal, scales, waveletname, dt)
6 power = (abs(coefficients)) ** 2
7 period = 1. / frequencies

"ValueError: not enough values to unpack (expected 4, got 2)"

Whats wrong to me ?

Thanks.
Best,
@bemoregt.

Solver Needs Samples of at least two classes

I came across your jupyter notebook and was pleased to find solutions to a problem that had been giving me headaches, that is, classification of data from a dataframe with columns that have numeric attributes. I have data that is similar to yours and I modified your code for my dataset but its not working. Your data has a column labelled "Type", which is just an array of ones.

Whenever I run your code on my dataset, I get the following error:
ValueError: This solver needs samples of at least 2 classes in the data, but the data contains only one class: '1'

Do you know why this error is coming up in my case when it wouldn't in your case ? I also tried out the code from your webpage which differs from the one here on github on the following line:
website code: mask = mask = np.random.rand(len(df)) < ratio (error comes up because lt is not defined anywhere in the code)
github code :mask = np.random.rand(len(df)) < ratio

When I run the code thats given on your website, and make the above change(removing &lt, ratio and adding <, the error changes to KeyError: "Type"

Do you know how I can solve this ? Thanks for the help in advance

Here is my code for the dataframe preprocessing
diffreport.txt

import warnings; warnings.simplefilter("ignore")
#importing important libraries
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import scipy.stats as stats
import statsmodels.formula.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
import csv

df = pd.read_csv("diffreport.csv", sep= ",")

d1 = df.drop("name", axis = 1)
d2 = d1.drop("isotopes", axis = 1)
d3 = d2.drop("adduct", axis = 1)
d4 = d3.drop("tstat", axis = 1)
d5 = d4.drop("pvalue", axis = 1)
d6 = d5.drop("fold", axis = 1)
d7 = d6.drop(d6.columns[0], axis = 1)
d8 = d7.drop("npeaks", axis = 1)
d9 = d8.drop("Eta6", axis = 1)
d10 = d9.drop("Eta8", axis = 1)
columns = ['Eta6_0', 'Eta6_2', 'Eta6_3', 'Eta8.1', 'Eta82', 'Eta83']
df1 = pd.DataFrame(d10, columns = columns)
df1['Type'] = "1"

The rest of my code is similar to yours but I have pasted it below for clarity
import time
import pandas as pd
import numpy as np

import pickle

Some modules for plotting and visualizing

import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import display

And some Machine Learning modules from scikit-learn

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, LabelEncoder

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn import tree
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB

dict_classifiers = {
"Logistic Regression": LogisticRegression(),
"Nearest Neighbors": KNeighborsClassifier(),
"Linear SVM": SVC(),
"Gradient Boosting Classifier": GradientBoostingClassifier(n_estimators=1000),
"Decision Tree": tree.DecisionTreeClassifier(),
"Random Forest": RandomForestClassifier(n_estimators=1000),
"Neural Net": MLPClassifier(alpha = 1),
"Naive Bayes": GaussianNB(),
#"AdaBoost": AdaBoostClassifier(),
#"QDA": QuadraticDiscriminantAnalysis(),
#"Gaussian Process": GaussianProcessClassifier()
}

def batch_classify(X_train, Y_train, X_test, Y_test, no_classifiers = 5, verbose = True):

dict_models = {}
for classifier_name, classifier in list(dict_classifiers.items())[:no_classifiers]:
    t_start = time.clock()
    classifier.fit(X_train, Y_train)
    t_end = time.clock()
    
    t_diff = t_end - t_start
    train_score = classifier.score(X_train, Y_train)
    test_score = classifier.score(X_test, Y_test)
    
    dict_models[classifier_name] = {'model': classifier, 'train_score': train_score, 'test_score': test_score, 'train_time': t_diff}
    if verbose:
        print("trained {c} in {f:.2f} s".format(c=classifier_name, f=t_diff))
return dict_models

def label_encode(df, list_columns):
"""
This method one-hot encodes all column, specified in list_columns

"""
for col in list_columns:
    le = LabelEncoder()
    col_values_unique = list(df[col].unique())
    le_fitted = le.fit(col_values_unique)

    col_values = list(df[col].values)
    le.classes_
    col_values_transformed = le.transform(col_values)
    df[col] = col_values_transformed

def expand_columns(df, list_columns):
for col in list_columns:
colvalues = df[col].unique()
for colvalue in colvalues:
newcol_name = "{}is{}".format(col, colvalue)
df.loc[df[col] == colvalue, newcol_name] = 1
df.loc[df[col] != colvalue, newcol_name] = 0
df.drop(list_columns, inplace=True, axis=1)

def get_train_test(df, y_col, x_cols, ratio):
"""
This method transforms a dataframe into a train and test set, for this you need to specify:
1. the ratio train : test (usually 0.7)
2. the column with the Y_values
"""
mask = np.random.rand(len(df)) < ratio
df_train = df[mask]
df_test = df[~mask]

Y_train = df_train[y_col].values
Y_test = df_test[y_col].values
X_train = df_train[x_cols].values
X_test = df_test[x_cols].values
return df_train, df_test, X_train, Y_train, X_test, Y_test

def display_dict_models(dict_models, sort_by='test_score'):
cls = [key for key in dict_models.keys()]
test_s = [dict_models[key]['test_score'] for key in cls]
training_s = [dict_models[key]['train_score'] for key in cls]
training_t = [dict_models[key]['train_time'] for key in cls]

df_ = pd.DataFrame(data=np.zeros(shape=(len(cls),4)), columns = ['classifier', 'train_score', 'test_score', 'train_time'])
for ii in range(0,len(cls)):
    df_.loc[ii, 'classifier'] = cls[ii]
    df_.loc[ii, 'train_score'] = training_s[ii]
    df_.loc[ii, 'test_score'] = test_s[ii]
    df_.loc[ii, 'train_time'] = training_t[ii]

display(df_.sort_values(by=sort_by, ascending=False))

def display_corr_with_col(df, col):
correlation_matrix = df.corr()
correlation_type = correlation_matrix[col].copy()
abs_correlation_type = correlation_type.apply(lambda x: abs(x))
desc_corr_values = abs_correlation_type.sort_values(ascending=False)
y_values = list(desc_corr_values.values)[1:]
x_values = range(0,len(y_values))
xlabels = list(desc_corr_values.keys())[1:]
fig, ax = plt.subplots(figsize=(8,8))
ax.bar(x_values, y_values)
ax.set_title('The correlation of all features with {}'.format(col), fontsize=20)
ax.set_ylabel('Pearson correlatie coefficient [abs waarde]', fontsize=16)
plt.xticks(x_values, xlabels, rotation='vertical')
plt.show()
#Classification

y_col_glass = 'Type'
x_cols_glass = list(df1.columns.values)
x_cols_glass.remove(y_col_glass)

train_test_ratio = 0.7
df_train, df_test, X_train, Y_train, X_test, Y_test = get_train_test(df1, y_col_glass, x_cols_glass, train_test_ratio)

dict_models = batch_classify(X_train, Y_train, X_test, Y_test, no_classifiers = 8)
display_dict_models(dict_models)

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.