Code Monkey home page Code Monkey logo

sutil's Introduction

sutil

This repository contains a set of tools to deal with machine learning and natural language processing tasks, including classes to make quick experimentation of different classifacation models.

Dataset

This class is made to load csv styles dataset's where all the features are comma separeted and the class is in the last column. It includes functions to normalize the features, add bias, save the data to a file and load from it. Also includes functions to split the train, validation and test datasets.

from sutil.base.Dataset import Dataset

datafile = './sutil/datasets/ex2data1.txt'
d = Dataset.fromDataFile(datafile, ',')
print(d.size)

sample = d.sample(0.3)
print(sample.size)
        
sample.save("modelo_01")

train, validation, test = d.split(train = 0.8, validation = 0.2)
print(train.size)
print(validation.size)
print(test.size)

Regularized Logistic Regression

You can also include your own models as a Regularized Logistic Regression, implemented manually using numpy and included in the sutil.models package

import numpy as np
from sutil.base.Dataset import Dataset
from sutil.models.RegularizedLogisticRegression import RegularizedLogisticRegression

datafile = './sutil/datasets/ex2data1.txt'
d = Dataset.fromDataFile(datafile, ',')
d.xlabel = 'Exam 1 score'
d.ylabel = 'Exam 2 score'
d.legend = ['Admitted', 'Not admitted']
iterations = 400
print('Plotting data with + indicating (y = 1) examples and o indicating (y = 0) examples.\n')
d.plotData()

theta = np.zeros((d.n + 1, 1))
lr = RegularizedLogisticRegression(theta, 0.03, 0, train=1)
lr.trainModel(d)
lr.score(d.X, d.y)
lr.roc.plot()
lr.roc.zoom((0, 0.4),(0.5, 1.0))

Sklearn model

You can also embed the sklearn models in a wrapper class in order to run experiments with diferent models implemented in sklearn. In the same style you can create tensorflow, keras or pytorch models inhyereting from sutil.modes.Model class and implementing the trainModel and predict methods.

import numpy as np
from sutil.base.Dataset import Dataset
from sutil.models.SklearnModel import SklearnModel
from sklearn.linear_model import LogisticRegression

datafile = './sutil/datasets/ex2data1.txt'
d = Dataset.fromDataFile(datafile, ',')
ms = LogisticRegression()
m = SklearnModel('Sklearn Logistic', ms)
m.trainModel(d)
m.score(d.X, d.y)
m.roc.plot()
m.roc.zoom((0, 0.4),(0.5, 1.0))

Neural Network Classifer

This class let's you perform classifcation using a Neural Network, multiperceptron classifer. It wraps the sklearn MLPClassifer and implements a method to search different activations, solvers and hidden layers structures. Upu can pass your own arguments to initialize the network as you want.

from sutil.base.Dataset import Dataset
from sutil.neuralnet.NeuralNetworkClassifier import NeuralNetworkClassifier

datafile = './sutil/datasets/ex2data1.txt'
d = Dataset.fromDataFile(datafile, ',')
d.normalizeFeatures()
sample = d.sample(examples = 30)

nn = NeuralNetworkClassifier((d.n, len(d.labels)))
nn.searchParameters(sample)
nn.trainModel(d)
nn.score(d.X, d.y)
nn.roc.plot()

Experiment

The experiment class let's you perform the data split and test against different models to compare the performance automatically

import numpy as np
from sutil.base.Dataset import Dataset
from sklearn.linear_model import LogisticRegression
from sutil.base.Experiment import Experiment
from sutil.models.SklearnModel import SklearnModel
from sutil.models.RegularizedLogisticRegression import RegularizedLogisticRegression
from sutil.neuralnet.NeuralNetworkClassifier import NeuralNetworkClassifier

# Load the data
datafile = './sutil/datasets/ex2data1.txt'
d = Dataset.fromDataFile(datafile, ',')
d.normalizeFeatures()
print("Size of the dataset... ")
print(d.size)
sample = d.sample(0.3)
print("Size of the sample... ")
print(d.sample)


# Create the models
theta = np.zeros((d.n + 1, 1))
lr = RegularizedLogisticRegression(theta, 0.03, 0)
m = SklearnModel('Sklearn Logistic', LogisticRegression())
# Look for the best parameters using a sample
nn = NeuralNetworkClassifier((d.n, len(d.labels)))
nn.searchParameters(sample)

input("Press enter to continue...")

# Create the experiment
experiment = Experiment(d, None, 0.8, 0.2)
experiment.addModel(lr, name = 'Sutil Logistic Regression')
experiment.addModel(m, name = 'Sklearn Logistic Regression')
experiment.addModel(nn, name = 'Sutil Neural Network')

# Run the experiment
experiment.run(plot = True)

Text utilities

Sutil includes text utilities to process and transform text for classification

PreProcessor

Pre processor class let's you implement text pre processinf functions to transform the data. It wraps nltk methods and uses it's own methods toperform:

  • Case normalization
  • Noise removal
  • Stemming
  • Leammatizing
  • Pattern text normalization
from sutil.text.PreProcessor import PreProcessor

string = "La Gata maullaba en la noche $'@|··~½¬½¬{{[[]}aqAs   qasdas 1552638"
p = PreProcessor.standard()
print(p.preProcess(string))

patterns = [("\d+", "NUMBER")]
c = [("case", "lower"), ("denoise", "spanish"), ("stopwords", "spanish"), 
     ("stem", "spanish"), ("lemmatize", "spanish"), ("normalize", patterns)]
p2 = PreProcessor(c)
print(p2.preProcess(string))


c = [("case", "lower"), ("denoise", "spanish"), ("stem", "spanish"), 
     ("normalize", patterns)]
p3 = PreProcessor(c)
print(p3.preProcess(string))

PhraseTokenizer

PhraseTokenizer lets you split the tokens of a phrase given a delimiter char. There's also the GramTokenizer class whihc lets you split words by fixed amounts of characers.

from sutil.text.GramTokenizer import GramTokenizer
from sutil.text.PhraseTokenizer import PhraseTokenizer

string = "Hi I'm a really helpful string"
t = PhraseTokenizer()
print(t.tokenize(string))

t2 = GramTokenizer()
print(t2.tokenize(string))

TextVectorizer

TextVectorizer class is an abstraction of methods to vectorize text and transform vectors to texts. Sutil implements, OneHotVectorizer, TFIDFVectorizer, CountVectorizer.

import pandas as pd
from sutil.text.TFIDFVectorizer import TFIDFVectorizer
from sutil.text.GramTokenizer import GramTokenizer
from sutil.text.PreProcessor import PreProcessor
from nltk.tokenize import TweetTokenizer

# Load the data
dir_data = "./sutil/datasets/"
df = pd.read_csv(dir_data + 'tweets.csv')

# Clean the data
patterns = [("\d+", "NUMBER")]
c = [("case", "lower"), ("denoise", "english"), ("stopwords", "english"), ("normalize", patterns)]
p2 = PreProcessor(c)
df['clean_tweet'] = df.tweet.apply(p2.preProcess)

vectorizer = TFIDFVectorizer({}, TweetTokenizer())
vectorizer.initialize(df.clean_tweet)
print(vectorizer.dictionary.head())


vectorizer2 = TFIDFVectorizer({}, GramTokenizer())
vectorizer2.initialize(df.clean_tweet)
print(vectorizer2.dictionary.head())

vector = vectorizer.encodePhrase(df.clean_tweet[0])
print(vectorizer.getValues()[0])
print(vector)

vector2 = vectorizer2.encodePhrase(df.clean_tweet[0])
print(vectorizer2.getValues()[0])
print(vector2)

print(df.clean_tweet[0])
print(vectorizer.decodeVector(vector))
print("*"*50)
print(vectorizer2.decodeVector(vector2))

TextDataSet

TextDataSet class abstracts a DataSet made by texts, with includes a vectorizer and a pre processor to pre process the text and transform it from text to vector and from vector to text:

from sutil.text.TextDataset import TextDataset
from sutil.text.GramTokenizer import GramTokenizer
from sutil.text.TFIDFVectorizer import TFIDFVectorizer
from sutil.text.PhraseTokenizer import PhraseTokenizer
from sutil.text.PreProcessor import PreProcessor

# Load the data in the standard way
filename = "./sutil/datasets/tweets.csv"
t = TextDataset.standard(filename, ",")

x = input("Presione enter para continuar...")
# Creaate the dataset with a custom vectorizer and pre processor
patterns = [("\d+", "NUMBER")]
c = [("case", "lower"), ("denoise", "spanish"), ("stopwords", "spanish"), ("normalize", patterns)]
preprocessor = PreProcessor(c)
vectorizer = TFIDFVectorizer({}, GramTokenizer())
t2 = TextDataset.setvectorizer(filename, vectorizer, preprocessor)
vector = t2.encodePhrase("United oh the")
print(t2.vectorizer.decodeVector(vector))

x = input("Presione enter para continuar...")
patterns = [("\d+", "NUMBER")]
c = [("case", "lower"), ("denoise", "spanish"), ("stopwords", "english"), ("normalize", patterns)]
pre2 = PreProcessor(c)
vectorizer = TFIDFVectorizer({}, PhraseTokenizer())
t3 = TextDataset.setvectorizer(filename, vectorizer, pre2)
vector = t3.encodePhrase("United oh the")
print(t3.decodeVector(vector))

sutil's People

Contributors

maxsob avatar kampamocha avatar

Stargazers

Weyler Maldonado avatar  avatar

Watchers

James Cloos avatar Gilberto Mendez avatar  avatar  avatar

Forkers

maxsob

sutil's Issues

Cannot install soldai-utils on python 2 or python3

When I'm trying to install soldai-utils on my virtualenv using python3.5.3 I receive the following error message:

ERROR: Could not find a version that satisfies the requirement soldai-utils (from versions: none)
ERROR: No matching distribution found for soldai-utils

I tried to install it using python2.7.12 and got the next message:

Cache entry deserialization failed, entry ignored
  Could not find a version that satisfies the requirement soldai-utils (from versions: )
No matching distribution found for soldai-utils

I'm using a linux environment

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.