igrigorik / decisiontree Goto Github PK

ID3-based implementation of the ML Decision Tree algorithm

Ruby 100.00%

rubyml decision-tree ruby machine-learning

decisiontree's Introduction

Decision Tree

A Ruby library which implements ID3 (information gain) algorithm for decision tree learning. Currently, continuous and discrete datasets can be learned.

Discrete model assumes unique labels & can be graphed and converted into a png for visual analysis
Continuous looks at all possible values for a variable and iteratively chooses the best threshold between all possible assignments. This results in a binary tree which is partitioned by the threshold at every step. (e.g. temperate > 20C)

Features

ID3 algorithms for continuous and discrete cases, with support for inconsistent datasets.
Graphviz component to visualize the learned tree
Support for multiple, and symbolic outputs and graphing of continuous trees.
Returns default value when no branches are suitable for input

Implementation

Ruleset is a class that trains an ID3Tree with 2/3 of the training data, converts it into set of rules and prunes the rules with the remaining 1/3 of the training data (in a C4.5 way).
Bagging is a bagging-based trainer (quite obvious), which trains 10 Ruleset trainers and when predicting chooses the best output based on voting.

Blog post with explanation & examples

Example

require 'decisiontree'

attributes = ['Temperature']
training = [
  [36.6, 'healthy'],
  [37, 'sick'],
  [38, 'sick'],
  [36.7, 'healthy'],
  [40, 'sick'],
  [50, 'really sick'],
]

# Instantiate the tree, and train it based on the data (set default to '1')
dec_tree = DecisionTree::ID3Tree.new(attributes, training, 'sick', :continuous)
dec_tree.train

test = [37, 'sick']
decision = dec_tree.predict(test)
puts "Predicted: #{decision} ... True decision: #{test.last}"

# => Predicted: sick ... True decision: sick

# Specify type ("discrete" or "continuous") in the training data
labels = ["hunger", "color"]
training = [
        [8, "red", "angry"],
        [6, "red", "angry"],
        [7, "red", "angry"],
        [7, "blue", "not angry"],
        [2, "red", "not angry"],
        [3, "blue", "not angry"],
        [2, "blue", "not angry"],
        [1, "red", "not angry"]
]

dec_tree = DecisionTree::ID3Tree.new(labels, training, "not angry", color: :discrete, hunger: :continuous)
dec_tree.train

test = [7, "red", "angry"]
decision = dec_tree.predict(test)
puts "Predicted: #{decision} ... True decision: #{test.last}"

# => Predicted: angry ... True decision: angry

License

decisiontree's People

Contributors

Stargazers

Watchers

Forkers

mathieuravaux inem khellan ganesan jtoy jeffjrare heaven theshizzah coralineada superchris marshallshen havenwood numerico ubermensch alongwithyou conzett lukeasrodgers neutralino tedcheng kapilkale dvisockas fun-alex-alex2006hw aminkrj vedolin rokibulhassan muddy-28 sclinede cngonzalez mrkn turgunyusuf venki911 akshay-sri123 fhur zeqicui-zz codegracie dannyben potatobaby cheerfulstoic hotfusionman nfredrik hbcbh1999 patrickjonesdotca setembrinolusa ishubham900 alexmiasoiedov jhonsonzhangxing davidjharcourt bookoflife baccu zeth75 server-securit zihg frankatmech jperry21 devscience jvxtaposed ctbwx sunilpuranik laura-swe simbazuva sdabic ahadc ahmed-bilal-kh vedma001 vchen30 alejandroe-sanchezv nicolasleger lexluga neuralnetworkingtechnologies kingefosa shuchangfrances neo4reo chipmunks-lime git-neelesh rhombus-labs karthickathlet shivamdhama bladefidz hzitoun indranil32 lucianofalmeida ravi13u danielmachinelearning grayowlshuck salunkheketki19 jayteesf prasastoadi ahmadhasankhan lewandowski paarura bhanditz thirunjuguna svkmax ahimani theglidingdots dalavancloud android3385 lichndtm rockanandu jesselawson

decisiontree's Issues

Issue with graph output

I am using the graphr Ruby gem and GraphViz on Ubuntu and all I get is

Btw, why didn’t you put the graphr gem in the dependencies?

Predicting Pattern's

How would I go about setting up a simple data set of a series of outcomes that are either 0 or 1 and predicting the nth outcome?

Overfitting or am I getting the ruleset interpretation wrong?

By calling the ruleset method after train on a DecisionTree::ID3Tree object (setup for continuous data) I expect to get back a number of rules. Each rule I interpret as a series of ANDed clauses.

However, many times I get cases where the same attribute is repeated in a clause although it may be included in a clause above it.

e.g.,

attrib_1 < 0.02123562506819547
attrib_2 >= 0.1922781177611915
attrib_3 < 0.2879504779121489
attrib_4 < 0.26382498790056597
attrib_4 < 0.193308315974597
=> class1()

In the above case the second mention to attrib_4 is superfluous if the rule is interpreted as:

if attrib_1 <  0.02123562506819547 
and attrib_2 >= 0.1922781177611915
and attrib_3 < 0.2879504779121489
and attrib_4 < 0.26382498790056597
and attrib_4 < 0.193308315974597
then class1()
end

So, am I wrong to assume a chain of ANDed clauses?
If not, then is the second occurrence of attrib_4 a sign of overfitting which I can safely ignore?
Could this just be a bug?

Export ruleset to ruby code

Hi,

I'm using Decisiontree gem on a project of mine and I wrote some simple code to export the ruleset generated by this gem into ruby if/elsif conditions.

Would that be interesting to add to your gem? If so I would push it.

Thanks

Code generated example:

def classify(sepallength, sepalwidth, petallength, petalwidth)
if petallength >= 2.45 and
petalwidth >= 1.75 and
petallength >= 4.85
then 'virginica'
elsif petallength >= 2.45 and
petalwidth >= 1.75 and
petallength < 4.85 and
sepallength >= 5.95
then 'virginica'
elsif petallength >= 2.45 and
petalwidth >= 1.75 and
petallength < 4.85 and
sepallength < 5.95
then 'versicolor'
elsif petallength >= 2.45 and
petalwidth < 1.75 and
petallength >= 4.95 and
petalwidth >= 1.55 and
sepallength >= 6.95
then 'virginica'
elsif petallength >= 2.45 and
petalwidth < 1.75 and
petallength >= 4.95 and
petalwidth >= 1.55 and
sepallength < 6.95
then 'versicolor'
elsif petallength >= 2.45 and
petalwidth < 1.75 and
petallength >= 4.95 and
petalwidth < 1.55
then 'virginica'
elsif petallength >= 2.45 and
petalwidth < 1.75 and
petallength < 4.95 and
petalwidth >= 1.65
then 'virginica'
elsif petallength >= 2.45 and
petalwidth < 1.75 and
petallength < 4.95 and
petalwidth < 1.65
then 'versicolor'
elsif petallength < 2.45
then 'setosa'
else false
end
end

Usage example:
puts classify(5.4,3.4,1.5,0.4) #setosa
puts classify(7.2,3.0,5.8,1.6) #virginica
puts classify(5.0,2.0,3.5,1.0) #versicolor

Why are gains normalized at each node?

Thank you for your great gem, Ilya!

We are currently working on a project, where we need to evaluate the importance of each attribute for a given decision tree. This is done by summing over the tree the gain for each attribute.

We observe that the results we get are quite different than results that are produced using the rpart package in R and the scikit-learn package in Python. This is explained
by the fact that the gain is normalized at each node in the decisiontree gem (here and here).

Hence our question: is there any reason for normalizing the gain at each node in the decisiontree gem?

If not, we would like to make a pull request for computing a non-normalized gain - and,
maybe later, for adding a function, which would compute the importance of each attribute.

Continuous and discrete attributes in the same dataset

Unless I'm missing something, it appears that this implementation does not deal well with a data set where some of the attributes are discrete and some are continuous. I may have a go and seeing how hard it is to implement this, unless this is already supported and I'm just missing it somehow.

undefined method `to_a` for "abc":String

There is already one similar open issue.
Consider a training set as"

[
        ["myself", 0, "#{my_name}"],
        ["friend", 1, "#{friend_name}"],
        ["family", 1, "#{member_name}"],
        ["other", 1, "#{other_name}"]
]

If all methods(my_name, friend_name etc) return same name, decision tree is breaking.

Undefined method `sum'

I'm try run examples/simple.rb in master but throw this error:

entropy': undefined method `sum' for [2, 3, 1]:Array (NoMethodError)

Try gem version 0.5.0 and work find, maybe it introduced in pull request #32

Data error

In your file,
the first line is "Age,Education,Income,Marital Status". You should change it to "Age,Education,Income,Marital,Status"

no graph created

Generally everything works fine, but this is not creating any graph. I use the example code from the blog article, no effect, no error, application puts the result but is not creating any png file.

graph/graphviz_dot not installed, graphing functionality not included

Hi, can you please mute this error message?

Error if attributes have the same value in training set.

I've encountered a problem where if you have a training set like:

[[5, 1, "9.990941"], [5, 1, "9.990926"], [5, 1, "9.991411"], [5, 1, "9.991286"], [5, 1, "9.9916579615681"], [5, 1, "9.9917500513096"], [5, 1, "9.991682"], [5, 1, "9.991675"], [5, 1, "9.990981"], [5, 1, "9.990918"], [5, 1, "9.990918"], [5, 1, "9.990926"], [5, 1, "9.990934"], [5, 1, "9.9907993677691"], [5, 1, "9.9907108548716"], [5, 1, "9.9907190386056"], [5, 1, "9.9907190386056"]]

Where the attributes do not vary, only the target feature does, I get an error:

undefined method `to_a' for 9.990941:String

I encounter the same issue if the attributes vary but the target feature does not.

I can't seem to graph anything using this gem

I copied and pasted the example code and i have installed graphr and graphviz but nothing happens when i run the code. I was wondering how the graph function is suppose to work. Nothing gets saved into my directory as well.

Infinite recursion due to the handling of zero gains in id3_continuous

On a real production dataset with 5 explanatory variables and ~1000 lines, I received a SystemStackError: stack level too deep when calling DecisionTree::ID3Tree#train.

Trying to figure out what was happening, I built the following simple dataset, which allows to reveal the bug:

attributes = ["X0", "X1", "X2", "X3"]
data = [["a", 0, 1, 1, 1], ["a", 0, 1, 0, 0], ["a", 0, 0, 1, 0], ["a", 0, 0, 0, 1]]
data_type = {X0: :discrete, X1: :continuous, X2: :continuous, X3: :continuous}
tree = DecisionTree::ID3Tree.new(attributes, data, 1, data_type)
tree.train  # SystemStackError is raised here!

The reason of this bug seems to lie in the specific output ([-1, -1]) of DecisionTree::ID3Tree#id3_continuous in the case if values.size == 1 (see this line).

Returning [0, -1] instead of [-1, -1] in the cases if values.size == 1 and if gain.size == 1 in the method #id3_continuous solves the problem.

It would also be relevant to stop the recursion in the case where the selection of each variable leads to a zero gain. That can be done adding in #id3_train the following line:

return data.first.last if performance.all? { |a, b| a <= 0 }

after this line:

performance = attributes.collect { |attribute| fitness_for(attribute).call(data, attributes, attribute) }

What do you think?

Do you want me to make a pull request with these changes?

Question about real usage

Hello,
This is not an issue, it's more like a question about real usage.
For example, I would like to use this gem on my projet. I've basically training data like this

Header: Client, Worker, Technology, Status
Data with past Requests:
Client A, John, Ruby, Won
Client A, John, Ruby Won
Client B, Bob, Java, Won
Client B, John, Ruby, Lost
Client C, John, HTML, Lost
Client C, Alice, HTML, Won
Client C, Bob, HTML, Lost
....

Now, what I want to do when new Request I want to advice who is the best worker for it. For example, what if I assign "Bob" for new Client what the chance that status will be "Lost".

I hope you got the idea.

How many records do we need to have?
What to do if we Request has many technologies? Duplicate rows?(with same client, status?)

Thanks

Any chance we could push a new version of the gem?

Just wondering if we could get out a new version of the gem in time for rubyconf next week. I'll be demoing the continuous attribute support and want to make sure it works with the current gem version if possible.

requiring graphr in production

Hi, I'm using decisiontree for a rails app (thanks!).
I found that my app was crushing at initialization time because of a failed require to graphr (http://pastebin.com/uMUDwfZf).
I could work around it including graphr in my Gemfile...
But I think that's wrong because graphr is a development dependency only.
So I think the bug is rather to require graphr in id3_tree.rb, which will be executed for all environments.
I removed it in my fork (Numerico@6aa6e25#L0L6) and two tests broke.
I'd happily fix them and submit a pull request if you agree.

Ruby ML list

Dear Ilya,

we've recently added your decisiontree to our Awesome RubyML list: https://github.com/arbox/machine-learning-with-ruby

Your project is really important for the Ruby ML ecosystem. So we introduced the rubyml topic which will ideally connect all ML related projects written in and for Ruby on GitHub.

Please consider adding the rubyml topic to your repository :)

Thank you!

A line in ID3Tree#id3_continuous has no effect

Thank you Ilya for your great job. I work with @nicomahler and we look forward contributing to this project.

As I worked on #19, I remarked that the last line of the code snippet below, extracted from ID3Tree#id3_continuous (line 117 in id3_tree.rb) seems to have no effect at all:

    gain = thresholds.collect { |threshold|
      (...)
      [data.classification.entropy - pos*sp[0].classification.entropy - neg*sp[1].classification.entropy, threshold]
    }.max { |a,b| a[0] <=> b[0] }

    return [-1, -1] if gain.size == 0  # gain.size is never 0, so this line of code has no effect (but will raise exception if gain is nil).

gain is a result of Enumerable#max method applied on an array of 2-elements arrays. Its value is either nil, if the array is empty (case where thresholds is empty - I don't know if it can happen, we never met that case), or a 2-elements array. It can never be an empty array. So gain.size is never 0.

Graph output unreadable

Not sure if this nice gem is still maintained or not, I hope it is.

I am facing difficulties visualizing the tree.
Using the dec_tree.graph method, I am getting an unreadable graph for any tree that is larger than the small examples.

Here is a small part of the image I get:

I saw a probably similar issue at #28 .
I have looked at this section of the code, to see if there are options I can pass (there aren't).
I thought about generating my own graph, but I see that build_tree is private, and I am not yet sure how to get the nodes and their connections on my own.
I saw in issue #11 that someone (you?) said "very few people use the actual graphing output" - so I wonder, how do most people generate a visual representation of their tree, if not with this function?

Am I missing something?

Thanks.

stack level too deep

I think I found a weird behaviour too.
Getting

Failure/Error: Unable to find matching line from backtrace
     SystemStackError:
       stack level too deep
     # /home/numerico/.rvm/gems/ruby-1.9.3-p392/gems/decisiontree-0.4.1/lib/decisiontree/id3_tree.rb:122

when I give it attributes with numerical names
for eg.

attributes = ['1', '2'] # NUMERICAL
training = [[1, 1, true], [1,2,false] #... ]

but this would work

attributes = ['one', 'two'] #NON NUMERICAL
training = #...

does this make any sense?

License missing from gemspec

RubyGems.org doesn't report a license for your gem. This is because it is not specified in the gemspec of your last release.

via e.g.

spec.license = 'MIT'
# or
spec.licenses = ['MIT', 'GPL-2']

Including a license in your gemspec is an easy way for rubygems.org and other tools to check how your gem is licensed. As you can imagine, scanning your repository for a LICENSE file or parsing the README, and then attempting to identify the license or licenses is much more difficult and more error prone. So, even for projects that already specify a license, including a license in your gemspec is a good practice. See, for example, how rubygems.org uses the gemspec to display the rails gem license.

There is even a License Finder gem to help companies/individuals ensure all gems they use meet their licensing needs. This tool depends on license information being available in the gemspec. This is an important enough issue that even Bundler now generates gems with a default 'MIT' license.

I hope you'll consider specifying a license in your gemspec. If not, please just close the issue with a nice message. In either case, I'll follow up. Thanks for your time!

Appendix:

If you need help choosing a license (sorry, I haven't checked your readme or looked for a license file), GitHub has created a license picker tool. Code without a license specified defaults to 'All rights reserved'-- denying others all rights to use of the code.
Here's a list of the license names I've found and their frequencies

p.s. In case you're wondering how I found you and why I made this issue, it's because I'm collecting stats on gems (I was originally looking for download data) and decided to collect license metadata,too, and make issues for gemspecs not specifying a license as a public service :). See the previous link or my blog post about this project for more information.