Code Monkey home page Code Monkey logo

mldr's People

Contributors

fcharte avatar fdavidcl avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mldr's Issues

Parse attributes enclosed in double quotes

mldr does not allow for attribute names to be enclosed within double quotes.

Example case: I modified birds to have one attribute in double (instead of single) quotes.

Error reported by R:

bb <- mldr("birds-test")
Warning message:
In matrix(unlist(strsplit(arff_data, ",", fixed = T)), ncol = num_attrs,  :
  la longitud de los datos [179955] no es un submúltiplo o múltiplo del número de filas [648] en la matriz

categorical attribute

Hello

I'm in trouble regarding the use of categorical attributes in mldr datasets.
The mldr function ignore the levels containing in the arff file.

For example, I get the flags.arrf and remove almost all instances (I left just 10)
When I read the dataset using mldr, if I get a categorical attribute like "language" is not possible know about the possible values.

For example:

test <- mldr("flags-test")
test$dataset[,"language"]
[1] "6" "7" "2" "8" "10" "1" "10" "6" "10" "1"

But I expected something like this:

test$dataset[,"language"]
[1] 6 7 2 8 10 1 10 6 10 1
Levels: 1 10 2 3 4 5 6 7 8 9

question

hello, could you tell me how to use distinct classifiers in R ? Just only IBk and J48 ?

Did the meaning of recall change?

Hi there,

I am having some trouble understanding the new implementation of the recall() function (I'm really bad with functionals).

In particular, the following sample data of true and predicted labels gives a different recall in the old vs. new implementation:

multi.y <-
structure(c(TRUE, FALSE, FALSE, TRUE, TRUE, TRUE, FALSE, TRUE
), .Dim = c(4L, 2L), .Dimnames = list(c("1", "2", "3", "4"), 
    c("tar1.multilabel", "tar2.multilabel")))

multi.p <-
structure(c(TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, TRUE
), .Dim = c(4L, 2L), .Dimnames = list(c("1", "2", "3", "4"), 
    c("tar1.multilabel", "tar2.multilabel")))

### old mldr

counters = data.frame(
  RealPositives = rowSums(multi.y),
  RealNegatives = rowSums(!multi.y),
  PredictedPositives = rowSums(multi.p),
  PredictedNegatives = rowSums(!multi.p),
  TruePositives = rowSums(multi.y & multi.p),
  TrueNegatives = rowSums(!multi.y & !multi.p))

# Calculate example based recall
mldr_Recall <- function(counters) {
  mean(counters$TruePositives / counters$RealPositives, na.rm  = TRUE)
}

mldr_Recall(counters)  # 0.66667

### new mldr

mldr::recall(multi.y, multi.p)  # 0.75

The mlr package uses mldr for a few tests here, and the updates caused a few of its tests to fail.

Could you explain what happened there? I couldn't figure it out myself, unfortunately.

Thanks in advance!

Error to filter data

When the labels are in the first columns of dataset the filter function are wrong.
Example:

df <- data.frame(matrix(rnorm(1000), ncol = 10))
df$Label1 <- c(sample(c(0,1), 100, replace = TRUE))
df$Label2 <- c(sample(c(0,1), 100, replace = TRUE))
mymldr <- mldr_from_dataframe(df[rev(1:12)], labelIndices = c(1, 2), name = "testMLDR")

fold1 <- mymldr[1:10]
fold1$dataset[fold1$labels$index]

My solution, for this was:

"[.mldr" <- function(mldrObject, rowFilter = T) {
  ...
  newDataset <- mldrObject$dataset[rows, seq(mldrObject$measures$num.attributes)]
  ...
}

Best

Bug in undefined strategy selection

> mldr::fmeasure(truth, response, undefined_value = 0)
Error in undefined_strats[[value]] : 
  attempt to select less than one element in get1index <real>

plot(birds) doesn't work

I get the following error message:

> library(mldr)
Enter mldrGUI() to launch mldr's web-based GUI
> plot(birds)
Error in if (nrow(value) == length(rn) && ncol(value) == length(cn)) { : 
    missing value where TRUE/FALSE needed

I get the same message when I try to plot various datasets...

Version: I just installed mldr from github, R 3.2.0 from the CRAN ubuntu repository on Ubuntu vivid (15.04).

Question about implementation of Measures

Hi!
I have a few questions/remarks about the implementation of the example based measures Accuracy, Precision and Recall.

  • I am wondering why the numerators in your implementation are different.
  • Could you explain the denominator of your implementation of Accuracy? Why is it the sum of predicted positives and predicted negatives? Shouldn't the code for Accuracy be something like:
mldr_Accuracy <- function(counters) {
  mean((counters$TruePositives) / (counters$PredictedPositives + counters$RealPositives - counters$TruePositives))
}

I added your source code and a screenshot of your vignette of the three measures below:

measures

# Calculate example based accuracy
mldr_Accuracy <- function(counters) {
  mean((counters$TruePositives + counters$TrueNegatives) / (counters$PredictedPositives + counters$PredictedNegatives))
}

# Calculate example based precision
mldr_Precision <- function(counters) {
  mean(counters$TruePositives / counters$PredictedPositives, na.rm = TRUE)
}

# Calculate example based recall
mldr_Recall <- function(counters) {
  mean(counters$TruePositives / counters$RealPositives, na.rm  = TRUE)
}

Tables inside "Attributes" tab doesn't look good

The tables used to show statistics for each attribute inside the "Attributes" page are not rendered properly in some browsers, such as Firefox running in OS X. HTML tags are shown instead of the tables.

circular plot in mldr

Hi,
I am trying to use the package. I did understand the LB histogram, for example. Numbers show the label frequencies. However, I don't understand numbers shown in each arc of the circular plot. They don't represent label frequencies and don't sum to the total number of instances in the dataset. Additionally, some labels don't appear in this plot. For example, I tried to plot the scene dataset and only 5 labels out of 6 were shown in the LC plot. Could you please elaborate?
Reem

NA values instead of zeroes in sparse ARFF files

It seems to me that, when reading sparse ARFF files (like the ones in the yahoo dataset), mldr is interpreting omitted values as NA values instead of zeros, what differs from the ARFF format description:

"Note that the omitted values in a sparse instance are 0, they are not "missing" values! If a value is unknown, you must explicitly represent it with a question mark (?)."

Due to this behaviour of the mldr parser, I am having to manually replace NA values with zeroes, and then rebuild the datasets.

Thank you very much for the great work in mldr!

bug in write_arff for dataset birds

I didn't know why, but if I save and try to load the birds dataset it is not work,
like this:

library(mldr)
data("birds")
write_arff(birds, "birds-test", T)
test <- mldr("birds-test")

Error in matrix(unlist(value, recursive = FALSE, use.names = FALSE), nrow = nr, :
'data' must be of a vector type, was 'NULL'
In addition: Warning messages:
1: In matrix(unlist(strsplit(arff_data, ",", fixed = T)), ncol = num_attrs, :
data length [179955] is not a sub-multiple or multiple of the number of rows [693]
2: In max(msr$count) : no non-missing arguments to max; returning -Inf

I'm using the mldr_0.3.22
Thanks.

read_header() fails to parse some headers

The read_header() function fails parsing the following patterns:

> read_header("@relation 'Slashdot: -C 22 -split-number 1891' %50")
$name
[1] "22"
# Don't return correct name and misses the number of labels

> read_header("@relation 'birds'")
Error in strsplit(regmatches(arff_relation, rgx), "\\s*:\\s*-[Cc]\\s*")[[1]] : 
  subscript out of bounds

Bugfix report (and possible solution)

Hello, I use this package and for some sparce datasets I had an error on call the mldr function.
For example: Yahoo datasets http://sourceforge.net/projects/mulan/files/datasets/yahoo.rar

I debug this error and I report this and I propose a simple solution.
The function parse_sparse_data split the data using:
unlist(strsplit(item, " "))

However in some datasets there are a space after the comma, and the result of this code is little different as expected, then my suggest of solution is use a trim, like this:
unlist(strsplit(gsub("^\s+|\s+$", "", item), " "))

In my test this work well.
Thanks

multilabel classification

Hello, I am a graduate student, and I come from Chine. My research direction is multi-label classification, recently some time to see some of your articles on the multi-label classification, feel really great, I can ask you a question? Do you use R to pre-process the data, and then put the processed data into meka for classification.

v0.4.1 breaks mlr test

Hi,

just wanted to let you know that the recent update broke one of our tests:

mlr-org/mlr#2524

Haven't yet taken a closer look if the problem is on our side or yours. Could you reply briefly which side should take action? :)
Thanks.

Read ARFF files calculating no measures

Sometimes I have to read ARFF files and pre-process them, and in those cases, the measures calculated by mldr are not useful, and also take a lot of time to calculate. I tried the function read.arff from RWeka, but it does not seem to discriminate between features and labels, as mldr does. Therefore, it would be nice if mldr allowed me to just read a dataset, without calculating the measures, just providing me a data.frame and the indexes of features and labels. So, after pre-processing the data.frame, I would build an mldr object by means of mldr_from_dataframe. Thanks for mldr!

An argument in mldr_evaluate to specify if the predictions given as input are bipartitions or rankings

An execution of the mldr_evaluate function calculates two kinds of metrics (the bipartition based and ranking based evaluation metrics), but (if I understood well) the predictions given as input are bipartitions or rankings, but not both. Would it make sense to have an argument at mldr_evaluate in which one specifies if the output of the classifier is a bipartition or a ranking, in order to avoid calculating and showing unnecessary metrics?

Alternatively, I think it would be easier to read the output of mldr_evaluate if the bipartition based and ranking based evaluation metrics were grouped under two sublists (e.g., "bipartition metrics" and "ranking metrics").

Sorry if this is nonsense, and thanks for mldr!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.