fcharte / mldr Goto Github PK

View Code? Open in Web Editor NEW

22.0 22.0 8.0 6.75 MB

R package for analyzing and manipulating multilabel datasets

License: GNU Lesser General Public License v3.0

R 100.00%

mldr's People

Contributors

Stargazers

Watchers

Forkers

fdavidcl alenzhao mcoyromero mkim0710 davan690 daan-grashoff mrblasco

mldr's Issues

na.rm parameter causes overestimation of some metrics

In metrics such as FMeasure, the na.rm parameter causes the mean function to ignore instances with NA values, which results in different (overestimated) values than when using a sum(data, na.rm=T)/length(data) approach.

Can I use my xml file to load dataset?

Thanks for your good-looking and practical tool. But I'm not familier with R,I just configured the environment. I can't load my xml file in GUI. Which format should I use to import my data?

My xml looked like https://i.bmp.ovh/imgs/2020/10/9bb82f146577643b.png.
Can be download in https://cowtransfer.com/s/23880f576c3940.

Documentation for demo scripts in Readme

Ignore escaped apostrophes in non quoted attribute names

@attribute TAG_m\'usica {0, 1} → "@attribute", "TAG_m\\'usica", "{0, 1}"

Parse attributes enclosed in double quotes

mldr does not allow for attribute names to be enclosed within double quotes.

Example case: I modified birds to have one attribute in double (instead of single) quotes.

Error reported by R:

bb <- mldr("birds-test")
Warning message:
In matrix(unlist(strsplit(arff_data, ",", fixed = T)), ncol = num_attrs,  :
  la longitud de los datos [179955] no es un submúltiplo o múltiplo del número de filas [648] en la matriz

Allow for different capitalizations in read.arff

e.g. @RELATION instead of @relation.

[meta] Add Contributing file

Add a file with details on how to contribute to the project.

categorical attribute

Hello

I'm in trouble regarding the use of categorical attributes in mldr datasets.
The mldr function ignore the levels containing in the arff file.

For example, I get the flags.arrf and remove almost all instances (I left just 10)
When I read the dataset using mldr, if I get a categorical attribute like "language" is not possible know about the possible values.

For example:

test <- mldr("flags-test")
test$dataset[,"language"]
[1] "6" "7" "2" "8" "10" "1" "10" "6" "10" "1"

But I expected something like this:

test$dataset[,"language"]
[1] 6 7 2 8 10 1 10 6 10 1
Levels: 1 10 2 3 4 5 6 7 8 9

The mldr() function fails while attemping to load an already loaded MLD

Having an external MLD of the mldr.datasets package already loaded, such as enron, an attemp to load it again produces an error.

Fix rank calculation in ranking-based metrics

Ranking-based evaluation metrics (Average Precision, Coverage...) are using order instead of rank to calculate label rankings. These functions output different results [source], so metrics could be wrong.

question

hello, could you tell me how to use distinct classifiers in R ? Just only IBk and J48 ?

Did the meaning of recall change?

Hi there,

I am having some trouble understanding the new implementation of the recall() function (I'm really bad with functionals).

In particular, the following sample data of true and predicted labels gives a different recall in the old vs. new implementation:

multi.y <-
structure(c(TRUE, FALSE, FALSE, TRUE, TRUE, TRUE, FALSE, TRUE
), .Dim = c(4L, 2L), .Dimnames = list(c("1", "2", "3", "4"), 
    c("tar1.multilabel", "tar2.multilabel")))

multi.p <-
structure(c(TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, TRUE
), .Dim = c(4L, 2L), .Dimnames = list(c("1", "2", "3", "4"), 
    c("tar1.multilabel", "tar2.multilabel")))

### old mldr

counters = data.frame(
  RealPositives = rowSums(multi.y),
  RealNegatives = rowSums(!multi.y),
  PredictedPositives = rowSums(multi.p),
  PredictedNegatives = rowSums(!multi.p),
  TruePositives = rowSums(multi.y & multi.p),
  TrueNegatives = rowSums(!multi.y & !multi.p))

# Calculate example based recall
mldr_Recall <- function(counters) {
  mean(counters$TruePositives / counters$RealPositives, na.rm  = TRUE)
}

mldr_Recall(counters)  # 0.66667

### new mldr

mldr::recall(multi.y, multi.p)  # 0.75

The mlr package uses mldr for a few tests here, and the updates caused a few of its tests to fail.

Could you explain what happened there? I couldn't figure it out myself, unfortunately.

Thanks in advance!

Error to filter data

When the labels are in the first columns of dataset the filter function are wrong.
Example:

df <- data.frame(matrix(rnorm(1000), ncol = 10))
df$Label1 <- c(sample(c(0,1), 100, replace = TRUE))
df$Label2 <- c(sample(c(0,1), 100, replace = TRUE))
mymldr <- mldr_from_dataframe(df[rev(1:12)], labelIndices = c(1, 2), name = "testMLDR")

fold1 <- mymldr[1:10]
fold1$dataset[fold1$labels$index]

My solution, for this was:

"[.mldr" <- function(mldrObject, rowFilter = T) {
  ...
  newDataset <- mldrObject$dataset[rows, seq(mldrObject$measures$num.attributes)]
  ...
}

Best

Bug in undefined strategy selection

> mldr::fmeasure(truth, response, undefined_value = 0)

Error in undefined_strats[[value]] : 
  attempt to select less than one element in get1index <real>

plot(birds) doesn't work

I get the following error message:

> library(mldr)
Enter mldrGUI() to launch mldr's web-based GUI
> plot(birds)
Error in if (nrow(value) == length(rn) && ncol(value) == length(cn)) { : 
    missing value where TRUE/FALSE needed

I get the same message when I try to plot various datasets...

Version: I just installed mldr from github, R 3.2.0 from the CRAN ubuntu repository on Ubuntu vivid (15.04).

Question about implementation of Measures

Hi!
I have a few questions/remarks about the implementation of the example based measures Accuracy, Precision and Recall.

I am wondering why the numerators in your implementation are different.
Could you explain the denominator of your implementation of Accuracy? Why is it the sum of predicted positives and predicted negatives? Shouldn't the code for Accuracy be something like:

mldr_Accuracy <- function(counters) {
  mean((counters$TruePositives) / (counters$PredictedPositives + counters$RealPositives - counters$TruePositives))
}

I added your source code and a screenshot of your vignette of the three measures below:

# Calculate example based accuracy
mldr_Accuracy <- function(counters) {
  mean((counters$TruePositives + counters$TrueNegatives) / (counters$PredictedPositives + counters$PredictedNegatives))
}

# Calculate example based precision
mldr_Precision <- function(counters) {
  mean(counters$TruePositives / counters$PredictedPositives, na.rm = TRUE)
}

# Calculate example based recall
mldr_Recall <- function(counters) {
  mean(counters$TruePositives / counters$RealPositives, na.rm  = TRUE)
}

Improve documentation in SCUMBLE calculations

Some calculations in measures.R may be insufficiently commented as to how they correspond to their definition.

SCUMBLE.CV computation fails in some datasets

The computation of the SCUMBLE.CV for each label fails with some MLDs, genbase for instance.

Tables inside "Attributes" tab doesn't look good

The tables used to show statistics for each attribute inside the "Attributes" page are not rendered properly in some browsers, such as Firefox running in OS X. HTML tags are shown instead of the tables.

Auto-scrolling graphic is enabled in single-column view

circular plot in mldr

Hi,
I am trying to use the package. I did understand the LB histogram, for example. Numbers show the label frequencies. However, I don't understand numbers shown in each arc of the circular plot. They don't represent label frequencies and don't sum to the total number of instances in the dataset. Additionally, some labels don't appear in this plot. For example, I tried to plot the scene dataset and only 5 labels out of 6 were shown in the LC plot. Could you please elaborate?
Reem

NA values instead of zeroes in sparse ARFF files

It seems to me that, when reading sparse ARFF files (like the ones in the yahoo dataset), mldr is interpreting omitted values as NA values instead of zeros, what differs from the ARFF format description:

"Note that the omitted values in a sparse instance are 0, they are not "missing" values! If a value is unknown, you must explicitly represent it with a question mark (?)."

Due to this behaviour of the mldr parser, I am having to manually replace NA values with zeroes, and then rebuild the datasets.

Thank you very much for the great work in mldr!

bug in write_arff for dataset birds

I didn't know why, but if I save and try to load the birds dataset it is not work,
like this:

library(mldr)
data("birds")
write_arff(birds, "birds-test", T)
test <- mldr("birds-test")

Error in matrix(unlist(value, recursive = FALSE, use.names = FALSE), nrow = nr, :
'data' must be of a vector type, was 'NULL'
In addition: Warning messages:
1: In matrix(unlist(strsplit(arff_data, ",", fixed = T)), ncol = num_attrs, :
data length [179955] is not a sub-multiple or multiple of the number of rows [693]
2: In max(msr$count) : no non-missing arguments to max; returning -Inf

I'm using the mldr_0.3.22
Thanks.

Load several MLDs at once

Ability to pass a vector of names as files to be loaded.

Support negative offset for MEKA label parameter

In MEKA, a negative offset means reading labels from the end

The regular expression used to parse the attributes fails with genbase

read_header() fails to parse some headers

The read_header() function fails parsing the following patterns:

> read_header("@relation 'Slashdot: -C 22 -split-number 1891' %50")
$name
[1] "22"
# Don't return correct name and misses the number of labels

> read_header("@relation 'birds'")
Error in strsplit(regmatches(arff_relation, rgx), "\\s*:\\s*-[Cc]\\s*")[[1]] : 
  subscript out of bounds

Creating an empty mldr object fails

Bugfix report (and possible solution)

Hello, I use this package and for some sparce datasets I had an error on call the mldr function.
For example: Yahoo datasets http://sourceforge.net/projects/mulan/files/datasets/yahoo.rar

I debug this error and I report this and I propose a simple solution.
The function parse_sparse_data split the data using:
unlist(strsplit(item, " "))

However in some datasets there are a space after the comma, and the result of this code is little different as expected, then my suggest of solution is use a trim, like this:
unlist(strsplit(gsub("^\s+|\s+$", "", item), " "))

In my test this work well.
Thanks

multilabel classification

Hello, I am a graduate student, and I come from Chine. My research direction is multi-label classification, recently some time to see some of your articles on the multi-label classification, feel really great, I can ask you a question? Do you use R to pre-process the data, and then put the processed data into meka for classification.

v0.4.1 breaks mlr test

Hi,

just wanted to let you know that the recent update broke one of our tests:

mlr-org/mlr#2524

Haven't yet taken a closer look if the problem is on our side or yours. Could you reply briefly which side should take action? :)
Thanks.

Load/wait indicators in UI

Add load bars or wait indicators during dataset load and algorithm execution.

Reading labels from some XML files fails

The read_xml() function fails while reading the labels when in the XML file there is an empty space between the opening and closing tags.

Read ARFF files calculating no measures

Sometimes I have to read ARFF files and pre-process them, and in those cases, the measures calculated by mldr are not useful, and also take a lot of time to calculate. I tried the function read.arff from RWeka, but it does not seem to discriminate between features and labels, as mldr does. Therefore, it would be nice if mldr allowed me to just read a dataset, without calculating the measures, just providing me a data.frame and the indexes of features and labels. So, after pre-processing the data.frame, I would build an mldr object by means of mldr_from_dataframe. Thanks for mldr!

Failure while reading dataset with uppercase @ATTRIBUTE tags

The mldr() function fails trying to load a dataset. The bug is in the attribute parsing function, which looks for lowercase "@Attribute" tags instead of ignoring case.

An argument in mldr_evaluate to specify if the predictions given as input are bipartitions or rankings

An execution of the mldr_evaluate function calculates two kinds of metrics (the bipartition based and ranking based evaluation metrics), but (if I understood well) the predictions given as input are bipartitions or rankings, but not both. Would it make sense to have an argument at mldr_evaluate in which one specifies if the output of the classifier is a bipartition or a ranking, in order to avoid calculating and showing unnecessary metrics?

Alternatively, I think it would be easier to read the output of mldr_evaluate if the bipartition based and ranking based evaluation metrics were grouped under two sublists (e.g., "bipartition metrics" and "ranking metrics").

Sorry if this is nonsense, and thanks for mldr!