fcharte / mldr Goto Github PK
View Code? Open in Web Editor NEWR package for analyzing and manipulating multilabel datasets
License: GNU Lesser General Public License v3.0
R package for analyzing and manipulating multilabel datasets
License: GNU Lesser General Public License v3.0
In metrics such as FMeasure, the na.rm
parameter causes the mean
function to ignore instances with NA
values, which results in different (overestimated) values than when using a sum(data, na.rm=T)/length(data)
approach.
Thanks for your good-looking and practical tool. But I'm not familier with R,I just configured the environment. I can't load my xml file in GUI. Which format should I use to import my data?
My xml looked like https://i.bmp.ovh/imgs/2020/10/9bb82f146577643b.png.
Can be download in https://cowtransfer.com/s/23880f576c3940.
@attribute TAG_m\'usica {0, 1}
→ "@attribute", "TAG_m\\'usica", "{0, 1}"
mldr does not allow for attribute names to be enclosed within double quotes.
Example case: I modified birds to have one attribute in double (instead of single) quotes.
Error reported by R:
bb <- mldr("birds-test")
Warning message:
In matrix(unlist(strsplit(arff_data, ",", fixed = T)), ncol = num_attrs, :
la longitud de los datos [179955] no es un submúltiplo o múltiplo del número de filas [648] en la matriz
e.g. @RELATION
instead of @relation
.
Add a file with details on how to contribute to the project.
Hello
I'm in trouble regarding the use of categorical attributes in mldr datasets.
The mldr function ignore the levels containing in the arff file.
For example, I get the flags.arrf and remove almost all instances (I left just 10)
When I read the dataset using mldr, if I get a categorical attribute like "language" is not possible know about the possible values.
For example:
test <- mldr("flags-test")
test$dataset[,"language"]
[1] "6" "7" "2" "8" "10" "1" "10" "6" "10" "1"
But I expected something like this:
test$dataset[,"language"]
[1] 6 7 2 8 10 1 10 6 10 1
Levels: 1 10 2 3 4 5 6 7 8 9
Having an external MLD of the mldr.datasets package already loaded, such as enron, an attemp to load it again produces an error.
Ranking-based evaluation metrics (Average Precision, Coverage...) are using order
instead of rank
to calculate label rankings. These functions output different results [source], so metrics could be wrong.
hello, could you tell me how to use distinct classifiers in R ? Just only IBk and J48 ?
Hi there,
I am having some trouble understanding the new implementation of the recall()
function (I'm really bad with functionals).
In particular, the following sample data of true and predicted labels gives a different recall in the old vs. new implementation:
multi.y <-
structure(c(TRUE, FALSE, FALSE, TRUE, TRUE, TRUE, FALSE, TRUE
), .Dim = c(4L, 2L), .Dimnames = list(c("1", "2", "3", "4"),
c("tar1.multilabel", "tar2.multilabel")))
multi.p <-
structure(c(TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, TRUE
), .Dim = c(4L, 2L), .Dimnames = list(c("1", "2", "3", "4"),
c("tar1.multilabel", "tar2.multilabel")))
### old mldr
counters = data.frame(
RealPositives = rowSums(multi.y),
RealNegatives = rowSums(!multi.y),
PredictedPositives = rowSums(multi.p),
PredictedNegatives = rowSums(!multi.p),
TruePositives = rowSums(multi.y & multi.p),
TrueNegatives = rowSums(!multi.y & !multi.p))
# Calculate example based recall
mldr_Recall <- function(counters) {
mean(counters$TruePositives / counters$RealPositives, na.rm = TRUE)
}
mldr_Recall(counters) # 0.66667
### new mldr
mldr::recall(multi.y, multi.p) # 0.75
The mlr
package uses mldr for a few tests here, and the updates caused a few of its tests to fail.
Could you explain what happened there? I couldn't figure it out myself, unfortunately.
Thanks in advance!
When the labels are in the first columns of dataset the filter function are wrong.
Example:
df <- data.frame(matrix(rnorm(1000), ncol = 10))
df$Label1 <- c(sample(c(0,1), 100, replace = TRUE))
df$Label2 <- c(sample(c(0,1), 100, replace = TRUE))
mymldr <- mldr_from_dataframe(df[rev(1:12)], labelIndices = c(1, 2), name = "testMLDR")
fold1 <- mymldr[1:10]
fold1$dataset[fold1$labels$index]
My solution, for this was:
"[.mldr" <- function(mldrObject, rowFilter = T) {
...
newDataset <- mldrObject$dataset[rows, seq(mldrObject$measures$num.attributes)]
...
}
Best
> mldr::fmeasure(truth, response, undefined_value = 0)
Error in undefined_strats[[value]] :
attempt to select less than one element in get1index <real>
I get the following error message:
> library(mldr)
Enter mldrGUI() to launch mldr's web-based GUI
> plot(birds)
Error in if (nrow(value) == length(rn) && ncol(value) == length(cn)) { :
missing value where TRUE/FALSE needed
I get the same message when I try to plot various datasets...
Version: I just installed mldr from github, R 3.2.0 from the CRAN ubuntu repository on Ubuntu vivid (15.04).
Hi!
I have a few questions/remarks about the implementation of the example based measures Accuracy, Precision and Recall.
mldr_Accuracy <- function(counters) {
mean((counters$TruePositives) / (counters$PredictedPositives + counters$RealPositives - counters$TruePositives))
}
I added your source code and a screenshot of your vignette of the three measures below:
# Calculate example based accuracy
mldr_Accuracy <- function(counters) {
mean((counters$TruePositives + counters$TrueNegatives) / (counters$PredictedPositives + counters$PredictedNegatives))
}
# Calculate example based precision
mldr_Precision <- function(counters) {
mean(counters$TruePositives / counters$PredictedPositives, na.rm = TRUE)
}
# Calculate example based recall
mldr_Recall <- function(counters) {
mean(counters$TruePositives / counters$RealPositives, na.rm = TRUE)
}
Some calculations in measures.R may be insufficiently commented as to how they correspond to their definition.
The computation of the SCUMBLE.CV for each label fails with some MLDs, genbase for instance.
The tables used to show statistics for each attribute inside the "Attributes" page are not rendered properly in some browsers, such as Firefox running in OS X. HTML tags are shown instead of the tables.
Hi,
I am trying to use the package. I did understand the LB histogram, for example. Numbers show the label frequencies. However, I don't understand numbers shown in each arc of the circular plot. They don't represent label frequencies and don't sum to the total number of instances in the dataset. Additionally, some labels don't appear in this plot. For example, I tried to plot the scene dataset and only 5 labels out of 6 were shown in the LC plot. Could you please elaborate?
Reem
It seems to me that, when reading sparse ARFF files (like the ones in the yahoo dataset), mldr is interpreting omitted values as NA values instead of zeros, what differs from the ARFF format description:
"Note that the omitted values in a sparse instance are 0, they are not "missing" values! If a value is unknown, you must explicitly represent it with a question mark (?)."
Due to this behaviour of the mldr parser, I am having to manually replace NA values with zeroes, and then rebuild the datasets.
Thank you very much for the great work in mldr!
I didn't know why, but if I save and try to load the birds dataset it is not work,
like this:
library(mldr)
data("birds")
write_arff(birds, "birds-test", T)
test <- mldr("birds-test")
Error in matrix(unlist(value, recursive = FALSE, use.names = FALSE), nrow = nr, :
'data' must be of a vector type, was 'NULL'
In addition: Warning messages:
1: In matrix(unlist(strsplit(arff_data, ",", fixed = T)), ncol = num_attrs, :
data length [179955] is not a sub-multiple or multiple of the number of rows [693]
2: In max(msr$count) : no non-missing arguments to max; returning -Inf
I'm using the mldr_0.3.22
Thanks.
Ability to pass a vector of names as files to be loaded.
In MEKA, a negative offset means reading labels from the end
The read_header()
function fails parsing the following patterns:
> read_header("@relation 'Slashdot: -C 22 -split-number 1891' %50")
$name
[1] "22"
# Don't return correct name and misses the number of labels
> read_header("@relation 'birds'")
Error in strsplit(regmatches(arff_relation, rgx), "\\s*:\\s*-[Cc]\\s*")[[1]] :
subscript out of bounds
Hello, I use this package and for some sparce datasets I had an error on call the mldr function.
For example: Yahoo datasets http://sourceforge.net/projects/mulan/files/datasets/yahoo.rar
I debug this error and I report this and I propose a simple solution.
The function parse_sparse_data split the data using:
unlist(strsplit(item, " "))
However in some datasets there are a space after the comma, and the result of this code is little different as expected, then my suggest of solution is use a trim, like this:
unlist(strsplit(gsub("^\s+|\s+$", "", item), " "))
In my test this work well.
Thanks
Hello, I am a graduate student, and I come from Chine. My research direction is multi-label classification, recently some time to see some of your articles on the multi-label classification, feel really great, I can ask you a question? Do you use R to pre-process the data, and then put the processed data into meka for classification.
Hi,
just wanted to let you know that the recent update broke one of our tests:
Haven't yet taken a closer look if the problem is on our side or yours. Could you reply briefly which side should take action? :)
Thanks.
Add load bars or wait indicators during dataset load and algorithm execution.
The read_xml() function fails while reading the labels when in the XML file there is an empty space between the opening and closing tags.
Sometimes I have to read ARFF files and pre-process them, and in those cases, the measures calculated by mldr are not useful, and also take a lot of time to calculate. I tried the function read.arff from RWeka, but it does not seem to discriminate between features and labels, as mldr does. Therefore, it would be nice if mldr allowed me to just read a dataset, without calculating the measures, just providing me a data.frame and the indexes of features and labels. So, after pre-processing the data.frame, I would build an mldr object by means of mldr_from_dataframe. Thanks for mldr!
The mldr() function fails trying to load a dataset. The bug is in the attribute parsing function, which looks for lowercase "@Attribute" tags instead of ignoring case.
An execution of the mldr_evaluate function calculates two kinds of metrics (the bipartition based and ranking based evaluation metrics), but (if I understood well) the predictions given as input are bipartitions or rankings, but not both. Would it make sense to have an argument at mldr_evaluate in which one specifies if the output of the classifier is a bipartition or a ranking, in order to avoid calculating and showing unnecessary metrics?
Alternatively, I think it would be easier to read the output of mldr_evaluate if the bipartition based and ranking based evaluation metrics were grouped under two sublists (e.g., "bipartition metrics" and "ranking metrics").
Sorry if this is nonsense, and thanks for mldr!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.