sramirez / spark-infotheoretic-feature-selection Goto Github PK

This package contains a generic implementation of greedy Information Theoretic Feature Selection (FS) methods. The implementation is based on the common theoretic framework presented by Gavin Brown. Implementations of mRMR, InfoGain, JMI and other commonly used FS filters are provided.

Home Page: http://sci2s.ugr.es/BigData

License: Apache License 2.0

Scala 100.00%

mrmr spark brown filter feature-selection

spark-infotheoretic-feature-selection's Introduction

An Information Theoretic Feature Selection Framework

The present framework implements Feature Selection (FS) on Spark for its application on Big Data problems. This package contains a generic implementation of greedy Information Theoretic Feature Selection methods. The implementation is based on the common theoretic framework presented in [1]. Implementations of mRMR, InfoGain, JMI and other commonly used FS filters are provided. In addition, the framework can be extended with other criteria provided by the user as long as the process complies with the framework proposed in [1].

Spark package: http://spark-packages.org/package/sramirez/spark-infotheoretic-feature-selection

Please cite as: S. Ramírez-Gallego; H. Mouriño-Talín; D. Martínez-Rego; V. Bolón-Canedo; J. M. Benítez; A. Alonso-Betanzos; F. Herrera, "An Information Theory-Based Feature Selection Framework for Big Data Under Apache Spark," in IEEE Transactions on Systems, Man, and Cybernetics: Systems, in press, pp.1-13, doi: 10.1109/TSMC.2017.2670926 URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7970198&isnumber=6376248

Main features:

Version for new ml library.
Support for sparse data and high-dimensional datasets (millions of features).
Improved performance (less than 1 minute per iteration for datasets like ECBDL14 and kddb with 400 cores).

This work has associated two submitted contributions to international journals which will be attached to this request as soon as they are accepted. This software has been proved with two large real-world datasets such as:

A dataset selected for the GECCO-2014 in Vancouver, July 13th, 2014 competition, which comes from the Protein Structure Prediction field (http://cruncher.ncl.ac.uk/bdcomp/). We have created a oversampling version of this dataset with 64 million instances, 631 attributes, 2 classes.
kddb dataset: http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20%28bridge%20to%20algebra%29. 20M instances and almost 30M of attributes.

Example (ml):

import org.apache.spark.ml.feature._
val selector = new InfoThSelector()
	.setSelectCriterion("mrmr")
      	.setNPartitions(100)
      	.setNumTopFeatures(10)
      	.setFeaturesCol("features")
      	.setLabelCol("class")
      	.setOutputCol("selectedFeatures")

val result = selector.fit(df).transform(df)

Example (MLLIB):

import org.apache.spark.mllib.feature._
val criterion = new InfoThCriterionFactory("mrmr")
val nToSelect = 100
val nPartitions = 100

println("*** FS criterion: " + criterion.getCriterion.toString)
println("*** Number of features to select: " + nToSelect)
println("*** Number of partitions: " + nPartitions)

val featureSelector = new InfoThSelector(criterion, nToSelect, nPartitions).fit(data)

val reduced = data.map(i => LabeledPoint(i.label, featureSelector.transform(i.features)))
reduced.first()

Design doc: https://docs.google.com/document/d/1HOaPL_HJzTbL2tVdzbTjhr5wxVvPe9e-23S7rc2VcsY/edit?usp=sharing

Prerequisites:

LabeledPoint data must be discretized as integer values in double representation, ranging from 0 to 255. By doing so, double values can be transformed to byte directly thus making the overall selection process much more efficient (communication overhead is deeply reduced).

Please refer to the MDLP package if you need to discretize your dataset:

https://spark-packages.org/package/sramirez/spark-MDLP-discretization

Contributors

Sergio Ramírez-Gallego ([email protected]) (main contributor and maintainer).
Héctor Mouriño-Talín ([email protected])
David Martínez-Rego ([email protected])

References

[1] Brown, G., Pocock, A., Zhao, M. J., & Luján, M. (2012). "Conditional likelihood maximisation: a unifying framework for information theoretic feature selection." The Journal of Machine Learning Research, 13(1), 27-66.

spark-infotheoretic-feature-selection's People

Contributors

Stargazers

Watchers

Forkers

fhpinto codeaudit ljzzju huang-hf yaokaifei desperado1992 gest-user joeyqzhou colinsongf jabenitez wanjun0511 maheshsv zhuohuwu0603 hbghhy rahasayantan zozulya lizhen0909 nicaiseeric nbcuas zongyaozhang github-allen dcooper46 tammycn smarker gb-allen dennisybw kevinpsguy jconwell kobelzy zwt233 francotesei chanmango 1202zhyl vanillaicecreams zwbjtu123 yuthreestone yikgithub kanesp jess639 swoop-inc zwxxj sukhoonjung fusion-research profbressan josemariabernad

spark-infotheoretic-feature-selection's Issues

How I can extract output Features with Scores from InfoThSelector ?

I am using MIM feature selector model. It running perfectly fine and output the features results automatically in a terminal. How can I extract output features with its relevant score? I want to save this output to a file.

val selector = new InfoThSelector()
         .setSelectCriterion("mim")
         .setNumTopFeatures(2)
         .setFeaturesCol("features")
         .setLabelCol("class")
         .setOutputCol("selectedFeatures")

Could you deploy a sources jar to maven along with the executable jar?

Would it be possible to build and deploy a sources jar along with the execution jar when you deploy to the maven repository? Having the source makes it easy to debug and follow the code during execution.

Its pretty easy, just add the sources plugin section (link below) to the pom file and it'll create the sources jar when you run package.

https://maven.apache.org/plugin-developers/cookbook/attach-source-javadoc-artifacts.html

error for type does not take parameters

Hi I'm trying the given example (mllib) but encountered the following error messages:

val featureSelector = InfoThSelector(criterion, nToSelect, nPartitions).fit(data)
org.apache.spark.mllib.feature.InfoThSelector.type does not take parameters
val featureSelector = InfoThSelector(criterion, nToSelect, nPartitions).fit(data)

Any ideas? Would appreciate very much for the help.

Wei

ThCriterionFactory is not a member of package org.apache.spark.mllib.feature

Hi, I am using the spark-infotheoretic-feature-selection jar file. I import the jar file into my project and maven the project,but it occur the error as follows:

I don't know how to fix it? May the jar file conflit with the origin mllib.feature jar file ?

Thank you

Spark 2.0.0 support for feature selection and MDL

Currently feature selection and MDL techniques are built using spark 1.6.1. Is there plan to support spark 2.0.0 or latest versions?.

Spark 3 / Scala 2.12 release

Thanks for making this lib.

Can you please build a Spark 3 / Scala 2.12 JAR file and upload it to Maven?

Looks like you were using Spark Packages. Think it's best to shift off Spark packages now that it's pretty much abandonware.

It's relatively straightforward to publish directly to Maven with sbt-sonatype. spark-daria is a good example.

Let me know if you'd be willing to go through this whole process. I am happy to help.

Publish on Spark Packages Main Repository

Hi @sramirez,

Would you like to make a release of this package on the Spark Packages Maven Repo as well? There is an sbt-plugin called sbt-spark-package that would help you make the release straight from your sbt console. All you need to do is set a couple configurations.

Publishing on the Spark Packages Repo will bump your ranking on the website, and will fill in the How To section, which users can use to include your package in their work.

Please let me know if you have any comments/questions/suggestions!

Best,
Burak

P.S. I bolded as well so that you would at least know that this is not an automatically generated message :)

Is there any way to associate boxes with keypoints following the COCO keypoint detection format

hello， Have you solved this problem? Can you explain to me how to put the key points and bounding boxes together in the format of coco dataset，Thank you

Failing to build on sbt

/path/to/spark-infotheoretic-feature-selection/build.sbt:19: error: not found: value sparkPackageName
sparkPackageName := "sramirez/infotheoretic-feature-selection"
^
[error] Type error in expression

All I did change was sparkVersion and libraryDependencies to 1.4.0

Thanks

Info-Theoretic Framework requires positive values in range [0, 255]

Is your algorithm not support for double or float value? Do you have suggestion if I have a very big value data like in million ? because your algorithm not support it

Issues related to code execution

I would like to ask why my code reported an error：Exception in thread "main" java.lang.AbstractMethodError

object FSTest01 {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("FS test")
    val sc = new SparkContext(conf)
    val raw = MLUtils.loadLibSVMFile(sc, "a2a.libsvm")
    val data: RDD[LabeledPoint] = raw.map({ case LabeledPoint(label, values) => LabeledPoint(label, Vectors.dense(values.toArray)) })

    val criterion: InfoThCriterionFactory = new InfoThCriterionFactory("mrmr")
    val nToSelect = 5
    val nPartitions = 10

    val featureSelector = new InfoThSelector(criterion, nToSelect, nPartitions).fit(data)

    val redData = data.map { lp =>
      LabeledPoint(lp.label, featureSelector.transform(lp.features))
    }

    println(redData.first().toString())
  }
}

Error reported on the fit line of the code

All criterion same ranking

I just tried all the criterion on the same dataset. But I get the same ranking no matter what i set the SelectCriterion to.

I was wondering if I need to do a particular transform to my data before running the selector.

When running, I first apply a minmaxscaler to all features then multiply by 255 round off to int and finally subtract 128 so they are the byte range.

sramirez / spark-infotheoretic-feature-selection Goto Github PK

spark-infotheoretic-feature-selection's Introduction

An Information Theoretic Feature Selection Framework

Main features:

Example (ml):

Example (MLLIB):

Prerequisites:

Contributors

References

spark-infotheoretic-feature-selection's People

Contributors

Stargazers

Watchers

Forkers

spark-infotheoretic-feature-selection's Issues

Recommend Projects

Recommend Topics

Recommend Org