Code Monkey home page Code Monkey logo

sramirez / spark-infotheoretic-feature-selection Goto Github PK

View Code? Open in Web Editor NEW
135.0 19.0 46.0 11.1 MB

This package contains a generic implementation of greedy Information Theoretic Feature Selection (FS) methods. The implementation is based on the common theoretic framework presented by Gavin Brown. Implementations of mRMR, InfoGain, JMI and other commonly used FS filters are provided.

Home Page: http://sci2s.ugr.es/BigData

License: Apache License 2.0

Scala 100.00%
mrmr spark brown filter feature-selection

spark-infotheoretic-feature-selection's Introduction

An Information Theoretic Feature Selection Framework

The present framework implements Feature Selection (FS) on Spark for its application on Big Data problems. This package contains a generic implementation of greedy Information Theoretic Feature Selection methods. The implementation is based on the common theoretic framework presented in [1]. Implementations of mRMR, InfoGain, JMI and other commonly used FS filters are provided. In addition, the framework can be extended with other criteria provided by the user as long as the process complies with the framework proposed in [1].

Spark package: http://spark-packages.org/package/sramirez/spark-infotheoretic-feature-selection

Please cite as: S. Ramírez-Gallego; H. Mouriño-Talín; D. Martínez-Rego; V. Bolón-Canedo; J. M. Benítez; A. Alonso-Betanzos; F. Herrera, "An Information Theory-Based Feature Selection Framework for Big Data Under Apache Spark," in IEEE Transactions on Systems, Man, and Cybernetics: Systems, in press, pp.1-13, doi: 10.1109/TSMC.2017.2670926 URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7970198&isnumber=6376248

Main features:

  • Version for new ml library.
  • Support for sparse data and high-dimensional datasets (millions of features).
  • Improved performance (less than 1 minute per iteration for datasets like ECBDL14 and kddb with 400 cores).

This work has associated two submitted contributions to international journals which will be attached to this request as soon as they are accepted. This software has been proved with two large real-world datasets such as:

Example (ml):

import org.apache.spark.ml.feature._
val selector = new InfoThSelector()
	.setSelectCriterion("mrmr")
      	.setNPartitions(100)
      	.setNumTopFeatures(10)
      	.setFeaturesCol("features")
      	.setLabelCol("class")
      	.setOutputCol("selectedFeatures")

val result = selector.fit(df).transform(df)

Example (MLLIB):

import org.apache.spark.mllib.feature._
val criterion = new InfoThCriterionFactory("mrmr")
val nToSelect = 100
val nPartitions = 100

println("*** FS criterion: " + criterion.getCriterion.toString)
println("*** Number of features to select: " + nToSelect)
println("*** Number of partitions: " + nPartitions)

val featureSelector = new InfoThSelector(criterion, nToSelect, nPartitions).fit(data)

val reduced = data.map(i => LabeledPoint(i.label, featureSelector.transform(i.features)))
reduced.first()

Design doc: https://docs.google.com/document/d/1HOaPL_HJzTbL2tVdzbTjhr5wxVvPe9e-23S7rc2VcsY/edit?usp=sharing

Prerequisites:

LabeledPoint data must be discretized as integer values in double representation, ranging from 0 to 255. By doing so, double values can be transformed to byte directly thus making the overall selection process much more efficient (communication overhead is deeply reduced).

Please refer to the MDLP package if you need to discretize your dataset:

https://spark-packages.org/package/sramirez/spark-MDLP-discretization

Contributors

References

[1] Brown, G., Pocock, A., Zhao, M. J., & Luján, M. (2012). "Conditional likelihood maximisation: a unifying framework for information theoretic feature selection." The Journal of Machine Learning Research, 13(1), 27-66.

spark-infotheoretic-feature-selection's People

Contributors

borisestulin avatar hbghhy avatar jabenitez avatar jconwell avatar smarker avatar sramirez avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

spark-infotheoretic-feature-selection's Issues

How I can extract output Features with Scores from InfoThSelector ?

I am using MIM feature selector model. It running perfectly fine and output the features results automatically in a terminal. How can I extract output features with its relevant score? I want to save this output to a file.

val selector = new InfoThSelector()
         .setSelectCriterion("mim")
         .setNumTopFeatures(2)
         .setFeaturesCol("features")
         .setLabelCol("class")
         .setOutputCol("selectedFeatures")

Could you deploy a sources jar to maven along with the executable jar?

Would it be possible to build and deploy a sources jar along with the execution jar when you deploy to the maven repository? Having the source makes it easy to debug and follow the code during execution.

Its pretty easy, just add the sources plugin section (link below) to the pom file and it'll create the sources jar when you run package.

https://maven.apache.org/plugin-developers/cookbook/attach-source-javadoc-artifacts.html

error for type does not take parameters

Hi I'm trying the given example (mllib) but encountered the following error messages:

val featureSelector = InfoThSelector(criterion, nToSelect, nPartitions).fit(data)
org.apache.spark.mllib.feature.InfoThSelector.type does not take parameters
val featureSelector = InfoThSelector(criterion, nToSelect, nPartitions).fit(data)

Any ideas? Would appreciate very much for the help.

Wei

Spark 3 / Scala 2.12 release

Thanks for making this lib.

Can you please build a Spark 3 / Scala 2.12 JAR file and upload it to Maven?

Looks like you were using Spark Packages. Think it's best to shift off Spark packages now that it's pretty much abandonware.

It's relatively straightforward to publish directly to Maven with sbt-sonatype. spark-daria is a good example.

Let me know if you'd be willing to go through this whole process. I am happy to help.

Publish on Spark Packages Main Repository

Hi @sramirez,

Would you like to make a release of this package on the Spark Packages Maven Repo as well? There is an sbt-plugin called sbt-spark-package that would help you make the release straight from your sbt console. All you need to do is set a couple configurations.

Publishing on the Spark Packages Repo will bump your ranking on the website, and will fill in the How To section, which users can use to include your package in their work.

Please let me know if you have any comments/questions/suggestions!

Best,
Burak

P.S. I bolded as well so that you would at least know that this is not an automatically generated message :)

Failing to build on sbt

/path/to/spark-infotheoretic-feature-selection/build.sbt:19: error: not found: value sparkPackageName
sparkPackageName := "sramirez/infotheoretic-feature-selection"
^
[error] Type error in expression


All I did change was sparkVersion and libraryDependencies to 1.4.0

Thanks

Issues related to code execution

I would like to ask why my code reported an error:Exception in thread "main" java.lang.AbstractMethodError

object FSTest01 {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("FS test")
    val sc = new SparkContext(conf)
    val raw = MLUtils.loadLibSVMFile(sc, "a2a.libsvm")
    val data: RDD[LabeledPoint] = raw.map({ case LabeledPoint(label, values) => LabeledPoint(label, Vectors.dense(values.toArray)) })

    val criterion: InfoThCriterionFactory = new InfoThCriterionFactory("mrmr")
    val nToSelect = 5
    val nPartitions = 10

    val featureSelector = new InfoThSelector(criterion, nToSelect, nPartitions).fit(data)

    val redData = data.map { lp =>
      LabeledPoint(lp.label, featureSelector.transform(lp.features))
    }

    println(redData.first().toString())
  }
}

Error reported on the fit line of the code

All criterion same ranking

I just tried all the criterion on the same dataset. But I get the same ranking no matter what i set the SelectCriterion to.

image

I was wondering if I need to do a particular transform to my data before running the selector.

When running, I first apply a minmaxscaler to all features then multiply by 255 round off to int and finally subtract 128 so they are the byte range.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.