Code Monkey home page Code Monkey logo

featran's Issues

Add an Avro module

So we can write output records in Avro GenericRecords and eventually Parquet/Arrow compatible formats.

Optimize extract with settings for streaming/backend use cases

Right now FeatureSpec#extractWithSettings does 2 things, parsing JSON settings and extract with .map. This could be inefficient in a backend where data is Seq[T] of 1 element.

We should either split it into 2 steps or add a streaming API so that elements can be feed into the Seq lazily.

Java wrapper?

We might wanna add a Java wrapper so that it could be used in a Java backend service.

Create load tests for Sparse/Dense arrays

Users report that sparse representation of features might be more expensive than dense, afair we do a little be more there, like creating new arrays after #59 and mapping. Would be nice to add some load tests, and optimize that code path.

java.lang.NullPointerException when accessing featureNames/featureValues of MultiFeatureSpec

I am getting a java.lang.NullPointerException when I am trying to access either featureNames or featureValues. When I use either one of the 2 specs separately it works fine but when I try to combine them in a MultiFeatureSpec it fails. Is it a bug or am I doing something wrong?

@BigQueryType.fromQuery(
    """
      |#standardSQL
      |SELECT album_gid, album.num_tracks AS num_tracks,
      |album.availability.latest_date AS latest_date,
      |global_popularity.popularity_normalized AS popularity_normalized,
      |album.duration AS duration
      |FROM (SELECT * FROM `knowledge-graph-112233.album_entity.album_entity_%s` LIMIT 1000)
      |WHERE album.num_tracks >= 3
    """.stripMargin, "$LATEST"
  ) class AlbumMeta

  def main(cmdlineArgs: Array[String]): Unit = {
    val (sc, args) = ContextAndArgs(cmdlineArgs)

    val date = args("date").replace("-", "")
    val output = args("output")

    val albumFeatures = sc.typedBigQuery[AlbumMeta](AlbumMeta.query.format(date))

    val conSpec = FeatureSpec.of[AlbumMeta]
      .required(_.duration.get.toDouble)(StandardScaler("duration"))
      .required(_.duration.get.toDouble)(StandardScaler("duration_mean", withMean=true))
      .required(_.duration.get.toDouble)(Identity("identity"))
      .required(_.duration.get.toDouble)(MinMaxScaler("min_max"))

    val albumSpec = FeatureSpec.of[AlbumMeta]
      .required(_.album_gid.get)(OneHotEncoder("album"))

    //    val spec_extracted = albumSpec.extract(albumFeatures)
    val spec_extracted = MultiFeatureSpec(conSpec, albumSpec).extract(albumFeatures)

    val t = spec_extracted.featureNames

    sc.close().waitUntilFinish()
  }

Error:
Caused by: java.lang.NullPointerException
at com.spotify.featran.FeatureSet.multiFeatureNames(FeatureSpec.scala:231)
at com.spotify.featran.MultiFeatureExtractor$$anonfun$featureNames$1.apply(MultiFeatureExtractor.scala:56)
at com.spotify.featran.MultiFeatureExtractor$$anonfun$featureNames$1.apply(MultiFeatureExtractor.scala:56)
at com.spotify.scio.util.Functions$$anon$8.processElement(Functions.scala:145)

Frequency rank transformer

Feature request from an internal user:
"I want to transform a string feature to integer ids, where the most common string gets id 1, second most common gets id 2, etc. and have a cap at at 10000, so everything else gets mapped to 0. Is that easily doable with featran?
The original dataset might have say 1 billion strings (of which say 1 million are unique)."

Not sure if this can be done with approximation in a single pass. The naive approach would be building a Map[String, Int] in reduce phase and a priority queue before map phase.

Add PositionalOneHotEncoder Transformer

Name needs to some work but the idea is when you have some categorical features like (red, blue, green) and have a feature that is green instead of output <0, 0, 1> the transformer should output 2.

Refactor tests

Split up transformer tests, add utilities for deterministic and non-deterministic transformers, and test Transformer directly.

Performance issue if `plus` is expensive

For transformers like *HotEncoder where plus expensive, i.e. set or map concatenation, we might end up creating many temporary objects and causing GC pressure. We could potentially override sumOption but not sure how to implement this across multiple transformers.

Provide defaults for FeatranSpec

In my code, I need to use featran to do identity transformation only:

  def applyFeatranSpec(s: SCollection[TrainingExample], vecSize: Int)
  : FeatureExtractor[SCollection, TrainingExample] = {
    FeatureSpec.of[TrainingExample]
      .required(e => e.context.map(_.toDouble).toArray)(VectorIdentity("context", vecSize))
      .required(e => e.target.toDouble)(Identity("target"))
      .extract(s)
  }

where

  case class TrainingExample(context: List[Long],
                             target: Long)

Would love Featran API to provide good Spec defaults, given a case class, so that I don't have to provide this boilerplate code, unless I need some specific feature transformation logic.

Settings versioning

We need versioning to prevent loading settings from incompatible versions of featran

one hot encoder for large N

The one hot encoder takes a long time to run when the number of tokens is large. I believe the problem lies in the fact that the current implementation iterates over each element in the dictionary:

https://github.com/spotify/featran/blob/master/core/src/main/scala/com/spotify/featran/transformers/OneHotEncoder.scala#L45

I tried writing an implementation that uses a hash map instead of a list to store the tokens (plus their index), but it is still necessary to "inflate" the feature with the empty elements.

https://github.com/slhansen/featran/blob/master/core/src/main/scala/com/spotify/featran/transformers/OneHotEncoder.scala#L57-#L68

Since we can assume that only one element will be added it would be preferable to have the default be an empty feature vector and just add one element to it given the index.

Collect feature stats

Could be useful for debugging. A couple of thoughts

  • Opt-in for user specified columns
  • Pre/post transformation
  • How do we deal with vectors, etc.?

Add scaling factor in Hash*HotEncoders

We can provide 2 params:

  • hashBucketSize: Int = 0
  • sizeScalingFactor: Double = 4.0

If hashBucketSize > 0 we use that for assigning labels to buckets, but if it is 0 we use HLL estimated size * sizeScalingFactor instead of hashBucketSize and 4x gives us ~5% collision according to #23.

Support Selective OneHotEncoder

Currently the OHE encodes all the values, but there might be 1000s of them which will blow up the training data. It'll be good to have a control to limit the number of columns generated after OHE.

Some solutions from the discussions on slack:

  1. Have a parameter "N" which defines the number of columns and most occurring N categories will be kept.
  2. Have the list of required categories be defined in the feturespec. The user will define a list of N categories required and only those will be OHE, others ignored.
  3. Have a percentile defined, e.g. keep all categories accounting for say 90% of the observations.

From initial discussions it seemed, #1 and #3 are harder to implement that #2? Open to discussion.

IndexOutOfBoundsException when using HashOneHotEncoder with SparseVector

Not sure if user or featran code bug but here's the partial stacktrace:

Caused by: java.lang.IndexOutOfBoundsException: 8239914 not in [0,8239912)
	at breeze.linalg.VectorBuilder.breeze$linalg$VectorBuilder$$boundsCheck(VectorBuilder.scala:92)
	at breeze.linalg.VectorBuilder.add(VectorBuilder.scala:112)
	at breeze.linalg.SparseVector$$anonfun$apply$2.apply(SparseVector.scala:196)
	at breeze.linalg.SparseVector$$anonfun$apply$2.apply(SparseVector.scala:195)
	at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
	at scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:73)
	at scala.collection.mutable.MutableList.foreach(MutableList.scala:30)
	at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
	at breeze.linalg.SparseVector$.apply(SparseVector.scala:195)
	at com.spotify.featran.FeatureBuilder$$anon$5.result(FeatureBuilder.scala:134)
	at com.spotify.featran.FeatureBuilder$$anon$5.result(FeatureBuilder.scala:117)
	at com.spotify.featran.FeatureExtractor$$anonfun$featureValuesWithOriginal$1.apply(FeatureExtractor.scala:96)
	at com.spotify.featran.FeatureExtractor$$anonfun$featureValuesWithOriginal$1.apply(FeatureExtractor.scala:94)
	at com.spotify.scio.util.Functions$$anon$7.processElement(Functions.scala:145)

UnsupportedOperationException when using TensorFlowFeatureBuilder

Getting the following error when using TensorFlowFeatureBuilder to build Example with fe.featureValues[Example]:

java.lang.UnsupportedOperationException
    at com.google.protobuf.MapField.ensureMutable(MapField.java:290)
    at com.google.protobuf.MapField$MutatabilityAwareMap.put(MapField.java:333)
    at org.tensorflow.example.Features$Builder.putFeature(Features.java:631)
    at com.spotify.featran.tensorflow.package$TensorFlowFeatureBuilder$.add(package.scala:30)
    at com.spotify.featran.CrossingFeatureBuilder.add(CrossingFeatureBuilder.scala:96)
    at com.spotify.featran.transformers.StandardScaler.buildFeatures(StandardScaler.scala:55)
    at com.spotify.featran.transformers.StandardScaler.buildFeatures(StandardScaler.scala:42)
    at com.spotify.featran.transformers.Transformer.optBuildFeatures(Transformer.scala:83)
    at com.spotify.featran.Feature.unsafeBuildFeatures(FeatureSpec.scala:148)
    at com.spotify.featran.FeatureSet.featureValues(FeatureSpec.scala:270)
    at com.spotify.featran.FeatureExtractor$$anonfun$featureResults$1.apply(FeatureExtractor.scala:94)
    at com.spotify.featran.FeatureExtractor$$anonfun$featureResults$1.apply(FeatureExtractor.scala:93)
    at com.spotify.scio.util.Functions$$anon$7.processElement(Functions.scala:145)

Spec:

    FeatureSpec
      .of[Features]
      .required(_.doubleFeature)(StandardScaler("reg_feature", withMean = true, withStd = true))

Using featran 0.1.11 and scio 0.4.3 and protobuf-java;3.3.1

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.