The featran's discuss from spotify

JMapWrapper should be serializable

Otherwise it won't work with Spark.

Remove QTreeSerializer workaround

Once upstream is merged.
twitter/chill#305

Testing Github + Trello Integration

Add an Avro module

So we can write output records in Avro GenericRecords and eventually Parquet/Arrow compatible formats.

Optimize extract with settings for streaming/backend use cases

Right now FeatureSpec#extractWithSettings does 2 things, parsing JSON settings and extract with .map. This could be inefficient in a backend where data is Seq[T] of 1 element.

We should either split it into 2 steps or add a streaming API so that elements can be feed into the Seq lazily.

Add TensorFlow Example feature builder

Java wrapper?

We might wanna add a Java wrapper so that it could be used in a Java backend service.

Create load tests for Sparse/Dense arrays

Users report that sparse representation of features might be more expensive than dense, afair we do a little be more there, like creating new arrays after #59 and mapping. Would be nice to add some load tests, and optimize that code path.

Skip reduce step if all features have unit aggregator

Support feature extraction in XGBoost formats

We should add support for feature extraction in XGBoost formats: LabeledPoint, DMatrix, ?

Add an array CollectionType

java.lang.NullPointerException when accessing featureNames/featureValues of MultiFeatureSpec

I am getting a java.lang.NullPointerException when I am trying to access either featureNames or featureValues. When I use either one of the 2 specs separately it works fine but when I try to combine them in a MultiFeatureSpec it fails. Is it a bug or am I doing something wrong?

@BigQueryType.fromQuery(
    """
      |#standardSQL
      |SELECT album_gid, album.num_tracks AS num_tracks,
      |album.availability.latest_date AS latest_date,
      |global_popularity.popularity_normalized AS popularity_normalized,
      |album.duration AS duration
      |FROM (SELECT * FROM `knowledge-graph-112233.album_entity.album_entity_%s` LIMIT 1000)
      |WHERE album.num_tracks >= 3
    """.stripMargin, "$LATEST"
  ) class AlbumMeta

  def main(cmdlineArgs: Array[String]): Unit = {
    val (sc, args) = ContextAndArgs(cmdlineArgs)

    val date = args("date").replace("-", "")
    val output = args("output")

    val albumFeatures = sc.typedBigQuery[AlbumMeta](AlbumMeta.query.format(date))

    val conSpec = FeatureSpec.of[AlbumMeta]
      .required(_.duration.get.toDouble)(StandardScaler("duration"))
      .required(_.duration.get.toDouble)(StandardScaler("duration_mean", withMean=true))
      .required(_.duration.get.toDouble)(Identity("identity"))
      .required(_.duration.get.toDouble)(MinMaxScaler("min_max"))

    val albumSpec = FeatureSpec.of[AlbumMeta]
      .required(_.album_gid.get)(OneHotEncoder("album"))

    //    val spec_extracted = albumSpec.extract(albumFeatures)
    val spec_extracted = MultiFeatureSpec(conSpec, albumSpec).extract(albumFeatures)

    val t = spec_extracted.featureNames

    sc.close().waitUntilFinish()
  }

Error:
Caused by: java.lang.NullPointerException
at com.spotify.featran.FeatureSet.multiFeatureNames(FeatureSpec.scala:231)
at com.spotify.featran.MultiFeatureExtractor$$anonfun$featureNames$1.apply(MultiFeatureExtractor.scala:56)
at com.spotify.featran.MultiFeatureExtractor$$anonfun$featureNames$1.apply(MultiFeatureExtractor.scala:56)
at com.spotify.scio.util.Functions$$anon$8.processElement(Functions.scala:145)

PipeIterator in RecordExtractor should be thread local?

https://github.com/spotify/featran/blob/master/core/src/main/scala/com/spotify/featran/FeatureExtractor.scala#L143

Right now each RecordExtractor has 1 input: PipeIterator instance and synchronize on feeding & result extraction. It should probably be thread local to utilize all CPU cores?

Frequency rank transformer

Feature request from an internal user:
"I want to transform a string feature to integer ids, where the most common string gets id 1, second most common gets id 2, etc. and have a cap at at 10000, so everything else gets mapped to 0. Is that easily doable with featran?
The original dataset might have say 1 billion strings (of which say 1 million are unique)."

Not sure if this can be done with approximation in a single pass. The naive approach would be building a Map[String, Int] in reduce phase and a priority queue before map phase.

Add PositionalOneHotEncoder Transformer

Name needs to some work but the idea is when you have some categorical features like (red, blue, green) and have a feature that is green instead of output <0, 0, 1> the transformer should output 2.

Refactor tests

Split up transformer tests, add utilities for deterministic and non-deterministic transformers, and test Transformer directly.

Upgrade breeze

1.0-RC1 is out. Hopefully a stable release soon.
https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/scala-breeze/8b5IGfo10rU/Z1HyEpTbAQAJ

Allow transformer to reject inputs

By adding a method to FeatureBuilder, so any Transformer in the spec can call it to reject an entire record.

Make SeqDataType generic

Flink support

Transformer persistence

Each transformer should be able to ser/de parameters and aggregations to a JSON object.

Fix scala doc warnings

Performance issue if `plus` is expensive

For transformers like *HotEncoder where plus expensive, i.e. set or map concatenation, we might end up creating many temporary objects and causing GC pressure. We could potentially override sumOption but not sure how to implement this across multiple transformers.

Provide defaults for FeatranSpec

In my code, I need to use featran to do identity transformation only:

  def applyFeatranSpec(s: SCollection[TrainingExample], vecSize: Int)
  : FeatureExtractor[SCollection, TrainingExample] = {
    FeatureSpec.of[TrainingExample]
      .required(e => e.context.map(_.toDouble).toArray)(VectorIdentity("context", vecSize))
      .required(e => e.target.toDouble)(Identity("target"))
      .extract(s)
  }

where

  case class TrainingExample(context: List[Long],
                             target: Long)

Would love Featran API to provide good Spec defaults, given a case class, so that I don't have to provide this boilerplate code, unless I need some specific feature transformation logic.

Settings versioning

We need versioning to prevent loading settings from incompatible versions of featran

Support sumOption in composite semigroups

Each bundle is in the form of Seq[Array[B]]. We can probably wrap it in a view of Seq[B] for each individual column and use the base semigroup's sumOption.

Make tests for probabilistic transformers black box

So that the test code doesn't depend on implementation details.

Hash*Encoders (maybe)
HeavyHitters
QuantileDiscretizer

Add transformers to handle outliers and unbalanced data

maybe quantiles for outliers
SMOTE: Synthetic Minority Over-Sampling Technique for unbalanced

How to apply the same transformation for test and train datasets

It's not clear to me how would you apply the transformation to the train dataset then to the test dataset.

It is important to save the state of the transformation after applying it to the train dataset. So the same values is being used for the test.

Add ngrams transformer

See https://github.com/tensorflow/transform/blob/master/tensorflow_transform/mappers.py#L399

Expose xgboost feature extractor in java api

Better test coverage over different runners

So that we can detect edge cases like NPEs and ser/de issues early.

one hot encoder for large N

The one hot encoder takes a long time to run when the number of tokens is large. I believe the problem lies in the fact that the current implementation iterates over each element in the dictionary:

https://github.com/spotify/featran/blob/master/core/src/main/scala/com/spotify/featran/transformers/OneHotEncoder.scala#L45

I tried writing an implementation that uses a hash map instead of a list to store the tokens (plus their index), but it is still necessary to "inflate" the feature with the empty elements.

https://github.com/slhansen/featran/blob/master/core/src/main/scala/com/spotify/featran/transformers/OneHotEncoder.scala#L57-#L68

Since we can assume that only one element will be added it would be preferable to have the default be an empty feature vector and just add one element to it given the index.

Collect feature stats

Could be useful for debugging. A couple of thoughts

Opt-in for user specified columns
Pre/post transformation
How do we deal with vectors, etc.?

test

Try mutable set and similar mutable data structures

In the B position of Transformer[A, B, C] to reduce GC pressure.
Need to double check for correctness and make mutation detector happy (like the one in Beam) though.

Add object level transformer docs

So that they show up nicely in scaladoc site.

Add scaling factor in Hash*HotEncoders

We can provide 2 params:

hashBucketSize: Int = 0
sizeScalingFactor: Double = 4.0

If hashBucketSize > 0 we use that for assigning labels to buckets, but if it is 0 we use HLL estimated size * sizeScalingFactor instead of hashBucketSize and 4x gives us ~5% collision according to #23.

Handle out of bound cases when using previous settings

Support Selective OneHotEncoder

Currently the OHE encodes all the values, but there might be 1000s of them which will blow up the training data. It'll be good to have a control to limit the number of columns generated after OHE.

Some solutions from the discussions on slack:

Have a parameter "N" which defines the number of columns and most occurring N categories will be kept.
Have the list of required categories be defined in the feturespec. The user will define a list of N categories required and only those will be OHE, others ignored.
Have a percentile defined, e.g. keep all categories accounting for say 90% of the observations.

From initial discussions it seemed, #1 and #3 are harder to implement that #2? Open to discussion.

IndexOutOfBoundsException when using HashOneHotEncoder with SparseVector

Not sure if user or featran code bug but here's the partial stacktrace:

Caused by: java.lang.IndexOutOfBoundsException: 8239914 not in [0,8239912)
	at breeze.linalg.VectorBuilder.breeze$linalg$VectorBuilder$$boundsCheck(VectorBuilder.scala:92)
	at breeze.linalg.VectorBuilder.add(VectorBuilder.scala:112)
	at breeze.linalg.SparseVector$$anonfun$apply$2.apply(SparseVector.scala:196)
	at breeze.linalg.SparseVector$$anonfun$apply$2.apply(SparseVector.scala:195)
	at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
	at scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:73)
	at scala.collection.mutable.MutableList.foreach(MutableList.scala:30)
	at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
	at breeze.linalg.SparseVector$.apply(SparseVector.scala:195)
	at com.spotify.featran.FeatureBuilder$$anon$5.result(FeatureBuilder.scala:134)
	at com.spotify.featran.FeatureBuilder$$anon$5.result(FeatureBuilder.scala:117)
	at com.spotify.featran.FeatureExtractor$$anonfun$featureValuesWithOriginal$1.apply(FeatureExtractor.scala:96)
	at com.spotify.featran.FeatureExtractor$$anonfun$featureValuesWithOriginal$1.apply(FeatureExtractor.scala:94)
	at com.spotify.scio.util.Functions$$anon$7.processElement(Functions.scala:145)

There might be concurrent access to fb if as is a in-memory parallel collection, i.e. if fb and its surrounding lambda is not copied via ser/de and accessed in a multi-threaded environment.
https://github.com/spotify/featran/blob/master/core/src/main/scala/com/spotify/featran/FeatureExtractor.scala#L94

java.lang.UnsupportedOperationException
    at com.google.protobuf.MapField.ensureMutable(MapField.java:290)
    at com.google.protobuf.MapField$MutatabilityAwareMap.put(MapField.java:333)
    at org.tensorflow.example.Features$Builder.putFeature(Features.java:631)
    at com.spotify.featran.tensorflow.package$TensorFlowFeatureBuilder$.add(package.scala:30)
    at com.spotify.featran.CrossingFeatureBuilder.add(CrossingFeatureBuilder.scala:96)
    at com.spotify.featran.transformers.StandardScaler.buildFeatures(StandardScaler.scala:55)
    at com.spotify.featran.transformers.StandardScaler.buildFeatures(StandardScaler.scala:42)
    at com.spotify.featran.transformers.Transformer.optBuildFeatures(Transformer.scala:83)
    at com.spotify.featran.Feature.unsafeBuildFeatures(FeatureSpec.scala:148)
    at com.spotify.featran.FeatureSet.featureValues(FeatureSpec.scala:270)
    at com.spotify.featran.FeatureExtractor$$anonfun$featureResults$1.apply(FeatureExtractor.scala:94)
    at com.spotify.featran.FeatureExtractor$$anonfun$featureResults$1.apply(FeatureExtractor.scala:93)
    at com.spotify.scio.util.Functions$$anon$7.processElement(Functions.scala:145)

Spec:

    FeatureSpec
      .of[Features]
      .required(_.doubleFeature)(StandardScaler("reg_feature", withMean = true, withStd = true))

Using featran 0.1.11 and scio 0.4.3 and protobuf-java;3.3.1

spotify / featran Goto Github PK

featran's Issues

Recommend Projects

Recommend Topics

Recommend Org