spotify / featran Goto Github PK
View Code? Open in Web Editor NEWA Scala feature transformation library for data science and machine learning
Home Page: https://spotify.github.io/featran
License: Apache License 2.0
A Scala feature transformation library for data science and machine learning
Home Page: https://spotify.github.io/featran
License: Apache License 2.0
Otherwise it won't work with Spark.
Once upstream is merged.
twitter/chill#305
So we can write output records in Avro GenericRecord
s and eventually Parquet/Arrow compatible formats.
Disregard
Schema generator and FeatureBuilder
Right now FeatureSpec#extractWithSettings
does 2 things, parsing JSON settings and extract with .map
. This could be inefficient in a backend where data is Seq[T]
of 1 element.
We should either split it into 2 steps or add a streaming API so that elements can be feed into the Seq
lazily.
We might wanna add a Java wrapper so that it could be used in a Java backend service.
Users report that sparse representation of features might be more expensive than dense, afair we do a little be more there, like creating new arrays after #59 and mapping. Would be nice to add some load tests, and optimize that code path.
We should add support for feature extraction in XGBoost
formats: LabeledPoint
, DMatrix
, ?
I am getting a java.lang.NullPointerException when I am trying to access either featureNames or featureValues. When I use either one of the 2 specs separately it works fine but when I try to combine them in a MultiFeatureSpec it fails. Is it a bug or am I doing something wrong?
@BigQueryType.fromQuery(
"""
|#standardSQL
|SELECT album_gid, album.num_tracks AS num_tracks,
|album.availability.latest_date AS latest_date,
|global_popularity.popularity_normalized AS popularity_normalized,
|album.duration AS duration
|FROM (SELECT * FROM `knowledge-graph-112233.album_entity.album_entity_%s` LIMIT 1000)
|WHERE album.num_tracks >= 3
""".stripMargin, "$LATEST"
) class AlbumMeta
def main(cmdlineArgs: Array[String]): Unit = {
val (sc, args) = ContextAndArgs(cmdlineArgs)
val date = args("date").replace("-", "")
val output = args("output")
val albumFeatures = sc.typedBigQuery[AlbumMeta](AlbumMeta.query.format(date))
val conSpec = FeatureSpec.of[AlbumMeta]
.required(_.duration.get.toDouble)(StandardScaler("duration"))
.required(_.duration.get.toDouble)(StandardScaler("duration_mean", withMean=true))
.required(_.duration.get.toDouble)(Identity("identity"))
.required(_.duration.get.toDouble)(MinMaxScaler("min_max"))
val albumSpec = FeatureSpec.of[AlbumMeta]
.required(_.album_gid.get)(OneHotEncoder("album"))
// val spec_extracted = albumSpec.extract(albumFeatures)
val spec_extracted = MultiFeatureSpec(conSpec, albumSpec).extract(albumFeatures)
val t = spec_extracted.featureNames
sc.close().waitUntilFinish()
}
Error:
Caused by: java.lang.NullPointerException
at com.spotify.featran.FeatureSet.multiFeatureNames(FeatureSpec.scala:231)
at com.spotify.featran.MultiFeatureExtractor$$anonfun$featureNames$1.apply(MultiFeatureExtractor.scala:56)
at com.spotify.featran.MultiFeatureExtractor$$anonfun$featureNames$1.apply(MultiFeatureExtractor.scala:56)
at com.spotify.scio.util.Functions$$anon$8.processElement(Functions.scala:145)
Right now each RecordExtractor
has 1 input: PipeIterator
instance and synchronize on feeding & result extraction. It should probably be thread local to utilize all CPU cores?
Feature request from an internal user:
"I want to transform a string feature to integer ids, where the most common string gets id 1, second most common gets id 2, etc. and have a cap at at 10000, so everything else gets mapped to 0. Is that easily doable with featran?
The original dataset might have say 1 billion strings (of which say 1 million are unique)."
Not sure if this can be done with approximation in a single pass. The naive approach would be building a Map[String, Int]
in reduce
phase and a priority queue before map
phase.
Name needs to some work but the idea is when you have some categorical features like (red, blue, green) and have a feature that is green instead of output <0, 0, 1> the transformer should output 2.
Split up transformer tests, add utilities for deterministic and non-deterministic transformers, and test Transformer
directly.
1.0-RC1 is out. Hopefully a stable release soon.
https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/scala-breeze/8b5IGfo10rU/Z1HyEpTbAQAJ
By adding a method to FeatureBuilder
, so any Transformer
in the spec can call it to reject an entire record.
Each transformer should be able to ser/de parameters and aggregations to a JSON object.
For transformers like *HotEncoder
where plus
expensive, i.e. set or map concatenation, we might end up creating many temporary objects and causing GC pressure. We could potentially override sumOption
but not sure how to implement this across multiple transformers.
In my code, I need to use featran to do identity transformation only:
def applyFeatranSpec(s: SCollection[TrainingExample], vecSize: Int)
: FeatureExtractor[SCollection, TrainingExample] = {
FeatureSpec.of[TrainingExample]
.required(e => e.context.map(_.toDouble).toArray)(VectorIdentity("context", vecSize))
.required(e => e.target.toDouble)(Identity("target"))
.extract(s)
}
where
case class TrainingExample(context: List[Long],
target: Long)
Would love Featran API to provide good Spec defaults, given a case class, so that I don't have to provide this boilerplate code, unless I need some specific feature transformation logic.
We need versioning to prevent loading settings from incompatible versions of featran
Each bundle is in the form of Seq[Array[B]]
. We can probably wrap it in a view of Seq[B]
for each individual column and use the base semigroup's sumOption
.
So that the test code doesn't depend on implementation details.
It's not clear to me how would you apply the transformation to the train dataset then to the test dataset.
It is important to save the state of the transformation after applying it to the train dataset. So the same values is being used for the test.
So that we can detect edge cases like NPEs and ser/de issues early.
The one hot encoder takes a long time to run when the number of tokens is large. I believe the problem lies in the fact that the current implementation iterates over each element in the dictionary:
I tried writing an implementation that uses a hash map instead of a list to store the tokens (plus their index), but it is still necessary to "inflate" the feature with the empty elements.
Since we can assume that only one element will be added it would be preferable to have the default be an empty feature vector and just add one element to it given the index.
Could be useful for debugging. A couple of thoughts
In the B
position of Transformer[A, B, C]
to reduce GC pressure.
Need to double check for correctness and make mutation detector happy (like the one in Beam) though.
So that they show up nicely in scaladoc site.
We can provide 2 params:
hashBucketSize: Int = 0
sizeScalingFactor: Double = 4.0
If hashBucketSize
> 0 we use that for assigning labels to buckets, but if it is 0 we use HLL estimated size * sizeScalingFactor
instead of hashBucketSize
and 4x gives us ~5% collision according to #23.
Currently the OHE encodes all the values, but there might be 1000s of them which will blow up the training data. It'll be good to have a control to limit the number of columns generated after OHE.
Some solutions from the discussions on slack:
From initial discussions it seemed, #1 and #3 are harder to implement that #2? Open to discussion.
Not sure if user or featran code bug but here's the partial stacktrace:
Caused by: java.lang.IndexOutOfBoundsException: 8239914 not in [0,8239912)
at breeze.linalg.VectorBuilder.breeze$linalg$VectorBuilder$$boundsCheck(VectorBuilder.scala:92)
at breeze.linalg.VectorBuilder.add(VectorBuilder.scala:112)
at breeze.linalg.SparseVector$$anonfun$apply$2.apply(SparseVector.scala:196)
at breeze.linalg.SparseVector$$anonfun$apply$2.apply(SparseVector.scala:195)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:73)
at scala.collection.mutable.MutableList.foreach(MutableList.scala:30)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at breeze.linalg.SparseVector$.apply(SparseVector.scala:195)
at com.spotify.featran.FeatureBuilder$$anon$5.result(FeatureBuilder.scala:134)
at com.spotify.featran.FeatureBuilder$$anon$5.result(FeatureBuilder.scala:117)
at com.spotify.featran.FeatureExtractor$$anonfun$featureValuesWithOriginal$1.apply(FeatureExtractor.scala:96)
at com.spotify.featran.FeatureExtractor$$anonfun$featureValuesWithOriginal$1.apply(FeatureExtractor.scala:94)
at com.spotify.scio.util.Functions$$anon$7.processElement(Functions.scala:145)
So we can ditch the commons-math3
dependency
We would like to be able to extract features in Java as sparse vectors. Similar to the way you can do extractor.featureValues[SparseVector[Float]]
in Scala.
There might be concurrent access to fb
if as
is a in-memory parallel collection, i.e. if fb
and its surrounding lambda is not copied via ser/de and accessed in a multi-threaded environment.
https://github.com/spotify/featran/blob/master/core/src/main/scala/com/spotify/featran/FeatureExtractor.scala#L94
Getting the following error when using TensorFlowFeatureBuilder
to build Example
with fe.featureValues[Example]
:
java.lang.UnsupportedOperationException
at com.google.protobuf.MapField.ensureMutable(MapField.java:290)
at com.google.protobuf.MapField$MutatabilityAwareMap.put(MapField.java:333)
at org.tensorflow.example.Features$Builder.putFeature(Features.java:631)
at com.spotify.featran.tensorflow.package$TensorFlowFeatureBuilder$.add(package.scala:30)
at com.spotify.featran.CrossingFeatureBuilder.add(CrossingFeatureBuilder.scala:96)
at com.spotify.featran.transformers.StandardScaler.buildFeatures(StandardScaler.scala:55)
at com.spotify.featran.transformers.StandardScaler.buildFeatures(StandardScaler.scala:42)
at com.spotify.featran.transformers.Transformer.optBuildFeatures(Transformer.scala:83)
at com.spotify.featran.Feature.unsafeBuildFeatures(FeatureSpec.scala:148)
at com.spotify.featran.FeatureSet.featureValues(FeatureSpec.scala:270)
at com.spotify.featran.FeatureExtractor$$anonfun$featureResults$1.apply(FeatureExtractor.scala:94)
at com.spotify.featran.FeatureExtractor$$anonfun$featureResults$1.apply(FeatureExtractor.scala:93)
at com.spotify.scio.util.Functions$$anon$7.processElement(Functions.scala:145)
Spec:
FeatureSpec
.of[Features]
.required(_.doubleFeature)(StandardScaler("reg_feature", withMean = true, withStd = true))
Using featran 0.1.11
and scio 0.4.3
and protobuf-java;3.3.1
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.