Comments (12)
Could be nice to have it output the protobuf format that is required by Facets so that we get the feature visualizations for free.
See: https://github.com/PAIR-code/facets/blob/master/facets_overview/proto/feature_statistics.proto
from featran.
I'm thinking maybe an extra featureStatistics
method on FeatureExtractor
, so this is done post transformation. We can build one Algebord Moment
per column easily.
Optionally we can also let user opt-in a subset of transformers, but that's extra complexity.
OTOH not sure if we can compute stats pre-transformation though, since it doesn't make sense for all inputs, e.g. strings, vector.
from featran.
FWIW here are a few things I've commonly checked in the past:
- Plot distribution per feature
- Descriptive stats on features: median, stddev, top_k_values, p05, p95, min, max
- Count Missing values, NaNs, outliers...
- Evolution of features stats across batches (say overall several weeks)
- correlation between target and features, and between features themselves
from featran.
@yonromai questions:
- Plot distribution per feature
Is this pre or post transformation? Some transformer i.e. binarizer may change dist, I guess pre is more useful in that case? Some case we might want post, i.e. one-hot encoder? - Descriptive stats on features
These should be easy. Not sure if p05/p95 make sense here. Did you mean 5% & 95% quantile? - Count missing values, NaNs, outliers...
This should be pre-transformation right? - Evolution
This is outside the scope of featran, but maybe suitable for the system one layer above?
from featran.
- Plot distribution per feature
Right, mostly pre transformation. This work can inform what transformations to apply - Descriptive stats on features
Yes - Count missing values, NaNs, outliers...
Yes - Evolution
👍
from featran.
So the pre-transform stats are potentially doable in the same reduce
pass for feature settings, the post-transform stats definitely requires another reduce
pass. Both should be opt-in obviously. We could also warn in cases of high dimensional features like *hot encoders.
@richwhitjr do you think it's worth doing the pre-transform stats in the same reduce
as feature settings? IMO it's complex and probably not worth since the user is most likely doing it ad-hoc to explore data.
from featran.
Seems complex and would be hard to mix monoids as needed. For example the transformation may want a QTree but for stats you will need a Moment Monoid. An adhoc "analysis" phase sounds promising though. I wonder though if this will require another type of Spec or if uses will expect the same type of stats for the same transformers.
from featran.
@marcromeyn Facets seems to support a lot more things than we discussed here. Just checking if we can drop some to narrow the scope.
- Weighted feature - Right now we only have weighted labels for n-hot encoders. Do we need it for anything else? Can we drop it altogether?
- Median, histogram, rank histogram, frequency and value - These are hard to brute force for large datasets, is approximation acceptable? If so do we need tunable precision (could make it more complex)?
- Bytes input - There's no bytes type in featran, guess we can drop?
- Vectors - How do vectors (
Array[Double]
) fit in here?
from featran.
Seems it could be a lot of work to replicate all the logic in facets. I'm wondering if it's easier and better to just sample in featran and do the statistics summarization in facets?
@marcromeyn @yonromai @richwhitjr thoughts?
from featran.
I like the idea of keeping the statistic summarization internal but make it easy for someone to take the stats and dump it ot something like Facets. In the future we could have a sub project to help do this in one step.
I just worry about serialization and dependency issues we may run into when introducing a new library since Featran has to support a lot of different distributed systems,
from featran.
It turns out that Facets has a the ability to import TfRecord
files and computes stats on it. I tried it on my data and it seems to work fine, although quite slow on big files.
So it should be really easy to sample TFRecord files from the featran output and import them inside Facets.
from featran.
It's easier to do this in TFDV, closing.
from featran.
Related Issues (20)
- Can we use scaladoc 2.12? HOT 2
- Add documentation site with paradox HOT 1
- Add Scala Binary Compatibility validation tool – "MiMa"
- Performance issue in TensorFlow FeatureBuilder HOT 1
- PositionEncoder doesn't support input as "Seq" of Strings HOT 3
- Feature transformations order lost after filtering on a MultiFeatureSpec HOT 1
- Use JsonSerializable typeclass for FlatReader[String] and FlatWriter[String]
- Upgrade TensorFlow to 1.9.0 HOT 3
- FlatExtractor performance. HOT 1
- Add java api for FlatConverter & FaltExtractor
- `featran` root artifact published by mistake
- Switch xgboost to official release package
- Is featran thread-safe and can be intergrated in akka
- Sequential composition of transformers HOT 2
- sbt `release skip-tests` not skipping tests? HOT 1
- Implementing feature transformer when the Aggregator and the Transformer input are of different types HOT 9
- Add dotty cross-compile support
- Can't mix `featran-xgboost` dependency with newer versions of xgboost
- Update TensorFlow to >=2.3.1 HOT 1
- Could you help upgrade the vulnerble dependency in featran?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from featran.