Comments (11)
KeplerMapper's API has not evolved since its first iteration. Back then I only clustered on the projection and it was more of a single-function-does-everything. Right now, I dont even know if fit_transform is the right term to use (you are essentially projecting).
Seperating out fit, transform, and fit_transform would allow for mapping on unseen data (calculate in which interval the new data falls, and add to most relevant cluster).
For small changes like nr_cubes -> n_cubes, if not too much added complexity, id like for this to accept both (non-breaking changes for old code, and updated documentation).
Afaik, scikit-learn does not do graphs at all. The closest data structure is with http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html but this uses keyed Numpy arrays (probably forced, to be consistent with other structures/data representations/optimizations). I think either dicts or NetworkX objects are more flexible to experimentation (add link strengths, add link functions). With NetworkX I think it also becomes easier to run graph algorithms (shortest or longest paths) and to compare Gromov-Hausdorff distance between graphs.
As for the visualization side, I can accept both dicts and NetworkX objects. Numpy arrays would become a bit more difficult for me.
from kepler-mapper.
I dont even know if fit_transform is the right term to use (you are essentially projecting).
I'm still new to the scikit_learn API, but it seems fit/fit_transform is the main work horse of each algorithm. Because of this I think map should be renamed to fit. We could roll current map functionality into fit_transform and then allow users to provide a custom lens or a projection:
graph = mapper.fit_transform(data, lens, clusterer, cover)
or
graph = mapper.fit_transform(data, projection, clusterer, cover)
For small changes like nr_cubes -> n_cubes, if not too much added complexity, id like for this to accept both (non-breaking changes for old code, and updated documentation).
Maintaining backwards compatibility is great, but we shouldn't try to keep it around forever. Maybe we should start dev'ing on a new branch and release backwards breaking changes with 1.2 or 2.0? It would be best to release all API changes at the same time, so letting them sit while we fine-tune them would be ideal.
I can accept both dicts and NetworkX objects. Numpy arrays would become a bit more difficult for me.
Let's stick with dicts and networkx objects then. Is it currently wired to accept networkx objects? Adding networkx as a requirement seems okay to me.
One thing I'd like to work on soon is adding ability for a custom linker. I'd like to be able to build arbitrary dimensional simplicial complexes, and this could also provide an interface for specifying the type of output. I think this will also be an essential component for implementing multi-mapper (a custom cover scheme and associated linker to handle the filtration).
from kepler-mapper.
In my workflow I found it useful to call .fit_transform multiple times. I also like that behind the scenes it calls .fit_transform and that the output is a Numpy array. First impression is to not consolidate .map and .fit_transform. This seems more friendly to custom (or chained) projections.
It should also be able to feed inverse_X=None and apply clustering on the projection (suffer projection loss), for instance in the case where it is possible to create a lens from the data (final layer -> t-SNE(2-d)), but not possible to cluster on the original data (clustering on raw pixel values fails to provide meaningful clusterings).
But I don't have a strong opinion about this and need to think about this more (and incorporate feedback from using kmapper on many varied data sets).
from kepler-mapper.
Let's stick with dicts and networkx objects then. Is it currently wired to accept networkx objects? Adding networkx as a requirement seems okay to me.
I'll write a networkX -> dict convertor which kicks in if you don't pass a dict, but an object. Lets keep NetworkX optional (required only if you set parameter to output a NetworkX object).
from kepler-mapper.
Regardless of consolidating projecting and mapping, it would be cool to be able to pipeline lenses.
projected_X = mapper.fit_transform(X, projection=[PCA(2),
custom_projection_function,
"knn_distance_5"])
from kepler-mapper.
That would be a nice feature, and shouldn't be very difficult to dev.
Is there any way we could incorporate chaining of function calls into the API? This would make the calls look something like:
lens = mapper(data).fit_transform(PCA(2))\
.fit_transform(custom_projection_function)\
.fit_transform("knn_distance_5")
This would be a bit more far-fetched and would require that lens
be its own object. The more I think about it the less realistic it seems.
Either way, the API you showed should work without major changes.
from kepler-mapper.
I like how so much can now fit inside a single tweet:
import kmapper as km
from sklearn.datasets import make_circles as c
from sklearn import manifold
inverse_X, y = c(3001)
K = km.KeplerMapper()
projected_X = K.fit_transform(inverse_X, projection=manifold.TSNE(3))
G = K.map(projected_X, inverse_X)
K.visualize(G, custom_tooltips=y)
I liked being able to call .fit_transform
to create custom color_functions. I also liked chaining them, or for creating distance matrices.
If we see .fit_transform
close to scikit-learn API (model fit to data and transform function) we can build out a .transform
for unseen data. Then you know which hypercube the data fell into, and can try to assign it to the hypercube clusters (1-nearest neighbors on cluster means?), then you can use the previously recorded statistical data from those clusters (how many defaulters ended up in that cluster?) as input for a ML algo like random forests (just have to store the cluster means, and the projection transformers).
I see no objection in having another function where you do more things in one go (perhaps more like in a Pipeline manner or whatever suits your research).
from kepler-mapper.
Wanting something like:
import keplermapper as km
from sklearn import linear_model as lm
mapper = km.KeplerMapper()
projected_X = mapper.fit_transform(
inverse_X=inverse_X,
y=y,
projection=lm.Ridge())
Where the projected_X is created by 5-fold out-of-fold guessing (we can write own code for this, or wait for the upcoming StackingEstimator). This gets us lens support for all Scikit-Learn API-compatible supervised and unsupervised models.
You could run a quick XGBoost on the data and use this with the ground truth as a color function to see where tree-based models perform poorly (and need better feature engineering).
from kepler-mapper.
Also, separating out fit_transform would make it easier to pickle?
from kmapper import Projecter, Mapper, Visualizer, Pipeline
projecter = Projecter(projection="dist_mean")
projected_X_train1 = projecter.fit_transform(X_train)
projected_X_test1 = projecter.transform(X_test)
joblib.dump(projecter, "projecter1.p")
projecter = Projecter(projection=MLPClassifier())
projected_X_train2 = projecter.fit_transform(X_train, y)
projected_X_test2 = projecter.transform(X_test)
joblib.dump(projecter, "projecter2.p")
pipeline = Pipeline([joblib.load("projecter1.p"), joblib.load("projecter2.p")])
projected_X_unseen = pipeline.transform(X_unseen)
from kepler-mapper.
I like being able to skip importing commonly used sklearn libraries like decomposition
and manifold
.
I think we can use:
import kmapper as km
mapper = km.KeplerMapper()
projected_X = mapper.fit_transform(X, projection=km.manifold.TSNE())
if we add from sklearn import manifold, decomposition, cluster
in the __init__.py
. Are there any possible problems with this? Violation of Python zen?
We could also have a custom km.models
that accepts a wide range of scikit-learn API compatible models (either with decision_function
or predict_proba
) through wrappers.
projection1=km.models.FastRGFClassifier()
projection2=km.models.LGBMClassifier()
https://github.com/baidu/fast_rgf
http://lightgbm.readthedocs.io/en/latest/Python-API.html#scikit-learn-api
Although perhaps a bit heavy on the Machine Learning-side (dont want to shoehorn KeplerMapper into an auto-ML library), it would make KeplerMapper fully competitive with state-of-art supervised learning.
from kepler-mapper.
💭
import kmapper as km
mapper = km.Mapper(cover=km.covers.Cubical(n_cubes=[20, 30],
overlap_perc=[0.1, 0.2]),
nerve=km.nerves.AdaptivePairwise(min_samples=3),
clusterer=km.clusterers.HDBSCAN())
graph_train = mapper.fit_transform(projected_X, inverse_X, y=y)
graph_test = mapper.transform(projected_X_unseen, inverse_X_unseen)
from kepler-mapper.
Related Issues (20)
- try different min_intersections from the visualization
- not able to understand this HOT 1
- Class methods are not being rendered by autosummary
- Examples, gallery not included in readthedocs build HOT 2
- idea: rewrite main readme and release file to .rst, import into docs HOT 5
- Bug: min_cluster_samples should not be set to a non-integer HOT 4
- plotlyviz expects 1d color values, but gets 2d instead HOT 1
- Outdated Documentation HOT 1
- `test_cubes_overlap` may be faulty HOT 2
- Idea - Convert networkx graph object or a graph in edge list format to a Mapper object HOT 8
- Doc toc restructure proposal (minor) HOT 4
- Shadowed test fails to run
- Min-Max confusion in projection statistic in cluster details
- making html files generated by visualize self-contained
- plotlyviz error
- Losing data
- Please refer to igraph instead of python-igraph HOT 2
- Directly producing color values for each node
- Overlapping bins in the HTML visualization.
- Issue with generating visuals in mapper HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from kepler-mapper.