Comments (2)
Thinking out loud...:
Scope prioritization for this package should follow the 80/20 rule. I think the obvious main target should be tabular data used for classical ML problems i.e. classification/regression. However, there are some main considerations:
- EDA for classical ML problems is most common and least likely to stand out against other tools. Coverage in this area should be focused on providing a polished, "one-stop shop" that can handle the most common analyses.
- We should also be opinionated about analyses that are both effective and technically sound e.g. we shouldn't exert effort on building / supporting pie charts and word clouds just because they are used frequently.
- Other types of data that can be used as a predictive feature in classical ML should be strongly considered. For example, locality (e.g. country, state, city, zip) is very common to include as a predictive feature, while other geospatial-specific analyses (e.g. flight maps) are much less commonly-applicable and may or may not be easily represented in tabular format.
- In the geospatial domain, we benefit from geopandas representing geospatial data in a Pandas-like API which reduces the hurdle of incorporating support for certain types of geospatial data.
-
In scoping out data sources we should be careful not to conflate the source of data with the analysis approach. For example, signal data and alarm data can framed as generic time series analysis or even classic ML/tabular. Computer vision data can be framed as a correlation heatmap on the raw pixels.
-
However, we should also keep an eye out for opportunities to cover more niche data types where there is a need and is not already covered by other (open source) tools. We should further brainstorm about what opportunities lie here.
-
We should be judicious about how/where we provide "automated data preparation" / transformation for specific data types as this package is not intended to eliminate the need for data preparation. I can think of two situations where we should lean towards providing this preparation as part of this package:
- Strong standardization in data formats in the industry/domain, as in geospatial
- Transformations that are considered to be part of the analysis itself: a simplistic example is the binning/counting for histograms
from data-describe.
My suggested next steps
- convert this into a proposal
- provide some simplistic data examples
from data-describe.
Related Issues (20)
- feature importance: Return top N features
- Add plotly backend for feature importance
- Add % explained variance in the labels for the cluster plot
- documentation image links are missing in website
- data_summary: Exception: Internal Error HOT 2
- Add link to open in Google Colab
- Only the Cluster_Analysis.ipynb contains a menu option for plotly
- Unit test for feature importance should validate "top_features" arg
- Develop notebook examples for specific use cases such as sensor discovery, predictive maintenance, etc.
- Create example notebooks for more specific use cases HOT 2
- data_summary: Unexpected keyword error when running Data_Summary.ipynb in the examples folder HOT 5
- Site links are broken HOT 3
- Conda environment yamls should use pinned dependency versions
- Imputation functions for missing data HOT 1
- data_summary includes null values in top_frequency
- Add error message if input data is too large for specific widgets.
- seaborn_viz_plot_time_series kwargs
- Add mallet as an additional model_type for topic modeling
- Add kwargs for create_doc_term_matrix and create_doc_term_matrix when fitting the topic model
- Add jinja2 requirement
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from data-describe.