Identifying scope of any package is important. We need to decide: What types of da

My suggested next steps <li class="task-list-item"

Add a document specifying scope of the package about data-describe HOT 2 CLOSED

data-describe commented on May 28, 2024 1

Add a document specifying scope of the package

from data-describe.

Comments (2)

dvdjlaw commented on May 28, 2024 2

Thinking out loud...:

Scope prioritization for this package should follow the 80/20 rule. I think the obvious main target should be tabular data used for classical ML problems i.e. classification/regression. However, there are some main considerations:

EDA for classical ML problems is most common and least likely to stand out against other tools. Coverage in this area should be focused on providing a polished, "one-stop shop" that can handle the most common analyses.

We should also be opinionated about analyses that are both effective and technically sound e.g. we shouldn't exert effort on building / supporting pie charts and word clouds just because they are used frequently.

Other types of data that can be used as a predictive feature in classical ML should be strongly considered. For example, locality (e.g. country, state, city, zip) is very common to include as a predictive feature, while other geospatial-specific analyses (e.g. flight maps) are much less commonly-applicable and may or may not be easily represented in tabular format.

In the geospatial domain, we benefit from geopandas representing geospatial data in a Pandas-like API which reduces the hurdle of incorporating support for certain types of geospatial data.

In scoping out data sources we should be careful not to conflate the source of data with the analysis approach. For example, signal data and alarm data can framed as generic time series analysis or even classic ML/tabular. Computer vision data can be framed as a correlation heatmap on the raw pixels.
However, we should also keep an eye out for opportunities to cover more niche data types where there is a need and is not already covered by other (open source) tools. We should further brainstorm about what opportunities lie here.
We should be judicious about how/where we provide "automated data preparation" / transformation for specific data types as this package is not intended to eliminate the need for data preparation. I can think of two situations where we should lean towards providing this preparation as part of this package:

Strong standardization in data formats in the industry/domain, as in geospatial
Transformations that are considered to be part of the analysis itself: a simplistic example is the binning/counting for histograms

from data-describe.