Code Monkey home page Code Monkey logo

Comments (2)

dvdjlaw avatar dvdjlaw commented on May 28, 2024 2

Thinking out loud...:

Scope prioritization for this package should follow the 80/20 rule. I think the obvious main target should be tabular data used for classical ML problems i.e. classification/regression. However, there are some main considerations:

  1. EDA for classical ML problems is most common and least likely to stand out against other tools. Coverage in this area should be focused on providing a polished, "one-stop shop" that can handle the most common analyses.
  • We should also be opinionated about analyses that are both effective and technically sound e.g. we shouldn't exert effort on building / supporting pie charts and word clouds just because they are used frequently.
  1. Other types of data that can be used as a predictive feature in classical ML should be strongly considered. For example, locality (e.g. country, state, city, zip) is very common to include as a predictive feature, while other geospatial-specific analyses (e.g. flight maps) are much less commonly-applicable and may or may not be easily represented in tabular format.
  • In the geospatial domain, we benefit from geopandas representing geospatial data in a Pandas-like API which reduces the hurdle of incorporating support for certain types of geospatial data.
  1. In scoping out data sources we should be careful not to conflate the source of data with the analysis approach. For example, signal data and alarm data can framed as generic time series analysis or even classic ML/tabular. Computer vision data can be framed as a correlation heatmap on the raw pixels.

  2. However, we should also keep an eye out for opportunities to cover more niche data types where there is a need and is not already covered by other (open source) tools. We should further brainstorm about what opportunities lie here.

  3. We should be judicious about how/where we provide "automated data preparation" / transformation for specific data types as this package is not intended to eliminate the need for data preparation. I can think of two situations where we should lean towards providing this preparation as part of this package:

  • Strong standardization in data formats in the industry/domain, as in geospatial
  • Transformations that are considered to be part of the analysis itself: a simplistic example is the binning/counting for histograms

from data-describe.

brianray avatar brianray commented on May 28, 2024

My suggested next steps

  • convert this into a proposal
  • provide some simplistic data examples

from data-describe.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.