The data-describe's discuss from data-describe

Automate management of semantic versioning of package

Such as using python-versioneer

Seal package exports/namespaces

High risk for users to use internal methods and hard to develop release strategy.

Remove broad exceptions in core/cluster.py

core/cluster.py contains many broad exceptions. Should be narrowed down to handle specific exceptions and improve the exception messages users/developers will see.

Move all import statements to top of the files

e.g. utilities/load_data.py

The class `Scagnostics` in metrics/bivariate.py is too long

The class Scagnostics in metrics/bivariate.py is very long (~500 lines). Also need to differentiate between user-level functions and internal/developer-level functions.

missing data files in Notebooks

"../data/weatherAUS.csv"

Move long text strings to a separate file

e.g. long text strings in test_topic_model.py. This can be moved to a separate file.

Add design doc for how we want to deal with sensitive data

Is Jupyter required

https://github.com/brianray/data-describe/blob/1ac718ff1d95598cddcdc8633471d5f511d83be6/mwdata/__init__.py#L20

According to this jupyter is a requirement. This may be a design decision and I am ok with it but if that is the case jupyter needs to be in the requirement.txt and that is a big requirement.

Support interactive visualizations

Currently DD is only suitable for running in notebooks via SDK. Should we support interactive plots?

Trade-offs and efforts of maintenance need to be considered. For R packages, this is relatively easier since there's plotly’s R package that converts static ggplot2 plots to interactive plots.

A question from @brianray on the Google Doc: can this be within the jupyter lab domain only?

Add automatic checks for consistent code style

Better to use `**kwargs` instead of `kwargs=None`

Better to use **kwargs instead of kwargs=None in utilities/load_data.py which defaults to a dictionary and you can use kwargs.get().

Integrate coverage reports using codecov.io

Notebooks need to be tested as well in addition to existing unit tests

Avoid use of sensitive keywords in Python

e.g. type == 'kde in geospatial/mapping.py

Divide dependencies into categories

Dependencies are heavy. Need to divide them into categories (required v.s. optional or based on different applications like geospatial v.s. text) handle exceptions properly in each module's import statement.

get feedback of geospacial functionality / datasets

@soshel Can you get your expert review of the current geospatial work done?

Add automatic checks for import statement style

Add a document specifying scope of the package

Identifying scope of any package is important. We need to decide:
What types of data should this package accept?
-- Signal Data
-- Alarm Data
-- Work order data
-- Geospatial data
-- Computer vision data
-- LIDAR data
-- Any others?
Once the data types are identified, we can define acceptable schemas and functionality for each data type.

Support time series and sequential data

An example is nteract's line chart for time series. There are a lot of users that work on time series so this might be something we want to cover.

Move UploadCommand from setup.py to a separate script

Setup.py should not contain/ship with UploadCommand which is used to publish the package to PyPI and depends on other tools like twine and git.

Notebooks need to be updated continuously

use tox test runner

https://tox.readthedocs.io/en/latest/

Remove unnecessary checks for required arguments

When an argument is required, remove the default value instead of checking whether it’s None, e.g. the function calculate_metrics() in metrics/bivariate.py.

Avoid using global variables like in utilities/contextmanager.py

We can switch to use module-level variables instead of global variable.

error in notebook "Cluster Analysis"

cluster(df, target='Target')
Mime type rendering requires nbformat>=4.2.0 but it is not installed
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/opt/conda/envs/sanofi_py37/lib/python3.7/site-packages/IPython/core/formatters.py in __call__(self, obj)
    916             method = get_real_method(obj, self.print_method)
    917             if method is not None:
--> 918                 method()
    919                 return True
    920 

/opt/conda/envs/sanofi_py37/lib/python3.7/site-packages/plotly/basedatatypes.py in _ipython_display_(self)
    458 
    459         if pio.renderers.render_on_display and pio.renderers.default:
--> 460             pio.show(self)
    461         else:
    462             print(repr(self))

/opt/conda/envs/sanofi_py37/lib/python3.7/site-packages/plotly/io/_renderers.py in show(fig, renderer, validate, **kwargs)
    384         if not nbformat or LooseVersion(nbformat.__version__) < LooseVersion("4.2.0"):
    385             raise ValueError(
--> 386                 "Mime type rendering requires nbformat>=4.2.0 but it is not installed"
    387             )
    388 

ValueError: Mime type rendering requires nbformat>=4.2.0 but it is not installed

Consider building a GUI

Similar to #49, should we consider building a GUI around the functionalities?

Sensitive Data Detection

One idea to really make DD standout against similar packages is by having the added capability of identifying sensitive data like PII and PHI. I looked into several packages and Presidio seems to be the most mature and there's a lot of support from the open-source community. Most packages rely on some combination of regex, rules-based approach, and NER (spaCy). Let me know what you guys think.

Presidio
- Open sourced from microsoft
- Leverages docker and kubernetes
- Customizable to the domain problem
- Uses NER, patterns, formats, chucksums
- Can be used as a standalone python package
PIIdetect
- Still in early development. Uses word2vec to identify PII and can create fake text
scrubadub
- Scrubs PII using regex and textblob. Slowly adding more ML approaches

Once the sensitive data is identified, we can anonymize the data:

Faker
- Can be used to generate synthetic data while maintaining distributions
- Tutorial here
Trumania
- Scenario-based random dataset generator
- Statistical distribution from numpy and faker is provided. Can be extended to new ones.
- In depth tutorial here

Python files should not be executable

Currently all the Python files are executable. This would lead to potential security issues.

Resolve warning messages in notebooks

Support automatic visualization

Should we support automatic visualization? Refer to the automatic visualization in H2O's Driverless AI.

Remove the unncessary block and the use of `pass`

In eospatial/mapping.py

branding / gh-pages / external marketing

We need to:

copy changes on main page. Change Download to "Install"
populate User Guide
populate release history
link API Documentation to Read the Docs
add Github banner
generate Read The Docs
Format the main Readme.md for the project

Main function `dim_reduc()` in dimensionality_reduction.py is too long

Publication readiness

In order to consider academia users, we should consider whether to make efforts on making the visualizations publication-ready and easy to use. A couple of examples on similar efforts:

ggfortify (its interactive version is autoplotly)
ggsci

A couple of open questions that @brianray mentioned earlier:

What is the level of effort?
How would that work with interactive plots?

Problem running guess_dtypes and select_dtypes. Returns "Unknown string format" and many warnings (string conversions and invalid literals)

Check and resolve any empty plots in notebooks

Move variables to reusable constant class

Variables/strings that appear in multiple places should be moved to a constant class so they can be easier to reuse, e.g.

the list of available metrics in core/scatter_plot.py and in metrics/bivariate.py. The doc-strings can point to this constant class directly.
Strings that represent file extensions in utilities/load_data.py.

data-describe / data-describe Goto Github PK

data-describe's Issues

Recommend Projects

Recommend Topics

Recommend Org