data-describe / data-describe Goto Github PK
View Code? Open in Web Editor NEWdata⎰describe: Pythonic EDA Accelerator for Data Science
License: Other
data⎰describe: Pythonic EDA Accelerator for Data Science
License: Other
e.g. long text strings in test_topic_model.py. This can be moved to a separate file.
In eospatial/mapping.py
The class Scagnostics
in metrics/bivariate.py is very long (~500 lines). Also need to differentiate between user-level functions and internal/developer-level functions.
core/cluster.py contains many broad exceptions. Should be narrowed down to handle specific exceptions and improve the exception messages users/developers will see.
According to this jupyter is a requirement. This may be a design decision and I am ok with it but if that is the case jupyter needs to be in the requirement.txt and that is a big requirement.
In order to consider academia users, we should consider whether to make efforts on making the visualizations publication-ready and easy to use. A couple of examples on similar efforts:
A couple of open questions that @brianray mentioned earlier:
"../data/weatherAUS.csv"
We need to:
Dependencies are heavy. Need to divide them into categories (required v.s. optional or based on different applications like geospatial v.s. text) handle exceptions properly in each module's import statement.
@soshel Can you get your expert review of the current geospatial work done?
Make the package install-able using conda
cluster(df, target='Target')
Mime type rendering requires nbformat>=4.2.0 but it is not installed
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/opt/conda/envs/sanofi_py37/lib/python3.7/site-packages/IPython/core/formatters.py in __call__(self, obj)
916 method = get_real_method(obj, self.print_method)
917 if method is not None:
--> 918 method()
919 return True
920
/opt/conda/envs/sanofi_py37/lib/python3.7/site-packages/plotly/basedatatypes.py in _ipython_display_(self)
458
459 if pio.renderers.render_on_display and pio.renderers.default:
--> 460 pio.show(self)
461 else:
462 print(repr(self))
/opt/conda/envs/sanofi_py37/lib/python3.7/site-packages/plotly/io/_renderers.py in show(fig, renderer, validate, **kwargs)
384 if not nbformat or LooseVersion(nbformat.__version__) < LooseVersion("4.2.0"):
385 raise ValueError(
--> 386 "Mime type rendering requires nbformat>=4.2.0 but it is not installed"
387 )
388
ValueError: Mime type rendering requires nbformat>=4.2.0 but it is not installed
Should we support automatic visualization? Refer to the automatic visualization in H2O's Driverless AI.
Variables/strings that appear in multiple places should be moved to a constant class so they can be easier to reuse, e.g.
Currently all the Python files are executable. This would lead to potential security issues.
An example is nteract's line chart for time series. There are a lot of users that work on time series so this might be something we want to cover.
How do we handle large datasets? Do we have a progress bar or informative warning messages while loading large datasets?
Currently DD is only suitable for running in notebooks via SDK. Should we support interactive plots?
Trade-offs and efforts of maintenance need to be considered. For R packages, this is relatively easier since there's plotly’s R package that converts static ggplot2 plots to interactive plots.
A question from @brianray on the Google Doc: can this be within the jupyter lab domain only?
e.g. utilities/load_data.py
Better to use **kwargs
instead of kwargs=None
in utilities/load_data.py which defaults to a dictionary and you can use kwargs.get()
.
Clips parameter is hardcoded to "state"
ax = geoplot.kdeplot( df=data.geometry.centroid, figsize=(context.fig_width, context.fig_height), clip=data.dissolve("state").geometry, shade_lowest=False, cmap="viridis", shade=True, **kde_kwargs )
Setup.py should not contain/ship with UploadCommand which is used to publish the package to PyPI and depends on other tools like twine and git.
e.g. type == 'kde
in geospatial/mapping.py
Identifying scope of any package is important. We need to decide:
What types of data should this package accept?
-- Signal Data
-- Alarm Data
-- Work order data
-- Geospatial data
-- Computer vision data
-- LIDAR data
-- Any others?
Once the data types are identified, we can define acceptable schemas and functionality for each data type.
One idea to really make DD standout against similar packages is by having the added capability of identifying sensitive data like PII and PHI. I looked into several packages and Presidio seems to be the most mature and there's a lot of support from the open-source community. Most packages rely on some combination of regex, rules-based approach, and NER (spaCy). Let me know what you guys think.
Once the sensitive data is identified, we can anonymize the data:
Similar to #49, should we consider building a GUI around the functionalities?
High risk for users to use internal methods and hard to develop release strategy.
something not copy left https://en.wikipedia.org/wiki/Comparison_of_free_and_open-source_software_licences
We can switch to use module-level variables instead of global variable.
When an argument is required, remove the default value instead of checking whether it’s None, e.g. the function calculate_metrics()
in metrics/bivariate.py.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.