Comments (8)
+1 to this idea and incorporating into this project as we discussed earlier today.
From maturity level, it seems like Presidio is a good option. It makes more sense to reuse the existing solution since a lot of domain knowledge and often business-specific.
Regarding anonymization, Presidio seems to also support some level of customizable anonymization. Do we want to leverage that? It looks like there isn't much popularity for Trumania. Perhaps we can just use Faker and build something suitable for our project?
from data-describe.
GCP DLP does this but is cloud native. I like the API https://cloud.google.com/dlp/docs/apis. One thing I thought of was some sort of abstract data scheme maybe also with an architecture behind (like Apache Arrow) that enforced end-to-end handling of PII/PHI/Financial data. Can we:
- Detect where unknown
- Hash and encrypt where possible
- keep the general Info Types in tact (ref https://cloud.google.com/dlp/docs/infotypes-reference)
from data-describe.
GCP DLP does this but is cloud native. I like the API https://cloud.google.com/dlp/docs/apis. One thing I thought of was some sort of abstract data scheme maybe also with an architecture behind (like Apache Arrow) that enforced end-to-end handling of PII/PHI/Financial data.
Could you elaborate a bit on this? What do you mean by "abstract data scheme"?
Can we:
* [ ] Detect where unknown * [ ] Hash and encrypt where possible * [ ] keep the general Info Types in tact (ref https://cloud.google.com/dlp/docs/infotypes-reference)
These sound great to me.
from data-describe.
"abstract data scheme", maybe something like what's found in PEP438 where we define a typed schema, ie:
from typing import NewType, TypedDict
EMAIL_ADDRESS = NewType('EMAIL_ADDRESS', string)
class Person(TypedDict):
first: int
last: int
email: EMAIL_ADDRESS
And then when writing to the Person dict, we are real careful to secure the data by whatever means is decided acceptable. DLP has a number of ways to do this.
from data-describe.
I see. That is a good idea.
from data-describe.
I just created a new issue #34 for adding a design doc for this and assigned to you @truongc2. Please follow the template I put together in #33. It's easier to review designs on pull requests since we can comment in-line.
from data-describe.
Did you want to merge the template into master? Or was it intended for something else?
from data-describe.
Yes, but needs at least one approval before merging. I just requested reviews.
from data-describe.
Related Issues (20)
- feature importance: Return top N features
- Add plotly backend for feature importance
- Add % explained variance in the labels for the cluster plot
- documentation image links are missing in website
- data_summary: Exception: Internal Error HOT 2
- Add link to open in Google Colab
- Only the Cluster_Analysis.ipynb contains a menu option for plotly
- Unit test for feature importance should validate "top_features" arg
- Develop notebook examples for specific use cases such as sensor discovery, predictive maintenance, etc.
- Create example notebooks for more specific use cases HOT 2
- data_summary: Unexpected keyword error when running Data_Summary.ipynb in the examples folder HOT 5
- Site links are broken HOT 3
- Conda environment yamls should use pinned dependency versions
- Imputation functions for missing data HOT 1
- data_summary includes null values in top_frequency
- Add error message if input data is too large for specific widgets.
- seaborn_viz_plot_time_series kwargs
- Add mallet as an additional model_type for topic modeling
- Add kwargs for create_doc_term_matrix and create_doc_term_matrix when fitting the topic model
- Add jinja2 requirement
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from data-describe.