Code Monkey home page Code Monkey logo

data-sanitiser's Introduction

Sanitise (by removal, tokenisaction, or redaction) Personal and Private Data

Code (regular expresssions and NTLK) to tokenise (remove) Private Personal Data in unstructured data.

Essentially, it cleans data of personal information, using NTLK Named Entity Recognition, along with a waterfall of regular expressions that identify and replace any words (entities) that match the expected pattern of known personal and private information. It does this in a way to retain some level of readability, and semantic meaning in the data.

It turns this...

My email address is [email protected] and [email protected] N16 9Ln I like bank holidays and speaking french. my ssn is 078-06-1120 call me on 078371827735 or 0207 183 1573 - your sincerely Lindsay Smith and by the way I work at Telrock

into this. (something you could pass to a 3rd party and they wouldn't need to be classed as a GDPR Processor)

My email address is [email protected] and [email protected] UKPOSTCODE I like bank holidays and speaking french. my ssn is SSN call me on UKPHONE or UKPHONE - your sincerely PERSON and by the way I work at ORGANIZATION

Inspiration - models memorise secrets

"never feed secrets as training data"

The inspiration for this project is from this paper. https://arxiv.org/abs/1802.08232 which The Register explains in its inimitable fashion. Briefly, Google trained their models with credit card numbers and now the card numbers are stored in the model. Whoops!

The paper has a decent suggestion to overcome the vulnerability, but assumes secrets have a low log perplexity (appears infrequently). That isn't often a characteristic of some PPI. There is some PPI with a high log perplexity, and there is PPI that is quite easily identified by pattern. What do you do if your training data is full of PPI?

"Intuitively, if the defender can identify secrets in the training data, then they can be removed from the model before it is trained. Such an approach guarantees to prevent memorization if the secrets can be identified, since the secrets will not appear in the training data, and thus not be observed by the model during training."

"The key challenge of this approach is how to identify the secrets in the training data. Several heuristics can be used. For example, if the secrets were known to follow some template (e.g., a regular expression), the defender may be able to remove all substrings matching the template from the training data in a preprocessing step. However, such heuristics cannot be exhaustive, and the defender never be aware of all potential templates that may exist in the training data. When the secrets cannot be captured by the heuristics, the defense will fail."

The Secret Sharer: Measuring Unintended Neural Network Memorization & Extracting Secrets

What is Personal and Private Data?

Here's the thing - there is no exhaustive list. At one level it is intuitive, but at best the guidance you will read is "data like examples a, b, c is personal/private/sensitive data". NIST in the US and the ICO in the UK provide such examples. (I'm working on the "periodic table" of personal, private and sensitive data)

This code will not exhaustively sanitise all Personal and Private Data because the definition of Personal and Private Data is not a definitive list. GDPR language is intentionally descriptive not definitive. It's a truism - how can you exhaustively identify things you can't define.

However, it is possible to exhaustively test for some PPI. For example, there are 1.6m postcodes in the UK. The regex used here has been tested against all 1.6m with 100% accuracy. For some PPI, it's trickier, in particular names and addresses. However, it is conceivable to exhaustively test your rules against every name and address in the UK Electoral Register.

As someone said to me -

... there is no definitive list of attributes, indeed the challenge is with modern technology/data sources is that new attributes are continually being created, e.g. Geolocation data, timestamp data, descriptive data that can identify an individual โ€“ male, Kiwi accent, blue jeans with turnups, blue open neck sweater, black Doc Martin boots, Mildmay pub, Islington 5.30pm Friday May 25

What about US, US and everyother countries defintion of Personal and Private data?

U.S. and EU privacy law diverge greatly. At the foundational level, they differ in their underlying philosophy: In the United States, privacy law focuses on redressing consumer harm and balancing privacy with efficient commercial transactions. In the European Union, privacy is hailed as a fundamental right that can trump other interests. Paul M. Schwartz and Danie J. Solove, Reconciling Personal Information in the United States and European Union, 102 Calif. L. Rev. 877 (2014).

European data protection law does not utilise the concept of PII, and its scope is instead determined by non-synonymous, wider concept of "personal data". Good overview found on wikipedia - Personally_identifiable_information

Current Data Cleaning capabilities

Refer to the tests to understand exactly what these data items mean and how wide the matching works.

  • UK postcode
  • US social security number
  • email address
  • UK phones number
  • US and Canada phone number
  • Payment cards (Amex, BCGlobal, Carte Blanche Card, Diners Club Card, Discover Card, Insta Payment Card, JCB Card, KoreanLocalCard, Laser Card, Maestro Card, Mastercard, Solo Card, Switch Card, Union Pay Card, Visa Card, Visa Master Card)
  • US Zipcodes
  • Canadian Postcodes
  • account number (any 5 -12 length of digits - do this last so not to pick up more specific matches)
  • person (name, title, initial)
  • organisation - merchant name
  • city
  • Locations (GPE)
  • state name
  • state Code

To do

  • address (Street)
  • date (eg DOB but any date)
  • time
  • money amount
  • product name
  • and for fun lets do profanities

Use Cases

  • Sanitising training data before you feed it in to a Machine Learning model
  • Sanitising data if you want to move a copy of data out of Production for test purposes
  • Sanitising logs - re-write logs in-situ to remove PPI as an Infosec control
  • Data Loss Prevention - script it into your email server to sanitise outbound emails. Could definitley be done in Postfix without too many headaches
  • Sanitising historical audit data

NB There is no guarantee that this (or any thing) will remove all PPI information.

NLTK

It uses the default implementation of NER available in NLTK. It does OK at recognising names. GPE stands for "Geo-political entity", ie the name of a location like a City.

Credit to all the various sources for the Regex

There are probably Stack Overflow posts I've missed, happy to be corrected.

Go here first

Credit Cards

Test Card Numbers

Various phone number formats

UK Postcodes

US & Canadian zipcodes

Social Securty Numbers

Email Addresses

Profanity & Censorship UK List from OfCom

These projects are interesting. Scrubadub is similar to this but I prefer the simplicity of a waterfall of Regex.

FAQ

Why do I get requests from NTLK to download stuff the first time I run it?

NTLK needs some basic models to run, and it decides the first times it is run which ones it needs. In the error messages it will tell you what you need to do and how to do it. There are a few nltk.download('punkt'), nltk.download('averaged_perceptron_tagger'), nltk.download('maxent_ne_chunker'), nltk.download('words'). Read this https://github.com/nltk/nltk/wiki/Frequently-Asked-Questions-(Stackoverflow-Edition)

Interesting Papers

Out of date in terms of legal status of PPI but good list of techniques.

Guide to Protecting the Confidentiality of Personally Identifiable Information (PII) - Recommendations of the National Institute of Standards and Technology

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.