Code Monkey home page Code Monkey logo

daanonymization's Introduction

Anonymization tool for Danish text

image

Downloads

Description

A simple pipeline wrapped around SpaCy and DaCy for anonymizing danish corpora. The pipeline allows for custom functions to be implemented and piped in combination with custom functions.

The DaCy model is built on multilingual RoBERTa which enables cross-lingual transfer for other languagues ultimately providing a robust named entity recognition model for anonymization that is able to handle noisy Danish text data which could include other languages.

Languages used in the multilingual RoBERTa can be found in appendix A of XLM-RoBERTa paper: Unsupervised Cross lingual Representation Learning at Scale

  • Free software: Apache-2.0 license

Disclaimer: As the pipeline utilizes predictive models and regex function to identify entities, there is no guarantee that all sensitive information have been remove.

Features

  • Regex for CPRs, telephone numbers, emails
  • Integration of custom functions as part of the pipeline
  • Named Entity Models for Danish language implemented (PER, LOC, ORG, MISC):
    • DaCy: https://github.com/KennethEnevoldsen/DaCy
    • Default entities to mask: PER, LOC and ORG (MISC can be specified but covers many different entitites)
    • Batch mode and multiprocessing
    • DaCy is robust to language changes as it is fine tuned from a multilingual RoBERTa model
  • Allow anonymizing using suppression
  • Allow masking to be aware of prior knowledge about individuals occuring in the texts
  • Pseudonymization module (Person 1, Person 2 etc.)
  • Logging to masked_corpus function, enabling tracking of warning if no person was found in a text
  • Beta version: Masking or noising with laplace epsilon noise of numbers

Installation

Install from pip

pip install DaAnonymization

Install from source

pip install git+https://github.com/martincjespersen/DaAnonymization.git

Quickstart

DaAnonymization's two main components are:

  • TextAnonymizer
  • TextPseudonymizer

Both components uses their mask_corpus function to anonymize/pseudonymize text by removing person, location, organization, email, telephone number and CPR. The order of these masking methods are by default CPR, telephone number, email and NER (PER,LOC,ORG) as NER will identify names in the emails. The following example shows an example of applying default anonymization and how it also cross-lingual transfer to english.

from textprivacy import TextAnonymizer

# list of texts (example with cross-lingual transfer to english)
corpus = [
    "Hej, jeg hedder Martin Jespersen og er fra Danmark og arbejder i "
    "Deloitte, mit cpr er 010203-2010, telefon: +4545454545 "
    "og email: [email protected]",
    "Hi, my name is Martin Jespersen and work in Deloitte. "
    "I used to be a PhD. at DTU in Machine Learning and B-cell immunoinformatics "
    "at Anker Engelunds Vej 1 Bygning 101A, 2800 Kgs. Lyngby.",
]

Anonymizer = TextAnonymizer(corpus)

# Anonymize person, location, organization, emails, CPR and telephone numbers
anonymized_corpus = Anonymizer.mask_corpus()

for text in anonymized_corpus:
    print(text)

Running this script outputs the following:

Hej, jeg hedder [PERSON] og er fra [LOKATION] og arbejder i [ORGANISATION], mit cpr er [CPR],
telefon: [TELEFON] og email: [EMAIL]

Hi, my name is [PERSON] and work in [ORGANISATION]. I used to be a PhD. at [ORGANISATION]
in Machine Learning and B-cell immunoinformatics at [LOKATION].

Using custom masking functions

As each project can have specific needs, DaAnonymization supports adding custom functions to the pipeline for masking additional features which are not implemented by default.

from textprivacy import TextAnonymizer
import re

# Takes string as input and returns a set of all occurences
example_custom_function = lambda x: set(list(re.findall(r"\d+ år", x)))

# list of texts
corpus = [
    "Hej, jeg hedder Martin Jespersen, er 20 år, er fra Danmark og arbejder i "
    "Deloitte, mit cpr er 010203-2010, telefon: +4545454545 "
    "og email: [email protected]",
]

Anonymizer = TextAnonymizer(corpus)

# update the mapping to include new custom function entity finder and replacement placeholder
Anonymizer.mapping.update({"ALDER": "[ALDER]"})

# add the name to masking_order in the desired order
# add custom function to custom_functions to update pool of possible masking functions
anonymized_corpus = Anonymizer.mask_corpus(
    masking_order=["CPR", "TELEFON", "EMAIL", "NER", "ALDER"],
    custom_functions={"ALDER": example_custom_function},
)

for text in anonymized_corpus:
    print(text)
Hej, jeg hedder [PERSON], er [ALDER], er fra [LOKATION] og arbejder i [ORGANISATION],
mit cpr er [CPR], telefon: [TELEFON] og email: [EMAIL]

Pseudonymization with prior knowledge

Sometimes it can be useful to maintain some context regarding sensitive information within the text. Pseudonymization allows for maintaining the connection between entities while masking them. Essentially this means adding a unique identifier for each individual and their information in the text.

By using the optional input argument individuals, you can add prior information about known individuals in the text you want to mask. The structure of individuals needs to be as shown below. The first dictionary provides a key for index of the text in the corpus, the next the unique identifier (integer) of the individuals and finally a dictionary of entities known prior for each individual.

from textprivacy import TextPseudonymizer

# prior information about the text
individuals = {1:
                {1:
                    {'PER': set(['Martin Jespersen', 'Martin', 'Jespersen, Martin']),
                     'CPR': set(['010203-2010']),
                     'EMAIL': set(['[email protected]']),
                     'LOC': set(['Danmark']),
                     'ORG': set(['Deloitte'])
                     },
                2:
                    {'PER': set(['Kristina']),
                     'ORG': set(['Novo Nordisk'])
                     }
                 }

              }

# list of texts
corpus = [
    "Første tekst om intet, blot Martin",
    "Hej, jeg hedder Martin Jespersen og er fra Danmark og arbejder i "
    "Deloitte, mit cpr er 010203-2010, telefon: +4545454545 "
    "og email: [email protected]. Martin er en 20 årig mand. "
    "Kristina er en person som arbejder i Novo Nordisk. "
    "Frank er en mand som bor i Danmark og arbejder i Netto",
]

Pseudonymizer = TextPseudonymizer(corpus, individuals=individuals)

# Pseudonymize person, location, organization, emails, CPR and telephone numbers
pseudonymized_corpus = Pseudonymizer.mask_corpus()

for text in pseudonymized_corpus:
    print(text)
Første tekst om intet, blot Person 1

Hej, jeg hedder Person 1 og er fra Lokation 1 og arbejder i Organisation 1, mit cpr er CPR 1,
telefon: Telefon 5 og email: Email 1. Person 1 er en 20 årig mand. Person 2 er en person som
arbejder i Organisation 2. Person 3 er en mand som bor i Lokation 1 og arbejder i Organisation 4

Demo using streamlit

DaAnonymization is now available with an easy demo website created in streamlit.

pip install streamlit==1.2.0
streamlit run app.py

Running the code above will result in a website demoing the use of DaAnonymization.

Fairness evaluations

Evaluations on gender and error biases are conducted in DaCy documentation.

Next up

  • When SpaCy fixed multiprocessing in nlp.pipe, remove current hack

daanonymization's People

Contributors

deloitte-martinclosterjespersen avatar glichtner avatar kennethenevoldsen avatar martincjespersen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

daanonymization's Issues

Spacy version mismatch causes failure

  • TextAnonymization version:
  • Python version: 3.8.13
  • Operating System: Windows, Ubuntu Linux

Description

I installed DaAnonymization with pip a week ago an tried to run your example from the readme, but it fails becaue of some mismatch between DaCy Large and current spacy version.

What I Did

The script anon_test.py:

from textprivacy import TextAnonymizer

# list of texts (example with cross-lingual transfer to english)
corpus = [
    "Hej, jeg hedder Martin Jespersen og er fra Danmark og arbejder i "
    "Deloitte, mit cpr er 010203-2010, telefon: +4545454545 "
    "og email: [email protected]",
    "Hi, my name is Martin Jespersen and work in Deloitte. "
    "I used to be a PhD. at DTU in Machine Learning and B-cell immunoinformatics "
    "at Anker Engelunds Vej 1 Bygning 101A, 2800 Kgs. Lyngby.",
]

Anonymizer = TextAnonymizer(corpus)

# Anonymize person, location, organization, emails, CPR and telephone numbers
anonymized_corpus = Anonymizer.mask_corpus()

for text in anonymized_corpus:
    print(text)
(anon): ~$ /home/akirkedal/software/anaconda/envs/anon/bin/python /home/akirkedal/software/anon/anon_test.py
/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/spacy/util.py:762: UserWarning: [W095] Model 'da_dacy_large_tft' (0.0.0) was trained with spaCy v3.0 and may not be 100% compatible with the current version (3.1.4). If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate
  warnings.warn(warn_msg)
Traceback (most recent call last):
  File "/home/akirkedal/software/anon/anon_test.py", line 1, in <module>
    from textprivacy import TextAnonymizer
  File "/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/textprivacy/__init__.py", line 7, in <module>
    from textprivacy.textanonymization import TextAnonymizer
  File "/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/textprivacy/textanonymization.py", line 34, in <module>
    ner_model = dacy.load("da_dacy_large_tft-0.0.0")
  File "/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/dacy/load.py", line 39, in load
    return spacy.load(path)
  File "/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/spacy/__init__.py", line 51, in load
    return util.load_model(
  File "/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/spacy/util.py", line 351, in load_model
    return load_model_from_path(Path(name), **kwargs)  # type: ignore[arg-type]
  File "/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/spacy/util.py", line 418, in load_model_from_path
    return nlp.from_disk(model_path, exclude=exclude, overrides=overrides)
  File "/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/spacy/language.py", line 2021, in from_disk
    util.from_disk(path, deserializers, exclude)  # type: ignore[arg-type]
  File "/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/spacy/util.py", line 1229, in from_disk
    reader(path / key)
  File "/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/spacy/language.py", line 2015, in <lambda>
    deserializers[name] = lambda p, proc=proc: proc.from_disk(  # type: ignore[misc]
  File "/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/spacy_transformers/pipeline_component.py", line 402, in from_disk
    util.from_disk(path, deserialize, exclude)
  File "/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/spacy/util.py", line 1229, in from_disk
    reader(path / key)
  File "/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/spacy_transformers/pipeline_component.py", line 391, in load_model
    tokenizer, transformer = huggingface_from_pretrained(
  File "/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/spacy_transformers/util.py", line 31, in huggingface_from_pretrained
    tokenizer = AutoTokenizer.from_pretrained(str_path, **config)
  File "/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/transformers/models/auto/tokenization_auto.py", line 568, in from_pretrained
    return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1732, in from_pretrained
    return cls._from_pretrained(
  File "/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1850, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/transformers/models/xlm_roberta/tokenization_xlm_roberta_fast.py", line 134, in __init__
    super().__init__(
  File "/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/transformers/tokenization_utils_fast.py", line 110, in __init__
    fast_tokenizer = convert_slow_tokenizer(slow_tokenizer)
  File "/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/transformers/convert_slow_tokenizer.py", line 829, in convert_slow_tokenizer
    return converter_class(transformer_tokenizer).converted()
  File "/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/transformers/convert_slow_tokenizer.py", line 375, in __init__
    from .utils import sentencepiece_model_pb2 as model_pb2
  File "/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/transformers/utils/sentencepiece_model_pb2.py", line 52, in <module>
    _descriptor.EnumValueDescriptor(name="UNIGRAM", index=0, number=1, options=None, type=None),
  File "/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/google/protobuf/descriptor.py", line 755, in __new__
    _message.Message._CheckCalledFromGeneratedFile()
TypeError: Descriptors cannot not be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:
 1. Downgrade the protobuf package to 3.20.x or lower.
 2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).

More information: https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updates```

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.