martincjespersen / daanonymization Goto Github PK
View Code? Open in Web Editor NEWSimple customizable pipeline tool for anonymizing Danish text.
License: Apache License 2.0
Simple customizable pipeline tool for anonymizing Danish text.
License: Apache License 2.0
I installed DaAnonymization with pip a week ago an tried to run your example from the readme, but it fails becaue of some mismatch between DaCy Large and current spacy version.
The script anon_test.py:
from textprivacy import TextAnonymizer
# list of texts (example with cross-lingual transfer to english)
corpus = [
"Hej, jeg hedder Martin Jespersen og er fra Danmark og arbejder i "
"Deloitte, mit cpr er 010203-2010, telefon: +4545454545 "
"og email: [email protected]",
"Hi, my name is Martin Jespersen and work in Deloitte. "
"I used to be a PhD. at DTU in Machine Learning and B-cell immunoinformatics "
"at Anker Engelunds Vej 1 Bygning 101A, 2800 Kgs. Lyngby.",
]
Anonymizer = TextAnonymizer(corpus)
# Anonymize person, location, organization, emails, CPR and telephone numbers
anonymized_corpus = Anonymizer.mask_corpus()
for text in anonymized_corpus:
print(text)
(anon): ~$ /home/akirkedal/software/anaconda/envs/anon/bin/python /home/akirkedal/software/anon/anon_test.py
/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/spacy/util.py:762: UserWarning: [W095] Model 'da_dacy_large_tft' (0.0.0) was trained with spaCy v3.0 and may not be 100% compatible with the current version (3.1.4). If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate
warnings.warn(warn_msg)
Traceback (most recent call last):
File "/home/akirkedal/software/anon/anon_test.py", line 1, in <module>
from textprivacy import TextAnonymizer
File "/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/textprivacy/__init__.py", line 7, in <module>
from textprivacy.textanonymization import TextAnonymizer
File "/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/textprivacy/textanonymization.py", line 34, in <module>
ner_model = dacy.load("da_dacy_large_tft-0.0.0")
File "/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/dacy/load.py", line 39, in load
return spacy.load(path)
File "/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/spacy/__init__.py", line 51, in load
return util.load_model(
File "/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/spacy/util.py", line 351, in load_model
return load_model_from_path(Path(name), **kwargs) # type: ignore[arg-type]
File "/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/spacy/util.py", line 418, in load_model_from_path
return nlp.from_disk(model_path, exclude=exclude, overrides=overrides)
File "/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/spacy/language.py", line 2021, in from_disk
util.from_disk(path, deserializers, exclude) # type: ignore[arg-type]
File "/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/spacy/util.py", line 1229, in from_disk
reader(path / key)
File "/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/spacy/language.py", line 2015, in <lambda>
deserializers[name] = lambda p, proc=proc: proc.from_disk( # type: ignore[misc]
File "/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/spacy_transformers/pipeline_component.py", line 402, in from_disk
util.from_disk(path, deserialize, exclude)
File "/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/spacy/util.py", line 1229, in from_disk
reader(path / key)
File "/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/spacy_transformers/pipeline_component.py", line 391, in load_model
tokenizer, transformer = huggingface_from_pretrained(
File "/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/spacy_transformers/util.py", line 31, in huggingface_from_pretrained
tokenizer = AutoTokenizer.from_pretrained(str_path, **config)
File "/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/transformers/models/auto/tokenization_auto.py", line 568, in from_pretrained
return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
File "/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1732, in from_pretrained
return cls._from_pretrained(
File "/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1850, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/transformers/models/xlm_roberta/tokenization_xlm_roberta_fast.py", line 134, in __init__
super().__init__(
File "/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/transformers/tokenization_utils_fast.py", line 110, in __init__
fast_tokenizer = convert_slow_tokenizer(slow_tokenizer)
File "/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/transformers/convert_slow_tokenizer.py", line 829, in convert_slow_tokenizer
return converter_class(transformer_tokenizer).converted()
File "/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/transformers/convert_slow_tokenizer.py", line 375, in __init__
from .utils import sentencepiece_model_pb2 as model_pb2
File "/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/transformers/utils/sentencepiece_model_pb2.py", line 52, in <module>
_descriptor.EnumValueDescriptor(name="UNIGRAM", index=0, number=1, options=None, type=None),
File "/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/google/protobuf/descriptor.py", line 755, in __new__
_message.Message._CheckCalledFromGeneratedFile()
TypeError: Descriptors cannot not be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:
1. Downgrade the protobuf package to 3.20.x or lower.
2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).
More information: https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updates```
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.