Code Monkey home page Code Monkey logo

document-anonymization's People

Contributors

josefawelling avatar simakro avatar thomasbtf avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

document-anonymization's Issues

Remove IDs from filename

  • The file names contain IDs (between date and title), which internally refer to the documents in a unique way. I.e. these IDs should actually be replaced.
  • As we build out this pipeline here, we should think about replacing IDs in both the structured and unstructured data with the same psuedonyms coming from an independent escrow. We are currently building that for the structured data anyway, so the extension should be fairly straightforward.

.None file is not processable

Problematic data available under /local/data/prob-docs

rule preprocess_page:
    input: results/5/uncompressed-docs/unpacked/2021-03-10.81304303.NoTitle/0.None.None.None
    output: results/5/preprocessed-docs/unpacked/2021-03-10.81304303.NoTitle/0.None.None.None
    log: logs/5/preprocess-page/unpacked/2021-03-10.81304303.NoTitle/0.None.None.None.log
    jobid: 4111
    wildcards: id=5, img=unpacked/2021-03-10.81304303.NoTitle/0.None.None.None

Select jobs to execute...
Activating conda environment: /home/tobias/git/document-anonymization/.snakemake/conda/98b5a3138345e48332173ef8adf6d54d
Activating conda environment: /home/tobias/git/document-anonymization/.snakemake/conda/98b5a3138345e48332173ef8adf6d54d
[Mon May 31 07:21:51 2021]
Error in rule preprocess_page:
    jobid: 4111
    output: results/5/preprocessed-docs/unpacked/2021-03-10.81304303.NoTitle/0.None.None.None
    log: logs/5/preprocess-page/unpacked/2021-03-10.81304303.NoTitle/0.None.None.None.log (check log file(s) for error message)
    conda-env: /home/tobias/git/document-anonymization/.snakemake/conda/98b5a3138345e48332173ef8adf6d54d
Error in rule preprocess_page:
    jobid: 15966
    output: results/1/preprocessed-docs/unpacked/2020-12-10.77953837.NoTitle/0.None.None
    log: logs/1/preprocess-page/unpacked/2020-12-10.77953837.NoTitle/0.None.None.log (check log file(s) for error message)
    conda-env: /home/tobias/git/document-anonymization/.snakemake/conda/98b5a3138345e48332173ef8adf6d54d
Traceback (most recent call last):
  File "/home/tobias/git/document-anonymization/.snakemake/scripts/tmp0kbpokh0.preprocess-page.py", line 63, in <module>
    processed_image = get_grayscale(image)
  File "/home/tobias/git/document-anonymization/.snakemake/scripts/tmp0kbpokh0.preprocess-page.py", line 15, in get_grayscale
    return cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
cv2.error: OpenCV(4.5.1) ../modules/imgproc/src/color.cpp:182: error: (-215:Assertion failed) !_src.empty() in function 'cvtColor

Falsifiy timestamps

Within the documents, the timestamps should be falsified. E.g. by adding a random but fixed offset for all files of a patient. Of course, this must also happen with the same offset that we already use in the structured data, so that the data still fit together.

Open questions

  • Don't delete data; keep it as an information. E.g. Birthday to X years old.
  • Mr / Mrs Filtering
  • Aufnahme- und Entlassungsdaten redigieren?

extraction of personal data is problematic

Traceback (most recent call last):
  File "/local/data/dev/thomas/document-anonymization/.snakemake/scripts/tmpn0w1zv0b.extract-personal-data.py", line 128, in <module>
    var_data = variate_personal_data(personal_data[0], personal_data[1])
  File "/local/data/dev/thomas/document-anonymization/.snakemake/scripts/tmpn0w1zv0b.extract-personal-data.py", line 63, in variate_personal_data
    for i, perm in enumerate(all_name_perms):
TypeError: 'NoneType' object is not iterable

Redact all internal IDs

Other internal IDs, such as job numbers, would actually have to be removed as well, as I understand it.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.