Code Monkey home page Code Monkey logo

spark-nlp-phi-annotator's Introduction

nlpsandbox.io

NLP Sandbox PHI Annotator Example

GitHub Release GitHub CI GitHub License Docker Leaderboard Discord

Introduction

NLPSandbox.io is an open platform for benchmarking modular natural language processing (NLP) tools on both public and private datasets. Academics, students, and industry professionals are invited to browse the available tasks and participate by developing and submitting an NLP Sandbox tool.

This repository provides an example implementation of the [NLP Sandbox Date Annotator API] written in Python-Flask. An NLP Sandbox PHI annotator takes as input a clinical note (text) and outputs a list of predicted PHI annotations found in the clinical note. Here PHIs are identified using regular expressions.

This tool is provided to NLP developers who develop in Python as a starting point to package their own PHI annotator as an NLP Sandbox tool (see section Development). This section also describes how to generate a tool "stub" using openapi-generator for 50+ programming languages-frameworks. This repository includes a GitHub CI/CD workflow that lints, tests, builds and pushes a Docker image of this tool to Synapse Docker Registry. This image of this example tool can be submitted as-is on NLPSandbox.io to benchmark its performance -- just don't expect a high performance!

Contents

Specification

Requirements

Usage

Running with Docker

The command below starts this NLP Sandbox PHI annotator locally.

docker compose up --build

You can stop the container run with Ctrl+C, followed by docker compose down.

Running with Python

Create a Conda environment.

conda create --name phi-annotator python=3.9 -y
conda activate phi-annotator

Install and start this NLP Sandbox PHI annotator.

cd server && pip install -r requirements.txt
python -m openapi_server

Accessing this NLP Sandbox tool User Interface

This NLP Sandbox tool provides a web interface that you can use to annotate clinical notes. This web client has been automatically generated by openapi-generator. To access the UI, open a new tab in your browser and navigate to one of the following address depending on whether you are running the tool using Docker (production) or Python (development).

Development

This section describes how to develop your own NLP Sandbox PHI annotator in Python-Flask and other programming languages-frameworks. This example tool is also available in Java in the GitHub repository nlpsandbox/phi-annotator-example-java.

Development requirements

Creating a GitHub repository

Depending on the language-frameworks you want to develop with:

You can also use a different code repository hosting service like GitLab and Bitbucket.

Configuring the CI/CD workflow

This repository includes a GitHub CI/CD workflow that lints, tests, builds and pushes a Docker image of this tool to Synapse Docker Registry. Only the images that have been pushed to Synapse Docker Resgitry can be submitted to NLPSandbox.io benchmarks for now.

After creating your GitHub repository, you need to configure the CI/CD workflow if you want to benefit from automatic lint checks, tests and Docker builds.

  1. Create two GitHub secrets
  2. In the CI/CD workflow, update the environment variable docker_repository with the value docker.synapse.org/<synapse_project_id>/<docker_image> where:
    • <synapse_project_id>: the Synapse ID of a project you have created on Synapse.org.
    • <docker_image> is the name of your image/tool.

Enabling version updates

This repository includes a Dependabot configuration that instructs GitHub to let you know when an update is available for one of your dependencies (e.g. Python, Node, Docker). Dependabot will automatically open a PR when an update is available. If you have configured the CI/CD workflow that comes with this repository, the workflow will automatically run and notify you if the update is breaking your code. You can then resolve the issue before merging the PR, hence making the update effective.

For more information on Dependabot, please visit the GitHub page Enabling and disabling version updates.

Generating a new NLP Sandbox tool using openapi-generator

The development of new NLP Sandbox tools is streamlined by using the openapi-generator to generate tool "stubs" for more than 50 programming languages and frameworks. Here a PHI annotator stub refers to an initial implementation that has been automatically generated by openapi-generator from the NLP Sandbox PHI Annotator API specification.

Run the command below to get the list of languages-framework supported by the openapi-generator (under the section SERVER generators).

npx @openapitools/openapi-generator-cli list

Generate the PHI annotator stub from an empty GitHub repository (here in Python-Flask):

mkdir server
npx @openapitools/openapi-generator-cli generate \
  -g python-flask \
  -o server \
  -i https://nlpsandbox.github.io/nlpsandbox-schemas/phi-annotator/latest/openapi.json

where the option -i refers to the OpenAPI specification of the NLP Sandbox PHI Annotator API.

The URL is composed of different elements:

  • phi-annotator - The type of NLP Sandbox tool to generate. The list of all the NLP Sandbox tool types available is defined in the NLP Sandbox schemas.
  • latest - The latest stable version of the NLP Sandbox schemas. This token can be replaced by a specific release version x.y.z of the NLP Sandbox schemas.

Keeping your tool up-to-date

The NLP Sandbox schemas is updated after receiving contribution from the community. For example, the Patient schema may include in the future additional information that NLP Sandbox tools can leverage to generate more accurate predictions.

After an update of the NLP Sandbox schemas, NLPSandbox.io will only accept to evaluate tools that implement the latest version of the schemas. It is therefore important to keep your tools up-to-date and re-submit them so that they continue to appear in the leaderboards and to be used by the community.

This GitHub repository includes a workflow that checks daily if a new release of the NLP Sandbox schemas is available, in which case a PR will be created. Follow the steps listed below to update your tool.

  1. Checkout the branch created by the workflow.

    git fetch
    git checkout <branch_name>
    
  2. Re-run the same openapi-generator command you used to generate the tool stub. If you started from an existing tool implementation like the one included in this GitHub repository, run the following command to update your tool to the latest version of the NLP Sandbox schemas (this command would be defined in package.json).

    npm run generate:server:latest
    
  3. Review the updates made to this tool in the NLP Sandbox schemas CHANGELOG.

  4. Review and merge the changes. If you are using VS Code, this step can be performed relatively easily using the section named "Source Control". This section lists the files that have been modified by the generator. When clicking on a file, VS Code shows side-by-side the current and updated version of the file. Changes can be accepted or rejected at the level of an entire file or for a selection of lines.

  5. Submit your updated tool to NLPSandbox.io.

Testing

If you started from an existing tool implementation like the one included in this GitHub repository, run the following command to lint and test your tool.

npm run lint
npm run test

For Python-Flask tools:

Preventing an NLP Sandbox tool from connecting to remote servers

The NLP Sandbox promotes the development of tools that are re-usable, reproducible, portable and cloud-ready. The table below describes how preventing a tool from connecting to remote server contributes to some of these tool properties.

Property Description
Reproducibility The output of a tool may not be reproducible if the tool depends on external resources, for example, that may no longer be available in the future.
Security A tool may attempt to upload sensitive information to a remote server.

The Docker Compose configuration included with this GitHub repository (docker-compose.yml) prevents the tool container to establish remote connection. This is achieved through the use of a internal Docker network and the presence of the Nginx container placed in front of the tool container. One benefit is that you can test your tool locally and ensure that it works fine while it does not have access to the internet. Note that when being evaluated on NLPSandbox.io, additional measures are put in place to prevent tools from connecting to remote servers.

Versioning

GitHub release tags

This repository uses semantic versioning to track the releases of this tool. This repository uses "non-moving" GitHub tags, that is, a tag will always point to the same git commit once it has been created.

Docker image tags

The artifact published by the CI/CD workflow of this GitHub repository is a Docker image pushed to the Synapse Docker Registry. This table lists the image tags pushed to the registry.

Tag name Moving Description
latest Yes Latest stable release.
edge Yes Latest commit made to the default branch.
edge-<sha> No Same as above with the reference to the git commit.
<major>.<minor>.<patch> No Stable release.

You should avoid using a moving tag like latest when deploying containers in production, because this makes it hard to track which version of the image is running and hard to roll back.

Benchmarking on NLPSandbox.io

Visit nlpsandbox.io for instructions on how to submit your NLP Sandbox tool and evaluate its performance.

Contributing

Thinking about contributing to this project? Get started by reading our contribution guide.

License

Apache License 2.0

spark-nlp-phi-annotator's People

Contributors

clemessien avatar

Stargazers

 avatar

Watchers

 avatar

spark-nlp-phi-annotator's Issues

Errors when running tox command

Problem description

Error obtained when the tox command is run

ERROR: InvocationError for command /home/cessien/spark-nlp-phi-annotator/server/.tox/py38/bin/pytest --cov-config=setup.cfg --cov=openapi_server -v (exited with code 1)
py39 create: /home/cessien/spark-nlp-phi-annotator/server/.tox/py39
ERROR: InterpreterNotFound: python3.9
________________________________________________________ summary _________________________________________________________
ERROR:   py37: commands failed
ERROR:   py38: commands failed
ERROR:  py39: InterpreterNotFound: python3.9

Steps to reproduce the issue

  1. Go to project folder
  2. cd server
  3. run the command 'tox'

Attempts to fix

  • I have excluded the py37 and py39 environment from the tox.ini. The errror still shows when I run tox -e py38.
  • I have also tried installing the latest version of tox. I think part of the error is with the sparknlp_jsl package.
  • This package is installed using the secret_code appended to the URL.
  • I have tried to see how to add the private index to the requirements.txt file
  • I have tried adding a higher version of pytest that supports python 3.7+

Issue installing Java 8 in dockerfile

Problem description

 Java 8 is not installed when the command to install is supplied in the Dockerfile. Java 8 is a requirement for the Spark 
 Session to be initialized and the JAVA_HOME environment variable should be set
 Unable to locate package openjdk-8-jdk

'''
ERROR [ 3/17] RUN apt-get install openjdk-8-jdk 1.1s

[ 3/17] RUN apt-get install openjdk-8-jdk:
#6 0.301 + apt-get install openjdk-8-jdk
#6 0.331 Reading package lists...
#6 0.888 Building dependency tree...
#6 0.996 Reading state information...
#6 1.086 E: Unable to locate package openjdk-8-jdk
'''

Steps to reproduce the issue

Add the following lines into the Dockerfile

  1. RUN apt-get update
  2. RUN apt-get install openjdk-8-jdk
  3. Then run docker-compose up

Expected behavior

     Java 8 is expected to be installed while creating the docker image

Attempted fixes

    I tried to install from official repositories instead of the manual installation by running the following commands:
  • apt-get install oracle-java8-installer
  • add-apt-repository ppa:webupd8team/java

phi_annotator

Use the NLP spark library to implement the following

  • Person annotation
  • Date annotation
  • Contact annotation
  • Text annotation

Set up Spark NLP

Steps to install NLPSpark library

Requirements & Setup

  1. Java 8
  2. ssh server
  3. Apache Spark 3.1.x (or 3.0.x, or 2.4.x, or 2.3.x)
  4. spark-nlp

Run the following commands to install Java

  1. sudo apt-get update
  2. sudo apt-get install openjdk-8-jdk
  3. export JAVA_HOME=path_to_java_home
  4. java -version
    This should return something like this:
    openjdk version "1.8.0_242"
    OpenJDK Runtime Environment (build 1.8.0_242-b09)
    OpenJDK 64-Bit Server VM (build 25.242-b09, mixed mode)

To install ssh server

If ssh is already installed and enabled, skip this step or else run the following commands;

  • sudo apt-get install openssh-server
  • sudo systemctl enable ssh
  • yping sudo systemctl start ssh

To install Apcahe Spark

  1. wget https://downloads.apache.org/spark/spark-3.0.1/spark-3.0.1-bin-hadoop2.7.tgz
  2. tar xvf spark-*
  3. sudo mv spark-3.0.1-bin-hadoop2.7/* /opt/spark
  4. nano ~/.barsh (add the following lines below)
  5. echo export SPARK_HOME=/opt/spark
    echo export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
    echo export PYSPARK_PYTHON=/usr/bin/python3
  6. To verify that this install correctly, run the following
    • start-master.sh
    • start-slave.sh spark://ubuntu1:7077
    • open the following link in your browser: http://127.0.0.1:8080/
    • Then you can kill the process

Register for spark nlp trial license

  • Go to https://www.johnsnowlabs.com/spark-nlp-try-free/
  • Select the Spark NLP for Healthcare option
  • Register and you will receive an email containing a license.json file
  • export all the fields in the file as environment variables i.e.
    export SPARK_NLP_LICENSE=${SPARK_NLP_LICENSE} >> ~/.bashrc
    source ~/.bashrc

To install NLP Spark

  1. run conda install -c johnsnowlabs spark-nlp

  2. Register for spark nlp jsl license at https://nlp.johnsnowlabs.com/docs/en/licensed_install

  3. Run the following command: pip install -q spark-nlp-jsl==${version} --extra-index-url https://pypi.johnsnowlabs.com/${secret.code} --upgrade

  4. run spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.1.0 to Load Spark NLP with pyspark

Spark NLP Models

If you prefer to use the model online, then discard this step, but if offline is the preferred option, then download the following from https://github.com/JohnSnowLabs/spark-nlp-models.

  1. embeddings_clinical
  2. ner_deid_large
  3. sentence_detector_dl_healthcare
  4. embeddings_clinical
    Create a folder called nlp_models in the server directory and place the downloaded models there.
    i.e. :
   aws s3 cp s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_large_en_2.5.3_2.4_1595427435246.zip
   aws s3 cp s3://auxdata.johnsnowlabs.com/clinical/models/sentence_detector_dl_healthcare_en_2.6.0_2.4_1600001082565.zip
   aws s3 cp s3://auxdata.johnsnowlabs.com/clinical/models/embeddings_clinical_en_2.4.0_2.4_1580237286004.zip```

Problems initializing spark session

Ivy Default Cache set to: /var/www/.ivy2/cache
phi-annotator    | The jars for the packages stored in: /var/www/.ivy2/jars
phi-annotator    | com.johnsnowlabs.nlp#spark-nlp_2.12 added as a dependency
phi-annotator    | :: resolving dependencies :: org.apache.spark#spark-submit-parent-9c8beda1-f199-4552-b85d-0f095156e99d;1.0
phi-annotator    |      confs: [default]
phi-annotator    | Exception in thread "main" java.io.FileNotFoundException: /var/www/.ivy2/cache/resolved-org.apache.spark-spark-submit-parent-9c8beda1-f199-4552-b85d-0f095156e99d-1.0.xml (No such file or directory)
phi-annotator    |      at java.io.FileOutputStream.open0(Native Method)
phi-annotator    |      at java.io.FileOutputStream.open(FileOutputStream.java:270)
phi-annotator    |      at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
phi-annotator    |      at java.io.FileOutputStream.<init>(FileOutputStream.java:162)
phi-annotator    |      at org.apache.ivy.plugins.parser.xml.XmlModuleDescriptorWriter.write(XmlModuleDescriptorWriter.java:70)
phi-annotator    |      at org.apache.ivy.plugins.parser.xml.XmlModuleDescriptorWriter.write(XmlModuleDescriptorWriter.java:62)
phi-annotator    |      at org.apache.ivy.core.module.descriptor.DefaultModuleDescriptor.toIvyFile(DefaultModuleDescriptor.java:563)
phi-annotator    |      at org.apache.ivy.core.cache.DefaultResolutionCacheManager.saveResolvedModuleDescriptor(DefaultResolutionCacheManager.java:176)
phi-annotator    |      at org.apache.ivy.core.resolve.ResolveEngine.resolve(ResolveEngine.java:245)
phi-annotator    |      at org.apache.ivy.Ivy.resolve(Ivy.java:523)
phi-annotator    |      at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1427)
phi-annotator    |      at org.apache.spark.deploy.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:54)
phi-annotator    |      at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:308)
phi-annotator    |      at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
phi-annotator    |      at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
phi-annotator    |      at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
phi-annotator    |      at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
phi-annotator    |      at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
phi-annotator    |      at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
phi-annotator    |      at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
phi-annotator    | Failed to add operation for POST /api/v1/textContactAnnotations
phi-annotator    | Traceback (most recent call last):
phi-annotator    |   File "openapi_server/__main__.py", line 10, in <module>
phi-annotator    |     app.add_api('openapi.yaml', pythonic_params=True)
phi-annotator    |   File "/usr/local/lib/python3.9/site-packages/connexion/apps/flask_app.py", line 57, in add_api
phi-annotator    |     api = super(FlaskApp, self).add_api(specification, **kwargs)
phi-annotator    |   File "/usr/local/lib/python3.9/site-packages/connexion/apps/abstract.py", line 144, in add_api
phi-annotator    |     api = self.api_cls(specification,
phi-annotator    |   File "/usr/local/lib/python3.9/site-packages/connexion/apis/abstract.py", line 111, in __init__
phi-annotator    |     self.add_paths()
phi-annotator    |   File "/usr/local/lib/python3.9/site-packages/connexion/apis/abstract.py", line 219, in add_paths
phi-annotator    |     self._handle_add_operation_error(path, method, sys.exc_info())
phi-annotator    |   File "/usr/local/lib/python3.9/site-packages/connexion/apis/abstract.py", line 231, in _handle_add_operation_error
phi-annotator    |     raise value.with_traceback(traceback)
phi-annotator    |   File "/usr/local/lib/python3.9/site-packages/connexion/apis/abstract.py", line 209, in add_paths
phi-annotator    |     self.add_operation(path, method)
phi-annotator    |   File "/usr/local/lib/python3.9/site-packages/connexion/apis/abstract.py", line 162, in add_operation
phi-annotator    |     operation = make_operation(
phi-annotator    |   File "/usr/local/lib/python3.9/site-packages/connexion/operations/__init__.py", line 8, in make_operation
phi-annotator    |     return spec.operation_cls.from_spec(spec, *args, **kwargs)
phi-annotator    |   File "/usr/local/lib/python3.9/site-packages/connexion/operations/openapi.py", line 128, in from_spec
phi-annotator    |     return cls(
phi-annotator    |   File "/usr/local/lib/python3.9/site-packages/connexion/operations/openapi.py", line 75, in __init__
phi-annotator    |     super(OpenAPIOperation, self).__init__(
phi-annotator    |   File "/usr/local/lib/python3.9/site-packages/connexion/operations/abstract.py", line 96, in __init__
phi-annotator    |     self._resolution = resolver.resolve(self)
phi-annotator    |   File "/usr/local/lib/python3.9/site-packages/connexion/resolver.py", line 40, in resolve
phi-annotator    |     return Resolution(self.resolve_function_from_operation_id(operation_id), operation_id)
phi-annotator    |   File "/usr/local/lib/python3.9/site-packages/connexion/resolver.py", line 61, in resolve_function_from_operation_id
phi-annotator    |     return self.function_resolver(operation_id)
phi-annotator    |   File "/usr/local/lib/python3.9/site-packages/connexion/utils.py", line 111, in get_function_from_name
phi-annotator    |     module = importlib.import_module(module_name)
phi-annotator    |   File "/usr/local/lib/python3.9/importlib/__init__.py", line 127, in import_module
phi-annotator    |     return _bootstrap._gcd_import(name[level:], package, level)
phi-annotator    |   File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
phi-annotator    |   File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
phi-annotator    |   File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
phi-annotator    |   File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
phi-annotator    |   File "<frozen importlib._bootstrap_external>", line 855, in exec_module
phi-annotator    |   File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
phi-annotator    |   File "/opt/app/./openapi_server/controllers/text_contact_annotation_controller.py", line 4, in <module>
phi-annotator    |     from openapi_server import nlp_config as cf
phi-annotator    |   File "/opt/app/./openapi_server/nlp_config.py", line 20, in <module>
phi-annotator    |     spark = Spark().spark
phi-annotator    |   File "/opt/app/./openapi_server/nlp_config.py", line 17, in __init__
phi-annotator    |     self.spark = sparknlp_jsl.start("3.1.0-1778570a3f59d3059ed1e58375192fd61b114fc9", params=params)
phi-annotator    |   File "/usr/local/lib/python3.9/site-packages/sparknlp_jsl/__init__.py", line 80, in start
phi-annotator    |     return builder.getOrCreate()
phi-annotator    |   File "/usr/local/lib/python3.9/site-packages/pyspark/sql/session.py", line 228, in getOrCreate
phi-annotator    |     sc = SparkContext.getOrCreate(sparkConf)
phi-annotator    |   File "/usr/local/lib/python3.9/site-packages/pyspark/context.py", line 384, in getOrCreate
phi-annotator    |     SparkContext(conf=conf or SparkConf())
phi-annotator    |   File "/usr/local/lib/python3.9/site-packages/pyspark/context.py", line 144, in __init__
phi-annotator    |     SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
phi-annotator    |   File "/usr/local/lib/python3.9/site-packages/pyspark/context.py", line 331, in _ensure_initialized
phi-annotator    |     SparkContext._gateway = gateway or launch_gateway(conf)
phi-annotator    |   File "/usr/local/lib/python3.9/site-packages/pyspark/java_gateway.py", line 108, in launch_gateway
phi-annotator    |     raise Exception("Java gateway process exited before sending its port number")
phi-annotator    | Exception: Java gateway process exited before sending its port number
phi-annotator    | unable to load app 0 (mountpoint='') (callable not found or import error)
phi-annotator    | *** no app loaded. going in full dynamic mode ***

Problem description

In an attempt to initialize the spark session, the program attempts to look for spark dependencies at ' /var/www/.ivy2/' which it doesn't have access to and the location does not exist

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.