Open Targets ETL Literature

The aim of this application is to replace and to enhance the LINK project. LIterature coNcept Knowledgebase original project is available here

Currently, the application can executes 3 steps List of available steps:

Processing
Embedding

Requirements

OpenJDK 1.8
scala 2.12.x (through SDKMAN is simple)
Apache Spark 3.0.1
Opentargets entities
EPMC entities

Obtain Opentarget entities

This are the files generated from the platform-etl-backend search step. For a specific release they will usually be found in gs://ot-snapshots/etl/outputs/<release>/search/**/*.json

Obtain EPMC dataset

These files are provided by our collaborator EuropePMC. The files contains the list of the public and private publications with many metadata. Moreover, EPMC extractes text and sentence using Machine Learning approach.

Processing step

The processing step extracts a common ground from EPMC entities and the three main entities of Opentarget (Disease, Target, Drug).

The input section needs two datasets. The ot-luts section contains the LookUp Table with the Opentarget info. The dataset used for this specific project is the "search" step of ETL project. The epmc section contains the EPMC literature dataset. The section outputs contains path for co-occurance dataset and matches dataset.

  processing {
   ot-luts {
     format = "json"
     path = "gs://open-targets-data-releases/21.02/output/ETL/search/**/*.json"
   }
   epmc {
     format = "json"
     path = "gs://otar-epmc/literature-files/**/*.jsonl"
   }
   outputs = {
    cooccurrences {
      format = ${common.output-format}
      path = ${common.output}"/cooccurrences"
    }
    matches {
      format = ${common.output-format}
      path = ${common.output}"/matches"
    }
   }
 }

After a first phase of shaping of the data/entity types and the normalization of the texts, the next step is to join the two dataset in order to create whole comprehensive dataset with common information. The step generates two different datasets:

Co-occorrence
Matches

The co-occorence dataset will be use by our data team to generate new features for our platform. The matches dataset will be use by the embedding step to generate .

Embedding step

TO DO

Create a fat JAR

Simply run the following command:

sbt assembly

The jar will be generated under target/scala-2.12.12/

Configuration

The base configuration is found under src/main/resources/reference.conf. If you want to use specific configurations for a Spark job see below.

Inputs

The inputs information are described under

Obtain Opentarget entities
Obtain EPMC entities

Outputs

output dir:
    matches
    cooccurrences
    word2vec
    word2vecSynonym
    xxyy

xxyy is used in the Open Targets front end via Elasticsearch.

Running

Dataproc

Create cluster and launch

Here how to create a cluster using gcloud tool

The current image version is preview-debian10 because is the only image that supports Spark3.

gcloud beta dataproc clusters create \
    etl-cluster \
    --image-version=preview-debian10 \
    --properties=yarn:yarn.nodemanager.vmem-check-enabled=false,spark:spark.debug.maxToStringFields=1024,spark:spark.master=yarn \
    --master-machine-type=n1-highmem-16 \
    --master-boot-disk-size=500 \
    --num-secondary-workers=0 \
    --worker-machine-type=n1-standard-64 \
    --num-workers=2 \
    --worker-boot-disk-size=2000 \
    --zone=europe-west1-d \
    --project=open-targets-eu-dev \
    --region=europe-west1 \
    --initialization-action-timeout=20m \
    --max-idle=30m

Submitting a job to existing cluster

And to submit the job with either a local jar or from a GCS Bucket (gs://...)

gcloud dataproc jobs submit spark \
           --cluster=etl-cluster \
           --project=open-targets-eu-dev \
           --region=europe-west1 \
           --async \
           --jar=gs://ot-snapshots/...

Load with custom configuration

Add to your run either commandline or sbt task Intellij IDEA -Dconfig.file=application.conf and it will load the configuration from your ./ path or project root. Missing fields will be resolved with reference.conf.

The same happens with logback configuration. You can add -Dlogback.configurationFile=application.xml and have a logback.xml hanging on your project root or run path. An example log configurationfile

<configuration>

    <appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
        <encoder>
            <pattern>%level %logger{15} - %message%n%xException{10}</pattern>
        </encoder>
    </appender>

    <root level="WARN">
        <appender-ref ref="STDOUT" />
    </root>

    <logger name="io.opentargets.etl.literature" level="DEBUG"/>
    <logger name="org.apache.spark" level="WARN"/>

</configuration>

If you are using the Dataproc cluster you need to add some additional arguments specifying where the configuration can be found.

gcloud dataproc jobs submit spark \
           --cluster=etl-cluster \
           --project=open-targets-eu-dev \
           --region=europe-west1 \
           --async \
           --files=application.conf \
           --properties=spark.executor.extraJavaOptions=-Dconfig.file=job.conf,spark.driver.extraJavaOptions=-Dconfig.file=application.conf \
           --jar=gs://ot-snapshots/...

where application.conf is a subset of reference.conf

common {
  output = "gs://ot-snapshots/etl-literature/prod-latest"
}

Spark-submit

The fat jar can be executed on a local installation of Spark using spark-submit:

/usr/lib/spark/bin/spark-submit --class io.opentargets.etl.literature.Main \
--driver-memory $(free -g | awk '{print $7}')g \
--master local[*] \
<jar> --arg1 ... --arg2 ...

Creating a new release

Add tag to master so we can recreate the jar

git tag -a <release> -m "Release <release>"
git push origin <release>

Where release is something like 20.11.0 (year, month, iteration). Hopefully we don't need the iteration.

Create jar and push to cloud storage

sbt assembly

gsutil cp target/scala-2.12/<jar> gs://open-targets-data-releases/<release>/platform-etl-literature/<jar>

Generate input files and push to cloud
Create updated configuration file and push to cloud

Save file in same place as jar so it can be re-run if necessary.

Run the steps
Create Elasticseach index (script in platform-backend-etl repository)

The relevant output file is xxyy

Versioning

Version	Date	Notes
1.0.0	March 2021	Initial release

Copyright

This software was developed as part of the Open Targets project. For more information please see: http://www.opentargets.org

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

jonkiky / platform-etl-literature Goto Github PK

platform-etl-literature's Introduction