The aim of this application is to replace and to enhance the LINK project. LIterature coNcept Knowledgebase original project is available here
Currently, the application can executes 3 steps List of available steps:
- Processing
- Embedding
- OpenJDK 1.8
- scala 2.12.x (through SDKMAN is simple)
- Apache Spark 3.0.1
- Opentargets entities
- EPMC entities
This are the files generated from the platform-etl-backend
search step. For a specific release they will usually be
found in gs://ot-snapshots/etl/outputs/<release>/search/**/*.json
These files are provided by our collaborator EuropePMC. The files contains the list of the public and private publications with many metadata. Moreover, EPMC extractes text and sentence using Machine Learning approach.
The processing step extracts a common ground from EPMC entities and the three main entities of Opentarget (Disease, Target, Drug).
The input section needs two datasets. The ot-luts
section contains the LookUp Table with the Opentarget info.
The dataset used for this specific project is the "search" step of ETL project.
The epmc
section contains the EPMC literature dataset.
The section outputs
contains path for co-occurance dataset and matches dataset.
processing {
ot-luts {
format = "json"
path = "gs://open-targets-data-releases/21.02/output/ETL/search/**/*.json"
}
epmc {
format = "json"
path = "gs://otar-epmc/literature-files/**/*.jsonl"
}
outputs = {
cooccurrences {
format = ${common.output-format}
path = ${common.output}"/cooccurrences"
}
matches {
format = ${common.output-format}
path = ${common.output}"/matches"
}
}
}
After a first phase of shaping of the data/entity types and the normalization of the texts, the next step is to join the two dataset in order to create whole comprehensive dataset with common information. The step generates two different datasets:
- Co-occorrence
- Matches
The co-occorence dataset will be use by our data team to generate new features for our platform. The matches dataset will be use by the embedding step to generate .
TO DO
Simply run the following command:
sbt assembly
The jar will be generated under target/scala-2.12.12/
The base configuration is found under src/main/resources/reference.conf
. If you want to use specific configurations
for a Spark job see below.
The inputs information are described under
- Obtain Opentarget entities
- Obtain EPMC entities
output dir:
matches
cooccurrences
word2vec
word2vecSynonym
xxyy
xxyy
is used in the Open Targets front end via Elasticsearch.
Here how to create a cluster using gcloud
tool
The current image version is preview-debian10
because is the only image that supports Spark3.
gcloud beta dataproc clusters create \
etl-cluster \
--image-version=preview-debian10 \
--properties=yarn:yarn.nodemanager.vmem-check-enabled=false,spark:spark.debug.maxToStringFields=1024,spark:spark.master=yarn \
--master-machine-type=n1-highmem-16 \
--master-boot-disk-size=500 \
--num-secondary-workers=0 \
--worker-machine-type=n1-standard-64 \
--num-workers=2 \
--worker-boot-disk-size=2000 \
--zone=europe-west1-d \
--project=open-targets-eu-dev \
--region=europe-west1 \
--initialization-action-timeout=20m \
--max-idle=30m
And to submit the job with either a local jar or from a GCS Bucket (gs://...)
gcloud dataproc jobs submit spark \
--cluster=etl-cluster \
--project=open-targets-eu-dev \
--region=europe-west1 \
--async \
--jar=gs://ot-snapshots/...
Add to your run either commandline or sbt task Intellij IDEA -Dconfig.file=application.conf
and it
will load the configuration from your ./
path or project root. Missing fields will be resolved
with reference.conf
.
The same happens with logback configuration. You can add -Dlogback.configurationFile=application.xml
and
have a logback.xml hanging on your project root or run path. An example log configurationfile
<configuration>
<appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
<encoder>
<pattern>%level %logger{15} - %message%n%xException{10}</pattern>
</encoder>
</appender>
<root level="WARN">
<appender-ref ref="STDOUT" />
</root>
<logger name="io.opentargets.etl.literature" level="DEBUG"/>
<logger name="org.apache.spark" level="WARN"/>
</configuration>
If you are using the Dataproc cluster you need to add some additional arguments specifying where the configuration can be found.
gcloud dataproc jobs submit spark \
--cluster=etl-cluster \
--project=open-targets-eu-dev \
--region=europe-west1 \
--async \
--files=application.conf \
--properties=spark.executor.extraJavaOptions=-Dconfig.file=job.conf,spark.driver.extraJavaOptions=-Dconfig.file=application.conf \
--jar=gs://ot-snapshots/...
where application.conf
is a subset of reference.conf
common {
output = "gs://ot-snapshots/etl-literature/prod-latest"
}
The fat jar can be executed on a local installation of Spark using spark-submit
:
/usr/lib/spark/bin/spark-submit --class io.opentargets.etl.literature.Main \
--driver-memory $(free -g | awk '{print $7}')g \
--master local[*] \
<jar> --arg1 ... --arg2 ...
- Add tag to master so we can recreate the jar
git tag -a <release> -m "Release <release>"
git push origin <release>
Where release is something like 20.11.0 (year, month, iteration). Hopefully we don't need the iteration.
- Create jar and push to cloud storage
sbt assembly
gsutil cp target/scala-2.12/<jar> gs://open-targets-data-releases/<release>/platform-etl-literature/<jar>
- Generate input files and push to cloud
- Create updated configuration file and push to cloud
- Save file in same place as jar so it can be re-run if necessary.
- Run the steps
- Create Elasticseach index (script in
platform-backend-etl
repository)
- The relevant output file is
xxyy
Version | Date | Notes |
---|---|---|
1.0.0 | March 2021 | Initial release |
Copyright 2014-2018 Biogen, Celgene Corporation, EMBL - European Bioinformatics Institute, GlaxoSmithKline, Takeda Pharmaceutical Company and Wellcome Sanger Institute
This software was developed as part of the Open Targets project. For more information please see: http://www.opentargets.org
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.