Code Monkey home page Code Monkey logo

ncov-ingest's Introduction

nCoV Ingestion Pipeline

Internal tooling for the Nextstrain team to ingest and curate SARS-CoV-2 genome sequences. The pipeline is open source, but we are not intending to support it for use by outside groups. Relies on data from https://simplemaps.com/data/us-cities.

Outputs documented here are part of ncov-ingest's public API: https://docs.nextstrain.org/projects/ncov/en/latest/reference/remote_inputs.html

Running locally

NOTE: The full set of sequences from GISAID/GenBank will most likely require more compute resources than what is available on your local computer.

To debug all rules on a subset of the data, you can use the config/debug_sample_genbank.yaml and config/debug_sample_gisaid.yaml config files. These will download raw data from AWS s3, randomly keeping only a subset of lines of the input files (configurable in the config file). This way, the pipeline completes in a matter of minutes and acceptable storage requirements for local compute. However, the output data should not be trusted, as biosample and cog-uk input lines are randomly selected independently of the main ndjson.

To get started, you can run the following:

snakemake -j all --configfile config/debug_sample_genbank.yaml  -pF --ri --nt

Warning If you are running the pipeline without a Nextclade cache, it will do a full Nextclade run that aligns all sequences, which will take significant time and resources!

Follow these instructions to run the ncov-ingest pipeline without all the bells and whistles used by internal Nextstrain runs that involve AWS S3, Slack notifications, and GitHub Action triggers:

GISAID

To pull sequences directly from GISAID, you are required to set two environment variables:

  • GISAID_API_ENDPOINT
  • GISAID_USERNAME_AND_PASSWORD

Then run the ncov-ingest pipeline with the nextstrain CLI:

nextstrain build \
  --image nextstrain/ncov-ingest \
  --env GISAID_API_ENDPOINT \
  --env GISAID_USERNAME_AND_PASSWORD \
  . \
    --configfile config/local_gisaid.yaml

GenBank

Sequences can be pulled from GenBank without any environment variables. Run the ncov-ingest pipeline with the nextstrain CLI:

nextstrain build \
  --image nextstrain/ncov-ingest \
  . \
  --configfile config/local_genbank.yaml \

Running automatically

The ingest pipelines are triggered from the GitHub workflows .github/workflows/ingest-master-*.yml and …/ingest-branch-*.yml but run on AWS Batch via the nextstrain build --aws-batch infrastructure. They're run on pushes to master that modify source-data/*-annotations.tsv and on pushes to other branches. Pushes to branches other than master upload files to branch-specific paths in the S3 bucket, don't send notifications, and don't trigger Nextstrain rebuilds, so that they don't interfere with the production data.

AWS credentials are stored in this repository's secrets and are associated with the nextstrain-ncov-ingest-uploader IAM user in the Bedford Lab AWS account, which is locked down to reading and publishing only the gisaid.ndjson, metadata.tsv, and sequences.fasta files and their zipped equivalents in the nextstrain-ncov-private S3 bucket.

Manually triggering the automation

A full run is now done in 3 steps via manual triggers:

  1. Fetch new sequences and ingest them by running ./vendored/trigger nextstrain/ncov-ingest gisaid/fetch-and-ingest --user <your-github-username>.
  2. Add manual annotations, update location hierarchy as needed, and run ingest without fetching new sequences.
    • Pushes of source-data/*-annotations.tsv to the master branch will automatically trigger a run of ingest.
    • You can also run ingest manually by running ./vendored/trigger nextstrain/ncov-ingest gisaid/ingest --user <your-github-username>.
  3. Once all manual fixes are complete, trigger a rebuild of nextstrain/ncov by running ./vendored/trigger ncov gisaid/rebuild --user <your-github-username>.

See the output of ./vendored/trigger nextstrain/ncov-ingest gisaid/fetch-and-ingest --user <your-github-username>, ./vendored/trigger nextstrain/ncov-ingest gisaid/ingest or ./vendored/trigger nextstrain/ncov-ingest rebuild for more information about authentication with GitHub.

Note: running ./vendored/trigger nextstrain/ncov-ingest posts a GitHub repository_dispatch. Regardless of which branch you are on, it will trigger the specified action on the master branch.

Valid dispatch types for ./vendored/trigger nextstrain/ncov-ingest are:

  • ingest (both GISAID and GenBank)
  • gisaid/ingest
  • genbank/ingest
  • fetch-and-ingest (both GISAID and GenBank)
  • gisaid/fetch-and-ingest
  • genbank/fetch-and-ingest

Updating manual annotations

Manual annotations should be added to source-data/gisaid_annotations.tsv. A common pattern is expected to be:

  1. Run https://github.com/nextstrain/ncov.
  2. Discover metadata that needs fixing.
  3. Update source-data/gisaid_annotations.tsv.
  4. Push changes to master and re-download gisaid/metadata.tsv.

Configuring alerts for new GISAID data from specific location hierarchy areas

Some Nextstrain team members may be interested in receiving alerts when new GISAID strains are added from specific locations, e.g. Portugal or Louisiana. To add a custom alert configuration, create a new entry in new-sequence-alerts-config.json. Each resolution (region, division, country, location) accepts a list of strings of areas of interest. Note that these strings must match the area name exactly.

To set up custom alerts, you'll need to retrieve your Slack member ID. Note that the user field in each alert configuration is for human use only -- it need not match your Slack display name or username. To view your Slack member ID, open up the Slack menu by clicking your name at the top, and then click on 'View profile'. Then, click on 'More'. You can then copy your Slack member ID from the menu that appears. Enter this into the slack_member_id field of your alert configuration.

Rerunning Nextclade ignoring cache after Nextclade dataset is updated

Clade assignments and other QC metadata output by Nextclade are currently cached in nextclade.tsv in the S3 bucket and only incremental additions for the new sequences are performed during the daily ingests. Whenever the underlying nextclade dataset (reference tree, QC rules) and/or nextclade software are updated, it is necessary to perform a full update of nextclade.tsv, rerunning for all of the GISAID and GenBank sequences all over again, to account for changes in the data and Nextclade algorithms.

In order to tell ingest to not use the cached nextclade.tsv/aligned.fasta and instead perform a full rerun, you need to add an (empty) touchfile to the s3 bucket (available as ./scripts/developer_scripts/rerun-nextclade.sh):

aws s3 cp - s3://nextstrain-ncov-private/nextclade.tsv.zst.renew < /dev/null
aws s3 cp - s3://nextstrain-data/files/ncov/open/nextclade.tsv.zst.renew < /dev/null

Ingest will automatically remove the touchfiles after it has completed the rerun.

To rerun Nextclade using the sars-cov-2-21L dataset - which is only necessary when the calculation of immune_escape and ace2_binding changes - you need to add an (empty) touchfile to the s3 bucket (available as ./scripts/developer_scripts/rerun-nextclade-21L.sh:

aws s3 cp - s3://nextstrain-ncov-private/nextclade_21L.tsv.zst.renew < /dev/null
aws s3 cp - s3://nextstrain-data/files/ncov/open/nextclade_21L.tsv.zst.renew < /dev/null

Required environment variables

  • GISAID_API_ENDPOINT
  • GISAID_USERNAME_AND_PASSWORD
  • AWS_DEFAULT_REGION
  • AWS_ACCESS_KEY_ID
  • AWS_SECRET_ACCESS_KEY
  • SLACK_TOKEN
  • SLACK_CHANNELS

vendored

This repository uses git subrepo to manage copies of ingest scripts in vendored, from nextstrain/ingest. To pull new changes from the central ingest repository, first install git subrepo, then run:

See vendored/README.md for instructions on how to update the vendored scripts. Note that this repo is a special case and does not put vendored scripts in an ingest/ directory. Modify commands accordingly.

ncov-ingest's People

Contributors

angiehinrichs avatar cassiawag avatar corneliusroemer avatar dependabot[bot] avatar eharkins avatar emmahodcroft avatar genehack avatar huddlej avatar ivan-aksamentov avatar j23414 avatar jameshadfield avatar joverlee521 avatar kairstenfay avatar lmoncla avatar miparedes avatar moirazuber avatar nodrogluap avatar rajivshah3 avatar rneher avatar trvrb avatar tsibley avatar victorlin avatar wandrilled avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ncov-ingest's Issues

Add GISAID accessions to the GenBank metadata file

The GISAID metadata references GenBank accessions, but the GenBank metadata doesn't have any GISAID accessions. Refactor how these accessions are tracked and cross-referenced in transform-gisaid and transform-genbank so they share a source-data/ file separate from annotations.

See also #124 (comment) re: versioned vs. non-versioned GenBank accessions.

GitHub API soon to deprecate basic authentication

Yesterday I used the manual ./bin/trigger script for the first time with the --user option and received this automatic email from GitHub:

Hi @kairstenfay,

On July 3rd, 2020 at 17:15 (UTC) you used a password to access an endpoint through the GitHub API using curl/7.58.0:

https://api.github.com/repositories/239887733/dispatches

Basic authentication using a password to the API is deprecated and will soon no longer work. Visit https://developer.github.com/changes/2020-02-14-deprecating-password-auth/ for more information around suggested workarounds and removal dates.

We may want to consider removing this option from ./bin/trigger and updating the README.

Warn in `transform` if annotations fields cannot be applied b/c they aren't valid

Context
The ncov-ingest gisaid_transform script takes in a list of additions or modifications to the sequences from gisaid_annotations.tsv. These allow updates to particular 'fields' (ex: location, strain, country_exposure - called keys in the script) and are matched by EPI_ISL number to replace the data that comes from GISAID by default via JSON.

Description
It would be great to throw up warnings (as we currently do for mal-formatted lines with != 4 columns) in two cases:

  • The EPI_ISL number specified does not exist in the data
  • The key that is being changed/added is not a valid field in the data

This would just allow us to flag up any mistakes that might be made when adding new entries to gisaid_annotations.tsv - mostly typos. Most new lines are generated automatically now by scripts, so I don't think this would be a common problem - but would be good to catch if it does happen!

Possible solution
Add a check to gisaid_annotations so that if EPI_ISL number isn't found, or the field asking to be replaced doesn't exist, then output a warning (which would get piped to Slack as they do now).

Only update sequences.fasta and metadata.tsv on diff

I updated fetch-data to only upload to S3 if SHA returns a diff. This should make tracking versions on S3 easier: 7972fd2#diff-ec427e971894f9cf86b2f01e8a4a5bb5

It would be a nice improvement to have the same procedure with transform-data.py, where S3 is only pushed if the file is actually new.

Without this, we have an issue where ncov Snakemake is smart about diffing the local data/metadata.tsv against the S3 version and will want to rerun even when it doesn't need to.

Standardize dates

Current Behavior
We don't do any date pre-processing or standardization, allowing invalid ISO 8601 dates (like 2020-5-8) to slip through. This was detected by the recent addition of the flagged_metadata.txt file where a bunch of invalid dates were flagged for not being less than or equal to the current date.

Expected behavior
Non-zero-padded or otherwise invalid dates like 2020-5-8 should be standardized to their expected format, e.g. 2020-05-08.

Allow ncov-ingest to upload to non-AWS remote file paths

Context

Related to #307, we want to support users who run the ingest workflow on different cloud platforms like Google Cloud (for example, through Terra). Although we can bypass the ingest workflow's default upload target that expects an S3 bucket as its destination in WDL workflows on Terra and upload directly to GS from the WDL workflow, we would probably benefit in the long run from native support for uploads to other cloud platforms besides S3.

Description

As in #307, we should allow users to specify a generic "destination" URI corresponding to at least an S3 or GS bucket such that the upload rule uploads all data from the workflow to the given bucket. This means the remainder of the S3-specific logic in the workflow needs to be replaced with generic remote file support.

We might be able to reuse the remote file handling logic from ncov to support writing files to valid remote URIs. However, we may also want to compress files in the workflow prior to upload instead of compressing during the upload process.

Revisit fetch from GISAID

Context
On Dec 2, 2021, multiple fetch-and-ingest runs for GISAID failed. The failure pattern was we would download for a while and the transfer would get closed before it's completed. Subsequent attempts to fetch would hit a 503 error. We manually triggered fetch-and-ingest two more times and saw the same failure pattern.

Possible solution
The scheduled run today had no issues, so this may have just been unfortunate timing of our runs being interrupted by GISAID's reboots. We can revisit the following solutions in anticipation of similar future issues:

  1. Manual downloads from the same API endpoint were able to complete successfully when done without streaming decompression. We can update fetch-from-gisaid to stop decompression during streaming to lower the open connection time. However, decompressing in a separate step this would increase the total time to run fetch-and-ingest.
  2. Switch to an endpoint with xz, which has better compression ratio and decompression time than bzip2. Regardless of errors, this would be a huge improvement for us and dramatically decrease fetch-and-ingest runtime.

Allow data from outside GISAID

We're often getting genomes + metadata shared prior to GISAID release. It would be good to allow this to be incorporated into this repo for consumption by the ncov pipeline. Not sure of the best way to address this -- we could commit a "non-GISAID" FASTA + TSV, but that may cause the repo to grow too big over time.

Run `./bin/check-annotations` as CI (or similar)

#38 introduced ./bin/check-annotations which helps identify problems with the contents of annotations.tsv. This highlights issues that may only become evident stochastically after running the auspice build.

It would be nice to run this as a CI (or similar). For this to happen the script would have to be modified to exit appropriately which it doesn't currently do.

Location is left blank after parsing

Thanks so much for such useful scripts!!!

Current Behavior
In raw gisaid data:
zstdgrep "Southern San Joaquin Valley" gisaid.ndjson.zst gives
{"covv_accession_id": "EPI_ISL_2171674", "covv_add_host_info": "", "covv_assembly_method": "iVar 1.2.2 / BWA 0.7.17-r1188", "covv_collection_date": "2021-01-24", "covv_comment": "", "covv_comment_type": "", "covv_host": "Human", "covv_location": "North America / USA / California / Southern San Joaquin Valley", ....

After

/ncov-ingest/bin/transform-gisaid     \
        gisaid.ndjson                     \ 
       --output-metadata metadata.tsv    \ 
       --output-fasta sequences.fasta    \ 
       --output-unix-newline`

The metadata.tsv would remove Southern San Joaquin Valley from location column:
image

Expected behavior
All location information be carried over to the transformed metadata.tsv

How to reproduce
Steps to reproduce the current behavior:
Latest version of the repo.

Or am I missing some filtering the ncov-ingest does on metadata?

Better Info when Assert in `check-locations`

Context
We don't permit any entries to have missing information in division_exposure country_exposure or region_exposure, which is checked in check-locations script.

If an entry is blank, an assert is triggered, and the ncov-ingest pipeline files. Unfortunately, there's little other information provided, so it can be hard to figure out what triggered the assert.

Description
It would be great to give a list of sequences/epi_isl numbers of any sequences falling foul of the rule, so that it's easier to investigate and fix the problem.

Examples
An example happened on 27 July, with sequence EPI_ISL_498226 - the location was listed only as 'Latvia' with no region. Thus, the region was 'Latvia' and country and division (thus country_exposure & division_exposure) were empty. This triggered the assert.

However, this was done after making many changes to gisaid_annotations, which historically has been the cause of this error. Even though 37 new sequences were also ingested after the changes to gisaid_annotations, as the ncov-ingest pipeline wasn't completing, it was hard to see what these were, and we assumed it was the gisaid_annotations changes that had somehow caused the problem.

Being able to see it was the Latvia/test1/2020 sequence that was causing the problem would have saved us a lot of time trying to find the isssue!

GISAID ingest keeps running out of memory

These workflows:

keep running out of memory on AWS and being killed, e.g. https://github.com/nextstrain/ncov-ingest/runs/3764464129?check_suite_focus=true.

This likely happens during the run of https://github.com/nextstrain/ncov-ingest/blob/master/bin/transform-gisaid since it takes gisaid.ndjson (the raw GISAID full dataset, which is over 100GB) as input, and performs a bunch of operations on it.

To avoid continually increasing the resources we ask for on the batch job, here are some ideas:

@tsibley said:

We should also understand why the memory needs increased even though the core of the ETL is streaming, and maybe also consider running these on m5 instance family instead of c5.
(which could be as small a change as adding m5 instances to the job queue used.)

@rneher said:

there are some lengthy compression steps that could happen in parallel to the rest of the pipeline.
the gzip compression is about 1h. Changing the compression of the ndjson to xz -2 already saved a lot.

Use idiomatic features of Snakemake

Context

The Snakemake workflow for ingest was written to map as directly from the original ingest shell scripts to Snakemake as possible. As a result, some of the workflow's implementation doesn't follow standard Snakemake idioms.

This issue comes from a code review comment of the original Snakemake workflow PR.

Description

We should replace shell-based idioms with Snakemake's own idioms including:

check-locations: Columns have mixed types

A snippet from the GISAID ingest log:

+ ./bin/check-locations data/gisaid/metadata.tsv data/gisaid/location_hierarchy.tsv gisaid_epi_isl
sys:1: DtypeWarning: Columns (9,28,37,39,43) have mixed types.Specify dtype option on import or set low_memory=False

This is probably unexpected types in the data, either medatata.tsv or location_hierarchy.tsv. For example we might be assuming a column contains numbers, but in reality it contains mostly numbers and then some runaway strings. But might be somethign more sophisticated also.

How to investigate:

This can be investigated in isolation from the pipeline, by running the ./bin/check-locations script.

  • download data/gisaid/metadata.tsv from S3
  • review input data files for obvious defects
  • review ./bin/check-locations for obvious defects
  • try to do binary search on the input data files to find rows that trigger the issue: delete half of the file, if problem goes away then the problem is in the other half, if not, remove half of what's remaining. Repeat until the minimal set of rows is found that reproduces the issue. Inspect these rows.

There is a few places throughout ingest scripts where these warnigns were silenced by setting low_memory=False as proposed in the warnign message. But this might not be what we need and might just hide programming mistakes and generate bogus outputs. We might need to search for occurrences of low_memory in the codebase to see what's going on there.

Only trigger preprocess if ingest if there are new sequences or new annotations

Context
From watching behavior of GISAID fetch and ingest action and from reading through the script ingest-gisaid: https://github.com/nextstrain/ncov-ingest/blob/master/bin/ingest-gisaid#L166, I'm pretty sure that preprocess is always triggered.

This means that we can't have ingest-gisaid running every ~6 hours to better catch updates to the endpoint by GISAID. (Unless we want a whole lot of wasted processing, and even then it's going to overwrite manual updates)

Description
Running ingest-gisaid should compare metadata.tsv.gz produced during the script's run to the file metadata.tsv.gz on S3, along with a similar comparison of sequences.fasta.xz. Only if hashes of either metadata or sequences are different should preprocess be triggered.

cc @tsibley @joverlee521

Include BioSample metadata in GenBank ingest

Many sequences that we pull from the GenBank API do not include location because this data is added to the linked BioSample.
We can ingest and link the BioSample record to get the complete metadata for GenBank sequences.

Note that Bio.Entrez.read() currently does not work with BioSample XMLs because they do not have a properly formatted DTD.
We need to parse the XML ourselves.

Allow ncov-ingest to download from non-AWS remote file paths

Context

To support running ingest on Terra, we need to support downloading existing Nextclade alignments and metadata from remote storage other than AWS's S3.

Description

We need to support the definition of a "source" bucket in the workflow configuration YAML associated with at least S3 or GS URIs. This means changing the name of the configuration variable from s3_src to a more generic name and modifying all of the logic in the workflow that refers specifically to downloading from S3 (e.g., "download_from_s3" script, etc.).

As part of this work, we will also need to modify the Pipfile used to populate the ncov-ingest Docker image by adding the Python bindings for Google Cloud Storage. See the Dockerfile for the docker-base image. See @tsibley's comments below.

Examples

The modified ingest should continue to work with our production S3 buckets, but it should also work from GS buckets accessed through Terra.

See code in ncov for handling remote files.

Alert for 'Future Dates' When Pre-Processing (Ingesting)

Context
Currently full runs take 2 hours. If sequences aren't checked when they come in (by scrolling through the incredibly helpful messages that appear in Slack) then it's very easy to miss if dates are seriously incorrect - being from the future.

However, these dates mess up the whole run, as they ruin the tree, and the only fix is to add to exclude.txt and re-run. It only takes missing a couple to waste half a day on runs!

Often we can get quick feedback from submitters on the correct date (or, they can be corrected online), so we wouldn't necessarily want these auto-kicked out (I think?), but it would be great to be alerted to these at the same time as we are alerted to new sequences, so we can immediately add to exclude.txt and reach out to submitters.

Description
When sequence come in, can they be checked against today's date. If they are > today's date ('from the future'), then can we get a big alert on Slack, giving a list of the 'futuristic sequences' so that we know immediately. Then we are able to take action!

Examples & Possible solution

I think a message like the ones we currently get would be great. Maybe another Icon so that they are really easy to spot - if a lot comes through sometimes the messages are easy to miss!

Originating Lab and Submitting Lab locations

In existing curated metadata, I had been aiming for things like:

"covv_subm_lab": "Pathogen Genomics Center, National Institute of Infectious Diseases",
"covv_subm_lab_addr": "1-23-1 Toyama, Shinjuku-ku, Tokyo 162-8640, Japan"

to end up something like:

"submitting_lab": "Pathogen Genomics Center, National Institute of Infectious Diseases, Tokyo, Japan"

Otherwise there are a bunch of conflicting "Centers for Disease Control" and "National Institutes of Infectious Diseases". I think this could mostly work by taking the last two words in the address, dropping numbers to grab things like "Tokyo, Japan" and "Beijing, China". But we do have situations where address is "1600 Clifton Rd NE, Atlanta GA 30329" and "2121 West Taylor, Chicago IL 60612" (the US people tend to drop USA from the address).

I think immediate solution can be just to append "covv_subm_lab_addr", but would be nice to have this cleaner.

Improve Slack bot build notifications

Right now the nCov notifier Slack bot's messaging could be improved. Two ideas:

* say which AWS Batch Job has been deployed to staging

  • report failed builds

Commented Annotations with 4 Columns are Still Applied

Current Behavior
As spotted by @eharkins - currently it seems rows that are commented but have 4 columns are applied, rather than ignored.

In this commit, lines 698-707, Eli included some commented information for reference but that shouldn't be processed.

Example:

# USA/OK-ADDL-03/2020	EPI_ISL_535362	location	City Suburb
# USA/OK-ADDL-04/2020	EPI_ISL_535363	location	City Suburb
# USA/PR-B35B/2020	EPI_ISL_534693	location	Ausilio Mutuo # Hospital in San Juan
# USA/PR-B3SB/2020	EPI_ISL_534694	location	Ausilio Mutuo # Hospital in San Juan

However, after pushing this update, we got notified that these locations had been applied, despite being commented:
image

Expected behavior
Any line that starts with # should be fully ignored, even if it has the right number of columns.

Possible solution
This seems to stem from this section of the code in gisaid_transform where we check if a line has 4 columns -- only if it does not then do we see if it is commented. If it has 4 columns, it seems to go ahead and try to process it, even if it starts with #.

We should change this so that it first checks to see if the first character (minus whitespace) is # and ignore the whole line if so (regardless of number of columns).

Temporary solution
Ensure that commented columns we don't want to be applied do not have 4 columns.

Stop triggering rebuild on edits to annotations

A rebuild is not needed every single time, and manual rebuilds are easy to re-trigger. This will result in about the same amount of effort (triggering 50% of the time vs. the status quo of cancelling 50% of the time), but is more logical and results in fewer Slack notifications.

(See this Slack thread for the original context.)

Slightly different strain names in sequences.fasta vs metadata

Current Behavior
While the vast majority of today's 150k strain names match between sequences.fasta and metadata downloaded from GISAID, there are 16 sequences with slightly different names in sequences.fasta compared to metadata.

Expected behavior
Ideally 100% of the names could match, so that the metadata for every sequence in sequences.fasta could be automatically associated with the sequence.

How to reproduce

  1. Download nextmeta and nextfasta from GISAID (currently metadata_2020-10-20_07-16.tsv.gz and sequences_2020-10-20_07-17.fasta.gz)

  2. Extract and compare strain names in a command shell:

gunzip -c metadata_2020-10-20_07-16.tsv.gz | tail -n +2 | cut -f 1 | sort > metadata.names
gunzip -c sequences_2020-10-20_07-17.fasta.gz | grep ^\> | sed -e 's/^>//' | sort > sequences.names
comm -3 metadata.names sequences.names
  1. Differences (first column is name that appears in metadata_2020-10-20_07-16.tsv.gz, second column is name that appears in sequences_2020-10-20_07-17.fasta.gz:
	/Italy/SIC-IZSBM169-57/2020
	/Italy/SIC-IZSBM169-76/2020
	Bangladesh/BCSIR-NILMRC-125/2020
	Bangladesh/BCSIR-NILMRC-131/2020
	Bangladesh/BCSIR-NILMRC-139/2020
	Bangladesh/BCSIR-NILMRC-145/2020
	Bangladesh/BCSIR-NILMRC-146/2020
	Bangladesh/BCSIR-NILMRC-150/2020
	Bangladesh/BCSIR-NILMRC-151/2020
	Bangladesh/BCSIR-NILMRC-152/2020
	Bangladesh/BCSIR-NILMRC-153/2020
Bangladesh/BCSIR-NILMRC_125/2020
Bangladesh/BCSIR-NILMRC_131/2020
Bangladesh/BCSIR-NILMRC_139/2020
Bangladesh/BCSIR-NILMRC_145/2020
Bangladesh/BCSIR-NILMRC_146/2020
Bangladesh/BCSIR-NILMRC_150/2020
Bangladesh/BCSIR-NILMRC_151/2020
Bangladesh/BCSIR-NILMRC_152/2020
Bangladesh/BCSIR-NILMRC_153/2020
India/HR-THSTI-42/2020
	India/HR-THSTI-BAL-42/2020
Italy/SIC-IZSBM169-57/2020
Italy/SIC-IZSBM169-76/2020
Spain/CT-015D0/2020
Spain/CT-2020030095/2020
	Spain/CT-IrsiCaixa-015D0/2020
	Spain/CT-IrsiCaixa-2020030095/2020
Switzerland/BS-42169171/2020
	Switzerland/BS-UHB-42169171/2020
env/Italy/LAZ-INMI-HSacco-1/2020
	env/Italy/LOM-INMI-HSacco-1/2020

Perhaps some name cleanups are applied to the metadata strain names that could also be applied to the sequence strain names?

Thank you so much for providing nextmeta and nextfasta, they're indispensable!

join-metadata-and-clades: A value is trying to be set on a copy of a slice from a DataFrame

A snippet from the GISAID ingest log:

+ ./bin/join-metadata-and-clades data/gisaid/metadata.tsv data/gisaid/nextclade.tsv -o data/gisaid/metadata.tsv
./bin/join-metadata-and-clades:110: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result["clock_deviation"][np.isnan(div_array)|np.isnan(t)] = np.nan

This is assignment to a copied data, and the updated copy goes away, so in the end the assignment has no effect. This operates on metadata, so we need to carefully investigate what are the consequences here.

This can be investigated in isolation from the pipeline, by downloading the 2 input data files and running the ./bin/join-metadata-and-clades standalone script.

Use NCBI Datasets command line tool to download SARS-CoV-2 data from GenBank

Start using the NCBI Datasets command line tool to download SARS-CoV-2 data from GenBank. This would replace the current fetch-from-genbank script, which has been consistently failing over the last few weeks.

I think we would mostly be interested in the virus metadata report and the genomic FASTA.

This will also require some updates to the transform-genbank script since the metadata and sequences will be in separate files instead of one NDJSON.

Fix COG-UK data

Most of the COG-UK is currently not included in our open builds despite being on genbank because of insufficient metadata. But COG-UK explicitly suggests to use metadata provided om CLIMB to patch this. Below is a sample comment in the genbank record.

COG_ACCESSION:COG-UK/BCNYJH/BIRM:20210316_1221_X5_FAP43200_37a8944f; COG_BASIC_QC COG_HIGH_QC:PASS; 
COG_NOTE:Sample metadata and QC flags may been updated since deposition in public databases. COG recommends users refer to data.covid19.climb.ac.uk for metadata and QC tables before conducting analysis.

Linking of the of the genbank data could either happen via the COG_ACCESSION in the comment, or via the ena_sample.secondary_accession. This currently exists for 339000 samples in genbank.

The following code snippet would load these data from CLIMB and turn them into a data frame indexed by ena_sample.secondary_accession that is also present in the genbank.ndjson as "biosample_accession":"SAMEA8463954".

import pandas as pd
import uuid

def load_coguk_sample_accessions():
    a = pd.read_csv("https://cog-uk.s3.climb.ac.uk/accessions/latest.tsv", sep='\t', index_col="central_sample_id")
    return a.loc[~a.index.duplicated(keep='first')]

def load_coguk_metadata():
    # function to generated sample id from sequence name
    def get_sample_id(x):
        try:
            sample_id = x.split('/')[1]
        except:
            sample_id = f'problemsample_{str(uuid.uuid4())}'
        return sample_id

    metadata = pd.read_csv("https://cog-uk.s3.climb.ac.uk/phylogenetics/latest/cog_metadata.csv", sep=',')[['sequence_name', 'country', 'adm1', 'is_pillar_2', 'sample_date','epi_week', 'lineage', 'lineages_version']]
    coguk_sample_ids = metadata.sequence_name.apply(get_sample_id)
    metadata.index=coguk_sample_ids
    return metadata.loc[~metadata.index.duplicated(keep='first')]

def join_coguk_metadata():
    return pd.concat([load_coguk_metadata(), load_coguk_sample_accessions()], axis='columns', copy=False)


def coguk_by_sample_accession(data):
    has_sample = data.loc[~data["ena_sample.secondary_accession"].isna()]
    has_sample.index = has_sample["ena_sample.secondary_accession"]
    return has_sample.loc[~has_sample.index.duplicated(keep='first')]

meta = join_coguk_metadata()
meta_by_biosample = coguk_by_sample_accession(meta)

To make this work, we would need to add this as a filter in transform-genbank somewhere here:

https://github.com/nextstrain/ncov-ingest/blob/master/bin/transform-genbank#L414

Run ncov-ingest on Terra

Description

Allow users with GISAID API endpoint access to run ncov-ingest with their credentials on Terra.

Possible solutions

Create a separate task in the ncov WDL workflow that downloads and runs the ingest workflow.

Location hierarchy formatting issues

The source-data/location_hierarchy.tsv file contains some issues where

  • some rows only have 3 columns instead of 4 (which GitHub currently points out when you view this file in the browser), i.e. a missing final tab
  • some rows have 5 columns instead of 4, i.e. a surplus final tab

This bug was discovered when seemingly duplicate location hierarchies were appearing in Slack notifications describing them as new region-country-division-location combinations.

To prevent future formatting issues with the manually updated location hierarchy file, I suggest adding a script like bin/clean-location-hierachy that either

  • does some bash magic (something like cut -f1-4 source-data/location-hierarchy.tsv to grab only the first four columns, then something else I haven't quite figured out to add a trailing tab to rows missing the 4th column -- awk and gsub()?), or
  • loads it into pandas and print it back out again with only 4 columns, tab-delimited.

I also suggest sorting the file on region, country, division, and location in case a row is misplaced.

This cleaning script should be added to the ingest scripts.

Include quick clade count after every ingest in logs

It'd be good as sanity check to have clade counts output after every ingest run.

This command would be enough:

zcat metadata.tsv.gz | tsv-summarize -H --group-by Nextstrain_clade --count

Yielding something like:

Nextstrain_clade        count
20I (Alpha, V1) 1160560
21A (Delta)     109534
20G     125069
21F (Iota)      43376
21I (Delta)     256077
20A     310770
21J (Delta)     2966243
19B     16740
21D (Eta)       7553
20C     123191
20B     261388
20H (Beta, V2)  42294
20D     13044
21E (Theta)     638
21B (Kappa)     8114
20J (Gamma, V3) 116987
20E (EU1)       175137
21G (Lambda)    9245
21C (Epsilon)   64377
19A     24359
21H (Mu)        14275
20F     13766
21K (Omicron)   1450
21L     7
        2190

Improve error handling in `check-annotations`

@lmoncla and I currently ran into an error raised by check-annotations with the following output:

The following strains have a division_exposure but not a country_exposure.
This is ok if the country_exposure happens to match the country, but problems arise if this is not the case.
To make this clear, country_exposure should be explicitly set for the following:
strain
France/CIBU200107/2020    NaN
Name: strain, dtype: object
strains have a division_exposure but not a country_exposure

France/CIBU200107/2020 was correctly annotated for sampling and exposure locations. It had division and division_exposure set but not country or country_exposure annotated. However, in the metadata, the country and country exposure were France.

This issue was resolved by removing division_exposure annotations for France/CIBU200107/2020. If it's the case where we don't want to have a division and division_exposure annotation without a country or country_exposure annotation, then we should describe that the division_exposure can be dropped in the error message sent to the user.

Alert Certain Users to New Sequences from GISAID

Context
At least two Nextstrain members are currently maintaining more custom builds (Emma maintains USA South-Central/Texas/Louisiana builds, Richard maintains ECDC Europe builds, and both maintain a Swiss build). With decreasing frequency of ncov builds, it's become a little harder to notice when new sequences from a country/division/region of interest come in.

It would be cool if there were a way to ping one or more users on Slack about the arrival of new sequences from an area of interest. For example, Emma would find it useful to know whenever we get new sequences from Texas or Louisiana so she can re-run the custom local builds there. Both Emma & Richard would probably find it useful to know when new Swiss sequences appear.

Description
It would be neat if somewhere in 'ingest' it could check against a list of areas of interest and alert the appropriate people when new sequences from these areas pop up. Could be either just tags in a message in #ncov-gisaid-updates channel, or could be private messages - whatever would work easier I guess!

Perhaps somewhere in the 'ingest' script there could be a text/JSON/etc file where one can put their slack username and the area they'd like to be alerted to - then any Nextstrain member could easily add or remove alerts as they need/want! For example:

username    level     area-name
emmahodcroft    division    Texas
emmahodcroft    division    Louisiana
rneher    country    Switzerland
emmahodcroft    country    Switzerland
...

Compress files stored on S3 with gzip

This will dramatically improve transfer times.

Files can be piped through gzip on their way to aws s3 cp … and piped through gunzip on their way back down.

Object keys should be given a .gz extension, which will need to be reflected in an update to nextstrain/ncov/Snakefile.

standardize characters and casing in location information during ingest

Currently, when new data is ingested, manual adjustments are frequently required for division and location fields to fix mundane, recurring issues. For example, division and location fields frequently include an underscore to separate words, rather than a space. Other times, these fields are spelled correctly, but with incorrect casing (all uppercase, rather than titlecase, for example). An example of this is sequences from Valencia, Spain, which almost always ingested with division Comunitat_Valenciana. Currently, these require manually adding each strain to annotations.tsv to set division Comumitat Valenciana.

A previous example was that we received sequences from the USA, which were ingested as region NORTH AMERICA. These had to be manually altered to region North America.

It would be amazing if all location fields (region, country, division, and location) could be automatically processed to:

  1. replace underscores between words with spaces
  2. convert to title case

These 2 checks/fixes would substantially reduce the amount of manual annotation that would need to occur for some batches of sequences in which this is common. This had been reasonably manageable until yesterday, when we received ~ 400 sequences from Spain, most of which required manual underscore removal across multiple fields.

Add too-short sequences to new exclude.txt

This depends on changes in #43 that introduce the new, automatically generated exclude.txt file.

Instead of silently dropping strains with insufficiently long sequences, add them to the exclude.txt with a comment describing why they were dropped. Downstream, augur will filter out these sequences as part of the ncov build.

Store line count in object metadata to avoid network transfer just to count lines

notify-on-record-change compares counts lines in a local file and remote file. It counts lines in the remote file by downloading and decompressing it in its entirety.

src_record_count="$(wc -l < "$src")"
dst_record_count="$(wc -l < <(aws s3 cp --no-progress "$dst" - | xz -T0 -dcfq))"
added_records="$(( src_record_count - dst_record_count ))"

This presumably takes some non-trivial amount of network transfer time! Instead of doing that unnecessary work, we could store line counts in the object metadata, like we already do for sha256sum, and then very quickly be able to get a remote file line count.

Additional metadata & annotations validations

This issue is meant to track ideas for additional metadata and/or annotation validations that aren't yet implemented in ./bin/transform or ./bin/check-annotations.

  • Ensure casing (i.e. capitalization) is consistent internal to annotations.tsv. See this comment for more context.

15k duplicate sequence names in ncov/open/aligned.fasta.xz

Current Behavior
The current https://data.nextstrain.org/files/ncov/open/aligned.fasta.xz contains ~15k duplicated sequences.

I haven't checked if the sequences themselves are identical, but it seems that GenBank only has one record for a name I spot checked (USA/DE-CDC-LC0461229/2022). The metadata.tsv.gz does not contain duplicate strain names.

See https://discussion.nextstrain.org/t/error-in-augur-tree-duplicated-sequence-name/970/4 for full context that lead to this discovery.

@jameshadfield @ivan-aksamentov ISTR recent changes introduced incremental alignments? Is this possibly related? I can imagine how naive concatenation could result in this.

Expected behavior
None of our sequence outputs (unaligned, aligned, etc) contain duplicated names and/or sequences.

How to reproduce

$ wget https://data.nextstrain.org/files/ncov/open/aligned.fasta.xz
$ xzcat -T4 < aligned.fasta.xz | grep '^>' | sort --parallel 4 | uniq -D > dups

dups (see output) will contain 30,926 lines, representing 15,463 names which appear twice in aligned.fasta.xz.

./bin/ingest sometimes fails on HTTP/2 request

./bin/ingest sometimes fails with the following log:

Uploading data to Slack with the message: metadata-changes.txt
curl: (16) Error in the HTTP2 framing layer
HTTP/2 200 
date: Fri, 15 May 2020 22:09:01 GMT
server: Apache
x-slack-req-id: d26004d83a56ca0cfb8116544529784d
x-oauth-scopes: incoming-webhook,files:write,chat:write
x-accepted-oauth-scopes: files:write
access-control-expose-headers: x-slack-req-id, retry-after
x-slack-backend: h
x-content-type-options: nosniff
expires: Mon, 26 Jul 1997 05:00:00 GMT
cache-control: private, no-cache, no-store, must-revalidate
x-xss-protection: 0
vary: Accept-Encoding
pragma: no-cache
access-control-allow-headers: slack-route, x-slack-version-ts, x-b3-traceid, x-b3-spanid, x-b3-parentspanid, x-b3-sampled, x-b3-flags
strict-transport-security: max-age=31536000; includeSubDomains; preload
referrer-policy: no-referrer
access-control-allow-origin: *
content-type: application/json; charset=utf-8
x-via: haproxy-www-kmep

From @tsibley:

Slack or curl's HTTP/2 support has a bug, probably a stochastic one. HTTP/2 is the next gen HTTP version which is not very widely deployed yet; it's a much more complex and unfriendly protocol than HTTP/1.1.

We could avoid the issue by forcing curl to use HTTP/1.1, instead of negotiating with the server, by running it with --http1.1.

compressed data error: bad block header magic after download

Current Behavior
After running bin/fetch-from-gisaid I get an error message: lbzip2: "data/gisaid.ndjson.new.bz2": compressed data error: bad block header magic (full log below). This is after download reported "(OK): download completed". At the point of the error message the files downloaded and (partially) unpacked:

-rw-r--r-- 1 pvh editors 260053049344 Feb 25 10:22 gisaid.ndjson.new
-rw-r--r-- 1 pvh editors  21112070822 Feb 25 07:24 gisaid.ndjson.new.bz2

Expected behavior
I expected either:

  1. the download completes and unpacks successfully or
  2. the download fails with an error (if the downloaded data is indeed corrupt)

How to reproduce
Steps to reproduce the current behavior:

This is the script that I use to fetch the data:

#!/bin/bash

#SBATCH -c 2
#SBATCH --mem=10G

set -e

cd /usr/people/pvh/ncov-ingest 
source $HOME/miniconda3/bin/activate 
GISAID_API_ENDPOINT=ENDPOINT_URL
GISAID_USERNAME_AND_PASSWORD=XXXX:YYYY 
export GISAID_API_ENDPOINT GISAID_USERNAME_AND_PASSWORD
# this doesn't need pipenv anymore
bin/fetch-from-gisaid data/gisaid.ndjson.new 
pipenv run bin/transform-gisaid data/gisaid.ndjson.new
  1. Run this script
  2. Wait
  3. See error

Possible solution
(optional)

Your environment:

  • Operating system: Ubuntu 20.04

Additional context
Add any other context about the problem here.

Complete log:


02/25 05:23:54 [NOTICE] Downloading 1 item(s)
[#a2d0d9 0B/0B CN:1 DL:0B]
[#a2d0d9 19GiB/19GiB(98%) CN:8 DL:2.3MiB ETA:2m49s]
[#a2d0d9 19GiB/19GiB(98%) CN:8 DL:17MiB ETA:21s]
[#a2d0d9 19GiB/19GiB(98%) CN:8 DL:26MiB ETA:12s]
[#a2d0d9 19GiB/19GiB(98%) CN:8 DL:29MiB ETA:9s]
[#a2d0d9 19GiB/19GiB(98%) CN:8 DL:24MiB ETA:11s]
[#a2d0d9 19GiB/19GiB(98%) CN:8 DL:17MiB ETA:15s]
[#a2d0d9 19GiB/19GiB(98%) CN:8 DL:19MiB ETA:11s]
[#a2d0d9 19GiB/19GiB(99%) CN:8 DL:21MiB ETA:8s]
[#a2d0d9 19GiB/19GiB(99%) CN:8 DL:25MiB ETA:5s]
[#a2d0d9 19GiB/19GiB(99%) CN:8 DL:23MiB ETA:5s]
[#a2d0d9 19GiB/19GiB(99%) CN:8 DL:21MiB ETA:5s]
[#a2d0d9 19GiB/19GiB(99%) CN:8 DL:25MiB ETA:2s]
[#a2d0d9 19GiB/19GiB(99%) CN:8 DL:26MiB]
[#a2d0d9 19GiB/19GiB(100%) CN:0]

02/25 05:24:14 [NOTICE] Download complete: /usr/people/pvh/ncov-ingest/data/gisaid.ndjson.new.bz2

Download Results:
gid   |stat|avg speed  |path/URI
======+====+===========+=======================================================
a2d0d9|OK  |    24MiB/s|/usr/people/pvh/ncov-ingest/data/gisaid.ndjson.new.bz2

Status Legend:
(OK):download completed.
lbzip2: "data/gisaid.ndjson.new.bz2": compressed data error: bad block header magic

transform-gisaid produces output TSVs with CRLF line endings

Current Behavior
The latest transform-gisaid script, introduced by #90, introduced a bug that produces all output TSVs (metadata and additional info) with Windows (CRLF) line endings (i.e. newlines) on all operating systems (Mac, Linux) by default instead of relying on the user's OS. Right now, a temporary workaround is implemented in the ingest-GISAID workflow to always invoke the --output-unix-newline option when running transform-gisaid.

Expected behavior
If the user invokes transform-gisaid with the --output-unix-newline option, then all line endings are in Unix style (works as expected).
Otherwise, the script outputs newlines using the user's operating system (does not work as expected and instead produces CRLF line endings).

How to reproduce
Steps to reproduce the current behavior:

  1. Run transform-gisaid without the --output-unix-newline option.
  2. Run file metadata.tsv or file additional_info.tsv. The output is
UTF-8 Unicode text, with very long lines, with CRLF line terminators

instead of the expected

UTF-8 Unicode text, with very long lines

Remove "REDUNDANT ANNOTATED METADATA" messages from technical logs

[batch] REDUNDANT ANNOTATED METADATA : LR757995 division Hubei
[batch] REDUNDANT ANNOTATED METADATA : LR757995 location Wuhan
[batch] REDUNDANT ANNOTATED METADATA : LR824035 country Switzerland
[batch] REDUNDANT ANNOTATED METADATA : LR824035 region Europe
[batch] REDUNDANT ANNOTATED METADATA : MZ918713.1 gisaid_epi_isl EPI_ISL_2423684

It seems that this is domain-specific info that should be read by ingest build shepherds. But right now these messages only clog the logs on Github Actions and AWS Batch making diagnostics harder.

Attached is 9 megs (!) of these messages from genbank ingest logs.

I propose we remove those or redirect them to a file, where shepherds can find them.

Thread on Slack: https://bedfordlab.slack.com/archives/CTZKJC7PZ/p1639149188358600

Attachments:
REDUNDANT ANNOTATED METADATA.txt

Rewrite transform-gisaid to perform a streaming transform

My system's out of memory killer is stopping my local run of ./bin/transform-gisaid. Because we're using Pandas, we're loading all the GISAID data into memory. Consider rewriting the script to use a streaming transform instead.

Profile ncov-ingest

Currently, GISAID ingest takes ~4 hours. We should profile the pipeline to figure out if improvements can be made without a major overhaul (i.e. incremental ingest with caches or a database).

  • Once the pipeline has been converted to Snakemake via #231, we can use Snakemake's benchmark, --stats, and/or --report to get an overview of the pipeline. We should upload the outputs to S3 to have a record of changes over time.
  • @tsibley suggested using Python profilers such as py-spy to inspect specific scripts that are pain points in the pipeline.

Move ingest pipeline to Snakemake workflow and include alignment

Currently the ncov-ingest pipeline is a collection of GitHub Actions that run on GitHub's servers. We'd have more control over these and the ability to do heavier weight computation (like alignment) if we move to a generic Snakemake workflow that can be run on AWS (via nextstrain build --aws-batch) or on a cluster.

At this point, alignment is uploaded from ncov pipeline here: https://github.com/nextstrain/ncov/blob/master/workflow/snakemake_rules/export_for_nextstrain.smk#L283. However, it would make sense for provisioning entrypoint files like metadata.tsv, sequences.fasta and aligned.fasta via ncov-ingest to have a clean separation between these two repos.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.