nextstrain / mpox Goto Github PK

View Code? Open in Web Editor NEW

39.0 17.0 16.0 33.22 MB

Nextstrain build for mpox virus

Home Page: https://nextstrain.org/mpox

License: MIT License

Python 86.94% Shell 12.66% Perl 0.40%

monkeypox phylogenetics augur auspice bioinformatics genbank genomic-epidemiology metadata nextstrain pipeline

mpox's Introduction

Nextstrain repository for mpox virus

This repository contains three workflows for the analysis of mpox virus (MPXV) data:

ingest/ - Download data from GenBank, clean and curate it and upload it to S3
phylogenetic/ - Filter sequences, align, construct phylogeny and export for visualization
nextclade/ - Make Nextclade datasets for nextstrain/nextclade_data

Each folder contains a README.md with more information. The results of running both workflows are publicly visible at nextstrain.org/mpox.

Installation

Follow the standard installation instructions for Nextstrain's suite of software tools.

Quickstart

Run the default phylogenetic workflow via:

cd phylogenetic/
nextstrain build .
nextstrain view .

Documentation

mpox's People

Contributors

Stargazers

Watchers

Forkers

bryantegomoh nadimm-rahman theosanderson pvanheus qianqli drb-s ktmeaton pitmonticone babarlelephant nmmahmed ceciletk chanzuckerberg miparedes minghao2016 1383385

mpox's Issues

ingest: notify Slack with metadata diff

Context

It would be helpful for build maintainers to see metadata diffs in Slack along with notifications of the metadata TSV being updated.

Possible solution

Use diff. The output may be a chore to read since it outputs the entire line that has changed
Use csv-diff as it was used in ncov-ingest/bin/notify-on-metadata-change. This sends notifications for changes and additions separately. We eventually stopped using it because it ran out of memory due to the large number of SARS-CoV-2 sequences.
Use daff. Outputs diff in a table with a new column marking changes, additions, and deletions. (I personally use daff when comparing tabular files locally and find the output much easier to understand)

Reverse complement reverse-complemented sequences in ingest to reduce downstream complications

Right now we seem to label sequences as reverse-complemented in ingest, but we don't actually fix the wrong orientation in the output sequences.

It would make downstream processing easier if we reversed the sequences straight in ingest.

`IndexError: tuple index out of range` in Snakemake file related to Nextalign rule, line 127

Current Behavior

An error is triggered when the monkepox pipeline is run using snakemake. The chunk code that starts in line 127 which executes Nextalign fails. This part of the code can be run directly in command line and the output is produced correctly but this extra step is not ideal.

Expected behavior

Snakemake to execute all the jobs without failing in line 127

How to reproduce

Steps to reproduce the current behavior:

Install Nextstrain using the Ambient directions
Install the monkeypox Nextstrian pipeline according to instructions
Run the pipeline after the installations is completed using the command snakemake -j 1 -p --configfile config/config_hmpxv1.yaml
See error:

Job 8: 
        Aligning sequences to config/reference.fasta
          - filling gaps with N
        
Reason: Missing output files: results/hmpxv1/aligned.fasta

RuleException in rule align in line 127 of /home/lmarcelat/monkeypox/workflow/snakemake_rules/core.smk:
IndexError: tuple index out of range, when formatting the following:

        nextalign run             --jobs {3}             --reference {input.reference}             --genemap {input.genemap}             --max-indel {params.max_indel}             --seed-spacing {params.seed_spacing}             --retry-reverse-complement             --output-fasta -             --output-insertions {output.insertions}             {input.sequences} | seqkit seq -i > {output.alignment}

My environment: if running Nextstrain locally

Windows operating system running locally using WSL

ingest: Include more data from GenBank

It would be great to fetch full author names, institutions and the sequencing and assembly methods from GenBank.

The information is available in e.g. https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?tool=portal&save=file&log$=seqview&db=nuccore&report=xml&id=ON669283&conwithfeat=on&hide-cdd=on. I don't know if it is possible to bulk-download it.

ingest: thread Slack notifications

Context

We get multiple Slack notifications per ingest run, so it would be cleaner to have these notifications threaded.

Possible Solution

See ncov workflow for example of how this can be done with PersistentDict.
Maybe another way we can go about this is edit the global Snakemake config to store the thread_ts? (Just a thought, haven't really tested if this is possible...)

Ingest: differentiate `date_released` and `date_submitted`

On Slack, @emmahodcroft pointed out that we are labeling the release date as date_submitted in the metadata TSV. These two dates usually aligned for SARS-CoV-2 sequences, but there is a noticeable difference in monkeypox sequences:

Once NCBI Virus adds submission date to their available fields, we can pull and include both dates in the metadata TSV.

CI is using an incompatible version of the Conda runtime

Currently (as observed in #176), the Conda runtime job instance of pathogen-ci is failing with the following error:

Current augur version: 22.1.0. Minimum required: 22.2.0

Augur version 22.1.0 is coming from this version of the Conda runtime: nextstrain-base 20230717T174555Z.

This used to work without any noticeable changes. Example: when the Augur minimum version was bumped to 22.2.0, Augur version 22.2.0 was available in this CI run. Notably, the version of the Conda runtime is nextstrain-base 20230731T212806Z.

This also seems to be working fine in the ncov repo, where the latest run resolved to nextstrain-base 20230830T164409Z.

My outstanding question is: why is an older version of the Conda runtime being resolved now, and seemingly only in this repo?

The geographic map is not appearing behind the geolocations in the mapbox image on our installed server as it does on the nexstrain.org server at https://nextstrain.org/monkeypox/mpxv?f_host=Homo%20sapiens

Current Behavior

The geographic map is not appearing behind the geolocations in the mapbox image on our installed server as it does on the nexstrain.org server at https://nextstrain.org/monkeypox/mpxv?f_host=Homo%20sapiens

Expected behavior

A clear and concise description of what you expected to happen instead.

How to reproduce

Steps to reproduce the current behavior:

Installed nextstrain/monkeypox today (28 June 2022) using Docker on our Amazon Server (with nextstrain.cli 3.0.5):

nextstrain-cli/bin/nextstrain build --docker . data/sequences.fasta data/metadata.tsv

nextstrain-cli/bin/nextstrain build --docker --cpus 50 . --configfile config/config_mpxv.yaml

nextstrain-cli/bin/nextstrain build --docker --cpus 50 . --configfile config/config_hmpxv1.yaml

Visualize results

nextstrain-cli/bin/nextstrain view auspice/ --allow-remote-access

In Chrome browser:

http://awsgenomep:4000/monkeypox/mpxv?f_host=Homo%20sapiens

Possible solution

(optional)

Your environment: if browsing Nextstrain online

Operating system:
Browser:

Your environment: if running Nextstrain locally

Operating system: Amazon Linux 2 AMI
Browser: Chrome
Version (e.g. auspice 2.7.0):

Additional context

Add any other context about the problem here.

Transmission Line Visualization Feature in Monkeypox Pipeline, Similar to the ncov Build

Context

This feature request aims to enhance the understanding of monkeypox epidemiology and transmission patterns.

Description

Currently, in the ncov build of Nextstrain, when running the pipeline locally, the output JSON file includes a feature that displays the transmission lines between sequences from different countries. I would like to propose extending this feature to the monkeypox pipeline as well.

By including the transmission line visualization in the monkeypox pipeline, researchers and public health professionals can gain valuable insights into the spread and transmission dynamics of monkeypox. This visualization will help track the movement of the virus across geographical regions and identify potential sources of outbreaks.

Implementing this feature in the monkeypox pipeline will contribute to a better understanding of the epidemiology of monkeypox and support more effective disease surveillance and control measures.

Thank you for considering this feature request.

It seems that there is no Chinese sequence included?

EPI_ISL_15293815 It's an imported case from Chongqing.
EPI_ISL_17809521 It seems to be monkeypox from Hangzhou?

ingest: adopt geolocation rules

Context

Use standard geolocation rules to annotate geolocations so that we do not have to make an annotation for the same geolocation edits for multiple records. This would a similar process to how the ncov-ingest uses the gisaid_geoLocationRules.tsv.

Description

Ideally, this would use a centralized geolocation rules TSV (could be within augur/augur/data/) for the most general rules.
Then, there can be monkeypox data specific TSV within the repo.

Within the ingest pipeline, we can fetch the general rules from augur's master branch and concatenate the local rules.
For the function that loads the geolocation rules, we can make sure that the local monkeypox rules can overwrite the general rules. Then include a transform step that overwrites geolocation fields using the full geolocation rules.

Support for GISAID data

For users who want to use GISAID data with this workflow, the following steps work nearly as expected.

These steps assume you have downloaded:

all sequences in FASTA format with whitespace replaced by underscore
patient metadata

# Download sequences: data/gisaid_pox_2022_06_16_19.fasta
# Download patient metadata: data/gisaid_pox_2022_06_16_19.tsv
# Note: patient metadata lacks submitting/originating lab.

# Parse out metadata from sequence deflines.
augur parse \
  --sequences data/gisaid_pox_2022_06_16_19.fasta \
  --fields strain gisaid_epi_isl date \
  --output-sequences data/sequences.fasta \
  --output-metadata data/sequence_metadata.tsv

# Join sequence metadata with patient metadata.
csvtk --tabs join -f 1 \
  data/sequence_metadata.tsv \
  data/gisaid_pox_2022_06_16_19.tsv > data/metadata.tsv

# TODO: Need a transform for GISAID locations like the one we have for GenBank.

# Run workflow.
# TODO: This step requires users to know that the "wrangling" of metadata renames the "strain" column to "strain_original"
# so they can rename it back to "strain". Correspondingly, the user has to tell the workflow not to use "strain_original"
# as the display strain name.
nextstrain build \
  --docker \
  --image=nextstrain/base:branch-nextalign-v2 \
  --cpus 1 \
  . \
  --configfile config/config_mpxv.yaml \
  --config strain_id_field=strain_original display_strain_field=strain

Note, the biggest issue with the implementation above is that there is no transform command to convert GISAID's location field to the standard Nextstrain geographic columns (region, country, division, and location). This means the default Augur filter logic that groups by country and year prints a warning message that it cannot find a "country" column and only groups. In Augur 16.0.0, this missing group-by column will produce an error message, so we should consider implementing the transform for GISAID locations.

Given the commands above, however, I get the following tree from the workflow:

The very long branches also indicate that users will need to manage their own list of strains to exclude, since strain names will not match GenBank accessions.

can not see the map?

sequences.fasta has 2 duplicates

% grep '>' sequences.fasta | sort | uniq -cd
   2 >Zaire_1979_005
   2 >Zaire_96_I_16

ingest: Split monolith transform rule

Context

Currently in the ingest pipeline, there is a single transform rule that runs a shell pipeline of multiple Python scripts. This works in the automated ingest pipeline, but may be tedious to debug when developing or when there's an error in the pipeline.

Description

We can split up the single rule into multiple rules by using Snakemake's piped outputs feature. I don't think anyone in the group has used this feature, so we don't know the pitfalls.

README says 'not public'

I see that you just committed to this repo. Your README.md says 'this is not public', yet this GitHub Repo is public. Just wanted to let you know, in case you forgot to set this GitHub Repo to private.

`fix_tree.py` can create invalid tree

The hmpxv1_big build failed yesterday with a validation error from augur export v2

[batch] [2024-01-21T16:43:57-08:00] Validating schema of 'results/hmpxv1_big/nt_muts.json'...
[batch] [2024-01-21T16:43:57-08:00] Validating schema of 'results/hmpxv1_big/aa_muts.json'...
[batch] [2024-01-21T16:43:57-08:00] Validating config file config/hmpxv1_big/auspice_config.json against the JSON schema
[batch] [2024-01-21T16:43:57-08:00] Validating schema of 'config/hmpxv1_big/auspice_config.json'...
[batch] [2024-01-21T16:43:57-08:00] Validating produced JSON
[batch] [2024-01-21T16:43:57-08:00] Validating schema of 'results/hmpxv1_big/raw_tree.json'...
[batch] [2024-01-21T16:43:57-08:00] Validating that the JSON is internally consistent...
[batch] [2024-01-21T16:43:57-08:00] Node OP615261 appears multiple times in the tree.
[batch] [2024-01-21T16:43:57-08:00] ------------------------
[batch] [2024-01-21T16:43:57-08:00] Validation of results/hmpxv1_big/raw_tree.json failed. Please check this in a local instance of `auspice`, as it is not expected to display correctly.

I searched for OP615261 in the results files and see that it only appears once in the tree_raw.nwk (produced by augur tree) but appears twice in the tree_fixed.nwk (produced by scripts/fix_tree.py). Somehow scripts/fix_tree.py is duplicating the node.

ingest: run NextClade and include in output files

Context

Run sequences through NextClade and include output in the final metadata TSV so that it can be imported into LAPIS.

Description

This is currently a multi-rule process in ncov-ingest because we keep a cache of the previous runs. We can adopt a similar process here as well, but be aware of pitfalls described in nextstrain/ncov-ingest#218

ENH: Don't subsample non-B.1 lineages

Context

Right now we may sample out some non-B.1 sequences because I argued we shouldn't sample on lineage country month year only on country month year -> my mistake.

We should sample by country within B.1 - but not subsample outside B.1. That way we combine the best of both.

Raised by @rambaut

Add accession as tip label option -> easier for getting exclusion accessions quickly

Possible clade misannotations

Thank you very much for this resource.

On my very naive tree, the following sequences cluster with West African sequences:

MPXV_WRAIR7_61__Walter_Reed_267
COP_58
Liberia_1970_184
Ivory_Coast_2012
Sierra_Leone
USA_2003_044
USA_2003_039

added 23/5:
MPXV_TNP_2017_North_Ponan
3030

though they are annotated as CA. In every case I have so far looked into in the literature, a West Africa-clade annotation would seem reasonable to my non-expert eyes. (But I may well be wrong in terms of how you choose to define the clades, etc.)

Thanks again

/usr/bin/bash: line 1: tsv-filter: command not found

Running the monkeypox pipeline the workflow errors out with the "tsv-filter" not found. Not sure in which script file or nextstrain package this command comes from. Am I missing a specific extra Nextstrain package or something that I should have installed?

Will you be updating outbreak.fasta ?

Just wondering if ourbreak.fasta will be updated with all the recent sequences in NCBI that are visible on the Nextstrain tree?

[phylo] CI workflow DAG includes `update_example_data`

Context

Sometimes running the phylo workflow with the CI configs locally includes the update_example_data rule in the DAG:

$ nextstrain build . --configfile profiles/ci/builds.yaml -n
Building DAG of jobs...
Job stats:
job                            count
---------------------------  -------
align                              1
all                                1
ancestral                          1
clades                             1
colors                             1
combine_samples                    1
copy_example_data                  1
decompress                         1
download                           1
export                             1
filter                             1
final_strain_name                  1
fix_tree                           1
mask                               1
mutation_context                   1
recency                            1
refine                             1
rename_clades                      1
reverse_reverse_complements        1
subsample                          2
traits                             1
translate                          1
tree                               1
update_example_data                1
total                             25
...
Reasons:
    (check individual jobs above for details)
    code has changed since last execution:
        decompress
    input files updated by another job:
        align, all, ancestral, clades, colors, combine_samples, copy_example_data, decompress, export, filter, final_strain_name, fix_tree, mask, mutation_context, recency, refine, rename_clades, reverse_reverse_complements, subsample, traits, translate, tree, update_example_data
    missing output files:
        download
    set of input files has changed since last execution:
        decompress
Some jobs were triggered by provenance information, see 'reason' section in the rule displays above.
If you prefer that only modification time is used to determine whether a job shall be executed, use the command line option '--rerun-triggers mtime' (also see --help).
If you are sure that a change for a certain output file (say, <outfile>) won't change the result (e.g. because you just changed the formatting of a script or environment definition), you can also wipe its metadata to skip such a trigger via 'snakemake --cleanup-metadata <outfile>'. 
Rules with provenance triggered jobs: decompress

This is not an issue in our automated CI runs via GitHub Action because the GH Action workflow does a clean clone of the repo.

Possible solutions

Manually removing the local .snakemake directory clears the Snakemake cache and resolves the issue.
Move the chores.smk file to be conditionally included in the core phylo workflow
Move the chores.smk file to a separate build-config that extends the workflow with custom_rules (conforms to the pathogen-repo-guide)

Push aligned sequences up to data.nextstrain.org for download availability

Context

We currently provide links to download the curated sequences & metadata, which is great. However, many times one just wants to start with an aligned sequence set (particularly in cases when alignment can be tricky, as with MPX). We generate this as part of our workflow, it would be great to:

Add a rule that uploads aligned.fasta to data.nextstrain.org (either after alignment or at the end)
Include a link to this aligned file in the description at the bottom of builds (and in the github repo)

We seem to trigger each build twice after ingest

Current Behavior

I've noticed that we seem to trigger each build twice at the moment at the end of fetch & ingest:

Not sure why

Ingest: remove `reverse` column from metadata TSV

(Originally flagged the obsolete reverse column in #207 (comment))

Reverse complement sequences were initially manually flagged by the reverse column added in #79.

Since Nextclade v2.2.0, there's a built-in --retry-reverse-complement option that adds a new column isReverseComplement. This feature was used in the ingest pipeline starting from #89. Then in #94, the ingest/bin/reverse_reversed_sequences.py script was replaced with the built-in Nextclade functionality as well.

In #191, the phylogenetic pipeline switched over from using the reverse column to the is_reverse_complement column output from Nextclade. This seemingly makes the reverse column obsolete. When checking the latest metadata TSV (2023-10-13), the reverse column is completely empty.

From my point of view, we can just remove the reverse column from the metadata.tsv file, but wanted to confirm with other users of the pipeline/metadata.tsv file (cc: @corneliusroemer, @chaoran-chen).

Ingest currently blocked by `fetch-from-ncbi-virus`

Current Behavior

Because of the behavior described in nextstrain/ingest#18, the ingest pipeline does not include sequences in it's fetch from NCBI Virus. This results in all of the records being dropped in the pipeline and the final outputs to s3://nextstrain-data/files/workflows/monkeypox/ are empty. This was first flagged internally by downstream CZI consumers on Slack.

We don't have insight into the undocumented NCBI Virus API and whether this new behavior is intentional, so the best thing might be to just switch to the NCBI Datasets CLI to fetch data.

Rename repo & builds

Context

General naming recommendations are continuing to depreciate (and expected to depreciate further) using 'monkeypox'. We likely should replace 'monkeypox' with 'MPXV' (and possibly Mpox in some places). This will require:

Institute redirects for the URLS for all builds
Replace all 'visible' links (where we write out URLS for files & similar) in docs & text (such as description)
- #165
- #169
- #171
Rename the repo (GitHub should take care of redirects as long as a new repo nextstrain/monkeypox isn't created)
Update old repo name references
Update files on S3
- Upload with mpox in path: #220
- Add a redirect for old paths?
  - Slack mention
  - More discussion happening in #220
  - Done by @joverlee521

Collection issue for things to bear in mind when moving a workflow to a folder

Context

Thought I take notes on what needs to be taken into account when moving a workflow to a folder. These are all the relevant commits I could find so far:

03f9a25
842d151
1871170 (changes to notify-on-* scripts and pathogen-repo-build action args)

This might be of interest to e.g. @joverlee521 @j23414

Masking more sites

Hi, in my tree I am masking 146773;146927

and 169733:169795

thereby eliminating the 3 non-Apobec mutations in internal branches that you get in https://nextstrain.org/monkeypox/mpxv?c=gt-nuc_146895,169756,169760&label=clade:B.1

(the pictures are from my mafft --addfragments alignment from which I remove the columns gapped in UK_P2, the masked coordinates apply to the latter set of columns)

ingest: deduplicate sequences using strain names

Context

Once we've completed #32, we can use strain names to deduplicate sequences.
This is necessary in case different groups sequence the same virus or if sequences are generated from different protocols.
(NOTE: This is separate from the versioning in GenBank, we already pull in the latest version of GenBank sequences).

Description

The duplicate sequences should probably be filtered out in a new script (e.g. ingest/bin/deduplicate-records) OR potentially use the augur deduplicate command (see nextstrain/augur#919).

We probably want to keep a file with all sequences in case people want the duplicate sequences for any reason.
The deduplicated files will be the main ones used for LAPIS and/or our monkeypox builds.

ingest: include `url` field

Context

See #72 (comment)

Possible solution

GenBank urls can be specially added as https://www.ncbi.nlm.nih.gov/nuccore/<genbank_accession>
URLs for arbitrary non-GenBank sequences will have to be added through manual annotations.

Potentially include year-only sequences

Right now we seem to exclude sequences from the B.1 build that lack a month, i.e. year-only sequences 2022-XX-XX

They get filtered out in subsampling as they don't find neatly into a year month sampling scheme. We could add a separate "year-only" filter to get them back in.

update README.

the readme contains a bunch of outdated instructions.

Ingest: Add alignmentStart/End and/or sequence length to our metadata.tsv

Context

It'd be great if one could see length directly from the metadata.tsv

Description

We currently don't have it in the metadata, it's easy to compute based on nextclade.tsv output.

We can add length (alignmentEnd-alignmentStart) or all 3. I guess doesn't harm to have all 3.

ingest: canonicalize strain names

Context

Currently, the ingest pipeline accepts any format for the strain names.
We should canonicalize them to have prettier names for display in Auspice and to have a way to deduplicate sequences.

Description

We need a clear standard format for strain names. If we follow the existing pattern we use for other pathogens (e.g. SARS-CoV-2), this would be <country>/<sample_id>/<year>

Once we've decided on a format, we should add necessary transforms to ingest/bin/transform-strain-names.

LAPIS data: cannot reindex on an axis with duplicate labels

Context

When using LAPIS data (data_source: "lapis"), the rule filter exits with the error: ValueError: cannot reindex on an axis with duplicate labels1. I think augur is unhappy that a year column already exists in the LAPIS data.

Additional Context

I'm using a conda environment rather than the docker image. But the conda environment works flawlessly for Nextstrain data, just not LAPIS. I'm guessing it's because I'm using a newer version of pandas (v1.4.2) since augur is also raising FutureWarning: reindexing with a non-unique Index is deprecated.

Possible Solution

One way to solve this, would be to drop the year column before the filter rule. Adding the following segment to scripts/wrangle_metadata.py fixes the issue for me:

# Remove the year column, because it will break augur filter
if "year" in metadata.columns:
  new_dates = []
  # Iterate through the 'date' and 'year' columns
  for s_date, s_year in zip(metadata["date"], metadata["year"]):

    # If date is null, we use the year
    if pd.isna(s_date) and not pd.isna(s_year):
      new_dates.append("{}-XX-XX".format(int(s_year)))

    # if date is not null, use it
    elif not pd.isna(s_date):
      new_dates.append(s_date)

    # Otherwise, use none
    else:
      new_dates.append(None)

  metadata["date"] = new_dates
  metadata.drop(columns=["year"], inplace=True)

Steps to Reproduce

Here is the shell command in isolation (after LAPIS download):

augur filter \
  --sequences data/sequences.fasta \
  --metadata results/metadata.tsv \
  --exclude config/exclude_accessions_hmpxv1.txt \
  --output-sequences results/hmpxv1_lapis/filtered.fasta \
  --output-metadata results/hmpxv1_lapis/metadata.tsv \
  --group-by country year \
  --sequences-per-group 1000 \
  --min-date 2017 \
  --min-length 10000 \
  --output-log results/hmpxv1_lapis/filtered.log

Environment

name: nextstrain-mpx
channels:
  - bioconda
  - conda-forge
  - anaconda
  - defaults
dependencies:
  - anaconda::python=3.9.10
  - anaconda::pip=22.0.3
  - conda-forge::pandas=1.4.2
  # Workflow
  - bioconda::snakemake=7.3.6
  # Phylogeny
  - bioconda::iqtree=2.2.0.3
  # Misc
  - bioconda::epiweeks=2.1.4
  - conda-forge::gzip>=1.6
  - pip:
    - nextstrain-augur==16.0.1

# Notes:
# - nextclade and nextalign: v2 must be manually installed and renamed to nextclade2 and nextalign2
#     wget -O $CONDA_PREFIX/bin/nextclade2 https://github.com/nextstrain/nextclade/releases/download/2.0.0-beta.5/nextclade-x86_64-unknown-linux-gnu
#     wget -O $CONDA_PREFIX/bin/nextalign2 https://github.com/nextstrain/nextclade/releases/download/2.0.0-beta.5/nextalign-x86_64-unknown-linux-gnu

Full Traceback

/home/keaton/.conda/envs/nextstrain-mpx/lib/python3.9/site-packages/augur/filter.py:953: FutureWarning: reindexing with a non-unique Index is deprecated and will raise in a future version.
  df_skip = metadata[metadata['year'].isnull()]
Traceback (most recent call last):
  File "/home/keaton/.conda/envs/nextstrain-mpx/lib/python3.9/site-packages/augur/__init__.py", line 81, in run
    return args.__command__.run(args)
  File "/home/keaton/.conda/envs/nextstrain-mpx/lib/python3.9/site-packages/augur/filter.py", line 1424, in run
    group_by_strain, skipped_strains = get_groups_for_subsampling(
  File "/home/keaton/.conda/envs/nextstrain-mpx/lib/python3.9/site-packages/augur/filter.py", line 953, in get_groups_for_subsampling
    df_skip = metadata[metadata['year'].isnull()]
  File "/home/keaton/.conda/envs/nextstrain-mpx/lib/python3.9/site-packages/pandas/core/frame.py", line 3492, in __getitem__
    return self.where(key)
  File "/home/keaton/.conda/envs/nextstrain-mpx/lib/python3.9/site-packages/pandas/util/_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)
  File "/home/keaton/.conda/envs/nextstrain-mpx/lib/python3.9/site-packages/pandas/core/frame.py", line 10955, in where
    return super().where(cond, other, inplace, axis, level, errors, try_cast)
  File "/home/keaton/.conda/envs/nextstrain-mpx/lib/python3.9/site-packages/pandas/core/generic.py", line 9308, in where
    return self._where(cond, other, inplace, axis, level, errors=errors)
  File "/home/keaton/.conda/envs/nextstrain-mpx/lib/python3.9/site-packages/pandas/core/generic.py", line 9075, in _where
    cond = cond.reindex(self._info_axis, axis=self._info_axis_number, copy=False)
  File "/home/keaton/.conda/envs/nextstrain-mpx/lib/python3.9/site-packages/pandas/util/_decorators.py", line 324, in wrapper
    return func(*args, **kwargs)
  File "/home/keaton/.conda/envs/nextstrain-mpx/lib/python3.9/site-packages/pandas/core/frame.py", line 4804, in reindex
    return super().reindex(**kwargs)
  File "/home/keaton/.conda/envs/nextstrain-mpx/lib/python3.9/site-packages/pandas/core/generic.py", line 4966, in reindex
    return self._reindex_axes(
  File "/home/keaton/.conda/envs/nextstrain-mpx/lib/python3.9/site-packages/pandas/core/frame.py", line 4617, in _reindex_axes
    frame = frame._reindex_columns(
  File "/home/keaton/.conda/envs/nextstrain-mpx/lib/python3.9/site-packages/pandas/core/frame.py", line 4662, in _reindex_columns
    return self._reindex_with_indexers(
  File "/home/keaton/.conda/envs/nextstrain-mpx/lib/python3.9/site-packages/pandas/core/generic.py", line 5032, in _reindex_with_indexers
    new_data = new_data.reindex_indexer(
  File "/home/keaton/.conda/envs/nextstrain-mpx/lib/python3.9/site-packages/pandas/core/internals/managers.py", line 679, in reindex_indexer
    self.axes[axis]._validate_can_reindex(indexer)
  File "/home/keaton/.conda/envs/nextstrain-mpx/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 4107, in _validate_can_reindex
    raise ValueError("cannot reindex on an axis with duplicate labels")
ValueError: cannot reindex on an axis with duplicate labels

ingest: use location hierarchy to help geolocation curation

Context

Use our standardized location hierarchy that the ncov shepherds have curated in ncov-ingest to help with automating geolocation rules curation (#34).

Description

Ideally, the location hierarchy TSV will be stored in a centralized location (could be within augur/augur/data/), but for now we can just pull the file from ncov-ingest.

nextstrain / mpox Goto Github PK

mpox's Introduction

Nextstrain repository for mpox virus

Installation

Quickstart

Documentation

mpox's People

Contributors

Stargazers

Watchers

Forkers

mpox's Issues

Context

Possible solution

Current Behavior

Expected behavior

How to reproduce

My environment: if running Nextstrain locally

Context

Possible Solution

Current Behavior

Expected behavior

How to reproduce

Visualize results

Possible solution

Your environment: if browsing Nextstrain online

Your environment: if running Nextstrain locally

Additional context

Context

Description

Context

Description

Context

Description

Context

Description

Context

Context

Possible solutions

Context

Current Behavior

Current Behavior

Context

Context

Context

Description

Context

Possible solution

Context

Description

Context

Description

Context

Additional Context

Possible Solution

Steps to Reproduce

Environment

Full Traceback

Context

Description

Recommend Projects

Recommend Topics

Recommend Org