Code Monkey home page Code Monkey logo

Comments (4)

ivan-aksamentov avatar ivan-aksamentov commented on May 29, 2024 2

We have been thinking how to simplify making multiple runs with a single command, for example for different flu segments. And perhaps even to combine results for different segments and process them further. But haven't figured anything quite yet. Is this your use-case?

In your proposal, how would you access the downloaded datasets afterwards? There is no way of knowing what's being downloaded and to where and how many. And you'd still need to copy-paste nextclade run 8 times. Also, datasets come and go, today it might be 8, tomorrow there's 32.

Note the * to stop accidentally downloading all the datasets.

I did not understand how * should stop downloading all datasets. Usually in paths * denotes so-called wildcard, that is any path under that particular path. Does nextclade currently downloads all datasets if you omit the *? (if so, it's a bug!) Can you clarify?

There are multiple ha datasets which are downloaded - this is okay in my opinion but not ideal. In such cases maybe download the flu_h3n2_ha_broad?

How do we pick flu_h3n2_ha_broad among others? And what if it's not a flu/*, but sc2/*? Nextclade features should make sense for all viruses (even the ones that haven't been added yet).


While wildcard downloads are not supported yet, in the meantime, there is a couple of other approaches you might consider:

  • If you don't use the dataset files outside of nextclade run, you can avoid separate dataset download entirely. The nextclade run command accepts --dataset-name argument, which makes it to download the dataset in-memory (without writing to disk) and run with it immediately. This way you don't need dataset get calls at all:

    nextclade run --dataset-name="nextstrain/flu/h3n2/pa" --output-dir="results/" my.fasta.gz
  • If you insist on having dataset files on disk (perhaps you use it in your processing after nextclade), you can use a loop to avoid repetition. Here is a few examples in bash (but you can also setup Snakemake or other workflow frameworks to run multiple things in some sort of a loop, according to a set of parameters):

    $ for v in pa mp; do nextclade dataset get --name="nextstrain/flu/h3n2/$v" --output-dir="outputdir/$v"; done
  • Instead of hardcoding dataset names, you can use dataset list with --search argument to find datasets using sub-string match. This is probably the closest thing to the wildcard * syntax you've requested:

    $ nextclade dataset list --only-names --search=flu/h3
    nextstrain/flu/h3n2/ha/CY163680
    nextstrain/flu/h3n2/ha/EPI1857216
    nextstrain/flu/h3n2/na/EPI1857215
    nextstrain/flu/h3n2/pb1
    nextstrain/flu/h3n2/np
    nextstrain/flu/h3n2/ns
    nextstrain/flu/h3n2/mp
    nextstrain/flu/h3n2/pa
    nextstrain/flu/h3n2/pb2

    Then you can feed this list into a loop instead of a hardcoded list:

    for v in $( nextclade dataset list --only-names --search=flu/h3 ); do
      nextclade dataset get --name="$v" --output-dir="outputdir/$v";
    done

    Nothing stops you from plugging your entire processing into this loop - this way you always know what is being downloaded and where:

    for v in $( nextclade dataset list --only-names --search=flu/h3 ); do
      nextclade dataset get --name="$v" --output-dir="outputdir/$v";
      nextclade run --input-dataset="outputdir/$v" --output=dir="results/" "my_$v.fasta.gz";
      my_script.py --virus="$v" --nextclade-tsv="results/$v/nextclade.tsv";
    done

    You can use GNU Parallel to run different datasets concurrently (workflows also often have a way to do this automatically):

    function run_one() {
      v=$1
      nextclade dataset get --name="$v" --output-dir="outputdir/$v";
      nextclade run --input-dataset="outputdir/$v" --output=dir="results/" "my_$v.fasta.gz";
      my_script.py --virus="$v" --nextclade-tsv="results/$v/nextclade.tsv";
    }
    export -f run_one
    
    parallel --jobs=4 run_one :::$( nextclade dataset list --only-names --search=flu/h3 )

These approaches allow you to avoid code duplication, and you always know what's being downloaded and where - exact names and paths. Loops of course complicate things quite a bit, that's a downside. And, for sanity checks, you likely want to test that you've got all the datasets you want, to avoid omissions.

If you need more control, you can filter datasets additionally by piping the list into the grep or into a script:

$ nextclade dataset list --only-names --search=flu/ | grep -E '(mp|pa)' | sort
nextstrain/flu/h1n1pdm/mp
nextstrain/flu/h1n1pdm/pa
nextstrain/flu/h3n2/mp
nextstrain/flu/h3n2/pa

$ nextclade dataset list --only-names --search=flu/h3 | my_filter.py

and then feed the resulting list into the loop.

For even more control, you can also add --json flag to the dataset list command. This will print your search results in JSON format, which you can later feed to jq or to a script. This way you can also implement your own search/filtering - by dumping JSON of all datasets, choosing a subset, and then downloading only it. Contrived example:

$ nextclade dataset list --json --search=flu/ | jq -r '.[] | select(.attributes.segment == "pa" and .attributes["reference name"] == "A/NewYork/392/2004") | .path'
nextstrain/flu/h3n2/pa

from nextclade.

rneher avatar rneher commented on May 29, 2024 1

@ammaraziz, what might be useful for you is nextclade sort. We are currently refining some of the matching parameters, but what you can do for example is

nextclade sort all_my_rsv_sequences.fasta --output-dir split_by_dataset --output-results-tsv table_with_matches.tsv

this will split your input sequences into files corresponding to datasets (and their prefixes).

from nextclade.

ammaraziz avatar ammaraziz commented on May 29, 2024

We have been thinking how to simplify making multiple runs with a single command, for example for different flu segments. And perhaps even to combine results for different segments and process them further. But haven't figured anything quite yet. Is this your use-case?

Yes it's very close to the use case!

In your proposal, how would you access the downloaded datasets afterwards? There is no way of knowing what's being downloaded and to where and how many. And you'd still need to copy-paste nextclade run 8 times. Also, datasets come and go, today it might be 8, tomorrow there's 32.

I had not considered this, this puts a big red ! on my request. But as you hinted above, the ability to make multiple runs with a single command falls within this idea.

I did not understand how * should stop downloading all datasets. Usually in paths * denotes so-called wildcard, that is any path under that particular path. Does nextclade currently downloads all datasets if you omit the *? (if so, it's a bug!) Can you clarify?

To stop accidentally entering in: nextstrain/flu/ which will download all datasets for all species of full (but as you said this doesn't generalise to other species supported by nextclade).

While wildcard downloads are not supported yet, in the meantime, there is a couple of other approaches you might consider:
....

Thank you for the code and the explanation, this does achieve my task (or what instigated this feature request). You've perfectly represented what I was trying to do in the code, that is not hard code the flu datasets.

Going back to this:

We have been thinking how to simplify making multiple runs with a single command, for example for different flu segments.

This feature request would be part of this bigger picture of running multiple runs with a single command. Therefore, closing this issue as in hindsight it should have been a discussion.

Thanks again!

from nextclade.

ammaraziz avatar ammaraziz commented on May 29, 2024

Hi Richard,

That's actually the usecase which triggered this request.

Thanks again :)

from nextclade.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.