Code Monkey home page Code Monkey logo

Comments (6)

danielecook avatar danielecook commented on June 2, 2024

@dbrami have you seen the case study?

https://github.com/google/deepvariant/blob/r1.6/docs/deepvariant-training-case-study.md

Can you share the entire commands as an example if i want to re-train using multiple BAMs as input (I saw you are using 18 different BAM to train WGS 1.5)

The commands we run are very similar to what you see in the case study - we just run it separately for each bam, then combine the training examples via shuffling. See the make_examples command in the case study. Run this for each bam you intend to train on. Then perform shuffling:

time python3 ${SHUFFLE_SCRIPT_DIR}/shuffle_tfrecords_beam.py \
  --project="${YOUR_PROJECT}" \
  --input_pattern_list="${OUTPUT_BUCKET}"/training_set.with_label.tfrecord-?????-of-00016.gz \
  --output_pattern_prefix="${OUTPUT_BUCKET}/training_set.with_label.shuffled" \
  --output_dataset_name="HG001" \
  --output_dataset_config_pbtxt="${OUTPUT_BUCKET}/training_set.dataset_config.pbtxt" \
  --job_name=shuffle-tfrecords \
  --runner=DataflowRunner \
  --staging_location="${OUTPUT_BUCKET}/staging" \
  --temp_location="${OUTPUT_BUCKET}/tempdir" \
  --save_main_session \
  --region us-east1

Here, you would substitute the input_pattern_list with a glob that captures examples across multiple bam inputs.

Do all the generated example files need to live at the same time to perform training? (ie is there a way to do iterative sub-sampling of large BAM and generate examples that get deleted once they are used?

This is doable, but will require a lot of work to manage. But - you could generate training examples, shuffle, warmstart your model, train for a bit, then delete and repeat (warmstarting from that model). This approach has a lot of downsides though - you have to manage the files, and its not going to be as good as training the full dataset at once.

Do you have a good rule of thumb for how many examples needed? (I saw you are over 350M for WGS 1.5)

If you are training from scratch, a large number of training examples is helpful. However, if you are warmstarting, it is possible to get away with far fewer training examples. The case study highlights an example that uses only 342,758 training examples, and leads to modest SNP improvement and a nice increase in INDEL accuracy in only ~1.5 hours of training.

More data is better, but the quality of the training labels is more important.

from deepvariant.

dbrami avatar dbrami commented on June 2, 2024

Thanks for the added code.
Here are my follow-up questions:

  • what is the pattern to use for file naming if Iā€™m processing multiple BAM files?
  • If downsampling same source BAM multiple times, do I perform loop function myself?
  • Is there a seed parameter for downsampling fraction?
    Thank you!

from deepvariant.

danielecook avatar danielecook commented on June 2, 2024

what is the pattern to use for file naming if Iā€™m processing multiple BAM files?

This can be whatever you like. When you perform shuffling, you want to specify a glob or list of patterns that will match.

If downsampling same source BAM multiple times, do I perform loop function myself?

I'm not sure what you mean here. Can you provide more information?

Is there a seed parameter for downsampling fraction?

I don't think so - but I will double check on this.

from deepvariant.

dbrami avatar dbrami commented on June 2, 2024

I guess the naming pattern is related to second question.
Say I have a 50x depth BAM file that I want to downsample to 20%. I can squeeze about 5 downsampled BAMs out of 50x.
Given I assume i will perform make_examples with "--training" within a loop of say 5 iterations, what would the naming scheme look like? Hence the importance of the seed parameter if downsampling same BAM mulitiple times within a loop; i would change seed each time...

I hope this clears up the questions and motivations behind them....

from deepvariant.

danielecook avatar danielecook commented on June 2, 2024

Is there a seed parameter for downsampling fraction?

I don't believe this is currently possible.

What you can do though is you can downsample using samtools.

From samtools view you can specify:

  -s FLOAT subsample reads (given INT.FRAC option value, 0.FRAC is the
           fraction of templates/read pairs to keep; INT part sets seed)

So you can subsample each bam like this:

for i in `seq 1 5`; do
  # i sets the seed.
  samtools view -s ${i}.20 input.bam > input.${i}.20.bam

Then run make_examples + shuffle to combine the results.

Note that you want your training data to resemble the data you plan to run your model on. Subsampling can help improve performance in lower coverage regions, but if you plan to run this model at the original coverage level you'll want to train on that type of data as well.

from deepvariant.

pichuan avatar pichuan commented on June 2, 2024

Hi @dbrami ,
I'll close this issue now. Feel free to follow up if you have more questions!

from deepvariant.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    šŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. šŸ“ŠšŸ“ˆšŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ā¤ļø Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.