Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Is there a seed parameter for downsampling fraction? <p dir="auto

Better example for training about deepvariant HOT 6 CLOSED

dbrami commented on June 2, 2024

Better example for training

from deepvariant.

Comments (6)

danielecook commented on June 2, 2024

@dbrami have you seen the case study?

https://github.com/google/deepvariant/blob/r1.6/docs/deepvariant-training-case-study.md

Can you share the entire commands as an example if i want to re-train using multiple BAMs as input (I saw you are using 18 different BAM to train WGS 1.5)

The commands we run are very similar to what you see in the case study - we just run it separately for each bam, then combine the training examples via shuffling. See the make_examples command in the case study. Run this for each bam you intend to train on. Then perform shuffling:

time python3 ${SHUFFLE_SCRIPT_DIR}/shuffle_tfrecords_beam.py \
  --project="${YOUR_PROJECT}" \
  --input_pattern_list="${OUTPUT_BUCKET}"/training_set.with_label.tfrecord-?????-of-00016.gz \
  --output_pattern_prefix="${OUTPUT_BUCKET}/training_set.with_label.shuffled" \
  --output_dataset_name="HG001" \
  --output_dataset_config_pbtxt="${OUTPUT_BUCKET}/training_set.dataset_config.pbtxt" \
  --job_name=shuffle-tfrecords \
  --runner=DataflowRunner \
  --staging_location="${OUTPUT_BUCKET}/staging" \
  --temp_location="${OUTPUT_BUCKET}/tempdir" \
  --save_main_session \
  --region us-east1

Here, you would substitute the input_pattern_list with a glob that captures examples across multiple bam inputs.

Do all the generated example files need to live at the same time to perform training? (ie is there a way to do iterative sub-sampling of large BAM and generate examples that get deleted once they are used?

This is doable, but will require a lot of work to manage. But - you could generate training examples, shuffle, warmstart your model, train for a bit, then delete and repeat (warmstarting from that model). This approach has a lot of downsides though - you have to manage the files, and its not going to be as good as training the full dataset at once.

Do you have a good rule of thumb for how many examples needed? (I saw you are over 350M for WGS 1.5)

If you are training from scratch, a large number of training examples is helpful. However, if you are warmstarting, it is possible to get away with far fewer training examples. The case study highlights an example that uses only 342,758 training examples, and leads to modest SNP improvement and a nice increase in INDEL accuracy in only ~1.5 hours of training.

More data is better, but the quality of the training labels is more important.

from deepvariant.

dbrami commented on June 2, 2024

Thanks for the added code.
Here are my follow-up questions:

what is the pattern to use for file naming if I’m processing multiple BAM files?
If downsampling same source BAM multiple times, do I perform loop function myself?
Is there a seed parameter for downsampling fraction?
Thank you!

from deepvariant.

danielecook commented on June 2, 2024

what is the pattern to use for file naming if I’m processing multiple BAM files?

This can be whatever you like. When you perform shuffling, you want to specify a glob or list of patterns that will match.

If downsampling same source BAM multiple times, do I perform loop function myself?

I'm not sure what you mean here. Can you provide more information?

Is there a seed parameter for downsampling fraction?

I don't think so - but I will double check on this.

from deepvariant.

dbrami commented on June 2, 2024

I guess the naming pattern is related to second question.
Say I have a 50x depth BAM file that I want to downsample to 20%. I can squeeze about 5 downsampled BAMs out of 50x.
Given I assume i will perform make_examples with "--training" within a loop of say 5 iterations, what would the naming scheme look like? Hence the importance of the seed parameter if downsampling same BAM mulitiple times within a loop; i would change seed each time...

I hope this clears up the questions and motivations behind them....

from deepvariant.

danielecook commented on June 2, 2024

Is there a seed parameter for downsampling fraction?

I don't believe this is currently possible.

What you can do though is you can downsample using samtools.

From samtools view you can specify:

  -s FLOAT subsample reads (given INT.FRAC option value, 0.FRAC is the
           fraction of templates/read pairs to keep; INT part sets seed)

So you can subsample each bam like this:

for i in `seq 1 5`; do
  # i sets the seed.
  samtools view -s ${i}.20 input.bam > input.${i}.20.bam

Then run make_examples + shuffle to combine the results.

Note that you want your training data to resemble the data you plan to run your model on. Subsampling can help improve performance in lower coverage regions, but if you plan to run this model at the original coverage level you'll want to train on that type of data as well.

from deepvariant.

pichuan commented on June 2, 2024

Hi @dbrami ,
I'll close this issue now. Feel free to follow up if you have more questions!

from deepvariant.

Better example for training about deepvariant HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent