Hi, I'm trying to train a model locally (adapting the code from train_autoencoder.ipyn

The problem was also discussed in this issue with tensorflow <a class="issue-link js-i

I got a similar issue while training on a T4 <code class="notranslat

CUDA_ERROR_LAUNCH_FAILED when training on GPU locally,about magenta/ddsp

Comments (8)

jesseengel commented on August 16, 2024 2

Thanks for looking into this!

It seems you're using a GPU with about half what we've been testing on (v100), so sorry you bumped into this edge case.

I am a little confused why that code snippet works (since we don't use sessions in 2.0), but I assume it's somehow tapping into the same backend. Can you try the TF 2.0 code from https://www.tensorflow.org/guide/gpu#limiting_gpu_memory_growth and see if it works for you too?

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
  try:
    # Currently, memory growth needs to be the same across GPUs
    for gpu in gpus:
      tf.config.experimental.set_memory_growth(gpu, True)
    logical_gpus = tf.config.experimental.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    # Memory growth must be set before GPUs have been initialized
    print(e)

from ddsp.

jesseengel commented on August 16, 2024

Can you add the exact command you're running and any other details (dataset etc.) that might be relevant?

from ddsp.

andreykramer commented on August 16, 2024

This is the command being run

ddsp_run --mode=train --alsologtostderr --model_dir="C:\Users\andrey\Desktop\winDDSP\MODEL" --gin_file="C:/Users/andrey/Desktop/winDDSP/soloinstrument.gin" --gin_file="C:/Users/andrey/Anaconda3/envs/test/lib/site-packages/ddsp/training/gin/datasets/tfrecord.gin" --gin_param="batch_size=16" --gin_param="TFRecordProvider.file_pattern='C:/Users/andrey/Desktop/winDDSP/data/train.tfrecord*'" --gin_param="train_util.train.num_steps=30000" --gin_param="train_util.train.steps_per_save=100" --gin_param="train_util.Trainer.checkpoints_to_keep=10"

And I believe that the dataset at the moment is just a single wav (around 15 seconds) that I prepared with ddsp_prepare_tfrecord. You can find the tfrecord files attached.
data.zip

As I said, the thing that confuses me most is that the same command runs perfectly fine when only the CPU is used to train. At the same time, judging from a tensorflow toy example code execution, tf and cuda seem to be configured correctly to work together.

from ddsp.

andreykramer commented on August 16, 2024

The problem was also discussed in this issue with tensorflow tensorflow/tensorflow#24496

Pasting this code inside train_util.py has solved the problem.

from tensorflow.compat.v1 import ConfigProto
from tensorflow.compat.v1 import InteractiveSession

config = ConfigProto()
config.gpu_options.allow_growth = True
session = InteractiveSession(config=config)

What was happening is that the process started filling the gpu memory very quickly and when it exceeded the available memory the aforementioned error popped out.

from ddsp.

andreykramer commented on August 16, 2024

You're welcome :) Yes, that code also does the job for the training. By the way, ddsp_prepare_tfrecord also has the same (or similar) problem. The console output is different but still I can see that it just allocates the whole GPU memory and then crashes. Where should I put that fix in? I've put it everywhere I can think of (prepare_tfrecord.py, prepare_tfrecord_lib.py, spectral_ops.py, core.py) and it doesn't seem to work.

Edit: I was trying to prepare a bigger dataset when I got this error (970 audios, 264mb), and found out it didn't work even on cpu. A small dataset with only one wav is prepared correctly both with GPU and CPU. How can I go around this? Thank you very much.

(base) andrey@andrey-PC:~/Escritorio/voicemodIA/DDSP$ ddsp_prepare_tfrecord --input_audio_filepatterns="/media/andrey/DATOS/Datasets/english/train/voice/male/*" --output_tfrecord_path="data/train.tfrecord" --num_shards=10 --alsologtostderr
2020-02-26 14:33:58.648031: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer.so.6
2020-02-26 14:33:58.649092: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer_plugin.so.6
2020-02-26 14:33:59.577649: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-02-26 14:33:59.598016: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-02-26 14:33:59.598330: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce RTX 2060 SUPER computeCapability: 7.5
coreClock: 1.68GHz coreCount: 34 deviceMemorySize: 7.79GiB deviceMemoryBandwidth: 417.29GiB/s
2020-02-26 14:33:59.598359: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-02-26 14:33:59.598407: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-02-26 14:33:59.599519: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-02-26 14:33:59.599689: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-02-26 14:33:59.600654: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-02-26 14:33:59.601324: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-02-26 14:33:59.601352: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-02-26 14:33:59.601433: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-02-26 14:33:59.601771: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-02-26 14:33:59.602051: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-02-26 14:33:59.602304: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-02-26 14:33:59.606149: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3699850000 Hz
2020-02-26 14:33:59.606322: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55b42a7ac960 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-02-26 14:33:59.606332: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-02-26 14:33:59.670131: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-02-26 14:33:59.670429: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55b42a79a280 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-02-26 14:33:59.670443: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce RTX 2060 SUPER, Compute Capability 7.5
2020-02-26 14:33:59.670574: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-02-26 14:33:59.670790: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce RTX 2060 SUPER computeCapability: 7.5
coreClock: 1.68GHz coreCount: 34 deviceMemorySize: 7.79GiB deviceMemoryBandwidth: 417.29GiB/s
2020-02-26 14:33:59.670810: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-02-26 14:33:59.670818: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-02-26 14:33:59.670834: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-02-26 14:33:59.670847: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-02-26 14:33:59.670858: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-02-26 14:33:59.670869: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-02-26 14:33:59.670877: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-02-26 14:33:59.670913: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-02-26 14:33:59.671140: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-02-26 14:33:59.671336: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-02-26 14:33:59.671355: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-02-26 14:33:59.826876: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-02-26 14:33:59.826903: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102]      0 
2020-02-26 14:33:59.826908: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0:   N 
2020-02-26 14:33:59.827068: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-02-26 14:33:59.827327: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-02-26 14:33:59.827535: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7028 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2060 SUPER, pci bus id: 0000:01:00.0, compute capability: 7.5)
1 Physical GPUs, 1 Logical GPUs
1 Physical GPUs, 1 Logical GPUs
1 Physical GPUs, 1 Logical GPUs
1 Physical GPUs, 1 Logical GPUs
I0226 14:34:00.700125 140064621770560 fn_api_runner_transforms.py:540] ==================== <function annotate_downstream_side_inputs at 0x7f61f93db4d0> ====================
I0226 14:34:00.700693 140064621770560 fn_api_runner_transforms.py:540] ==================== <function fix_side_input_pcoll_coders at 0x7f61f93db5f0> ====================
I0226 14:34:00.700984 140064621770560 fn_api_runner_transforms.py:540] ==================== <function lift_combiners at 0x7f61f93db680> ====================
I0226 14:34:00.701103 140064621770560 fn_api_runner_transforms.py:540] ==================== <function expand_sdf at 0x7f61f93db710> ====================
I0226 14:34:00.701330 140064621770560 fn_api_runner_transforms.py:540] ==================== <function expand_gbk at 0x7f61f93db7a0> ====================
I0226 14:34:00.701719 140064621770560 fn_api_runner_transforms.py:540] ==================== <function sink_flattens at 0x7f61f93db8c0> ====================
I0226 14:34:00.701858 140064621770560 fn_api_runner_transforms.py:540] ==================== <function greedily_fuse at 0x7f61f93db950> ====================
I0226 14:34:00.702906 140064621770560 fn_api_runner_transforms.py:540] ==================== <function read_to_impulse at 0x7f61f93db9e0> ====================
I0226 14:34:00.703006 140064621770560 fn_api_runner_transforms.py:540] ==================== <function impulse_to_input at 0x7f61f93dba70> ====================
I0226 14:34:00.703125 140064621770560 fn_api_runner_transforms.py:540] ==================== <function inject_timer_pcollections at 0x7f61f93dbc20> ====================
I0226 14:34:00.703323 140064621770560 fn_api_runner_transforms.py:540] ==================== <function sort_stages at 0x7f61f93dbcb0> ====================
I0226 14:34:00.703435 140064621770560 fn_api_runner_transforms.py:540] ==================== <function window_pcollection_coders at 0x7f61f93dbd40> ====================
I0226 14:34:00.704764 140064621770560 statecache.py:150] Creating state cache with size 100
I0226 14:34:00.704909 140064621770560 fn_api_runner.py:1797] Created Worker handler <apache_beam.runners.portability.fn_api_runner.EmbeddedWorkerHandler object at 0x7f61f935ac10> for environment urn: "beam:env:embedded_python:v1"

I0226 14:34:00.705106 140064621770560 fn_api_runner.py:822] Running ((((ref_AppliedPTransform_Create/Impulse_3)+(ref_AppliedPTransform_Create/FlatMap(<lambda at core.py:2597>)_4))+(ref_AppliedPTransform_Create/MaybeReshuffle/Reshuffle/AddRandomKeys_7))+(ref_AppliedPTransform_Create/MaybeReshuffle/Reshuffle/ReshufflePerKey/Map(reify_timestamps)_9))+(Create/MaybeReshuffle/Reshuffle/ReshufflePerKey/GroupByKey/Write)
I0226 14:34:00.731557 140064621770560 fn_api_runner.py:822] Running ((((((((((Create/MaybeReshuffle/Reshuffle/ReshufflePerKey/GroupByKey/Read)+(ref_AppliedPTransform_Create/MaybeReshuffle/Reshuffle/ReshufflePerKey/FlatMap(restore_timestamps)_14))+(ref_AppliedPTransform_Create/MaybeReshuffle/Reshuffle/RemoveRandomKeys_15))+(ref_AppliedPTransform_Create/Map(decode)_16))+(ref_AppliedPTransform_Map(_load_audio)_17))+(ref_AppliedPTransform_Map(_add_f0_estimate)_18))+(ref_AppliedPTransform_Map(_add_loudness)_19))+(ref_AppliedPTransform_FlatMap(_split_example)_20))+(ref_AppliedPTransform_Reshuffle/AddRandomKeys_22))+(ref_AppliedPTransform_Reshuffle/ReshufflePerKey/Map(reify_timestamps)_24))+(Reshuffle/ReshufflePerKey/GroupByKey/Write)
I0226 14:34:00.753737 140058657015552 prepare_tfrecord_lib.py:43] Loading '/media/andrey/DATOS/Datasets/english/train/voice/male/V001_0001595577.wav'.
2020-02-26 14:34:01.440541: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-02-26 14:34:01.586315: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
/home/andrey/anaconda3/lib/python3.7/site-packages/librosa/core/time_frequency.py:1208: RuntimeWarning: divide by zero encountered in log10
  - 0.5 * np.log10(f_sq + const[3]))
I0226 14:34:04.901932 140058657015552 prepare_tfrecord_lib.py:43] Loading '/media/andrey/DATOS/Datasets/english/train/voice/male/V001_0001866840.wav'.
Traceback (most recent call last):
  File "apache_beam/runners/common.py", line 883, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 667, in apache_beam.runners.common.PerWindowInvoker.invoke_process
  File "apache_beam/runners/common.py", line 748, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/apache_beam/transforms/core.py", line 1435, in <lambda>
    wrapper = lambda x, *args, **kwargs: [fn(x, *args, **kwargs)]
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/ddsp/training/data_preparation/prepare_tfrecord_lib.py", line 69, in _add_f0_estimate
    f0_hz, f0_confidence = compute_f0(audio, sample_rate, frame_rate)
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/ddsp/spectral_ops.py", line 276, in compute_f0
    assert n_padding % 1 == 0
AssertionError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/andrey/anaconda3/bin/ddsp_prepare_tfrecord", line 10, in <module>
    sys.exit(console_entry_point())
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/ddsp/training/data_preparation/prepare_tfrecord.py", line 105, in console_entry_point
    app.run(main)
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/ddsp/training/data_preparation/prepare_tfrecord.py", line 100, in main
    run()
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/ddsp/training/data_preparation/prepare_tfrecord.py", line 95, in run
    pipeline_options=FLAGS.pipeline_options)
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/ddsp/training/data_preparation/prepare_tfrecord_lib.py", line 170, in prepare_tfrecord
    coder=beam.coders.ProtoCoder(tf.train.Example))
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/apache_beam/pipeline.py", line 481, in __exit__
    self.run().wait_until_finish()
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/apache_beam/pipeline.py", line 461, in run
    self._options).run(False)
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/apache_beam/pipeline.py", line 474, in run
    return self.runner.run_pipeline(self, self._options)
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/apache_beam/runners/direct/direct_runner.py", line 182, in run_pipeline
    return runner.run_pipeline(pipeline, options)
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/apache_beam/runners/portability/fn_api_runner.py", line 486, in run_pipeline
    default_environment=self._default_environment))
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/apache_beam/runners/portability/fn_api_runner.py", line 494, in run_via_runner_api
    return self.run_stages(stage_context, stages)
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/apache_beam/runners/portability/fn_api_runner.py", line 583, in run_stages
    stage_context.safe_coders)
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/apache_beam/runners/portability/fn_api_runner.py", line 904, in _run_stage
    result, splits = bundle_manager.process_bundle(data_input, data_output)
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/apache_beam/runners/portability/fn_api_runner.py", line 2105, in process_bundle
    for result, split_result in executor.map(execute, part_inputs):
  File "/home/andrey/anaconda3/lib/python3.7/concurrent/futures/_base.py", line 598, in result_iterator
    yield fs.pop().result()
  File "/home/andrey/anaconda3/lib/python3.7/concurrent/futures/_base.py", line 435, in result
    return self.__get_result()
  File "/home/andrey/anaconda3/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/apache_beam/utils/thread_pool_executor.py", line 44, in run
    self._future.set_result(self._fn(*self._fn_args, **self._fn_kwargs))
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/apache_beam/runners/portability/fn_api_runner.py", line 2102, in execute
    return bundle_manager.process_bundle(part_map, expected_outputs)
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/apache_beam/runners/portability/fn_api_runner.py", line 2025, in process_bundle
    result_future = self._worker_handler.control_conn.push(process_bundle_req)
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/apache_beam/runners/portability/fn_api_runner.py", line 1358, in push
    response = self.worker.do_instruction(request)
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py", line 352, in do_instruction
    request.instruction_id)
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py", line 386, in process_bundle
    bundle_processor.process_bundle(instruction_id))
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/apache_beam/runners/worker/bundle_processor.py", line 812, in process_bundle
    data.transform_id].process_encoded(data.data)
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/apache_beam/runners/worker/bundle_processor.py", line 205, in process_encoded
    self.output(decoded_value)
  File "apache_beam/runners/worker/operations.py", line 302, in apache_beam.runners.worker.operations.Operation.output
  File "apache_beam/runners/worker/operations.py", line 304, in apache_beam.runners.worker.operations.Operation.output
  File "apache_beam/runners/worker/operations.py", line 178, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive
  File "apache_beam/runners/worker/operations.py", line 657, in apache_beam.runners.worker.operations.DoOperation.process
  File "apache_beam/runners/worker/operations.py", line 658, in apache_beam.runners.worker.operations.DoOperation.process
  File "apache_beam/runners/common.py", line 878, in apache_beam.runners.common.DoFnRunner.receive
  File "apache_beam/runners/common.py", line 885, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 941, in apache_beam.runners.common.DoFnRunner._reraise_augmented
  File "apache_beam/runners/common.py", line 883, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 497, in apache_beam.runners.common.SimpleInvoker.invoke_process
  File "apache_beam/runners/common.py", line 1028, in apache_beam.runners.common._OutputProcessor.process_outputs
  File "apache_beam/runners/worker/operations.py", line 178, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive
  File "apache_beam/runners/worker/operations.py", line 657, in apache_beam.runners.worker.operations.DoOperation.process
  File "apache_beam/runners/worker/operations.py", line 658, in apache_beam.runners.worker.operations.DoOperation.process
  File "apache_beam/runners/common.py", line 878, in apache_beam.runners.common.DoFnRunner.receive
  File "apache_beam/runners/common.py", line 885, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 941, in apache_beam.runners.common.DoFnRunner._reraise_augmented
  File "apache_beam/runners/common.py", line 883, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 497, in apache_beam.runners.common.SimpleInvoker.invoke_process
  File "apache_beam/runners/common.py", line 1028, in apache_beam.runners.common._OutputProcessor.process_outputs
  File "apache_beam/runners/worker/operations.py", line 178, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive
  File "apache_beam/runners/worker/operations.py", line 657, in apache_beam.runners.worker.operations.DoOperation.process
  File "apache_beam/runners/worker/operations.py", line 658, in apache_beam.runners.worker.operations.DoOperation.process
  File "apache_beam/runners/common.py", line 878, in apache_beam.runners.common.DoFnRunner.receive
  File "apache_beam/runners/common.py", line 885, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 941, in apache_beam.runners.common.DoFnRunner._reraise_augmented
  File "apache_beam/runners/common.py", line 883, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 497, in apache_beam.runners.common.SimpleInvoker.invoke_process
  File "apache_beam/runners/common.py", line 1028, in apache_beam.runners.common._OutputProcessor.process_outputs
  File "apache_beam/runners/worker/operations.py", line 178, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive
  File "apache_beam/runners/worker/operations.py", line 657, in apache_beam.runners.worker.operations.DoOperation.process
  File "apache_beam/runners/worker/operations.py", line 658, in apache_beam.runners.worker.operations.DoOperation.process
  File "apache_beam/runners/common.py", line 878, in apache_beam.runners.common.DoFnRunner.receive
  File "apache_beam/runners/common.py", line 885, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 941, in apache_beam.runners.common.DoFnRunner._reraise_augmented
  File "apache_beam/runners/common.py", line 883, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 667, in apache_beam.runners.common.PerWindowInvoker.invoke_process
  File "apache_beam/runners/common.py", line 747, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window
  File "apache_beam/runners/common.py", line 1028, in apache_beam.runners.common._OutputProcessor.process_outputs
  File "apache_beam/runners/worker/operations.py", line 178, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive
  File "apache_beam/runners/worker/operations.py", line 657, in apache_beam.runners.worker.operations.DoOperation.process
  File "apache_beam/runners/worker/operations.py", line 658, in apache_beam.runners.worker.operations.DoOperation.process
  File "apache_beam/runners/common.py", line 878, in apache_beam.runners.common.DoFnRunner.receive
  File "apache_beam/runners/common.py", line 885, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 956, in apache_beam.runners.common.DoFnRunner._reraise_augmented
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/future/utils/__init__.py", line 421, in raise_with_traceback
    raise exc.with_traceback(traceback)
  File "apache_beam/runners/common.py", line 883, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 667, in apache_beam.runners.common.PerWindowInvoker.invoke_process
  File "apache_beam/runners/common.py", line 748, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/apache_beam/transforms/core.py", line 1435, in <lambda>
    wrapper = lambda x, *args, **kwargs: [fn(x, *args, **kwargs)]
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/ddsp/training/data_preparation/prepare_tfrecord_lib.py", line 69, in _add_f0_estimate
    f0_hz, f0_confidence = compute_f0(audio, sample_rate, frame_rate)
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/ddsp/spectral_ops.py", line 276, in compute_f0
    assert n_padding % 1 == 0
RuntimeError: AssertionError [while running 'Map(_add_f0_estimate)']

from ddsp.

jesseengel commented on August 16, 2024

Cool, any interest in adding that to the code? I think it should probably just be a function allow_memory_growth() in train_util.py that gets called from ddsp_run.py when a boolean flag --allow_memory_growth flag is set (default to false).

The dataset creation seems to be a different issue perhaps as it's being caught by this assert:

  File "apache_beam/runners/common.py", line 883, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 667, in apache_beam.runners.common.PerWindowInvoker.invoke_process
  File "apache_beam/runners/common.py", line 748, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/apache_beam/transforms/core.py", line 1435, in <lambda>
    wrapper = lambda x, *args, **kwargs: [fn(x, *args, **kwargs)]
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/ddsp/training/data_preparation/prepare_tfrecord_lib.py", line 69, in _add_f0_estimate
    f0_hz, f0_confidence = compute_f0(audio, sample_rate, frame_rate)
  File "/home/andrey/anaconda3/lib/python3.7/site-packages/ddsp/spectral_ops.py", line 276, in compute_f0
    assert n_padding % 1 == 0
AssertionError

Would you like to create a different issue for that?

from ddsp.

andreykramer commented on August 16, 2024

I found out that the problem was with a specific .wav file and not because of the size of the dataset. It would be interesting to find out why's the code crashing with it, so I will open a new issue later. Also created a PR with the fix for this issue in the way you suggested, so I'm closing it.

Thank you for your responsiveness!

from ddsp.

erl-j commented on August 16, 2024

I got a similar issue while training on a T4

failed to initialize batched cufft plan with customized allocator: Failed to make cuFFT batched plan. Fatal Python error: Aborted

The code suggested by jesseengel (#29 (comment)) fixed the issue.

from ddsp.

CUDA_ERROR_LAUNCH_FAILED when training on GPU locally about ddsp HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent