Code Monkey home page Code Monkey logo

dolly's People

Contributors

baiqingl avatar edurdevic avatar eltociear avatar holdenk avatar matthayes avatar mike-conover-db avatar nfx avatar rmosleydb avatar rxin avatar samikalliomaki avatar srowen avatar tnixon avatar xuanyuanking avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dolly's Issues

Downloading the model and performing inference only

Is there an easy way to download the model generated by Databricks in the blog post, instead of retraining?
In fact, for a number of reasons, it still might not be so easy to retrain (access to Databricks, or to Standard_ND96asr_v4 instance).

Also, what is the recommended instance type for inference only, ie. calling the generate_response function?

Thank you very much for the hard work.

Training with 2 Epochs fails

Hi,

I can successfully fine-tune the model with 1 Epoch by using the latest commits in the repo. (all arguments are default)

However, when I set Epoch to 2, it goes all the way to the end but it fails with many No such file or directory:. I was wondering if something is hard-coded for 1 Epoch.

Here is the full trace, I break it down for simplicity:

Finishing the second Epoch

***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8
{'eval_loss': 1.4334062337875366, 'eval_runtime': 22.4126, 'eval_samples_per_second': 44.618, 'eval_steps_per_second': 1.428, 'epoch': 1.98}
{'loss': 0.7924, 'learning_rate': 1e-05, 'epoch': 1.99}
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8
{'eval_loss': 1.4370625019073486, 'eval_runtime': 22.4154, 'eval_samples_per_second': 44.612, 'eval_steps_per_second': 1.428, 'epoch': 1.99}
{'loss': 0.762, 'learning_rate': 1e-05, 'epoch': 2.0}
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8
{'eval_loss': 1.445156216621399, 'eval_runtime': 22.4227, 'eval_samples_per_second': 44.598, 'eval_steps_per_second': 1.427, 'epoch': 2.0}


Training completed. Do not forget to share your model on huggingface.co/models =)

right before the first ERROR

Loading best model from local_dolly_training/dolly__2023-03-28T07:46:32/checkpoint-1400 (score: 1.3070000410079956).
[2023-03-28 12:06:44,437] [INFO] [logging.py:93:log_dist] [Rank 0] DeepSpeed info: version=0.8.3, git-hash=unknown, git-branch=unknown
[2023-03-28 12:06:44,450] [INFO] [logging.py:93:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
Using /home/maziyar/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module fused_adam, skipping build step...
Loading extension module fused_adam...
Time to load fused_adam op: 0.0007417201995849609 seconds
Using /home/maziyar/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0001647472381591797 seconds
Using /home/maziyar/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module fused_adam, skipping build step...
Loading extension module fused_adam...
Time to load fused_adam op: 0.0008955001831054688 seconds
Using /home/maziyar/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module fused_adam, skipping build step...
Loading extension module fused_adam...
Time to load fused_adam op: 0.0008783340454101562 seconds
Using /home/maziyar/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module fused_adam, skipping build step...
Loading extension module fused_adam...
Time to load fused_adam op: 0.0008599758148193359 seconds
[2023-03-28 12:06:44,561] [INFO] [logging.py:93:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer
Using /home/maziyar/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.00018143653869628906 seconds
Using /home/maziyar/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.00016498565673828125 seconds
[2023-03-28 12:06:44,572] [INFO] [logging.py:93:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam
[2023-03-28 12:06:44,572] [INFO] [utils.py:55:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type=<class 'deepspeed.ops.adam.fused_adam.FusedAdam'>
[2023-03-28 12:06:44,573] [INFO] [logging.py:93:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 3 optimizer
[2023-03-28 12:06:44,756] [INFO] [utils.py:829:see_memory_usage] Stage 3 initialize beginning
[2023-03-28 12:06:44,756] [INFO] [utils.py:830:see_memory_usage] MA 22.7 GB         Max_MA 31.48 GB         CA 46.49 GB         Max_CA 46 GB 
[2023-03-28 12:06:44,757] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory:  used = 23.99 GB, percent = 3.2%
[2023-03-28 12:06:44,758] [INFO] [stage3.py:113:__init__] Reduce bucket size 16777216
[2023-03-28 12:06:44,758] [INFO] [stage3.py:114:__init__] Prefetch bucket size 15099494
Using /home/maziyar/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0002541542053222656 seconds
[2023-03-28 12:06:44,893] [INFO] [utils.py:829:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2023-03-28 12:06:44,893] [INFO] [utils.py:830:see_memory_usage] MA 22.7 GB         Max_MA 22.7 GB         CA 46.49 GB         Max_CA 46 GB 
[2023-03-28 12:06:44,894] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory:  used = 23.99 GB, percent = 3.2%
Parameter Offload: Total persistent parameters: 811008 in 114 params
[2023-03-28 12:06:45,040] [INFO] [utils.py:829:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
[2023-03-28 12:06:45,040] [INFO] [utils.py:830:see_memory_usage] MA 22.7 GB         Max_MA 22.7 GB         CA 46.49 GB         Max_CA 46 GB 
[2023-03-28 12:06:45,040] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory:  used = 23.99 GB, percent = 3.2%
[2023-03-28 12:06:45,177] [INFO] [utils.py:829:see_memory_usage] Before creating fp16 partitions
[2023-03-28 12:06:45,177] [INFO] [utils.py:830:see_memory_usage] MA 22.7 GB         Max_MA 22.7 GB         CA 46.49 GB         Max_CA 46 GB 
[2023-03-28 12:06:45,177] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory:  used = 24.0 GB, percent = 3.2%
[2023-03-28 12:06:47,552] [INFO] [utils.py:829:see_memory_usage] After creating fp16 partitions: 2
[2023-03-28 12:06:47,553] [INFO] [utils.py:830:see_memory_usage] MA 25.52 GB         Max_MA 25.52 GB         CA 28.33 GB         Max_CA 46 GB 
[2023-03-28 12:06:47,553] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory:  used = 23.99 GB, percent = 3.2%
[2023-03-28 12:06:47,689] [INFO] [utils.py:829:see_memory_usage] Before creating fp32 partitions
[2023-03-28 12:06:47,689] [INFO] [utils.py:830:see_memory_usage] MA 25.52 GB         Max_MA 25.52 GB         CA 28.33 GB         Max_CA 28 GB 
[2023-03-28 12:06:47,690] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory:  used = 23.99 GB, percent = 3.2%
[2023-03-28 12:06:47,842] [INFO] [utils.py:829:see_memory_usage] After creating fp32 partitions
[2023-03-28 12:06:47,843] [INFO] [utils.py:830:see_memory_usage] MA 31.16 GB         Max_MA 32.1 GB         CA 35.85 GB         Max_CA 36 GB 
[2023-03-28 12:06:47,843] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory:  used = 24.07 GB, percent = 3.2%
[2023-03-28 12:06:47,979] [INFO] [utils.py:829:see_memory_usage] Before initializing optimizer states
[2023-03-28 12:06:47,980] [INFO] [utils.py:830:see_memory_usage] MA 31.16 GB         Max_MA 31.16 GB         CA 35.85 GB         Max_CA 36 GB 
[2023-03-28 12:06:47,980] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory:  used = 24.07 GB, percent = 3.2%
[2023-03-28 12:06:48,644] [INFO] [utils.py:829:see_memory_usage] After initializing optimizer states
[2023-03-28 12:06:48,644] [INFO] [utils.py:830:see_memory_usage] MA 42.43 GB         Max_MA 46.19 GB         CA 50.88 GB         Max_CA 51 GB 
[2023-03-28 12:06:48,644] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory:  used = 24.0 GB, percent = 3.2%
[2023-03-28 12:06:48,645] [INFO] [stage3.py:376:_setup_for_real_optimizer] optimizer state initialized
Using /home/maziyar/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0003037452697753906 seconds
Using /home/maziyar/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.00035309791564941406 seconds
Using /home/maziyar/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.00029277801513671875 seconds
[2023-03-28 12:06:48,919] [INFO] [utils.py:829:see_memory_usage] After initializing ZeRO optimizer
[2023-03-28 12:06:48,919] [INFO] [utils.py:830:see_memory_usage] MA 45.28 GB         Max_MA 46.05 GB         CA 65.12 GB         Max_CA 65 GB 
[2023-03-28 12:06:48,919] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory:  used = 24.31 GB, percent = 3.2%
[2023-03-28 12:06:48,920] [INFO] [logging.py:93:log_dist] [Rank 0] DeepSpeed Final Optimizer = adamw
[2023-03-28 12:06:48,920] [INFO] [logging.py:93:log_dist] [Rank 0] DeepSpeed using configured LR scheduler = WarmupLR
[2023-03-28 12:06:48,920] [INFO] [logging.py:93:log_dist] [Rank 0] DeepSpeed LR Scheduler = <deepspeed.runtime.lr_schedules.WarmupLR object at 0x7fbdb36c2fa0>
[2023-03-28 12:06:48,920] [INFO] [logging.py:93:log_dist] [Rank 0] step=0, skipped=0, lr=[1e-05], mom=[[0.9, 0.999]]
[2023-03-28 12:06:48,920] [INFO] [config.py:1018:print] DeepSpeedEngine configuration:
[2023-03-28 12:06:48,920] [INFO] [config.py:1022:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
}
[2023-03-28 12:06:48,921] [INFO] [config.py:1022:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2023-03-28 12:06:48,921] [INFO] [config.py:1022:print]   amp_enabled .................. False
[2023-03-28 12:06:48,921] [INFO] [config.py:1022:print]   amp_params ................... False
[2023-03-28 12:06:48,921] [INFO] [config.py:1022:print]   autotuning_config ............ {
    "enabled": false, 
    "start_step": null, 
    "end_step": null, 
    "metric_path": null, 
    "arg_mappings": null, 
    "metric": "throughput", 
    "model_info": null, 
    "results_dir": "autotuning_results", 
    "exps_dir": "autotuning_exps", 
    "overwrite": true, 
    "fast": true, 
    "start_profile_step": 3, 
    "end_profile_step": 5, 
    "tuner_type": "gridsearch", 
    "tuner_early_stopping": 5, 
    "tuner_num_trials": 50, 
    "model_info_path": null, 
    "mp_size": 1, 
    "max_train_batch_size": null, 
    "min_train_batch_size": 1, 
    "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
    "min_train_micro_batch_size_per_gpu": 1, 
    "num_tuning_micro_batch_sizes": 3
}
[2023-03-28 12:06:48,921] [INFO] [config.py:1022:print]   bfloat16_enabled ............. True
[2023-03-28 12:06:48,921] [INFO] [config.py:1022:print]   checkpoint_parallel_write_pipeline  False
[2023-03-28 12:06:48,921] [INFO] [config.py:1022:print]   checkpoint_tag_validation_enabled  True
[2023-03-28 12:06:48,921] [INFO] [config.py:1022:print]   checkpoint_tag_validation_fail  False
[2023-03-28 12:06:48,921] [INFO] [config.py:1022:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7fbdb82352e0>
[2023-03-28 12:06:48,921] [INFO] [config.py:1022:print]   communication_data_type ...... None
[2023-03-28 12:06:48,921] [INFO] [config.py:1022:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2023-03-28 12:06:48,921] [INFO] [config.py:1022:print]   curriculum_enabled_legacy .... False
[2023-03-28 12:06:48,921] [INFO] [config.py:1022:print]   curriculum_params_legacy ..... False
[2023-03-28 12:06:48,921] [INFO] [config.py:1022:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2023-03-28 12:06:48,921] [INFO] [config.py:1022:print]   data_efficiency_enabled ...... False
[2023-03-28 12:06:48,921] [INFO] [config.py:1022:print]   dataloader_drop_last ......... False
[2023-03-28 12:06:48,921] [INFO] [config.py:1022:print]   disable_allgather ............ False
[2023-03-28 12:06:48,921] [INFO] [config.py:1022:print]   dump_state ................... False
[2023-03-28 12:06:48,921] [INFO] [config.py:1022:print]   dynamic_loss_scale_args ...... None
[2023-03-28 12:06:48,921] [INFO] [config.py:1022:print]   eigenvalue_enabled ........... False
[2023-03-28 12:06:48,921] [INFO] [config.py:1022:print]   eigenvalue_gas_boundary_resolution  1
[2023-03-28 12:06:48,921] [INFO] [config.py:1022:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2023-03-28 12:06:48,921] [INFO] [config.py:1022:print]   eigenvalue_layer_num ......... 0
[2023-03-28 12:06:48,921] [INFO] [config.py:1022:print]   eigenvalue_max_iter .......... 100
[2023-03-28 12:06:48,921] [INFO] [config.py:1022:print]   eigenvalue_stability ......... 1e-06
[2023-03-28 12:06:48,921] [INFO] [config.py:1022:print]   eigenvalue_tol ............... 0.01
[2023-03-28 12:06:48,921] [INFO] [config.py:1022:print]   eigenvalue_verbose ........... False
[2023-03-28 12:06:48,921] [INFO] [config.py:1022:print]   elasticity_enabled ........... False
[2023-03-28 12:06:48,921] [INFO] [config.py:1022:print]   flops_profiler_config ........ {
    "enabled": false, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 1, 
    "detailed": true, 
    "output_file": null
}
[2023-03-28 12:06:48,921] [INFO] [config.py:1022:print]   fp16_auto_cast ............... None
[2023-03-28 12:06:48,921] [INFO] [config.py:1022:print]   fp16_enabled ................. False
[2023-03-28 12:06:48,921] [INFO] [config.py:1022:print]   fp16_master_weights_and_gradients  False
[2023-03-28 12:06:48,921] [INFO] [config.py:1022:print]   global_rank .................. 0
[2023-03-28 12:06:48,921] [INFO] [config.py:1022:print]   grad_accum_dtype ............. None
[2023-03-28 12:06:48,921] [INFO] [config.py:1022:print]   gradient_accumulation_steps .. 1
[2023-03-28 12:06:48,921] [INFO] [config.py:1022:print]   gradient_clipping ............ 1.0
[2023-03-28 12:06:48,921] [INFO] [config.py:1022:print]   gradient_predivide_factor .... 1.0
[2023-03-28 12:06:48,921] [INFO] [config.py:1022:print]   initial_dynamic_scale ........ 1
[2023-03-28 12:06:48,922] [INFO] [config.py:1022:print]   load_universal_checkpoint .... False
[2023-03-28 12:06:48,922] [INFO] [config.py:1022:print]   loss_scale ................... 1.0
[2023-03-28 12:06:48,922] [INFO] [config.py:1022:print]   memory_breakdown ............. False
[2023-03-28 12:06:48,922] [INFO] [config.py:1022:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2023-03-28 12:06:48,922] [INFO] [config.py:1022:print]   nebula_config ................ {
    "enabled": false, 
    "persistent_storage_path": null, 
    "persistent_time_interval": 100, 
    "num_of_version_in_retention": 2, 
    "enable_nebula_load": true, 
    "load_path": null
}
[2023-03-28 12:06:48,922] [INFO] [config.py:1022:print]   optimizer_legacy_fusion ...... False
[2023-03-28 12:06:48,922] [INFO] [config.py:1022:print]   optimizer_name ............... adamw
[2023-03-28 12:06:48,922] [INFO] [config.py:1022:print]   optimizer_params ............. {'lr': 1e-05, 'betas': [0.9, 0.999], 'eps': 1e-08, 'weight_decay': 0.0}
[2023-03-28 12:06:48,922] [INFO] [config.py:1022:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2023-03-28 12:06:48,922] [INFO] [config.py:1022:print]   pld_enabled .................. False
[2023-03-28 12:06:48,922] [INFO] [config.py:1022:print]   pld_params ................... False
[2023-03-28 12:06:48,922] [INFO] [config.py:1022:print]   prescale_gradients ........... False
[2023-03-28 12:06:48,922] [INFO] [config.py:1022:print]   scheduler_name ............... WarmupLR
[2023-03-28 12:06:48,922] [INFO] [config.py:1022:print]   scheduler_params ............. {'warmup_min_lr': 0, 'warmup_max_lr': 1e-05, 'warmup_num_steps': 0}
[2023-03-28 12:06:48,922] [INFO] [config.py:1022:print]   sparse_attention ............. None
[2023-03-28 12:06:48,922] [INFO] [config.py:1022:print]   sparse_gradients_enabled ..... False
[2023-03-28 12:06:48,922] [INFO] [config.py:1022:print]   steps_per_print .............. 2000
[2023-03-28 12:06:48,922] [INFO] [config.py:1022:print]   train_batch_size ............. 32
[2023-03-28 12:06:48,922] [INFO] [config.py:1022:print]   train_micro_batch_size_per_gpu  8
[2023-03-28 12:06:48,922] [INFO] [config.py:1022:print]   use_node_local_storage ....... False
[2023-03-28 12:06:48,922] [INFO] [config.py:1022:print]   wall_clock_breakdown ......... False
[2023-03-28 12:06:48,922] [INFO] [config.py:1022:print]   world_size ................... 4
[2023-03-28 12:06:48,922] [INFO] [config.py:1022:print]   zero_allow_untested_optimizer  False
[2023-03-28 12:06:48,922] [INFO] [config.py:1022:print]   zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=16777216 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=15099494 param_persistence_threshold=40960 model_persistence_threshold=sys.maxsize max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=True stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False
[2023-03-28 12:06:48,922] [INFO] [config.py:1022:print]   zero_enabled ................. True
[2023-03-28 12:06:48,922] [INFO] [config.py:1022:print]   zero_force_ds_cpu_optimizer .. True
[2023-03-28 12:06:48,922] [INFO] [config.py:1022:print]   zero_optimization_stage ...... 3
[2023-03-28 12:06:48,922] [INFO] [config.py:1007:print_user_config]   json = {
    "bf16": {
        "enabled": true
    }, 
    "optimizer": {
        "type": "AdamW", 
        "params": {
            "lr": 1e-05, 
            "betas": [0.9, 0.999], 
            "eps": 1e-08, 
            "weight_decay": 0.0
        }
    }, 
    "scheduler": {
        "type": "WarmupLR", 
        "params": {
            "warmup_min_lr": 0, 
            "warmup_max_lr": 1e-05, 
            "warmup_num_steps": 0
        }
    }, 
    "zero_optimization": {
        "stage": 3, 
        "overlap_comm": true, 
        "contiguous_gradients": true, 
        "sub_group_size": 1.000000e+09, 
        "reduce_bucket_size": 1.677722e+07, 
        "stage3_prefetch_bucket_size": 1.509949e+07, 
        "stage3_param_persistence_threshold": 4.096000e+04, 
        "stage3_max_live_parameters": 1.000000e+09, 
        "stage3_max_reuse_distance": 1.000000e+09, 
        "stage3_gather_16bit_weights_on_model_save": true
    }, 
    "gradient_accumulation_steps": 1, 
    "gradient_clipping": 1.0, 
    "steps_per_print": 2.000000e+03, 
    "train_batch_size": 32, 
    "train_micro_batch_size_per_gpu": 8, 
    "wall_clock_breakdown": false
}
Using /home/maziyar/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.00025963783264160156 seconds
Attempting to resume from local_dolly_training/dolly__2023-03-28T07:46:32/checkpoint-1400
[2023-03-28 12:06:48,926] [INFO] [torch_checkpoint_engine.py:23:load] [Torch] Loading checkpoint from local_dolly_training/dolly__2023-03-28T07:46:32/checkpoint-1400/global_step1400/zero_pp_rank_0_mp_rank_00_model_states.pt...
[2023-03-28 12:06:48,976] [INFO] [torch_checkpoint_engine.py:25:load] [Torch] Loaded checkpoint from local_dolly_training/dolly__2023-03-28T07:46:32/checkpoint-1400/global_step1400/zero_pp_rank_0_mp_rank_00_model_states.pt.
[2023-03-28 12:06:48,981] [INFO] [torch_checkpoint_engine.py:23:load] [Torch] Loading checkpoint from local_dolly_training/dolly__2023-03-28T07:46:32/checkpoint-1400/global_step1400/zero_pp_rank_0_mp_rank_00_model_states.pt...
[2023-03-28 12:06:49,028] [INFO] [torch_checkpoint_engine.py:25:load] [Torch] Loaded checkpoint from local_dolly_training/dolly__2023-03-28T07:46:32/checkpoint-1400/global_step1400/zero_pp_rank_0_mp_rank_00_model_states.pt.
[2023-03-28 12:06:49,469] [INFO] [torch_checkpoint_engine.py:23:load] [Torch] Loading checkpoint from local_dolly_training/dolly__2023-03-28T07:46:32/checkpoint-1400/global_step1400/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt...
[2023-03-28 12:06:56,282] [INFO] [torch_checkpoint_engine.py:25:load] [Torch] Loaded checkpoint from local_dolly_training/dolly__2023-03-28T07:46:32/checkpoint-1400/global_step1400/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt.
[2023-03-28 12:06:56,283] [INFO] [engine.py:3043:_get_all_zero_checkpoint_state_dicts] successfully read 4 ZeRO state_dicts for rank 0
[2023-03-28 12:07:00,465] [INFO] [engine.py:2983:_load_zero_checkpoint] loading 4 zero partition checkpoints for rank 0
{'train_runtime': 15522.5667, 'train_samples_per_second': 6.568, 'train_steps_per_second': 0.205, 'train_loss': 1.0332834323995606, 'epoch': 2.0}
Deleting older checkpoint [local_dolly_training/dolly__2023-03-28T07:46:32/checkpoint-3000] due to args.save_total_limit

Now the actual ERRORs:

2023-03-28 12:07:01 ERROR [__main__] main failed
Traceback (most recent call last):
  File "/home/maziyar/apps/dolly/training/trainer.py", line 260, in <module>
    main()
  File "/home/maziyar/anaconda3/envs/dolly/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/maziyar/anaconda3/envs/dolly/lib/python3.9/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/maziyar/anaconda3/envs/dolly/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/maziyar/anaconda3/envs/dolly/lib/python3.9/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/maziyar/apps/dolly/training/trainer.py", line 252, in main
    train(**kwargs)
  File "/home/maziyar/apps/dolly/training/trainer.py", line 215, in train
    trainer.train()
  File "/home/maziyar/anaconda3/envs/dolly/lib/python3.9/site-packages/transformers/trainer.py", line 1527, in train
    return inner_training_loop(
  File "/home/maziyar/anaconda3/envs/dolly/lib/python3.9/site-packages/transformers/trainer.py", line 1920, in _inner_training_loop
    shutil.rmtree(checkpoint)
  File "/home/maziyar/anaconda3/envs/dolly/lib/python3.9/shutil.py", line 734, in rmtree
    _rmtree_safe_fd(fd, path, onerror)
  File "/home/maziyar/anaconda3/envs/dolly/lib/python3.9/shutil.py", line 690, in _rmtree_safe_fd
    onerror(os.unlink, fullname, sys.exc_info())
  File "/home/maziyar/anaconda3/envs/dolly/lib/python3.9/shutil.py", line 688, in _rmtree_safe_fd
    os.unlink(entry.name, dir_fd=topfd)
FileNotFoundError: [Errno 2] No such file or directory: 'rng_state_3.pth'
Traceback (most recent call last):
  File "/home/maziyar/anaconda3/envs/dolly/lib/python3.9/runpy.py", line 197, in _run_module_as_main
2023-03-28 12:07:01 ERROR [__main__] main failed
Traceback (most recent call last):
  File "/home/maziyar/apps/dolly/training/trainer.py", line 260, in <module>
    main()
  File "/home/maziyar/anaconda3/envs/dolly/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/maziyar/anaconda3/envs/dolly/lib/python3.9/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/maziyar/anaconda3/envs/dolly/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/maziyar/anaconda3/envs/dolly/lib/python3.9/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/maziyar/apps/dolly/training/trainer.py", line 252, in main
    train(**kwargs)
  File "/home/maziyar/apps/dolly/training/trainer.py", line 215, in train
    trainer.train()
  File "/home/maziyar/anaconda3/envs/dolly/lib/python3.9/site-packages/transformers/trainer.py", line 1527, in train
    return inner_training_loop(
  File "/home/maziyar/anaconda3/envs/dolly/lib/python3.9/site-packages/transformers/trainer.py", line 1920, in _inner_training_loop
    shutil.rmtree(checkpoint)
  File "/home/maziyar/anaconda3/envs/dolly/lib/python3.9/shutil.py", line 734, in rmtree
    _rmtree_safe_fd(fd, path, onerror)
  File "/home/maziyar/anaconda3/envs/dolly/lib/python3.9/shutil.py", line 690, in _rmtree_safe_fd
    onerror(os.unlink, fullname, sys.exc_info())
  File "/home/maziyar/anaconda3/envs/dolly/lib/python3.9/shutil.py", line 688, in _rmtree_safe_fd
    os.unlink(entry.name, dir_fd=topfd)
FileNotFoundError: [Errno 2] No such file or directory: 'rng_state_3.pth'
Traceback (most recent call last):
  File "/home/maziyar/anaconda3/envs/dolly/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/maziyar/anaconda3/envs/dolly/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/maziyar/apps/dolly/training/trainer.py", line 260, in <module>
    return _run_code(code, main_globals, None,
  File "/home/maziyar/anaconda3/envs/dolly/lib/python3.9/runpy.py", line 87, in _run_code
    main()
  File "/home/maziyar/anaconda3/envs/dolly/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
    exec(code, run_globals)
  File "/home/maziyar/apps/dolly/training/trainer.py", line 260, in <module>
    main()
  File "/home/maziyar/anaconda3/envs/dolly/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/maziyar/anaconda3/envs/dolly/lib/python3.9/site-packages/click/core.py", line 1053, in main
    return self.main(*args, **kwargs)
  File "/home/maziyar/anaconda3/envs/dolly/lib/python3.9/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/maziyar/anaconda3/envs/dolly/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
    rv = self.invoke(ctx)
  File "/home/maziyar/anaconda3/envs/dolly/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/maziyar/anaconda3/envs/dolly/lib/python3.9/site-packages/click/core.py", line 754, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/maziyar/anaconda3/envs/dolly/lib/python3.9/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/maziyar/apps/dolly/training/trainer.py", line 252, in main
    train(**kwargs)
  File "/home/maziyar/apps/dolly/training/trainer.py", line 215, in train
    return __callback(*args, **kwargs)
  File "/home/maziyar/apps/dolly/training/trainer.py", line 252, in main
    trainer.train()
  File "/home/maziyar/anaconda3/envs/dolly/lib/python3.9/site-packages/transformers/trainer.py", line 1527, in train
    train(**kwargs)
  File "/home/maziyar/apps/dolly/training/trainer.py", line 215, in train
    trainer.train()
  File "/home/maziyar/anaconda3/envs/dolly/lib/python3.9/site-packages/transformers/trainer.py", line 1527, in train
    return inner_training_loop(
  File "/home/maziyar/anaconda3/envs/dolly/lib/python3.9/site-packages/transformers/trainer.py", line 1920, in _inner_training_loop
    return inner_training_loop(
  File "/home/maziyar/anaconda3/envs/dolly/lib/python3.9/site-packages/transformers/trainer.py", line 1920, in _inner_training_loop
    shutil.rmtree(checkpoint)
  File "/home/maziyar/anaconda3/envs/dolly/lib/python3.9/shutil.py", line 734, in rmtree
    shutil.rmtree(checkpoint)
  File "/home/maziyar/anaconda3/envs/dolly/lib/python3.9/shutil.py", line 734, in rmtree
    _rmtree_safe_fd(fd, path, onerror)
  File "/home/maziyar/anaconda3/envs/dolly/lib/python3.9/shutil.py", line 690, in _rmtree_safe_fd
        onerror(os.unlink, fullname, sys.exc_info())_rmtree_safe_fd(fd, path, onerror)

  File "/home/maziyar/anaconda3/envs/dolly/lib/python3.9/shutil.py", line 688, in _rmtree_safe_fd
  File "/home/maziyar/anaconda3/envs/dolly/lib/python3.9/shutil.py", line 690, in _rmtree_safe_fd
    os.unlink(entry.name, dir_fd=topfd)
    onerror(os.unlink, fullname, sys.exc_info())
  File "/home/maziyar/anaconda3/envs/dolly/lib/python3.9/shutil.py", line 688, in _rmtree_safe_fd
FileNotFoundError: [Errno 2] No such file or directory: 'rng_state_3.pth'
    os.unlink(entry.name, dir_fd=topfd)
FileNotFoundError: [Errno 2] No such file or directory: 'rng_state_3.pth'
2023-03-28 12:07:01 ERROR [__main__] main failed
Traceback (most recent call last):
  File "/home/maziyar/apps/dolly/training/trainer.py", line 260, in <module>
    main()
  File "/home/maziyar/anaconda3/envs/dolly/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/maziyar/anaconda3/envs/dolly/lib/python3.9/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/maziyar/anaconda3/envs/dolly/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/maziyar/anaconda3/envs/dolly/lib/python3.9/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/maziyar/apps/dolly/training/trainer.py", line 252, in main
    train(**kwargs)
  File "/home/maziyar/apps/dolly/training/trainer.py", line 215, in train
    trainer.train()
  File "/home/maziyar/anaconda3/envs/dolly/lib/python3.9/site-packages/transformers/trainer.py", line 1527, in train
    return inner_training_loop(
  File "/home/maziyar/anaconda3/envs/dolly/lib/python3.9/site-packages/transformers/trainer.py", line 1920, in _inner_training_loop
    shutil.rmtree(checkpoint)
  File "/home/maziyar/anaconda3/envs/dolly/lib/python3.9/shutil.py", line 734, in rmtree
    _rmtree_safe_fd(fd, path, onerror)
  File "/home/maziyar/anaconda3/envs/dolly/lib/python3.9/shutil.py", line 667, in _rmtree_safe_fd
    _rmtree_safe_fd(dirfd, fullname, onerror)
  File "/home/maziyar/anaconda3/envs/dolly/lib/python3.9/shutil.py", line 690, in _rmtree_safe_fd
    onerror(os.unlink, fullname, sys.exc_info())
  File "/home/maziyar/anaconda3/envs/dolly/lib/python3.9/shutil.py", line 688, in _rmtree_safe_fd
    os.unlink(entry.name, dir_fd=topfd)
FileNotFoundError: [Errno 2] No such file or directory: 'bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt'
Traceback (most recent call last):
  File "/home/maziyar/anaconda3/envs/dolly/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/maziyar/anaconda3/envs/dolly/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/maziyar/apps/dolly/training/trainer.py", line 260, in <module>
    main()
  File "/home/maziyar/anaconda3/envs/dolly/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/maziyar/anaconda3/envs/dolly/lib/python3.9/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/maziyar/anaconda3/envs/dolly/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/maziyar/anaconda3/envs/dolly/lib/python3.9/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/maziyar/apps/dolly/training/trainer.py", line 252, in main
    train(**kwargs)
  File "/home/maziyar/apps/dolly/training/trainer.py", line 215, in train
    trainer.train()
  File "/home/maziyar/anaconda3/envs/dolly/lib/python3.9/site-packages/transformers/trainer.py", line 1527, in train
    return inner_training_loop(
  File "/home/maziyar/anaconda3/envs/dolly/lib/python3.9/site-packages/transformers/trainer.py", line 1920, in _inner_training_loop
    shutil.rmtree(checkpoint)
  File "/home/maziyar/anaconda3/envs/dolly/lib/python3.9/shutil.py", line 734, in rmtree
    _rmtree_safe_fd(fd, path, onerror)
  File "/home/maziyar/anaconda3/envs/dolly/lib/python3.9/shutil.py", line 667, in _rmtree_safe_fd
    _rmtree_safe_fd(dirfd, fullname, onerror)
  File "/home/maziyar/anaconda3/envs/dolly/lib/python3.9/shutil.py", line 690, in _rmtree_safe_fd
    onerror(os.unlink, fullname, sys.exc_info())
  File "/home/maziyar/anaconda3/envs/dolly/lib/python3.9/shutil.py", line 688, in _rmtree_safe_fd
    os.unlink(entry.name, dir_fd=topfd)
FileNotFoundError: [Errno 2] No such file or directory: 'bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt'
[2023-03-28 12:07:02,780] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 235676
[2023-03-28 12:07:04,022] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 235677
[2023-03-28 12:07:04,023] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 235678
[2023-03-28 12:07:04,037] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 235679
[2023-03-28 12:07:04,051] [ERROR] [launch.py:324:sigkill_handler] ['/home/maziyar/anaconda3/envs/dolly/bin/python', '-u', '-m', 'training.trainer', '--local_rank=3', '--deepspeed', '/home/maziyar/apps/dolly/config/ds_z3_bf16_config.json', '--epochs', '2', '--local-output-dir', 'local_dolly_training/dolly__2023-03-28T07:46:32', '--dbfs-output-dir', 'dolly_training/dolly__2023-03-28T07:46:32', '--per-device-train-batch-size', '8', '--per-device-eval-batch-size', '8', '--lr', '1e-5'] exits with return code = 1

Running the code without databricks

Hi in the train_dolly.py, a lot of MAGIC commands were there which is used in databricks notebooks. Do we need to run those commands separately if we are not using the databricks framework.
If yes, can you also suggest how/what to modify this to run on a single/double GPU as I have access to those.
image

Legal Status of Dolly trained on tatsu-lab/alpaca data?

I see you have trained Dolly on tatsu-lab/alpaca dataset generated by OpenAI's text-davinci-003 engine. My understanding of OpenAI license is that you cannot use their model output to compete with OpenAI. So, can I use Dolly commercially, even to compete with OpenAI?

I am not a lawyer, I am ML Engineer. And asking in good faith, hoping that maybe your "legal" solved this issue for us all. So we all can benefit from tatsu-lab/alpaca tasks dataset without concerns.

Training time (p4d.24xlarge)

Anyone know how long Dolly took to train on a p4d.24xlarge instance? I know that Alpaca took around 3 hours to train on an identical instance, so now I'm wondering how long it takes to train Dolly. Trying to gauge how much this will cost.

ValueError: Your setup doesn't support bf16/gpu.

I run the notebook on 8 V100 GPUs, but an error occured:

  File "<string>", line 105, in __init__
  File "/databricks/python/lib/python3.9/site-packages/transformers/training_args.py", line 1098, in __post_init__
    raise ValueError(
ValueError: Your setup doesn't support bf16/gpu. You need torch>=1.10, using Ampere GPU with cuda>=11.0

I changed "bf16" to "fp16" in ds_z3_bf16_config.json, but nothing happen...

License conflicts

I've shared this as Open Source news but then had a look at your data sets license. This is not open source, so I see many conflicts here. NC is not a Non-Profit license (would be nice, I know) it actually forbids any commercial activity. Even getting payed as a researcher when using it.

So I consider this an epic issue to solve → furthermore, it blocks contributions as they would fall under the same conditions and would completely lead the whole effort to crowdsource ad adsurdum. (also you as a team can then be not commercially involved at any point) → so I admire your efforts to create something beyond ChatGPT and would like to continue spreading the message, if you could carefully reconsider the license approach to be truly open source.

All the best

Is it possible to provide any context in the prompt

Hi,

Out of curiosity, is there any way to construct the PROMPT with some context to make sure the response is generated from that? Something similar to this discussion: https://community.openai.com/t/how-to-prevent-chatgpt-from-answering-questions-that-are-outside-the-scope-of-the-provided-context-in-the-system-role-message/112027

My use case is to first take the query and find top N semantically close documents from a database, then use them as context before pairing it with the input for a response. (a great way to make sure the generated response is not way off topic)

In alpaca, I saw this prompt:

f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
# Instruction:
{instruction}
# Input:
{input}
# Response:
"""

exits with return code = -9 with multi gpus

When I train the model with one GPU. It works. But When I set the --num_gpus 8, it returns "exits with return code = -9".

The command:

deepspeed --num_gpus 8 --module training.trainer --deepspeed /home/dolly/config/ds_z3_bf16_config.json --epochs 1 --local-output-dir /home/dolly/output/ --dbfs-output-dir /home/dolly/dbfs/ --per-device-train-batch-size 1 --per-device-eval-batch-size 1 --lr 1e-5

The output :

bash train.sh 
[2023-03-29 11:12:15,727] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
Detected CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 but ignoring it because one or several of --include/--exclude/--num_gpus/--num_nodes cl args were used. If you want to use CUDA_VISIBLE_DEVICES don't pass any of these arguments to deepspeed.
[2023-03-29 11:12:15,942] [INFO] [runner.py:548:main] cmd = /home/dolly/.venv/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --module --enable_each_rank_log=None training.trainer --deepspeed /home/dolly/config/ds_z3_bf16_config.json --epochs 1 --local-output-dir /home/dolly/output/ --dbfs-output-dir /home/dolly/dbfs/ --per-device-train-batch-size 1 --per-device-eval-batch-size 1 --lr 1e-5
[2023-03-29 11:12:21,865] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2023-03-29 11:12:21,865] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=8, node_rank=0
[2023-03-29 11:12:21,865] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2023-03-29 11:12:21,865] [INFO] [launch.py:162:main] dist_world_size=8
[2023-03-29 11:12:21,865] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
2023-03-29 11:12:47 INFO [__main__] Loading tokenizer for EleutherAI/gpt-j-6B
2023-03-29 11:12:47 INFO [__main__] Loading tokenizer for EleutherAI/gpt-j-6B
2023-03-29 11:12:47 INFO [__main__] Loading tokenizer for EleutherAI/gpt-j-6B
2023-03-29 11:12:47 INFO [__main__] Loading tokenizer for EleutherAI/gpt-j-6B
2023-03-29 11:12:47 INFO [__main__] Loading tokenizer for EleutherAI/gpt-j-6B
2023-03-29 11:12:47 INFO [__main__] Loading tokenizer for EleutherAI/gpt-j-6B
2023-03-29 11:12:47 INFO [__main__] Loading tokenizer for EleutherAI/gpt-j-6B
2023-03-29 11:12:47 INFO [__main__] Loading tokenizer for EleutherAI/gpt-j-6B
2023-03-29 11:12:49 INFO [__main__] Loading model for EleutherAI/gpt-j-6B
2023-03-29 11:12:49 INFO [__main__] Loading model for EleutherAI/gpt-j-6B
2023-03-29 11:12:49 INFO [__main__] Loading model for EleutherAI/gpt-j-6B
2023-03-29 11:12:58 INFO [__main__] Loading model for EleutherAI/gpt-j-6B
2023-03-29 11:12:58 INFO [__main__] Loading model for EleutherAI/gpt-j-6B
2023-03-29 11:13:08 INFO [__main__] Loading model for EleutherAI/gpt-j-6B
2023-03-29 11:13:18 INFO [__main__] Loading model for EleutherAI/gpt-j-6B
2023-03-29 11:13:18 INFO [__main__] Loading model for EleutherAI/gpt-j-6B
[2023-03-29 11:15:00,385] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 17373
[2023-03-29 11:15:01,360] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 17374
[2023-03-29 11:15:02,153] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 17375
[2023-03-29 11:15:03,026] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 17376
[2023-03-29 11:15:03,806] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 17377
[2023-03-29 11:15:03,944] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 17378
[2023-03-29 11:15:04,724] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 17379
[2023-03-29 11:15:04,726] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 17380
[2023-03-29 11:15:05,588] [ERROR] [launch.py:324:sigkill_handler] ['/home/dolly/.venv/bin/python', '-u', '-m', 'training.trainer', '--local_rank=7', '--deepspeed', '/home/dolly/dolly/config/ds_z3_bf16_config.json', '--epochs', '1', '--local-output-dir', '/home/dolly/output/', '--dbfs-output-dir', '/home/dolly/dbfs/', '--per-device-train-batch-size', '1', '--per-device-eval-batch-size', '1', '--lr', '1e-5'] exits with return code = -9

Fail to create a cluster on databricks

When creating a cluster on databricks, it always says "Finding instances for new nodes, acquiring more instances if necessary..."

Is there anything wrong with databricks? I logged in with AWS.

OOM while training (7xx step) on 4x A100 40GB

I am trying to train on a 4x A100 40GB system. None of the settings were changed from the default except for the --num_gpus 4 option.

It seems to be working really well, but suddenly CUDA OOM occurs during the 7xxth step. It was obviously not the step that was doing the evaluation or saving the model.

Is this possible? Is it because I didn't use 8 gpus? Does increasing the number of GPUs reduce the VRAM usage of each GPU?

Intended use for the new databricks-dolly-15k dataset

Hi,

I am really interested in the new databricks-dolly-15k dataset that was released today. I am just wondering,

  • Should I use it to fine-tune an already fine-tuned Dolly with tatsu-lab/alpaca dataset
  • Or shall I just use the databricks-dolly-15k dataset without using tatsu-lab/alpaca

I would love to see what others do with this databricks-dolly-15k dataset.

Your setup doesn't support bf16/gpu. You need torch>=1.10, using Ampere GPU with cuda>=11.0

Hey 👋🏽,

I am not able to see the Standard_ND96asr_v4 that has been recommended.
I am on Azure and location: West Europe, and the Standard_ND96asr_v4 is not available in West Europe region.

However, I managed to get a Standard_NC24s_v3 (448GB, 4GPUs) - see screenshot below. (By the way this wasn't an easy process as well, took me many attempts to get Azure to increase the quota)
image

However...
Upon training dolly, I get the following error:
ValueError: Your setup doesn't support bf16/gpu. You need torch>=1.10, using Ampere GPU with cuda>=11.0

I am assuming it's because of the type of GPU selected to train dolly.
Is there an alternative GPU that would work, other than the Standard_ND96asr_v4? Is there a workaround that someone has come across?

ConnectionResetError: [Errno 104] Connection reset by peer

I am running the sample training script with:

  • g5.24xlarge
  • cpu offload set in ds_z3_bf16_config.json
  • num_gpus to 4
  • train and eval batchsize = 4 (instead of 8)
  • logging_steps=100, eval_steps=1000, save_steps=2000
  • folders: like
Local Output Dir: /dolly/local_training/dolly__2023-04-10T00:45:05
DBFS Output Dir: /dolly/output/dolly__2023-04-10T00:45:05
Tensorboard Display Dir: /dolly/local_training/dolly__2023-04-10T00:45:05/runs

and got the following error messages. It looks the training itself almost finished and crashed at the really end.

Is there any way to avoid the below?

Thanks.

---------------------------------------------------------------------------
The Python process exited with an unknown exit code.

The last 10 KB of the process's stderr and stdout can be found below. See driver logs for full logs.
---------------------------------------------------------------------------
Last messages on stderr:
Sun Apr  9 11:04:28 2023 Connection to spark from PID  2322
Sun Apr  9 11:04:28 2023 Initialized gateway on port 38899
Sun Apr  9 11:04:28 2023 Connected to spark.
2023-04-09 11:04:32.509624: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-04-09 11:04:47 INFO [training.trainer] Loading tatsu-lab/alpaca dataset
2023-04-09 11:04:49 WARNING [datasets.builder] Using custom data configuration tatsu-lab--alpaca-715f206eec35a791
2023-04-09 11:04:50 INFO [training.trainer] Found 52002 rows
2023-04-09 11:04:56 INFO [training.trainer] Loading tokenizer for EleutherAI/gpt-j-6B
2023-04-09 11:19:38 INFO [root] Exception while sending command.
Traceback (most recent call last):
  File "/databricks/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", line 503, in send_command
    self.socket.sendall(command.encode("utf-8"))
ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/databricks/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1038, in send_command
    response = connection.send_command(command)
  File "/databricks/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", line 506, in send_command
    raise Py4JNetworkError(
py4j.protocol.Py4JNetworkError: Error while sending
2023-04-09 11:20:13 INFO [root] Exception while sending command.
Traceback (most recent call last):
  File "/databricks/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", line 503, in send_command
    self.socket.sendall(command.encode("utf-8"))
ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/databricks/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1038, in send_command
    response = connection.send_command(command)
  File "/databricks/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", line 506, in send_command
    raise Py4JNetworkError(
py4j.protocol.Py4JNetworkError: Error while sending
---------------------------------------------------------------------------
Last messages on stdout:
ameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}

[2023-04-10 00:02:17,956] [INFO] [config.py:1012:print]   curriculum_enabled_legacy .... False

[2023-04-10 00:02:17,956] [INFO] [config.py:1012:print]   curriculum_params_legacy ..... False

[2023-04-10 00:02:17,956] [INFO] [config.py:1012:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}

[2023-04-10 00:02:17,956] [INFO] [config.py:1012:print]   data_efficiency_enabled ...... False

[2023-04-10 00:02:17,956] [INFO] [config.py:1012:print]   dataloader_drop_last ......... False

[2023-04-10 00:02:17,956] [INFO] [config.py:1012:print]   disable_allgather ............ False

[2023-04-10 00:02:17,956] [INFO] [config.py:1012:print]   dump_state ................... False

[2023-04-10 00:02:17,956] [INFO] [config.py:1012:print]   dynamic_loss_scale_args ...... None

[2023-04-10 00:02:17,956] [INFO] [config.py:1012:print]   eigenvalue_enabled ........... False

[2023-04-10 00:02:17,956] [INFO] [config.py:1012:print]   eigenvalue_gas_boundary_resolution  1

[2023-04-10 00:02:17,956] [INFO] [config.py:1012:print]   eigenvalue_layer_name ........ bert.encoder.layer

[2023-04-10 00:02:17,956] [INFO] [config.py:1012:print]   eigenvalue_layer_num ......... 0

[2023-04-10 00:02:17,956] [INFO] [config.py:1012:print]   eigenvalue_max_iter .......... 100

[2023-04-10 00:02:17,956] [INFO] [config.py:1012:print]   eigenvalue_stability ......... 1e-06

[2023-04-10 00:02:17,956] [INFO] [config.py:1012:print]   eigenvalue_tol ............... 0.01

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   eigenvalue_verbose ........... False

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   elasticity_enabled ........... False

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   flops_profiler_config ........ {

    "enabled": false, 

    "profile_step": 1, 

    "module_depth": -1, 

    "top_modules": 1, 

    "detailed": true, 

    "output_file": null

}

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   fp16_auto_cast ............... None

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   fp16_enabled ................. False

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   fp16_master_weights_and_gradients  False

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   global_rank .................. 0

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   grad_accum_dtype ............. None

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   gradient_accumulation_steps .. 1

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   gradient_clipping ............ 1.0

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   gradient_predivide_factor .... 1.0

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   initial_dynamic_scale ........ 1

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   load_universal_checkpoint .... False

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   loss_scale ................... 1.0

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   memory_breakdown ............. False

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   monitor_config ............... <deepspeed.monitor.config.DeepSpeedMonitorConfig object at 0x7fe5c7fd2760>

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   nebula_config ................ {

    "enabled": false, 

    "persistent_storage_path": null, 

    "persistent_time_interval": 100, 

    "num_of_version_in_retention": 2, 

    "enable_nebula_load": true, 

    "load_path": null

}

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   optimizer_legacy_fusion ...... False

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   optimizer_name ............... adamw

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   optimizer_params ............. {'lr': 1e-05, 'betas': [0.9, 0.999], 'eps': 1e-08, 'weight_decay': 0.0}

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   pld_enabled .................. False

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   pld_params ................... False

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   prescale_gradients ........... False

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   scheduler_name ............... WarmupLR

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   scheduler_params ............. {'warmup_min_lr': 0, 'warmup_max_lr': 1e-05, 'warmup_num_steps': 0}

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   sparse_attention ............. None

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   sparse_gradients_enabled ..... False

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   steps_per_print .............. 2000

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   train_batch_size ............. 16

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   train_micro_batch_size_per_gpu  4

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   use_node_local_storage ....... False

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   wall_clock_breakdown ......... False

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   world_size ................... 4

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   zero_allow_untested_optimizer  False

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=16777216 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='cpu', nvme_path=None, buffer_count=4, pin_memory=True, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False) sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=15099494 param_persistence_threshold=40960 model_persistence_threshold=sys.maxsize max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=True stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False

[2023-04-10 00:02:17,958] [INFO] [config.py:1012:print]   zero_enabled ................. True

[2023-04-10 00:02:17,958] [INFO] [config.py:1012:print]   zero_optimization_stage ...... 3

[2023-04-10 00:02:17,958] [INFO] [config.py:997:print_user_config]   json = {

    "bf16": {

        "enabled": true

    }, 

    "optimizer": {

        "type": "AdamW", 

        "params": {

            "lr": 1e-05, 

            "betas": [0.9, 0.999], 

            "eps": 1e-08, 

            "weight_decay": 0.0

        }

    }, 

    "scheduler": {

        "type": "WarmupLR", 

        "params": {

            "warmup_min_lr": 0, 

            "warmup_max_lr": 1e-05, 

            "warmup_num_steps": 0

        }

    }, 

    "zero_optimization": {

        "stage": 3, 

        "overlap_comm": true, 

        "contiguous_gradients": true, 

        "sub_group_size": 1.000000e+09, 

        "reduce_bucket_size": 1.677722e+07, 

        "stage3_prefetch_bucket_size": 1.509949e+07, 

        "stage3_param_persistence_threshold": 4.096000e+04, 

        "stage3_max_live_parameters": 1.000000e+09, 

        "stage3_max_reuse_distance": 1.000000e+09, 

        "stage3_gather_16bit_weights_on_model_save": true, 

        "offload_optimizer": {

            "device": "cpu", 

            "pin_memory": true

        }

    }, 

    "gradient_accumulation_steps": 1, 

    "gradient_clipping": 1.0, 

    "steps_per_print": 2.000000e+03, 

    "train_batch_size": 16, 

    "train_micro_batch_size_per_gpu": 4, 

    "wall_clock_breakdown": false

}

Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...

No modifications detected for re-loaded extension module utils, skipping build step...

Loading extension module utils...

Time to load utils op: 0.00030159950256347656 seconds

Attempting to resume from /dolly/local_training/dolly__2023-04-09T11:19:57/checkpoint-2000

[2023-04-10 00:02:17,962] [INFO] [torch_checkpoint_engine.py:21:load] [Torch] Loading checkpoint from /dolly/local_training/dolly__2023-04-09T11:19:57/checkpoint-2000/global_step2000/zero_pp_rank_0_mp_rank_00_model_states.pt...

[2023-04-10 00:02:21,172] [INFO] [torch_checkpoint_engine.py:23:load] [Torch] Loaded checkpoint from /dolly/local_training/dolly__2023-04-09T11:19:57/checkpoint-2000/global_step2000/zero_pp_rank_0_mp_rank_00_model_states.pt.

[2023-04-10 00:02:21,173] [INFO] [torch_checkpoint_engine.py:21:load] [Torch] Loading checkpoint from /dolly/local_training/dolly__2023-04-09T11:19:57/checkpoint-2000/global_step2000/zero_pp_rank_0_mp_rank_00_model_states.pt...

[2023-04-10 00:02:21,202] [INFO] [torch_checkpoint_engine.py:23:load] [Torch] Loaded checkpoint from /dolly/local_training/dolly__2023-04-09T11:19:57/checkpoint-2000/global_step2000/zero_pp_rank_0_mp_rank_00_model_states.pt.

[2023-04-10 00:02:21,222] [INFO] [torch_checkpoint_engine.py:21:load] [Torch] Loading checkpoint from /dolly/local_training/dolly__2023-04-09T11:19:57/checkpoint-2000/global_step2000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt...

[2023-04-10 00:11:10,776] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 5962

[2023-04-10 00:11:10,777] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 5963

Couldn't find the train_dolly notebook.

Open the train_dolly notebook in the dolly repo, attach to your GPU cluster, and run all cells. When training finishes, the notebook will save the model under /dbfs/dolly_training.
In the Readme, the above line mentions that there is a train_dolly notebook, however, there is a train_dolly.py file only. Are you talking about that?

Python kernel is unresponsive due to Py4JSecurityException

I tried running

import numpy as np
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    PreTrainedModel,
    PreTrainedTokenizer
)
 
tokenizer = AutoTokenizer.from_pretrained("databricks/dolly-v1-6b", padding_side="left")
model = AutoModelForCausalLM.from_pretrained("databricks/dolly-v1-6b", device_map="auto", trust_remote_code=True, offload_folder="offload")

with this cluster configuration

Driver: m5d.large · Workers: m5d.large · 1-2 workers · Spot · fall back to On-demand · 12.2 LTS (includes Apache Spark 3.3.2, Scala 2.12) · us-east-1b

However, after 5 minutes of running, it throws this error

Fatal error: The Python kernel is unresponsive.

The Python process exited with an unknown exit code.

The last 10 KB of the process's stderr and stdout can be found below. See driver logs for full logs.

Last messages on stderr:
Wed Apr 5 13:52:47 2023 Connection to spark from PID 2437
Wed Apr 5 13:52:47 2023 Initialized gateway on port 41767
Wed Apr 5 13:52:47 2023 Connected to spark.
Unexpected internal error while setting REPL context in pre_command_execute: An error occurred while calling o402.tags. Trace:
py4j.security.Py4JSecurityException: Method public scala.collection.immutable.Map com.databricks.backend.common.rpc.CommandContext.tags() is not whitelisted on class class com.databricks.backend.common.rpc.CommandContext
at py4j.security.WhitelistingPy4JSecurityManager.checkCall(WhitelistingPy4JSecurityManager.java:473)
at py4j.Gateway.invoke(Gateway.java:305)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:195)
at py4j.ClientServerConnection.run(ClientServerConnection.java:115)
at java.lang.Thread.run(Thread.java:750)

Are there special steps done to overcome this? Seems like it's not happy with the hugging face's method?

Thanks,

Code example provided in readme does not work in Databricks notebook

using a runtime 12.2 LTS ML with a single machine Standard_NCas_T4_v3

this 2 lines do not work:

from transformers import pipeline
instruct_pipeline = pipeline(model="databricks/dolly-v2-12b", trust_remote_code=True, device_map="auto")

return this error:
Fatal error: The Python kernel is unresponsive.

image

RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`

When I use the model, there was an error:

RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasCreate(handle)``

The code:

import numpy as np
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    PreTrainedModel,
    PreTrainedTokenizer
)

tokenizer = AutoTokenizer.from_pretrained("databricks/dolly-v1-6b", padding_side="left")
model = AutoModelForCausalLM.from_pretrained("databricks/dolly-v1-6b", device_map="auto", trust_remote_code=True)

PROMPT_FORMAT = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response:
"""

def generate_response(instruction: str, *, model: PreTrainedModel, tokenizer: PreTrainedTokenizer,
                      do_sample: bool = True, max_new_tokens: int = 256, top_p: float = 0.92, top_k: int = 0, **kwargs) -> str:
    input_ids = tokenizer(PROMPT_FORMAT.format(instruction=instruction), return_tensors="pt").input_ids.to("cuda")

    # each of these is encoded to a single token
    response_key_token_id = tokenizer.encode("### Response:")[0]
    end_key_token_id = tokenizer.encode("### End")[0]

    gen_tokens = model.generate(input_ids, pad_token_id=tokenizer.pad_token_id, eos_token_id=end_key_token_id,
                                do_sample=do_sample, max_new_tokens=max_new_tokens, top_p=top_p, top_k=top_k, **kwargs)[0].cpu()

    # find where the response begins
    response_positions = np.where(gen_tokens == response_key_token_id)[0]

    if len(response_positions) >= 0:
        response_pos = response_positions[0]

        # find where the response ends
        end_pos = None
        end_positions = np.where(gen_tokens == end_key_token_id)[0]
        if len(end_positions) > 0:
            end_pos = end_positions[0]

        return tokenizer.decode(gen_tokens[response_pos + 1 : end_pos]).strip()

    return None

while True:
    instruction=input("You:")
    response = generate_response(instruction, model=model, tokenizer=tokenizer)
    if response:
        print(f"Dolly: {response}\n")

Logging hangs when running code on Vertex AI

I've copied the training code here to a pipeline on a KFP cluster. The model, tokenizer, and dataset all load, but after Trainer.train(), the log just hangs after the following log and I can't figure out why:

0%| | 0/15294 [00:00<?, ?it/s]You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding

I've left it for 3+ hours and no new logs appear. Any help would go a long way to productionizing this.

Here is my component code for my pipeline.

from kfp.v2.dsl import (
    component,
    Input,
    Output,
    Artifact,
    Model,
)

@component(
    base_image="us-docker.pkg.dev/vertex-ai/training/pytorch-gpu.1-13:latest",
    packages_to_install=[
        'accelerate',
        'bitsandbytes',
        'datasets',
        'transformers',
        'wandb'
    ]
)
def train_model(
        base_model: Input[Model],
        tokenizer_input: Input[Artifact],
        dataset_input: Input[Artifact],
        model_output: Output[Model]
    ):

    # ====================================
    # Imports
    # ====================================
    import numpy as np
    import os
    import pickle
    import torch
    import transformers
    import wandb

    from datasets import Dataset, load_from_disk

    from transformers import (
        AutoModelForCausalLM,
        AutoTokenizer,
        DataCollatorForLanguageModeling,
        PreTrainedTokenizer,
        Trainer,
        TrainingArguments,
        set_seed,
    )
    from typing import Any, Dict, List, Tuple, Union

    os.environ["PYTHONUNBUFFERED"] = "True"

    transformers.logging.set_verbosity_debug()

    RESPONSE_KEY = "### Response:\n"
    DEFAULT_SEED = 42
    TEST_SIZE = 1000

    EPOCHS = 3
    USE_CACHE = False # True is incompatiable with gradient checkpointing
    PER_DEVICE_TRAIN_BATCH_SIZE = 10
    PER_DEVICE_EVAL_BATCH_SIZE = 10
    LEARNING_RATE = 1e-5
    BF16 = True
    DEEP_SPEED = None
    LOCAL_RANK = -1
    GRADIENT_CHECKPOINTING = True

    set_seed(DEFAULT_SEED)

    # ====================================
    # Set WandB variables
    # ====================================
    wandb.init()

    # ====================================
    # Custom Classes
    # ====================================
    class DataCollatorForCompletionOnlyLM(DataCollatorForLanguageModeling):
        def torch_call(self, examples: List[Union[List[int], Any, Dict[str, Any]]]) -> Dict[str, Any]:
            batch = super().torch_call(examples)

            response_token_ids = self.tokenizer.encode(RESPONSE_KEY)

            labels = batch["labels"].clone()

            for i in range(len(examples)):

                response_token_ids_start_idx = None
                for idx in np.where(batch["labels"][i] == response_token_ids[0])[0]:
                    if np.array_equal(response_token_ids, batch["labels"][i, idx : idx + len(response_token_ids)]):
                        response_token_ids_start_idx = idx
                        break

                if response_token_ids_start_idx is None:
                    raise RuntimeError("Could not find response key token IDs")

                response_token_ids_end_idx = response_token_ids_start_idx + len(response_token_ids)

                # Make pytorch loss function ignore all tokens up through the end of the response key
                labels[i, :response_token_ids_end_idx] = -100

            batch["labels"] = labels

            return batch


    # ====================================
    # Custom Functions
    # ====================================
   

    # ====================================
    # Retrieve artifacts
    # ====================================

    # Load the tokenizer
    print("Loading tokenizer...")
    tokenizer = transformers.AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B") # Also test use_fast=False
    tokenizer.pad_token = tokenizer.eos_token
    print("✅ - Tokenizer loaded.")

    # Load the model
    print("Loading model...")
    model = AutoModelForCausalLM.from_pretrained(
        "EleutherAI/gpt-j-6B",
        trust_remote_code=True,
        use_cache=False if GRADIENT_CHECKPOINTING else True
    )
    print("✅ - Model loaded.")

    # Load the dataset
    print("Loading dataset...")
    dataset_path = os.path.join(dataset_input.path, "dataset.pkl")
    dataset = load_from_disk(dataset_path)
    print("✅ - Dataset loaded.")

    device = 'cuda' if torch.cuda.is_available() else 'cpu'

    print(f"Device set to {device}.")
    model.to(device)

    # Split dataset
    print("Splitting dataset...")
    split_dataset = dataset.train_test_split(test_size=TEST_SIZE, seed=DEFAULT_SEED)
    print("✅ - Dataset split.")

    # Create Data collator
    print("Creating data collator...")
    data_collator = DataCollatorForCompletionOnlyLM(
        tokenizer=tokenizer, mlm=False, return_tensors="pt", pad_to_multiple_of=8
    )
    print("✅ - Data collator created.")

    # Create model directory
    print("Create model directories...")
    output_dir = model_output.path
    logging_dir = os.path.join(output_dir, "runs")
    os.mkdir(output_dir)
    os.mkdir(logging_dir)
    print(f"✅ - Model output directory created at {output_dir}. Logging directory created at {logging_dir}.")

    # Create training arguments
    print("Creating training arguments...")
    training_args = TrainingArguments(
        output_dir=output_dir,
        per_device_train_batch_size=PER_DEVICE_TRAIN_BATCH_SIZE, # 10
        per_device_eval_batch_size=PER_DEVICE_EVAL_BATCH_SIZE, # 10
        fp16=False,
        bf16=BF16, # True
        learning_rate=LEARNING_RATE, # 1e-5
        num_train_epochs=EPOCHS, # 3
        deepspeed=DEEP_SPEED, # False
        gradient_checkpointing=GRADIENT_CHECKPOINTING, # True
        log_level="debug",
        logging_dir=logging_dir,
        logging_strategy="steps",
        logging_steps=1, # Default 10
        evaluation_strategy="steps",
        eval_steps=1, # Default 10
        save_strategy="steps",
        save_steps=10,
        save_total_limit=1,
        load_best_model_at_end=True,
        report_to="wandb",
        disable_tqdm=False,
        remove_unused_columns=False,
        local_rank=LOCAL_RANK, # 0
    )
    print("✅ - Training arguments created.")

    # Creating Trainer
    print("Instantiating Trainer...")
    trainer = Trainer(
        model=model,
        tokenizer=tokenizer,
        args=training_args,
        train_dataset=split_dataset["train"],
        eval_dataset=split_dataset["test"],
        data_collator=data_collator,
    )
    print("✅ - Trainer created.")

    print("Started training...")
    trainer.train()
    print("✅ - Training finished.")

    print(f"Saving Model to {output_dir}")
    trainer.save_model(output_dir=output_dir)

problem when start cluster

when start a cluster, I account the problem as follow:
Unexpected startup failure: An unexpected error occurred while setting up the cluster. If the problem persists, please try again and contact Databricks.
How can I solve this problem? Thanks for your help.

HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name'

I am getting a validation error on CMD 11:

from training.generate import generate_response, load_model_tokenizer_for_generate

model, tokenizer = load_model_tokenizer_for_generate(local_output_dir)

Here's the traceback:

---------------------------------------------------------------------------
HFValidationError                         Traceback (most recent call last)
File <command-597306789744509>:3
      1 from training.generate import generate_response, load_model_tokenizer_for_generate
----> 3 model, tokenizer = load_model_tokenizer_for_generate(local_output_dir, )

File /Workspace/Repos/[email protected]/dolly/training/generate.py:36, in load_model_tokenizer_for_generate(pretrained_model_name_or_path)
     25 def load_model_tokenizer_for_generate(
     26     pretrained_model_name_or_path: str,
     27 ) -> Tuple[PreTrainedModel, PreTrainedTokenizer]:
     28     """Loads the model and tokenizer so that it can be used for generating responses.
     29 
     30     Args:
   (...)
     34         Tuple[PreTrainedModel, PreTrainedTokenizer]: model and tokenizer
     35     """
---> 36     tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path, padding_side="left")
     37     model = AutoModelForCausalLM.from_pretrained(
     38         pretrained_model_name_or_path, device_map="auto", trust_remote_code=True
     39     )
     40     return model, tokenizer

File /databricks/python/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py:582, in AutoTokenizer.from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
    579     return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
    581 # Next, let's try to use the tokenizer_config file to get the tokenizer class.
--> 582 tokenizer_config = get_tokenizer_config(pretrained_model_name_or_path, **kwargs)
    583 if "_commit_hash" in tokenizer_config:
    584     kwargs["_commit_hash"] = tokenizer_config["_commit_hash"]

File /databricks/python/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py:433, in get_tokenizer_config(pretrained_model_name_or_path, cache_dir, force_download, resume_download, proxies, use_auth_token, revision, local_files_only, subfolder, **kwargs)
    371 """
    372 Loads the tokenizer configuration from a pretrained model tokenizer configuration.
    373 
   (...)
    430 tokenizer_config = get_tokenizer_config("tokenizer-test")
    431 ```"""
    432 commit_hash = kwargs.get("_commit_hash", None)
--> 433 resolved_config_file = cached_file(
    434     pretrained_model_name_or_path,
    435     TOKENIZER_CONFIG_FILE,
    436     cache_dir=cache_dir,
    437     force_download=force_download,
    438     resume_download=resume_download,
    439     proxies=proxies,
    440     use_auth_token=use_auth_token,
    441     revision=revision,
    442     local_files_only=local_files_only,
    443     subfolder=subfolder,
    444     _raise_exceptions_for_missing_entries=False,
    445     _raise_exceptions_for_connection_errors=False,
    446     _commit_hash=commit_hash,
    447 )
    448 if resolved_config_file is None:
    449     logger.info("Could not locate the tokenizer configuration file, will try to use the model config instead.")

File /databricks/python/lib/python3.9/site-packages/transformers/utils/hub.py:409, in cached_file(path_or_repo_id, filename, cache_dir, force_download, resume_download, proxies, use_auth_token, revision, local_files_only, subfolder, user_agent, _raise_exceptions_for_missing_entries, _raise_exceptions_for_connection_errors, _commit_hash)
    406 user_agent = http_user_agent(user_agent)
    407 try:
    408     # Load from URL or cache if already cached
--> 409     resolved_file = hf_hub_download(
    410         path_or_repo_id,
    411         filename,
    412         subfolder=None if len(subfolder) == 0 else subfolder,
    413         revision=revision,
    414         cache_dir=cache_dir,
    415         user_agent=user_agent,
    416         force_download=force_download,
    417         proxies=proxies,
    418         resume_download=resume_download,
    419         use_auth_token=use_auth_token,
    420         local_files_only=local_files_only,
    421     )
    423 except RepositoryNotFoundError:
    424     raise EnvironmentError(
    425         f"{path_or_repo_id} is not a local folder and is not a valid model identifier "
    426         "listed on 'https://huggingface.co/models'\nIf this is a private repository, make sure to "
    427         "pass a token having permission to this repo with `use_auth_token` or log in with "
    428         "`huggingface-cli login` and pass `use_auth_token=True`."
    429     )

File /databricks/python/lib/python3.9/site-packages/huggingface_hub/utils/_validators.py:114, in validate_hf_hub_args.<locals>._inner_fn(*args, **kwargs)
    109 for arg_name, arg_value in chain(
    110     zip(signature.parameters, args),  # Args values
    111     kwargs.items(),  # Kwargs values
    112 ):
    113     if arg_name == "repo_id":
--> 114         validate_repo_id(arg_value)
    116     elif arg_name == "token" and arg_value is not None:
    117         has_token = True

File /databricks/python/lib/python3.9/site-packages/huggingface_hub/utils/_validators.py:166, in validate_repo_id(repo_id)
    161     raise HFValidationError(
    162         f"Repo id must be a string, not {type(repo_id)}: '{repo_id}'."
    163     )
    165 if repo_id.count("/") > 1:
--> 166     raise HFValidationError(
    167         "Repo id must be in the form 'repo_name' or 'namespace/repo_name':"
    168         f" '{repo_id}'. Use `repo_type` argument if needed."
    169     )
    171 if not REPO_ID_REGEX.match(repo_id):
    172     raise HFValidationError(
    173         "Repo id must use alphanumeric chars or '-', '_', '.', '--' and '..' are"
    174         " forbidden, '-' and '.' cannot start or end the name, max length is 96:"
    175         f" '{repo_id}'."
    176     )

HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/root/dolly_training/dolly__2023-03-30T01:11:56'. Use `repo_type` argument if needed.

Error is occurring for both:

model, tokenizer = load_model_tokenizer_for_generate(local_output_dir) # local_output_dir  = /root/dolly_training/dolly__2023-03-30T01:11:56

model, tokenizer = load_model_tokenizer_for_generate(dbfs_output_dir) # dbfs_output_dir = /dbfs/dolly_training/dolly__2023-03-30T01:11:56

Cluster config:

{
    "autoscale": {
        "min_workers": 2,
        "max_workers": 8
    },
    "cluster_name": "Dolly Cluster",
    "spark_version": "12.2.x-gpu-ml-scala2.12",
    "spark_conf": {},
    "aws_attributes": {
        "first_on_demand": 1,
        "availability": "SPOT_WITH_FALLBACK",
        "zone_id": "auto",
        "spot_bid_price_percent": 100,
        "ebs_volume_count": 0
    },
    "node_type_id": "g4dn.xlarge",
    "driver_node_type_id": "g4dn.xlarge",
    "ssh_public_keys": [],
    "custom_tags": {},
    "spark_env_vars": {},
    "autotermination_minutes": 120,
    "enable_elastic_disk": false,
    "cluster_source": "UI",
    "init_scripts": [],
    "single_user_name": "[email protected]",
    "enable_local_disk_encryption": false,
    "data_security_mode": "SINGLE_USER",
    "runtime_engine": "STANDARD",
    "cluster_id": "0329-202609-84uf8huw"
}

create a official huggingface model?

I found some dolly related models on huggingface's website, but not sure if they are the official version. Any plan to create one on huggingface? Maybe people could test the performance of dolly easier in that case.

Dolly response generation takes 3 mins or more in GPU

code take around 3 mins to generate response. This lines take so much time even in a GPU. Any suggestion?

model.generate(input_ids, pad_token_id=tokenizer.pad_token_id, eos_token_id=end_key_token_id, do_sample=do_sample, max_new_tokens=max_new_tokens, top_p=top_p, top_k=top_k, **kwargs)[0].cpu()

Not able to select an on Azure any GPU machine

Hi.

Trying to run dolly on MS Azure. When I try to create compute cluster, and choosing Runtime 12.2 LTS, cannot choose any GPU machine, like Standard_ND96asr_v4.
Error: This node type is not compatible with selected runtime.
Any suggestions how to resolve this?

RuntimeError: Expected only a single token for '### Response: ' but found [50402, 198]

The issue:
Got a runtime error at line 87 of training/generate.py:
response_key_token_id = get_special_token_id(tokenizer, RESPONSE_KEY_NL)

The message:
RuntimeError: Expected only a single token for '### Response:
' but found [50402, 198]

The possible reason:
As RESPONSE_KEY_NL = f"### Response:\n"
tokenizer.encode(RESPONSE_KEY_NL) returns 2 tokens, hence the runtime error

Error during installation

operating system:Windows 10 专业版
python version : Python 3.10.11

When I just want pip3 install - r requirements_ Dev.txt will appear
Using cached https://mirrors.aliyun.com/pypi/packages/0d/6a/216004220f0658ac4ee2c06f724b1557d1a5f0533f7fd0830845351d0b30/deepspeed-0.8.0.tar.gz (749 kB)
Preparing metadata (setup.py) ... error
error: subprocess-exited-with-error

× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [9 lines of output]
Traceback (most recent call last):
File "", line 2, in
File "", line 34, in
File "C:\Users\Administrator\AppData\Local\Temp\pip-install-b9_i45kc\deepspeed_212e4e5c59c94e27afa26375d642bd71\setup.py", line 122, in
assert torch_available, "Unable to pre-compile ops without torch installed. Please install torch before attempting to pre-compile ops."
AssertionError: Unable to pre-compile ops without torch installed. Please install torch before attempting to pre-compile ops.
[WARNING] Unable to import torch, pre-compiling ops will be disabled. Please visit https://pytorch.org/ to see how to properly install torch on your system.
[WARNING] unable to import torch, please install it if you want to pre-compile any deepspeed ops.
DS_BUILD_OPS=1
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

new error on "device_map" parameter in the model

Hi,
after I installed accelerate and re-run the model, new error like below shows, I tried installed the safetensors and re-started the kernel but still had the same error msg. Please kindly advise how to fix. Thanks.
"
The current device_map had weights offloaded to the disk. Please provide an offload_folder for
them. Alternatively, make sure you have safetensors installed if the model you are using offers the weights in
this format."

OOM issue when finetune with V100

I tried to run the training.trainer script with batch size == 1 (originally it is 8), but met OOM issue with V100.

Has anyone tried to finetune it with V100-32G? or any machine does not have 80GB as A100?

Here is my training script based on the instruction:
deepspeed --num_gpus=8 --module training.trainer --deepspeed ./dolly/config/ds_z3_bf16_config.json --epochs 1 --local-output-dir ./dolly/local_output_dir --dbfs-output-dir ./dolly/dbfs_output_dir --per-device-train-batch-size 1 --per-device-eval-batch-size 1 --lr 1e-5

Generate takes 10-15 seconds on A100.

I'm running the generate.py code (from a couple commits ago) on a single A100 on a GCP VM in a container. It's taking 10-15 seconds per generation with a max_new_tokens of 128. Is it expected to be this slow?

Here's the code I'm currently debugging. I split the model and tokenizer loading to try and debug, but it's the same on exactly the same code from the repo:

# The format of the instruction the model has been trained on.
INTRO = "Below is an instruction that describes a task. Write a response that appropriately completes the request."
INSTRUCTION_FORMAT = """{intro}
### Instruction:
{instruction}
### Response:
"""

# Check GPU or CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'\n\nUsing device: {device}.\n\n')

def load_model(model_path) -> PreTrainedModel:
    print(f"Loading model from {model_path}.")
    model = AutoModelForCausalLM.from_pretrained(
        model_path,
        device_map="auto",
        # load_in_8bit=True,
        # torch_dtype=torch.float16
    ).to(device)
    return model

def load_tokenizer() -> PreTrainedTokenizer:
    tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B", padding_side="left", trust_remote_code=True)
    tokenizer.pad_token_id = tokenizer.eos_token_id
    return tokenizer

# Load Model
print("Loading model...")
model_path = os.environ["OUTPUT_PATH"]
model = load_model(model_path = model_path)
print("Model loaded.")

# Load Tokenizer
print("Loading tokenizer...")
tokenizer = load_tokenizer()
print("Tokenizer loaded.")

# Generate Response Function
def generate_response(
    instruction: str,
    *,
    model: PreTrainedModel,
    tokenizer: PreTrainedTokenizer,
    do_sample: bool = True,
    max_new_tokens: int = 128,
    top_p: float = 0.92,
    top_k: int = 0,
    **kwargs,
) -> str:
    """Given an instruction, uses the model and tokenizer to generate a response.  This formats the instruction in
    the instruction format that the model was fine-tuned on.
    Args:
        instruction (str): instruction to generate response for
        model (PreTrainedModel): model to use
        tokenizer (PreTrainedTokenizer): tokenizer to use
        do_sample (bool, optional): Whether or not to use sampling. Defaults to True.
        max_new_tokens (int, optional): Max new tokens after the prompt to generate. Defaults to 128.
        top_p (float, optional): If set to float < 1, only the smallest set of most probable tokens with probabilities
            that add up to top_p or higher are kept for generation. Defaults to 0.92.
        top_k (int, optional): The number of highest probability vocabulary tokens to keep for top-k-filtering.
            Defaults to 0.
    Returns:
        str: the generated response
    """
    input_ids = tokenizer(
        INSTRUCTION_FORMAT.format(intro=INTRO, instruction=instruction), return_tensors="pt"
    ).input_ids.to(device)

    gen_tokens = model.generate(
        input_ids,
        pad_token_id=tokenizer.pad_token_id,
        do_sample=do_sample,
        max_new_tokens=max_new_tokens,
        top_p=top_p,
        top_k=top_k,
        **kwargs,
    )[0].cpu()
    decoded = tokenizer.decode(gen_tokens).strip()
    #decoded = tokenizer.batch_decode(gen_tokens)[0]

    # The response appears after "### Response:".  The model has been trained to append "### End" at the end.
    m = re.search(r"#+\s*Response:\s*(.+?)#+\s*End", decoded, flags=re.DOTALL)

    response = None
    if m:
        response = m.group(1).strip()
    else:
        # The model might not generate the "### End" sequence before reaching the max tokens.  In this case, return
        # everything after "### Response:".
        m = re.search(r"#+\s*Response:\s*(.+)", decoded, flags=re.DOTALL)
        if m:
            response = m.group(1).strip()
        else:
            logger.warn(f"Failed to find response in:\n{decoded}")

    return response

Could not load model with any of the following classes

import torch
from transformers import pipeline

generate_text = pipeline(model="databricks/dolly-v2-12b", torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto")
This is leading to the error :-

in infer_framework_load_model
raise ValueError(f"Could not load model {model} with any of the following classes: {class_tuple}.")
ValueError: Could not load model databricks/dolly-v1-6b with any of the following classes: (<class 'transformers.models.auto.modeling_auto.AutoModelForCausalLM'>, <class 'transformers.models.gptj.modeling_gptj.GPTJForCausalLM'>).

Tensors must be contiguous when deepspeed broadcasting

When I train the model with one GPU. It works. But When I set the --num_gpus 8, it returns "Tensors must be contiguous".

The command:
deepspeed --num_gpus 8 --module training.trainer --deepspeed /root/github/dolly/config/ds_z3_bf16_config.json --epochs 1 --local-output-dir /root/models/dolly/dolly --per-device-train-batch-size 8 --per-device-eval-batch-size 8 --lr 1e-5

The output:
`Traceback (most recent call last):
File "/root/miniconda3/envs/dolly/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,

File "/root/miniconda3/envs/dolly/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)

File "/root/github/dolly/training/trainer.py", line 265, in
main()

File "/root/miniconda3/envs/dolly/lib/python3.8/site-packages/click/core.py", line 1128, in call
return self.main(*args, **kwargs)

File "/root/miniconda3/envs/dolly/lib/python3.8/site-packages/click/core.py", line 1053, in main
rv = self.invoke(ctx)

File "/root/miniconda3/envs/dolly/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)

File "/root/miniconda3/envs/dolly/lib/python3.8/site-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)

File "/root/github/dolly/training/trainer.py", line 257, in main
train(**kwargs)

File "/root/github/dolly/training/trainer.py", line 220, in train
trainer.train()

File "/root/miniconda3/envs/dolly/lib/python3.8/site-packages/transformers/trainer.py", line 1543, in train
return inner_training_loop(

File "/root/miniconda3/envs/dolly/lib/python3.8/site-packages/transformers/trainer.py", line 1612, in _inner_training_loop
deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(

File "/root/miniconda3/envs/dolly/lib/python3.8/site-packages/transformers/deepspeed.py", line 344, in deepspeed_init
deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)

File "/root/miniconda3/envs/dolly/lib/python3.8/site-packages/deepspeed/init.py", line 125, in initialize
engine = DeepSpeedEngine(args=args,

File "/root/miniconda3/envs/dolly/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 301, in init
self._configure_distributed_model(model)

File "/root/miniconda3/envs/dolly/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1187, in _configure_distributed_model
self._broadcast_model()

File "/root/miniconda3/envs/dolly/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1102, in _broadcast_model
dist.broadcast(p,

File "/root/miniconda3/envs/dolly/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 127, in log_wrapper
return func(*args, **kwargs)

File "/root/miniconda3/envs/dolly/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 232, in broadcast
return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)

File "/root/miniconda3/envs/dolly/lib/python3.8/site-packages/deepspeed/comm/torch.py", line 70, in broadcast
return torch.distributed.broadcast(tensor=tensor,

File "/root/miniconda3/envs/dolly/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast
work = group.broadcast([tensor], opts)
RuntimeError: Tensors must be contiguous`

Using Bigscience Bloom 176B or Bloomz 176B instead of GPT-J 6B

Would it be possible to take this software and substitute the Bigscience Bloom 176B or Bloomz 176B models, instead of the present GPT-J 6B model, as a simple drop-in in the code? If so, would running such a refinement be expected to take an equivalently large amount of time and/or amount of GPU resources? Thanks.

Could not load model with AutoModelForCasualLM dolly-v2-12b

ValueError: Could not load model databricks/dolly-v2-12b with any of the following classes: (<class
'transformers.models.auto.modeling_auto.AutoModelForCausalLM'>, <class
'transformers.models.gpt_neox.modeling_gpt_neox.GPTNeoXForCausalLM'>).

Can you please let me know where did I go wrong?

name 'init_empty_weights' is not defined

when I load the model with below code in the instruction, it reported the error ’name 'init_empty_weights' is not defined‘, please kindly advise how to fix, thanks a lot.
input code:
import numpy as np
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
PreTrainedModel,
PreTrainedTokenizer
)

tokenizer = AutoTokenizer.from_pretrained("databricks/dolly-v1-6b", padding_side="left")
model = AutoModelForCausalLM.from_pretrained("databricks/dolly-v1-6b", device_map="auto", trust_remote_code=True)

program crash before save the model

Training finished. But it returns error before save the model to output dir.

Here is the log.


Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from /home/dolly/output/checkpoint-1200 (score: 1.2900390625).
[2023-03-30 20:51:23,074] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.8.0, git-hash=eecd6a2, git-branch=HEAD
[2023-03-30 20:51:23,359] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
Using /home/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module cpu_adam, skipping build step...
Loading extension module cpu_adam...
Time to load cpu_adam op: 0.39576125144958496 seconds
Using /home/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module cpu_adam, skipping build step...
Loading extension module cpu_adam...
Time to load cpu_adam op: 0.4530458450317383 seconds
Using /home/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module cpu_adam, skipping build step...
Loading extension module cpu_adam...
Time to load cpu_adam op: 0.47116732597351074 seconds
Using /home/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module cpu_adam, skipping build step...
Loading extension module cpu_adam...
Time to load cpu_adam op: 0.4737248420715332 seconds
Adam Optimizer #1 is created with AVX512 arithmetic capability.
Config: alpha=0.000010, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1
[2023-03-30 20:51:26,334] [INFO] [logging.py:68:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer
[2023-03-30 20:51:26,345] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam
[2023-03-30 20:51:26,345] [INFO] [utils.py:52:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'>
[2023-03-30 20:51:26,345] [INFO] [logging.py:68:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer
Using /home//.cache/torch_extensions/py310_cu117 as PyTorch extensions root...Using /home//.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Using /home//.cache/torch_extensions/py310_cu117 as PyTorch extensions root...

No modifications detected for re-loaded extension module utils, skipping build step...No modifications detected for re-loaded extension module utils, skipping build step...

Loading extension module utils...Loading extension module utils...

No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.018253803253173828 secondsTime to load utils op: 0.018238306045532227 seconds

Time to load utils op: 0.018271446228027344 seconds
[2023-03-30 20:51:27,017] [INFO] [utils.py:831:see_memory_usage] Stage 3 initialize beginning
[2023-03-30 20:51:27,018] [INFO] [utils.py:832:see_memory_usage] MA 0.14 GB         Max_MA 10.54 GB         CA 13.41 GB         Max_CA 14 GB 
[2023-03-30 20:51:27,018] [INFO] [utils.py:840:see_memory_usage] CPU Virtual Memory:  used = 153.26 GB, percent = 81.8%
[2023-03-30 20:51:27,020] [INFO] [stage3.py:114:__init__] Reduce bucket size 16777216
[2023-03-30 20:51:27,020] [INFO] [stage3.py:115:__init__] Prefetch bucket size 15099494
Using /home//.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0006060600280761719 seconds
[2023-03-30 20:51:27,094] [INFO] [utils.py:831:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2023-03-30 20:51:27,095] [INFO] [utils.py:832:see_memory_usage] MA 0.14 GB         Max_MA 0.14 GB         CA 13.41 GB         Max_CA 13 GB 
[2023-03-30 20:51:27,096] [INFO] [utils.py:840:see_memory_usage] CPU Virtual Memory:  used = 153.26 GB, percent = 81.8%
Parameter Offload: Total persistent parameters: 811008 in 114 params
[2023-03-30 20:51:27,179] [INFO] [utils.py:831:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
[2023-03-30 20:51:27,180] [INFO] [utils.py:832:see_memory_usage] MA 0.14 GB         Max_MA 0.14 GB         CA 13.41 GB         Max_CA 13 GB 
[2023-03-30 20:51:27,180] [INFO] [utils.py:840:see_memory_usage] CPU Virtual Memory:  used = 153.26 GB, percent = 81.8%
[2023-03-30 20:51:27,254] [INFO] [utils.py:831:see_memory_usage] Before creating fp16 partitions
[2023-03-30 20:51:27,254] [INFO] [utils.py:832:see_memory_usage] MA 0.14 GB         Max_MA 0.14 GB         CA 13.41 GB         Max_CA 13 GB 
[2023-03-30 20:51:27,255] [INFO] [utils.py:840:see_memory_usage] CPU Virtual Memory:  used = 153.26 GB, percent = 81.8%
[2023-03-30 20:51:28,929] [INFO] [utils.py:831:see_memory_usage] After creating fp16 partitions: 2
[2023-03-30 20:51:28,948] [INFO] [utils.py:832:see_memory_usage] MA 0.14 GB         Max_MA 0.14 GB         CA 0.15 GB         Max_CA 13 GB 
[2023-03-30 20:51:28,949] [INFO] [utils.py:840:see_memory_usage] CPU Virtual Memory:  used = 175.59 GB, percent = 93.7%
[2023-03-30 20:51:29,193] [INFO] [utils.py:831:see_memory_usage] Before creating fp32 partitions
[2023-03-30 20:51:29,202] [INFO] [utils.py:832:see_memory_usage] MA 0.14 GB         Max_MA 0.14 GB         CA 0.15 GB         Max_CA 0 GB 
[2023-03-30 20:51:29,204] [INFO] [utils.py:840:see_memory_usage] CPU Virtual Memory:  used = 180.75 GB, percent = 96.4%
[2023-03-30 20:51:41,983] [INFO] [utils.py:831:see_memory_usage] After creating fp32 partitions
[2023-03-30 20:51:42,035] [INFO] [utils.py:832:see_memory_usage] MA 0.14 GB         Max_MA 0.14 GB         CA 0.15 GB         Max_CA 0 GB 
[2023-03-30 20:51:42,041] [INFO] [utils.py:840:see_memory_usage] CPU Virtual Memory:  used = 185.64 GB, percent = 99.1%
[2023-03-30 20:51:42,229] [INFO] [utils.py:831:see_memory_usage] Before initializing optimizer states
[2023-03-30 20:51:42,229] [INFO] [utils.py:832:see_memory_usage] MA 0.14 GB         Max_MA 0.14 GB         CA 0.15 GB         Max_CA 0 GB 
[2023-03-30 20:51:42,230] [INFO] [utils.py:840:see_memory_usage] CPU Virtual Memory:  used = 186.71 GB, percent = 99.6%
[2023-03-30 20:52:42,777] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 425355
[2023-03-30 20:52:48,097] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 425356
[2023-03-30 20:52:48,100] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 425357
[2023-03-30 20:52:52,963] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 425358
[2023-03-30 20:52:58,177] [ERROR] [launch.py:324:sigkill_handler] 

How to make dolly run with gradio interface?

Downloaded all files here:

image

Using this ipynb notebook

import gradio as gr
from transformers import pipeline
import torch

theme = gr.themes.Monochrome(
    primary_hue="indigo",
    secondary_hue="blue",
    neutral_hue="slate",
    radius_size=gr.themes.sizes.radius_sm,
    font=[gr.themes.GoogleFont("Open Sans"), "ui-sans-serif", "system-ui", "sans-serif"],
)

instruct_pipeline = pipeline(model="F:/Dolly 2.0/dolly-v2-12b", torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="cuda:1")
def generate(instruction): 
    return instruct_pipeline(instruction)


examples = [
    "Instead of making a peanut butter and jelly sandwich, what else could I combine peanut butter with in a sandwich? Give five ideas",
    "How do I make a campfire?",
    "Write me a tweet about the release of Dolly 2.0, a new LLM"
   
]


def process_example(args):
    for x in generate(args):
        pass
    return x

css = ".generating {visibility: hidden}"

with gr.Blocks(theme=theme, analytics_enabled=False, css=css) as demo:
    with gr.Column():
        gr.Markdown(
            """ ## Dolly 2.0
            Dolly 2.0 is a 12B parameter language model based on the EleutherAI pythia model family and fine-tuned exclusively on a new, high-quality human generated instruction following dataset, crowdsourced among Databricks employees. For more details, please refer to the [model card](https://huggingface.co/databricks/dolly-v2-12b)
            
            Type in the box below and click the button to generate answers to your most pressing questions!
            
      """
        )
        gr.HTML("<p>You can duplicate this Space to run it privately without a queue for shorter queue times  : <a style='display:inline-block' href='https://huggingface.co/spaces/RamAnanth1/Dolly-v2?duplicate=true'><img src='https://img.shields.io/badge/-Duplicate%20Space-blue?labelColor=white&style=flat&logo=&logoWidth=14' alt='Duplicate Space'></a> </p>")

        with gr.Row():
            with gr.Column(scale=3):
                instruction = gr.Textbox(placeholder="Enter your question here", label="Question", elem_id="q-input")

                with gr.Box():
                    gr.Markdown("**Answer**")
                    output = gr.Markdown(elem_id="q-output")
                submit = gr.Button("Generate", variant="primary")
                gr.Examples(
                    examples=examples,
                    inputs=[instruction],
                    cache_examples=False,
                    fn=process_example,
                    outputs=[output],
                )


    submit.click(generate, inputs=[instruction], outputs=[output])
    instruction.submit(generate, inputs=[instruction], outputs=[output])

demo.queue(concurrency_count=16).launch(debug=True)

when running it, this is the error I am getting

image

ValueError: Could not load model F:/Dolly 2.0/dolly-v2-12b with any of the following classes: (<class 
'transformers.models.auto.modeling_auto.AutoModelForCausalLM'>, <class 
'transformers.models.gpt_neox.modeling_gpt_neox.GPTNeoXForCausalLM'>).

RuntimeError: Placeholder storage has not been allocated on MPS device!

Env:

macOS== 13.3.1
M1Max, 64G
python==3.10.10
tensorflow-macos==2.12.0  
tensorflow-metal==0.8.0

It can run on Apple Silicon with CPU, but when I try to set the device to mps instead of cuda or cpu

device = torch.device("mps")
generate_text = InstructionTextGenerationPipeline(model=model, tokenizer=tokenizer, device=device)
generate_text.model.to(device)

It shows these error msg.

miniconda/envs/dolly/lib/python3.10/site-packages/torch/nn/functional.py", line 2210, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Placeholder storage has not been allocated on MPS device!

I think it seems another model hasn't been set to mps correctly, but I cannot find which model raised this error.

problem when setting training environment

In training step 2, when I select Databricks Runtime '12.2 LTS ML (includes Apache Spark 3.3.2, GPU, Scala 2.12)', I find that I can't select 'Standard_ND96asr_v4', the best choice is 'Standard_NC64as_T4_v3', can I use cluster whith this configuration? Thanks for your help.

Is it possible to utilize NCasT4_v3-series clusters?

It seems like the cluster has plenty of memory available, but I'm getting the following error. "torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 3; 15.75 GiB total capacity; 15.16 GiB already allocated; 37.62 MiB free; 15.16 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF"

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.