Comments (2)
I have the same problem and am still trying to understand what the message actually means.
As far as I can tell, this happens at the end of the first batch of 7 in the first epoch, when the code in train.py calls loss.backward. Because this problem is happening inside this call, I'm assuming this is not a bug in pyTorch, and this leads me to believe whatever processing that should have finished before reaching this point is still going in the background. Which is a classic case of data changing unexpectedly. Unfortunately, python, pyTorch and this type of code isn't my main field of expertise, so I'm going a bit blind there...
BTW, this is same problem as issue #46 (#46).
from training.
In an effort to clean up the git repo so we can maintain it better going forward, the MLPerf Training working group is closing out issues older than 2 years, since much has changed in the benchmark suite. If you think this issue is still relevant, please feel free to reopen. Even better, please come to the working group meeting to discuss your issue
from training.
Related Issues (20)
- Alternative method for downloading Llama2 70b HOT 2
- Gradient clipping not working for llama2_70b_lora benchmark HOT 1
- MLPerf library version for 4.0 Submission
- Llama2 - LoRA Reference Implementation HOT 3
- Data download for Stable Diffusion fails HOT 4
- DLRM criteo day23 MD5 varify faild HOT 1
- Hardware Configuration HOT 1
- llama3 support? HOT 1
- Invalid `local_replica_id` with `MultiWorkerMirroredStrategy` HOT 1
- `IndexError` in `cross_device_ops` with `MultiWorkerMirroredStrategy` HOT 1
- Scope of ML based benchmarks in MLPerf.
- Bus error (core dumped) in graph_neural_network
- Problem downloading S3 bucket
- Running udnet3d on multiple GPUS HOT 1
- Stable Diffusion Dataset HOT 10
- Stable Diffusion RCP request for Global Batch Size = 4096 HOT 1
- recommendation_v2/torchrec_dlrm Fatal Python error: Segmentation fault HOT 1
- TorchRec DLRM No such file or directory: 'sbatch' HOT 1
- TorchRec DLRM Failed to initialize NumPy: _ARRAY_API not found
- llama2-lora model/dataset download link doesn't exist HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from training.