Comments (8)
What stage is this happening on? Usually this happens when we try to do something like an inplace addition of tensors. It usually can be resolved by cloning inputs (or choosing to set inplace=False
for the relevant operator).
from pipedream.
It happened in the first stage, rank 0 and rank 1. I didn't change your code, and I tried to find where the inplace operation is but failed. What confused me is that vgg model runs successfully in hybrid mode, it only has two stages, while the ResNet has 3 stages.
from pipedream.
I see. This is a different model specification, so this is possible.
It's probably something to do with the inline ReLUs (https://github.com/msr-fiddle/pipedream/blob/master/runtime/image_classification/models/resnet50/gpus%3D4/stage0.py#L68), but I'm not sure exactly what.
One thing that should definitely fix this is replacing all instances of inline=True
in the constructor to inline=False
, but this will probably increase the memory footprint. You can also try adding some .clone()
s.
I haven't run ResNet-50 in hybrid mode for a while since DP often outperforms the hybrid setup by often a lot; text around Table 1 in our paper talks about why this is the case.
from pipedream.
This may happen when you run pipedream with a recent version of PyTorch. I run with a version of Python 3.7 + PyTorch release 1.5.0 and have the same issue.
from pipedream.
PyTorch 3.8 or Python 3.8?
from pipedream.
Sorry, it might be close to the PyTorch latest release 1.5.0, or actually I compiled a recent commit from master branch of PyTorch github repo. Cannot match a PyTorch release version to it.
from pipedream.
@deepakn94 @SimonZsx @kanonjz
I also got this issue, but I tried with an older version of PyTorch
pip install torch===1.2.0 torchvision===0.4.0 -f https://download.pytorch.org/whl/torch_stable.html
This worked fine for me. I am also not quite sure why it is not working with the latest (I initially tried with 1.5.0).
@deepakn94 Could this be due to updates with the distributed autograd, etc from PyTorch end?
By the way, I didn't install the patch. Will this be a trouble for getting the expected results?
from pipedream.
See this #52 issue. The reason is that PyTorch added some new version consistency checking in the latest PyTorch version.
For the patch, it is only for profiling AFAIK. The runtime does not use the pre-hooking features provided by the patch.
from pipedream.
Related Issues (20)
- Handling uneven number of batches per replicated instance of a layer
- GPU Peer2Peer communication via --num_ranks_in_server argument HOT 1
- Resource temporarily unavailable
- To run PipeDream_2BW branch without --recompute_step
- The BLEU score of translation model seems abnormal. The model doesn't seem to train effectively.
- GPT2 355m model convergence with 2BW training
- Is there AllReduce in data parallelism? HOT 6
- How is the Double-Buffered Weight Mechanism implemented?
- Supporting T5
- The arguments of self.start_helper_thread() should be more flexible instead of fixed as int64.
- Question about time complexity of PipeDream-2BW's planner algorithm
- Question about PipeDream's optimizer
- AttributeError: module 'models.resnet50.resnet50' has no attribute 'model' HOT 1
- Is there any 2bw code that will run on the native GPU HOT 1
- AttributeError: module 'torch.distributed' has no attribute 'P2POp' HOT 1
- Running in docker will give you an error that you can't find a physical address HOT 1
- what is the role of pre_hook_pytorch_latest.patch? HOT 1
- When I was testing the pipedream code with version-updated torch, I encountered the following error (1.1.0 -> 1.11.0): HOT 2
- optimizer got an empty parameter list when rank=1 HOT 1
- same train_loader but got different loader size HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pipedream.