Comments (5)
I was able to run your codes without errors from a simple virtual env without docker. I think it's something do file permissions of your project
& checkpoints
folders. You may want to create the checkpoints folders in docker / python script instead of mkdir from cli. The checkpoint handler needs to have full permissions to the checkpoint path. Lmk if you can fix it.
from orbax.
I just tried running it in ubuntu without docker and it worked, so it seems docker is the problem. Creating the checkpoints folder with python doesn't seem to change anything. I'll keep experimenting.
from orbax.
So the problem was the ownership of my mounted volume. Because I mounted /projects
from my host machine, I couldn't control its ownership from inside the container. I solved this by adding RUN mkdir /checkpoints
to my dockerfile in order to create a separate directory that is owned by the root user, then saving checkpoints to that directory instead of the one inside /project
. The one problem with this is that everything inside of /checkpoints
will be lost when I stop the container. I got around this by only using /checkpoints
as a temp directory and immediately copying any checkpoints to /project/checkpoints
after they're created.
Here's the updated dockerfile:
FROM python:3.9.17-slim-bullseye
RUN mkdir /checkpoints
WORKDIR /project
COPY requirements.txt requirements.txt
RUN python -m pip install --upgrade pip
RUN python -m pip install requirements.txt
EXPOSE 8888
ENTRYPOINT ["jupyter", "lab", "--ip=0.0.0.0", "--allow-root", "--no-browser", "--NotebookApp.token=''", "--NotebookApp.password=''"]
So the root directory of my container now looks like this:
(usual linux stuff)
/checkpoints
/project
/checkpoints
/test.py
...
And /checkpoints
is owned by the root user:
root@12b24ced75c4:/# ls -ld checkpoints
drwxr-xr-x 2 root root 4096 Aug 2 20:18 checkpoints
Whereas /project
and by extension /project/checkpoints
are not:
root@12b24ced75c4:/# ls -ld project
drwxrwxrwx 1 1000 1000 4096 Aug 2 20:17 project
Finally, here's the updated python file with the copying technique that I mentioned:
import flax.linen as nn
from flax.training import train_state
import optax
import orbax.checkpoint as ocp
import jax
import jax.numpy as jnp
import os
import shutil
def create_train_state(module, rng):
x = (jnp.ones([1, 256, 256, 1]))
variables = module.init(rng, x)
params = variables['params']
tx = optax.adam(1e-3)
ts = train_state.TrainState.create(
apply_fn=module.apply, params=params, tx=tx
)
return ts
class TestModel(nn.Module):
@nn.compact
def __call__(self, x):
x = nn.Conv(4, kernel_size=(3, 3))(x)
return x
if __name__ == '__main__':
init_rng = jax.random.PRNGKey(0)
model = TestModel()
state = create_train_state(model, init_rng)
del init_rng
checkpointer = ocp.Checkpointer(ocp.PyTreeCheckpointHandler(use_ocdbt=True))
# Save to root owned checkpoints dir.
checkpointer.save(os.path.abspath('../checkpoints/checkpoint1'), state)
# Copy from root owned checkpoints dir, to checkpoints dir in mounted volume.
shutil.copytree('../checkpoints/checkpoint1', 'checkpoints/checkpoint1')
# Restore from checkpoints dir in mounted volume.
state = checkpointer.restore(os.path.abspath('checkpoints/checkpoint1'))
@ChromeHearts thanks for pointing me in the right direction.
from orbax.
This is certainly very weird. Your docker was actually running as root so it shouldn't have issues directly saving checkpoints to the checkpoint
folder. In addition, if you can copy the checkpoints from temp to mounted folder, it meant your python script had no issues writing as well. It's definitely something to do with your docker or local mount. I re-did your setup in docker and was able to checkpoint directly to local mount folder.
sudo docker-compose exec test bash
root@e8a981978d40:/project# ls -l
total 16
-rw-r--r-- 1 1003 1003 268 Aug 3 02:26 Dockerfile
-rw-r--r-- 1 1003 1003 105 Aug 3 02:27 docker-compose.yaml
-rw-r--r-- 1 1003 1003 940 Aug 3 02:07 main.py
drwxr-xr-x 5 1003 1003 4096 Aug 3 02:08 py39
root@e8a981978d40:/project# mkdir checkpoints
root@e8a981978d40:/project# python main.py
No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
save_path='/project/checkpoints/checkpoint1'
root@e8a981978d40:/project# ls -l
total 20
-rw-r--r-- 1 1003 1003 268 Aug 3 02:26 Dockerfile
drwxr-xr-x 3 root root 4096 Aug 3 02:35 checkpoints
-rw-r--r-- 1 1003 1003 105 Aug 3 02:27 docker-compose.yaml
-rw-r--r-- 1 1003 1003 940 Aug 3 02:07 main.py
drwxr-xr-x 5 1003 1003 4096 Aug 3 02:08 py39
root@e8a981978d40:/project# find checkpoints
checkpoints
checkpoints/checkpoint1
checkpoints/checkpoint1/checkpoint
checkpoints/checkpoint1/d
checkpoints/checkpoint1/d/ce942794f70ea11a64fa0742f009b653
checkpoints/checkpoint1/d/2b248e926ce267f2604fe6215090a51b
checkpoints/checkpoint1/d/7d4d3dd57291b20fd8866db6035f8025
checkpoints/checkpoint1/manifest.ocdbt
root@e8a981978d40:/project#
The main.py is simply your python script (1st version without the copy). I managed to save checkpoint without issues. I suggest avoid the copy from temp to mounted volume. Docker temp folders are not meant for storing large dataset. They are slow and have limited storage size.
from orbax.
It seems that we have at least one solution, so closing this issue.
from orbax.
Related Issues (20)
- Running orbax from an existing asyncio event loop HOT 5
- How to restore a checkpoint with only partial knowledge of it? HOT 2
- Cannot install orbax-checkpoint due to uvicorn error HOT 1
- Best practice for tracking the best *and* latest? HOT 1
- Any way to have CheckpointManager write earlier checkpoints? HOT 2
- Incompatibility with Haiku HOT 4
- Questions about loading and writing checkpoints in distributed training HOT 4
- Could you please provide the py.typed file for type checking?
- AttributeError: 'Config' object has no attribute 'jax_coordination_service' HOT 2
- Issue loading checkpoint step during Paxml evaluation HOT 3
- Async checkpointer failure handling HOT 3
- Dependency array-record Not Available on ARM64 Architecture HOT 2
- Importing orbax before TensorFlow silences important logging from TensorFlow
- Support checkpointing new-style `jax.random.key` HOT 6
- Array has been deleted HOT 4
- How to restore on a CPU a checkpoint saved on a GPU? HOT 1
- Checkpoint Manager using different directory paths for save and restore HOT 2
- Cannot restore sharded array on different machine HOT 8
- How to restore a variable from checkpoint saved in cpu back in cpu when you have both gpu and cpu? HOT 5
- Strange behavior of saving sharded trainstate in GCP. HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from orbax.