Code Monkey home page Code Monkey logo

grid-docs's Introduction

Grid in 3 minutes

Introduction

Grid is designed for developing and training deep learning models at scale.

The TL;DR of using Grid is this:

  • Create a DATASTORE with your dataset.
  • Spin up an interactive SESSION to develop, analyze and prototype models/ideas.
  • When you have something that works, train it at scale via RUN.

This 3-minute video shows you how to execute code on cloud instances with zero code changes and how to debug/prototype and develop models with multi-GPU cloud instances.

intro.mp4

Here is a quick overview of:

Datastores

Sessions

Runs

Infrastructure is gone

Grid allocates all the machines and GPUs you need on demand, so you only pay for what you need when you need it.

Grid lets you focus on your work, NOT on the infrastructure. Create an account here to get free credits and get started!

Artifacts, logs, etc...

Grid handles all the other parts of developing and training at scale:

  • Artifacts
  • Logs
  • Metrics
  • etc...

Just run your files and watch the magic happen

Experiment Managers

Grid works with the experiment manager of your choice!!๐Ÿ”ฅ๐Ÿ”ฅ

No need to change your code!

Datastores: (Scalable datasets)

In Grid, we've introduced Datastores, high-performance, low-latency, versioned datasets.

image

The UI supports creating Datastores of < 1 GB

datastore.mp4

Use the CLI for larger datastores

grid datastore create imagenet_folder --name imagenet

Sessions (Interactive machines)

For prototyping/debugging/analyzing, sometimes you need a LIVE machine. We call these Sessions.

Web UI: Starting a new session

session.mp4

CLI: Starting a new session

# session with 2 M60 GPUs
grid session create --instance_type 2_m60_8gb

RUN (Sweep and train anything)

RUN any public or private repository with Grid in 5 steps:

This 1-minute video shows how to RUN from the web app:

run.mp4

If you prefer to use the CLI simply replace python with grid run.

First, install Grid and login

pip install lightning-grid --upgrade
grid login

Now clone the repo and hit run!

# clone repo
git clone https://github.com/williamFalcon/hello
cd hello

# start the sweep
grid run hello.py --number "[1, 2]" --food_item "['pizza', 'pear']"

This command produces these equivalent calls automatically

python hello.py --number 1 --food_item 'pizza'
python hello.py --number 2 --food_item 'pizza'

python hello.py --number 1 --food_item 'pear'
python hello.py --number 2 --food_item 'pear'

That's it!

We learned that:

  • RUN executes scripts on cloud machines (and runs hyperparameter sweeps)
  • SESSION starts an interactive machine with the CPU/GPUs of your choice
  • DATASTORE is an optimized, low-latency auto-versioned dataset.
  • Grid has a Web app and a CLI with similar functionality.

That's all you need to know about Grid!

Next!

Now try our first tutorial

grid-docs's People

Contributors

adam-lightning avatar alexandercort avatar dmitsf avatar ematta avatar ericchea avatar krishnakalyan3 avatar luca3rd avatar oojo12 avatar panos-is avatar pritamsoni-hsr avatar rasbt avatar rlizzo avatar robert-s-lee avatar sunitaprakash avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

grid-docs's Issues

GitHub Enterprise support?

Hi, do you support GitHub Enterprise on the self-hosted version?

It should be as simple as making the host name github.com configurable.

Thanks.

How can I add RStudio web access support?

I could potentially create a conda environment.yaml that prepares a machine with RStudio installed but how do I access the web interface? I suppose ssh tunneling is supported?

The other question is, could the RStudio front-end be added as another link in Grid.AI web interface, next to already available VSCode, JupyterLab and SSH ?

BYOC: update adding-custom-cloud-credentials.md

  • add export in front of EXTERNAL_ID and ROLE_ARN
  • add osx install of git, terraform, jq and AWS CLI
  • add ubuntu / debian install of git, terraform, jq and AWS CLI
  • add redhat install of git, terraform, jq and AWS CLI
  • add grid tool install before running grid commands in the last step
  • s/terraform output --json/terraform output -json/
  • check instance availability
aws ec2 describe-instance-type-offerings --location-type availability-zone  --filters Name=instance-type,Values=p3.16xlarge

Can a single big instance be shared between users?

I saw there are settings like --gpus and --memory but does this map to a similar concept like in SLURM when users can share big instance types while SLURM will partition them between users?

From what I saw so far Grid.AI is a single user single instance solution.

Further explain `scale_down_seconds`

Hi, could you further explain the scale_down_seconds setting?

I'm looking for a setting to allow me to keep the instance "hot" even after my jobs are done so I don't have to wait for an instance to start again, i.e. to avoid the cold-start problem.

Add details about Grid actions commands

The Grid YML spec supports three actions:

on_image_build commands passed to the image builder which are interpreted as RUN commands in a Dockerfile

on_before_training_start which allows users to specify commands that need to be executed sequentially before the main experiment process starts

on_after_training_end same as above, but executed after the main process ends

document hard link work, but not sym link in datastore

Assume there any many files and directories in the current directories. If a subset of them is required in datastore, then ln can be used to specify the subset. ln supports hard and sym links. hard links work. sym links do not work.

  • hard link example
mkdir tmp2
cd tmp2
ln ../test1.txt
ln ../test2.txt
grid datastore create --source . --name hardlink
โ”‚ prod-2     โ”‚     hardlink โ”‚       1 โ”‚   1.0 MB โ”‚ 2021-11-23 22:01 โ”‚ [email protected] โ”‚ Succeeded โ”‚
grid session create --name hardlinktest --datastore_name hardlink
# wait for session to come up
grid seesion ssh hardlinktest
cat /home/jovyan/hardlink/*
cat: /home/jovyan/hardlink/lost+found: Is a directory
test1.txt
test2.txt
  • sym link example [ does not work ]
echo "test1" > test1.txt
echo "test2" > test2.txt
mkdir tmp
cd tmp
ln -s ../test1.txt test1.txt
ln -s ../test2.txt test2.txt
cat *
grid datastore create --source . --name symlink

โ”‚ prod-2     โ”‚      symlink โ”‚       1 โ”‚  0 Bytes โ”‚ 2021-11-23 21:56 โ”‚ [email protected] โ”‚    Failed โ”‚

kinetics-video-classification: as-is script does not run

https://github.com/robert-s-lee/KineticsDemo/tree/fix-requirements.txt branch has WIP

Two issues: https://docs.grid.ai/examples/vision/kinetics-video-classification

  • fixed No module named 'flash.data'

fixed two typos in the code.

from flash.core.data.utils import download_data
from flash.core.utilities.imports import _KORNIA_AVAILABLE, _PYTORCHVIDEO_AVAILABLE

to fix

Traceback (most recent call last):
  File "train.py", line 14, in <module>
    from flash.data.utils import download_data
ModuleNotFoundError: No module named 'flash.data'
  • need fix for from pytorchvideo.transforms import ApplyTransformToKey, RandomShortSideScale, UniformTemporalSubsample
% python train.py

/opt/miniconda3/envs/kd/lib/python3.7/site-packages/kornia/augmentation/augmentation.py:1833: DeprecationWarning: GaussianBlur is no longer maintained and will be removed from the future versions. Please use RandomGaussianBlur instead.
  category=DeprecationWarning,
Please, run `pip install torchvideo`
  • To test
conda create --name=kd python=3.7
conda activate kd
pip install lightning-grid --upgrade
grid login 
git clone https://github.com/aribornstein/KineticsDemo.git
cd KineticsDemo
pip install -r requirements.txt 

Fix Broken Links in grid-docs

To check for broken links I used a python-link checking library. Perhaps it's a good idea to validate all links before a commit?.

python -m pip install linkcheckmd
linkcheckMarkdown .

changelog.md

platform/11_known-issues.md

platform/10_tips-and-tricks.md

platform/1_Billing/billing-rates.md

features/runs/1_README.md

features/runs/2_private-repos.md

features/runs/1_Creating Runs/1_Basic Runs/2_Adv Runs/3_sweep-syntax.md

features/runs/1_Creating Runs/2_Adv Runs/3_creating-runs-from-config.md

features/runs/1_Creating Runs/2_Adv Runs/2_creating-runs-with-dockerfile.md

features/runs/1_Creating Runs/2_Adv Runs/5_auto-resume-experiments.md

features/sessions/changing-instance-type.md

features/sessions/8_how-to-ssh-into-a-session.md

examples/running-with-different-frameworks/running-julia-programs.md

coco: fix the warning and errors

https://docs.grid.ai/examples/vision/coco run has the following warning and fails. https://github.com/robert-s-lee/CocoDemo/tree/rslee-refresh has the WIP

/opt/conda/lib/python3.8/site-packages/pl_bolts/utils/warnings.py:30: UserWarning: You want to use `gym` which is not in
stalled yet, install it with `pip install gym`.
  stdout_func(
/opt/conda/lib/python3.8/site-packages/pl_bolts/utils/warnings.py:30: UserWarning: You want to use `cv2` which is not in
stalled yet, install it with `pip install opencv-python`.
  stdout_func(

fails with the following

  File "/opt/conda/lib/python3.8/site-packages/flash/core/trainer.py", line 90, in finetune
    return super().fit(model, train_dataloader, val_dataloaders, datamodule)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 458, in fit
    self._run(model)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 756, in _run
    self.dispatch()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 797, in dispatch
    self.accelerator.start_training(self)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 96, in start_trainin
g
    self.training_type_plugin.start_training(trainer)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 14
4, in start_training
    self._results = trainer.run_stage()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 807, in run_stage
    return self.run_train()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 865, in run_train
    self.train_loop.on_train_epoch_start(epoch)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 169, in on_train_epoch_
start
    self.trainer.call_hook("on_train_epoch_start")
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1223, in call_hook
    trainer_hook(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/callback_hook.py", line 90, in on_train_epoch_s
tart
    callback.on_train_epoch_start(self, self.lightning_module)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/callbacks/finetuning.py", line 292, in on_train_epoch_s
tart
    self._store(pl_module, opt_idx, num_param_groups, current_param_groups)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/callbacks/finetuning.py", line 278, in _store
    if opt_idx not in self._internal_state:
AttributeError: 'ObjectDetectionFineTuning' object has no attribute '_internal_state'
Training: 0it [00:00, ?it/s]

CocoDemo: laptop and grid run instruction need refresh

https://docs.grid.ai/examples/vision/coco

local run fails

% python train.py --gpus=0 --max_epochs=1

/opt/miniconda3/envs/CocoDemo/lib/python3.8/site-packages/pytorch_lightning/metrics/__init__.py:43: LightningDeprecationWarning: `pytorch_lightning.metrics.*` module has been renamed to `torchmetrics.*` and split off to its own package (https://github.com/PyTorchLightning/metrics) since v1.3 and will be removed in v1.5
  rank_zero_deprecation(
/opt/miniconda3/envs/CocoDemo/lib/python3.8/site-packages/pl_bolts/utils/warnings.py:30: UserWarning: You want to use `wandb` which is not installed yet, install it with `pip install wandb`.
  stdout_func(
/opt/miniconda3/envs/CocoDemo/lib/python3.8/site-packages/pl_bolts/utils/warnings.py:30: UserWarning: You want to use `gym` which is not installed yet, install it with `pip install gym`.
  stdout_func(
Traceback (most recent call last):
  File "train.py", line 2, in <module>
    import flash
  File "/opt/miniconda3/envs/CocoDemo/lib/python3.8/site-packages/flash/__init__.py", line 51, in <module>
    from flash import tabular, text, vision
  File "/opt/miniconda3/envs/CocoDemo/lib/python3.8/site-packages/flash/vision/__init__.py", line 1, in <module>
    from flash.vision.classification import ImageClassificationData, ImageClassifier
  File "/opt/miniconda3/envs/CocoDemo/lib/python3.8/site-packages/flash/vision/classification/__init__.py", line 2, in <module>
    from flash.vision.classification.model import ImageClassifier
  File "/opt/miniconda3/envs/CocoDemo/lib/python3.8/site-packages/flash/vision/classification/model.py", line 23, in <module>
    from flash.vision.backbones import backbone_and_num_features
  File "/opt/miniconda3/envs/CocoDemo/lib/python3.8/site-packages/flash/vision/backbones.py", line 23, in <module>
    from pl_bolts.models.self_supervised import SimCLR, SwAV
  File "/opt/miniconda3/envs/CocoDemo/lib/python3.8/site-packages/pl_bolts/__init__.py", line 19, in <module>
    from pl_bolts import (  # noqa: E402
  File "/opt/miniconda3/envs/CocoDemo/lib/python3.8/site-packages/pl_bolts/datamodules/__init__.py", line 5, in <module>
    from pl_bolts.datamodules.experience_source import DiscountedExperienceSource, ExperienceSource, ExperienceSourceDataset
  File "/opt/miniconda3/envs/CocoDemo/lib/python3.8/site-packages/pl_bolts/datamodules/experience_source.py", line 24, in <module>
    class ExperienceSourceDataset(IterableDataset):
  File "/opt/miniconda3/envs/CocoDemo/lib/python3.8/site-packages/torch/utils/data/_typing.py", line 273, in __new__
    return super().__new__(cls, name, bases, namespace, **kwargs)  # type: ignore[call-overload]
  File "/opt/miniconda3/envs/CocoDemo/lib/python3.8/abc.py", line 85, in __new__
    cls = super().__new__(mcls, name, bases, namespace, **kwargs)
  File "/opt/miniconda3/envs/CocoDemo/lib/python3.8/site-packages/torch/utils/data/_typing.py", line 370, in _dp_init_subclass
    raise TypeError("Expected 'Iterator' as the return annotation for `__iter__` of {}"
TypeError: Expected 'Iterator' as the return annotation for `__iter__` of ExperienceSourceDataset, but found typing.Iterable

grid run fails the same way

grid run train.py \
--gpus=0 \
--max_epochs=1

 % grid logs nocturnal-fox-57-exp0

[stdout] [2021-06-28T15:04:44.196149+00:00] /opt/conda/lib/python3.8/site-packages/pl_bolts/utils/warnings.py:30: UserWarning: You want to use `gym` which is not installed yet, install it with `pip install gym`.
[stdout] [2021-06-28T15:04:44.196182+00:00]   stdout_func(
[stdout] [2021-06-28T15:04:44.196187+00:00] /opt/conda/lib/python3.8/site-packages/pl_bolts/utils/warnings.py:30: UserWarning: You want to use `cv2` which is not installed yet, install it with `pip install opencv-python`.
[stdout] [2021-06-28T15:04:44.196191+00:00]   stdout_func(
[stdout] [2021-06-28T15:04:44.196194+00:00]
[stdout] [2021-06-28T15:04:44.284747+00:00] /gridai/project/data/coco128.zip:   0%|          | 0/21628 [00:00<?, ?KB/s]
[stdout] [2021-06-28T15:04:44.384739+00:00] /gridai/project/data/coco128.zip:   9%|โ–‰         | 1969/21628 [00:00<00:00, 19687.72KB/s]
[stdout] [2021-06-28T15:04:44.484853+00:00] /gridai/project/data/coco128.zip:  29%|โ–ˆโ–ˆโ–‰       | 6242/21628 [00:00<00:00, 23486.98KB/s]
[stdout] [2021-06-28T15:04:44.593370+00:00] /gridai/project/data/coco128.zip:  44%|โ–ˆโ–ˆโ–ˆโ–ˆโ–     | 9550/21628 [00:00<00:00, 25718.71KB/s]
[stdout] [2021-06-28T15:04:44.684966+00:00] /gridai/project/data/coco128.zip:  65%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–   | 13962/21628 [00:00<00:00, 29396.62KB/s]
[stdout] [2021-06-28T15:04:44.786345+00:00] /gridai/project/data/coco128.zip:  79%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰  | 17157/21628 [00:00<00:00, 30117.44KB/s]
[stdout] [2021-06-28T15:04:44.798606+00:00] /gridai/project/data/coco128.zip:  97%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‹| 21046/21628 [00:00<00:00, 32303.19KB/s]
[stdout] [2021-06-28T15:04:45.931534+00:00] /gridai/project/data/coco128.zip: 21629KB [00:00, 35413.58KB/s]
[stdout] [2021-06-28T15:04:45.931550+00:00] Downloading: "https://download.pytorch.org/models/fasterrcnn_resnet50_fpn_coco-258fb6c6.pth" to /root/.cache/torch/hub/checkpoints/fasterrcnn_resnet50_fpn_coco-258fb6c6.pth
[stdout] [2021-06-28T15:04:45.931555+00:00]
[stdout] [2021-06-28T15:04:46.016983+00:00]   0%|          | 0.00/160M [00:00<?, ?B/s]
[stdout] [2021-06-28T15:04:46.155178+00:00]   5%|โ–         | 7.30M/160M [00:00<00:02, 76.1MB/s]
[stdout] [2021-06-28T15:04:46.244420+00:00]  11%|โ–ˆ         | 17.2M/160M [00:00<00:01, 82.0MB/s]
[stdout] [2021-06-28T15:04:46.314801+00:00]  15%|โ–ˆโ–Œ        | 24.0M/160M [00:00<00:01, 78.5MB/s]
[stdout] [2021-06-28T15:04:46.414542+00:00]  21%|โ–ˆโ–ˆ        | 33.8M/160M [00:00<00:01, 84.5MB/s]
[stdout] [2021-06-28T15:04:46.516028+00:00]  27%|โ–ˆโ–ˆโ–‹       | 43.7M/160M [00:00<00:01, 89.4MB/s]
[stdout] [2021-06-28T15:04:46.619571+00:00]  34%|โ–ˆโ–ˆโ–ˆโ–      | 54.9M/160M [00:00<00:01, 95.9MB/s]
[stdout] [2021-06-28T15:04:46.719658+00:00]  41%|โ–ˆโ–ˆโ–ˆโ–ˆ      | 65.6M/160M [00:00<00:00, 99.5MB/s]
[stdout] [2021-06-28T15:04:46.819577+00:00]  47%|โ–ˆโ–ˆโ–ˆโ–ˆโ–‹     | 75.2M/160M [00:00<00:00, 99.7MB/s]
[stdout] [2021-06-28T15:04:46.919790+00:00]  54%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–    | 86.4M/160M [00:00<00:00, 104MB/s]
[stdout] [2021-06-28T15:04:47.019646+00:00]  61%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ    | 97.5M/160M [00:01<00:00, 108MB/s]
[stdout] [2021-06-28T15:04:47.131654+00:00]  68%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Š   | 108M/160M [00:01<00:00, 110MB/s]
[stdout] [2021-06-28T15:04:47.230091+00:00]  74%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–  | 119M/160M [00:01<00:00, 108MB/s]
[stdout] [2021-06-28T15:04:47.323644+00:00]  81%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ  | 130M/160M [00:01<00:00, 109MB/s]
[stdout] [2021-06-28T15:04:47.442097+00:00]  88%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Š | 140M/160M [00:01<00:00, 110MB/s]
[stdout] [2021-06-28T15:04:47.602608+00:00]  94%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–| 151M/160M [00:01<00:00, 105MB/s]
[stdout] [2021-06-28T15:04:47.810078+00:00] 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 160M/160M [00:01<00:00, 99.0MB/s]
[stdout] [2021-06-28T15:04:47.810114+00:00] GPU available: False, used: False
[stdout] [2021-06-28T15:04:47.810240+00:00] TPU available: False, using: 0 TPU cores
[stdout] [2021-06-28T15:04:47.858016+00:00] /opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:69: UserWarning: you defined a validation_step but have no val_dataloader. Skipping val loop
[stdout] [2021-06-28T15:04:47.858048+00:00]   warnings.warn(*args, **kwargs)
[stdout] [2021-06-28T15:04:47.858052+00:00]
[stdout] [2021-06-28T15:04:47.858056+00:00]   | Name    | Type       | Params
[stdout] [2021-06-28T15:04:47.858060+00:00] ---------------------------------------
[stdout] [2021-06-28T15:04:47.858064+00:00] 0 | model   | FasterRCNN | 41.8 M
[stdout] [2021-06-28T15:04:47.858068+00:00] 1 | metrics | ModuleDict | 0
[stdout] [2021-06-28T15:04:47.858072+00:00] ---------------------------------------
[stdout] [2021-06-28T15:04:47.858075+00:00] 15.0 M    Trainable params
[stdout] [2021-06-28T15:04:47.858079+00:00] 26.8 M    Non-trainable params
[stdout] [2021-06-28T15:04:47.858083+00:00] 41.8 M    Total params
[stdout] [2021-06-28T15:04:47.858086+00:00] 167.021   Total estimated model params size (MB)
[stdout] [2021-06-28T15:04:47.866515+00:00] loading annotations into memory...
[stdout] [2021-06-28T15:04:47.866577+00:00] Done (t=0.01s)
[stdout] [2021-06-28T15:04:47.866584+00:00] creating index...
[stdout] [2021-06-28T15:04:47.866588+00:00] index created!
[stdout] [2021-06-28T15:04:47.866592+00:00]
[stdout] [2021-06-28T15:04:47.866700+00:00] Validation sanity check: 0it [00:00, ?it/s]
[stdout] [2021-06-28T15:04:47.868226+00:00]
[stdout] [2021-06-28T15:04:47.868340+00:00]
[stdout] [2021-06-28T15:04:47.871139+00:00] Traceback (most recent call last):
[stdout] [2021-06-28T15:04:47.871153+00:00]   File "train.py", line 43, in <module>
[stdout] [2021-06-28T15:04:47.871158+00:00]     trainer.finetune(model, datamodule)
[stdout] [2021-06-28T15:04:47.871161+00:00]   File "/opt/conda/lib/python3.8/site-packages/flash/core/trainer.py", line 90, in finetune
[stdout] [2021-06-28T15:04:47.871166+00:00]     return super().fit(model, train_dataloader, val_dataloaders, datamodule)
[stdout] [2021-06-28T15:04:47.871170+00:00]   File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 458, in fit
[stdout] [2021-06-28T15:04:47.871174+00:00]     self._run(model)
[stdout] [2021-06-28T15:04:47.871178+00:00]   File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 756, in _run
[stdout] [2021-06-28T15:04:47.871182+00:00]     self.dispatch()
[stdout] [2021-06-28T15:04:47.871240+00:00]   File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 797, in dispatch
[stdout] [2021-06-28T15:04:47.871247+00:00]     self.accelerator.start_training(self)
[stdout] [2021-06-28T15:04:47.871250+00:00]   File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 96, in start_training
[stdout] [2021-06-28T15:04:47.871254+00:00]     self.training_type_plugin.start_training(trainer)
[stdout] [2021-06-28T15:04:47.871258+00:00]   File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 144, in start_training
[stdout] [2021-06-28T15:04:47.871262+00:00]     self._results = trainer.run_stage()
[stdout] [2021-06-28T15:04:47.871266+00:00]   File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 807, in run_stage
[stdout] [2021-06-28T15:04:47.871270+00:00]     return self.run_train()
[stdout] [2021-06-28T15:04:47.871273+00:00]   File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 865, in run_train
[stdout] [2021-06-28T15:04:47.871277+00:00]     self.train_loop.on_train_epoch_start(epoch)
[stdout] [2021-06-28T15:04:47.871281+00:00]   File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 169, in on_train_epoch_start
[stdout] [2021-06-28T15:04:47.871285+00:00]     self.trainer.call_hook("on_train_epoch_start")
[stdout] [2021-06-28T15:04:47.871288+00:00]   File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1223, in call_hook
[stdout] [2021-06-28T15:04:47.871292+00:00]     trainer_hook(*args, **kwargs)
[stdout] [2021-06-28T15:04:47.871296+00:00]   File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/callback_hook.py", line 90, in on_train_epoch_start
[stdout] [2021-06-28T15:04:47.871300+00:00]     callback.on_train_epoch_start(self, self.lightning_module)
[stdout] [2021-06-28T15:04:47.871303+00:00]   File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/callbacks/finetuning.py", line 292, in on_train_epoch_start
[stdout] [2021-06-28T15:04:47.871307+00:00]     self._store(pl_module, opt_idx, num_param_groups, current_param_groups)
[stdout] [2021-06-28T15:04:47.871312+00:00]   File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/callbacks/finetuning.py", line 278, in _store
[stdout] [2021-06-28T15:04:47.871316+00:00]     if opt_idx not in self._internal_state:
[stdout] [2021-06-28T15:04:47.871320+00:00] AttributeError: 'ObjectDetectionFineTuning' object has no attribute '_internal_state'

grid sync-env: create usage example

grid sync-env Overview

grid sync-env command used use to generated the requirements.txt and check it into the GitHub. The below provides a summary of commands to use this fucntionality.

touch requirements.txt
grid sync-env
git add requirements.txt
git commit -m "requirements.txt synced with current environment"

grid sync-env detailed example

Let's assume a script ran successfully on a local environment. Hyper parameter sweep is the next step. Below is detailed examples starting from a repository that does not pre-existing requirements.txt file.

  • grid run will inform missing requirements.txt and suggest a recommendation
grid run pytorch_lightning_simple.py --datadir grid:fashionmnist:7

WARNING Neither a CPU or GPU number was specified. 1 CPU will be used as a default. To use N GPUs pass in '--grid_gpus N' flag.


        WARNING
        No requirements.txt or environment.yml found but we identified below
        dependencies from your source. Your build could crash or not
        start.

        torch
        pytorch_lightning
        torchvision
        optuna
        packaging


                Run submitted!
                `grid status` to list all runs
                `grid status invisible-swallow-386` to see all experiments for this run

                ----------------------
                Submission summary
                ----------------------
                script:                  pytorch_lightning_simple.py
                instance_type:           t2.medium
                use_spot:                False
                cloud_provider:          aws
                cloud_credentials:       cc-qdfdk
                grid_name:               invisible-swallow-386
                datastore_name:          None
                datastore_version:       None
                datastore_mount_dir:     None
  • Created requirements.txt as suggested.
touch requirements.txt
grid sync-env
git add requirements.txt
git commit -m "requirements.txt synced with current environment"
  • grid run executes with requirements.txt
grid run pytorch_lightning_simple.py --datadir grid:fashionmnist:7

WARNING Neither a CPU or GPU number was specified. 1 CPU will be used as a default. To use N GPUs pass in '--grid_gpus N' flag.

                Run submitted!
                `grid status` to list all runs
                `grid status ruby-mustang-170` to see all experiments for this run

                ----------------------
                Submission summary
                ----------------------
                script:                  pytorch_lightning_simple.py
                instance_type:           t2.medium
                use_spot:                False
                cloud_provider:          aws
                cloud_credentials:       cc-qdfdk
                grid_name:               ruby-mustang-170
                datastore_name:          None
                datastore_version:       None
                datastore_mount_dir:     None

List of limitations June 21

Running list of limitations to update for each release

Grid.ai Run icon - Currently does not work when Datastore is used in a run.
Same phone number cannot shared across multiple logins ids.

BYOC: publish list of permission in docs referenced in Terraform

https://docs.grid.ai/platform/about-these-features/adding-custom-cloud-credentials needs to list permission requested by https://github.com/gridai/terraform-aws-gridbyoc.git script. https://github.com/gridai/terraform-aws-gridbyoc/blob/main/main.tf has the list of permissions:

  • "eks:*",
  • "ecr:*",
  • "arn:aws:iam::aws:policy/AmazonEC2FullAccess",
  • "arn:aws:iam::aws:policy/AmazonGuardDutyFullAccess",
  • "arn:aws:iam::aws:policy/AmazonRoute53ResolverFullAccess",
  • "arn:aws:iam::aws:policy/AmazonS3FullAccess",
  • "arn:aws:iam::aws:policy/AmazonSNSFullAccess",
  • "arn:aws:iam::aws:policy/AmazonSQSFullAccess",
  • "arn:aws:iam::aws:policy/AmazonVPCFullAccess",
  • "arn:aws:iam::aws:policy/CloudWatchLogsFullAccess",
  • "arn:aws:iam::aws:policy/IAMFullAccess",

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.