gridai / grid-docs Goto Github PK

View Code? Open in Web Editor NEW

20.0 26.0 13.0 313.55 MB

Public repository for managing Grid Platform documentation synced with gitbook on docs.grid.ai

Makefile 5.75% JavaScript 59.69% Python 3.68% CSS 28.60% TypeScript 2.29%

grid-docs's Introduction

Grid in 3 minutes

Introduction

Grid is designed for developing and training deep learning models at scale.

The TL;DR of using Grid is this:

Create a DATASTORE with your dataset.
Spin up an interactive SESSION to develop, analyze and prototype models/ideas.
When you have something that works, train it at scale via RUN.

This 3-minute video shows you how to execute code on cloud instances with zero code changes and how to debug/prototype and develop models with multi-GPU cloud instances.

intro.mp4

Here is a quick overview of:

Datastores

Sessions

Runs

Infrastructure is gone

Grid allocates all the machines and GPUs you need on demand, so you only pay for what you need when you need it.

Grid lets you focus on your work, NOT on the infrastructure. Create an account here to get free credits and get started!

Artifacts, logs, etc...

Grid handles all the other parts of developing and training at scale:

Artifacts
Logs
Metrics
etc...

Just run your files and watch the magic happen

Experiment Managers

Grid works with the experiment manager of your choice!!🔥🔥

No need to change your code!

Datastores: (Scalable datasets)

In Grid, we've introduced Datastores, high-performance, low-latency, versioned datasets.

The UI supports creating Datastores of < 1 GB

datastore.mp4

Use the CLI for larger datastores

grid datastore create imagenet_folder --name imagenet

Sessions (Interactive machines)

For prototyping/debugging/analyzing, sometimes you need a LIVE machine. We call these Sessions.

Web UI: Starting a new session

session.mp4

CLI: Starting a new session

# session with 2 M60 GPUs
grid session create --instance_type 2_m60_8gb

RUN (Sweep and train anything)

RUN any public or private repository with Grid in 5 steps:

This 1-minute video shows how to RUN from the web app:

run.mp4

If you prefer to use the CLI simply replace python with grid run.

First, install Grid and login

pip install lightning-grid --upgrade
grid login

Now clone the repo and hit run!

# clone repo
git clone https://github.com/williamFalcon/hello
cd hello

# start the sweep
grid run hello.py --number "[1, 2]" --food_item "['pizza', 'pear']"

This command produces these equivalent calls automatically

python hello.py --number 1 --food_item 'pizza'
python hello.py --number 2 --food_item 'pizza'

python hello.py --number 1 --food_item 'pear'
python hello.py --number 2 --food_item 'pear'

That's it!

We learned that:

RUN executes scripts on cloud machines (and runs hyperparameter sweeps)
SESSION starts an interactive machine with the CPU/GPUs of your choice
DATASTORE is an optimized, low-latency auto-versioned dataset.
Grid has a Web app and a CLI with similar functionality.

That's all you need to know about Grid!

Next!

Now try our first tutorial

grid-docs's People

Contributors

Stargazers

Watchers

Forkers

jlperla elgalu snapbuy jucor mdicl ivanorsolic oojo12 jseldess alexandercort rasbt erkanmalcokcom qugu krishnakalyan3

grid-docs's Issues

image-classification-with-imagenet: ways to reduce 6+ hours to prepare demo

https://docs.grid.ai/examples/vision/image-classification-with-imagenet requires +6 hours if you need to download and process ImageNet. Is there pre-created datastore that can be shared?

GitHub Enterprise support?

Hi, do you support GitHub Enterprise on the self-hosted version?

It should be as simple as making the host name github.com configurable.

Thanks.

How can I add RStudio web access support?

I could potentially create a conda environment.yaml that prepares a machine with RStudio installed but how do I access the web interface? I suppose ssh tunneling is supported?

The other question is, could the RStudio front-end be added as another link in Grid.AI web interface, next to already available VSCode, JupyterLab and SSH ?

GAN Example

Some users have asked to provide a GAN example.
This Lightning Bolts example works pretty well in Grid just as is: https://github.com/PyTorchLightning/lightning-bolts/blob/master/pl_bolts/models/gans/basic/basic_gan_module.py

BYOC: update adding-custom-cloud-credentials.md

add export in front of EXTERNAL_ID and ROLE_ARN
add osx install of git, terraform, jq and AWS CLI
add ubuntu / debian install of git, terraform, jq and AWS CLI
add redhat install of git, terraform, jq and AWS CLI
add grid tool install before running grid commands in the last step
s/terraform output --json/terraform output -json/
check instance availability

aws ec2 describe-instance-type-offerings --location-type availability-zone  --filters Name=instance-type,Values=p3.16xlarge

Can a single big instance be shared between users?

I saw there are settings like --gpus and --memory but does this map to a similar concept like in SLURM when users can share big instance types while SLURM will partition them between users?

From what I saw so far Grid.AI is a single user single instance solution.

Further explain `scale_down_seconds`

Hi, could you further explain the scale_down_seconds setting?

I'm looking for a setting to allow me to keep the instance "hot" even after my jobs are done so I don't have to wait for an instance to start again, i.e. to avoid the cold-start problem.

Add details about Grid actions commands

The Grid YML spec supports three actions:

on_image_build commands passed to the image builder which are interpreted as RUN commands in a Dockerfile

on_before_training_start which allows users to specify commands that need to be executed sequentially before the main experiment process starts

on_after_training_end same as above, but executed after the main process ends

update BYOC docs

update https://docs.grid.ai/platform/upgrades/adding-custom-cloud-credentials the following:

refresh examples from the output of terraform output -json | jq to reflect external id and role arn
add grid edit
add grid clusters log

document hard link work, but not sym link in datastore

Assume there any many files and directories in the current directories. If a subset of them is required in datastore, then ln can be used to specify the subset. ln supports hard and sym links. hard links work. sym links do not work.

hard link example

mkdir tmp2
cd tmp2
ln ../test1.txt
ln ../test2.txt
grid datastore create --source . --name hardlink
│ prod-2     │     hardlink │       1 │   1.0 MB │ 2021-11-23 22:01 │ [email protected] │ Succeeded │
grid session create --name hardlinktest --datastore_name hardlink
# wait for session to come up
grid seesion ssh hardlinktest
cat /home/jovyan/hardlink/*
cat: /home/jovyan/hardlink/lost+found: Is a directory
test1.txt
test2.txt

sym link example [ does not work ]

echo "test1" > test1.txt
echo "test2" > test2.txt
mkdir tmp
cd tmp
ln -s ../test1.txt test1.txt
ln -s ../test2.txt test2.txt
cat *
grid datastore create --source . --name symlink

│ prod-2     │      symlink │       1 │  0 Bytes │ 2021-11-23 21:56 │ [email protected] │    Failed │

kinetics-video-classification: as-is script does not run

https://github.com/robert-s-lee/KineticsDemo/tree/fix-requirements.txt branch has WIP

Two issues: https://docs.grid.ai/examples/vision/kinetics-video-classification

fixed No module named 'flash.data'

fixed two typos in the code.

from flash.core.data.utils import download_data
from flash.core.utilities.imports import _KORNIA_AVAILABLE, _PYTORCHVIDEO_AVAILABLE

to fix

Traceback (most recent call last):
  File "train.py", line 14, in <module>
    from flash.data.utils import download_data
ModuleNotFoundError: No module named 'flash.data'

need fix for from pytorchvideo.transforms import ApplyTransformToKey, RandomShortSideScale, UniformTemporalSubsample

% python train.py

/opt/miniconda3/envs/kd/lib/python3.7/site-packages/kornia/augmentation/augmentation.py:1833: DeprecationWarning: GaussianBlur is no longer maintained and will be removed from the future versions. Please use RandomGaussianBlur instead.
  category=DeprecationWarning,
Please, run `pip install torchvideo`

To test

conda create --name=kd python=3.7
conda activate kd
pip install lightning-grid --upgrade
grid login 
git clone https://github.com/aribornstein/KineticsDemo.git
cd KineticsDemo
pip install -r requirements.txt

The images should be stored on s3 and not within the repo.

Fix Broken Links in grid-docs

To check for broken links I used a python-link checking library. Perhaps it's a good idea to validate all links before a commit?.

python -m pip install linkcheckmd
linkcheckMarkdown .

changelog.md

platform/11_known-issues.md

platform/10_tips-and-tricks.md

platform/1_Billing/billing-rates.md

https://docs.grid.ai/features/runs/interruptible-machines#interruptible-machines

features/runs/1_README.md

https://docs.grid.ai/features/runs/sweep-syntax x2 occurrence

features/runs/2_private-repos.md

https://docs.grid.ai/features/runs/faq.md

features/runs/1_Creating Runs/1_Basic Runs/2_Adv Runs/3_sweep-syntax.md

https://docs.grid.ai/features/runs/faq.md

features/runs/1_Creating Runs/2_Adv Runs/3_creating-runs-from-config.md

https://docs.grid.ai/getting-started

features/runs/1_Creating Runs/2_Adv Runs/2_creating-runs-with-dockerfile.md

https://docs.grid.ai/getting-started

features/runs/1_Creating Runs/2_Adv Runs/5_auto-resume-experiments.md

https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pl_examples/fault_tolerant/automatic.py

features/sessions/changing-instance-type.md

https://docs.grid.ai/features/runs/interruptible-machines#interruptible-machines

features/sessions/8_how-to-ssh-into-a-session.md

https://github.com/gridai/grid-docs/blob/doc-118-docs-add-troubleshooting-tip-to-ssh-docs/docs/features/sessions/how-to-ssh-into-a-session.md#step-0-create-an-ssh-key x2 occurrence

examples/running-with-different-frameworks/running-julia-programs.md

https://github.com/rlizzo/TuringCLIExample/blob/main/fit.jl

coco: fix the warning and errors

https://docs.grid.ai/examples/vision/coco run has the following warning and fails. https://github.com/robert-s-lee/CocoDemo/tree/rslee-refresh has the WIP

/opt/conda/lib/python3.8/site-packages/pl_bolts/utils/warnings.py:30: UserWarning: You want to use `gym` which is not in
stalled yet, install it with `pip install gym`.
  stdout_func(
/opt/conda/lib/python3.8/site-packages/pl_bolts/utils/warnings.py:30: UserWarning: You want to use `cv2` which is not in
stalled yet, install it with `pip install opencv-python`.
  stdout_func(

fails with the following

  File "/opt/conda/lib/python3.8/site-packages/flash/core/trainer.py", line 90, in finetune
    return super().fit(model, train_dataloader, val_dataloaders, datamodule)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 458, in fit
    self._run(model)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 756, in _run
    self.dispatch()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 797, in dispatch
    self.accelerator.start_training(self)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 96, in start_trainin
g
    self.training_type_plugin.start_training(trainer)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 14
4, in start_training
    self._results = trainer.run_stage()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 807, in run_stage
    return self.run_train()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 865, in run_train
    self.train_loop.on_train_epoch_start(epoch)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 169, in on_train_epoch_
start
    self.trainer.call_hook("on_train_epoch_start")
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1223, in call_hook
    trainer_hook(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/callback_hook.py", line 90, in on_train_epoch_s
tart
    callback.on_train_epoch_start(self, self.lightning_module)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/callbacks/finetuning.py", line 292, in on_train_epoch_s
tart
    self._store(pl_module, opt_idx, num_param_groups, current_param_groups)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/callbacks/finetuning.py", line 278, in _store
    if opt_idx not in self._internal_state:
AttributeError: 'ObjectDetectionFineTuning' object has no attribute '_internal_state'
Training: 0it [00:00, ?it/s]

Add an Android example for Mobile

We have an example of how Grid works on Mobile with iphone
It would be good to add another one for Android

Text Classification: make it run without datastore

https://docs.grid.ai/examples/nlp/text-classification Cannot with Grid AI Run Badge when there is datastore. Modify script to download if the data is not available. https://github.com/gridai/grid-text-classification/blob/main/train.py

Coin Market Cap Price Forecasting: grid run badge does not result in runnable experiment

https://docs.grid.ai/examples/time-series/price-forecasting below is not accepted as runnable.

python train.py \
run --grid_config .grid/config.yml train.py --max_epochs 100 --data_path /dataset/cryptocurrency_prices.csv --learning_rate 'uniform(0,0.03,5)' --hidden_size "[16,32,64]

BYOC: add requirements for cluster name

a DNS-1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character

CocoDemo: laptop and grid run instruction need refresh

https://docs.grid.ai/examples/vision/coco

local run fails

% python train.py --gpus=0 --max_epochs=1

/opt/miniconda3/envs/CocoDemo/lib/python3.8/site-packages/pytorch_lightning/metrics/__init__.py:43: LightningDeprecationWarning: `pytorch_lightning.metrics.*` module has been renamed to `torchmetrics.*` and split off to its own package (https://github.com/PyTorchLightning/metrics) since v1.3 and will be removed in v1.5
  rank_zero_deprecation(
/opt/miniconda3/envs/CocoDemo/lib/python3.8/site-packages/pl_bolts/utils/warnings.py:30: UserWarning: You want to use `wandb` which is not installed yet, install it with `pip install wandb`.
  stdout_func(
/opt/miniconda3/envs/CocoDemo/lib/python3.8/site-packages/pl_bolts/utils/warnings.py:30: UserWarning: You want to use `gym` which is not installed yet, install it with `pip install gym`.
  stdout_func(
Traceback (most recent call last):
  File "train.py", line 2, in <module>
    import flash
  File "/opt/miniconda3/envs/CocoDemo/lib/python3.8/site-packages/flash/__init__.py", line 51, in <module>
    from flash import tabular, text, vision
  File "/opt/miniconda3/envs/CocoDemo/lib/python3.8/site-packages/flash/vision/__init__.py", line 1, in <module>
    from flash.vision.classification import ImageClassificationData, ImageClassifier
  File "/opt/miniconda3/envs/CocoDemo/lib/python3.8/site-packages/flash/vision/classification/__init__.py", line 2, in <module>
    from flash.vision.classification.model import ImageClassifier
  File "/opt/miniconda3/envs/CocoDemo/lib/python3.8/site-packages/flash/vision/classification/model.py", line 23, in <module>
    from flash.vision.backbones import backbone_and_num_features
  File "/opt/miniconda3/envs/CocoDemo/lib/python3.8/site-packages/flash/vision/backbones.py", line 23, in <module>
    from pl_bolts.models.self_supervised import SimCLR, SwAV
  File "/opt/miniconda3/envs/CocoDemo/lib/python3.8/site-packages/pl_bolts/__init__.py", line 19, in <module>
    from pl_bolts import (  # noqa: E402
  File "/opt/miniconda3/envs/CocoDemo/lib/python3.8/site-packages/pl_bolts/datamodules/__init__.py", line 5, in <module>
    from pl_bolts.datamodules.experience_source import DiscountedExperienceSource, ExperienceSource, ExperienceSourceDataset
  File "/opt/miniconda3/envs/CocoDemo/lib/python3.8/site-packages/pl_bolts/datamodules/experience_source.py", line 24, in <module>
    class ExperienceSourceDataset(IterableDataset):
  File "/opt/miniconda3/envs/CocoDemo/lib/python3.8/site-packages/torch/utils/data/_typing.py", line 273, in __new__
    return super().__new__(cls, name, bases, namespace, **kwargs)  # type: ignore[call-overload]
  File "/opt/miniconda3/envs/CocoDemo/lib/python3.8/abc.py", line 85, in __new__
    cls = super().__new__(mcls, name, bases, namespace, **kwargs)
  File "/opt/miniconda3/envs/CocoDemo/lib/python3.8/site-packages/torch/utils/data/_typing.py", line 370, in _dp_init_subclass
    raise TypeError("Expected 'Iterator' as the return annotation for `__iter__` of {}"
TypeError: Expected 'Iterator' as the return annotation for `__iter__` of ExperienceSourceDataset, but found typing.Iterable

grid run fails the same way

grid run train.py \
--gpus=0 \
--max_epochs=1

 % grid logs nocturnal-fox-57-exp0

[stdout] [2021-06-28T15:04:44.196149+00:00] /opt/conda/lib/python3.8/site-packages/pl_bolts/utils/warnings.py:30: UserWarning: You want to use `gym` which is not installed yet, install it with `pip install gym`.
[stdout] [2021-06-28T15:04:44.196182+00:00]   stdout_func(
[stdout] [2021-06-28T15:04:44.196187+00:00] /opt/conda/lib/python3.8/site-packages/pl_bolts/utils/warnings.py:30: UserWarning: You want to use `cv2` which is not installed yet, install it with `pip install opencv-python`.
[stdout] [2021-06-28T15:04:44.196191+00:00]   stdout_func(
[stdout] [2021-06-28T15:04:44.196194+00:00]
[stdout] [2021-06-28T15:04:44.284747+00:00] /gridai/project/data/coco128.zip:   0%|          | 0/21628 [00:00<?, ?KB/s]
[stdout] [2021-06-28T15:04:44.384739+00:00] /gridai/project/data/coco128.zip:   9%|▉         | 1969/21628 [00:00<00:00, 19687.72KB/s]
[stdout] [2021-06-28T15:04:44.484853+00:00] /gridai/project/data/coco128.zip:  29%|██▉       | 6242/21628 [00:00<00:00, 23486.98KB/s]
[stdout] [2021-06-28T15:04:44.593370+00:00] /gridai/project/data/coco128.zip:  44%|████▍     | 9550/21628 [00:00<00:00, 25718.71KB/s]
[stdout] [2021-06-28T15:04:44.684966+00:00] /gridai/project/data/coco128.zip:  65%|██████▍   | 13962/21628 [00:00<00:00, 29396.62KB/s]
[stdout] [2021-06-28T15:04:44.786345+00:00] /gridai/project/data/coco128.zip:  79%|███████▉  | 17157/21628 [00:00<00:00, 30117.44KB/s]
[stdout] [2021-06-28T15:04:44.798606+00:00] /gridai/project/data/coco128.zip:  97%|█████████▋| 21046/21628 [00:00<00:00, 32303.19KB/s]
[stdout] [2021-06-28T15:04:45.931534+00:00] /gridai/project/data/coco128.zip: 21629KB [00:00, 35413.58KB/s]
[stdout] [2021-06-28T15:04:45.931550+00:00] Downloading: "https://download.pytorch.org/models/fasterrcnn_resnet50_fpn_coco-258fb6c6.pth" to /root/.cache/torch/hub/checkpoints/fasterrcnn_resnet50_fpn_coco-258fb6c6.pth
[stdout] [2021-06-28T15:04:45.931555+00:00]
[stdout] [2021-06-28T15:04:46.016983+00:00]   0%|          | 0.00/160M [00:00<?, ?B/s]
[stdout] [2021-06-28T15:04:46.155178+00:00]   5%|▍         | 7.30M/160M [00:00<00:02, 76.1MB/s]
[stdout] [2021-06-28T15:04:46.244420+00:00]  11%|█         | 17.2M/160M [00:00<00:01, 82.0MB/s]
[stdout] [2021-06-28T15:04:46.314801+00:00]  15%|█▌        | 24.0M/160M [00:00<00:01, 78.5MB/s]
[stdout] [2021-06-28T15:04:46.414542+00:00]  21%|██        | 33.8M/160M [00:00<00:01, 84.5MB/s]
[stdout] [2021-06-28T15:04:46.516028+00:00]  27%|██▋       | 43.7M/160M [00:00<00:01, 89.4MB/s]
[stdout] [2021-06-28T15:04:46.619571+00:00]  34%|███▍      | 54.9M/160M [00:00<00:01, 95.9MB/s]
[stdout] [2021-06-28T15:04:46.719658+00:00]  41%|████      | 65.6M/160M [00:00<00:00, 99.5MB/s]
[stdout] [2021-06-28T15:04:46.819577+00:00]  47%|████▋     | 75.2M/160M [00:00<00:00, 99.7MB/s]
[stdout] [2021-06-28T15:04:46.919790+00:00]  54%|█████▍    | 86.4M/160M [00:00<00:00, 104MB/s]
[stdout] [2021-06-28T15:04:47.019646+00:00]  61%|██████    | 97.5M/160M [00:01<00:00, 108MB/s]
[stdout] [2021-06-28T15:04:47.131654+00:00]  68%|██████▊   | 108M/160M [00:01<00:00, 110MB/s]
[stdout] [2021-06-28T15:04:47.230091+00:00]  74%|███████▍  | 119M/160M [00:01<00:00, 108MB/s]
[stdout] [2021-06-28T15:04:47.323644+00:00]  81%|████████  | 130M/160M [00:01<00:00, 109MB/s]
[stdout] [2021-06-28T15:04:47.442097+00:00]  88%|████████▊ | 140M/160M [00:01<00:00, 110MB/s]
[stdout] [2021-06-28T15:04:47.602608+00:00]  94%|█████████▍| 151M/160M [00:01<00:00, 105MB/s]
[stdout] [2021-06-28T15:04:47.810078+00:00] 100%|██████████| 160M/160M [00:01<00:00, 99.0MB/s]
[stdout] [2021-06-28T15:04:47.810114+00:00] GPU available: False, used: False
[stdout] [2021-06-28T15:04:47.810240+00:00] TPU available: False, using: 0 TPU cores
[stdout] [2021-06-28T15:04:47.858016+00:00] /opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:69: UserWarning: you defined a validation_step but have no val_dataloader. Skipping val loop
[stdout] [2021-06-28T15:04:47.858048+00:00]   warnings.warn(*args, **kwargs)
[stdout] [2021-06-28T15:04:47.858052+00:00]
[stdout] [2021-06-28T15:04:47.858056+00:00]   | Name    | Type       | Params
[stdout] [2021-06-28T15:04:47.858060+00:00] ---------------------------------------
[stdout] [2021-06-28T15:04:47.858064+00:00] 0 | model   | FasterRCNN | 41.8 M
[stdout] [2021-06-28T15:04:47.858068+00:00] 1 | metrics | ModuleDict | 0
[stdout] [2021-06-28T15:04:47.858072+00:00] ---------------------------------------
[stdout] [2021-06-28T15:04:47.858075+00:00] 15.0 M    Trainable params
[stdout] [2021-06-28T15:04:47.858079+00:00] 26.8 M    Non-trainable params
[stdout] [2021-06-28T15:04:47.858083+00:00] 41.8 M    Total params
[stdout] [2021-06-28T15:04:47.858086+00:00] 167.021   Total estimated model params size (MB)
[stdout] [2021-06-28T15:04:47.866515+00:00] loading annotations into memory...
[stdout] [2021-06-28T15:04:47.866577+00:00] Done (t=0.01s)
[stdout] [2021-06-28T15:04:47.866584+00:00] creating index...
[stdout] [2021-06-28T15:04:47.866588+00:00] index created!
[stdout] [2021-06-28T15:04:47.866592+00:00]
[stdout] [2021-06-28T15:04:47.866700+00:00] Validation sanity check: 0it [00:00, ?it/s]
[stdout] [2021-06-28T15:04:47.868226+00:00]
[stdout] [2021-06-28T15:04:47.868340+00:00]
[stdout] [2021-06-28T15:04:47.871139+00:00] Traceback (most recent call last):
[stdout] [2021-06-28T15:04:47.871153+00:00]   File "train.py", line 43, in <module>
[stdout] [2021-06-28T15:04:47.871158+00:00]     trainer.finetune(model, datamodule)
[stdout] [2021-06-28T15:04:47.871161+00:00]   File "/opt/conda/lib/python3.8/site-packages/flash/core/trainer.py", line 90, in finetune
[stdout] [2021-06-28T15:04:47.871166+00:00]     return super().fit(model, train_dataloader, val_dataloaders, datamodule)
[stdout] [2021-06-28T15:04:47.871170+00:00]   File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 458, in fit
[stdout] [2021-06-28T15:04:47.871174+00:00]     self._run(model)
[stdout] [2021-06-28T15:04:47.871178+00:00]   File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 756, in _run
[stdout] [2021-06-28T15:04:47.871182+00:00]     self.dispatch()
[stdout] [2021-06-28T15:04:47.871240+00:00]   File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 797, in dispatch
[stdout] [2021-06-28T15:04:47.871247+00:00]     self.accelerator.start_training(self)
[stdout] [2021-06-28T15:04:47.871250+00:00]   File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 96, in start_training
[stdout] [2021-06-28T15:04:47.871254+00:00]     self.training_type_plugin.start_training(trainer)
[stdout] [2021-06-28T15:04:47.871258+00:00]   File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 144, in start_training
[stdout] [2021-06-28T15:04:47.871262+00:00]     self._results = trainer.run_stage()
[stdout] [2021-06-28T15:04:47.871266+00:00]   File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 807, in run_stage
[stdout] [2021-06-28T15:04:47.871270+00:00]     return self.run_train()
[stdout] [2021-06-28T15:04:47.871273+00:00]   File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 865, in run_train
[stdout] [2021-06-28T15:04:47.871277+00:00]     self.train_loop.on_train_epoch_start(epoch)
[stdout] [2021-06-28T15:04:47.871281+00:00]   File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 169, in on_train_epoch_start
[stdout] [2021-06-28T15:04:47.871285+00:00]     self.trainer.call_hook("on_train_epoch_start")
[stdout] [2021-06-28T15:04:47.871288+00:00]   File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1223, in call_hook
[stdout] [2021-06-28T15:04:47.871292+00:00]     trainer_hook(*args, **kwargs)
[stdout] [2021-06-28T15:04:47.871296+00:00]   File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/callback_hook.py", line 90, in on_train_epoch_start
[stdout] [2021-06-28T15:04:47.871300+00:00]     callback.on_train_epoch_start(self, self.lightning_module)
[stdout] [2021-06-28T15:04:47.871303+00:00]   File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/callbacks/finetuning.py", line 292, in on_train_epoch_start
[stdout] [2021-06-28T15:04:47.871307+00:00]     self._store(pl_module, opt_idx, num_param_groups, current_param_groups)
[stdout] [2021-06-28T15:04:47.871312+00:00]   File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/callbacks/finetuning.py", line 278, in _store
[stdout] [2021-06-28T15:04:47.871316+00:00]     if opt_idx not in self._internal_state:
[stdout] [2021-06-28T15:04:47.871320+00:00] AttributeError: 'ObjectDetectionFineTuning' object has no attribute '_internal_state'

SShing as currently explained in the Typical Workflow (CLI user) docs currently doesn't work

After following the tutorials and running the commands at

I am trying to log in via SSH but it doesn't seem to work. I tried this with multiple sessions. What is the solution here?

grid sync-env: create usage example

`grid sync-env` Overview

grid sync-env command used use to generated the requirements.txt and check it into the GitHub. The below provides a summary of commands to use this fucntionality.

touch requirements.txt
grid sync-env
git add requirements.txt
git commit -m "requirements.txt synced with current environment"

`grid sync-env` detailed example

Let's assume a script ran successfully on a local environment. Hyper parameter sweep is the next step. Below is detailed examples starting from a repository that does not pre-existing requirements.txt file.

grid run will inform missing requirements.txt and suggest a recommendation

grid run pytorch_lightning_simple.py --datadir grid:fashionmnist:7

WARNING Neither a CPU or GPU number was specified. 1 CPU will be used as a default. To use N GPUs pass in '--grid_gpus N' flag.


        WARNING
        No requirements.txt or environment.yml found but we identified below
        dependencies from your source. Your build could crash or not
        start.

        torch
        pytorch_lightning
        torchvision
        optuna
        packaging


                Run submitted!
                `grid status` to list all runs
                `grid status invisible-swallow-386` to see all experiments for this run

                ----------------------
                Submission summary
                ----------------------
                script:                  pytorch_lightning_simple.py
                instance_type:           t2.medium
                use_spot:                False
                cloud_provider:          aws
                cloud_credentials:       cc-qdfdk
                grid_name:               invisible-swallow-386
                datastore_name:          None
                datastore_version:       None
                datastore_mount_dir:     None

Created requirements.txt as suggested.

touch requirements.txt
grid sync-env
git add requirements.txt
git commit -m "requirements.txt synced with current environment"

grid run executes with requirements.txt

grid run pytorch_lightning_simple.py --datadir grid:fashionmnist:7

WARNING Neither a CPU or GPU number was specified. 1 CPU will be used as a default. To use N GPUs pass in '--grid_gpus N' flag.

                Run submitted!
                `grid status` to list all runs
                `grid status ruby-mustang-170` to see all experiments for this run

                ----------------------
                Submission summary
                ----------------------
                script:                  pytorch_lightning_simple.py
                instance_type:           t2.medium
                use_spot:                False
                cloud_provider:          aws
                cloud_credentials:       cc-qdfdk
                grid_name:               ruby-mustang-170
                datastore_name:          None
                datastore_version:       None
                datastore_mount_dir:     None