Comments (13)

rafaspadilha commented on May 21, 2024 1

Thanks for using FarmVibes.AI and reporting the issue, @Amr-MKamal. I'll investigate this and return to you as soon as possible.

from farmvibes-ai.

Amr-MKamal commented on May 21, 2024 1

I used the same input_region provided in the file example POLYGON((-118.83884470335134 46.135858606707956,-119.59905735755602 46.135858606707956,-119.59356419349352 46.759102248950285,-119.20796470232867 46.75533893062969,-118.84433786741384 46.759102248950285,-118.83884470335134 46.135858606707956)) , but for the dates 2021 & 2022.

The chip size & other related training parameters I also didn't change :

CHIP_SIZE = 256
EPOCH_SIZE = 1024
BATCH_SIZE = 16
NDVI_STACK_BANDS = 37
NUM_WORKERS = 2  # Change this depending on available memory and number of cores
VAL_RATIO = 0.2 # Ratio of the validation subset of the input region

# Training hyperparameters
LR = 1e-3  # Learning rate
WD = 1e-2  # Weight decay
MAX_EPOCHS = 10  # How many epochs to train for

I will try decreasing the chip size/img_size to 128 & tell you how that goes.

from farmvibes-ai.

rafaspadilha commented on May 21, 2024

@Amr-MKamal I couldn't properly reproduce your error.

Are you running for the same region of the notebook (within the Continental USA, where CDL is available)?
Are you using the crop_env.yaml conda environment?

The AssertionError that you are seeing is coming from the SegmentationModel._shared_step() method from notebooks/crop_segmentation/notebook_lib/models.py:

def _shared_step(self, batch: Dict[str, Any], batch_idx: int) -> Dict[str, Any]:
        pred = self(batch["image"])
        for t in pred, batch["mask"]:
            assert torch.all(torch.isfinite(t))
        loss = self.loss(pred, batch["mask"])

        return {"loss": loss, "preds": pred.detach(), "target": batch["mask"]}

That assertion checks if all the values in the mask (in this case, the CDL maps) are defined and finite. This should be the case for the CDL, as the samples generated by CDLMask dataset are the result of a torch.isin operation that returns a boolean tensor.

from farmvibes-ai.

Amr-MKamal commented on May 21, 2024

@rafaspadilha , the error was in the old version of this notebook , the error I'm getting now for local training in Section [4] (after trying the exmaple area for 2021-2022):

RuntimeError                              Traceback (most recent call last)
Cell In[4], line 6
      4 plt.figure(figsize=(10, 10))
      5 ax = plt.gca()
----> 6 gpd.GeoSeries([bbox_to_shapely(b) for b in data.train_dataloader().sampler]).boundary.plot(ax=ax, color="C0")
      7 gpd.GeoSeries([bbox_to_shapely(b) for b in data.val_dataloader().sampler]).boundary.plot(ax=ax, color="C1")
      8 gpd.GeoSeries(bbox_to_shapely(train_roi)).boundary.plot(ax=ax, color="black")

Cell In[4], line 6, in <listcomp>(.0)
      4 plt.figure(figsize=(10, 10))
      5 ax = plt.gca()
----> 6 gpd.GeoSeries([bbox_to_shapely(b) for b in data.train_dataloader().sampler]).boundary.plot(ax=ax, color="C0")
      7 gpd.GeoSeries([bbox_to_shapely(b) for b in data.val_dataloader().sampler]).boundary.plot(ax=ax, color="C1")
      8 gpd.GeoSeries(bbox_to_shapely(train_roi)).boundary.plot(ax=ax, color="black")

File ~/farmvibes-ai/notebooks/crop_segmentation/notebook_lib/modules.py:75, in YearRandomGeoSampler.__iter__(self)
     74 def __iter__(self) -> Iterator[BoundingBox]:
---> 75     for bbox in super().__iter__():
     76         yield year_bbox(bbox)

File ~/anaconda3/envs/crop-seg/lib/python3.8/site-packages/torchgeo/samplers/single.py:130, in RandomGeoSampler.__iter__(self)
    123 """Return the index of a dataset.
    124 
    125 Returns:
    126     (minx, maxx, miny, maxy, mint, maxt) coordinates to index a dataset
    127 """
    128 for _ in range(len(self)):
    129     # Choose a random tile, weighted by area
--> 130     idx = torch.multinomial(self.areas, 1)
    131     hit = self.hits[idx]
    132     bounds = BoundingBox(*hit.bounds)

RuntimeError: cannot sample n_sample > prob_dist.size(-1) samples without replacement`

from farmvibes-ai.

rafaspadilha commented on May 21, 2024

This error seems to happen because the RandomGeoSampler is not able to sample chips (i.e., smaller image regions that will be used as training data for the segmentation model) within the input NDVI or CDL rasters. This may happen if you input region is very small or the chip size is too big.

Are you using the same region of the notebook or have you decreased the size of the input geometry?
If so, you may want to alter the parameter img_size of the CropSegDataModule (please, refer to the class definition).

Please, let me know if that fixes you issue.

from farmvibes-ai.

Amr-MKamal commented on May 21, 2024

I tried it down to only CHIP_SIZE = 1 & I still git the same error , minimizing this parameters alone or together doesn't solve the error @rafaspadilha

from farmvibes-ai.

rafaspadilha commented on May 21, 2024

I see. Please, could you check for me:

How many NDVI and CDL rasters are you passing as input to CropSegDataModule ? Could you run len(ndvi_rasters) and len(cdl_rasters) to check that?
Did you change the positive_indices parameter of CropSegDataModule ?

from farmvibes-ai.

Amr-MKamal commented on May 21, 2024

len(ndvi_rasters) result is : 330
& len(cdl_rasters) is : 1
However I noticed from the Crop Segmentation model documentation that both train_years & val_years are defaulted to 2020 , I assumed that running a different datetime (2021) will automatically update this , however that was not the case
after editing crop segmentation model as following the local training model worked successfully :
data = CropSegDataModule( ndvi_rasters, cdl_rasters, ndvi_stack_bands=NDVI_STACK_BANDS, img_size=(CHIP_SIZE, CHIP_SIZE), epoch_size=EPOCH_SIZE, batch_size=BATCH_SIZE, num_workers=NUM_WORKERS, positive_indices=constants.CROP_INDICES, val_ratio=VAL_RATIO, train_years=[2021], val_years=[2021], )

from farmvibes-ai.

Amr-MKamal commented on May 21, 2024

However now I still get the assertion error in 6 :\

from farmvibes-ai.

rafaspadilha commented on May 21, 2024

Hey, @Amr-MKamal. Yes, train_years and val_years default to 2020 in the CropSegDataModule. I will update the notebook in the next release to make it clearer that these parameters should be updated accordingly. I'm sorry for that.

Are you still having RuntimeError: cannot sample n_sample > prob_dist.size(-1) samples without replacement ?
Did you change anything from your previous run that worked successfully?

from farmvibes-ai.

Amr-MKamal commented on May 21, 2024

@rafaspadilha Thank you , no I'm getting the same assertion error I got at the beginning in Cell [6]
`

| Name | Type | Params

0 | model | FPN | 23.3 M
1 | loss | BCEWithLogitsLoss | 0
2 | train_metrics | MetricCollection | 0
3 | val_metrics | MetricCollection | 0

23.3 M Trainable params
0 Non-trainable params
23.3 M Total params
93.048 Total estimated model params size (MB)

Converting CDLMask CRS from EPSG:5070 to EPSG:32611
Converting CDLMask resolution from 30.0 to 10.0

Sanity Checking DataLoader 0: 0%
0/2 [00:00<?, ?it/s]

AssertionError Traceback (most recent call last)
Cell In[6], line 6
3 model = SegmentationModel.load_from_checkpoint(CHPT_PATH)
4 else:
5 # Train it now
----> 6 trainer.fit(model, data)

File ~/anaconda3/envs/crop-seg/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py:696, in Trainer.fit(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
677 r"""
678 Runs the full optimization routine.
679
(...)
693 datamodule: An instance of :class:~pytorch_lightning.core.datamodule.LightningDataModule.
694 """
695 self.strategy.model = model
--> 696 self._call_and_handle_interrupt(
697 self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
698 )

File ~/anaconda3/envs/crop-seg/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py:650, in Trainer._call_and_handle_interrupt(self, trainer_fn, *args, **kwargs)
648 return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
649 else:
--> 650 return trainer_fn(*args, **kwargs)
651 # TODO(awaelchli): Unify both exceptions below, where KeyboardError doesn't re-raise
652 except KeyboardInterrupt as exception:

File ~/anaconda3/envs/crop-seg/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py:735, in Trainer._fit_impl(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
731 ckpt_path = ckpt_path or self.resume_from_checkpoint
732 self._ckpt_path = self.__set_ckpt_path(
733 ckpt_path, model_provided=True, model_connected=self.lightning_module is not None
734 )
--> 735 results = self._run(model, ckpt_path=self.ckpt_path)
737 assert self.state.stopped
738 self.training = False

File ~/anaconda3/envs/crop-seg/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py:1166, in Trainer._run(self, model, ckpt_path)
1162 self._checkpoint_connector.restore_training_state()
1164 self._checkpoint_connector.resume_end()
-> 1166 results = self._run_stage()
1168 log.detail(f"{self.class.name}: trainer tearing down")
1169 self._teardown()

File ~/anaconda3/envs/crop-seg/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py:1252, in Trainer._run_stage(self)
1250 if self.predicting:
1251 return self._run_predict()
-> 1252 return self._run_train()

File ~/anaconda3/envs/crop-seg/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py:1274, in Trainer._run_train(self)
1271 self._pre_training_routine()
1273 with isolate_rng():
-> 1274 self._run_sanity_check()
1276 # enable train mode
1277 self.model.train()

File ~/anaconda3/envs/crop-seg/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py:1343, in Trainer._run_sanity_check(self)
1341 # run eval step
1342 with torch.no_grad():
-> 1343 val_loop.run()
1345 self._call_callback_hooks("on_sanity_check_end")
1347 # reset logger connector

File ~/anaconda3/envs/crop-seg/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py:200, in Loop.run(self, *args, **kwargs)
198 try:
199 self.on_advance_start(*args, **kwargs)
--> 200 self.advance(*args, **kwargs)
201 self.on_advance_end()
202 self._restarting = False

File ~/anaconda3/envs/crop-seg/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py:155, in EvaluationLoop.advance(self, *args, **kwargs)
153 if self.num_dataloaders > 1:
154 kwargs["dataloader_idx"] = dataloader_idx
--> 155 dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
157 # store batch level output per dataloader
158 self._outputs.append(dl_outputs)

File ~/anaconda3/envs/crop-seg/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py:143, in EvaluationEpochLoop.advance(self, data_fetcher, dl_max_batches, kwargs)
140 self.batch_progress.increment_started()
142 # lightning module methods
--> 143 output = self._evaluation_step(**kwargs)
144 output = self._evaluation_step_end(output)
146 self.batch_progress.increment_processed()

File ~/anaconda3/envs/crop-seg/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py:240, in EvaluationEpochLoop._evaluation_step(self, **kwargs)
229 """The evaluation step (validation_step or test_step depending on the trainer's state).
230
231 Args:
(...)
237 the outputs of the step
238 """
239 hook_name = "test_step" if self.trainer.testing else "validation_step"
--> 240 output = self.trainer._call_strategy_hook(hook_name, *kwargs.values())
242 return output

File ~/anaconda3/envs/crop-seg/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py:1704, in Trainer._call_strategy_hook(self, hook_name, *args, **kwargs)
1701 return
1703 with self.profiler.profile(f"[Strategy]{self.strategy.class.name}.{hook_name}"):
-> 1704 output = fn(*args, **kwargs)
1706 # restore current_fx when nested context
1707 pl_module._current_fx_name = prev_fx_name

File ~/anaconda3/envs/crop-seg/lib/python3.8/site-packages/pytorch_lightning/strategies/strategy.py:370, in Strategy.validation_step(self, *args, **kwargs)
368 with self.precision_plugin.val_step_context():
369 assert isinstance(self.model, ValidationStep)
--> 370 return self.model.validation_step(*args, **kwargs)

File ~/farmvibes-ai/notebooks/crop_segmentation/notebook_lib/models.py:97, in SegmentationModel.validation_step(self, batch, batch_idx)
96 def validation_step(self, batch: Dict[str, Any], batch_idx: int) -> Dict[str, Any]:
---> 97 return self._shared_step(batch, batch_idx)

File ~/farmvibes-ai/notebooks/crop_segmentation/notebook_lib/models.py:78, in SegmentationModel._shared_step(self, batch, batch_idx)
76 pred = self(batch["image"])
77 for t in pred, batch["mask"]:
---> 78 assert torch.all(torch.isfinite(t))
79 loss = self.loss(pred, batch["mask"])
81 return {"loss": loss, "preds": pred.detach(), "target": batch["mask"]}

AssertionError:

from farmvibes-ai.

Amr-MKamal commented on May 21, 2024

@rafaspadilha as a final solution I thought about going to notebook_lib/models.py and I commented this section
` # for t in pred, batch["mask"]:

assert torch.all(torch.isfinite(t))`

the rest of the cells in the local training notebook worked successfully and I was able to save the model to an onnx model
however in 04_inference notebook the interference workload could not complete with the following error

RuntimeError: Failed to run op compute_onnx_from_sequence in workflow run id 3f6de9b0-45ba-43cb-a92e-04025bce9f6c for input with message id 00-3f6de9b045ba43cba92e04025bce9f6c-2c866d66b2dc8cdf-01. Error description: <class 'RuntimeError'>: Traceback (most recent call last):\n File "/opt/conda/lib/python3.8/site-packages/vibe_agent/worker.py", line 123, in run_op\n return factory.build(spec).run(input, cache_info)\n File "/opt/conda/lib/python3.8/site-packages/vibe_agent/ops.py", line 106, in run\n stac_results = self._call_validate_op(**{**items, **raw_items})\n File "/opt/conda/lib/python3.8/site-packages/vibe_agent/ops.py", line 72, in _call_validate_op\n results = self.callback(**kwargs)\n File "/app/ops/compute_onnx/compute_onnx.py", line 65, in compute_onnx\n model = ort.InferenceSession(model_path)\n File "/opt/conda/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 347, in init\n self._create_inference_session(providers, provider_options, disabled_optimizers)\n File "/opt/conda/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 384, in _create_inference_session\n sess = C.InferenceSession(session_options, self._model_path, True, self._read_config_from_model)\nonnxruntime.capi.onnxruntime_pybind11_state.InvalidProtobuf: [ONNXRuntimeError] : 7 : INVALID_PROTOBUF : Load model from /mnt/onnx_resources/ failed:Protobuf parsing failed.\n.

Note: that I get this error running the provided example in terms of area & date (2020) with the provided environment

from farmvibes-ai.

tarishijain commented on May 21, 2024

I was getting a similar error at the training stage after running the trainer.fit(model, data). Also, I was working with a reduced dataset of 6 months rather than a year. Will the above solution work here and also is it advisable to train the model for less than a year?
I have used the same region and the provided crop_env.yaml conda environment.

AssertionError Traceback (most recent call last)
/tmp/ipykernel_294729/3723144614.py in
4 else:
5 # Train it now
----> 6 trainer.fit(model, data)

~/miniconda3/envs/crop-seg/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py in fit(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
694 """
695 self.strategy.model = model
--> 696 self._call_and_handle_interrupt(
697 self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
698 )

~/miniconda3/envs/crop-seg/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py in _call_and_handle_interrupt(self, trainer_fn, *args, **kwargs)
648 return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
649 else:
--> 650 return trainer_fn(*args, **kwargs)
651 # TODO(awaelchli): Unify both exceptions below, where KeyboardError doesn't re-raise
652 except KeyboardInterrupt as exception:

~/miniconda3/envs/crop-seg/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py in _fit_impl(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
733 ckpt_path, model_provided=True, model_connected=self.lightning_module is not None
734 )
--> 735 results = self._run(model, ckpt_path=self.ckpt_path)
736
737 assert self.state.stopped

~/miniconda3/envs/crop-seg/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py in _run(self, model, ckpt_path)
1164 self._checkpoint_connector.resume_end()
1165
-> 1166 results = self._run_stage()
1167
1168 log.detail(f"{self.class.name}: trainer tearing down")

~/miniconda3/envs/crop-seg/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py in _run_stage(self)
1250 if self.predicting:
1251 return self._run_predict()
-> 1252 return self._run_train()
1253
1254 def _pre_training_routine(self):

~/miniconda3/envs/crop-seg/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py in _run_train(self)
1272
1273 with isolate_rng():
-> 1274 self._run_sanity_check()
1275
1276 # enable train mode

~/miniconda3/envs/crop-seg/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py in _run_sanity_check(self)
1341 # run eval step
1342 with torch.no_grad():
-> 1343 val_loop.run()
1344
1345 self._call_callback_hooks("on_sanity_check_end")

~/miniconda3/envs/crop-seg/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py in run(self, *args, **kwargs)
198 try:
199 self.on_advance_start(*args, **kwargs)
--> 200 self.advance(*args, **kwargs)
201 self.on_advance_end()
202 self._restarting = False

~/miniconda3/envs/crop-seg/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py in advance(self, *args, **kwargs)
153 if self.num_dataloaders > 1:
154 kwargs["dataloader_idx"] = dataloader_idx
--> 155 dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
156
157 # store batch level output per dataloader

~/miniconda3/envs/crop-seg/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py in advance(self, data_fetcher, dl_max_batches, kwargs)
141
142 # lightning module methods
--> 143 output = self._evaluation_step(**kwargs)
144 output = self._evaluation_step_end(output)
145

~/miniconda3/envs/crop-seg/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py in _evaluation_step(self, **kwargs)
238 """
239 hook_name = "test_step" if self.trainer.testing else "validation_step"
--> 240 output = self.trainer._call_strategy_hook(hook_name, *kwargs.values())
241
242 return output

~/miniconda3/envs/crop-seg/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py in _call_strategy_hook(self, hook_name, *args, **kwargs)
1702
1703 with self.profiler.profile(f"[Strategy]{self.strategy.class.name}.{hook_name}"):
-> 1704 output = fn(*args, **kwargs)
1705
1706 # restore current_fx when nested context

~/miniconda3/envs/crop-seg/lib/python3.8/site-packages/pytorch_lightning/strategies/strategy.py in validation_step(self, *args, **kwargs)
368 with self.precision_plugin.val_step_context():
369 assert isinstance(self.model, ValidationStep)
--> 370 return self.model.validation_step(*args, **kwargs)
371
372 def test_step(self, *args: Any, **kwargs: Any) -> Optional[STEP_OUTPUT]:

~/farmvibes-ai/notebooks/crop_segmentation/notebook_lib/models.py in validation_step(self, batch, batch_idx)
95
96 def validation_step(self, batch: Dict[str, Any], batch_idx: int) -> Dict[str, Any]:
---> 97 return self._shared_step(batch, batch_idx)
98
99 def validation_step_end(self, outputs: Dict[str, Any]) -> None:

~/farmvibes-ai/notebooks/crop_segmentation/notebook_lib/models.py in _shared_step(self, batch, batch_idx)
76 pred = self(batch["image"])
77 for t in pred, batch["mask"]:
---> 78 assert torch.all(torch.isfinite(t))
79 loss = self.loss(pred, batch["mask"])
80

AssertionError:

from farmvibes-ai.

Crop Segementation local training error & AML training error about farmvibes-ai HOT 13 OPEN

Comments (13)

| Name | Type | Params

0 | model | FPN | 23.3 M
1 | loss | BCEWithLogitsLoss | 0
2 | train_metrics | MetricCollection | 0
3 | val_metrics | MetricCollection | 0

assert torch.all(torch.isfinite(t))`

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Comments (13)

| Name | Type | Params

0 | model | FPN | 23.3 M 1 | loss | BCEWithLogitsLoss | 0 2 | train_metrics | MetricCollection | 0 3 | val_metrics | MetricCollection | 0

assert torch.all(torch.isfinite(t))`

Related Issues (20)

Recommend Projects

Recommend Topics

Recommend Org

0 | model | FPN | 23.3 M
1 | loss | BCEWithLogitsLoss | 0
2 | train_metrics | MetricCollection | 0
3 | val_metrics | MetricCollection | 0