graviraja / mlops-basics Goto Github PK

License: MIT License

Python 27.94% Jupyter Notebook 71.32% Dockerfile 0.75%

mlops-basics's Introduction

Hi there

About Me

I am currently working at Enterpret as a Founding Engineer - NLP.

My interests are in Unsupervised Algorithms, Semantic Similarity and Productionising the NLP models. I also like to follow latest research works happening in the NLP domain.

Check out my latest blogs here: Deep Learning Blogs

Besides work, I like cooking 🥘 , cycling 🚴‍♀️ , kdramas 🎥.

Languages & Tools:

Contact

mlops-basics's People

Contributors

Stargazers

Watchers

Forkers

avyukth ankur3107 kforcodeai plthiyagu amitkayal abhiksark eliekawerk arusri23 mshaek jmeisele cchalc akashmavle5 laxminarayen armandidandeh kareem-negm flynn3103 leo23 charlesxxiv imsazzad romangao aarthikasirajan clementjf williamdeve jarvis-starkz irentang huangweiboy2 allensmile holymdz xjsxujingsong jackyyvan monitor14 puran-debugger mdakram09 hongvvu anhlbt toread-jxj kmmao superyang713 miblue119 jaychsu david30907d xkx-etudes toraaglobal pikachust8811 mrknight21 jasonya yctsai1997 victorloo orozcohsu seunghwan1228 itsmesachee venkat-fsa snapbuy ai-hub-deep-learning-fundamental jmesong abhijitpaul0212 safarri anashas omnaathharis dinaabdelrahman jaykimbravekjh abdouaziz abdullahmohammadkhan shahabmosallaie snehashis1997 kenghweeng chaoslogic azeezhamzat prakashknaikade yongduek mattburnham lcajachahua mridul98 espoirgk high-east claudionoronhafilho ari1988 girijeshcse techninjahere ashishpatel26 kishmurthy griffingc hojihun5516 phs008 jk96491 gavinljj byeonggichae giangdip2410 lyrl leejwuniverse hasan-moni-321 lilasrinivasreddy karthik-projecthub mlops01 saswat01 tusharkalecam itsshaikaslam rahulnkumar ekene966 daekeun-ml

mlops-basics's Issues

AWS Lambda Function: Test error

I am following Week 8 Blog post. When I deploy the container using Lambda and try to test it using the Test section, the Execution fails. I get the following log. Can you please help with this? Does this function already have internet access to download that model? (Sorry if the question is naive)

es/transformers/file_utils.py", line 1518, in get_from_cache
os.makedirs(cache_dir, exist_ok=True)
File "/usr/lib/python3.6/os.py", line 210, in makedirs
makedirs(head, mode, exist_ok)
File "/usr/lib/python3.6/os.py", line 210, in makedirs
makedirs(head, mode, exist_ok)
File "/usr/lib/python3.6/os.py", line 210, in makedirs
makedirs(head, mode, exist_ok)
File "/usr/lib/python3.6/os.py", line 220, in makedirs
mkdir(name, mode)
OSError: [Errno 30] Read-only file system: '/home/sbx_user1051'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/bin/uvicorn", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1137, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1062, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 763, in invoke
return __callback(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/uvicorn/main.py", line 425, in main
run(app, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/uvicorn/main.py", line 447, in run
server.run()
File "/usr/local/lib/python3.6/dist-packages/uvicorn/server.py", line 69, in run
return asyncio.get_event_loop().run_until_complete(self.serve(sockets=sockets))
File "/usr/lib/python3.6/asyncio/base_events.py", line 484, in run_until_complete
return future.result()
File "/usr/local/lib/python3.6/dist-packages/uvicorn/server.py", line 76, in serve
config.load()
File "/usr/local/lib/python3.6/dist-packages/uvicorn/config.py", line 448, in load
self.loaded_app = import_from_string(self.app)
File "/usr/local/lib/python3.6/dist-packages/uvicorn/importer.py", line 21, in import_from_string
module = importlib.import_module(module_str)
File "/usr/lib/python3.6/importlib/__init__.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 994, in _gcd_import
File "<frozen importlib._bootstrap>", line 971, in _find_and_load
File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 665, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 678, in exec_module
File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
File "./app.py", line 5, in <module>
predictor = ColaONNXPredictor("./models/model.onnx")
File "./inference_onnx.py", line 12, in __init__
self.processor = DataModule()
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/core/datamodule.py", line 49, in __call__
obj = type.__call__(cls, *args, **kwargs)
File "./data.py", line 20, in __init__
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
File "/usr/local/lib/python3.6/dist-packages/transformers/models/auto/tokenization_auto.py", line 534, in from_pretrained
config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/transformers/models/auto/configuration_auto.py", line 450, in from_pretrained
config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/transformers/configuration_utils.py", line 532, in get_config_dict
raise EnvironmentError(msg)
OSError: Can't load config for 'google/bert_uncased_L-2_H-128_A-2'. Make sure that:
- 'google/bert_uncased_L-2_H-128_A-2' is a correct model identifier listed on 'https://huggingface.co/models'
- or 'google/bert_uncased_L-2_H-128_A-2' is the correct path to a directory containing a config.json file
END RequestId: 95ab620c-bf63-46ab-8c02-27fb4099485b
REPORT RequestId: 95ab620c-bf63-46ab-8c02-27fb4099485b	Duration: 65041.97 ms	Billed Duration: 65042 ms	Memory Size: 1024 MB	Max Memory Used: 446 MB	
RequestId: 95ab620c-bf63-46ab-8c02-27fb4099485b Error: Runtime exited with error: exit status 1
Runtime.ExitError

[Bug] Getting an error related to colorlog during the training

🐛 Bug

I get an error when trying to train the model using the code from week2.

Stack trace

Could not load override hydra/job_logging/colorlog

Environment

I created a virtual environment using week2 requirements.

Is training happening?

MLOps-Basics/week_0_project_setup/model.py

Lines 25 to 28 in 403ce8d

    
           def training_step(self, batch, batch_idx): 
        
               logits = self.forward(batch["input_ids"], batch["attention_mask"]) 
        
               loss = F.cross_entropy(logits, batch["label"]) 
        
               self.log("train_loss", loss, prog_bar=True)

here, the loss is not returned, is the model even training?

Lambda Environent Support for SQLite3 Older Versions

dvc pull fails with the error on the screenshot, I tried downloading latest sqlite3 and compile from source but turns out lambda environment doesn't give much control

# # Configuring remote server in dvc
RUN dvc init --no-scm -f
RUN dvc remote add -d storage s3://basicmlops/dvcstore

# # pulling the trained model
RUN dvc pull dvcfiles/trained_model.dvc

How to push a container to specific repository in GitHub Actions?

To push an image xyz to a ECR repository abc using CLI, we would do the following:

docker tag xyz 246113150184.dkr.ecr.us-west-2.amazonaws.com/abc
docker push 246113150184.dkr.ecr.us-west-2.amazonaws.com/abc

How to do the same using GitHub actions? In your given example in the blog of Week 7, the image name mlops-basics and the repository name mlops-basics are same, so it is working. How to do it if they are different?

name: Create Docker Container

on: [push]

jobs:
  mlops-container:
    runs-on: ubuntu-latest
    defaults:
      run:
        working-directory: ./week_7_ecr
    steps:
      - name: Checkout
        uses: actions/checkout@v2
        with:
          ref: ${{ github.ref }}
      - name: Configure AWS Credentials
        uses: aws-actions/configure-aws-credentials@v1
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: us-west-2
      - name: Build container
        run: |
          docker build --build-arg AWS_ACCOUNT_ID=${{ secrets.AWS_ACCOUNT_ID }} \
                       --build-arg AWS_ACCESS_KEY_ID=${{ secrets.AWS_ACCESS_KEY_ID }} \
                       --build-arg AWS_SECRET_ACCESS_KEY=${{ secrets.AWS_SECRET_ACCESS_KEY }} \
                       --tag mlops-basics .
      - name: Push2ECR
        id: ecr
        uses: jwalton/gh-ecr-push@v1
        with:
          access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          region: us-west-2
          image: mlops-basics:latest

Please correct me if I have misunderstood something.

Advice how to deploy and run my docker image on my own local machine

I have my own local machine and I'd like to substitute what AWS S3 bucket does into my machine.

From my thought, the followings are the steps

Open VPN of my local network
Open sft/ssh of my local machine
On Github Action, put the vpn and ssh key
Send my own command.

Would you give me any advice/article to read?

Error with numpy and transformers modules

When I run python train.py I get the following error for two modules numpy, transformers. Fortunately, as you can see in the link, downgrading the version solved the problem, but it is not an essential improvement and I would like to see it improved.

https://stackoverflow.com/questions/75069062/module-numpy-has-no-attribute-object
huggingface/transformers#11799

Different module metrics for train/val

Module metrics stores internal states computed over each call on different batches. So using the same instance for both train and val might not lead to correct results when computed over epoch with (on_epoch=True) in step hooks. I'd suggest creating separate ones for each task (train & val).

ref: https://torchmetrics.readthedocs.io/en/latest/pages/quickstart.html#module-metrics

MLOps-Basics/week_9_monitoring/model.py

Lines 24 to 33 in d4ad5a9

    
           self.accuracy_metric = torchmetrics.Accuracy() 
        
           self.f1_metric = torchmetrics.F1(num_classes=self.num_classes) 
        
           self.precision_macro_metric = torchmetrics.Precision( 
        
               average="macro", num_classes=self.num_classes 
        
           ) 
        
           self.recall_macro_metric = torchmetrics.Recall( 
        
               average="macro", num_classes=self.num_classes 
        
           ) 
        
           self.precision_micro_metric = torchmetrics.Precision(average="micro") 
        
           self.recall_micro_metric = torchmetrics.Recall(average="micro")

DVCFiles alternative not working

dvc add ../models/best-checkpoint.ckpt --file trained_model.dvc
this command is not working for me.
i mean if run this command trained_model.dvc file is not created in dvcfiles folder

Metric not matched between in `early_stopping_callbacks` (Week 1)

Hi @graviraja ,

I found out that train.py in Week 1 throws RuntimeError while runnning.

RuntimeError: Early stopping conditioned on metric `val_loss` which is not available. 
Pass in or modify your `EarlyStopping` callback to use any of the following: 
`valid/loss_epoch`, `valid/acc`, `valid/precision_macro`, `valid/recall_macro`, 
`valid/precision_micro`, `valid/recall_micro`, `valid/f1`, `valid/loss`, `train/loss_step`, 
`train/acc_step`, `train/loss_epoch`, `train/acc_epoch`, `train/loss`, `train/acc`

"val_loss" on line 52 seems to be the reason since it does not match.

MLOps-Basics/week_1_wandb_logging/train.py

Line 52 in ade8cb0

monitor="val_loss", patience=3, verbose=True, mode="min"

I believe this should be changed to "valid/loss" ?

Thanks for the amazing resource.

What is Postman? How to set it up?

In the 8th week blog post, you mentioned to do this:

Now that the API Gateway is integrated, let's call it. Go to Postman and create a POST method with the Invoke URL and body containing sentence parameter.

Can you please elaborate about it in the blog post? I don't know what is Postman. It is a tab present in the Lambda function or API gateway or is it a separate AWS service?

Thanks.

Key error on Week1

Hi @graviraja,
I was following your tutorial on wandb logging and found a potential error in the training code when visualizing poorly performed data with wandb table.

MLOps-Basics/week_1_wandb_logging/train.py

Lines 19 to 21 in 13060e0

    
           def on_validation_end(self, trainer, pl_module): 
        
               val_batch = next(iter(self.datamodule.val_dataloader())) 
        
               sentences = val_batch["sentence"]

When running, this results in KeyError : "sentence", referring to line 21 sentences = val_batch["sentence"].

I think this is because "sentence" is not part of the columns setup in val_data of data.py. Please correct me if I'm wrong. Thanks :)

MLOps-Basics/week_1_wandb_logging/data.py

Lines 29 to 42 in 13060e0

    
           def setup(self, stage=None): 
        
               # we set up only relevant datasets when stage is specified 
        
               if stage == "fit" or stage is None: 
        
                   self.train_data = self.train_data.map(self.tokenize_data, batched=True) 
        
                   self.train_data.set_format( 
        
                       type="torch", columns=["input_ids", "attention_mask", "label"] 
        
                   ) 
        
                   self.val_data = self.val_data.map(self.tokenize_data, batched=True) 
        
                   self.val_data.set_format( 
        
                       type="torch", 
        
                       columns=["input_ids", "attention_mask", "label"], 
        
                       output_all_columns=True, 
        
                   )

A question on week_0

Hello Raviraj, your great work is so helping me.

I installed all packages using requirements.txt. And I trained a model without any issue.
But I have an issue while I do inference sentences.

In my case, every sentence gets the same results. (almost same score)
Could you check it out? Thanks!

Cannot use `load_dataset('glue', 'cola')` in Week0 requirements.txt

I think this version provided by requirements.txt in Week0 doesn't work because of not being able to use load_dataset('glue', 'cola')

detail: huggingface/datasets#5671

Therefore, I would like to recommend you update datasets and transformers to datasets==2.10.1 and transformers==4.27.3.

This should resolve the issue and ensure smooth functioning of the program.

Change Dimension of Softmax from 0 to 1 in modules from week 1 to 4

The output logits are of shape (1,2) therefore the dimension while calling the softmax function should be 1. These line from week 1 to 4 needs to be changed in this file

Potential Error in Blog of Week 0

Hey Raviraj, great work. I am learning a lot.

In the week 0 blog post, you mentioned the following:
As an example, I will be implementing EarlyStopping callback. This helps the model not to overfit by mointoring on a certain parameter (val_loss in this case) The best model will be saved in the dirpath.

But in the code, you have used ModelCheckpoint callback and did not use EarlyStopping. I believe EarlyStopping and ModelCheckpoint callbacks are two completely different callbacks. Please correct if I am wrong. Thanks.

Does it work on Windows?

From my understanding, the Huggingface Transformer Docker image can only work on Linux, is that right?

	def training_step(self, batch, batch_idx):
	logits = self.forward(batch["input_ids"], batch["attention_mask"])
	loss = F.cross_entropy(logits, batch["label"])
	self.log("train_loss", loss, prog_bar=True)

	self.accuracy_metric = torchmetrics.Accuracy()
	self.f1_metric = torchmetrics.F1(num_classes=self.num_classes)
	self.precision_macro_metric = torchmetrics.Precision(
	average="macro", num_classes=self.num_classes
	)
	self.recall_macro_metric = torchmetrics.Recall(
	average="macro", num_classes=self.num_classes
	)
	self.precision_micro_metric = torchmetrics.Precision(average="micro")
	self.recall_micro_metric = torchmetrics.Recall(average="micro")

	def on_validation_end(self, trainer, pl_module):
	val_batch = next(iter(self.datamodule.val_dataloader()))
	sentences = val_batch["sentence"]

	def setup(self, stage=None):
	# we set up only relevant datasets when stage is specified
	if stage == "fit" or stage is None:
	self.train_data = self.train_data.map(self.tokenize_data, batched=True)
	self.train_data.set_format(
	type="torch", columns=["input_ids", "attention_mask", "label"]
	)

	self.val_data = self.val_data.map(self.tokenize_data, batched=True)
	self.val_data.set_format(
	type="torch",
	columns=["input_ids", "attention_mask", "label"],
	output_all_columns=True,
	)