gokumohandas / made-with-ml Goto Github PK
View Code? Open in Web Editor NEWLearn how to design, develop, deploy and iterate on production-grade ML applications.
Home Page: https://madewithml.com
License: MIT License
Learn how to design, develop, deploy and iterate on production-grade ML applications.
Home Page: https://madewithml.com
License: MIT License
Homepage does not seem to work.
In notebooks/01_Foundations/03_NumPy.ipynb, the images in the notebook are missing.
Under Modelling there is a sequence of 3D diagrams showing the flow of shapes. It seems that the vocab_size dimension disappeared after the convolution step. From the earlier gifs showing convolution, they only use integers in each cell instead of a one hot encoded vector. I was hoping for some explanation of where the vocab_size dimension went during convolution, like what kind of aggregation happened there.
If there were annotations of the shapes as pytorch requires (including the manual axis 1,2 transpose) under each step will be very helpful. I had been trying to see the shapes throughout the flow using torchsummary.summary(model,(500,8,1))
but no matter what pattern i try it gives ValueError: too many values to unpack (expected 1)
.
It is breaking at user-defined code which is strange because i thought it should be torchsummary's issue. If i try to turn this 3-tuple into a single integer, then this user-code passes but torchsummary breaks saying integer is not iterable.
Does torchsummary work by sending random values through the pipeline to get the shapes and that's why it has to run user-code and that's why i see this unpacking error? How do I use properly torchsummary to view CNN shapes?
19
20 # Rearrange input so num_channels is in dim 1 (N, C, L)
---> 21 x_in, = inputs
22 if not channel_first:
23 x_in = x_in.transpose(1, 2)
In the numpy notebook, in the section # 3-D array (matrix)
I see that when you run the cell one of the outputs is x ndim: 2
. Seems that the title is in conflict with how numpy categorizes it and I've always considered [[], []]
to be 2d.
Katonic MLOps Platform is a collaborative platform with a Unified UI to manage all data science activities in one place and introduce MLOps practice into the production systems of customers and developers. It is a collection of cloud-native tools for all of these stages of MLOps:
-Data exploration
-Feature preparation
-Model training/tuning
-Model serving, testing and versioning
Katonic is for both data scientists and data engineers looking to build production-grade machine learning implementations and can be run either locally in your development environment or on a production cluster. Katonic provides a unified system—leveraging Kubernetes for containerization and scalability for the portability and repeatability of its pipelines.
It will be great if you can list it on your account
Website -
Katonic One Pager.pdf
Hi! Can you please post the old course in an archive. The new course does not have the foundations part.
While creating the LabelEncoder
class, I couldnt understand why return self
in class method fit(self,y)
?
My understanding is that when we call this method, the object variables are updated so no need for self?
Please correct me if I'm wrong, just trying to reason myself with each step of the code.
def fit(self, y):
classes = np.unique(y)
for i, class_ in enumerate(classes):
self.class_to_index[class_] = i
self.index_to_class = {v: k for k,v in self.class_to_index.items()}
self.classes = list(self.class_to_index.keys())
return self #Why?
3. We'll apply convolution via filters (filter_size, vocab_size, num_filters)
should be embedding_dim
to replace vocab_size
?first have to decice
padding our inputs before convolution to result is outputs
is
should be in
device = torch.device("cpu")
moves things back to cpu.interpretable_trainer.predict_step(dataloader)
breaks with AttributeError: 'list' object has no attribute 'dim'
. The precise step is F.softmax(z)
, where for interpretable_model, z is a list of 3 items and it was trying to softmax a list instead of a tensor.Could u pls release a instruction on Jupyter notebook, I mean how to run your code on Jupyter notebook. You know, in China, we cannot acsess Google.
Check my blog here: http://bit.ly/RG-mlops
Hi Goku,
I'm going through the Pandas and I noticed that in the Feature engineering section, you mentioned about applying a lambda function to create a new feature, but the code for it does not appear. I think it's just a minor typo.
Regards,
Roberto
The topic "Gradients" is PyTorch no TensorFlow
Hi Goku, I really enjoy the contents of the course! I have two questions:
Under Pytorch --> Interpretability:
b_unscaled = b * y_scaler.scale_ + y_scaler.mean_ - np.sum(W_unscaled*X_scaler.mean_)
This line seems to be missing a * (y_scaler.scale_/X_scaler.scale_)
in the last np.sum term.
The table for W unscaled was also confusing.
It has a sum term shown there, which means if X began with 2 predictors (this lesson only used 1 predictor), the scaled W will have 2 predictors while the sum will aggregate the 2 weights into 1 unscaled weight? Can't wrap my head around this.
Also, under Pytorch --> Interpretability, W_unscaled = W * (y_scaler.scale_/X_scaler.scale_)
there was no sum used here, so looks inconsistent with the formula in the table.
The website link https://madewithml.com/courses/mlops/evaluation/#intuition of Coarse-grained section suggests to import function precision_recall_curve
by
from sklearn.metrics import precision_recall_curve
but another function precision_recall_fscore_support
from the same module path is called for computing evaluation metrics by
overall_metrics = precision_recall_fscore_support(y_test, y_pred, average="weighted")
Webpage says W dimension is Dx1 but notebook says DxC. Prefer webpage to also show DxC to expose people to the more general multi-class W
Two errors causing notebook to not run top-down
a. Extra single quote behind k: plt.scatter(X[:, 0], X[:, 1], c=[colors[_y] for _y in y], s=25, edgecolors="k"')
b. SyntaxError: Double quotes to index dictionary early closing double quotes for f-string (happens in 2 cells) print (f"m:b = {class_counts["malignant"]/class_counts["benign"]:.2f}")
Hope the matrix calculus section had more explanation, feels to me like for people who understand it, they won't need the formulas, but for people who don't understand, it doesn't help much.
Some questions I had going through that section.
How did db = np.sum(dscores, axis=0, keepdims=True)
implementation come about? Was expecting a formula version describing gradient wrt bias but previous it's mentioned We'll leave the bias weights out for now to avoid complicating the backpropagation calculation
W_{unscaled}
includes sum in formula which it shouldn't?
Why it is not contain of svm?
i think instead of
ax = sns.barplot(list(tags), list(tag_counts))
it should be
ax = sns.barplot(x=list(tags), y=list(tag_counts))
in code at https://madewithml.com/courses/mlops/exploratory-data-analysis/
In the table at the top, outputs from second layer shows NxH should be NxC?
SyntaxError: plt.scatter(X[:, 0], X[:, 1], c=[colors[_y] for _y in y], edgecolors="k"', s=25)
Extra single quote behind "k" in notebook
Is def init_weights(self):
used anywhere? It seems this was defined but not applied anywhere, or does pytorch implicitly apply it during some step? I was expecting model.apply(init_weights)
somewhere
The objective is to have weights that are able to produce outputs that follow a similar distribution across all neurons
Could there be more clarity on this statement? What exactly is a "distribution across neurons" , and what does "similar" mean? What are the objects that we want similar? Is it we have 1 distribution per layer of neurons, and each neuron's single output value contributes to this discrete distribution of outputs in a layer, and we're comparing similarity across layers? (but this sounds wrong because each layer would have different number of neurons, can discrete distributions with different number of items in x-axis be compared?)
Is there missing - sign in term (with 1/y) on the left side of = a(y-1) in gradient derivation of dJ/dW2y
Is it okay to contribute to segmentation part of computer vision section? Wondering if there is anyone already work on it
Hello! Great content =]
But are you sure you want to remove outliers before feature engineering? E.g. if a feature has a power law distribution (as many do) then you would have outliers that are no longer outliers once you take the log of the feature.
Maybe you could add a warning or something. I makes sense to deal with outliers before your feature store but I wouldn't want to remove any outliers before having performed a thorough EDA. Now that I think about it the same goes for dealing with missing values. Of course we are talking MLOps so you might have meant that one should follow this guide once they have a model they are happy with but it seems more all encompassing what you have created.
Just a thought. Feel free to close this issue whenever you want.
Hi thanks for these impressive courses, They really help me a lot in my career.
I have some thoughts that I want to discuss. As there are and more more end2end MLOps platforms that use notebooks to deliver models to production, what is your opinion about converting notebooks to fully testable python modules (in 2023)? Is that still bring some benefits if the platform could ensure the reproducibility for training/data processing...?
Thanks in advance for your reply.
Hanyuan
Problem: only malignant
legend was shown ( plot data
section of the Logistic Regression lesson.)
Fix
I am not sure if I should create a PR for a notebook ... so I created this issue with a working code instead. Please see below
# Define X and y
X = df[["leukocyte_count", "blood_pressure"]].values
y = df["tumor_class"].values
# Split the data into separate arrays for benign and malignant classes
X_benign = X[y == "benign"]
X_malignant = X[y == "malignant"]
# Plot the data for each class separately
fig, ax = plt.subplots()
ax.scatter(X_benign[:, 0], X_benign[:, 1], c="blue", s=25, edgecolors="k", label="benign")
ax.scatter(X_malignant[:, 0], X_malignant[:, 1], c="red", s=25, edgecolors="k", label="malignant")
ax.set_xlabel("leukocyte count")
ax.set_ylabel("blood pressure")
ax.legend(loc="upper right")
plt.show()
class Dataset
's method collate_fn
needs a little change as otherwise following error in thrown when creating dataloader
ValueError: setting an array element with a sequence
Given Code
"""Processing on a batch."""
# Get inputs
batch = np.array(batch, dtype=object)
X = batch[:, 0] # This line execution throws above error
y = np.stack(batch[:, 1], axis=0)
Suggested solution
"""Processing on a batch."""
# Get inputs
batch = np.array(batch, dtype=object)
X = np.stack(batch[:, 0] ,axis=0)
y = np.stack(batch[:, 1], axis=0)
```
Looking forward to the next release. Hahaha, thank you very much.
There exists a similar task that is named text classification.
But I want to find a kind of model that the inputs are keyword set. And the keyword set is not from a sentence.
For example:
input ["apple", "pear", "water melon"] --> target class "fruit"
input ["tomato", "potato"] --> target class "vegetable"
Another example:
input ["apple", "Peking", "in summer"] --> target class "Chinese fruit"
input ["tomato", "New York", "in winter"] --> target class "American vegetable"
input ["apple", "Peking", "in winter"] --> target class "Chinese fruit"
input ["tomato", "Peking", "in winter"] --> target class "Chinese vegetable"
Thank you.
Hi,GokuMohandas:
I translate all content of Made-With-ML into chinese language, I post the content in my [blog] (https://franztao.github.io) and wechat blog。I wish get your agree about the recreated content by the original copyright owner?
Hi, Thank you for such excellent lessons!!!
I had 3 doubts in the lecture, can you please explain them:
When we pad the one-hot sequences to max number of seq length, why do we not put 1 at the 0th index? (so as to make it to correspond to < pad > token) Why is it currently all zeros ?
When we're loading the weights in the interpretableCNN model, why dont we get the weight mis-match error ? (as we have dropped the FC layer part and we're also not using strict=False )
My sns heatmap / conv_output have all the values 1 . It does not resemble yours...Can you help me with this?
url = "https://raw.githubusercontent.com/GokuMohandas/Made-With-ML/main/datasets/projects.json"
projects = json.loads(urlopen(url).read())
print (f"{len(projects)} projects")
print (json.dumps(projects[0], indent=2))
This cell will lead to 404 error(as the .json file is no longer in the directory, .csv file format replaces .json file).
In the cell of "Training" in Notebook ''Multilayer Perceptrons'', the sentence "6. Repeat steps 2 - 4 until model performs well." should be changed into "6. Repeat steps 2 - 5 until model performs well." Because gradient descent is implemented after each iteration.
It'd be great to see how you can easily connect to data sources and do EDA on real-life data with JupySQL.
Happy to help with a PR if needed, we can take one of the available guides!
Problem: Starting from either https://practicalai.me/learn/lessons/ or https://github.com/practicalAI/practicalAI, when attempting to click any of the lessons I see "Notebook not found".
Proposed fix: Possibly "basic_ml" should be added to the path?
When I click "authorize with Github" I see the same thing:
The link given then does not work:
In the case of the "linear regression" notebook, the non-working link given on the "lessons" page is https://colab.research.google.com/github/practicalAI/practicalAI/blob/master/notebooks/04_Linear_Regression.ipynb
Whereas if you go find it on github directly, it is
https://colab.research.google.com/github/practicalAI/practicalAI/blob/master/notebooks/basic_ml/04_Linear_Regression.ipynb
Should I do a notebook on data visualization using matplotlib | seaborn to be added to this already amazing repo?
The following padding function used in https://madewithml.com/courses/foundations/convolutional-neural-networks/ refers to num_classes
which in the example used comes up to 500. I was wondering if it should be referred as num_tokens
(as used in other functions). Just getting confused since as per my understanding num_classes = 4.
def pad_sequences(sequences, max_seq_len=0):
"""Pad sequences to max length in sequence."""
max_seq_len = max(max_seq_len, max(len(sequence) for sequence in sequences))
num_classes = sequences[0].shape[-1]
padded_sequences = np.zeros((len(sequences), max_seq_len, num_classes))
for i, sequence in enumerate(sequences):
padded_sequences[i][:len(sequence)] = sequence
return padded_sequences
Hi @GokuMohandas,
I've been recently taking a look at the sample Notebooks in this project and I found them really interesting and valuable for teaching purposes. We're even thinking about adding part of them to our curriculum at https://rmotr.com/ (cofounder and teacher here), in our Data Science program.
We have a small service at RMOTR that lets you run a Jupyter environment online in a single click. Similar to Google Colab or Binder, but also with the ability of installing custom requirements, clone an entire GH repo, etc. We use it for our students, so they don't have to hit the initial wall of installing the whole local Jupyter setup when they are getting started in the DS world.
You can see how practicalAI
looks like in the service using this link:
https://notebooks.rmotr.com/clone/gh/GokuMohandas/practicalAI
Note that all requirements listed in requirements.txt
are already installed when the env is loaded, so people can start using it right away. That gives you the flexibility of adding any requirement, and not being tied to what Colab provides by default.
Do you think it would be a good choice to add it as a third launching option? Alternatively to Colab and Binder, already listed in the README.
I hope you like it, and I truly appreciate any feedback.
thanks.
def predict_step
, z = F.softmax(z).cpu().numpy()
is shown on webpage. Notebook correctly assigns to y_prob = F.softmax(z).cpu().numpy()
thoughplt.scatter(X[:, 0], X[:, 1], c=[colors[_y] for _y in y], s=25, edgecolors="k"')
(happens 1x here, 2x in Data Quality page)def train_step
,z = self.model(inputs) # Forward pass
J = self.loss_fn(z, targets) # Define loss
without a apply_softmax = True
train_step
's Loss need J.detach().item()
but eval_step
used J directly without detach and itemcollate_fn
, batch = np.array(batch, dtype=object)
was used but i didn't understand why convert to object. Adding a note on what happens without it VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated.
would be very helpful in preparing students for ragged tensors and padding in CNN/RNN laterX = torch.FloatTensor(X.astype(np.float32)
breaks with ValueError: setting an array element with a sequence.
because batch[:,0] indexing creates nested numpy array objects that can't be casted, but this nested array thing will not occur for y during batch[:,1], because y begun as a 1d object already, so no nested array, so no problem casting, so there's no need to stack y? (same for CNN stacking y)padded_sequences = np.zeros
begun without nesting, and also numpy was able to implicitly flatten the sequence
numpy array during padded_sequences[i][:len(sequence)] = sequence
.Under Gradients the text
$ y = 3x + 2 $
$ y = \sum{y}/N $
$ \frac{\partial(z)}{\partial(x)} = \frac{\partial(z)}{\partial(y)} \frac{\partial(z)}{\partial(x)} = \frac{1}{N} * 3 = \frac{1}{12} * 3 = 0.25 $
should be
$ y = 3x + 2 $
$ z = \sum{y}/N $
$ \frac{\partial(z)}{\partial(x)} = \frac{\partial(z)}{\partial(y)} \frac{\partial(y)}{\partial(x)} = \frac{1}{N} * 3 = \frac{1}{12} * 3 = 0.25 $
Hi there,
While doing CNN module, I found that no batch normalization is applied in the forward pass?
class CNN(nn.Module):
def __init__(self, vocab_size, num_filters, filter_size,
hidden_dim, dropout_p, num_classes):
super(CNN, self).__init__()
# Convolutional filters
self.filter_size = filter_size
self.conv = nn.Conv1d(
in_channels=vocab_size, out_channels=num_filters,
kernel_size=filter_size, stride=1, padding=0, padding_mode="zeros")
self.batch_norm = nn.BatchNorm1d(num_features=num_filters)
# FC layers
self.fc1 = nn.Linear(num_filters, hidden_dim)
self.dropout = nn.Dropout(dropout_p)
self.fc2 = nn.Linear(hidden_dim, num_classes)
def forward(self, inputs, channel_first=False,):
# Rearrange input so num_channels is in dim 1 (N, C, L)
x_in, = inputs
if not channel_first:
x_in = x_in.transpose(1, 2)
# Padding for `SAME` padding
max_seq_len = x_in.shape[2]
padding_left = int((self.conv.stride[0]*(max_seq_len-1) - max_seq_len + self.filter_size)/2)
padding_right = int(math.ceil((self.conv.stride[0]*(max_seq_len-1) - max_seq_len + self.filter_size)/2))
# Conv outputs
z = self.conv(F.pad(x_in, (padding_left, padding_right)))
# ---------MISSING Batch Normalization here ? -----------
z = F.max_pool1d(z, z.size(2)).squeeze(2)
# FC layer
z = self.fc1(z)
z = self.dropout(z)
z = self.fc2(z)
return z
First of all, thank you so much for such an amazing course material. I found that the product-inization is made using flask which is not really scalable. I understand usual scaling mechanism like TF serving is not easy to put in a beginner level course. Is it something in your roadmap already to try RedisAI as an alternative?
PS: I am core dev from RedisAI team
ERROR: failed to pull data from the cloud - Checkout failed for following targets:
features.json
projects.json
tags.json
features.parquet
Hi Goku... I am really thankful for all your amazing tutorials.
I however was facing some issues in the Transformers lecture. There are a few minor bugs here with missing variables and imports; which was not an issue.
The training code however is missing the block:
# Train
best_model = trainer.train(
num_epochs, patience, train_dataloader, val_dataloader)
Also when i wrote this and ran it, I got an error:
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:14: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:15: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
from ipykernel import kernelapp as app
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
[<ipython-input-68-8d0f0dee99db>](https://localhost:8080/#) in <module>()
1 # Train
2 best_model = trainer.train(
----> 3 num_epochs, patience, train_dataloader, val_dataloader)
6 frames
[/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py](https://localhost:8080/#) in dropout(input, p, training, inplace)
1277 if p < 0.0 or p > 1.0:
1278 raise ValueError("dropout probability has to be between 0 and 1, " "but got {}".format(p))
-> 1279 return _VF.dropout_(input, p, training) if inplace else _VF.dropout(input, p, training)
1280
1281
TypeError: dropout(): argument 'input' (position 1) must be Tensor, not str
Apparently, the issue comes from the line :
seq, pool = self.transformer(input_ids=ids, attention_mask=masks)
wherein the "pool" returned is of class string.
Upon printing the type and the value of it i get the following :
<class 'str'>
pooler_output
Can you please have a look into this.
Thanks in Advance!!
hi!
appreciate your work here, me and my friends really learned a lot here
we happened to find a platform in mainland China providing similar service to google colab and kaggle ( as you may known there is connectivity problem to google services in mainland China) called KESCI(www.kesci.com). They provide dev-ready and up-to-date Python & R cpu environment all for free and an upcoming gpu support.
we also managed to translate the whole series to Chinese and applied for a column to publish them on KESCI, as a series. you can access it here : https://www.kesci.com/home/column/5c20e4c5916b6200104eea63
the Computer Vision notebook has already been translated but is still being trained in the transfer-learning section
also, do you think it is possible to add this as another launching option? i think there must be more people in China who could learn from your tutorials!
Great repo!
Can i translate it to Chinese?
I'm look at the Product Design page, and I'm seeing two small errors:
product
: what needs to be build to help our users reach their goals?
I am running the tagifai.ipynb notebook on the windows platform but facing difficulty viewing the experiment in MLflow.
Steps Done:
mlops-course\notebooks\tagifai.ipynb
" in vs code locally.mlflow server -h 0.0.0.0 -p 8000 --backend-store-uri /experiments/
" from the location of the notebook, experiments is the next folder inside it. # $PWD is omitted because of windows.Observation :
Please provide assistance with this issue.
Thanks
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.