karpathy / micrograd Goto Github PK

View Code? Open in Web Editor NEW

8.4K 8.4K 1.1K 248 KB

A tiny scalar-valued autograd engine and a neural net library on top of it with PyTorch-like API

License: MIT License

Jupyter Notebook 90.26% Python 9.74%

micrograd's Introduction

I like deep neural nets.

micrograd's People

Contributors

Stargazers

Watchers

Forkers

erwin314 linhduongtuan sinjax srravula1 benedictquartey satinder147 zhangzp9970 achoora tejamoy raghavian kastnerkyle satishgaurav lyn-l mostafaeissa kengoa techshot25 newcodevelop ralami1859 lanseyege randl gonvas evcu sunilsurineni rafaelmri mbrukman xrosliang jcassiojr jiahao abhinavshaw1993 conradbm danielkurniadi dragomirradev mrzresearcharena fcakyon maelstrom9 burakakrishna vgaurav3011 saralatif99 jesusfbes perfmjs milan-chicago paritoshgoyal stjordanis rahuldshetty n8behavior fagan2888 emanuele tkhan3 edwinyung andrew-pynch mmalekzadeh osmanbaskaya srkm009 nsarang kodeprav khushmeeet logancyang animuku faizankshaikh zeta1999 okz12 vishalbelsare bezova tchklovski njuhaozhang jli05 sailfish009 ilyastepanov t-groth knut0815 yassirf marcelbischoff jcanode abinitio888 pecanjk nvlawachan qixiuai donnyyou supercourage cj401-jw shuida brad-mengchi ssusantachary aokifish yejinlei ycechungai nickdgardner rkly baiduwen3 rhalbersma nguyenducnhaty i-spark jayoprell fengjunxi pandinosaurus tsukuyomih2 oceanos74 knarfamlap martinhoang11 dimmu

micrograd's Issues

Ensure backward() is idempotent

Hi Andrej,

Many thanks for micrograd & its accompanying video; they deepened my understanding of backprop considerably!

I notice that in the current implementation, calling backward() repeatedly is non-idempotent, because the grads just keep accumulating. This seems like something people are likely to trip over. The fix is simple: in the def of backward(), just above

        # go one variable at a time and apply the chain rule to get its gradient
        self.grad = 1

add

        # reset gradients to ensure they don't get repeatedly accumulated
        for v in reversed(topo):
            v.grad = 0

Just submitted PR 54 for your consideration which just makes that one change.

Example of non-idempotence with current master branch: given a simple tree where a = 3, b = 2, c = a + b, d = 1, e = c * d (all leaves as Values of course):

>>> print_grads()
print_grads()
a: 0, b: 0, c: 0, d: 0, e: 0
>>> e.backward()
e.backward()
>>> print_grads()
print_grads()
a: 1.0, b: 1.0, c: 1.0, d: 5.0, e: 1
>>> 
>>> e.backward()
e.backward()
>>> print_grads()
print_grads()
a: 3.0, b: 3.0, c: 2.0, d: 10.0, e: 1
>>>

Grad should be a Value instead of python/numpy scalar

So you can do higher order autodiff.

`other` should have a gradient in `pow` (?)

Hey Andrej -- just want to say thanks so much for your YouTube video on micrograd. The video has been absolutely enlightening.

Quick question -- while re-implementing micrograd on my end, I noticed that __pow__ (in Value) was missing a back-propagation definition for other. Is this expected?

micrograd/micrograd/engine.py

Lines 39 to 40 in c911406

    
           def _backward(): 
        
               self.grad += (other * self.data**(other-1)) * out.grad

Incorrect gradient when non-leaf Values are re-used

Thank you @evcu for raising, my little 2D toy problem converged and instead of going on to proper tests and double checking through the recursion I got all trigger-happy and amused with puppies. The core issue is that if variables are re-used then their gradient will be accumulated for each path. Do you think this simpler reference counting idea will work as a potential simpler solution? The idea is to suppress backward() calls until the very last one.

(Love your Stylized puppy in your branch btw! :D)

class Value:
    """ stores a single scalar value and its gradient """

    def __init__(self, data):
        self.data = data
        self.grad = 0
        self.backward = lambda: None
        self.refs = 0

    def __add__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data + other.data)
        self.refs += 1
        other.refs += 1
        
        def backward():
            if out.refs > 1:
                out.refs -= 1
                return
            self.grad += out.grad
            other.grad += out.grad
            self.backward()
            other.backward()
        out.backward = backward

        return out

    def __radd__(self, other):
        return self.__add__(other)

    def __mul__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data * other.data)
        self.refs += 1
        other.refs += 1
        
        def backward():
            if out.refs > 1:
                out.refs -= 1
                return
            self.grad += other.data * out.grad
            other.grad += self.data * out.grad
            self.backward()
            other.backward()
        out.backward = backward

        return out

    def __rmul__(self, other):
        return self.__mul__(other)

    def relu(self):
        out = Value(0 if self.data < 0 else self.data)
        self.refs += 1
        def backward():
            if out.refs > 1:
                out.refs -= 1
                return
            self.grad += (out.data > 0) * out.grad
            self.backward()
        out.backward = backward
        return out

    def __repr__(self):
        return f"Value(data={self.data}, grad={self.grad})"

micrograd.NET: C# port for .NET developers

Hi, Andrej,
Thanks for this excellent library! It may be useful not only for Python developers, but for C#, F#, Pascal etc. developers too, so I wrote a C# port for .NET ecosystem. The basic info about this is here: https://github.com/ColorfulSoft/System.AI/blob/master/Docs/micrograd.NET.md
Best,
Gleb S. Brykin

Homework Assignment Error with softmax activation function

Hi @karpathy
I was solving the assignment as mentioned in the YouTube video. In the Softmax function, I was getting the following error TypeError: unsupported operand type(s) for +: 'int' and 'Value'

This is the line where I am getting the error

def softmax(logits):
  counts = [logit.exp() for logit in logits]
  denominator = sum(counts) #Here I am getting the Typeerror
  out = [c / denominator for c in counts]
  return out

And, my add function in Value Class is the following

def __add__(self, other): # exactly as in the video
    other = other if isinstance(other, Value) else Value(other)
    out = Value(self.data + other.data, (self, other), '+')
    
    def _backward():
      self.grad += 1.0 * out.grad
      other.grad += 1.0 * out.grad
    out._backward = _backward
    
    return out

So my query is on the sum of list function. It is probably similar to counts[i].add(counts[i+1]) and then we keep on adding to the result till the end of the list. So this add function should work well. But I am not sure why it is not working, am I missing something?
Thanks in advance

Regarding the gradient update of the sub operation

The sub operation implemented here would utilize the _backward method of the add operation. I believe this is wrong because the _backward method for add operation accumulates out.grad for both the operands, but in case of sub operation it should accumulate out.grad for the positive operand and -out.grad for the negative operand.

for example:
a = b + c
d(a)/db = 1
d(a)/dc = 1

a = b - c
d(a)/db = 1
d(a)/dc = -1

So i think we need to add a separate _backward function for sub operation or modify the _backward method for add operation.

Adjusting parameters by sign and magnitude of gradient

https://github.com/karpathy/micrograd/blame/c911406e5ace8742e5841a7e0df113ecb5d54685/demo.ipynb#L271C13-L271C45

I really appreciate your videos! Such a gift to all of us.

When adjusting parameters after computing the loss, the example multiplies the step size by the sign and magnitude of the gradient. In cases of a steep gradients near local minimum values, a large value will jump the parameter far from the desired solution. In the case of shallow gradients, the parameter will struggle to reach its local minimum in the given number of iterations.

Thus, I think the adjustment should be a step size times the sign of the gradient.

What are your thoughts?

Vectorized implementation with PyTorch flavor

Here is a vectorized implementation with PyTorch flavor built on top of NumPy / CuPy: https://github.com/conscell/ugrad

Sequential MLP implementation

Maybe not PR worthy, but I guess one can abstract the MLP implementation even more, making use of the layers instead of number of inputs and outputs yet again, since each individual layer already knows them.

As such, I wrote it as:

class MLP:

  def __init__(self, layers):
    self.layers = layers

  def __call__(self, x):
    for l in self.layers:
      x = l(x)
    return x

  def parameters(self):
      return [p for layer in self.layers for p in layer.parameters()]

by which you can define a network more intuitively, much like PyTorch's Sequential:

n = MLP([Layer(3, 6), Layer(6, 3), Layer(3, 1)])

To be even more rigorous, a dimension assertion can be added in the __init__:

class MLP:

  def __init__(self, layers):
    self.layers = layers
    for i in range(1, len(layers)):
      assert layers[i-1].nout == layers[i].nin

for which I would have to store the nin & nout for the layers in the as well:

class Layer:

  def __init__(self, nin, nout):
    self.nin = nin
    self.nout = nout
    self.neurons = [Neuron(nin) for _ in range(nout)]

Noob question about backprop implementation

Hello,
I came across this from your YT video tutorials, thank you for making these!

In engine.py, you implement back propagation using explicit topological order computation.
Are there any reasons why we would not recursively call _backward for every child ?
e.g. implement backward function in Value as such:

    def backward(self):
        self._backward()
        for v in self._prev:
            v.backward()

Does it have something to do with how backprop is implemented in actual NN libraries? Is recursion harder to parallelise in practice compared to using topological ordering?

Thank you

Issue with zero_grad?

Hi, unless I'm misunderstanding something, zero_grad in nn.py is zeroing out the gradients on the parameter nodes, but shouldn't it do it on all the nodes in the graph?
Otherwise the inner nodes will keep accumulating them.

Rename engine.py to value.py

I suggest you rename engine.py to value.py.

Reasoning:

the engine name is misleading. The file doesn't contain some framework or domain logic.
the engine.py file contains a solo class named Value. The best name for it is value.py.

Another MiniGrad with the RAdam optimizer.

Hello guys. I wrote a MiniGrad with the RAdam optimizer. It could be found there.

PyPI package

Feature

Convert Micrograd into a PyPI package.

Need

Would be easier for institutions or bootcamps to help adopt this for their students.
As an organizer of the Data Science Club at SJSU, I would love to introduce the fundamentals with this library.

A tensor version for micrograd inpired by this work

I have implement a tensor version based on numpy.ndarray, code: https://github.com/hkxIron/tensorgrad

Zero_grad only zeros the weight and bias nodes, not the nodes for addition and multiplication

In zero_grad, we just zero the weight and bias graph nodes. But don't we need to zero the other graph nodes, like those created for addition and multiplication as well, since the backprop gradient flows through them as well?

Vectorized modification with GPU support.

https://github.com/rohit-krish/Deeplex
Your work sparked me to build this.
CuPy is used for GPU support.

Planning on improving it; peace...🙏

Reseting the grad of weights and biases is not enough

In the video "The spelled-out intro to neural networks and backpropagation: building micrograd" you present the following code:

n = MLP(3, [4, 4, 1])
xs = [
  [2.0, 3.0, -1.0],
  [3.0, -1.0, 0.5],
  [0.5, 1.0, 1.0],
  [1.0, 1.0, -1.0],
]
ys = [1.0, -1.0, -1.0, 1.0] # desired targets
for k in range(20):
  
  # forward pass
  ypred = [n(x) for x in xs]
  loss = sum((yout - ygt)**2 for ygt, yout in zip(ys, ypred))
  
  # backward pass
  for p in n.parameters():
    p.grad = 0.0
  loss.backward()
  
  # update
  for p in n.parameters():
    p.data += -0.1 * p.grad
  
  print(k, loss.data)

However before calling loss.backward() we should reset the grad for ALL values, not just for n.parameters().
Because every iteration of loss.backward() changes the grad (+=...) for all.

Would need 'substraction' support in Engine, mainly for regression loss functions like MSE

backward member implementation question

Why can't this function simply be implemented as follows? Am I missing something? We are dealing with a composite structure.

  def backward(self, is_first=True):
    if (is_first == True):
      self.grad = 1.0

    self._backward()

    for c in self._prev:
      c.backward(False)

Topological sort - bug

It's a nit that won't matter most of the time but the topo sort implementation doesn't work in case you have cycles in the graph.

i.e. there is a hard assumption you're operating over a DAG.

Appreciation and seed to GPU Support using PyCuda

@karpathy Hi,
Great Work.... simple and clean.
I was inspired and made my own mini deep learning library :P
I'm thinking can we extend this with GPU support?
I will be more than happy if you refer to my repository.
https://github.com/kartik4949/deepops

Thanks and keep doing work like this more :)
p.s: I'm a fresher :P

_backward as lambdas?

Hi @karpathy,

congratulations on this repo/talk. The educational value is truly immense. Good job!

Can you please explain the main motivation for _backward methods implemented as lambdas, as opposed to (one) regular method that starts with a hypothetical switch (self._op) and contains implementation for all arithmetic cases?

	def _backward():
	self.grad += (other * self.data*(other-1)) out.grad