ezyang / pytorch-unattached Goto Github PK

Tensors and Dynamic neural networks in Python with strong GPU acceleration

License: Other

CMake 1.93% Python 29.80% C++ 44.56% C 6.42% Shell 0.61% Lua 0.01% Objective-C 0.02% Vim Script 0.01% Cuda 15.38% Makefile 0.06% Metal 0.17% Objective-C++ 0.98% CSS 0.01% HTML 0.02% Batchfile 0.04%

pytorch-unattached's Introduction

PyTorch is a Python package that provides two high-level features:

Tensor computation (like NumPy) with strong GPU acceleration
Deep neural networks built on a tape-based autograd system

You can reuse your favorite Python packages such as NumPy, SciPy and Cython to extend PyTorch when needed.

We are in an early-release beta. Expect some adventures and rough edges.

More about PyTorch
Installation
Getting Started
Communication
Releases and Contributing
The Team

System	2.7	3.5
Linux CPU
Linux GPU
Windows GPU	—

More about PyTorch

At a granular level, PyTorch is a library that consists of the following components:

torch	a Tensor library like NumPy, with strong GPU support
torch.autograd	a tape-based automatic differentiation library that supports all differentiable Tensor operations in torch
torch.nn	a neural networks library deeply integrated with autograd designed for maximum flexibility
torch.multiprocessing	Python multiprocessing, but with magical memory sharing of torch Tensors across processes. Useful for data loading and Hogwild training.
torch.utils	DataLoader, Trainer and other utility functions for convenience
torch.legacy(.nn/.optim)	legacy code that has been ported over from torch for backward compatibility reasons

Usually one uses PyTorch either as:

a replacement for NumPy to use the power of GPUs.
a deep learning research platform that provides maximum flexibility and speed

Elaborating further:

A GPU-Ready Tensor Library

If you use NumPy, then you have used Tensors (a.k.a ndarray).

PyTorch provides Tensors that can live either on the CPU or the GPU, and accelerate compute by a huge amount.

We provide a wide variety of tensor routines to accelerate and fit your scientific computation needs such as slicing, indexing, math operations, linear algebra, reductions. And they are fast!

Dynamic Neural Networks: Tape-Based Autograd

PyTorch has a unique way of building neural networks: using and replaying a tape recorder.

Most frameworks such as TensorFlow, Theano, Caffe and CNTK have a static view of the world. One has to build a neural network, and reuse the same structure again and again. Changing the way the network behaves means that one has to start from scratch.

With PyTorch, we use a technique called reverse-mode auto-differentiation, which allows you to change the way your network behaves arbitrarily with zero lag or overhead. Our inspiration comes from several research papers on this topic, as well as current and past work such as torch-autograd, autograd, Chainer, etc.

While this technique is not unique to PyTorch, it's one of the fastest implementations of it to date. You get the best of speed and flexibility for your crazy research.

Python First

PyTorch is not a Python binding into a monolithic C++ framework. It is built to be deeply integrated into Python. You can use it naturally like you would use NumPy / SciPy / scikit-learn etc. You can write your new neural network layers in Python itself, using your favorite libraries and use packages such as Cython and Numba. Our goal is to not reinvent the wheel where appropriate.

Imperative Experiences

PyTorch is designed to be intuitive, linear in thought and easy to use. When you execute a line of code, it gets executed. There isn't an asynchronous view of the world. When you drop into a debugger, or receive error messages and stack traces, understanding them is straightforward. The stack trace points to exactly where your code was defined. We hope you never spend hours debugging your code because of bad stack traces or asynchronous and opaque execution engines.

Fast and Lean

PyTorch has minimal framework overhead. We integrate acceleration libraries such as Intel MKL and NVIDIA (cuDNN, NCCL) to maximize speed. At the core, its CPU and GPU Tensor and neural network backends (TH, THC, THNN, THCUNN) are written as independent libraries with a C99 API. They are mature and have been tested for years.

Hence, PyTorch is quite fast – whether you run small or large neural networks.

The memory usage in PyTorch is extremely efficient compared to Torch or some of the alternatives. We've written custom memory allocators for the GPU to make sure that your deep learning models are maximally memory efficient. This enables you to train bigger deep learning models than before.

Extensions without Pain

Writing new neural network modules, or interfacing with PyTorch's Tensor API was designed to be straightforward and with minimal abstractions.

You can write new neural network layers in Python using the torch API or your favorite NumPy-based libraries such as SciPy.

If you want to write your layers in C/C++, we provide a convenient extension API that is efficient and with minimal boilerplate. There is no wrapper code that needs to be written. You can see a tutorial here and an example here.

Installation

Binaries

Commands to install from binaries via Conda or pip wheels are on our website:

http://pytorch.org

From Source

If you are installing from source, we highly recommend installing an Anaconda environment. You will get a high-quality BLAS library (MKL) and you get a controlled compiler version regardless of your Linux distro.

Once you have Anaconda installed, here are the instructions.

If you want to compile with CUDA support, install

NVIDIA CUDA 7.5 or above
NVIDIA cuDNN v6.x or above

If you want to disable CUDA support, export environment variable NO_CUDA=1. Other potentially useful environment variables may be found in setup.py.

If you want to build on Windows, Visual Studio 2017 and NVTX are also needed.

Install optional dependencies

On Linux

export CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" # [anaconda root directory]

# Install basic dependencies
conda install numpy pyyaml mkl mkl-include setuptools cmake cffi typing
conda install -c mingfeima mkldnn

# Add LAPACK support for the GPU
conda install -c pytorch magma-cuda80 # or magma-cuda90 if CUDA 9

On macOS

export CMAKE_PREFIX_PATH=[anaconda root directory]
conda install numpy pyyaml mkl mkl-include setuptools cmake cffi typing

On Windows

conda install numpy pyyaml mkl mkl-include setuptools cmake cffi typing

Get the PyTorch source

git clone --recursive https://github.com/pytorch/pytorch
cd pytorch

Install PyTorch

On Linux

python setup.py install

On macOS

MACOSX_DEPLOYMENT_TARGET=10.9 CC=clang CXX=clang++ python setup.py install

On Windows

set "VS150COMNTOOLS=C:\Program Files (x86)\Microsoft Visual Studio\2017\Enterprise\VC\Auxiliary\Build"
set CMAKE_GENERATOR=Visual Studio 15 2017 Win64
set DISTUTILS_USE_SDK=1
REM The following line is needed for Python 2.7, but the support for it is very experimental.
set MSSdk=1

call "%VS150COMNTOOLS%\vcvarsall.bat" x64 -vcvars_ver=14.11
python setup.py install

Docker image

Dockerfile is supplied to build images with cuda support and cudnn v7. Build as usual

docker build -t pytorch -f docker/pytorch/Dockerfile .

You can also pull a pre-built docker image from Docker Hub and run with nvidia-docker, but this is not currently maintained and will pull PyTorch 0.2.

nvidia-docker run --rm -ti --ipc=host pytorch/pytorch:latest

Please note that PyTorch uses shared memory to share data between processes, so if torch multiprocessing is used (e.g. for multithreaded data loaders) the default shared memory segment size that container runs with is not enough, and you should increase shared memory size either with --ipc=host or --shm-size command line options to nvidia-docker run.

Previous Versions

Installation instructions and binaries for previous PyTorch versions may be found on our website.

Getting Started

Three pointers to get you started:

Communication

forums: discuss implementations, research, etc. http://discuss.pytorch.org
GitHub issues: bug reports, feature requests, install issues, RFCs, thoughts, etc.
Slack: general chat, online discussions, collaboration etc. https://pytorch.slack.com/ . Our slack channel is invite-only to promote a healthy balance between power-users and beginners. If you need a slack invite, ping us at [email protected]
newsletter: no-noise, one-way email newsletter with important announcements about pytorch. You can sign-up here: http://eepurl.com/cbG0rv

Releases and Contributing

PyTorch has a 90 day release cycle (major releases). Its current state is Beta, we expect no obvious bugs. Please let us know if you encounter a bug by filing an issue.

We appreciate all contributions. If you are planning to contribute back bug-fixes, please do so without any further discussion.

If you plan to contribute new features, utility functions or extensions to the core, please first open an issue and discuss the feature with us. Sending a PR without discussion might end up resulting in a rejected PR, because we might be taking the core in a different direction than you might be aware of.

The Team

PyTorch is a community driven project with several skillful engineers and researchers contributing to it.

PyTorch is currently maintained by Adam Paszke, Sam Gross, Soumith Chintala and Gregory Chanan with major contributions coming from 10s of talented individuals in various forms and means. A non-exhaustive but growing list needs to mention: Trevor Killeen, Sasank Chilamkurthy, Sergey Zagoruyko, Adam Lerer, Francisco Massa, Alykhan Tejani, Luca Antiga, Alban Desmaison, Andreas Kopf, James Bradbury, Zeming Lin, Yuandong Tian, Guillaume Lample, Marat Dukhan, Natalia Gimelshein, Christian Sarofeen, Martin Raison, Edward Yang, Zachary Devito.

Note: this project is unrelated to hughperkins/pytorch with the same name. Hugh is a valuable contributor in the Torch community and has helped with many things Torch and PyTorch.

pytorch-unattached's People

Contributors

Stargazers

Watchers

Forkers

houseroad h6yv7um nd1511 doc22940 hixio-mh trellixvulnteam

pytorch-unattached's Issues

Error formatting strategy

Weekend hacking = good time to do speculative, "not a good use of your time" thinking.

In #120 it was mentioned that we didn't have a convenient mechanism for formatting and printing user facing errors, and that we should somehow make it easier to do so. Continuing to make use of printf is probably the most practical way to proceed for now, but as I was working on the patch I felt bad for a few reasons:

printf is not type safe at all. Usually C programmers can just deal, but use of printf-style functions in C++ have their own hazards. The biggie: you have to remember NOT to pass std::string for a %s formatter.
printf style formatters make it difficult to render multiple line messages with increased indentation, which Haskell error messages use a lot, to good effect IMO. (Though, many would disagree with a claim that Haskell error messages are good.) While going down this rabbit hole, I noticed compilers like GCC/Clang have decided that multi-line error messages are not something you ever want to do (instead, you just emit extra diagnostic notes for any extra information you want to tack on.) This is probably precisely because their APIs don't make it convenient to output indented information...
printf style formatters discourage rendering of more complex data, like lists, etc., which may be of use to users.

So I turned my attention to the question: "What pretty-printing library could we use in C++ for this matter"? There are a few options:

@zdevito wrote a simple code templater in torch/csrc/jit/code_template.h which does type safe, template interpolation. Unfortunately, the API is a bit verbose as you have to explicitly build the env with explicit keys, specifying all types involved, and then apply it to the template. In principle, we could probably extend it to have a more compact API, but I don't know if this would destroy the conceptual integrity of the API.
Instead of printf, we could use C++'s existing printing convention using overload "<<" operators. Without any API, this is wordy, because you have to declare a string stream, write your error into it, and then extract out the string into your error message. But this can be solved, because temporaries in C++ live until the end of the full-expression in which they are called. So you can play tricks as in https://groups.google.com/forum/#!msg/comp.lang.c++/_GWLGQhbxYE/IDHpHMFm5XcJ to write the stringstream in one pipe. (Another version of the trick here: https://codereview.stackexchange.com/questions/6094/a-version-of-operator-that-returns-ostringstream-instead-of-ostream ). This API isn't going to give you indentation, but you don't have to write too much code.
I've always wanted an ACTUAL pretty-printer for C++, modeled off of, e.g., Wadler-Leijen pretty printers (https://homepages.inf.ed.ac.uk/wadler/papers/prettier/prettier.pdf). Depending on how much effort you put into it, there are two major jumps in functionality: first is (compositional) indentation (which comes from building a data structure representing the abstract document before rendering; much easier to re-indent now!); second is actually generating "pretty" output, whereby there are multiple layouts and the pretty-printer selects a layout that looks the best.

Personally, I'd like something that is (1) concise, (2) type safe and (3) which supports indentation, but I could be convinced that I don't actually want indentation.

Does Batchnorm run correctly on reexecuted traces?

I am not sure, since the C++ object has some stashed mutable state. Maybe it's OK because batchnorm is always used as a Module, and the Module is persistent across training iterations.

IR_IF does not preserve const-ness

This means it cannot be used to case switch over const data.

Adapt torch.jit.Traceable to RNN-cell like usage

Right now it assumes that it encapsulates the whole model, or that it's a unique subgraph in the model. If a cell is used multiple times, we'll record the first forward run and compile a one-stage closure when it's reused. This means that the closure isn't differentiable even once.

Segfault when running test_simple

Hi @ezyang, I'd like to get my hands dirty with the IR code (great stuff BTW). I just built the jit branch and ran test_simple

x = Variable(torch.Tensor([0.4]), requires_grad=True)
y = Variable(torch.Tensor([0.7]), requires_grad=True)

torch._C._tracer_enter((x, y))
z = torch.sigmoid(torch.tanh(x * (x + y)))
trace = torch._C._tracer_exit((z,))

I get a segfault right at the last line. The segfault is due to THPGraphClass being nullptr in THPGraph_Wrap (in torch/csrc/jit/python_ir.cpp), in turn called at line 64 in torch/csrc/jit/python_tracer.cpp.

Apparently THPJIT_initExtension (in torch/csrc/jit/init.cpp) is never executed for me, so THPGraphClass is not initialized. Is there anything I'm missing on my end?

Thanks!

word_language_model has dodgy constants

See here: https://github.com/ezyang/pytorch/blob/jit/test/expect/TestModels.test_word_language_model_LSTM.expect#L31

Needs investigation.

Fix rbegin/rend() on OSX

Our tests still crash on OSX because of the iterator stuff. We will need to fix this before we can release.

Graph.op hotpatch is delicate

If you somehow interact with Graph.op without having imported torch.toffee, the method will be missing (which will make you a sad panda). One pattern for handling this sort of thing is to have _C export a "base class" which is than extended by a Python class that adds helpers. The initialization code C code gets is hands on the Python class and then creates THAT, so the helpers are available. This is what is done for Variable.

Printer for a Node by itself

Right now there is no public printer, and the << overload just prints the unique of the node. We need a more detailed printer.

Remove our use of the legacy attributes in ToffeeIR

@houseroad: This commit in ToffeeIR removes legacy attributes: https://github.com/ProjectToffee/ToffeeIR/blob/e1f50697ba376b0410d9e8b90d2607b735d0f1ec/toffee/frontend/c2_toffee.py

We need to update Pytorch so that it no longer generates these legacy attributes.

In particular we need to remove the attributes pad, kernel, dilation, stride, pad, adj, legacy_pad from Conv, ConvTranspose, MaxPool, and AveragePool. Instead use their plural equivalents.

You can get the number of dimensions of a node in a primspec function using: len(input.type().sizes())

Remove IR_IF casing on Nodes which need to be exhaustive

Now that we have a fully dynamic Node interface, we no longer have exhaustive pattern matching on types of nodes. This means that we SHOULD NOT use IR_IF for cases where exhaustivity is important. We need to go through and audit these cases.

Shouldn't always fuse after tracing

Fusion should be something that can be optionally applied after the fact. Deferring fusing will let us do end-to-end tests that don't depend on fusion working correctly. It also gives us a -O0 mode.

Print parameters for Cpp ops

Right now all we get is the op name.

Helper function for jitting functions

We have a helper that works for models, but nothing for plain old functions.

Lifetime for Type objects associated with nodes makes me nervous

In particular, we have a public setType() function which has the effect of invalidating any Type* pointers returned from type(). This feels error-prone.

Use state_dict so that parameters get named

Right now, parameters are completely unnamed (just a big pile of numbers) because they are retrieved using parameters(). However, we actually have names for them from the state_dict interface from modules. It would be great to use these names to make the traces more interpretable.

We'll need to adjust name allocation to make use of name hints, and it's probably simplest if we unconditionally include the unique number even with the name hint.

Record actual type information in Type

Right now we only record sizes and strides. Knowing if it's actually a float/double/etc may also be useful.

Make lint failure print the graph that failed lint

Self explanatory

Improve printing for multireturn nodes

Right now the only uses we print are their corresponding Select nodes which isn't very useful

Print types

We recorded the types, we should print 'em.

Inplace doesn't trace correctly

When I trace AlexNet, the inplace ReLus turn into constants.

-  %18.0 = CppOp[ConvForward](%17, %1, %2), uses = [[]];
-  %20 = Constant(), uses = [%21.i0];
-  %21.0 = CppOp[ConvForward](%20, %3, %4), uses = [[]];
-  %23 = Constant(), uses = [%24.i0];
-  %24.0 = CppOp[ConvForward](%23, %5, %6), uses = [[]];
-  %26 = Constant(), uses = [%27.i0];
-  %27.0 = CppOp[ConvForward](%26, %7, %8), uses = [[]];
-  %29 = Constant(), uses = [%30.i0];

Figure out why!

Nullability of type() is annoying

We force it to be non-null and enforce it in the lint.

Return nodes should not appear in list of nodes.

From yesterday's standup discussion.

Nail down the Select node invariant

Currently, the Select invariant is documented in text:

// Select nodes are used to handle multiple returns for the ops that actually return
// multiple values like PythonOp
// By convension, there is a unique select node for each output of an op
// so you can iterate over uses of a multi-return op to get all the select nodes.
// in this case
// number_of_outputs = op.uses().size()
// this will change if Tuples ever become first class.

This is not precise enough for me to write a machine check: the crux of the matter is, "What is a multi-return op?"

I propose that the multi-return-ness of a Node is statically determined by the kind of a node: in particular, with today's IR, only PythonOps are multi-return.

Figure out where to put pybind11 type_caster

There are a few cases where they are defined in cpp files but that means they can't be used elsewhere.

Figure out why we're not raising error when we hit legacy functions

Inplace

We need to think carefully about the correctness conditions for optimizations involving inplace operations. Under the current semantics, when an variable is consumed by an inplace operation, PyTorch arranges so that there is never a reference to the old variable (because its tensor has been destroyed). We need to make sure that the optimizer never breaks this invariant, and we need a lint that can verify this is the case.

We also need to make sure reexecution of forwards respects data dependencies. At the moment it does not. See this test:

    def test_inplace_race(self):
        x = Variable(torch.FloatTensor([0]))
        @torch.jit.trace(num_derivatives=0)
        def fn(x):
            y = x + x + x + x + x + x + x + x
            x.add_(1)
            return x + y
        self.assertEqual(fn(x.clone()), y)
        self.assertEqual(fn(x.clone()), y) # run traced version

Adding to complexity is that data dependency can be masked by an aliasing operation, though I am having some difficulty coming up with a test for this case.

Don't use JIT_ASSERT for invariant violations from Python-space

I saw this JIT_ASSERT while reviewing some code:

        // primspecs do not deal with Handles at the moment
        // so we map handles to Unused nodes
        auto typ = old->typeOption();
        JIT_ASSERTM(typ && typ->kind() == jit::TypeKind::HandleType,
          "primspec produced too few outputs");

This code is invoked when processing the outputs of a Python primspec. I don't think this should be an assert, because in my opinion, Python code counts as code "written by external users", and so of course they might misuse the API, and that is not an assert-failure (for which we should be permitted to compile without asserts in production), that is just a regular user failure.

There is one particularly awkward elephant in the room, however, which is that we are currently exposing the entire compiler IR API from Python for primspec construction, and it is dashingly easy to violate invariants from Python with this API. So, Python being the invariant violation barrier is not a hard rule--but I think failure to record enough outputs in a primspec squarely falls on the "raise an exception" side of the line.

CC @zdevito

Consider ignoring Scatter/Gather when exporting

Otherwise, you have to disable DataParallel modules to trace

Rename init_pass.h/cpp

We renamed it to MatchJITOps, so maybe graph_matcher.h/cpp is a better name, and it's uniform with graph_fuser. (Perhaps we should put these in a directory of their own.)

[Future optimization] Eval-optimization pass

Depending on how many Evals we'll have in our traces in the end, and how large is a constant Eval overhead, we might want to implement an optimization pass that will stitch together subgraphs of multiple connected Evals so that they run a single larger node.

init_pass doesn't work with backwards

Right now it assumes opaque handles are never used. That will essentially only be true for forwards only graphs.

Exception-unsafe unique pointer handling

This code isn't exception safe:

template<pass_type optimizer>
PyObject * wrap_optimizer(PyObject *_unused, PyObject *py_graph) {
  HANDLE_TH_ERRORS
  THPUtils_assert(THPGraph_Check(py_graph), "expected a Graph instance");
  THPGraph *graph = (THPGraph*)py_graph;
  graph->cdata = optimizer(std::unique_ptr<Graph>{graph->cdata}).release();
  Py_RETURN_NONE;
  END_HANDLE_TH_ERRORS
}

If optimizer throws an exception, release() will never be called; so we'll deallocate graph->cdata, leaving the underlying Python object with a dangling pointer.

Model tests take a long time

Doing a full run of the model test takes several minutes for me. This is bad, because it discourages people from running all of the tests. Is there anything we can do to make the tests run more quickly? A few possibilities: (1) run them on GPU, (2) have smaller parameters on the models.

More structured symbol type

Just because we've gone all dynamically typed on the IR, doesn't mean we can't have our symbols have a little more structure than just "strings". What we have found extremely useful in GHC is to be able to allocate one level of subnamespaces to our symbols. To keep things as strings, we achieve this by reserving the first letter of the symbol as the namespace signifier.

Looking at our list of built-in symbols, here are the namespaces I'd like to see:

A Toffee Op namespace 'o' (oMul, oNegate)
A Toffee Attribute namespace 'a' (aepsilon, amomentum)
A "known key-attribute" namespace 'k' (kOffset, kInPlaceOutputs)

And you can always add more as necessary. The namespace signifier can be easily removed to get a raw string as necessary.

CC @zdevito

Make interned_string.cpp thread safe.

Long term plan for adding operators

Right now, we have been adding operators to the AST one-by-one to support various operations we may be interested in. There are problems with doing this long term: every new operation you add, is another case that must be handled in any IR_IF case statement. IR_IF works well if you don't have too many forms in your language, not if you have a separate subkind for every single one of PyTorch's operations!

The object-oriented way to solve this problem is to keep subclasses, stop writing IR_IFs and move methods into the class definition. In GHC, we solved this problem differently, by maintaining an environment of "known" ops and recording with them information necessary to do analyses and optimizations with them (e.g., their types, strictness, unfoldings, etc.) I think this latter approach scales better with lots of ops (and is similar to how NNVM is structuring their operators: https://github.com/dmlc/nnvm/blob/master/docs/overview.md) but we should make a decision one way or another before we start adding tons and tons of operator definitions to fill out our PyTorch support.

Unlike NNVM, one thing I do NOT advocate is making this information public (at least for now.) So, no user-written ops that support fusion. That's a very tricky API to get right and there should be some concerted design that happens before we let people rely on it.

Get ToffeeIR test runner to use files from PyTorch repo

So that you can run caffe2 integration tests, but still have tests in PyTorch repo

Look into how we can improve readability of node names in protobuf

We should at least to use a raw number, because Caffe2 sometimes prints them without a prefix.

Block conversion of "DoubleTensor" networks to Toffee.

https://github.com/ProjectToffee/ToffeeIR/issues/16

Stack trace on C++ exception

It would be really great if we could read out the debug information and then print it from the Python level. Would save a trip to gdb to get the trace when an exception is thrown.

Trace printing is not correct at the moment

Here is what I see when I run test/test_jit.py:

graph(%1, %2) {
  %3 = Add False(%1, %2);
  %4 =   %5 = Mul(%1, %3.0);
  %6 =   %7 = Tanh(%5.0);
  %8 =   %9 = Sigmoid(%7.0);
  %10 =   return (%9.0);
}

For some reason, the output is being rendered twice, e.g. %4 = %5. Additionally, because we internally have select nodes, the output doesn't really match up with ToffeeIR's representation: we'll need some relabeling for select nodes to get the formats to match up.

AlexNet fails lint

(python3) [[email protected] ~/local/vision/torchvision/models] python alexnet.py                              
graph(%1, %2, %3, %4, %5, %6, %7, %8, %9, %10, %11, %12, %13, %14, %15, %16, %17) {
  %18.0 = CppOp[ConvForward](%17, %1, %2), uses = [[]];
  %20.0 = CppOp[ConvForward](%21, %3, %4), uses = [[]];
  %21 = Constant(), uses = [%20.i0];
  %23.0 = CppOp[ConvForward](%24, %5, %6), uses = [[]];
  %24 = Constant(), uses = [%23.i0];
  %26.0 = CppOp[ConvForward](%27, %7, %8), uses = [[]];
  %27 = Constant(), uses = [%26.i0];
  %29.0 = CppOp[ConvForward](%30, %9, %10), uses = [[]];
  %30 = Constant(), uses = [%29.i0];
  %32.0 = ^Transpose(0, 1)(%11), uses = [[%35.i2]];
  %34 = Constant(), uses = [%35.i1];
  %35.0 = ^Addmm(1, 1, False)(%12, %34, %32.0), uses = [[]];
  %37.0 = ^Transpose(0, 1)(%13), uses = [[%40.i2]];
  %39 = Constant(), uses = [%40.i1];
  %40.0 = ^Addmm(1, 1, False)(%14, %39, %37.0), uses = [[]];
  %42.0 = ^Transpose(0, 1)(%15), uses = [[%45.i2]];
  %44 = Constant(), uses = [%45.i1];
  %45.0 = ^Addmm(1, 1, False)(%16, %44, %42.0), uses = [[%0.i0]];
  return (%45.0);
}

(python3) [[email protected] ~/local/vision/torchvision/models] python alexnet.py 
Traceback (most recent call last):
  File "alexnet.py", line 67, in <module>
    model(x)
  File "/data/users/ezyang/pytorch/torch/nn/modules/module.py", line 224, in __call__
    result = self.forward(*input, **kwargs)
  File "/data/users/ezyang/pytorch/torch/jit.py", line 137, in forward
    tuple(self.parameters()) + flatten(args))
  File "/data/users/ezyang/pytorch/torch/jit.py", line 22, in record_trace
    torch._C._jit_pass_lint(trace)
RuntimeError: torch/csrc/jit/ir.cpp:266: lint: Assertion `in_scope.count(input) == 1` failed.

I'm investigating

Maybe MultiType should record the types that can be selected out of it

Because of the select invariant, the "proper" type of MultiType can always be reconstructed from the uses, but this is only if we never DCE select nodes. If we ever do DCE them, we may lose information about the type of the operator.

Generating unnecessary transposes on matrix multiplies

When I export a simple nn.Linear op, I get this:

version: 1
node {
  input: "1"
  output: "4"
  op_type: "Transpose"
  attribute {
    name: "axes"
    ints: 0
    ints: 1
  }
}
node {
  input: "3"
  input: "4"
  input: "2"
  output: "5"
  op_type: "FC"
}
name: "torch-jit-export"
input: "1"
input: "2"
input: "3"
output: "5"

The transpose is pretty pointless.

Python name matching is unsound

At the moment, we are using getPythonName to compute the correspondence between Python operands known wired-in ops. This is unsound for a few reasons. The most obvious is that __name__ only reports the unqualified name, which means that if someone else defines an Add function with different semantics, we'll incorrectly conclude that the operand can be used in this case. But the more insidious is that the C++ backend now has a dependency on the particular names being used in the Python level, when there shouldn't be any dependence at all.

An easy way to make things more robust is to look at the entire name, including module qualifier; then if we restrict ourselves to core library operands we can simply enforce as an invariant in PyTorch that people should not be renaming operands willy-nilly. You can avoid this invariant when we start moving operators to C++, in which case the C++ operand can be directly responsible for finding the IR. In the longest term, re #36, it would be best to not special-case on particular operator names, but design metadata for the operators which are sufficient for our passes to do what they need to do.

Expiration strategy doesn't work with module parameters

    def test_retrace_model(self):
        class MyNet(nn.Module):
            def __init__(self):
                super(MyNet, self).__init__()
                self.hidden = nn.Parameter(torch.randn(2, 2))
            def forward(self, input):
                return input
        m = MyNet()
        x = Variable(torch.randn(1))
        trace, _ = torch.jit.record_trace(m, x)
        y = Variable(torch.randn(1))
        trace, _ = torch.jit.record_trace(m, y)

Fails with:

======================================================================
ERROR: test_retrace_model (__main__.TestJit)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test/test_jit.py", line 54, in test_retrace_model
    trace, _ = torch.jit.record_trace(m, y)
  File "/data/users/ezyang/pytorch/torch/jit.py", line 192, in record_trace
    args, parameters)
  File "/data/users/ezyang/pytorch/torch/jit.py", line 148, in record_trace
    self.saved_trace = torch._C._tracer_enter(trace_inputs)
RuntimeError: /data/users/ezyang/pytorch/torch/csrc/jit/tracer.h:167: enter: Assertion `input->tracing_state.state.expired()` failed.

I structured the example this way to avoid cries of "It's UB!" The code above is morally the same as this smaller repro:

    def test_retrace_model(self):
        def f(input):
            return input
        x = Variable(torch.randn(1))
        trace, _ = torch.jit.record_trace(f, x)
        trace, _ = torch.jit.record_trace(f, x)

And in case people may have forgotten, let me remind you of #57, which I claim would have solved this problem...

SimpleEval nodes might prevent resources from being freed

Because the hold a pointers to functions from arbitrary autograd graphs

Mark regions as untraceable

While thinking about Adam's work on handling untraceable functions, which handles backwards to backwards-backwards, I realized that the analogous situation arises on forwards to backwards when we introduce a "stop tracing in this region" combinator. So it may make sense to design these in tandem. Here is a simple example:

@nojit
def f(y):
  return y * 3

y = x * 2
z = f(y)
a = z + 2
return a

This should trace into

graph (%x) {
  %y = MulScalar [2] %x
  %z, %f_grad_fn = CallPython [f] %y
  %a = AddScalar [2] %z
  return %a
}

With backwards tracing, this should trace into:

graph (%x, %grad_a) {
  %y = MulScalar [2] %x
  %z, %f_grad_fn = CallPython [f] %y
  %a = AddScalar [2] %z
  ----------- STAGE 1 -----------
  %grad_z = %grad_a
  %grad_y = Eval %f_grad_fn, %grad_z
  %grad_x = MulScalar [2] %grad_y 
  return %a, %grad_x
}

We have to handle the same edge-cases as in the backwards trace. For example, let's consider how to handle inplace updates.

@nojit_inplace(True)
def f(y):
  y.add_(2)
  return y * 3

y = x * 2
z = f(y)
a = z + y + 2
return a

Now when we trace this, we need to get the following:

graph (%x) {
  %y = MulScalar [2] %x
  %z, %y_inplace, %f_grad_fn = CallPython [f] %y
  %y_plus_two = AddScalar [2] %y_inplace
  %a = Add %z, %y_plus_two
  return %a
}

The key is the new %y_inplace output: in straight-line Python code, we can simulate inplace operations purely by adding them as extra outputs to the function, state monad style. Because %y becomes dead at this point, it is impossible for subsequent tracing to violate uniqueness. We can only do this if we know a priori which variables in the uninterpretable function might be mutated.

This issue says little about implementation issues atm, but here are some examples we want to support.

TODO: Do examples with multiple inputs and outputs.