devitocodes / devito Goto Github PK

View Code? Open in Web Editor NEW

549.0 32.0 223.0 227.95 MB

DSL and compiler framework for automated finite-differences and stencil computation

Home Page: http://www.devitoproject.org

License: MIT License

Python 99.86% Shell 0.14%

finite-difference sympy jit performance fwi rtm dsl gpu hpc stencil compiler code-generation ultrasound-imaging

devito's Introduction

Devito: Fast Stencil Computation from Symbolic Specification

Devito is a Python package to implement optimized stencil computation (e.g., finite differences, image processing, machine learning) from high-level symbolic problem definitions. Devito builds on SymPy and employs automated code generation and just-in-time compilation to execute optimized computational kernels on several computer platforms, including CPUs, GPUs, and clusters thereof.

About Devito
Installation
Resources
FAQs
Performance
Get in touch
Interactive jupyter notebooks

About Devito

Devito provides a functional language to implement sophisticated operators that can be made up of multiple stencil computations, boundary conditions, sparse operations (e.g., interpolation), and much more. A typical use case is explicit finite difference methods for approximating partial differential equations. For example, a 2D diffusion operator may be implemented with Devito as follows

>>> grid = Grid(shape=(10, 10))
>>> f = TimeFunction(name='f', grid=grid, space_order=2)
>>> eqn = Eq(f.dt, 0.5 * f.laplace)
>>> op = Operator(Eq(f.forward, solve(eqn, f.forward)))

An Operator generates low-level code from an ordered collection of Eq (the example above being for a single equation). This code may also be compiled and executed

>>> op(t=timesteps, dt=dt)

There is virtually no limit to the complexity of an Operator -- the Devito compiler will automatically analyze the input, detect and apply optimizations (including single- and multi-node parallelism), and eventually generate code with suitable loops and expressions.

Key features include:

A functional language to express finite difference operators.
Straightforward mechanisms to adjust the discretization.
Constructs to express sparse operators (e.g., interpolation), classic linear operators (e.g., convolutions), and tensor contractions.
Seamless support for boundary conditions and adjoint operators.
A flexible API to define custom stencils, sub-domains, sub-sampling, and staggered grids.
Generation of highly optimized parallel code (SIMD vectorization, CPU and GPU parallelism via OpenMP and OpenACC, multi-node parallelism via MPI, blocking, aggressive symbolic transformations for FLOP reduction, etc.).
Distributed NumPy arrays over multi-node (MPI) domain decompositions.
Inspection and customization of the generated code.
Autotuning framework to ease performance tuning.
Smooth integration with popular Python packages such as NumPy, SymPy, Dask, and SciPy, as well as machine learning frameworks such as TensorFlow and PyTorch.

Installation

The easiest way to try Devito is through Docker using the following commands:

# get the code
git clone https://github.com/devitocodes/devito.git
cd devito

# start a jupyter notebook server on port 8888
docker-compose up devito

After running the last command above, the terminal will display a URL such as https://127.0.0.1:8888/?token=XXX. Copy-paste this URL into a browser window to start a Jupyter notebook session where you can go through the tutorials provided with Devito or create your own notebooks.

See here for detailed installation instructions and other options. If you encounter a problem during installation, please see the installation issues we have seen in the past.

Resources

To learn how to use Devito, here is a good place to start, with lots of examples and tutorials.

The website also provides access to other information, including documentation and instructions for citing us.

Some FAQs are discussed here.

Performance

If you are interested in any of the following

Generation of parallel code (CPU, GPU, multi-node via MPI);
Performance tuning;
Benchmarking operators;

then you should take a look at this README.

You may also be interested in TheMatrix -- a cross-architecture benchmarking framework showing the performance of several production-grade seismic operators implemented with Devito.

Get in touch

If you're using Devito, we would like to hear from you. Whether you are facing issues or just trying it out, join the conversation.

Interactive jupyter notebooks

The tutorial jupyter notebook are available interactively at the public binder jupyterhub.

devito's People

Contributors

Stargazers

Watchers

Forkers

kwinkunks xiaocenxiaocen issamsaid tjb900 nw0 gamdow alialrahim dalide fymenq esiwgnahz skkamyab alisiahkoohi aiiluo dlight vmickus numerical-fun richah2 shineusn woxin5295 cavalcantelucas maelso longyanu tccw rajatrasal nathangeology thast yuriyi fractalfragment georgebisbas ahartikainen rcluan antongk vincepandolfo rwalkerlewis chaoshunh vkrgithub cxz andrealoevemaerke jrt54 gabrielsvc ivasconcelosuu lukasmosser varinic britishpiper guoyinzh smartalech ofmla pedrospeixoto sujathakestur razorzxo cc-a kmch minasel rohitthecool brunomariz tallamjr jemrobinson iamleeg fabioluporini mhtbrshdfrd akpopoola mikeando ufrn-lapps rhodrin mopshell edcaunt zhang-rx eminsight benedito123 otavio-r-filho dciborow navjotk millennial-geoscience prathameshtugaonkar yohanesnuwara ckhimji wenj18 geophysics-opensource rawsh jaimesouza aeroowl felipeaugustogudes yanivhollander kenhester eric-wieser deepearth scdlresearch dabiged mrzjrbn acse-gc2016 maksgraczyk cha-tzi dbowman-ion jack-lascala hermes-senger samtkaplan danielpipa ma-sadeghi faizao nogueirapeterson

devito's Issues

indexing

@navjotk
the indexes x,y in __prepare_stencil are flipped (y,x) compared to the rest. It will create problem for non square domains

Centralised Configuration (Was Configuration File)

Over time, devito is using quite a few environment variables (4, if I am not mistaken). Since devito is being made as a library, it doesn't make sense to accept command line flags since the actual executable will not always be in our control (since the "end-user" of devito would be writing the executables). The remaining options to take in configuration parameters (like the choice of compiler, compiler flags, whether to enable debug/profiling mode or not) are Environment variables and a configuration file. Environment variables can become difficult to track when the number increases (and at 4, it is already complex). They also make debugging issues harder - setting up the project on a fresh machine will also be much quicker with a default configuration file.

Maybe we should move to reading these parameters from a config file?

Logging mechanism

We need a logging system to make it easy to find errors.
The same (or a different?) mechanism could be used for profiling and benchmarking.

Reduced test base for installation testing

We need to mark some tests as essential to produce a reduced test base to be run at installation time.

These tests will be used to test that an installation has been completed successfully. Running the complete test base for this purpose is an overkill and excessively time consuming.

quality of generated code

it's useful if we generate readable code. CSE helps, but it's not enough.

For example, at the moment we produce a long list of declarations:
float temp1;
float temp2;
float temp3;
...

Followed by a list of assignments
temp1 = ...
temp2 = ...
temp3 = ...

This essentially doubles the code size when hundreds of temporaries are used. We should rather have:
float temp1 = ...;
float temp2 = ...;
float temp3 = ...;

@vincepandolfo I'm assigning this to you but clearly this has lower priority than benchmarking

some python modules are executable

nothing more to say

Intermittent TTI failures on Travis OSX

The travis OSX builders frequently, but not regularly, segfault on the the standalone TTI example test. Presumably this might be due to the shaky nature of the travis OSX builders, so OSX testing is temporarily disabled. If anyone is interested in debugging and fixing this, simply remove the allow_failures line from the build matrix in .travis.yml.

Devito `Dimension` objects don't match `sympy.abc` symbols

The following snippet works and creates a sympy.Derivative object:

from devito.import TimeData, x
f = TimeData(name='f', shape=(3, 3), time_order=1, space_order=2)
print f.diff(x)

However, the statement below prints 0, although both should be equivalent:

from devito.import TimeData
from sympy.abc import x
f = TimeData(name='f', shape=(3, 3), time_order=1, space_order=2)
print f.diff(x)

Numpy dependency

In trying to install devito on the yemoja cluster I realised the following:

Our requirements.txt does not list numpy as a dependency
Travis.yaml does not download/install numpy in any special way
The tests pass on travis
Running the same tests on the cluster, with numpy 1.8, 3 tests fail. These same tests pass on version 1.11 of numpy.

Which version of numpy do we need?
Should this be added to requirements.txt?
How does travis know which version of numpy to use?

Generic symbolic dimensions

The default symbols used to represent the different spatial and temporal dimensions work fine for our current examples and test cases, but restrict users that may want to implement other, more complicated equations or time-stepping schemes. A more dynamic and customisable approach will facilitate further implementations and should allow us to significantly reduce the complexity of the Propagator. This, however, requires careful design which can be discussed in this issue.

My current thinking is based around the idea of introducing some type of Dimension class that, similar to our SymbolicData objects, associates SymPy dimension symbols (x, y, z, t) with meta-data about dimension sizes and loop iteration spaces. Such a model might work something like this:

A Propagator maintains a list of Dimensions, which it can query to generate the appropriate C loop code, reducing the complexity of the Propagator code.
A Dimension might yield multiple IterationSpace objects that represent sub-loops for loop blocking. These can be created according to various blocking schemes and across multiple dimensions.
Individual stencil equations coming into the Operator can be scanned for Dimension symbols, which allows us to capture free and bound variables and will allow us to automatically insert the stencil at the right depth in the loop nest. This will help with sanity checks for operator merging and inserting sparse operations, ie. source and receiver terms, more gracefully.
Additional SymPy expressions might be used to define custom stepping schemes for individual iteration spaces, which will allow user to implement custom time-stepping algorithms.

As stated above, this issue is intended for discussing and drafting a possible implementation, so comments and suggestions are highly welcome.

Print statement

Hi
Please put parenthesis to all the print statements:

print("stuff to print")
not
print "stuff to print"

The second one makes everything crash with python 3.x

I replaced all of it in my current PR but please remember it for the future.

Memory access issue

@navjotk @pvelesko @vincepandolfo @mlange05
I have an issue with my new Branch test_gradient. I am running the simple adjoint test test_adjointA.py that worked perfectly. I however did some rewrite of Acoustic_codegen and fwi_operators. I would be gratefull if anyone could help with it.
Currently if I run the exact same test 10 times, about 8 out of 10 times I get NaN instead of the proper values, while the other 2 other times the test pass perfectly.

valgrind --tool=memcheck py.test -vs tests/test_adjointA.py &> logAdj
I ran valgrind on it and I get the following error :

==8642== Invalid read of size 16 ==8642== at 0x103E8E30: ForwardOperator (ae8148534a2554d9701a4f62b11b7b82a54c9a34.cpp:38) ==8642== by 0xBF25ADB: ffi_call_unix64 (in /usr/lib/x86_64-linux-gnu/libffi.so.6.0.1) ==8642== by 0xBF2540B: ffi_call (in /usr/lib/x86_64-linux-gnu/libffi.so.6.0.1) ==8642== by 0xBD135FD: _ctypes_callproc (in /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so) ==8642== by 0xBD14F9D: ??? (in /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so) ==8642== by 0x505F95: PyObject_Call (in /usr/bin/python2.7) ==8642== by 0x49B079: PyEval_EvalFrameEx (in /usr/bin/python2.7) ==8642== by 0x4A090B: PyEval_EvalCodeEx (in /usr/bin/python2.7) ==8642== by 0x499A51: PyEval_EvalFrameEx (in /usr/bin/python2.7) ==8642== by 0x4A090B: PyEval_EvalCodeEx (in /usr/bin/python2.7) ==8642== by 0x499A51: PyEval_EvalFrameEx (in /usr/bin/python2.7) ==8642== by 0x4A1C99: ??? (in /usr/bin/python2.7) ==8642== Address 0xdc926d8 is 33,656 bytes inside a block of size 33,664 alloc'd ==8642== at 0x4C2AB80: malloc (vg_replace_malloc.c:296) ==8642== by 0x8C8AC0B: ??? (in /usr/lib/python2.7/dist-packages/numpy/core/multiarray.so) ==8642== by 0x8C29321: ??? (in /usr/lib/python2.7/dist-packages/numpy/core/multiarray.so) ==8642== by 0x8C2BCD1: ??? (in /usr/lib/python2.7/dist-packages/numpy/core/multiarray.so) ==8642== by 0x8C8C647: ??? (in /usr/lib/python2.7/dist-packages/numpy/core/multiarray.so) ==8642== by 0x49EC75: PyEval_EvalFrameEx (in /usr/bin/python2.7) ==8642== by 0x4A090B: PyEval_EvalCodeEx (in /usr/bin/python2.7) ==8642== by 0x499A51: PyEval_EvalFrameEx (in /usr/bin/python2.7) ==8642== by 0x499EF1: PyEval_EvalFrameEx (in /usr/bin/python2.7) ==8642== by 0x4A1C99: ??? (in /usr/bin/python2.7) ==8642== by 0x505F95: PyObject_Call (in /usr/bin/python2.7) ==8642== by 0x536094: ??? (in /usr/bin/python2.7) ==8642== ==8642== Invalid read of size 16 ==8642== at 0x105EBDD8: AdjointOperator (b79c4970ce58dd574bc69b44d71769d497267caf.cpp:38) ==8642== by 0xBF25ADB: ffi_call_unix64 (in /usr/lib/x86_64-linux-gnu/libffi.so.6.0.1) ==8642== by 0xBF2540B: ffi_call (in /usr/lib/x86_64-linux-gnu/libffi.so.6.0.1) ==8642== by 0xBD135FD: _ctypes_callproc (in /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so) ==8642== by 0xBD14F9D: ??? (in /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so) ==8642== by 0x505F95: PyObject_Call (in /usr/bin/python2.7) ==8642== by 0x49B079: PyEval_EvalFrameEx (in /usr/bin/python2.7) ==8642== by 0x4A090B: PyEval_EvalCodeEx (in /usr/bin/python2.7) ==8642== by 0x499A51: PyEval_EvalFrameEx (in /usr/bin/python2.7) ==8642== by 0x4A090B: PyEval_EvalCodeEx (in /usr/bin/python2.7) ==8642== by 0x49AB44: PyEval_EvalFrameEx (in /usr/bin/python2.7) ==8642== by 0x4A1C99: ??? (in /usr/bin/python2.7) ==8642== Address 0xdc926d8 is 33,656 bytes inside a block of size 33,664 alloc'd ==8642== at 0x4C2AB80: malloc (vg_replace_malloc.c:296) ==8642== by 0x8C8AC0B: ??? (in /usr/lib/python2.7/dist-packages/numpy/core/multiarray.so) ==8642== by 0x8C29321: ??? (in /usr/lib/python2.7/dist-packages/numpy/core/multiarray.so) ==8642== by 0x8C2BCD1: ??? (in /usr/lib/python2.7/dist-packages/numpy/core/multiarray.so) ==8642== by 0x8C8C647: ??? (in /usr/lib/python2.7/dist-packages/numpy/core/multiarray.so) ==8642== by 0x49EC75: PyEval_EvalFrameEx (in /usr/bin/python2.7) ==8642== by 0x4A090B: PyEval_EvalCodeEx (in /usr/bin/python2.7) ==8642== by 0x499A51: PyEval_EvalFrameEx (in /usr/bin/python2.7) ==8642== by 0x499EF1: PyEval_EvalFrameEx (in /usr/bin/python2.7) ==8642== by 0x4A1C99: ??? (in /usr/bin/python2.7) ==8642== by 0x505F95: PyObject_Call (in /usr/bin/python2.7) ==8642== by 0x536094: ??? (in /usr/bin/python2.7)

Copies in operator

I realized that all inputs to TTI_codegen and Acoustic_codegen are copied e.g :

self.m = DenseData(name="m", shape=self.model.vp.shape, dtype=self.dtype)
self.m.data[:] = self.model.vp**(-2)

Ii means that if model is modified, Acoustic won't be making the inversion workflow complicated. WOuld it be possible to do something like
self.m = DenseData(name="m", shape=self.model.vp.shape, dtype=self.dtype)
self.m.data[:] = self.model.get_m()

where self.model.get_m() would be a pointer to the get_m function so that everytime the operator is called it will run with whatever is vp at the current time. It will allow to separate the model structure (and Shot, rec) fro mthe actual propagator allowing model changes without having to recreate a new operator that wold copy again vp.

Code generation workflow

@ggorman @navjotk @mlange05
This is a general concern about the generation part. We do have a really nice tool however there is a few flags that will hit us hard if we don't look at it now. I will work on it a lot but some of it is completely out of my skills and knowledge to do anything about it. Feel free to add anyone to this issue.

The first one concern the symbolic expressions handling. It seems like the PDE stencils are not treated the same way as the src/rec and some other stencil.
one example is :
eqn = m * v.dt2 - v.laplace - damp * v.dt
stencil = solve(eqn, v.backward)[0]
# Add Gradient-specific updates. The dt2 is currently hacky as it has to match the cyclic indices
gradient_update = Eq(grad, grad - s**-2*(v + v.forward - 2 * v.forward.forward) * u.forward)
`stencils = [gradient_update, Eq(v.backward, stencil)]``

The actual stencil 'stencil' will be treated properly as designed for, however gradient_update won't because the left-hand side is not time dependent. This leads to this (v + v.forward - 2 * v.forward.forward) * u.forward that should only be v.dt2 * u. The reason being gradient_update completly skip the cyclic indexing forcing to look into the c code and force by hand the correct indices. This make the symbolic expression complely wrong from the mathematical point of view and defeat the purpose of it.

This problem also rises with the source/reciever part that is hard-coded to be injected t (translated to t2 in code) no matter what you want to do.

The second issue is mostly due to the Born function. As it is written in the python code Acoustic2D.py the computations is :
1- compute stencil 1 (u)
2- Add source to u
3- Compute stencil 2 (U)
4- add source to U
5- read data from U

The way it is currently implemente, it switch 1 and 2 and do 1,3,4 within the spacial loops. This does not do the same as the source for U depends on u and require u to have the source added. The only way to do that, is to do it like in the python and to do a first full spacial loop for the first stencil, add the source, and do a full second spatial loop for U. This is not currently possible in our workflow as it is solely based on
A pre-stencils
B stencils
C post-stencils

with no possibility to split the stencils part and add intermediate computations.

Finally i think our data structure have to me modified. There is a lot of copying and redundancy e.g
self.rec = SourceLike(name="rec", npoint=self.nrec, nt=self.nt, dt=self.dt, h=self.h, coordinates=self.data.receiver_coords, ndim=len(dimensions), dtype=self.dtype, nbpml=nbpml)

I do not understand why a source structure would require nbpml or h.
the rest of the data concern is in issue #66

Parallel test runs on travis

Should be fairly uncontroversial and needs to happen soon so that we catch OpenMP-related bugs before merging them.

Initialization of `SymbolicData` objects

Since #67 brought to light some issues with the initialization of SymbolicData objects (TimeData objects specifically) we need to find a coherent and uniform interface to initialize SymbolicData objects.

A possible solution for TimeData would be to have the data property reference _full_data instead of _data. In this way when self.initializer is called from initialize it will get as an argument the full data array instead of the data array with the hidden indexes.

This would also have the effect of removing the time padding from the user point of view.

Initial discussion with @mlange05 here: 09dc155#commitcomment-18517910

absolute imports

not much to say. Just that we should always use absolute imports. For example, instead of doing
import logger
we should have:
import devito.logger

Possible redundant code in devito/propagator.py, sympy_to_cgen() method and what does _pre_kernel_steps do?

        stmts = []

        for equality in stencils:
            stencil = self.convert_equality_to_cgen(equality)
            stmts.append(stencil)

        kernel = self._pre_kernel_steps
        kernel += stmts
        kernel += self._post_kernel_steps

        return cgen.Block(factors+decl+kernel)

When I comment out kernal = self._pre_kernel_steps, devito will generate exactly the same file without commenting it.

Could someone explain what does it do? What exactly is self._pre_kernel_steps?
I also found out _pre_kernel_steps is exactly the same as the stmts and _pre_kernel_steps will also vary as stmts varies.

For example, if I add a line of code stmts.append(cgen.Assign("testing", "justForTesting()")) before kernel = self._pre_kernel_steps. The _pre_kernel_steps will also contain one more item which is 'cgen.Assign("testing, "justForTesting()"))`.

Receiver setup

@navjotk @ggorman

I see that now, the receiver location (xrec) is defined as an attribute of the propagator. It should be defined the same ways as the source, mainly
xrec= self.data.get_rec_loc()
or something like that. All data related parameter (source location, receiver location,...) have to be inside the data object not the propagator.

Tests

As @mlange05 has pointed out in the past, we lack robust tests for the functionality that already exists.
While the need for tests is probably obvious to all of us, the specific tests that could be written have been largely elusive (to me) so far. I propose making a list of things that need to be tested more rigorously and also specific ideas on how to write those tests for each of them. Everyone please contribute to make this list as exhaustive as possible and we could probably farm out more specific issues from this (meta) issue once we have the specifics down.

The complete list of tests that Mathias implemented in the outer loop should be (re)implemented within the pytest framework to help more robust testing of the complete workflow. For now only the adjointA test has been implemented. The others include adjointJ, gradientFWI and gradientJ.
All operators should ideally support a debug mode where the sympy expression is internally converted to a python-callable function using lambdify and the loops implemented in python. The output of the debug mode could be compared to the output from the code generation in the test.
Tests for multi-threaded code: Compare output (norms) of single threaded code with multi-threaded code to ensure they are the same
Tests for 3d: Do we need specific tests for this if we have the debug (python) mode on all operators? Probably not.
Tests for higher order: As we realised a few days ago, the code was working fine for orders (2,2) but not for (4,x) or (2, very high order). To pick up such problems early in the future, we might want to have some sort of automatic testing for all orders (something beyond the gradient test since it didn't pick up these problems that @mloubout found).
Tests for variable translation: Another one of the bugs that @mloubout pointed out this week was related to how the model variables (x,y,z,t) were being mapped to the loop variables (i1, i2, i3, i4). To prevent such problems in the future maybe we could write a test that actually reads the generated code for a problem where all dimensions x, y, z are different and uses the length of the dimension to identify the mapping between (x,y,z,t) and (i1, i2, i3, i4) and verifies that they are being used consistently in array indices.

Profiling loop body bug

When profiling example retrieved times are:

kernel = 0.509173
loop_body = 8.062047
loop_stencils_a = 0.001152
loop_stencis_b = 95e-05

if my understanding is correct kernel should have a value which is larger than the rest, however loop_body timings are way higher.

@vincepandolfo Have a look at this.

Iteration extension

This is not really an issue but more of a devellopement direction. We currently have a separation between Iteration objects and symbolic objects. More specifically, no symbolic operation can be done on Iteration objects. However in a more general case, we may want to have Iteration accepting any kind of Symbolic data (e.g including derivatives) in order to avoid "by hand" implementation of some aaource terms or receiver terms.

Currently when we do src.add(m,u) it does not support doing - src.ad(m,u) (adding the source with a minus sign) and extending it to something that would allow to do (src.dt).add(m,u) would allow to work with a lot of more general inversion objective function implementations

Pre/Post stencils loops generation

@mlange05
I' don't know if I missed something but this lines trouble me :

        # Statements to be inserted into the time loop before the spatial loop
        time_loop_stencils_b = [self.time_substitutions(x)
                                for x in self.time_loop_stencils_b]
        time_loop_stencils_b = [self.convert_equality_to_cgen(x)
                                for x in self.time_loop_stencils_b]

        # Statements to be inserted into the time loop after the spatial loop
        time_loop_stencils_a = [self.time_substitutions(x)
                                for x in self.time_loop_stencils_a]
        time_loop_stencils_a = [self.convert_equality_to_cgen(x)
                                for x in self.time_loop_stencils_a]

The first statement is overwritten by the second one isn't it? If yes is it on purpose and can the time_substittion line be removed or is the second one supposed to act on time_loop_stencils_a instead of self.time_loop_stencils_a

Testing on travis on when needed

Hi guys, I am just thinking that we should not test very single push, but only the commit just before a PR merge? We seems to be the choking the server.

I am not familiar with Travis, is there way to set this up in Travis may be we should add [ci skip] in all but commits that are to be merged?

ABC and PML

A small issue I just realized. Currently we have nbpml extensions for the ABC and ~spc_order ghost points. So it turns out that for high order with have no ABC at all as it is mostly ghost points. I am pretty sure extending the model even more would not be a good idea.

I do not have a solution, this is linked to the "removing damp" problem and I just wanted it to be there for reference.

Enforce thread pinning through `Compiler` class

In order to ensure reproducible performance evaluation on NUMA systems we should enforce thread pinning in our Compiler classes for the most common tool chains (GNU and KMP). The challenge I see though is that simply "unrolling" the processor IDs as [0, .., N] might not be portable, since processor ID numbering conventions can vary between CPU generations. More generic settings like compact or scatter (KMP) might need to be explored for and compared to the "hand-unrolled" settings we use at the moment.

Consolidate compiler settings for different platforms

Compiler flags and other settings that change between different backends could be consolidated in a "backend" layer between the JitManager and the compiler. There could be various backends like:

Serial (no parallelisation)
OpenMP for Xeon on GCC
OpenMP for Xeon on Intel compiler
OpenMP for KNC
OpenMP for KNL
Choosing between these backends would require an environment variable.

This was suggested by @mlange05 . Inviting discussion on this implementation here.

Do without nt in the prepare step

This function expects nt as a parameter. However, the actual number is not really necessary in the final flow of things. This requirement needs to be removed to make the API compatible with the non-codegen version.

Codegen API

Right now, the generator class accepts an instance of the FunctionDescriptor object. This object contains all the information required to generate one C function. This could later be extended to a CodefileDescriptor, that could represent the information required to generate one .c file - which in turn could contain multiple FunctionDescriptors. The FunctionDescriptor contains fields corresponding to:

Function Parameters: These can be either matrix parameters or value(scalar) parameters. For a matrix parameter, a name and a shape is required. For a value parameter, a name and a type. One of the matrix parameters must be defined as the Looper. There must be one and only one looper for a FunctionDescriptor to be able to generate code successfully. The looper is the matrix over which the loop inside the function is defined. In the code written so far, the looper is the U parameter, i.e. the primary grid.
Function Body: Currently the body provided as a cgen Generable object is used as the body of the nested loop generated inside the function. We could later have a pre-loop and post-loop section as well.
Finer control over the generated loop: The API is meant to provide finer control over the generated loop, if required. This includes skipping border elements symmetrically (for ghost cells) and also controlling the counter direction (increasing or decreasing) in each dimension separately.

Looking for your comments, @mlange05

Time order

@opesci/opesci-dev
Why is the default time_order set to 4? I've said numerous times, 4th order is unconditionally unstable and should be completely removed for now as it requires a different PDE. So what is the next PR to go in to add it once for all.

The Operator API could take inspiration from loopy

The project loopy does something quite similar to what Operators in devito are doing. Probably their API is a good model for us to target.
Edit: To clarify - loopy is a suggestion for what the API should look like to the user of devito - there is no intention to interface with loopy.

Memory is not freed/keeps getting reallocated when creating SymbolicData objects

As demonstrated by this snippet:

import gc
import numpy as np
import resource
from sympy import Eq

from devito.interfaces import DenseData
from devito.operator import Operator


def init(data):
    data = np.zeros(data.shape)


def test_memory(nx=1000, ny=1000):

    for i in range(10):
        gc.collect()

        u = DenseData(name='u', shape=(nx, ny), dtype=np.float64, space_order=2, initializer=init)

        eqn = Eq(u, u + 3)

        op = Operator(stencils=eqn, shape=(nx, ny), nt=10, spc_border=1)

        op.apply()

        print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)

Output:

The memory for the u object keeps getting reallocated.

This is probably caused by the caching of Symbolic objects.

Profiler: calculate total Flops count in Python

Currently, we resort on parsing the C code and calculating the total flops count at runtime. Ideally, we would want to do this directly in Python since we know the loop limits and we can derive the number of operations per iteration from the Sympy expressions.

This is quite hard to do with the current state of the Propagator (it would also make the code much messier that what it already is) but it should become much easier when the Propagator will be rewritten to make use of Iteration and StencilKernel classes.

We should keep this issue in mind when building the new Propagator

python 3

Small thing, current master is not compatible with python 3.5.X

Redundant OpenMP variable

I am opening an issue because I can't figure out how to comment on code already in master. Since we moved the compilation to compiler.py, the OpenMP flag is being set from the env variable directly inside compiler.py (https://github.com/opesci/devito/blob/master/devito/compiler.py#L176).

Why, then, does a variable with that name still exist in operator.py? https://github.com/opesci/devito/blob/master/devito/operator.py#L87

@mlange05

Browser-based configuration GUI

The first step towards a Devito web interface is a basic browser-based UI that provides similar capabilities to the current examples/benchmark.py script (select equation type, spatial order, etc.). The web interface should be based on the flask package and a UI template page, possibly based on Jekyll or similar. The aim is to be able to invoke a local benchmarking run through the browser GUI.

Easy-to-replicate performance benchmarks

For publication and performance regression purposes we should have one or two pre-defined benchmark tests that can be re-run with ease to test and verify new features and optimisations. These benchmarks should be fully automated and rely on a pre-defined example setup. Following PR #80 and with a set of benchmarking utilities provided in the opesci/opescibench repo we should now be able to provide this.

The proposed development spec:

Identify appropriate example tests that execute in a reasonable time frame without being too small or trivial, and re-structure them to ensure they can be fully parametrized.
Add either <example>_bench.py or add --benchmark option to standalone mode of the test itself.
Create a benchmark class <Example>Executor that inherits from opescibench.Executor and defines how to execute the test runs and collects performance data (runtime, GFlops), as well as meta-data (OI).
Define a <Example>Benchmark class that inherits fromopescibench.Benchmark`, which allows it to invoke repeated performance runs and store the collected data and meta-data in JSON files. The specific performance relevant parameters should be configurable, for example:

python <example>_bench.py --benchmark --compiler clang --cse --cache-blocking 8 16 ...

Use the Benchmark class and custom matplotlib code to pre-define various types of plots, including roofline, bar comparisons and parallel scaling graphs. This should be invoked with arguments --plot --plottype or similar and the relevant parameters to compare and plot, for example:

python <example>_bench.py --plot --plottype roofline --cse --compiler clang gcc intel

The final command line options and some default settings to benchmark execution and plotting should be documented in a separate README file in the examples directory.

Please note that opescibench is intended to provide most of the core functionalities detailed here, ranging from repeated execution and data collection to various utilities for pretty plotting. It is, however, very much a work in progress and more work is probably required on providing more (any) documentation and enhancing the current features.

Factors in TTI and general case

@pvelesko @navjotk
I found the problem with instability in TTI. It was not a precision issue, but a missing private statement with openmp. I currenly hard coded it by hand to be able to run tests, but this needs to be fixed in propagator. The constant factors (and the same will apply for CSE constants) have to be declared private in the omp environment.

Line length must be below 90

I'm happy with a straw poll, and maybe we don't really want to set a strict rule, but we should really stop writing long lines of code

PEP8 says 80, I think 90 is OK.

If necessary we should add the rule.

Opinions ?

If we decide to add the rule (such that flake8 complains if broken), please @vincepandolfo add it ASAP (im not sure I'll be looking into this as of tomorrow)

Profiling broken

The ability to profile code, a feature that was previously working is now broken on master.

_indices in interfaces

The way the dimensions are obtained in TimeData in interfaces is

        if dimensions is None:
            # Infer dimensions from default and data shape
            _indices = [t, x, y, z]
            shape = kwargs.get('shape')
            dimensions = _indices[:len(shape) + 1]
        return dimensions

however for 3D len(shape)=4 therefor len(shape)+1 =5 out of range
and for 2D len(shape)+1 =4 so it outputs 3D dimensions [t, x, y, z]

Final timestep index not clearly defined when `save=False` in TimeData

This is the stencil generated for test_diffusion.py:

extern "C" int Operator(float *u_vec)
{
  float (*u)[100][100] = (float (*)[100][100]) u_vec;
  {
    int t0;
    int t1;
    for (int i3 = 0; i3<19; i3+=1)
    {
      {
        t0 = (i3)%(2);
        t1 = (t0 + 1)%(2);
      }
      {
        for (int i1 = 1; i1<99; i1++)
        {
          #pragma GCC ivdep
          for (int i2 = 1; i2<99; i2++)
          {
            u[t1][i1][i2] = 1.11022302462516e-16F*u[t0][i1][i2] + 2.5e-1F*u[t0][i1][i2 - 1] + 2.5e-1F*u[t0][i1][i2 + 1] + 2.5e-1F*u[t0][i1 - 1][i2] + 2.5e-1F*u[t0][i1 + 1][i2];
          }
        }
      }
    }
  }
  return 0;
}

The values of t0 and t1 alternate, so it's not clearly defined what index contains the final timestep. If the number of timesteps was even, the final result would be at index 1, while in this code the final result is at index 0

Src/Rec positions

We need to remove the hard coding of the source/receiver position and reading/injecting the field. The way it currently is done works fine for small geometries, but involve re-generation of the code for every source. This is a cheap part of the computation so we can afford to replace this (40-100k lines) hardcoding with a few inputs and a general src/rec injection/read function allowing to generate and compile the code only once.

For example, I am running a test model of size 201 x 201 x 70 , quite small, that should run within a couple minutes, however with 40000 receivers, the compiler takes forever (already 20min of cc1plus and still no sign of even starting to run).

Diffusion test looks incomplete

In the code for the diffusion test, I see that no time_dim parameter is being passed to the TimeData constructor here. I think the behaviour of the entire TimeData object is undefined in such a case and there should be an assert in the constructor to make sure we don't execute in case this happens. In the particular case of the diffusion example, I doubt this code is doing what the author meant it to do.

Also, the save test which was inspired by this test actually fixes this.

Container for managing wave-field time history

Need to add a container for storing the time history of the forward model.

Strict requirements:

API should be abstracted for opesci/devito so that the container backend can be replaced in the future if necessary.
Support for very large datasets - out of core support.
Configurable to use available RAM - spilling to disk when necessary.
Need to be able to specify storage device.

Desirable features:

Data structure should behave as a stack - i.e. last in first out (LIFI) data structure.
Support for non-blocking push and pop.
Prefetching from disk while popping.

Initial suggestion is to build support around stxxl::stack - http://stxxl.sourceforge.net/tags/master/tutorial_stack.html

Caching of generated code

On our current FWI runs, most of the time is spent by the workers compiling code, running the DSE transformations etc. This is obviously needless repetition that we should be avoiding by caching these results. Although I call it caching of generated code in the issue title, I mean to suggest caching at a higher level than just the code since even the DSE need not be called. Feel free to suggest a different title for the issue.

Total loads count is not calculated appropriately by the Profiler

Currently, the total number of loads does not take into account the indexes at which a load is performance.

For example, if v[i1][i2][i3] appears more than once in the code, every single instance of it will be counted as a different load.

This causes problems when calculating the optimal block size dimension based on the architecture.

Pip installation

To facilitate easy installation and cluster deployment we should add a setup.py file to enable installation via pip.

Load/store OI calculations potentially should be done without Profiling.

Info like OI or load count might be needed even if we are running with profiling=False.

For example when estimating optimal block size load count is always needed.
As a workaround I had to create an additional temporally profiler which would extract it even when profiling = False.

For future use , we might need to move the load/store and OI calculations back to the sympy expressions

PML

@navjotk
damp proflile needs to have a 3D condition as discussed yesterday