inducer / loopy Goto Github PK

View Code? Open in Web Editor NEW

553.0 20.0 70.0 12.01 MB

A code generator for array-based code on CPUs and GPUs

Home Page: http://mathema.tician.de/software/loopy

License: MIT License

Python 99.64% Shell 0.06% Fortran 0.27% Vim Script 0.03%

python opencl cuda ispc code-generation code-generator code-optimization code-transformation reduction scan

loopy's Introduction

Loopy: Transformation-Based Generation of High-Performance CPU/GPU Code

Loopy lets you easily generate the tedious, complicated code that is necessary to get good performance out of GPUs and multi-core CPUs. Loopy's core idea is that a computation should be described simply and then transformed into a version that gets high performance. This transformation takes place under user control, from within Python.

It can capture the following types of optimizations:

Vector and multi-core parallelism in the OpenCL/CUDA model
Data layout transformations (structure of arrays to array of structures)
Loop unrolling
Loop tiling with efficient handling of boundary cases
Prefetching/copy optimizations
Instruction level parallelism
and many more

Loopy targets array-type computations, such as the following:

dense linear algebra,
convolutions,
n-body interactions,
PDE solvers, such as finite element, finite difference, and Fast-Multipole-type computations

It is not (and does not want to be) a general-purpose programming language.

Loopy is licensed under the liberal MIT license and free for commercial, academic, and private use. All of Loopy's dependencies can be automatically installed from the package index after using:

pip install loopy

In addition, Loopy is compatible with and enhances pyopencl.

---

Places on the web related to Loopy:

Python package index (download releases)
Documentation (read how things work)
Github (get latest source code, file bugs)
Homepage
Benchmarks

loopy's People

Contributors

Stargazers

Watchers

Forkers

jdsteve2 rckirby dokempf navjotk watsona4 cmsquared damian-666 skyreflectedinmirrors shigh mattwala maedoc spillai osdi xywei jakirkham marcelkoch yutiansut resurgo-genetics krisje tj-sun limin2021 fanzijian1996 jarlene kaushikcfd shwina mmmika kyulee2 batermj da7oomhtpc ml-lab nchristensen batmanabcdefg luoyujun sailfish009 sebastianheg danshapero gaohao95 sv2518 yueyedeai eb4890 isuruf zachjweiner wwiiwwiiww alexfikl pizzashift jtlai0921 gxsaccount mfkiwl jaynotleno saidctb annabellej connorjward python-repository-hub pubuduprabhathiya mitkotak a-alveyblanc ajunlonglive nipunayf majosm hirumalpriyashan thilinarmtb ravindu-hirimuthugoda poorna2152 lcy-seso illinois-ceesd nicknytko budavarapu matthiasdiener zincnode

loopy's Issues

Kernel preprocessing removes CallInstructions without assignees

I am currently trying to make CallInstructions work with 0 assignees (as talked about briefly in Warwick). After the small issues I have already opened MRs for, I now have a bigger issue that I did not yet find a fix for.

preprocess_kernel, in particular realize_reduction, removes all CallInstructions without assignees from the input. See the following minimum example:

import loopy as lp
import numpy
import pymbolic.primitives as p
lp.CACHING_ENABLED = False

red = lp.Reduction("sum", ("i",), p.Subscript(p.Variable("x"), p.Variable("i")))
insn = lp.CallInstruction([], p.Call(p.Variable("f"), (red,)))

knl = lp.make_kernel("{[i] : 0<=i<n}", [insn], [lp.GlobalArg("x", dtype=numpy.float32)])
pknl = lp.preprocess_kernel(knl)
pknl.instructions

which outputs just the two expanded reduction instructions, but no call to f:

[Assignment(no_sync_with=frozenset([]), atomicity=(), boostable=False, conflicts_with_groups=frozenset([]), assignee=Variable('acc_i'), depends_on_is_final=False, depends_on=frozenset([]), tags=frozenset([]), priority=0, boostable_into=set([]), groups=frozenset([]), forced_iname_deps_is_final=False, predicates=frozenset([]), forced_iname_deps=frozenset([]), expression=0, id='insn_i_init', temp_var_type=None),
 Assignment(no_sync_with=frozenset([]), atomicity=(), boostable=False, conflicts_with_groups=frozenset([]), assignee=Variable('acc_i'), depends_on_is_final=False, depends_on=frozenset(['insn_i_init']), tags=frozenset([]), priority=0, boostable_into=set([]), groups=frozenset([]), forced_iname_deps_is_final=False, predicates=frozenset([]), forced_iname_deps=frozenset(['i']), expression=Sum((Variable('acc_i'), Subscript(Variable('x'), Variable('i')))), id='insn_i_update', temp_var_type=None)]

Adding an assignee to the CallInstruction in above example works as expected.

float64 ^ int?!

l.auto should probably die

check_that_temporaries_are_defined_in_subkernels_where_used gives false positives

Considers this MWE:

import loopy as lp
import numpy as np
lp.CACHING_ENABLED = False
knl = lp.make_kernel("{ [i] : 0<=i<n }",
               ["a[i] = 0",
                "c[i] = b[i]"],
               temporary_variables={"a": lp.TemporaryVariable("a", dtype=np.float64, shape=("n",), base_storage="base"),
                                    "b": lp.TemporaryVariable("b", dtype=np.float64, shape=("n",), base_storage="base"),},
               target=lp.CTarget(),
               silenced_warnings=frozenset({"read_no_write(b)"}))
print lp.generate_code(knl)[0]

check_that_temporaries_are_defined_in_subkernels_where_used throws a MissingDefinitionError here, where it definitely should not. I guess the root of the problem is the use of the base_storage mechanism. I am unsure though how to solve this, as I dont know enough of this whole subkernel thing (Are there docs yet?)

cc @mattwala

Introducing a pointer type

I am currently facing the problem that I need to generate code for a library function call that expects a pointer as an argument. I have an implementation sketched and prototyped. I would like to know whether there is a better approach already available in loopy and if not, what would need to be done to have it upstreamed.

So the sketch is:

Targets need to know their pointer type (size). That basically means that it needs to know about 32bit vs 64bit. This kinda hardcodes the assumption that pointer types do not have variable length (are there relevant platforms where this is not true?)
Add a PointerType in loopy.types. Contains callbacks into the target to learn about actual dtype. This is what the function mangler would use.
Introduce additional type_context p for pointer types in ExpressionToCExpressionMapper.
Adjust map_variable such that given a kernel argument, it returns an address (sometimes wrapping address operators)

set_loop_priority can only be called once

I would like to call set_loop_priority more than once, because my priority constraints arise locally during the construction of the kernel. However, any further calls override previously set priority. See following MWE, which is scheduled as k,j,i:

import loopy
k = loopy.make_kernel("{[i,j,k] : 0<=i,j,k<5}",
                      "a[i,j,k] = 1",
                      target=loopy.CTarget(),
                     )

k = loopy.set_loop_priority(k, "i,j")
k = loopy.set_loop_priority(k, "k,j")
print loopy.generate_code(k)[0]

As a solution, set_loop_priority could merge the two lists, preserving orders from both lists.
I am willing to implement this if you consider it a good idea. If you don't I will do it on my side and call set_loop_priority once.

NoOpInstruction behaves unexpectedly

When I first read about NoOpInstruction, I was pretty sure I understood its purpose and immediately had a use case ready. Trying to implement it, I come to conclusion that we might have a different idea of what it does.

I want to have a set of no op instruction stage{1,2,..}, where each one depends on its predecessor. IMO, having any instruction depend on one of these should now have an effect similar to grouping.
Minimum example:

stage1 = loopy.NoOpInstruction(id='stage1')
stage2 = loopy.NoOpInstruction(id='stage2', depends_on=frozenset({"stage1"}))

i1 = loopy.Assignment(Subscript(Variable("x"), Variable("i")), 0, depends_on=frozenset({"stage1"}))
i2 = loopy.Assignment(Subscript(Variable("x"), Variable("i")), 1, depends_on=frozenset({"stage2"}))

knl = loopy.make_kernel("{[i]: 0<=i<n}",
                       [stage1, stage2, i1, i2])

The dependency structure of the kernel is correct, but generating code for this does not show the expected order of instructions.

Furthermore, looking at test_special_instructions, I realized that

The NoOpInstruction does not have any effect on scheduling, you can safely remove it
In textual assignment it is not even possible to assign an id to a NoOp.
That having said, having an instruction depend on a NoOp is not possible

Not sure, whether this is a bug or just a misconception...

Strides and Transposes

Hey, I have a quick question on the expected strides and numpy/opencl.array transposes. I hesitate to call this an "issue" as it's not a bug with loopy but a gap in my understanding on the loopy / numpy interface

Consider the following example:

import loopy as lp
import pyopencl as cl
import numpy as np

lp.set_caching_enabled(False)

ctx = cl.create_some_context(interactive=True)
queue = cl.CommandQueue(ctx)

a_lp = lp.GlobalArg('a', shape=(7, 200), dtype=np.float64)
knl = lp.make_kernel('{[k]: 0<=k<200}}',
         """
         for k
                T[k] = a[4, k] + a[0, k]
         end
         """,
        kernel_data=[a_lp, '...'])

knl = lp.fix_parameters(knl, n=200)
a = np.random.random((200, 7))
a2 = np.zeros(a.T.shape)
a2[:, :] = a.T[:, :]
assert np.allclose(a2, a.T)

try:
        evt, (out,) = knl(queue, a=a.T) #this will fail
except Exception, E:
        print E
        evt, (out,) = knl(queue, a=a2) #this will execute

In the above, we declare an array of shape (7, 200) as a global arg, and two test arrays:

a is in the incorrect format, (200, 7).
a2 is in the correct format (7, 200).

I had assumed I could simply pass a.T in, as we can see assert np.allclose(a2, a.T) passes.
However, I guess that numpy's transpose (and consequently clarray's transpose) do not update the strides accordingly? (i.e. it's a view instead of a data change).

Is there a better way to do transposes (other than the zeros and fill method I use to populate a2 here)?

Different behavior of variable length loops depending on iname definition type

The following example is (slightly) modified from from sparse.py:

import loopy as lp
import numpy as np
lp.set_caching_enabled(False)

k = lp.make_kernel(
    "{ [i,j] : 0 <= i < m and 0 <= j < length }", #note that we use the 'compound' definition syntax here
    """
    <> rowstart = rowstarts[i]
    <> rowend = rowstarts[i]
    <> length = rowend - rowstart
    y[i] = sum(j, values[rowstart+j] * x[colindices[rowstart + j]])
    """)

k = lp.add_and_infer_dtypes(k, {
    "values,x": np.float64, "rowstarts,colindices": k.index_dtype
    })
print(lp.generate_code(k)[0])

however, when running this snippet, I get the following error (where sparse.py has no issues):

  File "build/bdist.linux-x86_64/egg/loopy/codegen/__init__.py", line 523, in generate_code
    codegen_result = generate_code_v2(kernel)
  File "build/bdist.linux-x86_64/egg/loopy/codegen/__init__.py", line 387, in generate_code_v2
    kernel = get_one_scheduled_kernel(kernel)
  File "build/bdist.linux-x86_64/egg/loopy/schedule/__init__.py", line 1979, in get_one_scheduled_kernel
    for scheduled_kernel in generate_loop_schedules(kernel):
  File "build/bdist.linux-x86_64/egg/loopy/schedule/__init__.py", line 1766, in generate_loop_schedules
    for sched in generate_loop_schedules_inner(kernel, debug_args=debug_args):
  File "build/bdist.linux-x86_64/egg/loopy/schedule/__init__.py", line 1779, in generate_loop_schedules_inner
    pre_schedule_checks(kernel)
  File "build/bdist.linux-x86_64/egg/loopy/check.py", line 367, in pre_schedule_checks
    check_write_destinations(kernel)
  File "build/bdist.linux-x86_64/egg/loopy/check.py", line 330, in check_write_destinations
    "inside a domain dependent on it" % wvar)
LoopyError: domain parameter 'length' may not be written inside a domain dependent on it
Uncaught exception. Entering post mortem debugging

It appears that defining the inames in this format causes
kernel.get_inames_domain(kernel.insn_inames(insn))

where kernel.insn_inames(insn) == frozenset(['i']) returns the whole domain:

BasicSet("[length, m] -> { [i, j] : 0 <= i < m and 0 <= j < length }")

Hence i depends on length.

I'm not sure if this is desired behavior of the Domains (since technically a single domain is passed in for this example) however, it's somewhat confusing as the domains are separable

Op types in statistics should have a repr

Flop counting of FMA

@jdsteve2 @rckirby

One issue with FLOP is whether we count a*x + b as one or two operations. This is called FMA. It's often faster and it has better accuracy than the two operations carried out separately. But exactly since it's accuracy is different, compilers generally won't compile them as an FMA unless you specify -cl-fast-relaxed-math. There is also a fma function in OpenCL that allows you to explicitly ask for an FMA on a per-operation basis.

Since there are enough moving parts and since loopy doesn't yet take do anything to manage FMAs itself, I think we should have a knob on whether FMAs should count as one or two flops in the flop counter.

Unroll / ILP unaware of conditionals

Hi, I'm starting to get involved with loopy, and I noticed that the unroll / ILP tag doesn't seem to be aware of what happens in a conditional.

E.g. for this simple example:

import loopy as lp
import numpy as np
import pyopencl as cl

#init
testsize = 100
T = np.random.uniform(size=testsize, low=400, high=2300)
T_arr = lp.GlobalArg('T', shape=T.shape[0], dtype=T.dtype)

def __print_code(knl):
    code, _ = lp.generate_code(knl)
    print code

ctx = cl.create_some_context(interactive=False)
queue = cl.CommandQueue(ctx)

knl = lp.make_kernel('{{[k]: 0 <=k<{}}}'.format(testsize),
                 """
                     for k
                         <>Tcond = T[k] < 1000 {id=dep}
                         cp[k] = 2 * T[k] {if=Tcond}
                     end
                 """,
                 [T_arr, '...'])
__print_code(knl)

knl_ilp = lp.split_iname(knl, 'k', 2, inner_tag='ilp')
__print_code(knl_ilp)

I get the following output:

Without ILP:

#define lid(N) ((int) get_local_id(N))
#define gid(N) ((int) get_group_id(N))
#if __OPENCL_C_VERSION__ < 120
#pragma OPENCL EXTENSION cl_khr_fp64: enable
#endif

__kernel void __attribute__ ((reqd_work_group_size(1, 1, 1))) loopy_kernel(__global double const *restrict T, __global double *restrict cp)
{
  int Tcond;

  for (int k = 0; k <= 99; ++k)
  {
    Tcond = T[k] < 1000.0;
    if (Tcond)
      cp[k] = 2.0 * T[k];
  }
}


With ILP:
#define lid(N) ((int) get_local_id(N))
#define gid(N) ((int) get_group_id(N))
#if __OPENCL_C_VERSION__ < 120
#pragma OPENCL EXTENSION cl_khr_fp64: enable
#endif

__kernel void __attribute__ ((reqd_work_group_size(1, 1, 1))) loopy_kernel(__global double const *restrict T, __global double *restrict cp)
{
  int Tcond[2];

  for (int k_outer = 0; k_outer <= 49; ++k_outer)
  {
    Tcond[0] = T[2 * k_outer + 0] < 1000.0;
    if (Tcond)
      cp[2 * k_outer + 0] = 2.0 * T[2 * k_outer + 0];
    Tcond[1] = T[2 * k_outer + 1] < 1000.0;
    if (Tcond)
      cp[2 * k_outer + 1] = 2.0 * T[2 * k_outer + 1];
  }
}

Note that the if(Tcond) did not update as expected.

If we use a similar example for unrolling:

import loopy as lp
import numpy as np
import pyopencl as cl

#init
testsize = 100
T = np.random.uniform(size=testsize, low=400, high=2300)
T_arr = lp.GlobalArg('T', shape=T.shape[0], dtype=T.dtype)

def __print_code(knl):
    code, _ = lp.generate_code(knl)
    print code

ctx = cl.create_some_context(interactive=False)
queue = cl.CommandQueue(ctx)

knl = lp.make_kernel('{{[k]: 0 <=k<{}}}'.format(testsize),
                 """
                     for k
                         <>Tcond[k] = T[k] < 1000 {id=dep}
                         cp[k] = 2 * T[k] {dep=dep,if=Tcond[k]}
                     end
                 """,
                 [T_arr, '...'])
__print_code(knl)

knl_ilp = lp.split_iname(knl, 'k', 2, inner_tag='unr')
__print_code(knl_ilp)

we get a similarly flawed output:

#define lid(N) ((int) get_local_id(N))
#define gid(N) ((int) get_group_id(N))
#if __OPENCL_C_VERSION__ < 120
#pragma OPENCL EXTENSION cl_khr_fp64: enable
#endif

__kernel void __attribute__ ((reqd_work_group_size(1, 1, 1))) loopy_kernel(__global double const *restrict T, __global double *restrict cp)
{
  int Tcond[100];

  for (int k_outer = 0; k_outer <= 49; ++k_outer)
  {
    Tcond[2 * k_outer + 0] = T[2 * k_outer + 0] < 1000.0;
    if (Tcond[k])
      cp[2 * k_outer + 0] = 2.0 * T[2 * k_outer + 0];
    Tcond[2 * k_outer + 1] = T[2 * k_outer + 1] < 1000.0;
    if (Tcond[k])
      cp[2 * k_outer + 1] = 2.0 * T[2 * k_outer + 1];
  }
}

I'm running the latest commit, r2e562728
I'm fairly new to the machinery behind loopy. Where would I start looking to correct this?
Also, as a side-note, is it possible to do else statements?

Ignore the ambiguous schedule warning

I lately felt like I want to be able to ignore the ambiguous schedule warning through silenced_warnings=['ambiguous_scheduling'], instead of making gymnastics to get rid of it. Any chance to upstream that patch?

Any way to run all tests / determine current Intel brokeness level?

Is there any method to run all the test_whatever.py in the top test/ directory?
I am developing a chemical kinetics library which I hope to accelerate on CPU/GPU via loo.py. This means I will need to support Intel CPUs (which I will be developing on).

I see that in auto_test.py the function:

_enumerate_cl_devices_for_ref_test

seems to exclude Intel as a target for being buggy. Fair.
I commented this out to see where Intel's OpenCL support stands (I have the most recent version--2017--of parallel studio cluster edition installed), but now I can't figure out how to run all tests (apart from running each script individually)

(Seemingly) Erroneous if statement applied to code

I've run into a strange bug (seemingly) where I'm not sure why a particular conditional is applied to part of my code. The kernel in question is rather complex, so I've prepared a (somewhat) simplified version for testing:

import loopy as lp
import numpy as np
import pyopencl as cl
from loopy.kernel.data import temp_var_scope as scopes

ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)

out_map = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 12, 13, 14], dtype=np.int32)
if_val = np.array([-1, 0, -1, -1, -1, -1, 0, -1, -1, 0, 0, 0, -1, 0, -1], dtype=np.int32)
vals = np.array([2, 3, 0, 1, 2, 4, 1, 2, 4, 1, 3, 6, 0, 1, 4, 7, 1, 9, 10, 3, 10, 11, 1, 14, 15, 1, 12, 13, 21, 22, 23, 21, 22, 23, 3, 24, 25, 26, 27, 28, 29, 1, 3, 6], dtype=np.int32)
num_vals = np.array([2, 4, 3, 3, 2, 2, 3, 3, 3, 3, 3, 3, 3, 4, 3], dtype=np.int32)
num_vals_offset = np.array(np.cumsum(num_vals) - num_vals, dtype=np.int32)

knl = lp.make_kernel(['{[i]: 0 <= i < 12}',
                '{[j]: 0 <= j < 100}',
                '{[a_count]: 0 <= a_count < a_end}',
                '{[b_count]: 0 <= b_count < b_end}'],
"""
for j
    for i
        <>i_map = out_map[i]

        #find P_sum
        <> a_end = abs(if_val[i_map])
        <> a_sum = 1.0d {id=ainit}
        if if_val[i_map] > 0
            <> a_val = 10.0d {id=aval_decl}
        else
            a_val = 0.1d {id=aval_decl1}
        end
        for a_count
            a_sum = a_sum * a_val {id=a_accum, dep=ainit:aval_decl:aval_decl1}
        end

        #find b_sum
        <>b_end = num_vals[i_map]
        <>offset = num_vals_offset[i_map] {id=offset}
        <>b_sum = 0 {id=b_init}
        for b_count
            <>val = vals[offset + b_count] {dep=offset}
            if if_val[i_map] != 0
                b_sum = b_sum + if_val[i_map] * B[j,val] {id=b_accum, dep=b_init}
            end
        end
        b_sum = exp(b_sum) {id=b_final, dep=b_accum}

        out[j,i] = a_sum * b_sum {dep=a_accum:b_final}
    end
end
""",
[lp.TemporaryVariable('out_map', initializer=out_map, shape=out_map.shape, read_only=True, scope=scopes.PRIVATE),
lp.TemporaryVariable('if_val', initializer=if_val, shape=if_val.shape, read_only=True, scope=scopes.PRIVATE),
lp.TemporaryVariable('vals', initializer=vals, shape=vals.shape, read_only=True, scope=scopes.PRIVATE),
lp.TemporaryVariable('num_vals', initializer=num_vals, shape=num_vals.shape, read_only=True, scope=scopes.PRIVATE),
lp.TemporaryVariable('num_vals_offset', initializer=num_vals_offset, shape=num_vals_offset.shape, read_only=True, scope=scopes.PRIVATE),
lp.GlobalArg('B', shape=(100, 31), dtype=np.float64),
lp.GlobalArg('out', shape=(100, 12), dtype=np.float64)])

knl = lp.prioritize_loops(knl, ['j', 'i'])
print(lp.generate_code(knl)[0])

This generates the following:

#define lid(N) ((int) get_local_id(N))
#define gid(N) ((int) get_group_id(N))
#if __OPENCL_C_VERSION__ < 120
#pragma OPENCL EXTENSION cl_khr_fp64: enable
#endif

__kernel void __attribute__ ((reqd_work_group_size(1, 1, 1))) loopy_kernel(__global double const *restrict B, __global double *restrict out)
{
  int a_end;
  double a_sum;
  double a_val;
  int b_end;
  double b_sum;
  int i_map;
  int const if_val[15] = { -1, 0, -1, -1, -1, -1, 0, -1, -1, 0, 0, 0, -1, 0, -1 };
  int const num_vals[15] = { 2, 4, 3, 3, 2, 2, 3, 3, 3, 3, 3, 3, 3, 4, 3 };
  int const num_vals_offset[15] = { 0, 2, 6, 9, 12, 14, 16, 19, 22, 25, 28, 31, 34, 37, 41 };
  int offset;
  int const out_map[12] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 12, 13, 14 };
  int val;
  int const vals[44] = { 2, 3, 0, 1, 2, 4, 1, 2, 4, 1, 3, 6, 0, 1, 4, 7, 1, 9, 10, 3, 10, 11, 1, 14, 15, 1, 12, 13, 21, 22, 23, 21, 22, 23, 3, 24, 25, 26, 27, 28, 29, 1, 3, 6 };

  for (int j = 0; j <= 99; ++j)
    for (int i = 0; i <= 11; ++i)
    {
      i_map = out_map[i];
      offset = num_vals_offset[i_map];
      b_end = num_vals[i_map];
      a_end = abs(if_val[i_map]);
      b_sum = 0.0;
      if (!(if_val[i_map] > 0))
        a_val = 0.1;
      if (if_val[i_map] > 0)
        a_val = 10.0;
      a_sum = 1.0;
      for (int b_count = 0; b_count <= -1 + b_end; ++b_count)
        **if (-1 + a_end >= 0)**
        {
          val = vals[offset + b_count];
          if (if_val[i_map] != 0)
            b_sum = b_sum + if_val[i_map] * B[31 * j + val];
        }
      b_sum = exp(b_sum);
      for (int a_count = 0; a_count <= -1 + a_end; ++a_count)
        a_sum = a_sum * a_val;
      out[12 * j + i] = a_sum * b_sum;
    }
}

I'm not sure if I'm just being dense and missing something obvious here, but I can't figure out why the ~~bolded~~ (starred, oops bolding doesn't work in code blocks) if-statement has been applied to the B-loop.

The above was generated using 32e23d2

Move single-writer resolution to kernel creation

And throw warnings for further dep additions from single-writer that are encountered during preprocess.

Not sure if this is doable, but it might be worth trying.

String syntax for zero-return-value call instructions

Not currently defined

cc @dokempf

Multiple CInstructions get automatically wrapped into Blocks

Not sure, whether this is a bug or a feature...

Having more than one CInstruction following up on each other, each of those instructions gets wrapped into a block. Minimum working example:

import loopy

insn = loopy.CInstruction(["i"], "mycode!")
insn2 = loopy.CInstruction(["i"], "mycode2!")


from loopy.target.c import CTarget
knl = loopy.make_kernel("{ [i]: 0<=i<n }",                                                                                                                                                                                                                          
                        [insn, insn2],     
                        target=CTarget(),
                        )
knl = loopy.preprocess_kernel(knl)
print loopy.generate_code(knl)[0]

Note, that omitting one of these instructions will remove the block from the other instruction.

To me, this is - besides violating the principle of least surprise - a bug, as I have a use case, where I want to declare a variable in a CInstruction (though I know this is generally not a very good idea and probably against the design principles of loopy).

Scalars with base_storage generate wrong assignment code

This is a super-corner-case, but I stumbled over it while experimenting. MWE:

import loopy as lp
knl = lp.make_kernel(
    ["{ [i]: 0<=i<1}"],
    ["a = 1"],
    [lp.TemporaryVariable("a", dtype=np.float64, shape=(), base_storage="base")],
    target=lp.CTarget(),  
    )

which produces

void loopy_kernel()
{
  char base[8] __attribute__ ((aligned (8)));
  double *const __restrict__ a = (double *const __restrict__ ) (base + 0);

  a = 1.0;
}

where it should say

*a = 1.0;

lp.show_dependency_graph bit rotting

For the first time, I wanted to use graph visualization for debugging. But it seems its broken since may or so.

  File ".../loopy/loopy/kernel/tools.py", line 543, in get_dot_dependency_graph
    raise LoopyError("schedule item not unterstood: %r" % sched_item)
loopy.diagnostic.LoopyError: schedule item not unterstood: CallKernel(extra_args=[], extra_inames=[], kernel_name='alpha_volume')

Allow additional arguments in ExpressionToCExpressionMapper mapper methods

Problem: I need to generate different code for lhs/rhs expressions, as I am implementing a backend that has different access methods for vector entries for writing/reading.

Solution: Pass a flag to the mapper method (Subscript, here), whether this lhs or rhs code is needed.
This can only be implemented by having all mapper methods pass around **kwargs through recursion.

If you agree and do not see a better solution, I will implement this.

CallInstruction with temp_var_types=None fail in argument guessing

The following minimal example

import loopy as lp
import pymbolic.primitives as p

insn = lp.CallInstruction([], p.Call(p.Variable("f"), ()))
lp.make_kernel([], [insn])

fails with:


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-3-1ee99214ee92> in <module>()
      3 
      4 insn = lp.CallInstruction([], p.Call(p.Variable("f"), ()))
----> 5 lp.make_kernel([], [insn])

/home/dominic/dune/dune-perftool/python/loopy/loopy/kernel/creation.pyc in make_kernel(***failed resolving arguments***)
   1537 
   1538     kernel_args = arg_guesser.convert_names_to_full_args(kernel_args)
-> 1539     kernel_args = arg_guesser.guess_kernel_args_if_requested(kernel_args)
   1540 
   1541     kwargs["substitutions"] = substitutions

/home/dominic/dune/dune-perftool/python/loopy/loopy/kernel/creation.pyc in guess_kernel_args_if_requested(self, kernel_args)
    884                 for assignee_var_name, temp_var_type in zip(
    885                         insn.assignee_var_names(),
--> 886                         insn.temp_var_types):
    887                     if temp_var_type is not None:
    888                         temp_var_names.add(assignee_var_name)

TypeError: izip argument #2 must support iteration

I can fix that by adding temp_var_types=() but I read the docs as if the default None should work too.

I will open a MR with a tentative fix in a minute.

Insufficient scoping of single-iteration loop iname definitions

E.g. https://gitlab.tiker.net/inducer/loopy/merge_requests/33/diffs#c42851e5221945181c81c2e14f636778411ee409_805_792

What is Loopy's function namespace? (was: Is min a reduction or a 2-arg minimum like OpenCL)

Sorry about the many pull requests / issues today! Been playing around with reductions, and breaking things in all sorts of interesting ways :)

Here's a fun one:

import loopy as lp
import pyopencl as cl
lp.set_caching_enabled(False)

ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)

knl = lp.make_kernel('{[i]: 0 < i < 10}',
"""
        out[i] = min(i, 5)
""",
['...'])

print lp.generate_code(knl)[0]

Results in:

  File "test.py", line 14, in <module>
    print lp.generate_code(knl)[0]
  File "/home/nick/loopy/loopy/codegen/__init__.py", line 523, in generate_code
    codegen_result = generate_code_v2(kernel)
  File "/home/nick/loopy/loopy/codegen/__init__.py", line 383, in generate_code_v2
    kernel = preprocess_kernel(kernel)
  File "/home/nick/loopy/loopy/preprocess.py", line 912, in preprocess_kernel
    kernel = realize_reduction(kernel, unknown_types_ok=False)
  File "/home/nick/loopy/loopy/preprocess.py", line 649, in realize_reduction
    new_expressions = (cb_mapper(insn.expression),)
  File "/home/nick/.local/lib/python2.7/site-packages/pymbolic/mapper/__init__.py", line 134, in __call__
    return method(expr, *args, **kwargs)
  File "/home/nick/loopy/loopy/symbolic.py", line 1345, in map_reduction
    result = self.callback(expr, self.rec, **kwargs)
  File "/home/nick/loopy/loopy/preprocess.py", line 559, in map_reduction
    "supposed to reduce over: " + ", ".join(bad_inames))
loopy.diagnostic.LoopyError: reduction used within loop(s) that it was supposed to reduce over: i

However,

import loopy as lp
import pyopencl as cl
lp.set_caching_enabled(False)

ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)

knl = lp.make_kernel('{[i]: 0 < i < 10}',
"""
        out[i] = abs(-i)
""",
['...'])

print knl(queue)[1]

works as expected (both are defined in the opencl function manglers).

ILP not applied to variable loop length

Playing around with sparse / variable loop length codes, and came across this one

import loopy as lp
import numpy as np
lp.set_caching_enabled(False)

k = lp.make_kernel([
    "{ [i] : 0 <= i < m }",
    "{ [j] : 0 <= j < length }"],
    """
    for i
        <> rowstart = rowstarts[i]
        <> rowend = rowstarts[i]
        <> length = rowend - rowstart
        y[i] = sum(j, values[rowstart+j] * x[colindices[rowstart + j]])
    end
    """)
k = lp.add_and_infer_dtypes(k, {'values,x':np.float64, 'rowstarts,colindices':k.index_dtype})
k = lp.split_iname(k, 'i', 2, inner_tag='ilp')
print(lp.generate_code(k)[0])

Generates:

#define lid(N) ((int) get_local_id(N))
#define gid(N) ((int) get_group_id(N))
#if __OPENCL_C_VERSION__ < 120
#pragma OPENCL EXTENSION cl_khr_fp64: enable
#endif
#define int_floor_div_pos_b(a,b) (                 ( (a) - ( ((a)<0) ? ((b)-1) : 0 )  ) / (b)                 )

__kernel void __attribute__ ((reqd_work_group_size(1, 1, 1))) loopy_kernel(__global int const *restrict colindices, int const m, __global int const *restrict rowstarts, __global double const *restrict values, __global double const *restrict x, __global double *restrict y)
{
  double acc_j[2];
  **int length[2]**;
  int rowend[2];
  int rowstart[2];

  for (int i_outer = 0; i_outer <= -1 + int_floor_div_pos_b(1 + m, 2); ++i_outer)
  {
    acc_j[0] = 0.0;
    rowend[0] = rowstarts[2 * i_outer];
    rowstart[0] = rowstarts[2 * i_outer];
    **length[0]** = rowend[0] + -1 * rowstart[0];
    if (-2 + -2 * i_outer + m >= 0)
    {
      acc_j[1] = 0.0;
      rowend[1] = rowstarts[2 * i_outer + 1];
      rowstart[1] = rowstarts[2 * i_outer + 1];
      **length[1]** = rowend[1] + -1 * rowstart[1];
    }
    for (int j = 0; j <= -1 + **length**; ++j)
    {
      acc_j[0] = acc_j[0] + values[rowstart[0] + j] * x[colindices[rowstart[0] + j]];
      if (-2 + -2 * i_outer + m >= 0)
        acc_j[1] = acc_j[1] + values[rowstart[1] + j] * x[colindices[rowstart[1] + j]];
    }
    y[2 * i_outer] = acc_j[0];
    if (-2 + -2 * i_outer + m >= 0)
      y[2 * i_outer + 1] = acc_j[1];
  }
}

Note the starred lines dealing with the length parameter (apparently bolding doesn't work in code markdown). The length parameter is correctly split by ILP except for when it's used as the loop bound

split_array_axis does not propagate nesting levels

Compound if statement generation vs changing of predicates

Predicates should get reevaluated if there's a chance they could have changed.

Fix reduction examples

Bring reduction examples in line with higher-perf examples from:

https://gitlab.tiker.net/papers/2016-loopy-kernel-splitting

(Instead of current idiotic code.)

Declaration of base storage arrays is broken

This is a regression on master, that appeared with the Index CSE branch. MWE:

import loopy as lp
import numpy as np
lp.CACHING_ENABLED = False

knl = lp.make_kernel(
    ["{ [i]: 0<=i<n}"],
    ["a[i] = 1"],
    [lp.TemporaryVariable("a", dtype=np.float64, shape=("n",), base_storage="base"),
     lp.ValueArg("n")],
    target=lp.CTarget(),  
    )

Error:

....

.../cgen/cgen/mapper.pyc in map_array_of(self, node, *args, **kwargs)
     96         return type(node)(
     97                 self.rec(node.subdecl, *args, **kwargs),
---> 98                 self.map_expression(node.count, *args, **kwargs))
     99 
    100     def map_function_declaration(self, node, *args, **kwargs):

.../loopy/loopy/target/c/__init__.py in map_expression(self, expr)
    221         else:
    222             raise LoopyError(
--> 223                     "Unexpected expression type: %s" % type(expr).__name__)
    224 
    225 # }}}

LoopyError: Unexpected expression type: Product

The Product to be mapped is 8*n, which is the size of the base_storage needed for declaration.

package unavailable on PyPI, but README.rst states otherwise

Hello, I would like to just point out that loo.py is NOT available for download here: https://pypi.python.org/pypi/loo.py - because contrary to what the README.rst file is saying, there are no releases available for download there.

I've been having this issue for a very long time, so I ended up building my own loo.py.whl but in the end I think that the package should be just available on PyPI, shouldn't it?

Segfault executing simple Kernel on Intel CPU

This appears to be a bug with how Loopy simplifies some loop structures, and interaction with unsigned integers

Anyways, it can be recreated relatively simply:

#bug recreate
#necessary packages
import loopy as lp
import numpy as np
import pyopencl as cl
from loopy.kernel.data import temp_var_scope as scopes

#set to intel cpu
import os
device = '1'
os.environ['PYOPENCL_CTX'] = device
os.environ['PYOPENCL_COMPILER_OUTPUT'] = '1'

#setup loopy w/ intel CPU
lp.set_caching_enabled(False)
for p in cl.get_platforms():
    for d in p.get_devices():
        if 'Intel' in str(d):
            device = d
            ctx = cl.Context(devices=[device], properties=[(cl.context_properties.PLATFORM, p)])
            
queue = cl.CommandQueue(ctx)

#the number of items to look at
num = 81

params = np.random.random(size=num)
params_lp = lp.TemporaryVariable('params', initializer=params, shape=lp.auto, scope=scopes.GLOBAL, read_only=True)

knl = lp.make_kernel('{[i]: 0 <= i < n}',
                    """
                        for i
                            out[i] = params[i]
                        end
                    """, [params_lp, '...'],
                    target=lp.PyOpenCLTarget(device=device))
knl = lp.fix_parameters(knl, n=num)
code, _ = lp.generate_code(knl)
print code
knl(queue)

vec_knl = lp.split_iname(knl, 'i', 4, inner_tag='l.0')
code, _ = lp.generate_code(vec_knl)
print code

vec_knl(queue) #code dies here

The first print code generates (excluding param defn', which changes every time):

__kernel void __attribute__ ((reqd_work_group_size(1, 1, 1))) loopy_kernel(__global double *restrict out)
{
  for (int i = 0; i <= 80; ++i)
    out[i] = params[i];
}

While the second produces the seemingly reasonable:

__kernel void __attribute__ ((reqd_work_group_size(4, 1, 1))) loopy_kernel(__global double *restrict out)
{
  for (int i_outer = 0; i_outer <= 20 + -1 * lid(0) + (3 * lid(0) / 4); ++i_outer)
    out[4 * i_outer + lid(0)] = params[4 * i_outer + lid(0)];
}

it produces the following valgrind trace:

==6767== Invalid read of size 1
==6767==    at 0x1F79EDEC: ??? (in /opt/intel/opencl-1.2-6.4.0.25/lib64/libOclCpuBackEnd.so)
==6767==    by 0x1F75F723: ??? (in /opt/intel/opencl-1.2-6.4.0.25/lib64/libOclCpuBackEnd.so)
==6767==    by 0x1F76160F: ??? (in /opt/intel/opencl-1.2-6.4.0.25/lib64/libOclCpuBackEnd.so)
==6767==    by 0x1F761885: ??? (in /opt/intel/opencl-1.2-6.4.0.25/lib64/libOclCpuBackEnd.so)
==6767==    by 0x1F761CDA: ??? (in /opt/intel/opencl-1.2-6.4.0.25/lib64/libOclCpuBackEnd.so)
==6767==    by 0x1F8F6556: ??? (in /opt/intel/opencl-1.2-6.4.0.25/lib64/libOclCpuBackEnd.so)
==6767==    by 0x1F6824B7: ??? (in /opt/intel/opencl-1.2-6.4.0.25/lib64/libOclCpuBackEnd.so)
==6767==    by 0x1F6634E6: ??? (in /opt/intel/opencl-1.2-6.4.0.25/lib64/libOclCpuBackEnd.so)
==6767==    by 0x1F67B908: ??? (in /opt/intel/opencl-1.2-6.4.0.25/lib64/libOclCpuBackEnd.so)
==6767==    by 0x1F660AD8: ??? (in /opt/intel/opencl-1.2-6.4.0.25/lib64/libOclCpuBackEnd.so)
==6767==    by 0x1F0835B6: ??? (in /opt/intel/opencl-1.2-6.4.0.25/lib64/libcpu_device.so)
==6767==    by 0x1E4961CF: ??? (in /opt/intel/opencl-1.2-6.4.0.25/lib64/libintelocl.so)
==6767==  Address 0x8 is not stack'd, malloc'd or (recently) free'd
==6767== 
Stack dump:
0.      Running pass 'PrepareKernelArgs' on module 'main'.
==6767== 
==6767== Process terminating with default action of signal 11 (SIGSEGV)
==6767==  Access not within mapped region at address 0x8
==6767==    at 0x1F79EDEC: ??? (in /opt/intel/opencl-1.2-6.4.0.25/lib64/libOclCpuBackEnd.so)
==6767==    by 0x1F75F723: ??? (in /opt/intel/opencl-1.2-6.4.0.25/lib64/libOclCpuBackEnd.so)
==6767==    by 0x1F76160F: ??? (in /opt/intel/opencl-1.2-6.4.0.25/lib64/libOclCpuBackEnd.so)
==6767==    by 0x1F761885: ??? (in /opt/intel/opencl-1.2-6.4.0.25/lib64/libOclCpuBackEnd.so)
==6767==    by 0x1F761CDA: ??? (in /opt/intel/opencl-1.2-6.4.0.25/lib64/libOclCpuBackEnd.so)
==6767==    by 0x1F8F6556: ??? (in /opt/intel/opencl-1.2-6.4.0.25/lib64/libOclCpuBackEnd.so)
==6767==    by 0x1F6824B7: ??? (in /opt/intel/opencl-1.2-6.4.0.25/lib64/libOclCpuBackEnd.so)
==6767==    by 0x1F6634E6: ??? (in /opt/intel/opencl-1.2-6.4.0.25/lib64/libOclCpuBackEnd.so)
==6767==    by 0x1F67B908: ??? (in /opt/intel/opencl-1.2-6.4.0.25/lib64/libOclCpuBackEnd.so)
==6767==    by 0x1F660AD8: ??? (in /opt/intel/opencl-1.2-6.4.0.25/lib64/libOclCpuBackEnd.so)
==6767==    by 0x1F0835B6: ??? (in /opt/intel/opencl-1.2-6.4.0.25/lib64/libcpu_device.so)
==6767==    by 0x1E4961CF: ??? (in /opt/intel/opencl-1.2-6.4.0.25/lib64/libintelocl.so)

The segfault seems to be due to the use of the lid() in the loop bounds, if you supply edit_code=True to the loopy options:

vec_knl = lp.set_options(vec_knl, edit_code=True)

and change the loop to:

  for (int i_outer = 0; i_outer <= 20; ++i_outer)
  {
    if (i_outer * 4 + lid(0) <= 80) {
    out[4 * i_outer + lid(0)] = params[4 * i_outer + lid(0)];
    }
  }

or even:

__kernel void __attribute__ ((reqd_work_group_size(4, 1, 1))) loopy_kernel(__global double *restrict out)
{
  for (int i_outer = 0; i_outer <= 20 + (3 * lid(0) / 4) - lid(0); ++i_outer)
    out[4 * i_outer + lid(0)] = params[4 * i_outer + lid(0)];
}

the program runs just fine.

The issue here seems (to me) that get_local_id returns a size_t, which is defined to be unsigned in the OpenCL Spec, hence -1 * lid(0) -> some ludicrously large number.

Interestingly, this problem is not present if I use an NVIDIA GPU with the OpenCL implementation that comes with CUDA 8.0

No declared encoding: Non-ASCII characters in new printing methods cause SyntaxError

After pulling latest master, I get plenty of these errors complaing about all those special characters:

SyntaxError: Non-ASCII character '\xe2' in file .../loopy/loopy/kernel/tools.py on line 1201, but
no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

Forced iname deps are not propagated to Reduction instructions correctly

Setting forced_iname_deps_is_final=True on a Reduction operation seems not to be well-defined. See this minimum example:

import loopy
from pymbolic.primitives import *

i1 = loopy.CInstruction("i",
                        "doSomethingToGetPhi();",
                        assignees=frozenset({"phi"}),
                        )

i2 = loopy.Assignment("a",
                      loopy.Reduction("sum", "j", Subscript(Variable("phi"), Variable("j"))),
                      forced_iname_deps=frozenset({}),
                      forced_iname_deps_is_final=True,
                     )

k = loopy.make_kernel("{[i,j] : 0<=i,j<n}",
                      [i1, i2],
                      [loopy.GlobalArg("a", dtype=numpy.float32, shape=()),
                       loopy.ValueArg("n", dtype=numpy.int32),
                       ],
                      target=loopy.CTarget(),
                      temporary_variables={"phi": loopy.TemporaryVariable("phi", dtype=numpy.float32, shape=("n",))}
                     )

k = loopy.preprocess_kernel(k)
print k.stringify(with_dependencies=True)

which produces the following output:

...
[]                                   acc_j <- 0   # insn_0_j_init
[i,j]                                acc_j <- acc_j + phi[j]   # insn_0_j_update
...

So the forced_iname_deps_is_final property was discarded on the update instruction, while it was preserved on the init instruction. This results in wrong loop nesting for me.

I am still investigating this issue, but I think it might be connected to the fact, that i1 is a CInstruction and not an assignment.

Match expressions are not hashable in python3

With the feature to specify dependencies through match expressions, all match expressions need to be hashable (they are kept in a frozenset). However, in python3 the existence of __eq__ makes them unhashable. I fixed this locally, by implementing __hash__, but I am unsure whether that is the correct upstream solution as I do not know what the update_persistent_hash functions are about.

Store counting vs vectorization

@jdsteve2

Figure out a syntax for no_sync_with scopes

Follow-up to https://gitlab.tiker.net/inducer/loopy/merge_requests/21

cc @mattwala

Add a NOP for dependency gathering purposes

Vectorizability of CallInstruction

Right now, there is a malformed error check at https://github.com/inducer/loopy/blob/master/loopy/codegen/instruction.py#L223 , as CallInstructions does not have a field atomicity.

I was running into this when I wanted to generate code for a function call, that takes a vector as an argument. Funnily, commenting out the check produces the correct result for me. I am not sure what other cases need to be considered here, though.

Improve type inference

For now, type inference falls flat for cases like this:

<> z = 0
z = z + (initially unknown type)

(it just derives the type of zero)

Should vectorize x = 4 if x is a vector

Device mapping takes a very long time

@mattwala

The code below (unintentionally) triggers many write-after-write dependencies in the writing of all the f_new. This leads to a lot of kernel splitting, and the device mapping stage takes a very long time. This leads to a poor user experience, and we should think of a way to fix that. If nothing else, this may serve as a good example that provokes performance degradation.

import loopy as lp
import pyopencl as cl
import pyopencl.array
import numpy as np

nx, ny, nv = 100, 100, 12
f = np.zeros((nx, ny, nv), dtype=np.float32)
f_new = np.zeros_like(f)

ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)

knl = lp.make_kernel(
     "{[ii,jj]:0<=ii<nx-2 and 0<=jj<ny-2}",
     """  # noqa (silences flake8 line length warning)
     i := ii + 1
     j := jj + 1
     for ii, jj
         <> m[0] =   +    f[i-1, j, 0] +    f[i, j-1, 1] + f[i+1, j, 2] +  f[i, j+1, 3]
         m[1] =   + 4.*f[i-1, j, 0] - 4.*f[i+1, j, 2]
         m[2] =   + 4.*f[i, j-1, 1] - 4.*f[i, j+1, 3]
         m[3] =   +    f[i-1, j, 0] -    f[i, j-1, 1] + f[i+1, j, 2] -  f[i, j+1, 3]
         m[4] =   +    f[i-1, j, 4] +    f[i, j-1, 5] + f[i+1, j, 6] +  f[i, j+1, 7]
         m[5] =   + 4.*f[i-1, j, 4] - 4.*f[i+1, j, 6]
         m[6] =   + 4.*f[i, j-1, 5] - 4.*f[i, j+1, 7]
         m[7] =   +    f[i-1, j, 4] -    f[i, j-1, 5] + f[i+1, j, 6] -  f[i, j+1, 7]
         m[8] =   +    f[i-1, j, 8] +    f[i, j-1, 9] + f[i+1, j, 10] + f[i, j+1, 11]
         m[9] =   + 4.*f[i-1, j, 8] - 4.*f[i+1, j, 10]
         m[10] =  + 4.*f[i, j-1, 9] - 4.*f[i, j+1, 11]
         m[11] =  +    f[i-1, j, 8] -    f[i, j-1, 9] + f[i+1, j, 10] - f[i, j+1, 11]

         m[1] = m[1] + 2.*(m[4] - m[1])
         m[2] = m[2] + 2.*(m[8] - m[2])
         m[3] = m[3]*(1. - 1.5)
         m[5] = m[5] + 1.5*(0.5*(m[0]*m[0]) + (m[4]*m[4])/m[0] - m[5])
         m[6] = m[6] + 1.5*(m[4]*m[8]/m[0] - m[6])
         m[7] = m[7]*(1. - 1.2000000000000000)
         m[9] = m[9] + 1.5*(m[4]*m[8]/m[0] - m[9])
         m[10] = m[10] + 1.5*(0.5*(m[0]*m[0]) + (m[8]*m[8])/m[0] - m[10])
         m[11] = m[11]*(1. - 1.2)

         # <>f_new[0] =  + 0.25*m[0] + 0.125*m[1] + 0.25*m[3]
         # f_new[1] =  + 0.25*m[0] + 0.125*m[2] - 0.25*m[3]
         # f_new[2] =  + 0.25*m[0] - 0.125*m[1] + 0.25*m[3]
         # f_new[3] =  + 0.25*m[0] - 0.125*m[2] - 0.25*m[3]
         # f_new[4] =  + 0.25*m[4] + 0.125*m[5] + 0.25*m[7]
         # f_new[5] =  + 0.25*m[4] + 0.125*m[6] - 0.25*m[7]
         # f_new[6] =  + 0.25*m[4] - 0.125*m[5] + 0.25*m[7]
         # f_new[7] =  + 0.25*m[4] - 0.125*m[6] - 0.25*m[7]
         # f_new[8] =  + 0.25*m[8] + 0.125*m[9] + 0.25*m[11]
         # f_new[9] =  + 0.25*m[8] + 0.125*m[10] - 0.25*m[11]
         # f_new[10] =  + 0.25*m[8] - 0.125*m[9] + 0.25*m[11]
         # f_new[11] =  + 0.25*m[8] - 0.125*m[10] - 0.25*m[11]

         f_new[i, j, 0] =  + 0.25*m[0] + 0.125*m[1] + 0.25*m[3]
         f_new[i, j, 1] =  + 0.25*m[0] + 0.125*m[2] - 0.25*m[3]
         f_new[i, j, 2] =  + 0.25*m[0] - 0.125*m[1] + 0.25*m[3]
         f_new[i, j, 3] =  + 0.25*m[0] - 0.125*m[2] - 0.25*m[3]
         f_new[i, j, 4] =  + 0.25*m[4] + 0.125*m[5] + 0.25*m[7]
         f_new[i, j, 5] =  + 0.25*m[4] + 0.125*m[6] - 0.25*m[7]
         f_new[i, j, 6] =  + 0.25*m[4] - 0.125*m[5] + 0.25*m[7]
         f_new[i, j, 7] =  + 0.25*m[4] - 0.125*m[6] - 0.25*m[7]
         f_new[i, j, 8] =  + 0.25*m[8] + 0.125*m[9] + 0.25*m[11]
         f_new[i, j, 9] =  + 0.25*m[8] + 0.125*m[10] - 0.25*m[11]
         f_new[i, j, 10] =  + 0.25*m[8] - 0.125*m[9] + 0.25*m[11]
         f_new[i, j, 11] =  + 0.25*m[8] - 0.125*m[10] - 0.25*m[11]
    end
    """, seq_dependencies=True)

knl = lp.add_and_infer_dtypes(knl, {"f": np.float32})
#knl = lp.add_and_infer_dtypes(knl, {"f_new": np.float32})

ref_knl = knl

knl = lp.split_iname(knl, "ii", 16, outer_tag="g.1", inner_tag="l.1")
knl = lp.split_iname(knl, "jj", 16, outer_tag="g.0", inner_tag="l.0")
knl = lp.expand_subst(knl)
knl = lp.add_prefetch(knl, "f", "ii_inner,jj_inner", fetch_bounding_box=True)
knl = lp.set_options(knl, write_cl=True)
knl = lp.preprocess_kernel(knl)
print(knl)
f = cl.array.to_device(queue, f)
f_new = cl.array.zeros_like(f)
evt, f_new = knl(queue, f=f)#, f_new=f_new)

Projection consistency of multiple domains

Given parent/child domains like this:

{[i]: 0<=i<10}
  {[j]: 0<=j<i and j<=6}

the domain of i disagrees with the projection of the domain for i and j onto just i. That should be flagged and prevented during preprocessing.

cc @mattwala

Generate code for loop bounds using "proper" PwAffs

Tracking bug for fix to the issue @mattwala ran into with scan codegen, regarding i_tracker.

FWIW, a stub for this already exists.

https://github.com/inducer/loopy/blob/master/loopy/symbolic.py#L1136

Temporary Elimination is unaware of aliases

Having 2 Temporaries referring to the same base storage and one being read-only, loopy will eliminate that one, where it shouldnt.

Integer operators undercounted

@jdsteve2 Looking at this code:

loopy/test/test_statistics.py

Line 135 in 1cd6637

e[i, k] = (g[i,k] ^ k)*(~h[i,k+1]) + (g[i, k] << (h[i,k] >> k))

I figure there should be at least 6 or 7 integer operations, not 3. How come only 3 of them get seen?

Multi-piece loop bounds use unnecessarily loose gist

@mattwala Sorry, I should have paid closer attention during the review earlier. This gist is unnecessarily loose:

https://gitlab.tiker.net/inducer/loopy/merge_requests/33/diffs#1be20bf1e82dd3ea7bc536cedab6c8f27e2259b8_397_393

It should take a gist with respect to the implemented domain, like this code:

https://github.com/inducer/loopy/blob/master/loopy/codegen/bounds.py#L47-L48

What's there now is not incorrect, but it doesn't "gist away" all the constraints it could. (In effect, it just gets whatever the domain says about the parameters, which is typically not very much--in loopy, that stuff typically winds up in "assumptions".)

Error compiling simple kernel, missing variable from islpy

I'm trying to compile an example kernel but running into the following exception:

In [5]: knl = lp.make_kernel("{[i]: 0 <= i < n}", "out[i] = 2*a[i]")
failed to parse domain '[n] -> {[i]: 0 <= i < n}'
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-5-861b95c9d0a6> in <module>()
----> 1 knl = lp.make_kernel("{[i]: 0 <= i < n}", "out[i] = 2*a[i]")

/Users/iskander/code/loopy/loopy/kernel/creation.py in make_kernel(***failed resolving arguments***)
   1156     # }}}
   1157
-> 1158     domains = parse_domains(domains, defines)
   1159
   1160     arg_guesser = ArgumentGuesser(domains, instructions,

/Users/iskander/code/loopy/loopy/kernel/creation.py in parse_domains(domains, defines)
    382
    383             try:
--> 384                 dom = isl.BasicSet.read_from_str(isl.DEFAULT_CONTEXT, dom)
    385             except:
    386                 print("failed to parse domain '%s'" % dom)

AttributeError: 'module' object has no attribute 'DEFAULT_CONTEXT'

This looks like an issue with ISL. Looking at the ispy module namespace I don't, in fact, see a DEFAULT_CONTEXT binding. Maybe we're using different versions?

Tag instructions to depend on all writes to read temporaries

Loopy has this lovely heuristic:

Specifically, loopy will automatically add a dependency to an 
instruction reading a variable if there is exactly one instruction writing that variable.

I would love to have an instruction tag (say !allwrites), that automatically adds dependencies on all writes to temporaries, that the instruction reads. If you like it, I open a PR.

Import error while calling kernel

File "softmax.py", line 43, in softmax
    evt, (exp,) = exp_kernel(queue, z=input_array, E=2.7)
  File "/Users/iskander/code/loopy/loopy/kernel/__init__.py", line 1088, in __call__
    queue, **kwargs)
  File "/Users/iskander/code/loopy/loopy/compiled.py", line 948, in __call__
    frozenset(six.iteritems(arg_to_dtype)))
  File "/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pytools-2015.1.2-py2.7.egg/pytools/__init__.py", line 471, in wrapper
    result = method(self, *args, **kwargs)
  File "/Users/iskander/code/loopy/loopy/compiled.py", line 840, in cl_kernel_info
    kernel = self.get_typed_and_scheduled_kernel(arg_to_dtype_set)
  File "/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pytools-2015.1.2-py2.7.egg/pytools/__init__.py", line 471, in wrapper
    result = method(self, *args, **kwargs)
  File "/Users/iskander/code/loopy/loopy/compiled.py", line 831, in get_typed_and_scheduled_kernel
    kernel = preprocess_kernel(kernel)
  File "/Users/iskander/code/loopy/loopy/preprocess.py", line 1115, in preprocess_kernel
    preprocess_cache[input_kernel] = kernel
  File "/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pytools-2015.1.2-py2.7.egg/pytools/persistent_dict.py", line 424, in __setitem__
    return self.store(key, value)
  File "/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pytools-2015.1.2-py2.7.egg/pytools/persistent_dict.py", line 306, in store
    dump(value, outf, protocol=HIGHEST_PROTOCOL)
  File "/Users/iskander/code/loopy/loopy/tools.py", line 144, in __getstate__
    c_name = self.target.dtype_to_typename(self.dtype)
  File "/Users/iskander/code/loopy/loopy/target/c/__init__.py", line 55, in dtype_to_typename
    return self.get_dtype_registry().dtype_to_ctype(dtype)
  File "/Users/iskander/code/loopy/loopy/target/pyopencl/__init__.py", line 263, in get_dtype_registry
    from pyopencl.compyte.dtypes import TYPE_REGISTRY
ImportError: cannot import name TYPE_REGISTRY

(Local) barrier deadlock vs loop bounds with l.0

(So far) hypothetical failure case:

Sequential loop over i, with bounds depending on an iname involving l.0.
Some lanes of the l.0 iname enter (say) the last iteration, some don't.
The ones that entered the iteration hit a barrier, the ones that didn't, don't.
Deadlock

cc @mattwala

inducer / loopy Goto Github PK

loopy's Introduction

Loopy: Transformation-Based Generation of High-Performance CPU/GPU Code

loopy's People

Contributors

Stargazers

Watchers

Forkers

loopy's Issues

Recommend Projects

Recommend Topics

Recommend Org