My implementation of the forward pass (b1c9dd2 probably bugged) is not as much faster my reference implementation. I wondered if it's because I'm not sure if I'm using the gep stuff correctly so I tried implementing llvmlite with ArrayType() to see if it improves.
See a600b87 array_llvmlite.py
:
The setup is super short
N = 4
arr = ir.ArrayType(ir.FloatType(), N)
fnty = ir.FunctionType(ir.FloatType(), (arr, arr))
module = ir.Module(name=__file__)
func = ir.Function(module, fnty, name="node")
block = func.append_basic_block(name="entry")
builder = ir.IRBuilder(block)
accum = ir.Constant(ir.FloatType(), 0)
for i in range(N):
x = builder.extract_value(func.args[0], i)
w = builder.extract_value(func.args[1], i)
accum = builder.fma(x, w, accum)
builder.ret(accum)
and the llvm_ir looks correct to me
; ModuleID = '<string>'
source_filename = "<string>"
target datalayout = "e-m:o-i64:64-i128:128-n32:64-S128"
target triple = "arm64-apple-darwin21.6.0"
; Function Attrs: nounwind readnone
define float @node([4 x float] %.1, [4 x float] %.2) local_unnamed_addr #0 {
entry:
%.4 = extractvalue [4 x float] %.1, 0
%.5 = extractvalue [4 x float] %.2, 0
%.6 = tail call float @llvm.fma.f32(float %.4, float %.5, float 0.000000e+00)
%.7 = extractvalue [4 x float] %.1, 1
%.8 = extractvalue [4 x float] %.2, 1
%.9 = tail call float @llvm.fma.f32(float %.7, float %.8, float %.6)
%.10 = extractvalue [4 x float] %.1, 2
%.11 = extractvalue [4 x float] %.2, 2
%.12 = tail call float @llvm.fma.f32(float %.10, float %.11, float %.9)
%.13 = extractvalue [4 x float] %.1, 3
%.14 = extractvalue [4 x float] %.2, 3
%.15 = tail call float @llvm.fma.f32(float %.13, float %.14, float %.12)
ret float %.15
}
; Function Attrs: nounwind readnone speculatable willreturn
declare float @llvm.fma.f32(float, float, float) #1
attributes #0 = { nounwind readnone }
attributes #1 = { nounwind readnone speculatable willreturn }
but then the asm looks suspiciously succinct?
.section __TEXT,__text,regular,pure_instructions
.build_version macos, 12, 0
.globl _node ; -- Begin function node
.p2align 2
_node: ; @node
.cfi_startproc
; %bb.0: ; %entry
fmov s16, wzr
fmadd s0, s0, s4, s16
fmadd s0, s1, s5, s0
fmadd s0, s2, s6, s0
fmadd s0, s3, s7, s0
ret
.cfi_endproc
; -- End function
.subsections_via_symbols
I got into trouble when trying to call it tho.
With pointers I used:
c_float_p = POINTER(c_float)
cfunc = CFUNCTYPE(c_float, c_float_p, c_float_p, c_int)(func_ptr)
ret = cfunc(inputs.ctypes.data_as(c_float_p), weights.ctypes.data_as(c_float_p), N)
but
c_float_arr = c_float * N
inputs.ctypes.data_as(c_float_arr)
didn't work at all, and complained
Traceback (most recent call last):
File "/Users/lfrati/git/ddag/array_llvmlite.py", line 64, in <module>
_ = inputs.ctypes.data_as(c_float_arr)
File "/Users/lfrati/miniconda3/envs/deep/lib/python3.9/site-packages/numpy/core/_internal.py", line 282, in data_as
ptr = self._ctypes.cast(self._data, obj)
File "/Users/lfrati/miniconda3/envs/deep/lib/python3.9/ctypes/__init__.py", line 510, in cast
return _cast(obj, obj, typ)
TypeError: cast() argument 2 must be a pointer type, not c_float_Array_4
However after some (a lot) trial and error I converged on this:
inputs = np.arange(N, dtype=np.float32)
weights = np.arange(N, dtype=np.float32)
c_float_arr = c_float * N
cfunc = CFUNCTYPE(c_float, c_float_arr, c_float_arr)(func_ptr)
inps, ws = (c_float_arr)(*inputs), (c_float_arr)(*weights)
ret = cfunc(inps, ws)
which seems to make the cfunc happy, but the return value is NaN
...
I've tried
builder = ir.IRBuilder(block)
xs, ws = func.args
x = builder.extract_value(xs, 3)
builder.ret(x)
and it seems that the values it gets are messed up (deterministically):
- 1.401298464324817e-45
- nan
- 6.998277952723291e+28
- 1.4918516417532683e-19
Why? Alignment? The values on the python side seem ok right before the call
inps, ws = (c_float_arr)(*inputs), (c_float_arr)(*weights)
for x, w in zip(inps, ws):
print(x, w)
# 0.0 0.0
# 1.0 1.0
# 2.0 2.0
# 3.0 3.0
ret = cfunc(inps, ws)