idein / py-videocore Goto Github PK

Python library for GPGPU on Raspberry Pi

License: MIT License

Python 100.00%

py-videocore's Introduction

PyVideoCore

This is work in progress project. Backward compatibility is not guaranteed.

PyVideoCore is a Python library for GPGPU on Raspberry Pi boards. The Raspberry Pi SoC integrates Broadcom VideoCore IV graphics core. It has 12 quad processor units (QPU) which is a dual-issue 16 way (4 way pipelined and 4 way true) SIMD processor. Read the following guide thoroughly to study its architecture.

VideoCore(R) IV 3D Architecture Reference Guide (PDF) ¹

Several QPU assemblers are written by pioneers (hermanhermitage, petewarden, elorimer and so on). There is also an implementation of OpenCL for QPU: VC4CL.

PyVideoCore's QPU assembler is different from theirs in terms of that its assembly language is implemented as an Internal DSL of Python language. This makes GPGPU programming on Raspberry Pi relatively easier in the sense that

You can put host programs and GPU side programs in a single Python script.
You can execute the program without ahead-of-time compilation.
You can utilize Python functionality, libraries and tools to organize GPU programs.

Requirements

Raspberry Pi Zero, 1, 2, or 3
- For Raspberry Pi 4, use py-videocore6 instead.
Python 2 (>= 2.6) or Python 3
NumPy
rpi-vcsm ~= 3.0.0
ioctl-opt ~= 1.2
nose (if you want to run tests)

Installation

$ git clone https://github.com/nineties/py-videocore.git
$ cd py-videocore
$ python setup.py install

Note that PyVideoCore does not work with the CPU-side OpenGL graphics stack, so configure your pi to use the legacy (original non-GL desktop) driver by the sudo raspi-config command (it just comments out all the dtoverlay=vc4-kms-v3d and dtoverlay=vc4-fkms-v3d lines in /boot/config.txt).

Depending on your running kernel version, PyVideoCore allocates memory through /dev/vcsm or /dev/vcsm-cma, which are the devices of the VCSM (VideoCore shared memory service) and the VCSM-CMA (contiguous memory allocator) drivers, respectively. To access the devices, you need to belong to the video group or need to be the root user. If you choose the former, run the following command and re-login.

$ sudo usermod --append --groups video $USER

The plain VCSM driver allocates memory in the GPU-side memory, of which size can be configured by the gpu_mem=XXX option in /boot/config.txt (e.g. gpu_mem=128 for 128 MB). This can also be done via the sudo raspi-config command.

On the other hand, the VCSM-CMA driver allocates memory in the CPU-side CMA memory, of which size can be configured by the dtoverlay=cma,cma-XXX option in /boot/config.txt (e.g. dtoverlay=cma,cma-128 for 128 MB).

Nevertheless, VideoCore IV QPUs can access arbitrary portions of the main memory, which may make your system unstable and even break your pi, so beware of bugs in the programs.

Getting Started

$ python examples/hello_world.py

Running Tests

$ nosetests -v

128MB or more GPU memory is required to pass tests. Failed some tests with 64MB or less.

Documentation

TBD

Tutorials

In japanese.

Records

- Achieved 8GFlops with sgemm. .. image:: https://pbs.twimg.com/media/CWYjkH7U4AAh9VE.jpg

License

Code and documentation are released under MIT license

Supplementary information and errata list.↩

py-videocore's People

Contributors

Stargazers

Watchers

Forkers

srinath9 tvtritin lakmal84 maveriq ameyjadiye terminus-imrc satorukawase z3rg notogawa yuyang3478 architectureofthings zhuhaijun753 yixingshen guriido ajaswa zbvictory develone benjamesbabala sntaus arnoldlai shengchun openube khawatkom pietern zhongxingpeng ashishkej nomaddo takeru isyslab mamede7 pinto0309 mz-lisec marclachapelle cvetaevvitaliy koiking213 karthik4293 harukiuchito gmstrbytes universal-it-systems afcarl hiromichinomata tedder bp-bling nakarin dragowave remynors jiaqi13 s-you ia1na09 long-long-float chadng conail nakajimakou1 martinatcoventry kevinmel2000 zrufy alvinasa fiefdx thkien yoon5 lyogavin congvm-cs tsukuyomih2 freewing-jp ishiy1993 holttechnologycorporation wessels-potgieter nazt worthenmanufacturing eswarraop smmzhang lileiigithub dahburj sozysozbot ny-a cjylab tokkec sxz0 gits00 mtornblad meesokim python-repository-hub hnuhchen icodein kenta11 teitaku nubok

py-videocore's Issues

execute /tests functions and print the Y results

Hi,

I'm currently developing on VideoCore GPU and your project is really interesting.
Sadly I'm a totally Python beginner.

Could you send me a command-line to execute the "test_per_elmt_imm()" function in "test_alu.py" and print the result in my raspberry terminal. I'm using a raspian lite OS.

Thanks :)

Tensorflow Op

Do you think this would be reasonable to implement with Tensorflow as a py_func op? I think it's reasonable to assume it would be worth creating a module as a substitute for sgemm or other heavy ops.

How can I read results of matmul?

Hello.
Thanks to you for great work and sharing your code.
Actually I dont know about assembly language well.
So Your code is very huge help for me.
Could you tell me how to read matmul result from gpu?
You show how to calculate matrix but there is no return.
I will wait your answer. Thanks!

Proposing a PR to fix a few small typos

Issue Type

[x] Bug (Typo)

Steps to Replicate and Expected Behaviour

Examine tests/test_alu.py and observe rotatoin, however expect to see rotation.
Examine tests/test_branch.py and observe returnes, however expect to see returns.
Examine videocore/encoding.py and observe purpuse, however expect to see purpose.
Examine videocore/encoding.py and observe accross, however expect to see across.
Examine videocore/encoding.py and observe accores, however expect to see across.

Notes

Semi-automated issue generated by
https://github.com/timgates42/meticulous/blob/master/docs/NOTE.md

To avoid wasting CI processing resources a branch with the fix has been
prepared but a pull request has not yet been created. A pull request fixing
the issue can be prepared from the link below, feel free to create it or
request @timgates42 create the PR. Alternatively if the fix is undesired please
close the issue with a small comment about the reasoning.

https://github.com/timgates42/py-videocore/pull/new/bugfix_typos

Thanks.

Getting correct values in matrix multiplication

Hello there, I came across this code and wanted to use your matrix multiplication implementation (sgemm.py). It seems like the result I get with (numpy's reference calculation) A.dot(B) + C differ from the output of the program. For example, I used 64 x 64 matrixes for A, B and C each filled with ones. While I get the result of a matrix filled with 65s in the numpy's reference calculation, I get only one row filled with 65s in the output of the program in the matrix C. Does this mean I am reading from the wrong matrix for the output or something else is acting strange? Thanks.

Alpha and beta is kept at the value 1

Update README to reflect OpenCL on VC4

There is an OpenCL for RPi now: https://github.com/doe300/VC4CL

Arch reference guide link broken

Current link points to 404

benchmark of sgemm examples are not fair.

This benchmark is not fair.

numpy version runs with arrays which are allocated for GPU's.
time measurement for GPU version should include costs for following steps
- allocation of GPU memory
- initialization of uniforms
- copy from Host memory region to GPU memory region
- copy from GPU memory region to Host memory region
- and so on.

Bad precision and warnings when running under Raspberry PI 3

This project is really promising. Too bad there is no documentation and very few examples :-).

I was trying some examples and I see this:

root@osmcpi:/home/jcea/virtualenv/py-videocore/examples# ../../bin/python sgemm.py 
sgemm.py:655: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  uniforms[th, 4] = A.addresses()[i*16*h, 0]
sgemm.py:656: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  uniforms[th, 5] = B.addresses()[0, j*64*w]
sgemm.py:657: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  uniforms[th, 6] = C.addresses()[i*16*h, j*64*w]
==== sgemm example (96x363 times 363x3072) ====
threads: 12
numpy: 31.9355 sec, 0.0067 Gflops
GPU: 0.0321 sec, 6.7054 Gflops
maximum absolute error: 7.3559e+01

root@osmcpi:/home/jcea/virtualenv/py-videocore/examples# ../../bin/python sgemm_1thread.py 
==== sgemm example (96x363 times 363x3072) ====
threads: 1
numpy: 31.5038 sec, 0.0068 Gflops
GPU: 0.2676 sec, 0.8034 Gflops
maximum absolute error: 7.8866e+01

The virtualenv I am using runs Python 3.4.

The Raspberry PI 3 is running KODI at the same time, but not reproducing a video, just in the main menu.

"Hello world" example works fine, with maximum error ~1e-7.

Maybe this is useful:

root@osmcpi:/home/jcea/virtualenv/py-videocore/examples# ../../bin/python mailbox.py 
firmware revision: 57bc72be
board model: 0
board revision: a02082
board serial: 000000008a498dfc

nosetests failed

256M memory for GPU

ERROR: test_raw.test_raw_hex

Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
self.test(*self.arg)
File "/home/pi/vision/py-videocore/tests/test_raw.py", line 16, in test_raw_hex
print_qhex(raw_hex, file = f)
File "/home/pi/vision/py-videocore/videocore/assembler.py", line 923, in print_qhex
print("0x{3:02X}{2:02X}{1:02X}{0:02X}, 0x{7:02X}{6:02X}{5:02X}{4:02X},".format(*c), file = file)
TypeError: unicode argument expected, got 'str'

can't call get_max_temperature/get_temperature: struct.error: pack_into expected 7 items for packing (got 6)

Hey, this is a little off topic, but .. not really. I'm trying to distill mailbox.py so I can use it in another open-source app without needing the heavy dependencies of py-videocore. I'm getting errors with the struct, and I don't think it has to do with what I've removed. Including the output first and then the code.

$ python3 vc.py
adding method: get_temperature
adding method: get_max_temperature
adding method: get_throttled
['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__enter__', '__eq__', '__exit__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_add_simple_method', '_simple_call', 'close', 'fd', 'get_max_temperature', 'get_temperature', 'get_throttled']
calling: get_throttled
tag sz: 4
boom. returning.
 n1
0
calling: get_max_temperature
tag sz: 8
Traceback (most recent call last):
  File "vc.py", line 89, in <module>
    print(m.get_max_temperature())
  File "vc.py", line 62, in f
    r = self._simple_call(name, tag, req_fmt, res_fmt, list(args))[5:]
  File "vc.py", line 46, in _simple_call
    *([24 + tag_size, PROCESS_REQUEST, tag, tag_size, tag_size] + args + [0]))
struct.error: pack_into expected 7 items for packing (got 6)

Here's the "distilled" class.

$ cat vc.py

# the following is distilled (reduced) from `py-videocore`, which has a larger purpose but
# happens to include these methods. The following has a MIT license.
# source: https://github.com/nineties/py-videocore

import os
from array import array
from struct import calcsize, pack_into, unpack_from
from fcntl import ioctl

IOCTL_MAILBOX = 0xC0046400   # _IOWR(100, 0, char *)
IOCTL_BUFSIZE = 1024

PROCESS_REQUEST = 0x00000000
REQUEST_SUCCESS = 0x80000000
PARSE_ERROR     = 0x80000001

class MailBoxException(Exception):
  'mailbox exception'

class MailBox(object):
  def __init__(self):
    self.fd = os.open('/dev/vcio', os.O_RDONLY)

  def close(self):
    if self.fd:
      os.close(self.fd)
    self.fd = None

  def __enter__(self):
    return self

  def __exit__(self, exc_type, exc_value, traceback):
    self.close()
    return exc_value is None

  def _simple_call(self, name, tag, req_fmt, res_fmt, args):
    'Call a method which has constant length response.'

    print("calling: {}".format(name))
    # Since the mailbox property interface overwrites the request tag buffer for returning
    # values to the host, size of the buffer must have enough space for both request
    # arguments and returned values. It must also be 32-bit aligned.
    tag_size = (max(calcsize(req_fmt), calcsize(res_fmt)) + 3) // 4 * 4
    print("tag sz: {}".format(tag_size))

    buf = array('B', [0]*IOCTL_BUFSIZE)
    pack_into('=5L' + req_fmt + 'L', buf, 0,
            *([24 + tag_size, PROCESS_REQUEST, tag, tag_size, tag_size] + args + [0]))

    ioctl(self.fd, IOCTL_MAILBOX, buf, True)

    r = unpack_from('=5L' + res_fmt, buf, 0)
    if r[1] != REQUEST_SUCCESS:
      raise MailBoxException('Request failed', name, *args)

    assert(r[4] == 0x80000000 | calcsize(res_fmt))
    print("boom. returning.")
    return r

  @classmethod
  def _add_simple_method(cls, name, tag, req_fmt, res_fmt):
    print("adding method: {}".format(name))
    def f(self, *args):
      r = self._simple_call(name, tag, req_fmt, res_fmt, list(args))[5:]
      n = len(r)
      if n == 1:
        print(" n1")
        return r[0]
      elif n > 1:
        print(" n>1")
        return r
    setattr(cls, name, f)

MAILBOX_METHODS = [
    ('get_temperature',                  0x00030006,  'L',    'LL'),
    ('get_max_temperature',              0x0003000a,  'L',    'LL'),
    ('get_throttled',                    0x00030046,  '',     'L'),

]
for name, tag, req_fmt, res_fmt in MAILBOX_METHODS:
  MailBox._add_simple_method(name, tag, req_fmt, res_fmt)


m = MailBox()
#for name, tag, req_fmt, res_fmt in MAILBOX_METHODS:
#  m._add_simple_method(name, tag, req_fmt, res_fmt)
print(dir(m))
#print(dir(MailBox))
#print(MailBox.get_throttled())
print(m.get_throttled())
print(m.get_max_temperature())

Implementation in C

I've tried the python example codes on my RPi2 and the multithreaded sgemm computation time is really amazing.

I wonder if there is sgemm implementation in C such that it can be used in other applications coded in C that requires matrix computation heavily (e.g. convolution in computer vision)

no /dev/vscm found

If the developer has installed the raspbian-lite there will be no /dev/vscm, so the developer needs to be informed.