Code Monkey home page Code Monkey logo

mc1's Introduction

This repo has moved to: https://gitlab.com/mrisc32/mrisc32

MRISC32

This is an open and free 32-bit RISC/Vector instruction set architecture (ISA), primarily inspired by the Cray-1 and MIPS architectures. The focus is to create a clean, modern ISA that is equally attractive to software, hardware and compiler developers.

This repository contains LaTeX documentation and databases of architectural information (e.g. instructions and system registers).

Documentation

The latest MRISC32 Instruction Set Manual (PDF) describes the MRISC32 ISA in detail.

Overview documents:

Features

  • Unified scalar/vector/integer/floating-point ISA.
  • There are two register files:
    • R0-R31: 32 scalar registers, each 32 bits wide.
      • Three registers have special meaning in hardware: Z, LR, VL.
      • 29 registers are general purpose (of which three are reserved by the ABI: SP, FP, TP).
      • All registers can be used for all types (integers, addresses and floating-point).
    • V0-V31: 32 vector registers, each with at least 16 32-bit elements.
      • All registers can be used for all types (integers, addresses and floating-point).
  • All instructions are 32 bits wide and easy to decode.
  • Most instructions are non-destructive 3-operand (two sources, one destination).
  • All conditionals are based on register content.
    • There are no condition code flags (carry, overflow, ...).
    • Compare instructions generate bit masks.
    • Branch instructions can act on bit masks (all bits set, all bits zero, etc) as well as signed quantities (less than zero, etc).
    • Bit masks are suitable for masking in conditional operations (for scalars, vectors and packed data types).
  • Powerful addressing modes:
    • Scaled indexed load/store (x1, x2, x4, x8).
    • Gather-scatter and stride-based vector load/store.
    • PC-releative and absolute load/store:
      • ±4 MiB range with one instruction.
      • Full 32-bit range with two instructions.
    • PC-relative and absolute branch:
      • ±4 MiB range with one instruction.
      • Full 32-bit range with two instructions.
  • Many traditional floating-point operations can be handled in whole or partially by integer operations, reducing the number of necessary instructions:
    • Load/store.
    • Branch.
    • Sign and bit manipulation (e.g. neg, abs).
  • Vector operations use a Cray-like model:
    • Vector operations are variable length (1-N elements).
    • Most integer and floating-point instructions come in both scalar and vector variants.
    • Vector instructions can use both vector and scalar operands (including immediate values), which removes the overhead for transfering scalar data into vector registers.
  • In addition to vector operations, there are also packed operations that operate on small data types (byte and half-word).
  • Fixed point operations are supported:
    • Single instruction multiplication of Q31, Q15 and Q7 fixed point numbers.
    • Single instruction conversion between floating-point and fixed point.
    • Saturating and halving addition and subtraction.

Note: There is no support for 64-bit floating-point operations (that is left for a 64-bit version of the ISA).

mc1's People

Contributors

mbitsnbites avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

mc1's Issues

Optimize block RAM (BRAM) usage

Investigate if BRAM in a typical device can be better utilized.

For instance, the VCPP stacks only use 384 bits each, hardly filling up a single memory block (usually 9+ Kbit / block), meaning that lots of block RAM bits go to waste. Could we use MLAB instead of BRAM in Intel devices, for instance?

vcr: Add a BGCOL register

The BGCOL register would be a 24-bit RGB color that is displayed when no pixels are showing (i.e. outside HSTRT/HSTOP).

As an extension the pixels could be alpha blended with the background color.

ROM: Infer block ROM

According to Altera: Recommended HDL Coding Styles (page 6-30), this should infer synchronous ROM:

LIBRARY ieee;
USE ieee.std_logic_1164.all;

ENTITY sync_rom IS
  PORT (
    clock: IN STD_LOGIC;
    address: IN STD_LOGIC_VECTOR(7 downto 0);
    data_out: OUT STD_LOGIC_VECTOR(5 downto 0)
  );
END sync_rom;

ARCHITECTURE rtl OF sync_rom IS
BEGIN
  PROCESS (clock)
  BEGIN
    IF rising_edge (clock) THEN
      CASE address IS
        WHEN "00000000" => data_out <= "101111";
        WHEN "00000001" => data_out <= "110110";
        ...
        WHEN "11111110" => data_out <= "000001";
        WHEN "11111111" => data_out <= "101010";
        WHEN OTHERS => data_out <= "101111";
      END CASE;
    END IF;
  END PROCESS;
END rtl;

Create a VCP assembler

The VCP assembler should accept a simple assembly syntax, e.g. similar to:

    ; Video registers
    .set    ADDR, 0
    .set    XOFFS, 1
    .set    XINCR, 2
    .set    HSTRT, 3
    .set    HSTOP, 4
    .set    CMODE, 5

    ; CMODE constants
    .set    CM_RGBA8888, 0
    .set    CM_RGBA5551, 1
    .set    CM_PAL8, 2
    .set    CM_PAL4, 3
    .set    CM_PAL2, 4
    .set    CM_PAL1, 5

    ; Set the program start address
    .org    0x000100

main:
    ; Display nothing
    setreg  HSTRT, 0
    setreg  HSTOP, 0

    ; Set the video mode
    setreg  XOFFS, 0x000000
    setreg  XINCR, 0x008000   ; 640 pixels/row
    setreg  CMODE, CM_PAL8

    ; Set the palette
    jsr     load_palette_a

    ; Activate video output starting at row 0.
    wait    0
    setreg  HSTOP, 1280

    ; Generate video addresses for all rows.
    .set    row, 0
    .set    row_addr, 0x001000
    .rept   360
      wait    row
      setreg  ADDR, row_addr
      .add    row, 2
      .add    row_addr, 160   ; Row stride
    .endr

    ; End of program
    wait    32767

load_palette_a:
    ; Load a palette with 256 colors.
    setpal  0, 255
    .word   0xff00ff00
    .lerp   0x01010101, 0xffffffff, 255
    rts

Syntax rules should be obvious from the above example (anything fancier should be unnecessary, to start with).

The following directives would be useful:

  • .org - set the program location
  • .rept/.endr - repeat section
  • .set - define symbol
  • .add - add a value to a symbol (redundant if .set supports mathematical expressions)
  • .lerp - generate a linear RGBA8888 gradient (linear interpolation between two end-values)
  • .word - insert one or more 32-bit data elements into the stream

The output from the assembler should be one of the following:

  • A GNU assembler compatible source file, suitable for inclusion (.inc).
  • A raw binary file.

It would be useful for a CPU (MRISC32) program to get access to some of the VCP-symbols, e.g. to determine the location of palette data. This could be handled by a .global directive that adds a GNU assembler symbol with the value of the given VCP symbol, for instance.

libc: Add a simple version of printf()

In its simplest form, printf(const char* str, ...) will:

  • Support up to N variable arguments (e.g. N=7, which fits into registers).
  • Iterate str to find %d, %h and %s:
    • Copy the substring up to the % and call vcon_print() (use a static buffer, or malloc?).
      • Better yet: Introduce vcon_printn() that takes a length parameter.
    • Depending on the type (d, h or s), call vcon_print_dec(), vcon_printf_hex() or vcon_print() with the correct argument.

It does not have to be a full printf() implementation, and we do not need to support sprintf() etc. Just have it around for convenience in C land.

Improve the MMIO registers

Set up R/W MMIO register and map them to useful I/O, both external board I/O and internal machine I/O. These will be different for different platforms, so make it possible to map them to different things but try to keep the register names/locations as portable as possible.

Example board I/O:

  • LED:s [out] - One 32-bit register?
  • 7-16 segment display [out] - One register per char x 8 chars?
  • Switches [in] - One 32-bit register?
  • Buttons [in] - One 32-bit register?
  • PS/2 [in] - One 32-bit register?
  • microSD [in/out] - Two 32-bit registers?
  • GPIO [in/out] - Many 32-bit registers?

Example internal I/O:

  • CPU clock frequency in Hz [in] - One 32-bit register.
  • Memory size [in] - One 32-bit register per memory type (VRAM, DRAM?).
  • Native video resolution [in] - Three registers: width, height, screen refresh rate.
  • Video frame number (free running counter) [in] - One 32-bit register.
  • Clock counter (free running counter, e.g. µs resolution or CPU clock ticks) [in] - One or two 32-bit registers.

video: Add simple 1-word caches to reduce VRAM traffic

The pixel pipeline often reads the same word over and over, especially in palette mode and low resolutions (e.g. 320x180 8bpp => 16 reads of each word, 1280x720 1bpp => 32 reads). Add a simple 1-word cache, or better yet a 2-word cache with prefetch.

Similarly the VCPP reads the same word over and over during WAIT commands. Add a 1-word cache.

The goal is to free up VRAM bandwidth to enable more things to be connected to the VRAM read port, in particular:

  1. A second video pipeline for two-layer graphics.
  2. Audio DMA.

vcpp: Glitch when coming out from WAIT while VRAM is busy

The VCPP can feed incorrect data to the ID/EX stage (only for SETPAL commands?) when it comes out of a WAITX instruction and competes with the pixel pipeline for the VRAM.

Testcase

dual-gradients.vcp

Expected

The "lines" should be blue:ish.

Actual

After a certain row the lines become pink (i.e. the incorrect palette index gets set to the pink color).

image

vcpp: One instruction may be lost after a WAIT instruction

We currently have to insert a single NOP after each WAIT instruction, since the instruction directly after the WAIT instruction may be lost (depending on the pipeline/scheduling situation):

    wait    123
    nop             ; <- SHOULD NOT BE NEEDED!
    setreg  0, 0x004567

VCPP: Add a jump-to-subroutine instruction

A jump-to-subroutine instruction could be useful for several things, but two very important use cases are:

  1. Allow the first program instruction, which is located at a fixed memory address, to be a branch to the real VCP location, anywhere in the video RAM.
  2. Re-use large:ish program parts. This is most useful for altering between two or more palettes, for instance (e.g. every second line is a darker version of the main palette).

We'd need an internal call stack, but it should be sufficient with a very small stack (e.g. 8 entries). The return instruction can be encoded as a jump to address zero (which is something you'd never want to do).

video: Improve dithering

After the final color stage, add dithering with a configurable (compile time) target bit resolution.

E.g. if the hardware has 12-bit RGB output (4 bits per component), the lower 4 bits of each component will be dropped. Therefore we'll add dithering in the range [0, 2^4-1] (i.e. [0, 15]).

Design ideas:

  • It should be possible to disable dithering completely at compile time (e.g. if full 24-bit video output is available).
  • It should be possible to configure the dithering method at run time via a VCR.
  • Implement a white noise ditherer (mostly for reference).
  • Implement a simple error diffusion algorithm e.g. one-dimensional or two-dimensional error diffusion (the latter would require a dual-ported line buffer).
  • Add a noisy start value per line (my be useful for reducing one-dimensional error diffusion artifacts).
  • Add a noisy start value per frame (should give temporal dithering which may hide dithering patterns, but care must be taken not to introduce flickering).

Introduce I/O buffer memory areas

Many I/O tasks require buffers (e.g. DMA, microSD I/O, keyboard FIFO, etc). These can be mapped into the MMIO memory area, but should use a RAM model rather than the current MMIO registers.

Decide whether to use a common RAM (e.g. BRAM) for all I/O areas, or independent RAM areas for each I/O device (the latter would eliminate bus contention, but requires more MUX logic on the CPU memory bus side, and possibly more BRAM blocks).

video: Add a second video layer

The second video layer will be a copy of the current video pipeline (so we get two VCPP:s, two sets of registers' two palettes and two pixel pipelines).

Use alpha blending (configurable) to mix the two layers.

Depends on #3

vcpp: Add a WAITX instruction

The WAIT instruction can be divided into a WAITY and a WAITX instruction.

One possibly useful application for the WAITX instruction is to mirror the image horizontally by waiting for a certain column and then negating the XINCR register.

VCPP: Change instr encoding: cmd bits in LSB:s

By moving the command bits (4 MSB) and some other bits (e.g. reg number in SETREG) to the least significant bits, many instructions can be created with one ldi MRISC32 instruction instead of two (ldhi + or).

ROM: Add a simple memory allocator

With the allocator in place it's easier to write modular software.

To start with the allocator can just be implemented as a sorted array of allocated blocks (pointer, size), and we can limit the array to 1024 allocation blocks, for instance.

ROM: Add C API video routines

Example:

  • vid_create_fb_vcp() - Create a VCP for a regular framebuffer (width, height, cmode, ptr). Return VCP base address and address to the palette.
  • vid_active_vcp() - Set the active VCP
  • vid_wait_vblank() - Wait for vertical blank

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.