Skip to content

Example: Video Pipelines

Julian Kemmerer edited this page Mar 20, 2026 · 23 revisions

This page describes implementing a basic video manipulation pipeline in PipelineC. This work is used as part of the StreamSoC audio video demo.

three module chain

A pixel stream from something like an OV2640 camera digital video port DVP is cropped to a selected area of the image, scaled up by an integer factor, and finally positioned somewhere in the output frame buffer.

Video Stream

This design uses valid-ready handshaking to transfer individual clock cycles of data. A PipelineC stream is used to represent the data bits and valid flag portion of the handshake. A stream(type_t) is a struct with uint1_t .valid flag and type_t .data data field. The stream/stream.h header provides macros for creating and using these stream types.

A hardware video stream typically consists of a bus that can deliver one or more pixels each clock cycle along with frame boundary flags marking the last pixel in a line (EOL) and last line in a frame (EOF). The ndarray.h header is used to define a video_t type used in this design. ndarray(NDIMS, type_t) is a struct type with type_t .data field and uint1_t .eod[NDIMS] flags for the end of each dimension.

// Types for streaming a two dimensional frame one pixel at a time
#include "ndarray.h"
DECL_NDARRAY_TYPE(2,pixel_t)
DECL_STREAM_TYPE(ndarray(2,pixel_t))
#define video_t ndarray(2,pixel_t)

In this stream(video_t) design with a two dimensional image: eod[0] is the end of line EOL 'last pixel in line' flag, and eod[1] is the end of frame EOF 'last line of pixels' flag (a third dimension eod[2] could be end of a group of frames over time, etc).

The blocks described below typically have inputs and outputs like:

// Input video handshake
stream(video_t) video_in; // input into module
uint1_t ready_for_video_in; // output from module

// Output video handshake
stream(video_t) video_out; // output from module
uint1_t ready_for_video_out; // input into module

In diagrams below only the stream data+valid feed forward wires are shown from the connecting arrow -> direction, opposite direction feedback ready is implied:

video_in           -> video_out           // Shown
ready_for_video_in <- ready_for_video_out // Not shown

For example, the code for a just-wires pass through of data from input to output looks like:

video_out = video_in; // .data and .valid forward dataflow
ready_for_video_in = ready_for_video_out; // feedback flow control ready 

For more info on reading PipelineC, check out some digital logic basics.

Test Pattern

Below is some test code for a pattern generator module. This design uses a pixel_t that has 8b RGB channels. The output is a square made of three vertical rectangle stripes: red, green, and blue.

#define TPW 60
#define TPH 60
stream(video_t) test_pattern(
  uint1_t ready_for_video_out
){
  // Counters for x,y position
  static uint16_t x;
  static uint16_t y;
  // test square rgb pattern
  stream(video_t) video_out;
  uint16_t color_width = TPW / 3;
  if(x < (1*color_width)){
    video_out.data.data.r = 255;
  }else if(x < (2*color_width)){
    video_out.data.data.g = 255;
  }else{
    video_out.data.data.b = 255;
  }
  video_out.valid = 1;
  video_out.data.eod[0] = x==(TPW-1); // last pixel / EOL
  video_out.data.eod[1] = y==(TPH-1); // last line / EOF
  if(video_out.valid & ready_for_video_out){
    x += 1;
    if(video_out.data.eod[0]){
      y += 1;
      x = 0;
      if(video_out.data.eod[1]){
        y = 0;
      }
    }
  }
  return video_out;
}

Frame Decoder

Using the stream(video_t) type with end of line (EOL) and end of frame (EOF) flags available with each pixel it is possible to decode the current frame position from a video stream. As EOL/EOF is asserted it is known that the next pixel begins the next line or next frame.

typedef struct frame_decoder_t{
  // Output video stream with position x,y
  stream(video_t) video_out;
  uint16_t dim[2]; // 2d x,y pos
  // Output ready signal for input stream
  uint1_t ready_for_video_in;
}frame_decoder_t;
frame_decoder_t frame_decoder(
  // Input video stream to be decoded
  stream(video_t) video_in,
  // Input ready signal for output stream
  uint1_t ready_for_outputs
)

The frame_decoder takes an input video stream and outputs a copy of that stream with x,y position values for the current pixel. The output stream is only active once the frame boundaries signals (EOL/EOF) have been seen and internal counters aligned.

Crop

The cropping module takes a video stream as input and outputs only a selected area of the image in a smaller frame size output stream.

                               parameters --|
                                            |
                                           \|/
input video stream -> frame_decoder -> pixel filter -> output video stream

First the input pixel x,y position is decoded from the video stream. Once the position of the pixel is known, there is a simple set of comparisons made against the cropping parameters to filter out all pixels that are not inside the cropped area. The math of doing the compare is not pipelined, but could be. See later frame buffer position section for a pipeline example.

typedef struct crop2d_params_t{
  uint16_t top_left_x;
  uint16_t top_left_y;
  uint16_t bot_right_x;
  uint16_t bot_right_y;
}crop2d_params_t;

New frame boundary EOL/EOF flags are calculated using the cropped frame dimensions:

o.video_out.data.eod[0] = decoded.dim[0] == params.bot_right_x;
o.video_out.data.eod[1] = decoded.dim[1] == params.bot_right_y;

The parameters for cropping determine the output frame size. If these control register values are changed mid-frame then awkward partially malformed frames of different line lengths can be output from the module. For this reason new control register values are only accepted in between frames to ensure full complete frames are always output.

Scale

The scaling module of this design does simple integer ratio upscaling via pixel duplication. For example, if the input is four pixels in a 2x2 grid, scaling the output by 2 creates a 4x4 output frame where each pixel has been duplicated 2x2=4 times.

2x2   2x    4x4
     scale

A  B      A A B B
      ->  A A B B 
C  D      C C D D  
          C C D D

Outputting the sequence of repeated pixels is accomplished by recirculating data through a FIFO that is large enough to hold an entire input line.

    parameters --|
                 |
                \|/
          |<-  state  <-|
          |   machine   |
         \|/           /|\
video in --> line_fifo --> video out

The state machine has two primary functions: accepting the video input stream loading a line of pixels into the FIFO, and outputting pixels from the FIFO while they are recirculated back into the FIFO.

typedef enum scale_state_t{
  LOAD_PIXELS,
  OUTPUT_PIXELS
}scale_state_t;

For example from the 2x2 -> 4x4 scaling diagram: the LOAD_PIXELS state would load the FIFO with the A,B pixels of the first input line. Then the OUTPUT_PIXELS state does iterations of the following: each pixel is output from the FIFO N times per line, and the pixels of an entire input line are recirculated back into the FIFO for a total of N repeated output lines. For example, the first line pixel A is output from the FIFO twice. The second copy of the A pixel is fed back into the write side of the FIFO for the next line. B is then also output twice and re-written back into the FIFO once. Then the process repeats again for the entire duplicated second output line (output A twice, B twice). After the final duplicated line is output, the FSM returns to the LOAD_PIXELS state to begin loading the C,D next input line.

While the FSM is outputting pixels it is not accepting input pixels. Future work might enable loading the next line of pixels during the final duplicated line output where pixels are not being recirculated back into the FIFO (somewhat combining LOAD and OUTPUT states). For this separate LOAD vs OUTPUT states reason it is likely that high data rate video streams would need a ping-pong dual FIFO setup where one FIFO is being filled while the other is being emptied/repeatedly recirculated.

New frame boundary EOL/EOF flags are calculated using a combination of input frame boundary markers, counters for the width and height of duplicated pixels area, and the size scaling input parameter.

o.video_out.data.eod[0] = fifo.data_out.eod[0] & (width_counter==(params.scale-1));
o.video_out.data.eod[1] = is_last_input_line & (height_counter==(params.scale-1));

The parameters for scaling determine the output frame size. If these control register values are changed mid-frame then awkward partially malformed frames of different line lengths can be output from the module. For this reason new control register values are only accepted in between frames to ensure full complete frames are always output.

Position

The frame buffer positioning module takes a video stream as input and draws it in a particular position inside the output display. Placing an image into the frame involves doing writes of pixel data to appropriate places in memory. In this design the StreamSoC VGA frame buffer is accessed via an AXI-Lite like bus.

                        parameters --|
                                     |
                                    \|/
video in -> frame_decoder -> fb_pos_addr_pipeline -> AXI write FSM -> writes into frame buffer

First the input video stream pixel x,y position is decoded. Once the input position of the pixel is known, the fb_pos_addr_pipeline pipeline is used to compute the destination memory address given the desired frame position parameters. Finally the address and pixel pair are sent into a helper FSM that drives the memory bus to do the write.

typedef struct fb_pos_params_t{
  uint16_t xpos;
  uint16_t ypos;
}fb_pos_params_t;

In order to use some helper macros from the global_func_inst.h header, the function to be pipelined was formulated as a output_t func(input_t) signature with a single input and output of struct types.

typedef struct fb_pos_addr_req_t{
  pixel_t p;
  uint16_t xpos;
  uint16_t ypos;
  fb_pos_params_t fb_pos;
}fb_pos_addr_req_t;
typedef struct fb_pos_addr_t{
  pixel_t p;
  uint32_t addr;
}fb_pos_addr_t;
fb_pos_addr_t fb_pos_addr(fb_pos_addr_req_t req){
  fb_pos_addr_t o;
  o.p = req.p;
  o.addr = pos_to_addr(req.fb_pos.xpos + req.xpos, req.fb_pos.ypos + req.ypos);
  return o;
}

Being a pure combinatorial logic function, the math of adding the relative pixel position values and pos_to_addr address calculation (few multiplies) can be automatically pipelined for any target FPGA or operating frequency. There is no need to manually plan each pipeline stage specific for one target FPGA thus creating code that is not easily portable from device to device (or even just if clock frequencies change on current device).

The GLOBAL_VALID_READY_PIPELINE_INST macro is used to declare the fb_pos_addr_pipeline pipeline and related valid-ready handshaking signals to connect to the pipeline (ex. stream(fb_pos_addr_req_t) fb_pos_addr_pipeline_in).

#include "global_func_inst.h"
// fb_pos_addr_pipeline: a pipeline of fb_pos_addr_t fb_pos_addr(fb_pos_addr_req_t) signature
//   is given buffer space for up to 16 operations in flight
GLOBAL_VALID_READY_PIPELINE_INST(fb_pos_addr_pipeline, fb_pos_addr_t, fb_pos_addr, fb_pos_addr_req_t, 16)

Finally a axi_shared_bus_t_write_start_logic helper FSM provided by the axi/axi_shared_bus.h header is used to connect the stream of pixels and addresses coming out of the fb_pos_addr_pipeline into 32bit wide memory bus writes. The design does not wait to confirm each write has completed before issuing the next, writes are made back to back as fast as possible.

An intentional overflow point was added in this 'end of the chain' final module. It handles the case of pixel overflow when writes are not completing fast enough. It allows pixels to be dropped without losing frame sync and causing large visual artifacts.

StreamSoC Interface

The image processing chain in this design is controlled by the StreamSoC's RISC-V management CPU. Memory mapped control registers are used to supply the parameters for each module.

video/software/mm_regs.h is used to specify what data is to be memory mapped:

crop2d_params_t crop_params;
scale2d_params_t scale_params;
fb_pos_params_t fb_pos_params;

video/hardware/mm_regs.c is where the hardware memory mapped registers mm_regs. is connected to a global variable wire in the design. In this case the global wire has the same name as the register:

crop_params = mm_regs.crop_params;
scale_params = mm_regs.scale_params;
fb_pos_params = mm_regs.fb_pos_params;

video/software/device.h is the device control software that uses the C global variable pointer mm_regs to access the hardware registers, for ex. incrementing the scale control register:

mm_regs->scale_params.scale += 1;

Instantiating the chain of blocks is done by including their device.c files into the SoC's devices and dataflow area cpu/hardware/devices.c:

#include "../../video/crop/hardware/device.c"
#include "../../video/scale/hardware/device.c"
#include "../../video/fb_pos/hardware/device.c"

#pragma MAIN video_dataflow
void video_dataflow(){
  // Camera video into crop
  crop_video_in = cam_input_video_fifo_out;
  cam_input_video_fifo_out_ready = crop_video_in_ready;

  // Crop output into scale input
  scale_video_in = crop_video_out;
  crop_video_out_ready = scale_video_in_ready;

  // Scale output into position input
  fb_pos_video_in = scale_video_out;
  scale_video_out_ready = fb_pos_video_in_ready;
}

Connecting the control registers into the SoC adds #includes for the video/ device registers defined above:

The device.h control software uses dev board switches and buttons to adjust register values:

// Four switches control button function (one-hot)
switches[3:0]:
  0b0001 : crop size
  0b0010 : crop position
  0b0100 : scale factor
  0b1000 : frame buffer position

// Four buttons are mapped to up,down,left,right
buttons[3:0] = R,L,D,U

These buttons and switches were presented to Claude and asked to produce the simple code for handling parameter changes. The code handles two things: incrementing/decrementing the parameter while clamping to frame size boundaries, and ensuring old pixels left over from previous parameter settings are erased from screen.

Clone this wiki locally