Optimised video stream rendering with OpenGL

Primary tabs

This article offers a technical insight into how the video of a VoIP call is rendered on a screen. While this may appear to be a rather basic operation at first glance, it is not, as it requires very CPU-intensive computations to convert the pixel format. By offloading these computations onto the graphics card, we can save a lot of CPU bandwidth, which can then be allocated to the video encoder, thus enabling better overall quality. Our software developer Thibault, who used to work on this project with the Liblinphone team and who is now part of the Flexisip team, has written the following explanation for you.

A developer's story time

OpenGL, Colourspaces, and Linphone

What is OpenGL? What is Y'CbCr? And why does Linphone need any of this?

In this article, we will focus on the very last part of this pipeline: the video renderer.

The renderer takes its input from the video decoder (H.264, VP8, etc.) and paints its output to a graphical buffer, which is the rectangle you see on your screen (the one with the smiling face of your friend).

This is where we encounter Y'CbCr and colourspaces. For technical reasons, the output of a video decoder is a frame in a Y'CbCr colourspace.
Most programmers and digital artists are used to representing images in the RGB colourspace, which is a grid of pixels with three channels: the red, green, and blue channels (and sometimes an alpha channel for transparency).
A Y'CbCr data frame is almost the same, except it encodes colour with three different channels: luma, chroma blue, and chroma red.

At this point, and for details on how this colourspace works, I encourage you to read the summary and the Rationale section of the Wikipedia article on Y'CbCr.

All that is then left for the renderer to do is to translate the dataframe to RGB colourspace, scale it to match the resolution of the output buffer, and draw the result onto that buffer.
However, this is a massively parallel process, as it needs to be done for every pixel. If you have a background in computer graphics, you know what I'm getting at: this is a job for your graphics card!

To be able to talk to the graphics card, you need OpenGL.
OpenGL (Open Graphics Library) is a cross-platform library and standard that allows a programmer to control a GPU (graphics processing unit) to carry out tasks such as 2D and 3D rendering.
It is well-known in the video game industry, along with DirectX and Metal (equivalent libraries that are not cross-platform), and is now slowly becoming obsolete due to the new Vulkan standard.

So there you have it: Linphone uses OpenGL to perform the Y'CbCr → RGB translation for video rendering.

The context of my mission around the video renderer was to create evolutions in the existing code to make it compliant with the OpenGL version 4.1 specification. There was existing code that I needed to either evolve or re-write, but in any case I supposed I had a rather good understanding of what it was doing...

I found that the existing code was doing the Y'CbCr to RGB conversion in more or less this way:

This is basically a 3x3 matrix, which is applied to each Y'CbCr pixel to obtain an RGB pixel. A 720P resolution image contains 921,600 pixels, which gives rise to a lot of computations, and that's where OpenGL helps.

In version 2.0 (2004), OpenGL introduced the concept of shaders through the OpenGL Shading Language, a C-like language for expressing computations to be performed by the GPU. Although initially created to compute the light and colour levels when rendering a 3D object, shaders have evolved to perform a variety of specialised functions. Here, a diverted but clever usage of shaders is applied: there is no real 3D scene, and we simply draw a flat texture on the screen. The texture, instead of being given as an RGB buffer as expected, is a Y'CbCr image, and the shader program is executed on the graphics card's GPU to transform this Y'CbCr texture into a RGB one. Simple, isn't it?

But, where does the transformation matrix above come from?

I searched for "yuv2rgb" on the web, as well as for some of the magic numbers used, like "1.164". Although I was able to find many examples of code (mostly C) that used similar constants to "convert YUV to RGB" (whatever that meant), none explained how they were obtained.

Refusing to give up, I finally landed on the YUV Wikipedia article, which—past the numbers and formulas that didn't seem to match the magic numbers in my shader—led me to the Y'CbCr article.

There, everything finally started to make sense. I had finally found where my magic numbers were coming from: they are the result of simplifying (in the mathematical sense) the inverse conversion matrix from the ITU-R BT.601 standard, mixed with partial range shifting and scaling! Notably, the 1.164 factor mentioned earlier is a rounded version of the 255/219 scale factor that can be found in the "ITU-R BT.601 conversion" section of that article.

I know, I know—I don't understand half of this either—but the point is that my numbers went from magic to scientific.
I don't need to understand the physics of phosphor light emission, nor how Kr, Kg, and Kb were chosen. I just need to know that that's how they were chosen.

Example code with documentation

So after all my research and effort, this is the final code that we use today in the OpenGL 4.1 context: 

https://gitlab.linphone.org/BC/public/mediastreamer2/-/blob/master/src/Y...

I tried to design it to make it easy to follow along when the Wikipedia article is opened next to it, and used explicit naming.

Conclusion

Offloading specific processing tasks to the GPU is often a good idea to save CPU time and energy. This colour space conversion, in an unoptimised form (plain C code without SIMD assembly instructions), would consume roughly as much as the whole video encoding process running on the main CPU. This shows you how important this optimisation is in the video stream processing pipeline!

The new Vulkan API, an open standard for 3D graphics, has announced GPU-accelerated video codec APIs (H.264, H.265, AV1). This looks extremely promising for further optimisation of Linphone's video processing pipeline by offloading tasks to the GPU.

For any further information, do not hesitate to contact our team!