Hardware Independent Accelerated Video Processing in Linux

2023-12-30 00:19

Detlev Zundel

Tags:

Thinking about the significant amount of disk space that movies are using, I decided to revisit evaluating the question of transcoding them for archival purposes. I was also interested to see how much of the processing of movie files can be done in dedicated hardware instead of hogging the main CPU. Using an AMD Ryzen-2400G based desktop machine I decided to find out if there is dedicated hardware that I can use from the standard GNU/Linux applications for movies. In hindsight, I would never have expected to spend so much time on this post, but then again I now have a better understanding of many concepts in this area. Hopefully you can also profit, my dear reader.

Finding The Theory

So looking at the CPU specification, I see that it features a Radeon RX Vega 11 GPU. Unfortunately, the Wikipedia site is silent about any IP blocks for video processing. Not being able to find anything on the AMD servers, some more web search was in place. After following some twisty passages, I finally found a screenshot from a presentation on introducing the CPU back in 2017. So obviously the GPU features the following hardware accelerated features:

IP Blocks

Video Decoding

Codec	Max FPS @ 1080p 4:2:0	Max FPS @ 2160p 4:2:0
MPEG2	60
VC1	60
VP9 8bpc	240	60
VP9 10bpc	240	60
H.264	240	60
HEVC (H.265) 8bpc	240	60
HEVC (H.265) 10bpc	240	60
JPEG 8bpc	240	60

Video Encoding

Codec	Max FPS @ 1080p	Max FPS @ 1440p	Max FPS @ 2160p
H.264 8bpc	120	60	30
HEVC (H.265) 8bpc	120	60	30

H.265 With A Caveat

So for playing movies, there are quite a few formats supported, but I really do miss support for the free AV1 format. From various sources I sincerely believe that AV1 should be the format of choice for the future. Personally I cannot really explain why AV1 should be better than VP9, but from a discussion with an expert in the field I memorized that from technical grounds it is more advanced than VP9 and should thus be the preferred choice. If you are interested in more detail, I found a good Technical Overview of AV1. The H.264 and H.265 formats are patent encumbered and used as a money printing machine for the MPEG LA and are thus in theory not a good choice for Free Software.

Somewhat unrelated, but while pondering the available file formats, I had to acknowledge that our family TV does not even support playing AV1 movies. There is simply no software support. In theory such an update is completely feasible but due to the planned obsolescence that capitalism tends to arrive at, leaves people with devices not receiving anymore software updates. Because of this, new, technically interesting, formats have a very hard time to establish themselves in reality. Technical excellence becomes unimportant when existing devices simply do not support them because of missing updates. Our Samsung TV stopped receiving updates a long time ago and so without buying a new TV, using AV1 is currently not a choice for me.

In the end this leaves me with H.265 as the best choice for what the encoding IPs offer. The rest of the article will thus use H.265 as the target format, but depending on your use cases, you may opt for something else.

Linux Support For Accelerated Video Processing

When IP blocks for hardware accelerated de- and encoding came along, the usual thing happened and vendors implemented proprietary software architectures for integrating them into operating systems. Nvidia uses the proprietary NVDEC ecosystem, Intel came up with Quick Sync Video and AMD implemented the Advanced Media Framework AMF.

And this does not even include the many embedded vendors struggling with mainline Linux support for such IP blocks. The i.MX family from NXP is just one example.

This of course incurs a heavy price in "software quality". The operating system can no longer abstract the hardware for upper software layers and thus the upper software layers become hardware dependent as there is no other way of accessing the functionality. This usually means that users need to download and install the proprietary drivers for the hardware they have available and that the tools to use those IP blocks are specific for the actual hardware. So searching for how to do hardware decoding and encoding in Linux splinters into many specialized discussions relevant only to a specific hardware.

Wouldn't it be cool if the Linux kernel could introduce an API unifying all those IP blocks and offer a common API to user space software like ffmpeg or gstreamer? Of course other people many times more clever than myself thought along the same lines and started to introduce the Video Acceleration (VA) API into the Linux kernel. User space programs using this API should be able to use the encoders and decoders of any supported hardware, just like operating systems are meant to do. Usually AMD is pretty good at adopting free software solutions (one of the reasons why I use an AMD based desktop), so I decided to try and use the hardware acceleration in this way, which should be good for many years to come.

Querying VA-API

The Debian package vainfo offers tools to query the kernel VA API support for our hardware. If not already installed, just install the package with an apt install vainfo and check its results:

dzu@krikkit:~$ vainfo 
libva info: VA-API version 1.17.0
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/radeonsi_drv_video.so
libva info: Found init function __vaDriverInit_1_17
libva info: va_openDriver() returns 0
vainfo: VA-API version: 1.17 (libva 2.12.0)
vainfo: Driver version: Mesa Gallium driver 22.3.6 for AMD Radeon Vega 11 Graphics (raven, LLVM 15.0.6, DRM 3.49, 6.1.0-13-amd64)
vainfo: Supported profile and entrypoints
      VAProfileMPEG2Simple            :	VAEntrypointVLD
      VAProfileMPEG2Main              :	VAEntrypointVLD
      VAProfileVC1Simple              :	VAEntrypointVLD
      VAProfileVC1Main                :	VAEntrypointVLD
      VAProfileVC1Advanced            :	VAEntrypointVLD
      VAProfileH264ConstrainedBaseline:	VAEntrypointVLD
      VAProfileH264ConstrainedBaseline:	VAEntrypointEncSlice
      VAProfileH264Main               :	VAEntrypointVLD
      VAProfileH264Main               :	VAEntrypointEncSlice
      VAProfileH264High               :	VAEntrypointVLD
      VAProfileH264High               :	VAEntrypointEncSlice
      VAProfileHEVCMain               :	VAEntrypointVLD
      VAProfileHEVCMain               :	VAEntrypointEncSlice
      VAProfileHEVCMain10             :	VAEntrypointVLD
      VAProfileJPEGBaseline           :	VAEntrypointVLD
      VAProfileVP9Profile0            :	VAEntrypointVLD
      VAProfileVP9Profile2            :	VAEntrypointVLD
      VAProfileNone                   :	VAEntrypointVideoProc
dzu@krikkit:~$

Every VAEntryPointVLD entry corresponds to hardware decoding of the specified format and the entry VAEntrypointEncSlice shows that the format is supported for encoding. Very cool! So my AMD GPU is ready to be used through the VA API. Let's see how this transfers to the usual tools of GNU/Linux distros.

Decoding In Hardware

mpv

The mpv media player offers an easy way to query the hardware decoders:

dzu@krikkit:~$ mpv --hwdec=help
Valid values (with alternative full names):
  nvdec (h263-nvdec)
  nvdec (h263p-nvdec)
  nvdec (h264-nvdec)
  nvdec (hevc-nvdec)
  nvdec (mjpeg-nvdec)
  nvdec (mpeg1video-nvdec)
  nvdec (mpeg2video-nvdec)
  nvdec (mpeg4-nvdec)
  nvdec (vc1-nvdec)
  nvdec (vp8-nvdec)
  nvdec (vp9-nvdec)
  nvdec (wmv3-nvdec)
  nvdec (av1-nvdec)
  vaapi (h263-vaapi)
  vaapi (h263p-vaapi)
  vaapi (h264-vaapi)
  vaapi (hevc-vaapi)
  vaapi (mjpeg-vaapi)
  vaapi (mpeg2video-vaapi)
  vaapi (mpeg4-vaapi)
  vaapi (vc1-vaapi)
  vaapi (vp8-vaapi)
  vaapi (vp9-vaapi)
  vaapi (wmv3-vaapi)
  vaapi (av1-vaapi)
  vdpau (h263-vdpau)
  vdpau (h263p-vdpau)
  vdpau (h264-vdpau)
  vdpau (hevc-vdpau)
  vdpau (mpeg1video-vdpau)
  vdpau (mpeg2video-vdpau)
  vdpau (mpeg4-vdpau)
  vdpau (vc1-vdpau)
  vdpau (vp9-vdpau)
  vdpau (wmv3-vdpau)
  vdpau (av1-vdpau)
  nvdec-copy (h263-nvdec-copy)
  nvdec-copy (h263p-nvdec-copy)
  nvdec-copy (h264-nvdec-copy)
  nvdec-copy (hevc-nvdec-copy)
  nvdec-copy (mjpeg-nvdec-copy)
  nvdec-copy (mpeg1video-nvdec-copy)
  nvdec-copy (mpeg2video-nvdec-copy)
  nvdec-copy (mpeg4-nvdec-copy)
  nvdec-copy (vc1-nvdec-copy)
  nvdec-copy (vp8-nvdec-copy)
  nvdec-copy (vp9-nvdec-copy)
  nvdec-copy (wmv3-nvdec-copy)
  nvdec-copy (av1-nvdec-copy)
  vaapi-copy (h263-vaapi-copy)
  vaapi-copy (h263p-vaapi-copy)
  vaapi-copy (h264-vaapi-copy)
  vaapi-copy (hevc-vaapi-copy)
  vaapi-copy (mjpeg-vaapi-copy)
  vaapi-copy (mpeg2video-vaapi-copy)
  vaapi-copy (mpeg4-vaapi-copy)
  vaapi-copy (vc1-vaapi-copy)
  vaapi-copy (vp8-vaapi-copy)
  vaapi-copy (vp9-vaapi-copy)
  vaapi-copy (wmv3-vaapi-copy)
  vaapi-copy (av1-vaapi-copy)
  vdpau-copy (h263-vdpau-copy)
  vdpau-copy (h263p-vdpau-copy)
  vdpau-copy (h264-vdpau-copy)
  vdpau-copy (hevc-vdpau-copy)
  vdpau-copy (mpeg1video-vdpau-copy)
  vdpau-copy (mpeg2video-vdpau-copy)
  vdpau-copy (mpeg4-vdpau-copy)
  vdpau-copy (vc1-vdpau-copy)
  vdpau-copy (vp9-vdpau-copy)
  vdpau-copy (wmv3-vdpau-copy)
  vdpau-copy (av1-vdpau-copy)
  qsv (h264_qsv-qsv)
  qsv (hevc_qsv-qsv)
  qsv (mpeg2_qsv-qsv)
  qsv (vc1_qsv-qsv)
  cuda (av1_cuvid-cuda)
  qsv (av1_qsv-qsv)
  cuda (h264_cuvid-cuda)
  cuda (hevc_cuvid-cuda)
  cuda (mjpeg_cuvid-cuda)
  qsv (mjpeg_qsv-qsv)
  cuda (mpeg1_cuvid-cuda)
  cuda (mpeg2_cuvid-cuda)
  cuda (mpeg4_cuvid-cuda)
  cuda (vc1_cuvid-cuda)
  cuda (vp8_cuvid-cuda)
  qsv (vp8_qsv-qsv)
  cuda (vp9_cuvid-cuda)
  qsv (vp9_qsv-qsv)
  v4l2m2m-copy (h263_v4l2m2m-v4l2m2m-copy)
  v4l2m2m-copy (h264_v4l2m2m-v4l2m2m-copy)
  qsv-copy (h264_qsv-qsv-copy)
  qsv-copy (hevc_qsv-qsv-copy)
  v4l2m2m-copy (hevc_v4l2m2m-v4l2m2m-copy)
  v4l2m2m-copy (mpeg4_v4l2m2m-v4l2m2m-copy)
  v4l2m2m-copy (mpeg1_v4l2m2m-v4l2m2m-copy)
  v4l2m2m-copy (mpeg2_v4l2m2m-v4l2m2m-copy)
  qsv-copy (mpeg2_qsv-qsv-copy)
  qsv-copy (vc1_qsv-qsv-copy)
  v4l2m2m-copy (vc1_v4l2m2m-v4l2m2m-copy)
  v4l2m2m-copy (vp8_v4l2m2m-v4l2m2m-copy)
  v4l2m2m-copy (vp9_v4l2m2m-v4l2m2m-copy)
  cuda-copy (av1_cuvid-cuda-copy)
  qsv-copy (av1_qsv-qsv-copy)
  cuda-copy (h264_cuvid-cuda-copy)
  cuda-copy (hevc_cuvid-cuda-copy)
  cuda-copy (mjpeg_cuvid-cuda-copy)
  qsv-copy (mjpeg_qsv-qsv-copy)
  cuda-copy (mpeg1_cuvid-cuda-copy)
  cuda-copy (mpeg2_cuvid-cuda-copy)
  cuda-copy (mpeg4_cuvid-cuda-copy)
  cuda-copy (vc1_cuvid-cuda-copy)
  cuda-copy (vp8_cuvid-cuda-copy)
  qsv-copy (vp8_qsv-qsv-copy)
  cuda-copy (vp9_cuvid-cuda-copy)
  qsv-copy (vp9_qsv-qsv-copy)
  auto (yes '')
  no
  auto-safe
  auto-copy
  auto-copy-safe
dzu@krikkit:~$

As you can see, history has provided us with a lot of (duplicate) ways of achieving hardware accelerated video decoding. Without prior knowledge, it would be hard to single out vaapi as the choice that we want to use. But because of the prior investigation, we directly aim for this target.

Using the free movie Big Buck Bunny (in the 1080p, 30 fps version) we can verify that we are indeed decoding the movie in hardware:

dzu@krikkit:~$ time mpv --hwdec=vaapi /tmp/bbb_sunflower_1080p_30fps_normal.mp4 
 (+) Video --vid=1 (*) (h264 1920x1080 30.000fps)
 (+) Audio --aid=1 (*) (mp3 2ch 48000Hz)
     Audio --aid=2 (*) (ac3 6ch 48000Hz)
File tags:
 Artist: Blender Foundation 2008, Janus Bager Kristensen 2013
 Comment: Creative Commons Attribution 3.0 - http://bbb3d.renderfarming.net
 Composer: Sacha Goedegebure
 Genre: Animation
 Title: Big Buck Bunny, Sunflower version
[vo/gpu/wayland] GNOME's wayland compositor lacks support for the idle inhibit protocol. This means the screen can blank during playback.
Using hardware decoding (vaapi).
AO: [pipewire] 48000Hz stereo 2ch floatp
VO: [gpu] 1920x1080 vaapi[nv12]
AV: 00:10:34 / 00:10:34 (100%) A-V:  0.000 Dropped: 20

Exiting... (End of file)

real	10m35,577s
user	0m28,690s
sys	0m27,672s
dzu@krikkit:~$

From the output of the time command we can conclude that the mpv process only required 9% CPU Load. Running the same command without any options proves that it then uses software rendering:

dzu@krikkit:~$ time mpv /tmp/bbb_sunflower_1080p_30fps_normal.mp4 
 (+) Video --vid=1 (*) (h264 1920x1080 30.000fps)
 (+) Audio --aid=1 (*) (mp3 2ch 48000Hz)
     Audio --aid=2 (*) (ac3 6ch 48000Hz)
File tags:
 Artist: Blender Foundation 2008, Janus Bager Kristensen 2013
 Comment: Creative Commons Attribution 3.0 - http://bbb3d.renderfarming.net
 Composer: Sacha Goedegebure
 Genre: Animation
 Title: Big Buck Bunny, Sunflower version
[vo/gpu/wayland] GNOME's wayland compositor lacks support for the idle inhibit protocol. This means the screen can blank during playback.
AO: [pipewire] 48000Hz stereo 2ch floatp
VO: [gpu] 1920x1080 yuv420p
AV: 00:10:34 / 00:10:34 (100%) A-V:  0.000 Dropped: 24

Exiting... (End of file)

real	10m35,460s
user	3m21,950s
sys	0m20,303s
dzu@krikkit:~$

Comparing our previous invocation, we see that we now use a lot more CPU power, resulting in a CPU Load of 34%. So obviously for mpv we currently need to provide additional command line parameters to use the present acceleration hardware.

Understanding What Is Going On

Now that we have a verifiable way to use or not use the hardware encoder, let's deepen our understanding of how this works in terms of system calls. Let's record an mpv session without and with using the acceleration by means of checking its system calls with strace:

dzu@krikkit:~$ strace -e trace=openat -fo /tmp/strace-no-accel mpv /tmp/bbb_sunflower_1080p_30fps_normal.mp4 
 (+) Video --vid=1 (*) (h264 1920x1080 30.000fps)
 (+) Audio --aid=1 (*) (mp3 2ch 48000Hz)
     Audio --aid=2 (*) (ac3 6ch 48000Hz)
File tags:
 Artist: Blender Foundation 2008, Janus Bager Kristensen 2013
 Comment: Creative Commons Attribution 3.0 - http://bbb3d.renderfarming.net
 Composer: Sacha Goedegebure
 Genre: Animation
 Title: Big Buck Bunny, Sunflower version
[vo/gpu/wayland] GNOME's wayland compositor lacks support for the idle inhibit protocol. This means the screen can blank during playback.
AO: [pipewire] 48000Hz stereo 2ch floatp
VO: [gpu] 1920x1080 yuv420p
AV: 00:00:02 / 00:10:34 (0%) A-V:  0.000 Dropped: 4

Exiting... (Quit)
dzu@krikkit:~$ strace -e trace=openat -fo /tmp/strace-accel mpv --hwdec=vaapi /tmp/bbb_sunflower_1080p_30fps_normal.mp4 
 (+) Video --vid=1 (*) (h264 1920x1080 30.000fps)
 (+) Audio --aid=1 (*) (mp3 2ch 48000Hz)
     Audio --aid=2 (*) (ac3 6ch 48000Hz)
File tags:
 Artist: Blender Foundation 2008, Janus Bager Kristensen 2013
 Comment: Creative Commons Attribution 3.0 - http://bbb3d.renderfarming.net
 Composer: Sacha Goedegebure
 Genre: Animation
 Title: Big Buck Bunny, Sunflower version
[vo/gpu/wayland] GNOME's wayland compositor lacks support for the idle inhibit protocol. This means the screen can blank during playback.
Using hardware decoding (vaapi).
AO: [pipewire] 48000Hz stereo 2ch floatp
VO: [gpu] 1920x1080 vaapi[nv12]
AV: 00:00:00 / 00:10:34 (0%) A-V:  0.000 Dropped: 5

Exiting... (Quit)
dzu@krikkit:~$

Inside these trace files, we have potential differences that we don't care about, i.e. the first column contains the PID and this will of course be different for our two recordings, but we are not interested in this difference. Also, the return field of an openat system call is the file descriptor of a process, but we don't care about the specific value that can potentially change between different runs (ordering, etc.), so let's remove the first column and everything after an equal sign and diff the results. We also know that we care only for filenames containing the substring /dri:

 dzu@krikkit:~$ diff -c <(cat /tmp/strace-no-accel | \
                           grep '/dri' | sed -e 's/^[0-9]\+ //' -e 's/ = [0-9]\+$//') \
                        <(cat /tmp/strace-accel | \
                           grep '/dri' | sed -e 's/^[0-9]\+ //' -e 's/ = [0-9]\+$//')
 *** /dev/fd/63	2023-12-26 01:26:04.224000705 +0100
 --- /dev/fd/62	2023-12-26 01:26:04.216000655 +0100
 ***************
 *** 24,26 ****
 --- 24,40 ----
   openat(AT_FDCWD, "/usr/share/drirc.d/00-radv-defaults.conf", O_RDONLY)
   openat(AT_FDCWD, "/etc/drirc", O_RDONLY) = -1 ENOENT (Datei oder Verzeichnis nicht gefunden)
   openat(AT_FDCWD, "/dev/dri", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY)
 + openat(AT_FDCWD, "/dev/dri/renderD128", O_RDWR)
 + openat(AT_FDCWD, "/usr/lib/x86_64-linux-gnu/dri/radeonsi_drv_video.so", O_RDONLY|O_CLOEXEC)
 + openat(AT_FDCWD, "/usr/share/drirc.d", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY)
 + openat(AT_FDCWD, "/usr/share/drirc.d/00-mesa-defaults.conf", O_RDONLY)
 + openat(AT_FDCWD, "/usr/share/drirc.d/00-radv-defaults.conf", O_RDONLY)
 + openat(AT_FDCWD, "/etc/drirc", O_RDONLY) = -1 ENOENT (Datei oder Verzeichnis nicht gefunden)
 + openat(AT_FDCWD, "/usr/share/drirc.d", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY)
 + openat(AT_FDCWD, "/usr/share/drirc.d/00-mesa-defaults.conf", O_RDONLY)
 + openat(AT_FDCWD, "/usr/share/drirc.d/00-radv-defaults.conf", O_RDONLY)
 + openat(AT_FDCWD, "/etc/drirc", O_RDONLY) = -1 ENOENT (Datei oder Verzeichnis nicht gefunden)
 + openat(AT_FDCWD, "/usr/share/drirc.d", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY)
 + openat(AT_FDCWD, "/usr/share/drirc.d/00-mesa-defaults.conf", O_RDONLY)
 + openat(AT_FDCWD, "/usr/share/drirc.d/00-radv-defaults.conf", O_RDONLY)
 + openat(AT_FDCWD, "/etc/drirc", O_RDONLY) = -1 ENOENT (Datei oder Verzeichnis nicht gefunden)
 dzu@krikkit:~$

So we know understand that user space programs accessing the VA API will do this by opening a device file below /dev/dri and the magic happens through this file descriptor. So checking the strace output will reliably tell us if a command uses hardware acceleration or not. Cool, we can use this to analyze other programs more quickly, but until the mpv command checks and uses the available hardware, let's encode our policy in a system-wide configuration file:

dzu@krikkit:~$ cat /etc/mpv/mpv.conf 
hwdec=vaapi
dzu@krikkit:~$

Be sure to check if you have an already existing /etc/mpv/mpv.conf before blindly copying my example, but once there is such a configuration file, mpv will now always use hardware acceleration if possible.

Totem

With our current understanding, it is easy to check if totem uses acceleration:

dzu@krikkit:~$ strace -fe trace=openat totem /tmp/bbb_sunflower_1080p_30fps_normal.mp4 2>&1 | grep /dev/dri 
[pid 109146] openat(AT_FDCWD, "/dev/dri", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 10
[pid 109146] openat(AT_FDCWD, "/dev/dri", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 8
[pid 109146] openat(AT_FDCWD, "/dev/dri", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 27
[pid 109146] openat(AT_FDCWD, "/dev/dri/renderD128", O_RDWR|O_CLOEXEC) = 27
[pid 109146] openat(AT_FDCWD, "/dev/dri", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 28
[pid 109146] openat(AT_FDCWD, "/dev/dri", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 31
dzu@krikkit:~$

So indeed, totem is a ideal in this respect and simply uses the hardware block when it finds one. Very nice!

Encoding In Hardware

Being able to decode videos with the accelerator, we turn to the problem of using the acceleration for encoding. As we decided earlier, we are especially interested in encoding to H.265, but our hardware can only do H.264 and H.265, so the choice was not hard to begin with.

As ffmpeg is my go to solution for video tasks, let's begin by checking its support for VA API:

dzu@krikkit:~$ ffprobe -encoders 2>&1 | grep vaapi
 V....D h264_vaapi           H.264/AVC (VAAPI) (codec h264)
 V....D hevc_vaapi           H.265/HEVC (VAAPI) (codec hevc)
 V....D mjpeg_vaapi          MJPEG (VAAPI) (codec mjpeg)
 V....D mpeg2_vaapi          MPEG-2 (VAAPI) (codec mpeg2video)
 V....D vp8_vaapi            VP8 (VAAPI) (codec vp8)
 V....D vp9_vaapi            VP9 (VAAPI) (codec vp9)
dzu@krikkit:~$

So for ffmpeg we are interested in the h264_vaapi and hevc_vaapi encoders. Before trying them out, let's get an idea of the parameters that they support as it will be important later on.

ffmpeg Encoder Options

HEVC (H265) Encoder Options

dzu@krikkit:~$ ffprobe -hide_banner -h encoder=hevc_vaapi
Encoder hevc_vaapi [H.265/HEVC (VAAPI)]:
    General capabilities: dr1 delay hardware 
    Threading capabilities: none
    Supported hardware devices: vaapi 
    Supported pixel formats: vaapi
h265_vaapi AVOptions:
  -low_power         <boolean>    E..V....... Use low-power encoding mode (only available on some platforms; may not support all encoding features) (default false)
  -idr_interval      <int>        E..V....... Distance (in I-frames) between IDR frames (from 0 to INT_MAX) (default 0)
  -b_depth           <int>        E..V....... Maximum B-frame reference depth (from 1 to INT_MAX) (default 1)
  -async_depth       <int>        E..V....... Maximum processing parallelism. Increase this to improve single channel performance. This option doesn't work if driver doesn't implement vaSyncBuffer function. (from 1 to 64) (default 2)
  -max_frame_size    <int>        E..V....... Maximum frame size (in bytes) (from 0 to INT_MAX) (default 0)
  -rc_mode           <int>        E..V....... Set rate control mode (from 0 to 6) (default auto)
     auto            0            E..V....... Choose mode automatically based on other parameters
     CQP             1            E..V....... Constant-quality
     CBR             2            E..V....... Constant-bitrate
     VBR             3            E..V....... Variable-bitrate
     ICQ             4            E..V....... Intelligent constant-quality
     QVBR            5            E..V....... Quality-defined variable-bitrate
     AVBR            6            E..V....... Average variable-bitrate
  -qp                <int>        E..V....... Constant QP (for P-frames; scaled by qfactor/qoffset for I/B) (from 0 to 52) (default 0)
  -aud               <boolean>    E..V....... Include AUD (default false)
  -profile           <int>        E..V....... Set profile (general_profile_idc) (from -99 to 255) (default -99)
     main            1            E..V.......
     main10          2            E..V.......
     rext            4            E..V.......
  -tier              <int>        E..V....... Set tier (general_tier_flag) (from 0 to 1) (default main)
     main            0            E..V.......
     high            1            E..V.......
  -level             <int>        E..V....... Set level (general_level_idc) (from -99 to 255) (default -99)
     1               30           E..V.......
     2               60           E..V.......
     2.1             63           E..V.......
     3               90           E..V.......
     3.1             93           E..V.......
     4               120          E..V.......
     4.1             123          E..V.......
     5               150          E..V.......
     5.1             153          E..V.......
     5.2             156          E..V.......
     6               180          E..V.......
     6.1             183          E..V.......
     6.2             186          E..V.......
  -sei               <flags>      E..V....... Set SEI to include (default hdr)
     hdr                          E..V....... Include HDR metadata for mastering display colour volume and content light level information
  -tiles             <image_size> E..V....... Tile columns x rows

dzu@krikkit:~$

H264 Options

dzu@krikkit:~$ ffprobe -hide_banner -h encoder=h264_vaapi
Encoder h264_vaapi [H.264/AVC (VAAPI)]:
    General capabilities: dr1 delay hardware 
    Threading capabilities: none
    Supported hardware devices: vaapi 
    Supported pixel formats: vaapi
h264_vaapi AVOptions:
  -low_power         <boolean>    E..V....... Use low-power encoding mode (only available on some platforms; may not support all encoding features) (default false)
  -idr_interval      <int>        E..V....... Distance (in I-frames) between IDR frames (from 0 to INT_MAX) (default 0)
  -b_depth           <int>        E..V....... Maximum B-frame reference depth (from 1 to INT_MAX) (default 1)
  -async_depth       <int>        E..V....... Maximum processing parallelism. Increase this to improve single channel performance. This option doesn't work if driver doesn't implement vaSyncBuffer function. (from 1 to 64) (default 2)
  -max_frame_size    <int>        E..V....... Maximum frame size (in bytes) (from 0 to INT_MAX) (default 0)
  -rc_mode           <int>        E..V....... Set rate control mode (from 0 to 6) (default auto)
     auto            0            E..V....... Choose mode automatically based on other parameters
     CQP             1            E..V....... Constant-quality
     CBR             2            E..V....... Constant-bitrate
     VBR             3            E..V....... Variable-bitrate
     ICQ             4            E..V....... Intelligent constant-quality
     QVBR            5            E..V....... Quality-defined variable-bitrate
     AVBR            6            E..V....... Average variable-bitrate
  -qp                <int>        E..V....... Constant QP (for P-frames; scaled by qfactor/qoffset for I/B) (from 0 to 52) (default 0)
  -quality           <int>        E..V....... Set encode quality (trades off against speed, higher is faster) (from -1 to INT_MAX) (default -1)
  -coder             <int>        E..V....... Entropy coder type (from 0 to 1) (default cabac)
     cavlc           0            E..V.......
     cabac           1            E..V.......
     vlc             0            E..V.......
     ac              1            E..V.......
  -aud               <boolean>    E..V....... Include AUD (default false)
  -sei               <flags>      E..V....... Set SEI to include (default identifier+timing+recovery_point)
     identifier                   E..V....... Include encoder version identifier
     timing                       E..V....... Include timing parameters (buffering_period and pic_timing)
     recovery_point               E..V....... Include recovery points where appropriate
  -profile           <int>        E..V....... Set profile (profile_idc and constraint_set*_flag) (from -99 to 65535) (default -99)
     constrained_baseline 578          E..V.......
     main            77           E..V.......
     high            100          E..V.......
  -level             <int>        E..V....... Set level (level_idc) (from -99 to 255) (default -99)
     1               10           E..V.......
     1.1             11           E..V.......
     1.2             12           E..V.......
     1.3             13           E..V.......
     2               20           E..V.......
     2.1             21           E..V.......
     2.2             22           E..V.......
     3               30           E..V.......
     3.1             31           E..V.......
     3.2             32           E..V.......
     4               40           E..V.......
     4.1             41           E..V.......
     4.2             42           E..V.......
     5               50           E..V.......
     5.1             51           E..V.......
     5.2             52           E..V.......
     6               60           E..V.......
     6.1             61           E..V.......
     6.2             62           E..V.......

dzu@krikkit:~$

While many options look the same, it is worth noting that the H.264 encoder supports an -quality parameter, while the HEVC encoder does not. The latter only features a -qp parameter for specifying a quality, but it seems like it is meant for the "constant quantization parameter" mode of encoding. We will get back to this.

Looking into the performance and quality of an encoder requires some knowledge about Constant Bit rate (CBR), Variable Bit rate (VBR), Constant Rate Factor (CRF) and other special terms. I found this CRF Guide by Werner Robitza to be a very good resource to quickly learn about them. Maybe you also want to at least glance over that page before reading on.

CRF Encoding (SW)

For the rest of the post, I will now use a longer movie from my own collection (H.264, 14m07s, 1920x1080p, 5615 kbit/s, 588 MiB) instead of the short Big Buck Bunny movie. Of course the content of the movie will influence how good an encoder can compress a file (think of a movie showing only black for long times which can be compressed heavily), so I wanted to take a real world example instead of an artificial computer generated movie for this section.

Understanding the basic parameters that go into an encoding, let's establish a baseline for the acceleration by doing a CRF encode of the sample file. Transcoding for archival purposes is usually best done with CRF as it allows for variable bit rates and a specification of required quality. To ease calling the tools multiple times, I put the invocations of ffmpeg into a script file, but essentially this is the command line that is executed:

# Default CRF is 28, but we lower it to 26
[ -z "$CRF" ] && CRF=26
ffmpeg -i "$1" -c:v libx265 -crf $CRF -c:a libvorbis -map 0:0 -map 0:a $OUTFILE

So the encoder is libx265 and unless given an explicit parameter, the script will use CRF=26 which is a little less than the default. I ended up with this value by doing some example encodes and comparing them visually on two monitors side by side. Here is the transcript of ffmepg doing the encoding:

dzu@krikkit:/tmp$ time recode-video -c h265 movie.mp4
ffmpeg version 5.1.4-0+deb12u1 Copyright (c) 2000-2023 the FFmpeg developers
  built with gcc 12 (Debian 12.2.0-14)
  configuration: --prefix=/usr --extra-version=0+deb12u1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libglslang --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librist --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libsvtav1 --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzimg --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --disable-sndio --enable-libjxl --enable-pocketsphinx --enable-librsvg --enable-libmfx --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-libx264 --enable-libplacebo --enable-librav1e --enable-shared
  libavutil      57. 28.100 / 57. 28.100
  libavcodec     59. 37.100 / 59. 37.100
  libavformat    59. 27.100 / 59. 27.100
  libavdevice    59.  7.100 / 59.  7.100
  libavfilter     8. 44.100 /  8. 44.100
  libswscale      6.  7.100 /  6.  7.100
  libswresample   4.  7.100 /  4.  7.100
  libpostproc    56.  6.100 / 56.  6.100
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'movie.mp4':
  Metadata:
    major_brand     : isom
    minor_version   : 512
    compatible_brands: isomiso2avc1mp41
    title           : Title
    encoder         : Lavf58.20.100
    media_type      : 9
    hd_video        : 2
  Duration: 00:14:07.13, start: 0.000000, bitrate: 5817 kb/s
  Chapters:
    Chapter #0:0: start 0.000000, end 847.000000
      Metadata:
        title           : Chapter 1
  Stream #0:0[0x1](eng): Video: h264 (Main) (avc1 / 0x31637661), yuv420p(progressive), 1920x1080 [SAR 1:1 DAR 16:9], 5615 kb/s, 25 fps, 25 tbr, 12800 tbn (default)
    Metadata:
      handler_name    : VideoHandler
      vendor_id       : [0][0][0][0]
  Stream #0:1[0x2](eng): Audio: aac (LC) (mp4a / 0x6134706D), 44100 Hz, stereo, fltp, 196 kb/s (default)
    Metadata:
      handler_name    : SoundHandler
      vendor_id       : [0][0][0][0]
  Stream #0:2[0x3](eng): Data: bin_data (text / 0x74786574)
    Metadata:
      handler_name    : SubtitleHandler
Stream mapping:
  Stream #0:0 -> #0:0 (h264 (native) -> hevc (libx265))
  Stream #0:1 -> #0:1 (aac (native) -> vorbis (libvorbis))
Press [q] to stop, [?] for help
x265 [info]: HEVC encoder version 3.5+1-f0c1022b6
x265 [info]: build info [Linux][GCC 12.2.0][64 bit] 8bit+10bit+12bit
x265 [info]: using cpu capabilities: MMX2 SSE2Fast LZCNT SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2
x265 [info]: Main profile, Level-4 (Main tier)
x265 [info]: Thread pool created using 8 threads
x265 [info]: Slices                              : 1
x265 [info]: frame threads / pool features       : 3 / wpp(17 rows)
x265 [info]: Coding QT: max CU size, min CU size : 64 / 8
x265 [info]: Residual QT: max TU size, max depth : 32 / 1 inter / 1 intra
x265 [info]: ME / range / subpel / merge         : hex / 57 / 2 / 3
x265 [info]: Keyframe min / max / scenecut / bias  : 25 / 250 / 40 / 5.00 
x265 [info]: Lookahead / bframes / badapt        : 20 / 4 / 2
x265 [info]: b-pyramid / weightp / weightb       : 1 / 1 / 0
x265 [info]: References / ref-limit  cu / depth  : 3 / off / on
x265 [info]: AQ: mode / str / qg-size / cu-tree  : 2 / 1.0 / 32 / 1
x265 [info]: Rate Control / qCompress            : CRF-26.0 / 0.60
x265 [info]: tools: rd=3 psy-rd=2.00 early-skip rskip mode=1 signhide tmvp
x265 [info]: tools: b-intra strong-intra-smoothing lslices=6 deblock sao
Output #0, matroska, to 'recode5B0u.mkv':
  Metadata:
    major_brand     : isom
    minor_version   : 512
    compatible_brands: isomiso2avc1mp41
    title           : Title
    hd_video        : 2
    media_type      : 9
    encoder         : Lavf59.27.100
  Chapters:
    Chapter #0:0: start 0.000000, end 847.000000
      Metadata:
        title           : Chapter 1
  Stream #0:0(eng): Video: hevc, yuv420p(progressive), 1920x1080 [SAR 1:1 DAR 16:9], q=2-31, 25 fps, 1k tbn (default)
    Metadata:
      handler_name    : VideoHandler
      vendor_id       : [0][0][0][0]
      encoder         : Lavc59.37.100 libx265
    Side data:
      cpb: bitrate max/min/avg: 0/0/0 buffer size: 0 vbv_delay: N/A
  Stream #0:1(eng): Audio: vorbis (oV[0][0] / 0x566F), 44100 Hz, stereo, fltp (default)
    Metadata:
      handler_name    : SoundHandler
      vendor_id       : [0][0][0][0]
      encoder         : Lavc59.37.100 libvorbis
frame=21178 fps= 18 q=34.0 Lsize=  101978kB time=00:14:07.08 bitrate= 986.2kbits/s speed=0.736x    
video:93449kB audio:8071kB subtitle:0kB other streams:0kB global headers:6kB muxing overhead: 0.450788%
x265 [info]: frame I:     92, Avg QP:24.38  kb/s: 6733.38 
x265 [info]: frame P:   5329, Avg QP:26.51  kb/s: 2188.34 
x265 [info]: frame B:  15757, Avg QP:32.23  kb/s: 434.12  
x265 [info]: Weighted P-Frames: Y:1.1% UV:0.8%
x265 [info]: consecutive B-frames: 1.4% 1.9% 1.8% 94.3% 0.5% 

encoded 21178 frames in 1151.51s (18.39 fps), 902.89 kb/s, Avg QP:30.76

real	19m11,814s
user	122m29,102s
sys	1m19,955s
dzu@krikkit:/tmp$ ls -lh movie.mp4*
-rw-r--r-- 1 dzu dzu 100M 22. Dez 20:07 movie.mp4
-rw-r--r-- 1 dzu dzu 588M 22. Dez 19:48 movie.mp4.bak
dzu@krikkit:/tmp$

We expected that this will max out the CPUs, and indeed calculating the CPU load by dividing the wall clock time (real) by (user + sys), we see that this software only encoding resulted in a CPU-Load of 645%. On the one hand that tells me that the libx265 encoder is indeed a very efficient implementation as it uses nearly all 8 available threads of my machine all the time. It looks like this is easy, but this is actually a very difficult endeavor to achieve, so kudos to the developers of libx265 even though this post is not at all about that encoder.

As you can see, my script replaces the original movie file and keeps the old file as a backup copy. Comparing the sizes gives us a quick glimpse of the savings and shows that H.265 is able to squeeze the movie down to 17% of its original size. Checking the original and the recoded movie side by side gives me the confidence that the settings are sane. This result is very cool and indeed what I was looking for. Using this on movie collections should save a lot of space.

We will see that although the hardware encoder is much quicker and leaves the CPU cores alone, it will not be able to achieve this result, but let's take things one bit at a time and call the hardware encoder with settings suggested on the internet.

VBR Accelerated Encoding

Again I put the magic command line into my script, but this is the part for calling ffmpeg to use the hardware encoder:

ffmpeg -hwaccel vaapi -hwaccel_device /dev/dri/renderD128 \
           -hwaccel_output_format vaapi -i $1 \
           -c:v hevc_vaapi \
           -b:v 5M \
           $OUTFILE

Unfortunately, ffmpeg does not output which mode the encoder will be in if we do not specify anything, and the option help shows "auto" which is just as unhelpful. The VA API Documentation for ffmpeg mentions that VBR is the default mode and some experiments with also specifying -maxrate show that this is really the case. So not specifying any mode gives us a VBR encoding with a target average bit rate. Sometimes this is also referred to as ABR.

Just as expected, this runs with a CPU load of 28% (using a single thread 28% of the time) but encodes the movie at 146 frames per second, i.e. it is multiple times faster than real time (25-30 fps for most movies). So instead of waiting 20 minutes for the result, the transcoding is complete in 2 minutes and 25 seconds!

Looking at the result table below, we see that the saving in storage is minimal, but that was kind of expected if the original file has a bit rate of 5615 kbit/s, and we instruct the hardware encoder to encode it to 5000 kbit/s. Indeed, the encoder nicely hit the target as the result has a bit rate of 5080 kbit/s.

Constant Quality Accelerated Encoding

Now that we can use our hardware, but the result is not in the range of where we know we can get to with our software baseline, we need to find alternatives. While the documentation on the web thus far has been scattered and difficult to interpret, it was nearly impossible to find information on those advanced things. The most important thing here is that the VA API encoders do not have a CRF parameter at all. Looking at the help of ffmepeg it seems that Constant Quality (CQP) may fit our bill. Using it is as easy as specifying -rc_mode 1. Here is the full invocation:

ffmpeg -hwaccel vaapi -hwaccel_device /dev/dri/renderD128 \
           -hwaccel_output_format vaapi -i $1 \
           -c:v hevc_vaapi \
           -rc_mode 1 \
           $OUTFILE

ffmpeg informed me that it uses a default quality of 25, because no other parameter was given. We keep this in mind for the next attempts. But already we see that the result is much better. We end up with a bit rate of 2170 kbit/s while the speed of the process stays exactly the same.

In order to specify our quality target, I tried specifying -quality, but that option is not supported by hevc_vaapi. Glancing at the H.264 encoder I see that it has such an option, but the HEVC encoder does not. There I only see -qp <num>, and so I tried that. This achieves the desired effect and playing with a few numbers yields the results given in the table below.

At -qp 28 I can already see clear visual differences (smoothing) compared to the libx265 version, I cut off the search at this point as the same quality will always yield larger results.

Other Encoding Modes

Looking at the options for -rc_mode it is a valid question if there are other modes doing even better jobs, but trying ICQ made ffmepg error out with this message:

[hevc_vaapi @ 0x55d2ebade0c0] Driver does not support ICQ RC mode (supported modes: CQP, CBR, VBR).

Ok, so CQP, VBR and CBR are the only modes that our accelerator supports.

CBR Accelerated Encoding

For completeness, I tested the CBR encoding (-rc_mode 2), but I do not expect to use that for archival purposes. CBR is really meant to encode a video for a known communication channel with a limited bandwidth. Specifying the bit rate lower than this limit should ensure that we can always stream the movie without any buffer underflow. But obviously, this is not what I am looking for here. Funnily enough, the results seem to be comparable to the VBR encoding, but maybe the difference would be clear when looking at the continuous actual bit rate instead of just looking at the average bit rate over the whole file.

Result Table

Encoder	Options	Bit rate	Size	Fps	Real	user+sys	load
		kbit/s	MiB		s	s
<None>		5615	588	25
libx265	-crf 26	986	100	18	1152	7429	645%
hevc_aapi	-rc_mode 3 -b:v 5M	5080	514	146	145	40	28%
hevc_aapi	-rc_mode 3 -b:v 2M	2082	211	142	148	44	30%
hevc_aapi	-rc_mode 2 -b:v 5M	5084	514	145	146	41	28%
hevc_aapi	-rc_mode 2 -b:v 2M	2084	211	142	150	50	33%
hevc_aapi	-rc_mode 1 -qp 25	2170	220	146	145	36	25%
hevc_aapi	-rc_mode 1 -qp 26	1802	183	146	145	36	25%
hevc_aapi	-rc_mode 1 -qp 27	1540	156	146	145	36	25%
hevc_aapi	-rc_mode 1 -qp 28	1407	143	145	145	36	25%
hevc_aapi	-rc_mode 1 -qp 29	1161	118	131
hevc_aapi	-rc_mode 3 -qp 28	Error(1)

Where Error(1) is "[hevc_vaapi @ 0x563fc6e9d180] Bitrate must be set for VBR RC mode."

Summary

The trigger for this post was the innocent question of how exactly do I use the hardware video processing IP present in my hardware.

After learning about the relevant pieces of the GNU/Linux software stack, I was able to really use the hardware IP and evaluate its performance and quality. My usage of the hardware independent VA API transfers nicely to other hardware encoders supported in VA API, but for now I have to be content with the fact that the IP block does not fulfill my requirements of minimal bit rate for a given quality, i.e. for archival purposes. Even though I can encode movies roughly 9 times faster (and with minimal CPU load) with VA API than with libx265, the resulting bit rate would not be as good and so the result files would take up more space on the archive disks.

As it stands, the hardware encoder can not match the quality and savings of the libx265 software option. Although it would save a lot of time (and energy) compared to the software encoding, quality matters more for archival purpuses I will not use it after all. It is worth checking out future versions of GPUs, but for my current system the best choice is software encoding.

We also saw that because of proprietary solutions it is basically still impossible to answer a simple question like "can I use the hardware encoder" for all hardware supported by the Linux kernel. But thanks to the never ending motivation and lobbying of Free Software designers, there is now an API that comes close to that target, i.e. VA API. And with that I can use my (integrated) AMD GPU with hardware independent user space encoders. This is exactly how operating systems should work.

As a superfuous advertisement, I would like to mention that consumers can actually influence this positive development by considering "Software Support" ahead of time. The Free Software Foundation hosts the very good h-node project trying to document the compatibility of Free Software with physical hardware. Even though there is a section on graphics cards it would not have helped us to decide which brand is better if we want to do hardware encoding. With "support" in this context I not only mean to express "it works for use case X", but also "it is supported with minimal vendor dependencies". Following the development of the Linux kernel (or as best as I can) allows myself to at least have an opinion here.

From an engineering perspective it is just mind boggling how much effort is going into all of those encoder and decoder libraries in software. Even with a common kernel API, they use different upper level software (i.e. hevc_vaapi from ffmpeg, vvaapih265enc from gstreamer, …) and one still needs to pick a software module that corresponds to a single piece of hardware. So simple scripts using any of those tools (mine included) are actually non-portable and that's a real shame. It should be possible to write portable scripts that simply specify a target format (i.e. HEVC) and rate control parameters and then let the OS decide what to use. Maybe we will get there in another 10 years. I have my fingers crossed.

Questions or Suggestions?

I am really interested in your take on this complex topic, so feel free to drop me a mail at mailto:dzu@member.fsf.org, or use Disqus to comment on the post. And yeah, I am ashamed that I did not yet implement a Mastodon based comment system, but at least I have seen working code by now.