Hardware Independent Accelerated Video Processing in Linux
Thinking about the significant amount of disk space that movies are using, I decided to revisit evaluating the question of transcoding them for archival purposes. I was also interested to see how much of the processing of movie files can be done in dedicated hardware instead of hogging the main CPU. Using an AMD Ryzen-2400G based desktop machine I decided to find out if there is dedicated hardware that I can use from the standard GNU/Linux applications for movies. In hindsight, I would never have expected to spend so much time on this post, but then again I now have a better understanding of many concepts in this area. Hopefully you can also profit, my dear reader.
Finding The Theory
So looking at the CPU specification, I see that it features a Radeon RX Vega 11 GPU. Unfortunately, the Wikipedia site is silent about any IP blocks for video processing. Not being able to find anything on the AMD servers, some more web search was in place. After following some twisty passages, I finally found a screenshot from a presentation on introducing the CPU back in 2017. So obviously the GPU features the following hardware accelerated features:
IP Blocks
Video Decoding
Codec | Max FPS @ 1080p 4:2:0 | Max FPS @ 2160p 4:2:0 |
---|---|---|
MPEG2 | 60 | |
VC1 | 60 | |
VP9 8bpc | 240 | 60 |
VP9 10bpc | 240 | 60 |
H.264 | 240 | 60 |
HEVC (H.265) 8bpc | 240 | 60 |
HEVC (H.265) 10bpc | 240 | 60 |
JPEG 8bpc | 240 | 60 |
Video Encoding
Codec | Max FPS @ 1080p | Max FPS @ 1440p | Max FPS @ 2160p |
---|---|---|---|
H.264 8bpc | 120 | 60 | 30 |
HEVC (H.265) 8bpc | 120 | 60 | 30 |
H.265 With A Caveat
So for playing movies, there are quite a few formats supported, but I really do miss support for the free AV1 format. From various sources I sincerely believe that AV1 should be the format of choice for the future. Personally I cannot really explain why AV1 should be better than VP9, but from a discussion with an expert in the field I memorized that from technical grounds it is more advanced than VP9 and should thus be the preferred choice. If you are interested in more detail, I found a good Technical Overview of AV1. The H.264 and H.265 formats are patent encumbered and used as a money printing machine for the MPEG LA and are thus in theory not a good choice for Free Software.
Somewhat unrelated, but while pondering the available file formats, I had to acknowledge that our family TV does not even support playing AV1 movies. There is simply no software support. In theory such an update is completely feasible but due to the planned obsolescence that capitalism tends to arrive at, leaves people with devices not receiving anymore software updates. Because of this, new, technically interesting, formats have a very hard time to establish themselves in reality. Technical excellence becomes unimportant when existing devices simply do not support them because of missing updates. Our Samsung TV stopped receiving updates a long time ago and so without buying a new TV, using AV1 is currently not a choice for me.
In the end this leaves me with H.265 as the best choice for what the encoding IPs offer. The rest of the article will thus use H.265 as the target format, but depending on your use cases, you may opt for something else.
Linux Support For Accelerated Video Processing
When IP blocks for hardware accelerated de- and encoding came along, the usual thing happened and vendors implemented proprietary software architectures for integrating them into operating systems. Nvidia uses the proprietary NVDEC ecosystem, Intel came up with Quick Sync Video and AMD implemented the Advanced Media Framework AMF.
And this does not even include the many embedded vendors struggling with mainline Linux support for such IP blocks. The i.MX family from NXP is just one example.
This of course incurs a heavy price in "software quality". The operating system can no longer abstract the hardware for upper software layers and thus the upper software layers become hardware dependent as there is no other way of accessing the functionality. This usually means that users need to download and install the proprietary drivers for the hardware they have available and that the tools to use those IP blocks are specific for the actual hardware. So searching for how to do hardware decoding and encoding in Linux splinters into many specialized discussions relevant only to a specific hardware.
Wouldn't it be cool if the Linux kernel could introduce an API
unifying all those IP blocks and offer a common API to user space
software like ffmpeg
or gstreamer
? Of course other people many
times more clever than myself thought along the same lines and started
to introduce the Video Acceleration (VA) API into the Linux kernel.
User space programs using this API should be able to use the encoders
and decoders of any supported hardware, just like operating systems
are meant to do. Usually AMD is pretty good at adopting free software
solutions (one of the reasons why I use an AMD based desktop), so I
decided to try and use the hardware acceleration in this way, which
should be good for many years to come.
Querying VA-API
The Debian package vainfo
offers tools to query the kernel VA API
support for our hardware. If not already installed, just install the
package with an apt install vainfo
and check its results:
dzu@krikkit:~$ vainfo
libva info: VA-API version 1.17.0
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/radeonsi_drv_video.so
libva info: Found init function __vaDriverInit_1_17
libva info: va_openDriver() returns 0
vainfo: VA-API version: 1.17 (libva 2.12.0)
vainfo: Driver version: Mesa Gallium driver 22.3.6 for AMD Radeon Vega 11 Graphics (raven, LLVM 15.0.6, DRM 3.49, 6.1.0-13-amd64)
vainfo: Supported profile and entrypoints
VAProfileMPEG2Simple : VAEntrypointVLD
VAProfileMPEG2Main : VAEntrypointVLD
VAProfileVC1Simple : VAEntrypointVLD
VAProfileVC1Main : VAEntrypointVLD
VAProfileVC1Advanced : VAEntrypointVLD
VAProfileH264ConstrainedBaseline: VAEntrypointVLD
VAProfileH264ConstrainedBaseline: VAEntrypointEncSlice
VAProfileH264Main : VAEntrypointVLD
VAProfileH264Main : VAEntrypointEncSlice
VAProfileH264High : VAEntrypointVLD
VAProfileH264High : VAEntrypointEncSlice
VAProfileHEVCMain : VAEntrypointVLD
VAProfileHEVCMain : VAEntrypointEncSlice
VAProfileHEVCMain10 : VAEntrypointVLD
VAProfileJPEGBaseline : VAEntrypointVLD
VAProfileVP9Profile0 : VAEntrypointVLD
VAProfileVP9Profile2 : VAEntrypointVLD
VAProfileNone : VAEntrypointVideoProc
dzu@krikkit:~$
Every VAEntryPointVLD
entry corresponds to hardware decoding of the
specified format and the entry VAEntrypointEncSlice
shows that the
format is supported for encoding. Very cool! So my AMD GPU is ready
to be used through the VA API. Let's see how this transfers to the
usual tools of GNU/Linux distros.
Decoding In Hardware
mpv
The mpv media player offers an easy way to query the hardware decoders:
dzu@krikkit:~$ mpv --hwdec=help
Valid values (with alternative full names):
nvdec (h263-nvdec)
nvdec (h263p-nvdec)
nvdec (h264-nvdec)
nvdec (hevc-nvdec)
nvdec (mjpeg-nvdec)
nvdec (mpeg1video-nvdec)
nvdec (mpeg2video-nvdec)
nvdec (mpeg4-nvdec)
nvdec (vc1-nvdec)
nvdec (vp8-nvdec)
nvdec (vp9-nvdec)
nvdec (wmv3-nvdec)
nvdec (av1-nvdec)
vaapi (h263-vaapi)
vaapi (h263p-vaapi)
vaapi (h264-vaapi)
vaapi (hevc-vaapi)
vaapi (mjpeg-vaapi)
vaapi (mpeg2video-vaapi)
vaapi (mpeg4-vaapi)
vaapi (vc1-vaapi)
vaapi (vp8-vaapi)
vaapi (vp9-vaapi)
vaapi (wmv3-vaapi)
vaapi (av1-vaapi)
vdpau (h263-vdpau)
vdpau (h263p-vdpau)
vdpau (h264-vdpau)
vdpau (hevc-vdpau)
vdpau (mpeg1video-vdpau)
vdpau (mpeg2video-vdpau)
vdpau (mpeg4-vdpau)
vdpau (vc1-vdpau)
vdpau (vp9-vdpau)
vdpau (wmv3-vdpau)
vdpau (av1-vdpau)
nvdec-copy (h263-nvdec-copy)
nvdec-copy (h263p-nvdec-copy)
nvdec-copy (h264-nvdec-copy)
nvdec-copy (hevc-nvdec-copy)
nvdec-copy (mjpeg-nvdec-copy)
nvdec-copy (mpeg1video-nvdec-copy)
nvdec-copy (mpeg2video-nvdec-copy)
nvdec-copy (mpeg4-nvdec-copy)
nvdec-copy (vc1-nvdec-copy)
nvdec-copy (vp8-nvdec-copy)
nvdec-copy (vp9-nvdec-copy)
nvdec-copy (wmv3-nvdec-copy)
nvdec-copy (av1-nvdec-copy)
vaapi-copy (h263-vaapi-copy)
vaapi-copy (h263p-vaapi-copy)
vaapi-copy (h264-vaapi-copy)
vaapi-copy (hevc-vaapi-copy)
vaapi-copy (mjpeg-vaapi-copy)
vaapi-copy (mpeg2video-vaapi-copy)
vaapi-copy (mpeg4-vaapi-copy)
vaapi-copy (vc1-vaapi-copy)
vaapi-copy (vp8-vaapi-copy)
vaapi-copy (vp9-vaapi-copy)
vaapi-copy (wmv3-vaapi-copy)
vaapi-copy (av1-vaapi-copy)
vdpau-copy (h263-vdpau-copy)
vdpau-copy (h263p-vdpau-copy)
vdpau-copy (h264-vdpau-copy)
vdpau-copy (hevc-vdpau-copy)
vdpau-copy (mpeg1video-vdpau-copy)
vdpau-copy (mpeg2video-vdpau-copy)
vdpau-copy (mpeg4-vdpau-copy)
vdpau-copy (vc1-vdpau-copy)
vdpau-copy (vp9-vdpau-copy)
vdpau-copy (wmv3-vdpau-copy)
vdpau-copy (av1-vdpau-copy)
qsv (h264_qsv-qsv)
qsv (hevc_qsv-qsv)
qsv (mpeg2_qsv-qsv)
qsv (vc1_qsv-qsv)
cuda (av1_cuvid-cuda)
qsv (av1_qsv-qsv)
cuda (h264_cuvid-cuda)
cuda (hevc_cuvid-cuda)
cuda (mjpeg_cuvid-cuda)
qsv (mjpeg_qsv-qsv)
cuda (mpeg1_cuvid-cuda)
cuda (mpeg2_cuvid-cuda)
cuda (mpeg4_cuvid-cuda)
cuda (vc1_cuvid-cuda)
cuda (vp8_cuvid-cuda)
qsv (vp8_qsv-qsv)
cuda (vp9_cuvid-cuda)
qsv (vp9_qsv-qsv)
v4l2m2m-copy (h263_v4l2m2m-v4l2m2m-copy)
v4l2m2m-copy (h264_v4l2m2m-v4l2m2m-copy)
qsv-copy (h264_qsv-qsv-copy)
qsv-copy (hevc_qsv-qsv-copy)
v4l2m2m-copy (hevc_v4l2m2m-v4l2m2m-copy)
v4l2m2m-copy (mpeg4_v4l2m2m-v4l2m2m-copy)
v4l2m2m-copy (mpeg1_v4l2m2m-v4l2m2m-copy)
v4l2m2m-copy (mpeg2_v4l2m2m-v4l2m2m-copy)
qsv-copy (mpeg2_qsv-qsv-copy)
qsv-copy (vc1_qsv-qsv-copy)
v4l2m2m-copy (vc1_v4l2m2m-v4l2m2m-copy)
v4l2m2m-copy (vp8_v4l2m2m-v4l2m2m-copy)
v4l2m2m-copy (vp9_v4l2m2m-v4l2m2m-copy)
cuda-copy (av1_cuvid-cuda-copy)
qsv-copy (av1_qsv-qsv-copy)
cuda-copy (h264_cuvid-cuda-copy)
cuda-copy (hevc_cuvid-cuda-copy)
cuda-copy (mjpeg_cuvid-cuda-copy)
qsv-copy (mjpeg_qsv-qsv-copy)
cuda-copy (mpeg1_cuvid-cuda-copy)
cuda-copy (mpeg2_cuvid-cuda-copy)
cuda-copy (mpeg4_cuvid-cuda-copy)
cuda-copy (vc1_cuvid-cuda-copy)
cuda-copy (vp8_cuvid-cuda-copy)
qsv-copy (vp8_qsv-qsv-copy)
cuda-copy (vp9_cuvid-cuda-copy)
qsv-copy (vp9_qsv-qsv-copy)
auto (yes '')
no
auto-safe
auto-copy
auto-copy-safe
dzu@krikkit:~$
As you can see, history has provided us with a lot of (duplicate) ways
of achieving hardware accelerated video decoding. Without prior
knowledge, it would be hard to single out vaapi
as the choice that
we want to use. But because of the prior investigation, we directly
aim for this target.
Using the free movie Big Buck Bunny (in the 1080p, 30 fps version) we can verify that we are indeed decoding the movie in hardware:
dzu@krikkit:~$ time mpv --hwdec=vaapi /tmp/bbb_sunflower_1080p_30fps_normal.mp4
(+) Video --vid=1 (*) (h264 1920x1080 30.000fps)
(+) Audio --aid=1 (*) (mp3 2ch 48000Hz)
Audio --aid=2 (*) (ac3 6ch 48000Hz)
File tags:
Artist: Blender Foundation 2008, Janus Bager Kristensen 2013
Comment: Creative Commons Attribution 3.0 - http://bbb3d.renderfarming.net
Composer: Sacha Goedegebure
Genre: Animation
Title: Big Buck Bunny, Sunflower version
[vo/gpu/wayland] GNOME's wayland compositor lacks support for the idle inhibit protocol. This means the screen can blank during playback.
Using hardware decoding (vaapi).
AO: [pipewire] 48000Hz stereo 2ch floatp
VO: [gpu] 1920x1080 vaapi[nv12]
AV: 00:10:34 / 00:10:34 (100%) A-V: 0.000 Dropped: 20
Exiting... (End of file)
real 10m35,577s
user 0m28,690s
sys 0m27,672s
dzu@krikkit:~$
From the output of the time
command we can conclude that the mpv
process only required 9% CPU Load. Running the same command without
any options proves that it then uses software rendering:
dzu@krikkit:~$ time mpv /tmp/bbb_sunflower_1080p_30fps_normal.mp4
(+) Video --vid=1 (*) (h264 1920x1080 30.000fps)
(+) Audio --aid=1 (*) (mp3 2ch 48000Hz)
Audio --aid=2 (*) (ac3 6ch 48000Hz)
File tags:
Artist: Blender Foundation 2008, Janus Bager Kristensen 2013
Comment: Creative Commons Attribution 3.0 - http://bbb3d.renderfarming.net
Composer: Sacha Goedegebure
Genre: Animation
Title: Big Buck Bunny, Sunflower version
[vo/gpu/wayland] GNOME's wayland compositor lacks support for the idle inhibit protocol. This means the screen can blank during playback.
AO: [pipewire] 48000Hz stereo 2ch floatp
VO: [gpu] 1920x1080 yuv420p
AV: 00:10:34 / 00:10:34 (100%) A-V: 0.000 Dropped: 24
Exiting... (End of file)
real 10m35,460s
user 3m21,950s
sys 0m20,303s
dzu@krikkit:~$
Comparing our previous invocation, we see that we now use a lot more
CPU power, resulting in a CPU Load of 34%. So obviously for mpv
we currently need to provide additional command line parameters to use
the present acceleration hardware.
Understanding What Is Going On
Now that we have a verifiable way to use or not use the hardware
encoder, let's deepen our understanding of how this works in terms of
system calls. Let's record an mpv
session without and with using
the acceleration by means of checking its system calls with strace
:
dzu@krikkit:~$ strace -e trace=openat -fo /tmp/strace-no-accel mpv /tmp/bbb_sunflower_1080p_30fps_normal.mp4
(+) Video --vid=1 (*) (h264 1920x1080 30.000fps)
(+) Audio --aid=1 (*) (mp3 2ch 48000Hz)
Audio --aid=2 (*) (ac3 6ch 48000Hz)
File tags:
Artist: Blender Foundation 2008, Janus Bager Kristensen 2013
Comment: Creative Commons Attribution 3.0 - http://bbb3d.renderfarming.net
Composer: Sacha Goedegebure
Genre: Animation
Title: Big Buck Bunny, Sunflower version
[vo/gpu/wayland] GNOME's wayland compositor lacks support for the idle inhibit protocol. This means the screen can blank during playback.
AO: [pipewire] 48000Hz stereo 2ch floatp
VO: [gpu] 1920x1080 yuv420p
AV: 00:00:02 / 00:10:34 (0%) A-V: 0.000 Dropped: 4
Exiting... (Quit)
dzu@krikkit:~$ strace -e trace=openat -fo /tmp/strace-accel mpv --hwdec=vaapi /tmp/bbb_sunflower_1080p_30fps_normal.mp4
(+) Video --vid=1 (*) (h264 1920x1080 30.000fps)
(+) Audio --aid=1 (*) (mp3 2ch 48000Hz)
Audio --aid=2 (*) (ac3 6ch 48000Hz)
File tags:
Artist: Blender Foundation 2008, Janus Bager Kristensen 2013
Comment: Creative Commons Attribution 3.0 - http://bbb3d.renderfarming.net
Composer: Sacha Goedegebure
Genre: Animation
Title: Big Buck Bunny, Sunflower version
[vo/gpu/wayland] GNOME's wayland compositor lacks support for the idle inhibit protocol. This means the screen can blank during playback.
Using hardware decoding (vaapi).
AO: [pipewire] 48000Hz stereo 2ch floatp
VO: [gpu] 1920x1080 vaapi[nv12]
AV: 00:00:00 / 00:10:34 (0%) A-V: 0.000 Dropped: 5
Exiting... (Quit)
dzu@krikkit:~$
Inside these trace files, we have potential differences that we don't
care about, i.e. the first column contains the PID and this will of
course be different for our two recordings, but we are not interested
in this difference. Also, the return field of an openat
system call
is the file descriptor of a process, but we don't care about the
specific value that can potentially change between different runs
(ordering, etc.), so let's remove the first column and everything
after an equal sign and diff the results. We also know that we care
only for filenames containing the substring /dri
:
dzu@krikkit:~$ diff -c <(cat /tmp/strace-no-accel | \
grep '/dri' | sed -e 's/^[0-9]\+ //' -e 's/ = [0-9]\+$//') \
<(cat /tmp/strace-accel | \
grep '/dri' | sed -e 's/^[0-9]\+ //' -e 's/ = [0-9]\+$//')
*** /dev/fd/63 2023-12-26 01:26:04.224000705 +0100
--- /dev/fd/62 2023-12-26 01:26:04.216000655 +0100
***************
*** 24,26 ****
--- 24,40 ----
openat(AT_FDCWD, "/usr/share/drirc.d/00-radv-defaults.conf", O_RDONLY)
openat(AT_FDCWD, "/etc/drirc", O_RDONLY) = -1 ENOENT (Datei oder Verzeichnis nicht gefunden)
openat(AT_FDCWD, "/dev/dri", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY)
+ openat(AT_FDCWD, "/dev/dri/renderD128", O_RDWR)
+ openat(AT_FDCWD, "/usr/lib/x86_64-linux-gnu/dri/radeonsi_drv_video.so", O_RDONLY|O_CLOEXEC)
+ openat(AT_FDCWD, "/usr/share/drirc.d", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY)
+ openat(AT_FDCWD, "/usr/share/drirc.d/00-mesa-defaults.conf", O_RDONLY)
+ openat(AT_FDCWD, "/usr/share/drirc.d/00-radv-defaults.conf", O_RDONLY)
+ openat(AT_FDCWD, "/etc/drirc", O_RDONLY) = -1 ENOENT (Datei oder Verzeichnis nicht gefunden)
+ openat(AT_FDCWD, "/usr/share/drirc.d", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY)
+ openat(AT_FDCWD, "/usr/share/drirc.d/00-mesa-defaults.conf", O_RDONLY)
+ openat(AT_FDCWD, "/usr/share/drirc.d/00-radv-defaults.conf", O_RDONLY)
+ openat(AT_FDCWD, "/etc/drirc", O_RDONLY) = -1 ENOENT (Datei oder Verzeichnis nicht gefunden)
+ openat(AT_FDCWD, "/usr/share/drirc.d", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY)
+ openat(AT_FDCWD, "/usr/share/drirc.d/00-mesa-defaults.conf", O_RDONLY)
+ openat(AT_FDCWD, "/usr/share/drirc.d/00-radv-defaults.conf", O_RDONLY)
+ openat(AT_FDCWD, "/etc/drirc", O_RDONLY) = -1 ENOENT (Datei oder Verzeichnis nicht gefunden)
dzu@krikkit:~$
So we know understand that user space programs accessing the VA API
will do this by opening a device file below /dev/dri
and the magic
happens through this file descriptor. So checking the strace
output
will reliably tell us if a command uses hardware acceleration or
not. Cool, we can use this to analyze other programs more quickly,
but until the mpv
command checks and uses the available hardware,
let's encode our policy in a system-wide configuration file:
dzu@krikkit:~$ cat /etc/mpv/mpv.conf
hwdec=vaapi
dzu@krikkit:~$
Be sure to check if you have an already existing /etc/mpv/mpv.conf
before blindly copying my example, but once there is such a
configuration file, mpv
will now always use hardware acceleration if
possible.
Totem
With our current understanding, it is easy to check if totem
uses
acceleration:
dzu@krikkit:~$ strace -fe trace=openat totem /tmp/bbb_sunflower_1080p_30fps_normal.mp4 2>&1 | grep /dev/dri
[pid 109146] openat(AT_FDCWD, "/dev/dri", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 10
[pid 109146] openat(AT_FDCWD, "/dev/dri", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 8
[pid 109146] openat(AT_FDCWD, "/dev/dri", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 27
[pid 109146] openat(AT_FDCWD, "/dev/dri/renderD128", O_RDWR|O_CLOEXEC) = 27
[pid 109146] openat(AT_FDCWD, "/dev/dri", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 28
[pid 109146] openat(AT_FDCWD, "/dev/dri", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 31
dzu@krikkit:~$
So indeed, totem
is a ideal in this respect and simply uses the
hardware block when it finds one. Very nice!
Encoding In Hardware
Being able to decode videos with the accelerator, we turn to the problem of using the acceleration for encoding. As we decided earlier, we are especially interested in encoding to H.265, but our hardware can only do H.264 and H.265, so the choice was not hard to begin with.
As ffmpeg
is my go to solution for video tasks, let's begin by
checking its support for VA API:
dzu@krikkit:~$ ffprobe -encoders 2>&1 | grep vaapi
V....D h264_vaapi H.264/AVC (VAAPI) (codec h264)
V....D hevc_vaapi H.265/HEVC (VAAPI) (codec hevc)
V....D mjpeg_vaapi MJPEG (VAAPI) (codec mjpeg)
V....D mpeg2_vaapi MPEG-2 (VAAPI) (codec mpeg2video)
V....D vp8_vaapi VP8 (VAAPI) (codec vp8)
V....D vp9_vaapi VP9 (VAAPI) (codec vp9)
dzu@krikkit:~$
So for ffmpeg
we are interested in the h264_vaapi
and hevc_vaapi
encoders. Before trying them out, let's get an idea of the parameters
that they support as it will be important later on.
ffmpeg Encoder Options
HEVC (H265) Encoder Options
dzu@krikkit:~$ ffprobe -hide_banner -h encoder=hevc_vaapi
Encoder hevc_vaapi [H.265/HEVC (VAAPI)]:
General capabilities: dr1 delay hardware
Threading capabilities: none
Supported hardware devices: vaapi
Supported pixel formats: vaapi
h265_vaapi AVOptions:
-low_power <boolean> E..V....... Use low-power encoding mode (only available on some platforms; may not support all encoding features) (default false)
-idr_interval <int> E..V....... Distance (in I-frames) between IDR frames (from 0 to INT_MAX) (default 0)
-b_depth <int> E..V....... Maximum B-frame reference depth (from 1 to INT_MAX) (default 1)
-async_depth <int> E..V....... Maximum processing parallelism. Increase this to improve single channel performance. This option doesn't work if driver doesn't implement vaSyncBuffer function. (from 1 to 64) (default 2)
-max_frame_size <int> E..V....... Maximum frame size (in bytes) (from 0 to INT_MAX) (default 0)
-rc_mode <int> E..V....... Set rate control mode (from 0 to 6) (default auto)
auto 0 E..V....... Choose mode automatically based on other parameters
CQP 1 E..V....... Constant-quality
CBR 2 E..V....... Constant-bitrate
VBR 3 E..V....... Variable-bitrate
ICQ 4 E..V....... Intelligent constant-quality
QVBR 5 E..V....... Quality-defined variable-bitrate
AVBR 6 E..V....... Average variable-bitrate
-qp <int> E..V....... Constant QP (for P-frames; scaled by qfactor/qoffset for I/B) (from 0 to 52) (default 0)
-aud <boolean> E..V....... Include AUD (default false)
-profile <int> E..V....... Set profile (general_profile_idc) (from -99 to 255) (default -99)
main 1 E..V.......
main10 2 E..V.......
rext 4 E..V.......
-tier <int> E..V....... Set tier (general_tier_flag) (from 0 to 1) (default main)
main 0 E..V.......
high 1 E..V.......
-level <int> E..V....... Set level (general_level_idc) (from -99 to 255) (default -99)
1 30 E..V.......
2 60 E..V.......
2.1 63 E..V.......
3 90 E..V.......
3.1 93 E..V.......
4 120 E..V.......
4.1 123 E..V.......
5 150 E..V.......
5.1 153 E..V.......
5.2 156 E..V.......
6 180 E..V.......
6.1 183 E..V.......
6.2 186 E..V.......
-sei <flags> E..V....... Set SEI to include (default hdr)
hdr E..V....... Include HDR metadata for mastering display colour volume and content light level information
-tiles <image_size> E..V....... Tile columns x rows
dzu@krikkit:~$
H264 Options
dzu@krikkit:~$ ffprobe -hide_banner -h encoder=h264_vaapi
Encoder h264_vaapi [H.264/AVC (VAAPI)]:
General capabilities: dr1 delay hardware
Threading capabilities: none
Supported hardware devices: vaapi
Supported pixel formats: vaapi
h264_vaapi AVOptions:
-low_power <boolean> E..V....... Use low-power encoding mode (only available on some platforms; may not support all encoding features) (default false)
-idr_interval <int> E..V....... Distance (in I-frames) between IDR frames (from 0 to INT_MAX) (default 0)
-b_depth <int> E..V....... Maximum B-frame reference depth (from 1 to INT_MAX) (default 1)
-async_depth <int> E..V....... Maximum processing parallelism. Increase this to improve single channel performance. This option doesn't work if driver doesn't implement vaSyncBuffer function. (from 1 to 64) (default 2)
-max_frame_size <int> E..V....... Maximum frame size (in bytes) (from 0 to INT_MAX) (default 0)
-rc_mode <int> E..V....... Set rate control mode (from 0 to 6) (default auto)
auto 0 E..V....... Choose mode automatically based on other parameters
CQP 1 E..V....... Constant-quality
CBR 2 E..V....... Constant-bitrate
VBR 3 E..V....... Variable-bitrate
ICQ 4 E..V....... Intelligent constant-quality
QVBR 5 E..V....... Quality-defined variable-bitrate
AVBR 6 E..V....... Average variable-bitrate
-qp <int> E..V....... Constant QP (for P-frames; scaled by qfactor/qoffset for I/B) (from 0 to 52) (default 0)
-quality <int> E..V....... Set encode quality (trades off against speed, higher is faster) (from -1 to INT_MAX) (default -1)
-coder <int> E..V....... Entropy coder type (from 0 to 1) (default cabac)
cavlc 0 E..V.......
cabac 1 E..V.......
vlc 0 E..V.......
ac 1 E..V.......
-aud <boolean> E..V....... Include AUD (default false)
-sei <flags> E..V....... Set SEI to include (default identifier+timing+recovery_point)
identifier E..V....... Include encoder version identifier
timing E..V....... Include timing parameters (buffering_period and pic_timing)
recovery_point E..V....... Include recovery points where appropriate
-profile <int> E..V....... Set profile (profile_idc and constraint_set*_flag) (from -99 to 65535) (default -99)
constrained_baseline 578 E..V.......
main 77 E..V.......
high 100 E..V.......
-level <int> E..V....... Set level (level_idc) (from -99 to 255) (default -99)
1 10 E..V.......
1.1 11 E..V.......
1.2 12 E..V.......
1.3 13 E..V.......
2 20 E..V.......
2.1 21 E..V.......
2.2 22 E..V.......
3 30 E..V.......
3.1 31 E..V.......
3.2 32 E..V.......
4 40 E..V.......
4.1 41 E..V.......
4.2 42 E..V.......
5 50 E..V.......
5.1 51 E..V.......
5.2 52 E..V.......
6 60 E..V.......
6.1 61 E..V.......
6.2 62 E..V.......
dzu@krikkit:~$
While many options look the same, it is worth noting that the H.264
encoder supports an -quality
parameter, while the HEVC encoder does
not. The latter only features a -qp
parameter for specifying a
quality, but it seems like it is meant for the "constant quantization
parameter" mode of encoding. We will get back to this.
Looking into the performance and quality of an encoder requires some knowledge about Constant Bit rate (CBR), Variable Bit rate (VBR), Constant Rate Factor (CRF) and other special terms. I found this CRF Guide by Werner Robitza to be a very good resource to quickly learn about them. Maybe you also want to at least glance over that page before reading on.
CRF Encoding (SW)
For the rest of the post, I will now use a longer movie from my own collection (H.264, 14m07s, 1920x1080p, 5615 kbit/s, 588 MiB) instead of the short Big Buck Bunny movie. Of course the content of the movie will influence how good an encoder can compress a file (think of a movie showing only black for long times which can be compressed heavily), so I wanted to take a real world example instead of an artificial computer generated movie for this section.
Understanding the basic parameters that go into an encoding, let's
establish a baseline for the acceleration by doing a CRF encode of
the sample file. Transcoding for archival purposes is usually best
done with CRF as it allows for variable bit rates and a
specification of required quality. To ease calling the tools multiple
times, I put the invocations of ffmpeg
into a script file, but
essentially this is the command line that is executed:
# Default CRF is 28, but we lower it to 26
[ -z "$CRF" ] && CRF=26
ffmpeg -i "$1" -c:v libx265 -crf $CRF -c:a libvorbis -map 0:0 -map 0:a $OUTFILE
So the encoder is libx265
and unless given an explicit parameter, the
script will use CRF=26 which is a little less than the default.
I ended up with this value by doing some example encodes and comparing
them visually on two monitors side by side. Here is the transcript of
ffmepg
doing the encoding:
dzu@krikkit:/tmp$ time recode-video -c h265 movie.mp4
ffmpeg version 5.1.4-0+deb12u1 Copyright (c) 2000-2023 the FFmpeg developers
built with gcc 12 (Debian 12.2.0-14)
configuration: --prefix=/usr --extra-version=0+deb12u1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libglslang --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librist --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libsvtav1 --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzimg --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --disable-sndio --enable-libjxl --enable-pocketsphinx --enable-librsvg --enable-libmfx --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-libx264 --enable-libplacebo --enable-librav1e --enable-shared
libavutil 57. 28.100 / 57. 28.100
libavcodec 59. 37.100 / 59. 37.100
libavformat 59. 27.100 / 59. 27.100
libavdevice 59. 7.100 / 59. 7.100
libavfilter 8. 44.100 / 8. 44.100
libswscale 6. 7.100 / 6. 7.100
libswresample 4. 7.100 / 4. 7.100
libpostproc 56. 6.100 / 56. 6.100
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'movie.mp4':
Metadata:
major_brand : isom
minor_version : 512
compatible_brands: isomiso2avc1mp41
title : Title
encoder : Lavf58.20.100
media_type : 9
hd_video : 2
Duration: 00:14:07.13, start: 0.000000, bitrate: 5817 kb/s
Chapters:
Chapter #0:0: start 0.000000, end 847.000000
Metadata:
title : Chapter 1
Stream #0:0[0x1](eng): Video: h264 (Main) (avc1 / 0x31637661), yuv420p(progressive), 1920x1080 [SAR 1:1 DAR 16:9], 5615 kb/s, 25 fps, 25 tbr, 12800 tbn (default)
Metadata:
handler_name : VideoHandler
vendor_id : [0][0][0][0]
Stream #0:1[0x2](eng): Audio: aac (LC) (mp4a / 0x6134706D), 44100 Hz, stereo, fltp, 196 kb/s (default)
Metadata:
handler_name : SoundHandler
vendor_id : [0][0][0][0]
Stream #0:2[0x3](eng): Data: bin_data (text / 0x74786574)
Metadata:
handler_name : SubtitleHandler
Stream mapping:
Stream #0:0 -> #0:0 (h264 (native) -> hevc (libx265))
Stream #0:1 -> #0:1 (aac (native) -> vorbis (libvorbis))
Press [q] to stop, [?] for help
x265 [info]: HEVC encoder version 3.5+1-f0c1022b6
x265 [info]: build info [Linux][GCC 12.2.0][64 bit] 8bit+10bit+12bit
x265 [info]: using cpu capabilities: MMX2 SSE2Fast LZCNT SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2
x265 [info]: Main profile, Level-4 (Main tier)
x265 [info]: Thread pool created using 8 threads
x265 [info]: Slices : 1
x265 [info]: frame threads / pool features : 3 / wpp(17 rows)
x265 [info]: Coding QT: max CU size, min CU size : 64 / 8
x265 [info]: Residual QT: max TU size, max depth : 32 / 1 inter / 1 intra
x265 [info]: ME / range / subpel / merge : hex / 57 / 2 / 3
x265 [info]: Keyframe min / max / scenecut / bias : 25 / 250 / 40 / 5.00
x265 [info]: Lookahead / bframes / badapt : 20 / 4 / 2
x265 [info]: b-pyramid / weightp / weightb : 1 / 1 / 0
x265 [info]: References / ref-limit cu / depth : 3 / off / on
x265 [info]: AQ: mode / str / qg-size / cu-tree : 2 / 1.0 / 32 / 1
x265 [info]: Rate Control / qCompress : CRF-26.0 / 0.60
x265 [info]: tools: rd=3 psy-rd=2.00 early-skip rskip mode=1 signhide tmvp
x265 [info]: tools: b-intra strong-intra-smoothing lslices=6 deblock sao
Output #0, matroska, to 'recode5B0u.mkv':
Metadata:
major_brand : isom
minor_version : 512
compatible_brands: isomiso2avc1mp41
title : Title
hd_video : 2
media_type : 9
encoder : Lavf59.27.100
Chapters:
Chapter #0:0: start 0.000000, end 847.000000
Metadata:
title : Chapter 1
Stream #0:0(eng): Video: hevc, yuv420p(progressive), 1920x1080 [SAR 1:1 DAR 16:9], q=2-31, 25 fps, 1k tbn (default)
Metadata:
handler_name : VideoHandler
vendor_id : [0][0][0][0]
encoder : Lavc59.37.100 libx265
Side data:
cpb: bitrate max/min/avg: 0/0/0 buffer size: 0 vbv_delay: N/A
Stream #0:1(eng): Audio: vorbis (oV[0][0] / 0x566F), 44100 Hz, stereo, fltp (default)
Metadata:
handler_name : SoundHandler
vendor_id : [0][0][0][0]
encoder : Lavc59.37.100 libvorbis
frame=21178 fps= 18 q=34.0 Lsize= 101978kB time=00:14:07.08 bitrate= 986.2kbits/s speed=0.736x
video:93449kB audio:8071kB subtitle:0kB other streams:0kB global headers:6kB muxing overhead: 0.450788%
x265 [info]: frame I: 92, Avg QP:24.38 kb/s: 6733.38
x265 [info]: frame P: 5329, Avg QP:26.51 kb/s: 2188.34
x265 [info]: frame B: 15757, Avg QP:32.23 kb/s: 434.12
x265 [info]: Weighted P-Frames: Y:1.1% UV:0.8%
x265 [info]: consecutive B-frames: 1.4% 1.9% 1.8% 94.3% 0.5%
encoded 21178 frames in 1151.51s (18.39 fps), 902.89 kb/s, Avg QP:30.76
real 19m11,814s
user 122m29,102s
sys 1m19,955s
dzu@krikkit:/tmp$ ls -lh movie.mp4*
-rw-r--r-- 1 dzu dzu 100M 22. Dez 20:07 movie.mp4
-rw-r--r-- 1 dzu dzu 588M 22. Dez 19:48 movie.mp4.bak
dzu@krikkit:/tmp$
We expected that this will max out the CPUs, and indeed calculating
the CPU load by dividing the wall clock time (real) by (user + sys),
we see that this software only encoding resulted in a CPU-Load of
645%. On the one hand that tells me that the libx265
encoder is
indeed a very efficient implementation as it uses nearly all 8
available threads of my machine all the time. It looks like this is
easy, but this is actually a very difficult endeavor to achieve, so
kudos to the developers of libx265
even though this post is not at
all about that encoder.
As you can see, my script replaces the original movie file and keeps the old file as a backup copy. Comparing the sizes gives us a quick glimpse of the savings and shows that H.265 is able to squeeze the movie down to 17% of its original size. Checking the original and the recoded movie side by side gives me the confidence that the settings are sane. This result is very cool and indeed what I was looking for. Using this on movie collections should save a lot of space.
We will see that although the hardware encoder is much quicker and leaves the CPU cores alone, it will not be able to achieve this result, but let's take things one bit at a time and call the hardware encoder with settings suggested on the internet.
VBR Accelerated Encoding
Again I put the magic command line into my script, but this is the
part for calling ffmpeg
to use the hardware encoder:
ffmpeg -hwaccel vaapi -hwaccel_device /dev/dri/renderD128 \
-hwaccel_output_format vaapi -i $1 \
-c:v hevc_vaapi \
-b:v 5M \
$OUTFILE
Unfortunately, ffmpeg
does not output which mode the encoder will be
in if we do not specify anything, and the option help shows "auto"
which is just as unhelpful. The VA API Documentation for ffmpeg
mentions that VBR is the default mode and some experiments with also
specifying -maxrate
show that this is really the case. So not
specifying any mode gives us a VBR encoding with a target average bit
rate. Sometimes this is also referred to as ABR.
Just as expected, this runs with a CPU load of 28% (using a single thread 28% of the time) but encodes the movie at 146 frames per second, i.e. it is multiple times faster than real time (25-30 fps for most movies). So instead of waiting 20 minutes for the result, the transcoding is complete in 2 minutes and 25 seconds!
Looking at the result table below, we see that the saving in storage is minimal, but that was kind of expected if the original file has a bit rate of 5615 kbit/s, and we instruct the hardware encoder to encode it to 5000 kbit/s. Indeed, the encoder nicely hit the target as the result has a bit rate of 5080 kbit/s.
Constant Quality Accelerated Encoding
Now that we can use our hardware, but the result is not in the range
of where we know we can get to with our software baseline, we need to
find alternatives. While the documentation on the web thus far has
been scattered and difficult to interpret, it was nearly impossible to
find information on those advanced things. The most important thing
here is that the VA API encoders do not have a CRF parameter at all.
Looking at the help of ffmepeg
it seems that Constant Quality
(CQP) may fit our bill. Using it is as easy as specifying -rc_mode
1
. Here is the full invocation:
ffmpeg -hwaccel vaapi -hwaccel_device /dev/dri/renderD128 \
-hwaccel_output_format vaapi -i $1 \
-c:v hevc_vaapi \
-rc_mode 1 \
$OUTFILE
ffmpeg
informed me that it uses a default quality of 25, because no
other parameter was given. We keep this in mind for the next attempts.
But already we see that the result is much better. We end up with a
bit rate of 2170 kbit/s while the speed of the process stays exactly
the same.
In order to specify our quality target, I tried specifying -quality
,
but that option is not supported by hevc_vaapi
. Glancing at the
H.264 encoder I see that it has such an option, but the HEVC encoder
does not. There I only see -qp <num>
, and so I tried that. This
achieves the desired effect and playing with a few numbers yields the
results given in the table below.
At -qp 28
I can already see clear visual differences (smoothing)
compared to the libx265
version, I cut off the search at this
point as the same quality will always yield larger results.
Other Encoding Modes
Looking at the options for -rc_mode
it is a valid question if there
are other modes doing even better jobs, but trying ICQ made ffmepg
error out with this message:
[hevc_vaapi @ 0x55d2ebade0c0] Driver does not support ICQ RC mode (supported modes: CQP, CBR, VBR).
Ok, so CQP, VBR and CBR are the only modes that our accelerator supports.
CBR Accelerated Encoding
For completeness, I tested the CBR encoding (-rc_mode 2
), but I do
not expect to use that for archival purposes. CBR is really meant to
encode a video for a known communication channel with a limited
bandwidth. Specifying the bit rate lower than this limit should
ensure that we can always stream the movie without any buffer
underflow. But obviously, this is not what I am looking for here.
Funnily enough, the results seem to be comparable to the VBR encoding,
but maybe the difference would be clear when looking at the continuous
actual bit rate instead of just looking at the average bit rate over
the whole file.
Result Table
Encoder | Options | Bit rate | Size | Fps | Real | user+sys | load |
---|---|---|---|---|---|---|---|
kbit/s | MiB | s | s | ||||
<None> | 5615 | 588 | 25 | ||||
libx265 | -crf 26 | 986 | 100 | 18 | 1152 | 7429 | 645% |
hevc_aapi | -rc_mode 3 -b:v 5M | 5080 | 514 | 146 | 145 | 40 | 28% |
hevc_aapi | -rc_mode 3 -b:v 2M | 2082 | 211 | 142 | 148 | 44 | 30% |
hevc_aapi | -rc_mode 2 -b:v 5M | 5084 | 514 | 145 | 146 | 41 | 28% |
hevc_aapi | -rc_mode 2 -b:v 2M | 2084 | 211 | 142 | 150 | 50 | 33% |
hevc_aapi | -rc_mode 1 -qp 25 | 2170 | 220 | 146 | 145 | 36 | 25% |
hevc_aapi | -rc_mode 1 -qp 26 | 1802 | 183 | 146 | 145 | 36 | 25% |
hevc_aapi | -rc_mode 1 -qp 27 | 1540 | 156 | 146 | 145 | 36 | 25% |
hevc_aapi | -rc_mode 1 -qp 28 | 1407 | 143 | 145 | 145 | 36 | 25% |
hevc_aapi | -rc_mode 1 -qp 29 | 1161 | 118 | 131 | |||
hevc_aapi | -rc_mode 3 -qp 28 | Error(1) |
Where Error(1) is "[hevc_vaapi @ 0x563fc6e9d180] Bitrate must be set for VBR RC mode."
Summary
The trigger for this post was the innocent question of how exactly do I use the hardware video processing IP present in my hardware.
After learning about the relevant pieces of the GNU/Linux software
stack, I was able to really use the hardware IP and evaluate its
performance and quality. My usage of the hardware independent VA API
transfers nicely to other hardware encoders supported in VA API, but
for now I have to be content with the fact that the IP block does not
fulfill my requirements of minimal bit rate for a given quality,
i.e. for archival purposes. Even though I can encode movies roughly 9
times faster (and with minimal CPU load) with VA API than with
libx265
, the resulting bit rate would not be as good and so the
result files would take up more space on the archive disks.
As it stands, the hardware encoder can not match the quality and
savings of the libx265
software option. Although it would save a
lot of time (and energy) compared to the software encoding, quality
matters more for archival purpuses I will not use it after all. It is
worth checking out future versions of GPUs, but for my current system
the best choice is software encoding.
We also saw that because of proprietary solutions it is basically still impossible to answer a simple question like "can I use the hardware encoder" for all hardware supported by the Linux kernel. But thanks to the never ending motivation and lobbying of Free Software designers, there is now an API that comes close to that target, i.e. VA API. And with that I can use my (integrated) AMD GPU with hardware independent user space encoders. This is exactly how operating systems should work.
As a superfuous advertisement, I would like to mention that consumers can actually influence this positive development by considering "Software Support" ahead of time. The Free Software Foundation hosts the very good h-node project trying to document the compatibility of Free Software with physical hardware. Even though there is a section on graphics cards it would not have helped us to decide which brand is better if we want to do hardware encoding. With "support" in this context I not only mean to express "it works for use case X", but also "it is supported with minimal vendor dependencies". Following the development of the Linux kernel (or as best as I can) allows myself to at least have an opinion here.
From an engineering perspective it is just mind boggling how much
effort is going into all of those encoder and decoder libraries in
software. Even with a common kernel API, they use different upper
level software (i.e. hevc_vaapi
from ffmpeg
, vvaapih265enc
from
gstreamer
, …) and one still needs to pick a software module that
corresponds to a single piece of hardware. So simple scripts using
any of those tools (mine included) are actually non-portable and
that's a real shame. It should be possible to write portable scripts
that simply specify a target format (i.e. HEVC) and rate control
parameters and then let the OS decide what to use. Maybe we will get
there in another 10 years. I have my fingers crossed.
Questions or Suggestions?
I am really interested in your take on this complex topic, so feel free to drop me a mail at mailto:dzu@member.fsf.org, or use Disqus to comment on the post. And yeah, I am ashamed that I did not yet implement a Mastodon based comment system, but at least I have seen working code by now.
Comments
Comments powered by Disqus