Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: vaGetImage takes 90ms on each FullHD frame #1824

Open
aslobodeniuk opened this issue Jun 27, 2024 · 9 comments · May be fixed by #1837
Open

[Bug]: vaGetImage takes 90ms on each FullHD frame #1824

aslobodeniuk opened this issue Jun 27, 2024 · 9 comments · May be fixed by #1837
Assignees
Labels
Common memory, surface, ddi

Comments

@aslobodeniuk
Copy link

aslobodeniuk commented Jun 27, 2024

Which component impacted?

Video Processing

Is it regression? Good in old configuration?

This issue doesn't reproduce with i965 driver

What happened?

This happens on a certain hardware with UHD Graphics 605 GPU.

This issue seems to happen on all the versions of iHD_driver, we checked on 20.1.1 iHD and 23.4.1 .
Reproduces with both ffmpeg and gstreamer (all latest versions), and any Full HD video.

How to reproduce:

$ wget https://test-videos.co.uk/vids/bigbuckbunny/mkv/1080/Big_Buck_Bunny_1080_10s_1MB.mkv

$ ffmpeg -y -an -hwaccel vaapi -hwaccel_device /dev/dri/renderD128 -hwaccel_output_format vaapi -i Big_Buck_Bunny_1080_10s_1MB.mkv -vf hwdownload,format=nv12 -f null

so in the output of ffmpeg we can see it only reaches ~10 fps.

Same 10 fps are reached if we download with gstreamer vah264dec element

gst-launch-1.0 -v filesrc location=Big_Buck_Bunny_1080_10s_1MB.mkv ! parsebin ! vah264dec ! "video/x-raw" ! fpsdisplaysink video-sink=fakesink sync=false

Checking the libva traces we can see that the slowest part is the vaGetImage, it always takes around 90-100ms

[20491.582113][ctx       none]=========vaCreateImage ret = VA_STATUS_SUCCESS, success (no error) 
[20491.680938][ctx       none]=========vaGetImage ret = VA_STATUS_SUCCESS, success (no error) 

Meanwhile without downloading to CPU memory the playback of the same file can reach 700fps.
To give an approximate benchmark of the CPU - software decoding of the same file reaches 300 fps, so it's not that incredibly slow.

Do you know a way to confirm it's a hardware or a driver issue?

What's the usage scenario when you are seeing the problem?

Playback

What impacted?

No response

Debug Information

lshw -C display

  *-display
       description: VGA compatible controller
       product: UHD Graphics 605
       vendor: Intel Corporation
       physical id: 2
       bus info: pci@0000:00:02.0
       version: 06
       width: 64 bits
       clock: 33MHz
       capabilities: pciexpress msi pm vga_controller bus_master cap_list rom
       configuration: driver=i915 latency=0
       resources: irq:128 memory:a0000000-a0ffffff memory:90000000-9fffffff ioport:f000(size=64) memory:c0000-dffff

cat /proc/cpuinfo

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 122
model name	: Intel(R) Celeron(R) J4125 CPU @ 2.00GHz
stepping	: 8
microcode	: 0xc
cpu MHz		: 900.000
cache size	: 4096 KB
physical id	: 0
siblings	: 4
core id		: 0
cpu cores	: 4
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 24
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl est tm2 ssse3 sdbg cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave rdrand lahf_lm 3dnowprefetch cpuid_fault cat_l2 cdp_l2 ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust smep erms mpx rdt_a rdseed smap clflushopt intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts umip rdpid arch_capabilities
bugs		: spectre_v1 spectre_v2 spec_store_bypass
bogomips	: 3993.60
clflush size	: 64
cache_alignment	: 64
address sizes	: 39 bits physical, 48 bits virtual
power management:

processor	: 1
vendor_id	: GenuineIntel
cpu family	: 6
model		: 122
model name	: Intel(R) Celeron(R) J4125 CPU @ 2.00GHz
stepping	: 8
microcode	: 0xc
cpu MHz		: 800.000
cache size	: 4096 KB
physical id	: 0
siblings	: 4
core id		: 1
cpu cores	: 4
apicid		: 2
initial apicid	: 2
fpu		: yes
fpu_exception	: yes
cpuid level	: 24
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl est tm2 ssse3 sdbg cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave rdrand lahf_lm 3dnowprefetch cpuid_fault cat_l2 cdp_l2 ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust smep erms mpx rdt_a rdseed smap clflushopt intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts umip rdpid arch_capabilities
bugs		: spectre_v1 spectre_v2 spec_store_bypass
bogomips	: 3993.60
clflush size	: 64
cache_alignment	: 64
address sizes	: 39 bits physical, 48 bits virtual
power management:

processor	: 2
vendor_id	: GenuineIntel
cpu family	: 6
model		: 122
model name	: Intel(R) Celeron(R) J4125 CPU @ 2.00GHz
stepping	: 8
microcode	: 0xc
cpu MHz		: 1178.455
cache size	: 4096 KB
physical id	: 0
siblings	: 4
core id		: 2
cpu cores	: 4
apicid		: 4
initial apicid	: 4
fpu		: yes
fpu_exception	: yes
cpuid level	: 24
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl est tm2 ssse3 sdbg cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave rdrand lahf_lm 3dnowprefetch cpuid_fault cat_l2 cdp_l2 ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust smep erms mpx rdt_a rdseed smap clflushopt intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts umip rdpid arch_capabilities
bugs		: spectre_v1 spectre_v2 spec_store_bypass
bogomips	: 3993.60
clflush size	: 64
cache_alignment	: 64
address sizes	: 39 bits physical, 48 bits virtual
power management:

processor	: 3
vendor_id	: GenuineIntel
cpu family	: 6
model		: 122
model name	: Intel(R) Celeron(R) J4125 CPU @ 2.00GHz
stepping	: 8
microcode	: 0xc
cpu MHz		: 800.000
cache size	: 4096 KB
physical id	: 0
siblings	: 4
core id		: 3
cpu cores	: 4
apicid		: 6
initial apicid	: 6
fpu		: yes
fpu_exception	: yes
cpuid level	: 24
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl est tm2 ssse3 sdbg cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave rdrand lahf_lm 3dnowprefetch cpuid_fault cat_l2 cdp_l2 ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust smep erms mpx rdt_a rdseed smap clflushopt intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts umip rdpid arch_capabilities
bugs		: spectre_v1 spectre_v2 spec_store_bypass
bogomips	: 3993.60
clflush size	: 64
cache_alignment	: 64
address sizes	: 39 bits physical, 48 bits virtual
power management:

Do you want to contribute a patch to fix the issue?

None

@aslobodeniuk
Copy link
Author

update: with i965 driver it reaches ~100fps, in other words the issue doesn't reproduce

@Jexu Jexu assigned MicroYY and unassigned Jexu and XinfengZhang Jul 1, 2024
@Jexu Jexu added the VP Video Processing label Jul 1, 2024
@intel-mediadev
Copy link
Contributor

Auto Created VSMGWL-74602 for further analysis.

@hye5 hye5 added Common memory, surface, ddi and removed VP Video Processing labels Jul 2, 2024
MicroYY added a commit to MicroYY/media-driver that referenced this issue Jul 30, 2024
@MicroYY MicroYY linked a pull request Jul 30, 2024 that will close this issue
@MicroYY
Copy link
Contributor

MicroYY commented Jul 30, 2024

I don't have UHD 605 in hand. May I know if #1837 can fix?

@aslobodeniuk
Copy link
Author

I don't have UHD 605 in hand. May I know if #1837 can fix?

Hi @MicroYY , I checked out the PR branch, made sure the last commit is b69c087bb96e9a6dc809e77aabf332f8dc9ae678, and rebuild the media driver with the master branches of libva and igdgmm.

So I checked it on the UHD 605 machine, I'm sure the new driver have loaded, but the same issue persists: vaGetImage takes now ~80 ms. It's not 90-100 as it was before, but still way more then it should.

@hauke-work
Copy link

I am seeing a similar problem on our systems with the following CPU:

model name	: Intel(R) Atom(TM) Processor A3950 @ 1.60GHz

I also see it with media driver version 24.3.0 and #1837
I am using kernel 5.4.276 with Mesa3D 23.2.1.

Extended renderer info (GLX_MESA_query_renderer):
    Vendor: Intel (0x8086)
    Device: Mesa Intel(R) HD Graphics 505 (APL 3) (0x5a84)
    Version: 23.2.1
    Accelerated: yes
    Video memory: 3888MB
    Unified memory: yes
    Preferred profile: core (0x1)
    Max core profile version: 4.6
    Max compat profile version: 4.6
    Max GLES1 profile version: 1.1
    Max GLES[23] profile version: 3.2
OpenGL vendor string: Intel
OpenGL renderer string: Mesa Intel(R) HD Graphics 505 (APL 3)
OpenGL core profile version string: 4.6 (Core Profile) Mesa 23.2.1
OpenGL core profile shading language version string: 4.60
OpenGL core profile context flags: (none)
OpenGL core profile profile mask: core profile
OpenGL core profile extensions:

Would an IPS ticket help?

@aslobodeniuk
Copy link
Author

Note that with i965 driver it reaches ~100fps, - the issue doesn't reproduce (on UHD 605).

@hauke-work
Copy link

hauke-work commented Aug 2, 2024

I will use the old libva driver for this use case for now, but I still want to get this to work as we need the media driver for other use cases.

I ran this in perf like this using the vaapi prlugin:

LIBVA_DRIVER_NAME=iHD GST_DEBUG=2 DISPLAY=:0 perf record -g  gst-launch-1.0 -v filesrc location=/tmp/Big_Buck_Bunny_1080_10s_1MB.mkv ! parsebin ! vaapih264dec ! "video/x-raw" ! autovideosink

and got this graph:

root@:~# perf report -n --stdio |head -n 100
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 21K of event 'cycles'
# Event count (approx.): 10209553885
#
# Children      Self       Samples  Command          Shared Object                       Symbol
# ........  ........  ............  ...............  ..................................  ................................................................................................................................................................................................
#
    87.56%     0.00%             0  qtdemux0:sink    [unknown]                           [.] 0000000000000000
            |
            ---0
               |
               |--82.74%--vaGetImage
               |          |
               |           --82.66%--memcpy
               |                     |
               |                      --0.60%--apic_timer_interrupt
               |                                |
               |                                 --0.57%--smp_apic_timer_interrupt
               |
                --2.92%--vaEndPicture
                          _Z19DdiMedia_EndPictureP15VADriverContextj
                          _Z20DdiDecode_EndPictureP15VADriverContextj
                          |
                           --2.92%--_ZN14DdiMediaDecode10EndPictureEP15VADriverContextj
                                     |
                                      --2.86%--_ZN14CodechalDecode7ExecuteEPv
                                                |
                                                 --1.76%--_ZN17CodechalDecodeAvc20DecodePrimitiveLevelEv
                                                           |
                                                            --1.65%--_ZN18GpuContextSpecific19SubmitCommandBufferEP14_MOS_INTERFACEP19_MOS_COMMAND_BUFFERb
                                                                      |
                                                                       --1.53%--_Z8do_exec2P12mos_linux_boiP17mos_linux_contextP13drm_clip_rectiijPi
                                                                                 |
                                                                                  --1.48%--ioctl
                                                                                            |
                                                                                             --1.48%--entry_SYSCALL_64_after_hwframe
                                                                                                       |
                                                                                                        --1.47%--do_syscall_64
                                                                                                                  __x64_sys_ioctl
                                                                                                                  ksys_ioctl
                                                                                                                  |
                                                                                                                   --1.47%--do_vfs_ioctl
                                                                                                                             drm_ioctl
                                                                                                                             drm_ioctl_kernel
                                                                                                                             i915_gem_execbuffer2_ioctl
                                                                                                                             |
                                                                                                                              --1.45%--i915_gem_do_execbuffer
                                                                                                                                        |
                                                                                                                                         --1.05%--eb_lookup_vmas
                                                                                                                                                   |
                                                                                                                                                    --1.00%--__i915_vma_do_pin
                                                                                                                                                              |
                                                                                                                                                               --0.97%--__i915_gem_object_get_pages
                                                                                                                                                                         shmem_get_pages
                                                                                                                                                                         |
                                                                                                                                                                          --0.73%--shmem_read_mapping_page_gfp
                                                                                                                                                                                    |
                                                                                                                                                                                     --0.72%--shmem_getpage_gfp.constprop.0

    82.75%     0.00%             1  qtdemux0:sink    libva.so.2.2000.0                   [.] vaGetImage
            |
             --82.74%--vaGetImage
                       |
                        --82.66%--memcpy
                                  |
                                   --0.60%--apic_timer_interrupt
                                             |
                                              --0.57%--smp_apic_timer_interrupt

    82.73%    81.61%         17214  qtdemux0:sink    libc.so.6                           [.] memcpy
            |
            |--81.61%--0
            |          |
            |           --81.57%--vaGetImage
            |                     |
            |                      --81.56%--memcpy
            |
             --1.12%--memcpy
                       |
                        --0.58%--apic_timer_interrupt
                                  |
                                   --0.57%--smp_apic_timer_interrupt

It looks similar when using the va plugin:

LIBVA_DRIVER_NAME=iHD GST_DEBUG=2 DISPLAY=:0 perf record -g  gst-launch-1.0 -v filesrc location=/tmp/Big_Buck_Bunny_1080_10s_1MB.mkv ! parsebin ! vah264dec ! "video/x-raw" ! autovideosink
root@:~# perf report -n --stdio |head -n 100
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 21K of event 'cycles'
# Event count (approx.): 8707899866
#
# Children      Self       Samples  Command         Shared Object                       Symbol
# ........  ........  ............  ..............  ..................................  .............................................................................................................................................................................................
#
    87.63%    86.72%         18119  qtdemux0:sink   libgstvideo-1.0.so.0.2209.0         [.] gst_video_frame_copy_plane
            |
            ---gst_video_frame_copy_plane
               |
               |--3.11%--apic_timer_interrupt
               |
                --1.00%--page_fault

     6.01%     0.00%             0  qtdemux0:sink   [unknown]                           [k] 0000000000000000
            |
            ---0
               |
               |--3.55%--vaEndPicture
               |          |
               |           --3.54%--_Z19DdiMedia_EndPictureP15VADriverContextj
               |                     |
               |                      --3.54%--_Z20DdiDecode_EndPictureP15VADriverContextj
               |                                |
               |                                 --3.52%--_ZN14DdiMediaDecode10EndPictureEP15VADriverContextj
               |                                           |
               |                                            --3.48%--_ZN14CodechalDecode7ExecuteEPv
               |                                                      |
               |                                                      |--2.13%--_ZN17CodechalDecodeAvc20DecodePrimitiveLevelEv
               |                                                      |          |
               |                                                      |           --1.99%--_ZN18GpuContextSpecific19SubmitCommandBufferEP14_MOS_INTERFACEP19_MOS_COMMAND_BUFFERb
               |                                                      |                     |
               |                                                      |                      --1.89%--_Z8do_exec2P12mos_linux_boiP17mos_linux_contextP13drm_clip_rectiijPi
               |                                                      |                                |
               |                                                      |                                 --1.84%--ioctl
               |                                                      |                                           |
               |                                                      |                                            --1.84%--entry_SYSCALL_64_after_hwframe
               |                                                      |                                                      |
               |                                                      |                                                       --1.83%--do_syscall_64
               |                                                      |                                                                 __x64_sys_ioctl
               |                                                      |                                                                 ksys_ioctl
               |                                                      |                                                                 |
               |                                                      |                                                                  --1.83%--do_vfs_ioctl
               |                                                      |                                                                            drm_ioctl
               |                                                      |                                                                            |
               |                                                      |                                                                             --1.82%--drm_ioctl_kernel
               |                                                      |                                                                                       i915_gem_execbuffer2_ioctl
               |                                                      |                                                                                       |
               |                                                      |                                                                                        --1.80%--i915_gem_do_execbuffer
               |                                                      |                                                                                                  |
               |                                                      |                                                                                                   --1.34%--eb_lookup_vmas
               |                                                      |                                                                                                             |
               |                                                      |                                                                                                              --1.23%--__i915_vma_do_pin
               |                                                      |                                                                                                                        |
               |                                                      |                                                                                                                         --1.19%--__i915_gem_object_get_pages
               |                                                      |                                                                                                                                   shmem_get_pages
               |                                                      |                                                                                                                                   |
               |                                                      |                                                                                                                                    --0.92%--shmem_read_mapping_page_gfp
               |                                                      |                                                                                                                                              |
               |                                                      |                                                                                                                                               --0.91%--shmem_getpage_gfp.constprop.0
               |                                                      |
               |                                                       --0.66%--_ZN17CodechalDecodeAvc16DecodeStateLevelEv
               |
                --0.53%--vaCreateBuffer
                          |
                           --0.52%--_Z21DdiMedia_CreateBufferP15VADriverContextj12VABufferTypejjPvPj
                                     |
                                      --0.51%--_ZN14DdiMediaDecode12CreateBufferE12VABufferTypejjPvPj

Both were done with Intel media driver 23.3.5.

I see the problem with and without HuC and GuC.

In Chromium media driver works fine. Watching VP9 Youtube videos is offloaded.

IPS case: 00900208

@ceyusa
Copy link

ceyusa commented Aug 3, 2024

@hauke-work how those perf reports are similar? in the first one there's a big memcpy which is, if I understand correctly, the big bottleneck, while in the second there isn't.

@fenhu
Copy link
Contributor

fenhu commented Oct 11, 2024

@aslobodeniuk @hauke-work @ceyusa Sorry for replying so late.

I Just checked the FFmpeg source codes: https://github.com/FFmpeg/FFmpeg/blob/master/libavutil/hwcontext_vaapi.c#L850-L897
image

The dump surface rules as blow :

  1. try vaDeriveImage first. it is a HW download.
  2. if 1# fails, it will call vaCreateImage and vaGetImage, triggering a larger size CPU copy memcpy(). This is almost a SW download.

From the call stack, it should go to 2# ==> it has significant SW latency.

From DG2, libva provides a new HW GPU copy API: vaCopy. The app can use it for any GPU to CPU or CPU to GPU copy without any SW latency. On the other hand, we are optimizing vaGetImage on UMD by using GPU copy from MTL+ recently.

Could you help confirm whether the current SW copy (vaGetImage) has any business impact on the current platform? Or does it only affect debugging (such as dump surfaces)? If it's only for debugging, I believe the issue will be resolved after MTL. We don't need any changes for Gen9-Gen12. If not, we still need to improve it.

@MicroYY MicroYY assigned XinfengZhang and unassigned MicroYY Jan 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Common memory, surface, ddi
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants