Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows multithreading test failure #55

Closed
frankplow opened this issue Apr 2, 2023 · 9 comments · Fixed by #95
Closed

Windows multithreading test failure #55

frankplow opened this issue Apr 2, 2023 · 9 comments · Fixed by #95

Comments

@frankplow
Copy link
Collaborator

frankplow commented Apr 2, 2023

Had some time so did a little bit more research on the Windows CI (#52) test failure.

memsetting the entire lc->sao_buffer like bab47ca does not fix the issue, so I don't think the issue is related to #26. With this change, valgrind and clang's address sanitiser don't report any memory issues.

I have compiled FFmpeg directly with MSVC/MSYS2 (i.e. not via FFVS-Project-Generator) and the problem is similar so I don't think it's anything to do with the build files. I haven't yet got the gcc/MSYS2 toolchain or MinGW gcc cross-compilation working unfortunately.

I can't get the LTRP_A_ERICSSON_3 failure to reproduce on my machine, so I don't think there's anything special about this test. The tests which fail most frequently on my machine are:

  • LMCS_A_Dolby_3.bit
  • WPP_B_Sharp_2.bit

The failures only occur when running tests concurrently, they do not occur when running the tests individually or when running tests using a single thread. I don't know whether this points towards libavcodec/vvc_thread.c at all? This line is part of what is preventing cross-compilation at the moment. Should it not be testing for compiler using _MSC_VER or something instead of checking the OS? See #57 for fix. Don't believe this is related.

@frankplow frankplow mentioned this issue Apr 2, 2023
@nuomi2021
Copy link
Member

Not sure MSYS2 has valgrind or not, if it has maybe you can use it to do some memory check.

@frankplow
Copy link
Collaborator Author

frankplow commented Apr 5, 2023

This issue is partially due to the lack of atomic operations for 8-bit types with MSVC winnt.h. Fixing this this will require an upstream change (see patch here) and then changing VVCFrameThread.avails to a atomic_uint *.

With these patches + bab47ca applied, the errors are mostly gone except for LTRP_A_ERICSSON_3 – maybe there is something special about this test case after all? I can now reproduce the errors when a single test is run, rather than as a part of the suite and the decoded MD5 is different each time. MSYS2 does not have valgrind unfortunately. I might try generating a VS solution with FFVVS-Project-Generator and debugging with VS – I see how that's handy already!

@nuomi2021
Copy link
Member

LTRP_A_ERICSSON_3
since Linux is always passed, it may be related to some invalid read/write too. maybe you can try valgrind on linux for this file. see what's happened.

I see how that's handy already!
😊

@frankplow
Copy link
Collaborator Author

The LTRP_A_ERICSSON_3 failure also affects Linux when assembly optimisations are enabled. I have created a new issue #59 for this.

@frankplow frankplow changed the title MSVC Test Failure Windows multithreading test failure Apr 6, 2023
@nuomi2021
Copy link
Member

image
Tried the current code b1c8bd1 with SLICES_A_HUAWEI_3.bit. We can still reproduce it. But every time the mismatch frame is different, even if I use a single thread.
Not easy to debug

@frankplow
Copy link
Collaborator Author

@nuomi2021 Is that with the memset to fix #26 (like bab47ca)?

@nuomi2021
Copy link
Member

Not sure, the memset will impact the thread scheduler. Even if the memset is ok, it does not mean we find the root cause.
If we can find a way to reproduce this with sing thread applications. like checkasm, it may help us debug.

@frankplow
Copy link
Collaborator Author

@nuomi2021 Sometimes, like here, ffvvc-test / windows/msvc/no asm fails so I don't think it is related to assembly optimisations.

There seem to be some bitstreams which fail much more frequently than others - maybe we could try identifying these and any similarities between them which may be suspect?

@nuomi2021
Copy link
Member

There are multi-slice or multiple-tile clips. But the wired thing is the failed blocks are not at the slice/tile boundary. Pretty hard to find out what's happened. A possible way to isolate the issue in my mind:

  1. check fail history, put all fail-prone clips into a tmp directory.
  2. set s->nb_fcs to 1 to disalbe thread.
  3. run "ffmpeg.py tmp"
  4. if it failed. it's maybe not mulitthread issue
  5. try to run https://rr-project.org/ to capture datas
  6. replay rr record to debug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants