Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subtitles generated for a 1.5 hour long video, the timeline is inaccurate #955

Closed
guangxuanliu opened this issue Oct 19, 2024 · 7 comments
Closed
Labels
bug Something isn't working

Comments

@guangxuanliu
Copy link

When transcribing a 1.5 hours long video, the generated subtitles have an inaccurate timeline and do not match the sound.

Even when using Whisper Large-v3, the situation remains the same.

What adjustments do I need to make the generated subtitles more accurate?

Operating system: Windows 10
Software version: Buzz 1.1.0

@guangxuanliu
Copy link
Author

In addition, buzz performs well when transcribing short videos.

@raivisdejus
Copy link
Collaborator

Some ideas that may help in short term are here #946

Work on longer term solution is in progress

@raivisdejus raivisdejus added the bug Something isn't working label Oct 20, 2024
@guangxuanliu
Copy link
Author

guangxuanliu commented Oct 20, 2024

Ok, Thanks for your reply and advice.
hope new version can solve this problem.

@ShakeWeLy
Copy link

Ok, Thanks for your reply and advice. hope new version can solve this problem.

have done?

@raivisdejus
Copy link
Collaborator

There is some progress in integrating stable-ts, but for usable result more time is needed. Hope to have some free time next couple of weeks or at the holiday season around the Christmas.

@raivisdejus
Copy link
Collaborator

@guangxuanliu @ShakeWeLy There is a little update with partial fix for the problem. The very latest development version from here https://github.com/chidiwilliams/buzz/actions/workflows/ci.yml?query=branch%3Amain (log into to the github, select the latest build and scroll down to the artifacts section to get the installation files)

This version adds ability to generate the subtitles by combining transcripts with word-level timings. https://chidiwilliams.github.io/buzz/docs/usage/edit_and_resize

  1. Generate transcripts with "Word-level timings" enabled
  2. Use the "Resize" tool to generate the subtitles.

In my testing this gives more precise timings and you have more options on how to combine / generate the subtitles.

I tested this approach on a movie. To improve subtitle quality even more you can try to separate the voice track from the video or audio, so speech recognition happens on a cleaner audio with no background noises. See this section on more information for GUI tools that can let you separate voices from the audio https://github.com/facebookresearch/demucs?tab=readme-ov-file#graphical-interface

Some future version of Buzz may include voice separation in Buzz

@raivisdejus
Copy link
Collaborator

@ShakeWeLy In the very latest development version an ability to extract speech before the audio is transcribed was added, this should reduce any background noises and make transcripts more accurate. Please test this and let us know if some inaccuracies still remain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants