Subtitles generated for a 1.5 hour long video, the timeline is inaccurate #955

guangxuanliu · 2024-10-19T23:43:24Z

When transcribing a 1.5 hours long video, the generated subtitles have an inaccurate timeline and do not match the sound.

Even when using Whisper Large-v3, the situation remains the same.

What adjustments do I need to make the generated subtitles more accurate?

Operating system: Windows 10
Software version: Buzz 1.1.0

guangxuanliu · 2024-10-19T23:48:08Z

In addition, buzz performs well when transcribing short videos.

raivisdejus · 2024-10-20T05:59:37Z

Some ideas that may help in short term are here #946

Work on longer term solution is in progress

guangxuanliu · 2024-10-20T07:01:09Z

Ok, Thanks for your reply and advice.
hope new version can solve this problem.

ShakeWeLy · 2024-12-08T11:57:14Z

Ok, Thanks for your reply and advice. hope new version can solve this problem.

have done?

raivisdejus · 2024-12-08T16:01:40Z

There is some progress in integrating stable-ts, but for usable result more time is needed. Hope to have some free time next couple of weeks or at the holiday season around the Christmas.

raivisdejus · 2024-12-29T19:47:48Z

@guangxuanliu @ShakeWeLy There is a little update with partial fix for the problem. The very latest development version from here https://github.com/chidiwilliams/buzz/actions/workflows/ci.yml?query=branch%3Amain (log into to the github, select the latest build and scroll down to the artifacts section to get the installation files)

This version adds ability to generate the subtitles by combining transcripts with word-level timings. https://chidiwilliams.github.io/buzz/docs/usage/edit_and_resize

Generate transcripts with "Word-level timings" enabled
Use the "Resize" tool to generate the subtitles.

In my testing this gives more precise timings and you have more options on how to combine / generate the subtitles.

I tested this approach on a movie. To improve subtitle quality even more you can try to separate the voice track from the video or audio, so speech recognition happens on a cleaner audio with no background noises. See this section on more information for GUI tools that can let you separate voices from the audio https://github.com/facebookresearch/demucs?tab=readme-ov-file#graphical-interface

Some future version of Buzz may include voice separation in Buzz

raivisdejus · 2025-01-02T11:41:07Z

@ShakeWeLy In the very latest development version an ability to extract speech before the audio is transcribed was added, this should reduce any background noises and make transcripts more accurate. Please test this and let us know if some inaccuracies still remain.

raivisdejus added the bug Something isn't working label Oct 20, 2024

raivisdejus closed this as completed Jan 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Subtitles generated for a 1.5 hour long video, the timeline is inaccurate #955

Subtitles generated for a 1.5 hour long video, the timeline is inaccurate #955

guangxuanliu commented Oct 19, 2024

guangxuanliu commented Oct 19, 2024

raivisdejus commented Oct 20, 2024

guangxuanliu commented Oct 20, 2024 •

edited

Loading

ShakeWeLy commented Dec 8, 2024

raivisdejus commented Dec 8, 2024

raivisdejus commented Dec 29, 2024

raivisdejus commented Jan 2, 2025

Subtitles generated for a 1.5 hour long video, the timeline is inaccurate #955

Subtitles generated for a 1.5 hour long video, the timeline is inaccurate #955

Comments

guangxuanliu commented Oct 19, 2024

guangxuanliu commented Oct 19, 2024

raivisdejus commented Oct 20, 2024

guangxuanliu commented Oct 20, 2024 • edited Loading

ShakeWeLy commented Dec 8, 2024

raivisdejus commented Dec 8, 2024

raivisdejus commented Dec 29, 2024

raivisdejus commented Jan 2, 2025

guangxuanliu commented Oct 20, 2024 •

edited

Loading