A tool to annotate videos of ultimate from recorded speech, and then edit the videos to clips matching selected parameters.
For example, if the audio includes "turnover" whenever one happens, the user can later make a compilation of all the turnovers in their video library.
The MWP of using speech recordings and AI to split the video into peaces that are glued together with the excess removed from between points is working.
Using Conda for managing Python virtual environments is highly encouraged.
Conda documentation on installation.
We use a Python interface to ffmpeg to process the video files.
WhisperX is used to convert audio files into text annotations.
Mostly following the original WhisperX setup:
Tested for PyTorch 2.0, Python 3.10 (use other versions at your own risk!) GPU execution requires the NVIDIA libraries cuBLAS 11.x and cuDNN 8.x to be installed on the system. Please refer to the CTranslate2 documentation.
You don't need GPU execution. It does speed things up, but a 2015 laptop handles speech-to-text with the medium
model in almost real time.
conda create --name whisperx python=3.10
conda activate whisperx
Linux and Windows CUDA11.8:
conda install pytorch==2.0.0 torchaudio==2.0.0 pytorch-cuda=11.8 -c pytorch -c nvidia
CPU only (laptop):
conda install pytorch==2.0.0 torchaudio==2.0.0 cpuonly -c pytorch
See other methods here.
pip install git+https://github.com/m-bain/whisperx.git
If already installed, update package to most recent commit
pip install git+https://github.com/m-bain/whisperx.git --upgrade
- How to run the program
- Step-by-step bullets
code blocks for commands
- How to run the program
- Step-by-step bullets
code blocks for commands
Teemu Säilynoja
This project is licensed under the [NAME HERE] License - see the LICENSE.md file for details