Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.idea		.idea
README.md		README.md
friends.s06e25.uncut.dvdrip.xvid-saints-english.smi		friends.s06e25.uncut.dvdrip.xvid-saints-english.smi
json_cleaning.py		json_cleaning.py
matching.py		matching.py
s06_e25.json		s06_e25.json
s06_e25_json.txt		s06_e25_json.txt
s06_e25_smi.txt		s06_e25_smi.txt
smi_cleaning.py		smi_cleaning.py
smi_superset.py		smi_superset.py
smi_superset_6_25.txt		smi_superset_6_25.txt

Repository files navigation

Synchronizing transcripts/ subtitles

Tools for synchronizing transcript and subtitles. The project is developed by Emory NLP lab.

Requirement

Install fuzzywuzzy package.

Usage

To run the program, follow below four step.
1. json_cleaning.py
  - extract all transcript, utterance_id, speaker from .json file.
  - input file -> .json file.
  - output file format -> pickle dumped .txt file.
2. smi_cleaning.py
  - extract each subtitle with start/end time(milliseconds) from .smi(subtitle) file
  - input file -> .smi(subtitle) file
  - output file format -> pickle dumped .txt file
3. smi_superset.py
  - make subtitle superset for all possible uttrance from .smi(subtitle) file.
  - input file -> pickle dumped .txt file (from step2)
  - output file format -> pickle dumped .txt file.
4. matching.py
  - apply fuzzywuzzy matching algorithms
  - input file -> step1 output file(extracted transcript) && step3 output file(superset of subtitle)
  - output file -> result of matching

Future work

About

No description or website provided.

python fuzzy-matching synchronizing-transcripts

Report repository

Releases

No releases published

Packages

No packages published

Languages

Python 100.0%