Skip to content

Latest commit

 

History

History
240 lines (178 loc) · 13.8 KB

README.md

File metadata and controls

240 lines (178 loc) · 13.8 KB

User Instruction

Introduction

align4d is a powerful Python package used for aligning text results from Speaker Diarization and Speech Recognition to gold standard transcript, especially when there are overlappings between speakers. This user manual provides a step-by-step guide on how to install, use and troubleshoot the package.

Mechanism

The align4d uses global alignment algorithm that is a multi-sequence variant of Needleman-Wunsch algorithm to align hypothesis (results generated by Speaker Diarization and Speech Recognition models) to reference (usually gold standard transcript, which will be separated into multiple sequence if there are multiple speakers). The alignment happens on the token level. For long sequence the align4d will automatically separate the sequence into smaller segments, align them separately by finding the absolute aligned parts (called barriers), and finally assemble them together.

The align4d uses Levenshtein Distance as the measurement of the similarity between tokens while doing alignment. There can be 4 situations between each position of alignment:

  1. Fully match. Two tokens are exactly the same (Levenshtein Distance is 0).
  2. Partially match. Two tokens are not exactly the same but the Levenshtein Distance between them are within a boundary.
  3. Mismatch. Two tokens are different and the Levenshtein Distance between them exceed the boundary.
  4. Gap. Only one token is present because it is aligned to a gap (insertion or deletion of tokens).

Installation

To install align4d, you need to have Python version 3.10 or 3.11. Follow these steps:

  1. Open your terminal or command prompt.
  2. Type in the following command: pip install align4d
  3. Wait for the package to download and install.

Usage

Importing align4d

To use align4d in your Python code, you need to import it. Here's how:

from align4d import align

Compile

Before actual alignment, the align4d is required to compile the c++ source codes distributed along with the package. To ensure successful compilation, the latest version of compiler that supports c++20 is required.

  1. For Windows, install or update to the latest version of Visual Studio with the latest version of Visual C++ (or Visual Studio version >= 17.4.4).
  2. For macOS, install or update to the latest version of Xcode with Apple Clang (or Xcode version >= 14.3 with Apple Clang version >= 14.0.3).
  3. For Linux, install or update to the latest version of GCC with G++ (or GCC version >= 11.2.0).

To compile the c++ source code, use the function align.compile():

align.compile()

At this stage, do not run any alignment related functions introduced in the following sections but just run align.compile(). Once it is compiled, you don't need to (and should not) run this function again while doing alignment. You do need to rerun the align.compile() when you switch to a new environment or reinstall the align4d.

Aligning Text Results

align4d can align results from Speaker Diarization and Speech Recognition. For simple and straight forward usage, the function can be used like this:

aligned_result = align.align(hypothesis, reference)

Here's the overview of all parameters of the function:

aligned_result = align.align(hypothesis: str | list[str], reference: list[list], partial_bound: int = 2, segment_length: int = None, barrier_length: int = None, strip_punctuation: bool = True)

The align() function takes in 6 parameters, the hypothesis and reference are required and the other 4 of them are optional:

  1. hypothesis: This is a list of strings or a string containing tokenized text . Each string represents a word that is generated from the Speech Recognition model. It is suggested to remove all the punctuations, escape values, and any other characters that is not in the natural language.

    hypothesis = ["ok", "I", "am", "a", "fish", "Are", "you", "Hello", "there", "How", "are", "you", "ok"]
    # or 
    hypothesis = "ok I am a fish. Are you? Hello there. How are you? ok"
  2. reference: This is a nested list of strings containing utterance and speaker labels from the gold standard text. The first string within each secondary list represents the speaker label, the second string represents the utterance. The second string can also be a list of strings where each string is a token. It is suggested to remove all the punctuations, escape values, and any other characters that is not in the natural language.

    reference = [
        ["A", "I am a fish."],
        ["B", "okay."],
        ["C", "Are you?"],
        ["D", "Hello there."],
        ["E", "How are you?"]
    ]
    # or 
    reference = [
        ["A", ["I", "am", "a", "fish."]],
        ["B", ["okay."]],
        ["C", ["Are", "you?"]],
        ["D", ["Hello", "there."]],
        ["E", ["How", "are", "you?"]]
    ]
  3. partial_bound: This is an integer that specifies the boundary between partially match and mismatch in terms of the Levenshtein Distance between the two tokens in comparison. This is an optional parameter and the default value is 2.

  4. segment_length: This is a integer that specifies the minimum length of each segment in terms of the number of hypothesis tokens. By providing segment_length and barrier_length the program can perform manual segmentation before actual alignment for long sequence based on the provided parameters.

    If segment_length and barrier_length are not provided and the hypothesis length in terms of tokens is over 100, the program will automatically search and use the optimal segment_length between 30 and 120

    If segment_length and barrier_length are not provided and the hypothesis length in terms of tokens is lower than 100, no segmentation will be performed.

    If segment_length and barrier_length are provided and both are integers less than or equal to 0, no segmentation will be performed.

    It is strongly suggested to perform auto or manual segmentation when the input sequence are long otherwise the alignment may fail because of RAM space limit.

    It is important that the segment_length and barrier_length need to be provided together to perform manual segmentation otherwise an Exception will be raised.

    Exception: Segment length or barrier length parameter incorrect or missing.
  5. barrier_length: This is an integer that specifies the length of parts in terms of number of tokens used to detect the absolute aligned parts. This is an optional parameter and the default value is 6 if the parameter is not specified. By providing segment_length and barrier_length the program can perform manual segmentation before actual alignment for long sequence based on the provided parameters.

    It is important that the segment_length and barrier_length need to be provided together to perform manual segmentation otherwise an Exception will be raised.

    Exception: Segment length or barrier length parameter incorrect or missing.
  6. strip_punctuation: This is a boolean that specifies if the align4d will strip all punctuation in the hypothesis and reference to provide more accurate alignment result or not. The default is set to True and the output will provide alignment with the original punctuation.

The align() function returns a dictionary containing the aligned results. The hypothesis will be the list of strings (tokens) as the value for the key “hypothesis”. The reference will be separated into multiple sequences according to the provided speaker label, where each sequence will be a list of strings (tokens) as the value for the key of their speaker labels. All the reference sequences will be contained in a secondary dictionary as the value for the key “reference” in the primary dictionary. In each list, each token is aligned to the positions that have the same index and the gap is denoted as “” (empty string). If there is punctuation in the input, the punctuation will be preserved in the output.

import json

hypothesis = "ok I am a fish. Are you? Hello there. How are you? ok"
reference = [
        ["A", "I am a fish. "],
        ["B", "okay. "],
        ["C", "Are you? "],
        ["D", "Hello there. "],
        ["E", "How are you? "]
]
align_result = align.align(hypothesis, reference)
print(json.dumps(output, indent=4))

Sample output from align() :

# content in align_result
{
    "hypothesis": ['ok', 'I', 'am', 'a', 'fish.', 'Are', 'you?', 'Hello', 'there.', 'How', 'are', 'you?', 'ok'],
    "reference": {
        "A": ['', 'I', 'am', 'a', 'fish.', '', '', '', '', '', '', '', ''],
        "B": ['okay.', '', '', '', '', '', '', '', '', '', '', '', ''],
        "C": ['', '', '', '', '', 'Are', 'you?', '', '', '', '', '', ''],
        "D": ['', '', '', '', '', '', '', 'Hello', 'there.', '', '', '', ''],
        "E": ['', '', '', '', '', '', '', '', '', 'How', 'are', 'you?', '']
    }
}

Retrieve token match result

Based on the alignment result, this tool provide function to retrieve the matching result (fully match, partially match, mismatch, gap) for each token. Use token_match() to retrieve the token level matching result.

The criterion for determining the matching result are the following (also mentioned in the Mechanism):

  1. fully match: Levenshtein Distance = 0
  2. partially match: Levenshtein Distance ≤ boundary (default to be 2)
  3. mismatch: Levenshtein Distance > boundary (default to be 2)
  4. gap: aligned to a gap

The token_match() requires 3 parameter, the align_result which is the direct return value from the previous three alignment functions, an optional parameter partial_bound which must be the same as the partial_bound used in align() function (default to be 2), and an optional parameter strip_punctuation which must be the same as the strip_punctuation used in align() function (default to be True).

hypothesis = "ok I am a fish. Are you? Hello there. How are you? ok"
reference = [
        ["A", "I am a fish. "],
        ["B", "okay. "],
        ["C", "Are you? "],
        ["D", "Hello there. "],
        ["E", "How are you? "]
]
align_result = align.align(hypothesis, reference)
token_match_result = align.token_match(align_result)
print(token_match_result)

The return value is a list of strings that shows the token matching result and can either be fully match, partially match, mismatch, or gap.

# possible output for get_token_match_result()
['mismatch', 'fully match', 'fully match', 'fully match', 'fully match', 'fully match', 'fully match', 'fully match', 'fully match', 'fully match', 'fully match', 'fully match', 'gap']

Retrieve mapping from reference to hypothesis

Based on the alignment result, this tool provide function to retrieve the mapping from each token in the reference sequences to the hypothesis sequence. Each index shows the relative position (index) in the hypothesis sequence of the non-gap token (fully match, partially match, or mismatch) from the separated reference sequences. If the index is -1, it means that the current token does not aligned to any token in the hypothesis (align to a gap).

To achieve this, use function align_indices(). This function requires 2 parameters, the align_result which is the direct return value from the previous align() functionand an optional parameter strip_punctuation which must be the same as the strip_punctuation used in align() function (default to be True).

hypothesis = "ok I am a fish. Are you? Hello there. How are you? ok"
reference = [
        ["A", "I am a fish. "],
        ["B", "okay. "],
        ["C", "Are you? "],
        ["D", "Hello there. "],
        ["E", "How are you? "]
]
align_result = align.align(hypothesis, reference)
align_indices = align.align_indices(align_result)
print(align_indices)

The return value is a dictionary containing list of integers that shows the mapping between tokens from separated reference to hypothesis. The integers are the indices of the tokens in reference sequence map to the hypothesis sequence (for example, the first token in sequence “C” is mapped to the token in hypothesis with index 5).

# possible output
{
    'A': [1, 2, 3, 4], 
    'B': [0], 
    'C': [5, 6], 
    'D': [7, 8], 
    'E': [9, 10, 11]
}

Troubleshooting

This package currently only supports Windows 10/11 x86_64, Linux x86_64 (tested with Ubuntu 22.04), and macOS (M-series processor or Intel processor).

If you encounter any issues while using align4d, try the following:

  1. Make sure you have installed Python version 3.10 or 3.11.
  2. For compilation, make sure you have the compiler that supports c++20. The compilers can be acquired by installing:
    1. For Windows, install or update to the latest version of Visual Studio with the latest version of Visual C++ (or Visual Studio version >= 17.4.4).
    2. For macOS, install or update to the latest version of Xcode with Apple Clang (or Xcode version >= 14.3 with Apple Clang version >= 14.0.3).
    3. For Linux, install or update to the latest version of GCC with G++ (or GCC version >= 11.2.0).
  3. If you have permission or access issues during compilation, please manually delete all compiled objects (ended with .so, .pyd, .dll) in the package under the same directory of align.py.
  4. Do not run align.compile() with other align functions (align(), token_match(), align_indices()) at the same time.
  5. Make sure you have installed the latest version of align4d.
  6. Check the input data to make sure it is in the correct format.
    1. All the input strings must be encoded in the utf-8 format.
    2. Characters that are within the utf-8 format but not part of the natural language may affect the alignment performance. Remove them unless you are clear about the usages about these characters.