Korean text normalization and language preparation package for LM in Kaldi-based ASR system
-
Updated
Apr 23, 2020 - Python
Korean text normalization and language preparation package for LM in Kaldi-based ASR system
Simple-to-use scoring function for arbitrarily tokenized texts.
Keyword Search Recipe for Subword ASR
Subword-augmented Embedding for Cloze Reading Comprehension (COLING 2018)
johnny - a neural network graph based DEPendency Parser
Effective Subword Segmentation for Text Comprehension (TASLP 2019)
Unsupervised Word Segmentation using Minimum Description Length for Neural Machine Translation (NMT)
An implementation of subword division algorithm proposed in T. Mikolov (2012).
A framework for generating subword vocabulary from a tensorflow dataset and building custom BERT tokenizer models.
This repository contains source code implementation of assignments for NTU's MSAI course AI6127 on Deep Neural Networks for Natural Language Processing (2019 Sem 2).
A causal intervention framework to learn robust and interpretable character representations inside subword-based language models
The concept of DAWGs is based on: Blumer, A. et al. (1985). The smallest automation recognizing the subwords of a text. Theoretical Computer Science, 40, 31–55.
Tokenization is a way of separating a piece of text into smaller units called tokens. Here, tokens can be either words, characters, or subwords. Hence, tokenization can be broadly classified into 3 types – word, character, and subword (n-gram characters) tokenization.
Subword Neural Machine Translation
Add a description, image, and links to the subword topic page so that developers can more easily learn about it.
To associate your repository with the subword topic, visit your repo's landing page and select "manage topics."