_the Tokens _{we have} In Common

WORK IN PROGRESS

Tokens-In-Common is a package to facilitate efficient autoregressive language model inference on multiple variants of the same text that differ from each other in pivotal ways, but nonetheless have a (large) part of their tokens in common.

The primary use case kept in mind when developing this package is to study the downstream effects of choosing one word/phrase/answer/etc. over another.

The package allows you to represent such texts as graphs such as these:

This package allows you to construct such trees easily from the data and substitutions you have in mind.

A naive and inefficient way to apply autoregressive LMs on such trees would be to apply it to each branch independently. But, this would mean that the earlier parts of the text, which many variants have in common, would be processed multiple times unnecessarily.

Instead, we can change the attention mask to forward the model on entire trees at a time. We do this by having each branch in the tree only attend to the tokens that precede it in its own branch.

But, before we can do this, we have to tokenize the text

This step should be as effortless as it is normally. Even when tokens are formed from characters in multiple vertices in the graph, such as in the example below:

Text	Tokens

Workflow

Create Multitree representation from (unicode) text, using ...;
Provide the tokenizer of choice which will be used to build a multitree of tokens;
Convert the multitree of tokens to tensors of input_ids, position_ids, and attention_mask (specifying attention between each pair of tokens);
Call huggingface implementation of choice

Also see the examples/ directory.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
images		images
src/tokens_in_common		src/tokens_in_common
tests		tests
LICENSE		LICENSE
README.md		README.md
create_examples.py		create_examples.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

_the Tokens _{we have} In Common

Workflow

About

Releases

Packages

Languages

License

sfschouten/tokens-in-common

Folders and files

Latest commit

History

Repository files navigation

the Tokens we have In Common

Workflow

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

_the Tokens _{we have} In Common

Packages