Skip to content

Latest commit

 

History

History
59 lines (37 loc) · 2.33 KB

README.md

File metadata and controls

59 lines (37 loc) · 2.33 KB

Native Node.js tokenizer for RWKV

0 dependency tokenizer for the RWKV project

Should also work for EleutherAI neox and pythia, as they use the same tokenizer

Setup

npm i rwkv-tokenizer-node

Usage

const tokenizer = require("RWKV-tokenizer-node");

// Encode into token int : [12092, 3645, 2]
const tokens = tokenizer.encode("Hello World!");

// Decode back to "Hello World!"
const decoded = tokenizer.decode(tokens);

Its primary purpose is for use in implementing RWKV-cpp-node , though it could probably be used for other use cases (eg. pure-JS implementaiton of gpt-neox or RWKV)

What can be improved?

  • performance: its kinda disappointing that this is easily 10x slower then the python implementation (which i believe is using the rust library), however this is generally still good enough for most usecases
  • Why not use the hugging face library? Sadly the official huggingface tokenizer lib for nodejs is broken : huggingface/tokenizers#911

PS: Anyone who has any ideas on how to improve its performance, while not failing the test suite, is welcomed to do so.

How to run the test?

# This run the sole test file test/tokenizer.test.js
npm run test

The python script used to seed the refence data (using huggingface tokenizer) is found at test/build-test-token-json.py This test includes a very extensive UTF-8 test file covering all major (and many minor) languages

Designated maintainer

@picocreator - is the current maintainer of the project, ping him on the RWKV discord if you have any questions on this project

Special thanks & refrences

@saharNooby - which the current implementation is heavily based on

@cztomsik @josephrocca @BlinkDL - for their various implementation, which is used as refence to squash out mismatching encoding with HF implementation.