Skip to content

Latest commit

 

History

History
276 lines (193 loc) · 10.7 KB

README.md

File metadata and controls

276 lines (193 loc) · 10.7 KB

string-encode Build Status codecov

  • Convert different types of JavaScript String to/from Uint8Array.
  • Check for String encoding.

The main target of this library is the Browser, where there is no Buffer type.

Node.js is welcome too, except for toString('base64') which depends on btoa. See Node.js equivalents.

Install

npm i -S string-encode

Or add it directly to the browser:

<script src="https://unpkg.com/string-encode"></script>
<script>
const { str2buffer, buffer2str /* ... */ } = stringEncode;
// ...
</script>

Usage

str2buffer() and buffer2str()

The most important functions of this library are str2buffer(str, asUtf8) and buffer2str(buf, asUtf8) for converting any String, including multibyte, to and from Uint8Array.

import { str2buffer, buffer2str } from 'string-encode';

// When you know your string doesn't contain multibyte characters:
let buffer = str2buffer(binaryString, false);
// ... do something with buffer ...
let processedSting = buffer2str(buffer, false);

// When you know your string might contain multibyte characters:
let buffer = str2buffer(mbString, true);
// ...
let processedMbString = buffer2str(buffer, true);

// Let it guess whether to utf8 encode/decode or not - not recommended:
let buffer = str2buffer(anyStr);
// ...
let processedSting = buffer2str(buffer);

Example: sha1

Simple sha1 function using crypto for Browser, that works with String and is compatible with the PHP counterpart:

import { str2buffer, toString } from 'string-encode';

const crypto = window.crypto || window.msCrypto || window.webkitCrypto;
const subtle = crypto.subtle || crypto.webkitSubtle;

async function sha1(str, enc='hex') {
    let buf = str2buffer(str, true);
    buf = await subtle.digest('SHA-1', buf);
    buf = new Uint8Array(buf);
    return toString.call(buf, enc);
}

How to use this sha1 function:

await sha1('something');        // "1af17e73721dbe0c40011b82ed4bb1a7dbe3ce29"
await sha1('something', false); // "\u001añ~sr\u001d¾\f@\u0001\u001b\u0082íK±§ÛãÎ)"
await sha1('что-то');           // "991fe0590dfec23402d71c0e817bc7a7ab217e2b"
await sha1('что-то', 'base64'); // "mR/gWQ3+wjQC1xwOgXvHp6shfis="

utf8Encode(str) and utf8Decode(str)

Example: btoa/atob

Base64 encode/decode a multibyte string:

import { utf8Encode, utf8Decode } from 'string-encode';

btoa(utf8Encode('⚔ или 😄')); // "4pqUINC40LvQuCDwn5iE"
utf8Decode(atob('4pqUINC40LvQuCDwn5iE')); // "⚔ или 😄"

Node.js equivalents

string-encode in Browser Buffer in Node.js
str2buffer(str, false) Buffer.from(str, 'binary')
str2buffer(str, true) Buffer.from(str, 'utf8')
hex2buffer(str) Buffer.from(str, 'hex')
str2buffer(atob(str), false) Buffer.from(str, 'base64')
- -
buffer2str(str, false) Buffer.toString('binary')
buffer2str(str, true) Buffer.toString('utf8')
buffer2hex(str) Buffer.toString('hex')
btoa(buffer2str(str, false)) Buffer.toString('base64')

.toString()

If you want your Uint8Array to be one step closer to the Node.js's Buffer, just add the .toString() method to it.

import { toString } from 'string-encode';

let buf = Uint8Array.from([65, 108, 111, 104, 97, 44]);
buf.toString = toString; // the magic method

console.log(buf + ' world!');
buf.toString('hex');    // "416c6f68612c"
buf.toString('base64'); // "QWxvaGEs"

Besides encoding/decoding, there are few more functions for testing string encoding.


The theory of String 😉

A JavaScript String is a unicode string, which means that it is a list of unicode characters, not a list of bytes! And it does not map one-to-one to an array of bytes without some encoding either. This is because a unicode character requires 3 bytes to be able to encode any of the growing list of about 144 000 symbols. Thus String is not the best data type for working with binary data.

This is the main reason why the Node.js devs have come up with the Buffer type. Later on there have been invented the TypedArray standard to the rescue and the Node.js devs have adopted the new type, namely Uint8Array, as the parent type for the existing Buffer type, starting with Node.js v4.

Meanwhile there have been written many libraries to encode, encrypt, hash or otherwise transform the data, all using the plain String type that was available to the community since the beginning of JS.

Even some browser built-in functions that came before the TypedArray standard rely on the String type to do their encoding (eg. btoa == "binary to ASCII").

Today, if you want to manipulate some bytes in JavaScript, you most likely need a Uint8Array instead of a String for best performance and compatibility with other environments and tools.

String kinds (or encodings)

Judging by content, there are a few kinds of JS Strings used in almost all applications.

Binary

Any String that do not contain multibyte characters can be considered a binary string. In other words, each character's code is in the range [0..255]. These strings can be mapped one-to-one to arrays of bytes, which Uint8Arrays basically are.

const binStr = 'when © × ® = ?';
isBinary(binStr); // true
hasMultibyte(binStr); // false
btoa(binStr); // "qSBpcyCu"
str2buffer(binStr); // Uint8Array([169, 32, 105, 115, 32, 174])

Most old-fashion encoding functions accept only this type of strings (eg. btoa).

Multibyte

In JS the most common string is a Multibyte string, one that contains unicode characters, which require more than a byte of memory.

const mbStr = '$ ⚔ ₽ 😄 € ™';
isBinary(mbStr); // false
hasMultibyte(mbStr); // '⚔'
ord(mbStr[2]); // 9876

Most encoding algorithms would not accept a multibyte String.

If you try to run btoa('€'), you'll get an error like:

Uncaught DOMException:
    Failed to execute 'btoa' on 'Window':
        The string to be encoded contains characters outside of the Latin1 range.

Because is a multibyte character.

The solution is to encode the multibyte string into a singe-byte string somehow.

UTF8 encoded

UTF8 is the most widely used byte encoding of unicode/multibyte strings in computers today. It is the default encoding of web pages that travel over the wire (content-type: text/html; charset=UTF-8) and the default in many programing languages. The important feature of UTF8 is that it is fully compatible with ASCII strings, which means any ASCII string is also a valid UTF8 encoded string. Unless you need symbols outside the ASCII table, this encoding is very compact, and uses more than a byte per character only where needed.

In fact, UTF8 should be the default choice of encoding you use in a program.

const mbStr = '$ ⚔ ₽ 😄 € ™';
const utf8Str = utf8Encode(mbStr);
isBinary(utf8Str); // true
isUTF8(utf8Str); // true

isUTF8(asciiStr); // true

btoa(utf8Str); // '4oK9IOKalCAkIPCfmIQg4oKsIOKEog=='
str2buffer(utf8Str); // Uint8Array([226, 130, 189, 32, 226, 154, 148, 32, 36, 32, 240, 159, 152, 132, 32, 226, 130, 172, 32, 226, 132, 162])

Even though utf8Str is still of type String, it is no longer a multibyte string, and thus can be manipulated as an array of bytes.

ASCII

A subset of binary strings is ASCII only strings, which represent the class of strings with character codes in the range [0..127]. Each ASCII character can be represented with only 7 bits.

const asciiStr = 'Any text using the 26 English letters, digits and punctuation!';
isASCII(asciiStr); // true

isASCII(binStr); // false
isASCII(utf8Str); // false

String Types Table

All table headings are functions exported by this library.

String guessEncoding hasMultibyte isBinary isASCII isUTF8 utf8bytes
"" hex false true true true 0
"English alphabet is 26" ascii false true true true 0
"$ ⚔ ₽ 😄 € ™" mb "⚔" false false false false
utf8Encode("$ ⚔ ₽ 😄 € ™") utf8 false true false true 16
"when © × ® = ?" binary false true false false false
"Xש" utf8 false true false true 2
utf8Decode("Xש") mb "Xש" false false false false
"© binary? ×" ~utf8 false true false false false | 2

I did not add the isHEX column because it is a trivial format - you can't confuse it with the others.

Note 1:

Sometimes you can't tell whether the string has been utf8Encodeed or it is just a unicode string that by coincidence is also a valid utf8 string.

In the table above "Xש" could be the original string or could be the encoded string.

Note 2:

When slicing utf8 encoded strings, you might cut a multibyte character in half. What you get as a result could be considered a valid utf8 string, with async utf8 characters at the edges.

In the table above "© binary? ×" is such a slice. The "©" symbol could be the last byte of a utf8 encoded character, and "×" - the first of the two bytes of another character.


To be continued...


Further reading: