Skip to content

Commit

Permalink
WIP UTF-8 moved to Core and JSON minorities, prep for more
Browse files Browse the repository at this point in the history
  • Loading branch information
boazsegev committed Sep 28, 2024
1 parent fd3533f commit 024be51
Show file tree
Hide file tree
Showing 14 changed files with 1,052 additions and 1,100 deletions.
1,039 changes: 522 additions & 517 deletions fio-stl.h

Large diffs are not rendered by default.

39 changes: 5 additions & 34 deletions fio-stl.md
Original file line number Diff line number Diff line change
Expand Up @@ -3268,13 +3268,9 @@ This will use `FIO_FD_FIND_BLOCK` bytes on the stack to read the file in a loop.
#include "fio-stl.h"
```

The facil.io JSON parser is a non-strict parser, with support for trailing commas in collections, new-lines in strings, extended escape characters, comments, and octal, hex and binary numbers.
The facil.io JSON parser is a non-strict parser, with support for trailing commas in collections, new-lines in strings, extended escape characters, comments, and common numeral formats (octal, hex and binary).

The parser allows for streaming data and decouples the parsing process from the resulting data-structure by calling static callbacks for JSON related events.

To use the JSON parser, define `FIO_JSON` before including the `fio-slt.h` file and later define the static callbacks required by the parser (see list of callbacks).

**Note**: the FIOBJ soft types already use the JSON parser. For this reason, another JSON parser can't be implemented in the same translation unit as the FIOBJ implementation. To use another JSON parser, implement it in a different C file then the one where the FIOBJ types are implemented.
The facil.io JSON parser should be considered **unsafe** as overflow protection depends on the `NUL` character appearing at the end of the string passed to the parser.

**Note:** this module depends on the `FIO_ATOL` module which will be automatically included.

Expand All @@ -3286,35 +3282,8 @@ To use the JSON parser, define `FIO_JSON` before including the `fio-slt.h` file
#define FIO_JSON_MAX_DEPTH 512
#endif
```
The JSON parser isn't recursive, but it allocates a nesting bitmap on the stack, which consumes stack memory.

To ensure the stack isn't abused, the parser will limit JSON nesting levels to a customizable `FIO_JSON_MAX_DEPTH` number of nesting levels.


The JSON parser type. Memory must be initialized to 0 before first uses (see `FIO_JSON_INIT`).

The type should be considered opaque. To add user data to the parser, use C-style inheritance and pointer arithmetics or simple type casting.

i.e.:

```c
typedef struct {
fio_json_parser_s private;
int my_data;
} my_json_parser_s;
// void use_in_callback (fio_json_parser_s * p) {
// my_json_parser_s *my = (my_json_parser_s *)p;
// }
```

#### `FIO_JSON_INIT`

```c
#define FIO_JSON_INIT \
{ .depth = 0 }
```

A convenient macro that could be used to initialize the parser's memory to 0.
To ensure the program's stack isn't abused, the parser will limit JSON nesting levels to a customizable `FIO_JSON_MAX_DEPTH` number of nesting levels.

### JSON parser API

Expand Down Expand Up @@ -5683,6 +5652,8 @@ If the String isn't UTF-8 valid up to the requested selection, than `pos` will b

The returned `len` value may be shorter than the original if there wasn't enough data left to accommodate the requested length. When a `len` value of `0` is returned, this means that `pos` marks the end of the String.

if `pos` is negative, counts backwards (`-1` is the position of the last UTF-8 character).

Returns -1 on error and 0 on success.

### Core String C / JSON escaping
Expand Down
80 changes: 80 additions & 0 deletions fio-stl/000 core.h
Original file line number Diff line number Diff line change
Expand Up @@ -750,6 +750,86 @@ typedef struct fio_buf_info_s {
.buf = fio___stack_mem___##name, .capa = (capacity) \
}

/* *****************************************************************************
UTF-8 Support (basic)
***************************************************************************** */

/* Returns the number of bytes required to UTF-8 encoded a code point `u` */
#define FIO_UTF8_CODE_LEN(u) \
(((size_t)((u) > ((1U << 21) - 1)) - 1) & \
(1U + ((u) > 127) + ((u) > 2047) + ((u) > 65535)))

/* Returns the number of valid UTF-8 bytes on pointer `str`. */
#define FIO_UTF8_CHAR_LEN(str) \
((((((*(str)) & 0xF8U) == 0xF0U) & \
((((str)[(((*(str)) & 0xF8U) == 0xF0U) /* 1||0 */]) & 0xC0U) == 0x80U) & \
((((str)[(((*(str)) & 0xF8U) == 0xF0U) << 1]) & 0xC0U) == 0x80U) & \
((((str)[(((*(str)) & 0xF8U) == 0xF0U) | \
((((*(str)) & 0xF8U) == 0xF0U) << 1)]) & \
0xC0U) == 0x80U)) \
<< 2) | \
((((*(str)&0xF0U) == 0xE0U) & \
(((((str)[(((*(str)) & 0xF0U) == 0xE0U) /* 1||0 */]) & 0xC0U) == 0x80U) & \
((((str)[(((*(str)) & 0xF0U) == 0xE0U) << 1]) & 0xC0U) == 0x80U))) * \
3) | \
((((*(str)&0xE0U) == 0xC0U) & \
((((str)[((*(str)&0xE0U) == 0xC0U) /* 1 or 0 */]) & 0xC0U) == 0x80U)) \
<< 1) | \
(((*(str)) & 0x80U) == 0U))

/*
* Writes code point `u` to `dest`, assuming `dest` is a `uint8_t` pointer.
*
* Use:
*
* FIO_UTF8_WRITE(dest, 0x1D11E, FIO_UTF8_CODE_LEN(0x1D11E))
*
*/
#define FIO_UTF8_WRITE(dest, u, u_code_len) \
do { \
const uint8_t len__ = (u_code_len); \
const uint8_t offset__ = 0xF0U << (4U - len__); \
const uint8_t head__ = 0x80U << (len__ < 2); \
const uint8_t mask__ = 0xFFU >> ((len__ > 1) << 1); \
*(dest) = offset__ | ((u) >> (6 * (len__ - (len__ > 1)))); \
(dest) += (len__ > 1); \
*(dest) = head__ | (((u) >> 12) & mask__); \
(dest) += (len__ > 3); \
*(dest) = head__ | (((u) >> 6) & mask__); \
(dest) += (len__ > 2); \
*(dest) = head__ | ((u)&mask__); \
(dest) += (len__ > 0); \
} while (0)

/*
* Reads code point to `uint32_t` `target` from `uint8_t` pointer `src`.
*
* The `src` pointer will advance by `FIO_UTF8_CHAR_LEN` (0 on error).
*
* Use:
*
* uint32_t target;
* FIO_UTF8_READ(target, str)
*
*/
#define FIO_UTF8_READ(target, src) \
do { \
const uint8_t len__ = FIO_UTF8_CHAR_LEN(src); \
const uint8_t offset__ = ~(0xF0U << (4U - len__)); \
const uint8_t mask__ = ~(0x80U << (len__ < 2)); \
target = (0U - (len__ > 1)) & (*src & offset__); \
target <<= 6; \
src += (len__ > 1); \
target |= (0U - (len__ > 3)) & *src & mask__; \
target <<= ((len__ > 3) << 2) | ((len__ > 3) << 1); \
src += (len__ > 3); \
target |= (0U - (len__ > 2)) & *src & mask__; \
target <<= ((len__ > 2) << 2) | ((len__ > 2) << 1); \
src += (len__ > 2); \
target |= (*src & mask__); \
src += (len__ > 0); \
} while (0)

/* *****************************************************************************
Linked Lists Persistent Macros and Types
***************************************************************************** */
Expand Down
Loading

0 comments on commit 024be51

Please sign in to comment.