Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compression for serialized objects #886

Open
dselman opened this issue Jul 25, 2024 · 1 comment
Open

Compression for serialized objects #886

dselman opened this issue Jul 25, 2024 · 1 comment
Assignees
Labels
Difficulty: Starter Type: Enhancement ✨ Improvement to process or efficiency Type: Feature Request 🛍️ New feature or request

Comments

@dselman
Copy link
Contributor

dselman commented Jul 25, 2024

Feature Request 🛍️

Support compression of serialised Concerto objects.

Use Case

ASTs and serialised objects in general are verbose. They compress well due to repeated JSON properties, like $class.

Possible Solution

Provide compress/decompress functions within Concerto core or util.

Context

  • Working with large models getting HTTP timeouts or other storage issues

Detailed Description

Two approaches, which may be complimentary have been explored.

Class Map

This specifically targets the $class properties within the JSON objects produced by the Serializer. The JSON tree is visited to build a Map of all $class values in the JSON. $class entries that start with the same prefix as the root $class are shortened by removing the common prefix.

This map is used to replace the $class properties with indexes into the map, resulting in a JSON object that looks like:

{
  "$class": "1",
  "models": [
    {
      "$class": "2",
      "decorators": [],
      "namespace": "[email protected]",
      "imports": [],
      "declarations": [
        {
          "$class": "3",
          "name": "SSN",
          "location": {
            "$class": "4",
            "start": {
              "offset": 22,
              "line": 3,
              "column": 1,
              "$class": "5"
            },
            "end": {
              "offset": 124,
              "line": 9,
              "column": 1,
              "$class": "5"
            }
          },
}],
"$version": 1,
  "$classMap": {
    "1": ".Models",
    "2": ".Model",
    "3": ".StringScalar",
    "4": ".Range",
    "5": ".Position",
    "6": ".Decorator",
    "7": ".ConceptDeclaration",
    "8": ".StringProperty",
    "9": ".DecoratorString",
    "10": ".ObjectProperty",
    "11": ".TypeIdentifier",
    "12": ".IntegerProperty",
    "13": ".MapDeclaration",
    "14": ".StringMapKeyType",
    "15": ".StringMapValueType",
    "16": ".EnumDeclaration",
    "17": ".EnumProperty"
  },
  "$prefix": "[email protected]"
}

LZ Compression

LZ compression is used on the JSON object (either the source object as-is, or the object after the Class Map has been built). Resulting in a JSON object that looks like:

{
  "compressed": "ᯡࠩƬ΀䌦㧤Ɛ䄣氧ァ☢㠥暠㨡㛻熤娠䷒䀠䁦ᄠ᥺၌䛛ࠣK嚴≄ú ",
  "format": "LZ_UTF16"
}

Results

Class Map: approximately 1.6x compression
ClassMap + LZ: approximately 12x compression
Just LZ: approximately 10x compression

@dselman dselman added Type: Feature Request 🛍️ New feature or request Difficulty: Starter Type: Enhancement ✨ Improvement to process or efficiency labels Jul 25, 2024
@dselman dselman self-assigned this Jul 25, 2024
@DS-AdamMilazzo
Copy link

DS-AdamMilazzo commented Jul 25, 2024

For LZ compression, how is the byte stream converted into a string in your example? I think you'd want to consider two things.

  • First, you may want to consider the production of strings that would be invalid in UTF-16 (e.g. with unpaired surrogates), which is commonly used by languages to hold strings in memory. Invalid UTF-16 strings aren't necessarily bad, because they rarely cause actual problems, but it's worth investigating if you're going to be using high Unicode characters like that.
  • Second, given that UTF-8 is pretty much the ubiquitous standard for text encoding in storage and transmission, I think you'd want to be careful that the encoding of the compressed bytes into a string produces Unicode code points that will encode efficiently into UTF-8. For example, code points from 0-127 encode into a single byte, wasting 12.5% of the bits. Code points from 128 to 2047 encode into two bytes, wasting 31.25% of the bits. Code points from 2048 to 65535 encode into three bytes, wasting 33.33% of the bits, and higher ones waste over 36% IIRC. Basically, you'd be better off encoding everything into low ASCII, except that low ASCII has 34 characters that must be escaped in JSON (making them take either two or six bytes instead of one), including the NUL character which causes problems with many systems (given that much code treats NULs as terminators). So, you should exclude from the alphabet those characters that must be escaped, especially NUL. (For example, Postgres can't store a string containing NUL.) Excluding those 34 characters means low ASCII wastes almost 18.1% of the bits, but that minimum requires a sophisticated algorithm to achieve. Considering only simple algorithms, base-64 wastes 25% and base-85 wastes 20%. 20% is close enough to 18.1%, so simply using base-85 (whose standard alphabet requires no escaping in JSON) is probably your best bet.

That said, if you're considering LZ-type compression at all, you may consider storing the result natively in binary if the storage system can handle it, rather than encoding it into a string and then encoding the string into JSON and then encoding the JSON into UTF-8. Storing as binary wastes 0% of the bits. Cosmos DB supported binary attachments, but it's deprecated and they recommend moving to Azure Blob Storage instead; that has the downside of needing to talk to two services. You might consider using a different database than Cosmos DB if you're going to be storing a lot of binary data.

For the class map:

  • You'd almost certainly benefit more from a general string table that applies to both property names and property values.
  • The string table could be represented more efficiently in JSON as an array rather than a dictionary.
  • For $class properties, I'd suggest a mechanism that's probably better than a prefix in general - given that in many cases the prefix will be empty or will be only "com." or something - split the $class into namespace+version and type, and index them both into the string table separately. "[email protected]" might become "0.1" (where 0 is the index of "[email protected]" and 1 is the index of "ConceptDeclaration"). Abbreviated $class values, when implemented, would be recognized by not having a period.
  • For indexes into the string table, it'd be a good idea to use a base-93 or so alphabet to represent the indexes rather than base-10. That will compress the indexes much better if the string table ends up becoming large.
  • I can also think of a way to avoid the overhead of having a separate string table - which would eliminate the string table section entirely while increasing the benefits - but it requires preservation of property order, which can sometimes be tricky. (I don't know if Cosmos DB will preserve property order - probably not - but if you're using LZ encoding on top then you could do it.) Avoiding a separate string table also has the benefit that it enables the data to be processed in a streaming fashion; otherwise, the client has to receive the string table before it can understand the data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Difficulty: Starter Type: Enhancement ✨ Improvement to process or efficiency Type: Feature Request 🛍️ New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants