Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: secondary indexes #918

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
97 changes: 97 additions & 0 deletions website/docs/spec/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,9 +60,11 @@ The following records are allowed to appear in the data section:
- [Schema](#schema-op0x03)
- [Channel](#channel-op0x04)
- [Message](#message-op0x05)
- [Secondary Index Key](#secondary-index-key-op0x10)
- [Attachment](#attachment-op0x09)
- [Chunk](#chunk-op0x06)
- [Message Index](#message-index-op0x07)
- [Secondary Message Index](#secondary-message-index-op0x11)
- [Metadata](#metadata-op0x0C)
- [Data End](#data-end-op0x0F)

Expand All @@ -82,7 +84,9 @@ The following records are allowed to appear in the summary section:

- [Schema](#schema-op0x03)
- [Channel](#channel-op0x04)
- [Secondary Index Key](#secondary-index-key-op0x10)
- [Chunk Index](#chunk-index-op0x08)
- [Secondary Chunk Index](#secondary-chunk-index-op0x12)
- [Attachment Index](#attachment-index-op0x0A)
- [Metadata Index](#metadata-index-op0x0D)
- [Statistics](#statistics-op0x0B)
Expand Down Expand Up @@ -179,6 +183,30 @@ The message encoding and schema must match that of the Channel record correspond
| 8 | publish_time | Timestamp | Time at which the message was published. If not available, must be set to the log time. |
| N | data | Bytes | Message data, to be decoded according to the schema of the channel. |

### Secondary Index Key (op=0x10)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should come up with a more future-proof word than "secondary", in case we decide in a V2 to merge the primary indexes into this same record type.


A Secondary Index Key record defines a secondary timestamp index that will be used in this file.
Secondary Indexes can be used to quickly look up messages by timestamps other than `log_time`.
The `name` field identifies the timestamp key that messages will be indexed by. The [registry](./registry.md#secondary-index-keys) lists well-known secondary index key names.

A Secondary Index Key record must appear before any [Secondary Message Index](#secondary-message-index-op0x11) records
in the data section with this `secondary_index_id`.

Secondary Index Key records in the Data section must also appear in the Summary section, before
any [Secondary Chunk Index](#secondary-chunk-index-op0x12) records with this `secondary_index_id`.

| Bytes | Name | Type | Description |
| ----- | ------------------ | ------ | ----------------------------------------------------------------- |
| 2 | secondary_index_id | uint16 | A unique identifier for this secondary index within the file. |
| 4 + N | name | string | A name that describes the key, eg. `publish_time`, `header.stamp` |

> Why do Secondary Index Key records appear in the Data section?
> When reading using an index, the Secondary Index Key would be read out of the Summary section
> before reading into the Data section. This means that the Secondary Index Key in the Data section
> is not normally used. However, if a MCAP is truncated and the summary section is lost, having the
> Secondary Index Key appear before any Secondary Message Index records allows the MCAP to be fully
> recovered.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it necessary to have this record type, vs the parser just inferring from the existence of the other two record types that the index is in use? I can see dropping this would require moving the "name" field into the secondary chunk index record but that doesn't seem like the biggest thing we stick in those records anyway.

Today the parser knows a file is indexed via the presence of the index records - we don't need a third record type for that right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I should follow up with a concrete suggestion - what about

SecondaryMessageIndex
name
channel_id
records

SecondaryChunkIndex
name
chunk_start_offset
first_key
last_key
message_index_offsets
metadata (??? - will elaborate in another comment)

then in the summary offset section, we'd have a new group pointing at "SecondaryChunkIndex". The "name" key is stored in both locations to allow a partially-written file to still have index data recovered, which is a purpose your third record type also supplies. The cost is the duplication of the "name" field in the SecondaryMessageIndex records. It would be good to get some data on how much this costs us - my assumption is that the effect would generally be swamped by the size of "records", this it doesn't justify the extra record type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

forget the "metadata" part - I think it would be better implemented with a "chunk info" record as described in another comment.


### Chunk (op=0x06)

A Chunk contains a batch of Schema, Channel, and Message records. The batch of records contained in a chunk may be compressed or uncompressed.
Expand Down Expand Up @@ -207,6 +235,17 @@ A sequence of Message Index records occurs immediately after each chunk. Exactly

Messages outside of chunks cannot be indexed.

### Secondary Message Index (op=0x11)

A Secondary Message Index record allows readers to locate individual message records within a chunk using a
key defined in a [Secondary Index Key record](#secondary-index-key-op0x10).

| Bytes | Name | Type | Description |
| ----- | ------------------ | --------------------------------- | -------------------------------------------------------------------------------------------------------------- |
| 2 | channel_id | uint16 | Channel ID. |
| 2 | secondary_index_id | uint16 | Secondary Index ID. |
| 4 + N | records | `Array<Tuple<Timestamp, uint64>>` | Array of timestamp and offset for each record. Offset is relative to the start of the uncompressed chunk data. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be key, offset I think


### Chunk Index (op=0x08)

A Chunk Index record contains the location of a Chunk record and its associated Message Index records.
Expand All @@ -229,6 +268,18 @@ A Schema and Channel record MUST exist in the summary section for all channels r

> Why? The typical use case for file readers using an index is fast random access to a specific message timestamp. Channel is a prerequisite for decoding Message record data. Without an easy-to-access copy of the Channel records, readers would need to search for Channel records from the start of the file, degrading random access read performance.

### Secondary Chunk Index (op=0x12)

A secondary Chunk Index record contains additional secondary index information on top of the corresponding Chunk Index record.

| Bytes | Name | Type | Description |
| ----- | --------------------- | --------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| 2 | secondary_index_id | uint16 | Secondary Index ID. |
| 8 | chunk_start_offset | uint64 | Offset to the chunk record from the start of the file. |
| 8 | earliest_key | Timestamp | Earliest key in the chunk. Zero if the chunk contains no messages with this key. |
| 8 | latest_key | Timestamp | Latest key in the chunk. Zero if the chunk contains no messages with this key. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using a zero timestamp as a sentinal is something we do elsewhere but not strictly correct, since 0 is a valid timestamp.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about saying chunk indexes should be omitted when they would apply to no messages?

| 4 + N | message_index_offsets | `Map<uint16, uint64>` | Mapping from channel ID to the offset of the secondary message index record with this `secondary_index_id` for that channel after the chunk, from the start of the file. An empty map indicates no message indexing is available. |

### Attachment (op=0x09)

Attachment records contain auxiliary artifacts such as text, core dumps, calibration data, or other arbitrary data.
Expand Down Expand Up @@ -522,6 +573,52 @@ A writer may choose to put messages in Chunks to compress record data. This MCAP
[Footer]
```

### Multiple Messages with a Secondary Index

```
[Header]
[Secondary Index Key 1]
[Chunk A]
[Schema A]
[Channel 1 (A)]
[Channel 2 (B)]
[Message on 1]
[Message on 1]
[Message on 2]
[Message Index 1]
[Message Index 2]
[Secondary Message Index 1 (Channel 1)]
[Secondary Message Index 1 (Channel 2)]
[Attachment 1]
[Chunk B]
[Schema B]
[Channel 3 (B)]
[Message on 3]
[Message on 1]
[Message Index 3]
[Message Index 1]
[Secondary Message Index 1 (Channel 3)]
[Secondary Message Index 1 (Channel 1)]
[Data End]
[Schema A]
[Schema B]
[Channel 1]
[Channel 2]
[Channel 3]
[Secondary Index Key 1]
[Chunk Index A]
[Chunk Index B]
[Secondary Chunk Index 1 (Chunk A)]
[Secondary Chunk Index 1 (Chunk B)]
[Attachment Index 1]
[Statistics]
[Summary Offset 0x01]
[Summary Offset 0x05]
[Summary Offset 0x07]
[Summary Offset 0x08]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these summary offsets seem wrong, 0x01 is the header and that's not in the summary section.

I think they should be 0x03, 0x04, 0x10, 0x08, 0x12, 0x0A, 0x0B

[Footer]
```

## Further Reading

- [Feature explanations][feature_explanations]: includes usage details that may be useful to implementers of readers or writers.
14 changes: 14 additions & 0 deletions website/docs/spec/registry.md
Original file line number Diff line number Diff line change
Expand Up @@ -152,3 +152,17 @@ The `ros2` profile describes how to create MCAP files for [ROS 2](https://docs.r
#### Schema

- `encoding`: MUST be either `ros2msg` or `ros2idl`

## Secondary index keys

The Secondary Index Key `name` field may contain the following options:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does "may" indicate I can put my own stuff in there and expect some tooling support? I think this would be good to shoot for. Rather than having studio or whatever hard code "header.stamp", "publish_time", etc, would it be viable to dynamically show a list of sort options based on the file's index section?

And likewise with the info command, CLI, reader support etc.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, i feel like you should expect some tooling support for any key, but tooling can make extra assumptions about well-known keys.


### `header.stamp`

Indexes the `stamp` value of the `std_msgs/msg/Header`-valued `header` field of the deserialized message data.

- `profile`: must be `ros1` or `ros2`

### `publish_time`

Indexes the `publish_time` value of Message records.