-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: secondary indexes #918
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -60,9 +60,11 @@ The following records are allowed to appear in the data section: | |
- [Schema](#schema-op0x03) | ||
- [Channel](#channel-op0x04) | ||
- [Message](#message-op0x05) | ||
- [Secondary Index Key](#secondary-index-key-op0x10) | ||
- [Attachment](#attachment-op0x09) | ||
- [Chunk](#chunk-op0x06) | ||
- [Message Index](#message-index-op0x07) | ||
- [Secondary Message Index](#secondary-message-index-op0x11) | ||
- [Metadata](#metadata-op0x0C) | ||
- [Data End](#data-end-op0x0F) | ||
|
||
|
@@ -82,7 +84,9 @@ The following records are allowed to appear in the summary section: | |
|
||
- [Schema](#schema-op0x03) | ||
- [Channel](#channel-op0x04) | ||
- [Secondary Index Key](#secondary-index-key-op0x10) | ||
- [Chunk Index](#chunk-index-op0x08) | ||
- [Secondary Chunk Index](#secondary-chunk-index-op0x12) | ||
- [Attachment Index](#attachment-index-op0x0A) | ||
- [Metadata Index](#metadata-index-op0x0D) | ||
- [Statistics](#statistics-op0x0B) | ||
|
@@ -179,6 +183,30 @@ The message encoding and schema must match that of the Channel record correspond | |
| 8 | publish_time | Timestamp | Time at which the message was published. If not available, must be set to the log time. | | ||
| N | data | Bytes | Message data, to be decoded according to the schema of the channel. | | ||
|
||
### Secondary Index Key (op=0x10) | ||
|
||
A Secondary Index Key record defines a secondary timestamp index that will be used in this file. | ||
Secondary Indexes can be used to quickly look up messages by timestamps other than `log_time`. | ||
The `name` field identifies the timestamp key that messages will be indexed by. The [registry](./registry.md#secondary-index-keys) lists well-known secondary index key names. | ||
|
||
A Secondary Index Key record must appear before any [Secondary Message Index](#secondary-message-index-op0x11) records | ||
in the data section with this `secondary_index_id`. | ||
|
||
Secondary Index Key records in the Data section must also appear in the Summary section, before | ||
any [Secondary Chunk Index](#secondary-chunk-index-op0x12) records with this `secondary_index_id`. | ||
|
||
| Bytes | Name | Type | Description | | ||
| ----- | ------------------ | ------ | ----------------------------------------------------------------- | | ||
| 2 | secondary_index_id | uint16 | A unique identifier for this secondary index within the file. | | ||
| 4 + N | name | string | A name that describes the key, eg. `publish_time`, `header.stamp` | | ||
|
||
> Why do Secondary Index Key records appear in the Data section? | ||
> When reading using an index, the Secondary Index Key would be read out of the Summary section | ||
> before reading into the Data section. This means that the Secondary Index Key in the Data section | ||
> is not normally used. However, if a MCAP is truncated and the summary section is lost, having the | ||
> Secondary Index Key appear before any Secondary Message Index records allows the MCAP to be fully | ||
> recovered. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why is it necessary to have this record type, vs the parser just inferring from the existence of the other two record types that the index is in use? I can see dropping this would require moving the "name" field into the secondary chunk index record but that doesn't seem like the biggest thing we stick in those records anyway. Today the parser knows a file is indexed via the presence of the index records - we don't need a third record type for that right? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess I should follow up with a concrete suggestion - what about
then in the summary offset section, we'd have a new group pointing at "SecondaryChunkIndex". The "name" key is stored in both locations to allow a partially-written file to still have index data recovered, which is a purpose your third record type also supplies. The cost is the duplication of the "name" field in the SecondaryMessageIndex records. It would be good to get some data on how much this costs us - my assumption is that the effect would generally be swamped by the size of "records", this it doesn't justify the extra record type. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. forget the "metadata" part - I think it would be better implemented with a "chunk info" record as described in another comment. |
||
|
||
### Chunk (op=0x06) | ||
|
||
A Chunk contains a batch of Schema, Channel, and Message records. The batch of records contained in a chunk may be compressed or uncompressed. | ||
|
@@ -207,6 +235,17 @@ A sequence of Message Index records occurs immediately after each chunk. Exactly | |
|
||
Messages outside of chunks cannot be indexed. | ||
|
||
### Secondary Message Index (op=0x11) | ||
|
||
A Secondary Message Index record allows readers to locate individual message records within a chunk using a | ||
key defined in a [Secondary Index Key record](#secondary-index-key-op0x10). | ||
|
||
| Bytes | Name | Type | Description | | ||
| ----- | ------------------ | --------------------------------- | -------------------------------------------------------------------------------------------------------------- | | ||
| 2 | channel_id | uint16 | Channel ID. | | ||
| 2 | secondary_index_id | uint16 | Secondary Index ID. | | ||
| 4 + N | records | `Array<Tuple<Timestamp, uint64>>` | Array of timestamp and offset for each record. Offset is relative to the start of the uncompressed chunk data. | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. should be key, offset I think |
||
|
||
### Chunk Index (op=0x08) | ||
|
||
A Chunk Index record contains the location of a Chunk record and its associated Message Index records. | ||
|
@@ -229,6 +268,18 @@ A Schema and Channel record MUST exist in the summary section for all channels r | |
|
||
> Why? The typical use case for file readers using an index is fast random access to a specific message timestamp. Channel is a prerequisite for decoding Message record data. Without an easy-to-access copy of the Channel records, readers would need to search for Channel records from the start of the file, degrading random access read performance. | ||
|
||
### Secondary Chunk Index (op=0x12) | ||
|
||
A secondary Chunk Index record contains additional secondary index information on top of the corresponding Chunk Index record. | ||
|
||
| Bytes | Name | Type | Description | | ||
| ----- | --------------------- | --------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ||
| 2 | secondary_index_id | uint16 | Secondary Index ID. | | ||
| 8 | chunk_start_offset | uint64 | Offset to the chunk record from the start of the file. | | ||
| 8 | earliest_key | Timestamp | Earliest key in the chunk. Zero if the chunk contains no messages with this key. | | ||
| 8 | latest_key | Timestamp | Latest key in the chunk. Zero if the chunk contains no messages with this key. | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. using a zero timestamp as a sentinal is something we do elsewhere but not strictly correct, since 0 is a valid timestamp. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what about saying chunk indexes should be omitted when they would apply to no messages? |
||
| 4 + N | message_index_offsets | `Map<uint16, uint64>` | Mapping from channel ID to the offset of the secondary message index record with this `secondary_index_id` for that channel after the chunk, from the start of the file. An empty map indicates no message indexing is available. | | ||
|
||
### Attachment (op=0x09) | ||
|
||
Attachment records contain auxiliary artifacts such as text, core dumps, calibration data, or other arbitrary data. | ||
|
@@ -522,6 +573,52 @@ A writer may choose to put messages in Chunks to compress record data. This MCAP | |
[Footer] | ||
``` | ||
|
||
### Multiple Messages with a Secondary Index | ||
|
||
``` | ||
[Header] | ||
[Secondary Index Key 1] | ||
[Chunk A] | ||
[Schema A] | ||
[Channel 1 (A)] | ||
[Channel 2 (B)] | ||
[Message on 1] | ||
[Message on 1] | ||
[Message on 2] | ||
[Message Index 1] | ||
[Message Index 2] | ||
[Secondary Message Index 1 (Channel 1)] | ||
[Secondary Message Index 1 (Channel 2)] | ||
[Attachment 1] | ||
[Chunk B] | ||
[Schema B] | ||
[Channel 3 (B)] | ||
[Message on 3] | ||
[Message on 1] | ||
[Message Index 3] | ||
[Message Index 1] | ||
[Secondary Message Index 1 (Channel 3)] | ||
[Secondary Message Index 1 (Channel 1)] | ||
[Data End] | ||
[Schema A] | ||
[Schema B] | ||
[Channel 1] | ||
[Channel 2] | ||
[Channel 3] | ||
[Secondary Index Key 1] | ||
[Chunk Index A] | ||
[Chunk Index B] | ||
[Secondary Chunk Index 1 (Chunk A)] | ||
[Secondary Chunk Index 1 (Chunk B)] | ||
[Attachment Index 1] | ||
[Statistics] | ||
[Summary Offset 0x01] | ||
[Summary Offset 0x05] | ||
[Summary Offset 0x07] | ||
[Summary Offset 0x08] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think these summary offsets seem wrong, 0x01 is the header and that's not in the summary section. I think they should be 0x03, 0x04, 0x10, 0x08, 0x12, 0x0A, 0x0B |
||
[Footer] | ||
``` | ||
|
||
## Further Reading | ||
|
||
- [Feature explanations][feature_explanations]: includes usage details that may be useful to implementers of readers or writers. |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -152,3 +152,17 @@ The `ros2` profile describes how to create MCAP files for [ROS 2](https://docs.r | |
#### Schema | ||
|
||
- `encoding`: MUST be either `ros2msg` or `ros2idl` | ||
|
||
## Secondary index keys | ||
|
||
The Secondary Index Key `name` field may contain the following options: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does "may" indicate I can put my own stuff in there and expect some tooling support? I think this would be good to shoot for. Rather than having studio or whatever hard code "header.stamp", "publish_time", etc, would it be viable to dynamically show a list of sort options based on the file's index section? And likewise with the info command, CLI, reader support etc. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, i feel like you should expect some tooling support for any key, but tooling can make extra assumptions about well-known keys. |
||
|
||
### `header.stamp` | ||
|
||
Indexes the `stamp` value of the `std_msgs/msg/Header`-valued `header` field of the deserialized message data. | ||
|
||
- `profile`: must be `ros1` or `ros2` | ||
|
||
### `publish_time` | ||
|
||
Indexes the `publish_time` value of Message records. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should come up with a more future-proof word than "secondary", in case we decide in a V2 to merge the primary indexes into this same record type.