Merge pull request #9 from pinterest/update_readme

Update readme
pinterest · Aug 23, 2024 · 6935a54 · 6935a54
2 parents a5a14dc + 64e00ca
commit 6935a54
Show file tree

Hide file tree

Showing 2 changed files with 69 additions and 33 deletions.
diff --git a/README.md b/README.md
@@ -1,41 +1,68 @@
-# Kafka Tiered Storage
-Kafka Tiered Storage is a framework that allows [Apache Kafka](https://kafka.apache.org/) brokers to offload finalized log segments to a remote storage system. 
-This allows Kafka to maintain a smaller disk footprint and reduce the need for expensive storage on the brokers. 
-The framework also provides a Kafka client compatible consumer that can consume from both the broker and the remote storage system.
+# Pinterest Tiered Storage for Apache Kafka®
+Pinterest Tiered Storage for [Apache Kafka®](https://kafka.apache.org/) is a broker-independent framework that allows brokers
+to offload finalized log segments to a remote storage system. 
+This allows Apache Kafka® to maintain a smaller disk footprint and reduce the need for expensive storage on the brokers. 
+The framework also provides a consumer client that can consume from both the broker and the remote storage system.
 
-Pinterest's implementation of Kafka Tiered Storage provides a Kafka broker independent approach to tiered storage. It consists of two main components:
-1. [Uploader](ts-segment-uploader): A continuous process that runs on each Kafka broker and uploads finalized log segments to a remote storage system (e.g. Amazon S3, with unique prefix per Kafka cluster and topic).
-2. [Consumer](ts-consumer): A Kafka client compatible consumer that consumes from both tiered storage log segments and Kafka cluster.
+Pinterest's implementation of Tiered Storage for Apache Kafka® provides a ***broker-independent*** approach to Tiered Storage.
+***See the differences between [Pinterest vs. Apache Kafka® Tiered Storage](#pinterest-vs-apache-kafka-tiered-storage)***.
+
+It consists of two main components:
+1. [Uploader](ts-segment-uploader): A continuous process that runs on each Apache Kafka® broker and uploads finalized log segments to a remote storage system (e.g. Amazon S3, with unique prefix per cluster and topic).
+2. [Consumer](ts-consumer): A consumer client capable of consuming from both Tiered Storage log segments and Apache Kafka® cluster.
 
 A third module [ts-common](ts-common) contains common classes and interfaces that are used by the `ts-consumer` and `ts-segment-uploader` modules, such as Metrics, StorageEndpointProvider, etc.
 
 Feel free to read into each module's README for more details.
 
-## Why Tiered Storage?
-[Apache Kafka](https://kafka.apache.org/) is a distributed event streaming platform that stores partitioned and replicated log segments on disk for
-a configurable retention period. However, as data volume and/or retention periods grow, the disk footprint of Kafka clusters can become expensive. 
-Tiered Storage allows Kafka to offload finalized log segments to a more cost-effective remote storage system, reducing the need for expensive storage on the brokers.
+# Why Tiered Storage?
+[Apache Kafka®](https://kafka.apache.org/) is a distributed event streaming platform that stores partitioned and replicated log segments on disk for
+a configurable retention period. However, as data volume and/or retention periods grow, the disk footprint of Apache Kafka® clusters can become expensive. 
+Tiered Storage allows brokers to offload finalized log segments to a more cost-effective remote storage system, reducing the need for expensive storage on the brokers.
 
 With Tiered Storage, you can:
 1. Maintain a smaller overall broker footprint, reducing operational costs
-2. Retain data for longer periods of time while avoiding horizontal and vertical scaling of Kafka clusters
+2. Retain data for longer periods of time while avoiding horizontal and vertical scaling of Apache Kafka® clusters
 3. Reduce CPU, network, and disk I/O utilization on brokers by reading directly from remote storage
 
-## Highlights
-- **Kafka Broker Independent**: The tiered storage solution is designed to be Kafka broker independent, meaning it runs as an independent process alongside the Kafka server process. Currently, it only supports ZooKeeper-based Kafka versions. KRaft support is WIP.
-- **Fault Tolerant**: Broker restarts, replacements, leadership changes, and other common Kafka operations / issues are handled gracefully.
+## Pinterest vs. Apache Kafka® Tiered Storage
+### Apache Kafka® Tiered Storage
+[KIP-405](https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage?uclick_id=11f222c6-967b-4935-98a9-cc88aafad7f5)
+provides a native, open-source offering to Tiered Storage for Apache Kafka® and is available starting from Apache Kafka® 3.6.0.
+The native Tiered Storage implementation is broker-dependent, meaning that the broker process itself is responsible 
+for offloading finalized log segments to remote storage, and the ***broker is always in the critical path of consumption***.
+
+### Pinterest Tiered Storage: Skip the broker
+***Pinterest's implementation of Tiered Storage is broker-independent***, meaning that the Tiered Storage process runs as a separate process alongside the Apache Kafka® server process,
+***and the broker is not always in the critical path of consumption***.
+This allows for more flexibility in adopting Tiered Storage, and better accommodates unpredictable consumption patterns. 
+Some of the key advantages of a broker-independent approach are:
+
+1. **You don't need to upgrade brokers**: While the native offering requires upgrading brokers to a version that supports Tiered Storage, a broker-independent approach does not.
+2. **You can skip the broker entirely during consumption**: When in `TIERED_STORAGE_ONLY` mode, the consumption loop does not touch the broker itself, allowing for more
+unpredictable spikes in consumption patterns without affecting the broker. See [MemQ](https://github.com/pinterest/memq) for a PubSub system that uses this approach natively.
+3. **Support consumer backfills and replays without affecting broker CPU**: When the broker is out of the critical path of consumption,
+consumer backfills and replays can be done without needing to keep additional CPU buffer on the brokers just to support those surges.
+4. **Avoid cross-AZ transfer costs**: While the native approach adds a cross-AZ network cost factor for consumers that are not AZ-aware,
+this broker-independent approach avoids that cost for all consumers when reading directly from remote storage.
+5. **Faster adoption, iteration, and improvements**: A broker-independent Tiered Storage solution lets you adopt and upgrade Tiered Storage without
+waiting for Apache Kafka® upgrades. Improvements, bug fixes, and new features are released independently of Apache Kafka® releases.
+
+# Highlights
+- **Broker Independent**: The tiered storage solution is designed to be broker-independent. [Here's why we think it's better](#pinterest-tiered-storage-for-apache-kafka).
 - **Skip the broker entirely during consumption**: The consumer can read from both broker and Tiered Storage backend filesystem. When in `TIERED_STORAGE_ONLY` mode, the consumption loop does not touch the broker itself, allowing for reduction in broker resource utilization.
-- **Pluggable Storage Backends**: The framework is designed to be backend-agnostic. Currently, only S3 is supported. More backend filesystems will be supported in the near future.
+- **Pluggable Storage Backends**: The framework is designed to be backend-agnostic.
 - **S3 Partitioning**: Prefix-entropy (salting) is configurable out-of-the-box to allow for prefix-partitioned S3 buckets, allowing for better scalability by avoiding request rate hotspots.
+- **Fault Tolerant**: Broker restarts, replacements, leadership changes, and other common Apache Kafka® operations / issues are handled gracefully.
 - **Metrics**: Comprehensive metrics are provided out-of-the-box for monitoring and alerting purposes.
 
 # Quick Start
 Detailed quickstart instructions are available [here](docs/quickstart.md).
 
 # Usage
-Using Kafka Tiered Storage consists of the following high-level steps:
+Using Pinterest Tiered Storage for Apache Kafka® consists of the following high-level steps:
 1. Have a remote storage system ready to accept reads and writes of log segments (e.g. Amazon S3 bucket)
-2. Configure and start [ts-segment-uploader](ts-segment-uploader) on each Kafka broker
+2. Configure and start [ts-segment-uploader](ts-segment-uploader) on each Apache Kafka® broker
 3. Use [ts-consumer](ts-consumer) to read from either the broker or the remote storage system
 4. Monitor and manage the tiered storage system using the provided metrics and tools
 
@@ -45,22 +72,31 @@ Feel free to read into each module's README for more details.
 ![Architecture](docs/images/architecture.png)
 
 # Current Status
-**Kafka Tiered Storage is currently under active development and the APIs may change over time.**
+**Pinterest Tiered Storage for Apache Kafka® is currently under active development and the APIs may change over time.**
 
-Kafka Tiered Storage currently supports the following remote storage systems:
+It currently supports the following remote storage systems:
 - Amazon S3
 
-Some of our planned features and improvements:
+Some planned features and improvements:
 
 - KRaft support
 - More storage system support (e.g. HDFS)
 - Integration with [PubSub Client](https://github.com/pinterest/psc) (backend-agnostic client library)
 
 Contributions are always welcome!
 
+# Ecosystem
+Check out some of the other Pinterest projects designed to make PubSub more automated, efficient, and reliable:
+- [PubSub Client](https://github.com/pinterest/psc): A backend-agnostic client library for PubSub systems
+- [MemQ](https://github.com/pinterest/memq): An efficient, scalable cloud native PubSub system
+- [Orion](https://github.com/pinterest/orion): A generalized and pluggable management and automation platform for stateful distributed systems, such as Apache Kafka® and MemQ
+
 # Maintainers
 - Vahid Hashemian
 - Jeff Xiang
 
 # License
-Kafka Tiered Storage is distributed under Apache License, Version 2.0.
+Pinterest Tiered Storage for Apache Kafka® is distributed under Apache License, Version 2.0.
+
+# Trademark
+Apache®️, Apache Kafka, and Kafka are trademarks of the Apache Software Foundation.
diff --git a/ts-segment-uploader/README.md b/ts-segment-uploader/README.md
@@ -1,28 +1,28 @@
 # Tiered Storage Segment Uploader
 
 ## Overview
-This module contains the uploader code that is used to upload Kafka log segments to the backing tiered storage filesystem.
-It is designed to be a long-running and independent process that runs on each Kafka broker in order to upload
+This module contains the uploader code that is used to upload Apache Kafka® log segments to the backing tiered storage filesystem.
+It is designed to be a long-running and independent process that runs on each Apache Kafka® broker in order to upload
 finalized (closed) log segments to a remote storage system. These log segments can later be read by a 
 [TieredStorageConsumer](../ts-consumer) even if the log segments have already been deleted from local storage on the broker, provided
-that the retention period of the segments on remote storage system is longer than that of the Kafka topic itself.
+that the retention period of the segments on remote storage system is longer than that of the Apache Kafka® topic itself.
 
 ## Architecture
 ![Uploader Architecture](../docs/images/uploader.png)
 
-The uploader process runs alongside the Kafka server process on every broker, and is responsible for uploading log segments 
+The uploader process runs alongside the Apache Kafka® server process on every broker, and is responsible for uploading log segments 
 to the configured remote storage system.
 
 The key components to the uploader are:
 1. [KafkaSegmentUploader](src/main/java/com/pinterest/kafka/tieredstorage/uploader/KafkaSegmentUploader.java): The entrypoint class
-2. [DirectoryTreeWatcher](src/main/java/com/pinterest/kafka/tieredstorage/uploader/DirectoryTreeWatcher.java): Watches the Kafka log directory for log rotations (closing and opening of new log segments)
-3. [KafkaLeadershipWatcher](src/main/java/com/pinterest/kafka/tieredstorage/uploader/KafkaLeadershipWatcher.java): Monitors leadership changes in the Kafka cluster and updates the watched filepaths as necessary
+2. [DirectoryTreeWatcher](src/main/java/com/pinterest/kafka/tieredstorage/uploader/DirectoryTreeWatcher.java): Watches the Apache Kafka® log directory for log rotations (closing and opening of new log segments)
+3. [KafkaLeadershipWatcher](src/main/java/com/pinterest/kafka/tieredstorage/uploader/KafkaLeadershipWatcher.java): Monitors leadership changes in the Apache Kafka® cluster and updates the watched filepaths as necessary
 4. [FileUploader](src/main/java/com/pinterest/kafka/tieredstorage/uploader/S3FileUploader.java): Uploads the log segments to the remote storage system
-5. [KafkaEnvironmentProvider](src/main/java/com/pinterest/kafka/tieredstorage/uploader/KafkaEnvironmentProvider.java): Provides information about the Kafka environment for the uploader
+5. [KafkaEnvironmentProvider](src/main/java/com/pinterest/kafka/tieredstorage/uploader/KafkaEnvironmentProvider.java): Provides information about the Apache Kafka® environment for the uploader
 
 If a broker is currently a leader for a topic-partition, the uploader will watch the log directory for that topic-partition.
 When a previously-active log segment is closed, the uploader will upload the closed segment to the remote storage system. Typically, 
-a log segment is closed when it reaches a time or size-based threshold, configurable via Kafka broker properties.
+a log segment is closed when it reaches a time or size-based threshold, configurable via Apache Kafka® broker properties.
 
 Each upload will consist of 3 parts: the segment file, the index file, and the time index file. Once these files are successfully
 uploaded, an `offset.wm` file will also be uploaded for that topic partition which contains the offset of the last uploaded log segment.
@@ -31,10 +31,10 @@ This is used to resume uploads from the last uploaded offset in case of a failur
 ## Usage
 The segment uploader entrypoint class is `KafkaSegmentUploader`. At a minimum, running the segment uploader requires:
 1. **KafkaEnvironmentProvider**: This system property should be provided upon running the segment uploader (i.e. `-DkafkaEnvironmentProvider`). It should
-provide the FQDN of the class that provides the Kafka environment, which should be an implementation of [KafkaEnvironmentProvider](src/main/java/com/pinterest/kafka/tieredstorage/uploader/KafkaEnvironmentProvider.java).
+provide the FQDN of the class that provides the Apache Kafka® environment, which should be an implementation of [KafkaEnvironmentProvider](src/main/java/com/pinterest/kafka/tieredstorage/uploader/KafkaEnvironmentProvider.java).
 2. **Uploader Configurations**: The directory in which uploader configurations live should be provided as the first argument to the segment uploader main class. 
 It should point to a directory which contains a `.properties` file that contains the configurations for the segment uploader. The file that
-the uploader chooses to use in this directory is determined by the Kafka cluster ID that is provided by `clusterId()` method
+the uploader chooses to use in this directory is determined by the Apache Kafka® cluster ID that is provided by `clusterId()` method
 for the `KafkaEnvironmentProvider` implementation. Therefore, the properties file should be named as `<clusterId>.properties`.
 3. **Storage Endpoint Configurations**: The segment uploader requires a [StorageServiceEndpointProvider](../ts-common/src/main/java/com/pinterest/kafka/tieredstorage/common/discovery/StorageServiceEndpointProvider.java)
 to be specified in the properties file. This provider's implementation dictates where each topic's log segments should be uploaded to.
@@ -66,7 +66,7 @@ With prefix entropy, one can pre-partition the S3 bucket using the prefix bits t
 Because the hash is generated via the cluster, topic, and partition ID combination, all log segments (and their associated `.index`, `.timeindex`, and `offset.wm` files)
 for a given topic-partition on a given cluster will be uploaded to the same prefix in S3. This way, the [TieredStorageConsumer](../ts-consumer/src/main/java/com/pinterest/kafka/tieredstorage/consumer/TieredStorageConsumer.java)
 can deterministically reconstruct the entire key during consumption. Due to this reason, **it is important to ensure that the `ts.segment.uploader.s3.prefix.entropy.bits` 
-configuration is consistent across all brokers in the Kafka cluster, and consistent in the consumer's configuration as well**.
+configuration is consistent across all brokers in the Apache Kafka® cluster, and consistent in the consumer's configuration as well**.
 
 ## Configuration
 The segment uploader configurations are passed via the aforementioned properties file. Available configurations