From fea32907944dc19e470dc7565e4a6bea7d555b9c Mon Sep 17 00:00:00 2001 From: Angelo Fausti Date: Thu, 26 Dec 2024 16:38:34 -0700 Subject: [PATCH] Add FAQ and resource sections --- docs/faq.rst | 116 +++++++++++++++++++++++++++++++++++++++++++++ docs/index.rst | 4 +- docs/resources.rst | 18 +++++++ 3 files changed, 137 insertions(+), 1 deletion(-) create mode 100644 docs/faq.rst create mode 100644 docs/resources.rst diff --git a/docs/faq.rst b/docs/faq.rst new file mode 100644 index 0000000..ee8b9c9 --- /dev/null +++ b/docs/faq.rst @@ -0,0 +1,116 @@ +:og:description: Sasquatch Frequently Asked Question. + +.. _faq: + +FAQ +=== + +Do you need dedicated DevOps personnel to maintain Sasquatch? If so, how many? +------------------------------------------------------------------------------ + +Typically, 0.2 FTE (Full-Time Equivalent) is sufficient to maintain Sasquatch after deployment. +Currently, we manage Sasquatch across five different environments (Kubernetes clusters) with less than 1 FTE. + +How do you deploy Sasquatch on Kubernetes? +------------------------------------------ + +We developed our own internal developer platform `Phalanx`_ to manage application deployment configuration, including the Rubin Science Platform (RSP) and ancillary services. + +.. _Phalanx: https://phalanx.lsst.io + +What is the throughput and number of simultaneous clients connecting to Sasquatch? +---------------------------------------------------------------------------------- + +- **Data Producers:** Approximately 65+ Control System Components (CSCs) grouped into ~20 connectors (writers) produce data at an estimated 10 MB/s during Rubin Observatory operations. +- **Data Consumers:** Users at the Summit control room view Chronograf dashboards, and engineers use the Rubin Science Platform (RSP) for querying InfluxDB via a Python client. Sasquatch data collected at the Summit is also replicated to USDF for project-wide availability. + +How do you interact with InfluxDB? +---------------------------------- + +In addition to Chronograf, the EFD Python client is our primary tool for interacting with data stored in InfluxDB facilitating data analysis within the Rubin Science Platform (RSP). + +Have you found a need for more InfluxDB tags usage? It seemed that Rubin Observatory wasn't using InfluxDB tags much for the EFD. +--------------------------------------------------------------------------------------------------------------------------------- + +Yes, we recently redesigned the InfluxDB schema for the LSST Camera to accommodate its hierarchical data structure (Camera, Raft, Sensor, Amplifier, etc) +This subsystem now uses many tags. You can view an `example configuration here`_. + +.. _example configuration here: https://github.com/lsst-sqre/phalanx/blob/main/applications/sasquatch/values-summit.yaml#L184 + +For other subsystems, while the InfluxDB schema could be improved, query performance has been acceptable without extensive tag usage. +Therefore, this has not been a priority. + +What happens when an InfluxDB data node goes offline? Is all data available? +---------------------------------------------------------------------------- + +Yes, with an InfluxDB Cluster and replication factor set to two, all data is still available when a node goes offline. + + +How easy is it to add a storage node to expand capacity? Can this be done without downtime? +------------------------------------------------------------------------------------------- + +We use local storage for our InfluxDB cluster to enhance performance. +Thus you can expand the InfluxDB cluster storage capacity by adding more data nodes while maintaining the data replication factor. + +Once the new data nodes are available, you just increase the replica count configuration for data pods. +This automatically registers the new cluster members, and thanks to the built-in high-availability (HA) configuration, this can be done without downtime. + +Some considerations: +- Familiarity with the ``influxctl`` admin tool to manage shards in the cluster is required. +- InfluxDB Enterprise includes an `Anti-Entropy (AE)`_ service to ensure data consistency across replicas. +This service must be turned off during manual shard restoration to avoid conflicts with the restore process. + +.. _Anti-Entropy (AE): https://docs.influxdata.com/enterprise_influxdb/v1/administration/configure/anti-entropy/ + +We recommend using a replication factor of two for the cluster. +With a replication factor of two each shard is replicated across at least two nodes. +Adding more than two data nodes increases storage capacity, however not all nodes will hold the same shards. + +Chronograf and Kapacitor usage. How difficult is to create lots of dashboards? Did one team develop all dashboards, or was it distributed between other software teams? Learning curve remarks. +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- + +We use Chronograf for data visualization/exploration and Kapacitor for alerting. +Here are our observations: + +- Chronograf is intuitive and enables our users to create their own dashboards. While we provide guidance, each team is responsible for creating and maintaining their dashboards. +- Lack of a "dashboard as code" approach. While dashboards can be exported as JSON, editing or version-controlling them is challenging. +- In the Chronograf UI, dashboards are presented as a flat list without folder organization or labels, making it hard to browse 200+ dashboards at the Summit. +- Query results are not cached, so shared links hit the database multiple times. +- Kapacitor has been effective for creating alert rules and integrating with Slack notifications. Chronograf provides a user-friendly interface for this. +- Both dashboards and alerts are stored in respective databases, requiring backups. Programmatic creation is not yet supported. + +**Future Plans:** We are evaluating **Grafana** and **Apache Superset** to address some of these limitations while continuing to use Chronograf for data exploration. +We value Chronograf's query builder, which is easy to use and supports both InfluxQL and Flux. + +How responsive is plotting look backs? How does this compare to Grafana? +-------------------------------------------------------------------------------- + +Chronograf is very responsive for long look backs. +Chronograf's ``:interval:`` template variable dynamically adjusts the time grouping in queries, ensuring smooth visualizations regardless of the time range. +It might be possible to implement similar functionality in Grafana using template variables, but we have not explored this yet. + +Was it easy to maintain/interface/add to Kafka? +----------------------------------------------- + +Kafka has a steep learning curve and a complex deployment setup with brokers, controllers, Kafka Connect, MirrorMaker 2, Schema Registry, etc. +The Strimzi Operator simplifies the deployment and management of Kafka on Kubernetes. + +Our Kafka clients are mainly written in Python and Java. +We also provide a REST API (Confluent's Kafka REST Proxy) for easier Kafka interfacing using HTTP clients. + +How straightforward are Kafka or JVM upgrades? Any JVM-related issues? +---------------------------------------------------------------------- + +We use `Strimzi`_ to manage Kafka deployments on Kubernetes, which simplifies many administration tasks including upgrades. + +.. _Strimzi: https://sasquatch.lsst.io/developer-guide/index.html + +No JVM-related issues have been encountered other than the usual memory tuning and garbage collection settings. + +How are Kafka topics managed and accessed? +------------------------------------------ + +Kafka topics have three replicas for fault tolerance. +We disable topic auto creation in Kafka, thus each client must request topic creation. +While Kafka lacks native namespaces, we use a topic naming convention to group topics by subsystem. +Each client has a unique Kafka user with ACLs specifying accessible topics. diff --git a/docs/index.rst b/docs/index.rst index 2fc1651..04ba769 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -17,5 +17,7 @@ Sasquatch is currently deployed at the Summit, USDF and test stands through `Pha User guide Environments Developer guide + FAQ + Resources -.. _Phalanx: https://phalanx.lsst.io \ No newline at end of file +.. _Phalanx: https://phalanx.lsst.io diff --git a/docs/resources.rst b/docs/resources.rst new file mode 100644 index 0000000..17f07a0 --- /dev/null +++ b/docs/resources.rst @@ -0,0 +1,18 @@ +:og:description: Sasquatch resources. + +.. _resources: + +Resources +========= + +- SPIE 2024 `paper`_ and `presentation`_ +- `Chronograf training for Rubin observers`_ +- `Acceptance criteria for InfluxDB Enterprise`_ +- `USDF EFD storage requirements`_ + + +.. _paper: https://dmtn-290.lsst.io/ +.. _presentation: https://docs.google.com/presentation/d/1M6ES4Uk8CM1fnZuxbx33tbwXLePBKGkJCGEAj0AKaog/edit?usp=sharing +.. _Chronograf training for Rubin observers: https://vimeo.com/1001641391?share=copy +.. _Acceptance criteria for InfluxDB Enterprise: https://docs.google.com/document/d/1OuXdtOGMLvrXeIsiE1rx2V_4FgraJ_ZfpfALYaeQB3s/edit?usp=sharing +.. _USDF EFD storage requirements: https://sqr-085.lsst.io