Skip to content

Commit

Permalink
Merge pull request #51 from lsst-sqre/tickets/DM-48272
Browse files Browse the repository at this point in the history
Add FAQ and resource sections
  • Loading branch information
afausti authored Dec 26, 2024
2 parents 55c8447 + fea3290 commit caafa0d
Show file tree
Hide file tree
Showing 3 changed files with 137 additions and 1 deletion.
116 changes: 116 additions & 0 deletions docs/faq.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
:og:description: Sasquatch Frequently Asked Question.

.. _faq:

FAQ
===

Do you need dedicated DevOps personnel to maintain Sasquatch? If so, how many?
------------------------------------------------------------------------------

Typically, 0.2 FTE (Full-Time Equivalent) is sufficient to maintain Sasquatch after deployment.
Currently, we manage Sasquatch across five different environments (Kubernetes clusters) with less than 1 FTE.

How do you deploy Sasquatch on Kubernetes?
------------------------------------------

We developed our own internal developer platform `Phalanx`_ to manage application deployment configuration, including the Rubin Science Platform (RSP) and ancillary services.

.. _Phalanx: https://phalanx.lsst.io

What is the throughput and number of simultaneous clients connecting to Sasquatch?
----------------------------------------------------------------------------------

- **Data Producers:** Approximately 65+ Control System Components (CSCs) grouped into ~20 connectors (writers) produce data at an estimated 10 MB/s during Rubin Observatory operations.
- **Data Consumers:** Users at the Summit control room view Chronograf dashboards, and engineers use the Rubin Science Platform (RSP) for querying InfluxDB via a Python client. Sasquatch data collected at the Summit is also replicated to USDF for project-wide availability.

How do you interact with InfluxDB?
----------------------------------

In addition to Chronograf, the EFD Python client is our primary tool for interacting with data stored in InfluxDB facilitating data analysis within the Rubin Science Platform (RSP).

Have you found a need for more InfluxDB tags usage? It seemed that Rubin Observatory wasn't using InfluxDB tags much for the EFD.
---------------------------------------------------------------------------------------------------------------------------------

Yes, we recently redesigned the InfluxDB schema for the LSST Camera to accommodate its hierarchical data structure (Camera, Raft, Sensor, Amplifier, etc)
This subsystem now uses many tags. You can view an `example configuration here`_.

.. _example configuration here: https://github.com/lsst-sqre/phalanx/blob/main/applications/sasquatch/values-summit.yaml#L184

For other subsystems, while the InfluxDB schema could be improved, query performance has been acceptable without extensive tag usage.
Therefore, this has not been a priority.

What happens when an InfluxDB data node goes offline? Is all data available?
----------------------------------------------------------------------------

Yes, with an InfluxDB Cluster and replication factor set to two, all data is still available when a node goes offline.


How easy is it to add a storage node to expand capacity? Can this be done without downtime?
-------------------------------------------------------------------------------------------

We use local storage for our InfluxDB cluster to enhance performance.
Thus you can expand the InfluxDB cluster storage capacity by adding more data nodes while maintaining the data replication factor.

Once the new data nodes are available, you just increase the replica count configuration for data pods.
This automatically registers the new cluster members, and thanks to the built-in high-availability (HA) configuration, this can be done without downtime.

Some considerations:
- Familiarity with the ``influxctl`` admin tool to manage shards in the cluster is required.
- InfluxDB Enterprise includes an `Anti-Entropy (AE)`_ service to ensure data consistency across replicas.
This service must be turned off during manual shard restoration to avoid conflicts with the restore process.

.. _Anti-Entropy (AE): https://docs.influxdata.com/enterprise_influxdb/v1/administration/configure/anti-entropy/

We recommend using a replication factor of two for the cluster.
With a replication factor of two each shard is replicated across at least two nodes.
Adding more than two data nodes increases storage capacity, however not all nodes will hold the same shards.

Chronograf and Kapacitor usage. How difficult is to create lots of dashboards? Did one team develop all dashboards, or was it distributed between other software teams? Learning curve remarks.
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

We use Chronograf for data visualization/exploration and Kapacitor for alerting.
Here are our observations:

- Chronograf is intuitive and enables our users to create their own dashboards. While we provide guidance, each team is responsible for creating and maintaining their dashboards.
- Lack of a "dashboard as code" approach. While dashboards can be exported as JSON, editing or version-controlling them is challenging.
- In the Chronograf UI, dashboards are presented as a flat list without folder organization or labels, making it hard to browse 200+ dashboards at the Summit.
- Query results are not cached, so shared links hit the database multiple times.
- Kapacitor has been effective for creating alert rules and integrating with Slack notifications. Chronograf provides a user-friendly interface for this.
- Both dashboards and alerts are stored in respective databases, requiring backups. Programmatic creation is not yet supported.

**Future Plans:** We are evaluating **Grafana** and **Apache Superset** to address some of these limitations while continuing to use Chronograf for data exploration.
We value Chronograf's query builder, which is easy to use and supports both InfluxQL and Flux.

How responsive is plotting look backs? How does this compare to Grafana?
--------------------------------------------------------------------------------

Chronograf is very responsive for long look backs.
Chronograf's ``:interval:`` template variable dynamically adjusts the time grouping in queries, ensuring smooth visualizations regardless of the time range.
It might be possible to implement similar functionality in Grafana using template variables, but we have not explored this yet.

Was it easy to maintain/interface/add to Kafka?
-----------------------------------------------

Kafka has a steep learning curve and a complex deployment setup with brokers, controllers, Kafka Connect, MirrorMaker 2, Schema Registry, etc.
The Strimzi Operator simplifies the deployment and management of Kafka on Kubernetes.

Our Kafka clients are mainly written in Python and Java.
We also provide a REST API (Confluent's Kafka REST Proxy) for easier Kafka interfacing using HTTP clients.

How straightforward are Kafka or JVM upgrades? Any JVM-related issues?
----------------------------------------------------------------------

We use `Strimzi`_ to manage Kafka deployments on Kubernetes, which simplifies many administration tasks including upgrades.

.. _Strimzi: https://sasquatch.lsst.io/developer-guide/index.html

No JVM-related issues have been encountered other than the usual memory tuning and garbage collection settings.

How are Kafka topics managed and accessed?
------------------------------------------

Kafka topics have three replicas for fault tolerance.
We disable topic auto creation in Kafka, thus each client must request topic creation.
While Kafka lacks native namespaces, we use a topic naming convention to group topics by subsystem.
Each client has a unique Kafka user with ACLs specifying accessible topics.
4 changes: 3 additions & 1 deletion docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,5 +17,7 @@ Sasquatch is currently deployed at the Summit, USDF and test stands through `Pha
User guide <user-guide/index>
Environments <environments>
Developer guide <developer-guide/index>
FAQ <faq>
Resources <resources>

.. _Phalanx: https://phalanx.lsst.io
.. _Phalanx: https://phalanx.lsst.io
18 changes: 18 additions & 0 deletions docs/resources.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
:og:description: Sasquatch resources.

.. _resources:

Resources
=========

- SPIE 2024 `paper`_ and `presentation`_
- `Chronograf training for Rubin observers`_
- `Acceptance criteria for InfluxDB Enterprise`_
- `USDF EFD storage requirements`_


.. _paper: https://dmtn-290.lsst.io/
.. _presentation: https://docs.google.com/presentation/d/1M6ES4Uk8CM1fnZuxbx33tbwXLePBKGkJCGEAj0AKaog/edit?usp=sharing
.. _Chronograf training for Rubin observers: https://vimeo.com/1001641391?share=copy
.. _Acceptance criteria for InfluxDB Enterprise: https://docs.google.com/document/d/1OuXdtOGMLvrXeIsiE1rx2V_4FgraJ_ZfpfALYaeQB3s/edit?usp=sharing
.. _USDF EFD storage requirements: https://sqr-085.lsst.io

0 comments on commit caafa0d

Please sign in to comment.