EuroSys'21 Reviews and Rebuttal

EuroSys 2021 Paper #656 Reviews and Comments

Paper #656 A Causal Approach to Mitigate Non-Functional Faults Resulting from Misconfigurations

Review #656A

Paper summary

Cauper is a technique for diagnosing root-causes of performance problems in configurable software systems. It is based on causal analysis of configurable parameters and observable system state. By using causal analysis, Cauper is able to identify root causes of performance problems much faster than other techniques.

Strengths

Interesting solution
Thorough evaluation

Significant Weaknesses

Limitations of approach unclear

Comments for author

Thank you for submitting to Eurosys. I enjoyed reading the paper and found both the approach and the results to be compelling. Most of my comments revolve around the presentation in the paper.

Cauper combines a handful of off-the-shelf algorithms into a pipeline that carefully selects which configurations to explore. It was hard to get an intuition for how robust the resulting pipeline would be. For example, how well do the techniques work on very limited data; what the effect would be of making poor inferences; whether there are assumptions about e.g. linearity in relationships; if it's possible that the technique may never converge; how robust it is to lots of unobserved confounders, and so on. The evaluation was reasonably robust, and the technique proved to be good in practice. So my interest here is primarily with respect to the presentation, discussion, and rationale for using the above techniques.

Of the above, I was particularly interested in convergence. Are you certain that the technique will converge? The text says "estimations of causal effect will become more accurate". But how many unobserved confounders do you need before convergence becomes intractable? How much does CAUPER rely on perfectly identifying all relevant factors?

In the presentation of Cauper's pipeline, in some parts it was difficult to distinguish whether the text was describing details of the off-the-shelf technique used, or additional details added by Cauper (e.g. the description of NOTEARS).

It wasn't clear to me if there's any special treatment of continuous vs. discrete variables. Also, concretely, I couldn't wrap my head around how the approach would handle boolean configuration (ie, A -> B only when C is enabled, otherwise there is no relationship), non-linearity, and non-convexity. Would Cauper just get stuck in some local maxima?

In the paper's introduction, Cauper is distinguished from prior work as being developer-guided. Despite the description in the text, I still didn't quite grasp whether this materially changes the problem (especially, for example, how the evaluation can directly compare to numerous prior techniques). I would appreciate more directly pointing out how this affects the solution (ie, why can't Cauper be used to just optimize latency in general?).

In section 6.2, under "data collection", could you clarify the experiments you ran here? With 28 different configuration options, this seems like an enormous number of different configurations to exhaustively test (at least 2^28?). I think I'm missing something here.

The overall metric that seems to matter in the evaluation is Gain, for which Cauper does well. Accuracy, precision, and recall did not make sense to me in this setting, since there is surely no ground-truth for non-functional faults? Do you measure this with respect to, perhaps, some global optimum? Please clarify.

In 7.2, "Cauper never deteriorates any non-functional property". This is an important point but it wasn't obvious to me that it was a design goal earlier in the paper (maybe I missed it reading the text) nor was it clear how the design enforces this. Please elaborate.

Lastly, while the evaluation is nice, my overall impression of ML applications is that they are a convenient target for work like this. That is, they offer a simple, deterministic, stateless, easy-to-measure workload. This is a good thing for a robust evaluation, though it does lead me to wonder how well Cauper (and the other techniques) would do as the systems get larger and more configurable.

Nit:

The "$5 trillion" citation in the introduction is a bunch of marketing claptrap, please remove it.

Novelty

New contribution

Writing quality

Well-written

Experimental methodology

Good

Overall merit

Weak accept (OK paper, but I am not enthusiastic)

Review #656B

Paper summary

Modern computing platforms (e.g., GPU nodes) are highly-configurable. This paper considers the problem of how to detect erroneous configurations that can cause non-functional faults such as heat dissipation, latency or energy consumption. Current configuration optimization techniques use a search process that samples the configuration space exhaustively to find an optimal or sub-optimal configuration, but can take long time to converge. ML-based approaches can also take significant time and require lots of data. The proposed solution, Cauper, uses causal models to represent the complex interactions between the configuration variables and the performance objectives. It also uses incremental learning where it incrementally updates the internal causal model with new observations to infer a candidate configuration that fixes the fault.

Strengths

Useful background
Well-written paper
Technically-sound solution
Extensive evaluation

Significant Weaknesses

Rather complex system
Some technical details are missing
Not clear how to configure the system

Comments for author

The paper presents an elegant and detailed solution to the problem of identifying the root-cause of non-functional faults and finding a configuration fix. The intuition of why causal models can work better than an ML approach or other optimization techniques is clear, however, some details of the proposed approach are missing. It also wasn't clear to me how a developer can use Cauper and how easy it is to configure it. More details follow.

Reference 36 at line 94 seems inappropriate. The article linked by reference 36 talks about data breaches caused by cloud misconfigurations and specifically 33 billion records that have been exposed because inappropriate security measures. Instead, this paper is focusing on non-functional faults such as latency, energy consumption or heat dissipation. The article also refers to public cloud environments, while this paper seems to refer more to private computing scenarios, or at least the developer stories mentioned at line 97-102 seems to imply so.

The developer example described in the Intro and throughout the paper uses the term latency but "performance" (frames/sec in the example) would be more appropriate. Moreover, in the Introduction the authors should define what "GPU growth" means. I had to read later sections to understand what this meant.

What are the "prescribed edge orientation rules" (line 455) used to produce a partial ancestral graph? Of course, a reader could go and read the 6 papers linked in the footnote, but giving an intuition on whether these rules are manually or automatically inferred and which kind of observations they rely on seems necessary. I also would like to understand why these edge orientation rules can orient some of the edges but not all. How does the number of partially directed edges depend on the observations provided? What's the trade-off? With more observations would it be possible to have zero partially directed edges and skip the NOTEARS phase? Or viceversa, could NOTEARS be used for all edges without relying on orientation rules?

In the use case presented in Figure 4, how does one select the configuration options to choose from?

In general, I would have liked to learn more about Cauper as a developer tool, such as what the system's inputs/outputs are, how one should specify search queries, what can be configured and what is fixed. More importantly, I would like to understand how one can discover and specify which configuration variables to consider. This seems a critical aspect to guarantee good performance. In the evaluation, the authors use 28 configurations, but they acknowledge that modern systems may have thousands interacting configurations (line 10, 1300) so I'm wondering how a user would deal with this complexity. It is also not clear how one can define which system events to model.

line 483: there are two possible approaches to orient edges in a PAG, but what is the trade-off? The first one is a manual approach: would this guarantee higher accuracy?

line 925: what does "accuracy (Jaccard similarity)" mean in this context?

The evaluation is extensive and the comparison with all different approaches is very much appreciated. However, the paper does not really evaluate the correctness of the causal models the system internally builds. It would be good to provide more insights on how often the models can be inaccurate and how much they can affect performance.

The paper contains many typos.

Nits:

Line 435: table 3a-> figure 3a
Line 862: Please clarify that BERT does not perform sentiment analysis. BERT can be fine-tuned with an external dataset to perform sentiment analysis.

Novelty

New contribution

Writing quality

Adequate

Experimental methodology

Average

Overall merit

Weak accept (OK paper, but I am not enthusiastic)

Review #656C

Paper summary

The paper proposes CAUPER, a diagnostic tool for identifying the source of non-functional faults in computer systems. CAUPER is based on a causal inference approach: it constructs a causal model graph based on observations, which it can then use for counterfactual reasoning to identify hardware faults. Experimentally, CAUPER compares well with both ML-based and search-based diagnostic and optimization approaches, both in the sense that it generally produces more accurate explanations for faults and that it runs faster.

Strengths

The task of automatically identifying faults in software is certainly a compelling one, as the presence of many fault reports on NVIDIA's forum shows.

The idea of applying causal inference to this task, in order to support counterfactual reasoning about what would happen if parameters were changed (based on measured performance data) is quite natural and is well-motivated. This approach is particularly well-suited to non-functional faults in which some performance data is still available about the system even when the fault is present (compare faults that cause the machine or program to crash, where there is only one bit of information available: the fact the program failed).

The experimental results do seem very broad and convincing, showing the benefits of CAUPER over previous approaches. However, as I am not very familiar with this space, my impression here could be off-base.

Significant Weaknesses

The construction of the causality graph seems a bit ad hoc, with many components pulled from previous work and explained in a somewhat hasty manner. For example, the entirety of page 5 of the paper seems to be explaining how FCI and NOTEARS work, but I don't have an intuition of why these were chosen in comparison to other possible approaches (indeed I don't even know what the other approaches are). Is FCI + NOTEARS just the standard method used for this sort of causal inference?

Not much detail is given about the query engine CAUPER uses. Section 4.3 is vague about this: for example, it is not exactly clear how CAUPER gets from the text "How to improve my latency?" to the counterfactual statement described in (5).

Novelty

New contribution

Writing quality

Well-written

Experimental methodology

Good

Overall merit

Weak accept (OK paper, but I am not enthusiastic)

Review #656D

Updated: 14 Jan 2021 6:56:36pm GMT

Paper summary

This paper applies Causal Bayesian Network to diagnose and repair non-functional faults (such as prolonged latency, thermal faults and energy consumption issues) of SOM (System on Module).

The diagnostic procedure can be summarized as three steps: first, the FCI (fast Causal Inference) is applied algorithm to learn a general PAG (Partial Ancestral Graph) from observational data; second, NOTEARS is applied to reduce the general PAG to a specific causal DAG (Directed Acyclic Graph) that describes the causal relationship between any two measured variables; third, top causal paths that rank high for ATE (Average Treatment Effect) averaged over all variables residing the path are selected as "causal explanations".

The repairing suggestion procedure is straightforward: permute all possible configurations on a selected causal path and suggest as a repair the configuration that increases the probability of improving latency/energy/thermal most, i.e., the configuration that has highest ITE (Individual Treatment Effect).

The system is prototyped and verified with Nvidia's Jetson platforms for workloads including CNN image recognition, BERT sentiment analysis, Deep Speech voice recognition, SQLite DBMS and X264 encoding/decoding. The performance of proposed method, i.e., effectiveness of repairing non-functional faults and the speed of repair search, is validated to outperform simple baseline techniques such as CBI, BUGDoc.

Strengths

(1) The authors extract all posts pertaining to NVIDIA Jetson Platforms after 2017, identify 47 performance related issues and identify three major categories of non-functional faults. It is nice the filtered posts are also released in the paper.

(2) Applying causal learning and inference to diagnose and recover system issues is an important and active research topic.

(3) The experimental studies are solid and extensive. It covers three different SOMs (TX1, TX2 and Xavier) and five workloads. The applicability of the diagnostic and repairing procedure is well evaluated.

Significant Weaknesses

I am concerned about the causal methods used in this paper.

(1) The causal learning method is an "ensemble" of two different structural learning techniques. The technique used to resolve the conflicts between the two different methods is not rigorous at all and the validity is doubtful.

(2) Drawing causal conclusions with counterfactual inference methods, in general, requires many hefty assumptions/conditions, such as identifiability, faithfulness and unmeasured/unobserved hidden confounders. These assumptions and conditions may not hold in real system settings and it is not easy to bridge the gap. It is important that for system applications, these conditions and assumptions are well taken care of . I did not see this paper discusses enough or contributes to reduce the gap.

(3) This paper selects a few top-ranked paths (default: top three paths) in the learned DAG, and then maps these few paths to causal explanations of performance issues. This is also not strict though. Will the top paths well cover the causal knowledge? If a large number of paths are needed to well cover the causal knowledge, selecting only a few top paths seem problematic.

(4) Another typical challenges of applying causal inference to system applications is scalability. The paper also admits that a technical challenge is that the configuration space is combinatorially large. But I only see about 13 variables in the experiment part. More importantly, the major structural learning method applied in the paper (i.e., FCI) also does not scale up to high dimensional settings (RFCI improves the efficiency of FCI a lot for high-dimensional structural learning). Will the problem scales up to more configurable variables in real systems? Many systems nowadays are micro-service architectured, each micro-service has its own configurables but jointly/interactively impact overall system performance. In this setting, the total number of variables largely exceeds 13.

Comments for author

(1) In Section 4.1. The study is not truly observational study. The data is collected with controllable experiments where configuration variables are manually set to randomly chosen values. If you have controls over all the variables, why not doing randomized experiments to infer causal effects? Truly randomized experiments will also indirectly eliminate all the unmeasured hidden confounders. Your data is not like traditional observational data while some variables, such as human age, are imbalanced between the treatment and control group in observational data, but you have no controls over it.

(2) In Section 4.1, the hybrid method of FCI and NOTEARS seems problematic.

a) You use a hybrid of FCI and NOTEARS for DAG learning. For some edges in the DAG, you choose the result to be in accordance with NOTEARS, while the others, you choose to go accordance of FCI. An example is that for bidirectional edges, the result goes in accordance with FCI while for partially directed edges, the result goes in accordance with NOTEARS. What if both of them are incorrect? What if only one of them is correct while you chose the wrong one?

b) If there is a unmeasured confounder and it is implicitly represented by a partial directed edge (solid arrow with circle at tail) in the FCI PAG. If it corresponds to a directed edge (plain solid arrow) in the NOTEARS DAG, you choose to replace the partial directed edge to a directed edge (in accordance with NOTEARS), will this just eliminate the unmeasured confounders? How do you deal with hidden confounding bias?

c) You claim that it is very important to capture intrinsic non-linear interactions between options. But NOTEARS is in fact a continuous optimization technique for learning only linear SEMs (Structural Equation Model). Is this valid?

(3)In Section 4.2, you select the top k paths (default: k=3) in the learned DAG. Will the top paths well cover the causal knowledge? How do you know k is large enough? If k is very large and you extract a great number of paths, how could you interpret that into causal constitutions? Besides, if the averaged ATE of causal paths are evenly distributed (long tailed ATE distribution over paths), selecting only the top ones while filtering the rest is obviously problematic.

(4)In Section 4.3, the repairing procedure is basically exhaustive search for highest ITE over all possible configurations (in the repair set). Though the top selected causal paths will reduce the the size of the repair set a lot, but it is still exhaustive search and may fail to scale. When there are many selected paths and all the paths include a large number of decision variables, exhaustive search and frequent inference over the large DAG is computationally prohibitive.

(5)In Section 7.2, you claim that your approach is much faster than Bayesian Optimization (i.e., PESMO). The speed of Bayesian optimization is largely dependent on the complexity of learning step, the complexity of the "argmax" step and the intensity of exploration. Did you try different Bayesian optimization models ore PESMO with different hypeparameter settings? In general, my understanding is that causal inference is for analysis but not for control and black box optimization techniques are more suitable for finding optimal repair.

Novelty

Incremental improvement

Writing quality

Well-written

Experimental methodology

Good

Overall merit

Weak reject (This paper should be rejected, but I'll not fight strongly)

Review #656E

Paper summary

Cauper is a system for automatically inferring causal dependencies between configuration parameters and observed performance metrics, and for reasoning about ways to repair non-functional faults (like poor performance or energy consumption). It draws on a substantial amount of work on probabilistic graphical models and other areas of statistics.

Several 'textbook' techniques are combined to build a system that is able to bootstrap a model of internal dependencies from a small number of randomly generated configurations, and use that to drive an automated exploration of possible 'repairs'.

The system is evaluated using a real-world bug database from several Nvidia ML embedded systems and compared against 5 previous works on automated diagnosis (based on ML and search/optimization). It is shown to find good quality configuration changes in less time that the other methods.

Strengths

Draws on considerable knowledge and expertise in the probabilistic models domain.
Provides an interesting overview of the capabilities of such techniques.
Applied to some real world configuration problems with reasonable success.
Compared against relevant alternatives using an impressive and apparently very time consuming study.

Significant Weaknesses

Arguably the wrong venue to be reviewing this work - though it is applied to systems problems, this pretty much only changes the names of the variables in the diagrams.
The limitations of the described techniques are not well explored.
Very unclear how well Cauper would scale to larger and more complex scenarios (with many more variables and more complex configuration spaces) - both in terms of the diagnosis quality and whether the search could be parallelized
Requires online trials for diagnosis - which is not always feasible.

Comments for author

Thankyou for submitting your work to Eurosys21. Over my career I have been exposed to a variety of probabilistic methods and inference techniques, and it's always interesting to come across some new ones and see how well they can be applied to systems problems. Both FCI and NOTEARS were new to me and I needed to go and read up about them.

I have seen various papers like this in systems conferences that aimed to do root cause diagnosis from observable performance counters. They often include have a nice working example that exhibits strong correlation and the systems reader will think "er, yes... those two counters basically measure the same thing". In this paper, that wasn't the case - but mainly because I have no idea what 'GPU growth' might mean! ;-)

But this brings me to a more substantial point - while it is interesting for a systems audience to read about some new techniques, they are not best placed to peer review the contributions which are clearly in the statistical learning field.

I appreciate the huge amount of work that was involved in the evaluation of this system on a real world dataset, and Cauper seems to produce better automated diagnosis results that the other ML/search techniques you compare against. However, I did not get a lot of insight about the systems aspects of your techniques, e.g. How easy would it be to apply in other scenarios? How scalable/parallelizable is it? What are the ways in which this could go wrong or be misapplied?

Lower level comments

Figure 1... this would perhaps be easier to understand if the term "GPU growth" (and other variables) meant anything to the reader. This appears to be the name of some performance counter specific to a piece of NVidia hardware, but might as well be X, Y or Z. There is no helpful systems intuition here (at least for me).

Similarly, on p7, "swap memory" being a probable root cause comes across as borderline gibberish.

Design

Your work builds on several previously published techniques - that I'm assuming are commonly known in your field (at least many of the references come from textbooks by Judea Pearl!) The two main parts also seem to fall into that category:

Fact Causal Inference (FCI)

NOTEARS[102]

You provide a high level description of these algorithms but I don't come away with an idea of how expensive they are, e.g.

To generate a DAG, NOTEARS learns an adjacency matrix from the observational data.

The description of NOTEARS is quite abstract and doesn't appear to mention how the data is used in updating the adjacency matrix (which it surely must be)

I'm also not 100% clear if there were new techniques you needed to invent for this work, or whether it was a case of plugging together some well known and tastefully chosen algorithms?

4.4 Incremental learning

we generate a new configuration using the recommended repairs from Eq. (8). We reconfigured system with the new configuration and we observe the system behavior. If the new configuration addresses the non-functional fault, we return the recommended repairs to the developer

How expensive is this in your case? Presumably to run one of your 5 workloads long enough to get some statistically significant observations of thermal load/energy takes a while? Was this time chosen by hand?

5 Case Study

Cauper could diagnose the root-cause of the misconfiguration and recommends a fix within 24 minutes.

I'm not sure how to interpret this statistic... How much of this time is spend in reconfiguring/building/deploying and measuring configurations? How many configurations needed to be explored? How would this time scale with a larger number of variables or configuration options? (what's the inherent complexity - or does it depend on how 'obvious' the causal relationships are in the observed variables)

6 Experimental setup

In places it is unclear which parts of this section are required to train/use Cauper and which are purely part of building a labeled test dataset, e.g.

Data Collection.We exhaustively set each of configuration option to all permitted values. Note, for configuration options that were continuous we choose 10 equally spaced values between the minimum and maximum permissible values

We create a ground-truth data by inspecting each configuration labeled faulty in Fig. 5 and identifying their root-causes manually.

Is this tractable? What happens with values that are unbounded and where sensible configuration values may not be uniformly distributed. What do you do with inconsistent/illegal parameter combinations? (these presumably will get 'learned' but if 99.999% of configurations are invalid/nonsensical this may take a while)

Repair quality metric:

This metric seems to assume a lot about the distribution of the variables that you are measuring. It can obviously be used to compare two repairs (i.e. it's a metric) - but your abstract quoted '14% better' as though this meant something.

7 Evaluation

This section compares Cauper with four state of-the-art ML-based methods for fault diagnostics, namely...

I am impressed that you compared against these schemes given the huge amount of time this would clearly need to collect data and evaluate results.

However, I don't have a good idea as to what it means to apply these methods. Is this a matter of running some existing library code (or R program) against a spreadsheet of values? Or is this a more procedural/algorithmic process that involves automating trial experiments to measure potentially novel and unseen configurations?

Gain - this metric seems arbitrarily non-linear. How does it weight the various 'target metrics' or does it just consider the one that was 'faulty'?

What happens when there are a very large number of target variables that should not be adversely affected when 'fixing' the problem variable(s)? (I see that Sec7.2 describes a problem like this for SMAC - but is Cauper immune to such unwanted repair side-effects?)

Wallclock Time:

It would be good to break down what this includes. How much is automated? How much requires user intervention? How much is parallelizable?

Since Cauper uses incremental learning with only 25 initial samples, it is much faster. On average, Cauper is 7× faster than ML-based methods.

This seems a strong conclusion to make from a limited number of datapoints.

Nits:

There believe there are two reason for this

Novelty

Incremental improvement

Writing quality

Well-written

Experimental methodology

Good

Overall merit

Weak reject (This paper should be rejected, but I'll not fight strongly)

Review #656F

Paper summary

The paper presents CAUPER, a causal performance debugging that enables users to identify, explain, and fix the root cause of non-function faults early. The paper uses GPU performance issues a a showcase.

Strengths

Answers what if questions. e.g. "What would be the effect of changing another configuration X"? "What is the effect of CPU growth on latency if the available swap memory was 2GB?"
Use novel causal models to express complex interactions between the configurations and the performance objectives with a rich graphical representation. Specially, the use of causal structural discovery algorithms [86, 102].
The paper has study 1000 posts on NVIDIA platforms, and found 47 performance issues.
Compare with the state of the art such as delta debugging, CBI, BugDoc, EnCore.
Evaluation dataset is robust.

Significant Weaknesses

The "mathematical" model is unclear if it's scalable and how general.
A lot of "we do ..." (seems to be a lot of manual effort) as opposed to more "the model does ..." (which is a more automated way if ML models are used).
Depends on some observational data.

Comments for author

Great work! I enjoyed reading it. I have some questions, doubts, etc. but overall I think this paper warrants an acceptance. As listed above, there are more pros than cons.

"We gather a few dozen samples of observational data" -- I think this is as you state, a limitation of the techniques that you kind of need to have good samples. While if you use ML, you can just dump as much data to the model.

"We build a dense graph by connecting all pairs .." -- Here is where I got worried about the overall approach. Who is "we" here? If "we" is a human, then how general/scalable is the problem? How much effort one must put into to get a reasonable graph that accurately depicts a complex system? If "we" is not human but rather your analytical modeling, then please change all the "we do ..." to the right subject.

This is counterintuitive to what you mentioned in the introduction where you said ML will have issues when there are many configurations. To me, that's where ML will impress us, we just dump a lot of configurations, values, and observable data, and ML can help us connect the dots. ML is more scalable, but yes, I agree with you, it will be less accurate.

Ranking causal paths "For further analysis, we use paths with the highest causal effect." -- This also worries me a bit. It seems you're assuming there is one dominant factor. But definitely there are performance-configuration relationships that are multi dimensional.

"Table 3a" -- where?

Some user study would be interesting. e.g. give a bad command line to compile the system or run the system, and ask the users why the performance is bad, and ask them to run CAUPER and uses the recommendations.

I like how you can quantify things in the evaluation. I think your dataset is valuable and other people can use the same dataset and build better models/solutions without going through the hassles buying the devices, etc.

I recommend the authors to find a better title that can bite more audience :) I think what is cool in this paper is the use of GPU performance faults, which matters a lot for many people these days. While configuration debugging is important, it's an almost overcrowded field. But if you put something like "Causal Performance Diagnosis (or Debugging) of GPU Non-Functional Faults", it might attract more audience. In other words, I might be interested in watching a Youtube video of "How I make my Tesla 2x faster" than just "How I make my car 2x faster" :)

Novelty

New contribution

Writing quality

Well-written

Experimental methodology

Excellent

Overall merit

Accept (Good paper, I will advocate for it)

R-AuthResp Response by Rahul Krishna [email protected] (2145 words)

Dear Reviewers,

Thanks for the constructive feedback! We'll incorporate all the content-/presentation-issues in the next revision.

#656A;#656D;#656E---Scalability

CAUPER was tested with 28 options which can have $\approx5-25$ Million valid configurations. On larger configuration spaces, our approach can be extended as follows:

Building: Causal discovery algorithms can be modified for thousands (even millions) of variables with small sample size (<1000) with high precision and recall, e.g., ParallelPC and FGEC. As our target systems had $\approx30$ configuration-options (dimensions), we chose simpler discovery algorithms (FCI+NOTEARS). We may incorporate sophisticated causal structure discovery for high dimensional systems.

Path discovery: The causal graph G=(V,E) for many real-world problems has shown to be sparse (Kalisch2007;Colombo2014). The average degree of each vertex is usually $\leq3$ (generally |E| = O(|V|)) (bnlearn). Our experiments also confirm this (our avg. degree=2). Thus, when CAUPER scales to 100+ configurations, useful causal paths will grow linearly (not exponentially).

Counterfactual repair: An exhaustive search over the paths (even on top-k paths alone) may be a bottleneck. In practice, this wasn't limiting because: (a) the search happens on a subset of all the configuration options (i.e, those in the path); (b) no additional time is spent measuring every candidate repair on the real-system, instead, the outcome is estimated offline from the existing model. Therefore, exhaustive search was relatively fast ($\approx5$ seconds per-evaluation on average). For larger configuration spaces, we may use better search strategies.

#656B; #656C#; 656D,#656E---Causal-Discovery Algos, Orientation, and Correctness.

Choice: The task of learning correct causal structures is NP-hard (Chickering2004) guaranteeing asymptotic correctness at best. In practice, it's best to use expert judgment to validate the correctness of causal graphs, but such expertise can be impossible to have as the number of configuration options increases. Thus, we use a hybrid of two off-the-shelf discovery paradigms chosen based on two complementary assumptions---FCI is a statistical method that can handle hidden confounders, and NOTEARS (an optimization approach) can offer confidence scores for edge direction by relaxing the assumption of hidden confounders.

Orientation: Generally, we choose FCI over NOTEARS since FCI highlights hidden confounders (bidirected edges). Partially directed edges (e.g, o--->) may appear in FCI indicating weak statistical evidence for directionality (due to insufficient data/latent confounders), only there we use NOTEARS relaxing the assumption of latent confounders. If NOTEARS produces an edge with this relaxed assumption we keep the directed edge.

Correctness: The resulting causal graph may be approximate and incomplete. Instead of a (non-automated) manual refinement of the causal model, a greedy refinement strategy is used to incrementally update the causal graph with more samples. We use experimental metrics (gain) as a heuristic to determine if the causal model is approximately correct to detect and repair the faults.

#656A; #656E---Data Collection and Ground Truth

We chose configuration options based on NVIDIA/Tensorflow's guides and tutorials. For evaluation, we randomly sample permitted values using a uniform distribution (we'll explore other sampling strategies in future work). For each sample, latency, energy, and heat is measured 5 times and averaged. Our collection budget is 24 hours per software/hardware. For 5 software, 3 hardware, data collection took 15 days. From this data, we find the non-functional faults and we manually find the fix that achieves the best performance. These fixes are our ground truth. If CAUPER recommends fixes to the same options as our manual fixes, then the accuracy, precision, and recall will be high.

Note: We discard all invalid configurations (during data collection and iterative learning) to prevent learning incorrect relationships.

#656B;#656C;#656F---Design Considerations

Configurations: We use the default software and hardware configurations options and refer to system documentation (of NVIDIA/Tensorflow) to include additional performance-related configuration options. The user may include additional configurations, but as the number of chosen configuration options grows, they'll need larger numbers of samples, which can be expensive.

System Events: Common system-events reported by iperf can be used. Depending on the task, additional system events can help enrich the causal graph. But this may incur instrumentation cost.

Counterfactual-Queries: Current work manually generates counterfactual expressions (eq. 5) for common user queries: improve performance (e.g., latency), keeping everything else unchanged. In the future, we hope to develop a more principled domain-specific-language (DSL) to convert any arbitrary user query to a counterfactual expression in a fully fledged framework.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

DETAILED RESPONSES

Review #656A

Q1.1 Convergence and Local Optima?

Causal discovery methods can only offer asymptotic convergence guarantees (Spirtes1999). Thus, our approach uses a greedy heuristic where we terminate once the causal graph can provide a suitable repair to fix the non-functional fault. Since CAUPER is not an optimization technique we do not seek a global optimum.

Q1.4 How CAUPER prevents non-deterioration?

We account for non-deterioration of other variables by formulating the counterfactual expression (eq.5) to ensure that fixes to any one target variable (eg, latency) don’t negatively affect other target variables (eg, energy/heat).

Review #656B

Q2.1 Why can't orientation direct all edges.

Edge orientation is an automated process that depends on the amount of available data. Sometimes, there may not be enough data to determine the orientation of some edges resulting in partial edges.

Q2.2 No. of observations vs. no. of partially directed edges.

The number of partially directed edges should in theory reduce with the number of observations provided. While the theoretical upper bound is hard to establish, we will include a sensitivity analysis to offer empirical evidence.

Q2.3 How can one discover and specify which configuration variables to consider and which system events to model?

Discovering which configuration variables to consider requires some domain expertise. In addition to frequently used defaults, we can choose configuration options based on software and hardware documentations/guides, tutorials. For systems events, we may use the events reported by off-the-shelf profiling tools like iperf, or NVIDIA perfworks. We'll clarify these in our next draft.

Q2.4 Edge orientation manual vs. ensemble trade-off.

Manual orientation is generally the most accurate, but they require a lot of domain expertise, and become intractable where the number of configuration options are high.

Review #656C

Q3.1 Is FCI + NOTEARS just a standard?

There are several other methods (statistical and optimization methods) for causal discovery that make different assumptions (e.g., discrete/continuous values, etc). We'll provide a detailed explanation of when and how to use other methods in our next draft.

Q3.2 How does CAUPER get from "how to improve my latency?" to the counterfactual statement.

In the current work, we manually convert the above textual query to the probabilistic expression on eq. 5 by assuming that the user wants to reduce their latency, given everything else stays the same. In the future, we plan to develop a more principled domain-specific-language to convert any user query to a counterfactual expression.

Review #656D

Q4.1 FCI+NOTEARS Hybrid--Edge orientation correctness.

For most edges, we choose FCI over NOTEARS since FCI indicates hidden confounders with bidirected edges. Sometimes, partially directed edges (o---o and o--->) appear indicating weak and partial statistical evidence for directionality (because the latent confounders), only in such cases we use NOTEARS relaxing the condition (by assuming no latent confounders) to detect an edge. If NOTEARS produces an edge with this relaxed assumption we keep the directed edge (assuming no latent confounder).

We acknowledge the incompleteness of this approach. To address this, we refine the causal graph with incremental learning to address the inconsistencies with more samples. Since it is hard to determine the true causal graph -- we use experimental metrics (gain) to determine if the causal model is approximately correct to discover a fix to address the non-functional fault.

Q4.2 Causal sufficiency, identifiability, faithfulness assumptions, and unmeasured confounders.

We implicitly ensure our counterfactual query is identifiable. We ignore non-identifiable paths and queries. Unmeasured confounders are accounted for by FCI (indicated with bidirected edges), and this is accounted for during counterfactual query evaluations. Faithfulness is a necessary assumption to be made inorder to build a causal graph, however it is not an easily-testable assumption in practice. Violations could result in incorrect/sub-par repairs. We'll discuss these in detail (with empirical evidence) in our next draft.

Q4.3 Why not do random interventions? Data is not like traditional observational data where some variables are not intervenable.

We avoid inferring causal relationships with random interventions because each intervention must be measured by deploying the system and this can take very long to run and validate. Further, like traditional data, many of our measured variables are in fact non-intervenable (e.g., system events like cache miss ratio, major faults, minor faults, etc.).

Q4.4 Choosing top-k paths

k is a tunable hyperparameter. Time/budget permitting, we may increase k further. But if we increase k to a large number, counterfactual repair generation may take too long and this is an important trade-off to consider. In general, the distribution of ACE scores was skewed (top-k paths had considerably larger ACE than the rest). For our use case, empirical evidence (which we shall add to our draft) showed k=3 performed as well as k >3.

Q4.5 Repair generation is exhaustive, even if there are only k paths.

We acknowledge that there may be a bottleneck with an exhaustive search over the paths (even if it's only on top-k paths). In practice, this wasn't very limiting because the search uses counterfactual reasoning over existing data. We spend no additional time measuring each repair on the real-system, the outcome is estimated from the existing model (and it is done offline), therefore, measuring large combinations was still relatively fast (it takes roughly 5 seconds to evaluate all repairs to find the best candidate).

Q4.6 Different BPO's or hyperparameters.

BPO hyperparameters were chosen after extensive tuning. We used the hyperparemeters that gave us the best models. PESMO, for multi-objective optimization, was used with GP as the surrogate model. And, we used the Matern covariance function for the GPs. The acquisition function is minimized using L-BFGS. The gradients of the acquisition function are approximated by the differences as suggested by the original implementation of PESMO. For single-objective optimization, we used SMAC for minimization with GP with a number of restarts to be 20. We set the maximum iteration to 200.

Review #656E

Q5.1 How expensive is incremental learning, reconfiguring/building/deploying for a fix?

The wallclock time reported includes only the total time to discover a fix for all misconfigurations. Additional time to setup, build, and deploy a new system is subjective depending on factors such as workload size and the hardware. This additional time applies to all methods. In our setup, each step of incremental learning takes avg. 2 minutes and involves: (1) counterfactual query evaluation (15--20sec avg.), (2) reconfiguring/building/deploying recommended fix (avg. 70-75sec), and (3) updating causal model (avg. 30s)

Q5.2 How did you compare with other ML methods?

We implemented the other approaches and procedurally compared the repairs offered by each method with CAUPER. We have made our implementations available in our replication package with detailed instructions on how to use them.

Q5.3 What happens when there are a very large number of target variables that should not be adversely affected when 'fixing' the problem variable(s)?

CAUPER is specifically designed for multi-objective problems where multiple variables are involved. Users of CAUPER incorporate this into the counterfactual question to ensure fixes don't affect them. We enforced this as an implicit constraint in CAUPER for the 3 target variables studied here. For example, when reducing latency, we enforce (in eq.5) that the heat and energy-consumption must remain the same.

Q5.4 Learning from 25 samples is limited for conclusion.

We agree, we use 25 samples (this could be varied depending on the dimensionality, i.e., size of the configuration space) as our initial sample set but we incrementally increase the sample size and update our causal model until we find a repair. The final sample size was at most 100 samples (a small fraction of what other methods need).

Review #656F

Q6.1 Who is "we"?

Apologies, "we" refers to analytical modeling, not the human. We'll change the subject.

Q6.2 Aren't performance-configuration relationships that are multi dimensional?

Each causal path is a unique combination of configuration options leading to the performance. By choosing the top-k paths, we account for various combinations of performance-configuration relationships.

Q6.3 User study

This is our immediate next step-- we're in contact with NVIDIA developers to get their feedback on using CAUPER.

Q6.4 Better Title

Thank you! We'll try to change the title along these lines.

REFERENCES

[Kalisch2007] M. Kalisch and P. Buhlmann. Estimating high-dimensional directed acyclic graphs with the PC-algorithm. J. Mach. Learn. Res., 8:613-636, 2007.

[Colombo2014] D. Colombo, M. H. Maathuis, M. Kalisch, and T. Richardson. Learning high-dimensional directed acyclic graphs with latent and selection variables. Annals of Statistics, 40(1):294–321, 2012.

[Chickering2004] Chickering, D. M., Heckerman, D., and Meek, C. (2004). Large-sample learning of Bayesian networks is NP-hard. The Journal of Machine Learning Research, 5, 1287–1330.

[Spirtes1999] P. Spirtes, T. RIchardson, and C. Meek. Causal discovery in the presence of latent variables andselection bias. In G. Cooper and C. Glymour, editors, Computation, Causality, and Discovery,pages 211–252. AAAI Press, 1999.

Comment @A1 by Reviewer D

This paper was discussed at the meeting. While the reviewers appreciated the importance of the problem, everyone agreed that: (a) it is not clear what the main technical challenge in crafting the solution was (what was non trivial about combining the two techniques?) (b) it is not clear when the approach would not work, and (c) what scalability challenges the approach faces.

EuroSys'21 Reviews and Rebuttal

EuroSys 2021 Paper #656 Reviews and Comments

Review #656A

Paper summary

Strengths

Significant Weaknesses

Comments for author

Novelty

Writing quality

Experimental methodology

Overall merit

Review #656B

Paper summary

Strengths

Significant Weaknesses

Comments for author

Novelty

Writing quality

Experimental methodology

Overall merit

Review #656C

Paper summary

Strengths

Significant Weaknesses

Novelty

Writing quality

Experimental methodology

Overall merit

Review #656D

Paper summary

Strengths

Significant Weaknesses

Comments for author

Novelty

Writing quality

Experimental methodology

Overall merit

Review #656E

Paper summary

Strengths

Significant Weaknesses

Comments for author

Lower level comments

Novelty

Writing quality

Experimental methodology

Overall merit

Review #656F

Paper summary

Strengths

Significant Weaknesses

Comments for author

Novelty

Writing quality

Experimental methodology

Overall merit

R-AuthResp Response by Rahul Krishna [email protected] (2145 words)

#656A;#656D;#656E---Scalability

#656B; #656C#; 656D,#656E---Causal-Discovery Algos, Orientation, and Correctness.

#656A; #656E---Data Collection and Ground Truth

#656B;#656C;#656F---Design Considerations

DETAILED RESPONSES

Review #656A

Review #656B

Review #656C

Review #656D

Review #656E

Review #656F

REFERENCES

Comment @A1 by Reviewer D

Clone this wiki locally