FSE'21 Reviews and Rebuttal

ESEC/FSE 2021 Paper #975 Reviews and Comments

Paper #975 CAUPER: Debugging and Fixing Misconfigurationsusing Counterfactual Reasoning

Review #975A

Overall merit

Weak reject

Paper summary

Non-functional (NF) faults (e.g. excessive latency, overuse of energy consumption) in configurable systems can lead to unexpected consequences by degrading system performance. These are often attributed to a misconfiguration (choosing incorrect combinations of configuration options) and fixed by selecting different system configurations. However, the size of most modern configuration spaces is huge, and analyses/optimization is confounded in systems that have both hardware and software configurations. This paper presents CAUPER, a technique that uses casual Inference and counterfactual reasoning to identify, fix and repair non-functional faults. A study on three Nvidia Jetson hardware platforms (using 5 software systems) demonstrates CAUPER is more efficient and effective than state of the art techniques. A public benchmark of non-functional faults is provided for the research community as part of this project.

Strengths and weaknesses

Strengths

Interesting idea
Works on mixed hardware software
Dataset provided for the community

Weaknesses

Lacking detail and intuition behind technique
Not clear why sampling and classification or prediction cannot provide similar results
Study lacks sufficient detail to understand the practical benefit of this approach

Comments for authors

The overall idea of this paper is very interesting. I like the simplicity of causal inference and the results are better than the state-of-the-art techniques that are used for comparison. I am also happy to see the benchmarks being made available. I see value in this paper and expect it would be of benefit to the broader ESEC/FSE research community. I do have reservations related to the explanation and intuition for choosing this technique as well as with the choice of comparison against the state of the art. The working example is simplified to the point of showing only 2 configuration options, which makes it deceptively simple. Also, some of the simpler techniques such as prediction (which uses regression equations, but also does not need source code, and classification trees which have been used for characterization of failures in configurable systems in the past are ignored). I also believe that some of the approaches which have been discounted, are due to the use of hardware/software since they are white box technique. Hence, this work might be better scoped with respect to the target domain. Last the study presents the results at an abstract level. I find it hard to understand what the true benefit of this technique is.

Novelty:

In general, the intuition and description of the technical harness used is missing. I would not expect this audience to have a background in casual inference. The approach is presented superficially, with decisions about stages of the PAG construction that seem arbitrary (i.e. we use the information-theoretic approach..). The reader has to believe that these choices and the use of casual inference is a good idea. Perhaps a motivating example that demonstrates why inference would improve over other approaches.
The use case for this work is not clear. It seems that this is focusing on finding causes of existing faults, but then the causal models are built without specific faults. Hence, there is a disconnect between these two parts of the approach. Also, I believe that part of the real need for this work comes from the mixed hardware/software environment. Hence scoping of this work could help with the overall story/contribution.
There has been research on performance prediction (using regression) and this can also be used in combination with sampling. While it may not provide fixes, this approach can predict performance of various NF properties ( the paper references some of this work). This would be a better comparison since it uses small dynamic samples to derive regression models for the configuration space

Soundness:

Since the NF faults may be rare (in the configuration space) it would be nice to have evidence that the small random sample you use to construct the causal graphs is sufficient. There has been a lot of research on sampling for testing configurable systems and it seems that this could be leveraged here and might perform better.
As a follow on I am surprised that there is no discussion of the use of classification trees which have been used for ‘fault characterization’ of functional faults in configurable systems. This approach has been shown to work with sampling so it seems like it would also be fast and could be competitive.
For the counterfactual reasoning a fault is needed as a starting point. But it also has to be tagged with the configuration options it has selected when that fault was found. Is there an assumption that this information is included with the fault (the full set of configurations that were set)? This is important to understand, for understanding the practicality of the technique.

Presentation:

Figure 4 is used to explain how this technique works, however, it uses only 2 configuration options which trivializes the problem. If there are 20+ configuration options, what are the other causal relationships and how many would have to be considered? What happens when more than 2 or 3 patterns lead to similar NF faults? This example should be improved to explain the complete process
Figure 1 needs a better explanation/caption. Figure 1 a and b look identical exepct for the color. There is no explanation of what the values mean (1G etc). Maybe that is latency but isn’t latency on the Y access? It is confusing and does not make the point in the paper
Figure 2 should be made larger and full captions/labels/legends used. It is hard to understand what is being shown and what the takeaway is from this data

Replicability/Reproducibility

For the data collection (page 7) the paper says it sets each configuration option (is this an each-choice sample) where each value for each option is tried at least once? Or did you mean you build a cartesian product of all possible configuration settings combined. That would be huge for this dataset, hence I don’t think that is the case. If you are only trying each configuration option once, then you lose the interactions between configuration options and you cannot guarantee you have optimal data in your sample
The study is lacking detail and that makes it hard to follow. For instance, you mention the dataset and how you collect the faults, but how many are there? How large is the configuration space, and how much of it did you explore? Are these faults related to a small set of the configuration options, etc? None of this information is provided making the study too abstract.
Did you reimplement the state of the art techniques or use existing tools?

-For the results the reader again has only an abstract view of the data (as accuracy, precision), etc. but there is no discussion of any concrete values/data – or even examples, so the data does not provide much insight into what the benefit is from this technique.

Questions for authors

How do the causal graphs work on multiple configuration options that influence the outcomes? Why only show 2 options in the example figure?
How many actual faults are in the data set and are these distributed among different configuration options/NF outcomes?
Have you compared against some sampling approaches to see if these produce better samples which are more representative of the data?

Review #975B

Overall merit

Weak reject

Paper summary

This paper is about CAUPER, an approach for diagnostics of non-functional issues (such as latency issues) in highly configurable with thousands of interacting configurations with a complex performance behavior. CAUPER learns and exploits the causal structure of configuration options, system events, and performance metrics to identify the source of non-functional issues. The authors have shown empirically that CAUPER effectively and quickly diagnoses the root cause of non-functional faults and recommends high-quality repairs to mitigate these faults.

Strengths and weaknesses

Simple approach for a timely and practical problem
Reasons over both hardware and software failures
Evaluation on real scenarios

The paper lacks detail and intuition behind the design choices related to the technique
Dataset, despite a link being provided, lacks details and data/scripts

Comments for authors

** Soundness **

The problem is well motivated. However, the description of the approach, the experimental methodology, and results lack details that prevent properly assessing the soundness of the approach from that viewpoint. A plus of this approach is that it reasons over both hardware and software issues.

** Significance **

Albeit the issues with soundness, I think that the technique has the potential to be significant based on the results reported in the empirical evaluation; namely, it is shown that the proposed technique is better than the state-of-the-art techniques.

** Novelty **

The idea of using casual Inference and counterfactual reasoning to identify, fix and repair non-functional faults is novel.

** Recoverability, Replicability and Reproducibility **

I commend the authors for promising to make the dataset available. The dataset seems however to be incomplete (e.g., empty readme.md). Exploring the link, I fail to understand what is indeed provided by the authors. I don't think the authors are providing anything meaningful.

Moreover, there is no Replicability and Reproducibility package available. Furthermore, the paper lacks details that would prevent one (or at least me) from implementing the approach.

** Presentation **

Generally well written and structured. However, it would be desirable to add details (e.g., details about the dataset are not given -- descriptive statistics) and intuition behind the phases of the technique.

Questions for authors

Are there plans to release a replication package, on top of the benchmarks?
Can you provide descriptive statistics on the dataset used in the empirical evaluation?
Did you check the results with the developers using NVIDIA's forum.

Review #975C

Overall merit

Accept

Paper summary

The paper focuses on identifying causes for non-functional faults and suggesting repairs to these. The approach operates in three phases. In the first observed data is generated from which a causal model is built. In the second phase, causal paths are identified that link configurations to faults. In the third phase, counterfactual queries are generated and tested to show which has the highest effect on intervening on the NF fault. The new configuration is tested for suitability. If not the process iterates. The approach is tested on a number of benchmarks and against a number of approaches showing superior accuracy and performance. A new benchmark is also provided.

Strengths and weaknesses

provides justification and targeted fixed compared to ML approaches.
evaluation shows improvement on state of the art
minimizes configurations that require changes compared to other approaches

limited to single causes
link has no data archived or tool

Comments for authors

Soundness The approach to causal discovery is founded on existing techniques and One assumption seems to be in place is that causes are single variables and Are independent of context. If I understood correctly, the dependency is on paths that link single parent variables to the outcome variable (the NF-fault) rather than some for of structural equation that shows dependencies on the value of several parent variables. It is not clear what happens when there are multiple causes identified in a single iteration whether all these would be fixed simultaneously or iteratively.

The approach is evaluated empirically. The paper however lacks formal guarantees on soundness and convergence.

Significance The approaches demonstrates reasonable improvement on state of the art approaches across the board. Its reliance on causal reasoning rather than correlation means that the repairs are more targeted and irrelevant re-configuration are avoided/minimized.

Novelty The use of causal reasoning in the context of configuration repair for non-functional requirements as far as I am aware is new and an interesting application of causal reasoning.

Recoverability, Replicability and Reproducibility The principles underlying the approach are well explained. The authors list are one of their contributions the availability of the new fault dataset however the link shows no data and summary tables somewhat different from that shown for RQ2 in the paper.

Presentation The paper is well-structured and fairly clear. It assumed teh reader is familiar already with the background upon which it depends which makes some sections less easy to follow. The approach its well-illustrated with example which simplify the process.

Questions for authors

1- Are causes always single variables or does context play a role?

2- What formal guarantees of convergence, number of repairs, etc does this approach have?

3-(How) can the number of observations data required be determined in advance if at all?

First-Response-Period Response by Md Shahriar Iqbal [email protected] (1610 words)

We thank the reviewers for their constructive feedback. We'll fix all the presentation issues in the next revision.

Review-#975A

Q1:Multi-configuration options, example

P1:Multi-configuration options. In section-3.2 we extract the top K ($K > 1$) causal-paths by ranking each path using $P_{ACE}$ values (line 517-566). Each path starts at a performance objective i.e., energy/latency, and ends at a configuration option. Later in section-3.3, we evaluate counterfactual queries on all K paths to select the options values with the maximum ITE values (line 569-628). In this way, multiple options values can be modified in the new configuration that is selected for evaluation in the next iteration.

P2:Example. We presented a simplified example with 2 options to provide an overview and benefits of using causal inference to the readers without overburdening with the complexity of multiple confounders so that the readers do not get distracted. We chose this style to convey key concepts like Simpson’s Paradox to show the limitations of correlational model-based methods. Obviously real systems have a large number of options. We evaluated CAUPER on systems with 28 configuration options, and 19 events (line 755-758).

Q2:Dataset description

Our NF-faults dataset contains 495 NF-faults in total. It has 453 single-objective (306 are due to latency and 147 are energy related) and 42 multi-objective NF-faults. Depending on particular hardware, 3% of our explored configurations suffer due to latency faults, 5% due to energy faults, and 1% due to multi-objective NF-faults. Due to space limitations we remove them from the main text and will provide them in the appendix.

Q3:Comparison with sampling/tree

We compared CAUPER against 4 state-of-the-art ML-based and 2 Optimization techniques amongst which BugDoc (line 799) is a decision tree-based approach (much richer in representation than simple regression tree). Our experimental results indicate that CAUPER achieves as high as 32% more gain (line 858-863) with 13% higher accuracy and 19% higher precision than BugDoc (line 853-855).

Our baseline SMAC is an optimization approach that has shown to outperform other approaches like ParamILS and TB-SPO that had better performance than sampling approaches with regression trees. Our results in lines 902-918 and 954-960 demonstrate that CAUPER outperforms SMAC for both latency and energy faults.

Nevertheless, empirical study presented in (Medeiros, 2016) found similar performance amongst the 10 different sampling algorithms for performance of highly configurable systems including random sampling. This is one of the reasons we did not compare directly with different sampling approaches.

Further, (Kaltenecker, 2020) showed that model+search over model tends to perform better (generate better samples) for tasks like debugging, optimization than just sampling using fixed strategies (e.g., random, pairwise).

Review-#975B

Q1:Replication package

We plan to release our code for regeneration. An anonymous version of the code is available here: https://github.com/onkfotocer/CAUPER/tree/master/results. The fault dataset is also included in the replication package. https://github.com/onkfotocer/CAUPER/tree/master/datasets. We updated the Readme for the ease of use. If accepted, we will submit the tool and dataset for artifact evaluation.

Q2:Descriptive statistics

See Q2 in Review-#975A.

Q3:Nvidia’s forum

For the reported real-world case-study in section 4, we compared our results with the accepted fix provided in Nvidia forum (line 684-735). Here, we adopted the same methodology followed by common program repair approaches where they compare the results with the already fixed bugs. Part of our future work aims to perform extensive developer surveys.

Review-#975C

Q1:Single-causes

Based on our empirical study, the causes of most of these faults span across multi-options, showing interactions are important. Lines 646-657 show the repair of NF-Fault with 5 configuration options (CPU Cores, CPU/GPU/EMC Frequency and CUDA_STATIC). The incremental learning in CAUPER aids in identifying multi-causes iteratively. With the addition of more samples, newer edges are discovered. Also See P1 in Q1 of Review-#975A.

Q2:Formal-guarantees

The task of learning correct causal structures in methods like FCI is NP-hard (Chickering,2004) and offers at best asymptotic convergence guarantees (Spirtes,2000). Both entropic algorithms return a local optimum (Entropic-causality). The resulting causal graph may be incomplete which is greedily refined using incremental learning in CAUPER. We use metrics (gain) as heuristic to determine if the causal model is approximately correct to repair the faults. Interesting observation from our experiment was that even with an incomplete causal structure, we were able to find configurations that resolve the performance issues.

Q3:Determining observations number

To our knowledge, we cannot provide any theoretical guarantees to determine the number of observations required as a necessary condition for this is to recover the true causal graph. However, in our empirical study we noticed maximum 61 observations needed to resolve a particular fault (25 observations+36 interventions)

Some additional references are at the bottom of the rebuttal.

Additional Comments:

#975A --- Novelty --- Intuition for Causal Inference We describe our motivation to use causal inference for debugging and repair using a classical example of Simpson’s Paradox (lines 64-77 and lines 98-122). Simpson's Paradox is a phenomenon when a trend appears in several different groups of data but disappears or reverses when these groups are combined. Traditional correlational techniques fail in such scenarios as they cannot account for the unmeasured confounders present in the data, whereas, causal inference methods can effectively reason in such cases to address such issues. Our observations confirm the presence of such confounders (Figure 5) in the performance distribution of software systems and serves as a motivation to use causal inference. We believe that such causal reasoning has the potential to be a powerful debugging approach in SE.

#975A --- Soundness --- Causal Sufficiency

Although learning causal models in high dimensional settings with small sample size is a big challenge, our experimental set up shows that the combination of structure learning and active learning is an effective way to handle these situations. We confirm this claim with the experimental results in section 6 (line 825-1064).

#975A --- Novelty --- Unclear use case

The causal structure is discovered from observational data under normal execution to infer causal relationship between the configuration options and performance. Given that there is some variance in the performance data, the causal model is able to capture the cause and effects on performance variation of the system. Then during counterfactual query, we query the model w.r.t. the faulty scenario to find “what would have happened if” a faulty configuration option changed. As opposed to blackbox predictive models (discriminative models), causal models represent the underlying data generating process (here performance behavior of the system). Therefore, as opposed to discriminative blackbox models, causal models do not need to have access to any input data with faults or any constraints in terms of distribution of the labels.

#975A --- Soundness --- Tag options with faults

For effective use, we assume the user has some domain knowledge about the config options that have the potential to influence performance (these can be found from the developer forums). However, later in the paper in RQ3 (line 982-1065) in the scalability study, we relax this assumption and include all the options considering the user has no knowledge about the domain. Our results showed that CAUPER is able to find an effective repair for the faults in such cases as well.

#975A --- Replicability --- Data Collection

We agree that collecting ground truth data for all possible combinations of options with all values is impractical. In our study, we randomly select configurations for measurements to obtain the approximate ground truths by setting a higher time budget of 24 hours.

For counterfactual query evaluation, using all possible values for a configuration option could cause combinatorial explosion if the graphs are dense. However, we observe that the obtained causal graphs are sparse which significantly reduces the search space. From the ablation study provided in Table 2, we observe that the average degree of the nodes is relatively small (less than 4). The ITE computation of Eq. (5) (line 614) to estimate the causal effects occurs only on the observational data. Therefore, we may generate any number of repairs and reason about them using the learned causal model without having to deploy those interventions and measuring their performance in the real-world.

#975B --- Soundness --- Lacks Details

We selected a diverse set of software systems (three machine learning systems, a video encoder and a database system) in our study and deployed them on three Nvidia hardware as these devices are resource constrained, micro-architecturally different and known to be more prone to misconfigurations. We selected 28 config options and 19 system events to construct our causal graph as it is interesting to determine how these options influence performance (e.g., directly or via some system events). We selected these options based on Nvidia’s configuration guides. To identify single and multi-objective NF-Faults we measure (approximately 2000 depending on a particular software/hardware) randomly selected configurations with a budget of 24 hours. We use the tail values (worse than 99th percentile) to build the NF-Faults dataset. For single-objective NF-Faults, we manually select the configuration with maximum gain as the ground truth (approximate). For multi-objective NF-Faults, we select the Pareto-optimal configurations as the ground truth (approximate). Q2 in *Review-#975A provides the distribution of faults.

#975B#975C --- Reproducibility --- Artifact

See Q1 of Review-#975B.

#975C --- Soundness --- Multiple Causes

See P1 in Q1 of Review-#975A.

Additional References:

Spirtes, Peter, et al. Causation, prediction, and search. MIT press, 2000.

Chickering, David, et al. Large-Sample Learning of Bayesian Networks in NP-Hard, JMLR, 2004, 1287 - 1330.

Kaltenecker, Christian et al. The Interplay of Sampling and Machine Learning for Software Performance Prediction. IEEE, 2020.

Entropic-Causality, Proposition 4 in the Appendix of Entropic Causal Inference, and Theorem 1 of Applications of Common Entropy for Causal Inference.

Medeiros, Flavio et al., A comparison of 10 sampling algorithms for configurable systems, ICSE, 2016.

ParamILS, Hutter, Frank et al. ParamILS: an automatic algorithm configuration framework.JAIR, 36:267–306, 2009

TB-SPO, Hutter, Frank et al. Time-bounded sequential parameter optimization. InProc. of LION-4, pages 281–298, 2010.

Review #975D

Metareview

The reviewers all concur that the paper presents an interesting approach to a problem of interest to the community. After the author responses the reviewers discussed this paper and agree that (1) the approach lacks important details and (2) the examples provided over simplify the problem. While we understand why the example might exclude some complexity to make the paper easier to read, together these issues obfuscate the real technical contribution and benefit of CAUPER.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FSE'21 Reviews and Rebuttal

ESEC/FSE 2021 Paper #975 Reviews and Comments

Review #975A

Overall merit

Paper summary

Strengths and weaknesses

Comments for authors

Novelty:

Soundness:

Presentation:

Replicability/Reproducibility

Questions for authors

Review #975B

Overall merit

Paper summary

Strengths and weaknesses

Comments for authors

Questions for authors

Review #975C

Overall merit

Paper summary

Strengths and weaknesses

Comments for authors

Questions for authors

First-Response-Period Response by Md Shahriar Iqbal [email protected] (1610 words)

Review #975D

Metareview

Clone this wiki locally