NU Learning Resource Center OPAC catalog › Details for: ACM Transactions on Software Engineering and Methodology

Normal view MARC view ISBD view

ACM Transactions on Software Engineering and Methodology

Material type: Text

TextSeries: ; ACM Transactions on Software Engineering and Methodology, Volume 32, Issue 2, 2023Publication details: New York : Association for Computing Machinery, c2023Description: [various pagings] : illustrations ; 26 cmISSN: 1049-331XSubject(s): SOFTWARE ENGINEERING | SYSTEM DESCRIPTION LANGUAGES | SOFTWARE EVOLUTION | SOFTWARE VERIFICATION AND VALIDATION | SECURITY AND PRIVACY

Contents:

Single and Multi-objective Test Cases Prioritization for Self-driving Cars in Virtual Environments -- Graded Refinement, Retrenchment, and Simulation -- On the Significance of Category Prediction for Code-Comment Synchronization -- Dealing with Belief Uncertainty in Domain Models -- Aide-mémoire: Improving a Project’s Collective Memory via Pull Request–Issue Links -- iBiR: Bug-report-driven Fault Injection -- deGraphCS: Embedding Variable-based Flow Graph for Neural Code Search -- Accelerating Overdue Pull Requests toward Completion -- Toward More Efficient Statistical Debugging with Abstraction Refinement -- Estimating Probabilistic Safe WCET Ranges of Real-Time Systems at Design Stages -- Coverage-Based Debloating for Java Bytecode -- DIRE and its Data: Neural Decompiled Variable Renamings with Respect to Software Class -- A Characterization Study of Merge Conflicts in Java Projects -- Data Race Repair Using Static Analysis Summaries -- Evaluating Surprise Adequacy for Deep Learning System Testing -- Assessing the Alignment between the Information Needs of Developers and the Documentation of Programming Languages: A Case Study on Rust -- On Proving the Correctness of Refactoring Class Diagrams of MDE Metamodels -- Suboptimal Comments in Java Projects: From Independent Comment Changes to Commenting Practices -- HINNPerf: Hierarchical Interaction Neural Network for Performance Prediction of Configurable Systems -- A Machine Learning Approach for Automated Filling of Categorical Fields in Data Entry Forms -- Towards Learning Generalizable Code Embeddings Using Task-agnostic Graph Convolutional Networks -- Efficient and Effective Feature Space Exploration for Testing Deep Learning Systems -- Demystifying Hidden Sensitive Operations in Android Apps -- Testing, Validation, and Verification of Robotic and Autonomous Systems: A Systematic Review -- Dissecting American Fuzzy Lop: A FuzzBench Evaluation -- Fuzzing Configurations of Program Options -- Dissecting American Fuzzy Lop – A FuzzBench Evaluation - RCR Report -- Fuzzing Configurations of Program Options - RCR Report.

Summary: [Article Title: Single and Multi-objective Test Cases Prioritization for Self-driving Cars in Virtual Environments/ Christian Birchler, Sajad Khatiri, Pouria Derakhshanfar, Sebastiano Panichella, and Annibale Panichella, p. 28:1-28:30] Abstract: Testing with simulation environments helps to identify critical failing scenarios for self-driving cars (SDCs). Simulation-based tests are safer than in-field operational tests and allow detecting software defects before deployment. However, these tests are very expensive and are too many to be run frequently within limited time constraints. In this article, we investigate test case prioritization techniques to increase the ability to detect SDC regression faults with virtual tests earlier. Our approach, called SDC-Prioritizer, prioritizes virtual tests for SDCs according to static features of the roads we designed to be used within the driving scenarios. These features can be collected without running the tests, which means that they do not require past execution results. We introduce two evolutionary approaches to prioritize the test cases using diversity metrics (black-box heuristics) computed on these static features. These two approaches, called SO-SDC-Prioritizer and MO-SDC-Prioritizer, use single-objective and multi-objective genetic algorithms (GA), respectively, to find trade-offs between executing the less expensive tests and the most diverse test cases earlier. Our empirical study conducted in the SDC domain shows that MO-SDC-Prioritizer significantly (P- value <=0.1e-10) improves the ability to detect safety-critical failures at the same level of execution time compared to baselines: random and greedy-based test case orderings. Besides, our study indicates that multi-objective meta-heuristics outperform single-objective approaches when prioritizing simulation-based tests for SDCs. MO-SDC-Prioritizer prioritizes test cases with a large improvement in fault detection while its overhead (up to 0.45% of the test execution cost) is negligible. https://doi.org/10.1145/3533818Summary: [Article Title: Graded Refinement, Retrenchment, and Simulation/ Richard Banach, p. 29:1-29:69] Abstract: Refinement of formal system models towards implementation has been a mainstay of system development since the inception of formal and Correct by Construction approaches to system development. However, pure refinement approaches do not always deal fluently with all desirable system requirements. This prompted the development of alternatives and generalizations, such as retrenchment. The crucial concept of simulation is key to judging the quality of the conformance between abstract and more concrete system models. Reformulations of these theoretical approaches are reprised and are embedded in a graded framework. The added flexibility this offers is intended to deal more effectively with the needs of applications in which the relationship between different levels of abstraction is not straightforward, and in which behavior can oscillate between conforming quite closely to an idealized abstraction and deviating quite far from it. The framework developed is confronted with an intentionally demanding case study: a model active control system for the protection of buildings during earthquakes. This offers many challenges: it is hybrid/cyber-physical; it has to respond to rather unpredictable inputs; and it has to straddle the gap between continuous behavior and discretized/quantized/numerical implementation. https://doi.org/10.1145/3534116Summary: [Article Title: On the Significance of Category Prediction for Code-Comment Synchronization/ Zhen Yang, Jacky Wai Keung, Xiao Yu, Yan Xiao, Zhi Jin and Jingyu Zhang, p. 30:1-30:41] Abstract: Software comments sometimes are not promptly updated in sync when the associated code is changed. The inconsistency between code and comments may mislead the developers and result in future bugs. Thus, studies concerning code-comment synchronization have become highly important, which aims to automatically synchronize comments with code changes. Existing code-comment synchronization approaches mainly contain two types, i.e., (1) deep learning-based (e.g., CUP), and (2) heuristic-based (e.g., HebCUP). The former constructs a neural machine translation-structured semantic model, which has a more generalized capability on synchronizing comments with software evolution and growth. However, the latter designs a series of rules for performing token-level replacements on old comments, which can generate the completely correct comments for the samples fully covered by their fine-designed heuristic rules. In this article, we propose a composite approach named CBS (i.e., Classifying Before Synchronizing) to further improve the code-comment synchronization performance, which combines the advantages of CUP and HebCUP with the assistance of inferred categories of Code-Comment Inconsistent (CCI) samples. Specifically, we firstly define two categories (i.e., heuristic-prone and non-heuristic-prone) for CCI samples and propose five features to assist category prediction. The samples whose comments can be correctly synchronized by HebCUP are heuristic-prone, while others are non-heuristic-prone. Then, CBS employs our proposed Multi-Subsets Ensemble Learning (MSEL) classification algorithm to alleviate the class imbalance problem and construct the category prediction model. Next, CBS uses the trained MSEL to predict the category of the new sample. If the predicted category is heuristic-prone, CBS employs HebCUP to conduct the code-comment synchronization for the sample, otherwise, CBS allocates CUP to handle it. Our extensive experiments demonstrate that CBS statistically significantly outperforms CUP and HebCUP, and obtains an average improvement of 23.47%, 22.84%, 3.04%, 3.04%, 1.64%, and 19.39% in terms of Accuracy, Recall@5, Average Edit Distance (AED), Relative Edit Distance (RED), BLEU-4, and Effective Synchronized Sample (ESS) ratio, respectively, which highlights that category prediction for CCI samples can boost the code-comment synchronization performance. https://doi.org/10.1145/3534117Summary: [Article Title: Dealing with Belief Uncertainty in Domain Models/ Lola Burgueño, Paula Muñoz, Robert Clarisó, Jordi Cabot, Sébastien Gérard and Antonio Vallecillo, p. 31:1-31:34] Abstract: There are numerous domains in which information systems need to deal with uncertain information. These uncertainties may originate from different reasons such as vagueness, imprecision, incompleteness, or inconsistencies, and in many cases, they cannot be neglected. In this article, we are interested in representing and processing uncertain information in domain models, considering the stakeholders’ beliefs (opinions). We show how to associate beliefs to model elements and how to propagate and operate with their associated uncertainty so that domain experts can individually reason about their models enriched with their personal opinions. In addition, we address the challenge of combining the opinions of different domain experts on the same model elements, with the goal to come up with informed collective decisions. We provide different strategies and a methodology to optimally merge individual opinions. https://doi.org/10.1145/3542947Summary: [Article Title: Aide-mémoire: Improving a Project’s Collective Memory via Pull Request–Issue Links/ Profir-Petru Pârţachi, David R. White and Earl T. Barr, p. 32:1-32:36] Abstract: Links between pull request and the issues they address document and accelerate the development of a software project but are often omitted. We present a new tool, Aide-mémoire, to suggest such links when a developer submits a pull request or closes an issue, smoothly integrating into existing workflows. In contrast to previous state-of-the-art approaches that repair related commit histories, Aide-mémoire is designed for continuous, real-time, and long-term use, employing Mondrian forest to adapt over a project’s lifetime and continuously improve traceability. Aide-mémoire is tailored for two specific instances of the general traceability problem—namely, commit to issue and pull request to issue links, with a focus on the latter—and exploits data inherent to these two problems to outperform tools for general purpose link recovery. Our approach is online, language-agnostic, and scalable. We evaluate over a corpus of 213 projects and six programming languages, achieving a mean average precision of 0.95. Adopting Aide-mémoire is both efficient and effective: A programmer need only evaluate a single suggested link 94% of the time, and 16% of all discovered links were originally missed by developers. https://doi.org/10.1145/3542937Summary: [Article Title: iBiR: Bug-report-driven Fault Injection/ Ahmed Khanfir, Anil Koyuncu, Mike Papadakis, Maxime Cordy, Tegawende F. Bissyandé, Jacques Klein and Yves Le Traon, p. 33:1-33:31] Abstract: Much research on software engineering relies on experimental studies based on fault injection. Fault injection, however, is not often relevant to emulate real-world software faults since it “blindly” injects large numbers of faults. It remains indeed challenging to inject few but realistic faults that target a particular functionality in a program. In this work, we introduce iBiR , a fault injection tool that addresses this challenge by exploring change patterns associated to user-reported faults. To inject realistic faults, we create mutants by re-targeting a bug-report-driven automated program repair system, i.e., reversing its code transformation templates. iBiR is further appealing in practice since it requires deep knowledge of neither code nor tests, just of the program’s relevant bug reports. Thus, our approach focuses the fault injection on the feature targeted by the bug report. We assess iBiR by considering the Defects4J dataset. Experimental results show that our approach outperforms the fault injection performed by traditional mutation testing in terms of semantic similarity with the original bug, when applied at either system or class levels of granularity, and provides better, statistically significant estimations of test effectiveness (fault detection). Additionally, when injecting 100 faults, iBiR injects faults that couple with the real ones in around 36% of the cases, while mutation testing achieves less than 4%. https://doi.org/10.1145/3542946Summary: [Article Title: deGraphCS: Embedding Variable-based Flow Graph for Neural Code Search/ Chen Zeng, Yue Yu, Shanshan Li, Xin Xia, Zhiming Wang, Mingyang Geng, Linxiao Bai, Wei Dong and Xiangke Liao, p. 34:1-34:27] Abstract: With the rapid increase of public code repositories, developers maintain a great desire to retrieve precise code snippets by using natural language. Despite existing deep learning-based approaches that provide end-to-end solutions (i.e., accept natural language as queries and show related code fragments), the performance of code search in the large-scale repositories is still low in accuracy because of the code representation (e.g., AST) and modeling (e.g., directly fusing features in the attention stage). In this paper, we propose a novel learnable deep Graph for Code Search (called deGraphCS) to transfer source code into variable-based flow graphs based on an intermediate representation technique, which can model code semantics more precisely than directly processing the code as text or using the syntax tree representation. Furthermore, we propose a graph optimization mechanism to refine the code representation and apply an improved gated graph neural network to model variable-based flow graphs. To evaluate the effectiveness of deGraphCS, we collect a large-scale dataset from GitHub containing 41,152 code snippets written in the C language and reproduce several typical deep code search methods for comparison. The experimental results show that deGraphCS can achieve state-of-the-art performance and accurately retrieve code snippets satisfying the needs of the users. https://doi.org/10.1145/3546066Summary: [Article Title: Nudge: Accelerating Overdue Pull Requests toward Completion/ Chandra Maddila, Sai Surya Upadrasta, Chetan Bansal, Nachiappan Nagappan, Georgios Gousios and Arie van Deursen, p. 35:1-35:30] Abstract: Pull requests are a key part of the collaborative software development and code review process today. However, pull requests can also slow down the software development process when the reviewer(s) or the author do not actively engage with the pull request. In this work, we design an end-to-end service, Nudge, for accelerating overdue pull requests toward completion by reminding the author or the reviewer(s) to engage with their overdue pull requests. First, we use models based on effort estimation and machine learning to predict the completion time for a given pull request. Second, we use activity detection to filter out pull requests that may be overdue but for which sufficient action is taking place nonetheless. Last, we use actor identification to understand who the blocker of the pull request is and nudge the appropriate actor (author or reviewer(s)). The key novelty of Nudge is that it succeeds in reducing pull request resolution time, while ensuring that developers perceive the notifications sent as useful, at the scale of thousands of repositories. In a randomized trial on 147 repositories in use at Microsoft, Nudge was able to reduce pull request resolution time by 60% for 8,500 pull requests, when compared to overdue pull requests for which Nudge did not send a notification. Furthermore, developers receiving Nudge notifications resolved 73% of these notifications as positive. We observed similar results when scaling up the deployment of Nudge to 8,000 repositories at Microsoft, for which Nudge sent 210,000 notifications during a full year. This demonstrates Nudge’s ability to scale to thousands of repositories. Last, our qualitative analysis of a selection of Nudge notifications indicates areas for future research, such as taking dependencies among pull requests and developer availability into account. https://doi.org/10.1145/3544791Summary: [Article Title: Toward More Efficient Statistical Debugging with Abstraction Refinement/ Zhiqiang Zuo, Xintao Niu, Siyi Zhang, Lu Fang, Siau Cheng Khoo, Shan Lu, Chengnian Sun and Guoqing Harry Xu, p. 36:1-36:38] Abstract: Debugging is known to be a notoriously painstaking and time-consuming task. As one major family of automated debugging, statistical debugging approaches have been well investigated over the past decade, which collect failing and passing executions and apply statistical techniques to identify discriminative elements as potential bug causes. Most of the existing approaches instrument the entire program to produce execution profiles for debugging, thus incurring hefty instrumentation and analysis cost. However, as in fact a major part of the program code is error-free, full-scale program instrumentation is wasteful and unnecessary. This article presents a systematic abstraction refinement-based pruning technique for statistical debugging. Our technique only needs to instrument and analyze the code partially. While guided by a mathematically rigorous analysis, our technique is guaranteed to produce the same debugging results as an exhaustive analysis in deterministic settings. With the help of the effective and safe pruning, our technique greatly saves the cost of failure diagnosis without sacrificing any debugging capability. We apply this technique to two different statistical debugging scenarios: in-house and production-run statistical debugging. The comprehensive evaluations validate that our technique can significantly improve the efficiency of statistical debugging in both scenarios, while without jeopardizing the debugging capability. https://doi.org/10.1145/3544790Summary: [Article Title: Estimating Probabilistic Safe WCET Ranges of Real-Time Systems at Design Stages/ Jaekwon Lee, Seung Yeob Shin, Shiva Nejati, Lionel Briand and Yago Isasi Parache, p. 37:1-37:33] Abstract: Estimating worst-case execution time (WCET) is an important activity at early design stages of real-time systems. Based on WCET estimates, engineers make design and implementation decisions to ensure that task executions always complete before their specified deadlines. However, in practice, engineers often cannot provide precise point WCET estimates and prefer to provide plausible WCET ranges. Given a set of real-time tasks with such ranges, we provide an automated technique to determine for what WCET values the system is likely to meet its deadlines and, hence, operate safely with a probabilistic guarantee. Our approach combines a search algorithm for generating worst-case scheduling scenarios with polynomial logistic regression for inferring probabilistic safe WCET ranges. We evaluated our approach by applying it to three industrial systems from different domains and several synthetic systems. Our approach efficiently and accurately estimates probabilistic safe WCET ranges within which deadlines are likely to be satisfied with a high degree of confidence. https://doi.org/10.1145/3546941Summary: [Article Title: Coverage-Based Debloating for Java Bytecode/ César Soto-Valero, Thomas Durieux, Nicolas Harrand and Benoit Baudry, p. 38:1-38:34] Abstract: Software bloat is code that is packaged in an application but is actually not necessary to run the application. The presence of software bloat is an issue for security, performance, and for maintenance. In this article, we introduce a novel technique for debloating, which we call coverage-based debloating. We implement the technique for one single language: Java bytecode. We leverage a combination of state-of-the-art Java bytecode coverage tools to precisely capture what parts of a project and its dependencies are used when running with a specific workload. Then, we automatically remove the parts that are not covered, in order to generate a debloated version of the project. We succeed to debloat 211 library versions from a dataset of 94 unique open-source Java libraries. The debloated versions are syntactically correct and preserve their original behaviour according to the workload. Our results indicate that 68.3% of the libraries’ bytecode and 20.3% of their total dependencies can be removed through coverage-based debloating. For the first time in the literature on software debloating, we assess the utility of debloated libraries with respect to client applications that reuse them. We select 988 client projects that either have a direct reference to the debloated library in their source code or which test suite covers at least one class of the libraries that we debloat. Our results show that 81.5% of the clients, with at least one test that uses the library, successfully compile and pass their test suite when the original library is replaced by its debloated version. https://doi.org/10.1145/3546948Summary: [Article Title: DIRE and its Data: Neural Decompiled Variable Renamings with Respect to Software Class/ Luke Dramko, Jeremy Lacomis, Pengcheng Yin, Ed Schwartz, Miltiadis Allamanis, Graham Neubig, Bogdan Vasilescu and Claire Le Goues, p. 39:1-39:34] Abstract: The decompiler is one of the most common tools for examining executable binaries without the corresponding source code. It transforms binaries into high-level code, reversing the compilation process. Unfortunately, decompiler output is far from readable because the decompilation process is often incomplete. State-of-the-art techniques use machine learning to predict missing information like variable names. While these approaches are often able to suggest good variable names in context, no existing work examines how the selection of training data influences these machine learning models. We investigate how data provenance and the quality of training data affect performance, and how well, if at all, trained models generalize across software domains. We focus on the variable renaming problem using one such machine learning model, DIRE. We first describe DIRE in detail and the accompanying technique used to generate training data from raw code. We also evaluate DIRE’s overall performance without respect to data quality. Next, we show how training on more popular, possibly higher quality code (measured using GitHub stars) leads to a more generalizable model because popular code tends to have more diverse variable names. Finally, we evaluate how well DIRE predicts domain-specific identifiers, propose a modification to incorporate domain information, and show that it can predict identifiers in domain-specific scenarios 23% more frequently than the original DIRE model. https://doi.org/10.1145/3546946Summary: [Article Title: A Characterization Study of Merge Conflicts in Java Projects/ Bowen Shen, Muhammad Ali Gulzar, Fei He and Na Meng, p. 40:1-40:28] Abstract: In collaborative software development, programmers create software branches to add features and fix bugs tentatively, and then merge branches to integrate edits. When edits from different branches textually overlap (i.e., textual conflicts) or lead to compilation and runtime errors (i.e., build and test conflicts), it is challenging for developers to remove such conflicts. Prior work proposed tools to detect and solve conflicts. They investigate how conflicts relate to code smells and the software development process. However, many questions are still not fully investigated, such as what types of conflicts exist in real-world applications and how developers or tools handle them. For this article, we used automated textual merge, compilation, and testing to reveal three types of conflicts in 208 open-source repositories: textual conflicts, build conflicts (i.e., conflicts causing build errors), and test conflicts (i.e., conflicts triggering test failures). We manually inspected 538 conflicts and their resolutions to characterize merge conflicts from different angles. Our analysis revealed three interesting phenomena. First, higher-order conflicts (i.e., build and test conflicts) are harder to detect and resolve, while existing tools mainly focus on textual conflicts. Second, developers manually resolved most higher-order conflicts by applying similar edits to multiple program locations; their conflict resolutions share common editing patterns implying great opportunities for future tool design. Third, developers resolved 64% of true textual conflicts by keeping complete edits from either a left or right branch. Unlike prior studies, our research for the first time thoroughly characterizes three types of conflicts, with a special focus on higher-order conflicts and limitations of existing tool design. Our work will shed light on future research of software merge. https://doi.org/10.1145/3546944Summary: [Article Title: Hippodrome: Data Race Repair Using Static Analysis Summaries/ Andreea Costea, Abhishek Tiwari, Sigmund Chianasta, Kishore R, Abhik Roychoudhury and Ilya Sergey, p. 41:1-41:33] Abstract: Implementing bug-free concurrent programs is a challenging task in modern software development. State-of-the-art static analyses find hundreds of concurrency bugs in production code, scaling to large codebases. Yet, fixing these bugs in constantly changing codebases represents a daunting effort for programmers, particularly because a fix in the concurrent code can introduce other bugs in a subtle way. In this work, we show how to harness compositional static analysis for concurrency bug detection, to enable a new Automated Program Repair (APR) technique for data races in large concurrent Java codebases. The key innovation of our work is an algorithm that translates procedure summaries inferred by the analysis tool for the purpose of bug reporting into small local patches that fix concurrency bugs (without introducing new ones). This synergy makes it possible to extend the virtues of compositional static concurrency analysis to APR, making our approach effective (it can detect and fix many more bugs than existing tools for data race repair), scalable (it takes seconds to analyze and suggest fixes for sizeable codebases), and usable (generally, it does not require annotations from the users and can perform continuous automated repair). Our study, conducted on popular open-source projects, has confirmed that our tool automatically produces concurrency fixes similar to those proposed by the developers in the past. https://doi.org/10.1145/3546942Summary: [Article Title: Evaluating Surprise Adequacy for Deep Learning System Testing/ Jinhan Kim, Robert Feldt anad Shin Yoo, p. 42:1-42:29] Abstract: The rapid adoption of Deep Learning (DL) systems in safety critical domains such as medical imaging and autonomous driving urgently calls for ways to test their correctness and robustness. Borrowing from the concept of test adequacy in traditional software testing, existing work on testing of DL systems initially investigated DL systems from structural point of view, leading to a number of coverage metrics. Our lack of understanding of the internal mechanism of Deep Neural Networks (DNNs), however, means that coverage metrics defined on the Boolean dichotomy of coverage are hard to intuitively interpret and understand. We propose the degree of out-of-distribution-ness of a given input as its adequacy for testing: the more surprising a given input is to the DNN under test, the more likely the system will show unexpected behavior for the input. We develop the concept of surprise into a test adequacy criterion, called Surprise Adequacy (SA). Intuitively, SA measures the difference in the behavior of the DNN for the given input and its behavior for the training data. We posit that a good test input should be sufficiently, but not overtly, surprising compared to the training dataset. This article evaluates SA using a range of DL systems from simple image classifiers to autonomous driving car platforms, as well as both small and large data benchmarks ranging from MNIST to ImageNet. The results show that the SA value of an input can be a reliable predictor of the correctness of the mode behavior. We also show that SA can be used to detect adversarial examples, and also be efficiently computed against large training dataset such as ImageNet using sampling. https://doi.org/10.1145/3546947Summary: [Article Title: Assessing the Alignment between the Information Needs of Developers and the Documentation of Programming Languages: A Case Study on Rust/ Filipe Roseiro Cogo, Xin Xia, Ahmed E. Hassan, p. 43:1-43:48] Abstract: Programming language documentation refers to the set of technical documents that provide application developers with a description of the high-level concepts of a language (e.g., manuals, tutorials, and API references). Such documentation is essential to support application developers in effectively using a programming language. One of the challenges faced by documenters (i.e., personnel that design and produce documentation for a programming language) is to ensure that documentation has relevant information that aligns with the concrete needs of developers, defined as the missing knowledge that developers acquire via voluntary search. In this article, we present an automated approach to support documenters in evaluating the differences and similarities between the concrete information need of developers and the current state of documentation (a problem that we refer to as the topical alignment of a programming language documentation). Our approach leverages semi-supervised topic modelling that uses domain knowledge to guide the derivation of topics. We initially train a baseline topic model from a set of Rust-related Q&A posts. We then use this baseline model to determine the distribution of topic probabilities of each document of the official Rust documentation. Afterwards, we assess the similarities and differences between the topics of the Q&A posts and the official documentation. Our results show a relatively high level of topical alignment in Rust documentation. Still, information about specific topics is scarce in both the Q&A websites and the documentation, particularly related topics with programming niches such as network, game, and database development. For other topics (e.g., related topics with language features such as structs, patterns and matchings, and foreign function interface), information is only available on Q&A websites while lacking in the official documentation. Finally, we discuss implications for programming language documenters, particularly how to leverage our approach to prioritize topics that should be added to the documentation. https://doi.org/10.1145/3546945Summary: [Article Title: On Proving the Correctness of Refactoring Class Diagrams of MDE Metamodels/ Najd Altoyan and Don Batory, p. 44:1-44:42] Abstract: Model Driven Engineering (MDE) is a general-purpose engineering methodology to elevate system design, maintenance, and analysis to corresponding activities on models. Models (graphical and/or textual) of a target application are automatically transformed into source code, performance models, Promela files (for model checking), and so on for system analysis and construction. Models are instances of metamodels. One form an MDE metamodel can take is a [class diagram, constraints] pair: the class diagram defines all object diagrams that could be metamodel instances; object constraint language (OCL) constraints eliminate semantically undesirable instances. A metamodel refactoring is an invertible semantics-preserving co-transformation, i.e., it transforms both a metamodel and its models without losing data. This article addresses a subproblem of metamodel refactoring: how to prove the correctness of refactorings of class diagrams without OCL constraints using the Coq Proof Assistant. https://doi.org/10.1145/3549541Summary: [Article Title: Suboptimal Comments in Java Projects: From Independent Comment Changes to Commenting Practices/ Chao Wang, Hao He, Uma Pal, Darko Marinov and Minghui Zhou, p. 45:1-45:33] Abstract: High-quality source code comments are valuable for software development and maintenance, however, code often contains low-quality comments or lacks them altogether. We name such source code comments as suboptimal comments. Such suboptimal comments create challenges in code comprehension and maintenance. Despite substantial research on low-quality source code comments, empirical knowledge about commenting practices that produce suboptimal comments and reasons that lead to suboptimal comments are lacking. We help bridge this knowledge gap by investigating (1) independent comment changes (ICCs)—comment changes committed independently of code changes—which likely address suboptimal comments, (2) commenting guidelines, and (3) comment-checking tools and comment-generating tools, which are often employed to help commenting practice—especially to prevent suboptimal comments. We collect 24M+ comment changes from 4,392 open-source GitHub Java repositories and find that ICCs widely exist. The ICC ratio—proportion of ICCs among all comment changes—is ~15.5%, with 98.7% of the repositories having ICC. Our thematic analysis of 3,533 randomly sampled ICCs provides a three-dimensional taxonomy for what is changed (four comment categories and 13 subcategories), how it changed (six commenting activity categories), and what factors are associated with the change (three factors). We investigate 600 repositories to understand the prevalence, content, impact, and violations of commenting guidelines. We find that only 15.5% of the 600 sampled repositories have any commenting guidelines. We provide the first taxonomy for elements in commenting guidelines: where and what to comment are particularly important. The repositories without such guidelines have a statistically significantly higher ICC ratio, indicating the negative impact of the lack of commenting guidelines. However, commenting guidelines are not strictly followed: 85.5% of checked repositories have violations. We also systematically study how developers use two kinds of tools, comment-checking tools and comment-generating tools, in the 4,392 repositories. We find that the use of Javadoc tool is negatively correlated with the ICC ratio, while the use of Checkstyle has no statistically significant correlation; the use of comment-generating tools leads to a higher ICC ratio. To conclude, we reveal issues and challenges in current commenting practice, which help understand how suboptimal comments are introduced. We propose potential research directions on comment location prediction, comment generation, and comment quality assessment; suggest how developers can formulate commenting guidelines and enforce rules with tools; and recommend how to enhance current comment-checking and comment-generating tools. https://doi.org/10.1145/3546949Summary: [Article Title: HINNPerf: Hierarchical Interaction Neural Network for Performance Prediction of Configurable Systems/ Jiezhu Cheng, Cuiyun Gao and Zibin Zheng, p. 46:1-46:30] Abstract: Modern software systems are usually highly configurable, providing users with customized functionality through various configuration options. Understanding how system performance varies with different option combinations is important to determine optimal configurations that meet specific requirements. Due to the complex interactions among multiple options and the high cost of performance measurement under a huge configuration space, it is challenging to study how different configurations influence the system performance. To address these challenges, we propose HINNPerf, a novel hierarchical interaction neural network for performance prediction of configurable systems. HINNPerf employs the embedding method and hierarchic network blocks to model the complicated interplay between configuration options, which improves the prediction accuracy of the method. In addition, we devise a hierarchical regularization strategy to enhance the model robustness. Empirical results on 10 real-world configurable systems show that our method statistically significantly outperforms state-of-the-art approaches by achieving average 22.67% improvement in prediction accuracy. In addition, combined with the Integrated Gradients method, the designed hierarchical architecture provides some insights about the interaction complexity and the significance of configuration options, which might help users and developers better understand how the configurable system works and efficiently identify significant options affecting the performance. https://doi.org/10.1145/3528100Summary: [Article Title: A Machine Learning Approach for Automated Filling of Categorical Fields in Data Entry Forms/ Hichem Belgacem, Xiaochen Li, Domenico Bianculli and Lionel Briand, p. 47:1-47:40] Abstract: Users frequently interact with software systems through data entry forms. However, form filling is time-consuming and error-prone. Although several techniques have been proposed to auto-complete or pre-fill fields in the forms, they provide limited support to help users fill categorical fields, i.e., fields that require users to choose the right value among a large set of options. In this article, we propose LAFF, a learning-based automated approach for filling categorical fields in data entry forms. LAFF first builds Bayesian Network models by learning field dependencies from a set of historical input instances, representing the values of the fields that have been filled in the past. To improve its learning ability, LAFF uses local modeling to effectively mine the local dependencies of fields in a cluster of input instances. During the form filling phase, LAFF uses such models to predict possible values of a target field, based on the values in the already-filled fields of the form and their dependencies; the predicted values (endorsed based on field dependencies and prediction confidence) are then provided to the end-user as a list of suggestions. We evaluated LAFF by assessing its effectiveness and efficiency in form filling on two datasets, one of them proprietary from the banking domain. Experimental results show that LAFF is able to provide accurate suggestions with a Mean Reciprocal Rank value above 0.73. Furthermore, LAFF is efficient, requiring at most 317 ms per suggestion. https://doi.org/10.1145/3533021Summary: [Article Title: Towards Learning Generalizable Code Embeddings Using Task-agnostic Graph Convolutional Networks/ Zishuo Ding, Heng Li, Weiyi Shang and Tse-Hsun (Peter) Chen, p. 48:1-48:43] Abstract: Code embeddings have seen increasing applications in software engineering (SE) research and practice recently. Despite the advances in embedding techniques applied in SE research, one of the main challenges is their generalizability. A recent study finds that code embeddings may not be readily leveraged for the downstream tasks that the embeddings are not particularly trained for. Therefore, in this article, we propose GraphCodeVec, which represents the source code as graphs and leverages the Graph Convolutional Networks to learn more generalizable code embeddings in a task-agnostic manner. The edges in the graph representation are automatically constructed from the paths in the abstract syntax trees, and the nodes from the tokens in the source code. To evaluate the effectiveness of GraphCodeVec , we consider three downstream benchmark tasks (i.e., code comment generation, code authorship identification, and code clones detection) that are used in a prior benchmarking of code embeddings and add three new downstream tasks (i.e., source code classification, logging statements prediction, and software defect prediction), resulting in a total of six downstream tasks that are considered in our evaluation. For each downstream task, we apply the embeddings learned by GraphCodeVec and the embeddings learned from four baseline approaches and compare their respective performance. We find that GraphCodeVec outperforms all the baselines in five out of the six downstream tasks, and its performance is relatively stable across different tasks and datasets. In addition, we perform ablation experiments to understand the impacts of the training context (i.e., the graph context extracted from the abstract syntax trees) and the training model (i.e., the Graph Convolutional Networks) on the effectiveness of the generated embeddings. The results show that both the graph context and the Graph Convolutional Networks can benefit GraphCodeVec in producing high-quality embeddings for the downstream tasks, while the improvement by Graph Convolutional Networks is more robust across different downstream tasks and datasets. Our findings suggest that future research and practice may consider using graph-based deep learning methods to capture the structural information of the source code for SE tasks. https://doi.org/10.1145/3542944Summary: [Article Title: Efficient and Effective Feature Space Exploration for Testing Deep Learning Systems/ Tahereh Zohdinasab, Vincenzo Riccio, Alessio Gambi and Paolo Tonella, p. 49:1-49:38] Abstract: Assessing the quality of Deep Learning (DL) systems is crucial, as they are increasingly adopted in safety-critical domains. Researchers have proposed several input generation techniques for DL systems. While such techniques can expose failures, they do not explain which features of the test inputs influenced the system’s (mis-) behaviour. DeepHyperion was the first test generator to overcome this limitation by exploring the DL systems’ feature space at large. In this article, we propose DeepHyperion-CS, a test generator for DL systems that enhances DeepHyperion by promoting the inputs that contributed more to feature space exploration during the previous search iterations. We performed an empirical study involving two different test subjects (i.e., a digit classifier and a lane-keeping system for self-driving cars). Our results proved that the contribution-based guidance implemented within DeepHyperion-CS outperforms state-of-the-art tools and significantly improves the efficiency and the effectiveness of DeepHyperion. DeepHyperion-CS exposed significantly more misbehaviours for five out of six feature combinations and was up to 65% more efficient than DeepHyperion in finding misbehaviour-inducing inputs and exploring the feature space. DeepHyperion-CS was useful for expanding the datasets used to train the DL systems, populating up to 200% more feature map cells than the original training set. https://doi.org/10.1145/3544792Summary: [Article Title: Demystifying Hidden Sensitive Operations in Android Apps/ Xiaoyu Sun, Xiao Chen, Li Li, Haipeng Cai, John Grundy, Jordan Samhi, Tegawendé Bissyandé and Jacques Klein, p. 50:1-50:30] Abstract: Security of Android devices is now paramount, given their wide adoption among consumers. As researchers develop tools for statically or dynamically detecting suspicious apps, malware writers regularly update their attack mechanisms to hide malicious behavior implementation. This poses two problems to current research techniques: static analysis approaches, given their over-approximations, can report an overwhelming number of false alarms, while dynamic approaches will miss those behaviors that are hidden through evasion techniques. We propose in this work a static approach specifically targeted at highlighting hidden sensitive operations (HSOs), mainly sensitive data flows. The prototype version of HiSenDroid has been evaluated on a large-scale dataset of thousands of malware and goodware samples on which it successfully revealed anti-analysis code snippets aiming at evading detection by dynamic analysis. We further experimentally show that, with FlowDroid, some of the hidden sensitive behaviors would eventually lead to private data leaks. Those leaks would have been hard to spot either manually among the large number of false positives reported by the state-of-the-art static analyzers, or by dynamic tools. Overall, by putting the light on hidden sensitive operations, HiSenDroid helps security analysts in validating potentially sensitive data operations, which would be previously unnoticed. https://doi.org/10.1145/3574158Summary: [Article Title: Testing, Validation, and Verification of Robotic and Autonomous Systems: A Systematic Review/ Hugo Araujo, Mohammad Reza Mousavi and Mahsa Varshosaz, p. 51:1-51:61] Abstract: We perform a systematic literature review on testing, validation, and verification of robotic and autonomous systems (RAS). The scope of this review covers peer-reviewed research papers proposing, improving, or evaluating testing techniques, processes, or tools that address the system-level qualities of RAS. Our survey is performed based on a rigorous methodology structured in three phases. First, we made use of a set of 26 seed papers (selected by domain experts) and the SERP-TEST taxonomy to design our search query and (domain-specific) taxonomy. Second, we conducted a search in three academic search engines and applied our inclusion and exclusion criteria to the results. Respectively, we made use of related work and domain specialists (50 academics and 15 industry experts) to validate and refine the search query. As a result, we encountered 10,735 studies, out of which 195 were included, reviewed, and coded. Our objective is to answer four research questions, pertaining to (1) the type of models, (2) measures for system performance and testing adequacy, (3) tools and their availability, and (4) evidence of applicability, particularly in industrial contexts. We analyse the results of our coding to identify strengths and gaps in the domain and present recommendations to researchers and practitioners. Our findings show that variants of temporal logics are most widely used for modelling requirements and properties, while variants of state-machines and transition systems are used widely for modelling system behaviour. Other common models concern epistemic logics for specifying requirements and belief-desire-intention models for specifying system behaviour. Apart from time and epistemics, other aspects captured in models concern probabilities (e.g., for modelling uncertainty) and continuous trajectories (e.g., for modelling vehicle dynamics and kinematics). Many papers lack any rigorous measure of efficiency, effectiveness, or adequacy for their proposed techniques, processes, or tools. Among those that provide a measure of efficiency, effectiveness, or adequacy, the majority use domain-agnostic generic measures such as number of failures, size of state-space, or verification time were most used. There is a trend in addressing the research gap in this respect by developing domain-specific notions of performance and adequacy. Defining widely accepted rigorous measures of performance and adequacy for each domain is an identified research gap. In terms of tools, the most widely used tools are well-established model-checkers such as Prism and Uppaal, as well as simulation tools such as Gazebo; Matlab/Simulink is another widely used toolset in this domain. Overall, there is very limited evidence of industrial applicability in the papers published in this domain. There is even a gap considering consolidated benchmarks for various types of autonomous systems. https://doi.org/10.1145/3542945Summary: [Article Title: Dissecting American Fuzzy Lop: A FuzzBench Evaluation/ Andrea Fioraldi, Alessandro Mantovani, Dominik Maier and Davide Balzarotti, p. 52:1-52:26] Abstract: AFL is one of the most used and extended fuzzers, adopted by industry and academic researchers alike. Although the community agrees on AFL’s effectiveness at discovering new vulnerabilities and its outstanding usability, many of its internal design choices remain untested to date. Security practitioners often clone the project “as-is” and use it as a starting point to develop new techniques, usually taking everything under the hood for granted. Instead, we believe that a careful analysis of the different parameters could help modern fuzzers improve their performance and explain how each choice can affect the outcome of security testing, either negatively or positively. The goal of this work is to provide a comprehensive understanding of the internal mechanisms of AFL by performing experiments and by comparing different metrics used to evaluate fuzzers. This can help to show the effectiveness of some techniques and to clarify which aspects are instead outdated. To perform our study, we performed nine unique experiments that we carried out on the popular Fuzzbench platform. Each test focuses on a different aspect of AFL, ranging from its mutation approach to the feedback encoding scheme and its scheduling methodologies. Our findings show that each design choice affects different factors of AFL. Some of these are positively correlated with the number of detected bugs or the coverage of the target application, whereas other features are related to usability and reliability. Most important, we believe that the outcome of our experiments indicates which parts of AFL we should preserve in the design of modern fuzzers. https://doi.org/10.1145/3580596Summary: [Article Title: Fuzzing Configurations of Program Options/ Zenong Zhang, George Klees, Eric Wang, Michael Hicks and Shiyi Wei, p. 53:1-53:21] Abstract: While many real-world programs are shipped with configurations to enable/disable functionalities, fuzzers have mostly been applied to test single configurations of these programs. In this work, we first conduct an empirical study to understand how program configurations affect fuzzing performance. We find that limiting a campaign to a single configuration can result in failing to cover a significant amount of code. We also observe that different program configurations contribute differing amounts of code coverage, challenging the idea that each one can be efficiently fuzzed individually. Motivated by these two observations, we propose ConfigFuzz , which can fuzz configurations along with normal inputs. ConfigFuzz transforms the target program to encode its program options within part of the fuzzable input, so existing fuzzers’ mutation operators can be reused to fuzz program configurations. We instantiate ConfigFuzz on six configurable, common fuzzing targets, and integrate their executions in FuzzBench. In our evaluation, ConfigFuzz outperforms two baseline fuzzers in four targets, while the results are mixed in the other targets due to program size and configuration space. We also analyze the options fuzzed by ConfigFuzz and how they affect the performance. https://doi.org/10.1145/3580597Summary: [Article Title: Dissecting American Fuzzy Lop – A FuzzBench Evaluation - RCR Report/ Andrea Fioraldi, Alessandro Mantovani, Dominik Maier and Davide Balzarotti, p. 54:1-54:4] Abstract: This report describes the artifacts of the “Dissecting American Fuzzy Lop – A FuzzBench Evaluation” paper. The artifacts are available online at https://github.com/eurecom-s3/dissecting_afl and archived at https://doi.org/10.6084/m9.figshare.21401280. American Fuzzy Lop (AFL) consists of the produced code, the setup to run the experiments in FuzzBench, and the generated reports. We claim the Functional badge as the patches to AFL are easy to enable and the experiments are easy to run thanks to the FuzzBench service, but the evaluations are self-contained and the modifications to AFL are as is. For the purpose of reproducing the experiments, no particular skills are needed as the process is straightforward and described in https://google.github.io/fuzzbench/getting-started/adding-a-new-fuzzer/#requesting-an-experiment. https://doi.org/10.1145/3580600Summary: [Article Title: Fuzzing Configurations of Program Options - RCR Report/ Zenong Zhang, George Klees, Eric Wang, Michael Hicks and Shiyi Wei, p. 55:1-55:3] Abstract: This artifact contains the source code and instructions to reproduce the evaluation results of the article “Fuzzing Configurations of Program Options.” The source code includes the configuration grammars for six target programs, the scripts to generate configuration stubs, and the scripts to post-process fuzzing results. The README of the artifact includes the steps to prepare the experimental environment on a clean Ubuntu machine and step-by-step commands to reproduce the evaluation experiments. A VirtualBox image with ConfigFuzz properly set up is also included. https://doi.org/10.1145/3580601

Item type:

Serials

Tags from this library: No tags from this library for this title. Log in to add tags.

Average rating: 0.0 (0 votes)

Holdings ( 1 )
Title notes ( 58 )
Comments ( 0 )

Item type	Current library	Home library	Collection	Shelving location	Call number	Copy number	Status	Date due	Barcode
Serials	LRC - Main	National University - Manila	Gen. Ed. - CCIT	Periodicals	ACM Transactions on Software Engineering and Methodology, Volume 32, Issue 2, 2023 (Browse shelf (Opens below))	c.1	Available		PER000000586

Browsing National University - Manila shelves, Shelving location: Periodicals, Collection: Gen. Ed. - CCIT Close shelf browser (Hides shelf browser)

Previous	No cover image available	No cover image available	No cover image available	No cover image available	No cover image available	No cover image available	No cover image available	Next
Previous	ACM Transactions on Software Engineering and Methodology, Volume 31, Issue 2, 2022 ACM Transactions on Software Engineering and Methodology	ACM Transactions on Software Engineering and Methodology, Volume 31, Issue 3, 2022 ACM Transactions on Software Engineering and Methodology	ACM Transactions on Software Engineering and Methodology, Volume 31, Issue 4, 2022 ACM Transactions on Software Engineering and Methodology	ACM Transactions on Software Engineering and Methodology, Volume 32, Issue 2, 2023 ACM Transactions on Software Engineering and Methodology	MIS Asia, Volume 2, March - April 2010 MIS Asia	MIS Asia, Volume 5, September - October 2009 MIS Asia	MIS Asia, Volume 6, November - December 2009 MIS Asia	Next

Includes bibliographical references.

[Article Title: Single and Multi-objective Test Cases Prioritization for Self-driving Cars in Virtual Environments/ Christian Birchler, Sajad Khatiri, Pouria Derakhshanfar, Sebastiano Panichella, and Annibale Panichella, p. 28:1-28:30]

Abstract: Testing with simulation environments helps to identify critical failing scenarios for self-driving cars (SDCs). Simulation-based tests are safer than in-field operational tests and allow detecting software defects before deployment. However, these tests are very expensive and are too many to be run frequently within limited time constraints.

In this article, we investigate test case prioritization techniques to increase the ability to detect SDC regression faults with virtual tests earlier. Our approach, called SDC-Prioritizer, prioritizes virtual tests for SDCs according to static features of the roads we designed to be used within the driving scenarios. These features can be collected without running the tests, which means that they do not require past execution results. We introduce two evolutionary approaches to prioritize the test cases using diversity metrics (black-box heuristics) computed on these static features. These two approaches, called SO-SDC-Prioritizer and MO-SDC-Prioritizer, use single-objective and multi-objective genetic algorithms (GA), respectively, to find trade-offs between executing the less expensive tests and the most diverse test cases earlier.

Our empirical study conducted in the SDC domain shows that MO-SDC-Prioritizer significantly (P- value <=0.1e-10) improves the ability to detect safety-critical failures at the same level of execution time compared to baselines: random and greedy-based test case orderings. Besides, our study indicates that multi-objective meta-heuristics outperform single-objective approaches when prioritizing simulation-based tests for SDCs.

MO-SDC-Prioritizer prioritizes test cases with a large improvement in fault detection while its overhead (up to 0.45% of the test execution cost) is negligible.

https://doi.org/10.1145/3533818

[Article Title: Graded Refinement, Retrenchment, and Simulation/ Richard Banach, p. 29:1-29:69]

Abstract: Refinement of formal system models towards implementation has been a mainstay of system development since the inception of formal and Correct by Construction approaches to system development. However, pure refinement approaches do not always deal fluently with all desirable system requirements. This prompted the development of alternatives and generalizations, such as retrenchment. The crucial concept of simulation is key to judging the quality of the conformance between abstract and more concrete system models. Reformulations of these theoretical approaches are reprised and are embedded in a graded framework. The added flexibility this offers is intended to deal more effectively with the needs of applications in which the relationship between different levels of abstraction is not straightforward, and in which behavior can oscillate between conforming quite closely to an idealized abstraction and deviating quite far from it. The framework developed is confronted with an intentionally demanding case study: a model active control system for the protection of buildings during earthquakes. This offers many challenges: it is hybrid/cyber-physical; it has to respond to rather unpredictable inputs; and it has to straddle the gap between continuous behavior and discretized/quantized/numerical implementation.

https://doi.org/10.1145/3534116

[Article Title: On the Significance of Category Prediction for Code-Comment Synchronization/ Zhen Yang, Jacky Wai Keung, Xiao Yu, Yan Xiao, Zhi Jin and Jingyu Zhang, p. 30:1-30:41]

Abstract: Software comments sometimes are not promptly updated in sync when the associated code is changed. The inconsistency between code and comments may mislead the developers and result in future bugs. Thus, studies concerning code-comment synchronization have become highly important, which aims to automatically synchronize comments with code changes. Existing code-comment synchronization approaches mainly contain two types, i.e., (1) deep learning-based (e.g., CUP), and (2) heuristic-based (e.g., HebCUP). The former constructs a neural machine translation-structured semantic model, which has a more generalized capability on synchronizing comments with software evolution and growth. However, the latter designs a series of rules for performing token-level replacements on old comments, which can generate the completely correct comments for the samples fully covered by their fine-designed heuristic rules. In this article, we propose a composite approach named CBS (i.e., Classifying Before Synchronizing) to further improve the code-comment synchronization performance, which combines the advantages of CUP and HebCUP with the assistance of inferred categories of Code-Comment Inconsistent (CCI) samples. Specifically, we firstly define two categories (i.e., heuristic-prone and non-heuristic-prone) for CCI samples and propose five features to assist category prediction. The samples whose comments can be correctly synchronized by HebCUP are heuristic-prone, while others are non-heuristic-prone. Then, CBS employs our proposed Multi-Subsets Ensemble Learning (MSEL) classification algorithm to alleviate the class imbalance problem and construct the category prediction model. Next, CBS uses the trained MSEL to predict the category of the new sample. If the predicted category is heuristic-prone, CBS employs HebCUP to conduct the code-comment synchronization for the sample, otherwise, CBS allocates CUP to handle it. Our extensive experiments demonstrate that CBS statistically significantly outperforms CUP and HebCUP, and obtains an average improvement of 23.47%, 22.84%, 3.04%, 3.04%, 1.64%, and 19.39% in terms of Accuracy, Recall@5, Average Edit Distance (AED), Relative Edit Distance (RED), BLEU-4, and Effective Synchronized Sample (ESS) ratio, respectively, which highlights that category prediction for CCI samples can boost the code-comment synchronization performance.

https://doi.org/10.1145/3534117

[Article Title: Dealing with Belief Uncertainty in Domain Models/ Lola Burgueño, Paula Muñoz, Robert Clarisó, Jordi Cabot, Sébastien Gérard and Antonio Vallecillo, p. 31:1-31:34]

Abstract: There are numerous domains in which information systems need to deal with uncertain information. These uncertainties may originate from different reasons such as vagueness, imprecision, incompleteness, or inconsistencies, and in many cases, they cannot be neglected. In this article, we are interested in representing and processing uncertain information in domain models, considering the stakeholders’ beliefs (opinions). We show how to associate beliefs to model elements and how to propagate and operate with their associated uncertainty so that domain experts can individually reason about their models enriched with their personal opinions. In addition, we address the challenge of combining the opinions of different domain experts on the same model elements, with the goal to come up with informed collective decisions. We provide different strategies and a methodology to optimally merge individual opinions.

https://doi.org/10.1145/3542947

[Article Title: Aide-mémoire: Improving a Project’s Collective Memory via Pull Request–Issue Links/ Profir-Petru Pârţachi, David R. White and Earl T. Barr, p. 32:1-32:36]

Abstract: Links between pull request and the issues they address document and accelerate the development of a software project but are often omitted. We present a new tool, Aide-mémoire, to suggest such links when a developer submits a pull request or closes an issue, smoothly integrating into existing workflows. In contrast to previous state-of-the-art approaches that repair related commit histories, Aide-mémoire is designed for continuous, real-time, and long-term use, employing Mondrian forest to adapt over a project’s lifetime and continuously improve traceability. Aide-mémoire is tailored for two specific instances of the general traceability problem—namely, commit to issue and pull request to issue links, with a focus on the latter—and exploits data inherent to these two problems to outperform tools for general purpose link recovery. Our approach is online, language-agnostic, and scalable. We evaluate over a corpus of 213 projects and six programming languages, achieving a mean average precision of 0.95. Adopting Aide-mémoire is both efficient and effective: A programmer need only evaluate a single suggested link 94% of the time, and 16% of all discovered links were originally missed by developers.

https://doi.org/10.1145/3542937

[Article Title: iBiR: Bug-report-driven Fault Injection/ Ahmed Khanfir, Anil Koyuncu, Mike Papadakis, Maxime Cordy, Tegawende F. Bissyandé, Jacques Klein and Yves Le Traon, p. 33:1-33:31]

Abstract: Much research on software engineering relies on experimental studies based on fault injection. Fault injection, however, is not often relevant to emulate real-world software faults since it “blindly” injects large numbers of faults. It remains indeed challenging to inject few but realistic faults that target a particular functionality in a program. In this work, we introduce iBiR , a fault injection tool that addresses this challenge by exploring change patterns associated to user-reported faults. To inject realistic faults, we create mutants by re-targeting a bug-report-driven automated program repair system, i.e., reversing its code transformation templates. iBiR is further appealing in practice since it requires deep knowledge of neither code nor tests, just of the program’s relevant bug reports. Thus, our approach focuses the fault injection on the feature targeted by the bug report. We assess iBiR by considering the Defects4J dataset. Experimental results show that our approach outperforms the fault injection performed by traditional mutation testing in terms of semantic similarity with the original bug, when applied at either system or class levels of granularity, and provides better, statistically significant estimations of test effectiveness (fault detection). Additionally, when injecting 100 faults, iBiR injects faults that couple with the real ones in around 36% of the cases, while mutation testing achieves less than 4%.

https://doi.org/10.1145/3542946

[Article Title: deGraphCS: Embedding Variable-based Flow Graph for Neural Code Search/ Chen Zeng, Yue Yu, Shanshan Li, Xin Xia, Zhiming Wang, Mingyang Geng, Linxiao Bai, Wei Dong and Xiangke Liao, p. 34:1-34:27]

Abstract: With the rapid increase of public code repositories, developers maintain a great desire to retrieve precise code snippets by using natural language. Despite existing deep learning-based approaches that provide end-to-end solutions (i.e., accept natural language as queries and show related code fragments), the performance of code search in the large-scale repositories is still low in accuracy because of the code representation (e.g., AST) and modeling (e.g., directly fusing features in the attention stage).

In this paper, we propose a novel learnable deep Graph for Code Search (called deGraphCS) to transfer source code into variable-based flow graphs based on an intermediate representation technique, which can model code semantics more precisely than directly processing the code as text or using the syntax tree representation. Furthermore, we propose a graph optimization mechanism to refine the code representation and apply an improved gated graph neural network to model variable-based flow graphs. To evaluate the effectiveness of deGraphCS, we collect a large-scale dataset from GitHub containing 41,152 code snippets written in the C language and reproduce several typical deep code search methods for comparison. The experimental results show that deGraphCS can achieve state-of-the-art performance and accurately retrieve code snippets satisfying the needs of the users.

https://doi.org/10.1145/3546066

[Article Title: Nudge: Accelerating Overdue Pull Requests toward Completion/ Chandra Maddila, Sai Surya Upadrasta, Chetan Bansal, Nachiappan Nagappan, Georgios Gousios and Arie van Deursen, p. 35:1-35:30]

Abstract: Pull requests are a key part of the collaborative software development and code review process today. However, pull requests can also slow down the software development process when the reviewer(s) or the author do not actively engage with the pull request. In this work, we design an end-to-end service, Nudge, for accelerating overdue pull requests toward completion by reminding the author or the reviewer(s) to engage with their overdue pull requests. First, we use models based on effort estimation and machine learning to predict the completion time for a given pull request. Second, we use activity detection to filter out pull requests that may be overdue but for which sufficient action is taking place nonetheless. Last, we use actor identification to understand who the blocker of the pull request is and nudge the appropriate actor (author or reviewer(s)). The key novelty of Nudge is that it succeeds in reducing pull request resolution time, while ensuring that developers perceive the notifications sent as useful, at the scale of thousands of repositories. In a randomized trial on 147 repositories in use at Microsoft, Nudge was able to reduce pull request resolution time by 60% for 8,500 pull requests, when compared to overdue pull requests for which Nudge did not send a notification. Furthermore, developers receiving Nudge notifications resolved 73% of these notifications as positive. We observed similar results when scaling up the deployment of Nudge to 8,000 repositories at Microsoft, for which Nudge sent 210,000 notifications during a full year. This demonstrates Nudge’s ability to scale to thousands of repositories. Last, our qualitative analysis of a selection of Nudge notifications indicates areas for future research, such as taking dependencies among pull requests and developer availability into account.

https://doi.org/10.1145/3544791

[Article Title: Toward More Efficient Statistical Debugging with Abstraction Refinement/ Zhiqiang Zuo, Xintao Niu, Siyi Zhang, Lu Fang, Siau Cheng Khoo, Shan Lu, Chengnian Sun and Guoqing Harry Xu, p. 36:1-36:38]

Abstract: Debugging is known to be a notoriously painstaking and time-consuming task. As one major family of automated debugging, statistical debugging approaches have been well investigated over the past decade, which collect failing and passing executions and apply statistical techniques to identify discriminative elements as potential bug causes. Most of the existing approaches instrument the entire program to produce execution profiles for debugging, thus incurring hefty instrumentation and analysis cost. However, as in fact a major part of the program code is error-free, full-scale program instrumentation is wasteful and unnecessary.

This article presents a systematic abstraction refinement-based pruning technique for statistical debugging. Our technique only needs to instrument and analyze the code partially. While guided by a mathematically rigorous analysis, our technique is guaranteed to produce the same debugging results as an exhaustive analysis in deterministic settings. With the help of the effective and safe pruning, our technique greatly saves the cost of failure diagnosis without sacrificing any debugging capability.

We apply this technique to two different statistical debugging scenarios: in-house and production-run statistical debugging. The comprehensive evaluations validate that our technique can significantly improve the efficiency of statistical debugging in both scenarios, while without jeopardizing the debugging capability.

https://doi.org/10.1145/3544790

[Article Title: Estimating Probabilistic Safe WCET Ranges of Real-Time Systems at Design Stages/ Jaekwon Lee, Seung Yeob Shin, Shiva Nejati, Lionel Briand and Yago Isasi Parache, p. 37:1-37:33]

Abstract: Estimating worst-case execution time (WCET) is an important activity at early design stages of real-time systems. Based on WCET estimates, engineers make design and implementation decisions to ensure that task executions always complete before their specified deadlines. However, in practice, engineers often cannot provide precise point WCET estimates and prefer to provide plausible WCET ranges. Given a set of real-time tasks with such ranges, we provide an automated technique to determine for what WCET values the system is likely to meet its deadlines and, hence, operate safely with a probabilistic guarantee. Our approach combines a search algorithm for generating worst-case scheduling scenarios with polynomial logistic regression for inferring probabilistic safe WCET ranges. We evaluated our approach by applying it to three industrial systems from different domains and several synthetic systems. Our approach efficiently and accurately estimates probabilistic safe WCET ranges within which deadlines are likely to be satisfied with a high degree of confidence.

https://doi.org/10.1145/3546941

[Article Title: Coverage-Based Debloating for Java Bytecode/ César Soto-Valero, Thomas Durieux, Nicolas Harrand and Benoit Baudry, p. 38:1-38:34]

Abstract: Software bloat is code that is packaged in an application but is actually not necessary to run the application. The presence of software bloat is an issue for security, performance, and for maintenance. In this article, we introduce a novel technique for debloating, which we call coverage-based debloating. We implement the technique for one single language: Java bytecode. We leverage a combination of state-of-the-art Java bytecode coverage tools to precisely capture what parts of a project and its dependencies are used when running with a specific workload. Then, we automatically remove the parts that are not covered, in order to generate a debloated version of the project. We succeed to debloat 211 library versions from a dataset of 94 unique open-source Java libraries. The debloated versions are syntactically correct and preserve their original behaviour according to the workload. Our results indicate that 68.3% of the libraries’ bytecode and 20.3% of their total dependencies can be removed through coverage-based debloating.

For the first time in the literature on software debloating, we assess the utility of debloated libraries with respect to client applications that reuse them. We select 988 client projects that either have a direct reference to the debloated library in their source code or which test suite covers at least one class of the libraries that we debloat. Our results show that 81.5% of the clients, with at least one test that uses the library, successfully compile and pass their test suite when the original library is replaced by its debloated version.

https://doi.org/10.1145/3546948

[Article Title: DIRE and its Data: Neural Decompiled Variable Renamings with Respect to Software Class/ Luke Dramko, Jeremy Lacomis, Pengcheng Yin, Ed Schwartz, Miltiadis Allamanis, Graham Neubig, Bogdan Vasilescu and Claire Le Goues, p. 39:1-39:34]

Abstract: The decompiler is one of the most common tools for examining executable binaries without the corresponding source code. It transforms binaries into high-level code, reversing the compilation process. Unfortunately, decompiler output is far from readable because the decompilation process is often incomplete. State-of-the-art techniques use machine learning to predict missing information like variable names. While these approaches are often able to suggest good variable names in context, no existing work examines how the selection of training data influences these machine learning models. We investigate how data provenance and the quality of training data affect performance, and how well, if at all, trained models generalize across software domains. We focus on the variable renaming problem using one such machine learning model, DIRE. We first describe DIRE in detail and the accompanying technique used to generate training data from raw code. We also evaluate DIRE’s overall performance without respect to data quality. Next, we show how training on more popular, possibly higher quality code (measured using GitHub stars) leads to a more generalizable model because popular code tends to have more diverse variable names. Finally, we evaluate how well DIRE predicts domain-specific identifiers, propose a modification to incorporate domain information, and show that it can predict identifiers in domain-specific scenarios 23% more frequently than the original DIRE model.

https://doi.org/10.1145/3546946

[Article Title: A Characterization Study of Merge Conflicts in Java Projects/ Bowen Shen, Muhammad Ali Gulzar, Fei He and Na Meng, p. 40:1-40:28]

Abstract: In collaborative software development, programmers create software branches to add features and fix bugs tentatively, and then merge branches to integrate edits. When edits from different branches textually overlap (i.e., textual conflicts) or lead to compilation and runtime errors (i.e., build and test conflicts), it is challenging for developers to remove such conflicts. Prior work proposed tools to detect and solve conflicts. They investigate how conflicts relate to code smells and the software development process. However, many questions are still not fully investigated, such as what types of conflicts exist in real-world applications and how developers or tools handle them. For this article, we used automated textual merge, compilation, and testing to reveal three types of conflicts in 208 open-source repositories: textual conflicts, build conflicts (i.e., conflicts causing build errors), and test conflicts (i.e., conflicts triggering test failures). We manually inspected 538 conflicts and their resolutions to characterize merge conflicts from different angles.

Our analysis revealed three interesting phenomena. First, higher-order conflicts (i.e., build and test conflicts) are harder to detect and resolve, while existing tools mainly focus on textual conflicts. Second, developers manually resolved most higher-order conflicts by applying similar edits to multiple program locations; their conflict resolutions share common editing patterns implying great opportunities for future tool design. Third, developers resolved 64% of true textual conflicts by keeping complete edits from either a left or right branch. Unlike prior studies, our research for the first time thoroughly characterizes three types of conflicts, with a special focus on higher-order conflicts and limitations of existing tool design. Our work will shed light on future research of software merge.

https://doi.org/10.1145/3546944

[Article Title: Hippodrome: Data Race Repair Using Static Analysis Summaries/ Andreea Costea, Abhishek Tiwari, Sigmund Chianasta, Kishore R, Abhik Roychoudhury and Ilya Sergey, p. 41:1-41:33]

Abstract: Implementing bug-free concurrent programs is a challenging task in modern software development. State-of-the-art static analyses find hundreds of concurrency bugs in production code, scaling to large codebases. Yet, fixing these bugs in constantly changing codebases represents a daunting effort for programmers, particularly because a fix in the concurrent code can introduce other bugs in a subtle way.

In this work, we show how to harness compositional static analysis for concurrency bug detection, to enable a new Automated Program Repair (APR) technique for data races in large concurrent Java codebases. The key innovation of our work is an algorithm that translates procedure summaries inferred by the analysis tool for the purpose of bug reporting into small local patches that fix concurrency bugs (without introducing new ones). This synergy makes it possible to extend the virtues of compositional static concurrency analysis to APR, making our approach effective (it can detect and fix many more bugs than existing tools for data race repair), scalable (it takes seconds to analyze and suggest fixes for sizeable codebases), and usable (generally, it does not require annotations from the users and can perform continuous automated repair). Our study, conducted on popular open-source projects, has confirmed that our tool automatically produces concurrency fixes similar to those proposed by the developers in the past.

https://doi.org/10.1145/3546942

[Article Title: Evaluating Surprise Adequacy for Deep Learning System Testing/ Jinhan Kim, Robert Feldt anad Shin Yoo, p. 42:1-42:29]

Abstract: The rapid adoption of Deep Learning (DL) systems in safety critical domains such as medical imaging and autonomous driving urgently calls for ways to test their correctness and robustness. Borrowing from the concept of test adequacy in traditional software testing, existing work on testing of DL systems initially investigated DL systems from structural point of view, leading to a number of coverage metrics. Our lack of understanding of the internal mechanism of Deep Neural Networks (DNNs), however, means that coverage metrics defined on the Boolean dichotomy of coverage are hard to intuitively interpret and understand. We propose the degree of out-of-distribution-ness of a given input as its adequacy for testing: the more surprising a given input is to the DNN under test, the more likely the system will show unexpected behavior for the input. We develop the concept of surprise into a test adequacy criterion, called Surprise Adequacy (SA). Intuitively, SA measures the difference in the behavior of the DNN for the given input and its behavior for the training data. We posit that a good test input should be sufficiently, but not overtly, surprising compared to the training dataset. This article evaluates SA using a range of DL systems from simple image classifiers to autonomous driving car platforms, as well as both small and large data benchmarks ranging from MNIST to ImageNet. The results show that the SA value of an input can be a reliable predictor of the correctness of the mode behavior. We also show that SA can be used to detect adversarial examples, and also be efficiently computed against large training dataset such as ImageNet using sampling.

https://doi.org/10.1145/3546947

[Article Title: Assessing the Alignment between the Information Needs of Developers and the Documentation of Programming Languages: A Case Study on Rust/ Filipe Roseiro Cogo, Xin Xia, Ahmed E. Hassan, p. 43:1-43:48]

Abstract: Programming language documentation refers to the set of technical documents that provide application developers with a description of the high-level concepts of a language (e.g., manuals, tutorials, and API references). Such documentation is essential to support application developers in effectively using a programming language. One of the challenges faced by documenters (i.e., personnel that design and produce documentation for a programming language) is to ensure that documentation has relevant information that aligns with the concrete needs of developers, defined as the missing knowledge that developers acquire via voluntary search. In this article, we present an automated approach to support documenters in evaluating the differences and similarities between the concrete information need of developers and the current state of documentation (a problem that we refer to as the topical alignment of a programming language documentation). Our approach leverages semi-supervised topic modelling that uses domain knowledge to guide the derivation of topics. We initially train a baseline topic model from a set of Rust-related Q&A posts. We then use this baseline model to determine the distribution of topic probabilities of each document of the official Rust documentation. Afterwards, we assess the similarities and differences between the topics of the Q&A posts and the official documentation. Our results show a relatively high level of topical alignment in Rust documentation. Still, information about specific topics is scarce in both the Q&A websites and the documentation, particularly related topics with programming niches such as network, game, and database development. For other topics (e.g., related topics with language features such as structs, patterns and matchings, and foreign function interface), information is only available on Q&A websites while lacking in the official documentation. Finally, we discuss implications for programming language documenters, particularly how to leverage our approach to prioritize topics that should be added to the documentation.

https://doi.org/10.1145/3546945

[Article Title: On Proving the Correctness of Refactoring Class Diagrams of MDE Metamodels/ Najd Altoyan and Don Batory, p. 44:1-44:42]

Abstract: Model Driven Engineering (MDE) is a general-purpose engineering methodology to elevate system design, maintenance, and analysis to corresponding activities on models. Models (graphical and/or textual) of a target application are automatically transformed into source code, performance models, Promela files (for model checking), and so on for system analysis and construction.

Models are instances of metamodels. One form an MDE metamodel can take is a [class diagram, constraints] pair: the class diagram defines all object diagrams that could be metamodel instances; object constraint language (OCL) constraints eliminate semantically undesirable instances.

A metamodel refactoring is an invertible semantics-preserving co-transformation, i.e., it transforms both a metamodel and its models without losing data. This article addresses a subproblem of metamodel refactoring: how to prove the correctness of refactorings of class diagrams without OCL constraints using the Coq Proof Assistant.

https://doi.org/10.1145/3549541

[Article Title: Suboptimal Comments in Java Projects: From Independent Comment Changes to Commenting Practices/ Chao Wang, Hao He, Uma Pal, Darko Marinov and Minghui Zhou, p. 45:1-45:33]

Abstract: High-quality source code comments are valuable for software development and maintenance, however, code often contains low-quality comments or lacks them altogether. We name such source code comments as suboptimal comments. Such suboptimal comments create challenges in code comprehension and maintenance. Despite substantial research on low-quality source code comments, empirical knowledge about commenting practices that produce suboptimal comments and reasons that lead to suboptimal comments are lacking. We help bridge this knowledge gap by investigating (1) independent comment changes (ICCs)—comment changes committed independently of code changes—which likely address suboptimal comments, (2) commenting guidelines, and (3) comment-checking tools and comment-generating tools, which are often employed to help commenting practice—especially to prevent suboptimal comments.

We collect 24M+ comment changes from 4,392 open-source GitHub Java repositories and find that ICCs widely exist. The ICC ratio—proportion of ICCs among all comment changes—is ~15.5%, with 98.7% of the repositories having ICC. Our thematic analysis of 3,533 randomly sampled ICCs provides a three-dimensional taxonomy for what is changed (four comment categories and 13 subcategories), how it changed (six commenting activity categories), and what factors are associated with the change (three factors). We investigate 600 repositories to understand the prevalence, content, impact, and violations of commenting guidelines. We find that only 15.5% of the 600 sampled repositories have any commenting guidelines. We provide the first taxonomy for elements in commenting guidelines: where and what to comment are particularly important. The repositories without such guidelines have a statistically significantly higher ICC ratio, indicating the negative impact of the lack of commenting guidelines. However, commenting guidelines are not strictly followed: 85.5% of checked repositories have violations. We also systematically study how developers use two kinds of tools, comment-checking tools and comment-generating tools, in the 4,392 repositories. We find that the use of Javadoc tool is negatively correlated with the ICC ratio, while the use of Checkstyle has no statistically significant correlation; the use of comment-generating tools leads to a higher ICC ratio.

To conclude, we reveal issues and challenges in current commenting practice, which help understand how suboptimal comments are introduced. We propose potential research directions on comment location prediction, comment generation, and comment quality assessment; suggest how developers can formulate commenting guidelines and enforce rules with tools; and recommend how to enhance current comment-checking and comment-generating tools.

https://doi.org/10.1145/3546949

[Article Title: HINNPerf: Hierarchical Interaction Neural Network for Performance Prediction of Configurable Systems/ Jiezhu Cheng, Cuiyun Gao and Zibin Zheng, p. 46:1-46:30]

Abstract: Modern software systems are usually highly configurable, providing users with customized functionality through various configuration options. Understanding how system performance varies with different option combinations is important to determine optimal configurations that meet specific requirements. Due to the complex interactions among multiple options and the high cost of performance measurement under a huge configuration space, it is challenging to study how different configurations influence the system performance. To address these challenges, we propose HINNPerf, a novel hierarchical interaction neural network for performance prediction of configurable systems. HINNPerf employs the embedding method and hierarchic network blocks to model the complicated interplay between configuration options, which improves the prediction accuracy of the method. In addition, we devise a hierarchical regularization strategy to enhance the model robustness. Empirical results on 10 real-world configurable systems show that our method statistically significantly outperforms state-of-the-art approaches by achieving average 22.67% improvement in prediction accuracy. In addition, combined with the Integrated Gradients method, the designed hierarchical architecture provides some insights about the interaction complexity and the significance of configuration options, which might help users and developers better understand how the configurable system works and efficiently identify significant options affecting the performance.

https://doi.org/10.1145/3528100

[Article Title: A Machine Learning Approach for Automated Filling of Categorical Fields in Data Entry Forms/ Hichem Belgacem, Xiaochen Li, Domenico Bianculli and Lionel Briand, p. 47:1-47:40]

Abstract: Users frequently interact with software systems through data entry forms. However, form filling is time-consuming and error-prone. Although several techniques have been proposed to auto-complete or pre-fill fields in the forms, they provide limited support to help users fill categorical fields, i.e., fields that require users to choose the right value among a large set of options.

In this article, we propose LAFF, a learning-based automated approach for filling categorical fields in data entry forms. LAFF first builds Bayesian Network models by learning field dependencies from a set of historical input instances, representing the values of the fields that have been filled in the past. To improve its learning ability, LAFF uses local modeling to effectively mine the local dependencies of fields in a cluster of input instances. During the form filling phase, LAFF uses such models to predict possible values of a target field, based on the values in the already-filled fields of the form and their dependencies; the predicted values (endorsed based on field dependencies and prediction confidence) are then provided to the end-user as a list of suggestions.

We evaluated LAFF by assessing its effectiveness and efficiency in form filling on two datasets, one of them proprietary from the banking domain. Experimental results show that LAFF is able to provide accurate suggestions with a Mean Reciprocal Rank value above 0.73. Furthermore, LAFF is efficient, requiring at most 317 ms per suggestion.

https://doi.org/10.1145/3533021

[Article Title: Towards Learning Generalizable Code Embeddings Using Task-agnostic Graph Convolutional Networks/ Zishuo Ding, Heng Li, Weiyi Shang and Tse-Hsun (Peter) Chen, p. 48:1-48:43]

Abstract: Code embeddings have seen increasing applications in software engineering (SE) research and practice recently. Despite the advances in embedding techniques applied in SE research, one of the main challenges is their generalizability. A recent study finds that code embeddings may not be readily leveraged for the downstream tasks that the embeddings are not particularly trained for. Therefore, in this article, we propose GraphCodeVec, which represents the source code as graphs and leverages the Graph Convolutional Networks to learn more generalizable code embeddings in a task-agnostic manner. The edges in the graph representation are automatically constructed from the paths in the abstract syntax trees, and the nodes from the tokens in the source code. To evaluate the effectiveness of GraphCodeVec , we consider three downstream benchmark tasks (i.e., code comment generation, code authorship identification, and code clones detection) that are used in a prior benchmarking of code embeddings and add three new downstream tasks (i.e., source code classification, logging statements prediction, and software defect prediction), resulting in a total of six downstream tasks that are considered in our evaluation. For each downstream task, we apply the embeddings learned by GraphCodeVec and the embeddings learned from four baseline approaches and compare their respective performance. We find that GraphCodeVec outperforms all the baselines in five out of the six downstream tasks, and its performance is relatively stable across different tasks and datasets. In addition, we perform ablation experiments to understand the impacts of the training context (i.e., the graph context extracted from the abstract syntax trees) and the training model (i.e., the Graph Convolutional Networks) on the effectiveness of the generated embeddings. The results show that both the graph context and the Graph Convolutional Networks can benefit GraphCodeVec in producing high-quality embeddings for the downstream tasks, while the improvement by Graph Convolutional Networks is more robust across different downstream tasks and datasets. Our findings suggest that future research and practice may consider using graph-based deep learning methods to capture the structural information of the source code for SE tasks.

https://doi.org/10.1145/3542944

[Article Title: Efficient and Effective Feature Space Exploration for Testing Deep Learning Systems/ Tahereh Zohdinasab, Vincenzo Riccio, Alessio Gambi and Paolo Tonella, p. 49:1-49:38]

Abstract: Assessing the quality of Deep Learning (DL) systems is crucial, as they are increasingly adopted in safety-critical domains. Researchers have proposed several input generation techniques for DL systems. While such techniques can expose failures, they do not explain which features of the test inputs influenced the system’s (mis-) behaviour. DeepHyperion was the first test generator to overcome this limitation by exploring the DL systems’ feature space at large. In this article, we propose DeepHyperion-CS, a test generator for DL systems that enhances DeepHyperion by promoting the inputs that contributed more to feature space exploration during the previous search iterations. We performed an empirical study involving two different test subjects (i.e., a digit classifier and a lane-keeping system for self-driving cars). Our results proved that the contribution-based guidance implemented within DeepHyperion-CS outperforms state-of-the-art tools and significantly improves the efficiency and the effectiveness of DeepHyperion. DeepHyperion-CS exposed significantly more misbehaviours for five out of six feature combinations and was up to 65% more efficient than DeepHyperion in finding misbehaviour-inducing inputs and exploring the feature space. DeepHyperion-CS was useful for expanding the datasets used to train the DL systems, populating up to 200% more feature map cells than the original training set.

https://doi.org/10.1145/3544792

[Article Title: Demystifying Hidden Sensitive Operations in Android Apps/ Xiaoyu Sun, Xiao Chen, Li Li, Haipeng Cai, John Grundy, Jordan Samhi, Tegawendé Bissyandé and Jacques Klein, p. 50:1-50:30]

Abstract: Security of Android devices is now paramount, given their wide adoption among consumers. As researchers develop tools for statically or dynamically detecting suspicious apps, malware writers regularly update their attack mechanisms to hide malicious behavior implementation. This poses two problems to current research techniques: static analysis approaches, given their over-approximations, can report an overwhelming number of false alarms, while dynamic approaches will miss those behaviors that are hidden through evasion techniques. We propose in this work a static approach specifically targeted at highlighting hidden sensitive operations (HSOs), mainly sensitive data flows. The prototype version of HiSenDroid has been evaluated on a large-scale dataset of thousands of malware and goodware samples on which it successfully revealed anti-analysis code snippets aiming at evading detection by dynamic analysis. We further experimentally show that, with FlowDroid, some of the hidden sensitive behaviors would eventually lead to private data leaks. Those leaks would have been hard to spot either manually among the large number of false positives reported by the state-of-the-art static analyzers, or by dynamic tools. Overall, by putting the light on hidden sensitive operations, HiSenDroid helps security analysts in validating potentially sensitive data operations, which would be previously unnoticed.

https://doi.org/10.1145/3574158

[Article Title: Testing, Validation, and Verification of Robotic and Autonomous Systems: A Systematic Review/ Hugo Araujo, Mohammad Reza Mousavi and Mahsa Varshosaz, p. 51:1-51:61]

Abstract: We perform a systematic literature review on testing, validation, and verification of robotic and autonomous systems (RAS). The scope of this review covers peer-reviewed research papers proposing, improving, or evaluating testing techniques, processes, or tools that address the system-level qualities of RAS.

Our survey is performed based on a rigorous methodology structured in three phases. First, we made use of a set of 26 seed papers (selected by domain experts) and the SERP-TEST taxonomy to design our search query and (domain-specific) taxonomy. Second, we conducted a search in three academic search engines and applied our inclusion and exclusion criteria to the results. Respectively, we made use of related work and domain specialists (50 academics and 15 industry experts) to validate and refine the search query. As a result, we encountered 10,735 studies, out of which 195 were included, reviewed, and coded.

Our objective is to answer four research questions, pertaining to (1) the type of models, (2) measures for system performance and testing adequacy, (3) tools and their availability, and (4) evidence of applicability, particularly in industrial contexts. We analyse the results of our coding to identify strengths and gaps in the domain and present recommendations to researchers and practitioners.

Our findings show that variants of temporal logics are most widely used for modelling requirements and properties, while variants of state-machines and transition systems are used widely for modelling system behaviour. Other common models concern epistemic logics for specifying requirements and belief-desire-intention models for specifying system behaviour. Apart from time and epistemics, other aspects captured in models concern probabilities (e.g., for modelling uncertainty) and continuous trajectories (e.g., for modelling vehicle dynamics and kinematics).

Many papers lack any rigorous measure of efficiency, effectiveness, or adequacy for their proposed techniques, processes, or tools. Among those that provide a measure of efficiency, effectiveness, or adequacy, the majority use domain-agnostic generic measures such as number of failures, size of state-space, or verification time were most used. There is a trend in addressing the research gap in this respect by developing domain-specific notions of performance and adequacy. Defining widely accepted rigorous measures of performance and adequacy for each domain is an identified research gap.

In terms of tools, the most widely used tools are well-established model-checkers such as Prism and Uppaal, as well as simulation tools such as Gazebo; Matlab/Simulink is another widely used toolset in this domain.

Overall, there is very limited evidence of industrial applicability in the papers published in this domain. There is even a gap considering consolidated benchmarks for various types of autonomous systems.

https://doi.org/10.1145/3542945

[Article Title: Dissecting American Fuzzy Lop: A FuzzBench Evaluation/ Andrea Fioraldi, Alessandro Mantovani, Dominik Maier and Davide Balzarotti, p. 52:1-52:26]

Abstract: AFL is one of the most used and extended fuzzers, adopted by industry and academic researchers alike. Although the community agrees on AFL’s effectiveness at discovering new vulnerabilities and its outstanding usability, many of its internal design choices remain untested to date. Security practitioners often clone the project “as-is” and use it as a starting point to develop new techniques, usually taking everything under the hood for granted. Instead, we believe that a careful analysis of the different parameters could help modern fuzzers improve their performance and explain how each choice can affect the outcome of security testing, either negatively or positively.

The goal of this work is to provide a comprehensive understanding of the internal mechanisms of AFL by performing experiments and by comparing different metrics used to evaluate fuzzers. This can help to show the effectiveness of some techniques and to clarify which aspects are instead outdated. To perform our study, we performed nine unique experiments that we carried out on the popular Fuzzbench platform. Each test focuses on a different aspect of AFL, ranging from its mutation approach to the feedback encoding scheme and its scheduling methodologies.

Our findings show that each design choice affects different factors of AFL. Some of these are positively correlated with the number of detected bugs or the coverage of the target application, whereas other features are related to usability and reliability. Most important, we believe that the outcome of our experiments indicates which parts of AFL we should preserve in the design of modern fuzzers.

https://doi.org/10.1145/3580596

[Article Title: Fuzzing Configurations of Program Options/ Zenong Zhang, George Klees, Eric Wang, Michael Hicks and Shiyi Wei, p. 53:1-53:21]

Abstract: While many real-world programs are shipped with configurations to enable/disable functionalities, fuzzers have mostly been applied to test single configurations of these programs. In this work, we first conduct an empirical study to understand how program configurations affect fuzzing performance. We find that limiting a campaign to a single configuration can result in failing to cover a significant amount of code. We also observe that different program configurations contribute differing amounts of code coverage, challenging the idea that each one can be efficiently fuzzed individually. Motivated by these two observations, we propose ConfigFuzz , which can fuzz configurations along with normal inputs. ConfigFuzz transforms the target program to encode its program options within part of the fuzzable input, so existing fuzzers’ mutation operators can be reused to fuzz program configurations. We instantiate ConfigFuzz on six configurable, common fuzzing targets, and integrate their executions in FuzzBench. In our evaluation, ConfigFuzz outperforms two baseline fuzzers in four targets, while the results are mixed in the other targets due to program size and configuration space. We also analyze the options fuzzed by ConfigFuzz and how they affect the performance.

https://doi.org/10.1145/3580597

[Article Title: Dissecting American Fuzzy Lop – A FuzzBench Evaluation - RCR Report/ Andrea Fioraldi, Alessandro Mantovani, Dominik Maier and Davide Balzarotti, p. 54:1-54:4]

Abstract: This report describes the artifacts of the “Dissecting American Fuzzy Lop – A FuzzBench Evaluation” paper. The artifacts are available online at https://github.com/eurecom-s3/dissecting_afl and archived at https://doi.org/10.6084/m9.figshare.21401280. American Fuzzy Lop (AFL) consists of the produced code, the setup to run the experiments in FuzzBench, and the generated reports. We claim the Functional badge as the patches to AFL are easy to enable and the experiments are easy to run thanks to the FuzzBench service, but the evaluations are self-contained and the modifications to AFL are as is. For the purpose of reproducing the experiments, no particular skills are needed as the process is straightforward and described in https://google.github.io/fuzzbench/getting-started/adding-a-new-fuzzer/#requesting-an-experiment.

https://doi.org/10.1145/3580600

[Article Title: Fuzzing Configurations of Program Options - RCR Report/ Zenong Zhang, George Klees, Eric Wang, Michael Hicks and Shiyi Wei, p. 55:1-55:3]

Abstract: This artifact contains the source code and instructions to reproduce the evaluation results of the article “Fuzzing Configurations of Program Options.” The source code includes the configuration grammars for six target programs, the scripts to generate configuration stubs, and the scripts to post-process fuzzing results. The README of the artifact includes the steps to prepare the experimental environment on a clean Ubuntu machine and step-by-step commands to reproduce the evaluation experiments. A VirtualBox image with ConfigFuzz properly set up is also included.

https://doi.org/10.1145/3580601

There are no comments on this title.

to post a comment.