NU Learning Resource Center OPAC catalog › Details for: ACM Transactions on Software Engineering and Methodology

Normal view MARC view ISBD view

ACM Transactions on Software Engineering and Methodology

Material type: Text

TextSeries: ; ACM Transactions on Software Engineering and Methodology, Volume 31, Issue 3, 2022Publication details: New York : Association for Computing Machinery, c2022Description: [various pagings] : illustrations ; 26 cmISSN: 1049-331XSubject(s): SOFTWARE ENGINEERING | INFORMATION SYSTEMS | CLOUD COMPUTING | SECURITY AND PRIVACY |

Contents:

L2S: A Framework for Synthesizing the Most Probable Program under a Specification -- Context- and Fairness-Aware In-Process Crowdworker Recommendation -- ReCDroid+: Automated End-to-End Crash Reproduction from Bug Reports for Android Apps -- Verification of Distributed Systems via Sequential Emulation -- Opinion Mining for Software Development: A Systematic Literature Review -- Stateful Serverless Computing with Crucial -- Applying Bayesian Analysis Guidelines to Empirical Software Engineering Data: The Case of Programming Languages and Code Quality -- On the Faults Found in REST APIs by Automated Test Generation -- Using Personality Detection Tools for Software Engineering Research: How Far Can We Go? -- All in One: Design, Verification, and Implementation of SNOW-optimal Read Atomic Transactions -- Do Developers Really Know How to Use Git Commands? A Large-scale Study Using Stack Overflow -- Industry–Academia Research Collaboration and Knowledge Co-creation: Patterns and Anti-patterns -- Continuous and Proactive Software Architecture Evaluation: An IoT Case -- NPC: Neuron Path Coverage via Characterizing Decision Logic of Deep Neural Networks -- An Empirical Study of the Effectiveness of an Ensemble of Stand-alone Sentiment Detection Tools for Software Engineering Datasets -- Detecting and Augmenting Missing Key Aspects in Vulnerability Descriptions -- Towards Robustness of Deep Program Processing Models—Detection, Estimation, and Enhancement -- Context-Aware Code Change Embedding for Better Patch Correctness Assessment -- XCode: Towards Cross-Language Code Representation with Large-Scale Pre-Training -- An Empirical Study of the Impact of Hyperparameter Tuning and Model Optimization on the Performance Properties of Deep Neural Networks -- Time-travel Investigation: Toward Building a Scalable Attack Detection Framework on Ethereum -- Examining Penetration Tester Behavior in the Collegiate Penetration Testing Competition -- Predictive Models in Software Engineering: Challenges and Opportunities.

Summary: [Article Title: L2S: A Framework for Synthesizing the Most Probable Program under a Specification/ Yingfei Xiong and Bo Wang, p. 34:1-34:45] Abstract: In many scenarios, we need to find the most likely program that meets a specification under a local context, where the local context can be an incomplete program, a partial specification, natural language description, and so on. We call such a problem program estimation. In this article, we propose a framework, LingLong Synthesis Framework (L2S), to address this problem. Compared with existing work, our work is novel in the following aspects. (1) We propose a theory of expansion rules to describe how to decompose a program into choices. (2) We propose an approach based on abstract interpretation to efficiently prune off the program sub-space that does not satisfy the specification. (3) We prove that the probability of a program is the product of the probabilities of choosing expansion rules, regardless of the choosing order. (4) We reduce the program estimation problem to a pathfinding problem, enabling existing pathfinding algorithms to solve this problem. L2S has been applied to program generation and program repair. In this article, we report our instantiation of this framework for synthesizing conditional expressions (L2S-Cond) and repairing conditional statements (L2S-Hanabi). The experiments on L2S-Cond show that each option enabled by L2S, including the expansion rules, the pruning technique, and the use of different pathfinding algorithms, plays a major role in the performance of the approach. The default configuration of L2S-Cond correctly predicts nearly 60% of the conditional expressions in the top 5 candidates. Moreover, we evaluate L2S-Hanabi on 272 bugs from two real-world Java defects benchmarks, namely Defects4J and Bugs.jar. L2S-Hanabi correctly fixes 32 bugs with a high precision of 84%. In terms of repairing conditional statement bugs, L2S-Hanabi significantly outperforms all existing approaches in both precision and recall. https://doi.org/10.1145/3487570Summary: [Article Title: Context- and Fairness-Aware In-Process Crowdworker Recommendation/ Junjie Wang, Ye Yang, Song Wang, Jun Hu and Qing Wang, p. 35:1-35:31] Abstract: Identifying and optimizing open participation is essential to the success of open software development. Existing studies highlighted the importance of worker recommendation for crowdtesting tasks in order to improve bug detection efficiency, i.e., detect more bugs with fewer workers. However, there are a couple of limitations in existing work. First, these studies mainly focus on one-time recommendations based on expertise matching at the beginning of a new task. Second, the recommendation results suffer from severe popularity bias, i.e., highly experienced workers are recommended in almost all the tasks, while less experienced workers rarely get recommended. This article argues the need for context- and fairness-aware in-process crowdworker recommendation in order to address these limitations. We motivate this study through a pilot study, revealing the prevalence of long-sized non-yielding windows, i.e., no new bugs are revealed in consecutive test reports during the process of a crowdtesting task. This indicates the potential opportunity for accelerating crowdtesting by recommending appropriate workers in a dynamic manner, so that the non-yielding windows could be shortened. Besides, motivated by the popularity bias in existing crowdworker recommendation approach, this study also aims at alleviating the unfairness in recommendations. Driven by these observations, this article proposes a context- and fairness-aware in-process crowdworker recommendation approach, iRec2.0, to detect more bugs earlier, shorten the non-yielding windows, and alleviate the unfairness in recommendations. It consists of three main components: (1) the modeling of dynamic testing context, (2) the learning-based ranking component, and (3) the multi-objective optimization-based re-ranking component. The evaluation is conducted on 636 crowdtesting tasks from one of the largest crowdtesting platforms, and results show the potential of iRec2.0 in improving the cost-effectiveness of crowdtesting by saving the cost, shortening the testing process, and alleviating the unfairness among workers. In detail, iRec2.0 could shorten the non-yielding window by a median of 50%–66% in different application scenarios, and consequently have potential of saving testing cost by a median of 8%–12%. Meanwhile, the recommendation frequency of the crowdworker drop from 34%–60% to 5%–26% under different scenarios, indicating its potential in alleviating the unfairness among crowdworkers. https://doi.org/10.1145/3487571Summary: [Article Title: ReCDroid+: Automated End-to-End Crash Reproduction from Bug Reports for Android Apps/ Yu Zhao, Ting Su, Yang Liu, Wei Zheng, Xiaoxue Wu, Ramakanth Kavuluru, William G. J. Halfond and Tingting Yu, p. 36:1-36:33] Abstract: The large demand of mobile devices creates significant concerns about the quality of mobile applications (apps). Developers heavily rely on bug reports in issue tracking systems to reproduce failures (e.g., crashes). However, the process of crash reproduction is often manually done by developers, making the resolution of bugs inefficient, especially given that bug reports are often written in natural language. To improve the productivity of developers in resolving bug reports, in this paper, we introduce a novel approach, called ReCDroid+, that can automatically reproduce crashes from bug reports for Android apps. ReCDroid+ uses a combination of natural language processing (NLP), deep learning, and dynamic GUI exploration to synthesize event sequences with the goal of reproducing the reported crash. We have evaluated ReCDroid+ on 66 original bug reports from 37 Android apps. The results show that ReCDroid+ successfully reproduced 42 crashes (63.6% success rate) directly from the textual description of the manually reproduced bug reports. A user study involving 12 participants demonstrates that ReCDroid+ can improve the productivity of developers when resolving crash bug reports. https://doi.org/10.1145/3488244Summary: [Article Title: Verification of Distributed Systems via Sequential Emulation/ Luca Di Stefano, Rocco De Nicola and Omar Inverso, p. 37:1-37:41] Abstract: Sequential emulation is a semantics-based technique to automatically reduce property checking of distributed systems to the analysis of sequential programs. An automated procedure takes as input a formal specification of a distributed system, a property of interest, and the structural operational semantics of the specification language and generates a sequential program whose execution traces emulate the possible evolutions of the considered system. The problem as to whether the property of interest holds for the system can then be expressed either as a reachability or as a termination query on the program. This allows to immediately adapt mature verification techniques developed for general-purpose languages to domain-specific languages, and to effortlessly integrate new techniques as soon as they become available. We test our approach on a selection of concurrent systems originated from different contexts from population protocols to models of flocking behaviour. By combining a comprehensive range of program verification techniques, from traditional symbolic execution to modern inductive-based methods such as property-directed reachability, we are able to draw consistent and correct verification verdicts for the considered systems. https://doi.org/10.1145/3490387Summary: [Article Title: Opinion Mining for Software Development: A Systematic Literature Review/ Bin Lin, Nathan Cassee, Alexander Serebrenik, Gabriele Bavota, Nicole Novielli and Michele Lanza, p. 38:1-38:41] Abstract: Opinion mining, sometimes referred to as sentiment analysis, has gained increasing attention in software engineering (SE) studies. SE researchers have applied opinion mining techniques in various contexts, such as identifying developers’ emotions expressed in code comments and extracting users’ critics toward mobile apps. Given the large amount of relevant studies available, it can take considerable time for researchers and developers to figure out which approaches they can adopt in their own studies and what perils these approaches entail. We conducted a systematic literature review involving 185 papers. More specifically, we present (1) well-defined categories of opinion mining-related software development activities, (2) available opinion mining approaches, whether they are evaluated when adopted in other studies, and how their performance is compared, (3) available datasets for performance evaluation and tool customization, and (4) concerns or limitations SE researchers might need to take into account when applying/customizing these opinion mining techniques. The results of our study serve as references to choose suitable opinion mining tools for software development activities and provide critical insights for the further development of opinion mining techniques in the SE domain. https://doi.org/10.1145/3490388Summary: [Article Title: Stateful Serverless Computing with Crucial/ Daniel Barcelona-Pons, Pierre Sutra, Marc Sánchez-Artigas, Gerard París and Pedro García-López, p. 39:1-39:38] Abstract: Serverless computing greatly simplifies the use of cloud resources. In particular, Function-as-a-Service (FaaS) platforms enable programmers to develop applications as individual functions that can run and scale independently. Unfortunately, applications that require fine-grained support for mutable state and synchronization, such as machine learning (ML) and scientific computing, are notoriously hard to build with this new paradigm. In this work, we aim at bridging this gap. We present Crucial, a system to program highly-parallel stateful serverless applications. Crucial retains the simplicity of serverless computing. It is built upon the key insight that FaaS resembles to concurrent programming at the scale of a datacenter. Accordingly, a distributed shared memory layer is the natural answer to the needs for fine-grained state management and synchronization. Crucial allows to port effortlessly a multi-threaded code base to serverless, where it can benefit from the scalability and pay-per-use model of FaaS platforms. We validate Crucial with the help of micro-benchmarks and by considering various stateful applications. Beyond classical parallel tasks (e.g., a Monte Carlo simulation), these applications include representative ML algorithms such as k-means and logistic regression. Our evaluation shows that Crucial obtains superior or comparable performance to Apache Spark at similar cost (18%–40% faster). We also use Crucial to port (part of) a state-of-the-art multi-threaded ML library to serverless. The ported application is up to 30% faster than with a dedicated high-end server. Finally, we attest that Crucial can rival in performance with a single-machine, multi-threaded implementation of a complex coordination problem. Overall, Crucial delivers all these benefits with less than 6% of changes in the code bases of the evaluated applications. https://doi.org/10.1145/3490386Summary: [Article Title: Applying Bayesian Analysis Guidelines to Empirical Software Engineering Data: The Case of Programming Languages and Code Quality/ Carlo A. Furia, Richard Torkar and Robert Feldt, p. 40:1-40:38] Abstract: Statistical analysis is the tool of choice to turn data into information and then information into empirical knowledge. However, the process that goes from data to knowledge is long, uncertain, and riddled with pitfalls. To be valid, it should be supported by detailed, rigorous guidelines that help ferret out issues with the data or model and lead to qualified results that strike a reasonable balance between generality and practical relevance. Such guidelines are being developed by statisticians to support the latest techniques for Bayesian data analysis. In this article, we frame these guidelines in a way that is apt to empirical research in software engineering. To demonstrate the guidelines in practice, we apply them to reanalyze a GitHub dataset about code quality in different programming languages. The dataset’s original analysis [Ray et al. 55] and a critical reanalysis [Berger et al. 6] have attracted considerable attention—in no small part because they target a topic (the impact of different programming languages) on which strong opinions abound. The goals of our reanalysis are largely orthogonal to this previous work, as we are concerned with demonstrating, on data in an interesting domain, how to build a principled Bayesian data analysis and to showcase its benefits. In the process, we will also shed light on some critical aspects of the analyzed data and of the relationship between programming languages and code quality—such as the impact of project-specific characteristics other than the used programming language. The high-level conclusions of our exercise will be that Bayesian statistical techniques can be applied to analyze software engineering data in a way that is principled, flexible, and leads to convincing results that inform the state-of-the-art while highlighting the boundaries of its validity. The guidelines can support building solid statistical analyses and connecting their results. Thus, they can help buttress continued progress in empirical software engineering research. https://doi.org/10.1145/3490953Summary: [Article Title: On the Faults Found in REST APIs by Automated Test Generation/ Bogdan Marculescu, Man Zhang and Andrea Arcuri, p. 41:1-41:43] Abstract: RESTful web services are often used for building a wide variety of enterprise applications. The diversity and increased number of applications using RESTful APIs means that increasing amounts of resources are spent developing and testing these systems. Automation in test data generation provides a useful way of generating test data in a fast and efficient manner. However, automated test generation often results in large test suites that are hard to evaluate and investigate manually. This article proposes a taxonomy of the faults we have found using search-based software testing techniques applied on RESTful APIs. The taxonomy is a first step in understanding, analyzing, and ultimately fixing software faults in web services and enterprise applications. We propose to apply a density-based clustering algorithm to the test cases evolved during the search to allow a better separation between different groups of faults. This is needed to enable engineers to highlight and focus on the most serious faults. Tests were automatically generated for a set of eight case studies, seven open-source and one industrial. The test cases generated during the search are clustered based on the reported last executed line and based on the error messages returned, when such error messages were available. The tests were manually evaluated to determine their root causes and to obtain additional information. The article presents a taxonomy of the faults found based on the manual analysis of 415 faults in the eight case studies and proposes a method to support the classification using clustering of the resulting test cases. https://doi.org/10.1145/3491038Summary: [Article Title: Using Personality Detection Tools for Software Engineering Research: How Far Can We Go?/ Fabio Calefato and Filippo Lanubile, p. 42:1-42:48] Abstract: Assessing the personality of software engineers may help to match individual traits with the characteristics of development activities such as code review and testing, as well as support managers in team composition. However, self-assessment questionnaires are not a practical solution for collecting multiple observations on a large scale. Instead, automatic personality detection, while overcoming these limitations, is based on off-the-shelf solutions trained on non-technical corpora, which might not be readily applicable to technical domains like software engineering. In this article, we first assess the performance of general-purpose personality detection tools when applied to a technical corpus of developers’ e-mails retrieved from the public archives of the Apache Software Foundation. We observe a general low accuracy of predictions and an overall disagreement among the tools. Second, we replicate two previous research studies in software engineering by replacing the personality detection tool used to infer developers’ personalities from pull-request discussions and e-mails. We observe that the original results are not confirmed, i.e., changing the tool used in the original study leads to diverging conclusions. Our results suggest a need for personality detection tools specially targeted for the software engineering domain. https://doi.org/10.1145/3491039Summary: [Article Title: All in One: Design, Verification, and Implementation of SNOW-optimal Read Atomic Transactions/ Si Liu, p. 43:1-43:44] Abstract: Distributed read atomic transactions are important building blocks of modern cloud databases that magnificently bridge the gap between data availability and strong data consistency. The performance of their transactional reads is particularly critical to the overall system performance, as many real-world database workloads are dominated by reads. Following the SNOW design principle for optimal reads, we develop LORA, a novel SNOW-optimal algorithm for distributed read atomic transactions. LORA completes its reads in exactly one round trip, even in the presence of conflicting writes, without imposing additional overhead to the communication, and it outperforms the state-of-the-art read atomic algorithms. To guide LORA’s development, we present a rewriting-logic-based framework and toolkit for design, verification, implementation, and evaluation of distributed databases. Within the framework, we formalize LORA and mathematically prove its data consistency guarantees. We also apply automatic model checking and statistical verification to validate our proofs and to estimate LORA’s performance. We additionally generate from the formal model a correct-by-construction distributed implementation for testing and performance evaluation under realistic deployments. Our design-level and implementation-based experimental results are consistent, which together demonstrate LORA’s promising data consistency and performance achievement. https://doi.org/10.1145/3494517Summary: [Article Title: Do Developers Really Know How to Use Git Commands? A Large-scale Study Using Stack Overflow/ Wenhua Yang, Chong Zhang, Minxue Pan, Chang Xu, Yu Zhou and Zhiqiu Huang, p. 44:1-44:29] Abstract: Git, a cross-platform and open source distributed version control tool, provides strong support for non-linear development and is capable of handling everything from small to large projects with speed and efficiency. It has become an indispensable tool for millions of software developers and is the de facto standard of version control in software development nowadays. However, despite its widespread use, developers still frequently face difficulties when using various Git commands to manage projects and collaborate. To better help developers use Git, it is necessary to understand the issues and difficulties that they may encounter when using Git. Unfortunately, this problem has not yet been comprehensively studied. To fill this knowledge gap, in this article, we conduct a large-scale study on Stack Overflow, a popular Q&A forum for developers. We extracted and analyzed 80,370 relevant questions from Stack Overflow, and reported the increasing popularity of the Git command questions. By analyzing the questions, we identified the Git commands that are frequently asked and those that are associated with difficult questions on Stack Overflow to help understand the difficulties developers may encounter when using Git commands. In addition, we conducted a survey to understand how developers learn Git commands in practice, showing that self-learning is the primary learning approach. These findings provide a range of actionable implications for researchers, educators, and developers. https://doi.org/10.1145/3494518Summary: [Article Title: Industry–Academia Research Collaboration and Knowledge Co-creation: Patterns and Anti-patterns/ Dusica Marijan and Sagar Sen, p. 45:1-45:52] Abstract: Increasing the impact of software engineering research in the software industry and the society at large has long been a concern of high priority for the software engineering community. The problem of two cultures, research conducted in a vacuum (disconnected from the real world), or misaligned time horizons are just some of the many complex challenges standing in the way of successful industry–academia collaborations. This article reports on the experience of research collaboration and knowledge co-creation between industry and academia in software engineering as a way to bridge the research–practice collaboration gap. Our experience spans 14 years of collaboration between researchers in software engineering and the European and Norwegian software and IT industry. Using the participant observation and interview methods, we have collected and afterwards analyzed an extensive record of qualitative data. Drawing upon the findings made and the experience gained, we provide a set of 14 patterns and 14 anti-patterns for industry–academia collaborations, aimed to support other researchers and practitioners in establishing and running research collaboration projects in software engineering. https://doi.org/10.1145/3494519Summary: [Article Title: Continuous and Proactive Software Architecture Evaluation: An IoT Case/ Dalia Sobhy, Leandro Minku, Rami Bahsoon and Rick Kazman, p. 46:1-46:54] Abstract: Design-time evaluation is essential to build the initial software architecture to be deployed. However, experts’ assumptions made at design-time are unlikely to remain true indefinitely in systems that are characterized by scale, hyperconnectivity, dynamism, and uncertainty in operations (e.g. IoT). Therefore, experts’ design-time decisions can be challenged at run-time. A continuous architecture evaluation that systematically assesses and intertwines design-time and run-time decisions is thus necessary. This paper proposes the first proactive approach to continuous architecture evaluation of the system leveraging the support of simulation. The approach evaluates software architectures by not only tracking their performance over time, but also forecasting their likely future performance through machine learning of simulated instances of the architecture. This enables architects to make cost-effective informed decisions on potential changes to the architecture. We perform an IoT case study to show how machine learning on simulated instances of architecture can fundamentally guide the continuous evaluation process and influence the outcome of architecture decisions. A series of experiments is conducted to demonstrate the applicability and effectiveness of the approach. We also provide the architect with recommendations on how to best benefit from the approach through choice of learners and input parameters, grounded on experimentation and evidence. REFERENCES https://doi.org/10.1145/3492762Summary: [Article Title: NPC: Neuron Path Coverage via Characterizing Decision Logic of Deep Neural Networks/ Xiaofei Xie, Tianlin Li, Jian Wang, Lei Ma, Qing Guo, Felix Juefei-Xu and Yang Liu, p. 47:1-47:27] Abstract: Deep learning has recently been widely applied to many applications across different domains, e.g., image classification and audio recognition. However, the quality of Deep Neural Networks (DNNs) still raises concerns in the practical operational environment, which calls for systematic testing, especially in safety-critical scenarios. Inspired by software testing, a number of structural coverage criteria are designed and proposed to measure the test adequacy of DNNs. However, due to the blackbox nature of DNN, the existing structural coverage criteria are difficult to interpret, making it hard to understand the underlying principles of these criteria. The relationship between the structural coverage and the decision logic of DNNs is unknown. Moreover, recent studies have further revealed the non-existence of correlation between the structural coverage and DNN defect detection, which further posts concerns on what a suitable DNN testing criterion should be. In this article, we propose the interpretable coverage criteria through constructing the decision structure of a DNN. Mirroring the control flow graph of the traditional program, we first extract a decision graph from a DNN based on its interpretation, where a path of the decision graph represents a decision logic of the DNN. Based on the control flow and data flow of the decision graph, we propose two variants of path coverage to measure the adequacy of the test cases in exercising the decision logic. The higher the path coverage, the more diverse decision logic the DNN is expected to be explored. Our large-scale evaluation results demonstrate that: The path in the decision graph is effective in characterizing the decision of the DNN, and the proposed coverage criteria are also sensitive with errors, including natural errors and adversarial examples, and strongly correlate with the output impartiality. https://doi.org/10.1145/3490489Summary: [Article Title: An Empirical Study of the Effectiveness of an Ensemble of Stand-alone Sentiment Detection Tools for Software Engineering Datasets/ Gias Uddin, Yann-Gaël Guéhénuc, Foutse Khomh and Chanchal K. Roy, p. 48:1-48:38] Abstract: Sentiment analysis in software engineering (SE) has shown promise to analyze and support diverse development activities. Recently, several tools are proposed to detect sentiments in software artifacts. While the tools improve accuracy over off-the-shelf tools, recent research shows that their performance could still be unsatisfactory. A more accurate sentiment detector for SE can help reduce noise in analysis of software scenarios where sentiment analysis is required. Recently, combinations, i.e., hybrids of stand-alone classifiers are found to offer better performance than the stand-alone classifiers for fault detection. However, we are aware of no such approach for sentiment detection for software artifacts. We report the results of an empirical study that we conducted to determine the feasibility of developing an ensemble engine by combining the polarity labels of stand-alone SE-specific sentiment detectors. Our study has two phases. In the first phase, we pick five SE-specific sentiment detection tools from two recently published papers by Lin et al. [29, 30], who first reported negative results with stand alone sentiment detectors and then proposed an improved SE-specific sentiment detector, POME [29]. We report the study results on 17,581 units (sentences/documents) coming from six currently available sentiment benchmarks for software engineering. We find that the existing tools can be complementary to each other in 85-95% of the cases, i.e., one is wrong but another is right. However, a majority voting-based ensemble of those tools fails to improve the accuracy of sentiment detection. We develop Sentisead, a supervised tool by combining the polarity labels and bag of words as features. Sentisead improves the performance (F1-score) of the individual tools by 4% (over Senti4SD [5]) – 100% (over POME [29]). The initial development of Sentisead occurred before we observed the use of deep learning models for SE-specific sentiment detection. In particular, recent papers show the superiority of advanced language-based pre-trained transformer models (PTM) over rule-based and shallow learning models. Consequently, in a second phase, we compare and improve Sentisead infrastructure using the PTMs. We find that a Sentisead infrastructure with RoBERTa as the ensemble of the five stand-alone rule-based and shallow learning SE-specific tools from Lin et al. [29, 30] offers the best F1-score of 0.805 across the six datasets, while a stand-alone RoBERTa shows an F1-score of 0.801. https://doi.org/10.1145/3491211Summary: [Article Title: Detecting and Augmenting Missing Key Aspects in Vulnerability Descriptions/ Hao Guo, Sen Chen, Zhenchang Xing, Xiaohong Li, Yude Bai and Jiamou Sun, p. 49:1-49:27] Abstract: Security vulnerabilities have been continually disclosed and documented. For the effective understanding, management, and mitigation of the fast-growing number of vulnerabilities, an important practice in documenting vulnerabilities is to describe the key vulnerability aspects, such as vulnerability type, root cause, affected product, impact, attacker type, and attack vector. In this article, we first investigate 133,639 vulnerability reports in the Common Vulnerabilities and Exposures (CVE) database over the past 20 years. We find that 56%, 85%, 38%, and 28% of CVEs miss vulnerability type, root cause, attack vector, and attacker type, respectively. By comparing the differences of the latest updated CVE reports across different databases, we observe that 1,476 missing key aspects in 1,320 CVE descriptions were augmented manually in the National Vulnerability Database (NVD), which indicates that the vulnerability database maintainers try to complete the vulnerability descriptions in practice to mitigate such a problem. To help complete the missing information of key vulnerability aspects and reduce human efforts, we propose a neural-network-based approach called PMA to predict the missing key aspects of a vulnerability based on its known aspects. We systematically explore the design space of the neural network models and empirically identify the most effective model design in the scenario. Our ablation study reveals the prominent correlations among vulnerability aspects when predicting. Trained with historical CVEs, our model achieves 88%, 71%, 61%, and 81% in F1 for predicting the missing vulnerability type, root cause, attacker type, and attack vector of 8,623 “future” CVEs across 3 years, respectively. Furthermore, we validate the predicting performance of key aspect augmentation of CVEs based on the manually augmented CVE data collected from NVD, which confirms the practicality of our approach. We finally highlight that PMA has the ability to reduce human efforts by recommending and augmenting missing key aspects for vulnerability databases, and to facilitate other research works such as severity level prediction of CVEs based on the vulnerability descriptions. https://doi.org/10.1145/3498537Summary: [Article Title: Towards Robustness of Deep Program Processing Models—Detection, Estimation, and Enhancement/ Huangzhao Zhang, Zhiyi Fu, Ge Li, Lei Ma, Zhehao Zhao, Hua’an Yang, Yizhe Sun, Yang Liu and Zhi Jin, p. 50:1-50:40] Abstract: Deep learning (DL) has recently been widely applied to diverse source code processing tasks in the software engineering (SE) community, which achieves competitive performance (e.g., accuracy). However, the robustness, which requires the model to produce consistent decisions given minorly perturbed code inputs, still lacks systematic investigation as an important quality indicator. This article initiates an early step and proposes a framework CARROT for robustness detection, measurement, and enhancement of DL models for source code processing. We first propose an optimization-based attack technique CARROTA to generate valid adversarial source code examples effectively and efficiently. Based on this, we define the robustness metrics and propose robustness measurement toolkit CARROTM, which employs the worst-case performance approximation under the allowable perturbations. We further propose to improve the robustness of the DL models by adversarial training (CARROTT) with our proposed attack techniques. Our in-depth evaluations on three source code processing tasks (i.e., functionality classification, code clone detection, defect prediction) containing more than 3 million lines of code and the classic or SOTA DL models, including GRU, LSTM, ASTNN, LSCNN, TBCNN, CodeBERT, and CDLH, demonstrate the usefulness of our techniques for ❶ effective and efficient adversarial example detection, ❷ tight robustness estimation, and ❸ effective robustness enhancement. https://doi.org/10.1145/3511887Summary: [Article Title: Context-Aware Code Change Embedding for Better Patch Correctness Assessment/ Bo Lin, Shangwen Wang, Ming Wen and Xiaoguang Mao, p. 51:1-51:29] Abstract: Despite the capability in successfully fixing more and more real-world bugs, existing Automated Program Repair (APR) techniques are still challenged by the long-standing overfitting problem (i.e., a generated patch that passes all tests is actually incorrect). Plenty of approaches have been proposed for automated patch correctness assessment (APCA). Nonetheless, dynamic ones (i.e., those that needed to execute tests) are time-consuming while static ones (i.e., those built on top of static code features) are less precise. Therefore, embedding techniques have been proposed recently, which assess patch correctness via embedding token sequences extracted from the changed code of a generated patch. However, existing techniques rarely considered the context information and program structures of a generated patch, which are crucial for patch correctness assessment as revealed by existing studies. In this study, we explore the idea of context-aware code change embedding considering program structures for patch correctness assessment. Specifically, given a patch, we not only focus on the changed code but also take the correlated unchanged part into consideration, through which the context information can be extracted and leveraged. We then utilize the AST path technique for representation where the structure information from AST node can be captured. Finally, based on several pre-defined heuristics, we build a deep learning based classifier to predict the correctness of the patch. We implemented this idea as Cache and performed extensive experiments to assess its effectiveness. Our results demonstrate that Cache can (1) perform better than previous representation learning based techniques (e.g., Cache relatively outperforms existing techniques by ≈ 6%, ≈ 3%, and ≈ 16%, respectively under three diverse experiment settings), and (2) achieve overall higher performance than existing APCA techniques while even being more precise than certain dynamic ones including PATCH-SIM (92.9% vs. 83.0%). Further results reveal that the context information and program structures leveraged by Cache contributed significantly to its outstanding performance. REFERENCES https://doi.org/10.1145/3505247Summary: [Article Title: XCode: Towards Cross-Language Code Representation with Large-Scale Pre-Training/ Zehao Lin, Guodun Li, Jingfeng Zhang, Yue Deng, Xiangji Zeng, Yin Zhang and Yao Wan, p. 52:1-52:44] Abstract: Source code representation learning is the basis of applying artificial intelligence to many software engineering tasks such as code clone detection, algorithm classification, and code summarization. Recently, many works have tried to improve the performance of source code representation from various perspectives, e.g., introducing the structural information of programs into latent representation. However, when dealing with rapidly expanded unlabeled cross-language source code datasets from the Internet, there are still two issues. Firstly, deep learning models for many code-specific tasks still suffer from the lack of high-quality labels. Secondly, the structural differences among programming languages make it more difficult to process multiple languages in a single neural architecture. To address these issues, in this article, we propose a novel Cross-language Code representation with a large-scale pre-training (XCode) method. Concretely, we propose to use several abstract syntax trees and ELMo-enhanced variational autoencoders to obtain multiple pre-trained source code language models trained on about 1.5 million code snippets. To fully utilize the knowledge across programming languages, we further propose a Shared Encoder-Decoder (SED) architecture which uses the multi-teacher single-student method to transfer knowledge from the aforementioned pre-trained models to the distilled SED. The pre-trained models and SED will cooperate to better represent the source code. For evaluation, we examine our approach on three typical downstream cross-language tasks, i.e., source code translation, code clone detection, and code-to-code search, on a real-world dataset composed of programming exercises with multiple solutions. Experimental results demonstrate the effectiveness of our proposed approach on cross-language code representations. Meanwhile, our approach performs significantly better than several code representation baselines on different downstream tasks in terms of multiple automatic evaluation metrics. https://doi.org/10.1145/3506696Summary: [Article Title: An Empirical Study of the Impact of Hyperparameter Tuning and Model Optimization on the Performance Properties of Deep Neural Networks/ Lizhi Liao, Heng Li, Weiyi Shang and Lei Ma, p.53:1-53:40] Abstract: Deep neural network (DNN) models typically have many hyperparameters that can be configured to achieve optimal performance on a particular dataset. Practitioners usually tune the hyperparameters of their DNN models by training a number of trial models with different configurations of the hyperparameters, to find the optimal hyperparameter configuration that maximizes the training accuracy or minimizes the training loss. As such hyperparameter tuning usually focuses on the model accuracy or the loss function, it is not clear and remains under-explored how the process impacts other performance properties of DNN models, such as inference latency and model size. On the other hand, standard DNN models are often large in size and computing-intensive, prohibiting them from being directly deployed in resource-bounded environments such as mobile devices and Internet of Things (IoT) devices. To tackle this problem, various model optimization techniques (e.g., pruning or quantization) are proposed to make DNN models smaller and less computing-intensive so that they are better suited for resource-bounded environments. However, it is neither clear how the model optimization techniques impact other performance properties of DNN models such as inference latency and battery consumption, nor how the model optimization techniques impact the effect of hyperparameter tuning (i.e., the compounding effect). Therefore, in this paper, we perform a comprehensive study on four representative and widely-adopted DNN models, i.e., CNN image classification, Resnet-50, CNN text classification, and LSTM sentiment classification, to investigate how different DNN model hyperparameters affect the standard DNN models, as well as how the hyperparameter tuning combined with model optimization affect the optimized DNN models, in terms of various performance properties (e.g., inference latency or battery consumption). Our empirical results indicate that tuning specific hyperparameters has heterogeneous impact on the performance of DNN models across different models and different performance properties. In particular, although the top tuned DNN models usually have very similar accuracy, they may have significantly different performance in terms of other aspects (e.g., inference latency). We also observe that model optimization has a confounding effect on the impact of hyperparameters on DNN model performance. For example, two sets of hyperparameters may result in standard models with similar performance but their performance may become significantly different after they are optimized and deployed on the mobile device. Our findings highlight that practitioners can benefit from paying attention to a variety of performance properties and the confounding effect of model optimization when tuning and optimizing their DNN models. https://doi.org/10.1145/3506695Summary: [Article Title: Time-travel Investigation: Toward Building a Scalable Attack Detection Framework on Ethereum/ Siwei Wu, Lei Wu, Yajin Zhou, Runhuai Li, Zhi Wang, Xiapu Luo, Cong Wang and Kui Ren, p. 54:1-54:33] Abstract: Ethereum has been attracting lots of attacks, hence there is a pressing need to perform timely investigation and detect more attack instances. However, existing systems suffer from the scalability issue due to the following reasons. First, the tight coupling between malicious contract detection and blockchain data importing makes them infeasible to repeatedly detect different attacks. Second, the coarse-grained archive data makes them inefficient to replay transactions. Third, the separation between malicious contract detection and runtime state recovery consumes lots of storage. In this article, we propose a scalable attack detection framework named EthScope, which overcomes the scalability issue by neatly re-organizing the Ethereum state and efficiently locating suspicious transactions. It leverages the fine-grained state to support the replay of arbitrary transactions and proposes a well-designed schema to optimize the storage consumption. The performance evaluation shows that EthScope can solve the scalability issue, i.e., efficiently performing a large-scale analysis on billions of transactions, and a speedup of around 2,300× when replaying transactions. It also has lower storage consumption compared with existing systems. Further analysis shows that EthScope can help analysts understand attack behaviors and detect more attack instances. https://doi.org/10.1145/3505263Summary: [Article Title: Examining Penetration Tester Behavior in the Collegiate Penetration Testing Competition/ Benjamin S. Meyers, Sultan Fahad Almassari, Brandon N. Keller and Andrew Meneely, p. 55:1-55:25] Abstract: Penetration testing is a key practice toward engineering secure software. Malicious actors have many tactics at their disposal, and software engineers need to know what tactics attackers will prioritize in the first few hours of an attack. Projects like MITRE ATT&CK™ provide knowledge, but how do people actually deploy this knowledge in real situations? A penetration testing competition provides a realistic, controlled environment with which to measure and compare the efficacy of attackers. In this work, we examine the details of vulnerability discovery and attacker behavior with the goal of improving existing vulnerability assessment processes using data from the 2019 Collegiate Penetration Testing Competition (CPTC). We constructed 98 timelines of vulnerability discovery and exploits for 37 unique vulnerabilities discovered by 10 teams of penetration testers. We grouped related vulnerabilities together by mapping to Common Weakness Enumerations and MITRE ATT&CK™. We found that (1) vulnerabilities related to improper resource control (e.g., session fixation) are discovered faster and more often, as well as exploited faster, than vulnerabilities related to improper access control (e.g., weak password requirements), (2) there is a clear process followed by penetration testers of discovery/collection to lateral movement/pre-attack. Our methodology facilitates quicker analysis of vulnerabilities in future CPTC events. https://doi.org/10.1145/3514040Summary: [Article Title: Predictive Models in Software Engineering: Challenges and Opportunities/ Yanming Yang, Xin Xia, David Lo, Tingting Bi, John Grundy and Xiaohu Yang, p. 56:1-56:72] Abstract: Predictive models are one of the most important techniques that are widely applied in many areas of software engineering. There have been a large number of primary studies that apply predictive models and that present well-performed studies in various research domains, including software requirements, software design and development, testing and debugging, and software maintenance. This article is a first attempt to systematically organize knowledge in this area by surveying a body of 421 papers on predictive models published between 2009 and 2020. We describe the key models and approaches used, classify the different models, summarize the range of key application areas, and analyze research results. Based on our findings, we also propose a set of current challenges that still need to be addressed in future work and provide a proposed research road map for these opportunities. https://doi.org/10.1145/3503509

Item type:

Serials

Tags from this library: No tags from this library for this title. Log in to add tags.

Average rating: 0.0 (0 votes)

Holdings ( 1 )
Title notes ( 48 )
Comments ( 0 )

Item type	Current library	Home library	Collection	Shelving location	Call number	Copy number	Status	Date due	Barcode
Serials	LRC - Main	National University - Manila	Gen. Ed. - CCIT	Periodicals	ACM Transactions on Software Engineering and Methodology, Volume 31, Issue 3, 2022 (Browse shelf (Opens below))	c.1	Available		PER000000513

Browsing National University - Manila shelves, Shelving location: Periodicals, Collection: Gen. Ed. - CCIT Close shelf browser (Hides shelf browser)

Previous	No cover image available	No cover image available	No cover image available	No cover image available	No cover image available	No cover image available	No cover image available	Next
Previous	ACM Transactions on Modeling and Computer Simulation, Volume 31, Issue 1, Dec 2021 ACM Transactions on Modeling and Computer Simulation	ACM Transactions on Modeling and Computer Simulation, Volume 31, Issue 3, July 2021 ACM Transactions on Modeling and Computer Simulation	ACM Transactions on Software Engineering and Methodology, Volume 31, Issue 2, 2022 ACM Transactions on Software Engineering and Methodology	ACM Transactions on Software Engineering and Methodology, Volume 31, Issue 3, 2022 ACM Transactions on Software Engineering and Methodology	ACM Transactions on Software Engineering and Methodology, Volume 31, Issue 4, 2022 ACM Transactions on Software Engineering and Methodology	ACM Transactions on Software Engineering and Methodology, Volume 32, Issue 2, 2023 ACM Transactions on Software Engineering and Methodology	MIS Asia, Volume 2, March - April 2010 MIS Asia	Next

Includes bibliographical references.

[Article Title: L2S: A Framework for Synthesizing the Most Probable Program under a Specification/ Yingfei Xiong and Bo Wang, p. 34:1-34:45]

Abstract: In many scenarios, we need to find the most likely program that meets a specification under a local context, where the local context can be an incomplete program, a partial specification, natural language description, and so on. We call such a problem program estimation. In this article, we propose a framework, LingLong Synthesis Framework (L2S), to address this problem. Compared with existing work, our work is novel in the following aspects. (1) We propose a theory of expansion rules to describe how to decompose a program into choices. (2) We propose an approach based on abstract interpretation to efficiently prune off the program sub-space that does not satisfy the specification. (3) We prove that the probability of a program is the product of the probabilities of choosing expansion rules, regardless of the choosing order. (4) We reduce the program estimation problem to a pathfinding problem, enabling existing pathfinding algorithms to solve this problem.

L2S has been applied to program generation and program repair. In this article, we report our instantiation of this framework for synthesizing conditional expressions (L2S-Cond) and repairing conditional statements (L2S-Hanabi). The experiments on L2S-Cond show that each option enabled by L2S, including the expansion rules, the pruning technique, and the use of different pathfinding algorithms, plays a major role in the performance of the approach. The default configuration of L2S-Cond correctly predicts nearly 60% of the conditional expressions in the top 5 candidates. Moreover, we evaluate L2S-Hanabi on 272 bugs from two real-world Java defects benchmarks, namely Defects4J and Bugs.jar. L2S-Hanabi correctly fixes 32 bugs with a high precision of 84%. In terms of repairing conditional statement bugs, L2S-Hanabi significantly outperforms all existing approaches in both precision and recall.

https://doi.org/10.1145/3487570

[Article Title: Context- and Fairness-Aware In-Process Crowdworker Recommendation/ Junjie Wang, Ye Yang, Song Wang, Jun Hu and Qing Wang, p. 35:1-35:31]

Abstract: Identifying and optimizing open participation is essential to the success of open software development. Existing studies highlighted the importance of worker recommendation for crowdtesting tasks in order to improve bug detection efficiency, i.e., detect more bugs with fewer workers. However, there are a couple of limitations in existing work. First, these studies mainly focus on one-time recommendations based on expertise matching at the beginning of a new task. Second, the recommendation results suffer from severe popularity bias, i.e., highly experienced workers are recommended in almost all the tasks, while less experienced workers rarely get recommended. This article argues the need for context- and fairness-aware in-process crowdworker recommendation in order to address these limitations. We motivate this study through a pilot study, revealing the prevalence of long-sized non-yielding windows, i.e., no new bugs are revealed in consecutive test reports during the process of a crowdtesting task. This indicates the potential opportunity for accelerating crowdtesting by recommending appropriate workers in a dynamic manner, so that the non-yielding windows could be shortened. Besides, motivated by the popularity bias in existing crowdworker recommendation approach, this study also aims at alleviating the unfairness in recommendations.

Driven by these observations, this article proposes a context- and fairness-aware in-process crowdworker recommendation approach, iRec2.0, to detect more bugs earlier, shorten the non-yielding windows, and alleviate the unfairness in recommendations. It consists of three main components: (1) the modeling of dynamic testing context, (2) the learning-based ranking component, and (3) the multi-objective optimization-based re-ranking component. The evaluation is conducted on 636 crowdtesting tasks from one of the largest crowdtesting platforms, and results show the potential of iRec2.0 in improving the cost-effectiveness of crowdtesting by saving the cost, shortening the testing process, and alleviating the unfairness among workers. In detail, iRec2.0 could shorten the non-yielding window by a median of 50%–66% in different application scenarios, and consequently have potential of saving testing cost by a median of 8%–12%. Meanwhile, the recommendation frequency of the crowdworker drop from 34%–60% to 5%–26% under different scenarios, indicating its potential in alleviating the unfairness among crowdworkers.

https://doi.org/10.1145/3487571

[Article Title: ReCDroid+: Automated End-to-End Crash Reproduction from Bug Reports for Android Apps/ Yu Zhao, Ting Su, Yang Liu, Wei Zheng, Xiaoxue Wu, Ramakanth Kavuluru, William G. J. Halfond and Tingting Yu, p. 36:1-36:33]

Abstract: The large demand of mobile devices creates significant concerns about the quality of mobile applications (apps). Developers heavily rely on bug reports in issue tracking systems to reproduce failures (e.g., crashes). However, the process of crash reproduction is often manually done by developers, making the resolution of bugs inefficient, especially given that bug reports are often written in natural language. To improve the productivity of developers in resolving bug reports, in this paper, we introduce a novel approach, called ReCDroid+, that can automatically reproduce crashes from bug reports for Android apps. ReCDroid+ uses a combination of natural language processing (NLP), deep learning, and dynamic GUI exploration to synthesize event sequences with the goal of reproducing the reported crash. We have evaluated ReCDroid+ on 66 original bug reports from 37 Android apps. The results show that ReCDroid+ successfully reproduced 42 crashes (63.6% success rate) directly from the textual description of the manually reproduced bug reports. A user study involving 12 participants demonstrates that ReCDroid+ can improve the productivity of developers when resolving crash bug reports.

https://doi.org/10.1145/3488244

[Article Title: Verification of Distributed Systems via Sequential Emulation/ Luca Di Stefano, Rocco De Nicola and Omar Inverso, p. 37:1-37:41]

Abstract: Sequential emulation is a semantics-based technique to automatically reduce property checking of distributed systems to the analysis of sequential programs. An automated procedure takes as input a formal specification of a distributed system, a property of interest, and the structural operational semantics of the specification language and generates a sequential program whose execution traces emulate the possible evolutions of the considered system. The problem as to whether the property of interest holds for the system can then be expressed either as a reachability or as a termination query on the program. This allows to immediately adapt mature verification techniques developed for general-purpose languages to domain-specific languages, and to effortlessly integrate new techniques as soon as they become available. We test our approach on a selection of concurrent systems originated from different contexts from population protocols to models of flocking behaviour. By combining a comprehensive range of program verification techniques, from traditional symbolic execution to modern inductive-based methods such as property-directed reachability, we are able to draw consistent and correct verification verdicts for the considered systems.

https://doi.org/10.1145/3490387

[Article Title: Opinion Mining for Software Development: A Systematic Literature Review/ Bin Lin, Nathan Cassee, Alexander Serebrenik, Gabriele Bavota, Nicole Novielli and Michele Lanza, p. 38:1-38:41]

Abstract: Opinion mining, sometimes referred to as sentiment analysis, has gained increasing attention in software engineering (SE) studies. SE researchers have applied opinion mining techniques in various contexts, such as identifying developers’ emotions expressed in code comments and extracting users’ critics toward mobile apps. Given the large amount of relevant studies available, it can take considerable time for researchers and developers to figure out which approaches they can adopt in their own studies and what perils these approaches entail.

We conducted a systematic literature review involving 185 papers. More specifically, we present (1) well-defined categories of opinion mining-related software development activities, (2) available opinion mining approaches, whether they are evaluated when adopted in other studies, and how their performance is compared, (3) available datasets for performance evaluation and tool customization, and (4) concerns or limitations SE researchers might need to take into account when applying/customizing these opinion mining techniques. The results of our study serve as references to choose suitable opinion mining tools for software development activities and provide critical insights for the further development of opinion mining techniques in the SE domain.

https://doi.org/10.1145/3490388

[Article Title: Stateful Serverless Computing with Crucial/ Daniel Barcelona-Pons, Pierre Sutra, Marc Sánchez-Artigas, Gerard París and Pedro García-López, p. 39:1-39:38]

Abstract: Serverless computing greatly simplifies the use of cloud resources. In particular, Function-as-a-Service (FaaS) platforms enable programmers to develop applications as individual functions that can run and scale independently. Unfortunately, applications that require fine-grained support for mutable state and synchronization, such as machine learning (ML) and scientific computing, are notoriously hard to build with this new paradigm. In this work, we aim at bridging this gap. We present Crucial, a system to program highly-parallel stateful serverless applications. Crucial retains the simplicity of serverless computing. It is built upon the key insight that FaaS resembles to concurrent programming at the scale of a datacenter. Accordingly, a distributed shared memory layer is the natural answer to the needs for fine-grained state management and synchronization. Crucial allows to port effortlessly a multi-threaded code base to serverless, where it can benefit from the scalability and pay-per-use model of FaaS platforms. We validate Crucial with the help of micro-benchmarks and by considering various stateful applications. Beyond classical parallel tasks (e.g., a Monte Carlo simulation), these applications include representative ML algorithms such as k-means and logistic regression. Our evaluation shows that Crucial obtains superior or comparable performance to Apache Spark at similar cost (18%–40% faster). We also use Crucial to port (part of) a state-of-the-art multi-threaded ML library to serverless. The ported application is up to 30% faster than with a dedicated high-end server. Finally, we attest that Crucial can rival in performance with a single-machine, multi-threaded implementation of a complex coordination problem. Overall, Crucial delivers all these benefits with less than 6% of changes in the code bases of the evaluated applications.

https://doi.org/10.1145/3490386

[Article Title: Applying Bayesian Analysis Guidelines to Empirical Software Engineering Data: The Case of Programming Languages and Code Quality/ Carlo A. Furia, Richard Torkar and Robert Feldt, p. 40:1-40:38]

Abstract: Statistical analysis is the tool of choice to turn data into information and then information into empirical knowledge. However, the process that goes from data to knowledge is long, uncertain, and riddled with pitfalls. To be valid, it should be supported by detailed, rigorous guidelines that help ferret out issues with the data or model and lead to qualified results that strike a reasonable balance between generality and practical relevance. Such guidelines are being developed by statisticians to support the latest techniques for Bayesian data analysis. In this article, we frame these guidelines in a way that is apt to empirical research in software engineering.

To demonstrate the guidelines in practice, we apply them to reanalyze a GitHub dataset about code quality in different programming languages. The dataset’s original analysis [Ray et al. 55] and a critical reanalysis [Berger et al. 6] have attracted considerable attention—in no small part because they target a topic (the impact of different programming languages) on which strong opinions abound. The goals of our reanalysis are largely orthogonal to this previous work, as we are concerned with demonstrating, on data in an interesting domain, how to build a principled Bayesian data analysis and to showcase its benefits. In the process, we will also shed light on some critical aspects of the analyzed data and of the relationship between programming languages and code quality—such as the impact of project-specific characteristics other than the used programming language.

The high-level conclusions of our exercise will be that Bayesian statistical techniques can be applied to analyze software engineering data in a way that is principled, flexible, and leads to convincing results that inform the state-of-the-art while highlighting the boundaries of its validity. The guidelines can support building solid statistical analyses and connecting their results. Thus, they can help buttress continued progress in empirical software engineering research.

https://doi.org/10.1145/3490953

[Article Title: On the Faults Found in REST APIs by Automated Test Generation/ Bogdan Marculescu, Man Zhang and Andrea Arcuri, p. 41:1-41:43]

Abstract: RESTful web services are often used for building a wide variety of enterprise applications. The diversity and increased number of applications using RESTful APIs means that increasing amounts of resources are spent developing and testing these systems. Automation in test data generation provides a useful way of generating test data in a fast and efficient manner. However, automated test generation often results in large test suites that are hard to evaluate and investigate manually.

This article proposes a taxonomy of the faults we have found using search-based software testing techniques applied on RESTful APIs. The taxonomy is a first step in understanding, analyzing, and ultimately fixing software faults in web services and enterprise applications. We propose to apply a density-based clustering algorithm to the test cases evolved during the search to allow a better separation between different groups of faults. This is needed to enable engineers to highlight and focus on the most serious faults.

Tests were automatically generated for a set of eight case studies, seven open-source and one industrial. The test cases generated during the search are clustered based on the reported last executed line and based on the error messages returned, when such error messages were available. The tests were manually evaluated to determine their root causes and to obtain additional information.

The article presents a taxonomy of the faults found based on the manual analysis of 415 faults in the eight case studies and proposes a method to support the classification using clustering of the resulting test cases.

https://doi.org/10.1145/3491038

[Article Title: Using Personality Detection Tools for Software Engineering Research: How Far Can We Go?/ Fabio Calefato and Filippo Lanubile, p. 42:1-42:48]

Abstract: Assessing the personality of software engineers may help to match individual traits with the characteristics of development activities such as code review and testing, as well as support managers in team composition. However, self-assessment questionnaires are not a practical solution for collecting multiple observations on a large scale. Instead, automatic personality detection, while overcoming these limitations, is based on off-the-shelf solutions trained on non-technical corpora, which might not be readily applicable to technical domains like software engineering. In this article, we first assess the performance of general-purpose personality detection tools when applied to a technical corpus of developers’ e-mails retrieved from the public archives of the Apache Software Foundation. We observe a general low accuracy of predictions and an overall disagreement among the tools. Second, we replicate two previous research studies in software engineering by replacing the personality detection tool used to infer developers’ personalities from pull-request discussions and e-mails. We observe that the original results are not confirmed, i.e., changing the tool used in the original study leads to diverging conclusions. Our results suggest a need for personality detection tools specially targeted for the software engineering domain.

https://doi.org/10.1145/3491039

[Article Title: All in One: Design, Verification, and Implementation of SNOW-optimal Read Atomic Transactions/ Si Liu, p. 43:1-43:44]

Abstract: Distributed read atomic transactions are important building blocks of modern cloud databases that magnificently bridge the gap between data availability and strong data consistency. The performance of their transactional reads is particularly critical to the overall system performance, as many real-world database workloads are dominated by reads. Following the SNOW design principle for optimal reads, we develop LORA, a novel SNOW-optimal algorithm for distributed read atomic transactions. LORA completes its reads in exactly one round trip, even in the presence of conflicting writes, without imposing additional overhead to the communication, and it outperforms the state-of-the-art read atomic algorithms.

To guide LORA’s development, we present a rewriting-logic-based framework and toolkit for design, verification, implementation, and evaluation of distributed databases. Within the framework, we formalize LORA and mathematically prove its data consistency guarantees. We also apply automatic model checking and statistical verification to validate our proofs and to estimate LORA’s performance. We additionally generate from the formal model a correct-by-construction distributed implementation for testing and performance evaluation under realistic deployments. Our design-level and implementation-based experimental results are consistent, which together demonstrate LORA’s promising data consistency and performance achievement.

https://doi.org/10.1145/3494517

[Article Title: Do Developers Really Know How to Use Git Commands? A Large-scale Study Using Stack Overflow/ Wenhua Yang, Chong Zhang, Minxue Pan, Chang Xu, Yu Zhou and Zhiqiu Huang, p. 44:1-44:29]

Abstract: Git, a cross-platform and open source distributed version control tool, provides strong support for non-linear development and is capable of handling everything from small to large projects with speed and efficiency. It has become an indispensable tool for millions of software developers and is the de facto standard of version control in software development nowadays. However, despite its widespread use, developers still frequently face difficulties when using various Git commands to manage projects and collaborate. To better help developers use Git, it is necessary to understand the issues and difficulties that they may encounter when using Git. Unfortunately, this problem has not yet been comprehensively studied. To fill this knowledge gap, in this article, we conduct a large-scale study on Stack Overflow, a popular Q&A forum for developers. We extracted and analyzed 80,370 relevant questions from Stack Overflow, and reported the increasing popularity of the Git command questions. By analyzing the questions, we identified the Git commands that are frequently asked and those that are associated with difficult questions on Stack Overflow to help understand the difficulties developers may encounter when using Git commands. In addition, we conducted a survey to understand how developers learn Git commands in practice, showing that self-learning is the primary learning approach. These findings provide a range of actionable implications for researchers, educators, and developers.

https://doi.org/10.1145/3494518

[Article Title: Industry–Academia Research Collaboration and Knowledge Co-creation: Patterns and Anti-patterns/ Dusica Marijan and Sagar Sen, p. 45:1-45:52]

Abstract: Increasing the impact of software engineering research in the software industry and the society at large has long been a concern of high priority for the software engineering community. The problem of two cultures, research conducted in a vacuum (disconnected from the real world), or misaligned time horizons are just some of the many complex challenges standing in the way of successful industry–academia collaborations. This article reports on the experience of research collaboration and knowledge co-creation between industry and academia in software engineering as a way to bridge the research–practice collaboration gap. Our experience spans 14 years of collaboration between researchers in software engineering and the European and Norwegian software and IT industry. Using the participant observation and interview methods, we have collected and afterwards analyzed an extensive record of qualitative data. Drawing upon the findings made and the experience gained, we provide a set of 14 patterns and 14 anti-patterns for industry–academia collaborations, aimed to support other researchers and practitioners in establishing and running research collaboration projects in software engineering.

https://doi.org/10.1145/3494519

[Article Title: Continuous and Proactive Software Architecture Evaluation: An IoT Case/ Dalia Sobhy, Leandro Minku, Rami Bahsoon and Rick Kazman, p. 46:1-46:54]

Abstract: Design-time evaluation is essential to build the initial software architecture to be deployed. However, experts’ assumptions made at design-time are unlikely to remain true indefinitely in systems that are characterized by scale, hyperconnectivity, dynamism, and uncertainty in operations (e.g. IoT). Therefore, experts’ design-time decisions can be challenged at run-time. A continuous architecture evaluation that systematically assesses and intertwines design-time and run-time decisions is thus necessary. This paper proposes the first proactive approach to continuous architecture evaluation of the system leveraging the support of simulation. The approach evaluates software architectures by not only tracking their performance over time, but also forecasting their likely future performance through machine learning of simulated instances of the architecture. This enables architects to make cost-effective informed decisions on potential changes to the architecture. We perform an IoT case study to show how machine learning on simulated instances of architecture can fundamentally guide the continuous evaluation process and influence the outcome of architecture decisions. A series of experiments is conducted to demonstrate the applicability and effectiveness of the approach. We also provide the architect with recommendations on how to best benefit from the approach through choice of learners and input parameters, grounded on experimentation and evidence.

REFERENCES

https://doi.org/10.1145/3492762

[Article Title: NPC: Neuron Path Coverage via Characterizing Decision Logic of Deep Neural Networks/ Xiaofei Xie, Tianlin Li, Jian Wang, Lei Ma, Qing Guo, Felix Juefei-Xu and Yang Liu, p. 47:1-47:27]

Abstract: Deep learning has recently been widely applied to many applications across different domains, e.g., image classification and audio recognition. However, the quality of Deep Neural Networks (DNNs) still raises concerns in the practical operational environment, which calls for systematic testing, especially in safety-critical scenarios. Inspired by software testing, a number of structural coverage criteria are designed and proposed to measure the test adequacy of DNNs. However, due to the blackbox nature of DNN, the existing structural coverage criteria are difficult to interpret, making it hard to understand the underlying principles of these criteria. The relationship between the structural coverage and the decision logic of DNNs is unknown. Moreover, recent studies have further revealed the non-existence of correlation between the structural coverage and DNN defect detection, which further posts concerns on what a suitable DNN testing criterion should be.

In this article, we propose the interpretable coverage criteria through constructing the decision structure of a DNN. Mirroring the control flow graph of the traditional program, we first extract a decision graph from a DNN based on its interpretation, where a path of the decision graph represents a decision logic of the DNN. Based on the control flow and data flow of the decision graph, we propose two variants of path coverage to measure the adequacy of the test cases in exercising the decision logic. The higher the path coverage, the more diverse decision logic the DNN is expected to be explored. Our large-scale evaluation results demonstrate that: The path in the decision graph is effective in characterizing the decision of the DNN, and the proposed coverage criteria are also sensitive with errors, including natural errors and adversarial examples, and strongly correlate with the output impartiality.

https://doi.org/10.1145/3490489

[Article Title: An Empirical Study of the Effectiveness of an Ensemble of Stand-alone Sentiment Detection Tools for Software Engineering Datasets/ Gias Uddin, Yann-Gaël Guéhénuc, Foutse Khomh and Chanchal K. Roy, p. 48:1-48:38]

Abstract: Sentiment analysis in software engineering (SE) has shown promise to analyze and support diverse development activities. Recently, several tools are proposed to detect sentiments in software artifacts. While the tools improve accuracy over off-the-shelf tools, recent research shows that their performance could still be unsatisfactory. A more accurate sentiment detector for SE can help reduce noise in analysis of software scenarios where sentiment analysis is required. Recently, combinations, i.e., hybrids of stand-alone classifiers are found to offer better performance than the stand-alone classifiers for fault detection. However, we are aware of no such approach for sentiment detection for software artifacts. We report the results of an empirical study that we conducted to determine the feasibility of developing an ensemble engine by combining the polarity labels of stand-alone SE-specific sentiment detectors. Our study has two phases. In the first phase, we pick five SE-specific sentiment detection tools from two recently published papers by Lin et al. [29, 30], who first reported negative results with stand alone sentiment detectors and then proposed an improved SE-specific sentiment detector, POME [29]. We report the study results on 17,581 units (sentences/documents) coming from six currently available sentiment benchmarks for software engineering. We find that the existing tools can be complementary to each other in 85-95% of the cases, i.e., one is wrong but another is right. However, a majority voting-based ensemble of those tools fails to improve the accuracy of sentiment detection. We develop Sentisead, a supervised tool by combining the polarity labels and bag of words as features. Sentisead improves the performance (F1-score) of the individual tools by 4% (over Senti4SD [5]) – 100% (over POME [29]). The initial development of Sentisead occurred before we observed the use of deep learning models for SE-specific sentiment detection. In particular, recent papers show the superiority of advanced language-based pre-trained transformer models (PTM) over rule-based and shallow learning models. Consequently, in a second phase, we compare and improve Sentisead infrastructure using the PTMs. We find that a Sentisead infrastructure with RoBERTa as the ensemble of the five stand-alone rule-based and shallow learning SE-specific tools from Lin et al. [29, 30] offers the best F1-score of 0.805 across the six datasets, while a stand-alone RoBERTa shows an F1-score of 0.801.

https://doi.org/10.1145/3491211

[Article Title: Detecting and Augmenting Missing Key Aspects in Vulnerability Descriptions/ Hao Guo, Sen Chen, Zhenchang Xing, Xiaohong Li, Yude Bai and Jiamou Sun, p. 49:1-49:27]

Abstract: Security vulnerabilities have been continually disclosed and documented. For the effective understanding, management, and mitigation of the fast-growing number of vulnerabilities, an important practice in documenting vulnerabilities is to describe the key vulnerability aspects, such as vulnerability type, root cause, affected product, impact, attacker type, and attack vector. In this article, we first investigate 133,639 vulnerability reports in the Common Vulnerabilities and Exposures (CVE) database over the past 20 years. We find that 56%, 85%, 38%, and 28% of CVEs miss vulnerability type, root cause, attack vector, and attacker type, respectively. By comparing the differences of the latest updated CVE reports across different databases, we observe that 1,476 missing key aspects in 1,320 CVE descriptions were augmented manually in the National Vulnerability Database (NVD), which indicates that the vulnerability database maintainers try to complete the vulnerability descriptions in practice to mitigate such a problem.

To help complete the missing information of key vulnerability aspects and reduce human efforts, we propose a neural-network-based approach called PMA to predict the missing key aspects of a vulnerability based on its known aspects. We systematically explore the design space of the neural network models and empirically identify the most effective model design in the scenario. Our ablation study reveals the prominent correlations among vulnerability aspects when predicting. Trained with historical CVEs, our model achieves 88%, 71%, 61%, and 81% in F1 for predicting the missing vulnerability type, root cause, attacker type, and attack vector of 8,623 “future” CVEs across 3 years, respectively. Furthermore, we validate the predicting performance of key aspect augmentation of CVEs based on the manually augmented CVE data collected from NVD, which confirms the practicality of our approach. We finally highlight that PMA has the ability to reduce human efforts by recommending and augmenting missing key aspects for vulnerability databases, and to facilitate other research works such as severity level prediction of CVEs based on the vulnerability descriptions.

https://doi.org/10.1145/3498537

[Article Title: Towards Robustness of Deep Program Processing Models—Detection, Estimation, and Enhancement/ Huangzhao Zhang, Zhiyi Fu, Ge Li, Lei Ma, Zhehao Zhao, Hua’an Yang, Yizhe Sun, Yang Liu and Zhi Jin, p. 50:1-50:40]

Abstract: Deep learning (DL) has recently been widely applied to diverse source code processing tasks in the software engineering (SE) community, which achieves competitive performance (e.g., accuracy). However, the robustness, which requires the model to produce consistent decisions given minorly perturbed code inputs, still lacks systematic investigation as an important quality indicator. This article initiates an early step and proposes a framework CARROT for robustness detection, measurement, and enhancement of DL models for source code processing. We first propose an optimization-based attack technique CARROTA to generate valid adversarial source code examples effectively and efficiently. Based on this, we define the robustness metrics and propose robustness measurement toolkit CARROTM, which employs the worst-case performance approximation under the allowable perturbations. We further propose to improve the robustness of the DL models by adversarial training (CARROTT) with our proposed attack techniques. Our in-depth evaluations on three source code processing tasks (i.e., functionality classification, code clone detection, defect prediction) containing more than 3 million lines of code and the classic or SOTA DL models, including GRU, LSTM, ASTNN, LSCNN, TBCNN, CodeBERT, and CDLH, demonstrate the usefulness of our techniques for ❶ effective and efficient adversarial example detection, ❷ tight robustness estimation, and ❸ effective robustness enhancement.

https://doi.org/10.1145/3511887

[Article Title: Context-Aware Code Change Embedding for Better Patch Correctness Assessment/ Bo Lin, Shangwen Wang, Ming Wen and Xiaoguang Mao, p. 51:1-51:29]

Abstract: Despite the capability in successfully fixing more and more real-world bugs, existing Automated Program Repair (APR) techniques are still challenged by the long-standing overfitting problem (i.e., a generated patch that passes all tests is actually incorrect). Plenty of approaches have been proposed for automated patch correctness assessment (APCA). Nonetheless, dynamic ones (i.e., those that needed to execute tests) are time-consuming while static ones (i.e., those built on top of static code features) are less precise. Therefore, embedding techniques have been proposed recently, which assess patch correctness via embedding token sequences extracted from the changed code of a generated patch. However, existing techniques rarely considered the context information and program structures of a generated patch, which are crucial for patch correctness assessment as revealed by existing studies. In this study, we explore the idea of context-aware code change embedding considering program structures for patch correctness assessment. Specifically, given a patch, we not only focus on the changed code but also take the correlated unchanged part into consideration, through which the context information can be extracted and leveraged. We then utilize the AST path technique for representation where the structure information from AST node can be captured. Finally, based on several pre-defined heuristics, we build a deep learning based classifier to predict the correctness of the patch. We implemented this idea as Cache and performed extensive experiments to assess its effectiveness. Our results demonstrate that Cache can (1) perform better than previous representation learning based techniques (e.g., Cache relatively outperforms existing techniques by ≈
6%, ≈
3%, and ≈
16%, respectively under three diverse experiment settings), and (2) achieve overall higher performance than existing APCA techniques while even being more precise than certain dynamic ones including PATCH-SIM (92.9% vs. 83.0%). Further results reveal that the context information and program structures leveraged by Cache contributed significantly to its outstanding performance.

REFERENCES

https://doi.org/10.1145/3505247

[Article Title: XCode: Towards Cross-Language Code Representation with Large-Scale Pre-Training/ Zehao Lin, Guodun Li, Jingfeng Zhang, Yue Deng, Xiangji Zeng, Yin Zhang and Yao Wan, p. 52:1-52:44]

Abstract: Source code representation learning is the basis of applying artificial intelligence to many software engineering tasks such as code clone detection, algorithm classification, and code summarization. Recently, many works have tried to improve the performance of source code representation from various perspectives, e.g., introducing the structural information of programs into latent representation. However, when dealing with rapidly expanded unlabeled cross-language source code datasets from the Internet, there are still two issues. Firstly, deep learning models for many code-specific tasks still suffer from the lack of high-quality labels. Secondly, the structural differences among programming languages make it more difficult to process multiple languages in a single neural architecture.

To address these issues, in this article, we propose a novel Cross-language Code representation with a large-scale pre-training (XCode) method. Concretely, we propose to use several abstract syntax trees and ELMo-enhanced variational autoencoders to obtain multiple pre-trained source code language models trained on about 1.5 million code snippets. To fully utilize the knowledge across programming languages, we further propose a Shared Encoder-Decoder (SED) architecture which uses the multi-teacher single-student method to transfer knowledge from the aforementioned pre-trained models to the distilled SED. The pre-trained models and SED will cooperate to better represent the source code. For evaluation, we examine our approach on three typical downstream cross-language tasks, i.e., source code translation, code clone detection, and code-to-code search, on a real-world dataset composed of programming exercises with multiple solutions. Experimental results demonstrate the effectiveness of our proposed approach on cross-language code representations. Meanwhile, our approach performs significantly better than several code representation baselines on different downstream tasks in terms of multiple automatic evaluation metrics.

https://doi.org/10.1145/3506696

[Article Title: An Empirical Study of the Impact of Hyperparameter Tuning and Model Optimization on the Performance Properties of Deep Neural Networks/ Lizhi Liao, Heng Li, Weiyi Shang and Lei Ma, p.53:1-53:40]

Abstract: Deep neural network (DNN) models typically have many hyperparameters that can be configured to achieve optimal performance on a particular dataset. Practitioners usually tune the hyperparameters of their DNN models by training a number of trial models with different configurations of the hyperparameters, to find the optimal hyperparameter configuration that maximizes the training accuracy or minimizes the training loss. As such hyperparameter tuning usually focuses on the model accuracy or the loss function, it is not clear and remains under-explored how the process impacts other performance properties of DNN models, such as inference latency and model size. On the other hand, standard DNN models are often large in size and computing-intensive, prohibiting them from being directly deployed in resource-bounded environments such as mobile devices and Internet of Things (IoT) devices. To tackle this problem, various model optimization techniques (e.g., pruning or quantization) are proposed to make DNN models smaller and less computing-intensive so that they are better suited for resource-bounded environments. However, it is neither clear how the model optimization techniques impact other performance properties of DNN models such as inference latency and battery consumption, nor how the model optimization techniques impact the effect of hyperparameter tuning (i.e., the compounding effect). Therefore, in this paper, we perform a comprehensive study on four representative and widely-adopted DNN models, i.e., CNN image classification, Resnet-50, CNN text classification, and LSTM sentiment classification, to investigate how different DNN model hyperparameters affect the standard DNN models, as well as how the hyperparameter tuning combined with model optimization affect the optimized DNN models, in terms of various performance properties (e.g., inference latency or battery consumption). Our empirical results indicate that tuning specific hyperparameters has heterogeneous impact on the performance of DNN models across different models and different performance properties. In particular, although the top tuned DNN models usually have very similar accuracy, they may have significantly different performance in terms of other aspects (e.g., inference latency). We also observe that model optimization has a confounding effect on the impact of hyperparameters on DNN model performance. For example, two sets of hyperparameters may result in standard models with similar performance but their performance may become significantly different after they are optimized and deployed on the mobile device. Our findings highlight that practitioners can benefit from paying attention to a variety of performance properties and the confounding effect of model optimization when tuning and optimizing their DNN models.

https://doi.org/10.1145/3506695

[Article Title: Time-travel Investigation: Toward Building a Scalable Attack Detection Framework on Ethereum/ Siwei Wu, Lei Wu, Yajin Zhou, Runhuai Li, Zhi Wang, Xiapu Luo, Cong Wang and Kui Ren, p. 54:1-54:33]

Abstract: Ethereum has been attracting lots of attacks, hence there is a pressing need to perform timely investigation and detect more attack instances. However, existing systems suffer from the scalability issue due to the following reasons. First, the tight coupling between malicious contract detection and blockchain data importing makes them infeasible to repeatedly detect different attacks. Second, the coarse-grained archive data makes them inefficient to replay transactions. Third, the separation between malicious contract detection and runtime state recovery consumes lots of storage.

In this article, we propose a scalable attack detection framework named EthScope, which overcomes the scalability issue by neatly re-organizing the Ethereum state and efficiently locating suspicious transactions. It leverages the fine-grained state to support the replay of arbitrary transactions and proposes a well-designed schema to optimize the storage consumption. The performance evaluation shows that EthScope can solve the scalability issue, i.e., efficiently performing a large-scale analysis on billions of transactions, and a speedup of around 2,300×
when replaying transactions. It also has lower storage consumption compared with existing systems. Further analysis shows that EthScope can help analysts understand attack behaviors and detect more attack instances.

https://doi.org/10.1145/3505263

[Article Title: Examining Penetration Tester Behavior in the Collegiate Penetration Testing Competition/ Benjamin S. Meyers, Sultan Fahad Almassari, Brandon N. Keller and Andrew Meneely, p. 55:1-55:25]

Abstract: Penetration testing is a key practice toward engineering secure software. Malicious actors have many tactics at their disposal, and software engineers need to know what tactics attackers will prioritize in the first few hours of an attack. Projects like MITRE ATT&CK™ provide knowledge, but how do people actually deploy this knowledge in real situations? A penetration testing competition provides a realistic, controlled environment with which to measure and compare the efficacy of attackers. In this work, we examine the details of vulnerability discovery and attacker behavior with the goal of improving existing vulnerability assessment processes using data from the 2019 Collegiate Penetration Testing Competition (CPTC). We constructed 98 timelines of vulnerability discovery and exploits for 37 unique vulnerabilities discovered by 10 teams of penetration testers. We grouped related vulnerabilities together by mapping to Common Weakness Enumerations and MITRE ATT&CK™. We found that (1) vulnerabilities related to improper resource control (e.g., session fixation) are discovered faster and more often, as well as exploited faster, than vulnerabilities related to improper access control (e.g., weak password requirements), (2) there is a clear process followed by penetration testers of discovery/collection to lateral movement/pre-attack. Our methodology facilitates quicker analysis of vulnerabilities in future CPTC events.

https://doi.org/10.1145/3514040

[Article Title: Predictive Models in Software Engineering: Challenges and Opportunities/ Yanming Yang, Xin Xia, David Lo, Tingting Bi, John Grundy and Xiaohu Yang, p. 56:1-56:72]

Abstract: Predictive models are one of the most important techniques that are widely applied in many areas of software engineering. There have been a large number of primary studies that apply predictive models and that present well-performed studies in various research domains, including software requirements, software design and development, testing and debugging, and software maintenance. This article is a first attempt to systematically organize knowledge in this area by surveying a body of 421 papers on predictive models published between 2009 and 2020. We describe the key models and approaches used, classify the different models, summarize the range of key application areas, and analyze research results. Based on our findings, we also propose a set of current challenges that still need to be addressed in future work and provide a proposed research road map for these opportunities.

https://doi.org/10.1145/3503509

There are no comments on this title.

to post a comment.