Software Engineering
- [1] arXiv:2406.03577 [pdf, ps, html, other]
-
Title: Explaining the Contributing Factors for Vulnerability Detection in Machine LearningSubjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
There is an increasing trend to mine vulnerabilities from software repositories and use machine learning techniques to automatically detect software vulnerabilities. A fundamental but unresolved research question is: how do different factors in the mining and learning process impact the accuracy of identifying vulnerabilities in software projects of varying characteristics? Substantial research has been dedicated in this area, including source code static analysis, software repository mining, and NLP-based machine learning. However, practitioners lack experience regarding the key factors for building a baseline model of the state-of-the-art. In addition, there lacks of experience regarding the transferability of the vulnerability signatures from project to project. This study investigates how the combination of different vulnerability features and three representative machine learning models impact the accuracy of vulnerability detection in 17 real-world projects. We examine two types of vulnerability representations: 1) code features extracted through NLP with varying tokenization strategies and three different embedding techniques (bag-of-words, word2vec, and fastText) and 2) a set of eight architectural metrics that capture the abstract design of the software systems. The three machine learning algorithms include a random forest model, a support vector machines model, and a residual neural network model. The analysis shows a recommended baseline model with signatures extracted through bag-of-words embedding, combined with the random forest, consistently increases the detection accuracy by about 4% compared to other combinations in all 17 projects. Furthermore, we observe the limitation of transferring vulnerability signatures across domains based on our experiments.
- [2] arXiv:2406.03660 [pdf, ps, html, other]
-
Title: Refactoring to Pythonic Idioms: A Hybrid Knowledge-Driven Approach Leveraging Large Language ModelsComments: Accepted by FSE 2024,22 pagesSubjects: Software Engineering (cs.SE)
Pythonic idioms are highly valued and widely used in the Python programming community. However, many Python users find it challenging to use Pythonic idioms. Adopting a rule-based approach or LLM-only approach is not sufficient to overcome three persistent challenges of code idiomatization including code miss, wrong detection and wrong refactoring. Motivated by the determinism of rules and adaptability of LLMs, we propose a hybrid approach consisting of three modules. We not only write prompts to instruct LLMs to complete tasks, but we also invoke Analytic Rule Interfaces (ARIs) to accomplish tasks. The ARIs are Python code generated by prompting LLMs to generate code. We first construct a knowledge module with three elements including ASTscenario, ASTcomponent and Condition, and prompt LLMs to generate Python code for incorporation into an ARI library for subsequent use. After that, for any syntax-error-free Python code, we invoke ARIs from the ARI library to extract ASTcomponent from the ASTscenario, and then filter out ASTcomponent that does not meet the condition. Finally, we design prompts to instruct LLMs to abstract and idiomatize code, and then invoke ARIs from the ARI library to rewrite non-idiomatic code into the idiomatic code. Next, we conduct a comprehensive evaluation of our approach, RIdiom, and Prompt-LLM on nine established Pythonic idioms in RIdiom. Our approach exhibits superior accuracy, F1-score, and recall, while maintaining precision levels comparable to RIdiom, all of which consistently exceed or come close to 90% for each metric of each idiom. Lastly, we extend our evaluation to encompass four new Pythonic idioms. Our approach consistently outperforms Prompt-LLM, achieving metrics with values consistently exceeding 90% for accuracy, F1-score, precision, and recall.
- [3] arXiv:2406.03839 [pdf, ps, html, other]
-
Title: PCART: Automated Repair of Python API Parameter Compatibility IssuesComments: Submitted to IEEE Transactions on Software EngineeringSubjects: Software Engineering (cs.SE)
In modern software development, Python third-party libraries have become crucial, particularly due to their widespread use in fields such as deep learning and scientific computing. However, the parameters of APIs in third-party libraries often change during evolution, causing compatibility issues for client applications that depend on specific versions. Due to Python's flexible parameter-passing mechanism, different methods of parameter passing can result in different API compatibility. Currently, no tool is capable of automatically detecting and repairing Python API parameter compatibility issues. To fill this gap, we propose PCART, the first to implement a fully automated process from API extraction, code instrumentation, and API mapping establishment, to compatibility assessment, and finally to repair and validation, for solving various types of Python API parameter compatibility issues, i.e., parameter addition, removal, renaming, reordering of parameters, as well as the conversion of positional parameters to keyword parameters. We construct a large-scale benchmark PCBENCH, including 47,478 test cases mutated from 844 parameter-changed APIs of 33 popular Python libraries, to evaluate PCART. The evaluation results show that PCART is effective yet efficient, significantly outperforming existing tools (MLCatchUp and Relancer) and the large language model ChatGPT-4, achieving an F-measure of 96.49% in detecting API parameter compatibility issues and a repair accuracy of 91.36%. The evaluation on 14 real-world Python projects from GitHub further demonstrates that PCART has good practicality. We believe PCART can help programmers reduce the time spent on maintaining Python API updates and facilitate automated Python API compatibility issue repair.
- [4] arXiv:2406.03870 [pdf, ps, html, other]
-
Title: GOOSE: Goal-Conditioned Reinforcement Learning for Safety-Critical Scenario GenerationSubjects: Software Engineering (cs.SE)
Scenario-based testing is considered state-of-the-art for verifying and validating Advanced Driver Assistance Systems (ADASs) and Automated Driving Systems (ADSs). However, the practical application of scenario-based testing requires an efficient method to generate or collect the scenarios that are needed for the safety assessment. In this paper, we propose Goal-conditioned Scenario Generation (GOOSE), a goal-conditioned reinforcement learning (RL) approach that automatically generates safety-critical scenarios to challenge ADASs or ADSs. In order to simultaneously set up and optimize scenarios, we propose to control vehicle trajectories at the scenario level. Each step in the RL framework corresponds to a scenario simulation. We use Non-Uniform Rational B-Splines (NURBS) for trajectory modeling. To guide the goal-conditioned agent, we formulate test-specific, constraint-based goals inspired by the OpenScenario Domain Specific Language(DSL). Through experiments conducted on multiple pre-crash scenarios derived from UN Regulation No. 157 for Active Lane Keeping Systems (ALKS), we demonstrate the effectiveness of GOOSE in generating scenarios that lead to safety-critical events.
- [5] arXiv:2406.04037 [pdf, ps, html, other]
-
Title: A Road-Map for Transferring Software Engineering methods for Model-Based Early V&V of Behaviour to Systems EngineeringComments: 9 pages, 2 figures, SE2030, Software Engineering in 2030 Workshop, ACM International Conference on the Foundations of Software Engineering (FSE) 2024, Porto de Galinhas, BrazilSubjects: Software Engineering (cs.SE)
In this paper we discuss the growing need for system behaviour to be validated and verified (V&V'ed) early in model-based systems engineering. Several aspects push companies towards integration of techniques, methods, and processes that promote specific and general V&V activities earlier to support more effective decision-making. As a result, there are incentives to introduce new technologies to remain competitive with the recently drastic changes in system complexity and heterogeneity. Performing V&V early on in development is a means of reducing risk for later error detection while moving key activities earlier in a process. We present a summary of the literature on early V&V and position existing challenges regarding potential solutions and future investigations. In particular, we reason that the software engineering community can act as a source for inspiration as many emerging technologies in the software domain are showing promise in the wider systems domain, and there already exist well formed methods for early V&V of software behaviour in the software modelling community. We conclude the paper with a road-map for future research and development for both researchers and practitioners to further develop the concepts discussed in the paper.
- [6] arXiv:2406.04057 [pdf, ps, html, other]
-
Title: Overwhelmed Software DevelopersLisa-Marie Michels, Aleksandra Petkova, Marcel Richter, Andreas Farley, Daniel Graziotin, Stefan WagnerComments: 8 pages. Published at IEEE Software. Based on the technical report arXiv:2401.02780Journal-ref: IEEE Software (Volume: 41, Issue: 4, July-Aug. 2024), Page(s): 51 - 59Subjects: Software Engineering (cs.SE); Computers and Society (cs.CY)
We have conducted a qualitative psychology study to explore the experience of feeling overwhelmed in the realm of software development. Through the candid confessions of two participants who have recently faced overwhelming challenges, we have identified seven distinct categories: communication-induced, disturbance-related, organizational, variety, technical, temporal, and positive overwhelm. While most types of overwhelm tend to deteriorate productivity and increase stress levels, developers sometimes perceive overwhelm as a catalyst for heightened focus, self-motivation, and productivity. Stress was often found to be a common companion of overwhelm. Our findings align with previous studies conducted in diverse disciplines. However, we believe that software developers possess unique traits that may enable them to navigate through the storm of overwhelm more effectively.
- [7] arXiv:2406.04066 [pdf, ps, other]
-
Title: Requirements for Organizational Resilience: Engineering Developer HappinessComments: 5 pagesJournal-ref: IEEE Software, Jul.-Aug. 2024, pp. 14-18, vol. 41Subjects: Software Engineering (cs.SE); Computers and Society (cs.CY)
Can the right requirements boost developer satisfaction and happiness? We believe they can. In keeping with this issue's theme, "Well-Being for Resilience: Developers Thrive," we discuss the connection between the three keywords, well-being, resilience, and thriving. How could requirements engineering foster these qualities? While there hasn't been much research on this topic, we see opportunities for future work. Let's initiate the discussion!
- [8] arXiv:2406.04152 [pdf, ps, html, other]
-
Title: Position: How Regulation Will Change Software Security ResearchComments: 5 pages, submitted to SE2023 workshop at FSE 2024Subjects: Software Engineering (cs.SE)
Software security has been an important research topic over the years. The community has proposed processes and tools for secure software development and security analysis. However, a significant number of vulnerabilities remains in real-world software-driven systems and products.
To alleviate this problem, legislation is being established to oblige manufacturers, for example, to comply with essential security requirements and to establish appropriate development practices. We argue that software engineering research needs to provide better tools and support that helps industry comply with the new standards while retaining effcient processes. We argue for a stronger cooperation between legal scholars and computer scientists, and for bridging the gap between higher-level regulation and code-level engineering.
New submissions for Friday, 7 June 2024 (showing 8 of 8 entries )
- [9] arXiv:2406.04027 (cross-list from cs.CR) [pdf, ps, html, other]
-
Title: PowerPeeler: A Precise and General Dynamic Deobfuscation Method for PowerShell ScriptsComments: To appear in the ACM CCS 2024Subjects: Cryptography and Security (cs.CR); Software Engineering (cs.SE)
PowerShell is a powerful and versatile task automation tool. Unfortunately, it is also widely abused by cyber attackers. To bypass malware detection and hinder threat analysis, attackers often employ diverse techniques to obfuscate malicious PowerShell scripts. Existing deobfuscation tools suffer from the limitation of static analysis, which fails to simulate the real deobfuscation process accurately.
In this paper, we propose PowerPeeler. To the best of our knowledge, it is the first dynamic PowerShell script deobfuscation approach at the instruction level. It utilizes expression-related Abstract Syntax Tree (AST) nodes to identify potential obfuscated script pieces. Then, PowerPeeler correlates the AST nodes with their corresponding instructions and monitors the script's entire execution process. Subsequently, PowerPeeler dynamically tracks the execution of these instructions and records their execution results. Finally, PowerPeeler stringifies these results to replace the corresponding obfuscated script pieces and reconstruct the deobfuscated script.
To evaluate the effectiveness of PowerPeeler, we collect 1,736,669 real-world malicious PowerShell samples with diversity obfuscation methods. We compare PowerPeeler with five state-of-the-art deobfuscation tools and GPT-4. The evaluation results demonstrate that PowerPeeler can effectively handle all well-known obfuscation methods. Additionally, the deobfuscation correctness rate of PowerPeeler reaches 95%, significantly surpassing that of other tools. PowerPeeler not only recovers the highest amount of sensitive data but also maintains a semantic consistency over 97%, which is also the best. Moreover, PowerPeeler effectively obtains the largest quantity of valid deobfuscated results within a limited time frame. Furthermore, PowerPeeler is extendable and can be used as a helpful tool for other cyber security solutions.
Cross submissions for Friday, 7 June 2024 (showing 1 of 1 entries )
- [10] arXiv:2306.01376 (replaced) [pdf, ps, html, other]
-
Title: DSHGT: Dual-Supervisors Heterogeneous Graph Transformer -- A pioneer study of using heterogeneous graph learning for detecting software vulnerabilitiesSubjects: Software Engineering (cs.SE); Machine Learning (cs.LG)
Vulnerability detection is a critical problem in software security and attracts growing attention both from academia and industry. Traditionally, software security is safeguarded by designated rule-based detectors that heavily rely on empirical expertise, requiring tremendous effort from software experts to generate rule repositories for large code corpus. Recent advances in deep learning, especially Graph Neural Networks (GNN), have uncovered the feasibility of automatic detection of a wide range of software vulnerabilities. However, prior learning-based works only break programs down into a sequence of word tokens for extracting contextual features of codes, or apply GNN largely on homogeneous graph representation (e.g., AST) without discerning complex types of underlying program entities (e.g., methods, variables). In this work, we are one of the first to explore heterogeneous graph representation in the form of Code Property Graph and adapt a well-known heterogeneous graph network with a dual-supervisor structure for the corresponding graph learning task. Using the prototype built, we have conducted extensive experiments on both synthetic datasets and real-world projects. Compared with the state-of-the-art baselines, the results demonstrate promising effectiveness in this research direction in terms of vulnerability detection performance (average F1 improvements over 10\% in real-world projects) and transferability from C/C++ to other programming languages (average F1 improvements over 11%).
- [11] arXiv:2401.04621 (replaced) [pdf, ps, other]
-
Title: DebugBench: Evaluating Debugging Capability of Large Language ModelsRunchu Tian, Yining Ye, Yujia Qin, Xin Cong, Yankai Lin, Yinxu Pan, Yesai Wu, Haotian Hui, Weichuan Liu, Zhiyuan Liu, Maosong SunComments: Accepted as Findings of ACL 2024Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Large Language Models (LLMs) have demonstrated exceptional coding capability. However, as another critical component of programming proficiency, the debugging capability of LLMs remains relatively unexplored. Previous evaluations of LLMs' debugging ability are significantly limited by the risk of data leakage, the scale of the dataset, and the variety of tested bugs. To overcome these deficiencies, we introduce `DebugBench', an LLM debugging benchmark consisting of 4,253 instances. It covers four major bug categories and 18 minor types in C++, Java, and Python. To construct DebugBench, we collect code snippets from the LeetCode community, implant bugs into source data with GPT-4, and assure rigorous quality checks. We evaluate two commercial and four open-source models in a zero-shot scenario. We find that (1) while closed-source models exhibit inferior debugging performance compared to humans, open-source models relatively lower pass rate scores; (2) the complexity of debugging notably fluctuates depending on the bug category; (3) incorporating runtime feedback has a clear impact on debugging performance which is not always helpful. As an extension, we also compare LLM debugging and code generation, revealing a strong correlation between them for closed-source models. These findings will benefit the development of LLMs in debugging.
- [12] arXiv:2401.16467 (replaced) [pdf, ps, html, other]
-
Title: ReGAL: Refactoring Programs to Discover Generalizable AbstractionsComments: ICML 2024 Camera-Ready; First two authors contributed equally; Code: this https URLSubjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Programming Languages (cs.PL)
While large language models (LLMs) are increasingly being used for program synthesis, they lack the global view needed to develop useful abstractions; they generally predict programs one at a time, often repeating the same functionality. Generating redundant code from scratch is both inefficient and error-prone. To address this, we propose Refactoring for Generalizable Abstraction Learning (ReGAL), a gradient-free method for learning a library of reusable functions via code refactorization, i.e., restructuring code without changing its execution output. ReGAL learns from a small set of existing programs, iteratively verifying and refining its abstractions via execution. We find that the shared function libraries discovered by ReGAL make programs easier to predict across diverse domains. On five datasets -- LOGO graphics generation, Date reasoning, TextCraft (a Minecraft-based text-game) MATH, and TabMWP -- both open-source and proprietary LLMs improve in accuracy when predicting programs with ReGAL functions. For CodeLlama-13B, ReGAL results in absolute accuracy increases of 11.5% on LOGO, 26.1% on date understanding, and 8.1% on TextCraft, outperforming GPT-3.5 in two of three domains. Our analysis reveals ReGAL's abstractions encapsulate frequently-used subroutines as well as environment dynamics.
- [13] arXiv:2402.01156 (replaced) [pdf, ps, other]
-
Title: An Empirical Study on Low Code Programming using Traditional vs Large Language Model SupportYongkun Liu, Jiachi Chen, Tingting Bi, John Grundy, Yanlin Wang, Jianxing Yu, Ting Chen, Yutian Tang, Zibin ZhengSubjects: Software Engineering (cs.SE)
Low-code programming (LCP) refers to programming using models at higher levels of abstraction, resulting in less manual and more efficient programming, and reduced learning effort for amateur developers. Many LCP tools have rapidly evolved and have benefited from the concepts of visual programming languages (VPLs) and programming by demonstration (PBD). With huge increase in interest in using large language models (LLMs) in software engineering, LLM-based LCP has began to become increasingly important. However, the technical principles and application scenarios of traditional approaches to LCP and LLM-based LCP are significantly different. Understanding these key differences and characteristics in the application of the two approaches to LCP by users is crucial for LCP providers in improving existing and developing new LCP tools, and in better assisting users in choosing the appropriate LCP technology. We conducted an empirical study of both traditional LCP and LLM-based LCP. We analyzed developers' discussions on Stack Overflow (SO) over the past three years and then explored the similarities and differences between traditional LCP and LLM-based LCP features and developer feedback. Our findings reveal that while traditional LCP and LLM-based LCP share common primary usage scenarios, they significantly differ in scope, limitations and usage throughout the software development lifecycle, particularly during the implementation phase. We also examine how LLMs impact and integrate with LCP, discussing the latest technological developments in LLM-based LCP, such as its integration with VPLs and the application of LLM Agents in software engineering.
- [14] arXiv:2402.07844 (replaced) [pdf, ps, html, other]
-
Title: Mercury: A Code Efficiency Benchmark for LLM Code SynthesisSubjects: Software Engineering (cs.SE); Computation and Language (cs.CL)
Amidst the recent strides in evaluating Large Language Models for Code (Code LLMs), existing benchmarks have mainly focused on the functional correctness of generated code, neglecting the importance of their computational efficiency. To fill the gap, we present Mercury, the first code efficiency benchmark for Code LLMs. It comprises 1,889 Python tasks, each accompanied by adequate solutions that serve as real-world efficiency baselines, enabling a comprehensive analysis of the runtime distribution. Based on the distribution, we introduce a new metric Beyond, which computes a runtime-percentile-weighted Pass score to reflect functional correctness and code efficiency simultaneously. On Mercury, leading Code LLMs can achieve 65% on Pass, while less than 50% on Beyond. Given that an ideal Beyond score would be aligned with the Pass score, it indicates that while Code LLMs exhibit impressive capabilities in generating functionally correct code, there remains a notable gap in their efficiency. Finally, our empirical experiments reveal that Direct Preference Optimization (DPO) serves as a robust baseline for enhancing code efficiency compared with Supervised Fine Tuning (SFT), which paves a promising avenue for future exploration of efficient code generation. Our code and data are available on GitHub: this https URL.
- [15] arXiv:2403.07974 (replaced) [pdf, ps, html, other]
-
Title: LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for CodeNaman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, Ion StoicaComments: Website - this https URLSubjects: Software Engineering (cs.SE); Computation and Language (cs.CL); Machine Learning (cs.LG)
Large Language Models (LLMs) applied to code-related applications have emerged as a prominent field, attracting significant interest from both academia and industry. However, as new and improved LLMs are developed, existing evaluation benchmarks (e.g., HumanEval, MBPP) are no longer sufficient for assessing their capabilities. In this work, we propose LiveCodeBench, a comprehensive and contamination-free evaluation of LLMs for code, which continuously collects new problems over time from contests across three competition platforms, namely LeetCode, AtCoder, and CodeForces. Notably, our benchmark also focuses on a broader range of code related capabilities, such as self-repair, code execution, and test output prediction, beyond just code generation. Currently, LiveCodeBench hosts four hundred high-quality coding problems that were published between May 2023 and May 2024. We have evaluated 18 base LLMs and 34 instruction-tuned LLMs on LiveCodeBench. We present empirical findings on contamination, holistic performance comparisons, potential overfitting in existing benchmarks as well as individual model comparisons. We will release all prompts and model completions for further community analysis, along with a general toolkit for adding new scenarios and model
- [16] arXiv:2405.10467 (replaced) [pdf, ps, html, other]
-
Title: Agent Design Pattern Catalogue: A Collection of Architectural Patterns for Foundation Model based AgentsSubjects: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
Foundation model-enabled generative artificial intelligence facilitates the development and implementation of agents, which can leverage distinguished reasoning and language processing capabilities to takes a proactive, autonomous role to pursue users' goals. Nevertheless, there is a lack of systematic knowledge to guide practitioners in designing the agents considering challenges of goal-seeking (including generating instrumental goals and plans), such as hallucinations inherent in foundation models, explainability of reasoning process, complex accountability, etc. To address this issue, we have performed a systematic literature review to understand the state-of-the-art foundation model-based agents and the broader ecosystem. In this paper, we present a pattern catalogue consisting of 17 architectural patterns with analyses of the context, forces, and trade-offs as the outcomes from the previous literature review. The proposed catalogue can provide holistic guidance for the effective use of patterns, and support the architecture design of foundation model-based agents by facilitating goal-seeking and plan generation.
- [17] arXiv:2406.02624 (replaced) [pdf, ps, html, other]
-
Title: Take a Step Further: Understanding Page Spray in Linux Kernel ExploitationZiyi Guo, Dang K Le, Zhenpeng Lin, Kyle Zeng, Ruoyu Wang, Tiffany Bao, Yan Shoshitaishvili, Adam Doupé, Xinyu XingSubjects: Cryptography and Security (cs.CR); Software Engineering (cs.SE)
Recently, a novel method known as Page Spray emerges, focusing on page-level exploitation for kernel vulnerabilities. Despite the advantages it offers in terms of exploitability, stability, and compatibility, comprehensive research on Page Spray remains scarce. Questions regarding its root causes, exploitation model, comparative benefits over other exploitation techniques, and possible mitigation strategies have largely remained unanswered. In this paper, we conduct a systematic investigation into Page Spray, providing an in-depth understanding of this exploitation technique. We introduce a comprehensive exploit model termed the \sys model, elucidating its fundamental principles. Additionally, we conduct a thorough analysis of the root causes underlying Page Spray occurrences within the Linux Kernel. We design an analyzer based on the Page Spray analysis model to identify Page Spray callsites. Subsequently, we evaluate the stability, exploitability, and compatibility of Page Spray through meticulously designed experiments. Finally, we propose mitigation principles for addressing Page Spray and introduce our own lightweight mitigation approach. This research aims to assist security researchers and developers in gaining insights into Page Spray, ultimately enhancing our collective understanding of this emerging exploitation technique and making improvements to the community.