ArXiv MCP Server

Apache 2.0

Overview InspectNew Endpoints Schema Related Servers Reviews Score

program_repair_llm_results.json•576 kB

{ "query": "(\"program repair\" OR \"software repair\" OR \"automatic repair\" OR \"code repair\" OR \"bug repair\" OR \"bug fix\" OR \"code fix\" OR \"automatic fix\" OR \"patch generation\" OR \"patch correctness\" OR \"patch validation\" OR \"fix generation\" OR \"code transformation\" OR \"code edit\" OR \"fix error\") AND (\"LLM\" OR \"LLMs\" OR \"Large Language Model\" OR \"Large Language Models\" OR \"Pre-trained\" OR \"Pretrained\" OR \"Pre-training\" OR \"Pretraining\" OR \"PLM\" OR \"PLMs\" OR \"BERT\" OR \"CodeBERT\" OR \"T5\" OR \"CodeT5\" OR \"GPT\" OR \"CodeGPT\" OR \"Codex\" OR \"ChatGPT\" OR \"Llama\" OR \"CodeLlama\" OR \"GPT-3\" OR \"GPT-4\" OR \"GPT-3.5\" OR \"neural\" OR \"machine learning\" OR \"deep learning\" OR \"transformer\" OR \"transformers\" OR \"model\" OR \"models\" OR \"transfer learning\" OR \"supervised learning\")", "search_queries_executed": 15, "filtering_groups": 1, "total_unique_papers": 355, "intersected_papers": 285, "filtered_papers": 285, "papers": [ { "id": "2507.18755v1", "title": "Agentic Program Repair from Test Failures at Scale: A Neuro-symbolic approach with static analysis and test execution feedback", "authors": [ "Chandra Maddila", "Adam Tait", "Claire Chang", "Daniel Cheng", "Nauman Ahmad", "Vijayaraghavan Murali", "Marshall Roch", "Arnaud Avondet", "Aaron Meltzer", "Victor Montalvao", "Michael Hopko", "Chris Waterson", "Parth Thakkar", "Renuka Fernandez", "Kristian Kristensen", "Sivan Barzily", "Sherry Chen", "Rui Abreu", "Nachiappan Nagappan", "Payam Shodjai", "Killian Murphy", "James Everingham", "Aparna Ramani", "Peter C. Rigby" ], "abstract": "Aim: With the advent of LLMs, sophisticated agentic program repair has become\nviable at large organizations with large codebases. In this work, we develop an\nEngineering Agent that fixes the source code based on test failures at scale\nacross diverse software offerings internally.\n Method: Using Llama as the base, we employ the ReAct harness to develop an\nagent. We start with a test failure that was triaged by a rule-based test\nfailure bot. We then set up an agentic harness and allow the agent to reason\nand run a set of 15 actions from reading a file to generating a patch. We\nprovide feedback to the agent through static analysis and test failures so it\ncan refine its solution. We leverage an LLM-as-a-Judge to ensure that the patch\nconforms to the standards followed by a human review to land fixes.\n Benchmark Findings: We curated offline benchmarks for our patch generator,\nthe Engineering Agent loop, and the LLM-as-a-Judge. In offline evaluations we\nfound that a specialized 70B model is highly competitive with the much larger\nbut vanilla Llama-405B. In an ablation study, we found that the ReAct harness\n(neural model) benefited from the symbolic information from static analysis\ntools and test execution traces. A model that strikes a balance between the\nsolve rate and error rate vs the cost and latency has a benchmark solve rate of\n42.3% using an average 11.8 feedback iterations.\n Production Findings: In a three month period, 80% of the generated fixes were\nreviewed, of which 31.5% were landed (25.5% of the total number of generated\nfixes).\n Feedback from Engineers: We used open coding to extract qualitative themes\nfrom engineers' feedback. We saw positive feedback in the form of quick\napprovals, gratitude, and surprise. We also found mixed feedback when the\nEngineering Agent's solution was partially correct and it served as a good\nstarting point.", "categories": [ "cs.SE", "cs.AI", "cs.PL" ], "published": "2025-07-24T19:12:32+00:00", "url": "http://arxiv.org/pdf/2507.18755v1", "resource_uri": "arxiv://2507.18755v1", "citation_count": 0 }, { "id": "2507.18140v1", "title": "MathOPEval: A Fine-grained Evaluation Benchmark for Visual Operations of MLLMs in Mathematical Reasoning", "authors": [ "Xiaoyuan Li", "Moxin Li", "Wenjie Wang", "Rui Men", "Yichang Zhang", "Fuli Feng", "Dayiheng Liu", "Junyang Lin" ], "abstract": "Recent progress in Multi-modal Large Language Models (MLLMs) has enabled\nstep-by-step multi-modal mathematical reasoning by performing visual operations\nbased on the textual instructions. A promising approach uses code as an\nintermediate representation to precisely express and manipulate the images in\nthe reasoning steps. However, existing evaluations focus mainly on text-only\nreasoning outputs, leaving the MLLM's ability to perform accurate visual\noperations via code largely unexplored. This work takes a first step toward\naddressing that gap by evaluating MLLM's code-based capabilities in multi-modal\nmathematical reasoning.Specifically, our framework focuses on two key\nevaluation aspects: (1) Multi-modal Code Generation (MCG) evaluates the model's\nability to accurately understand and construct visualizations from scratch. (2)\nMulti-modal Code Editing (MCE) assesses the model's capacity for fine-grained\noperations, which include three types: Deletion, Modification and Annotation.\nTo evaluate the above tasks, we incorporate a dataset that covers the five most\npopular types of mathematical figures, including geometric diagrams, function\nplots, and three types of statistical charts, to provide a comprehensive and\neffective measurement of existing MLLMs. Our experimental evaluation involves\nnine mainstream MLLMs, and the results reveal that existing models still lag\nsignificantly behind human performance in performing fine-grained visual\noperations.", "categories": [ "cs.CL" ], "published": "2025-07-24T07:03:11+00:00", "url": "http://arxiv.org/pdf/2507.18140v1", "resource_uri": "arxiv://2507.18140v1", "citation_count": 0 }, { "id": "2507.17691v1", "title": "CASCADE: LLM-Powered JavaScript Deobfuscator at Google", "authors": [ "Shan Jiang", "Pranoy Kovuri", "David Tao", "Zhixun Tan" ], "abstract": "Software obfuscation, particularly prevalent in JavaScript, hinders code\ncomprehension and analysis, posing significant challenges to software testing,\nstatic analysis, and malware detection. This paper introduces CASCADE, a novel\nhybrid approach that integrates the advanced coding capabilities of Gemini with\nthe deterministic transformation capabilities of a compiler Intermediate\nRepresentation (IR), specifically JavaScript IR (JSIR). By employing Gemini to\nidentify critical prelude functions, the foundational components underlying the\nmost prevalent obfuscation techniques, and leveraging JSIR for subsequent code\ntransformations, CASCADE effectively recovers semantic elements like original\nstrings and API names, and reveals original program behaviors. This method\novercomes limitations of existing static and dynamic deobfuscation techniques,\neliminating hundreds to thousands of hardcoded rules while achieving\nreliability and flexibility. CASCADE is already deployed in Google's production\nenvironment, demonstrating substantial improvements in JavaScript deobfuscation\nefficiency and reducing reverse engineering efforts.", "categories": [ "cs.SE", "cs.AI", "cs.CR", "cs.LG", "cs.PL" ], "published": "2025-07-23T16:57:32+00:00", "url": "http://arxiv.org/pdf/2507.17691v1", "resource_uri": "arxiv://2507.17691v1", "citation_count": 0 }, { "id": "2507.17642v2", "title": "Motivic classes of fixed-generators Hilbert schemes of unibranch curve singularities and Igusa zeta functions", "authors": [ "Ilaria Rossinelli" ], "abstract": "This paper delves into the study of Hilbert schemes of unibranch plane curves\nwhose points have a fixed number of minimal generators. Building on the work of\nOblomkov, Rasmussen and Shende we provide a formula for their motivic classes\nand investigate the relationship with principal Hilbert schemes of the same\ngiven unibranch curve. In addition, the paper specializes this study to the\ncase of $(p,q)$-curves, where we obtain more structured results for the motivic\nclasses of fixed-generators Hilbert schemes: their positivity and topological\ninvariance, and an explicit relationship to one-generator schemes i.e.\nprincipal ideals in $\\widehat{\\mathcal{O}}_{C,0}$. Finally, we focus on a\nspecial open component in the one-generator locus, whose motivic class is\nnaturally related to the motivic measure on the arc scheme $\\mathbb A^2_\\infty$\nof the plane introduced by Denef and Loeser as well as to the Igusa zeta\nfunction. We also provide an explicit formulation of these motivic classes in\nterms of an embedded resolution of the singularity, proving their polynomiality\nas well as making them an interesting topological invariant of the given curve.", "categories": [ "math.AG" ], "published": "2025-07-23T16:06:23+00:00", "url": "http://arxiv.org/pdf/2507.17642v2", "resource_uri": "arxiv://2507.17642v2", "citation_count": 0 }, { "id": "2507.17548v1", "title": "CodeReasoner: Enhancing the Code Reasoning Ability with Reinforcement Learning", "authors": [ "Lingxiao Tang", "He Ye", "Zhongxin Liu", "Xiaoxue Ren", "Lingfeng Bao" ], "abstract": "Code reasoning is a fundamental capability for large language models (LLMs)\nin the code domain. It involves understanding and predicting a program's\nexecution behavior, such as determining the output for a given input or whether\na specific statement will be executed. This capability is essential for\ndownstream tasks like debugging, code generation, and program repair. Prior\napproaches mainly rely on supervised fine-tuning to improve performance in code\nreasoning tasks. However, they often show limited gains and fail to generalize\nacross diverse scenarios. We argue this is due to two core issues: the low\nquality of training data and the limitations of supervised fine-tuning, which\nstruggles to teach general reasoning skills. To address these challenges, we\npropose CodeReasoner, a framework that spans both dataset construction and a\ntwo-stage training process. First, we introduce a method to construct datasets\nthat focus on the core execution logic of Python programs. Next, we apply\ninstruction tuning to inject execution-specific knowledge distilled from a\npowerful teacher model. We then enhance reasoning and generalization through\nGRPO reinforcement learning on top of the fine-tuned model. Experiments on\nthree widely-used code reasoning benchmarks show that CodeReasoner improves\nperformance by 27.1% to 40.2% over prior methods using a 7B model. Notably, the\n7B model matches GPT-4o on key tasks like input/output and coverage prediction.\nWhen scaled to 14B, CodeReasoner outperforms GPT-4o across all benchmarks.\nAblation studies confirm the effectiveness of each training stage and highlight\nthe importance of reasoning chains.", "categories": [ "cs.SE" ], "published": "2025-07-23T14:26:58+00:00", "url": "http://arxiv.org/pdf/2507.17548v1", "resource_uri": "arxiv://2507.17548v1", "citation_count": 0 }, { "id": "2507.16887v1", "title": "Revisiting Pre-trained Language Models for Vulnerability Detection", "authors": [ "Youpeng Li", "Weiliang Qi", "Xuyu Wang", "Fuxun Yu", "Xinda Wang" ], "abstract": "The rapid advancement of pre-trained language models (PLMs) has demonstrated\npromising results for various code-related tasks. However, their effectiveness\nin detecting real-world vulnerabilities remains a critical challenge. % for the\nsecurity community. While existing empirical studies evaluate PLMs for\nvulnerability detection (VD), their inadequate consideration in data\npreparation, evaluation setups, and experimental settings undermines the\naccuracy and comprehensiveness of evaluations. This paper introduces RevisitVD,\nan extensive evaluation of 17 PLMs spanning smaller code-specific PLMs and\nlarge-scale PLMs using newly constructed datasets. Specifically, we compare the\nperformance of PLMs under both fine-tuning and prompt engineering, assess their\neffectiveness and generalizability across various training and testing\nsettings, and analyze their robustness against code normalization, abstraction,\nand semantic-preserving transformations.\n Our findings reveal that, for VD tasks, PLMs incorporating pre-training tasks\ndesigned to capture the syntactic and semantic patterns of code outperform both\ngeneral-purpose PLMs and those solely pre-trained or fine-tuned on large code\ncorpora. However, these models face notable challenges in real-world scenarios,\nsuch as difficulties in detecting vulnerabilities with complex dependencies,\nhandling perturbations introduced by code normalization and abstraction, and\nidentifying semantic-preserving vulnerable code transformations. Also, the\ntruncation caused by the limited context windows of PLMs can lead to a\nnon-negligible amount of labeling errors. This study underscores the importance\nof thorough evaluations of model performance in practical scenarios and\noutlines future directions to help enhance the effectiveness of PLMs for\nrealistic VD applications.", "categories": [ "cs.CR", "cs.AI", "cs.LG", "cs.SE" ], "published": "2025-07-22T17:58:49+00:00", "url": "http://arxiv.org/pdf/2507.16887v1", "resource_uri": "arxiv://2507.16887v1", "citation_count": 0 }, { "id": "2507.15822v1", "title": "Do AI models help produce verified bug fixes?", "authors": [ "Li Huang", "Ilgiz Mustafin", "Marco Piccioni", "Alessandro Schena", "Reto Weber", "Bertrand Meyer" ], "abstract": "Among areas of software engineering where AI techniques -- particularly,\nLarge Language Models -- seem poised to yield dramatic improvements, an\nattractive candidate is Automatic Program Repair (APR), the production of\nsatisfactory corrections to software bugs. Does this expectation materialize in\npractice? How do we find out, making sure that proposed corrections actually\nwork? If programmers have access to LLMs, how do they actually use them to\ncomplement their own skills?\n To answer these questions, we took advantage of the availability of a\nprogram-proving environment, which formally determines the correctness of\nproposed fixes, to conduct a study of program debugging with two randomly\nassigned groups of programmers, one with access to LLMs and the other without,\nboth validating their answers through the proof tools. The methodology relied\non a division into general research questions (Goals in the Goal-Query-Metric\napproach), specific elements admitting specific answers (Queries), and\nmeasurements supporting these answers (Metrics). While applied so far to a\nlimited sample size, the results are a first step towards delineating a proper\nrole for AI and LLMs in providing guaranteed-correct fixes to program bugs.\n These results caused surprise as compared to what one might expect from the\nuse of AI for debugging and APR. The contributions also include: a detailed\nmethodology for experiments in the use of LLMs for debugging, which other\nprojects can reuse; a fine-grain analysis of programmer behavior, made possible\nby the use of full-session recording; a definition of patterns of use of LLMs,\nwith 7 distinct categories; and validated advice for getting the best of LLMs\nfor debugging and Automatic Program Repair.", "categories": [ "cs.SE", "cs.AI" ], "published": "2025-07-21T17:30:16+00:00", "url": "http://arxiv.org/pdf/2507.15822v1", "resource_uri": "arxiv://2507.15822v1", "citation_count": 0 }, { "id": "2507.15664v1", "title": "VeriRAG: A Retrieval-Augmented Framework for Automated RTL Testability Repair", "authors": [ "Haomin Qi", "Yuyang Du", "Lihao Zhang", "Soung Chang Liew", "Kexin Chen", "Yining Du" ], "abstract": "Large language models (LLMs) have demonstrated immense potential in\ncomputer-aided design (CAD), particularly for automated debugging and\nverification within electronic design automation (EDA) tools. However, Design\nfor Testability (DFT) remains a relatively underexplored area. This paper\npresents VeriRAG, the first LLM-assisted DFT-EDA framework. VeriRAG leverages a\nRetrieval-Augmented Generation (RAG) approach to enable LLM to revise code to\nensure DFT compliance. VeriRAG integrates (1) an autoencoder-based similarity\nmeasurement model for precise retrieval of reference RTL designs for the LLM,\nand (2) an iterative code revision pipeline that allows the LLM to ensure DFT\ncompliance while maintaining synthesizability. To support VeriRAG, we introduce\nVeriDFT, a Verilog-based DFT dataset curated for DFT-aware RTL repairs. VeriRAG\nretrieves structurally similar RTL designs from VeriDFT, each paired with a\nrigorously validated correction, as references for code repair. With VeriRAG\nand VeriDFT, we achieve fully automated DFT correction -- resulting in a\n7.72-fold improvement in successful repair rate compared to the zero-shot\nbaseline (Fig. 5 in Section V). Ablation studies further confirm the\ncontribution of each component of the VeriRAG framework. We open-source our\ndata, models, and scripts at https://github.com/yuyangdu01/LLM4DFT.", "categories": [ "cs.AR" ], "published": "2025-07-21T14:25:52+00:00", "url": "http://arxiv.org/pdf/2507.15664v1", "resource_uri": "arxiv://2507.15664v1", "citation_count": 0 }, { "id": "2507.15599v1", "title": "Applying the Chinese Wall Reverse Engineering Technique to Large Language Model Code Editing", "authors": [ "Manatsawin Hanmongkolchai" ], "abstract": "Large language models for code (Code LLM) are increasingly utilized in\nprogramming environments. Despite their utility, the training datasets for top\nLLM remain undisclosed, raising concerns about potential copyright violations.\nSome models, such as Pleias and Comma put emphasis on data curation and\nlicenses, however, with limited training data these models are not competitive\nand only serve as proof of concepts. To improve the utility of these models, we\npropose an application of the \"Chinese Wall\" technique, inspired by the reverse\nengineering technique of the same name -- a high quality model is used to\ngenerate detailed instructions for a weaker model. By doing so, a weaker but\nethically aligned model may be used to perform complicated tasks that,\notherwise, can only be completed by more powerful models. In our evaluation,\nwe've found that this technique improves Comma v0.1 1T's performance in\nCanItEdit benchmark by over 66%, and Starcoder2 Instruct by roughly 20%\ncompared to when running the same model on the benchmark alone. The practical\napplication of this technique today, however, may be limited due to the lack of\nmodels trained on public domain content without copyright restrictions.", "categories": [ "cs.SE", "cs.LG" ], "published": "2025-07-21T13:21:29+00:00", "url": "http://arxiv.org/pdf/2507.15599v1", "resource_uri": "arxiv://2507.15599v1", "citation_count": 0 }, { "id": "2507.15251v1", "title": "Input Reduction Enhanced LLM-based Program Repair", "authors": [ "Boyang Yang", "Luyao Ren", "Xin Yin", "Jiadong Ren", "Haoye Tian", "Shunfu Jin" ], "abstract": "Large Language Models (LLMs) have shown great potential in Automated Program\nRepair (APR). Test inputs, being crucial for reasoning the root cause of\nfailures, are always included in the prompt for LLM-based APR. Unfortunately,\nLLMs struggle to retain key information in long prompts. When the test inputs\nare extensive in the prompt, this may trigger the \"lost-in-the-middle\" issue,\ncompromising repair performance. To address this, we propose ReduceFix, an\nLLM-based APR approach with a built-in component that automatically reduces\ntest inputs while retaining their failure-inducing behavior. ReduceFix prompts\nan LLM to generate a reducer that minimizes failure-inducing test inputs\nwithout human effort, and then feeds the reduced failure-inducing inputs to\nguide patch generation.\n For targeted evaluation, we constructed LFTBench, the first long-input APR\nbenchmark with 200 real bugs from 20 programming tasks, each paired with a\nfailure-inducing input whose median size is 1 MB. On this benchmark, ReduceFix\nshrinks inputs by 89.1% on average and improves overall pass@10 by up to 53.8%\nrelative to a prompt that includes the original test, and by 17.6% compared\nwith omitting the test entirely. Adding the same reduction step to ChatRepair\nincreases its fix rate by 21.3% without other changes. Ablation studies further\nhighlight the impact of input length and compressed failure information on\nrepair success. These results underscore that automatically reducing failing\ninputs is a practical and powerful complement to LLM-based APR, significantly\nimproving its scalability and effectiveness.", "categories": [ "cs.SE" ], "published": "2025-07-21T05:26:32+00:00", "url": "http://arxiv.org/pdf/2507.15251v1", "resource_uri": "arxiv://2507.15251v1", "citation_count": 0 }, { "id": "2507.12415v1", "title": "SWE-Perf: Can Language Models Optimize Code Performance on Real-World Repositories?", "authors": [ "Xinyi He", "Qian Liu", "Mingzhe Du", "Lin Yan", "Zhijie Fan", "Yiming Huang", "Zejian Yuan", "Zejun Ma" ], "abstract": "Code performance optimization is paramount in real-world software engineering\nand critical for production-level systems. While Large Language Models (LLMs)\nhave demonstrated impressive capabilities in code generation and bug fixing,\ntheir proficiency in enhancing code performance at the repository level remains\nlargely unexplored. To address this gap, we introduce SWE-Perf, the first\nbenchmark specifically designed to systematically evaluate LLMs on code\nperformance optimization tasks within authentic repository contexts. SWE-Perf\ncomprises 140 carefully curated instances, each derived from\nperformance-improving pull requests from popular GitHub repositories. Each\nbenchmark instance includes the relevant codebase, target functions,\nperformance-related tests, expert-authored patches, and executable\nenvironments. Through a comprehensive evaluation of representative methods that\nspan file-level and repo-level approaches (e.g., Agentless and OpenHands), we\nreveal a substantial capability gap between existing LLMs and expert-level\noptimization performance, highlighting critical research opportunities in this\nemerging field.", "categories": [ "cs.SE" ], "published": "2025-07-16T17:05:17+00:00", "url": "http://arxiv.org/pdf/2507.12415v1", "resource_uri": "arxiv://2507.12415v1", "citation_count": 0 }, { "id": "2507.10535v1", "title": "CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks", "authors": [ "Hongchao Jiang", "Yiming Chen", "Yushi Cao", "Hung-yi Lee", "Robby T. Tan" ], "abstract": "Large Language Models (LLMs) have significantly advanced the state-of-the-art\nin various coding tasks. Beyond directly answering user queries, LLMs can also\nserve as judges, assessing and comparing the quality of responses generated by\nother models. Such an evaluation capability is crucial both for benchmarking\ndifferent LLMs and for improving response quality through response ranking.\nHowever, despite the growing adoption of the LLM-as-a-Judge paradigm, its\neffectiveness in coding scenarios remains underexplored due to the absence of\ndedicated benchmarks. To address this gap, we introduce CodeJudgeBench, a\nbenchmark explicitly designed to evaluate the performance of LLM-as-a-Judge\nmodels across three critical coding tasks: code generation, code repair, and\nunit test generation. Through comprehensive benchmarking of 26 LLM-as-a-Judge\nmodels, we find that recent thinking models significantly outperform\nnon-thinking models on our carefully designed code judging tasks. Notably, even\nrelatively small thinking models, such as Qwen3-8B, can outperform specially\ntrained LLM-as-a-Judge models up to 70B in size. Nevertheless, all models still\nexhibit significant randomness in their judgment of coding tasks. For pairwise\njudging tasks, simply changing the order in which responses are presented can\nsubstantially impact accuracy. In addition, when judging code and unit tests\nwritten by different LLMs, LLM-as-a-Judge models also show variance in\nperformance. This sensitivity raises concerns about the reliability and\nconsistency of LLM-as-a-Judge in coding scenarios. Lastly, we study optimal\nprompting strategies for LLM-as-a-Judge. We find that using pair-wise\ncomparison outperforms scalar point-wise judging. Furthermore, retaining\ncomments and reasoning in the full, unprocessed LLM response leads to improved\njudge performance.", "categories": [ "cs.CL", "cs.AI", "cs.SE" ], "published": "2025-07-14T17:56:29+00:00", "url": "http://arxiv.org/pdf/2507.10535v1", "resource_uri": "arxiv://2507.10535v1", "citation_count": 0 }, { "id": "2507.10103v1", "title": "Accelerating Automatic Program Repair with Dual Retrieval-Augmented Fine-Tuning and Patch Generation on Large Language Models", "authors": [ "Hanyang Guo", "Xiaoheng Xie", "Hong-Ning Dai", "Peng Di", "Yu Zhang", "Bishenghui Tao", "Zibin Zheng" ], "abstract": "Automated Program Repair (APR) is essential for ensuring software reliability\nand quality while enhancing efficiency and reducing developers' workload.\nAlthough rule-based and learning-based APR methods have demonstrated their\neffectiveness, their performance was constrained by the defect type of repair,\nthe quality of training data, and the size of model parameters. Recently, Large\nLanguage Models (LLMs) combined with Retrieval-Augmented-Generation (RAG) have\nbeen increasingly adopted in APR tasks. However, current code LLMs and RAG\ndesigns neither fully address code repair tasks nor consider code-specific\nfeatures. To overcome these limitations, we propose SelRepair, a novel APR\napproach with integration of a fine-tuned LLM with a newly-designed dual RAG\nmodule. This approach uses a bug-fix pair dataset for fine-tuning and\nincorporates semantic and syntactic/structural similarity information through\nan RAG selection gate. This design ensures relevant information is retrieved\nefficiently, thereby reducing token length and inference time. Evaluations on\nJava datasets show SelRepair outperforms other APR methods, achieving 26.29%\nand 17.64% in terms of exact match (EM) on different datasets while reducing\ninference time by at least 6.42% with controlled input lengths.", "categories": [ "cs.SE", "cs.CR" ], "published": "2025-07-14T09:41:51+00:00", "url": "http://arxiv.org/pdf/2507.10103v1", "resource_uri": "arxiv://2507.10103v1", "citation_count": 0 }, { "id": "2507.09411v1", "title": "LLMalMorph: On The Feasibility of Generating Variant Malware using Large-Language-Models", "authors": [ "Md Ajwad Akil", "Adrian Shuai Li", "Imtiaz Karim", "Arun Iyengar", "Ashish Kundu", "Vinny Parla", "Elisa Bertino" ], "abstract": "Large Language Models (LLMs) have transformed software development and\nautomated code generation. Motivated by these advancements, this paper explores\nthe feasibility of LLMs in modifying malware source code to generate variants.\nWe introduce LLMalMorph, a semi-automated framework that leverages semantical\nand syntactical code comprehension by LLMs to generate new malware variants.\nLLMalMorph extracts function-level information from the malware source code and\nemploys custom-engineered prompts coupled with strategically defined code\ntransformations to guide the LLM in generating variants without\nresource-intensive fine-tuning. To evaluate LLMalMorph, we collected 10 diverse\nWindows malware samples of varying types, complexity and functionality and\ngenerated 618 variants. Our thorough experiments demonstrate that it is\npossible to reduce the detection rates of antivirus engines of these malware\nvariants to some extent while preserving malware functionalities. In addition,\ndespite not optimizing against any Machine Learning (ML)-based malware\ndetectors, several variants also achieved notable attack success rates against\nan ML-based malware classifier. We also discuss the limitations of current LLM\ncapabilities in generating malware variants from source code and assess where\nthis emerging technology stands in the broader context of malware variant\ngeneration.", "categories": [ "cs.CR" ], "published": "2025-07-12T22:11:10+00:00", "url": "http://arxiv.org/pdf/2507.09411v1", "resource_uri": "arxiv://2507.09411v1", "citation_count": 0 }, { "id": "2507.08671v1", "title": "LLMCup: Ranking-Enhanced Comment Updating with LLMs", "authors": [ "Hua Ge", "Juan Zhai", "Minxue Pan", "Fusen He", "Ziyue Tan" ], "abstract": "While comments are essential for enhancing code readability and\nmaintainability in modern software projects, developers are often motivated to\nupdate code but not comments, leading to outdated or inconsistent documentation\nthat hinders future understanding and maintenance. Recent approaches such as\nCUP and HebCup have attempted automatic comment updating using neural\nsequence-to-sequence models and heuristic rules, respectively. However, these\nmethods can miss or misinterpret crucial information during comment updating,\nresulting in inaccurate comments, and they often struggle with complex update\nscenarios. Given these challenges, a promising direction lies in leveraging\nlarge language models (LLMs), which have shown impressive performance in\nsoftware engineering tasks such as comment generation, code synthesis, and\nprogram repair. This suggests their strong potential to capture the logic\nbehind code modifications - an ability that is crucial for the task of comment\nupdating. Nevertheless, selecting an appropriate prompt strategy for an LLM on\neach update case remains challenging. To address this, we propose a novel\ncomment updating framework, LLMCup, which first uses multiple prompt strategies\nto provide diverse candidate updated comments via an LLM, and then employs a\nranking model, CupRank, to select the best candidate as final updated comment.\nExperimental results demonstrate the effectiveness of LLMCup, with improvements\nover state-of-the-art baselines (CUP and HebCup) by 49.0%-116.9% in Accuracy,\n10.8%-20% in BLEU-4, 4.6% in METEOR, 0.9%-1.9% in F1, and 2.1%-3.4% in\nSentenceBert similarity. Furthermore, a user study shows that comments updated\nby LLMCup sometimes surpass human-written updates, highlighting the importance\nof incorporating human evaluation in comment quality assessment.", "categories": [ "cs.SE", "D.2.3; D.2.7; I.2.6" ], "published": "2025-07-11T15:11:27+00:00", "url": "http://arxiv.org/pdf/2507.08671v1", "resource_uri": "arxiv://2507.08671v1", "citation_count": 0 }, { "id": "2507.05512v1", "title": "Disappearing Ink: Obfuscation Breaks N-gram Code Watermarks in Theory and Practice", "authors": [ "Gehao Zhang", "Eugene Bagdasarian", "Juan Zhai", "Shiqing Ma" ], "abstract": "Distinguishing AI-generated code from human-written code is becoming crucial\nfor tasks such as authorship attribution, content tracking, and misuse\ndetection. Based on this, N-gram-based watermarking schemes have emerged as\nprominent, which inject secret watermarks to be detected during the generation.\n However, their robustness in code content remains insufficiently evaluated.\nMost claims rely solely on defenses against simple code transformations or code\noptimizations as a simulation of attack, creating a questionable sense of\nrobustness. In contrast, more sophisticated schemes already exist in the\nsoftware engineering world, e.g., code obfuscation, which significantly alters\ncode while preserving functionality. Although obfuscation is commonly used to\nprotect intellectual property or evade software scanners, the robustness of\ncode watermarking techniques against such transformations remains largely\nunexplored.\n In this work, we formally model the code obfuscation and prove the\nimpossibility of N-gram-based watermarking's robustness with only one intuitive\nand experimentally verified assumption, distribution consistency, satisfied.\nGiven the original false positive rate of the watermarking detection, the ratio\nthat the detector failed on the watermarked code after obfuscation will\nincrease to 1 - fpr.\n The experiments have been performed on three SOTA watermarking schemes, two\nLLMs, two programming languages, four code benchmarks, and four obfuscators.\nAmong them, all watermarking detectors show coin-flipping detection abilities\non obfuscated codes (AUROC tightly surrounds 0.5). Among all models,\nwatermarking schemes, and datasets, both programming languages own obfuscators\nthat can achieve attack effects with no detection AUROC higher than 0.6 after\nthe attack. Based on the theoretical and practical observations, we also\nproposed a potential path of robust code watermarking.", "categories": [ "cs.CR", "cs.AI" ], "published": "2025-07-07T22:18:19+00:00", "url": "http://arxiv.org/pdf/2507.05512v1", "resource_uri": "arxiv://2507.05512v1", "citation_count": 0 }, { "id": "2507.03659v1", "title": "Specification-Guided Repair of Arithmetic Errors in Dafny Programs using LLMs", "authors": [ "Valentina Wu", "Alexandra Mendes", "Alexandre Abreu" ], "abstract": "Formal verification offers strong assurances of software correctness.\nHowever, debugging and repairing the underlying faults can be complex and\ntime-consuming when verification fails. Automated Program Repair (APR) aims to\nease this by automatically identifying and fixing faults. Traditional APR\ntechniques often depend on test suites for validation, but these may fail to\ncapture all scenarios. In contrast, formal specifications provide stronger\ncorrectness criteria for effective repairs.\n We present an innovative APR tool for Dafny, a verification-aware programming\nlanguage that uses formal specifications - including pre-conditions,\npost-conditions, and invariants - as oracles for fault localization and repair.\nAssuming the correctness of the specifications and focusing on arithmetic bugs,\nwe localize faults through a series of steps, which include using Hoare Logic\nto determine the state of each statement within the program and\nstate-of-the-art Large Language Models (LLMs) to synthesize candidate fixes.\nThe chosen models were GPT-4o mini, Llama 3, Mistral 7B, and Llemma 7B.\n We evaluate our approach using DafnyBench, a benchmark of real-world Dafny\nprograms. Our tool achieves 89.6% accuracy in fault localization, with GPT-4o\nmini yielding the highest repair success rate (74.18%). These results highlight\nthe potential of combining formal reasoning with LLM-driven program synthesis\nfor automated program repair.", "categories": [ "cs.SE", "cs.PL" ], "published": "2025-07-04T15:36:12+00:00", "url": "http://arxiv.org/pdf/2507.03659v1", "resource_uri": "arxiv://2507.03659v1", "citation_count": 0 }, { "id": "2507.05281v1", "title": "CoreCodeBench: A Configurable Multi-Scenario Repository-Level Benchmark", "authors": [ "Lingyue Fu", "Hao Guan", "Bolun Zhang", "Haowei Yuan", "Yaoming Zhu", "Jun Xu", "Zongyu Wang", "Lin Qiu", "Xunliang Cai", "Xuezhi Cao", "Weiwen Liu", "Weinan Zhang", "Yong Yu" ], "abstract": "As Large Language Models (LLMs) demonstrate increasingly sophisticated code\nprocessing capabilities, evaluating their performance on engineering-level code\nremains challenging. Existing repository-level benchmarks primarily focus on\nsingle scenarios, such as code generation or bug fixing, without adequately\ncapturing the diversity and complexity of real-world software or project\nengineering workflows. Furthermore, these benchmarks suffer from limited\ncontrollability in question positioning and reliability issues in their\ngenerated test cases. To address these limitations, we present CorePipe, a\nfully automated pipeline that converts repositories into comprehensive test\ncases, and introduce CoreCodeBench, a configurable multi-scenario\nrepository-level benchmark. To simulate real engineering scenarios, CorePipe\ngenerates three types of atomic questions (Development, BugFix, and Test-Driven\nDevelopment) specifically targeting core code segments. These atomic questions\nare further combined into three types of composite questions, with difficulty\nlevels flexibly adjusted through hyperparameter tuning. CoreCodeBench provides\na comprehensive and extensive repository-level benchmark to investigate the\napplicability of LLMs in real-world engineering projects. Experiments with 16\nLLMs across diverse scenarios reveal varying capabilities and offer\nmulti-dimensional insights into LLM performance in engineering contexts. The\ncode for CorePipe is available at\nhttps://github.com/AGI-Eval-Official/CoreCodeBench, and the data for\nCoreCodeBench can be accessed at\nhttps://huggingface.co/collections/tubehhh/corecodebench-68256d2faabf4b1610a08caa.", "categories": [ "cs.SE", "cs.CL" ], "published": "2025-07-04T09:42:04+00:00", "url": "http://arxiv.org/pdf/2507.05281v1", "resource_uri": "arxiv://2507.05281v1", "citation_count": 0 }, { "id": "2507.05269v1", "title": "CORE: Benchmarking LLMs Code Reasoning Capabilities through Static Analysis Tasks", "authors": [ "Danning Xie", "Mingwei Zheng", "Xuwei Liu", "Jiannan Wang", "Chengpeng Wang", "Lin Tan", "Xiangyu Zhang" ], "abstract": "Large language models (LLMs) have been widely adopted across diverse software\nengineering domains, such as code generation, program repair, and vulnerability\ndetection. These applications require understanding beyond surface-level code\npatterns: value propagation, control flow, and interdependence between program\nelements. However, existing benchmarks primarily evaluate end-to-end outcomes,\nsuch as whether code is correctly repaired or generated, leaving the models\nability for program semantic reasoning underexplored. This work presents CoRe,\na high-quality, human-verified benchmark designed to evaluate LLMs on\nfundamental static analysis tasks. CoRe includes 12,553 task instances spanning\ndata dependency, control dependency, and information flow across programs\nwritten in C/C++, Java, and Python. To ensure semantic diversity and reasoning\ncomplexity, we propose a semantics-aware diverse sampling strategy that selects\ntargets and task instances based on structural coverage and dependency depth.\nWe evaluate 10 mainstream LLMs and show that, while they perform well at\nidentifying dependencies, models still struggle with tasks that require deeper\nsemantic understanding and multi-step reasoning. We further conduct qualitative\nanalyses to uncover key challenges, such as complex control structures and\nbackward dependency patterns, offering insights into improving LLMs code\nreasoning capabilities.", "categories": [ "cs.SE", "cs.AI" ], "published": "2025-07-03T01:35:58+00:00", "url": "http://arxiv.org/pdf/2507.05269v1", "resource_uri": "arxiv://2507.05269v1", "citation_count": 0 }, { "id": "2507.01827v1", "title": "APRMCTS: Improving LLM-based Automated Program Repair with Iterative Tree Search", "authors": [ "Haichuan Hu", "Congqing He", "Hao Zhang", "Xiaochen Xie", "Quanjun Zhang" ], "abstract": "Automated Program Repair (APR) attempts to fix software bugs without human\nintervention, which plays a crucial role in software development and\nmaintenance. Recently, with the advances in Large Language Models (LLMs), a\nrapidly increasing number of APR techniques have been proposed with remarkable\nperformance. However, existing LLM-based APR techniques typically adopt\ntrial-and-error strategies, which suffer from two major drawbacks: (1)\ninherently limited patch effectiveness due to local exploration, and (2) low\nsearch efficiency due to redundant exploration. In this paper, we propose\nAPRMCTS, which uses iterative tree search to improve LLM-based APR. APRMCTS\nincorporates Monte Carlo Tree Search (MCTS) into patch searching by performing\na global evaluation of the explored patches and selecting the most promising\none for subsequent refinement and generation. APRMCTS effectively resolves the\nproblems of falling into local optima and thus helps improve the efficiency of\npatch searching. Our experiments on 835 bugs from Defects4J demonstrate that,\nwhen integrated with GPT-3.5, APRMCTS can fix a total of 201 bugs, which\noutperforms all state-of-the-art baselines. Besides, APRMCTS helps GPT-4o-mini,\nGPT-3.5, Yi-Coder-9B, and Qwen2.5-Coder-7B to fix 30, 27, 37, and 28 more bugs,\nrespectively. More importantly, APRMCTS boasts a significant performance\nadvantage while employing small patch size (16 and 32), notably fewer than the\n500 and 10,000 patches adopted in previous studies. In terms of cost, compared\nto existing state-of-the-art LLM-based APR methods, APRMCTS has time and\nmonetary costs of less than 20% and 50%, respectively. Our extensive study\ndemonstrates that APRMCTS exhibits good effectiveness and efficiency, with\nparticular advantages in addressing complex bugs.", "categories": [ "cs.SE" ], "published": "2025-07-02T15:44:12+00:00", "url": "http://arxiv.org/pdf/2507.01827v1", "resource_uri": "arxiv://2507.01827v1", "citation_count": 0 }, { "id": "2507.01628v1", "title": "DaiFu: In-Situ Crash Recovery for Deep Learning Systems", "authors": [ "Zilong He", "Pengfei Chen", "Hongyu Zhang", "Xiaoyun Li", "Guangba Yu", "Hongyang Chen", "Zibin Zheng" ], "abstract": "Deep learning (DL) systems have been widely adopted in many areas, and are\nbecoming even more popular with the emergence of large language models.\nHowever, due to the complex software stacks involved in their development and\nexecution, crashes are unavoidable and common. Crashes severely waste computing\nresources and hinder development productivity, so efficient crash recovery is\ncrucial. Existing solutions, such as checkpoint-retry, are too heavyweight for\nfast recovery from crashes caused by minor programming errors or transient\nruntime errors. Therefore, we present DaiFu, an in-situ recovery framework for\nDL systems. Through a lightweight code transformation to a given DL system,\nDaiFu augments it to intercept crashes in situ and enables dynamic and instant\nupdates to its program running context (e.g., code, configurations, and other\ndata) for agile crash recovery. Our evaluation shows that DaiFu helps reduce\nthe restore time for crash recovery, achieving a 1372x speedup compared with\nstate-of-the-art solutions. Meanwhile, the overhead of DaiFu is negligible\n(under 0.40%). We also construct a benchmark spanning 7 distinct crash\nscenarios in DL systems, and show the effectiveness of DaiFu in diverse\nsituations.", "categories": [ "cs.SE" ], "published": "2025-07-02T11:58:38+00:00", "url": "http://arxiv.org/pdf/2507.01628v1", "resource_uri": "arxiv://2507.01628v1", "citation_count": 0 }, { "id": "2507.02976v2", "title": "Are AI-Generated Fixes Secure? Analyzing LLM and Agent Patches on SWE-bench", "authors": [ "Amirali Sajadi", "Kostadin Damevski", "Preetha Chatterjee" ], "abstract": "Large Language Models (LLMs) and their agentic frameworks are increasingly\nadopted to automate software development tasks such as issue resolution and\nprogram repair. While prior work has identified security risks in LLM-generated\ncode, most evaluations have focused on synthetic or isolated settings, leaving\nopen questions about the security of these systems in real-world development\ncontexts. In this study, we present the first large-scale security analysis of\nLLM-generated patches using 20,000+ issues from the SWE-bench dataset. We\nevaluate patches produced by a standalone LLM (Llama 3.3) and compare them to\ndeveloper-written patches. We also assess the security of patches generated by\nthree top-performing agentic frameworks (OpenHands, AutoCodeRover, HoneyComb)\non a subset of our data. Finally, we analyze a wide range of code, issue, and\nproject-level factors to understand the conditions under which LLMs and agents\nare most likely to generate insecure code. Our findings reveal that the\nstandalone LLM introduces nearly 9x more new vulnerabilities than developers,\nwith many of these exhibiting unique patterns not found in developers' code.\nAgentic workflows also generate a significant number of vulnerabilities,\nparticularly when granting LLMs more autonomy, potentially increasing the\nlikelihood of misinterpreting project context or task requirements. We find\nthat vulnerabilities are more likely to occur in LLM patches associated with a\nhigher number of files, more lines of generated code, and GitHub issues that\nlack specific code snippets or information about the expected code behavior and\nsteps to reproduce. These results suggest that contextual factors play a\ncritical role in the security of the generated code and point toward the need\nfor proactive risk assessment methods that account for both code and\nissue-level information to complement existing vulnerability detection tools.", "categories": [ "cs.CR", "cs.LG", "cs.SE" ], "published": "2025-06-30T21:10:19+00:00", "url": "http://arxiv.org/pdf/2507.02976v2", "resource_uri": "arxiv://2507.02976v2", "citation_count": 0 }, { "id": "2506.24015v1", "title": "Bug Fixing with Broader Context: Enhancing LLM-Based Program Repair via Layered Knowledge Injection", "authors": [ "Ramtin Ehsani", "Esteban Parra", "Sonia Haiduc", "Preetha Chatterjee" ], "abstract": "Prompting LLMs with bug-related context (e.g., error messages, stack traces)\nimproves automated program repair, but many bugs still remain unresolved. In\nreal-world projects, developers often rely on broader repository and\nproject-level context beyond the local code to resolve such bugs. In this\npaper, we investigate how automatically extracting and providing such knowledge\ncan improve LLM-based program repair. We propose a layered knowledge injection\nframework that incrementally augments LLMs with structured context. It starts\nwith the Bug Knowledge Layer, which includes information such as the buggy\nfunction and failing tests; expands to the Repository Knowledge Layer, which\nadds structural dependencies, related files, and commit history; and finally\ninjects the Project Knowledge Layer, which incorporates relevant details from\ndocumentation and previously fixed bugs. We evaluate this framework on a\ndataset of 314 bugs from BugsInPy using two LLMs (Llama 3.3 and GPT-4o-mini),\nand analyze fix rates across six bug types. By progressively injecting\nknowledge across layers, our approach achieves a fix rate of 79% (250/314)\nusing Llama 3.3, a significant improvement of 23% over previous work. All bug\ntypes show improvement with the addition of repository-level context, while\nonly a subset benefit further from project-level knowledge, highlighting that\ndifferent bug types require different levels of contextual information for\neffective repair. We also analyze the remaining unresolved bugs and find that\nmore complex and structurally isolated bugs, such as Program Anomaly and GUI\nbugs, remain difficult even after injecting all available information. Our\nresults show that layered context injection improves program repair and suggest\nthe need for interactive and adaptive APR systems.", "categories": [ "cs.SE" ], "published": "2025-06-30T16:19:38+00:00", "url": "http://arxiv.org/pdf/2506.24015v1", "resource_uri": "arxiv://2506.24015v1", "citation_count": 0 }, { "id": "2506.23749v1", "title": "A Survey of LLM-based Automated Program Repair: Taxonomies, Design Paradigms, and Applications", "authors": [ "Boyang Yang", "Zijian Cai", "Fengling Liu", "Bach Le", "Lingming Zhang", "Tegawendé F. Bissyandé", "Yang Liu", "Haoye Tian" ], "abstract": "Large language models (LLMs) are reshaping automated program repair (APR). We\ncategorize the recent 63 LLM-based APR systems published from January 2022 to\nJune 2025 into four paradigms, and show how retrieval- or analysis-augmented\ncontexts strengthen any of them. This taxonomy clarifies key trade-offs:\nfine-tuning delivers strong task alignment at high training cost; prompting\nenables rapid deployment but is limited by prompt design and context windows;\nprocedural pipelines offer reproducible control with moderate overhead; agentic\nframeworks tackle multi-hunk or cross-file bugs at the price of increased\nlatency and complexity. Persistent challenges include verifying semantic\ncorrectness beyond test suites, repairing repository-scale defects, and\nlowering the costs of LLMs. We outline research directions that combine\nlightweight human feedback, repository-aware retrieval, code analysis, and\ncost-aware planning to advance reliable and efficient LLM-based APR.", "categories": [ "cs.SE" ], "published": "2025-06-30T11:46:01+00:00", "url": "http://arxiv.org/pdf/2506.23749v1", "resource_uri": "arxiv://2506.23749v1", "citation_count": 0 }, { "id": "2506.23100v1", "title": "Repair Ingredients Are All You Need: Improving Large Language Model-Based Program Repair via Repair Ingredients Search", "authors": [ "Jiayi Zhang", "Kai Huang", "Jian Zhang", "Yang Liu", "Chunyang Chen" ], "abstract": "Automated Program Repair (APR) techniques aim to automatically fix buggy\nprograms. Among these, Large Language Model-based (LLM-based) approaches have\nshown great promise. Recent advances demonstrate that directly leveraging LLMs\ncan achieve leading results. However, these techniques remain suboptimal in\ngenerating contextually relevant and accurate patches, as they often overlook\nrepair ingredients crucial for practical program repair. In this paper, we\npropose ReinFix, a novel framework that enables LLMs to autonomously search for\nrepair ingredients throughout both the reasoning and solution phases of bug\nfixing. In the reasoning phase, ReinFix integrates static analysis tools to\nretrieve internal ingredients, such as variable definitions, to assist the LLM\nin root cause analysis when it encounters difficulty understanding the context.\nDuring the solution phase, when the LLM lacks experience in fixing specific\nbugs, ReinFix searches for external ingredients from historical bug fixes with\nsimilar bug patterns, leveraging both the buggy code and its root cause to\nguide the LLM in identifying appropriate repair actions, thereby increasing the\nlikelihood of generating correct patches. Evaluations on two popular benchmarks\n(Defects4J V1.2 and V2.0) demonstrate the effectiveness of our approach over\nSOTA baselines. Notably, ReinFix fixes 146 bugs, which is 32 more than the\nbaselines on Defects4J V1.2. On Defects4J V2.0, ReinFix fixes 38 more bugs than\nthe SOTA. Importantly, when evaluating on the recent benchmarks that are free\nof data leakage risk, ReinFix also maintains the best performance.", "categories": [ "cs.SE" ], "published": "2025-06-29T06:02:11+00:00", "url": "http://arxiv.org/pdf/2506.23100v1", "resource_uri": "arxiv://2506.23100v1", "citation_count": 0 }, { "id": "2506.21211v1", "title": "$T^3$: Multi-level Tree-based Automatic Program Repair with Large Language Models", "authors": [ "Quanming Liu", "Xupeng Bu", "Zhichao Yan", "Ru Li" ], "abstract": "Automatic Program Repair (APR) is a core technology in software development\nand maintenance, with aims to enable automated defect repair with minimal human\nintervention. In recent years, the substantial advancements in Large Language\nModels (LLMs) and the Chain-of-Thought (CoT) techniques have significantly\nenhanced the reasoning capabilities of these models. However, due to the\ncomplex logic and multi-step reasoning ability needed, the application of CoT\ntechniques in the APR domain remains insufficient. This study systematically\nevaluates the performance of several common CoT techniques in APR tasks and\nproposes an innovative framework $T^3$, which integrates the powerful reasoning\ncapabilities of LLMs with tree search, effectively improving the precision of\ngenerating candidate repair solutions. Furthermore, $T^3$ provides valuable\nguidance for optimizing sample selection and repair strategies in APR tasks,\nestablishing a robust framework for achieving efficient automated debugging.", "categories": [ "cs.SE", "cs.AI" ], "published": "2025-06-26T13:04:28+00:00", "url": "http://arxiv.org/pdf/2506.21211v1", "resource_uri": "arxiv://2506.21211v1", "citation_count": 0 }, { "id": "2506.18824v1", "title": "Understanding Software Engineering Agents: A Study of Thought-Action-Result Trajectories", "authors": [ "Islem Bouzenia", "Michael Pradel" ], "abstract": "Large Language Model (LLM)-based agents are increasingly employed to automate\ncomplex software engineering tasks such as program repair and issue resolution.\nThese agents operate by autonomously generating natural language thoughts,\ninvoking external tools, and iteratively refining their solutions. Despite\ntheir widespread adoption, the internal decision-making processes of these\nagents remain largely unexplored, limiting our understanding of their\noperational dynamics and failure modes. In this paper, we present a large-scale\nempirical study of the thought-action-result trajectories of three\nstate-of-the-art LLM-based agents: \\textsc{RepairAgent},\n\\textsc{AutoCodeRover}, and \\textsc{OpenHands}. We unify their interaction logs\ninto a common format, capturing 120 trajectories and 2822 LLM interactions\nfocused on program repair and issue resolution. Our study combines quantitative\nanalyses of structural properties, action patterns, and token usage with\nqualitative assessments of reasoning coherence and feedback integration. We\nidentify key trajectory characteristics such as iteration counts and token\nconsumption, recurring action sequences, and the semantic coherence linking\nthoughts, actions, and their results. Our findings reveal behavioral motifs and\nanti-patterns that distinguish successful from failed executions, providing\nactionable insights for improving agent design, including prompting strategies,\nfailure diagnosis, and anti-pattern detection. We release our dataset and\nannotation framework to support further research on transparent and robust\nautonomous software engineering agents.", "categories": [ "cs.SE", "cs.AI" ], "published": "2025-06-23T16:34:52+00:00", "url": "http://arxiv.org/pdf/2506.18824v1", "resource_uri": "arxiv://2506.18824v1", "citation_count": 0 }, { "id": "2506.18394v1", "title": "Tracing Errors, Constructing Fixes: Repository-Level Memory Error Repair via Typestate-Guided Context Retrieval", "authors": [ "Xiao Cheng", "Zhihao Guo", "Huan Huo", "Yulei Sui" ], "abstract": "Memory-related errors in C programming continue to pose significant\nchallenges in software development, primarily due to the complexities of manual\nmemory management inherent in the language. These errors frequently serve as\nvectors for severe vulnerabilities, while their repair requires extensive\nknowledge of program logic and C's memory model. Automated Program Repair (APR)\nhas emerged as a critical research area to address these challenges.\nTraditional APR approaches rely on expert-designed strategies and predefined\ntemplates, which are labor-intensive and constrained by the effectiveness of\nmanual specifications. Deep learning techniques offer a promising alternative\nby automatically extracting repair patterns, but they require substantial\ntraining datasets and often lack interpretability.\n This paper introduces LTFix, a novel approach that harnesses the potential of\nLarge Language Models (LLMs) for automated memory error repair, especially for\ncomplex repository-level errors that span multiple functions and files. We\naddress two fundamental challenges in LLM-based memory error repair: a limited\nunderstanding of interprocedural memory management patterns and context window\nlimitations for repository-wide analysis. Our approach utilizes a finite\ntypestate automaton to guide the tracking of error-propagation paths and\ncontext trace, capturing both spatial (memory states) and temporal (execution\nhistory) dimensions of error behavior. This typestate-guided context retrieval\nstrategy provides the LLM with concise yet semantically rich information\nrelevant to erroneous memory management, effectively addressing the token\nlimitation of LLMs.", "categories": [ "cs.SE" ], "published": "2025-06-23T08:30:00+00:00", "url": "http://arxiv.org/pdf/2506.18394v1", "resource_uri": "arxiv://2506.18394v1", "citation_count": 0 }, { "id": "2506.17772v1", "title": "PAGENT: Learning to Patch Software Engineering Agents", "authors": [ "Haoran Xue", "Gias Uddin", "Song Wang" ], "abstract": "LLM Agents produce patches automatically to resolve an issue. However, they\ncan generate inaccurate patches. Little is known about the root causes behind\nthose failed patches or how those could be fixed. This paper reports an\nempirical study of the failed patches generated by seven top LLM code agents.\nWe collected 114 issues from the SWE-bench Lite dataset that remained\nunresolved across the agents. The seven agents produced a total of 769 failed\npatches for those issues, which we checked with a combination of GPT-4o and\nmanual analysis. We present a taxonomy of the failure reasons across the\npatches. The taxonomy contains six categories, with several sub-categories\nunder each category. For example, a frequently observed category is the\ninability of an LLM to correctly infer/produce the appropriate variable type in\nthe produced patch. As a first step towards addressing such type-related\nerrors, we designed PAGENT (Patch Agent). PAGENT utilizes program analysis\ntechniques like CFG creation and exploration to infer the type of information\nof a patch. PAGENT does this by applying repository-level static code analysis\ntechniques. Then, PAGENT refines the inferred type by further utilizing an\nLLM-based inference technique. We tested PAGENT on all 127 type-related failed\npatches from the top three agents in our study. PAGENT could fix 29 of the 127\nfailed patches.", "categories": [ "cs.SE" ], "published": "2025-06-21T18:00:00+00:00", "url": "http://arxiv.org/pdf/2506.17772v1", "resource_uri": "arxiv://2506.17772v1", "citation_count": 0 }, { "id": "2506.17632v1", "title": "Optimization-Free Patch Attack on Stereo Depth Estimation", "authors": [ "Hangcheng Liu", "Xu Kuang", "Xingshuo Han", "Xingwan Wu", "Haoran Ou", "Shangwei Guo", "Xingyi Huang", "Tao Xiang", "Tianwei Zhang" ], "abstract": "Stereo Depth Estimation (SDE) is essential for scene understanding in\nvision-based systems like autonomous driving. However, recent studies show that\nSDE models are vulnerable to adversarial attacks, which are often limited to\nunrealistic settings, e.g., digital perturbations on separate stereo views in\nstatic scenes, restricting their real-world applicability. This raises a\ncritical question: how can we design physically realizable, scene-adaptive, and\ntransferable attacks against SDE under realistic constraints?\n To answer this, we make two key contributions. First, we propose a unified\nattack framework that extends optimization-based techniques to four core stages\nof stereo matching: feature extraction, cost-volume construction, cost\naggregation, and disparity regression. A comprehensive stage-wise evaluation\nacross 9 mainstream SDE models, under constraints like photometric consistency,\nreveals that optimization-based patches suffer from poor transferability.\nInterestingly, partially transferable patches suggest that patterns, rather\nthan pixel-level perturbations, may be key to generalizable attacks. Motivated\nby this, we present PatchHunter, the first optimization-free adversarial patch\nattack against SDE. PatchHunter formulates patch generation as a reinforcement\nlearning-driven search over a structured space of visual patterns crafted to\ndisrupt SDE assumptions.\n We validate PatchHunter across three levels: the KITTI dataset, the CARLA\nsimulator, and real-world vehicle deployment. PatchHunter not only surpasses\noptimization-based methods in effectiveness but also achieves significantly\nbetter black-box transferability. Even under challenging physical conditions\nlike low light, PatchHunter maintains high attack success (e.g., D1-all > 0.4),\nwhereas optimization-based methods fail.", "categories": [ "cs.CV" ], "published": "2025-06-21T08:23:02+00:00", "url": "http://arxiv.org/pdf/2506.17632v1", "resource_uri": "arxiv://2506.17632v1", "citation_count": 0 }, { "id": "2506.17507v1", "title": "Optimal Parallel Algorithms for Convex Hulls in 2D and 3D under Noisy Primitive Operations", "authors": [ "Michael T. Goodrich", "Vinesh Sridhar" ], "abstract": "In the noisy primitives model, each primitive comparison performed by an\nalgorithm, e.g., testing whether one value is greater than another, returns the\nincorrect answer with random, independent probability p < 1/2 and otherwise\nreturns a correct answer. This model was first applied in the context of\nsorting and searching, and recent work by Eppstein, Goodrich, and Sridhar\nextends this model to sequential algorithms involving geometric primitives such\nas orientation and sidedness tests. However, their approaches appear to be\ninherently sequential; hence, in this paper, we study parallel computational\ngeometry algorithms for 2D and 3D convex hulls in the noisy primitives model.\nWe give the first optimal parallel algorithms in the noisy primitives model for\n2D and 3D convex hulls in the CREW PRAM model. The main technical contribution\nof our work concerns our ability to detect and fix errors during intermediate\nsteps of our algorithm using a generalization of the failure sweeping\ntechnique.", "categories": [ "cs.CG", "cs.DC" ], "published": "2025-06-20T23:09:23+00:00", "url": "http://arxiv.org/pdf/2506.17507v1", "resource_uri": "arxiv://2506.17507v1", "citation_count": 0 }, { "id": "2506.17471v1", "title": "Code Generation for Near-Roofline Finite Element Actions on GPUs from Symbolic Variational Forms", "authors": [ "Kaushik Kulkarni", "Andreas Klöckner" ], "abstract": "We present a novel parallelization strategy for evaluating Finite Element\nMethod (FEM) variational forms on GPUs, focusing on those that are expressible\nthrough the Unified Form Language (UFL) on simplex meshes. We base our approach\non code transformations, wherein we construct a space of scheduling candidates\nand rank them via a heuristic cost model to effectively handle the large\ndiversity of computational workloads that can be expressed in this way. We\npresent a design of a search space to which the cost model is applied, along\nwith an associated pruning strategy to limit the number of configurations that\nneed to be empirically evaluated. The goal of our design is to strike a balance\nbetween the device's latency-hiding capabilities and the amount of state space,\na key factor in attaining near-roofline performance.\n To make our work widely available, we have prototyped our parallelization\nstrategy within the \\textsc{Firedrake} framework, a UFL-based FEM solver. We\nevaluate the performance of our parallelization scheme on two generations of\nNvidia GPUs, specifically the Titan V (Volta architecture) and Tesla K40c\n(Kepler architecture), across a range of operators commonly used in\napplications, including fluid dynamics, wave propagation, and structural\nmechanics, in 2D and 3D geometries. Our results demonstrate that our proposed\nalgorithm achieves more than $50\\%$ roofline performance in $65\\%$ of the test\ncases on both devices.", "categories": [ "cs.DC", "cs.MS", "cs.NA", "cs.PF", "math.NA", "65Y05" ], "published": "2025-06-20T20:23:42+00:00", "url": "http://arxiv.org/pdf/2506.17471v1", "resource_uri": "arxiv://2506.17471v1", "citation_count": 0 }, { "id": "2506.17208v1", "title": "Dissecting the SWE-Bench Leaderboards: Profiling Submitters and Architectures of LLM- and Agent-Based Repair Systems", "authors": [ "Matias Martinez", "Xavier Franch" ], "abstract": "The rapid progress in Automated Program Repair (APR) has been driven by\nadvances in AI, particularly large language models (LLMs) and agent-based\nsystems. SWE-Bench is a recent benchmark designed to evaluate LLM-based repair\nsystems using real issues and pull requests mined from 12 popular open-source\nPython repositories. Its public leaderboards, SWE-Bench Lite and SWE-Bench\nVerified, have become central platforms for tracking progress and comparing\nsolutions. However, because the submission process does not require detailed\ndocumentation, the architectural design and origin of many solutions remain\nunclear. In this paper, we present the first comprehensive study of all\nsubmissions to the SWE-Bench Lite (68 entries) and Verified (79 entries)\nleaderboards, analyzing 67 unique approaches across dimensions such as\nsubmitter type, product availability, LLM usage, and system architecture. Our\nfindings reveal the dominance of proprietary LLMs (especially Claude 3.5/3.7),\nthe presence of both agentic and non-agentic designs, and a contributor base\nspanning from individual developers to large tech companies.", "categories": [ "cs.SE", "cs.AI", "cs.CL" ], "published": "2025-06-20T17:57:08+00:00", "url": "http://arxiv.org/pdf/2506.17208v1", "resource_uri": "arxiv://2506.17208v1", "citation_count": 0 }, { "id": "2506.16650v1", "title": "SemAgent: A Semantics Aware Program Repair Agent", "authors": [ "Anvith Pabba", "Alex Mathai", "Anindya Chakraborty", "Baishakhi Ray" ], "abstract": "Large Language Models (LLMs) have shown impressive capabilities in downstream\nsoftware engineering tasks such as Automated Program Repair (APR). In\nparticular, there has been a lot of research on repository-level\nissue-resolution benchmarks such as SWE-Bench. Although there has been\nsignificant progress on this topic, we notice that in the process of solving\nsuch issues, existing agentic systems tend to hyper-localize on immediately\nsuspicious lines of code and fix them in isolation, without a deeper\nunderstanding of the issue semantics, code semantics, or execution semantics.\nConsequently, many existing systems generate patches that overfit to the user\nissue, even when a more general fix is preferable. To address this limitation,\nwe introduce SemAgent, a novel workflow-based procedure that leverages issue,\ncode, and execution semantics to generate patches that are complete -\nidentifying and fixing all lines relevant to the issue. We achieve this through\na novel pipeline that (a) leverages execution semantics to retrieve relevant\ncontext, (b) comprehends issue-semantics via generalized abstraction, (c)\nisolates code-semantics within the context of this abstraction, and (d)\nleverages this understanding in a two-stage architecture: a repair stage that\nproposes fine-grained fixes, followed by a reviewer stage that filters relevant\nfixes based on the inferred issue-semantics. Our evaluations show that our\nmethodology achieves a solve rate of 44.66% on the SWEBench-Lite benchmark\nbeating all other workflow-based approaches, and an absolute improvement of\n7.66% compared to our baseline, which lacks such deep semantic understanding.\nWe note that our approach performs particularly well on issues requiring\nmulti-line reasoning (and editing) and edge-case handling, suggesting that\nincorporating issue and code semantics into APR pipelines can lead to robust\nand semantically consistent repairs.", "categories": [ "cs.SE", "cs.AI", "cs.MA" ], "published": "2025-06-19T23:27:58+00:00", "url": "http://arxiv.org/pdf/2506.16650v1", "resource_uri": "arxiv://2506.16650v1", "citation_count": 0 }, { "id": "2506.16136v1", "title": "Seeing is Fixing: Cross-Modal Reasoning with Multimodal LLMs for Visual Software Issue Fixing", "authors": [ "Kai Huang", "Jian Zhang", "Xiaofei Xie", "Chunyang Chen" ], "abstract": "Large language model-(LLM) based automated program repair (APR) techniques\nhave shown promising results in resolving real-world GitHub issue tasks.\nExisting APR systems are primarily evaluated in unimodal settings (e.g.,\nSWE-bench). However, these autonomous systems struggle to resolve multimodal\nproblem scenarios (e.g., SWE-bench M) due to limitations in interpreting and\nleveraging visual information. In multimodal scenarios, LLMs need to rely on\nvisual information in the graphical user interface (GUI) to understand bugs and\ngenerate fixes. To bridge this gap, we propose GUIRepair, a cross-modal\nreasoning approach for resolving multimodal issue scenarios by understanding\nand capturing visual information. Specifically, GUIRepair integrates two key\ncomponents, Image2Code and Code2Image, to enhance fault comprehension and patch\nvalidation. Image2Code extracts relevant project documents based on the issue\nreport, then applies this domain knowledge to generate the reproduced code\nresponsible for the visual symptoms, effectively translating GUI images into\nexecutable context for better fault comprehension. Code2Image replays the\nvisual issue scenario using the reproduced code and captures GUI renderings of\nthe patched program to assess whether the fix visually resolves the issue,\nproviding feedback for patch validation. We evaluate GUIRepair on SWE-bench M,\nand the approach demonstrates significant effectiveness. When utilizing GPT-4o\nas the base model, GUIRepair solves 157 instances, outperforming the best\nopen-source baseline by 26 instances. Furthermore, when using o4-mini as the\nbase model, GUIRepair can achieve even better results and solve 175 instances,\noutperforming the top commercial system by 22 instances. This emphasizes the\nsuccess of our new perspective on incorporating cross-modal reasoning by\nunderstanding and capturing visual information to resolve multimodal issues.", "categories": [ "cs.SE" ], "published": "2025-06-19T08:42:11+00:00", "url": "http://arxiv.org/pdf/2506.16136v1", "resource_uri": "arxiv://2506.16136v1", "citation_count": 0 }, { "id": "2506.13186v1", "title": "Empirical Evaluation of Large Language Models in Automated Program Repair", "authors": [ "Jiajun Sun", "Fengjie Li", "Xinzhu Qi", "Hongyu Zhang", "Jiajun Jiang" ], "abstract": "The increasing prevalence of software bugs has made automated program repair\n(APR) a key research focus. Large language models (LLMs) offer new\nopportunities for APR, but existing studies mostly rely on smaller,\nearlier-generation models and Java benchmarks. The repair capabilities of\nmodern, large-scale LLMs across diverse languages and scenarios remain\nunderexplored. To address this, we conduct a comprehensive empirical study of\nfour open-source LLMs, CodeLlama, LLaMA, StarCoder, and DeepSeek-Coder,\nspanning 7B to 33B parameters, diverse architectures, and purposes. We evaluate\nthem across two bug scenarios (enterprise-grades and algorithmic), three\nlanguages (Java, C/C++, Python), and four prompting strategies, analyzing over\n600K generated patches on six benchmarks. Key findings include: (1) model\nspecialization (e.g., CodeLlama) can outperform larger general-purpose models\n(e.g., LLaMA); (2) repair performance does not scale linearly with model size;\n(3) correct patches often appear early in generation; and (4) prompts\nsignificantly affect results. These insights offer practical guidance for\ndesigning effective and efficient LLM-based APR systems.", "categories": [ "cs.SE" ], "published": "2025-06-16T07:52:15+00:00", "url": "http://arxiv.org/pdf/2506.13186v1", "resource_uri": "arxiv://2506.13186v1", "citation_count": 0 }, { "id": "2506.13182v1", "title": "From Empirical Evaluation to Context-Aware Enhancement: Repairing Regression Errors with LLMs", "authors": [ "Anh Ho", "Thanh Le-Cong", "Bach Le", "Christine Rizkallah" ], "abstract": "[...] Since then, various APR approaches, especially those leveraging the\npower of large language models (LLMs), have been rapidly developed to fix\ngeneral software bugs. Unfortunately, the effectiveness of these advanced\ntechniques in the context of regression bugs remains largely unexplored. This\ngap motivates the need for an empirical study evaluating the effectiveness of\nmodern APR techniques in fixing real-world regression bugs.\n In this work, we conduct an empirical study of APR techniques on Java\nregression bugs. To facilitate our study, we introduce RegMiner4APR, a\nhigh-quality benchmark of Java regression bugs integrated into a framework\ndesigned to facilitate APR research. The current benchmark includes 99\nregression bugs collected from 32 widely used real-world Java GitHub\nrepositories. We begin by conducting an in-depth analysis of the benchmark,\ndemonstrating its diversity and quality. Building on this foundation, we\nempirically evaluate the capabilities of APR to regression bugs by assessing\nboth traditional APR tools and advanced LLM-based APR approaches. Our\nexperimental results show that classical APR tools fail to repair any bugs,\nwhile LLM-based APR approaches exhibit promising potential. Motivated by these\nresults, we investigate impact of incorporating bug-inducing change information\ninto LLM-based APR approaches for fixing regression bugs. Our results highlight\nthat this context-aware enhancement significantly improves the performance of\nLLM-based APR, yielding 1.8x more successful repairs compared to using\nLLM-based APR without such context.", "categories": [ "cs.SE", "cs.AI" ], "published": "2025-06-16T07:49:18+00:00", "url": "http://arxiv.org/pdf/2506.13182v1", "resource_uri": "arxiv://2506.13182v1", "citation_count": 0 }, { "id": "2506.12728v1", "title": "MCTS-Refined CoT: High-Quality Fine-Tuning Data for LLM-Based Repository Issue Resolution", "authors": [ "Yibo Wang", "Zhihao Peng", "Ying Wang", "Zhao Wei", "Hai Yu", "Zhiliang Zhu" ], "abstract": "LLMs demonstrate strong performance in auto-mated software engineering,\nparticularly for code generation and issue resolution. While proprietary models\nlike GPT-4o achieve high benchmarks scores on SWE-bench, their API dependence,\ncost, and privacy concerns limit adoption. Open-source alternatives offer\ntransparency but underperform in complex tasks, especially sub-100B parameter\nmodels. Although quality Chain-of-Thought (CoT) data can enhance reasoning,\ncurrent methods face two critical flaws: (1) weak rejection sampling reduces\ndata quality, and (2) inadequate step validation causes error accumulation.\nThese limitations lead to flawed reasoning chains that impair LLMs'ability to\nlearn reliable issue resolution. The paper proposes MCTS-REFINE, an enhanced\nMonte Carlo Tree Search (MCTS)-based algorithm that dynamically validates and\noptimizes intermediate reasoning steps through a rigorous rejection sampling\nstrategy, generating high-quality CoT data to improve LLM performance in issue\nresolution tasks. Key innovations include: (1) augmenting MCTS with a\nreflection mechanism that corrects errors via rejection sampling and\nrefinement, (2) decomposing issue resolution into three subtasks-File\nLocalization, Fault Localization, and Patch Generation-each with clear\nground-truth criteria, and (3) enforcing a strict sampling protocol where\nintermediate outputs must exactly match verified developer patches, ensuring\ncorrectness across reasoning paths. Experiments on SWE-bench Lite and SWE-bench\nVerified demonstrate that LLMs fine-tuned with our CoT dataset achieve\nsubstantial improvements over baselines.Notably, Qwen2.5-72B- Instruct achieves\n28.3%(Lite) and 35.0%(Verified) resolution rates, surpassing SOTA baseline\nSWE-Fixer-Qwen-72B with the same parameter scale, which only reached\n24.7%(Lite) and 32.8%(Verified).", "categories": [ "cs.SE" ], "published": "2025-06-15T05:42:01+00:00", "url": "http://arxiv.org/pdf/2506.12728v1", "resource_uri": "arxiv://2506.12728v1", "citation_count": 0 }, { "id": "2506.12320v1", "title": "The Foundation Cracks: A Comprehensive Study on Bugs and Testing Practices in LLM Libraries", "authors": [ "Weipeng Jiang", "Xiaoyu Zhang", "Xiaofei Xie", "Jiongchi Yu", "Yuhan Zhi", "Shiqing Ma", "Chao Shen" ], "abstract": "Large Language Model (LLM) libraries have emerged as the foundational\ninfrastructure powering today's AI revolution, serving as the backbone for LLM\ndeployment, inference optimization, fine-tuning, and production serving across\ndiverse applications. Despite their critical role in the LLM ecosystem, these\nlibraries face frequent quality issues and bugs that threaten the reliability\nof AI systems built upon them. To address this knowledge gap, we present the\nfirst comprehensive empirical investigation into bug characteristics and\ntesting practices in modern LLM libraries. We examine 313 bug-fixing commits\nextracted across two widely-adopted LLM libraries: HuggingFace Transformers and\nvLLM.Through rigorous manual analysis, we establish comprehensive taxonomies\ncategorizing bug symptoms into 5 types and root causes into 14 distinct\ncategories.Our primary discovery shows that API misuse has emerged as the\npredominant root cause (32.17%-48.19%), representing a notable transition from\nalgorithm-focused defects in conventional deep learning frameworks toward\ninterface-oriented problems. Additionally, we examine 7,748 test functions to\nidentify 7 distinct test oracle categories employed in current testing\napproaches, with predefined expected outputs (such as specific tensors and text\nstrings) being the most common strategy. Our assessment of existing testing\neffectiveness demonstrates that the majority of bugs escape detection due to\ninadequate test cases (41.73%), lack of test drivers (32.37%), and weak test\noracles (25.90%). Drawing from these findings, we offer some recommendations\nfor enhancing LLM library quality assurance.", "categories": [ "cs.SE", "cs.AI" ], "published": "2025-06-14T03:00:36+00:00", "url": "http://arxiv.org/pdf/2506.12320v1", "resource_uri": "arxiv://2506.12320v1", "citation_count": 0 }, { "id": "2506.10484v1", "title": "EXPEREPAIR: Dual-Memory Enhanced LLM-based Repository-Level Program Repair", "authors": [ "Fangwen Mu", "Junjie Wang", "Lin Shi", "Song Wang", "Shoubin Li", "Qing Wang" ], "abstract": "Automatically repairing software issues remains a fundamental challenge at\nthe intersection of software engineering and AI. Although recent advancements\nin Large Language Models (LLMs) have demonstrated potential for\nrepository-level repair tasks, current methodologies exhibit two notable\nlimitations: (1) they often address issues in isolation, neglecting to\nincorporate insights from previously resolved issues, and (2) they rely on\nstatic and rigid prompting strategies, which constrain their ability to\ngeneralize across diverse and evolving issue scenarios. Inspired by the dual\nmemory systems of human cognition, where episodic and semantic memories work\nsynergistically to support human reasoning and decision-making, we propose\nExpeRepair, a novel LLM-based approach that continuously learns from historical\nrepair experiences through dual-channel knowledge accumulation. ExpeRepair\norganizes historical repair experiences into two complementary memories: an\nepisodic memory that stores concrete repair demonstrations, and a semantic\nmemory that encodes abstract reflective insights. At inference time, ExpeRepair\nactivates both memory systems by retrieving relevant demonstrations from\nepisodic memory and recalling high-level repair insights from semantic memory.\nIt further enhances adaptability through dynamic prompt composition,\nsynergistically integrating both memory types to replace static prompts with\ncontext-aware, experience-driven prompts. Experiments on the SWE-bench Lite\nbenchmark demonstrate that ExpeRepair achieves a pass@1 score of 49.3% with\nClaude 3.7 Sonnet, outperforming all state-of-the-art open-source methods.", "categories": [ "cs.SE" ], "published": "2025-06-12T08:39:27+00:00", "url": "http://arxiv.org/pdf/2506.10484v1", "resource_uri": "arxiv://2506.10484v1", "citation_count": 0 }, { "id": "2506.10426v1", "title": "Towards Understanding Bugs in Distributed Training and Inference Frameworks for Large Language Models", "authors": [ "Xiao Yu", "Haoxuan Chen", "Feifei Niu", "Xing Hu", "Jacky Wai Keung", "Xin Xia" ], "abstract": "With the rapid development of large language models (LLMs), distributed\ntraining and inference frameworks like DeepSpeed have become essential for\nscaling model training and inference across multiple GPUs or nodes. However,\nthe increasing complexity of these frameworks brings non-trivial software bugs,\nwhich may degrade training performance, cause unexpected failures, and result\nin significant resource waste. Understanding framework bugs' characteristics is\nfundamental for quality assurance, allowing the design of more effective\ndebugging and repair methods. Thus, our paper conducts the first large-scale\nempirical analysis of 308 fixed bugs across three popular distributed\ntraining/inference frameworks: DeepSpeed, Megatron-LM, and Colossal-AI. We\nexamine bug symptoms, root causes, bug identification and fixing efforts, and\ncommon low-effort fixing strategies. Additionally, the distributed nature of\nthese frameworks introduces unique bug root causes, such as allocation strategy\nerror and distributed communication error. Diagnosing and fixing complex bugs\nremains challenging due to factors like the disconnect between symptoms and\nroot causes, high bug reproduction costs, and low-level or cross-component\ninteractions. Interestingly, we observe that 48% of bug fixes require minimal\ncode changes (<=10 LOC) and follow simple strategies such as conditional logic\noptimization, parameter handling enhancement, or version compatibility\nhandling, indicating potential for automation. Based on these insights, we\noffer several implications for improving the reliability of both distributed\ntraining and inference frameworks and their dependent LLM projects, while also\nidentifying opportunities to leverage LLM-based tools for automated debugging\nand repair.", "categories": [ "cs.SE" ], "published": "2025-06-12T07:24:59+00:00", "url": "http://arxiv.org/pdf/2506.10426v1", "resource_uri": "arxiv://2506.10426v1", "citation_count": 0 }, { "id": "2506.08311v1", "title": "Understanding Software Engineering Agents Through the Lens of Traceability: An Empirical Study", "authors": [ "Ira Ceka", "Saurabh Pujar", "Shyam Ramji", "Luca Buratti", "Gail Kaiser", "Baishakhi Ray" ], "abstract": "With the advent of large language models (LLMs), software engineering agents\n(SWE agents) have emerged as a powerful paradigm for automating a range of\nsoftware tasks -- from code generation and repair to test case synthesis. These\nagents operate autonomously by interpreting user input and responding to\nenvironmental feedback. While various agent architectures have demonstrated\nstrong empirical performance, the internal decision-making worfklows that drive\ntheir behavior remain poorly understood. Deeper insight into these workflows\nhold promise for improving both agent reliability and efficiency. In this work,\nwe present the first systematic study of SWE agent behavior through the lens of\nexecution traces. Our contributions are as follows: (1) we propose the first\ntaxonomy of decision-making pathways across five representative agents; (2)\nusing this taxonomy, we identify three core components essential to agent\nsuccess -- bug localization, patch generation, and reproduction test generation\n-- and study each in depth; (3) we study the impact of test generation on\nsuccessful patch production; and analyze strategies that can lead to successful\ntest generation; (4) we further conduct the first large-scale code clone\nanalysis comparing agent-generated and developer-written patches and provide a\nqualitative study revealing structural and stylistic differences in patch\ncontent. Together, these findings offer novel insights into agent design and\nopen avenues for building agents that are both more effective and more aligned\nwith human development practices.", "categories": [ "cs.SE", "cs.AI" ], "published": "2025-06-10T00:41:54+00:00", "url": "http://arxiv.org/pdf/2506.08311v1", "resource_uri": "arxiv://2506.08311v1", "citation_count": 0 }, { "id": "2506.08173v1", "title": "Repeton: Structured Bug Repair with ReAct-Guided Patch-and-Test Cycles", "authors": [ "Nguyen Phu Vinh", "Anh Chung Hoang", "Chris Ngo", "Truong-Son Hy" ], "abstract": "Large Language Models (LLMs) have shown strong capabilities in code\ngeneration and comprehension, yet their application to complex software\nengineering tasks often suffers from low precision and limited\ninterpretability. We present Repeton, a fully open-source framework that\nleverages LLMs for precise and automated code manipulation in real-world Git\nrepositories. Rather than generating holistic fixes, Repeton operates through a\nstructured patch-and-test pipeline: it iteratively diagnoses issues, proposes\ncode changes, and validates each patch through automated testing. This stepwise\nprocess is guided by lightweight heuristics and development tools, avoiding\nreliance on embedding-based retrieval systems. Evaluated on the SWE-bench Lite\nbenchmark, our method shows good performance compared to RAG-based methods in\nboth patch validity and interpretability. By decomposing software engineering\ntasks into modular, verifiable stages, Repeton provides a practical path toward\nscalable and transparent autonomous debugging.", "categories": [ "cs.SE", "cs.AI" ], "published": "2025-06-09T19:36:40+00:00", "url": "http://arxiv.org/pdf/2506.08173v1", "resource_uri": "arxiv://2506.08173v1", "citation_count": 0 }, { "id": "2506.05614v1", "title": "Which Prompting Technique Should I Use? An Empirical Investigation of Prompting Techniques for Software Engineering Tasks", "authors": [ "E. G. Santana Jr", "Gabriel Benjamin", "Melissa Araujo", "Harrison Santos", "David Freitas", "Eduardo Almeida", "Paulo Anselmo da M. S. Neto", "Jiawei Li", "Jina Chun", "Iftekhar Ahmed" ], "abstract": "A growing variety of prompt engineering techniques has been proposed for\nLarge Language Models (LLMs), yet systematic evaluation of each technique on\nindividual software engineering (SE) tasks remains underexplored. In this\nstudy, we present a systematic evaluation of 14 established prompt techniques\nacross 10 SE tasks using four LLM models. As identified in the prior\nliterature, the selected prompting techniques span six core dimensions\n(Zero-Shot, Few-Shot, Thought Generation, Ensembling, Self-Criticism, and\nDecomposition). They are evaluated on tasks such as code generation, bug\nfixing, and code-oriented question answering, to name a few. Our results show\nwhich prompting techniques are most effective for SE tasks requiring complex\nlogic and intensive reasoning versus those that rely more on contextual\nunderstanding and example-driven scenarios. We also analyze correlations\nbetween the linguistic characteristics of prompts and the factors that\ncontribute to the effectiveness of prompting techniques in enhancing\nperformance on SE tasks. Additionally, we report the time and token consumption\nfor each prompting technique when applied to a specific task and model,\noffering guidance for practitioners in selecting the optimal prompting\ntechnique for their use cases.", "categories": [ "cs.SE" ], "published": "2025-06-05T21:58:44+00:00", "url": "http://arxiv.org/pdf/2506.05614v1", "resource_uri": "arxiv://2506.05614v1", "citation_count": 0 }, { "id": "2506.04987v1", "title": "A Multi-Dataset Evaluation of Models for Automated Vulnerability Repair", "authors": [ "Zanis Ali Khan", "Aayush Garg", "Qiang Tang" ], "abstract": "Software vulnerabilities pose significant security threats, requiring\neffective mitigation. While Automated Program Repair (APR) has advanced in\nfixing general bugs, vulnerability patching, a security-critical aspect of APR\nremains underexplored. This study investigates pre-trained language models,\nCodeBERT and CodeT5, for automated vulnerability patching across six datasets\nand four languages. We evaluate their accuracy and generalization to unknown\nvulnerabilities. Results show that while both models face challenges with\nfragmented or sparse context, CodeBERT performs comparatively better in such\nscenarios, whereas CodeT5 excels in capturing complex vulnerability patterns.\nCodeT5 also demonstrates superior scalability. Furthermore, we test fine-tuned\nmodels on both in-distribution (trained) and out-of-distribution (unseen)\ndatasets. While fine-tuning improves in-distribution performance, models\nstruggle to generalize to unseen data, highlighting challenges in robust\nvulnerability detection. This study benchmarks model performance, identifies\nlimitations in generalization, and provides actionable insights to advance\nautomated vulnerability patching for real-world security applications.", "categories": [ "cs.SE", "cs.AI" ], "published": "2025-06-05T13:00:19+00:00", "url": "http://arxiv.org/pdf/2506.04987v1", "resource_uri": "arxiv://2506.04987v1", "citation_count": 0 }, { "id": "2506.04019v1", "title": "CETBench: A Novel Dataset constructed via Transformations over Programs for Benchmarking LLMs for Code-Equivalence Checking", "authors": [ "Neeva Oza", "Ishaan Govil", "Parul Gupta", "Dinesh Khandelwal", "Dinesh Garg", "Parag Singla" ], "abstract": "LLMs have been extensively used for the task of automated code generation. In\nthis work, we examine the applicability of LLMs for the related but relatively\nunexplored task of code-equivalence checking, i.e., given two programs, whether\nthey are functionally equivalent or not. This is an important problem since\nbenchmarking code equivalence can play a critical role in evaluating LLM\ncapabilities for tasks such as code re-writing and code translation. Towards\nthis end, we present CETBench - Code Equivalence with Transformations\nBenchmark, constructed via a repository of programs, where two programs in the\nrepository may be solving the same or different tasks. Each instance in our\ndataset is obtained by taking a pair of programs in the repository and applying\na random series of pre-defined code transformations, resulting in\n(non-)equivalent pairs. Our analysis on this dataset reveals a surprising\nfinding that very simple code transformations in the underlying pair of\nprograms can result in a significant drop in performance of SOTA LLMs for the\ntask of code-equivalence checking. To remedy this, we present a simple\nfine-tuning-based approach to boost LLM performance on the transformed pairs of\nprograms. Our approach for dataset generation is generic, and can be used with\nrepositories with varying program difficulty levels and allows for applying\nvarying numbers as well as kinds of transformations. In our experiments, we\nperform ablations over the difficulty level of original programs, as well as\nthe kind of transformations used in generating pairs for equivalence checking.\nOur analysis presents deep insights into the working of LLMs for the task of\ncode-equivalence, and points to the fact that they may still be far from what\ncould be termed as a semantic understanding of the underlying code.", "categories": [ "cs.SE", "cs.CL", "cs.LG", "cs.PL", "68-02 (Primary) 68T50, 68T07, 68N19, 68N30 (Secondary)", "I.2.7; I.2.6; I.2.5; D.3.0; D.3.3; D.3.1; F.3.2; F.3.1; F.3.3;\n D.2.3; D.2.5" ], "published": "2025-06-04T14:47:14+00:00", "url": "http://arxiv.org/pdf/2506.04019v1", "resource_uri": "arxiv://2506.04019v1", "citation_count": 0 }, { "id": "2506.03921v1", "title": "Boosting Open-Source LLMs for Program Repair via Reasoning Transfer and LLM-Guided Reinforcement Learning", "authors": [ "Xunzhu Tang", "Jacques Klein", "Tegawendé F. Bissyandé" ], "abstract": "Several closed-source LLMs have consistently outperformed open-source\nalternatives in program repair tasks, primarily due to their superior reasoning\ncapabilities and extensive pre-training. This paper introduces Repairity, a\nnovel three-stage methodology that significantly narrows this performance gap\nthrough reasoning extraction and reinforcement learning. Our approach: (1)\nsystematically filters high-quality reasoning traces from closed-source models\nusing correctness verification, (2) transfers this reasoning knowledge to\nopen-source models via supervised fine-tuning, and (3) develops reinforcement\nlearning with LLM-based feedback to further optimize performance. Empirical\nevaluation across multiple program repair benchmarks demonstrates that\nRepairity improves the performance of Qwen2.5-Coder-32B-Instruct, a base open\nsource LLM, by 8.68\\% on average, reducing the capability gap with\nClaude-Sonnet3.7, a state-of-the-art closed-source model, from 10.05% to 1.35%.\nAblation studies confirm that both reasoning extraction and LLM-guided\nreinforcement learning contribute significantly to these improvements. Our\nmethodology generalizes effectively to additional code-related tasks, enabling\norganizations to leverage high-quality program repair capabilities while\nmaintaining the customizability, transparency, and deployment flexibility\ninherent to open-source models.", "categories": [ "cs.SE" ], "published": "2025-06-04T13:13:58+00:00", "url": "http://arxiv.org/pdf/2506.03921v1", "resource_uri": "arxiv://2506.03921v1", "citation_count": 0 }, { "id": "2506.03524v2", "title": "Seed-Coder: Let the Code Model Curate Data for Itself", "authors": [ "ByteDance Seed", "Yuyu Zhang", "Jing Su", "Yifan Sun", "Chenguang Xi", "Xia Xiao", "Shen Zheng", "Anxiang Zhang", "Kaibo Liu", "Daoguang Zan", "Tao Sun", "Jinhua Zhu", "Shulin Xin", "Dong Huang", "Yetao Bai", "Lixin Dong", "Chao Li", "Jianchong Chen", "Hanzhi Zhou", "Yifan Huang", "Guanghan Ning", "Xierui Song", "Jiaze Chen", "Siyao Liu", "Kai Shen", "Liang Xiang", "Yonghui Wu" ], "abstract": "Code data in large language model (LLM) pretraining is recognized crucial not\nonly for code-related tasks but also for enhancing general intelligence of\nLLMs. Current open-source LLMs often heavily rely on human effort to produce\ntheir code pretraining data, such as employing hand-crafted filtering rules\ntailored to individual programming languages, or using human-annotated data to\ntrain quality filters. However, these approaches are inherently limited in\nscalability, prone to subjective biases, and costly to extend and maintain\nacross diverse programming languages. To address these challenges, we introduce\nSeed-Coder, a series of open-source LLMs comprising base, instruct and\nreasoning models of 8B size, minimizing human involvement in data construction.\nOur code pretraining data is produced by a model-centric data pipeline, which\npredominantly leverages LLMs for scoring and filtering code data. The instruct\nmodel is further trained via supervised fine-tuning and preference\noptimization, and the reasoning model leverages Long-Chain-of-Thought (LongCoT)\nreinforcement learning to improve multi-step code reasoning. Seed-Coder\nachieves state-of-the-art results among open-source models of similar size and\neven surpasses some much larger models, demonstrating superior performance in\ncode generation, code completion, code editing, code reasoning, and software\nengineering tasks.", "categories": [ "cs.CL", "cs.SE" ], "published": "2025-06-04T03:17:19+00:00", "url": "http://arxiv.org/pdf/2506.03524v2", "resource_uri": "arxiv://2506.03524v2", "citation_count": 0 }, { "id": "2506.03283v1", "title": "Empirical Evaluation of Generalizable Automated Program Repair with Large Language Models", "authors": [ "Viola Campos", "Ridwan Shariffdeen", "Adrian Ulges", "Yannic Noller" ], "abstract": "Automated Program Repair (APR) proposes bug fixes to aid developers in\nmaintaining software. The state of the art in this domain focuses on using\nLLMs, leveraging their strong capabilities to comprehend specifications in\nnatural language and to generate program code. Recent works have shown that\nLLMs can be used to generate repairs. However, despite the APR community's\nresearch achievements and several industry deployments in the last decade, APR\nstill lacks the capabilities to generalize broadly. In this work, we present an\nintensive empirical evaluation of LLMs for generating patches. We evaluate a\ndiverse set of 13 recent models, including open ones (e.g., Llama 3.3, Qwen 2.5\nCoder, and DeepSeek R1 (dist.)) and closed ones (e.g., o3-mini, GPT-4o, Claude\n3.7 Sonnet, Gemini 2.0 Flash). In particular, we explore language-agnostic\nrepairs by utilizing benchmarks for Java (e.g., Defects4J), JavaScript (e.g.,\nBugsJS), Python (e.g., BugsInPy), and PHP (e.g., BugsPHP). Besides the\ngeneralization between different languages and levels of patch complexity, we\nalso investigate the effects of fault localization (FL) as a preprocessing step\nand compare the progress for open vs closed models. Our evaluation represents a\nsnapshot of the current repair capabilities of the latest LLMs. Key results\ninclude: (1) Different LLMs tend to perform best for different languages, which\nmakes it hard to develop cross-platform repair techniques with single LLMs. (2)\nThe combinations of models add value with respect to uniquely fixed bugs, so a\ncommittee of expert models should be considered. (3) Under realistic\nassumptions of imperfect FL, we observe significant drops in accuracy from the\nusual practice of using perfect FL. Our findings and insights will help both\nresearchers and practitioners develop reliable and generalizable APR techniques\nand evaluate them in realistic and fair environments.", "categories": [ "cs.SE" ], "published": "2025-06-03T18:15:14+00:00", "url": "http://arxiv.org/pdf/2506.03283v1", "resource_uri": "arxiv://2506.03283v1", "citation_count": 0 }, { "id": "2506.02780v1", "title": "Reuse or Generate? Accelerating Code Editing via Edit-Oriented Speculative Decoding", "authors": [ "Peiding Wang", "Li Zhang", "Fang Liu", "Yinghao Zhu", "Wang Xu", "Lin Shi", "Xiaoli Lian", "Minxiao Li", "Bo Shen", "An Fu" ], "abstract": "Large Language Models (LLMs) have demonstrated remarkable capabilities in\ncode editing, substantially enhancing software development productivity.\nHowever, the inherent complexity of code editing tasks forces existing\napproaches to rely on LLMs' autoregressive end-to-end generation, where\ndecoding speed plays a critical role in efficiency. While inference\nacceleration techniques like speculative decoding are applied to improve the\ndecoding efficiency, these methods fail to account for the unique\ncharacteristics of code editing tasks where changes are typically localized and\nexisting code segments are reused. To address this limitation, we propose\nEfficientEdit, a novel method that improves LLM-based code editing efficiency\nthrough two key mechanisms based on speculative decoding: (1) effective reuse\nof original code segments while identifying potential edit locations, and (2)\nefficient generate edit content via high-quality drafts from edit-oriented\ndraft models and a dynamic verification mechanism that balances quality and\nacceleration. Experimental results show that EfficientEdit can achieve up to\n10.38$\\times$ and 13.09$\\times$ speedup compared to standard autoregressive\ndecoding in CanItEdit and CodeIF-Bench, respectively, outperforming\nstate-of-the-art inference acceleration approaches by up to 90.6%.", "categories": [ "cs.SE" ], "published": "2025-06-03T12:01:20+00:00", "url": "http://arxiv.org/pdf/2506.02780v1", "resource_uri": "arxiv://2506.02780v1", "citation_count": 0 }, { "id": "2506.02617v1", "title": "Toward Understanding Bugs in Vector Database Management Systems", "authors": [ "Yinglin Xie", "Xinyi Hou", "Yanjie Zhao", "Shenao Wang", "Kai Chen", "Haoyu Wang" ], "abstract": "Vector database management systems (VDBMSs) play a crucial role in\nfacilitating semantic similarity searches over high-dimensional embeddings from\ndiverse data sources. While VDBMSs are widely used in applications such as\nrecommendation, retrieval-augmented generation (RAG), and multimodal search,\ntheir reliability remains underexplored. Traditional database reliability\nmodels cannot be directly applied to VDBMSs because of fundamental differences\nin data representation, query mechanisms, and system architecture. To address\nthis gap, we present the first large-scale empirical study of software defects\nin VDBMSs. We manually analyzed 1,671 bug-fix pull requests from 15 widely used\nopen-source VDBMSs and developed a comprehensive taxonomy of bugs based on\nsymptoms, root causes, and developer fix strategies. Our study identifies five\ncategories of bug symptoms, with more than half manifesting as functional\nfailures. We further reveal 31 recurring fault patterns and highlight failure\nmodes unique to vector search systems. In addition, we summarize 12 common fix\nstrategies, whose distribution underscores the critical importance of correct\nprogram logic. These findings provide actionable insights into VDBMS\nreliability challenges and offer guidance for building more robust future\nsystems.", "categories": [ "cs.SE" ], "published": "2025-06-03T08:34:01+00:00", "url": "http://arxiv.org/pdf/2506.02617v1", "resource_uri": "arxiv://2506.02617v1", "citation_count": 0 }, { "id": "2506.00204v1", "title": "Structure-Aware Fill-in-the-Middle Pretraining for Code", "authors": [ "Linyuan Gong", "Alvin Cheung", "Mostafa Elhoushi", "Sida Wang" ], "abstract": "Fill-in-the-Middle (FIM) is a common pretraining method for code LLMs, where\nmodels complete code segments given surrounding context. However, existing LLMs\ntreat code as plain text and mask random character spans. We propose and\nevaluate AST-FIM, a pretraining strategy that leverages Abstract Syntax Trees\n(ASTs) to mask complete syntactic structures at scale, ensuring coherent\ntraining examples better aligned with universal code structures and common code\nediting patterns such as blocks, expressions, or functions. To evaluate\nreal-world fill-in-the-middle (FIM) programming tasks, we introduce\nReal-FIM-Eval, a benchmark derived from 30,000+ GitHub commits across 12\nlanguages. On infilling tasks, experiments on 1B and 8B parameter models show\nthat AST-FIM is particularly beneficial for real-world code editing as it\noutperforms standard random-character FIM by up to 5 pts on standard FIM\nbenchmarks. Our code is publicly available at\nhttps://github.com/gonglinyuan/ast_fim.", "categories": [ "cs.CL", "cs.AI", "cs.SE" ], "published": "2025-05-30T20:19:39+00:00", "url": "http://arxiv.org/pdf/2506.00204v1", "resource_uri": "arxiv://2506.00204v1", "citation_count": 0 }, { "id": "2506.00172v1", "title": "Breakpoint: Scalable evaluation of system-level reasoning in LLM code agents", "authors": [ "Kaivalya Hariharan", "Uzay Girit", "Atticus Wang", "Jacob Andreas" ], "abstract": "Benchmarks for large language models (LLMs) have predominantly assessed\nshort-horizon, localized reasoning. Existing long-horizon suites (e.g.\nSWE-bench) rely on manually curated issues, so expanding or tuning difficulty\ndemands expensive human effort and evaluations quickly saturate. However, many\nreal-world tasks, such as software engineering or scientific research, require\nagents to rapidly comprehend and manipulate novel, complex structures\ndynamically; evaluating these capabilities requires the ability to construct\nlarge and varied sets of problems for agents to solve. We introduce Breakpoint,\na benchmarking methodology that automatically generates code-repair tasks by\nadversarially corrupting functions within real-world software repositories.\nBreakpoint systematically controls task difficulty along two clear dimensions:\nlocal reasoning (characterized by code complexity metrics such as cyclomatic\ncomplexity) and system-level reasoning (characterized by call-graph centrality\nand the number of simultaneously corrupted interdependent functions). In\nexperiments across more than 900 generated tasks we demonstrate that our\nmethodology can scale to arbitrary difficulty, with state-of-the-art models'\nsuccess rates ranging from 55% on the easiest tasks down to 0% on the hardest.", "categories": [ "cs.LG" ], "published": "2025-05-30T19:23:51+00:00", "url": "http://arxiv.org/pdf/2506.00172v1", "resource_uri": "arxiv://2506.00172v1", "citation_count": 0 }, { "id": "2505.24715v1", "title": "CoRet: Improved Retriever for Code Editing", "authors": [ "Fabio Fehr", "Prabhu Teja Sivaprasad", "Luca Franceschi", "Giovanni Zappella" ], "abstract": "In this paper, we introduce CoRet, a dense retrieval model designed for\ncode-editing tasks that integrates code semantics, repository structure, and\ncall graph dependencies. The model focuses on retrieving relevant portions of a\ncode repository based on natural language queries such as requests to implement\nnew features or fix bugs. These retrieved code chunks can then be presented to\na user or to a second code-editing model or agent. To train CoRet, we propose a\nloss function explicitly designed for repository-level retrieval. On SWE-bench\nand Long Code Arena's bug localisation datasets, we show that our model\nsubstantially improves retrieval recall by at least 15 percentage points over\nexisting models, and ablate the design choices to show their importance in\nachieving these results.", "categories": [ "cs.LG", "cs.AI", "cs.CL" ], "published": "2025-05-30T15:36:37+00:00", "url": "http://arxiv.org/pdf/2505.24715v1", "resource_uri": "arxiv://2505.24715v1", "citation_count": 0 }, { "id": "2505.23932v2", "title": "SwingArena: Competitive Programming Arena for Long-context GitHub Issue Solving", "authors": [ "Wendong Xu", "Jing Xiong", "Chenyang Zhao", "Qiujiang Chen", "Haoran Wang", "Hui Shen", "Zhongwei Wan", "Jianbo Dai", "Taiqiang Wu", "He Xiao", "Chaofan Tao", "Z. Morley Mao", "Ying Sheng", "Zhijiang Guo", "Hongxia Yang", "Bei Yu", "Lingpeng Kong", "Quanquan Gu", "Ngai Wong" ], "abstract": "We present SwingArena, a competitive evaluation framework for Large Language\nModels (LLMs) that closely mirrors real-world software development workflows.\nUnlike traditional static benchmarks, SwingArena models the collaborative\nprocess of software iteration by pairing LLMs as submitters, who generate\npatches, and reviewers, who create test cases and verify the patches through\ncontinuous integration (CI) pipelines. To support these interactive\nevaluations, we introduce a retrieval-augmented code generation (RACG) module\nthat efficiently handles long-context challenges by providing syntactically and\nsemantically relevant code snippets from large codebases, supporting multiple\nprogramming languages (C++, Python, Rust, and Go). This enables the framework\nto scale across diverse tasks and contexts while respecting token limitations.\nOur experiments, using over 400 high-quality real-world GitHub issues selected\nfrom a pool of 2,300 issues, show that models like GPT-4o excel at aggressive\npatch generation, whereas DeepSeek and Gemini prioritize correctness in CI\nvalidation. SwingArena presents a scalable and extensible methodology for\nevaluating LLMs in realistic, CI-driven software development settings. More\ndetails are available on our project page: swing-bench.github.io", "categories": [ "cs.CL" ], "published": "2025-05-29T18:28:02+00:00", "url": "http://arxiv.org/pdf/2505.23932v2", "resource_uri": "arxiv://2505.23932v2", "citation_count": 0 }, { "id": "2505.22954v1", "title": "Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents", "authors": [ "Jenny Zhang", "Shengran Hu", "Cong Lu", "Robert Lange", "Jeff Clune" ], "abstract": "Today's AI systems have human-designed, fixed architectures and cannot\nautonomously and continuously improve themselves. The advance of AI could\nitself be automated. If done safely, that would accelerate AI development and\nallow us to reap its benefits much sooner. Meta-learning can automate the\ndiscovery of novel algorithms, but is limited by first-order improvements and\nthe human design of a suitable search space. The G\\\"odel machine proposed a\ntheoretical alternative: a self-improving AI that repeatedly modifies itself in\na provably beneficial manner. Unfortunately, proving that most changes are net\nbeneficial is impossible in practice. We introduce the Darwin G\\\"odel Machine\n(DGM), a self-improving system that iteratively modifies its own code (thereby\nalso improving its ability to modify its own codebase) and empirically\nvalidates each change using coding benchmarks. Inspired by Darwinian evolution\nand open-endedness research, the DGM maintains an archive of generated coding\nagents. It grows the archive by sampling an agent from it and using a\nfoundation model to create a new, interesting, version of the sampled agent.\nThis open-ended exploration forms a growing tree of diverse, high-quality\nagents and allows the parallel exploration of many different paths through the\nsearch space. Empirically, the DGM automatically improves its coding\ncapabilities (e.g., better code editing tools, long-context window management,\npeer-review mechanisms), increasing performance on SWE-bench from 20.0% to\n50.0%, and on Polyglot from 14.2% to 30.7%. Furthermore, the DGM significantly\noutperforms baselines without self-improvement or open-ended exploration. All\nexperiments were done with safety precautions (e.g., sandboxing, human\noversight). The DGM is a significant step toward self-improving AI, capable of\ngathering its own stepping stones along paths that unfold into endless\ninnovation.", "categories": [ "cs.AI" ], "published": "2025-05-29T00:26:15+00:00", "url": "http://arxiv.org/pdf/2505.22954v1", "resource_uri": "arxiv://2505.22954v1", "citation_count": 0 }, { "id": "2505.22304v1", "title": "CADReview: Automatically Reviewing CAD Programs with Error Detection and Correction", "authors": [ "Jiali Chen", "Xusen Hei", "HongFei Liu", "Yuancheng Wei", "Zikun Deng", "Jiayuan Xie", "Yi Cai", "Li Qing" ], "abstract": "Computer-aided design (CAD) is crucial in prototyping 3D objects through\ngeometric instructions (i.e., CAD programs). In practical design workflows,\ndesigners often engage in time-consuming reviews and refinements of these\nprototypes by comparing them with reference images. To bridge this gap, we\nintroduce the CAD review task to automatically detect and correct potential\nerrors, ensuring consistency between the constructed 3D objects and reference\nimages. However, recent advanced multimodal large language models (MLLMs)\nstruggle to recognize multiple geometric components and perform spatial\ngeometric operations within the CAD program, leading to inaccurate reviews. In\nthis paper, we propose the CAD program repairer (ReCAD) framework to\neffectively detect program errors and provide helpful feedback on error\ncorrection. Additionally, we create a dataset, CADReview, consisting of over\n20K program-image pairs, with diverse errors for the CAD review task. Extensive\nexperiments demonstrate that our ReCAD significantly outperforms existing\nMLLMs, which shows great potential in design applications.", "categories": [ "cs.CV" ], "published": "2025-05-28T12:41:00+00:00", "url": "http://arxiv.org/pdf/2505.22304v1", "resource_uri": "arxiv://2505.22304v1", "citation_count": 0 }, { "id": "2505.20854v1", "title": "An LLM-as-Judge Metric for Bridging the Gap with Human Evaluation in SE Tasks", "authors": [ "Xin Zhou", "Kisub Kim", "Ting Zhang", "Martin Weyssow", "Luis F. Gomes", "Guang Yang", "David Lo" ], "abstract": "Large Language Models (LLMs) and other automated techniques have been\nincreasingly used to support software developers by generating software\nartifacts such as code snippets, patches, and comments. However, accurately\nassessing the correctness of these generated artifacts remains a significant\nchallenge. On one hand, human evaluation provides high accuracy but is\nlabor-intensive and lacks scalability. On the other hand, other existing\nautomatic evaluation metrics are scalable and require minimal human effort, but\nthey often fail to accurately reflect the actual correctness of generated\nsoftware artifacts.\n In this paper, we present SWE-Judge, the first evaluation metric for\nLLM-as-Ensemble-Judge specifically designed to accurately assess the\ncorrectness of generated software artifacts. SWE-Judge first defines five\ndistinct evaluation strategies, each implemented as an independent judge. A\ndynamic team selection mechanism then identifies the most appropriate subset of\njudges to produce a final correctness score through ensembling. We evaluate\nSWE-Judge across a diverse set of software engineering (SE) benchmarks,\nincluding CoNaLa, Card2Code, HumanEval-X, APPS, APR-Assess, and Summary-Assess.\nThese benchmarks span three SE tasks: code generation, automated program\nrepair, and code summarization. Experimental results demonstrate that SWE-Judge\nconsistently achieves a higher correlation with human judgments, with\nimprovements ranging from 5.9% to 183.8% over existing automatic metrics.\nFurthermore, SWE-Judge reaches agreement levels with human annotators that are\ncomparable to inter-annotator agreement in code generation and program repair\ntasks. These findings underscore SWE-Judge's potential as a scalable and\nreliable alternative to human evaluation.", "categories": [ "cs.SE", "cs.AI", "cs.CL" ], "published": "2025-05-27T08:04:34+00:00", "url": "http://arxiv.org/pdf/2505.20854v1", "resource_uri": "arxiv://2505.20854v1", "citation_count": 0 }, { "id": "2505.20207v1", "title": "GPUMC: A Stateless Model Checker for GPU Weak Memory Concurrency", "authors": [ "Soham Chakraborty", "S. Krishna", "Andreas Pavlogiannis", "Omkar Tuppe" ], "abstract": "GPU computing is embracing weak memory concurrency for performance\nimprovement. However, compared to CPUs, modern GPUs provide more fine-grained\nconcurrency features such as scopes, have additional properties like\ndivergence, and thereby follow different weak memory consistency models. These\nfeatures and properties make concurrent programming on GPUs more complex and\nerror-prone. To this end, we present GPUMC, a stateless model checker to check\nthe correctness of GPU shared-memory concurrent programs under scoped-RC11 weak\nmemory concurrency model. GPUMC explores all possible executions in GPU\nprograms to reveal various errors - races, barrier divergence, and assertion\nviolations. In addition, GPUMC also automatically repairs these errors in the\nappropriate cases.\n We evaluate GPUMC with benchmarks and real-life GPU programs. GPUMC is\nefficient both in time and memory in verifying large GPU programs where\nstate-of-the-art tools are timed out. In addition, GPUMC identifies all known\nerrors in these benchmarks compared to the state-of-the-art tools.", "categories": [ "cs.LO", "cs.PL", "cs.SE" ], "published": "2025-05-26T16:47:44+00:00", "url": "http://arxiv.org/pdf/2505.20207v1", "resource_uri": "arxiv://2505.20207v1", "citation_count": 0 }, { "id": "2505.19395v1", "title": "VADER: A Human-Evaluated Benchmark for Vulnerability Assessment, Detection, Explanation, and Remediation", "authors": [ "Ethan TS. Liu", "Austin Wang", "Spencer Mateega", "Carlos Georgescu", "Danny Tang" ], "abstract": "Ensuring that large language models (LLMs) can effectively assess, detect,\nexplain, and remediate software vulnerabilities is critical for building robust\nand secure software systems. We introduce VADER, a human-evaluated benchmark\ndesigned explicitly to assess LLM performance across four key\nvulnerability-handling dimensions: assessment, detection, explanation, and\nremediation. VADER comprises 174 real-world software vulnerabilities, each\ncarefully curated from GitHub repositories and annotated by security experts.\nFor each vulnerability case, models are tasked with identifying the flaw,\nclassifying it using Common Weakness Enumeration (CWE), explaining its\nunderlying cause, proposing a patch, and formulating a test plan. Using a\none-shot prompting strategy, we benchmark six state-of-the-art LLMs (Claude 3.7\nSonnet, Gemini 2.5 Pro, GPT-4.1, GPT-4.5, Grok 3 Beta, and o3) on VADER, and\nhuman security experts evaluated each response according to a rigorous scoring\nrubric emphasizing remediation (quality of the code fix, 50%), explanation\n(20%), and classification and test plan (30%) according to a standardized\nrubric. Our results show that current state-of-the-art LLMs achieve only\nmoderate success on VADER - OpenAI's o3 attained 54.7% accuracy overall, with\nothers in the 49-54% range, indicating ample room for improvement. Notably,\nremediation quality is strongly correlated (Pearson r > 0.97) with accurate\nclassification and test plans, suggesting that models that effectively\ncategorize vulnerabilities also tend to fix them well. VADER's comprehensive\ndataset, detailed evaluation rubrics, scoring tools, and visualized results\nwith confidence intervals are publicly released, providing the community with\nan interpretable, reproducible benchmark to advance vulnerability-aware LLMs.\nAll code and data are available at: https://github.com/AfterQuery/vader", "categories": [ "cs.CR", "cs.AI" ], "published": "2025-05-26T01:20:44+00:00", "url": "http://arxiv.org/pdf/2505.19395v1", "resource_uri": "arxiv://2505.19395v1", "citation_count": 0 }, { "id": "2505.18955v1", "title": "Co-PatcheR: Collaborative Software Patching with Component(s)-specific Small Reasoning Models", "authors": [ "Yuheng Tang", "Hongwei Li", "Kaijie Zhu", "Michael Yang", "Yangruibo Ding", "Wenbo Guo" ], "abstract": "Motivated by the success of general-purpose large language models (LLMs) in\nsoftware patching, recent works started to train specialized patching models.\nMost works trained one model to handle the end-to-end patching pipeline\n(including issue localization, patch generation, and patch validation).\nHowever, it is hard for a small model to handle all tasks, as different\nsub-tasks have different workflows and require different expertise. As such, by\nusing a 70 billion model, SOTA methods can only reach up to 41% resolved rate\non SWE-bench-Verified. Motivated by the collaborative nature, we propose\nCo-PatcheR, the first collaborative patching system with small and specialized\nreasoning models for individual components. Our key technique novelties are the\nspecific task designs and training recipes. First, we train a model for\nlocalization and patch generation. Our localization pinpoints the suspicious\nlines through a two-step procedure, and our generation combines patch\ngeneration and critique. We then propose a hybrid patch validation that\nincludes two models for crafting issue-reproducing test cases with and without\nassertions and judging patch correctness, followed by a majority vote-based\npatch selection. Through extensive evaluation, we show that Co-PatcheR achieves\n46% resolved rate on SWE-bench-Verified with only 3 x 14B models. This makes\nCo-PatcheR the best patcher with specialized models, requiring the least\ntraining resources and the smallest models. We conduct a comprehensive ablation\nstudy to validate our recipes, as well as our choice of training data number,\nmodel size, and testing-phase scaling strategy.", "categories": [ "cs.AI", "cs.CR", "cs.SE" ], "published": "2025-05-25T02:58:30+00:00", "url": "http://arxiv.org/pdf/2505.18955v1", "resource_uri": "arxiv://2505.18955v1", "citation_count": 0 }, { "id": "2505.16975v2", "title": "SWE-Dev: Evaluating and Training Autonomous Feature-Driven Software Development", "authors": [ "Yaxin Du", "Yuzhu Cai", "Yifan Zhou", "Cheng Wang", "Yu Qian", "Xianghe Pang", "Qian Liu", "Yue Hu", "Siheng Chen" ], "abstract": "Large Language Models (LLMs) have shown strong capability in diverse software\nengineering tasks, e.g. code completion, bug fixing, and document generation.\nHowever, feature-driven development (FDD), a highly prevalent real-world task\nthat involves developing new functionalities for large, existing codebases,\nremains underexplored. We therefore introduce SWE-Dev, the first large-scale\ndataset (with 14,000 training and 500 test samples) designed to evaluate and\ntrain autonomous coding systems on real-world feature development tasks. To\nensure verifiable and diverse training, SWE-Dev uniquely provides all instances\nwith a runnable environment and its developer-authored executable unit tests.\nThis collection not only provides high-quality data for Supervised Fine-Tuning\n(SFT), but also enables Reinforcement Learning (RL) by delivering accurate\nreward signals from executable unit tests. Our extensive evaluations on\nSWE-Dev, covering 17 chatbot LLMs, 10 reasoning models, and 10 Multi-Agent\nSystems (MAS), reveal that FDD is a profoundly challenging frontier for current\nAI (e.g., Claude-3.7-Sonnet achieves only 22.45\\% Pass@3 on the hard test\nsplit). Crucially, we demonstrate that SWE-Dev serves as an effective platform\nfor model improvement: fine-tuning on training set enabled a 7B model\ncomparable to GPT-4o on \\textit{hard} split, underscoring the value of its\nhigh-quality training data. Code is available here\n\\href{https://github.com/DorothyDUUU/SWE-Dev}{https://github.com/DorothyDUUU/SWE-Dev}.", "categories": [ "cs.SE", "cs.CL" ], "published": "2025-05-22T17:51:49+00:00", "url": "http://arxiv.org/pdf/2505.16975v2", "resource_uri": "arxiv://2505.16975v2", "citation_count": 0 }, { "id": "2505.16402v1", "title": "AdvReal: Adversarial Patch Generation Framework with Application to Adversarial Safety Evaluation of Object Detection Systems", "authors": [ "Yuanhao Huang", "Yilong Ren", "Jinlei Wang", "Lujia Huo", "Xuesong Bai", "Jinchuan Zhang", "Haiyan Yu" ], "abstract": "Autonomous vehicles are typical complex intelligent systems with artificial\nintelligence at their core. However, perception methods based on deep learning\nare extremely vulnerable to adversarial samples, resulting in safety accidents.\nHow to generate effective adversarial examples in the physical world and\nevaluate object detection systems is a huge challenge. In this study, we\npropose a unified joint adversarial training framework for both 2D and 3D\nsamples to address the challenges of intra-class diversity and environmental\nvariations in real-world scenarios. Building upon this framework, we introduce\nan adversarial sample reality enhancement approach that incorporates non-rigid\nsurface modeling and a realistic 3D matching mechanism. We compare with 5\nadvanced adversarial patches and evaluate their attack performance on 8 object\ndetecotrs, including single-stage, two-stage, and transformer-based models.\nExtensive experiment results in digital and physical environments demonstrate\nthat the adversarial textures generated by our method can effectively mislead\nthe target detection model. Moreover, proposed method demonstrates excellent\nrobustness and transferability under multi-angle attacks, varying lighting\nconditions, and different distance in the physical world. The demo video and\ncode can be obtained at https://github.com/Huangyh98/AdvReal.git.", "categories": [ "cs.CV" ], "published": "2025-05-22T08:54:03+00:00", "url": "http://arxiv.org/pdf/2505.16402v1", "resource_uri": "arxiv://2505.16402v1", "citation_count": 0 }, { "id": "2505.14978v1", "title": "JARVIS: A Multi-Agent Code Assistant for High-Quality EDA Script Generation", "authors": [ "Ghasem Pasandi", "Kishor Kunal", "Varun Tej", "Kunjal Shan", "Hanfei Sun", "Sumit Jain", "Chunhui Li", "Chenhui Deng", "Teodor-Dumitru Ene", "Haoxing Ren", "Sreedhar Pratty" ], "abstract": "This paper presents JARVIS, a novel multi-agent framework that leverages\nLarge Language Models (LLMs) and domain expertise to generate high-quality\nscripts for specialized Electronic Design Automation (EDA) tasks. By combining\na domain-specific LLM trained with synthetically generated data, a custom\ncompiler for structural verification, rule enforcement, code fixing\ncapabilities, and advanced retrieval mechanisms, our approach achieves\nsignificant improvements over state-of-the-art domain-specific models. Our\nframework addresses the challenges of data scarcity and hallucination errors in\nLLMs, demonstrating the potential of LLMs in specialized engineering domains.\nWe evaluate our framework on multiple benchmarks and show that it outperforms\nexisting models in terms of accuracy and reliability. Our work sets a new\nprecedent for the application of LLMs in EDA and paves the way for future\ninnovations in this field.", "categories": [ "cs.SE", "cs.AI", "cs.LG" ], "published": "2025-05-20T23:40:57+00:00", "url": "http://arxiv.org/pdf/2505.14978v1", "resource_uri": "arxiv://2505.14978v1", "citation_count": 0 }, { "id": "2505.13103v2", "title": "Fixing 7,400 Bugs for 1$: Cheap Crash-Site Program Repair", "authors": [ "Han Zheng", "Ilia Shumailov", "Tianqi Fan", "Aiden Hall", "Mathias Payer" ], "abstract": "The rapid advancement of bug-finding techniques has led to the discovery of\nmore vulnerabilities than developers can reasonably fix, creating an urgent\nneed for effective Automated Program Repair (APR) methods. However, the\ncomplexity of modern bugs often makes precise root cause analysis difficult and\nunreliable. To address this challenge, we propose crash-site repair to simplify\nthe repair task while still mitigating the risk of exploitation. In addition,\nwe introduce a template-guided patch generation approach that significantly\nreduces the token cost of Large Language Models (LLMs) while maintaining both\nefficiency and effectiveness.\n We implement our prototype system, WILLIAMT, and evaluate it against\nstate-of-the-art APR tools. Our results show that, when combined with the\ntop-performing agent CodeRover-S, WILLIAMT reduces token cost by 45.9% and\nincreases the bug-fixing rate to 73.5% (+29.6%) on ARVO, a ground-truth open\nsource software vulnerabilities benchmark. Furthermore, we demonstrate that\nWILLIAMT can function effectively even without access to frontier LLMs: even a\nlocal model running on a Mac M4 Mini achieves a reasonable repair rate. These\nfindings highlight the broad applicability and scalability of WILLIAMT.", "categories": [ "cs.SE", "cs.CR" ], "published": "2025-05-19T13:32:51+00:00", "url": "http://arxiv.org/pdf/2505.13103v2", "resource_uri": "arxiv://2505.13103v2", "citation_count": 0 }, { "id": "2505.13008v2", "title": "Adversarial Reasoning for Repair Based on Inferred Program Intent", "authors": [ "He Ye", "Aidan Z. H. Yang", "Chang Hu", "Yanlin Wang", "Tao Zhang", "Claire Le Goues" ], "abstract": "Automated program repair (APR) has shown promising results, particularly with\nthe use of neural networks. Currently, most APR tools focus on code\ntransformations specified by test suites, rather than reasoning about the\nprogram intent and the high-level bug specification. Without a proper\nunderstanding of program intent, these tools tend to generate patches that\noverfit incomplete test suites and fail to reflect the developers intentions.\nHowever, reasoning about program intent is challenging. In our work, we propose\nan approach called AdverIntent-Agent, based on critique and adversarial\nreasoning. Our approach is novel to shift the focus from generating multiple\nAPR patches to inferring multiple potential program intents. Ideally, we aim to\ninfer intents that are, to some extent, adversarial to each other, maximizing\nthe probability that at least one aligns closely with the developers original\nintent. AdverIntent-Agent is a multi-agent approach consisting of three agents:\na reasoning agent, a test agent, and a repair agent. First, the reasoning agent\ngenerates adversarial program intents along with the corresponding faulty\nstatements. Next, the test agent produces adversarial test cases that align\nwith each inferred intent, constructing oracles that use the same inputs but\nhave different expected outputs. Finally, the repair agent uses dynamic and\nprecise LLM prompts to generate patches that satisfy both the inferred program\nintent and the generated tests. AdverIntent-Agent was evaluated on two\nbenchmarks: Defects4J 2.0 and HumanEval-Java. AdverIntent-Agent correctly\nrepaired 77 and 105 bugs in both benchmarks, respectively.", "categories": [ "cs.SE" ], "published": "2025-05-19T11:51:56+00:00", "url": "http://arxiv.org/pdf/2505.13008v2", "resource_uri": "arxiv://2505.13008v2", "citation_count": 0 }, { "id": "2505.08263v2", "title": "LLM-Based Detection of Tangled Code Changes for Higher-Quality Method-Level Bug Datasets", "authors": [ "Md Nahidul Islam Opu", "Shaowei Wang", "Shaiful Chowdhury" ], "abstract": "Tangled code changes, commits that conflate unrelated modifications such as\nbug fixes, refactorings, and enhancements, introduce significant noise into bug\ndatasets and adversely affect the performance of bug prediction models.\nAddressing this issue at a fine-grained, method-level granularity remains\nunderexplored. This is critical to address, as recent bug prediction models,\ndriven by practitioner demand, are increasingly focusing on finer granularity\nrather than traditional class- or file-level predictions. This study\ninvestigates the utility of Large Language Models (LLMs) for detecting tangled\ncode changes by leveraging both commit messages and method-level code diffs. We\nformulate the problem as a binary classification task and evaluate multiple\nprompting strategies, including zero-shot, few-shot, and chain-of-thought\nprompting, using state-of-the-art proprietary LLMs such as GPT-4o and\nGemini-2.0-Flash. Our results demonstrate that combining commit messages with\ncode diffs significantly enhances model performance, with the combined few-shot\nand chain-of-thought prompting achieving an F1-score of 0.88. Additionally, we\nexplore machine learning models trained on LLM-generated embeddings, where a\nmulti-layer perceptron classifier achieves superior performance (F1-score:\n0.906, MCC: 0.807). Applying our approach to 49 open-source projects improves\nthe distributional separability of code metrics between buggy and non-buggy\nmethods, demonstrating the promise of LLMs for method-level commit untangling\nand potentially contributing to improving the accuracy of future bug prediction\nmodels.", "categories": [ "cs.SE" ], "published": "2025-05-13T06:26:13+00:00", "url": "http://arxiv.org/pdf/2505.08263v2", "resource_uri": "arxiv://2505.08263v2", "citation_count": 0 }, { "id": "2505.07522v1", "title": "Byam: Fixing Breaking Dependency Updates with Large Language Models", "authors": [ "Frank Reyes", "May Mahmoud", "Federico Bono", "Sarah Nadi", "Benoit Baudry", "Martin Monperrus" ], "abstract": "Application Programming Interfaces (APIs) facilitate the integration of\nthird-party dependencies within the code of client applications. However,\nchanges to an API, such as deprecation, modification of parameter names or\ntypes, or complete replacement with a new API, can break existing client code.\nThese changes are called breaking dependency updates; It is often tedious for\nAPI users to identify the cause of these breaks and update their code\naccordingly. In this paper, we explore the use of Large Language Models (LLMs)\nto automate client code updates in response to breaking dependency updates. We\nevaluate our approach on the BUMP dataset, a benchmark for breaking dependency\nupdates in Java projects. Our approach leverages LLMs with advanced prompts,\nincluding information from the build process and from the breaking dependency\nanalysis. We assess effectiveness at three granularity levels: at the build\nlevel, the file level, and the individual compilation error level. We\nexperiment with five LLMs: Google Gemini-2.0 Flash, OpenAI GPT4o-mini, OpenAI\no3-mini, Alibaba Qwen2.5-32b-instruct, and DeepSeek V3. Our results show that\nLLMs can automatically repair breaking updates. Among the considered models,\nOpenAI's o3-mini is the best, able to completely fix 27% of the builds when\nusing prompts that include contextual information such as the buggy line, API\ndifferences, error messages, and step-by-step reasoning instructions. Also, it\nfixes 78% of the individual compilation errors. Overall, our findings\ndemonstrate the potential for LLMs to fix compilation errors due to breaking\ndependency updates, supporting developers in their efforts to stay up-to-date\nwith changes in their dependencies.", "categories": [ "cs.SE" ], "published": "2025-05-12T13:03:26+00:00", "url": "http://arxiv.org/pdf/2505.07522v1", "resource_uri": "arxiv://2505.07522v1", "citation_count": 0 }, { "id": "2505.07372v1", "title": "Synthetic Code Surgery: Repairing Bugs and Vulnerabilities with LLMs and Synthetic Data", "authors": [ "David de-Fitero-Dominguez", "Antonio Garcia-Cabot", "Eva Garcia-Lopez" ], "abstract": "This paper presents a novel methodology for enhancing Automated Program\nRepair (APR) through synthetic data generation utilizing Large Language Models\n(LLMs). Current APR systems are constrained by the limited availability of\nhigh-quality training data encompassing diverse bug types across multiple\nprogramming languages. The proposed approach addresses this limitation through\na two-phase process: a synthetic sample generation followed by a rigorous\nquality assessment. Multiple state-of-the-art LLMs were employed to generate\napproximately 30,000 paired examples of buggy and fixed code across 12\nprogramming languages and 13 bug categories. Subsequently, these samples\nunderwent cross-model evaluation against five criteria: correctness, code\nquality, security, performance, and completeness. Experimental evaluation on\nthe VulRepair test set dataset showed statistically significant improvements in\nPerfect Prediction rates, with the quality-filtered synthetic dataset\noutperforming both baseline and real-world commit data configurations in\ncertain scenarios. The methodology was validated through rigorous statistical\ntesting, including ANOVA and post-hoc Tukey's Honest Significant Difference\nanalysis. Furthermore, the best-performing configurations surpassed existing\nsystems despite using a less computationally intensive decoding strategy. This\nresearch establishes a self-bootstrapping paradigm in which LLMs generate and\nevaluate their own training data, potentially transforming approaches to data\nscarcity across software engineering tasks and advancing the development of\nrobust, adaptable tools for automated code maintenance.", "categories": [ "cs.SE", "cs.AI" ], "published": "2025-05-12T09:14:20+00:00", "url": "http://arxiv.org/pdf/2505.07372v1", "resource_uri": "arxiv://2505.07372v1", "citation_count": 0 }, { "id": "2505.07270v2", "title": "Automated Repair of Ambiguous Natural Language Requirements", "authors": [ "Haoxiang Jia", "Robbie Morris", "He Ye", "Federica Sarro", "Sergey Mechtaev" ], "abstract": "The widespread adoption of large language models (LLMs) in software\nengineering has amplified the role of natural language (NL). The inherent\nambiguity of NL threatens software quality, because ambiguous requirements may\nlead to faulty program generation. The complexity of ambiguity detection and\nresolution motivates us to introduce automated repair of ambiguous NL\nrequirements, which we approach by reducing code generation uncertainty and\naligning NL with input-output examples. Repairing ambiguity in requirements is\na difficult challenge for LLMs, as it demands metacognition - the model must\nunderstand how its own interpretation changes when the text is altered. Our\nexperiments show that directly prompting an LLM to detect and resolve\nambiguities results in irrelevant or inconsistent clarifications. Our key\ninsight is to decompose this problem into simpler sub-problems that do not\nrequire metacognitive reasoning. First, we analyze and repair the LLM's\ninterpretation of requirements embodied by the distribution of programs they\ninduce by using traditional testing and program repair. Second, we repair\nrequirements based on the changes to the distribution via contrastive\nspecification inference. We implemented this proposal, dubbed as SpecFix, and\nevaluated it by using three state-of-the-art LLMs (GPT-4o, DeepSeek-V3 and\nQwen2.5-Coder-32b) across two widely used code generation benchmarks, namely\nHumanEval+ and MBPP+. Our results show that SpecFix, operating autonomously\nwithout human intervention or external information, modifies 23.93% of the\nrequirements, leading to a 33.66% improvement in model Pass@1 on the modified\nrequirements. Across the entire benchmark, this corresponds to an 4.3% increase\nin overall Pass@1. Importantly, SpecFix's repairs generalize across models:\nrequirements repaired by one model boost the performance of other models by\n9.6%.", "categories": [ "cs.SE" ], "published": "2025-05-12T06:47:53+00:00", "url": "http://arxiv.org/pdf/2505.07270v2", "resource_uri": "arxiv://2505.07270v2", "citation_count": 0 }, { "id": "2505.07897v2", "title": "LongCodeBench: Evaluating Coding LLMs at 1M Context Windows", "authors": [ "Stefano Rando", "Luca Romani", "Alessio Sampieri", "Luca Franco", "John Yang", "Yuta Kyuragi", "Fabio Galasso", "Tatsunori Hashimoto" ], "abstract": "Context lengths for models have grown rapidly, from thousands to millions of\ntokens in just a few years. The extreme context sizes of modern long-context\nmodels have made it difficult to construct realistic long-context benchmarks --\nnot only due to the cost of collecting million-context tasks but also in\nidentifying realistic scenarios that require significant contexts. We identify\ncode comprehension and repair as a natural testbed and challenge task for\nlong-context models and introduce LongCodeBench (LCB), a benchmark to test LLM\ncoding abilities in long-context scenarios. Our benchmark tests both the\ncomprehension and repair capabilities of LCLMs in realistic and important\nsettings by drawing from real-world GitHub issues and constructing QA\n(LongCodeQA) and bug fixing (LongSWE-Bench) tasks. We carefully stratify the\ncomplexity of our benchmark, enabling us to evaluate models across different\nscales -- ranging from Qwen2.5 14B Instruct to Google's flagship Gemini model.\nWe find that long-context remains a weakness for all models, with performance\ndrops such as from 29% to 3% for Claude 3.5 Sonnet, or from 70.2% to 40% for\nQwen2.5.", "categories": [ "cs.CL", "cs.AI" ], "published": "2025-05-12T05:38:03+00:00", "url": "http://arxiv.org/pdf/2505.07897v2", "resource_uri": "arxiv://2505.07897v2", "citation_count": 0 }, { "id": "2505.04441v1", "title": "Towards Effectively Leveraging Execution Traces for Program Repair with Code LLMs", "authors": [ "Mirazul Haque", "Petr Babkin", "Farima Farmahinifarahani", "Manuela Veloso" ], "abstract": "Large Language Models (LLMs) show promising performance on various\nprogramming tasks, including Automatic Program Repair (APR). However, most\napproaches to LLM-based APR are limited to the static analysis of the programs,\nwhile disregarding their runtime behavior. Inspired by knowledge-augmented NLP,\nin this work, we aim to remedy this potential blind spot by augmenting standard\nAPR prompts with program execution traces. We evaluate our approach using the\nGPT family of models on three popular APR datasets. Our findings suggest that\nsimply incorporating execution traces into the prompt provides a limited\nperformance improvement over trace-free baselines, in only 2 out of 6 tested\ndataset / model configurations. We further find that the effectiveness of\nexecution traces for APR diminishes as their complexity increases. We explore\nseveral strategies for leveraging traces in prompts and demonstrate that\nLLM-optimized prompts help outperform trace-free prompts more consistently.\nAdditionally, we show trace-based prompting to be superior to finetuning a\nsmaller LLM on a small-scale dataset; and conduct probing studies reinforcing\nthe notion that execution traces can complement the reasoning abilities of the\nLLMs.", "categories": [ "cs.LG", "cs.SE" ], "published": "2025-05-07T14:12:41+00:00", "url": "http://arxiv.org/pdf/2505.04441v1", "resource_uri": "arxiv://2505.04441v1", "citation_count": 0 }, { "id": "2505.04670v2", "title": "LLM Code Customization with Visual Results: A Benchmark on TikZ", "authors": [ "Charly Reux", "Mathieu Acher", "Djamel Eddine Khelladi", "Olivier Barais", "Clément Quinton" ], "abstract": "With the rise of AI-based code generation, customizing existing code out of\nnatural language instructions to modify visual results -such as figures or\nimages -has become possible, promising to reduce the need for deep programming\nexpertise. However, even experienced developers can struggle with this task, as\nit requires identifying relevant code regions (feature location), generating\nvalid code variants, and ensuring the modifications reliably align with user\nintent. In this paper, we introduce vTikZ, the first benchmark designed to\nevaluate the ability of Large Language Models (LLMs) to customize code while\npreserving coherent visual outcomes. Our benchmark consists of carefully\ncurated vTikZ editing scenarios, parameterized ground truths, and a reviewing\ntool that leverages visual feedback to assess correctness. Empirical evaluation\nwith stateof-the-art LLMs shows that existing solutions struggle to reliably\nmodify code in alignment with visual intent, highlighting a gap in current\nAI-assisted code editing approaches. We argue that vTikZ opens new research\ndirections for integrating LLMs with visual feedback mechanisms to improve code\ncustomization tasks in various domains beyond TikZ, including image processing,\nart creation, Web design, and 3D modeling.", "categories": [ "cs.SE", "cs.AI" ], "published": "2025-05-07T08:26:54+00:00", "url": "http://arxiv.org/pdf/2505.04670v2", "resource_uri": "arxiv://2505.04670v2", "citation_count": 0 }, { "id": "2505.04040v1", "title": "Identification and Optimization of Redundant Code Using Large Language Models", "authors": [ "Shamse Tasnim Cynthia" ], "abstract": "Redundant code is a persistent challenge in software development that makes\nsystems harder to maintain, scale, and update. It adds unnecessary complexity,\nhinders bug fixes, and increases technical debt. Despite their impact, removing\nredundant code manually is risky and error-prone, often introducing new bugs or\nmissing dependencies. While studies highlight the prevalence and negative\nimpact of redundant code, little focus has been given to Artificial\nIntelligence (AI) system codebases and the common patterns that cause\nredundancy. Additionally, the reasons behind developers unintentionally\nintroducing redundant code remain largely unexplored. This research addresses\nthese gaps by leveraging large language models (LLMs) to automatically detect\nand optimize redundant code in AI projects. Our research aims to identify\nrecurring patterns of redundancy and analyze their underlying causes, such as\noutdated practices or insufficient awareness of best coding principles.\nAdditionally, we plan to propose an LLM agent that will facilitate the\ndetection and refactoring of redundancies on a large scale while preserving\noriginal functionality. This work advances the application of AI in identifying\nand optimizing redundant code, ultimately helping developers maintain cleaner,\nmore readable, and scalable codebases.", "categories": [ "cs.SE" ], "published": "2025-05-07T00:44:32+00:00", "url": "http://arxiv.org/pdf/2505.04040v1", "resource_uri": "arxiv://2505.04040v1", "citation_count": 0 }, { "id": "2505.02931v1", "title": "The Art of Repair: Optimizing Iterative Program Repair with Instruction-Tuned Models", "authors": [ "Fernando Vallecillos Ruiz", "Max Hort", "Leon Moonen" ], "abstract": "Automatic program repair (APR) aims to reduce the manual efforts required to\nidentify and fix errors in source code. Before the rise of LLM-based agents, a\ncommon strategy was to increase the number of generated patches, sometimes to\nthe thousands, to achieve better repair results on benchmarks. More recently,\nself-iterative capabilities enabled LLMs to refine patches over multiple rounds\nguided by feedback. However, literature often focuses on many iterations and\ndisregards different numbers of outputs.\n We investigate an APR pipeline that balances these two approaches, the\ngeneration of multiple outputs and multiple rounds of iteration, while imposing\na limit of 10 total patches per bug. We apply three SOTA instruction-tuned LLMs\n- DeepSeekCoder-Instruct, Codellama-Instruct, Llama3.1-Instruct - to the APR\ntask. We further fine-tune each model on an APR dataset with three sizes (1K,\n30K, 65K) and two techniques (Full Fine-Tuning and LoRA), allowing us to assess\ntheir repair capabilities on two APR benchmarks: HumanEval-Java and Defects4J.\n Our results show that by using only a fraction (<1%) of the fine-tuning\ndataset, we can achieve improvements of up to 78% in the number of plausible\npatches generated, challenging prior studies that reported limited gains using\nFull Fine-Tuning. However, we find that exceeding certain thresholds leads to\ndiminishing outcomes, likely due to overfitting. Moreover, we show that base\nmodels greatly benefit from creating patches in an iterative fashion rather\nthan generating them all at once. In addition, the benefit of iterative\nstrategies becomes more pronounced in complex benchmarks. Even fine-tuned\nmodels, while benefiting less from iterations, still gain advantages,\nparticularly on complex benchmarks. The research underscores the need for\nbalanced APR strategies that combine multi-output generation and iterative\nrefinement.", "categories": [ "cs.SE", "cs.AI", "cs.CL", "cs.LG" ], "published": "2025-05-05T18:06:51+00:00", "url": "http://arxiv.org/pdf/2505.02931v1", "resource_uri": "arxiv://2505.02931v1", "citation_count": 0 }, { "id": "2505.02629v1", "title": "Parameter-Efficient Fine-Tuning with Attributed Patch Semantic Graph for Automated Patch Correctness Assessment", "authors": [ "Zhenyu Yang", "Jingwen Wu", "Zhen Yang", "Zhongxing Yu" ], "abstract": "Automated program repair (APR) aims to automatically repair program errors\nwithout human intervention, and recent years have witnessed a growing interest\non this research topic. While much progress has been made and techniques\noriginating from different disciplines have been proposed, APR techniques\ngenerally suffer from the patch overfitting issue, i.e., the generated patches\nare not genuinely correct despite they pass the employed tests. To alleviate\nthis issue, many research efforts have been devoted for automated patch\ncorrectness assessment (APCA). In particular, with the emergence of large\nlanguage model (LLM) technology, researchers have employed LLM to assess the\npatch correctness and have obtained the state-of-the-art performance. The\nliterature on APCA has demonstrated the importance of capturing patch semantic\nand explicitly considering certain code attributes in predicting patch\ncorrectness. However, existing LLM-based methods typically treat code as token\nsequences and ignore the inherent formal structure for code, making it\ndifficult to capture the deep patch semantics. Moreover, these LLM-based\nmethods also do not explicitly account for enough code attributes. To overcome\nthese drawbacks, we in this paper design a novel patch graph representation\nnamed attributed patch semantic graph (APSG), which adequately captures the\npatch semantic and explicitly reflects important patch attributes. To\neffectively use graph information in APSG, we accordingly propose a new\nparameter-efficient fine-tuning (PEFT) method of LLMs named Graph-LoRA.\nExtensive evaluations have been conducted to evaluate our method, and the\nresults show that compared to the state-of-the-art methods, our method improves\naccuracy and F1 score by 2.3% to 6.6% and 1.8% to 6.1% respectively.", "categories": [ "cs.SE" ], "published": "2025-05-05T13:15:53+00:00", "url": "http://arxiv.org/pdf/2505.02629v1", "resource_uri": "arxiv://2505.02629v1", "citation_count": 0 }, { "id": "2505.02275v1", "title": "A Path Less Traveled: Reimagining Software Engineering Automation via a Neurosymbolic Paradigm", "authors": [ "Antonio Mastropaolo", "Denys Poshyvanyk" ], "abstract": "The emergence of Large Code Models (LCMs) has transformed software\nengineering (SE) automation, driving significant advancements in tasks such as\ncode generation, source code documentation, code review, and bug fixing.\nHowever, these advancements come with trade-offs: achieving high performance\noften entails exponential computational costs, reduced interpretability, and an\nincreasing dependence on data-intensive models with hundreds of billions of\nparameters. In this paper, we propose Neurosymbolic Software Engineering, in\nshort NSE, as a promising paradigm combining neural learning with symbolic\n(rule-based) reasoning, while strategically introducing a controlled source of\nchaos to simulate the complex dynamics of real-world software systems. This\nhybrid methodology aims to enhance efficiency, reliability, and transparency in\nAI-driven software engineering while introducing controlled randomness to adapt\nto evolving requirements, unpredictable system behaviors, and non-deterministic\nexecution environments. By redefining the core principles of AI-driven software\nengineering automation, NSE lays the groundwork for solutions that are more\nadaptable, transparent, and closely aligned with the evolving demands of modern\nsoftware development practices.", "categories": [ "cs.SE" ], "published": "2025-05-04T22:10:21+00:00", "url": "http://arxiv.org/pdf/2505.02275v1", "resource_uri": "arxiv://2505.02275v1", "citation_count": 0 }, { "id": "2505.01108v1", "title": "Towards an Interpretable Analysis for Estimating the Resolution Time of Software Issues", "authors": [ "Dimitrios-Nikitas Nastos", "Themistoklis Diamantopoulos", "Davide Tosi", "Martina Tropeano", "Andreas L. Symeonidis" ], "abstract": "Lately, software development has become a predominantly online process, as\nmore teams host and monitor their projects remotely. Sophisticated approaches\nemploy issue tracking systems like Jira, predicting the time required to\nresolve issues and effectively assigning and prioritizing project tasks.\nSeveral methods have been developed to address this challenge, widely known as\nbug-fix time prediction, yet they exhibit significant limitations. Most\nconsider only textual issue data and/or use techniques that overlook the\nsemantics and metadata of issues (e.g., priority or assignee expertise). Many\nalso fail to distinguish actual development effort from administrative delays,\nincluding assignment and review phases, leading to estimates that do not\nreflect the true effort needed. In this work, we build an issue monitoring\nsystem that extracts the actual effort required to fix issues on a per-project\nbasis. Our approach employs topic modeling to capture issue semantics and\nleverages metadata (components, labels, priority, issue type, assignees) for\ninterpretable resolution time analysis. Final predictions are generated by an\naggregated model, enabling contributors to make informed decisions. Evaluation\nacross multiple projects shows the system can effectively estimate resolution\ntime and provide valuable insights.", "categories": [ "cs.SE" ], "published": "2025-05-02T08:38:59+00:00", "url": "http://arxiv.org/pdf/2505.01108v1", "resource_uri": "arxiv://2505.01108v1", "citation_count": 0 }, { "id": "2505.01022v3", "title": "Detecting the Root Cause Code Lines in Bug-Fixing Commits by Heterogeneous Graph Learning", "authors": [ "Liguo Ji", "Chenchen Li", "Shenglin Wang", "Furui Zhan" ], "abstract": "With the continuous growth in the scale and complexity of software systems,\ndefect remediation has become increasingly difficult and costly. Automated\ndefect prediction tools can proactively identify software changes prone to\ndefects within software projects, thereby enhancing software development\nefficiency. However, existing work in heterogeneous and complex software\nprojects continues to face challenges, such as struggling with heterogeneous\ncommit structures and ignoring cross-line dependencies in code changes, which\nultimately reduce the accuracy of defect identification. To address these\nchallenges, we propose an approach called RC_Detector. RC_Detector comprises\nthree main components: the bug-fixing graph construction component, the code\nsemantic aggregation component, and the cross-line semantic retention\ncomponent. The bug-fixing graph construction component identifies the code\nsyntax structures and program dependencies within bug-fixing commits and\ntransforms them into heterogeneous graph formats by converting the source code\ninto vector representations. The code semantic aggregation component adapts to\nheterogeneous data by using heterogeneous attention to learn the hidden\nsemantic representation of target code lines. The cross-line semantic retention\ncomponent regulates propagated semantic information by using attenuation and\nreinforcement gates derived from old and new code semantic representations,\neffectively preserving cross-line semantic relationships. Extensive experiments\nwere conducted to evaluate the performance of our model by collecting data from\n87 open-source projects, including 675 bug-fixing commits. The experimental\nresults demonstrate that our model outperforms state-of-the-art approaches,\nachieving significant improvements of\n83.15%,96.83%,78.71%,74.15%,54.14%,91.66%,91.66%, and 34.82% in MFR,\nrespectively, compared with the state-of-the-art approaches.", "categories": [ "cs.SE" ], "published": "2025-05-02T05:39:50+00:00", "url": "http://arxiv.org/pdf/2505.01022v3", "resource_uri": "arxiv://2505.01022v3", "citation_count": 0 }, { "id": "2505.00990v1", "title": "Identifying Root Cause of bugs by Capturing Changed Code Lines with Relational Graph Neural Networks", "authors": [ "Jiaqi Zhang", "Shikai Guo", "Hui Li", "Chenchen Li", "Yu Chai", "Rong Chen" ], "abstract": "The Just-In-Time defect prediction model helps development teams improve\nsoftware quality and efficiency by assessing whether code changes submitted by\ndevelopers are likely to introduce defects in real-time, allowing timely\nidentification of potential issues during the commit stage. However, two main\nchallenges exist in current work due to the reality that all deleted and added\nlines in bug-fixing commits may be related to the root cause of the introduced\nbug: 1) lack of effective integration of heterogeneous graph information, and\n2) lack of semantic relationships between changed code lines. To address these\nchallenges, we propose a method called RC-Detection, which utilizes relational\ngraph convolutional network to capture the semantic relationships between\nchanged code lines. RC-Detection is used to detect root-cause deletion lines in\nchanged code lines, thereby identifying the root cause of introduced bugs in\nbug-fixing commits. To evaluate the effectiveness of RC-Detection, we used\nthree datasets that contain high-quality bug-fixing and bug-introducing\ncommits. Extensive experiments were conducted to evaluate the performance of\nour model by collecting data from 87 open-source projects, including 675\nbug-fix commits. The experimental results show that, compared to the most\nadvanced root cause detection methods, RC-Detection improved Recall@1,\nRecall@2, Recall@3, and MFR by at 4.107%, 5.113%, 4.289%, and 24.536%,\nrespectively.", "categories": [ "cs.SE" ], "published": "2025-05-02T04:29:09+00:00", "url": "http://arxiv.org/pdf/2505.00990v1", "resource_uri": "arxiv://2505.00990v1", "citation_count": 0 }, { "id": "2505.00703v2", "title": "T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT", "authors": [ "Dongzhi Jiang", "Ziyu Guo", "Renrui Zhang", "Zhuofan Zong", "Hao Li", "Le Zhuo", "Shilin Yan", "Pheng-Ann Heng", "Hongsheng Li" ], "abstract": "Recent advancements in large language models have demonstrated how\nchain-of-thought (CoT) and reinforcement learning (RL) can improve performance.\nHowever, applying such reasoning strategies to the visual generation domain\nremains largely unexplored. In this paper, we present T2I-R1, a novel\nreasoning-enhanced text-to-image generation model, powered by RL with a\nbi-level CoT reasoning process. Specifically, we identify two levels of CoT\nthat can be utilized to enhance different stages of generation: (1) the\nsemantic-level CoT for high-level planning of the prompt and (2) the\ntoken-level CoT for low-level pixel processing during patch-by-patch\ngeneration. To better coordinate these two levels of CoT, we introduce\nBiCoT-GRPO with an ensemble of generation rewards, which seamlessly optimizes\nboth generation CoTs within the same training step. By applying our reasoning\nstrategies to the baseline model, Janus-Pro, we achieve superior performance\nwith 13% improvement on T2I-CompBench and 19% improvement on the WISE\nbenchmark, even surpassing the state-of-the-art model FLUX.1. Code is available\nat: https://github.com/CaraJ7/T2I-R1", "categories": [ "cs.CV", "cs.AI", "cs.CL", "cs.LG" ], "published": "2025-05-01T17:59:46+00:00", "url": "http://arxiv.org/pdf/2505.00703v2", "resource_uri": "arxiv://2505.00703v2", "citation_count": 0 }, { "id": "2504.20412v2", "title": "CrashFixer: A crash resolution agent for the Linux kernel", "authors": [ "Alex Mathai", "Chenxi Huang", "Suwei Ma", "Jihwan Kim", "Hailie Mitchell", "Aleksandr Nogikh", "Petros Maniatis", "Franjo Ivančić", "Junfeng Yang", "Baishakhi Ray" ], "abstract": "Code large language models (LLMs) have shown impressive capabilities on a\nmultitude of software engineering tasks. In particular, they have demonstrated\nremarkable utility in the task of code repair. However, common benchmarks used\nto evaluate the performance of code LLMs are often limited to small-scale\nsettings. In this work, we build upon kGym, which shares a benchmark for\nsystem-level Linux kernel bugs and a platform to run experiments on the Linux\nkernel.\n This paper introduces CrashFixer, the first LLM-based software repair agent\nthat is applicable to Linux kernel bugs. Inspired by the typical workflow of a\nkernel developer, we identify the key capabilities an expert developer\nleverages to resolve a kernel crash. Using this as our guide, we revisit the\nkGym platform and identify key system improvements needed to practically run\nLLM-based agents at the scale of the Linux kernel (50K files and 20M lines of\ncode). We implement these changes by extending kGym to create an improved\nplatform - called kGymSuite, which will be open-sourced. Finally, the paper\npresents an evaluation of various repair strategies for such complex kernel\nbugs and showcases the value of explicitly generating a hypothesis before\nattempting to fix bugs in complex systems such as the Linux kernel. We also\nevaluated CrashFixer's capabilities on still open bugs, and found at least two\npatch suggestions considered plausible to resolve the reported bug.", "categories": [ "cs.SE", "cs.AI", "cs.OS" ], "published": "2025-04-29T04:18:51+00:00", "url": "http://arxiv.org/pdf/2504.20412v2", "resource_uri": "arxiv://2504.20412v2", "citation_count": 0 }, { "id": "2504.20196v1", "title": "Prompting LLMs for Code Editing: Struggles and Remedies", "authors": [ "Daye Nam", "Ahmed Omran", "Ambar Murillo", "Saksham Thakur", "Abner Araujo", "Marcel Blistein", "Alexander Frömmgen", "Vincent Hellendoorn", "Satish Chandra" ], "abstract": "Large Language Models (LLMs) are rapidly transforming software engineering,\nwith coding assistants embedded in an IDE becoming increasingly prevalent.\nWhile research has focused on improving the tools and understanding developer\nperceptions, a critical gap exists in understanding how developers actually use\nthese tools in their daily workflows, and, crucially, where they struggle. This\npaper addresses part of this gap through a multi-phased investigation of\ndeveloper interactions with an LLM-powered code editing and transformation\nfeature, Transform Code, in an IDE widely used at Google. First, we analyze\ntelemetry logs of the feature usage, revealing that frequent re-prompting can\nbe an indicator of developer struggles with using Transform Code. Second, we\nconduct a qualitative analysis of unsatisfactory requests, identifying five key\ncategories of information often missing from developer prompts. Finally, based\non these findings, we propose and evaluate a tool, AutoPrompter, for\nautomatically improving prompts by inferring missing information from the\nsurrounding code context, leading to a 27% improvement in edit correctness on\nour test set.", "categories": [ "cs.SE", "cs.AI", "cs.HC" ], "published": "2025-04-28T18:59:28+00:00", "url": "http://arxiv.org/pdf/2504.20196v1", "resource_uri": "arxiv://2504.20196v1", "citation_count": 0 }, { "id": "2504.19283v1", "title": "Efficient Serverless Cold Start: Reducing Library Loading Overhead by Profile-guided Optimization", "authors": [ "Syed Salauddin Mohammad Tariq", "Ali Al Zein", "Soumya Sripad Vaidya", "Arati Khanolkar", "Zheng Song", "Probir Roy" ], "abstract": "Serverless computing abstracts away server management, enabling automatic\nscaling, efficient resource utilization, and cost-effective pricing models.\nHowever, despite these advantages, it faces the significant challenge of\ncold-start latency, adversely impacting end-to-end performance. Our study shows\nthat many serverless functions initialize libraries that are rarely or never\nused under typical workloads, thus introducing unnecessary overhead. Although\nexisting static analysis techniques can identify unreachable libraries, they\nfail to address workload-dependent inefficiencies, resulting in limited\nperformance improvements. To overcome these limitations, we present SLIMSTART,\na profile-guided optimization tool designed to identify and mitigate\ninefficient library usage patterns in serverless applications. By leveraging\nstatistical sampling and call-path profiling, SLIMSTART collects runtime\nlibrary usage data, generates detailed optimization reports, and applies\nautomated code transformations to reduce cold-start overhead. Furthermore,\nSLIMSTART integrates seamlessly into CI/CD pipelines, enabling adaptive\nmonitoring and continuous optimizations tailored to evolving workloads. Through\nextensive evaluation across three benchmark suites and four real-world\nserverless applications, SLIMSTART achieves up to a 2.30X speedup in\ninitialization latency, a 2.26X improvement in end-to-end latency, and a 1.51X\nreduction in memory usage, demonstrating its effectiveness in addressing\ncold-start inefficiencies and optimizing resource utilization.", "categories": [ "cs.DC", "cs.PF" ], "published": "2025-04-27T15:50:45+00:00", "url": "http://arxiv.org/pdf/2504.19283v1", "resource_uri": "arxiv://2504.19283v1", "citation_count": 0 }, { "id": "2504.19110v2", "title": "APE-Bench I: Towards File-level Automated Proof Engineering of Formal Math Libraries", "authors": [ "Huajian Xin", "Luming Li", "Xiaoran Jin", "Jacques Fleuriot", "Wenda Li" ], "abstract": "Recent progress in large language models (LLMs) has shown promise in formal\ntheorem proving, yet existing benchmarks remain limited to isolated, static\nproof tasks, failing to capture the iterative, engineering-intensive workflows\nof real-world formal mathematics libraries. Motivated by analogous advances in\nsoftware engineering, we introduce the paradigm of Automated Proof Engineering\n(APE), which aims to automate proof engineering tasks such as feature addition,\nproof refactoring, and bug fixing using LLMs. To facilitate research in this\ndirection, we present APE-Bench I, the first realistic benchmark built from\nreal-world commit histories of Mathlib4, featuring diverse file-level tasks\ndescribed in natural language and verified via a hybrid approach combining the\nLean compiler and LLM-as-a-Judge. We further develop Eleanstic, a scalable\nparallel verification infrastructure optimized for proof checking across\nmultiple versions of Mathlib. Empirical results on state-of-the-art LLMs\ndemonstrate strong performance on localized edits but substantial degradation\non handling complex proof engineering. This work lays the foundation for\ndeveloping agentic workflows in proof engineering, with future benchmarks\ntargeting multi-file coordination, project-scale verification, and autonomous\nagents capable of planning, editing, and repairing formal libraries.", "categories": [ "cs.CL" ], "published": "2025-04-27T05:04:02+00:00", "url": "http://arxiv.org/pdf/2504.19110v2", "resource_uri": "arxiv://2504.19110v2", "citation_count": 0 }, { "id": "2504.19099v1", "title": "VeriDebug: A Unified LLM for Verilog Debugging via Contrastive Embedding and Guided Correction", "authors": [ "Ning Wang", "Bingkun Yao", "Jie Zhou", "Yuchen Hu", "Xi Wang", "Nan Guan", "Zhe Jiang" ], "abstract": "Large Language Models (LLMs) have demonstrated remarkable potential in\ndebugging for various programming languages. However, the application of LLMs\nto Verilog debugging remains insufficiently explored. Here, we present\nVeriDebug, an approach that integrates contrastive representation and guided\ncorrection capabilities for automated Verilog debugging. Unlike existing\nmethods, VeriDebug employs an embedding-based technique to accurately retrieve\ninternal information, followed by bug-fixing. VeriDebug unifies Verilog bug\ndetection and correction through a shared parameter space. By simultaneously\nlearning bug patterns and fixes, it streamlines debugging via contrastive\nembedding and guided correction. Empirical results show the efficacy of\nVeriDebug in enhancing Verilog debugging. Our VeriDebugLoc, Type model achieves\n64.7 accuracy in bug fixing (Acc1), a significant improvement from the existing\nopen-source SOTAs 11.3. This performance not only outperforms open-source\nalternatives but also exceeds larger closed-source models like GPT-3.5-turbo\n(36.6), offering a more accurate alternative to conventional debugging methods.", "categories": [ "cs.SE", "cs.AI", "cs.AR" ], "published": "2025-04-27T04:09:48+00:00", "url": "http://arxiv.org/pdf/2504.19099v1", "resource_uri": "arxiv://2504.19099v1", "citation_count": 0 }, { "id": "2504.18702v1", "title": "Codetations: Intelligent, Persistent Notes and UIs for Programs and Other Documents", "authors": [ "Edward Misback", "Erik Vank", "Zachary Tatlock", "Steven Tanimoto" ], "abstract": "Software developers maintain extensive mental models of code they produce and\nits context, often relying on memory to retrieve or reconstruct design\ndecisions, edge cases, and debugging experiences. These missing links and data\nobstruct both developers and, more recently, large language models (LLMs)\nworking with unfamiliar code. We present Codetations, a system that helps\ndevelopers contextualize documents with rich notes and tools. Unlike previous\napproaches, notes in Codetations stay outside the document to prevent code\nclutter, attaching to spans in the document using a hybrid\nedit-tracking/LLM-based method. Their content is dynamic, interactive, and\nsynchronized with code changes. A worked example shows that relevant notes with\ninteractively-collected data improve LLM performance during code repair. In our\nuser evaluation, developers praised these properties and saw significant\npotential in annotation types that we generated with an LLM in just a few\nminutes.", "categories": [ "cs.SE", "cs.HC" ], "published": "2025-04-25T21:33:25+00:00", "url": "http://arxiv.org/pdf/2504.18702v1", "resource_uri": "arxiv://2504.18702v1", "citation_count": 0 }, { "id": "2504.18050v1", "title": "Validating Network Protocol Parsers with Traceable RFC Document Interpretation", "authors": [ "Mingwei Zheng", "Danning Xie", "Qingkai Shi", "Chengpeng Wang", "Xiangyu Zhang" ], "abstract": "Validating the correctness of network protocol implementations is highly\nchallenging due to the oracle and traceability problems. The former determines\nwhen a protocol implementation can be considered buggy, especially when the\nbugs do not cause any observable symptoms. The latter allows developers to\nunderstand how an implementation violates the protocol specification, thereby\nfacilitating bug fixes. Unlike existing works that rarely take both problems\ninto account, this work considers both and provides an effective solution using\nrecent advances in large language models (LLMs). Our key observation is that\nnetwork protocols are often released with structured specification documents,\na.k.a. RFC documents, which can be systematically translated to formal protocol\nmessage specifications via LLMs. Such specifications, which may contain errors\ndue to the hallucination of LLMs, are used as a quasi-oracle to validate\nprotocol parsers, while the validation results in return gradually refine the\noracle. Since the oracle is derived from the document, any bugs we find in a\nprotocol implementation can be traced back to the document, thus addressing the\ntraceability problem. We have extensively evaluated our approach using nine\nnetwork protocols and their implementations written in C, Python, and Go. The\nresults show that our approach outperforms the state-of-the-art and has\ndetected 69 bugs, with 36 confirmed. The project also demonstrates the\npotential for fully automating software validation based on natural language\nspecifications, a process previously considered predominantly manual due to the\nneed to understand specification documents and derive expected outputs for test\ninputs.", "categories": [ "cs.SE", "cs.AI" ], "published": "2025-04-25T03:39:19+00:00", "url": "http://arxiv.org/pdf/2504.18050v1", "resource_uri": "arxiv://2504.18050v1", "citation_count": 0 }, { "id": "2504.17824v1", "title": "EduBot -- Can LLMs Solve Personalized Learning and Programming Assignments?", "authors": [ "Yibin Wang", "Jiaxi Xie", "Lakshminarayanan Subramanian" ], "abstract": "The prevalence of Large Language Models (LLMs) is revolutionizing the process\nof writing code. General and code LLMs have shown impressive performance in\ngenerating standalone functions and code-completion tasks with one-shot\nqueries. However, the ability to solve comprehensive programming tasks with\nrecursive requests and bug fixes remains questionable. In this paper, we\npropose EduBot, an intelligent automated assistant system that combines\nconceptual knowledge teaching, end-to-end code development, personalized\nprogramming through recursive prompt-driven methods, and debugging with limited\nhuman interventions powered by LLMs. We show that EduBot can solve complicated\nprogramming tasks consisting of sub-tasks with increasing difficulties ranging\nfrom conceptual to coding questions by recursive automatic prompt-driven\nsystems without finetuning on LLMs themselves. To further evaluate EduBot's\nperformance, we design and conduct a benchmark suite consisting of 20 scenarios\nin algorithms, machine learning, and real-world problems. The result shows that\nEduBot can complete most scenarios in less than 20 minutes. Based on the\nbenchmark suites, we perform a comparative study to take different LLMs as the\nbackbone and to verify EduBot's compatibility and robustness across LLMs with\nvarying capabilities. We believe that EduBot is an exploratory approach to\nexplore the potential of pre-trained LLMs in multi-step reasoning and code\ngeneration for solving personalized assignments with knowledge learning and\ncode generation.", "categories": [ "cs.SE", "cs.AI" ], "published": "2025-04-23T23:25:13+00:00", "url": "http://arxiv.org/pdf/2504.17824v1", "resource_uri": "arxiv://2504.17824v1", "citation_count": 0 }, { "id": "2504.17069v1", "title": "Distilling semantically aware orders for autoregressive image generation", "authors": [ "Rishav Pramanik", "Antoine Poupon", "Juan A. Rodriguez", "Masih Aminbeidokhti", "David Vazquez", "Christopher Pal", "Zhaozheng Yin", "Marco Pedersoli" ], "abstract": "Autoregressive patch-based image generation has recently shown competitive\nresults in terms of image quality and scalability. It can also be easily\nintegrated and scaled within Vision-Language models. Nevertheless,\nautoregressive models require a defined order for patch generation. While a\nnatural order based on the dictation of the words makes sense for text\ngeneration, there is no inherent generation order that exists for image\ngeneration. Traditionally, a raster-scan order (from top-left to bottom-right)\nguides autoregressive image generation models. In this paper, we argue that\nthis order is suboptimal, as it fails to respect the causality of the image\ncontent: for instance, when conditioned on a visual description of a sunset, an\nautoregressive model may generate clouds before the sun, even though the color\nof clouds should depend on the color of the sun and not the inverse. In this\nwork, we show that first by training a model to generate patches in\nany-given-order, we can infer both the content and the location (order) of each\npatch during generation. Secondly, we use these extracted orders to finetune\nthe any-given-order model to produce better-quality images. Through our\nexperiments, we show on two datasets that this new generation method produces\nbetter images than the traditional raster-scan approach, with similar training\ncosts and no extra annotations.", "categories": [ "cs.CV", "cs.AI" ], "published": "2025-04-23T19:33:58+00:00", "url": "http://arxiv.org/pdf/2504.17069v1", "resource_uri": "arxiv://2504.17069v1", "citation_count": 0 }, { "id": "2504.16310v1", "title": "Improving Automated Secure Code Reviews: A Synthetic Dataset for Code Vulnerability Flaws", "authors": [ "Leonardo Centellas-Claros", "Juan J. Alonso-Lecaros", "Juan Pablo Sandoval Alcocer", "Andres Neyem" ], "abstract": "Automation of code reviews using AI models has garnered substantial attention\nin the software engineering community as a strategy to reduce the cost and\neffort associated with traditional peer review processes. These models are\ntypically trained on extensive datasets of real-world code reviews that address\ndiverse software development concerns, including testing, refactoring, bug\nfixes, performance optimization, and maintainability improvements. However, a\nnotable limitation of these datasets is the under representation of code\nvulnerabilities, critical flaws that pose significant security risks, with\nsecurity-focused reviews comprising a small fraction of the data. This scarcity\nof vulnerability-specific data restricts the effectiveness of AI models in\nidentifying and commenting on security-critical code. To address this issue, we\npropose the creation of a synthetic dataset consisting of vulnerability-focused\nreviews that specifically comment on security flaws. Our approach leverages\nLarge Language Models (LLMs) to generate human-like code review comments for\nvulnerabilities, using insights derived from code differences and commit\nmessages. To evaluate the usefulness of the generated synthetic dataset, we\nplan to use it to fine-tune three existing code review models. We anticipate\nthat the synthetic dataset will improve the performance of the original code\nreview models.", "categories": [ "cs.SE" ], "published": "2025-04-22T23:07:24+00:00", "url": "http://arxiv.org/pdf/2504.16310v1", "resource_uri": "arxiv://2504.16310v1", "citation_count": 0 }, { "id": "2504.15989v2", "title": "Optimizing Token Consumption in LLMs: A Nano Surge Approach for Code Reasoning Efficiency", "authors": [ "Junwei Hu", "Weicheng Zheng", "Yihan Liu", "Yan Liu" ], "abstract": "With the increasing adoption of large language models (LLMs) in software\nengineering, the Chain of Thought (CoT) reasoning paradigm has become an\nessential approach for automated code repair. However, the explicit multi-step\nreasoning in CoT leads to substantial increases in token consumption, reducing\ninference efficiency and raising computational costs, especially for complex\ncode repair tasks. Most prior research has focused on improving the correctness\nof code repair while largely overlooking the resource efficiency of the\nreasoning process itself. To address this challenge, this paper proposes three\ntargeted optimization strategies: Context Awareness, Responsibility Tuning, and\nCost Sensitive. Context Awareness guides the model to focus on key contextual\ninformation, Responsibility Tuning refines the structure of the reasoning\nprocess through clearer role and responsibility assignment, and Cost Sensitive\nincorporates resource-awareness to suppress unnecessary token generation during\ninference. Experiments across diverse code repair scenarios demonstrate that\nthese methods can significantly reduce token consumption in CoT-based reasoning\nwithout compromising repair quality. This work provides novel insights and\nmethodological guidance for enhancing the efficiency of LLM-driven code repair\ntasks in software engineering.", "categories": [ "cs.SE" ], "published": "2025-04-22T15:51:00+00:00", "url": "http://arxiv.org/pdf/2504.15989v2", "resource_uri": "arxiv://2504.15989v2", "citation_count": 0 }, { "id": "2504.15637v1", "title": "DR.FIX: Automatically Fixing Data Races at Industry Scale", "authors": [ "Farnaz Behrang", "Zhizhou Zhang", "Georgian-Vlad Saioc", "Peng Liu", "Milind Chabbi" ], "abstract": "Data races are a prevalent class of concurrency bugs in shared-memory\nparallel programs, posing significant challenges to software reliability and\nreproducibility. While there is an extensive body of research on detecting data\nraces and a wealth of practical detection tools across various programming\nlanguages, considerably less effort has been directed toward automatically\nfixing data races at an industrial scale. In large codebases, data races are\ncontinuously introduced and exhibit myriad patterns, making automated fixing\nparticularly challenging.\n In this paper, we tackle the problem of automatically fixing data races at an\nindustrial scale. We present Dr.Fix, a tool that combines large language models\n(LLMs) with program analysis to generate fixes for data races in real-world\nsettings, effectively addressing a broad spectrum of racy patterns in complex\ncode contexts. Implemented for Go--the programming language widely used in\nmodern microservice architectures where concurrency is pervasive and data races\nare common--Dr.Fix seamlessly integrates into existing development workflows.\nWe detail the design of Dr.Fix and examine how individual design choices\ninfluence the quality of the fixes produced. Over the past 18 months, Dr.Fix\nhas been integrated into developer workflows at Uber demonstrating its\npractical utility. During this period, Dr.Fix produced patches for 224 (55%)\nfrom a corpus of 404 data races spanning various categories; 193 of these\npatches (86%) were accepted by more than a hundred developers via code reviews\nand integrated into the codebase.", "categories": [ "cs.DC", "cs.AI", "cs.LG", "cs.PL", "cs.SE" ], "published": "2025-04-22T06:56:15+00:00", "url": "http://arxiv.org/pdf/2504.15637v1", "resource_uri": "arxiv://2504.15637v1", "citation_count": 0 }, { "id": "2504.14757v1", "title": "SWE-Synth: Synthesizing Verifiable Bug-Fix Data to Enable Large Language Models in Resolving Real-World Bugs", "authors": [ "Minh V. T. Pham", "Huy N. Phan", "Hoang N. Phan", "Cuong Le Chi", "Tien N. Nguyen", "Nghi D. Q. Bui" ], "abstract": "Large language models (LLMs) are transforming automated program repair (APR)\nthrough agent-based approaches that localize bugs, generate patches, and verify\nfixes. However, the lack of high-quality, scalable training datasets,\nespecially those with verifiable outputs and intermediate reasoning\ntraces-limits progress, particularly for open-source models. In this work, we\npresent SWE-Synth, a framework for synthesizing realistic, verifiable, and\nprocess-aware bug-fix datasets at the repository level. SWE-Synth leverages LLM\nagents to simulate debugging workflows, producing not only bug-fix pairs but\nalso test cases and structured repair trajectories. Compared to manually\ncurated datasets, our method scales with minimal human effort while preserving\ncontextual richness and correctness. Experiments show that models trained on\nSWE-Synth outperform those trained on real-world datasets by 2.3% on SWE-Bench\nLite. Our results highlight the potential of synthetic, agent-generated data to\nadvance the state of the art in APR and software engineering automation.", "categories": [ "cs.SE", "cs.AI" ], "published": "2025-04-20T22:37:43+00:00", "url": "http://arxiv.org/pdf/2504.14757v1", "resource_uri": "arxiv://2504.14757v1", "citation_count": 0 }, { "id": "2504.14024v1", "title": "Simplicity by Obfuscation: Evaluating LLM-Driven Code Transformation with Semantic Elasticity", "authors": [ "Lorenzo De Tomasi", "Claudio Di Sipio", "Antinisca Di Marco", "Phuong T. Nguyen" ], "abstract": "Code obfuscation is the conversion of original source code into a\nfunctionally equivalent but less readable form, aiming to prevent reverse\nengineering and intellectual property theft. This is a challenging task since\nit is crucial to maintain functional correctness of the code while\nsubstantially disguising the input code. The recent development of large\nlanguage models (LLMs) paves the way for practical applications in different\ndomains, including software engineering. This work performs an empirical study\non the ability of LLMs to obfuscate Python source code and introduces a metric\n(i.e., semantic elasticity) to measure the quality degree of obfuscated code.\nWe experimented with 3 leading LLMs, i.e., Claude-3.5-Sonnet, Gemini-1.5,\nGPT-4-Turbo across 30 Python functions from diverse computational domains. Our\nfindings reveal GPT-4-Turbo's remarkable effectiveness with few-shot prompting\n(81% pass rate versus 29% standard prompting), significantly outperforming both\nGemini-1.5 (39%) and Claude-3.5-Sonnet (30%). Notably, we discovered a\ncounter-intuitive \"obfuscation by simplification\" phenomenon where models\nconsistently reduce rather than increase cyclomatic complexity. This study\nprovides a methodological framework for evaluating AI-driven obfuscation while\nhighlighting promising directions for leveraging LLMs in software security.", "categories": [ "cs.SE" ], "published": "2025-04-18T18:29:23+00:00", "url": "http://arxiv.org/pdf/2504.14024v1", "resource_uri": "arxiv://2504.14024v1", "citation_count": 0 }, { "id": "2504.13272v1", "title": "Using LLMs for Library Migration", "authors": [ "Md Mohayeminul Islam", "Ajay Kumar Jha", "May Mahmoud", "Ildar Akhmetov", "Sarah Nadi" ], "abstract": "Library migration is the process of replacing a used software library with\nanother library that provides similar functionality. Manual library migration\nis time-consuming and error prone, as it requires developers to understand the\nAPIs of both libraries, map them, and perform the necessary code\ntransformations. Due to its difficulty, most of the existing automated\ntechniques and tooling stop at the API mapping stage or support a limited set\nof code transformations. On the other hand, Large Language Models (LLMs) are\ngood at generating and transforming code and finding similar code, which are\nnecessary upstream tasks for library migration. Such capabilities suggest that\nLLMs may be suitable for library migration. Therefore, in this paper, we\ninvestigate the effectiveness of LLMs for migration between Python libraries.\nWe evaluate three LLMs, LLama 3.1, GPT-4o mini, and GPT-4o on PyMigBench, where\nwe migrate 321 real-world library migrations that include 2,989\nmigration-related code changes. We measure the correctness of the migration\nresults in two ways. We first compare the LLM's migrated code with the\ndevelopers' migrated code in the benchmark and then run the unit tests\navailable in the client repositories. We find that LLama 3.1, GPT-4o mini, and\nGPT-4o correctly migrate 89%, 89%, and 94% of the migration-related code\nchanges. respectively. We also find that 36%, 52% and 64% of the LLama 3.1,\nGPT-4o mini, and GPT-4o migrations pass the same tests that passed in the\ndeveloper's migration. Overall, our results suggest that LLMs can be effective\nin migrating code between libraries, but we also identify cases that pose\ndifficulties for the LLM.", "categories": [ "cs.SE" ], "published": "2025-04-17T18:32:48+00:00", "url": "http://arxiv.org/pdf/2504.13272v1", "resource_uri": "arxiv://2504.13272v1", "citation_count": 0 }, { "id": "2504.12445v1", "title": "RePurr: Automated Repair of Block-Based Learners' Programs", "authors": [ "Sebastian Schweikl", "Gordon Fraser" ], "abstract": "Programming is increasingly taught using block-based languages like Scratch.\nWhile the use of blocks prevents syntax errors, learners can still make\nsemantic mistakes, requiring feedback and help. As teachers may be overwhelmed\nby help requests in a classroom, may lack programming expertise themselves, or\nmay be unavailable in independent learning scenarios, automated hint generation\nis desirable. Automated program repair (APR) can provide the foundation for\nthis, but relies on multiple assumptions: (1) APR usually targets isolated\nbugs, but learners may fundamentally misunderstand tasks or request help for\nsubstantially incomplete code. (2) Software tests are required to guide the\nsearch and localize broken blocks, but tests for block-based programs are\ndifferent to those in past APR research: They consist of system tests, and very\nfew of them already fully cover the code. At the same time, they have vastly\nlonger runtimes due to animations and interactions on Scratch programs, which\ninhibits the applicability of search. (3) The plastic surgery hypothesis\nassumes the code necessary for repairs already exists in the codebase.\nBlock-based programs tend to be small and may lack this redundancy. To study if\nAPR of such programs is still feasible, we introduce, to the best of our\nknowledge, the first APR approach for Scratch based on evolutionary search. Our\nRePurr prototype includes novel refinements of fault localization to improve\nthe guidance of test suites, recovers the plastic surgery hypothesis by\nexploiting that learning scenarios provide model and student solutions, and\nreduces the costs of fitness evaluations via test parallelization and\nacceleration. Empirical evaluation on a set of real learners' programs confirms\nthe anticipated challenges, but also demonstrates APR can still effectively\nimprove and fix learners' programs, enabling automated generation of hints and\nfeedback.", "categories": [ "cs.SE" ], "published": "2025-04-16T19:22:51+00:00", "url": "http://arxiv.org/pdf/2504.12445v1", "resource_uri": "arxiv://2504.12445v1", "citation_count": 0 }, { "id": "2504.12268v1", "title": "HLS-Eval: A Benchmark and Framework for Evaluating LLMs on High-Level Synthesis Design Tasks", "authors": [ "Stefan Abi-Karam", "Cong Hao" ], "abstract": "The rapid scaling of large language model (LLM) training and inference has\ndriven their adoption in semiconductor design across academia and industry.\nWhile most prior work evaluates LLMs on hardware description language (HDL)\ntasks, particularly Verilog, designers are increasingly using high-level\nsynthesis (HLS) to build domain-specific accelerators and complex hardware\nsystems. However, benchmarks and tooling to comprehensively evaluate LLMs for\nHLS design tasks remain scarce.\n To address this, we introduce HLS-Eval, the first complete benchmark and\nevaluation framework for LLM-driven HLS design. HLS-Eval targets two core\ntasks: (1) generating HLS code from natural language descriptions, and (2)\nperforming HLS-specific code edits to optimize performance and hardware\nefficiency. The benchmark includes 94 unique designs drawn from standard HLS\nbenchmarks and novel sources. Each case is prepared via a semi-automated flow\nthat produces a natural language description and a paired testbench for\nC-simulation and synthesis validation, ensuring each task is \"LLM-ready.\"\n Beyond the benchmark, HLS-Eval offers a modular Python framework for\nautomated, parallel evaluation of both local and hosted LLMs. It includes a\nparallel evaluation engine, direct HLS tool integration, and abstractions for\nto support different LLM interaction paradigms, enabling rapid prototyping of\nnew benchmarks, tasks, and LLM methods.\n We demonstrate HLS-Eval through baseline evaluations of open-source LLMs on\nVitis HLS, measuring outputs across four key metrics - parseability,\ncompilability, runnability, and synthesizability - reflecting the iterative HLS\ndesign cycle. We also report pass@k metrics, establishing clear baselines and\nreusable infrastructure for the broader LLM-for-hardware community.\n All benchmarks, framework code, and results are open-sourced at\nhttps://github.com/stefanpie/hls-eval.", "categories": [ "cs.AR", "cs.AI" ], "published": "2025-04-16T17:30:36+00:00", "url": "http://arxiv.org/pdf/2504.12268v1", "resource_uri": "arxiv://2504.12268v1", "citation_count": 0 }, { "id": "2504.10449v1", "title": "M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models", "authors": [ "Junxiong Wang", "Wen-Ding Li", "Daniele Paliotta", "Daniel Ritter", "Alexander M. Rush", "Tri Dao" ], "abstract": "Effective reasoning is crucial to solving complex mathematical problems.\nRecent large language models (LLMs) have boosted performance by scaling\ntest-time computation through long chain-of-thought reasoning. However,\ntransformer-based models are inherently limited in extending context length due\nto their quadratic computational complexity and linear memory requirements. In\nthis paper, we introduce a novel hybrid linear RNN reasoning model, M1, built\non the Mamba architecture, which allows memory-efficient inference. Our\napproach leverages a distillation process from existing reasoning models and is\nfurther enhanced through RL training. Experimental results on the AIME and MATH\nbenchmarks show that M1 not only outperforms previous linear RNN models but\nalso matches the performance of state-of-the-art Deepseek R1 distilled\nreasoning models at a similar scale. We also compare our generation speed with\na highly performant general purpose inference engine, vLLM, and observe more\nthan a 3x speedup compared to a same size transformer. With throughput speedup,\nwe are able to achieve higher accuracy compared to DeepSeek R1 distilled\ntransformer reasoning models under a fixed generation time budget using\nself-consistency voting. Overall, we introduce a hybrid Mamba reasoning model\nand provide a more effective approach to scaling test-time generation using\nself-consistency or long chain of thought reasoning.", "categories": [ "cs.LG" ], "published": "2025-04-14T17:38:25+00:00", "url": "http://arxiv.org/pdf/2504.10449v1", "resource_uri": "arxiv://2504.10449v1", "citation_count": 0 }, { "id": "2504.08703v3", "title": "SWE-PolyBench: A multi-language benchmark for repository level evaluation of coding agents", "authors": [ "Muhammad Shihab Rashid", "Christian Bock", "Yuan Zhuang", "Alexander Buchholz", "Tim Esler", "Simon Valentin", "Luca Franceschi", "Martin Wistuba", "Prabhu Teja Sivaprasad", "Woo Jung Kim", "Anoop Deoras", "Giovanni Zappella", "Laurent Callot" ], "abstract": "Coding agents powered by large language models have shown impressive\ncapabilities in software engineering tasks, but evaluating their performance\nacross diverse programming languages and real-world scenarios remains\nchallenging. We introduce SWE-PolyBench, a new multi-language benchmark for\nrepository-level, execution-based evaluation of coding agents. SWE-PolyBench\ncontains 2110 instances from 21 repositories and includes tasks in Java (165),\nJavaScript (1017), TypeScript (729) and Python (199), covering bug fixes,\nfeature additions, and code refactoring. We provide a task and\nrepository-stratified subsample (SWE-PolyBench500) and release an evaluation\nharness allowing for fully automated evaluation. To enable a more comprehensive\ncomparison of coding agents, this work also presents a novel set of metrics\nrooted in syntax tree analysis. We evaluate leading open source coding agents\non SWE-PolyBench, revealing their strengths and limitations across languages,\ntask types, and complexity classes. Our experiments show that current agents\nexhibit uneven performances across languages and struggle with complex problems\nwhile showing higher performance on simpler tasks. SWE-PolyBench aims to drive\nprogress in developing more versatile and robust AI coding assistants for\nreal-world software engineering. Our datasets and code are available at:\nhttps://github.com/amazon-science/SWE-PolyBench", "categories": [ "cs.SE" ], "published": "2025-04-11T17:08:02+00:00", "url": "http://arxiv.org/pdf/2504.08703v3", "resource_uri": "arxiv://2504.08703v3", "citation_count": 0 }, { "id": "2504.08368v1", "title": "FocalLens: Instruction Tuning Enables Zero-Shot Conditional Image Representations", "authors": [ "Cheng-Yu Hsieh", "Pavan Kumar Anasosalu Vasu", "Fartash Faghri", "Raviteja Vemulapalli", "Chun-Liang Li", "Ranjay Krishna", "Oncel Tuzel", "Hadi Pouransari" ], "abstract": "Visual understanding is inherently contextual -- what we focus on in an image\ndepends on the task at hand. For instance, given an image of a person holding a\nbouquet of flowers, we may focus on either the person such as their clothing,\nor the type of flowers, depending on the context of interest. Yet, most\nexisting image encoding paradigms represent an image as a fixed, generic\nfeature vector, overlooking the potential needs of prioritizing varying visual\ninformation for different downstream use cases. In this work, we introduce\nFocalLens, a conditional visual encoding method that produces different\nrepresentations for the same image based on the context of interest, expressed\nflexibly through natural language. We leverage vision instruction tuning data\nand contrastively finetune a pretrained vision encoder to take natural language\ninstructions as additional inputs for producing conditional image\nrepresentations. Extensive experiments validate that conditional image\nrepresentation from FocalLens better pronounce the visual features of interest\ncompared to generic features produced by standard vision encoders like CLIP. In\naddition, we show FocalLens further leads to performance improvements on a\nrange of downstream tasks including image-image retrieval, image\nclassification, and image-text retrieval, with an average gain of 5 and 10\npoints on the challenging SugarCrepe and MMVP-VLM benchmarks, respectively.", "categories": [ "cs.CV", "cs.CL", "cs.LG" ], "published": "2025-04-11T09:07:05+00:00", "url": "http://arxiv.org/pdf/2504.08368v1", "resource_uri": "arxiv://2504.08368v1", "citation_count": 0 }, { "id": "2504.07634v1", "title": "Agent That Debugs: Dynamic State-Guided Vulnerability Repair", "authors": [ "Zhengyao Liu", "Yunlong Ma", "Jingxuan Xu", "Junchen Ai", "Xiang Gao", "Hailong Sun", "Abhik Roychoudhury" ], "abstract": "In recent years, more vulnerabilities have been discovered every day, while\nmanual vulnerability repair requires specialized knowledge and is\ntime-consuming. As a result, many detected or even published vulnerabilities\nremain unpatched, thereby increasing the exposure of software systems to\nattacks. Recent advancements in agents based on Large Language Models have\ndemonstrated their increasing capabilities in code understanding and\ngeneration, which can be promising to achieve automated vulnerability repair.\nHowever, the effectiveness of agents based on static information retrieval is\nstill not sufficient for patch generation. To address the challenge, we propose\na program repair agent called VulDebugger that fully utilizes both static and\ndynamic context, and it debugs programs in a manner akin to humans. The agent\ninspects the actual state of the program via the debugger and infers expected\nstates via constraints that need to be satisfied. By continuously comparing the\nactual state with the expected state, it deeply understands the root causes of\nthe vulnerabilities and ultimately accomplishes repairs. We experimentally\nevaluated VulDebugger on 50 real-life projects. With 60.00% successfully fixed,\nVulDebugger significantly outperforms state-of-the-art approaches for\nvulnerability repair.", "categories": [ "cs.SE" ], "published": "2025-04-10T10:31:10+00:00", "url": "http://arxiv.org/pdf/2504.07634v1", "resource_uri": "arxiv://2504.07634v1", "citation_count": 0 }, { "id": "2504.07027v1", "title": "Using ML filters to help automated vulnerability repairs: when it helps and when it doesn't", "authors": [ "Maria Camporese", "Fabio Massacci" ], "abstract": "[Context:] The acceptance of candidate patches in automated program repair\nhas been typically based on testing oracles. Testing requires typically a\ncostly process of building the application while ML models can be used to\nquickly classify patches, thus allowing more candidate patches to be generated\nin a positive feedback loop. [Problem:] If the model predictions are unreliable\n(as in vulnerability detection) they can hardly replace the more reliable\noracles based on testing. [New Idea:] We propose to use an ML model as a\npreliminary filter of candidate patches which is put in front of a traditional\nfilter based on testing. [Preliminary Results:] We identify some theoretical\nbounds on the precision and recall of the ML algorithm that makes such\noperation meaningful in practice. With these bounds and the results published\nin the literature, we calculate how fast some of state-of-the art vulnerability\ndetectors must be to be more effective over a traditional AVR pipeline such as\nAPR4Vuln based just on testing.", "categories": [ "cs.SE", "cs.CR", "cs.LG" ], "published": "2025-04-09T16:39:09+00:00", "url": "http://arxiv.org/pdf/2504.07027v1", "resource_uri": "arxiv://2504.07027v1", "citation_count": 0 }, { "id": "2504.06939v1", "title": "FeedbackEval: A Benchmark for Evaluating Large Language Models in Feedback-Driven Code Repair Tasks", "authors": [ "Dekun Dai", "MingWei Liu", "Anji Li", "Jialun Cao", "Yanlin Wang", "Chong Wang", "Xin Peng", "Zibin Zheng" ], "abstract": "Code repair is a fundamental task in software development, facilitating\nefficient bug resolution and software maintenance. Although large language\nmodels (LLMs) have demonstrated considerable potential in automated code\nrepair, their ability to comprehend and effectively leverage diverse types of\nfeedback remains insufficiently understood. To bridge this gap, we introduce\nFeedbackEval, a systematic benchmark for evaluating LLMs' feedback\ncomprehension and performance in code repair tasks. We conduct a comprehensive\nempirical study on five state-of-the-art LLMs, including GPT-4o, Claude-3.5,\nGemini-1.5, GLM-4, and Qwen2.5, to evaluate their behavior under both\nsingle-iteration and iterative code repair settings. Our results show that\nstructured feedback, particularly in the form of test feedback, leads to the\nhighest repair success rates, while unstructured feedback proves significantly\nless effective. Iterative feedback further enhances repair performance, though\nthe marginal benefit diminishes after two or three rounds. Moreover, prompt\nstructure is shown to be critical: incorporating docstrings, contextual\ninformation, and explicit guidelines substantially improves outcomes, whereas\npersona-based, chain-of-thought, and few-shot prompting strategies offer\nlimited benefits in single-iteration scenarios. This work introduces a robust\nbenchmark and delivers practical insights to advance the understanding and\ndevelopment of feedback-driven code repair using LLMs.", "categories": [ "cs.SE" ], "published": "2025-04-09T14:43:08+00:00", "url": "http://arxiv.org/pdf/2504.06939v1", "resource_uri": "arxiv://2504.06939v1", "citation_count": 0 }, { "id": "2504.04708v1", "title": "SapiensID: Foundation for Human Recognition", "authors": [ "Minchul Kim", "Dingqiang Ye", "Yiyang Su", "Feng Liu", "Xiaoming Liu" ], "abstract": "Existing human recognition systems often rely on separate, specialized models\nfor face and body analysis, limiting their effectiveness in real-world\nscenarios where pose, visibility, and context vary widely. This paper\nintroduces SapiensID, a unified model that bridges this gap, achieving robust\nperformance across diverse settings. SapiensID introduces (i) Retina Patch\n(RP), a dynamic patch generation scheme that adapts to subject scale and\nensures consistent tokenization of regions of interest, (ii) a masked\nrecognition model (MRM) that learns from variable token length, and (iii)\nSemantic Attention Head (SAH), an module that learns pose-invariant\nrepresentations by pooling features around key body parts. To facilitate\ntraining, we introduce WebBody4M, a large-scale dataset capturing diverse poses\nand scale variations. Extensive experiments demonstrate that SapiensID achieves\nstate-of-the-art results on various body ReID benchmarks, outperforming\nspecialized models in both short-term and long-term scenarios while remaining\ncompetitive with dedicated face recognition systems. Furthermore, SapiensID\nestablishes a strong baseline for the newly introduced challenge of Cross\nPose-Scale ReID, demonstrating its ability to generalize to complex, real-world\nconditions.", "categories": [ "cs.CV" ], "published": "2025-04-07T03:38:07+00:00", "url": "http://arxiv.org/pdf/2504.04708v1", "resource_uri": "arxiv://2504.04708v1", "citation_count": 0 }, { "id": "2504.04657v1", "title": "ACE-RLHF: Automated Code Evaluation and Socratic Feedback Generation Tool using Large Language Models and Reinforcement Learning with Human Feedback", "authors": [ "Tasnia Rahman", "Sathish A. P. Kumar", "Sumit Jha", "Arvind Ramanathan" ], "abstract": "Automated Program Repair tools are developed for generating feedback and\nsuggesting a repair method for erroneous code. State of the art (SOTA) code\nrepair methods rely on data-driven approaches and often fail to deliver\nsolution for complicated programming questions. To interpret the natural\nlanguage of unprecedented programming problems, using Large Language Models\n(LLMs) for code-feedback generation is crucial. LLMs generate more\ncomprehensible feedback than compiler-generated error messages, and\nReinforcement Learning with Human Feedback (RLHF) further enhances quality by\nintegrating human-in-the-loop which helps novice students to lean programming\nfrom scratch interactively. We are applying RLHF fine-tuning technique for an\nexpected Socratic response such as a question with hint to solve the\nprogramming issue. We are proposing code feedback generation tool by\nfine-tuning LLM with RLHF, Automated Code Evaluation with RLHF (ACE-RLHF),\ncombining two open-source LLM models with two different SOTA optimization\ntechniques. The quality of feedback is evaluated on two benchmark datasets\ncontaining basic and competition-level programming questions where the later is\nproposed by us. We achieved 2-5% higher accuracy than RL-free SOTA techniques\nusing Llama-3-7B-Proximal-policy optimization in automated evaluation and\nsimilar or slightly higher accuracy compared to reward model-free RL with AI\nFeedback (RLAIF). We achieved almost 40% higher accuracy with GPT-3.5 Best-of-n\noptimization while performing manual evaluation.", "categories": [ "cs.LG" ], "published": "2025-04-07T01:11:22+00:00", "url": "http://arxiv.org/pdf/2504.04657v1", "resource_uri": "arxiv://2504.04657v1", "citation_count": 0 }, { "id": "2504.04372v2", "title": "How Accurately Do Large Language Models Understand Code?", "authors": [ "Sabaat Haroon", "Ahmad Faraz Khan", "Ahmad Humayun", "Waris Gill", "Abdul Haddi Amjad", "Ali R. Butt", "Mohammad Taha Khan", "Muhammad Ali Gulzar" ], "abstract": "Large Language Models (LLMs) are increasingly used in post-development tasks\nsuch as code repair and testing. A key factor in these tasks' success is the\nmodel's deep understanding of code. However, the extent to which LLMs truly\nunderstand code remains largely unevaluated. Quantifying code comprehension is\nchallenging due to its abstract nature and the lack of a standardized metric.\nPreviously, this was assessed through developer surveys, which are not feasible\nfor evaluating LLMs. Existing LLM benchmarks focus primarily on code\ngeneration, fundamentally different from code comprehension. Additionally,\nfixed benchmarks quickly become obsolete as they become part of the training\ndata. This paper presents the first large-scale empirical investigation into\nLLMs' ability to understand code. Inspired by mutation testing, we use an LLM's\nfault-finding ability as a proxy for its deep code understanding. This approach\nis based on the insight that a model capable of identifying subtle functional\ndiscrepancies must understand the code well. We inject faults in real-world\nprograms and ask the LLM to localize them, ensuring the specifications suffice\nfor fault localization. Next, we apply semantic-preserving code mutations\n(SPMs) to the faulty programs and test whether the LLMs still locate the\nfaults, verifying their confidence in code understanding. We evaluate nine\npopular LLMs on 600,010 debugging tasks from 670 Java and 637 Python programs.\nWe find that LLMs lose the ability to debug the same bug in 78% of faulty\nprograms when SPMs are applied, indicating a shallow understanding of code and\nreliance on features irrelevant to semantics. We also find that LLMs understand\ncode earlier in the program better than later. This suggests that LLMs' code\ncomprehension remains tied to lexical and syntactic features due to\ntokenization designed for natural languages, which overlooks code semantics.", "categories": [ "cs.SE", "cs.AI", "cs.LG" ], "published": "2025-04-06T05:59:29+00:00", "url": "http://arxiv.org/pdf/2504.04372v2", "resource_uri": "arxiv://2504.04372v2", "citation_count": 0 }, { "id": "2504.02141v1", "title": "On Simulation-Guided LLM-based Code Generation for Safe Autonomous Driving Software", "authors": [ "Ali Nouri", "Johan Andersson", "Kailash De Jesus Hornig", "Zhennan Fei", "Emil Knabe", "Hakan Sivencrona", "Beatriz Cabrero-Daniel", "Christian Berger" ], "abstract": "Automated Driving System (ADS) is a safety-critical software system\nresponsible for the interpretation of the vehicle's environment and making\ndecisions accordingly. The unbounded complexity of the driving context,\nincluding unforeseeable events, necessitate continuous improvement, often\nachieved through iterative DevOps processes. However, DevOps processes are\nthemselves complex, making these improvements both time- and\nresource-intensive. Automation in code generation for ADS using Large Language\nModels (LLM) is one potential approach to address this challenge. Nevertheless,\nthe development of ADS requires rigorous processes to verify, validate, assess,\nand qualify the code before it can be deployed in the vehicle and used. In this\nstudy, we developed and evaluated a prototype for automatic code generation and\nassessment using a designed pipeline of a LLM-based agent, simulation model,\nand rule-based feedback generator in an industrial setup. The LLM-generated\ncode is evaluated automatically in a simulation model against multiple critical\ntraffic scenarios, and an assessment report is provided as feedback to the LLM\nfor modification or bug fixing. We report about the experimental results of the\nprototype employing Codellama:34b, DeepSeek (r1:32b and Coder:33b),\nCodeGemma:7b, Mistral:7b, and GPT4 for Adaptive Cruise Control (ACC) and\nUnsupervised Collision Avoidance by Evasive Manoeuvre (CAEM). We finally\nassessed the tool with 11 experts at two Original Equipment Manufacturers\n(OEMs) by conducting an interview study.", "categories": [ "cs.SE", "cs.AI" ], "published": "2025-04-02T21:35:11+00:00", "url": "http://arxiv.org/pdf/2504.02141v1", "resource_uri": "arxiv://2504.02141v1", "citation_count": 0 }, { "id": "2504.01523v1", "title": "Adapting Knowledge Prompt Tuning for Enhanced Automated Program Repair", "authors": [ "Xuemeng Cai", "Lingxiao Jiang" ], "abstract": "Automated Program Repair (APR) aims to enhance software reliability by\nautomatically generating bug-fixing patches. Recent work has improved the\nstate-of-the-art of APR by fine-tuning pre-trained large language models\n(LLMs), such as CodeT5, for APR. However, the effectiveness of fine-tuning\nbecomes weakened in data scarcity scenarios, and data scarcity can be a common\nissue in practice, limiting fine-tuning performance. To alleviate this\nlimitation, this paper adapts prompt tuning for enhanced APR and conducts a\ncomprehensive study to evaluate its effectiveness in data scarcity scenarios,\nusing three LLMs of different sizes and six diverse datasets across four\nprogramming languages. Prompt tuning rewrites the input to a model by adding\nextra prompt tokens and tunes both the model and the prompts on a small\ndataset. These tokens provide task-specific knowledge that can improve the\nmodel for APR, which is especially critical in data scarcity scenarios.\nMoreover, domain knowledge has proven crucial in many code intelligence tasks,\nbut existing studies fail to leverage domain knowledge during the prompt tuning\nfor APR. To close this gap, we introduce knowledge prompt tuning, an approach\nthat adapts prompt tuning with six distinct types of code- or bug-related\ndomain knowledge for APR. Our work, to the best of our knowledge, is the first\nto adapt and evaluate prompt tuning and the effectiveness of code- or\nbug-related domain knowledge for APR, particularly under data scarcity\nsettings. Our evaluation results demonstrate that prompt tuning with knowledge\ngenerally outperforms fine-tuning under various experimental settings,\nachieving an average improvement of 87.33% over fine-tuning in data scarcity\nscenarios.", "categories": [ "cs.SE" ], "published": "2025-04-02T09:10:02+00:00", "url": "http://arxiv.org/pdf/2504.01523v1", "resource_uri": "arxiv://2504.01523v1", "citation_count": 0 }, { "id": "2504.01404v1", "title": "LLM4SZZ: Enhancing SZZ Algorithm with Context-Enhanced Assessment on Large Language Models", "authors": [ "Lingxiao Tang", "Jiakun Liu", "Zhongxin Liu", "Xiaohu Yang", "Lingfeng Bao" ], "abstract": "The SZZ algorithm is the dominant technique for identifying bug-inducing\ncommits and serves as a foundation for many software engineering studies, such\nas bug prediction and static code analysis. Researchers have proposed many\nvariants to enhance the SZZ algorithm's performance since its introduction. The\nmajority of them rely on static techniques or heuristic assumptions, making\nthem easy to implement, but their performance improvements are often limited.\nRecently, a deep learning-based SZZ algorithm has been introduced to enhance\nthe original SZZ algorithm. However, it requires complex preprocessing and is\nrestricted to a single programming language. Additionally, while it enhances\nprecision, it sacrifices recall. Furthermore, most of variants overlook crucial\ninformation, such as commit messages and patch context, and are limited to\nbug-fixing commits involving deleted lines. The emergence of large language\nmodels (LLMs) offers an opportunity to address these drawbacks. In this study,\nwe investigate the strengths and limitations of LLMs and propose LLM4SZZ, which\nemploys two approaches (i.e., rank-based identification and context-enhanced\nidentification) to handle different types of bug-fixing commits. We determine\nwhich approach to adopt based on the LLM's ability to comprehend the bug and\nidentify whether the bug is present in a commit. The context-enhanced\nidentification provides the LLM with more context and requires it to find the\nbug-inducing commit among a set of candidate commits. In rank-based\nidentification, we ask the LLM to select buggy statements from the bug-fixing\ncommit and rank them based on their relevance to the root cause. Experimental\nresults show that LLM4SZZ outperforms all baselines across three datasets,\nimproving F1-score by 6.9% to 16.0% without significantly sacrificing recall.", "categories": [ "cs.SE" ], "published": "2025-04-02T06:40:57+00:00", "url": "http://arxiv.org/pdf/2504.01404v1", "resource_uri": "arxiv://2504.01404v1", "citation_count": 0 }, { "id": "2503.23803v2", "title": "Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute", "authors": [ "Yingwei Ma", "Yongbin Li", "Yihong Dong", "Xue Jiang", "Rongyu Cao", "Jue Chen", "Fei Huang", "Binhua Li" ], "abstract": "Recent advancements in software engineering agents have demonstrated\npromising capabilities in automating program improvements. However, their\nreliance on closed-source or resource-intensive models introduces significant\ndeployment challenges in private environments, prompting a critical question:\n\\textit{How can personally deployable open-source LLMs achieve comparable code\nreasoning performance?}\n To this end, we propose a unified Test-Time Compute scaling framework that\nleverages increased inference-time computation instead of larger models. Our\nframework incorporates two complementary strategies: internal TTC and external\nTTC. Internally, we introduce a \\textit{development-contextualized trajectory\nsynthesis} method leveraging real-world software repositories to bootstrap\nmulti-stage reasoning processes, such as fault localization and patch\ngeneration. We further enhance trajectory quality through rejection sampling,\nrigorously evaluating trajectories along accuracy and complexity. Externally,\nwe propose a novel \\textit{development-process-based search} strategy guided by\nreward models and execution verification. This approach enables targeted\ncomputational allocation at critical development decision points, overcoming\nlimitations of existing \"end-point only\" verification methods.\n Evaluations on SWE-bench Verified demonstrate our \\textbf{32B model achieves\na 46\\% issue resolution rate}, surpassing significantly larger models such as\nDeepSeek R1 671B and OpenAI o1. Additionally, we provide the empirical\nvalidation of the test-time scaling phenomenon within SWE agents, revealing\nthat \\textbf{models dynamically allocate more tokens to increasingly\nchallenging problems}, effectively enhancing reasoning capabilities. We\npublicly release all training data, models, and code to facilitate future\nresearch. https://github.com/yingweima2022/SWE-Reasoner", "categories": [ "cs.SE", "cs.AI" ], "published": "2025-03-31T07:31:32+00:00", "url": "http://arxiv.org/pdf/2503.23803v2", "resource_uri": "arxiv://2503.23803v2", "citation_count": 0 }, { "id": "2503.23791v1", "title": "LLMigrate: Transforming \"Lazy\" Large Language Models into Efficient Source Code Migrators", "authors": [ "Yuchen Liu", "Junhao Hu", "Yingdi Shan", "Ge Li", "Yanzhen Zou", "Yihong Dong", "Tao Xie" ], "abstract": "Rewriting C code in Rust provides stronger memory safety, yet migrating large\ncodebases such as the 32-million-line Linux kernel remains challenging. While\nrule-based translators (e.g., C2Rust) provide accurate yet largely unsafe Rust\nprograms, recent Large Language Model (LLM) approaches produce more idiomatic,\nsafe Rust programs but frequently exhibit \"laziness\", omitting significant\nportions of the target code. To address the issue, in this paper, we present\nLLMigrate, an LLM-based C-to-Rust translation tool that splits modules into\ndiscrete functions, translating them individually, and then reintegrating them.\nLLMigrate uses static analysis to retain necessary context, pairs GPT-4o (a\nstate-of-the-art LLM) with compiler-driven translation and program-repair\ntechniques for complex core functions, and leverages call-graph-guided\ntranslation to ensure consistent interfaces. Evaluations on three\nrepresentative Linux kernel modules (math, sort, and ramfs) show that LLMigrate\nrequires modifying less than 15\\% of the target code, significantly\noutperforming a pure GPT-4o-based migration.", "categories": [ "cs.PL", "cs.SE" ], "published": "2025-03-31T07:09:07+00:00", "url": "http://arxiv.org/pdf/2503.23791v1", "resource_uri": "arxiv://2503.23791v1", "citation_count": 0 }, { "id": "2503.22821v1", "title": "Identifying and Mitigating API Misuse in Large Language Models", "authors": [ "Terry Yue Zhuo", "Junda He", "Jiamou Sun", "Zhenchang Xing", "David Lo", "John Grundy", "Xiaoning Du" ], "abstract": "API misuse in code generated by large language models (LLMs) represents a\nserious emerging challenge in software development. While LLMs have\ndemonstrated impressive code generation capabilities, their interactions with\ncomplex library APIs remain highly prone to errors, potentially leading to\nsoftware failures and security vulnerabilities. This paper presents the first\ncomprehensive study of API misuse patterns in LLM-generated code, analyzing\nboth method selection and parameter usage across Python and Java. Through\nextensive manual annotation of 3,892 method-level and 2,560 parameter-level\nmisuses, we develop a novel taxonomy of four distinct API misuse types specific\nto LLMs, which significantly differ from traditional human-centric misuse\npatterns. Our evaluation of two widely used LLMs, StarCoder-7B (open-source)\nand Copilot (closed-source), reveals significant challenges in API usage,\nparticularly in areas of hallucination and intent misalignment. We propose\nDr.Fix, a novel LLM-based automatic program repair approach for API misuse\nbased on the aforementioned taxonomy. Our method substantially improves repair\naccuracy for real-world API misuse, demonstrated by increases of up to 38.4\npoints in BLEU scores and 40 percentage points in exact match rates across\ndifferent models and programming languages. This work provides crucial insights\ninto the limitations of current LLMs in API usage and presents an effective\nsolution for the automated repair of API misuse in LLM-generated code.", "categories": [ "cs.SE" ], "published": "2025-03-28T18:43:12+00:00", "url": "http://arxiv.org/pdf/2503.22821v1", "resource_uri": "arxiv://2503.22821v1", "citation_count": 0 }, { "id": "2503.22512v3", "title": "Unlocking LLM Repair Capabilities in Low-Resource Programming Languages Through Cross-Language Translation and Multi-Agent Refinement", "authors": [ "Wenqiang Luo", "Jacky Wai Keung", "Boyang Yang", "Jacques Klein", "Tegawende F. Bissyande", "Haoye Tian", "Bach Le" ], "abstract": "Recent advances in leveraging LLMs for APR have demonstrated impressive\ncapabilities in fixing software defects. However, current LLM-based approaches\npredominantly focus on mainstream programming languages like Java and Python,\nneglecting less prevalent but emerging languages such as Rust due to expensive\ntraining resources, limited datasets, and insufficient community support. This\nnarrow focus creates a significant gap in repair capabilities across the\nprogramming language spectrum, where the full potential of LLMs for\ncomprehensive multilingual program repair remains largely unexplored. To\naddress this limitation, we introduce a novel cross-language program repair\napproach LANTERN that leverages LLMs' differential proficiency across languages\nthrough a multi-agent iterative repair paradigm. Our technique strategically\ntranslates defective code from languages where LLMs exhibit weaker repair\ncapabilities to languages where they demonstrate stronger performance, without\nrequiring additional training. A key innovation of our approach is an LLM-based\ndecision-making system that dynamically selects optimal target languages based\non bug characteristics and continuously incorporates feedback from previous\nrepair attempts. We evaluate our method on xCodeEval, a comprehensive\nmultilingual benchmark comprising 5,068 bugs across 11 programming languages.\nResults demonstrate significant enhancement in repair effectiveness,\nparticularly for underrepresented languages, with Rust showing a 22.09%\nimprovement in Pass@10 metrics. Our research provides the first empirical\nevidence that cross-language translation significantly expands the repair\ncapabilities of LLMs and effectively bridges the performance gap between\nprogramming languages with different levels of popularity, opening new avenues\nfor truly language-agnostic automated program repair.", "categories": [ "cs.SE" ], "published": "2025-03-28T15:15:56+00:00", "url": "http://arxiv.org/pdf/2503.22512v3", "resource_uri": "arxiv://2503.22512v3", "citation_count": 0 }, { "id": "2503.22424v1", "title": "CoSIL: Software Issue Localization via LLM-Driven Code Repository Graph Searching", "authors": [ "Zhonghao Jiang", "Xiaoxue Ren", "Meng Yan", "Wei Jiang", "Yong Li", "Zhongxin Liu" ], "abstract": "Large language models (LLMs) have significantly advanced autonomous software\nengineering, leading to a growing number of software engineering agents that\nassist developers in automatic program repair. Issue localization forms the\nbasis for accurate patch generation. However, because of limitations caused by\nthe context window length of LLMs, existing issue localization methods face\nchallenges in balancing concise yet effective contexts and adequately\ncomprehensive search spaces. In this paper, we introduce CoSIL, an LLM driven,\nsimple yet powerful function level issue localization method without training\nor indexing. CoSIL reduces the search space through module call graphs,\niteratively searches the function call graph to obtain relevant contexts, and\nuses context pruning to control the search direction and manage contexts\neffectively. Importantly, the call graph is dynamically constructed by the LLM\nduring search, eliminating the need for pre-parsing. Experiment results\ndemonstrate that CoSIL achieves a Top-1 localization success rate of 43 percent\nand 44.6 percent on SWE bench Lite and SWE bench Verified, respectively, using\nQwen2.5 Coder 32B, outperforming existing methods by 8.6 to 98.2 percent. When\nCoSIL is applied to guide the patch generation stage, the resolved rate further\nimproves by 9.3 to 31.5 percent.", "categories": [ "cs.SE", "cs.AI", "cs.CL" ], "published": "2025-03-28T13:36:26+00:00", "url": "http://arxiv.org/pdf/2503.22424v1", "resource_uri": "arxiv://2503.22424v1", "citation_count": 0 }, { "id": "2503.22388v2", "title": "Why Stop at One Error? Benchmarking LLMs as Data Science Code Debuggers for Multi-Hop and Multi-Bug Errors", "authors": [ "Zhiyu Yang", "Shuo Wang", "Yukun Yan", "Yang Deng" ], "abstract": "LLMs are transforming software development, yet current code generation and\ncode repair benchmarks mainly assess syntactic and functional correctness in\nsimple, single-error cases. LLMs' capabilities to autonomously find and fix\nruntime logical errors in complex data science code remain largely unexplored.\nTo address this gap, we introduce DSDBench: the Data Science Debugging\nBenchmark, the first benchmark for systematic evaluation of LLMs on multi-hop\nerror tracing and multi-bug detection in data science code debugging. DSDBench\nadapts datasets from existing data science task benchmarks, such as DABench and\nMatPlotBench, featuring realistic data science debugging tasks with\nautomatically synthesized multi-hop, multi-bug code snippets. DSDBench includes\n1,117 annotated samples with 741 cause-effect error pairs and runtime error\nmessages. Evaluations of state-of-the-art LLMs on DSDBench show significant\nperformance gaps, highlighting challenges in debugging logical runtime errors\nin data science code. DSDBench offers a crucial resource to evaluate and\nimprove LLMs' debugging and reasoning capabilities, enabling more reliable\nAI-assisted data science in the future. DSDBench is publicly available at\ngithub.com/KevinCL16/DSDBench.", "categories": [ "cs.CL" ], "published": "2025-03-28T12:46:54+00:00", "url": "http://arxiv.org/pdf/2503.22388v2", "resource_uri": "arxiv://2503.22388v2", "citation_count": 0 }, { "id": "2503.21971v3", "title": "RocketPPA: Code-Level Power, Performance, and Area Prediction via LLM and Mixture of Experts", "authors": [ "Armin Abdollahi", "Mehdi Kamal", "Massoud Pedram" ], "abstract": "This paper presents RocketPPA, a novel ultra-fast power, performance (delay),\nand area (PPA) estimator operating directly at the code-level abstraction using\nHDL code as input. The key technical innovation is its LLM-based regression\nmodel, which uniquely integrates a large language model (LLM) with a\nmixture-of-experts (MoE) architecture composed of multilayer perceptrons\n(MLPs). The LLM interprets the input HDL code and then utilizes its final\nhidden-layer representations to predict PPA metrics. Low-rank adaptation (LoRA)\nis used for parameter-efficient fine-tuning to enable efficient LLM training.\nFurthermore, the work includes the development of an LLM-based HDL code repair\nframework to generate a large and synthesizable training dataset. Experimental\nresults on the VerilogEval benchmark demonstrate that RocketPPA achieves\nsignificant improvements in the accuracy of PPA estimation compared to previous\nstate-of-the-art methods like Llama3-MetRex-8B. Specifically, at a 10% relative\nerror threshold, RocketPPA enhances the pass rate for area prediction by 13.6%,\ndelay by 9.4%, and power by 14.7%. At a 20% threshold, the improvements are\n9.6% for area, 10.8% for delay, and 18.5% for power. Moreover, RocketPPA\nachieves a speedup of over 20x compared to MetRex and 30x over MasterRTL in\nprocessing the test set. The impact of RocketPPA is the potential to\nsubstantially accelerate the hardware design process by providing accurate PPA\nestimations early in the design cycle, thus avoiding the overhead of manual\nfeature engineering and time-consuming synthesis flows.", "categories": [ "cs.LG", "cs.SE" ], "published": "2025-03-27T20:35:09+00:00", "url": "http://arxiv.org/pdf/2503.21971v3", "resource_uri": "arxiv://2503.21971v3", "citation_count": 0 }, { "id": "2503.21710v1", "title": "Enhancing Repository-Level Software Repair via Repository-Aware Knowledge Graphs", "authors": [ "Boyang Yang", "Haoye Tian", "Jiadong Ren", "Shunfu Jin", "Yang Liu", "Feng Liu", "Bach Le" ], "abstract": "Repository-level software repair faces challenges in bridging semantic gaps\nbetween issue descriptions and code patches. Existing approaches, which mostly\ndepend on large language models (LLMs), suffer from semantic ambiguities,\nlimited structural context understanding, and insufficient reasoning\ncapability. To address these limitations, we propose KGCompass with two\ninnovations: (1) a novel repository-aware knowledge graph (KG) that accurately\nlinks repository artifacts (issues and pull requests) and codebase entities\n(files, classes, and functions), allowing us to effectively narrow down the\nvast search space to only 20 most relevant functions with accurate candidate\nbug locations and contextual information, and (2) a path-guided repair\nmechanism that leverages KG-mined entity path, tracing through which allows us\nto augment LLMs with relevant contextual information to generate precise\npatches along with their explanations. Experimental results in the\nSWE-Bench-Lite demonstrate that KGCompass achieves state-of-the-art repair\nperformance (45.67%) and function-level localization accuracy (51.33%) across\nopen-source approaches, costing only $0.20 per repair. Our analysis reveals\nthat among successfully localized bugs, 69.7% require multi-hop traversals\nthrough the knowledge graph, without which LLM-based approaches struggle to\naccurately locate bugs. The knowledge graph built in KGCompass is language\nagnostic and can be incrementally updated, making it a practical solution for\nreal-world development environments.", "categories": [ "cs.SE" ], "published": "2025-03-27T17:21:47+00:00", "url": "http://arxiv.org/pdf/2503.21710v1", "resource_uri": "arxiv://2503.21710v1", "citation_count": 0 }, { "id": "2503.19449v3", "title": "VecTrans: Enhancing Compiler Auto-Vectorization through LLM-Assisted Code Transformations", "authors": [ "Zhongchun Zheng", "Kan Wu", "Long Cheng", "Lu Li", "Rodrigo C. O. Rocha", "Tianyi Liu", "Wei Wei", "Jianjiang Zeng", "Xianwei Zhang", "Yaoqing Gao" ], "abstract": "Auto-vectorization is a fundamental optimization for modern compilers to\nexploit SIMD parallelism. However, state-of-the-art approaches still struggle\nto handle intricate code patterns, often requiring manual hints or\ndomain-specific expertise. Large language models (LLMs), with their ability to\ncapture intricate patterns, provide a promising solution, yet their effective\napplication in compiler optimizations remains an open challenge due to issues\nsuch as hallucinations and a lack of domain-specific reasoning. In this paper,\nwe present VecTrans, a novel framework that leverages LLMs to enhance\ncompiler-based code vectorization. VecTrans first employs compiler analysis to\nidentify potentially vectorizable code regions. It then utilizes an LLM to\nrefactor these regions into patterns that are more amenable to the compilers\nauto-vectorization. To ensure semantic correctness, VecTrans further integrates\na hybrid validation mechanism at the intermediate representation (IR) level.\nWith the above efforts, VecTrans combines the adaptability of LLMs with the\nprecision of compiler vectorization, thereby effectively opening up the\nvectorization opportunities. experimental results show that among all TSVC\nfunctions unvectorizable by GCC, ICC, Clang, and BiSheng Compiler, VecTrans\nachieves an geomean speedup of 1.77x and successfully vectorizes 24 of 51 test\ncases. This marks a significant advancement over state-of-the-art approaches\nwhile maintaining a cost efficiency of $0.012 per function optimization for LLM\nAPI usage.", "categories": [ "cs.SE", "cs.AI", "cs.LG", "cs.PF" ], "published": "2025-03-25T08:39:35+00:00", "url": "http://arxiv.org/pdf/2503.19449v3", "resource_uri": "arxiv://2503.19449v3", "citation_count": 0 }, { "id": "2503.17080v1", "title": "Seeing What Matters: Empowering CLIP with Patch Generation-to-Selection", "authors": [ "Gensheng Pei", "Tao Chen", "Yujia Wang", "Xinhao Cai", "Xiangbo Shu", "Tianfei Zhou", "Yazhou Yao" ], "abstract": "The CLIP model has demonstrated significant advancements in aligning visual\nand language modalities through large-scale pre-training on image-text pairs,\nenabling strong zero-shot classification and retrieval capabilities on various\ndomains. However, CLIP's training remains computationally intensive, with high\ndemands on both data processing and memory. To address these challenges, recent\nmasking strategies have emerged, focusing on the selective removal of image\npatches to improve training efficiency. Although effective, these methods often\ncompromise key semantic information, resulting in suboptimal alignment between\nvisual features and text descriptions. In this work, we present a concise yet\neffective approach called Patch Generation-to-Selection to enhance CLIP's\ntraining efficiency while preserving critical semantic content. Our method\nintroduces a gradual masking process in which a small set of candidate patches\nis first pre-selected as potential mask regions. Then, we apply Sobel edge\ndetection across the entire image to generate an edge mask that prioritizes the\nretention of the primary object areas. Finally, similarity scores between the\ncandidate mask patches and their neighboring patches are computed, with optimal\ntransport normalization refining the selection process to ensure a balanced\nsimilarity matrix. Our approach, CLIP-PGS, sets new state-of-the-art results in\nzero-shot classification and retrieval tasks, achieving superior performance in\nrobustness evaluation and language compositionality benchmarks.", "categories": [ "cs.CV" ], "published": "2025-03-21T12:10:38+00:00", "url": "http://arxiv.org/pdf/2503.17080v1", "resource_uri": "arxiv://2503.17080v1", "citation_count": 0 }, { "id": "2503.15815v1", "title": "Attention Pruning: Automated Fairness Repair of Language Models via Surrogate Simulated Annealing", "authors": [ "Vishnu Asutosh Dasu", "Md Rafi ur Rashid", "Vipul Gupta", "Saeid Tizpaz-Niari", "Gang Tan" ], "abstract": "This paper explores pruning attention heads as a post-processing bias\nmitigation method for large language models (LLMs). Modern AI systems such as\nLLMs are expanding into sensitive social contexts where fairness concerns\nbecome especially crucial. Since LLMs develop decision-making patterns by\ntraining on massive datasets of human-generated content, they naturally encode\nand perpetuate societal biases. While modifying training datasets and\nalgorithms is expensive and requires significant resources; post-processing\ntechniques-such as selectively deactivating neurons and attention heads in\npre-trained LLMs-can provide feasible and effective approaches to improve\nfairness. However, identifying the optimal subset of parameters to prune\npresents a combinatorial challenge within LLMs' immense parameter space,\nrequiring solutions that efficiently balance competing objectives across the\nfrontiers of model fairness and utility.\n To address the computational challenges, we explore a search-based program\nrepair approach via randomized simulated annealing. Given the prohibitive\nevaluation costs in billion-parameter LLMs, we develop surrogate deep neural\nnetworks that efficiently model the relationship between attention head states\n(active/inactive) and their corresponding fairness/utility metrics. This allows\nus to perform optimization over the surrogate models and efficiently identify\noptimal subsets of attention heads for selective pruning rather than directly\nsearching through the LLM parameter space. This paper introduces Attention\nPruning, a fairness-aware surrogate simulated annealing approach to prune\nattention heads in LLMs that disproportionately contribute to bias while\nminimally impacting overall model utility. Our experiments show that Attention\nPruning achieves up to $40\\%$ reduction in gender bias and outperforms the\nstate-of-the-art bias mitigation strategies.", "categories": [ "cs.AI" ], "published": "2025-03-20T03:02:32+00:00", "url": "http://arxiv.org/pdf/2503.15815v1", "resource_uri": "arxiv://2503.15815v1", "citation_count": 0 }, { "id": "2503.15050v2", "title": "Studying and Understanding the Effectiveness and Failures of Conversational LLM-Based Repair", "authors": [ "Aolin Chen", "Haojun Wu", "Qi Xin", "Steven P. Reiss", "Jifeng Xuan" ], "abstract": "Automated program repair (APR) is designed to automate the process of\nbug-fixing. In recent years, thanks to the rapid development of large language\nmodels (LLMs), automated repair has achieved remarkable progress. Advanced APR\ntechniques powered by conversational LLMs, most notably ChatGPT, have exhibited\nimpressive repair abilities and gained increasing popularity due to the\ncapabilities of the underlying LLMs in providing repair feedback and performing\niterative patch improvement. Despite the superiority, conversational APR\ntechniques still fail to repair a large number of bugs. For example, a\nstate-of-the-art conversational technique ChatRepair does not correctly repair\nover half of the single-function bugs in the Defects4J dataset. To understand\nthe effectiveness and failures of conversational LLM-based repair and provide\npossible directions for improvement, we studied the exemplary ChatRepair with a\nfocus on comparing the effectiveness of its cloze-style and full function\nrepair strategies, assessing its key iterative component for patch improvement,\nand analyzing the repair failures. Our study has led to a series of findings,\nwhich we believe provide key implications for future research.", "categories": [ "cs.SE" ], "published": "2025-03-19T09:39:32+00:00", "url": "http://arxiv.org/pdf/2503.15050v2", "resource_uri": "arxiv://2503.15050v2", "citation_count": 0 }, { "id": "2503.14924v1", "title": "UTFix: Change Aware Unit Test Repairing using LLM", "authors": [ "Shanto Rahman", "Sachit Kuhar", "Berk Cirisci", "Pranav Garg", "Shiqi Wang", "Xiaofei Ma", "Anoop Deoras", "Baishakhi Ray" ], "abstract": "Software updates, including bug repair and feature additions, are frequent in\nmodern applications but they often leave test suites outdated, resulting in\nundetected bugs and increased chances of system failures. A recent study by\nMeta revealed that 14%-22% of software failures stem from outdated tests that\nfail to reflect changes in the codebase. This highlights the need to keep tests\nin sync with code changes to ensure software reliability.\n In this paper, we present UTFix, a novel approach for repairing unit tests\nwhen their corresponding focal methods undergo changes. UTFix addresses two\ncritical issues: assertion failure and reduced code coverage caused by changes\nin the focal method. Our approach leverages language models to repair unit\ntests by providing contextual information such as static code slices, dynamic\ncode slices, and failure messages. We evaluate UTFix on our generated synthetic\nbenchmarks (Tool-Bench), and real-world benchmarks. Tool- Bench includes\ndiverse changes from popular open-source Python GitHub projects, where UTFix\nsuccessfully repaired 89.2% of assertion failures and achieved 100% code\ncoverage for 96 tests out of 369 tests. On the real-world benchmarks, UTFix\nrepairs 60% of assertion failures while achieving 100% code coverage for 19 out\nof 30 unit tests. To the best of our knowledge, this is the first comprehensive\nstudy focused on unit test in evolving Python projects. Our contributions\ninclude the development of UTFix, the creation of Tool-Bench and real-world\nbenchmarks, and the demonstration of the effectiveness of LLM-based methods in\naddressing unit test failures due to software evolution.", "categories": [ "cs.SE" ], "published": "2025-03-19T06:10:03+00:00", "url": "http://arxiv.org/pdf/2503.14924v1", "resource_uri": "arxiv://2503.14924v1", "citation_count": 0 }, { "id": "2503.14724v1", "title": "CodingGenie: A Proactive LLM-Powered Programming Assistant", "authors": [ "Sebastian Zhao", "Alan Zhu", "Hussein Mozannar", "David Sontag", "Ameet Talwalkar", "Valerie Chen" ], "abstract": "While developers increasingly adopt tools powered by large language models\n(LLMs) in day-to-day workflows, these tools still require explicit user\ninvocation. To seamlessly integrate LLM capabilities to a developer's workflow,\nwe introduce CodingGenie, a proactive assistant integrated into the code\neditor. CodingGenie autonomously provides suggestions, ranging from bug fixing\nto unit testing, based on the current code context and allows users to\ncustomize suggestions by providing a task description and selecting what\nsuggestions are shown. We demonstrate multiple use cases to show how proactive\nsuggestions from CodingGenie can improve developer experience, and also analyze\nthe cost of adding proactivity. We believe this open-source tool will enable\nfurther research into proactive assistants. CodingGenie is open-sourced at\nhttps://github.com/sebzhao/CodingGenie/ and video demos are available at\nhttps://sebzhao.github.io/CodingGenie/.", "categories": [ "cs.HC" ], "published": "2025-03-18T20:54:40+00:00", "url": "http://arxiv.org/pdf/2503.14724v1", "resource_uri": "arxiv://2503.14724v1", "citation_count": 0 }, { "id": "2503.14023v2", "title": "Synthetic Data Generation Using Large Language Models: Advances in Text and Code", "authors": [ "Mihai Nadas", "Laura Diosan", "Andreea Tomescu" ], "abstract": "This survey reviews how large language models (LLMs) are transforming\nsynthetic training data generation in both natural language and code domains.\nBy producing artificial but task-relevant examples, these models can\nsignificantly augment or even substitute for real-world datasets, particularly\nin scenarios where labeled data is scarce, expensive, or sensitive. This paper\nsurveys recent advances in leveraging LLMs to create synthetic text and code,\nhighlighting key techniques such as prompt-based generation,\nretrieval-augmented pipelines, and iterative self-refinement. We examine how\nthese methods can enrich low-resource tasks (e.g., classification, question\nanswering) and facilitate code-centric applications (e.g., instruction tuning,\ncode translation, bug repair) through automated verification of functional\ncorrectness. Alongside potential benefits - cost-effectiveness, broad coverage,\nand controllable diversity - we discuss the accompanying challenges, including\nfactual inaccuracies in generated text, insufficient stylistic or\ndistributional realism, and risks of bias amplification. Proposed mitigation\nstrategies range from filtering and weighting synthetic outputs to\nreinforcement learning with execution feedback in code domains. We conclude by\noutlining open research directions, such as automated prompt engineering,\ncross-modal data synthesis, and robust evaluation frameworks, underscoring the\ngrowing importance of LLM-generated synthetic data in accelerating AI\ndevelopment while emphasizing ethical and quality safeguards.", "categories": [ "cs.CL" ], "published": "2025-03-18T08:34:03+00:00", "url": "http://arxiv.org/pdf/2503.14023v2", "resource_uri": "arxiv://2503.14023v2", "citation_count": 0 }, { "id": "2503.13772v1", "title": "Do Large Language Models Understand Performance Optimization?", "authors": [ "Bowen Cui", "Tejas Ramesh", "Oscar Hernandez", "Keren Zhou" ], "abstract": "Large Language Models (LLMs) have emerged as powerful tools for software\ndevelopment tasks such as code completion, translation, and optimization.\nHowever, their ability to generate efficient and correct code, particularly in\ncomplex High-Performance Computing (HPC) contexts, has remained underexplored.\nTo address this gap, this paper presents a comprehensive benchmark suite\nencompassing multiple critical HPC computational motifs to evaluate the\nperformance of code optimized by state-of-the-art LLMs, including OpenAI o1,\nClaude-3.5, and Llama-3.2. In addition to analyzing basic computational\nkernels, we developed an agent system that integrates LLMs to assess their\neffectiveness in real HPC applications. Our evaluation focused on key criteria\nsuch as execution time, correctness, and understanding of HPC-specific\nconcepts. We also compared the results with those achieved using traditional\nHPC optimization tools. Based on the findings, we recognized the strengths of\nLLMs in understanding human instructions and performing automated code\ntransformations. However, we also identified significant limitations, including\ntheir tendency to generate incorrect code and their challenges in comprehending\ncomplex control and data flows in sophisticated HPC code.", "categories": [ "cs.DC", "cs.SE" ], "published": "2025-03-17T23:30:23+00:00", "url": "http://arxiv.org/pdf/2503.13772v1", "resource_uri": "arxiv://2503.13772v1", "citation_count": 0 }, { "id": "2503.09217v1", "title": "Evaluating the Generalizability of LLMs in Automated Program Repair", "authors": [ "Fengjie Li", "Jiajun Jiang", "Jiajun Sun", "Hongyu Zhang" ], "abstract": "LLM-based automated program repair methods have attracted significant\nattention for their state-of-the-art performance. However, they were primarily\nevaluated on a few well known datasets like Defects4J, raising questions about\ntheir effectiveness on new datasets. In this study, we evaluate 11\ntop-performing LLMs on DEFECTS4J-TRANS, a new dataset derived from transforming\nDefects4J while maintaining the original semantics. Results from experiments on\nboth Defects4J and DEFECTS4J-TRANS show that all studied LLMs have limited\ngeneralizability in APR tasks, with the average number of correct and plausible\npatches decreasing by 49.48% and 42.90%, respectively, on DEFECTS4J-TRANS.\nFurther investigation into incorporating additional repair-relevant information\nin repair prompts reveals that, although this information significantly\nenhances the LLMs' capabilities (increasing the number of correct and plausible\npatches by up to 136.67% and 121.82%, respectively), performance still falls\nshort of their original results. This indicates that prompt engineering alone\nis insufficient to substantially enhance LLMs' repair capabilities. Based on\nour study, we also offer several recommendations for future research.", "categories": [ "cs.SE", "cs.AI" ], "published": "2025-03-12T10:03:58+00:00", "url": "http://arxiv.org/pdf/2503.09217v1", "resource_uri": "arxiv://2503.09217v1", "citation_count": 0 }, { "id": "2503.08532v1", "title": "Bogus Bugs, Duplicates, and Revealing Comments: Data Quality Issues in NPR", "authors": [ "Julian Aron Prenner", "Romain Robbes" ], "abstract": "The performance of a machine learning system is not only determined by the\nmodel but also, to a substantial degree, by the data it is trained on. With the\nincreasing use of machine learning, issues related to data quality have become\na concern also in automated program repair research. In this position paper, we\nreport some of the data-related issues we have come across when working with\nseveral large APR datasets and benchmarks, including, for instance, duplicates\nor \"bogus bugs\". We briefly discuss the potential impact of these problems on\nrepair performance and propose possible remedies. We believe that more\ndata-focused approaches could improve the performance and robustness of current\nand future APR systems.", "categories": [ "cs.SE" ], "published": "2025-03-11T15:23:13+00:00", "url": "http://arxiv.org/pdf/2503.08532v1", "resource_uri": "arxiv://2503.08532v1", "citation_count": 0 }, { "id": "2504.15284v4", "title": "EditLord: Learning Code Transformation Rules for Code Editing", "authors": [ "Weichen Li", "Albert Jan", "Baishakhi Ray", "Junfeng Yang", "Chengzhi Mao", "Kexin Pei" ], "abstract": "Code editing is a foundational task in software development, where its\neffectiveness depends on whether it introduces desired code property changes\nwithout changing the original code's intended functionality. Existing\napproaches often formulate code editing as an implicit end-to-end task,\nomitting the fact that code-editing procedures inherently consist of discrete\nand explicit steps. Thus, they suffer from suboptimal performance and lack of\nrobustness and generalization. We introduce EditLord, a code editing framework\nthat makes the code transformation steps explicit. Our key insight is to employ\na language model (LM) as an inductive learner to extract code editing rules\nfrom the training code pairs as concise meta-rule sets. Such rule sets will be\nmanifested for each training sample to augment them for finetuning or assist in\nprompting- and iterative-based code editing. EditLord outperforms the\nstate-of-the-art by an average of 22.7% in editing performance and 58.1% in\nrobustness while achieving 20.2% higher functional correctness across critical\nsoftware engineering and security applications, LM models, and editing modes.", "categories": [ "cs.SE", "cs.CR", "cs.LG" ], "published": "2025-03-10T16:33:59+00:00", "url": "http://arxiv.org/pdf/2504.15284v4", "resource_uri": "arxiv://2504.15284v4", "citation_count": 0 }, { "id": "2503.07058v1", "title": "Breaking the Limits of Quantization-Aware Defenses: QADT-R for Robustness Against Patch-Based Adversarial Attacks in QNNs", "authors": [ "Amira Guesmi", "Bassem Ouni", "Muhammad Shafique" ], "abstract": "Quantized Neural Networks (QNNs) have emerged as a promising solution for\nreducing model size and computational costs, making them well-suited for\ndeployment in edge and resource-constrained environments. While quantization is\nknown to disrupt gradient propagation and enhance robustness against\npixel-level adversarial attacks, its effectiveness against patch-based\nadversarial attacks remains largely unexplored. In this work, we demonstrate\nthat adversarial patches remain highly transferable across quantized models,\nachieving over 70\\% attack success rates (ASR) even at extreme bit-width\nreductions (e.g., 2-bit). This challenges the common assumption that\nquantization inherently mitigates adversarial threats. To address this, we\npropose Quantization-Aware Defense Training with Randomization (QADT-R), a\nnovel defense strategy that integrates Adaptive Quantization-Aware Patch\nGeneration (A-QAPA), Dynamic Bit-Width Training (DBWT), and\nGradient-Inconsistent Regularization (GIR) to enhance resilience against highly\ntransferable patch-based attacks. A-QAPA generates adversarial patches within\nquantized models, ensuring robustness across different bit-widths. DBWT\nintroduces bit-width cycling during training to prevent overfitting to a\nspecific quantization setting, while GIR injects controlled gradient\nperturbations to disrupt adversarial optimization. Extensive evaluations on\nCIFAR-10 and ImageNet show that QADT-R reduces ASR by up to 25\\% compared to\nprior defenses such as PBAT and DWQ. Our findings further reveal that\nPBAT-trained models, while effective against seen patch configurations, fail to\ngeneralize to unseen patches due to quantization shift. Additionally, our\nempirical analysis of gradient alignment, spatial sensitivity, and patch\nvisibility provides insights into the mechanisms that contribute to the high\ntransferability of patch-based attacks in QNNs.", "categories": [ "cs.CV" ], "published": "2025-03-10T08:43:36+00:00", "url": "http://arxiv.org/pdf/2503.07058v1", "resource_uri": "arxiv://2503.07058v1", "citation_count": 0 }, { "id": "2503.06680v2", "title": "FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation", "authors": [ "Wei Li", "Xin Zhang", "Zhongxin Guo", "Shaoguang Mao", "Wen Luo", "Guangyue Peng", "Yangyu Huang", "Houfeng Wang", "Scarlett Li" ], "abstract": "Implementing new features in repository-level codebases is a crucial\napplication of code generation models. However, current benchmarks lack a\ndedicated evaluation framework for this capability. To fill this gap, we\nintroduce FEA-Bench, a benchmark designed to assess the ability of large\nlanguage models (LLMs) to perform incremental development within code\nrepositories. We collect pull requests from 83 GitHub repositories and use\nrule-based and intent-based filtering to construct task instances focused on\nnew feature development. Each task instance containing code changes is paired\nwith relevant unit test files to ensure that the solution can be verified. The\nfeature implementation requires LLMs to simultaneously possess code completion\ncapabilities for new components and code editing abilities for other relevant\nparts in the code repository, providing a more comprehensive evaluation method\nof LLMs' automated software engineering capabilities. Experimental results show\nthat LLMs perform significantly worse in the FEA-Bench, highlighting\nconsiderable challenges in such repository-level incremental code development.", "categories": [ "cs.SE", "cs.CL" ], "published": "2025-03-09T16:11:57+00:00", "url": "http://arxiv.org/pdf/2503.06680v2", "resource_uri": "arxiv://2503.06680v2", "citation_count": 0 }, { "id": "2503.05860v1", "title": "Benchmarking AI Models in Software Engineering: A Review, Search Tool, and Enhancement Protocol", "authors": [ "Roham Koohestani", "Philippe de Bekker", "Maliheh Izadi" ], "abstract": "Benchmarks are essential for consistent evaluation and reproducibility. The\nintegration of Artificial Intelligence into Software Engineering (AI4SE) has\ngiven rise to numerous benchmarks for tasks such as code generation and bug\nfixing. However, this surge presents challenges: (1) scattered benchmark\nknowledge across tasks, (2) difficulty in selecting relevant benchmarks, (3)\nthe absence of a uniform standard for benchmark development, and (4)\nlimitations of existing benchmarks. In this paper, we review 173 studies and\nidentify 204 AI4SE benchmarks. We classify these benchmarks, analyze their\nlimitations, and expose gaps in practices. Based on our review, we created\nBenchScout, a semantic search tool to find relevant benchmarks, using automated\nclustering of the contexts from associated studies. We conducted a user study\nwith 22 participants to evaluate BenchScout's usability, effectiveness, and\nintuitiveness which resulted in average scores of 4.5, 4.0, and 4.1 out of 5.\nTo advance benchmarking standards, we propose BenchFrame, a unified method to\nenhance benchmark quality. As a case study, we applied BenchFrame to the\nHumanEval benchmark and addressed its main limitations. This led to\nHumanEvalNext, featuring (1) corrected errors, (2) improved language\nconversion, (3) expanded test coverage, and (4) increased difficulty. We then\nevaluated ten state-of-the-art code language models on HumanEval,\nHumanEvalPlus, and HumanEvalNext. On HumanEvalNext, models showed a pass@1\nscore reduction of 31.22% and 19.94% compared to HumanEval and HumanEvalPlus,\nrespectively.", "categories": [ "cs.SE", "cs.AI" ], "published": "2025-03-07T18:44:32+00:00", "url": "http://arxiv.org/pdf/2503.05860v1", "resource_uri": "arxiv://2503.05860v1", "citation_count": 0 }, { "id": "2503.04214v1", "title": "Extracting Fix Ingredients using Language Models", "authors": [ "Julian Aron Prenner", "Romain Robbes" ], "abstract": "Deep learning and language models are increasingly dominating automated\nprogram repair research. While previous generate-and-validate approaches were\nable to find and use fix ingredients on a file or even project level, neural\nlanguage models are limited to the code that fits their input window. In this\nwork we investigate how important identifier ingredients are in neural program\nrepair and present ScanFix, an approach that leverages an additional scanner\nmodel to extract identifiers from a bug's file and potentially project-level\ncontext. We find that lack of knowledge of far-away identifiers is an important\ncause of failed repairs. Augmenting repair model input with scanner-extracted\nidentifiers yields relative improvements of up to 31%. However, ScanFix is\noutperformed by a model with a large input window (> 5k tokens). When passing\ningredients from the ground-truth fix, improvements are even higher. This shows\nthat, with refined extraction techniques, ingredient scanning, similar to fix\ncandidate ranking, could have the potential to become an important subtask of\nfuture automated repair systems. At the same time, it also demonstrates that\nthis idea is subject to Sutton's bitter lesson and may be rendered unnecessary\nby new code models with ever-increasing context windows.", "categories": [ "cs.SE" ], "published": "2025-03-06T08:48:52+00:00", "url": "http://arxiv.org/pdf/2503.04214v1", "resource_uri": "arxiv://2503.04214v1", "citation_count": 0 }, { "id": "2503.04057v1", "title": "Insights from Rights and Wrongs: A Large Language Model for Solving Assertion Failures in RTL Design", "authors": [ "Jie Zhou", "Youshu Ji", "Ning Wang", "Yuchen Hu", "Xinyao Jiao", "Bingkun Yao", "Xinwei Fang", "Shuai Zhao", "Nan Guan", "Zhe Jiang" ], "abstract": "SystemVerilog Assertions (SVAs) are essential for verifying Register Transfer\nLevel (RTL) designs, as they can be embedded into key functional paths to\ndetect unintended behaviours. During simulation, assertion failures occur when\nthe design's behaviour deviates from expectations. Solving these failures,\ni.e., identifying and fixing the issues causing the deviation, requires\nanalysing complex logical and timing relationships between multiple signals.\nThis process heavily relies on human expertise, and there is currently no\nautomatic tool available to assist with it. Here, we present AssertSolver, an\nopen-source Large Language Model (LLM) specifically designed for solving\nassertion failures. By leveraging synthetic training data and learning from\nerror responses to challenging cases, AssertSolver achieves a bug-fixing pass@1\nmetric of 88.54% on our testbench, significantly outperforming OpenAI's\no1-preview by up to 11.97%. We release our model and testbench for public\naccess to encourage further research:\nhttps://github.com/SEU-ACAL/reproduce-AssertSolver-DAC-25.", "categories": [ "cs.AR" ], "published": "2025-03-06T03:17:48+00:00", "url": "http://arxiv.org/pdf/2503.04057v1", "resource_uri": "arxiv://2503.04057v1", "citation_count": 0 }, { "id": "2503.03656v2", "title": "Robust Learning of Diverse Code Edits", "authors": [ "Tushar Aggarwal", "Swayam Singh", "Abhijeet Awasthi", "Aditya Kanade", "Nagarajan Natarajan" ], "abstract": "Software engineering activities frequently involve edits to existing code.\nHowever, contemporary code language models (LMs) lack the ability to handle\ndiverse types of code-edit requirements. In this work, we attempt to overcome\nthis shortcoming through (1) a novel synthetic data generation pipeline and (2)\na robust model adaptation algorithm. Starting with seed code examples and\ndiverse editing criteria, our pipeline generates high-quality samples\ncomprising original and modified code, along with natural language instructions\nin different styles and verbosity. Today's code LMs come bundled with strong\nabilities, such as code generation and instruction following, which should not\nbe lost due to fine-tuning. To ensure this, we propose a novel adaptation\nalgorithm, SeleKT, that (a) leverages a dense gradient-based step to identify\nthe weights that are most important for code editing, and (b) does a sparse\nprojection onto the base model to avoid overfitting. Using our approach, we\nobtain a new series of models NextCoder (adapted from QwenCoder-2.5) that\nachieves strong results on five code-editing benchmarks, outperforming\ncomparable size models and even several larger ones. We show the generality of\nour approach on two model families (DeepSeekCoder and QwenCoder), compare\nagainst other fine-tuning approaches, and demonstrate robustness by showing\nretention of code generation and general problem-solving abilities post\nadaptation. We opensource the models, synthetic dataset, and implementation at\nhttps://aka.ms/nextcoder.", "categories": [ "cs.SE", "cs.LG" ], "published": "2025-03-05T16:39:04+00:00", "url": "http://arxiv.org/pdf/2503.03656v2", "resource_uri": "arxiv://2503.03656v2", "citation_count": 0 }, { "id": "2503.01098v1", "title": "SolBench: A Dataset and Benchmark for Evaluating Functional Correctness in Solidity Code Completion and Repair", "authors": [ "Zaoyu Chen", "Haoran Qin", "Nuo Chen", "Xiangyu Zhao", "Lei Xue", "Xiapu Luo", "Xiao-Ming Wu" ], "abstract": "Smart contracts are crucial programs on blockchains, and their immutability\npost-deployment makes functional correctness vital. Despite progress in code\ncompletion models, benchmarks for Solidity, the primary smart contract\nlanguage, are lacking. Existing metrics like BLEU do not adequately assess the\nfunctional correctness of generated smart contracts. To fill this gap, we\nintroduce SolBench, a benchmark for evaluating the functional correctness of\nSolidity smart contracts generated by code completion models. SolBench includes\n4,178 functions from 1,155 Ethereum-deployed contracts. Testing advanced models\nrevealed challenges in generating correct code without context, as Solidity\nfunctions rely on context-defined variables and interfaces. To address this, we\npropose a Retrieval-Augmented Code Repair framework. In this framework, an\nexecutor verifies functional correctness, and if necessary, an LLM repairs the\ncode using retrieved snippets informed by executor traces. We conduct a\ncomprehensive evaluation of both closed-source and open-source LLMs across\nvarious model sizes and series to assess their performance in smart contract\ncompletion. The results show that code repair and retrieval techniques\neffectively enhance the correctness of smart contract completion while reducing\ncomputational costs.", "categories": [ "cs.SE", "cs.AI", "cs.CL" ], "published": "2025-03-03T01:55:20+00:00", "url": "http://arxiv.org/pdf/2503.01098v1", "resource_uri": "arxiv://2503.01098v1", "citation_count": 0 }, { "id": "2502.20127v1", "title": "SoRFT: Issue Resolving with Subtask-oriented Reinforced Fine-Tuning", "authors": [ "Zexiong Ma", "Chao Peng", "Pengfei Gao", "Xiangxin Meng", "Yanzhen Zou", "Bing Xie" ], "abstract": "Mainstream issue-resolving frameworks predominantly rely on commercial\nmodels, leading to high costs and privacy concerns. Existing training\napproaches for issue resolving struggle with poor generalization and fail to\nfully leverage open-source development resources. We propose Subtask-oriented\nReinforced Fine-Tuning (SoRFT), a novel training approach to enhance the issue\nresolving capability of LLMs. We decomposes issue resolving into structured\nsubtasks: file localization, function localization, line localization, and code\nedit generation. SoRFT consists of two training stages: (1) rejection-sampled\nsupervised fine-tuning, Chain of Thought (CoT) data is filtered using\nground-truth before fine-tuning the LLM, and (2) rule-based reinforcement\nlearning, which leverages PPO with ground-truth based rewards. We evaluate the\nSoRFT-trained model on SWE-Bench Verified and SWE-Bench Lite, achieving\nstate-of-the-art (SOTA) performance among open-source models (e.g., resolve\n21.4% issues on SWE-Bench Verified with SoRFT-Qwen-7B). The experimental\nresults demonstrate that SoRFT significantly enhances issue-resolving\nperformance, improves model generalization, and provides a cost-efficient\nalternative to commercial models.", "categories": [ "cs.SE", "cs.AI", "cs.CL" ], "published": "2025-02-27T14:19:45+00:00", "url": "http://arxiv.org/pdf/2502.20127v1", "resource_uri": "arxiv://2502.20127v1", "citation_count": 0 }, { "id": "2502.19960v2", "title": "SeisMoLLM: Advancing Seismic Monitoring via Cross-modal Transfer with Pre-trained Large Language Model", "authors": [ "Xinghao Wang", "Feng Liu", "Rui Su", "Zhihui Wang", "Lihua Fang", "Lianqing Zhou", "Lei Bai", "Wanli Ouyang" ], "abstract": "Recent advances in deep learning have revolutionized seismic monitoring, yet\ndeveloping a foundation model that performs well across multiple complex tasks\nremains challenging, particularly when dealing with degraded signals or data\nscarcity. This work presents SeisMoLLM, the first foundation model that\nutilizes cross-modal transfer for seismic monitoring, to unleash the power of\nlarge-scale pre-training from a large language model without requiring direct\npre-training on seismic datasets. Through elaborate waveform tokenization and\nfine-tuning of pre-trained GPT-2 model, SeisMoLLM achieves state-of-the-art\nperformance on the DiTing and STEAD datasets across five critical tasks:\nback-azimuth estimation, epicentral distance estimation, magnitude estimation,\nphase picking, and first-motion polarity classification. It attains 36 best\nresults out of 43 task metrics and 12 top scores out of 16 few-shot\ngeneralization metrics, with many relative improvements ranging from 10% to\n50%. In addition to its superior performance, SeisMoLLM maintains efficiency\ncomparable to or even better than lightweight models in both training and\ninference. These findings establish SeisMoLLM as a promising foundation model\nfor practical seismic monitoring and highlight cross-modal transfer as an\nexciting new direction for earthquake studies, showcasing the potential of\nadvanced deep learning techniques to propel seismology research forward.", "categories": [ "cs.LG" ], "published": "2025-02-27T10:35:53+00:00", "url": "http://arxiv.org/pdf/2502.19960v2", "resource_uri": "arxiv://2502.19960v2", "citation_count": 0 }, { "id": "2502.19710v1", "title": "SAP-DIFF: Semantic Adversarial Patch Generation for Black-Box Face Recognition Models via Diffusion Models", "authors": [ "Mingsi Wang", "Shuaiyin Yao", "Chang Yue", "Lijie Zhang", "Guozhu Meng" ], "abstract": "Given the need to evaluate the robustness of face recognition (FR) models,\nmany efforts have focused on adversarial patch attacks that mislead FR models\nby introducing localized perturbations. Impersonation attacks are a significant\nthreat because adversarial perturbations allow attackers to disguise themselves\nas legitimate users. This can lead to severe consequences, including data\nbreaches, system damage, and misuse of resources. However, research on such\nattacks in FR remains limited. Existing adversarial patch generation methods\nexhibit limited efficacy in impersonation attacks due to (1) the need for high\nattacker capabilities, (2) low attack success rates, and (3) excessive query\nrequirements. To address these challenges, we propose a novel method SAP-DIFF\nthat leverages diffusion models to generate adversarial patches via semantic\nperturbations in the latent space rather than direct pixel manipulation. We\nintroduce an attention disruption mechanism to generate features unrelated to\nthe original face, facilitating the creation of adversarial samples and a\ndirectional loss function to guide perturbations toward the target identity\nfeature space, thereby enhancing attack effectiveness and efficiency. Extensive\nexperiments on popular FR models and datasets demonstrate that our method\noutperforms state-of-the-art approaches, achieving an average attack success\nrate improvement of 45.66% (all exceeding 40%), and a reduction in the number\nof queries by about 40% compared to the SOTA approach", "categories": [ "cs.CV", "cs.CR" ], "published": "2025-02-27T02:57:29+00:00", "url": "http://arxiv.org/pdf/2502.19710v1", "resource_uri": "arxiv://2502.19710v1", "citation_count": 0 }, { "id": "2502.19407v2", "title": "Learning Code-Edit Embedding to Model Student Debugging Behavior", "authors": [ "Hasnain Heickal", "Andrew Lan" ], "abstract": "Providing effective feedback for programming assignments in computer science\neducation can be challenging: students solve problems by iteratively submitting\ncode, executing it, and using limited feedback from the compiler or the\nauto-grader to debug. Analyzing student debugging behavior in this process may\nreveal important insights into their knowledge and inform better personalized\nsupport tools. In this work, we propose an encoder-decoder-based model that\nlearns meaningful code-edit embeddings between consecutive student code\nsubmissions, to capture their debugging behavior. Our model leverages\ninformation on whether a student code submission passes each test case to\nfine-tune large language models (LLMs) to learn code editing representations.\nIt enables personalized next-step code suggestions that maintain the student's\ncoding style while improving test case correctness. Our model also enables us\nto analyze student code-editing patterns to uncover common student errors and\ndebugging behaviors, using clustering techniques. Experimental results on a\nreal-world student code submission dataset demonstrate that our model excels at\ncode reconstruction and personalized code suggestion while revealing\ninteresting patterns in student debugging behavior.", "categories": [ "cs.SE", "cs.CL" ], "published": "2025-02-26T18:54:39+00:00", "url": "http://arxiv.org/pdf/2502.19407v2", "resource_uri": "arxiv://2502.19407v2", "citation_count": 0 }, { "id": "2502.13966v2", "title": "Where's the Bug? Attention Probing for Scalable Fault Localization", "authors": [ "Adam Stein", "Arthur Wayne", "Aaditya Naik", "Mayur Naik", "Eric Wong" ], "abstract": "Ensuring code correctness remains a challenging problem even as large\nlanguage models (LLMs) become increasingly capable at code-related tasks. While\nLLM-based program repair systems can propose bug fixes using only a user's bug\nreport, their effectiveness is fundamentally limited by their ability to\nperform fault localization (FL), a challenging problem for both humans and\nLLMs. Existing FL approaches rely on executable test cases, require training on\ncostly and often noisy line-level annotations, or demand resource-intensive\nLLMs. In this paper, we present Bug Attention Probe (BAP), a method which\nlearns state-of-the-art fault localization without any direct localization\nlabels, outperforming traditional FL baselines and prompting of large-scale\nLLMs. We evaluate our approach across a variety of code settings, including\nreal-world Java bugs from the standard Defects4J dataset as well as seven other\ndatasets which span a diverse set of bug types and languages. Averaged across\nall eight datasets, BAP improves by 34.6% top-1 accuracy compared to the\nstrongest baseline and 93.4% over zero-shot prompting GPT-4o. BAP is also\nsignificantly more efficient than prompting, outperforming large open-weight\nmodels at a small fraction of the computational cost.", "categories": [ "cs.SE", "cs.LG" ], "published": "2025-02-19T18:59:32+00:00", "url": "http://arxiv.org/pdf/2502.13966v2", "resource_uri": "arxiv://2502.13966v2", "citation_count": 0 }, { "id": "2502.12115v4", "title": "SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?", "authors": [ "Samuel Miserendino", "Michele Wang", "Tejal Patwardhan", "Johannes Heidecke" ], "abstract": "We introduce SWE-Lancer, a benchmark of over 1,400 freelance software\nengineering tasks from Upwork, valued at \\$1 million USD total in real-world\npayouts. SWE-Lancer encompasses both independent engineering tasks--ranging\nfrom \\$50 bug fixes to \\$32,000 feature implementations--and managerial tasks,\nwhere models choose between technical implementation proposals. Independent\ntasks are graded with end-to-end tests triple-verified by experienced software\nengineers, while managerial decisions are assessed against the choices of the\noriginal hired engineering managers. We evaluate model performance and find\nthat frontier models are still unable to solve the majority of tasks. To\nfacilitate future research, we open-source a unified Docker image and a public\nevaluation split, SWE-Lancer Diamond\n(https://github.com/openai/SWELancer-Benchmark). By mapping model performance\nto monetary value, we hope SWE-Lancer enables greater research into the\neconomic impact of AI model development.", "categories": [ "cs.LG", "cs.SE" ], "published": "2025-02-17T18:41:16+00:00", "url": "http://arxiv.org/pdf/2502.12115v4", "resource_uri": "arxiv://2502.12115v4", "citation_count": 0 }, { "id": "2502.11844v3", "title": "BaxBench: Can LLMs Generate Correct and Secure Backends?", "authors": [ "Mark Vero", "Niels Mündler", "Victor Chibotaru", "Veselin Raychev", "Maximilian Baader", "Nikola Jovanović", "Jingxuan He", "Martin Vechev" ], "abstract": "Automatic program generation has long been a fundamental challenge in\ncomputer science. Recent benchmarks have shown that large language models\n(LLMs) can effectively generate code at the function level, make code edits,\nand solve algorithmic coding tasks. However, to achieve full automation, LLMs\nshould be able to generate production-quality, self-contained application\nmodules. To evaluate the capabilities of LLMs in solving this challenge, we\nintroduce BaxBench, a novel evaluation benchmark consisting of 392 tasks for\nthe generation of backend applications. We focus on backends for three critical\nreasons: (i) they are practically relevant, building the core components of\nmost modern web and cloud software, (ii) they are difficult to get right,\nrequiring multiple functions and files to achieve the desired functionality,\nand (iii) they are security-critical, as they are exposed to untrusted\nthird-parties, making secure solutions that prevent deployment-time attacks an\nimperative. BaxBench validates the functionality of the generated applications\nwith comprehensive test cases, and assesses their security exposure by\nexecuting end-to-end exploits. Our experiments reveal key limitations of\ncurrent LLMs in both functionality and security: (i) even the best model,\nOpenAI o1, achieves a mere 62% on code correctness; (ii) on average, we could\nsuccessfully execute security exploits on around half of the correct programs\ngenerated by each LLM; and (iii) in less popular backend frameworks, models\nfurther struggle to generate correct and secure applications. Progress on\nBaxBench signifies important steps towards autonomous and secure software\ndevelopment with LLMs.", "categories": [ "cs.CR", "cs.AI", "cs.LG", "cs.PL" ], "published": "2025-02-17T14:37:47+00:00", "url": "http://arxiv.org/pdf/2502.11844v3", "resource_uri": "arxiv://2502.11844v3", "citation_count": 0 }, { "id": "2502.10953v1", "title": "Empirical evaluation of LLMs in predicting fixes of Configuration bugs in Smart Home System", "authors": [ "Sheikh Moonwara Anjum Monisha", "Atul Bharadwaj" ], "abstract": "This empirical study evaluates the effectiveness of Large Language Models\n(LLMs) in predicting fixes for configuration bugs in smart home systems. The\nresearch analyzes three prominent LLMs - GPT-4, GPT-4o (GPT-4 Turbo), and\nClaude 3.5 Sonnet - using four distinct prompt designs to assess their ability\nto identify appropriate fix strategies and generate correct solutions. The\nstudy utilized a dataset of 129 debugging issues from the Home Assistant\nCommunity, focusing on 21 randomly selected cases for in-depth analysis.\nResults demonstrate that GPT-4 and Claude 3.5 Sonnet achieved 80\\% accuracy in\nstrategy prediction when provided with both bug descriptions and original\nscripts. GPT-4 exhibited consistent performance across different prompt types,\nwhile GPT-4o showed advantages in speed and cost-effectiveness despite slightly\nlower accuracy. The findings reveal that prompt design significantly impacts\nmodel performance, with comprehensive prompts containing both description and\noriginal script yielding the best results. This research provides valuable\ninsights for improving automated bug fixing in smart home system configurations\nand demonstrates the potential of LLMs in addressing configuration-related\nchallenges.", "categories": [ "cs.SE", "cs.AI" ], "published": "2025-02-16T02:11:36+00:00", "url": "http://arxiv.org/pdf/2502.10953v1", "resource_uri": "arxiv://2502.10953v1", "citation_count": 0 }, { "id": "2502.09771v1", "title": "Knowledge-Enhanced Program Repair for Data Science Code", "authors": [ "Shuyin Ouyang", "Jie M. Zhang", "Zeyu Sun", "Albert Merono Penuela" ], "abstract": "This paper introduces DSrepair, a knowledge-enhanced program repair method\ndesigned to repair the buggy code generated by LLMs in the data science domain.\nDSrepair uses knowledge graph based RAG for API knowledge retrieval as well as\nbug knowledge enrichment to construct repair prompts for LLMs. Specifically, to\nenable knowledge graph based API retrieval, we construct DS-KG (Data Science\nKnowledge Graph) for widely used data science libraries. For bug knowledge\nenrichment, we employ an abstract syntax tree (AST) to localize errors at the\nAST node level. DSrepair's effectiveness is evaluated against five\nstate-of-the-art LLM-based repair baselines using four advanced LLMs on the\nDS-1000 dataset. The results show that DSrepair surpasses all five baselines.\nSpecifically, when compared to the second-best baseline, DSrepair demonstrates\nsignificant improvements, fixing 44.4%, 14.2%, 20.6%, and 32.1% more buggy code\nsnippets for each of the four evaluated LLMs, respectively. Additionally, it\nachieves greater efficiency, reducing the number of tokens required per code\ntask by 17.49%, 34.24%, 24.71%, and 17.59%, respectively.", "categories": [ "cs.SE" ], "published": "2025-02-13T21:00:21+00:00", "url": "http://arxiv.org/pdf/2502.09771v1", "resource_uri": "arxiv://2502.09771v1", "citation_count": 0 }, { "id": "2502.09065v2", "title": "Lowering the Error Floor of Error Correction Code Transformer", "authors": [ "Taewoo Park", "Seong-Joon Park", "Hee-Youl Kwak", "Sang-Hyo Kim", "Yongjune Kim" ], "abstract": "With the success of transformer architectures across diverse applications,\nthe error correction code transformer (ECCT) has gained significant attention\nfor its superior decoding performance. In spite of its advantages, the error\nfloor problem in ECCT decoding remains unexplored. We present the first\ninvestigation into this issue, revealing that ECCT encounters error floors,\nlimiting its effectiveness in practical settings. To address this error floor\nproblem, we adopt a hybrid decoding framework that integrates ECCT with\nconventional hard decision decoders. Unlike prior hybrid decoding schemes, our\nkey contribution lies in proposing a novel loss function that explicitly takes\ninto account the interaction between ECCT and hard decision decoders during\ntraining. The proposed loss function guides ECCT to focus on residual errors\nthat are not corrected by the hard decision stages, effectively lowering the\nerror floor. Simulation results confirm that the hybrid decoder trained with\nthe proposed loss function achieves substantial performance gains over standard\nECCT in both the waterfall and the error floor regions.", "categories": [ "cs.IT", "math.IT" ], "published": "2025-02-13T08:26:57+00:00", "url": "http://arxiv.org/pdf/2502.09065v2", "resource_uri": "arxiv://2502.09065v2", "citation_count": 0 }, { "id": "2502.18487v1", "title": "AuPair: Golden Example Pairs for Code Repair", "authors": [ "Aditi Mavalankar", "Hassan Mansoor", "Zita Marinho", "Masha Samsikova", "Tom Schaul" ], "abstract": "Scaling up inference-time compute has proven to be a valuable strategy in\nimproving the performance of Large Language Models (LLMs) without fine-tuning.\nAn important task that can benefit from additional inference-time compute is\nself-repair; given an initial flawed response, or guess, the LLM corrects its\nown mistake and produces an improved response, or fix. We leverage the\nin-context learning ability of LLMs to perform self-repair in the coding\ndomain. The key contribution of our paper is an approach that synthesises and\nselects an ordered set of golden example pairs, or AuPairs, of these initial\nguesses and subsequent fixes for the corresponding problems. Each such AuPair\nis provided as a single in-context example at inference time to generate a\nrepaired solution. For an inference-time compute budget of $N$ LLM calls per\nproblem, $N$ AuPairs are used to generate $N$ repaired solutions, out of which\nthe highest-scoring solution is selected as the final answer. The underlying\nintuition is that if the LLM is given a different example of fixing an\nincorrect guess each time, it can subsequently generate a diverse set of\nrepaired solutions. Our algorithm selects these AuPairs in a manner that\nmaximises complementarity and usefulness. We demonstrate the results of our\nalgorithm on 5 LLMs across 7 competitive programming datasets for the code\nrepair task. Our algorithm yields a significant boost in performance compared\nto best-of-$N$ and self-repair, and also exhibits strong generalisation across\ndatasets and models. Moreover, our approach shows significantly stronger\nscaling with inference-time compute budget compared to baselines.", "categories": [ "cs.SE", "cs.AI", "cs.CL", "cs.LG" ], "published": "2025-02-12T11:07:04+00:00", "url": "http://arxiv.org/pdf/2502.18487v1", "resource_uri": "arxiv://2502.18487v1", "citation_count": 0 }, { "id": "2502.08260v1", "title": "FixDrive: Automatically Repairing Autonomous Vehicle Driving Behaviour for $0.08 per Violation", "authors": [ "Yang Sun", "Christopher M. Poskitt", "Kun Wang", "Jun Sun" ], "abstract": "Autonomous Vehicles (AVs) are advancing rapidly, with Level-4 AVs already\noperating in real-world conditions. Current AVs, however, still lag behind\nhuman drivers in adaptability and performance, often exhibiting overly\nconservative behaviours and occasionally violating traffic laws. Existing\nsolutions, such as runtime enforcement, mitigate this by automatically\nrepairing the AV's planned trajectory at runtime, but such approaches lack\ntransparency and should be a measure of last resort. It would be preferable for\nAV repairs to generalise beyond specific incidents and to be interpretable for\nusers. In this work, we propose FixDrive, a framework that analyses driving\nrecords from near-misses or law violations to generate AV driving strategy\nrepairs that reduce the chance of such incidents occurring again. These repairs\nare captured in {\\mu}Drive, a high-level domain-specific language for\nspecifying driving behaviours in response to event-based triggers. Implemented\nfor the state-of-the-art autonomous driving system Apollo, FixDrive identifies\nand visualises critical moments from driving records, then uses a Multimodal\nLarge Language Model (MLLM) with zero-shot learning to generate {\\mu}Drive\nprograms. We tested FixDrive on various benchmark scenarios, and found that the\ngenerated repairs improved the AV's performance with respect to following\ntraffic laws, avoiding collisions, and successfully reaching destinations.\nFurthermore, the direct costs of repairing an AV -- 15 minutes of offline\nanalysis and $0.08 per violation -- are reasonable in practice.", "categories": [ "cs.SE" ], "published": "2025-02-12T10:07:56+00:00", "url": "http://arxiv.org/pdf/2502.08260v1", "resource_uri": "arxiv://2502.08260v1", "citation_count": 0 }, { "id": "2502.07067v1", "title": "Repository-level Code Search with Neural Retrieval Methods", "authors": [ "Siddharth Gandhi", "Luyu Gao", "Jamie Callan" ], "abstract": "This paper presents a multi-stage reranking system for repository-level code\nsearch, which leverages the vastly available commit histories of large\nopen-source repositories to aid in bug fixing. We define the task of\nrepository-level code search as retrieving the set of files from the current\nstate of a code repository that are most relevant to addressing a user's\nquestion or bug. The proposed approach combines BM25-based retrieval over\ncommit messages with neural reranking using CodeBERT to identify the most\npertinent files. By learning patterns from diverse repositories and their\ncommit histories, the system can surface relevant files for the task at hand.\nThe system leverages both commit messages and source code for relevance\nmatching, and is evaluated in both normal and oracle settings. Experiments on a\nnew dataset created from 7 popular open-source repositories demonstrate\nsubstantial improvements of up to 80% in MAP, MRR and P@1 over the BM25\nbaseline, across a diverse set of queries, demonstrating the effectiveness this\napproach. We hope this work aids LLM agents as a tool for better code search\nand understanding. Our code and results obtained are publicly available.", "categories": [ "cs.IR" ], "published": "2025-02-10T21:59:01+00:00", "url": "http://arxiv.org/pdf/2502.07067v1", "resource_uri": "arxiv://2502.07067v1", "citation_count": 0 }, { "id": "2502.06215v1", "title": "LessLeak-Bench: A First Investigation of Data Leakage in LLMs Across 83 Software Engineering Benchmarks", "authors": [ "Xin Zhou", "Martin Weyssow", "Ratnadira Widyasari", "Ting Zhang", "Junda He", "Yunbo Lyu", "Jianming Chang", "Beiqi Zhang", "Dan Huang", "David Lo" ], "abstract": "Large Language Models (LLMs) are widely utilized in software engineering (SE)\ntasks, such as code generation and automated program repair. However, their\nreliance on extensive and often undisclosed pre-training datasets raises\nsignificant concerns about data leakage, where the evaluation benchmark data is\nunintentionally ``seen'' by LLMs during the model's construction phase. The\ndata leakage issue could largely undermine the validity of LLM-based research\nand evaluations. Despite the increasing use of LLMs in the SE community, there\nis no comprehensive study that assesses the extent of data leakage in SE\nbenchmarks for LLMs yet. To address this gap, this paper presents the first\nlarge-scale analysis of data leakage in 83 SE benchmarks concerning LLMs. Our\nresults show that in general, data leakage in SE benchmarks is minimal, with\naverage leakage ratios of only 4.8\\%, 2.8\\%, and 0.7\\% for Python, Java, and\nC/C++ benchmarks, respectively. However, some benchmarks exhibit relatively\nhigher leakage ratios, which raises concerns about their bias in evaluation.\nFor instance, QuixBugs and BigCloneBench have leakage ratios of 100.0\\% and\n55.7\\%, respectively. Furthermore, we observe that data leakage has a\nsubstantial impact on LLM evaluation. We also identify key causes of high data\nleakage, such as the direct inclusion of benchmark data in pre-training\ndatasets and the use of coding platforms like LeetCode for benchmark\nconstruction. To address the data leakage, we introduce\n\\textbf{LessLeak-Bench}, a new benchmark that removes leaked samples from the\n83 SE benchmarks, enabling more reliable LLM evaluations in future research.\nOur study enhances the understanding of data leakage in SE benchmarks and\nprovides valuable insights for future research involving LLMs in SE.", "categories": [ "cs.SE", "cs.AI", "cs.CL" ], "published": "2025-02-10T07:33:49+00:00", "url": "http://arxiv.org/pdf/2502.06215v1", "resource_uri": "arxiv://2502.06215v1", "citation_count": 0 }, { "id": "2502.04947v1", "title": "Enriching continuous Lagrange finite element approximation spaces using neural networks", "authors": [ "Hélène Barucq", "Michel Duprez", "Florian Faucher", "Emmanuel Franck", "Frédérique Lecourtier", "Vanessa Lleras", "Victor Michel-Dansac", "Nicolas Victorion" ], "abstract": "In this work, we present a preliminary study combining two approaches in the\ncontext of solving PDEs: the classical finite element method (FEM) and more\nrecent techniques based on neural networks. Indeed, in recent years,\nphysics-informed neural networks (PINNs) have become particularly interesting\nfor rapidly solving such problems, especially in high dimensions. However,\ntheir lack of accuracy is a significant drawback in this context, hence the\ninterest in combining them with FEM, for which error estimators are already\nknown. The complete pipeline proposed here, therefore, consists of modifying\nclassical FEM approximation spaces by taking information from a prior, chosen\nhere as the prediction of a neural network. On the one hand, this combination\nimproves and certifies the prediction of neural networks to obtain a fast and\naccurate solution. On the other hand, error estimates are proven, showing that\nsuch strategies outperform classical ones by a factor that depends only on the\nquality of the prior. We validate our approach with numerical results obtained\nfor this preliminary work on parametric problems with one- and two-dimensional\ngeometries. They demonstrate that to achieve a fixed error target, a coarser\nmesh can be used with our enhanced FEM compared to the standard one, leading to\nreduced computation time, particularly for parametric problems.", "categories": [ "math.NA", "cs.NA", "35A35, 65N30, 68T01" ], "published": "2025-02-07T14:12:35+00:00", "url": "http://arxiv.org/pdf/2502.04947v1", "resource_uri": "arxiv://2502.04947v1", "citation_count": 0 }, { "id": "2502.03930v3", "title": "DiTAR: Diffusion Transformer Autoregressive Modeling for Speech Generation", "authors": [ "Dongya Jia", "Zhuo Chen", "Jiawei Chen", "Chenpeng Du", "Jian Wu", "Jian Cong", "Xiaobin Zhuang", "Chumin Li", "Zhen Wei", "Yuping Wang", "Yuxuan Wang" ], "abstract": "Several recent studies have attempted to autoregressively generate continuous\nspeech representations without discrete speech tokens by combining diffusion\nand autoregressive models, yet they often face challenges with excessive\ncomputational loads or suboptimal outcomes. In this work, we propose Diffusion\nTransformer Autoregressive Modeling (DiTAR), a patch-based autoregressive\nframework combining a language model with a diffusion transformer. This\napproach significantly enhances the efficacy of autoregressive models for\ncontinuous tokens and reduces computational demands. DiTAR utilizes a\ndivide-and-conquer strategy for patch generation, where the language model\nprocesses aggregated patch embeddings and the diffusion transformer\nsubsequently generates the next patch based on the output of the language\nmodel. For inference, we propose defining temperature as the time point of\nintroducing noise during the reverse diffusion ODE to balance diversity and\ndeterminism. We also show in the extensive scaling analysis that DiTAR has\nsuperb scalability. In zero-shot speech generation, DiTAR achieves\nstate-of-the-art performance in robustness, speaker similarity, and\nnaturalness.", "categories": [ "eess.AS", "cs.AI", "cs.CL", "cs.LG", "cs.SD" ], "published": "2025-02-06T10:09:49+00:00", "url": "http://arxiv.org/pdf/2502.03930v3", "resource_uri": "arxiv://2502.03930v3", "citation_count": 0 }, { "id": "2502.03719v2", "title": "Code Shaping: Iterative Code Editing with Free-form AI-Interpreted Sketching", "authors": [ "Ryan Yen", "Jian Zhao", "Daniel Vogel" ], "abstract": "We introduce the concept of code shaping, an interaction paradigm for editing\ncode using free-form sketch annotations directly on top of the code and console\noutput. To evaluate this concept, we conducted a three-stage design study with\n18 different programmers to investigate how sketches can communicate intended\ncode edits to an AI model for interpretation and execution. The results show\nhow different sketches are used, the strategies programmers employ during\niterative interactions with AI interpretations, and interaction design\nprinciples that support the reconciliation between the code editor and\nsketches. Finally, we demonstrate the practical application of the code shaping\nconcept with two use case scenarios, illustrating design implications from the\nstudy.", "categories": [ "cs.HC" ], "published": "2025-02-06T02:16:08+00:00", "url": "http://arxiv.org/pdf/2502.03719v2", "resource_uri": "arxiv://2502.03719v2", "citation_count": 0 }, { "id": "2502.03617v1", "title": "Resource-Efficient & Effective Code Summarization", "authors": [ "Saima Afrin", "Joseph Call", "Khai-Nguyen Nguyen", "Oscar Chaparro", "Antonio Mastropaolo" ], "abstract": "Code Language Models (CLMs) have demonstrated high effectiveness in\nautomating software engineering tasks such as bug fixing, code generation, and\ncode documentation. This progress has been driven by the scaling of large\nmodels, ranging from millions to trillions of parameters (e.g., GPT-4).\nHowever, as models grow in scale, sustainability concerns emerge, as they are\nextremely resource-intensive, highlighting the need for efficient,\nenvironmentally conscious solutions. GreenAI techniques, such as QLoRA\n(Quantized Low-Rank Adaptation), offer a promising path for dealing with large\nmodels' sustainability as they enable resource-efficient model fine-tuning.\nPrevious research has shown the effectiveness of QLoRA in code-related tasks,\nparticularly those involving natural language inputs and code as the target\noutput (NL-to-Code), such as code generation. However, no studies have explored\nits application to tasks that are fundamentally similar to NL-to-Code (natural\nlanguage to code) but operate in the opposite direction, such as code\nsummarization. This leaves a gap in understanding how well QLoRA can generalize\nto Code-to-NL tasks, which are equally important for supporting developers in\nunderstanding and maintaining code. To address this gap, we investigate the\nextent to which QLoRA's capabilities in NL-to-Code tasks can be leveraged and\ntransferred to code summarization, one representative Code-to-NL task. Our\nstudy evaluates two state-of-the-art CLMs (CodeLlama and DeepSeek-Coder) across\ntwo programming languages: Python and Java. Our research tasked models with\ngenerating descriptions for Python and Java code methods. The results align\nwith prior findings on QLoRA for source code generation, showing that QLoRA\nenables efficient fine-tuning of CLMs for code summarization.", "categories": [ "cs.SE" ], "published": "2025-02-05T21:06:30+00:00", "url": "http://arxiv.org/pdf/2502.03617v1", "resource_uri": "arxiv://2502.03617v1", "citation_count": 0 }, { "id": "2502.03560v1", "title": "Simulating Errors in Touchscreen Typing", "authors": [ "Danqing Shi", "Yujun Zhu", "Francisco Erivaldo Fernandes Junior", "Shumin Zhai", "Antti Oulasvirta" ], "abstract": "Empirical evidence shows that typing on touchscreen devices is prone to\nerrors and that correcting them poses a major detriment to users' performance.\nDesign of text entry systems that better serve users, across their broad\ncapability range, necessitates understanding the cognitive mechanisms that\nunderpin these errors. However, prior models of typing cover only motor slips.\nThe paper reports on extending the scope of computational modeling of typing to\ncover the cognitive mechanisms behind the three main types of error: slips\n(inaccurate execution), lapses (forgetting), and mistakes (incorrect\nknowledge). Given a phrase, a keyboard, and user parameters, Typoist simulates\neye and finger movements while making human-like insertion, omission,\nsubstitution, and transposition errors. Its main technical contribution is the\nformulation of a supervisory control problem wherein the controller allocates\ncognitive resources to detect and fix errors generated by the various\nmechanisms. The model generates predictions of typing performance that can\ninform design, for better text entry systems.", "categories": [ "cs.HC" ], "published": "2025-02-05T19:17:33+00:00", "url": "http://arxiv.org/pdf/2502.03560v1", "resource_uri": "arxiv://2502.03560v1", "citation_count": 0 }, { "id": "2502.03492v1", "title": "Teaching Language Models to Critique via Reinforcement Learning", "authors": [ "Zhihui Xie", "Jie chen", "Liyu Chen", "Weichao Mao", "Jingjing Xu", "Lingpeng Kong" ], "abstract": "Teaching large language models (LLMs) to critique and refine their outputs is\ncrucial for building systems that can iteratively improve, yet it is\nfundamentally limited by the ability to provide accurate judgments and\nactionable suggestions. In this work, we study LLM critics for code generation\nand propose $\\texttt{CTRL}$, a framework for $\\texttt{C}$ritic\n$\\texttt{T}$raining via $\\texttt{R}$einforcement $\\texttt{L}$earning, which\ntrains a critic model to generate feedback that maximizes correction\nperformance for a fixed generator model without human supervision. Our results\ndemonstrate that critics trained with $\\texttt{CTRL}$ significantly enhance\npass rates and mitigate compounding errors across both base and stronger\ngenerator models. Furthermore, we show that these critic models act as accurate\ngenerative reward models and enable test-time scaling through iterative\ncritique-revision, achieving up to 106.1% relative improvements across\nchallenging code generation benchmarks.", "categories": [ "cs.LG", "cs.AI", "cs.CL" ], "published": "2025-02-05T02:18:46+00:00", "url": "http://arxiv.org/pdf/2502.03492v1", "resource_uri": "arxiv://2502.03492v1", "citation_count": 0 }, { "id": "2502.02009v1", "title": "LLMSecConfig: An LLM-Based Approach for Fixing Software Container Misconfigurations", "authors": [ "Ziyang Ye", "Triet Huynh Minh Le", "M. Ali Babar" ], "abstract": "Security misconfigurations in Container Orchestrators (COs) can pose serious\nthreats to software systems. While Static Analysis Tools (SATs) can effectively\ndetect these security vulnerabilities, the industry currently lacks automated\nsolutions capable of fixing these misconfigurations. The emergence of Large\nLanguage Models (LLMs), with their proven capabilities in code understanding\nand generation, presents an opportunity to address this limitation. This study\nintroduces LLMSecConfig, an innovative framework that bridges this gap by\ncombining SATs with LLMs. Our approach leverages advanced prompting techniques\nand Retrieval-Augmented Generation (RAG) to automatically repair security\nmisconfigurations while preserving operational functionality. Evaluation of\n1,000 real-world Kubernetes configurations achieved a 94\\% success rate while\nmaintaining a low rate of introducing new misconfigurations.\n Our work makes a promising step towards automated container security\nmanagement, reducing the manual effort required for configuration maintenance.", "categories": [ "cs.SE", "cs.AI", "cs.CR", "cs.LG" ], "published": "2025-02-04T04:56:34+00:00", "url": "http://arxiv.org/pdf/2502.02009v1", "resource_uri": "arxiv://2502.02009v1", "citation_count": 0 }, { "id": "2502.01821v2", "title": "Agentic Bug Reproduction for Effective Automated Program Repair at Google", "authors": [ "Runxiang Cheng", "Michele Tufano", "Jürgen Cito", "José Cambronero", "Pat Rondon", "Renyao Wei", "Aaron Sun", "Satish Chandra" ], "abstract": "Bug reports often lack sufficient detail for developers to reproduce and fix\nthe underlying defects. Bug Reproduction Tests (BRTs), tests that fail when the\nbug is present and pass when it has been resolved, are crucial for debugging,\nbut they are rarely included in bug reports, both in open-source and in\nindustrial settings. Thus, automatically generating BRTs from bug reports has\nthe potential to accelerate the debugging process and lower time to repair.\nThis paper investigates automated BRT generation within an industry setting,\nspecifically at Google, focusing on the challenges of a large-scale,\nproprietary codebase and considering real-world industry bugs extracted from\nGoogle's internal issue tracker. We adapt and evaluate a state-of-the-art BRT\ngeneration technique, LIBRO, and present our agent-based approach, BRT Agent,\nwhich makes use of a fine-tuned Large Language Model (LLM) for code editing.\nOur BRT Agent significantly outperforms LIBRO, achieving a 28% plausible BRT\ngeneration rate, compared to 10% by LIBRO, on 80 human-reported bugs from\nGoogle's internal issue tracker. We further investigate the practical value of\ngenerated BRTs by integrating them with an Automated Program Repair (APR)\nsystem at Google. Our results show that providing BRTs to the APR system\nresults in 30% more bugs with plausible fixes. Additionally, we introduce\nEnsemble Pass Rate (EPR), a metric which leverages the generated BRTs to select\nthe most promising fixes from all fixes generated by APR system. Our evaluation\non EPR for Top-K and threshold-based fix selections demonstrates promising\nresults and trade-offs. For example, EPR correctly selects a plausible fix from\na pool of 20 candidates in 70% of cases, based on its top-1 ranking.", "categories": [ "cs.SE", "cs.AI" ], "published": "2025-02-03T20:57:17+00:00", "url": "http://arxiv.org/pdf/2502.01821v2", "resource_uri": "arxiv://2502.01821v2", "citation_count": 0 }, { "id": "2502.00998v2", "title": "Generating logical magic states with the aid of non-Abelian topological order", "authors": [ "Sheng-Jie Huang", "Yanzhu Chen" ], "abstract": "In fault-tolerant quantum computing with the surface code, non-Clifford gates\nare crucial for universal computation. However, implementing these gates using\nmethods like magic state distillation and code switching requires significant\nresources. In this work, we propose a new protocol that combines magic state\npreparation and code transformation to realize logical non-Clifford operations\nwith the potential for fault tolerance. Our approach begins with a special\nlogical state in the $\\mathbb{Z}_4$ surface code. By applying a sequence of\ntransformations, the system goes through different topological codes, including\nthe non-Abelian $D_4$ quantum double model. This process ultimately produces a\nmagic state encoded in the $\\mathbb{Z}_{2}$ surface code. A logical $T$ gate\ncan be implemented in the standard $\\mathbb{Z}_2$ surface code by gate\nteleportation. In our analysis, we employ a framework where the topological\ncodes are represented by their topological orders and all the transformations\nare considered as topological manipulations such as gauging symmetries and\ncondensing anyons. This perspective is particularly useful for understanding\ntransformations between topological codes.", "categories": [ "quant-ph", "cond-mat.str-el", "hep-th" ], "published": "2025-02-03T02:38:32+00:00", "url": "http://arxiv.org/pdf/2502.00998v2", "resource_uri": "arxiv://2502.00998v2", "citation_count": 0 }, { "id": "2502.00350v1", "title": "OrcaLoca: An LLM Agent Framework for Software Issue Localization", "authors": [ "Zhongming Yu", "Hejia Zhang", "Yujie Zhao", "Hanxian Huang", "Matrix Yao", "Ke Ding", "Jishen Zhao" ], "abstract": "Recent developments in Large Language Model (LLM) agents are revolutionizing\nAutonomous Software Engineering (ASE), enabling automated coding, problem\nfixes, and feature improvements. However, localization -- precisely identifying\nsoftware problems by navigating to relevant code sections -- remains a\nsignificant challenge. Current approaches often yield suboptimal results due to\na lack of effective integration between LLM agents and precise code search\nmechanisms. This paper introduces OrcaLoca, an LLM agent framework that\nimproves accuracy for software issue localization by integrating priority-based\nscheduling for LLM-guided action, action decomposition with relevance scoring,\nand distance-aware context pruning. Experimental results demonstrate that\nOrcaLoca becomes the new open-source state-of-the-art (SOTA) in function match\nrate (65.33%) on SWE-bench Lite. It also improves the final resolved rate of an\nopen-source framework by 6.33 percentage points through its patch generation\nintegration.", "categories": [ "cs.SE", "cs.AI" ], "published": "2025-02-01T07:15:03+00:00", "url": "http://arxiv.org/pdf/2502.00350v1", "resource_uri": "arxiv://2502.00350v1", "citation_count": 0 }, { "id": "2501.18950v3", "title": "Fantastic Targets for Concept Erasure in Diffusion Models and Where To Find Them", "authors": [ "Anh Bui", "Trang Vu", "Long Vuong", "Trung Le", "Paul Montague", "Tamas Abraham", "Junae Kim", "Dinh Phung" ], "abstract": "Concept erasure has emerged as a promising technique for mitigating the risk\nof harmful content generation in diffusion models by selectively unlearning\nundesirable concepts. The common principle of previous works to remove a\nspecific concept is to map it to a fixed generic concept, such as a neutral\nconcept or just an empty text prompt. In this paper, we demonstrate that this\nfixed-target strategy is suboptimal, as it fails to account for the impact of\nerasing one concept on the others. To address this limitation, we model the\nconcept space as a graph and empirically analyze the effects of erasing one\nconcept on the remaining concepts. Our analysis uncovers intriguing geometric\nproperties of the concept space, where the influence of erasing a concept is\nconfined to a local region. Building on this insight, we propose the Adaptive\nGuided Erasure (AGE) method, which \\emph{dynamically} selects optimal target\nconcepts tailored to each undesirable concept, minimizing unintended side\neffects. Experimental results show that AGE significantly outperforms\nstate-of-the-art erasure methods on preserving unrelated concepts while\nmaintaining effective erasure performance. Our code is published at\n{https://github.com/tuananhbui89/Adaptive-Guided-Erasure}.", "categories": [ "cs.LG", "cs.AI", "cs.CV" ], "published": "2025-01-31T08:17:23+00:00", "url": "http://arxiv.org/pdf/2501.18950v3", "resource_uri": "arxiv://2501.18950v3", "citation_count": 0 }, { "id": "2501.18438v2", "title": "o3-mini vs DeepSeek-R1: Which One is Safer?", "authors": [ "Aitor Arrieta", "Miriam Ugarte", "Pablo Valle", "José Antonio Parejo", "Sergio Segura" ], "abstract": "The irruption of DeepSeek-R1 constitutes a turning point for the AI industry\nin general and the LLMs in particular. Its capabilities have demonstrated\noutstanding performance in several tasks, including creative thinking, code\ngeneration, maths and automated program repair, at apparently lower execution\ncost. However, LLMs must adhere to an important qualitative property, i.e.,\ntheir alignment with safety and human values. A clear competitor of DeepSeek-R1\nis its American counterpart, OpenAI's o3-mini model, which is expected to set\nhigh standards in terms of performance, safety and cost. In this technical\nreport, we systematically assess the safety level of both DeepSeek-R1 (70b\nversion) and OpenAI's o3-mini (beta version). To this end, we make use of our\nrecently released automated safety testing tool, named ASTRAL. By leveraging\nthis tool, we automatically and systematically generated and executed 1,260\ntest inputs on both models. After conducting a semi-automated assessment of the\noutcomes provided by both LLMs, the results indicate that DeepSeek-R1 produces\nsignificantly more unsafe responses (12%) than OpenAI's o3-mini (1.2%).", "categories": [ "cs.SE", "cs.AI" ], "published": "2025-01-30T15:45:56+00:00", "url": "http://arxiv.org/pdf/2501.18438v2", "resource_uri": "arxiv://2501.18438v2", "citation_count": 0 }, { "id": "2501.17740v2", "title": "Attacker Control and Bug Prioritization", "authors": [ "Guilhem Lacombe", "Sébastien Bardin" ], "abstract": "As bug-finding methods improve, bug-fixing capabilities are exceeded,\nresulting in an accumulation of potential vulnerabilities. There is thus a need\nfor efficient and precise bug prioritization based on exploitability. In this\nwork, we explore the notion of control of an attacker over a vulnerability's\nparameters, which is an often overlooked factor of exploitability. We show that\ntaint as well as straightforward qualitative and quantitative notions of\ncontrol are not enough to effectively differentiate vulnerabilities. Instead,\nwe propose to focus analysis on feasible value sets, which we call domains of\ncontrol, in order to better take into account threat models and expert insight.\nOur new Shrink and Split algorithm efficiently extracts domains of control from\npath constraints obtained with symbolic execution and renders them in an easily\nprocessed, human-readable form. This in turn allows to automatically compute\nmore complex control metrics, such as weighted Quantitative Control, which\nfactors in the varying threat levels of different values. Experiments show that\nour method is both efficient and precise. In particular, it is the only one\nable to distinguish between vulnerabilities such as cve-2019-14192 and\ncve-2022-30552, while revealing a mistake in the human evaluation of\ncve-2022-30790. The high degree of automation of our tool also brings us closer\nto a fully-automated evaluation pipeline.", "categories": [ "cs.CR" ], "published": "2025-01-29T16:27:43+00:00", "url": "http://arxiv.org/pdf/2501.17740v2", "resource_uri": "arxiv://2501.17740v2", "citation_count": 0 }, { "id": "2502.18465v1", "title": "Empirical Research on Utilizing LLM-based Agents for Automated Bug Fixing via LangGraph", "authors": [ "Jialin Wang", "Zhihua Duan" ], "abstract": "This paper presents a novel framework for automated code generation and\ndebugging, designed to improve accuracy, efficiency, and scalability in\nsoftware development. The proposed system integrates three core components\nLangGraph, GLM4 Flash, and ChromaDB within a four step iterative workflow to\ndeliver robust performance and seamless functionality.\n LangGraph serves as a graph-based library for orchestrating tasks, providing\nprecise control and execution while maintaining a unified state object for\ndynamic updates and consistency. It supports multi-agent, hierarchical, and\nsequential processes, making it highly adaptable to complex software\nengineering workflows. GLM4 Flash, a large language model, leverages its\nadvanced capabilities in natural language understanding, contextual reasoning,\nand multilingual support to generate accurate code snippets based on user\nprompts. ChromaDB acts as a vector database for semantic search and contextual\nmemory storage, enabling the identification of patterns and the generation of\ncontext-aware bug fixes based on historical data.\n The system operates through a structured four-step process: (1) Code\nGeneration, which translates natural language descriptions into executable\ncode; (2) Code Execution, which validates the code by identifying runtime\nerrors and inconsistencies; (3) Code Repair, which iteratively refines buggy\ncode using ChromaDB's memory capabilities and LangGraph's state tracking; and\n(4) Code Update, which ensures the code meets functional and performance\nrequirements through iterative modifications.", "categories": [ "cs.SE" ], "published": "2025-01-29T12:01:00+00:00", "url": "http://arxiv.org/pdf/2502.18465v1", "resource_uri": "arxiv://2502.18465v1", "citation_count": 0 }, { "id": "2501.16655v1", "title": "Large Language Model Critics for Execution-Free Evaluation of Code Changes", "authors": [ "Aashish Yadavally", "Hoan Nguyen", "Laurent Callot", "Gauthier Guinet" ], "abstract": "Large language models (LLMs) offer a promising way forward for automating\nsoftware engineering tasks, such as bug fixes, feature additions, etc., via\nmulti-step LLM-based agentic workflows. However, existing metrics for\nevaluating such workflows, mainly build status and occasionally log analysis,\nare too sparse and limited in providing the information needed to assess the\nquality of changes made. In this work, we designed LLM-based critics to derive\nwell-structured and rigorous intermediate/step-level, execution-free evaluation\nproxies for repo-level code changes. Importantly, we assume access to the gold\ntest patch for the problem (i.e., reference-aware) to assess both semantics and\nexecutability of generated patches. With the gold test patch as a reference, we\npredict executability of all editing locations with an F1 score of 91.6%,\naggregating which, we can predict the build status in 84.8% of the instances in\nSWE-bench. In particular, such an execution-focused LLM critic outperforms\nother reference-free and reference-aware LLM critics by 38.9% to 72.5%.\nMoreover, we demonstrate the usefulness of such a reference-aware framework in\ncomparing patches generated by different agentic workflows. Finally, we\nopen-source the library developed for this project, which allows further usage\nfor either other agentic workflows or other benchmarks. The source code is\navailable at https://github.com/amazon-science/code-agent-eval.", "categories": [ "cs.CL", "cs.AI", "cs.SE" ], "published": "2025-01-28T02:38:56+00:00", "url": "http://arxiv.org/pdf/2501.16655v1", "resource_uri": "arxiv://2501.16655v1", "citation_count": 0 }, { "id": "2501.16191v1", "title": "Raiders of the Lost Dependency: Fixing Dependency Conflicts in Python using LLMs", "authors": [ "Antony Bartlett", "Cynthia Liem", "Annibale Panichella" ], "abstract": "Fixing Python dependency issues is a tedious and error-prone task for\ndevelopers, who must manually identify and resolve environment dependencies and\nversion constraints of third-party modules and Python interpreters. Researchers\nhave attempted to automate this process by relying on large knowledge graphs\nand database lookup tables. However, these traditional approaches face\nlimitations due to the variety of dependency error types, large sets of\npossible module versions, and conflicts among transitive dependencies. This\nstudy explores the potential of using large language models (LLMs) to\nautomatically fix dependency issues in Python programs. We introduce PLLM\n(pronounced \"plum\"), a novel technique that employs retrieval-augmented\ngeneration (RAG) to help an LLM infer Python versions and required modules for\na given Python file. PLLM builds a testing environment that iteratively (1)\nprompts the LLM for module combinations, (2) tests the suggested changes, and\n(3) provides feedback (error messages) to the LLM to refine the fix. This\nfeedback cycle leverages natural language processing (NLP) to intelligently\nparse and interpret build error messages. We benchmark PLLM on the Gistable\nHG2.9K dataset, a collection of challenging single-file Python gists. We\ncompare PLLM against two state-of-the-art automatic dependency inference\napproaches, namely PyEGo and ReadPyE, w.r.t. the ability to resolve dependency\nissues. Our results indicate that PLLM can fix more dependency issues than the\ntwo baselines, with +218 (+15.97%) more fixes over ReadPyE and +281 (+21.58%)\nover PyEGo. Our deeper analyses suggest that PLLM is particularly beneficial\nfor projects with many dependencies and for specific third-party numerical and\nmachine-learning modules. Our findings demonstrate the potential of LLM-based\napproaches to iteratively resolve Python dependency issues.", "categories": [ "cs.SE", "cs.AI" ], "published": "2025-01-27T16:45:34+00:00", "url": "http://arxiv.org/pdf/2501.16191v1", "resource_uri": "arxiv://2501.16191v1", "citation_count": 0 }, { "id": "2501.16149v2", "title": "PATCH: Empowering Large Language Model with Programmer-Intent Guidance and Collaborative-Behavior Simulation for Automatic Bug Fixing", "authors": [ "Yuwei Zhang", "Zhi Jin", "Ying Xing", "Ge Li", "Fang Liu", "Jiaxin Zhu", "Wensheng Dou", "Jun Wei" ], "abstract": "Bug fixing holds significant importance in software development and\nmaintenance. Recent research has made substantial strides in exploring the\npotential of large language models (LLMs) for automatically resolving software\nbugs. However, a noticeable gap in existing approaches lies in the oversight of\ncollaborative facets intrinsic to bug resolution, treating the process as a\nsingle-stage endeavor. Moreover, most approaches solely take the buggy code\nsnippet as input for LLMs during the patch generation stage. To mitigate the\naforementioned limitations, we introduce a novel stage-wise framework named\nPATCH. Specifically, we first augment the buggy code snippet with corresponding\ndependence context and intent information to better guide LLMs in generating\nthe correct candidate patches. Additionally, by taking inspiration from bug\nmanagement practices, we decompose the bug-fixing task into four distinct\nstages: bug reporting, bug diagnosis, patch generation, and patch verification.\nThese stages are performed interactively by LLMs, aiming to simulate the\ncollaborative behavior of programmers during the resolution of software bugs.\nBy harnessing these collective contributions, PATCH effectively enhances the\nbug-fixing capability of LLMs. We implement PATCH by employing the powerful\ndialogue-based LLM ChatGPT. Our evaluation on the widely used bug-fixing\nbenchmark BFP demonstrates that PATCH has achieved better performance than\nstate-of-the-art LLMs.", "categories": [ "cs.SE" ], "published": "2025-01-27T15:43:04+00:00", "url": "http://arxiv.org/pdf/2501.16149v2", "resource_uri": "arxiv://2501.16149v2", "citation_count": 0 }, { "id": "2501.16044v1", "title": "MultiMend: Multilingual Program Repair with Context Augmentation and Multi-Hunk Patch Generation", "authors": [ "Reza Gharibi", "Mohammad Hadi Sadreddini", "Seyed Mostafa Fakhrahmad" ], "abstract": "Context: Bugs in code are inevitable and can lead to severe consequences,\nranging from security vulnerabilities to operational failures. Debugging\nsoftware remains challenging despite advances in testing and verification,\noften requiring extensive manual effort. Learning-based automated program\nrepair (APR) has shown promise in reducing the time, effort, and cost of\nmanually fixing bugs. However, existing techniques face several challenges,\nincluding language-dependent strategies, limited bug context utilization, and\ndifficulties in handling bugs that span multiple locations in the code.\n Objective: This paper introduces MultiMend, a learning-based APR approach\ndesigned to improve repair performance on multiple programming languages with\nlanguage-independent context augmentation and multi-hunk patch generation.\n Method: MultiMend fine-tunes a pre-trained encoder-decoder transformer model\n(CodeT5) to generate bug-fixing patches. It embeds source code lines and\napplies retrieval-augmented generation to augment the buggy context with\nrelevant lines during patch generation. The approach systematically constructs\npatches for multi-hunk bugs to reduce the needed patch validations. We evaluate\nMultiMend on four benchmarks with four programming languages and compare it\nwith state-of-the-art methods.\n Results: Experimental results show that MultiMend achieves competitive\neffectiveness and efficiency against compared tools. Across all benchmarks,\nMultiMend fixes 2,077 bugs, of which 1,455 are identical to the developer's\npatch, and 106 are for multi-hunk bugs. Both context augmentation and\nmulti-hunk patch generation positively contribute to the results.\n Conclusion: MultiMend shows promising performance across benchmarks. The\nfindings highlight its applicability to real-world software maintenance and its\npotential to reduce manual debugging efforts.", "categories": [ "cs.SE" ], "published": "2025-01-27T13:37:43+00:00", "url": "http://arxiv.org/pdf/2501.16044v1", "resource_uri": "arxiv://2501.16044v1", "citation_count": 0 }, { "id": "2501.14438v1", "title": "Data-efficient Performance Modeling via Pre-training", "authors": [ "Chunting Liu", "Riyadh Baghdadi" ], "abstract": "Performance models are essential for automatic code optimization, enabling\ncompilers to predict the effects of code transformations on performance and\nguide search for optimal transformations. Building state-of-the-art performance\nmodels with deep learning, however, requires vast labeled datasets of random\nprograms -- an expensive and time-consuming process, stretching over months.\nThis paper introduces a self-supervised pre-training scheme with autoencoders\nto reduce the need for labeled data. By pre-training on a large dataset of\nrandom programs, the autoencoder learns representations of code and\ntransformations, which are then used to embed programs for the performance\nmodel. Implemented in the Tiramisu autoscheduler, our approach improves model\naccuracy with less data. For example, to achieve a MAPE of 20.72%, the original\nmodel requires 18 million data points, whereas our method achieves a similar\nMAPE of 22.44% with only 3.6 million data points, reducing data requirements by\n5x.", "categories": [ "cs.PL", "cs.DC", "cs.LG" ], "published": "2025-01-24T12:14:53+00:00", "url": "http://arxiv.org/pdf/2501.14438v1", "resource_uri": "arxiv://2501.14438v1", "citation_count": 0 }, { "id": "2501.12079v1", "title": "Directional Diffusion-Style Code Editing Pre-training", "authors": [ "Qingyuan Liang", "Zeyu Sun", "Qihao Zhu", "Junhao Hu", "Yifan Zhao", "Yizhou Chen", "Mingxuan Zhu", "Guoqing Wang", "Lu Zhang" ], "abstract": "Code pre-trained models have shown promising effectiveness in various\nsoftware engineering tasks. Among these tasks, many tasks are related to\nsoftware evolution and/or code editing. However, existing code pre-trained\nmodels often overlook the real-world code editing data and the evolutionary\nnature of the editing process. In this paper, to simulate the step-by-step code\nediting process of human developers, we propose DivoT5, a pre-trained model\nbased on directional diffusion at the data level. In DivoT5, we adopt two\ncategories of pre-training tasks. The first category is mask and denoising\ntasks augmented with a diffusion direction representing code evolution. That\nis, we first apply a noising process to the code snippets before evolution, and\nthen ask the pre-training process to restore the snippets with noise into the\ncode snippets after evolution. The second category is tasks aiming to reinforce\nthe evolutionary direction. That is, we first generate various intermediate\nversions for each pair of snippets before and after evolution, and then ask the\npre-training process to transform the intermediate versions into the snippet\nafter evolution for each pair. We evaluate DivoT5 for two code-editing\nscenarios and one non-editing scenario using five downstream tasks. Given each\ndownstream task, we fine-tune the pre-trained DivoT5 to evaluate its\neffectiveness. Our experimental results show that DivoT5 achieves\nstate-of-the-art (SOTA) performance on most tasks in comparison to models of\nthe same scale (220M), large scale (770M) models in fine-tuning, and\nbillion-scale (6.7B, 8B, ChatGPT) models in few-shot settings. For one\ncode-editing task (i.e., automated code review), DivoT5 pre-trained on top of\nCodeT5-small (60M) can even outperform CodeT5-base (220M) and other pre-trained\nmodels with 220M parameters except for DivoT5 pre-trained on top of CodeT5-base\n(220M).", "categories": [ "cs.SE" ], "published": "2025-01-21T12:10:18+00:00", "url": "http://arxiv.org/pdf/2501.12079v1", "resource_uri": "arxiv://2501.12079v1", "citation_count": 0 }, { "id": "2501.11006v2", "title": "GREEN-CODE: Learning to Optimize Energy Efficiency in LLM-based Code Generation", "authors": [ "Shashikant Ilager", "Lukas Florian Briem", "Ivona Brandic" ], "abstract": "Large Language Models (LLMs) are becoming integral to daily life, showcasing\ntheir vast potential across various Natural Language Processing (NLP) tasks.\nBeyond NLP, LLMs are increasingly used in software development tasks, such as\ncode completion, modification, bug fixing, and code translation. Software\nengineers widely use tools like GitHub Copilot and Amazon Q, streamlining\nworkflows and automating tasks with high accuracy. While the resource and\nenergy intensity of LLM training is often highlighted, inference can be even\nmore resource-intensive over time, as it's a continuous process with a high\nnumber of invocations. Therefore, developing resource-efficient alternatives\nfor LLM inference is crucial for sustainability. This work proposes GREEN-CODE,\na framework for energy-aware code generation in LLMs. GREEN-CODE performs\ndynamic early exit during LLM inference. We train a Reinforcement Learning (RL)\nagent that learns to balance the trade-offs between accuracy, latency, and\nenergy consumption. Our approach is evaluated on two open-source LLMs, Llama\n3.2 3B and OPT 2.7B, using the JavaCorpus and PY150 datasets. Results show that\nour method reduces the energy consumption between 23-50 % on average for code\ngeneration tasks without significantly affecting accuracy.", "categories": [ "cs.DC", "cs.AI", "cs.PF", "cs.SE", "C.4; D.0; E.4; I.7" ], "published": "2025-01-19T10:44:03+00:00", "url": "http://arxiv.org/pdf/2501.11006v2", "resource_uri": "arxiv://2501.11006v2", "citation_count": 0 }, { "id": "2501.09888v1", "title": "Understanding the Effectiveness of LLMs in Automated Self-Admitted Technical Debt Repayment", "authors": [ "Mohammad Sadegh Sheikhaei", "Yuan Tian", "Shaowei Wang", "Bowen Xu" ], "abstract": "Self-Admitted Technical Debt (SATD), cases where developers intentionally\nacknowledge suboptimal solutions in code through comments, poses a significant\nchallenge to software maintainability. Left unresolved, SATD can degrade code\nquality and increase maintenance costs. While Large Language Models (LLMs) have\nshown promise in tasks like code generation and program repair, their potential\nin automated SATD repayment remains underexplored.\n In this paper, we identify three key challenges in training and evaluating\nLLMs for SATD repayment: (1) dataset representativeness and scalability, (2)\nremoval of irrelevant SATD repayments, and (3) limitations of existing\nevaluation metrics. To address the first two dataset-related challenges, we\nadopt a language-independent SATD tracing tool and design a 10-step filtering\npipeline to extract SATD repayments from repositories, resulting two\nlarge-scale datasets: 58,722 items for Python and 97,347 items for Java. To\nimprove evaluation, we introduce two diff-based metrics, BLEU-diff and\nCrystalBLEU-diff, which measure code changes rather than whole code.\nAdditionally, we propose another new metric, LEMOD, which is both interpretable\nand informative. Using our new benchmarks and evaluation metrics, we evaluate\ntwo types of automated SATD repayment methods: fine-tuning smaller models, and\nprompt engineering with five large-scale models. Our results reveal that\nfine-tuned small models achieve comparable Exact Match (EM) scores to\nprompt-based approaches but underperform on BLEU-based metrics and LEMOD.\nNotably, Gemma-2-9B leads in EM, addressing 10.1% of Python and 8.1% of Java\nSATDs, while Llama-3.1-70B-Instruct and GPT-4o-mini excel on BLEU-diff,\nCrystalBLEU-diff, and LEMOD metrics. Our work contributes a robust benchmark,\nimproved evaluation metrics, and a comprehensive evaluation of LLMs, advancing\nresearch on automated SATD repayment.", "categories": [ "cs.SE", "K.6.3" ], "published": "2025-01-17T00:23:44+00:00", "url": "http://arxiv.org/pdf/2501.09888v1", "resource_uri": "arxiv://2501.09888v1", "citation_count": 0 }, { "id": "2501.09745v1", "title": "Suggesting Code Edits in Interactive Machine Learning Notebooks Using Large Language Models", "authors": [ "Bihui Jin", "Jiayue Wang", "Pengyu Nie" ], "abstract": "Machine learning developers frequently use interactive computational\nnotebooks, such as Jupyter notebooks, to host code for data processing and\nmodel training. Jupyter notebooks provide a convenient tool for writing machine\nlearning pipelines and interactively observing outputs, however, maintaining\nJupyter notebooks, e.g., to add new features or fix bugs, can be challenging\ndue to the length and complexity of the notebooks. Moreover, there is no\nexisting benchmark related to developer edits on Jupyter notebooks. To address\nthis, we present the first dataset of 48,398 Jupyter notebook edits derived\nfrom 20,095 revisions of 792 machine learning repositories on GitHub, and\nperform the first study of the using LLMs to predict code edits in Jupyter\nnotebooks. Our dataset captures granular details of cell-level and line-level\nmodifications, offering a foundation for understanding real-world maintenance\npatterns in machine learning workflows. We observed that the edits on Jupyter\nnotebooks are highly localized, with changes averaging only 166 lines of code\nin repositories. While larger models outperform smaller counterparts in code\nediting, all models have low accuracy on our dataset even after finetuning,\ndemonstrating the complexity of real-world machine learning maintenance tasks.\nOur findings emphasize the critical role of contextual information in improving\nmodel performance and point toward promising avenues for advancing large\nlanguage models' capabilities in engineering machine learning code.", "categories": [ "cs.SE", "cs.CL", "cs.LG" ], "published": "2025-01-16T18:55:38+00:00", "url": "http://arxiv.org/pdf/2501.09745v1", "resource_uri": "arxiv://2501.09745v1", "citation_count": 0 }, { "id": "2501.09135v1", "title": "HAFix: History-Augmented Large Language Models for Bug Fixing", "authors": [ "Yu Shi", "Abdul Ali Bangash", "Emad Fallahzadeh", "Bram Adams", "Ahmed E. Hassan" ], "abstract": "Recent studies have explored the performance of Large Language Models (LLMs)\non various Software Engineering (SE) tasks, such as code generation and bug\nfixing. However, these approaches typically rely on the context data from the\ncurrent snapshot of the project, overlooking the potential of rich historical\ndata from real-world software repositories. Additionally, the impact of prompt\nstyles on LLM performance within a historical context remains underexplored. To\naddress these gaps, we propose HAFix, which stands for History-Augmented LLMs\non Bug Fixing, a novel approach that leverages individual historical heuristics\nassociated with bugs and aggregates the results of these heuristics (HAFix-Agg)\nto enhance LLMs' bug-fixing capabilities. To empirically evaluate HAFix, we\nemploy Code Llama on a dataset of 51 single-line bugs, sourced from 11\nopen-source projects, by mining the historical context data of bugs and\noperationalizing this context in the form of seven heuristics. Our evaluation\ndemonstrates that historical heuristics significantly enhance bug-fixing\nperformance. For example, the FLN-all heuristic achieves a 10% improvement in\nperformance compared to a non-historical baseline inspired by GitHub Copilot.\nFurthermore, HAFix-Agg fixes 45% more bugs than the baseline, outperforming\nFLN-all and demonstrating the best performance overall. Moreover, within the\ncontext of historical heuristics, we identify the Instruction style prompt as\nthe most effective template for LLMs in bug fixing. Finally, we provide a\npragmatic trade-off analysis of bug-fixing performance, cost, and time\nefficiency, offering valuable insights for the practical deployment of our\napproach in real-world scenarios.", "categories": [ "cs.SE" ], "published": "2025-01-15T20:39:32+00:00", "url": "http://arxiv.org/pdf/2501.09135v1", "resource_uri": "arxiv://2501.09135v1", "citation_count": 0 }, { "id": "2501.07531v1", "title": "Evaluating Agent-based Program Repair at Google", "authors": [ "Pat Rondon", "Renyao Wei", "José Cambronero", "Jürgen Cito", "Aaron Sun", "Siddhant Sanyam", "Michele Tufano", "Satish Chandra" ], "abstract": "Agent-based program repair offers to automatically resolve complex bugs\nend-to-end by combining the planning, tool use, and code generation abilities\nof modern LLMs. Recent work has explored the use of agent-based repair\napproaches on the popular open-source SWE-Bench, a collection of bugs from\nhighly-rated GitHub Python projects. In addition, various agentic approaches\nsuch as SWE-Agent have been proposed to solve bugs in this benchmark. This\npaper explores the viability of using an agentic approach to address bugs in an\nenterprise context. To investigate this, we curate an evaluation set of 178\nbugs drawn from Google's issue tracking system. This dataset spans both\nhuman-reported (78) and machine-reported bugs (100).\n To establish a repair performance baseline on this benchmark, we implement\nPasserine, an agent similar in spirit to SWE-Agent that can work within\nGoogle's development environment. We show that with 20 trajectory samples and\nGemini 1.5 Pro, Passerine can produce a patch that passes bug tests (i.e.,\nplausible) for 73% of machine-reported and 25.6% of human-reported bugs in our\nevaluation set. After manual examination, we found that 43% of machine-reported\nbugs and 17.9% of human-reported bugs have at least one patch that is\nsemantically equivalent to the ground-truth patch.\n These results establish a baseline on an industrially relevant benchmark,\nwhich as we show, contains bugs drawn from a different distribution -- in terms\nof language diversity, size, and spread of changes, etc. -- compared to those\nin the popular SWE-Bench dataset.", "categories": [ "cs.SE", "cs.AI" ], "published": "2025-01-13T18:09:25+00:00", "url": "http://arxiv.org/pdf/2501.07531v1", "resource_uri": "arxiv://2501.07531v1", "citation_count": 0 }, { "id": "2501.07339v1", "title": "Evaluating Pre-Trained Models for Multi-Language Vulnerability Patching", "authors": [ "Zanis Ali Khan", "Aayush Garg", "Yuejun Guo", "Qiang Tang" ], "abstract": "Software vulnerabilities pose critical security risks, demanding prompt and\neffective mitigation strategies. While advancements in Automated Program Repair\n(APR) have primarily targeted general software bugs, the domain of\nvulnerability patching, which is a security-critical subset of APR, remains\nunderexplored. This paper investigates the potential of pre-trained language\nmodels, CodeBERT and CodeT5, for automated vulnerability patching across\ndiverse datasets and five programming languages. We evaluate these models on\ntheir accuracy, computational efficiency, and how the length of vulnerable code\npatches impacts performance. Our findings reveal promising accuracy levels,\nparticularly for CodeT5 on datasets with complex vulnerability patterns, while\nCodeBERT demonstrates strengths in handling fragmented or context-limited\ndatasets. CodeT5 further showcases superior efficiency, making it well-suited\nfor large-scale applications. However, both models face challenges in\nmaintaining performance as patch length increases, highlighting the complexity\nof addressing extended in program repair specifically aimed at fixing\nvulnerabilities. This study benchmarks model performance, highlights key\nlimitations, and offers insights to improve automated vulnerability patching\nfor practical security applications.", "categories": [ "cs.SE" ], "published": "2025-01-13T13:51:05+00:00", "url": "http://arxiv.org/pdf/2501.07339v1", "resource_uri": "arxiv://2501.07339v1", "citation_count": 0 }, { "id": "2501.05040v3", "title": "SWE-Fixer: Training Open-Source LLMs for Effective and Efficient GitHub Issue Resolution", "authors": [ "Chengxing Xie", "Bowen Li", "Chang Gao", "He Du", "Wai Lam", "Difan Zou", "Kai Chen" ], "abstract": "Large Language Models (LLMs) have demonstrated remarkable proficiency across\na variety of complex tasks. One significant application of LLMs is in tackling\nsoftware engineering challenges, particularly in resolving real-world tasks on\nGitHub by fixing code based on the issues reported by the users. However, many\ncurrent approaches rely on proprietary LLMs, which limits reproducibility,\naccessibility, and transparency. The critical components of LLMs for addressing\nsoftware engineering issues and how their capabilities can be effectively\nenhanced remain unclear. To address these challenges, we introduce SWE-Fixer, a\nnovel open-source framework designed to effectively and efficiently resolve\nGitHub issues. SWE-Fixer comprises two essential modules: a code file retrieval\nmodule and a code editing module. The retrieval module employs BM25 along with\na lightweight model to achieve coarse-to-fine file retrieval. Subsequently, the\ncode editing module utilizes the other model to generate patches for the\nidentified files. To mitigate the lack of publicly available datasets, we\ncompile an extensive dataset that includes 110K GitHub issues along with their\ncorresponding patches and train the two models of SWE-Fixer separately. We\nassess our approach on the SWE-Bench Lite and Verified benchmarks, achieving\ncompetitive performance among open-source models with scores of 22.0% and\n30.2%. Furthermore, SWE-Fixer reaches state-of-the-art performance (24.7% on\nLite and 32.8% on Verified) with PASS_TO_PASS (P2P) filtering. Additionally,\nour approach requires only two model calls per instance, making it\nsignificantly more efficient than existing methods. These results highlight the\neffectiveness of SWE-Fixer in real-world code-fixing scenarios. We will make\nour model, dataset, and code publicly available at\nhttps://github.com/InternLM/SWE-Fixer.", "categories": [ "cs.CL" ], "published": "2025-01-09T07:54:24+00:00", "url": "http://arxiv.org/pdf/2501.05040v3", "resource_uri": "arxiv://2501.05040v3", "citation_count": 0 }, { "id": "2501.04572v3", "title": "Regret Analysis: a control perspective", "authors": [ "Travis E. Gibson", "Sawal Acharya" ], "abstract": "Online learning and model reference adaptive control have many interesting\nintersections. One area where they differ however is in how the algorithms are\nanalyzed and what objective or metric is used to discriminate \"good\" algorithms\nfrom \"bad\" algorithms. In adaptive control there are usually two objectives: 1)\nprove that all time varying parameters/states of the system are bounded, and 2)\nthat the instantaneous error between the adaptively controlled system and a\nreference system converges to zero over time (or at least a compact set). For\nonline learning the performance of algorithms is often characterized by the\nregret the algorithm incurs. Regret is defined as the cumulative loss (cost)\nover time from the online algorithm minus the cumulative loss (cost) of the\nsingle optimal fixed parameter choice in hindsight. Another significant\ndifference between the two areas of research is with regard to the assumptions\nmade in order to obtain said results. Adaptive control makes assumptions about\nthe input-output properties of the control problem and derives solutions for a\nfixed error model or optimization task. In the online learning literature\nresults are derived for classes of loss functions (i.e. convex) while a priori\nassuming certain signals are bounded. In this work we discuss these differences\nin detail through the regret based analysis of gradient descent for convex\nfunctions and the control based analysis of a streaming regression problem. We\nclose with a discussion about the newly defined paradigm of online adaptive\ncontrol.", "categories": [ "eess.SY", "cs.LG", "cs.SY", "math.OC" ], "published": "2025-01-08T15:42:41+00:00", "url": "http://arxiv.org/pdf/2501.04572v3", "resource_uri": "arxiv://2501.04572v3", "citation_count": 0 }, { "id": "2501.04285v4", "title": "Separate Source Channel Coding Is Still What You Need: An LLM-based Rethinking", "authors": [ "Tianqi Ren", "Rongpeng Li", "Ming-min Zhao", "Xianfu Chen", "Guangyi Liu", "Yang Yang", "Zhifeng Zhao", "Honggang Zhang" ], "abstract": "Along with the proliferating research interest in Semantic Communication\n(SemCom), Joint Source Channel Coding (JSCC) has dominated the attention due to\nthe widely assumed existence in efficiently delivering information semantics.\nNevertheless, this paper challenges the conventional JSCC paradigm, and\nadvocates for adoption of Separate Source Channel Coding (SSCC) to enjoy the\nunderlying more degree of freedom for optimization. We demonstrate that SSCC,\nafter leveraging the strengths of Large Language Model (LLM) for source coding\nand Error Correction Code Transformer (ECCT) complemented for channel decoding,\noffers superior performance over JSCC. Our proposed framework also effectively\nhighlights the compatibility challenges between SemCom approaches and digital\ncommunication systems, particularly concerning the resource costs associated\nwith the transmission of high precision floating point numbers. Through\ncomprehensive evaluations, we establish that empowered by LLM-based compression\nand ECCT-enhanced error correction, SSCC remains a viable and effective\nsolution for modern communication systems. In other words, separate source and\nchannel coding is still what we need!", "categories": [ "cs.IT", "eess.SP", "math.IT" ], "published": "2025-01-08T05:17:09+00:00", "url": "http://arxiv.org/pdf/2501.04285v4", "resource_uri": "arxiv://2501.04285v4", "citation_count": 0 }, { "id": "2501.02628v1", "title": "Cracks in The Stack: Hidden Vulnerabilities and Licensing Risks in LLM Pre-Training Datasets", "authors": [ "Mahmoud Jahanshahi", "Audris Mockus" ], "abstract": "A critical part of creating code suggestion systems is the pre-training of\nLarge Language Models on vast amounts of source code and natural language text,\noften of questionable origin or quality. This may contribute to the presence of\nbugs and vulnerabilities in code generated by LLMs. While efforts to identify\nbugs at or after code generation exist, it is preferable to pre-train or\nfine-tune LLMs on curated, high-quality, and compliant datasets. The need for\nvast amounts of training data necessitates that such curation be automated,\nminimizing human intervention.\n We propose an automated source code autocuration technique that leverages the\ncomplete version history of open-source software projects to improve the\nquality of training data. This approach leverages the version history of all\nOSS projects to identify training data samples that have been modified or have\nundergone changes in at least one OSS project, and pinpoint a subset of samples\nthat include fixes for bugs or vulnerabilities. We evaluate this method using\nThe Stack v2 dataset, and find that 17% of the code versions in the dataset\nhave newer versions, with 17% of those representing bug fixes, including 2.36%\naddressing known CVEs. The deduplicated version of Stack v2 still includes\nblobs vulnerable to 6,947 known CVEs. Furthermore, 58% of the blobs in the\ndataset were never modified after creation, suggesting they likely represent\nsoftware with minimal or no use. Misidentified blob origins present an\nadditional challenge, as they lead to the inclusion of non-permissively\nlicensed code, raising serious compliance concerns.\n By addressing these issues, the training of new models can avoid perpetuating\nbuggy code patterns or license violations. We expect our results to inspire\nprocess improvements for automated data curation, with the potential to enhance\nthe reliability of outputs generated by AI tools.", "categories": [ "cs.SE", "cs.AI" ], "published": "2025-01-05T18:54:25+00:00", "url": "http://arxiv.org/pdf/2501.02628v1", "resource_uri": "arxiv://2501.02628v1", "citation_count": 0 }, { "id": "2501.02446v1", "title": "RTLMarker: Protecting LLM-Generated RTL Copyright via a Hardware Watermarking Framework", "authors": [ "Kun Wang", "Kaiyan Chang", "Mengdi Wang", "Xinqi Zou", "Haobo Xu", "Yinhe Han", "Ying Wang" ], "abstract": "Recent advances of large language models in the field of Verilog generation\nhave raised several ethical and security concerns, such as code copyright\nprotection and dissemination of malicious code. Researchers have employed\nwatermarking techniques to identify codes generated by large language models.\nHowever, the existing watermarking works fail to protect RTL code copyright due\nto the significant syntactic and semantic differences between RTL code and\nsoftware code in languages such as Python. This paper proposes a hardware\nwatermarking framework RTLMarker that embeds watermarks into RTL code and\ndeeper into the synthesized netlist. We propose a set of rule-based Verilog\ncode transformations , ensuring the watermarked RTL code's syntactic and\nsemantic correctness. In addition, we consider an inherent tradeoff between\nwatermark transparency and watermark effectiveness and jointly optimize them.\nThe results demonstrate RTLMarker's superiority over the baseline in RTL code\nwatermarking.", "categories": [ "cs.CR", "cs.AI" ], "published": "2025-01-05T05:38:28+00:00", "url": "http://arxiv.org/pdf/2501.02446v1", "resource_uri": "arxiv://2501.02446v1", "citation_count": 0 }, { "id": "2501.01040v1", "title": "Event Masked Autoencoder: Point-wise Action Recognition with Event-Based Cameras", "authors": [ "Jingkai Sun", "Qiang Zhang", "Jiaxu Wang", "Jiahang Cao", "Renjing Xu" ], "abstract": "Dynamic vision sensors (DVS) are bio-inspired devices that capture visual\ninformation in the form of asynchronous events, which encode changes in pixel\nintensity with high temporal resolution and low latency. These events provide\nrich motion cues that can be exploited for various computer vision tasks, such\nas action recognition. However, most existing DVS-based action recognition\nmethods lose temporal information during data transformation or suffer from\nnoise and outliers caused by sensor imperfections or environmental factors. To\naddress these challenges, we propose a novel framework that preserves and\nexploits the spatiotemporal structure of event data for action recognition. Our\nframework consists of two main components: 1) a point-wise event masked\nautoencoder (MAE) that learns a compact and discriminative representation of\nevent patches by reconstructing them from masked raw event camera points data;\n2) an improved event points patch generation algorithm that leverages an event\ndata inlier model and point-wise data augmentation techniques to enhance the\nquality and diversity of event points patches. To the best of our knowledge,\nour approach introduces the pre-train method into event camera raw points data\nfor the first time, and we propose a novel event points patch embedding to\nutilize transformer-based models on event cameras.", "categories": [ "cs.CV" ], "published": "2025-01-02T03:49:03+00:00", "url": "http://arxiv.org/pdf/2501.01040v1", "resource_uri": "arxiv://2501.01040v1", "citation_count": 0 }, { "id": "2412.20340v2", "title": "Distilling Desired Comments for Enhanced Code Review with Large Language Models", "authors": [ "Yongda Yu", "Lei Zhang", "Guoping Rong", "Haifeng Shen", "Jiahao Zhang", "Haoxiang Yan", "Guohao Shi", "Dong Shao", "Ruiqi Pan", "Yuan Li", "Qiushi Wang", "Zhao Tian" ], "abstract": "There has been a growing interest in using Large Language Models (LLMs) for\ncode review thanks to their proven proficiency in code comprehension. The\nprimary objective of most review scenarios is to generate desired review\ncomments (DRCs) that explicitly identify issues to trigger code fixes. However,\nexisting LLM-based solutions are not so effective in generating DRCs for\nvarious reasons such as hallucination. To enhance their code review ability,\nthey need to be fine-tuned with a customized dataset that is ideally full of\nDRCs. Nevertheless, such a dataset is not yet available, while manual\nannotation of DRCs is too laborious to be practical. In this paper, we propose\na dataset distillation method, Desiview, which can automatically construct a\ndistilled dataset by identifying DRCs from a code review dataset. Experiments\non the CodeReviewer dataset comprising more than 150K review entries show that\nDesiview achieves an impressive performance of 88.93%, 80.37%, 86.67%, and\n84.44% in terms of Precision, Recall, Accuracy, and F1, respectively,\nsurpassing state-of-the-art methods. To validate the effect of such a distilled\ndataset on enhancing LLMs' code review ability, we first fine-tune the latest\nLLaMA series (i.e., LLaMA 3 and LLaMA 3.1) to build model Desiview4FT. We then\nenhance the model training effect through KTO alignment by feeding those review\ncomments identified as non-DRCs to the LLMs, resulting in model Desiview4FA.\nVerification results indicate that Desiview4FA slightly outperforms\nDesiview4FT, while both models have significantly improved against the base\nmodels in terms of generating DRCs. Human evaluation confirms that both models\nidentify issues more accurately and tend to generate review comments that\nbetter describe the issues contained in the code than the base LLMs do.", "categories": [ "cs.SE", "cs.AI", "D.2.3; I.2.7" ], "published": "2024-12-29T03:49:13+00:00", "url": "http://arxiv.org/pdf/2412.20340v2", "resource_uri": "arxiv://2412.20340v2", "citation_count": 0 }, { "id": "2412.19031v1", "title": "Repository Structure-Aware Training Makes SLMs Better Issue Resolver", "authors": [ "Zexiong Ma", "Shengnan An", "Zeqi Lin", "Yanzhen Zou", "Bing Xie" ], "abstract": "Language models have been applied to various software development tasks, but\nthe performance varies according to the scale of the models. Large Language\nModels (LLMs) outperform Small Language Models (SLMs) in complex tasks like\nrepository-level issue resolving, but raise concerns about privacy and cost. In\ncontrast, SLMs are more accessible but under-perform in complex tasks. In this\npaper, we introduce ReSAT (Repository Structure-Aware Training), construct\ntraining data based on a large number of issues and corresponding pull requests\nfrom open-source communities to enhance the model's understanding of repository\nstructure and issue resolving ability. We construct two types of training data:\n(1) localization training data, a multi-level progressive localization data to\nimprove code understanding and localization capability; (2) code edit training\ndata, which improves context-based code editing capability. The evaluation\nresults on SWE-Bench-verified and RepoQA demonstrate that ReSAT effectively\nenhances SLMs' issue-resolving and repository-level long-context understanding\ncapabilities.", "categories": [ "cs.SE", "cs.AI" ], "published": "2024-12-26T03:01:32+00:00", "url": "http://arxiv.org/pdf/2412.19031v1", "resource_uri": "arxiv://2412.19031v1", "citation_count": 0 }, { "id": "2412.18750v3", "title": "The Impact of Input Order Bias on Large Language Models for Software Fault Localization", "authors": [ "Md Nakhla Rafi", "Dong Jae Kim", "Tse-Hsun Chen", "Shaowei Wang" ], "abstract": "Large Language Models (LLMs) have shown significant potential in software\nengineering tasks such as Fault Localization (FL) and Automatic Program Repair\n(APR). This study investigates how input order and context size influence LLM\nperformance in FL, a crucial step for many downstream software engineering\ntasks. We evaluate different method orderings using Kendall Tau distances,\nincluding \"perfect\" (where ground truths appear first) and \"worst\" (where\nground truths appear last), across two benchmarks containing Java and Python\nprojects. Our results reveal a strong order bias: in Java projects, Top-1 FL\naccuracy drops from 57% to 20% when reversing the order, while in Python\nprojects, it decreases from 38% to approximately 3%. However, segmenting inputs\ninto smaller contexts mitigates this bias, reducing the performance gap in FL\nfrom 22% and 6% to just 1% across both benchmarks. We replaced method names\nwith semantically meaningful alternatives to determine whether this bias is due\nto data leakage. The observed trends remained consistent, suggesting that the\nbias is not caused by memorization from training data but rather by the\ninherent effect of input order. Additionally, we explored ordering methods\nbased on traditional FL techniques and metrics, finding that DepGraph's ranking\nachieves 48% Top-1 accuracy, outperforming simpler approaches such as\nCallGraph(DFS). These findings highlight the importance of structuring inputs,\nmanaging context effectively, and selecting appropriate ordering strategies to\nenhance LLM performance in FL and other software engineering applications.", "categories": [ "cs.SE", "cs.AI", "cs.LG" ], "published": "2024-12-25T02:48:53+00:00", "url": "http://arxiv.org/pdf/2412.18750v3", "resource_uri": "arxiv://2412.18750v3", "citation_count": 0 }, { "id": "2412.17315v1", "title": "CodeV: Issue Resolving with Visual Data", "authors": [ "Linhao Zhang", "Daoguang Zan", "Quanshun Yang", "Zhirong Huang", "Dong Chen", "Bo Shen", "Tianyu Liu", "Yongshun Gong", "Pengjie Huang", "Xudong Lu", "Guangtai Liang", "Lizhen Cui", "Qianxiang Wang" ], "abstract": "Large Language Models (LLMs) have advanced rapidly in recent years, with\ntheir applications in software engineering expanding to more complex\nrepository-level tasks. GitHub issue resolving is a key challenge among these\ntasks. While recent approaches have made progress on this task, they focus on\ntextual data within issues, neglecting visual data. However, this visual data\nis crucial for resolving issues as it conveys additional knowledge that text\nalone cannot. We propose CodeV, the first approach to leveraging visual data to\nenhance the issue-resolving capabilities of LLMs. CodeV resolves each issue by\nfollowing a two-phase process: data processing and patch generation. To\nevaluate CodeV, we construct a benchmark for visual issue resolving, namely\nVisual SWE-bench. Through extensive experiments, we demonstrate the\neffectiveness of CodeV, as well as provide valuable insights into leveraging\nvisual data to resolve GitHub issues.", "categories": [ "cs.SE", "cs.AI", "cs.CL" ], "published": "2024-12-23T06:17:11+00:00", "url": "http://arxiv.org/pdf/2412.17315v1", "resource_uri": "arxiv://2412.17315v1", "citation_count": 0 }, { "id": "2412.14857v1", "title": "On the standard models of del Pezzo fibrations of degree four", "authors": [ "Natsume Kitagawa" ], "abstract": "Corti defined the notion of standard models of del Pezzo fibrations, and\nstudied their existence over $\\mathbb{C}$ with a fixed generic fibre. In this\npaper, we prove the existence of standard models of del Pezzo fibrations of\ndegree $4$ in characteristic $>2$. To show this, we use the notion of Koll\\'ar\nstability, which was introduced by Koll\\'ar and Abban-Fedorchuk-Krylov.", "categories": [ "math.AG" ], "published": "2024-12-19T13:51:25+00:00", "url": "http://arxiv.org/pdf/2412.14857v1", "resource_uri": "arxiv://2412.14857v1", "citation_count": 0 }, { "id": "2502.07786v1", "title": "Counterexample Guided Program Repair Using Zero-Shot Learning and MaxSAT-based Fault Localization", "authors": [ "Pedro Orvalho", "Mikoláš Janota", "Vasco Manquinho" ], "abstract": "Automated Program Repair (APR) for introductory programming assignments\n(IPAs) is motivated by the large number of student enrollments in programming\ncourses each year. Since providing feedback on IPAs requires substantial time\nand effort from faculty, personalized feedback often involves suggesting fixes\nto students' programs. Formal Methods (FM)-based semantic repair approaches,\ncheck a program's execution against a test suite or reference solution, are\neffective but limited. These tools excel at identifying buggy parts but can\nonly fix programs if the correct implementation and the faulty one share the\nsame control flow graph. Conversely, Large Language Models (LLMs) are used for\nAPR but often make extensive instead of minimal rewrites. This leads to more\ninvasive fixes, making it harder for students to learn from their mistakes. In\nsummary, LLMs excel at completing strings, while FM-based fault localization\nexcel at identifying buggy parts of a program. In this paper, we propose a\nnovel approach that combines the strengths of both FM-based fault localization\nand LLMs, via zero-shot learning, to enhance APR for IPAs. Our method uses\nMaxSAT-based fault localization to identify buggy parts of a program, then\npresents the LLM with a program sketch devoid of these buggy statements. This\nhybrid approach follows a CEGIS loop to iteratively refine the program. We ask\nthe LLM to synthesize the missing parts, which are then checked against a test\nsuite. If the suggested program is incorrect, a counterexample from the test\nsuite is fed back to the LLM. Our experiments show that our counterexample\nguided approach, using MaxSAT-based bug-free program sketches, significantly\nimproves the repair capabilities of all six evaluated LLMs. This method allows\nLLMs to repair more programs with smaller fixes, outperforming other\nconfigurations and state-of-the-art symbolic program repair tools.", "categories": [ "cs.SE", "cs.AI" ], "published": "2024-12-19T12:08:44+00:00", "url": "http://arxiv.org/pdf/2502.07786v1", "resource_uri": "arxiv://2502.07786v1", "citation_count": 0 }, { "id": "2412.14598v2", "title": "Can We Get Rid of Handcrafted Feature Extractors? SparseViT: Nonsemantics-Centered, Parameter-Efficient Image Manipulation Localization through Spare-Coding Transformer", "authors": [ "Lei Su", "Xiaochen Ma", "Xuekang Zhu", "Chaoqun Niu", "Zeyu Lei", "Ji-Zhe Zhou" ], "abstract": "Non-semantic features or semantic-agnostic features, which are irrelevant to\nimage context but sensitive to image manipulations, are recognized as\nevidential to Image Manipulation Localization (IML). Since manual labels are\nimpossible, existing works rely on handcrafted methods to extract non-semantic\nfeatures. Handcrafted non-semantic features jeopardize IML model's\ngeneralization ability in unseen or complex scenarios. Therefore, for IML, the\nelephant in the room is: How to adaptively extract non-semantic features?\nNon-semantic features are context-irrelevant and manipulation-sensitive. That\nis, within an image, they are consistent across patches unless manipulation\noccurs. Then, spare and discrete interactions among image patches are\nsufficient for extracting non-semantic features. However, image semantics vary\ndrastically on different patches, requiring dense and continuous interactions\namong image patches for learning semantic representations. Hence, in this\npaper, we propose a Sparse Vision Transformer (SparseViT), which reformulates\nthe dense, global self-attention in ViT into a sparse, discrete manner. Such\nsparse self-attention breaks image semantics and forces SparseViT to adaptively\nextract non-semantic features for images. Besides, compared with existing IML\nmodels, the sparse self-attention mechanism largely reduced the model size (max\n80% in FLOPs), achieving stunning parameter efficiency and computation\nreduction. Extensive experiments demonstrate that, without any handcrafted\nfeature extractors, SparseViT is superior in both generalization and efficiency\nacross benchmark datasets.", "categories": [ "cs.CV" ], "published": "2024-12-19T07:39:06+00:00", "url": "http://arxiv.org/pdf/2412.14598v2", "resource_uri": "arxiv://2412.14598v2", "citation_count": 0 }, { "id": "2412.08035v1", "title": "Scalable, Validated Code Translation of Entire Projects using Large Language Models", "authors": [ "Hanliang Zhang", "Cristina David", "Meng Wang", "Brandon Paulsen", "Daniel Kroening" ], "abstract": "Large language models (LLMs) show promise in code translation due to their\nability to generate idiomatic code. However, a significant limitation when\nusing LLMs for code translation is scalability: existing works have shown a\ndrop in translation success rates for code exceeding around 100 lines. We\novercome this limitation by developing a modular approach to translation, where\nwe partition the code into small code fragments which can be translated\nindependently and semantically validated (that is, checking I/O equivalence).\nWhen this approach is applied naively, we discover that LLMs are unreliable\nwhen translating features of the source language that do not have a direct\nmapping to the target language, and that the LLM often gets stuck in repair\nloops when attempting to fix errors. To address these issues, we introduce two\nkey concepts: (1) feature mapping, which integrates predefined translation\nrules with LLM-based translation to guide the LLM in navigating subtle language\ndifferences and producing semantically accurate code; and (2)\ntype-compatibility, which facilitates localized checks at the function\nsignature level to detect errors early, thereby narrowing the scope of\npotential repairs. We apply our approach to translating real-world Go codebases\nto Rust, demonstrating that we can consistently generate reliable Rust\ntranslations for projects up to 6,600 lines of code and 369 functions, with an\naverage of 73% of functions successfully validated for I/O equivalence,\nconsiderably higher than any existing work.", "categories": [ "cs.PL", "cs.SE" ], "published": "2024-12-11T02:31:46+00:00", "url": "http://arxiv.org/pdf/2412.08035v1", "resource_uri": "arxiv://2412.08035v1", "citation_count": 0 }, { "id": "2412.08014v2", "title": "MAGIC: Mastering Physical Adversarial Generation in Context through Collaborative LLM Agents", "authors": [ "Yun Xing", "Nhat Chung", "Jie Zhang", "Yue Cao", "Ivor Tsang", "Yang Liu", "Lei Ma", "Qing Guo" ], "abstract": "Physical adversarial attacks in driving scenarios can expose critical\nvulnerabilities in visual perception models. However, developing such attacks\nremains challenging due to diverse real-world environments and the requirement\nfor maintaining visual naturality. Building upon this challenge, we reformulate\nphysical adversarial attacks as a one-shot patch generation problem. Our\napproach generates adversarial patches through a deep generative model that\nconsiders the specific scene context, enabling direct physical deployment in\nmatching environments. The primary challenge lies in simultaneously achieving\ntwo objectives: generating adversarial patches that effectively mislead object\ndetection systems while determining contextually appropriate deployment within\nthe scene. We propose MAGIC (Mastering Physical Adversarial Generation In\nContext), a novel framework powered by multi-modal LLM agents to address these\nchallenges. MAGIC automatically understands scene context and generates\nadversarial patch through the synergistic interaction of language and vision\ncapabilities. In particular, MAGIC orchestrates three specialized LLM agents:\nThe adv-patch generation agent (GAgent) masters the creation of deceptive\npatches through strategic prompt engineering for text-to-image models. The\nadv-patch deployment agent (DAgent) ensures contextual coherence by determining\noptimal deployment strategies based on scene understanding. The\nself-examination agent (EAgent) completes this trilogy by providing critical\noversight and iterative refinement of both processes. We validate our method on\nboth digital and physical levels, i.e., nuImage and manually captured\nreal-world scenes, where both statistical and visual results prove that our\nMAGIC is powerful and effective for attacking widely applied object detection\nsystems, i.e., YOLO and DETR series.", "categories": [ "cs.CV", "cs.AI" ], "published": "2024-12-11T01:41:19+00:00", "url": "http://arxiv.org/pdf/2412.08014v2", "resource_uri": "arxiv://2412.08014v2", "citation_count": 0 }, { "id": "2412.05098v1", "title": "From Defects to Demands: A Unified, Iterative, and Heuristically Guided LLM-Based Framework for Automated Software Repair and Requirement Realization", "authors": [ "Alex", "Liu", "Vivian", "Chi" ], "abstract": "This manuscript signals a new era in the integration of artificial\nintelligence with software engineering, placing machines at the pinnacle of\ncoding capability. We present a formalized, iterative methodology proving that\nAI can fully replace human programmers in all aspects of code creation and\nrefinement. Our approach, combining large language models with formal\nverification, test-driven development, and incremental architectural guidance,\nachieves a 38.6% improvement over the current top performer's 48.33% accuracy\non the SWE-bench benchmark. This surpasses previously assumed limits, signaling\nthe end of human-exclusive coding and the rise of autonomous AI-driven software\ninnovation. More than a technical advance, our work challenges centuries-old\nassumptions about human creativity. We provide robust evidence of AI\nsuperiority, demonstrating tangible gains in practical engineering contexts and\nlaying the foundation for a future in which computational creativity outpaces\nhuman ingenuity.", "categories": [ "cs.SE", "cs.AI" ], "published": "2024-12-06T14:54:21+00:00", "url": "http://arxiv.org/pdf/2412.05098v1", "resource_uri": "arxiv://2412.05098v1", "citation_count": 0 }, { "id": "2412.03905v3", "title": "Integrating Various Software Artifacts for Better LLM-based Bug Localization and Program Repair", "authors": [ "Qiong Feng", "Xiaotian Ma", "Jiayi Sheng", "Ziyuan Feng", "Wei Song", "Peng Liang" ], "abstract": "LLMs have garnered considerable attention for their potential to streamline\nAutomated Program Repair (APR). LLM-based approaches can either insert the\ncorrect code or directly generate patches when provided with buggy methods.\nHowever, most of LLM-based APR methods rely on a single type of software\ninformation, without fully leveraging different software artifacts. Despite\nthis, many LLM-based approaches do not explore which specific types of\ninformation best assist in APR. Addressing this gap is crucial for advancing\nLLM-based APR techniques. We propose DEVLoRe to use issue content (description\nand message) and stack error traces to localize buggy methods, then rely on\ndebug information in buggy methods and issue content and stack error to\nlocalize buggy lines and generate plausible patches which can pass all unit\ntests. The results show that while issue content is particularly effective in\nassisting LLMs with fault localization and program repair, different types of\nsoftware artifacts complement each other. By incorporating different artifacts,\nDEVLoRe successfully locates 49.3% and 47.6% of single and non-single buggy\nmethods and generates 56.0% and 14.5% plausible patches for the Defects4J v2.0\ndataset, respectively. This outperforms current state-of-the-art APR methods.\nFurthermore, we re-implemented and evaluated our framework, demonstrating its\neffectiveness in its effectiveness in resolving 9 unique issues compared to\nother state-of-the-art frameworks using the same or more advanced models on\nSWE-bench Lite.We also discussed whether a leading framework for Python code\ncan be directly applied to Java code, or vice versa. The source code and\nexperimental results of this work for replication are available at\nhttps://github.com/XYZboom/DEVLoRe.", "categories": [ "cs.SE", "cs.AI" ], "published": "2024-12-05T06:21:31+00:00", "url": "http://arxiv.org/pdf/2412.03905v3", "resource_uri": "arxiv://2412.03905v3", "citation_count": 0 }, { "id": "2412.01447v1", "title": "PLD+: Accelerating LLM inference by leveraging Language Model Artifacts", "authors": [ "Shwetha Somasundaram", "Anirudh Phukan", "Apoorv Saxena" ], "abstract": "To reduce the latency associated with autoretrogressive LLM inference,\nspeculative decoding has emerged as a novel decoding paradigm, where future\ntokens are drafted and verified in parallel. However, the practical deployment\nof speculative decoding is hindered by its requirements for additional\ncomputational resources and fine-tuning, which limits its out-of-the-box\nusability. To address these challenges, we present PLD+, a suite of novel\nalgorithms developed to accelerate the inference process of LLMs, particularly\nfor input-guided tasks. These tasks, which include code editing, text editing,\nsummarization, etc., often feature outputs with substantial overlap with their\ninputs-an attribute PLD+ is designed to exploit. PLD+ also leverages the\nartifacts (attention and hidden states) generated during inference to\naccelerate inference speed. We test our approach on five input-guided tasks and\nthrough extensive experiments we find that PLD+ outperforms all tuning-free\napproaches. In the greedy setting, it even outperforms the state-of-the-art\ntuning-dependent approach EAGLE on four of the tasks. (by a margin of upto 2.31\nin terms of avg. speedup). Our approach is tuning free, does not require any\nadditional compute and can easily be used for accelerating inference of any\nLLM.", "categories": [ "cs.CL", "cs.AI" ], "published": "2024-12-02T12:36:27+00:00", "url": "http://arxiv.org/pdf/2412.01447v1", "resource_uri": "arxiv://2412.01447v1", "citation_count": 0 }, { "id": "2412.01072v1", "title": "When Fine-Tuning LLMs Meets Data Privacy: An Empirical Study of Federated Learning in LLM-Based Program Repair", "authors": [ "Wenqiang Luo", "Jacky Wai Keung", "Boyang Yang", "He Ye", "Claire Le Goues", "Tegawende F. Bissyande", "Haoye Tian", "Bach Le" ], "abstract": "Software systems have been evolving rapidly and inevitably introducing bugs\nat an increasing rate, leading to significant losses in resources consumed by\nsoftware maintenance. Recently, large language models (LLMs) have demonstrated\nremarkable potential in enhancing software development and maintenance\npractices, particularly in automated program repair (APR) with improved\naccuracy and efficiency of bug fixing. However, LLM-based APR heavily relies on\nhigh-quality code repositories. A larger portion of existing code repositories\nare for private use and proprietary assets from various industries, reflecting\nmore diversity and nuances in the data since real-world industries often have\nmore extensive software development practices, which cannot be covered by\nmerely public datasets. Therefore, utilizing private datasets shows significant\npotential in enhancing software development and maintenance. However, obtaining\nsuch data from various industries is hindered by data privacy concerns, as\ncompanies are reluctant to share their codebases. To address the gap, we\ninvestigate the use of federated learning as a privacy-preserving approach that\nenables private entities to fine-tune LLMs on proprietary and decentralized\ndata, facilitating the collaboration between clients to fully utilize their\ndata to help enhance software development and maintenance. Our evaluation\nreveals that federated fine-tuning can effectively enhance program repair\ncapabilities. Notably, the impact of heterogeneous code on LLM fine-tuning is\nnegligible, indicating that real-world industries can benefit from\ncollaborative development regardless of diverse data distributions.\nFurthermore, each type of federated algorithm exhibits unique strengths across\ndifferent LLMs, suggesting that fine-tuning for program repair can be enhanced\nby tailoring the optimization process to specific characteristics of different\nLLMs.", "categories": [ "cs.SE" ], "published": "2024-12-02T03:18:47+00:00", "url": "http://arxiv.org/pdf/2412.01072v1", "resource_uri": "arxiv://2412.01072v1", "citation_count": 0 }, { "id": "2412.01007v3", "title": "CoRNStack: High-Quality Contrastive Data for Better Code Retrieval and Reranking", "authors": [ "Tarun Suresh", "Revanth Gangi Reddy", "Yifei Xu", "Zach Nussbaum", "Andriy Mulyar", "Brandon Duderstadt", "Heng Ji" ], "abstract": "Effective code retrieval plays a crucial role in advancing code generation,\nbug fixing, and software maintenance, particularly as software systems increase\nin complexity. While current code embedding models have demonstrated promise in\nretrieving code snippets for small-scale, well-defined tasks, they often\nunderperform in more demanding real-world applications such as bug localization\nwithin GitHub repositories. We hypothesize that a key issue is their reliance\non noisy and inconsistent datasets for training, which impedes their ability to\ngeneralize to more complex retrieval scenarios. To address these limitations,\nwe introduce CoRNStack, a large-scale, high-quality contrastive training\ndataset for code that spans multiple programming languages. This dataset is\ncurated using consistency filtering to eliminate noisy positives and is further\nenriched with mined hard negatives, thereby facilitating more effective\nlearning. We demonstrate that contrastive training of embedding models using\nCoRNStack leads to state-of-the-art performance across a variety of code\nretrieval tasks. Furthermore, the dataset can be leveraged for training code\nreranking models, a largely underexplored area compared to text reranking. Our\nfinetuned code reranking model significantly improves the ranking quality over\nthe retrieved results. Finally, by employing our code retriever and reranker\ntogether, we demonstrate significant improvements in function localization for\nGitHub issues, an important component of real-world software development.", "categories": [ "cs.CL", "cs.IR" ], "published": "2024-12-01T23:54:12+00:00", "url": "http://arxiv.org/pdf/2412.01007v3", "resource_uri": "arxiv://2412.01007v3", "citation_count": 0 }, { "id": "2411.19152v2", "title": "Universal approximation of continuous functions with minimal quantum circuits", "authors": [ "Adrián Pérez-Salinas", "Mahtab Yaghubi Rad", "Alice Barthe", "Vedran Dunjko" ], "abstract": "The conventional paradigm of quantum computing is discrete: it utilizes\ndiscrete sets of gates to realize bitstring-to-bitstring mappings, some of them\narguably intractable for classical computers. In parameterized quantum\napproaches, widely used in quantum optimization and quantum machine learning,\nthe input becomes continuous and the output represents real-valued functions.\nVarious strategies exist to encode the input into a quantum circuit. While the\nbitstring-to-bitstring universality of quantum computers is quite well\nunderstood, basic questions remained open in the continuous case. For example,\nit was proven that full multivariate function universality requires either (i)\na fixed encoding procedure with a number of qubits scaling as the dimension of\nthe input or (ii) a tunable encoding procedure in single-qubit circuits. This\nreveals a trade-off between the complexity of the data encoding and the qubit\nrequirements. The question of whether universality can be reached with a fixed\nencoding and constantly many qubits has been open for the last five years. In\nthis paper, we answer this remaining fundamental question in the affirmative.\nWe provide a constructive method to approximate arbitrary multivariate\nfunctions using just a single qubit and a fixed-generator parametrization, at\nthe expense of increasing the depth. We also prove universality for a few of\nalternative fixed encoding strategies which may have independent interest. Our\nresults rely on a combination of techniques from harmonic analysis and quantum\nsignal processing.", "categories": [ "quant-ph" ], "published": "2024-11-28T13:52:43+00:00", "url": "http://arxiv.org/pdf/2411.19152v2", "resource_uri": "arxiv://2411.19152v2", "citation_count": 0 }, { "id": "2411.18019v1", "title": "A Real-World Benchmark for Evaluating Fine-Grained Issue Solving Capabilities of Large Language Models", "authors": [ "Ruida Hu", "Chao Peng", "Jingyi Ren", "Bo Jiang", "Xiangxin Meng", "Qinyun Wu", "Pengfei Gao", "Xinchen Wang", "Cuiyun Gao" ], "abstract": "Automatically resolving software issues is crucial for software development\nin practice, impacting the software quality and user experience. The process of\nresolving real-world issues encompasses tasks such as question-answering (QA),\nfault localization, and code editing. Existing benchmarks such as HumanEval\nfall short in their ability to assess LLMs' proficiency in solving issues\nwithin a codebase. Although benchmarks like SWE-Bench are designed to evaluate\nthe LLMs' capability to handle real-world GitHub issues, the end-to-end\nevaluation method cannot provide granular insights on the performance of\nsubtasks involved in issue solving. To address existing deficiencies in\nbenchmarking LLMs for practical software engineering tasks, we introduce\nFAUN-Eval, a benchmark specifically designed to evaluate the Fine-grAined issUe\nsolviNg capabilities of LLMs. FAUN-Eval systematically assesses LLMs across\nthree distinct tasks: QA, fault localization, and code editing. This benchmark\nis constructed using a dataset curated from 30 well-known GitHub repositories.\nFor each entry, issue and pull request (PR) pairs are meticulously compiled and\nvalidated using cross-referencing and keyword verification methods. FAUN-Eval\nincludes 300 entries and employs both LLM and manual checks to ensure data\nquality. We evaluate ten LLMs with FAUN-Eval, including four closed-source and\nsix open-source models. Our experimental results reveal several key findings.\nWe find that the top-performing LLMs differ across the different tasks.\nAdditionally, features in issues may lead LLMs to generate incorrect\ninformation. Moreover, models may vary in their proficiency with texts of\ndifferent lengths.", "categories": [ "cs.SE" ], "published": "2024-11-27T03:25:44+00:00", "url": "http://arxiv.org/pdf/2411.18019v1", "resource_uri": "arxiv://2411.18019v1", "citation_count": 0 }, { "id": "2411.17927v1", "title": "Measuring Emergent Capabilities of LLMs for Software Engineering: How Far Are We?", "authors": [ "Conor O'Brien", "Daniel Rodriguez-Cardenas", "Alejandro Velasco", "David N. Palacio", "Denys Poshyvanyk" ], "abstract": "The adoption of Large Language Models (LLMs) across multiple contexts has\nsparked interest in understanding how scaling model size might lead to\nbehavioral changes, as LLMs can exhibit behaviors not observed in their smaller\ncounterparts. Understanding these emergent capabilities is essential for\nadvancing LLM development and improving their interpretability across diverse\ntasks. However, whether LLMs exhibit true emergence in the context of Software\nEngineering remains an unexplored topic, as most research has focused on NLP\ntasks. In this paper, we investigate the emergence of capabilities in the\ncontext of SE. We propose a model-agnostic pipeline for evaluating this\nphenomenon across three SE tasks: bug fixing, code translation, and commit\nmessage generation. More precisely, for each task, we present a case study\ninstantiating our pipeline to analyze the emergence of capabilities in\nCodeGen1-multi across four scales ranging from 350M to 16.1B parameters. Our\nfindings do not not provide evidence to support the idea of emergent\ncapabilities resulting from scaling the model size in the selected set of\ntasks. We hope our results can pave the way to a more nuanced understanding of\nemergent capabilities of LLMs within the SE domain, guiding future research to\nfocus on task-specific evaluations and the identification of alternative\nfactors contributing to this phenomenon. Our work underscores the importance of\ntask diversity in examining model behaviors and highlights potential\nlimitations in transferring prior understandings of and approaches to emergence\nfrom NLP to Software Engineering.", "categories": [ "cs.SE" ], "published": "2024-11-26T22:48:55+00:00", "url": "http://arxiv.org/pdf/2411.17927v1", "resource_uri": "arxiv://2411.17927v1", "citation_count": 0 }, { "id": "2411.17274v6", "title": "CleanVul: Automatic Function-Level Vulnerability Detection in Code Commits Using LLM Heuristics", "authors": [ "Yikun Li", "Ting Zhang", "Ratnadira Widyasari", "Yan Naing Tun", "Huu Hung Nguyen", "Tan Bui", "Ivana Clairine Irsan", "Yiran Cheng", "Xiang Lan", "Han Wei Ang", "Frank Liauw", "Martin Weyssow", "Hong Jin Kang", "Eng Lieh Ouh", "Lwin Khin Shar", "David Lo" ], "abstract": "Accurate identification of software vulnerabilities is crucial for system\nintegrity. Vulnerability datasets, often derived from the National\nVulnerability Database (NVD) or directly from GitHub, are essential for\ntraining machine learning models to detect these security flaws. However, these\ndatasets frequently suffer from significant noise, typically 40% to 75%, due\nprimarily to the automatic and indiscriminate labeling of all changes in\nvulnerability-fixing commits (VFCs) as vulnerability-related. This\nmisclassification occurs because not all changes in a commit aimed at fixing\nvulnerabilities pertain to security threats; many are routine updates like bug\nfixes or test improvements.\n This paper introduces the first methodology that uses the Large Language\nModel (LLM) with a heuristic enhancement to automatically identify\nvulnerability-fixing changes from VFCs, achieving an F1-score of 0.82.\nVulSifter was applied to a large-scale study, where we conducted a crawl of\n127,063 repositories on GitHub, resulting in the acquisition of 5,352,105\ncommits. VulSifter involves utilizing an LLM to comprehend code semantics and\ncontextual information, while applying heuristics to filter out unrelated\nchanges. We then developed CleanVul, a high-quality dataset comprising 8,203\nfunctions using our LLM heuristic enhancement approach, demonstrating\nCorrectness (90.6%) comparable to established datasets such as SVEN and\nPrimeVul.\n To evaluate the CleanVul dataset, we conducted experiments focusing on\nfine-tuning various LLMs on CleanVul and other high-quality datasets.\nEvaluation results reveal that LLMs fine-tuned on CleanVul not only exhibit\nenhanced accuracy but also superior generalization capabilities compared to\nthose trained on uncleaned datasets. Specifically, models trained on CleanVul\nand tested on PrimeVul achieve accuracy higher than those trained and tested\nexclusively on PrimeVul.", "categories": [ "cs.SE", "cs.CR" ], "published": "2024-11-26T09:51:55+00:00", "url": "http://arxiv.org/pdf/2411.17274v6", "resource_uri": "arxiv://2411.17274v6", "citation_count": 0 }, { "id": "2411.13587v3", "title": "Exploring the Adversarial Vulnerabilities of Vision-Language-Action Models in Robotics", "authors": [ "Taowen Wang", "Cheng Han", "James Chenhao Liang", "Wenhao Yang", "Dongfang Liu", "Luna Xinyu Zhang", "Qifan Wang", "Jiebo Luo", "Ruixiang Tang" ], "abstract": "Recently in robotics, Vision-Language-Action (VLA) models have emerged as a\ntransformative approach, enabling robots to execute complex tasks by\nintegrating visual and linguistic inputs within an end-to-end learning\nframework. While VLA models offer significant capabilities, they also introduce\nnew attack surfaces, making them vulnerable to adversarial attacks. With these\nvulnerabilities largely unexplored, this paper systematically quantifies the\nrobustness of VLA-based robotic systems. Recognizing the unique demands of\nrobotic execution, our attack objectives target the inherent spatial and\nfunctional characteristics of robotic systems. In particular, we introduce two\nuntargeted attack objectives that leverage spatial foundations to destabilize\nrobotic actions, and a targeted attack objective that manipulates the robotic\ntrajectory. Additionally, we design an adversarial patch generation approach\nthat places a small, colorful patch within the camera's view, effectively\nexecuting the attack in both digital and physical environments. Our evaluation\nreveals a marked degradation in task success rates, with up to a 100\\%\nreduction across a suite of simulated robotic tasks, highlighting critical\nsecurity gaps in current VLA architectures. By unveiling these vulnerabilities\nand proposing actionable evaluation metrics, we advance both the understanding\nand enhancement of safety for VLA-based robotic systems, underscoring the\nnecessity for continuously developing robust defense strategies prior to\nphysical-world deployments.", "categories": [ "cs.RO", "cs.AI" ], "published": "2024-11-18T01:52:20+00:00", "url": "http://arxiv.org/pdf/2411.13587v3", "resource_uri": "arxiv://2411.13587v3", "citation_count": 0 }, { "id": "2411.10890v2", "title": "An Empirical Investigation on the Challenges in Scientific Workflow Systems Development", "authors": [ "Khairul Alam", "Banani Roy", "Chanchal K. Roy", "Kartik Mittal" ], "abstract": "Scientific Workflow Systems (SWSs) are advanced software frameworks that\ndrive modern research by orchestrating complex computational tasks and managing\nextensive data pipelines. These systems offer a range of essential features,\nincluding modularity, abstraction, interoperability, workflow composition\ntools, resource management, error handling, and comprehensive documentation.\nUtilizing these frameworks accelerates the development of scientific computing,\nresulting in more efficient and reproducible research outcomes. However,\ndeveloping a user-friendly, efficient, and adaptable SWS poses several\nchallenges. This study explores these challenges through an in-depth analysis\nof interactions on Stack Overflow (SO) and GitHub, key platforms where\ndevelopers and researchers discuss and resolve issues. In particular, we\nleverage topic modeling (BERTopic) to understand the topics SWSs developers\ndiscuss on these platforms. We identified 10 topics developers discuss on SO\n(e.g., Workflow Creation and Scheduling, Data Structures and Operations,\nWorkflow Execution) and found that workflow execution is the most challenging.\nBy analyzing GitHub issues, we identified 13 topics (e.g., Errors and Bug\nFixing, Documentation, Dependencies) and discovered that data structures and\noperations is the most difficult. We also found common topics between SO and\nGitHub, such as data structures and operations, task management, and workflow\nscheduling. Additionally, we categorized each topic by type (How, Why, What,\nand Others). We observed that the How type consistently dominates across all\ntopics, indicating a need for procedural guidance among developers. The\ndominance of the How type is also evident in domains like Chatbots and Mobile\ndevelopment. Our study will guide future research in proposing tools and\ntechniques to help the community overcome the challenges developers face when\ndeveloping SWSs.", "categories": [ "cs.SE", "ACM" ], "published": "2024-11-16T21:14:11+00:00", "url": "http://arxiv.org/pdf/2411.10890v2", "resource_uri": "arxiv://2411.10890v2", "citation_count": 0 }, { "id": "2411.10213v1", "title": "An Empirical Study on LLM-based Agents for Automated Bug Fixing", "authors": [ "Xiangxin Meng", "Zexiong Ma", "Pengfei Gao", "Chao Peng" ], "abstract": "Large language models (LLMs) and LLM-based Agents have been applied to fix\nbugs automatically, demonstrating the capability in addressing software defects\nby engaging in development environment interaction, iterative validation and\ncode modification. However, systematic analysis of these agent and non-agent\nsystems remain limited, particularly regarding performance variations among\ntop-performing ones. In this paper, we examine seven proprietary and\nopen-source systems on the SWE-bench Lite benchmark for automated bug fixing.\nWe first assess each system's overall performance, noting instances solvable by\nall or none of these sytems, and explore why some instances are uniquely solved\nby specific system types. We also compare fault localization accuracy at file\nand line levels and evaluate bug reproduction capabilities, identifying\ninstances solvable only through dynamic reproduction. Through analysis, we\nconcluded that further optimization is needed in both the LLM itself and the\ndesign of Agentic flow to improve the effectiveness of the Agent in bug fixing.", "categories": [ "cs.SE", "cs.AI" ], "published": "2024-11-15T14:19:15+00:00", "url": "http://arxiv.org/pdf/2411.10213v1", "resource_uri": "arxiv://2411.10213v1", "citation_count": 0 }, { "id": "2411.07586v1", "title": "A Comprehensive Survey of AI-Driven Advancements and Techniques in Automated Program Repair and Code Generation", "authors": [ "Avinash Anand", "Akshit Gupta", "Nishchay Yadav", "Shaurya Bajaj" ], "abstract": "Bug fixing and code generation have been core research topics in software\ndevelopment for many years. The recent explosive growth in Large Language\nModels has completely transformed these spaces, putting in reach incredibly\npowerful tools for both. In this survey, 27 recent papers have been reviewed\nand split into two groups: one dedicated to Automated Program Repair (APR) and\nLLM integration and the other to code generation using LLMs. The first group\nconsists of new methods for bug detection and repair, which include locating\nsemantic errors, security vulnerabilities, and runtime failure bugs. The place\nof LLMs in reducing manual debugging efforts is emphasized in this work by APR\ntoward context-aware fixes, with innovations that boost accuracy and efficiency\nin automatic debugging. The second group dwells on code generation, providing\nan overview of both general-purpose LLMs fine-tuned for programming and\ntask-specific models. It also presents methods to improve code generation, such\nas identifier-aware training, fine-tuning at the instruction level, and\nincorporating semantic code structures. This survey work contrasts the\nmethodologies in APR and code generation to identify trends such as using LLMs,\nfeedback loops to enable iterative code improvement and open-source models. It\nalso discusses the challenges of achieving functional correctness and security\nand outlines future directions for research in LLM-based software development.", "categories": [ "cs.AI" ], "published": "2024-11-12T06:47:54+00:00", "url": "http://arxiv.org/pdf/2411.07586v1", "resource_uri": "arxiv://2411.07586v1", "citation_count": 0 }, { "id": "2411.06767v1", "title": "PDC & DM-SFT: A Road for LLM SQL Bug-Fix Enhancing", "authors": [ "Yiwen Duan", "Yonghong Yu", "Xiaoming Zhao", "Yichang Wu", "Wenbo Liu" ], "abstract": "Code Large Language Models (Code LLMs), such as Code llama and\nDeepSeek-Coder, have demonstrated exceptional performance in the code\ngeneration tasks. However, most existing models focus on the abilities of\ngenerating correct code, but often struggle with bug repair. We introduce a\nsuit of methods to enhance LLM's SQL bug-fixing abilities. The methods are\nmainly consisted of two parts: A Progressive Dataset Construction (PDC) from\nscratch and Dynamic Mask Supervised Fine-tuning (DM-SFT). PDC proposes two data\nexpansion methods from the perspectives of breadth first and depth first\nrespectively. DM-SFT introduces an efficient bug-fixing supervised learning\napproach, which effectively reduce the total training steps and mitigate the\n\"disorientation\" in SQL code bug-fixing training. In our evaluation, the code\nLLM models trained with two methods have exceeds all current best performing\nmodel which size is much larger.", "categories": [ "cs.CL", "cs.AI", "cs.LG" ], "published": "2024-11-11T07:47:20+00:00", "url": "http://arxiv.org/pdf/2411.06767v1", "resource_uri": "arxiv://2411.06767v1", "citation_count": 0 }, { "id": "2411.07794v1", "title": "Feature Fusion Transferability Aware Transformer for Unsupervised Domain Adaptation", "authors": [ "Xiaowei Yu", "Zhe Huang", "Zao Zhang" ], "abstract": "Unsupervised domain adaptation (UDA) aims to leverage the knowledge learned\nfrom labeled source domains to improve performance on the unlabeled target\ndomains. While Convolutional Neural Networks (CNNs) have been dominant in\nprevious UDA methods, recent research has shown promise in applying Vision\nTransformers (ViTs) to this task. In this study, we propose a novel Feature\nFusion Transferability Aware Transformer (FFTAT) to enhance ViT performance in\nUDA tasks. Our method introduces two key innovations: First, we introduce a\npatch discriminator to evaluate the transferability of patches, generating a\ntransferability matrix. We integrate this matrix into self-attention, directing\nthe model to focus on transferable patches. Second, we propose a feature fusion\ntechnique to fuse embeddings in the latent space, enabling each embedding to\nincorporate information from all others, thereby improving generalization.\nThese two components work in synergy to enhance feature representation\nlearning. Extensive experiments on widely used benchmarks demonstrate that our\nmethod significantly improves UDA performance, achieving state-of-the-art\n(SOTA) results.", "categories": [ "cs.CV", "cs.AI" ], "published": "2024-11-10T22:23:12+00:00", "url": "http://arxiv.org/pdf/2411.07794v1", "resource_uri": "arxiv://2411.07794v1", "citation_count": 0 }, { "id": "2411.04300v2", "title": "Slow Mixing of Quantum Gibbs Samplers", "authors": [ "David Gamarnik", "Bobak T. Kiani", "Alexander Zlokapa" ], "abstract": "Preparing thermal (Gibbs) states is a common task in physics and computer\nscience. Recent algorithms mimic cooling via system-bath coupling, where the\ncost is determined by mixing time, akin to classical Metropolis-like\nalgorithms. However, few methods exist to demonstrate slow mixing in quantum\nsystems, unlike the well-established classical tools for systems like the Ising\nmodel and constraint satisfaction problems. We present a quantum generalization\nof these tools through a generic bottleneck lemma that implies slow mixing in\nquantum systems. This lemma focuses on quantum measures of distance, analogous\nto the classical Hamming distance but rooted in uniquely quantum principles and\nquantified either through Bohr spectrum jumps or operator locality.\n Using our bottleneck lemma, we establish unconditional lower bounds on the\nmixing times of Gibbs samplers for several families of Hamiltonians at low\ntemperatures. For classical Hamiltonians with mixing time lower bounds\n$T_\\mathrm{mix} = \\exp[\\Omega(n^\\alpha)]$, we prove that quantum Gibbs samplers\nalso have $T_\\mathrm{mix} = \\exp[\\Omega(n^\\alpha)]$. This applies to models\nlike random $K$-SAT instances and spin glasses. For stabilizer Hamiltonians, we\nprovide a concise proof of exponential lower bounds $T_\\mathrm{mix} =\n\\exp[\\Omega(n)]$ on mixing times of good $n$-qubit stabilizer codes at low\nconstant temperature. Finally, we consider constant-degree classical\nHamiltonians and show how to lift classical slow mixing results in the presence\nof a transverse field using Poisson Feynman-Kac techniques. We show generic\nresults for models with linear free energy barriers, and we demonstrate that\nour techniques extend to models with sublinear free energy barriers by proving\n$T_\\mathrm{mix} = \\exp[n^{1/2-o(1)}]$ for the ferromagnetic 2D transverse field\nIsing model.", "categories": [ "quant-ph", "math-ph", "math.MP", "math.PR" ], "published": "2024-11-06T22:51:27+00:00", "url": "http://arxiv.org/pdf/2411.04300v2", "resource_uri": "arxiv://2411.04300v2", "citation_count": 0 }, { "id": "2411.03471v2", "title": "MetRex: A Benchmark for Verilog Code Metric Reasoning Using LLMs", "authors": [ "Manar Abdelatty", "Jingxiao Ma", "Sherief Reda" ], "abstract": "Large Language Models (LLMs) have been applied to various hardware design\ntasks, including Verilog code generation, EDA tool scripting, and RTL bug\nfixing. Despite this extensive exploration, LLMs are yet to be used for the\ntask of post-synthesis metric reasoning and estimation of HDL designs. In this\npaper, we assess the ability of LLMs to reason about post-synthesis metrics of\nVerilog designs. We introduce MetRex, a large-scale dataset comprising 25,868\nVerilog HDL designs and their corresponding post-synthesis metrics, namely\narea, delay, and static power. MetRex incorporates a Chain of Thought (CoT)\ntemplate to enhance LLMs' reasoning about these metrics. Extensive experiments\nshow that Supervised Fine-Tuning (SFT) boosts the LLM's reasoning capabilities\non average by 37.0\\%, 25.3\\%, and 25.7\\% on the area, delay, and static power,\nrespectively. While SFT improves performance on our benchmark, it remains far\nfrom achieving optimal results, especially on complex problems. Comparing to\nstate-of-the-art regression models, our approach delivers accurate\npost-synthesis predictions for 17.4\\% more designs (within a 5\\% error margin),\nin addition to offering a 1.7x speedup by eliminating the need for\npre-processing. This work lays the groundwork for advancing LLM-based Verilog\ncode metric reasoning.", "categories": [ "cs.AR", "cs.CL" ], "published": "2024-11-05T19:52:58+00:00", "url": "http://arxiv.org/pdf/2411.03471v2", "resource_uri": "arxiv://2411.03471v2", "citation_count": 0 }, { "id": "2411.02310v2", "title": "MdEval: Massively Multilingual Code Debugging", "authors": [ "Shukai Liu", "Linzheng Chai", "Jian Yang", "Jiajun Shi", "He Zhu", "Liran Wang", "Ke Jin", "Wei Zhang", "Hualei Zhu", "Shuyue Guo", "Tao Sun", "Jiaheng Liu", "Yunlong Duan", "Yu Hao", "Liqun Yang", "Guanglin Niu", "Ge Zhang", "Zhoujun Li" ], "abstract": "Code large language models (LLMs) have made significant progress in code\ndebugging by directly generating the correct code based on the buggy code\nsnippet. Programming benchmarks, typically consisting of buggy code snippet and\ntheir associated test cases, are used to assess the debugging capabilities of\nLLMs. However, many existing benchmarks primarily focus on Python and are often\nlimited in terms of language diversity (e.g., DebugBench and DebugEval). To\nadvance the field of multilingual debugging with LLMs, we propose the first\nmassively multilingual debugging benchmark, which includes 3.6K test samples of\n18 programming languages and covers the automated program repair (APR) task,\nthe code review (CR) task, and the bug identification (BI) task. Further, we\nintroduce the debugging instruction corpora MDEVAL-INSTRUCT by injecting bugs\ninto the correct multilingual queries and solutions (xDebugGen). Further, a\nmultilingual debugger xDebugCoder trained on MDEVAL-INSTRUCT as a strong\nbaseline specifically to handle the bugs of a wide range of programming\nlanguages (e.g. \"Missing Mut\" in language Rust and \"Misused Macro Definition\"\nin language C). Our extensive experiments on MDEVAL reveal a notable\nperformance gap between open-source models and closed-source LLMs (e.g., GPT\nand Claude series), highlighting huge room for improvement in multilingual code\ndebugging scenarios.", "categories": [ "cs.CL" ], "published": "2024-11-04T17:36:40+00:00", "url": "http://arxiv.org/pdf/2411.02310v2", "resource_uri": "arxiv://2411.02310v2", "citation_count": 0 }, { "id": "2411.03346v2", "title": "Fixing Security Vulnerabilities with AI in OSS-Fuzz", "authors": [ "Yuntong Zhang", "Jiawei Wang", "Dominic Berzin", "Martin Mirchev", "Dongge Liu", "Abhishek Arya", "Oliver Chang", "Abhik Roychoudhury" ], "abstract": "Critical open source software systems undergo significant validation in the\nform of lengthy fuzz campaigns. The fuzz campaigns typically conduct a biased\nrandom search over the domain of program inputs, to find inputs which crash the\nsoftware system. Such fuzzing is useful to enhance the security of software\nsystems in general since even closed source software may use open source\ncomponents. Hence testing open source software is of paramount importance.\nCurrently OSS-Fuzz is the most significant and widely used infrastructure for\ncontinuous validation of open source systems. Unfortunately even though\nOSS-Fuzz has identified more than 10,000 vulnerabilities across 1000 or more\nsoftware projects, the detected vulnerabilities may remain unpatched, as\nvulnerability fixing is often manual in practice. In this work, we rely on the\nrecent progress in Large Language Model (LLM) agents for autonomous program\nimprovement including bug fixing. We customise the well-known AutoCodeRover\nagent for fixing security vulnerabilities. This is because LLM agents like\nAutoCodeRover fix bugs from issue descriptions via code search. Instead for\nsecurity patching, we rely on the test execution of the exploit input to\nextract code elements relevant to the fix. Our experience with OSS-Fuzz\nvulnerability data shows that LLM agent autonomy is useful for successful\nsecurity patching, as opposed to approaches like Agentless where the control\nflow is fixed. More importantly our findings show that we cannot measure\nquality of patches by code similarity of the patch with reference codes (as in\nCodeBLEU scores used in VulMaster), since patches with high CodeBLEU scores\nstill fail to pass given the given exploit input. Our findings indicate that\nsecurity patch correctness needs to consider dynamic attributes like test\nexecutions as opposed to relying of standard text/code similarity metrics.", "categories": [ "cs.CR", "cs.SE" ], "published": "2024-11-03T16:20:32+00:00", "url": "http://arxiv.org/pdf/2411.03346v2", "resource_uri": "arxiv://2411.03346v2", "citation_count": 0 }, { "id": "2411.00940v3", "title": "Collider-Flavour Complementarity from the bottom to the top", "authors": [ "Oliver Atkinson", "Christoph Englert", "Matthew Kirk", "Gilberto Tetlalmatzi-Xolocotzi" ], "abstract": "Motivated by recently observed anomalies in the flavour sector, we analyse\nthe potential of measurements of top quarks at the Large Hadron Collider (LHC)\nto provide complementary constraints on interactions that shape low-energy\nprecision investigations in the $B$ sector. The measurement of top quark\nproperties, such as the top width and the abundant top pair production\nchannels, are already reaching the percent level at this relatively early stage\nof the LHC phenomenology program. A focused analysis of four-fermion\ninteractions, employing effective field theory without flavour structure\nassumptions and incorporating renormalization group evolution effects, bridges\n$B$ meson scale phenomena with key top quark measurements. We demonstrate that\nthe LHC is increasingly competitive with, and complementary to, flavour physics\nconstraints. Our results, which include a first comprehensive analysis of\nnon-leptonic B decays in this context, suggest that the LHC's top physics\nprogram could serve as a valuable, complementary tool in the search for physics\nbeyond the Standard Model within the flavour sector.", "categories": [ "hep-ph", "hep-ex" ], "published": "2024-11-01T18:00:01+00:00", "url": "http://arxiv.org/pdf/2411.00940v3", "resource_uri": "arxiv://2411.00940v3", "citation_count": 0 }, { "id": "2410.23331v1", "title": "Can Models Help Us Create Better Models? Evaluating LLMs as Data Scientists", "authors": [ "Michał Pietruszka", "Łukasz Borchmann", "Aleksander Jędrosz", "Paweł Morawiecki" ], "abstract": "We present a benchmark for large language models designed to tackle one of\nthe most knowledge-intensive tasks in data science: writing feature engineering\ncode, which requires domain knowledge in addition to a deep understanding of\nthe underlying problem and data structure. The model is provided with a dataset\ndescription in a prompt and asked to generate code transforming it. The\nevaluation score is derived from the improvement achieved by an XGBoost model\nfit on the modified dataset compared to the original data. By an extensive\nevaluation of state-of-the-art models and comparison to well-established\nbenchmarks, we demonstrate that the FeatEng of our proposal can cheaply and\nefficiently assess the broad capabilities of LLMs, in contrast to the existing\nmethods.", "categories": [ "cs.CL" ], "published": "2024-10-30T17:59:01+00:00", "url": "http://arxiv.org/pdf/2410.23331v1", "resource_uri": "arxiv://2410.23331v1", "citation_count": 0 }, { "id": "2410.23101v2", "title": "Guided Game Level Repair via Explainable AI", "authors": [ "Mahsa Bazzaz", "Seth Cooper" ], "abstract": "Procedurally generated levels created by machine learning models can be\nunsolvable without further editing. Various methods have been developed to\nautomatically repair these levels by enforcing hard constraints during the\npost-processing step. However, as levels increase in size, these\nconstraint-based repairs become increasingly slow. This paper proposes using\nexplainability methods to identify specific regions of a level that contribute\nto its unsolvability. By assigning higher weights to these regions,\nconstraint-based solvers can prioritize these problematic areas, enabling more\nefficient repairs. Our results, tested across three games, demonstrate that\nthis approach can help to repair procedurally generated levels faster.", "categories": [ "cs.AI", "cs.LG" ], "published": "2024-10-30T15:12:36+00:00", "url": "http://arxiv.org/pdf/2410.23101v2", "resource_uri": "arxiv://2410.23101v2", "citation_count": 0 }, { "id": "2410.20962v1", "title": "Combining Logic with Large Language Models for Automatic Debugging and Repair of ASP Programs", "authors": [ "Ricardo Brancas", "Vasco Manquinho", "Ruben Martins" ], "abstract": "Logic programs are a powerful approach for solving NP-Hard problems. However,\ndue to their declarative nature, debugging logic programs poses significant\nchallenges. Unlike procedural paradigms, which allow for step-by-step\ninspection of program state, logic programs require reasoning about logical\nstatements for fault localization. This complexity is amplified in learning\nenvironments due to students' inexperience.\n We introduce FormHe, a novel tool that combines logic-based techniques and\nLarge Language Models to identify and correct issues in Answer Set Programming\nsubmissions. FormHe consists of two components: a fault localization module and\na program repair module. First, the fault localizer identifies a set of faulty\nprogram statements requiring modification. Subsequently, FormHe employs program\nmutation techniques and Large Language Models to repair the flawed ASP program.\nThese repairs can then serve as guidance for students to correct their\nprograms.\n Our experiments with real buggy programs submitted by students show that\nFormHe accurately detects faults in 94% of cases and successfully repairs 58%\nof incorrect submissions.", "categories": [ "cs.SE", "cs.LO" ], "published": "2024-10-28T12:30:48+00:00", "url": "http://arxiv.org/pdf/2410.20962v1", "resource_uri": "arxiv://2410.20962v1", "citation_count": 0 }, { "id": "2410.18582v1", "title": "LLM-Aided Efficient Hardware Design Automation", "authors": [ "Kangwei Xu", "Ruidi Qiu", "Zhuorui Zhao", "Grace Li Zhang", "Ulf Schlichtmann", "Bing Li" ], "abstract": "With the rapidly increasing complexity of modern chips, hardware engineers\nare required to invest more effort in tasks such as circuit design,\nverification, and physical implementation. These workflows often involve\ncontinuous modifications, which are labor-intensive and prone to errors.\nTherefore, there is an increasing need for more efficient and cost-effective\nElectronic Design Automation (EDA) solutions to accelerate new hardware\ndevelopment. Recently, large language models (LLMs) have made significant\nadvancements in contextual understanding, logical reasoning, and response\ngeneration. Since hardware designs and intermediate scripts can be expressed in\ntext format, it is reasonable to explore whether integrating LLMs into EDA\ncould simplify and fully automate the entire workflow. Accordingly, this paper\ndiscusses such possibilities in several aspects, covering hardware description\nlanguage (HDL) generation, code debugging, design verification, and physical\nimplementation. Two case studies, along with their future outlook, are\nintroduced to highlight the capabilities of LLMs in code repair and testbench\ngeneration. Finally, future directions and challenges are highlighted to\nfurther explore the potential of LLMs in shaping the next-generation EDA", "categories": [ "eess.SY", "cs.SY" ], "published": "2024-10-24T09:35:21+00:00", "url": "http://arxiv.org/pdf/2410.18582v1", "resource_uri": "arxiv://2410.18582v1", "citation_count": 0 }, { "id": "2410.18241v1", "title": "Characterising Open Source Co-opetition in Company-hosted Open Source Software Projects: The Cases of PyTorch, TensorFlow, and Transformers", "authors": [ "Cailean Osborne", "Farbod Daneshyan", "Runzhi He", "Hengzhi Ye", "Yuxia Zhang", "Minghui Zhou" ], "abstract": "Companies, including market rivals, have long collaborated on the development\nof open source software (OSS), resulting in a tangle of co-operation and\ncompetition known as \"open source co-opetition\". While prior work investigates\nopen source co-opetition in OSS projects that are hosted by vendor-neutral\nfoundations, we have a limited understanding thereof in OSS projects that are\nhosted and governed by one company. Given their prevalence, it is timely to\ninvestigate open source co-opetition in such contexts. Towards this end, we\nconduct a mixed-methods analysis of three company-hosted OSS projects in the\nartificial intelligence (AI) industry: Meta's PyTorch (prior to its donation to\nthe Linux Foundation), Google's TensorFlow, and Hugging Face's Transformers. We\ncontribute three key findings. First, while the projects exhibit similar code\nauthorship patterns between host and external companies (80%/20% of commits),\ncollaborations are structured differently (e.g., decentralised vs.\nhub-and-spoke networks). Second, host and external companies engage in\nstrategic, non-strategic, and contractual collaborations, with varying\nincentives and collaboration practices. Some of the observed collaborations are\nspecific to the AI industry (e.g., hardware-software optimizations or AI model\nintegrations), while others are typical of the broader software industry (e.g.,\nbug fixing or task outsourcing). Third, single-vendor governance creates a\npower imbalance that influences open source co-opetition practices and\npossibilities, from the host company's singular decision-making power (e.g.,\nthe risk of license change) to their community involvement strategy (e.g., from\nover-control to over-delegation). We conclude with recommendations for future\nresearch.", "categories": [ "cs.SE", "cs.AI", "cs.CY" ], "published": "2024-10-23T19:35:41+00:00", "url": "http://arxiv.org/pdf/2410.18241v1", "resource_uri": "arxiv://2410.18241v1", "citation_count": 0 }, { "id": "2410.17820v2", "title": "Understanding When Tree of Thoughts Succeeds: Larger Models Excel in Generation, Not Discrimination", "authors": [ "Qiqi Chen", "Xinpeng Wang", "Philipp Mondorf", "Michael A. Hedderich", "Barbara Plank" ], "abstract": "Tree of Thoughts (ToT) is a reasoning strategy for Large Language Models\n(LLMs) that employs a generator to suggest reasoning steps and a discriminator\nto decide which steps to implement. ToT demonstrates strong performance on\nreasoning tasks, often surpassing simple methods such as Input-Output (IO)\nprompting and Chain-of-Thought (CoT) reasoning. However, ToT does not\nconsistently outperform such simpler methods across all models, leaving large\nknowledge gaps on the conditions under which ToT is most beneficial. In this\npaper, we analyze the roles of the generator and discriminator separately to\nbetter understand the conditions when ToT is beneficial. We find that the\ngenerator plays a more critical role than the discriminator in driving the\nsuccess of ToT. Scaling the generator leads to notable improvements in ToT\nperformance, even when using a smaller model as the discriminator, whereas\nscaling the discriminator with a fixed generator yields only marginal gains.\nOur results show that models across different scales exhibit comparable\ndiscrimination capabilities, yet differ significantly in their generative\nperformance for ToT.", "categories": [ "cs.CL" ], "published": "2024-10-23T12:26:10+00:00", "url": "http://arxiv.org/pdf/2410.17820v2", "resource_uri": "arxiv://2410.17820v2", "citation_count": 0 }, { "id": "2410.16655v1", "title": "Semantic-guided Search for Efficient Program Repair with Large Language Models", "authors": [ "Thanh Le-Cong", "Bach Le", "Toby Murray" ], "abstract": "In this paper, we first show that increases in beam size of even just\nsmall-sized LLM (1B-7B parameters) require an extensive GPU resource\nconsumption, leading to up to 80% of recurring crashes due to memory overloads\nin LLM-based APR. Seemingly simple solutions to reduce memory consumption are\n(1) to quantize LLM models, i.e., converting the weights of a LLM from\nhigh-precision values to lower-precision ones. and (2) to make beam search\nsequential, i.e., forwarding each beam through the model sequentially and then\nconcatenate them back into a single model output. However, we show that these\napproaches still do not work via both theoretical analysis and experiments. To\naddress this, we introduce FLAMES, a novel LLM-based APR technique that employs\nsemantic-guided patch generation to enhance repair effectiveness and memory\nefficiency. Unlike conventional methods that rely on beam search, FLAMES\nutilizes greedy decoding to enhance memory efficiency while steering the search\nto more potentially good repair candidates via a semantic-guided best-first\nsearch algorithm. At each decoding step, FLAMES uses semantic feedback from\ntest validation such as the number of passing and failing test cases to select\nthe most promising token to explore further. Our empirical evaluation on the\nDefects4J and HumanEval-Java datasets shows that FLAMES not only substantially\nreduces memory consumption by up to 83% compared to conventional LLM-based APR,\nbut also accelerates the repair process. Remarkably, FLAMES successfully\ngenerated 133 and 103 correct fixes for 333 and 163 bugs in the Defects4J and\nHumanEval-Java datasets, respectively. This suggests that FLAMES is not only\nmore efficient but also outperforms state-of-the-art techniques, fixing at\nleast 10 and 11 more bugs than SOTA baselines in the Defects4J and\nHumanEval-Java datasets, respectively.", "categories": [ "cs.SE", "cs.AI" ], "published": "2024-10-22T02:59:47+00:00", "url": "http://arxiv.org/pdf/2410.16655v1", "resource_uri": "arxiv://2410.16655v1", "citation_count": 0 }, { "id": "2410.16447v2", "title": "How much secure randomness is in a quantum state?", "authors": [ "Kriss Gutierrez Anco", "Tristan Nemoz", "Peter Brown" ], "abstract": "How much cryptographically-secure randomness can be extracted from a quantum\nstate? This fundamental question probes the absolute limits of quantum random\nnumber generation (QRNG) and yet, despite the technological maturity of QRNGs,\nit remains unsolved. In this work we consider a general adversarial model that\nallows for an adversary who has quantum side-information about both the source\nand the measurement device. Using links between randomness extraction rates and\nsandwiched R\\'enyi entropies, we provide compact, easy to compute, achievable\nrates of secure randomness extraction from quantum states. In turn, this\nprovides a simple to evaluate benchmarking tool for the randomness generation\nrates of QRNG protocols.", "categories": [ "quant-ph" ], "published": "2024-10-21T19:16:56+00:00", "url": "http://arxiv.org/pdf/2410.16447v2", "resource_uri": "arxiv://2410.16447v2", "citation_count": 0 }, { "id": "2410.15899v2", "title": "On the Design and Performance of Machine Learning Based Error Correcting Decoders", "authors": [ "Yuncheng Yuan", "Péter Scheepers", "Lydia Tasiou", "Yunus Can Gültekin", "Federico Corradi", "Alex Alvarado" ], "abstract": "This paper analyzes the design and competitiveness of four neural network\n(NN) architectures recently proposed as decoders for forward error correction\n(FEC) codes. We first consider the so-called single-label neural network (SLNN)\nand the multi-label neural network (MLNN) decoders which have been reported to\nachieve near maximum likelihood (ML) performance. Here, we show analytically\nthat SLNN and MLNN decoders can always achieve ML performance, regardless of\nthe code dimensions -- although at the cost of computational complexity -- and\nno training is in fact required. We then turn our attention to two\ntransformer-based decoders: the error correction code transformer (ECCT) and\nthe cross-attention message passing transformer (CrossMPT). We compare their\nperformance against traditional decoders, and show that ordered statistics\ndecoding outperforms these transformer-based decoders. The results in this\npaper cast serious doubts on the application of NN-based FEC decoders in the\nshort and medium block length regime.", "categories": [ "eess.SP", "cs.LG" ], "published": "2024-10-21T11:23:23+00:00", "url": "http://arxiv.org/pdf/2410.15899v2", "resource_uri": "arxiv://2410.15899v2", "citation_count": 0 }, { "id": "2410.14393v1", "title": "Debug Smarter, Not Harder: AI Agents for Error Resolution in Computational Notebooks", "authors": [ "Konstantin Grotov", "Artem Borzilov", "Maksim Krivobok", "Timofey Bryksin", "Yaroslav Zharov" ], "abstract": "Computational notebooks became indispensable tools for research-related\ndevelopment, offering unprecedented interactivity and flexibility in the\ndevelopment process. However, these benefits come at the cost of\nreproducibility and an increased potential for bugs. With the rise of\ncode-fluent Large Language Models empowered with agentic techniques, smart\nbug-fixing tools with a high level of autonomy have emerged. However, those\ntools are tuned for classical script programming and still struggle with\nnon-linear computational notebooks. In this paper, we present an AI agent\ndesigned specifically for error resolution in a computational notebook. We have\ndeveloped an agentic system capable of exploring a notebook environment by\ninteracting with it -- similar to how a user would -- and integrated the system\ninto the JetBrains service for collaborative data science called Datalore. We\nevaluate our approach against the pre-existing single-action solution by\ncomparing costs and conducting a user study. Users rate the error resolution\ncapabilities of the agentic system higher but experience difficulties with UI.\nWe share the results of the study and consider them valuable for further\nimproving user-agent collaboration.", "categories": [ "cs.LG", "cs.AI" ], "published": "2024-10-18T11:55:34+00:00", "url": "http://arxiv.org/pdf/2410.14393v1", "resource_uri": "arxiv://2410.14393v1", "citation_count": 0 }, { "id": "2410.11300v1", "title": "Instructive Code Retriever: Learn from Large Language Model's Feedback for Code Intelligence Tasks", "authors": [ "Jiawei Lu", "Haoye Wang", "Zhongxin Liu", "Keyu Liang", "Lingfeng Bao", "Xiaohu Yang" ], "abstract": "Recent studies proposed to leverage large language models (LLMs) with\nIn-Context Learning (ICL) to handle code intelligence tasks without\nfine-tuning. ICL employs task instructions and a set of examples as\ndemonstrations to guide the model in generating accurate answers without\nupdating its parameters. While ICL has proven effective for code intelligence\ntasks, its performance heavily relies on the selected examples. Previous work\nhas achieved some success in using BM25 to retrieve examples for code\nintelligence tasks. However, existing approaches lack the ability to understand\nthe semantic and structural information of queries, resulting in less helpful\ndemonstrations. Moreover, they do not adapt well to the complex and dynamic\nnature of user queries in diverse domains. In this paper, we introduce a novel\napproach named Instructive Code Retriever (ICR), which is designed to retrieve\nexamples that enhance model inference across various code intelligence tasks\nand datasets. We enable ICR to learn the semantic and structural information of\nthe corpus by a tree-based loss function. To better understand the correlation\nbetween queries and examples, we incorporate the feedback from LLMs to guide\nthe training of the retriever. Experimental results demonstrate that our\nretriever significantly outperforms state-of-the-art approaches. We evaluate\nour model's effectiveness on various tasks, i.e., code summarization, program\nsynthesis, and bug fixing. Compared to previous state-of-the-art algorithms,\nour method achieved improvements of 50.0% and 90.0% in terms of BLEU-4 for two\ncode summarization datasets, 74.6% CodeBLEU on program synthesis dataset, and\nincreases of 3.6 and 3.2 BLEU-4 on two bug fixing datasets.", "categories": [ "cs.SE" ], "published": "2024-10-15T05:44:00+00:00", "url": "http://arxiv.org/pdf/2410.11300v1", "resource_uri": "arxiv://2410.11300v1", "citation_count": 0 }, { "id": "2410.09997v1", "title": "Collu-Bench: A Benchmark for Predicting Language Model Hallucinations in Code", "authors": [ "Nan Jiang", "Qi Li", "Lin Tan", "Tianyi Zhang" ], "abstract": "Despite their success, large language models (LLMs) face the critical\nchallenge of hallucinations, generating plausible but incorrect content. While\nmuch research has focused on hallucinations in multiple modalities including\nimages and natural language text, less attention has been given to\nhallucinations in source code, which leads to incorrect and vulnerable code\nthat causes significant financial loss. To pave the way for research in LLMs'\nhallucinations in code, we introduce Collu-Bench, a benchmark for predicting\ncode hallucinations of LLMs across code generation (CG) and automated program\nrepair (APR) tasks. Collu-Bench includes 13,234 code hallucination instances\ncollected from five datasets and 11 diverse LLMs, ranging from open-source\nmodels to commercial ones. To better understand and predict code\nhallucinations, Collu-Bench provides detailed features such as the per-step log\nprobabilities of LLMs' output, token types, and the execution feedback of LLMs'\ngenerated code for in-depth analysis. In addition, we conduct experiments to\npredict hallucination on Collu-Bench, using both traditional machine learning\ntechniques and neural networks, which achieves 22.03 -- 33.15% accuracy. Our\nexperiments draw insightful findings of code hallucination patterns, reveal the\nchallenge of accurately localizing LLMs' hallucinations, and highlight the need\nfor more sophisticated techniques.", "categories": [ "cs.SE", "cs.AI", "cs.CL" ], "published": "2024-10-13T20:41:47+00:00", "url": "http://arxiv.org/pdf/2410.09997v1", "resource_uri": "arxiv://2410.09997v1", "citation_count": 0 }, { "id": "2410.09174v1", "title": "Context-Aware SQL Error Correction Using Few-Shot Learning -- A Novel Approach Based on NLQ, Error, and SQL Similarity", "authors": [ "Divyansh Jain", "Eric Yang" ], "abstract": "In recent years, the demand for automated SQL generation has increased\nsignificantly, driven by the need for efficient data querying in various\napplications. However, generating accurate SQL queries remains a challenge due\nto the complexity and variability of natural language inputs. This paper\nintroduces a novel few-shot learning-based approach for error correction in SQL\ngeneration, enhancing the accuracy of generated queries by selecting the most\nsuitable few-shot error correction examples for a given natural language\nquestion (NLQ). In our experiments with the open-source Gretel dataset, the\nproposed model offers a 39.2% increase in fixing errors from the baseline\napproach with no error correction and a 10% increase from a simple error\ncorrection method. The proposed technique leverages embedding-based similarity\nmeasures to identify the closest matches from a repository of few-shot\nexamples. Each example comprises an incorrect SQL query, the resulting error,\nthe correct SQL query, and detailed steps to transform the incorrect query into\nthe correct one. By employing this method, the system can effectively guide the\ncorrection of errors in newly generated SQL queries. Our approach demonstrates\nsignificant improvements in SQL generation accuracy by providing contextually\nrelevant examples that facilitate error identification and correction. The\nexperimental results highlight the effectiveness of embedding-based selection\nin enhancing the few-shot learning process, leading to more precise and\nreliable SQL query generation. This research contributes to the field of\nautomated SQL generation by offering a robust framework for error correction,\npaving the way for more advanced and user-friendly database interaction tools.", "categories": [ "cs.CL" ], "published": "2024-10-11T18:22:08+00:00", "url": "http://arxiv.org/pdf/2410.09174v1", "resource_uri": "arxiv://2410.09174v1", "citation_count": 0 }, { "id": "2410.08806v1", "title": "Don't Transform the Code, Code the Transforms: Towards Precise Code Rewriting using LLMs", "authors": [ "Chris Cummins", "Volker Seeker", "Jordi Armengol-Estapé", "Aram H. Markosyan", "Gabriel Synnaeve", "Hugh Leather" ], "abstract": "Tools for rewriting, refactoring and optimizing code should be fast and\ncorrect. Large language models (LLMs), by their nature, possess neither of\nthese qualities. Yet, there remains tremendous opportunity in using LLMs to\nimprove code.\n We explore the use of LLMs not to transform code, but to code transforms. We\npropose a chain-of-thought approach to synthesizing code transformations from a\nsmall number of input/output code examples that incorporates execution and\nfeedback. Unlike the direct rewrite approach, LLM-generated transformations are\neasy to inspect, debug, and validate. The logic of the rewrite is explicitly\ncoded and easy to adapt. The compute required to run code transformations is\nminute compared to that of LLM rewriting.\n We test our approach on 16 Python code transformations and find that LLM-\ngenerated transforms are perfectly precise for 7 of them and less imprecise\nthan direct LLM rewriting on the others. We hope to encourage further research\nto improving the precision of LLM code rewriting.", "categories": [ "cs.LG" ], "published": "2024-10-11T13:45:16+00:00", "url": "http://arxiv.org/pdf/2410.08806v1", "resource_uri": "arxiv://2410.08806v1", "citation_count": 0 }, { "id": "2410.21285v1", "title": "FastFixer: An Efficient and Effective Approach for Repairing Programming Assignments", "authors": [ "Fang Liu", "Zhenwei Liu", "Qianhui Zhao", "Jing Jiang", "Li Zhang", "Ge Li", "Zian Sun", "Zhongqi Li", "Yuchi Ma" ], "abstract": "Providing personalized and timely feedback for student's programming\nassignments is useful for programming education. Automated program repair (APR)\ntechniques have been used to fix the bugs in programming assignments, where the\nLarge Language Models (LLMs) based approaches have shown promising results.\nGiven the growing complexity of identifying and fixing bugs in advanced\nprogramming assignments, current fine-tuning strategies for APR are inadequate\nin guiding the LLM to identify bugs and make accurate edits during the\ngenerative repair process. Furthermore, the autoregressive decoding approach\nemployed by the LLM could potentially impede the efficiency of the repair,\nthereby hindering the ability to provide timely feedback. To tackle these\nchallenges, we propose FastFixer, an efficient and effective approach for\nprogramming assignment repair. To assist the LLM in accurately identifying and\nrepairing bugs, we first propose a novel repair-oriented fine-tuning strategy,\naiming to enhance the LLM's attention towards learning how to generate the\nnecessary patch and its associated context. Furthermore, to speed up the patch\ngeneration, we propose an inference acceleration approach that is specifically\ntailored for the program repair task. The evaluation results demonstrate that\nFastFixer obtains an overall improvement of 20.46% in assignment fixing when\ncompared to the state-of-the-art baseline. Considering the repair efficiency,\nFastFixer achieves a remarkable inference speedup of 16.67 times compared to\nthe autoregressive decoding algorithm.", "categories": [ "cs.CY", "cs.SE" ], "published": "2024-10-11T10:17:02+00:00", "url": "http://arxiv.org/pdf/2410.21285v1", "resource_uri": "arxiv://2410.21285v1", "citation_count": 0 }, { "id": "2410.07516v2", "title": "Exploring and Lifting the Robustness of LLM-powered Automated Program Repair with Metamorphic Testing", "authors": [ "Pengyu Xue", "Linhao Wu", "Zhen Yang", "Zhongxing Yu", "Zhi Jin", "Ge Li", "Yan Xiao", "Shuo Liu", "Xinyi Li", "Hongyi Lin", "Jingwen Wu" ], "abstract": "In recent years, Large language model-powered Automated Program Repair (LAPR)\ntechniques have achieved state-of-the-art bug-fixing performance and have been\npervasively applied and studied in both industry and academia. Nonetheless,\nLLMs were proved to be highly sensitive to input prompts, with slight\ndifferences in the expressions of semantically equivalent programs potentially\ncausing repair failures. Therefore, it is crucial to conduct robustness testing\non LAPR techniques before their practical deployment. However, related research\nis scarce. To this end, we propose MT-LAPR, a Metamorphic Testing framework\nexclusively for LAPR techniques, which summarizes nine widely-recognized\nMetamorphic Relations (MRs) by developers across three perturbation levels:\ntoken, statement, and block. Afterward, our proposed MRs are applied to buggy\ncodes to generate test cases, which are semantically equivalent yet to affect\nthe inference of LAPR. Experiments are carried out on two extensively examined\nbug-fixing datasets, i.e., Defect4J and QuixBugs, and four bug-fixing abled\nLLMs released recently, demonstrating that 34.4% - 48.5% of the test cases\nexpose the instability of LAPR techniques on average, showing the effectiveness\nof MT-LAPR and uncovering a positive correlation between code readability and\nthe robustness of LAPR techniques. Inspired by the above findings, this paper\nuses the test cases generated by MT-LAPR as samples to train a CodeT5-based\ncode editing model aiming at improving code readability and then embeds it into\nthe LAPR workflow as a data preprocessing step. Extensive experiments\ndemonstrate that this approach significantly enhances the robustness of LAPR by\n49.32% at most.", "categories": [ "cs.SE" ], "published": "2024-10-10T01:14:58+00:00", "url": "http://arxiv.org/pdf/2410.07516v2", "resource_uri": "arxiv://2410.07516v2", "citation_count": 0 }, { "id": "2410.07356v1", "title": "Optimizing High-Level Synthesis Designs with Retrieval-Augmented Large Language Models", "authors": [ "Haocheng Xu", "Haotian Hu", "Sitao Huang" ], "abstract": "High-level synthesis (HLS) allows hardware designers to create hardware\ndesigns with high-level programming languages like C/C++/OpenCL, which greatly\nimproves hardware design productivity. However, existing HLS flows require\nprogrammers' hardware design expertise and rely on programmers' manual code\ntransformations and directive annotations to guide compiler optimizations.\nOptimizing HLS designs requires non-trivial HLS expertise and tedious iterative\nprocess in HLS code optimization. Automating HLS code optimizations has become\na burning need. Recently, large language models (LLMs) trained on massive code\nand programming tasks have demonstrated remarkable proficiency in comprehending\ncode, showing the ability to handle domain-specific programming queries\ndirectly without labor-intensive fine-tuning. In this work, we propose a novel\nretrieval-augmented LLM-based approach to effectively optimize high-level\nsynthesis (HLS) programs. Our proposed method leverages few-shot learning,\nenabling large language models to adopt domain-specific knowledge through\nnatural language prompts. We propose a unique framework, Retrieve Augmented\nLarge Language Model Aided Design (RALAD), designed to enhance LLMs'\nperformance in HLS code optimization tasks. RALAD employs advanced embedding\ntechniques and top-\\emph{k} search algorithms to dynamically source relevant\nknowledge from extensive databases, thereby providing contextually appropriate\nresponses to complex programming queries. Our implementation of RALAD on two\nspecialized domains, utilizing comparatively smaller language models, achieves\nan impressive 80\\% success rate in compilation tasks and outperforms general\nLLMs by 3.7 -- 19$\\times$ in latency improvement.", "categories": [ "cs.AR", "cs.PL" ], "published": "2024-10-09T18:11:14+00:00", "url": "http://arxiv.org/pdf/2410.07356v1", "resource_uri": "arxiv://2410.07356v1", "citation_count": 0 }, { "id": "2410.07002v3", "title": "CursorCore: Assist Programming through Aligning Anything", "authors": [ "Hao Jiang", "Qi Liu", "Rui Li", "Shengyu Ye", "Shijin Wang" ], "abstract": "Large language models have been successfully applied to programming\nassistance tasks, such as code completion, code insertion, and instructional\ncode editing. However, these applications remain insufficiently automated and\nstruggle to effectively integrate various types of information during the\nprogramming process, including coding history, current code, and user\ninstructions. In this work, we propose a new conversational framework that\ncomprehensively integrates these information sources, collect data to train our\nmodels and evaluate their performance. Firstly, to thoroughly evaluate how well\nmodels align with different types of information and the quality of their\noutputs, we introduce a new benchmark, APEval (Assist Programming Eval), to\ncomprehensively assess the performance of models in programming assistance\ntasks. Then, for data collection, we develop a data generation pipeline,\nProgramming-Instruct, which synthesizes training data from diverse sources,\nsuch as GitHub and online judge platforms. This pipeline can automatically\ngenerate various types of messages throughout the programming process. Finally,\nusing this pipeline, we generate 219K samples, fine-tune multiple models, and\ndevelop the CursorCore series. We show that CursorCore outperforms other models\nof comparable size. This framework unifies applications such as inline chat and\nautomated editing, contributes to the advancement of coding assistants. Code,\nmodels and data are freely available at\nhttps://github.com/TechxGenus/CursorCore.", "categories": [ "cs.CL", "cs.AI", "cs.SE" ], "published": "2024-10-09T15:45:52+00:00", "url": "http://arxiv.org/pdf/2410.07002v3", "resource_uri": "arxiv://2410.07002v3", "citation_count": 0 }, { "id": "2410.06440v1", "title": "Checker Bug Detection and Repair in Deep Learning Libraries", "authors": [ "Nima Shiri Harzevili", "Mohammad Mahdi Mohajer", "Jiho Shin", "Moshi Wei", "Gias Uddin", "Jinqiu Yang", "Junjie Wang", "Song Wang", "Zhen Ming", "Jiang", "Nachiappan Nagappan" ], "abstract": "Checker bugs in Deep Learning (DL) libraries are critical yet not\nwell-explored. These bugs are often concealed in the input validation and\nerror-checking code of DL libraries and can lead to silent failures, incorrect\nresults, or unexpected program behavior in DL applications. Despite their\npotential to significantly impact the reliability and performance of DL-enabled\nsystems built with these libraries, checker bugs have received limited\nattention.\n We present the first comprehensive study of DL checker bugs in two\nwidely-used DL libraries, i.e., TensorFlow and PyTorch. Initially, we\nautomatically collected a dataset of 2,418 commits from TensorFlow and PyTorch\nrepositories on GitHub from Sept. 2016 to Dec. 2023 using specific keywords\nrelated to checker bugs. Through manual inspection, we identified 527 DL\nchecker bugs. Subsequently, we analyzed these bugs from three perspectives,\ni.e., root causes, symptoms, and fixing patterns. Using the knowledge gained\nvia root cause analysis of checker bugs, we further propose TensorGuard, a\nproof-of-concept RAG-based LLM-based tool to detect and fix checker bugs in DL\nlibraries via prompt engineering a series of ChatGPT prompts. We evaluated\nTensorGuard's performance on a test dataset that includes 92 buggy and 135\nclean checker-related changes in TensorFlow and PyTorch from January 2024 to\nJuly 2024. Our results demonstrate that TensorGuard has high average recall\n(94.51\\%) using Chain of Thought prompting, a balanced performance between\nprecision and recall using Zero-Shot prompting and Few-Shot prompting\nstrategies. In terms of patch generation, TensorGuard achieves an accuracy of\n11.1\\%, which outperforms the state-of-the-art bug repair baseline by 2\\%. We\nhave also applied TensorGuard on the latest six months' checker-related changes\n(493 changes) of the JAX library from Google, which resulted in the detection\nof 64 new checker bugs.", "categories": [ "cs.SE" ], "published": "2024-10-09T00:48:12+00:00", "url": "http://arxiv.org/pdf/2410.06440v1", "resource_uri": "arxiv://2410.06440v1", "citation_count": 0 }, { "id": "2410.06351v1", "title": "Moving Faster and Reducing Risk: Using LLMs in Release Deployment", "authors": [ "Rui Abreu", "Vijayaraghavan Murali", "Peter C Rigby", "Chandra Maddila", "Weiyan Sun", "Jun Ge", "Kaavya Chinniah", "Audris Mockus", "Megh Mehta", "Nachiappan Nagappan" ], "abstract": "Release engineering has traditionally focused on continuously delivering\nfeatures and bug fixes to users, but at a certain scale, it becomes impossible\nfor a release engineering team to determine what should be released. At Meta's\nscale, the responsibility appropriately and necessarily falls back on the\nengineer writing and reviewing the code. To address this challenge, we\ndeveloped models of diff risk scores (DRS) to determine how likely a diff is to\ncause a SEV, i.e., a severe fault that impacts end-users. Assuming that SEVs\nare only caused by diffs, a naive model could randomly gate X% of diffs from\nlanding, which would automatically catch X% of SEVs on average. However, we\naimed to build a model that can capture Y% of SEVs by gating X% of diffs, where\nY >> X. By training the model on historical data on diffs that have caused SEVs\nin the past, we can predict the riskiness of an outgoing diff to cause a SEV.\nDiffs that are beyond a particular threshold of risk can then be gated. We have\nfour types of gating: no gating (green), weekend gating (weekend), medium\nimpact on end-users (yellow), and high impact on end-users (red). The input\nparameter for our models is the level of gating, and the outcome measure is the\nnumber of captured SEVs. Our research approaches include a logistic regression\nmodel, a BERT-based model, and generative LLMs. Our baseline regression model\ncaptures 18.7%, 27.9%, and 84.6% of SEVs while respectively gating the top 5%\n(weekend), 10% (yellow), and 50% (red) of risky diffs. The BERT-based model,\nStarBERT, only captures 0.61x, 0.85x, and 0.81x as many SEVs as the logistic\nregression for the weekend, yellow, and red gating zones, respectively. The\ngenerative LLMs, iCodeLlama-34B and iDiffLlama-13B, when risk-aligned, capture\nmore SEVs than the logistic regression model in production: 1.40x, 1.52x,\n1.05x, respectively.", "categories": [ "cs.SE" ], "published": "2024-10-08T20:40:38+00:00", "url": "http://arxiv.org/pdf/2410.06351v1", "resource_uri": "arxiv://2410.06351v1", "citation_count": 0 }, { "id": "2410.18107v1", "title": "In-Context Code-Text Learning for Bimodal Software Engineering", "authors": [ "Xunzhu Tang", "Liran Wang", "Yonghui Liu", "Linzheng Chai", "Jian Yang", "Zhoujun Li", "Haoye Tian", "Jacques Klein", "Tegawende F. Bissyande" ], "abstract": "Bimodal software analysis initially appeared to be within reach with the\nadvent of large language models. Unfortunately, the complex interplay of\nnatural language text and code in software engineering, presents unique\nchallenges that prevent pretrained models to generalize to a variety of tasks.\nWe postulate that in-context learning for the code-text bimodality is a\npromising avenue. This paper thus introduces a comprehensive study of\nin-context code-text learning, focusing on leveraging pretrained CodeLLAMA\nmodels.\n We consider a diverse dataset encompassing 23 software engineering tasks,\nwhich we transform in an in-context learning format. To effectively extract\ninformative features, we propose a configurable prompt template. Our proposed\npipeline, InCTRL, then unifies prompt learning across various software\nengineering tasks. Extensive evaluation on the study datasets demonstrates the\nsuperiority of INCTRL-models in few-shot performance, surpassing\nstate-of-the-art models including the support model, CodeLLAMA. Typically, we\nobserve that applied to the CodeLLAMA model, INCTRL brings improvements in\nterms of precision (at least about 12\\%) and recall (up to 93.88\\%) on various\ntasks. For example, on the task of program repair, INCTRL improves the BLEU\nscore of CodeLLAMA by 85 points, while for clone detection, INCTRL achieves an\nimprovement of 69 percentage points. Moreover, INCTRL-models offer\nstate-of-the-art performance when using retrieval-augmented generation on\nindividual downstream tasks. Finally, we qualitatively analyze the benefits of\nINCTRL over CodeLLAMA and open-source all models for broader impact.\n We make our code and dataset publicly available at: \\begin{center}\n {\\url{https://anonymous.4open.science/r/inctrl-B65B}} \\end{center}", "categories": [ "cs.SE", "cs.AI" ], "published": "2024-10-08T19:42:00+00:00", "url": "http://arxiv.org/pdf/2410.18107v1", "resource_uri": "arxiv://2410.18107v1", "citation_count": 0 }, { "id": "2410.05911v1", "title": "Accelerating Error Correction Code Transformers", "authors": [ "Matan Levy", "Yoni Choukroun", "Lior Wolf" ], "abstract": "Error correction codes (ECC) are crucial for ensuring reliable information\ntransmission in communication systems. Choukroun & Wolf (2022b) recently\nintroduced the Error Correction Code Transformer (ECCT), which has demonstrated\npromising performance across various transmission channels and families of\ncodes. However, its high computational and memory demands limit its practical\napplications compared to traditional decoding algorithms. Achieving effective\nquantization of the ECCT presents significant challenges due to its inherently\nsmall architecture, since existing, very low-precision quantization techniques\noften lead to performance degradation in compact neural networks. In this\npaper, we introduce a novel acceleration method for transformer-based decoders.\nWe first propose a ternary weight quantization method specifically designed for\nthe ECCT, inducing a decoder with multiplication-free linear layers. We present\nan optimized self-attention mechanism to reduce computational complexity via\ncodeaware multi-heads processing. Finally, we provide positional encoding via\nthe Tanner graph eigendecomposition, enabling a richer representation of the\ngraph connectivity. The approach not only matches or surpasses ECCT's\nperformance but also significantly reduces energy consumption, memory\nfootprint, and computational complexity. Our method brings transformer-based\nerror correction closer to practical implementation in resource-constrained\nenvironments, achieving a 90% compression ratio and reducing arithmetic\noperation energy consumption by at least 224 times on modern hardware.", "categories": [ "cs.LG", "cs.AI", "cs.IT", "math.IT" ], "published": "2024-10-08T11:07:55+00:00", "url": "http://arxiv.org/pdf/2410.05911v1", "resource_uri": "arxiv://2410.05911v1", "citation_count": 0 }, { "id": "2410.04485v1", "title": "Exploring the Potential of Conversational Test Suite Based Program Repair on SWE-bench", "authors": [ "Anton Cheshkov", "Pavel Zadorozhny", "Rodion Levichev", "Evgeny Maslov", "Ronaldo Franco Jaldin" ], "abstract": "Automatic program repair at project level may open yet to be seen\nopportunities in various fields of human activity. Since the SWE-Bench\nchallenge was presented, we have seen numerous of solutions. Patch generation\nis a part of program repair, and test suite-based conversational patch\ngeneration has proven its effectiveness. However, the potential of\nconversational patch generation has not yet specifically estimated on\nSWE-Bench. This study reports experimental results aimed at evaluating the\nindividual effectiveness of conversational patch generation on problems from\nSWE-Bench. The experiments show that a simple conversational pipeline based on\nLLaMA 3.1 70B can generate valid patches in 47\\% of cases, which is comparable\nto the state-of-the-art in program repair on SWE-Bench.", "categories": [ "cs.SE", "cs.AI", "cs.MA" ], "published": "2024-10-06T13:55:33+00:00", "url": "http://arxiv.org/pdf/2410.04485v1", "resource_uri": "arxiv://2410.04485v1", "citation_count": 0 }, { "id": "2410.03364v2", "title": "Error Correction Code Transformer: From Non-Unified to Unified", "authors": [ "Yongli Yan", "Jieao Zhu", "Tianyue Zheng", "Jiaqi He", "Linglong Dai" ], "abstract": "Channel coding is vital for reliable data transmission in modern wireless\nsystems, and its significance will increase with the emergence of\nsixth-generation (6G) networks, which will need to support various error\ncorrection codes. However, traditional decoders were typically designed as\nfixed hardware circuits tailored to specific decoding algorithms, leading to\ninefficiencies and limited flexibility. To address these challenges, this paper\nproposes a unified, code-agnostic Transformer-based decoding architecture\ncapable of handling multiple linear block codes, including Polar, Low-Density\nParity-Check (LDPC), and Bose-Chaudhuri-Hocquenghem (BCH), within a single\nframework. To achieve this, standardized units are employed to harmonize\nparameters across different code types, while the redesigned unified attention\nmodule compresses the structural information of various codewords.\nAdditionally, a sparse mask, derived from the sparsity of the parity-check\nmatrix, is introduced to enhance the model's ability to capture inherent\nconstraints between information and parity-check bits, resulting in improved\ndecoding accuracy and robustness. Extensive experimental results demonstrate\nthat the proposed unified Transformer-based decoder not only outperforms\nexisting methods but also provides a flexible, efficient, and high-performance\nsolution for next-generation wireless communication systems.", "categories": [ "cs.IT", "cs.LG", "math.IT" ], "published": "2024-10-04T12:30:42+00:00", "url": "http://arxiv.org/pdf/2410.03364v2", "resource_uri": "arxiv://2410.03364v2", "citation_count": 0 }, { "id": "2410.03210v2", "title": "Tadashi: Enabling AI-Based Automated Code Generation With Guaranteed Correctness", "authors": [ "Emil Vatai", "Aleksandr Drozd", "Ivan R. Ivanov", "Joao E. Batista", "Yinghao Ren", "Mohamed Wahib" ], "abstract": "Frameworks and domain-specific languages for auto-generating code have\ntraditionally depended on human experts to implement rigorous methods ensuring\nthe legality of code transformations. Recently, machine learning (ML) has\ngained traction for generating code optimized for specific hardware targets.\nHowever, ML approaches-particularly black-box neural networks-offer no\nguarantees on the correctness or legality of the transformations they produce.\nTo address this gap, we introduce Tadashi, an end-to-end system that leverages\nthe polyhedral model to support researchers in curating datasets critical for\nML-based code generation. Tadashi provides an end-to-end system capable of\napplying, verifying, and evaluating candidate transformations on polyhedral\nschedules with both reliability and practicality. We formally prove that\nTadashi guarantees the legality of generated transformations, demonstrate its\nlow runtime overhead, and showcase its broad applicability. Tadashi available\nat https://github.com/vatai/tadashi/.", "categories": [ "cs.LG" ], "published": "2024-10-04T07:56:05+00:00", "url": "http://arxiv.org/pdf/2410.03210v2", "resource_uri": "arxiv://2410.03210v2", "citation_count": 0 }, { "id": "2410.02749v3", "title": "Training Language Models on Synthetic Edit Sequences Improves Code Synthesis", "authors": [ "Ulyana Piterbarg", "Lerrel Pinto", "Rob Fergus" ], "abstract": "Software engineers mainly write code by editing existing programs. In\ncontrast, language models (LMs) autoregressively synthesize programs in a\nsingle pass. One explanation for this is the scarcity of sequential edit data.\nWhile high-quality instruction data for code synthesis is scarce, edit data for\nsynthesis is even scarcer. To fill this gap, we develop a synthetic data\ngeneration algorithm called LintSeq. This algorithm refactors programs into\nsequences of synthetic edits by using a linter to procedurally sample across\ninterdependent lines of source code. Synthetic edits sampled with LintSeq\nreflect the syntax and semantics of their programming language. To test the\nalgorithm, we use it to refactor a dataset of instruction + program pairs into\ninstruction + program-diff-sequence tuples. Then, we fine-tune a series of\nsmaller LMs ranging from 2.6B to 14B parameters on both the re-factored and\noriginal versions of this dataset. We perform comprehensive evaluations\ncomparing edit sequence code LMs against baselines on HumanEval, MBPP(+),\nCodeContests, DS-1000, and BigCodeBench. We show that models fine-tuned to\niteratively synthesize code match or outperform baselines on pass@1, and\nexhibit better scaling across higher pass@k as a function of total test-time\nFLOPs. Finally, we also pretrain our own tiny LMs for code understanding. We\nshow that fine-tuning these models to synthesize code edit-by-edit results in\nstrong performance on HumanEval and MBPP(+) compared to existing code language\nmodels of similar scale such as CodeT5+, AlphaCode, and Codex.", "categories": [ "cs.LG", "cs.CL" ], "published": "2024-10-03T17:57:22+00:00", "url": "http://arxiv.org/pdf/2410.02749v3", "resource_uri": "arxiv://2410.02749v3", "citation_count": 0 }, { "id": "2410.01999v4", "title": "CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding & Reasoning Capabilities of CodeLLMs", "authors": [ "Dung Nguyen Manh", "Thang Phan Chau", "Nam Le Hai", "Thong T. Doan", "Nam V. Nguyen", "Quang Pham", "Nghi D. Q. Bui" ], "abstract": "Recent advances in Code Large Language Models (CodeLLMs) have primarily\nfocused on open-ended code generation, often overlooking the crucial aspect of\ncode understanding and reasoning. To bridge this gap, we introduce CodeMMLU, a\ncomprehensive multiple-choice benchmark designed to evaluate the depth of\nsoftware and code comprehension in LLMs. CodeMMLU includes nearly 20,000\nquestions spanning diverse domains, including code analysis, defect detection,\nand software engineering principles across multiple programming languages.\nUnlike traditional benchmarks that emphasize code generation, CodeMMLU assesses\na model's ability to reason about programs across a wide-range of tasks such as\ncode repair, execution reasoning, and fill-in-the-blank challenges. Our\nextensive evaluation reveals that even state-of-the-art models struggle with\nCodeMMLU, highlighting significant gaps in comprehension beyond generation. By\nemphasizing the essential connection between code understanding and effective\nAI-assisted development, CodeMMLU provides a critical resource for advancing\nmore reliable and capable coding assistants.", "categories": [ "cs.SE" ], "published": "2024-10-02T20:04:02+00:00", "url": "http://arxiv.org/pdf/2410.01999v4", "resource_uri": "arxiv://2410.01999v4", "citation_count": 0 }, { "id": "2409.19922v1", "title": "Benchmarking ChatGPT, Codeium, and GitHub Copilot: A Comparative Study of AI-Driven Programming and Debugging Assistants", "authors": [ "Md Sultanul Islam Ovi", "Nafisa Anjum", "Tasmina Haque Bithe", "Md. Mahabubur Rahman", "Mst. Shahnaj Akter Smrity" ], "abstract": "With the increasing adoption of AI-driven tools in software development,\nlarge language models (LLMs) have become essential for tasks like code\ngeneration, bug fixing, and optimization. Tools like ChatGPT, GitHub Copilot,\nand Codeium provide valuable assistance in solving programming challenges, yet\ntheir effectiveness remains underexplored. This paper presents a comparative\nstudy of ChatGPT, Codeium, and GitHub Copilot, evaluating their performance on\nLeetCode problems across varying difficulty levels and categories. Key metrics\nsuch as success rates, runtime efficiency, memory usage, and error-handling\ncapabilities are assessed. GitHub Copilot showed superior performance on easier\nand medium tasks, while ChatGPT excelled in memory efficiency and debugging.\nCodeium, though promising, struggled with more complex problems. Despite their\nstrengths, all tools faced challenges in handling harder problems. These\ninsights provide a deeper understanding of each tool's capabilities and\nlimitations, offering guidance for developers and researchers seeking to\noptimize AI integration in coding workflows.", "categories": [ "cs.SE" ], "published": "2024-09-30T03:53:40+00:00", "url": "http://arxiv.org/pdf/2409.19922v1", "resource_uri": "arxiv://2409.19922v1", "citation_count": 0 }, { "id": "2409.19715v2", "title": "Coffee-Gym: An Environment for Evaluating and Improving Natural Language Feedback on Erroneous Code", "authors": [ "Hyungjoo Chae", "Taeyoon Kwon", "Seungjun Moon", "Yongho Song", "Dongjin Kang", "Kai Tzu-iunn Ong", "Beong-woo Kwak", "Seonghyeon Bae", "Seung-won Hwang", "Jinyoung Yeo" ], "abstract": "This paper presents Coffee-Gym, a comprehensive RL environment for training\nmodels that provide feedback on code editing. Coffee-Gym includes two major\ncomponents: (1) Coffee, a dataset containing humans' code edit traces for\ncoding questions and machine-written feedback for editing erroneous code; (2)\nCoffeeEval, a reward function that faithfully reflects the helpfulness of\nfeedback by assessing the performance of the revised code in unit tests. With\nthem, Coffee-Gym addresses the unavailability of high-quality datasets for\ntraining feedback models with RL, and provides more accurate rewards than the\nSOTA reward model (i.e., GPT-4). By applying Coffee-Gym, we elicit feedback\nmodels that outperform baselines in enhancing open-source code LLMs' code\nediting, making them comparable with closed-source LLMs. We make the dataset\nand the model checkpoint publicly available.", "categories": [ "cs.CL" ], "published": "2024-09-29T14:14:25+00:00", "url": "http://arxiv.org/pdf/2409.19715v2", "resource_uri": "arxiv://2409.19715v2", "citation_count": 0 }, { "id": "2409.18952v1", "title": "RepairBench: Leaderboard of Frontier Models for Program Repair", "authors": [ "André Silva", "Martin Monperrus" ], "abstract": "AI-driven program repair uses AI models to repair buggy software by producing\npatches. Rapid advancements in AI surely impact state-of-the-art performance of\nprogram repair. Yet, grasping this progress requires frequent and standardized\nevaluations. We propose RepairBench, a novel leaderboard for AI-driven program\nrepair. The key characteristics of RepairBench are: 1) it is execution-based:\nall patches are compiled and executed against a test suite, 2) it assesses\nfrontier models in a frequent and standardized way. RepairBench leverages two\nhigh-quality benchmarks, Defects4J and GitBug-Java, to evaluate frontier models\nagainst real-world program repair tasks. We publicly release the evaluation\nframework of RepairBench. We will update the leaderboard as new frontier models\nare released.", "categories": [ "cs.SE", "cs.LG" ], "published": "2024-09-27T17:52:34+00:00", "url": "http://arxiv.org/pdf/2409.18952v1", "resource_uri": "arxiv://2409.18952v1", "citation_count": 0 }, { "id": "2409.15186v2", "title": "Location is Key: Leveraging Large Language Model for Functional Bug Localization in Verilog", "authors": [ "Bingkun Yao", "Ning Wang", "Jie Zhou", "Xi Wang", "Hong Gao", "Zhe Jiang", "Nan Guan" ], "abstract": "Bug localization in Verilog code is a crucial and time-consuming task during\nthe verification of hardware design. Since introduction, Large Language Models\n(LLMs) have showed their strong programming capabilities. However, no work has\nyet considered using LLMs for bug localization in Verilog code. This paper\npresents Location-is-Key, an opensource LLM solution to locate functional\nerrors in Verilog snippets. LiK achieves high localization accuracy, with a\npass@1 localization accuracy of 93.3% on our test dataset based on RTLLM,\nsurpassing GPT-4's 77.9% and comparable to Claude-3.5's 90.8%. Additionally,\nthe bug location obtained by LiK significantly improves GPT-3.5's bug repair\nefficiency (Functional pass@1 increased from 40.39% to 58.92%), highlighting\nthe importance of bug localization in LLM-based Verilog debugging. Compared to\nexisting methods, LiK only requires the design specification and the erroneous\ncode snippet, without the need for testbenches, assertions, or any other EDA\ntools. This research demonstrates the feasibility of using LLMs for Verilog\nerror localization, thus providing a new direction for automatic Verilog code\ndebugging.", "categories": [ "cs.AR", "cs.AI" ], "published": "2024-09-23T16:38:53+00:00", "url": "http://arxiv.org/pdf/2409.15186v2", "resource_uri": "arxiv://2409.15186v2", "citation_count": 0 }, { "id": "2409.14301v3", "title": "Multi-Grained Specifications for Distributed System Model Checking and Verification", "authors": [ "Lingzhi Ouyang", "Xudong Sun", "Ruize Tang", "Yu Huang", "Madhav Jivrajani", "Xiaoxing Ma", "Tianyin Xu" ], "abstract": "This paper presents our experience specifying and verifying the correctness\nof ZooKeeper, a complex and evolving distributed coordination system. We use\nTLA+ to model fine-grained behaviors of ZooKeeper and use the TLC model checker\nto verify its correctness properties; we also check conformance between the\nmodel and code. The fundamental challenge is to balance the granularity of\nspecifications and the scalability of model checking -- fine-grained\nspecifications lead to state-space explosion, while coarse-grained\nspecifications introduce model-code gaps. To address this challenge, we write\nspecifications with different granularities for composable modules, and compose\nthem into mixed-grained specifications based on specific scenarios. For\nexample, to verify code changes, we compose fine-grained specifications of\nchanged modules and coarse-grained specifications that abstract away details of\nunchanged code with preserved interactions. We show that writing multi-grained\nspecifications is a viable practice and can cope with model-code gaps without\nuntenable state space, especially for evolving software where changes are\ntypically local and incremental. We detected six severe bugs that violate five\ntypes of invariants and verified their code fixes; the fixes have been merged\nto ZooKeeper. We also improve the protocol design to make it easy to implement\ncorrectly.", "categories": [ "cs.DC", "cs.SE" ], "published": "2024-09-22T02:59:56+00:00", "url": "http://arxiv.org/pdf/2409.14301v3", "resource_uri": "arxiv://2409.14301v3", "citation_count": 0 }, { "id": "2409.12993v2", "title": "CraftRTL: High-quality Synthetic Data Generation for Verilog Code Models with Correct-by-Construction Non-Textual Representations and Targeted Code Repair", "authors": [ "Mingjie Liu", "Yun-Da Tsai", "Wenfei Zhou", "Haoxing Ren" ], "abstract": "Despite the significant progress made in code generation with large language\nmodels, challenges persist, especially with hardware description languages such\nas Verilog. This paper first presents an analysis of fine-tuned LLMs on Verilog\ncoding, with synthetic data from prior methods. We identify two main issues:\ndifficulties in handling non-textual representations (Karnaugh maps,\nstate-transition diagrams and waveforms) and significant variability during\ntraining with models randomly making \"minor\" mistakes. To address these\nlimitations, we enhance data curation by creating correct-by-construction data\ntargeting non-textual representations. Additionally, we introduce an automated\nframework that generates error reports from various model checkpoints and\ninjects these errors into open-source code to create targeted code repair data.\nOur fine-tuned Starcoder2-15B outperforms prior state-of-the-art results by\n3.8%, 10.9%, 6.6% for pass@1 on VerilogEval-Machine, VerilogEval-Human, and\nRTLLM.", "categories": [ "cs.AR", "cs.CL" ], "published": "2024-09-19T12:15:55+00:00", "url": "http://arxiv.org/pdf/2409.12993v2", "resource_uri": "arxiv://2409.12993v2", "citation_count": 0 }, { "id": "2409.11190v2", "title": "SuperCoder2.0: Technical Report on Exploring the feasibility of LLMs as Autonomous Programmer", "authors": [ "Anmol Gautam", "Kishore Kumar", "Adarsh Jha", "Mukunda NS", "Ishaan Bhola" ], "abstract": "We present SuperCoder2.0, an advanced autonomous system designed to enhance\nsoftware development through artificial intelligence. The system combines an\nAI-native development approach with intelligent agents to enable fully\nautonomous coding. Key focus areas include a retry mechanism with error output\ntraceback, comprehensive code rewriting and replacement using Abstract Syntax\nTree (ast) parsing to minimize linting issues, code embedding technique for\nretrieval-augmented generation, and a focus on localizing methods for\nproblem-solving rather than identifying specific line numbers. The methodology\nemploys a three-step hierarchical search space reduction approach for code base\nnavigation and bug localization:utilizing Retrieval Augmented Generation (RAG)\nand a Repository File Level Map to identify candidate files, (2) narrowing down\nto the most relevant files using a File Level Schematic Map, and (3) extracting\n'relevant locations' within these files. Code editing is performed through a\ntwo-part module comprising CodeGeneration and CodeEditing, which generates\nmultiple solutions at different temperature values and replaces entire methods\nor classes to maintain code integrity. A feedback loop executes\nrepository-level test cases to validate and refine solutions. Experiments\nconducted on the SWE-bench Lite dataset demonstrate SuperCoder2.0's\neffectiveness, achieving correct file localization in 84.33% of cases within\nthe top 5 candidates and successfully resolving 34% of test instances. This\nperformance places SuperCoder2.0 fourth globally on the SWE-bench leaderboard.\nThe system's ability to handle diverse repositories and problem types\nhighlights its potential as a versatile tool for autonomous software\ndevelopment. Future work will focus on refining the code editing process and\nexploring advanced embedding models for improved natural language to code\nmapping.", "categories": [ "cs.SE", "cs.AI" ], "published": "2024-09-17T13:44:42+00:00", "url": "http://arxiv.org/pdf/2409.11190v2", "resource_uri": "arxiv://2409.11190v2", "citation_count": 0 }, { "id": "2409.10033v3", "title": "Can GPT-O1 Kill All Bugs? An Evaluation of GPT-Family LLMs on QuixBugs", "authors": [ "Haichuan Hu", "Ye Shang", "Guolin Xu", "Congqing He", "Quanjun Zhang" ], "abstract": "LLMs have long demonstrated remarkable effectiveness in automatic program\nrepair (APR), with OpenAI's ChatGPT being one of the most widely used models in\nthis domain. Through continuous iterations and upgrades of GPT-family models,\ntheir performance in fixing bugs has already reached state-of-the-art levels.\nHowever, there are few works comparing the effectiveness and variations of\ndifferent versions of GPT-family models on APR. In this work, inspired by the\nrecent public release of the GPT-o1 models, we conduct the first study to\ncompare the effectiveness of different versions of the GPT-family models in\nAPR. We evaluate the performance of the latest version of the GPT-family models\n(i.e., O1-preview and O1-mini), GPT-4o, and the historical version of ChatGPT\non APR. We conduct an empirical study of the four GPT-family models against\nother LLMs and APR techniques on the QuixBugs benchmark from multiple\nevaluation perspectives, including repair success rate, repair cost, response\nlength, and behavior patterns. The results demonstrate that O1's repair\ncapability exceeds that of prior GPT-family models, successfully fixing all 40\nbugs in the benchmark. Our work can serve as a foundation for further in-depth\nexploration of the applications of GPT-family models in APR.", "categories": [ "cs.SE", "cs.AI" ], "published": "2024-09-16T06:51:32+00:00", "url": "http://arxiv.org/pdf/2409.10033v3", "resource_uri": "arxiv://2409.10033v3", "citation_count": 0 }, { "id": "2409.09661v1", "title": "ContractTinker: LLM-Empowered Vulnerability Repair for Real-World Smart Contracts", "authors": [ "Che Wang", "Jiashuo Zhang", "Jianbo Gao", "Libin Xia", "Zhi Guan", "Zhong Chen" ], "abstract": "Smart contracts are susceptible to being exploited by attackers, especially\nwhen facing real-world vulnerabilities. To mitigate this risk, developers often\nrely on third-party audit services to identify potential vulnerabilities before\nproject deployment. Nevertheless, repairing the identified vulnerabilities is\nstill complex and labor-intensive, particularly for developers lacking security\nexpertise. Moreover, existing pattern-based repair tools mostly fail to address\nreal-world vulnerabilities due to their lack of high-level semantic\nunderstanding. To fill this gap, we propose ContractTinker, a Large Language\nModels (LLMs)-empowered tool for real-world vulnerability repair. The key\ninsight is our adoption of the Chain-of-Thought approach to break down the\nentire generation task into sub-tasks. Additionally, to reduce hallucination,\nwe integrate program static analysis to guide the LLM. We evaluate\nContractTinker on 48 high-risk vulnerabilities. The experimental results show\nthat among the patches generated by ContractTinker, 23 (48%) are valid patches\nthat fix the vulnerabilities, while 10 (21%) require only minor modifications.\nA video of ContractTinker is available at https://youtu.be/HWFVi-YHcPE.", "categories": [ "cs.SE", "cs.CR" ], "published": "2024-09-15T08:24:01+00:00", "url": "http://arxiv.org/pdf/2409.09661v1", "resource_uri": "arxiv://2409.09661v1", "citation_count": 0 }, { "id": "2409.07362v1", "title": "GitSEED: A Git-backed Automated Assessment Tool for Software Engineering and Programming Education", "authors": [ "Pedro Orvalho", "Mikoláš Janota", "Vasco Manquinho" ], "abstract": "Due to the substantial number of enrollments in programming courses, a key\nchallenge is delivering personalized feedback to students. The nature of this\nfeedback varies significantly, contingent on the subject and the chosen\nevaluation method. However, tailoring current Automated Assessment Tools (AATs)\nto integrate other program analysis tools is not straightforward. Moreover,\nAATs usually support only specific programming languages, providing feedback\nexclusively through dedicated websites based on test suites.\n This paper introduces GitSEED, a language-agnostic automated assessment tool\ndesigned for Programming Education and Software Engineering (SE) and backed by\nGitLab. The students interact with GitSEED through GitLab. Using GitSEED,\nstudents in Computer Science (CS) and SE can master the fundamentals of git\nwhile receiving personalized feedback on their programming assignments and\nprojects. Furthermore, faculty members can easily tailor GitSEED's pipeline by\nintegrating various code evaluation tools (e.g., memory leak detection, fault\nlocalization, program repair, etc.) to offer personalized feedback that aligns\nwith the needs of each CS/SE course. Our experiments assess GitSEED's efficacy\nvia comprehensive user evaluation, examining the impact of feedback mechanisms\nand features on student learning outcomes. Findings reveal positive\ncorrelations between GitSEED usage and student engagement.", "categories": [ "cs.SE", "cs.CY" ], "published": "2024-09-11T15:50:42+00:00", "url": "http://arxiv.org/pdf/2409.07362v1", "resource_uri": "arxiv://2409.07362v1", "citation_count": 0 }, { "id": "2409.16299v2", "title": "HyperAgent: Generalist Software Engineering Agents to Solve Coding Tasks at Scale", "authors": [ "Huy Nhat Phan", "Tien N. Nguyen", "Phong X. Nguyen", "Nghi D. Q. Bui" ], "abstract": "Large Language Models (LLMs) have revolutionized software engineering (SE),\nshowcasing remarkable proficiency in various coding tasks. Despite recent\nadvancements that have enabled the creation of autonomous software agents\nutilizing LLMs for end-to-end development tasks, these systems are typically\ndesigned for specific SE functions. We introduce HyperAgent, an innovative\ngeneralist multi-agent system designed to tackle a wide range of SE tasks\nacross different programming languages by mimicking the workflows of human\ndevelopers. HyperAgent features four specialized agents-Planner, Navigator,\nCode Editor, and Executor-capable of handling the entire lifecycle of SE tasks,\nfrom initial planning to final verification. HyperAgent sets new benchmarks in\ndiverse SE tasks, including GitHub issue resolution on the renowned SWE-Bench\nbenchmark, outperforming robust baselines. Furthermore, HyperAgent demonstrates\nexceptional performance in repository-level code generation (RepoExec) and\nfault localization and program repair (Defects4J), often surpassing\nstate-of-the-art baselines.", "categories": [ "cs.SE", "cs.AI" ], "published": "2024-09-09T19:35:34+00:00", "url": "http://arxiv.org/pdf/2409.16299v2", "resource_uri": "arxiv://2409.16299v2", "citation_count": 0 }, { "id": "2409.03711v3", "title": "Relaxation times for disoriented isospin condensates in high energy heavy ion collisions", "authors": [ "Olivia Chabowski", "Joseph I. Kapusta", "Mayank Singh" ], "abstract": "Fluctuations between charged and neutral kaons measured by the ALICE\nCollaboration in Pb-Pb collisions at the LHC exceed conventional explanations.\nPreviously it was shown that if the scalar condensate is accompanied by an\nelectrically neutral isospin--1 field then the combination can produce large\nequilibrium fluctuations where $\\langle \\bar{u}u\\rangle \\ne \\langle\n\\bar{d}d\\rangle$. Hadronizing strange and anti-strange quarks might then\nstrongly fluctuate between charged ($u\\bar{s}$ or $s\\bar{u}$) and neutral\n($d\\bar{s}$ or $s\\bar{d}$) kaons. Here we estimate the times for the\ncondensates to achieve their equilibrium probability distributions within\ncausal volumes in high energy heavy ion collisions. This is achieved by\nmodeling the temperature dependence of the condensates, mesonic collective\nexcitations, decay rates of the associated fields, and employing the Langevin\nand Fokker-Planck equations. Within this model, we find that the equilibration\ntimes are short compared with the expansion time, making disoriented isospin\ncondensates a viable explanation for the anomalous fluctuations observed at the\nLHC.", "categories": [ "hep-ph", "nucl-th" ], "published": "2024-09-05T17:17:55+00:00", "url": "http://arxiv.org/pdf/2409.03711v3", "resource_uri": "arxiv://2409.03711v3", "citation_count": 0 }, { "id": "2409.03267v1", "title": "No Man is an Island: Towards Fully Automatic Programming by Code Search, Code Generation and Program Repair", "authors": [ "Quanjun Zhang", "Chunrong Fang", "Ye Shang", "Tongke Zhang", "Shengcheng Yu", "Zhenyu Chen" ], "abstract": "Automatic programming attempts to minimize human intervention in the\ngeneration of executable code, and has been a long-standing challenge in the\nsoftware engineering community. To advance automatic programming, researchers\nare focusing on three primary directions: (1) code search that reuses existing\ncode snippets from external databases; (2) code generation that produces new\ncode snippets from natural language; and (3) program repair that refines\nexisting code snippets by fixing detected bugs. Despite significant\nadvancements, the effectiveness of state-of-the-art techniques is still\nlimited, such as the usability of searched code and the correctness of\ngenerated code.\n Motivated by the real-world programming process, where developers usually use\nvarious external tools to aid their coding processes, such as code search\nengines and code testing tools, in this work, we propose \\toolname{}, an\nautomatic programming framework that leverages recent large language models\n(LLMs) to integrate the three research areas to address their inherent\nlimitations. In particular, our framework first leverages different code search\nstrategies to retrieve similar code snippets, which are then used to further\nguide the code generation process of LLMs. Our framework further validates the\nquality of generated code by compilers and test cases, and constructs repair\nprompts to query LLMs for generating correct patches. We conduct preliminary\nexperiments to demonstrate the potential of our framework, \\eg helping\nCodeLlama solve 267 programming problems with an improvement of 62.53\\%. As a\ngeneric framework, \\toolname{} can integrate various code search, generation,\nand repair tools, combining these three research areas together for the first\ntime. More importantly, it demonstrates the potential of using traditional SE\ntools to enhance the usability of LLMs in automatic programming.", "categories": [ "cs.SE" ], "published": "2024-09-05T06:24:29+00:00", "url": "http://arxiv.org/pdf/2409.03267v1", "resource_uri": "arxiv://2409.03267v1", "citation_count": 0 }, { "id": "2409.00899v2", "title": "MarsCode Agent: AI-native Automated Bug Fixing", "authors": [ "Yizhou Liu", "Pengfei Gao", "Xinchen Wang", "Jie Liu", "Yexuan Shi", "Zhao Zhang", "Chao Peng" ], "abstract": "Recent advances in large language models (LLMs) have shown significant\npotential to automate various software development tasks, including code\ncompletion, test generation, and bug fixing. However, the application of LLMs\nfor automated bug fixing remains challenging due to the complexity and\ndiversity of real-world software systems. In this paper, we introduce MarsCode\nAgent, a novel framework that leverages LLMs to automatically identify and\nrepair bugs in software code. MarsCode Agent combines the power of LLMs with\nadvanced code analysis techniques to accurately localize faults and generate\npatches. Our approach follows a systematic process of planning, bug\nreproduction, fault localization, candidate patch generation, and validation to\nensure high-quality bug fixes. We evaluated MarsCode Agent on SWE-bench, a\ncomprehensive benchmark of real-world software projects, and our results show\nthat MarsCode Agent achieves a high success rate in bug fixing compared to most\nof the existing automated approaches.", "categories": [ "cs.SE", "cs.AI" ], "published": "2024-09-02T02:24:38+00:00", "url": "http://arxiv.org/pdf/2409.00899v2", "resource_uri": "arxiv://2409.00899v2", "citation_count": 0 }, { "id": "2409.00571v1", "title": "Enhancing Source Code Security with LLMs: Demystifying The Challenges and Generating Reliable Repairs", "authors": [ "Nafis Tanveer Islam", "Joseph Khoury", "Andrew Seong", "Elias Bou-Harb", "Peyman Najafirad" ], "abstract": "With the recent unprecedented advancements in Artificial Intelligence (AI)\ncomputing, progress in Large Language Models (LLMs) is accelerating rapidly,\npresenting challenges in establishing clear guidelines, particularly in the\nfield of security. That being said, we thoroughly identify and describe three\nmain technical challenges in the security and software engineering literature\nthat spans the entire LLM workflow, namely; \\textbf{\\textit{(i)}} Data\nCollection and Labeling; \\textbf{\\textit{(ii)}} System Design and Learning; and\n\\textbf{\\textit{(iii)}} Performance Evaluation. Building upon these challenges,\nthis paper introduces \\texttt{SecRepair}, an instruction-based LLM system\ndesigned to reliably \\textit{identify}, \\textit{describe}, and automatically\n\\textit{repair} vulnerable source code. Our system is accompanied by a list of\nactionable guides on \\textbf{\\textit{(i)}} Data Preparation and Augmentation\nTechniques; \\textbf{\\textit{(ii)}} Selecting and Adapting state-of-the-art LLM\nModels; \\textbf{\\textit{(iii)}} Evaluation Procedures. \\texttt{SecRepair} uses\na reinforcement learning-based fine-tuning with a semantic reward that caters\nto the functionality and security aspects of the generated code. Our empirical\nanalysis shows that \\texttt{SecRepair} achieves a \\textit{12}\\% improvement in\nsecurity code repair compared to other LLMs when trained using reinforcement\nlearning. Furthermore, we demonstrate the capabilities of \\texttt{SecRepair} in\ngenerating reliable, functional, and compilable security code repairs against\nreal-world test cases using automated evaluation metrics.", "categories": [ "cs.CR", "cs.AI" ], "published": "2024-09-01T00:41:40+00:00", "url": "http://arxiv.org/pdf/2409.00571v1", "resource_uri": "arxiv://2409.00571v1", "citation_count": 0 }, { "id": "2408.13597v2", "title": "APPATCH: Automated Adaptive Prompting Large Language Models for Real-World Software Vulnerability Patching", "authors": [ "Yu Nong", "Haoran Yang", "Long Cheng", "Hongxin Hu", "Haipeng Cai" ], "abstract": "Timely and effective vulnerability patching is essential for cybersecurity\ndefense, for which various approaches have been proposed yet still struggle to\ngenerate valid and correct patches for real-world vulnerabilities. In this\npaper, we leverage the power and merits of pre-trained language language models\n(LLMs) to enable automated vulnerability patching using no test input/exploit\nevidence and without model training/fine-tuning. To elicit LLMs to effectively\nreason about vulnerable code behaviors, which is essential for quality patch\ngeneration, we introduce vulnerability semantics reasoning and adaptive\nprompting on LLMs and instantiate the methodology as APPATCH, an automated\nLLM-based patching system. Our evaluation of APPATCH on 97 zero-day\nvulnerabilities and 20 existing vulnerabilities demonstrates its superior\nperformance to both existing prompting methods and state-of-the-art\nnon-LLM-based techniques (by up to 28.33% in F1 and 182.26% in recall over the\nbest baseline). Through APPATCH, we demonstrate what helps for LLM-based\npatching and how, as well as discussing what still lacks and why.", "categories": [ "cs.CR", "cs.SE" ], "published": "2024-08-24T14:51:50+00:00", "url": "http://arxiv.org/pdf/2408.13597v2", "resource_uri": "arxiv://2408.13597v2", "citation_count": 0 }, { "id": "2408.12056v2", "title": "Enhancing Automated Program Repair with Solution Design", "authors": [ "Jiuang Zhao", "Donghao Yang", "Li Zhang", "Xiaoli Lian", "Zitian Yang", "Fang Liu" ], "abstract": "Automatic Program Repair (APR) endeavors to autonomously rectify issues\nwithin specific projects, which generally encompasses three categories of\ntasks: bug resolution, new feature development, and feature enhancement.\nDespite extensive research proposing various methodologies, their efficacy in\naddressing real issues remains unsatisfactory. It's worth noting that,\ntypically, engineers have design rationales (DR) on solution-planed solutions\nand a set of underlying reasons-before they start patching code. In open-source\nprojects, these DRs are frequently captured in issue logs through project\nmanagement tools like Jira. This raises a compelling question: How can we\nleverage DR scattered across the issue logs to efficiently enhance APR? To\ninvestigate this premise, we introduce DRCodePilot, an approach designed to\naugment GPT-4-Turbo's APR capabilities by incorporating DR into the prompt\ninstruction. Furthermore, given GPT-4's constraints in fully grasping the\nbroader project context and occasional shortcomings in generating precise\nidentifiers, we have devised a feedback-based self-reflective framework, in\nwhich we prompt GPT-4 to reconsider and refine its outputs by referencing a\nprovided patch and suggested identifiers. We have established a benchmark\ncomprising 938 issue-patch pairs sourced from two open-source repositories\nhosted on GitHub and Jira. Our experimental results are impressive: DRCodePilot\nachieves a full-match ratio that is a remarkable 4.7x higher than when GPT-4 is\nutilized directly. Additionally, the CodeBLEU scores also exhibit promising\nenhancements. Moreover, our findings reveal that the standalone application of\nDR can yield promising increase in the full-match ratio across CodeLlama,\nGPT-3.5, and GPT-4 within our benchmark suite. We believe that our DRCodePilot\ninitiative heralds a novel human-in-the-loop avenue for advancing the field of\nAPR.", "categories": [ "cs.SE", "cs.AI" ], "published": "2024-08-22T01:13:02+00:00", "url": "http://arxiv.org/pdf/2408.12056v2", "resource_uri": "arxiv://2408.12056v2", "citation_count": 0 }, { "id": "2408.11710v1", "title": "Leveraging Large Language Models for Enhancing the Understandability of Generated Unit Tests", "authors": [ "Amirhossein Deljouyi", "Roham Koohestani", "Maliheh Izadi", "Andy Zaidman" ], "abstract": "Automated unit test generators, particularly search-based software testing\ntools like EvoSuite, are capable of generating tests with high coverage.\nAlthough these generators alleviate the burden of writing unit tests, they\noften pose challenges for software engineers in terms of understanding the\ngenerated tests. To address this, we introduce UTGen, which combines\nsearch-based software testing and large language models to enhance the\nunderstandability of automatically generated test cases. We achieve this\nenhancement through contextualizing test data, improving identifier naming, and\nadding descriptive comments. Through a controlled experiment with 32\nparticipants from both academia and industry, we investigate how the\nunderstandability of unit tests affects a software engineer's ability to\nperform bug-fixing tasks. We selected bug-fixing to simulate a real-world\nscenario that emphasizes the importance of understandable test cases. We\nobserve that participants working on assignments with UTGen test cases fix up\nto 33% more bugs and use up to 20% less time when compared to baseline test\ncases. From the post-test questionnaire, we gathered that participants found\nthat enhanced test names, test data, and variable names improved their\nbug-fixing process.", "categories": [ "cs.SE", "cs.AI" ], "published": "2024-08-21T15:35:34+00:00", "url": "http://arxiv.org/pdf/2408.11710v1", "resource_uri": "arxiv://2408.11710v1", "citation_count": 0 }, { "id": "2408.11296v1", "title": "RePair: Automated Program Repair with Process-based Feedback", "authors": [ "Yuze Zhao", "Zhenya Huang", "Yixiao Ma", "Rui Li", "Kai Zhang", "Hao Jiang", "Qi Liu", "Linbo Zhu", "Yu Su" ], "abstract": "The gap between the trepidation of program reliability and the expense of\nrepairs underscores the indispensability of Automated Program Repair (APR). APR\nis instrumental in transforming vulnerable programs into more robust ones,\nbolstering program reliability while simultaneously diminishing the financial\nburden of manual repairs. Commercial-scale language models (LM) have taken APR\nto unprecedented levels. However, the emergence reveals that for models fewer\nthan 100B parameters, making single-step modifications may be difficult to\nachieve the desired effect. Moreover, humans interact with the LM through\nexplicit prompts, which hinders the LM from receiving feedback from compiler\nand test cases to automatically optimize its repair policies. In this\nliterature, we explore how small-scale LM (less than 20B) achieve excellent\nperformance through process supervision and feedback. We start by constructing\na dataset named CodeNet4Repair, replete with multiple repair records, which\nsupervises the fine-tuning of a foundational model. Building upon the\nencouraging outcomes of reinforcement learning, we develop a reward model that\nserves as a critic, providing feedback for the fine-tuned LM's action,\nprogressively optimizing its policy. During inference, we require the LM to\ngenerate solutions iteratively until the repair effect no longer improves or\nhits the maximum step limit. The results show that process-based not only\noutperforms larger outcome-based generation methods, but also nearly matches\nthe performance of closed-source commercial large-scale LMs.", "categories": [ "cs.SE", "cs.CL" ], "published": "2024-08-21T02:53:23+00:00", "url": "http://arxiv.org/pdf/2408.11296v1", "resource_uri": "arxiv://2408.11296v1", "citation_count": 0 }, { "id": "2408.11081v2", "title": "What can Large Language Models Capture about Code Functional Equivalence?", "authors": [ "Nickil Maveli", "Antonio Vergari", "Shay B. Cohen" ], "abstract": "Code-LLMs, LLMs pre-trained on large code corpora, have shown great progress\nin learning rich representations of the structure and syntax of code,\nsuccessfully using it to generate or classify code fragments. At the same time,\nunderstanding if they are able to do so because they capture code semantics,\nand how well, is still an open question. In this paper, we tackle this problem\nby introducing SeqCoBench, a benchmark for systematically assessing how\nCode-LLMs can capture code functional equivalence. SeqCoBench contains over 20\ncode transformations that either preserve or alter the semantics of Python\nprograms. We conduct extensive evaluations in different settings, including\nzero-shot and parameter-efficient finetuning methods on state-of-the-art\n(Code)-LLMs to see if they can discern semantically equivalent or different\npairs of programs in SeqCoBench. We find that the performance gap between these\nLLMs and classical match-based retrieval scores is minimal, with both\napproaches showing a concerning lack of depth in understanding code semantics.", "categories": [ "cs.SE", "cs.AI", "cs.CL", "cs.LG" ], "published": "2024-08-20T11:19:06+00:00", "url": "http://arxiv.org/pdf/2408.11081v2", "resource_uri": "arxiv://2408.11081v2", "citation_count": 0 }, { "id": "2408.10486v2", "title": "Revisiting Evolutionary Program Repair via Code Language Model", "authors": [ "Yunan Wang", "Tingyu Guo", "Zilong Huang", "Yuan Yuan" ], "abstract": "Software defects are an inherent part of software development and\nmaintenance. To address these defects, Automated Program Repair (APR) has been\ndeveloped to fix bugs automatically. With the advent of Large Language Models,\nCode Language Models (CLMs) trained on code corpora excels in code generation,\nmaking them suitable for APR applications. Despite this progress, a significant\nlimitation remains: many bugs necessitate multi-point edits for repair, yet\ncurrent CLM-based APRs are restricted to single-point bug fixes, which severely\nnarrows the scope of repairable bugs. Moreover, these tools typically only\nconsider the direct context of the buggy line when building prompts for the\nCLM, leading to suboptimal repair outcomes due to the limited information\nprovided. This paper introduces a novel approach, ARJA-CLM, which integrates\nthe multiobjective evolutionary algorithm with CLM to fix multilocation bugs in\nJava projects. We also propose a context-aware prompt construction stratege,\nwhich enriches the prompt with additional information about accessible fields\nand methods for the CLM generating candidate statements. Our experiments on the\nDefects4J and APR-2024 competition benchmark demonstrate that ARJA-CLM\nsurpasses many state-of-the-art repair systems, and performs well on\nmulti-point bugs. The results also reveal that CLMs effectively utilize the\nprovided field and method information within context-aware prompts to produce\ncandidate statements.", "categories": [ "cs.SE" ], "published": "2024-08-20T01:57:45+00:00", "url": "http://arxiv.org/pdf/2408.10486v2", "resource_uri": "arxiv://2408.10486v2", "citation_count": 0 }, { "id": "2408.09657v1", "title": "Impact of Large Language Models of Code on Fault Localization", "authors": [ "Suhwan Ji", "Sanghwa Lee", "Changsup Lee", "Hyeonseung Im", "Yo-Sub Han" ], "abstract": "Identifying the point of error is imperative in software debugging.\nTraditional fault localization (FL) techniques rely on executing the program\nand using the code coverage matrix in tandem with test case results to\ncalculate a suspiciousness score for each function or line. Recently,\nlearning-based FL techniques have harnessed machine learning models to extract\nmeaningful features from the code coverage matrix and improve FL performance.\nThese techniques, however, require compilable source code, existing test cases,\nand specialized tools for generating the code coverage matrix for each\nprogramming language of interest.\n In this paper, we propose, for the first time, a simple but effective\nsequence generation approach for fine-tuning large language models of code\n(LLMCs) for FL tasks. LLMCs have recently received much attention for various\nsoftware engineering problems. In line with these, we leverage the innate\nunderstanding of code that LLMCs have acquired through pre-training on large\ncode corpora. Specifically, we fine-tune representative encoder,\nencoder-decoder, and decoder-based 13 LLMCs for FL tasks. Unlike previous\napproaches, LLMCs can analyze code sequences even with syntactic errors, since\nthey do not rely on compiled input. Still, they have a limitation on the length\nof the input data. Therefore, for a fair comparison with existing FL\ntechniques, we extract methods with errors from the project-level benchmark,\nDefects4J, and analyze them at the line level. Experimental results show that\nLLMCs fine-tuned with our approach successfully pinpoint error positions in\n50.6\\%, 64.2\\%, and 72.3\\% of 1,291 methods in Defects4J for Top-1/3/5\nprediction, outperforming the best learning-based state-of-the-art technique by\nup to 1.35, 1.12, and 1.08 times, respectively. Our findings suggest promising\nresearch directions for FL and automated program repair tasks using LLMCs.", "categories": [ "cs.SE" ], "published": "2024-08-19T02:36:07+00:00", "url": "http://arxiv.org/pdf/2408.09657v1", "resource_uri": "arxiv://2408.09657v1", "citation_count": 0 }, { "id": "2408.09568v3", "title": "MergeRepair: An Exploratory Study on Merging Task-Specific Adapters in Code LLMs for Automated Program Repair", "authors": [ "Meghdad Dehghan", "Jie JW Wu", "Fatemeh H. Fard", "Ali Ouni" ], "abstract": "Large Language Models (LLMs) have shown high capabilities in several software\ndevelopment-related tasks such as program repair, documentation, code\nrefactoring, debugging, and testing. However, training these models requires\nmassive amount of data and significant computational resources. Adapters are\nspecialized, small modules designed for parameter efficient fine-tuning of LLMs\nfor specific tasks, domains, or applications without requiring extensive\nretraining of the entire model. These adapters offer a more efficient way to\ncustomize LLMs for particular needs, leveraging the pre-existing capabilities\nof the large model. Model (and adapter) merging have emerged as a technique to\ndevelop one model capable of multiple tasks, with minimal or no training\nrequired. Although model and adapter merging has shown promising performance in\ndomains such as natural language processing and computer vision, its\napplicability to software engineering tasks remains underexplored. In this\npaper, we investigate the effectiveness of merged adapters within the context\nof software engineering, with a particular focus on the Automated Program\nRepair (APR) task, through our approach, MergeRepair. In particular, we merge\nmultiple task-specific adapters using three different merging methods,\nincluding weight-averaging, ties, and dare-ties, and evaluate the performance\nof the merged adapter on the APR task. We introduce a continual merging\napproach, a novel method in which we sequentially merge the task-specific\nadapters where the order and weight of the merged adapters play a significant\nrole. We further compare the performance of our approach with a baseline method\nconsisting of equal-weight merging applied on parameters of different adapters,\nwhere all adapters are of equal importance.", "categories": [ "cs.SE", "cs.AI" ], "published": "2024-08-18T18:45:48+00:00", "url": "http://arxiv.org/pdf/2408.09568v3", "resource_uri": "arxiv://2408.09568v3", "citation_count": 0 }, { "id": "2408.09078v1", "title": "An Exploratory Study on Fine-Tuning Large Language Models for Secure Code Generation", "authors": [ "Junjie Li", "Fazle Rabbi", "Cheng Cheng", "Aseem Sangalay", "Yuan Tian", "Jinqiu Yang" ], "abstract": "AI-powered coding assistants such as GitHub Copilot and OpenAI ChatGPT have\nachieved notable success in automating code generation. However, these tools\nrely on pre-trained Large Language Models (LLMs) that are typically trained on\nhuman-written code sourced from open-source project hosting sites like GitHub,\nwhich often contains inherent security vulnerabilities. These vulnerabilities\nmay then be mirrored in the code generated by these LLMs, a critical risk\nrevealed and highlighted by recent empirical studies. In this work, we present\nan exploratory study on whether fine-tuning pre-trained LLMs on datasets of\nvulnerability-fixing commits can promote secure code generation. We explored\ntwo parameter-efficient fine-tuning techniques (LoRa and IA3) on two\npre-trained LLMs for code generation. We crawled a fine-tuning dataset (14,622\nC and C++ files) for secure code generation by collecting code fixes of\nconfirmed vulnerabilities from open-source repositories. Our evaluation dataset\ncomprises 52 vulnerability scenarios designed to cover the top most dangerous C\nand C++ Common Weakness Enumerations (CWEs). Each scenario is a prompt that may\ninduce LLMs to generate vulnerable code. Our exploration reveals that\nfine-tuning LLMs can improve secure code generation by 6.4% in C language and\n5.4% in C++ language. We further experimented with fine-tuning LLMs using\ndifferent versions of the collected secure code dataset (block, function, and\nline). We found that fine-tuning with function-level and block-level datasets\nachieves the best secure code generation performance, compared to the\nalternatives (file-level and line-level).", "categories": [ "cs.SE", "D.2.0" ], "published": "2024-08-17T02:51:27+00:00", "url": "http://arxiv.org/pdf/2408.09078v1", "resource_uri": "arxiv://2408.09078v1", "citation_count": 0 }, { "id": "2408.05006v3", "title": "COAST: Enhancing the Code Debugging Ability of LLMs through Communicative Agent Based Data Synthesis", "authors": [ "Weiqing Yang", "Hanbin Wang", "Zhenghao Liu", "Xinze Li", "Yukun Yan", "Shuo Wang", "Yu Gu", "Minghe Yu", "Zhiyuan Liu", "Ge Yu" ], "abstract": "Code debugging is a vital stage of software development, essential for\nensuring the reliability and performance of Large Language Models (LLMs) in the\ncode generation task. Human debugging typically follows a multi-stage process,\nwhich includes Bug Localization, Bug Identification, Code Repair, and Code\nRecognition. However, existing code debugging benchmarks predominantly focus on\nthe Code Repair stage, which offers only a limited perspective on evaluating\nthe debugging capabilities of LLMs. In this paper, we introduce DEBUGEVAL, a\ncomprehensive benchmark for evaluating the debugging abilities of LLMs by\nemulating the multi-stage human debugging process. Through evaluating on\nDEBUGEVAL, we observe that 7B-scale models consistently underperform compared\nto their larger counterparts, highlighting their limitations in comprehending\ncode semantics. In this case, we propose the COmmunicative Agent-based data\nSynThesis (COAST) framework, which employs a multi-agent system to generate\nhigh-quality training data for supervised fine-tuning (SFT). Experimental\nresults demonstrate that COAST-generated data outperform human-curated and\nGPT-4-generated data, enabling 7B-scale LLMs to achieve debugging performance\ncomparable to GPT-3.5. All data and codes are available at\nhttps://github.com/NEUIR/COAST.", "categories": [ "cs.SE", "cs.AI" ], "published": "2024-08-09T11:35:44+00:00", "url": "http://arxiv.org/pdf/2408.05006v3", "resource_uri": "arxiv://2408.05006v3", "citation_count": 0 }, { "id": "2408.03827v1", "title": "Automated Code Fix Suggestions for Accessibility Issues in Mobile Apps", "authors": [ "Forough Mehralian", "Titus Barik", "Jeff Nichols", "Amanda Swearngin" ], "abstract": "Accessibility is crucial for inclusive app usability, yet developers often\nstruggle to identify and fix app accessibility issues due to a lack of\nawareness, expertise, and inadequate tools. Current accessibility testing tools\ncan identify accessibility issues but may not always provide guidance on how to\naddress them. We introduce FixAlly, an automated tool designed to suggest\nsource code fixes for accessibility issues detected by automated accessibility\nscanners. FixAlly employs a multi-agent LLM architecture to generate fix\nstrategies, localize issues within the source code, and propose code\nmodification suggestions to fix the accessibility issue. Our empirical study\ndemonstrates FixAlly's capability in suggesting fixes that resolve issues found\nby accessibility scanners -- with an effectiveness of 77% in generating\nplausible fix suggestions -- and our survey of 12 iOS developers finds they\nwould be willing to accept 69.4% of evaluated fix suggestions.", "categories": [ "cs.SE", "cs.AI", "cs.HC", "D.2.5; I.2" ], "published": "2024-08-07T15:06:07+00:00", "url": "http://arxiv.org/pdf/2408.03827v1", "resource_uri": "arxiv://2408.03827v1", "citation_count": 0 }, { "id": "2408.02710v1", "title": "RCDM: Enabling Robustness for Conditional Diffusion Model", "authors": [ "Weifeng Xu", "Xiang Zhu", "Xiaoyong Li" ], "abstract": "The conditional diffusion model (CDM) enhances the standard diffusion model\nby providing more control, improving the quality and relevance of the outputs,\nand making the model adaptable to a wider range of complex tasks. However,\ninaccurate conditional inputs in the inverse process of CDM can easily lead to\ngenerating fixed errors in the neural network, which diminishes the\nadaptability of a well-trained model. The existing methods like data\naugmentation, adversarial training, robust optimization can improve the\nrobustness, while they often face challenges such as high computational\ncomplexity, limited applicability to unknown perturbations, and increased\ntraining difficulty. In this paper, we propose a lightweight solution, the\nRobust Conditional Diffusion Model (RCDM), based on control theory to\ndynamically reduce the impact of noise and significantly enhance the model's\nrobustness. RCDM leverages the collaborative interaction between two neural\nnetworks, along with optimal control strategies derived from control theory, to\noptimize the weights of two networks during the sampling process. Unlike\nconventional techniques, RCDM establishes a mathematical relationship between\nfixed errors and the weights of the two neural networks without incurring\nadditional computational overhead. Extensive experiments were conducted on\nMNIST and CIFAR-10 datasets, and the results demonstrate the effectiveness and\nadaptability of our proposed model.", "categories": [ "cs.LG", "cs.CV" ], "published": "2024-08-05T13:12:57+00:00", "url": "http://arxiv.org/pdf/2408.02710v1", "resource_uri": "arxiv://2408.02710v1", "citation_count": 0 }, { "id": "2408.02232v4", "title": "SpecRover: Code Intent Extraction via LLMs", "authors": [ "Haifeng Ruan", "Yuntong Zhang", "Abhik Roychoudhury" ], "abstract": "Autonomous program improvement typically involves automatically producing bug\nfixes and feature additions. Such program improvement can be accomplished by a\ncombination of large language model (LLM) and program analysis capabilities, in\nthe form of an LLM agent. Since program repair or program improvement typically\nrequires a specification of intended behavior - specification inference can be\nuseful for producing high quality program patches. In this work, we examine\nefficient and low-cost workflows for iterative specification inference within\nan LLM agent. Given a GitHub issue to be resolved in a software project, our\ngoal is to conduct iterative code search accompanied by specification inference\n- thereby inferring intent from both the project structure and behavior. The\nintent thus captured is examined by a reviewer agent with the goal of vetting\nthe patches as well as providing a measure of confidence in the vetted patches.\nOur approach SpecRover (AutoCodeRover-v2) is built on the open-source LLM agent\nAutoCodeRover. In an evaluation on the full SWE-Bench consisting of 2294 GitHub\nissues, it shows more than 50% improvement in efficacy over AutoCodeRover.\nCompared to the open-source agents available, our work shows modest cost ($0.65\nper issue) in resolving an average GitHub issue in SWE-Bench lite. The\nproduction of explanation by SpecRover allows for a better \"signal\" to be given\nto the developer, on when the suggested patches can be accepted with\nconfidence. SpecRover also seeks to demonstrate the continued importance of\nspecification inference in automated program repair, even as program repair\ntechnologies enter the LLM era.", "categories": [ "cs.SE", "cs.AI" ], "published": "2024-08-05T04:53:01+00:00", "url": "http://arxiv.org/pdf/2408.02232v4", "resource_uri": "arxiv://2408.02232v4", "citation_count": 0 }, { "id": "2408.01801v1", "title": "Introducing Bidirectional Programming in Constructive Solid Geometry-Based CAD", "authors": [ "J. Felipe Gonzalez", "Danny Kieken", "Thomas Pietrzak", "Audrey Girouard", "Géry Casiez" ], "abstract": "3D Computer-Aided Design (CAD) users need to overcome several obstacles to\nbenefit from the flexibility of programmatic interface tools. Besides the\nbarriers of any programming language, users face challenges inherent to 3D\nspatial interaction. Scripting simple operations, such as moving an element in\n3D space, can be significantly more challenging than performing the same task\nusing direct manipulation. We introduce the concept of bidirectional\nprogramming for Constructive Solid Geometry (CSG) CAD tools, informed by\ninterviews we performed with programmatic interface users. We describe how\nusers can navigate and edit the 3D model using direct manipulation in the view\nor code editing while the system ensures consistency between both spaces. We\nalso detail a proof-of-concept implementation using a modified version of\nOpenSCAD.", "categories": [ "cs.HC" ], "published": "2024-08-03T15:05:19+00:00", "url": "http://arxiv.org/pdf/2408.01801v1", "resource_uri": "arxiv://2408.01801v1", "citation_count": 0 }, { "id": "2408.01733v1", "title": "CoEdPilot: Recommending Code Edits with Learned Prior Edit Relevance, Project-wise Awareness, and Interactive Nature", "authors": [ "Chenyan Liu", "Yufan Cai", "Yun Lin", "Yuhuan Huang", "Yunrui Pei", "Bo Jiang", "Ping Yang", "Jin Song Dong", "Hong Mei" ], "abstract": "Recent years have seen the development of LLM-based code generation. Compared\nto generating code in a software project, incremental code edits are\nempirically observed to be more frequent. The emerging code editing approaches\nusually formulate the problem as generating an edit based on known relevant\nprior edits and context. However, practical code edits can be more complicated.\nFirst, an editing session can include multiple (ir)relevant edits to the code\nunder edit. Second, the inference of the subsequent edits is non-trivial as the\nscope of its ripple effect can be the whole project. In this work, we propose\nCoEdPilot, an LLM-driven solution to recommend code edits by discriminating the\nrelevant edits, exploring their interactive natures, and estimating its ripple\neffect in the project. Specifically, CoEdPilot orchestrates multiple neural\ntransformers to identify what and how to edit in the project regarding both\nedit location and edit content. When a user accomplishes an edit with an\noptional editing description, a Subsequent Edit Analysis first reports the most\nrelevant files in the project with what types of edits (e.g., keep, insert, and\nreplace) can happen for each line of their code. Next, an Edit-content\nGenerator generates concrete edit options for the lines of code, regarding its\nrelevant prior changes reported by an Edit-dependency Analyzer. Lastly, both\nthe Subsequent Edit Analysis and the Edit-content Generator capture relevant\nprior edits as feedback to readjust their recommendations. We train our models\nby collecting over 180K commits from 471 open-source projects in 5 programming\nlanguages. Our extensive experiments show that CoEdPilot can well predict the\nedits (i.e., predicting edit location with an accuracy of 70.8%-85.3%, and the\nedit content with an exact match rate of 41.8% and BLEU4 score of 60.7)...", "categories": [ "cs.SE" ], "published": "2024-08-03T10:23:05+00:00", "url": "http://arxiv.org/pdf/2408.01733v1", "resource_uri": "arxiv://2408.01733v1", "citation_count": 0 }, { "id": "2407.20898v1", "title": "ThinkRepair: Self-Directed Automated Program Repair", "authors": [ "Xin Yin", "Chao Ni", "Shaohua Wang", "Zhenhao Li", "Limin Zeng", "Xiaohu Yang" ], "abstract": "Though many approaches have been proposed for Automated Program Repair (APR)\nand indeed achieved remarkable performance, they still have limitations in\nfixing bugs that require analyzing and reasoning about the logic of the buggy\nprogram. Recently, large language models (LLMs) instructed by prompt\nengineering have attracted much attention for their powerful ability to address\nmany kinds of tasks including bug-fixing. However, the quality of the prompt\nwill highly affect the ability of LLMs and manually constructing high-quality\nprompts is a costly endeavor.\n To address this limitation, we propose a self-directed LLM-based automated\nprogram repair, ThinkRepair, with two main phases: collection phase and fixing\nphase. The former phase automatically collects various chains of thoughts that\nconstitute pre-fixed knowledge by instructing LLMs with the Chain-of-Thought\n(CoT) prompt. The latter phase targets fixing a bug by first selecting examples\nfor few-shot learning and second automatically interacting with LLMs,\noptionally appending with feedback of testing information.\n Evaluations on two widely studied datasets (Defects4J and QuixBugs) by\ncomparing ThinkRepair with 12 SOTA APRs indicate the priority of ThinkRepair in\nfixing bugs. Notably, ThinkRepair fixes 98 bugs and improves baselines by\n27%-344.4% on Defects4J V1.2. On Defects4J V2.0, ThinkRepair fixes 12-65 more\nbugs than the SOTA APRs. Additionally, ThinkRepair also makes a considerable\nimprovement on QuixBugs (31 for Java and 21 for Python at most).", "categories": [ "cs.SE" ], "published": "2024-07-30T15:17:07+00:00", "url": "http://arxiv.org/pdf/2407.20898v1", "resource_uri": "arxiv://2407.20898v1", "citation_count": 0 }, { "id": "2407.19055v1", "title": "Effective Large Language Model Debugging with Best-first Tree Search", "authors": [ "Jialin Song", "Jonathan Raiman", "Bryan Catanzaro" ], "abstract": "Large Language Models (LLMs) show promise in code generation tasks. However,\ntheir code-writing abilities are often limited in scope: while they can\nsuccessfully implement simple functions, they struggle with more complex tasks.\nA fundamental difference with how an LLM writes code, compared to a human\nprogrammer, is that it cannot consistently spot and fix bugs. Debugging is a\ncrucial skill for programmers and it enables iterative code refinement towards\na correct implementation. In this work, we propose a novel algorithm to enable\nLLMs to debug their code via self-reflection and search where a model attempts\nto identify its previous mistakes. Our key contributions are 1) a best-first\ntree search algorithm with self-reflections (BESTER) that achieves\nstate-of-the-art Pass@1 in three code generation benchmarks. BESTER maintains\nits superiority when we measure pass rates taking into account additional\ninference costs incurred by tree search. 2) A novel interpretability study on\nwhat self-reflections attend to in buggy programs and how they impact bug\nfixes, which provides a deeper understanding of the debugging process. 3) An\nextensive study on when self-reflections are effective in finding bugs.", "categories": [ "cs.SE", "cs.AI", "cs.LG" ], "published": "2024-07-26T19:26:00+00:00", "url": "http://arxiv.org/pdf/2407.19055v1", "resource_uri": "arxiv://2407.19055v1", "citation_count": 0 }, { "id": "2407.18423v1", "title": "HDL-GPT: High-Quality HDL is All You Need", "authors": [ "Bhuvnesh Kumar", "Saurav Nanda", "Ganapathy Parthasarathy", "Pawan Patil", "Austin Tsai", "Parivesh Choudhary" ], "abstract": "This paper presents Hardware Description Language Generative Pre-trained\nTransformers (HDL-GPT), a novel approach that leverages the vast repository of\nopen-source High Definition Language (HDL) codes to train superior quality\nlarge code models. The core premise of this paper is the hypothesis that\nhigh-quality HDL is all you need to create models with exceptional performance\nand broad zero-shot generalization abilities. The paper elucidates the methods\nemployed for the curation and augmentation of large corpora from open-source\nHDL code, transforming highly variable quality data into high-quality data\nthrough careful prompting and context maintenance. We demonstrate that the\ncareful selection, filtering, and augmentation of data across HDLs can yield\npowerful models that surpass current state-of-the-art models. We also explore\nthe impact of different fine-tuning methods on the quality of results. We\ndescribe experimental results across a range of fine-tuned SOTA LLMs,\nsubstantiating our claims. We demonstrate improvements of 50% to 200% over SOTA\nHDL models on current benchmarks in tasks ranging from HDL circuit\nexplanations, code generation, formal and simulation testbench creation,\ntriaging bugs, and fixing them. HDL-GPT opens new avenues for the development\nof advanced model training techniques for circuit design tasks.", "categories": [ "cs.LG", "cs.AI" ], "published": "2024-07-25T22:48:08+00:00", "url": "http://arxiv.org/pdf/2407.18423v1", "resource_uri": "arxiv://2407.18423v1", "citation_count": 0 }, { "id": "2407.16667v1", "title": "RedAgent: Red Teaming Large Language Models with Context-aware Autonomous Language Agent", "authors": [ "Huiyu Xu", "Wenhui Zhang", "Zhibo Wang", "Feng Xiao", "Rui Zheng", "Yunhe Feng", "Zhongjie Ba", "Kui Ren" ], "abstract": "Recently, advanced Large Language Models (LLMs) such as GPT-4 have been\nintegrated into many real-world applications like Code Copilot. These\napplications have significantly expanded the attack surface of LLMs, exposing\nthem to a variety of threats. Among them, jailbreak attacks that induce toxic\nresponses through jailbreak prompts have raised critical safety concerns. To\nidentify these threats, a growing number of red teaming approaches simulate\npotential adversarial scenarios by crafting jailbreak prompts to test the\ntarget LLM. However, existing red teaming methods do not consider the unique\nvulnerabilities of LLM in different scenarios, making it difficult to adjust\nthe jailbreak prompts to find context-specific vulnerabilities. Meanwhile,\nthese methods are limited to refining jailbreak templates using a few mutation\noperations, lacking the automation and scalability to adapt to different\nscenarios. To enable context-aware and efficient red teaming, we abstract and\nmodel existing attacks into a coherent concept called \"jailbreak strategy\" and\npropose a multi-agent LLM system named RedAgent that leverages these strategies\nto generate context-aware jailbreak prompts. By self-reflecting on contextual\nfeedback in an additional memory buffer, RedAgent continuously learns how to\nleverage these strategies to achieve effective jailbreaks in specific contexts.\nExtensive experiments demonstrate that our system can jailbreak most black-box\nLLMs in just five queries, improving the efficiency of existing red teaming\nmethods by two times. Additionally, RedAgent can jailbreak customized LLM\napplications more efficiently. By generating context-aware jailbreak prompts\ntowards applications on GPTs, we discover 60 severe vulnerabilities of these\nreal-world applications with only two queries per vulnerability. We have\nreported all found issues and communicated with OpenAI and Meta for bug fixes.", "categories": [ "cs.CR", "cs.AI", "cs.CL" ], "published": "2024-07-23T17:34:36+00:00", "url": "http://arxiv.org/pdf/2407.16667v1", "resource_uri": "arxiv://2407.16667v1", "citation_count": 0 }, { "id": "2407.16557v3", "title": "Patched RTC: evaluating LLMs for diverse software development tasks", "authors": [ "Asankhaya Sharma" ], "abstract": "This paper introduces Patched Round-Trip Correctness (Patched RTC), a novel\nevaluation technique for Large Language Models (LLMs) applied to diverse\nsoftware development tasks, particularly focusing on \"outer loop\" activities\nsuch as bug fixing, code review, and documentation updates. Patched RTC extends\nthe original Round-Trip Correctness method to work with any LLM and downstream\ntask, offering a self-evaluating framework that measures consistency and\nrobustness of model responses without human intervention. The study\ndemonstrates a correlation between Patched RTC scores and task-specific\naccuracy metrics, presenting it as an alternative to the LLM-as-Judge paradigm\nfor open-domain task evaluation. We implement Patched RTC in an open-source\nframework called patchwork, allowing for transparent evaluation during\ninference across various patchflows. Experiments comparing GPT-3.5 and GPT-4\nmodels across different software development tasks reveal that Patched RTC\neffectively distinguishes model performance and task difficulty. The paper also\nexplores the impact of consistency prompts on improving model accuracy,\nsuggesting that Patched RTC can guide prompt refinement and model selection for\ncomplex software development workflows.", "categories": [ "cs.SE", "cs.AI" ], "published": "2024-07-23T15:12:14+00:00", "url": "http://arxiv.org/pdf/2407.16557v3", "resource_uri": "arxiv://2407.16557v3", "citation_count": 0 }, { "id": "2407.14435v3", "title": "Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders", "authors": [ "Senthooran Rajamanoharan", "Tom Lieberum", "Nicolas Sonnerat", "Arthur Conmy", "Vikrant Varma", "János Kramár", "Neel Nanda" ], "abstract": "Sparse autoencoders (SAEs) are a promising unsupervised approach for\nidentifying causally relevant and interpretable linear features in a language\nmodel's (LM) activations. To be useful for downstream tasks, SAEs need to\ndecompose LM activations faithfully; yet to be interpretable the decomposition\nmust be sparse -- two objectives that are in tension. In this paper, we\nintroduce JumpReLU SAEs, which achieve state-of-the-art reconstruction fidelity\nat a given sparsity level on Gemma 2 9B activations, compared to other recent\nadvances such as Gated and TopK SAEs. We also show that this improvement does\nnot come at the cost of interpretability through manual and automated\ninterpretability studies. JumpReLU SAEs are a simple modification of vanilla\n(ReLU) SAEs -- where we replace the ReLU with a discontinuous JumpReLU\nactivation function -- and are similarly efficient to train and run. By\nutilising straight-through-estimators (STEs) in a principled manner, we show\nhow it is possible to train JumpReLU SAEs effectively despite the discontinuous\nJumpReLU function introduced in the SAE's forward pass. Similarly, we use STEs\nto directly train L0 to be sparse, instead of training on proxies such as L1,\navoiding problems like shrinkage.", "categories": [ "cs.LG" ], "published": "2024-07-19T16:07:19+00:00", "url": "http://arxiv.org/pdf/2407.14435v3", "resource_uri": "arxiv://2407.14435v3", "citation_count": 0 }, { "id": "2407.14044v2", "title": "ECCO: Can We Improve Model-Generated Code Efficiency Without Sacrificing Functional Correctness?", "authors": [ "Siddhant Waghjale", "Vishruth Veerendranath", "Zora Zhiruo Wang", "Daniel Fried" ], "abstract": "Although large language models (LLMs) have been largely successful in\ngenerating functionally correct programs, conditioning models to produce\nefficient solutions while ensuring correctness remains a challenge. Further,\nunreliability in benchmarking code efficiency is a hurdle across varying\nhardware specifications for popular interpreted languages such as Python. In\nthis paper, we present ECCO, a reproducible benchmark for evaluating program\nefficiency via two paradigms: natural language (NL) based code generation and\nhistory-based code editing. On ECCO, we adapt and thoroughly investigate the\nthree most promising existing LLM-based approaches: in-context learning,\niterative refinement with execution or NL feedback, and fine-tuning conditioned\non execution and editing history. While most methods degrade functional\ncorrectness and moderately increase program efficiency, we find that adding\nexecution information often helps maintain functional correctness, and NL\nfeedback enhances more on efficiency. We release our benchmark to support\nfuture work on LLM-based generation of efficient code.", "categories": [ "cs.CL", "cs.AI" ], "published": "2024-07-19T05:47:40+00:00", "url": "http://arxiv.org/pdf/2407.14044v2", "resource_uri": "arxiv://2407.14044v2", "citation_count": 0 }, { "id": "2407.11438v2", "title": "Trust No Bot: Discovering Personal Disclosures in Human-LLM Conversations in the Wild", "authors": [ "Niloofar Mireshghallah", "Maria Antoniak", "Yash More", "Yejin Choi", "Golnoosh Farnadi" ], "abstract": "Measuring personal disclosures made in human-chatbot interactions can provide\na better understanding of users' AI literacy and facilitate privacy research\nfor large language models (LLMs). We run an extensive, fine-grained analysis on\nthe personal disclosures made by real users to commercial GPT models,\ninvestigating the leakage of personally identifiable and sensitive information.\nTo understand the contexts in which users disclose to chatbots, we develop a\ntaxonomy of tasks and sensitive topics, based on qualitative and quantitative\nanalysis of naturally occurring conversations. We discuss these potential\nprivacy harms and observe that: (1) personally identifiable information (PII)\nappears in unexpected contexts such as in translation or code editing (48% and\n16% of the time, respectively) and (2) PII detection alone is insufficient to\ncapture the sensitive topics that are common in human-chatbot interactions,\nsuch as detailed sexual preferences or specific drug use habits. We believe\nthat these high disclosure rates are of significant importance for researchers\nand data curators, and we call for the design of appropriate nudging mechanisms\nto help users moderate their interactions.", "categories": [ "cs.CL" ], "published": "2024-07-16T07:05:31+00:00", "url": "http://arxiv.org/pdf/2407.11438v2", "resource_uri": "arxiv://2407.11438v2", "citation_count": 0 }, { "id": "2407.09593v3", "title": "Non-universal probes of composite Higgs models: New bounds and prospects for FCC-ee", "authors": [ "Ben A. Stefanek" ], "abstract": "We study the leading loop-level phenomenology of composite Higgs models via\nthe effective field theory of a strongly interacting light Higgs and top quark\n(SILH+TQ). We systematically analyze the renormalization group evolution (RGE)\nof tree-generated operators in the SILH+TQ scenario, finding large mixings of\nflavor non-universal operators into those affecting electroweak precision\nobservables. We show that these model-independent RG contributions are more\nimportant than typical estimates for finite matching terms. Flavor\nnon-universal effects are completely captured by examining three options for\nthe top mixing: fully composite $q_L^3$, equal compositeness, and fully\ncomposite $t_R$. In the most phenomenologically viable case of a fully\ncomposite $t_R$, we show that the strongest bound on the natural parameter\nspace comes from a 2-loop double-log contribution of the 4-top operator $O_{tt}\n= (\\bar t_R \\gamma_\\mu t_R)(\\bar t_R \\gamma^\\mu t_R)$ into the Peskin-Takeuchi\n$T$ parameter. In general, we find that this 2-loop effect allows existing\nelectroweak precision data to give better constraints on 4-top operators than\nhigh-energy probes from top production at the LHC. Independent of the top\nmixing, we find that a future tera-$Z$ machine such as FCC-ee has the potential\nto probe composite Higgs models up to a scale of $m_* \\gtrsim 25$ TeV, and test\nthe naturalness of the electroweak scale at the $\\lesssim 10^{-4}$ level.", "categories": [ "hep-ph" ], "published": "2024-07-12T18:00:00+00:00", "url": "http://arxiv.org/pdf/2407.09593v3", "resource_uri": "arxiv://2407.09593v3", "citation_count": 0 }, { "id": "2407.08958v1", "title": "Towards Practical and Useful Automated Program Repair for Debugging", "authors": [ "Qi Xin", "Haojun Wu", "Steven P. Reiss", "Jifeng Xuan" ], "abstract": "Current automated program repair (APR) techniques are far from being\npractical and useful enough to be considered for realistic debugging. They rely\non unrealistic assumptions including the requirement of a comprehensive suite\nof test cases as the correctness criterion and frequent program re-execution\nfor patch validation; they are not fast; and their ability of repairing the\ncommonly arising complex bugs by fixing multiple locations of the program is\nvery limited. We hope to substantially improve APR's practicality,\neffectiveness, and usefulness to help people debug. Towards this goal, we\nenvision PracAPR, an interactive repair system that works in an Integrated\nDevelopment Environment (IDE) to provide effective repair suggestions for\ndebugging. PracAPR does not require a test suite or program re-execution. It\nassumes that the developer uses an IDE debugger and the program has suspended\nat a location where a problem is observed. It interacts with the developer to\nobtain a problem specification. Based on the specification, it performs\ntest-free, flow-analysis-based fault localization, patch generation that\ncombines large language model-based local repair and tailored strategy-driven\nglobal repair, and program re-execution-free patch validation based on\nsimulated trace comparison to suggest repairs. By having PracAPR, we hope to\ntake a significant step towards making APR useful and an everyday part of\ndebugging.", "categories": [ "cs.SE" ], "published": "2024-07-12T03:19:54+00:00", "url": "http://arxiv.org/pdf/2407.08958v1", "resource_uri": "arxiv://2407.08958v1", "citation_count": 0 }, { "id": "2407.05366v2", "title": "An activity transition in FRB 20201124A: methodological rigor, detection of frequency-dependent cessation, and a geometric magnetar model", "authors": [ "A. V. Bilous", "J. van Leeuwen", "Y. Maan", "I. Pastor-Marazuela", "L. C. Oostrum", "K. M. Rajwade", "Y. Y. Wang" ], "abstract": "We report detections of fast radio bursts (FRBs) from the repeating source\nFRB 20201124A with Apertif/WSRT and GMRT, and measurements of basic burst\nproperties, especially the dispersion measure (DM) and fluence. Based on\ncomparisons of these properties with previously published larger samples, we\nargue that the excess DM reported earlier for pulses with integrated signal to\nnoise ratio $\\lesssim 1000$ is due to incompletely accounting for the so-called\nsad trombone effect, even when using structure-maximizing DM algorithms. Our\ninvestigations of fluence distributions next lead us to advise against formal\npower-law fitting, especially dissuading the use of the least-square method,\nand we demonstrate the large biases involved. A maximum likelihood estimator\n(MLE) provides a much more accurate estimate of the power law and we provide\naccessible code for direct inclusion in future research. Our GMRT observations\nwere fortuitously scheduled around the end of the activity cycle as recorded by\nFAST. We detected several bursts (one of them very strong) at 400/600 MHz, a\nfew hours after sensitive FAST non-detections already showed the 1.3 GHz FRB\nemission to have ceased. After FRB 20180916B, this is a second example of a\nfrequency-dependent activity window identified in a repeating FRB source. Since\nnumerous efforts have so-far failed to determine a spin period for FRB\n20201124A, we conjecture it to be an ultra-long period magnetar, with a period\non the scale of months, and with a very wide, highly irregular duty cycle.\nAssuming the emission comes from closed field lines, we use radius-to-frequency\nmapping and polarization information from other studies to constrain the\nmagnetospheric geometry and location of the emission region. Our initial\nfindings are consistent with a possible connection between FRBs and crustal\nmotion events.", "categories": [ "astro-ph.HE", "astro-ph.GA" ], "published": "2024-07-07T13:40:43+00:00", "url": "http://arxiv.org/pdf/2407.05366v2", "resource_uri": "arxiv://2407.05366v2", "citation_count": 0 }, { "id": "2407.03889v1", "title": "Automated C/C++ Program Repair for High-Level Synthesis via Large Language Models", "authors": [ "Kangwei Xu", "Grace Li Zhang", "Xunzhao Yin", "Cheng Zhuo", "Ulf Schlichtmann", "Bing Li" ], "abstract": "In High-Level Synthesis (HLS), converting a regular C/C++ program into its\nHLS-compatible counterpart (HLS-C) still requires tremendous manual effort.\nVarious program scripts have been introduced to automate this process. But the\nresulting codes usually contain many issues that should be manually repaired by\ndevelopers. Since Large Language Models (LLMs) have the ability to automate\ncode generation, they can also be used for automated program repair in HLS.\nHowever, due to the limited training of LLMs considering hardware and software\nsimultaneously, hallucinations may occur during program repair using LLMs,\nleading to compilation failures. Besides, using LLMs for iterative repair also\nincurs a high cost. To address these challenges, we propose an LLM-driven\nprogram repair framework that takes regular C/C++ code as input and\nautomatically generates its corresponding HLS-C code for synthesis while\nminimizing human repair effort. To mitigate the hallucinations in LLMs and\nenhance the prompt quality, a Retrieval-Augmented Generation (RAG) paradigm is\nintroduced to guide the LLMs toward correct repair. In addition, we use LLMs to\ncreate a static bit width optimization program to identify the optimized bit\nwidths for variables. Moreover, LLM-driven HLS optimization strategies are\nintroduced to add/tune pragmas in HLS-C programs for circuit optimization.\nExperimental results demonstrate that the proposed LLM-driven automated\nframework can achieve much higher repair pass rates in 24 real-world\napplications compared with the traditional scripts and the direct application\nof LLMs for program repair.", "categories": [ "eess.SY", "cs.SY" ], "published": "2024-07-04T12:29:06+00:00", "url": "http://arxiv.org/pdf/2407.03889v1", "resource_uri": "arxiv://2407.03889v1", "citation_count": 0 }, { "id": "2407.03625v2", "title": "Fix the Tests: Augmenting LLMs to Repair Test Cases with Static Collector and Neural Reranker", "authors": [ "Jun Liu", "Jiwei Yan", "Yuanyuan Xie", "Jun Yan", "Jian Zhang" ], "abstract": "During software evolution, it is advocated that test code should co-evolve\nwith production code. In real development scenarios, test updating may lag\nbehind production code changing, which may cause compilation failure or bring\nother troubles. Existing techniques based on pre-trained language models can be\ndirectly adopted to repair obsolete tests caused by such unsynchronized code\nchanges, especially syntactic-related ones. However, the lack of task-oriented\ncontextual information affects the repair accuracy on large-scale projects.\nStarting from an obsolete test, the key challenging task is precisely\nidentifying and constructing Test-Repair-Oriented Contexts (TROCtxs) from the\nwhole repository within a limited token size. In this paper, we propose SYNTER\n(SYNtactic-breaking-changes-induced TEst Repair), a novel approach based on\nLLMs to automatically repair obsolete test cases via precise and concise\nTROCtxs construction. Inspired by developers' programming practices, we design\nthree types of TROCtx: class context, usage context, and environment context.\nGiven an obsolete test case to repair, SYNTER firstly collects the related code\ninformation for each type of TROCtx through static analysis techniques\nautomatically. Then, it generates reranking queries to identify the most\nrelevant TROCtxs, which will be taken as the repair-required key contexts and\nbe input to the large language model for the final test repair. To evaluate the\neffectiveness of SYNTER, we construct a benchmark dataset that contains a set\nof obsolete tests caused by syntactic breaking changes. The experimental\nresults show that SYNTER outperforms baseline approaches both on textual- and\nintent-matching metrics. With the augmentation of constructed TROCtxs,\nhallucinations are reduced by 57.1%.", "categories": [ "cs.SE" ], "published": "2024-07-04T04:24:43+00:00", "url": "http://arxiv.org/pdf/2407.03625v2", "resource_uri": "arxiv://2407.03625v2", "citation_count": 0 }, { "id": "2407.03611v1", "title": "An Empirical Study on Capability of Large Language Models in Understanding Code Semantics", "authors": [ "Thu-Trang Nguyen", "Thanh Trong Vu", "Hieu Dinh Vo", "Son Nguyen" ], "abstract": "Large Language Models for Code (code LLMs) have demonstrated remarkable\nperformance across various software engineering (SE) tasks, increasing the\napplication of code LLMs in software development. Despite the success of code\nLLMs, there remain significant concerns about the actual capabilities and\nreliability of these models, \"whether these models really learn the semantics\nof code from the training data and leverage the learned knowledge to perform\nthe SE tasks\". In this paper, we introduce EMPICA, a comprehensive framework\ndesigned to systematically and empirically evaluate the capabilities of code\nLLMs in understanding code semantics. Specifically, EMPICA systematically\nintroduces controlled modifications/transformations into the input code and\nexamines the models' responses. Generally, code LLMs must be robust to\nsemantically equivalent code inputs and be sensitive to non-equivalent ones for\nall SE tasks. Specifically, for every SE task, given an input code snippet c\nand its semantic equivalent variants, code LLMs must robustly produce\nconsistent/equivalent outputs while they are expected to generate different\noutputs for c and its semantic non-equivalent variants. Our experimental\nresults on three representative code understanding tasks, including code\nsummarization, method name prediction, and output prediction, reveal that the\nrobustness and sensitivity of the state-of-the-art code LLMs to code\ntransformations vary significantly across tasks and transformation operators.\nIn addition, the code LLMs exhibit better robustness to the semantic preserving\ntransformations than their sensitivity to the semantic non-preserving\ntransformations. These results highlight a need to enhance the model's\ncapabilities of understanding code semantics, especially the sensitivity\nproperty.", "categories": [ "cs.SE", "cs.AI" ], "published": "2024-07-04T03:40:58+00:00", "url": "http://arxiv.org/pdf/2407.03611v1", "resource_uri": "arxiv://2407.03611v1", "citation_count": 0 }, { "id": "2407.03157v2", "title": "Let the Code LLM Edit Itself When You Edit the Code", "authors": [ "Zhenyu He", "Jun Zhang", "Shengjie Luo", "Jingjing Xu", "Zhi Zhang", "Di He" ], "abstract": "In this work, we investigate a typical scenario in code generation where a\ndeveloper edits existing code in real time and requests a code assistant, e.g.,\na large language model, to re-predict the next token or next line on the fly.\nNaively, the LLM needs to re-encode the entire KV cache to provide an accurate\nprediction. However, this process is computationally expensive, especially when\nthe sequence length is long. Simply encoding the edited subsequence and\nintegrating it to the original KV cache meets the temporal confusion problem,\nleading to significantly worse performance. We address this efficiency and\naccuracy trade-off by introducing \\underline{\\textbf{Positional\n\\textbf{I}ntegrity \\textbf{E}ncoding} (PIE). Building upon the rotary\npositional encoding, PIE first removes the rotary matrices in the Key cache\nthat introduce temporal confusion and then reapplies the correct rotary\nmatrices. This process ensures that positional relationships between tokens are\ncorrect and requires only a single round of matrix multiplication. We validate\nthe effectiveness of PIE through extensive experiments on the RepoBench-C-8k\ndataset, utilizing DeepSeek-Coder models with 1.3B, 6.7B, and 33B parameters.\nOur evaluation includes three real-world coding tasks: code insertion, code\ndeletion, and multi-place code editing. Results demonstrate that PIE reduces\ncomputational overhead by over 85% compared to the standard full recomputation\napproach across all model sizes and tasks while well approximating the model\nperformance.", "categories": [ "cs.CL", "cs.AI", "cs.LG", "cs.SE" ], "published": "2024-07-03T14:34:03+00:00", "url": "http://arxiv.org/pdf/2407.03157v2", "resource_uri": "arxiv://2407.03157v2", "citation_count": 0 }, { "id": "2407.02395v2", "title": "Is Your AI-Generated Code Really Safe? Evaluating Large Language Models on Secure Code Generation with CodeSecEval", "authors": [ "Jiexin Wang", "Xitong Luo", "Liuwen Cao", "Hongkui He", "Hailin Huang", "Jiayuan Xie", "Adam Jatowt", "Yi Cai" ], "abstract": "Large language models (LLMs) have brought significant advancements to code\ngeneration and code repair, benefiting both novice and experienced developers.\nHowever, their training using unsanitized data from open-source repositories,\nlike GitHub, raises the risk of inadvertently propagating security\nvulnerabilities. Despite numerous studies investigating the safety of code\nLLMs, there remains a gap in comprehensively addressing their security\nfeatures. In this work, we aim to present a comprehensive study aimed at\nprecisely evaluating and enhancing the security aspects of code LLMs. To\nsupport our research, we introduce CodeSecEval, a meticulously curated dataset\ndesigned to address 44 critical vulnerability types with 180 distinct samples.\nCodeSecEval serves as the foundation for the automatic evaluation of code\nmodels in two crucial tasks: code generation and code repair, with a strong\nemphasis on security. Our experimental results reveal that current models\nfrequently overlook security issues during both code generation and repair\nprocesses, resulting in the creation of vulnerable code. In response, we\npropose different strategies that leverage vulnerability-aware information and\ninsecure code explanations to mitigate these security vulnerabilities.\nFurthermore, our findings highlight that certain vulnerability types\nparticularly challenge model performance, influencing their effectiveness in\nreal-world applications. Based on these findings, we believe our study will\nhave a positive impact on the software engineering community, inspiring the\ndevelopment of improved methods for training and utilizing LLMs, thereby\nleading to safer and more trustworthy model deployment.", "categories": [ "cs.SE", "cs.CL" ], "published": "2024-07-02T16:13:21+00:00", "url": "http://arxiv.org/pdf/2407.02395v2", "resource_uri": "arxiv://2407.02395v2", "citation_count": 0 }, { "id": "2407.01489v2", "title": "Agentless: Demystifying LLM-based Software Engineering Agents", "authors": [ "Chunqiu Steven Xia", "Yinlin Deng", "Soren Dunn", "Lingming Zhang" ], "abstract": "Recent advancements in large language models (LLMs) have significantly\nadvanced the automation of software development tasks, including code\nsynthesis, program repair, and test generation. More recently, researchers and\nindustry practitioners have developed various autonomous LLM agents to perform\nend-to-end software development tasks. These agents are equipped with the\nability to use tools, run commands, observe feedback from the environment, and\nplan for future actions. However, the complexity of these agent-based\napproaches, together with the limited abilities of current LLMs, raises the\nfollowing question: Do we really have to employ complex autonomous software\nagents? To attempt to answer this question, we build Agentless -- an agentless\napproach to automatically solve software development problems. Compared to the\nverbose and complex setup of agent-based approaches, Agentless employs a\nsimplistic three-phase process of localization, repair, and patch validation,\nwithout letting the LLM decide future actions or operate with complex tools.\nOur results on the popular SWE-bench Lite benchmark show that surprisingly the\nsimplistic Agentless is able to achieve both the highest performance (32.00%,\n96 correct fixes) and low cost ($0.70) compared with all existing open-source\nsoftware agents! Furthermore, we manually classified the problems in SWE-bench\nLite and found problems with exact ground truth patch or\ninsufficient/misleading issue descriptions. As such, we construct SWE-bench\nLite-S by excluding such problematic issues to perform more rigorous evaluation\nand comparison. Our work highlights the current overlooked potential of a\nsimple, interpretable technique in autonomous software development. We hope\nAgentless will help reset the baseline, starting point, and horizon for\nautonomous software agents, and inspire future work along this crucial\ndirection.", "categories": [ "cs.SE", "cs.AI", "cs.CL", "cs.LG" ], "published": "2024-07-01T17:24:45+00:00", "url": "http://arxiv.org/pdf/2407.01489v2", "resource_uri": "arxiv://2407.01489v2", "citation_count": 0 } ], "query_components": [ [ "program repair", "software repair", "automatic repair", "code repair", "bug repair", "bug fix", "code fix", "automatic fix", "patch generation", "patch correctness", "patch validation", "fix generation", "code transformation", "code edit", "fix error" ], [ "LLM", "LLMs", "Large Language Model", "Large Language Models", "Pre-trained", "Pretrained", "Pre-training", "Pretraining", "PLM", "PLMs", "BERT", "CodeBERT", "T5", "CodeT5", "GPT", "CodeGPT", "Codex", "ChatGPT", "Llama", "CodeLlama", "GPT-3", "GPT-4", "GPT-3.5", "neural", "machine learning", "deep learning", "transformer", "transformers", "model", "models", "transfer learning", "supervised learning" ] ], "filters": { "date_from": "2024-07-01", "min_citations": null, "ccf_a_only": false } }

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/wr-web/APR'

If you have feedback or need assistance with the MCP directory API, please join our Discord server