LLM赋能的Datalog代码翻译技术及增量程序分析框架
作者:
通讯作者:

卜磊,E-mail:bulei@nju.edu.cn

中图分类号:

TP311

基金项目:

国家自然科学基金(62232008, 62172200); 江苏省前沿引领技术基础研究专项(BK20202001)


LLM-powered Datalog Code Translation and Incremental Program Analysis Framework
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [55]
  • |
  • 相似文献 [20]
  • | | |
  • 文章评论
    摘要:

    Datalog是一种声明式逻辑编程语言, 在不同领域得到了广泛应用. 近年来, 学术界和工业界对Datalog的兴趣高涨, 设计并开发了多种Datalog引擎和相应方言. 然而, 多方言带来的一个问题是以一种Datalog方言实现的代码, 一般而言不能在另一种方言的引擎上执行. 因此, 当采用新Datalog引擎时, 需要将现有Datalog代码翻译到新方言上. 目前的Datalog代码翻译技术可分为人工重写代码和人工设计翻译规则两类, 存在耗时长、大量重复劳动、缺乏灵活性和可拓展性等问题. 提出了一种大语言模型(LLM)赋能的Datalog代码翻译技术, 利用LLM强大的代码理解和生成能力, 通过分治翻译策略、基于少样本提示和思维链提示的提示工程、基于检查-反馈-修复的迭代纠错机制, 可以在不同Datalog方言之间实现高精度代码翻译, 减轻开发人员重复开发翻译规则的工作量. 基于此代码翻译技术, 设计并实现了一种通用的基于Datalog的声明式增量程序分析框架. 在不同Datalog方言对上评估了所提出的LLM赋能的Datalog代码翻译技术的性能, 评估结果验证了所提代码翻译技术的有效性. 对通用声明式增量程序分析框架进行了实验评估, 验证了基于所提代码翻译技术的增量程序分析的加速效果.

    Abstract:

    Datalog, a declarative logic programming language, is widely applied in various fields. In recent years, there has been a growing interest in Datalog from both the academic and industrial communities, leading to the design and development of multiple Datalog engines and corresponding dialects. However, one problem brought about by the multiple dialects is that the code implemented in one Datalog dialect generally cannot be executed on the engine of another dialect. Therefore, when a new Datalog engine is adopted, the existing Datalog code needs to be translated into the new dialect. The current Datalog code translation techniques can be classified into two categories: manually rewriting the code and manually designing translation rules, which have problems such as being time-consuming, involving a large amount of repetitive work, and lacking flexibility and scalability. In this study, a Datalog code translation technology empowered by large language model (LLM) is proposed. By leveraging the powerful code understanding and generation capabilities of LLM, through the divide-and-conquer translation strategy, the prompt engineering based on few-shot and chain-of-thought prompts, and an iterative error-correction mechanism based on check-feedback-repair, high-precision code translation between different Datalog dialects can be achieved, reducing the workload of developers in repeatedly developing translation rules. Based on this code translation technology, a general declarative incremental program analysis framework based on Datalog is designed and implemented. The performance of the proposed LLM-powered Datalog code translation technology is evaluated on different Datalog dialect pairs, and the evaluation results verify the effectiveness of the proposed code translation technology. This study also conducts an experimental evaluation of the general declarative incremental program analysis framework, verifying the speedup effect of incremental program analysis based on the proposed code translation technology.

    参考文献
    [1] Loo BT, Condie T, Garofalakis M, Gay DE, Hellerstein JM, Maniatis P, Ramakrishnan R, Roscoe T, Stoica I. Declarative networking: Language, execution and optimization. In: Proc. of the 2006 ACM SIGMOD Int’l Conf. on Management of Data. Chicago: ACM, 2006. 97–108. [doi: 10.1145/1142473.1142485]
    [2] Seo J, Guo S, Lam MS. SociaLite: An efficient graph query language based on datalog. IEEE Trans. on Knowledge and Data Engineering, 2015, 27(7): 1824–1837.
    [3] Shkapsky A, Yang MH, Interlandi M, Chiu H, Condie T, Zaniolo C. Big data analytics with datalog queries on spark. In: Proc. of the 2016 Int’l Conf. on Management of Data. San Francisco: ACM, 2016. 1135–1149. [doi: 10.1145/2882903.2915229]
    [4] Bravenboer M, Smaragdakis Y. Strictly declarative specification of sophisticated points-to analyses. In: Proc. of the 24th ACM SIGPLAN Conf. on Object Oriented Programming Systems Languages and Applications. Orlando: ACM, 2009. 243–262. [doi: 10.1145/1640089.1640108]
    [5] Jordan H, Scholz B, Subotić P. Soufflé: On synthesis of program analyzers. In: Proc. of the 28th Int’l Conf. on Computer Aided Verification. Toronto: Springer, 2016. 422–430. [doi: 10.1007/978-3-319-41540-6_23]
    [6] Marczak WR, Huang SS, Bravenboer M, Sherr M, Loo BT, Aref M. SecureBlox: Customizable secure distributed data processing. In: Proc. of the 2010 ACM SIGMOD Int’l Conf. on Management of Data. Indianapolis: ACM, 2010. 723–734. [doi: 10.1145/1807167.1807246]
    [7] Aref M, ten Cate B, Green TJ, Kimelfeld B, Olteanu D, Pasalic E, Veldhuizen TL, Washburn G. Design and implementation of the LogicBlox system. In: Proc. of the 2015 ACM SIGMOD Int’l Conf. on Management of Data. Melbourne: ACM, 2015. 1371–1382. [doi: 10.1145/2723372.2742796]
    [8] Antoniadis T, Triantafyllou K, Smaragdakis Y. Porting doop to Soufflé: A tale of inter-engine portability for Datalog-based analyses. In: Proc. of the 6th ACM SIGPLAN Int’l Workshop on State of the Art in Program Analysis. Barcelona: ACM, 2017. 25–30. [doi: 10.1145/3088515.3088522]
    [9] 沈天琪, 王熙灶, 宾向荣, 卜磊. DDoop: 基于差分式Datalog求解的增量指针分析框架. 软件学报, 2024, 35(6): 2608–2630. http://www.jos.org.cn/1000-9825/7100.htm
    Shen TQ, Wang XZ, Bin XR, Bu L. DDoop: Incremental pointer analysis framework based on differential datalog evaluation. Ruan Jian Xue Bao/Journal of Software, 2024, 35(6): 2608–2630 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/7100.htm
    [10] Szabó T. Incrementalizing production CodeQL analyses. In: Proc. of the 31st ACM Joint European Software Engineering Conf. and Symp. on the Foundations of Software Engineering. San Francisco: ACM, 2023. 1716–1726. [doi: 10.1145/3611643.3613860]
    [11] Ryzhyk L, Budiu M. Differential datalog. In: Proc. of the 3rd Int’l Workshop on the Resurgence of Datalog in Academia and Industry (Datalog 2.0 2019). Philadelphia: CEUR, 2019. 56–67.
    [12] Pan R, Ibrahimzada AR, Krishna R, Sankar D, Wassi LP, Merler M, Sobolev B, Pavuluri R, Sinha S, Jabbarvand R. Lost in translation: A study of bugs introduced by large language models while translating code. In: Proc. of the 46th Int’l Conf. on Software Engineering. Lisbon: ACM, 2024. 82. [doi: 10.1145/3597503.3639226]
    [13] GPT-4. 2023. https://openai.com/index/gpt-4-research/
    [14] Touvron H, Lavril T, Izacard G, Martinet X, Lachaux MA, Lacroix T, Rozière B, Goyal N, Hambro E, Azhar F, Rodriguez A, Joulin A, Grave E, Lample G. LLaMA: Open and efficient foundation language models. arXiv:2302.13971, 2023.
    [15] Roziere B, Gehring J, Gloeckle F, et al. Code Llama: Open foundation models for code. arXiv:2308.12950, 2023.
    [16] Ma H, Zhang CQ, Bian Y T, Liu LM, Zhang ZR, Zhao PL, Zhang S, Fu HZ, Hu QH, Wu BZ. Fairness-guided few-shot prompting for large language models. In: Proc. of the 37th Int’l Conf. on Neural Information Processing Systems. New Orleans: Curran Associates Inc., 2023. 1870.
    [17] Wei J, Wang XZ, Schuurmans D, Bosma M, Ichter B, Xia F, Chi EH, Le QV, Zhou D. Chain-of-thought prompting elicits reasoning in large language models. In: Proc. of the 36th Int’l Conf. on Neural Information Processing Systems. New Orleans: Curran Associates Inc., 2022. 1800.
    [18] Besson F, Jensen T. Modular class analysis with DATALOG. In: Proc. of the 10th Int’l Symp. on Static Analysis. San Diego: Springer, 2003. 19–36. [doi: 10.1007/3-540-44898-5_2]
    [19] Hinchey M, Jackson M, Cousot P, Cook B, Bowen JP, Margaria T. Software engineering and formal methods. Communications of the ACM, 2008, 51(9): 54–59.
    [20] Albrecht PF, Garrison PE, Graham SL, Hyerle RH, Ip P, Krieg-Brückner B. Source-to-source translation: Ada to Pascal and Pascal to Ada. In: Proc. of the 1980 ACM-SIGPLAN Symp. on Ada Programming Language. Boston: ACM, 1980. 183–193. [doi: 10.1145/948632.948658]
    [21] Balogh GD, Mudalige GR, Reguly IZ, Antao SF, Bertolli C. OP2-Clang: A source-to-source translator using clang/LLVM LibTooling. In: Proc. of the 5th IEEE/ACM Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC). Dallas: IEEE, 2018. 59–70. [doi: 10.1109/LLVM-HPC.2018.8639205]
    [22] Roziere B, Lachaux MA, Chanussot L, Lample G. Unsupervised translation of programming languages. In: Proc. of the 34th Int’l Conf. on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2020. 1730.
    [23] Welcome Llama 3-meta’s new open LLM. 2024. https://huggingface.co/blog/llama3
    [24] Lozhkov A, Li R, Allal LB, et al. StarCoder 2 and the stack v2: The next generation. arXiv:2402.19173, 2024.
    [25] Mansur MN, Wüstholz V, Christakis M. Dependency-aware metamorphic testing of datalog engines. In: Proc. of the 32nd ACM SIGSOFT Int’l Symp. on Software Testing and Analysis. Seattle: ACM, 2023. 236–247. [doi: 10.1145/3597926.3598052]
    [26] Mansur MN, Christakis M, Wüstholz V. Metamorphic testing of Datalog engines. In: Proc. of the 29th ACM Joint Meeting on European Software Engineering Conf. and Symp. on the Foundations of Software Engineering. Athens: ACM, 2021. 639–650. [doi: 10.1145/3468264.3468573]
    [27] Sahebolamri A, Gilray T, Micinski K. Seamless deductive inference via macros. In: Proc. of the 31st ACM SIGPLAN Int’l Conf. on Compiler Construction. Seoul: ACM, 2022. 77–88. [doi: 10.1145/3497776.3517779]
    [28] Chung J, Kamar E, Amershi S. Increasing diversity while maintaining accuracy: Text data generation with large language models and human interventions. In: Proc. of the 61st Annual Meeting of the Association for Computational Linguistics (Vol.1: Long Papers). Toronto: Association for Computational Linguistics, 2023. 575–593.
    [29] Parr TJ, Quong RW. ANTLR: A predicated-LL (k) parser generator. Software: Practice and Experience, 1995, 25(7): 789–810.
    [30] Wang XZ, Wei J, Schuurmans D, Le QV, Chi EH, Narang S, Chowdhery A, Zhou D. Self-consistency improves chain of thought reasoning in language models. In: Proc. of the 11th Int’l Conf. on Learning Representations. Kigali: ICLR, 2022. 1–24.
    [31] Liu BZ, Huang J. SHARP: Fast incremental context-sensitive pointer analysis for Java. Proc. of the ACM on Programming Languages, 2022, 6(OOPSLA1): 88: 1–28. [doi: 10.1145/3527332]
    [32] Liu BZ, Huang J, Rauchwerger L. Rethinking incremental and parallel pointer analysis. ACM Trans. on Programming Languages and Systems, 2019, 41(1): 6.
    [33] Green TJ. LogiQL: A declarative language for enterprise applications. In: Proc. of the 34th ACM SIGMOD-SIGACT-SIGAI Symp. on Principles of Database Systems. Melbourne: ACM, 2015. 59–64. [doi: 10.1145/2745754.2745780]
    [34] C2Rust. 2024. https://github.com/immunant/c2rust
    [35] CxGo. 2024. https://github.com/gotranspile/cxgo
    [36] Sharpen-Automated Java->C# coversion. 2024. https://github.com/mono/sharpen
    [37] Java 2 CSharp Translator for Eclipse. 2013. https://sourceforge.net/projects/j2cstranslator/
    [38] Nguyen AT, Nguyen TT, Nguyen TN. Migrating code with statistical machine translation. In: Proc. of the 36th Int’l Conf. on Software Engineering. Hyderabad: ACM, 2014. 544–547. [doi: 10.1145/2591062.2591072]
    [39] Chen XY, Liu C, Song D. Tree-to-tree neural networks for program translation. In: Proc. of the 32nd Int’l Conf. on Neural Information Processing Systems. Montréal: Curran Associates Inc., 2018. 2552–2562.
    [40] Lachaux MA, Roziere B, Szafraniec M, Lample G. DOBF: A deobfuscation pre-training objective for programming languages. In: Proc. of the 35th Int’l Conf. on Neural Information Processing Systems. Virtual: Curran Associates Inc., 2021. 1147.
    [41] Li R, Allal LB, Zi YT, et al. StarCoder: May the source be with you! arXiv:2305.06161, 2023.
    [42] Wang Y, Wang WS, Joty S, Hoi SCH. CodeT5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In: Proc. of the 2021 Conf. on Empirical Methods in Natural Language Processing. Punta Cana: ACL, 2021. 8696–8708. [doi: 10.18653/v1/2021.emnlp-main.685]
    [43] Zheng QK, Xia X, Zou X, Dong YX, Wang S, Xue YF, Shen L, Wang ZH, Wang AD, Li Y, Su T, Yang ZL, Tang J. CodeGeeX: A pre-trained model for code generation with multilingual benchmarking on HumanEval-X. In: Proc. of the 29th ACM SIGKDD Conf, on Knowledge Discovery and Data Mining. Long Beach: ACM, 2023. 5673–5684. [doi: 10.1145/3580305.3599790]
    [44] Sun WS, Fang CR, You YD, Miao Y, Liu Y, Li YK, Deng GL, Huang SH, Chen YC, Zhang QJ, Qian HW, LiuY, Chen ZY. Automatic code summarization via ChatGPT: How far are we? arXiv:2305.12865, 2023.
    [45] Tang Z, Ge JD, Liu SQ, Zhu TW, Xu TT, Huang LG, Luo B. Domain adaptive code completion via language models and decoupled domain databases. In: Proc. of the 38th IEEE/ACM Int’l Conf. on Automated Software Engineering (ASE). Luxembourg: IEEE, 2023. 421–433. [doi: 10.1109/ASE56229.2023.00076]
    [46] Deng YL, Xia CS, Peng HR, Yang CY, Zhang LM. Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models. In: Proc. of the 32nd ACM SIGSOFT Int’l Symp. on Software Testing and Analysis (ISSTA 2023). Seattle: ACM, 2023. 423–435. [doi: 10.1145/3597926.3598067]
    [47] Sun YQ, Wu DY, Xue Y, Liu H, Wang HJ, Xu ZZ, Xie XF, Liu Y. GPTScan: Detecting logic vulnerabilities in smart contracts by combining gpt with program analysis. In: Proc. of the 46th Int’l Conf. on Software Engineering (ICSE). Lisbon: ACM, 2024. 166. [doi: 10.1145/3597503.3639117]
    [48] Li HN, Hao Y, Zhai YZ, Qian ZY. Enhancing static analysis for practical bug detection: An LLM-integrated approach. Proc. of the ACM on Programming Languages, 2024, 8(OOPSLA1): 474–499.
    [49] Wu YH, Li Z, Zhang JM, Papadakis M, Harman M, Liu Y. Large language models in fault localisation. arXiv:2308.15276, 2023.
    [50] Jiang N, Liu K, Lutellier T, Tan L. Impact of code language models on automated program repair. In: Proc. of the 45th Int’l Conf. on Software Engineering (ICSE). Melbourne: IEEE, 2023. 1430–1442. [doi: 10.1109/ICSE48619.2023.00125]
    [51] Arzt S, Bodden E. Reviser: Efficiently updating IDE-/IFDS-based data-flow analyses in response to incremental program changes. In: Proc. of the 36th Int’l Conf. on Software Engineering. Hyderabad: ACM, 2014. 288–298. [doi: 10.1145/2568225.2568243]
    [52] Reps T. Program analysis via graph reachability. Information and Software Technology, 1998, 40(11-12): 701–726.
    [53] Reps T, Horwitz S, Sagiv M. Precise interprocedural dataflow analysis via graph reachability. In: Proc. of the 22nd ACM SIGPLAN-SIGACT Symp. on Principles of Programming Languages. San Francisco: ACM, 1995. 49–61. [doi: 10.1145/199448.199462]
    [54] Distefano D, Fähndrich M, Logozzo F, O'Hearn PW. Scaling static analyses at Facebook. Communications of the ACM, 2019, 62(8): 62–70.
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

王熙灶,沈天琪,宾向荣,卜磊. LLM赋能的Datalog代码翻译技术及增量程序分析框架.软件学报,2025,36(6):2515-2535

复制
分享
文章指标
  • 点击次数:516
  • 下载次数: 197
  • HTML阅读次数: 0
  • 引用次数: 0
历史
  • 收稿日期:2024-08-26
  • 最后修改日期:2024-10-14
  • 在线发布日期: 2024-12-10
文章二维码
您是第19876148位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号