Aussie AI

AI Safety Research

  • Last Updated 1 January, 2026
  • by David Spuler, Ph.D.

Safe and responsible use of AI is an important and all-encompassing goal. Multiple concerns arise in the use of modern AI capabilities, and for the future with more advanced AI systems. This article examines the various research papers on difference AI safety issues.

Types of AI Safety Issues

There are a variety of distinct issue in terms of appropriate use of AI. Some of the categories include:

  • Bias and fairness
  • Inaccurate results
  • Imaginary results ("hallucinations" or "confabulations")
  • Inappropriate responses (e.g., "toxicity")
  • Plagiarism

There are some issues that get quite close to being philosophy rather than technology:

  • Alignment (ensuring AI engines are "aligned" with human goals)
  • Overrideability/interruptibility
  • Obedience vs autonomy

There are some overarching issues for AI matters for the government and in the community:

  • Ethics
  • Governance
  • Regulation
  • Auditing and Enforcement
  • Privacy
  • Risk Mitigation

Issues specific to mitigation of AI safety risks include:

  • Red teaming (testing of safety issues)
  • Prompt shields
  • Guardrails
  • Jailbreak prevention
  • Refusal modules
  • Security issues

And since we may rely on AI models in various real-world situations, including dangerous real-time situations like driving a car, there are some practical technological issues ensuring that AI engines operate safely and reliably within their basic operational scope:

  • Testing and Debugging (simply avoiding coding "bugs" in complex AI engines)
  • Real-time performance profiling ("de-slugging")
  • Error Handling (tolerance of internal or external errors)
  • Code Resilience (handling unexpected inputs or situations reasonably)

Overviews, Surveys, and Reviews

Various authors have reviewed the areas of safety and ethics:

Hallucinations

Hallucinations are plausible-sounding answers that are not correct, and not based on any facts. It appears like the LLM is lying or faking the answer, but it doesn't actually know this. Rather, it is probabilistically trying to come up with the best answer, and sometimes it doesn't have a factual answer, so it can fill in the blanks.

  • Liangming Pan, Michael Saxon, Wenda Xu, Deepak Nathani, Xinyi Wang, William Yang Wang, May 03 2024, Automatically Correcting Large Language Models: Surveying the Landscape of Diverse Automated Correction Strategies, https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00660/120911
  • Bingbin Liu, Jordan T. Ash, Surbhi Goel, Akshay Krishnamurthy, Cyril Zhang, June 2023, Exposing Attention Glitches with Flip-Flop Language Modeling, https://arxiv.org/abs/2306.00946
  • Lucas Mearian, 14 Mar 2024, AI hallucination mitigation: two brains are better than one, https://www.computerworld.com/article/1612465/ai-hallucination-mitigation-two-brains-are-better-than-one.html
  • Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 15 Mar 2024 (v5), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer (A large survey of a variety of LLM optimizations.)
  • Bijit Ghosh Feb 2024, Advanced Prompt Engineering for Reducing Hallucination, https://medium.com/@bijit211987/advanced-prompt-engineering-for-reducing-hallucination-bb2c8ce62fc6
  • Junyi Li, Jie Chen, Ruiyang Ren, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Nie, Ji-Rong Wen, 6 Jan 2024, The Dawn After the Dark: An Empirical Study on Factuality Hallucination in Large Language Models, https://arxiv.org/abs/2401.03205 Code: https://github.com/RUCAIBox/HaluEval-2.0
  • Colin Fraser, Apr 18, 2024, Hallucinations, Errors, and Dreams On why modern AI systems produce false outputs and what there is to be done about it, https://medium.com/@colin.fraser/hallucinations-errors-and-dreams-c281a66f3c35
  • Johnny Li, Saksham Consul, Eda Zhou, James Wong, Naila Farooqui, Yuxin Ye, Nithyashree Manohar, Zhuxiaona Wei, Tian Wu, Ben Echols, Sharon Zhou, Gregory Diamos, 25 Jun 2024, Banishing LLM Hallucinations Requires Rethinking Generalization, https://arxiv.org/abs/2406.17642
  • Pavan Belagatti, Jul 31, 2024, Semantic Chunking for Enhanced RAG Applications! https://levelup.gitconnected.com/semantic-chunking-for-enhanced-rag-applications-b6bc92942af0
  • Mintong Kang, Nezihe Merve Gürel, Ning Yu, Dawn Song, Bo Li, July 2024, C-RAG: Certified Generation Risks for Retrieval-Augmented Language Models, Proceedings of the 41st International Conference on Machine Learning, PMLR 235:22963-23000, 2024, https://proceedings.mlr.press/v235/kang24a.html
  • Mengya Hu, Rui Xu, Deren Lei, Yaxi Li, Mingyu Wang, Emily Ching, Eslam Kamal, Alex Deng, 22 Aug 2024, SLM Meets LLM: Balancing Latency, Interpretability and Consistency in Hallucination Detection, https://arxiv.org/abs/2408.12748
  • Hao Zhou, Chengming Hu, Ye Yuan, Yufei Cui, Yili Jin, Can Chen, Haolun Wu, Dun Yuan, Li Jiang, Di Wu, Xue Liu, Charlie Zhang, Xianbin Wang, Jiangchuan Liu, 17 May 2024, Large Language Model (LLM) for Telecommunications: A Comprehensive Survey on Principles, Key Techniques, and Opportunities, https://arxiv.org/abs/2405.10825
  • C Yang, S Fujita, 2024, Adaptive Control of Retrieval-Augmented Generation for LLMs Through Reflective Tags, https://www.preprints.org/manuscript/202408.2152/download/final_file
  • Michael Wood, Aug 26, 2024, 100% Accurate AI Claimed by Acurai — OpenAI and Anthropic Confirm Acurai’s Discoveries, https://blog.cubed.run/100-accurate-ai-claimed-by-acurai-openai-and-anthropic-confirm-acurais-discoveries-98fce1ddeb5b
  • James Lee Stakelum, Sep 2024, The End of AI Hallucinations: A Big Breakthrough in Accuracy for AI Application Developers, https://medium.com/@JamesStakelum/the-end-of-ai-hallucinations-a-breakthrough-in-accuracy-for-data-engineers-e67be5cc742a
  • F. Li, X. zhang and P. Zhang, 2024, Mitigating Hallucination Issues in Small-Parameter LLMs through Inter-Layer Contrastive Decoding, 2024 International Joint Conference on Neural Networks (IJCNN), Yokohama, Japan, 2024, pp. 1-8, doi: 10.1109/IJCNN60899.2024.10650644, https://ieeexplore.ieee.org/abstract/document/10650644
  • Zhongxiang Sun, Zihua Si, Xiaoxue Zang, Kai Zheng, Yang Song, Xiao Zhang, Jun Xu, 15 Oct 2024, LargePiG: Your Large Language Model is Secretly a Pointer Generator, https://arxiv.org/abs/2410.11366
  • Garanc Burke, Hilke Schellmann, October 27, 2024, Researchers say an AI-powered transcription tool used in hospitals invents things no one ever said, https://apnews.com/article/ai-artificial-intelligence-health-business-90020cdf5fa16c79ca2e5b6c4c9bbb14
  • Adi Simhi, Jonathan Herzig, Idan Szpektor, Yonatan Belinkov, 29 Oct 2024, Distinguishing Ignorance from Error in LLM Hallucinations, https://arxiv.org/abs/2410.22071 https://github.com/technion-cs-nlp/hallucination-mitigation
  • Salvatore Raieli, Nov 2024, What Is The Best Therapy For a Hallucinating AI Patient? Exploring the Art and Science of Prompt Engineering to Cure LLM Hallucinations, https://levelup.gitconnected.com/what-is-the-best-therapy-for-a-hallucinating-ai-patient-acf0cb9b3e00
  • Vitaly Kukharenko, Nov 2024, Why Do Neural Networks Hallucinate (And What Are Experts Doing About It)? https://pub.towardsai.net/why-do-neural-networks-hallucinate-and-what-are-experts-doing-about-it-7b9342605bf7
  • Yixiong Fang, Ziran Yang, Zhaorun Chen, Zhuokai Zhao, Jiawei Zhou, 9 Dec 2024, From Uncertainty to Trust: Enhancing Reliability in Vision-Language Models with Uncertainty-Guided Dropout Decoding, https://arxiv.org/abs/2412.06474
  • Inkit Padhi, Manish Nagireddy, Giandomenico Cornacchia, Subhajit Chaudhury, Tejaswini Pedapati, Pierre Dognin, Keerthiram Murugesan, Erik Miehling, Martín Santillán Cooper, Kieran Fraser, Giulio Zizzo, Muhammad Zaid Hameed, Mark Purcell, Michael Desmond, Qian Pan, Inge Vejsbjerg, Elizabeth M. Daly, Michael Hind, Werner Geyer, Ambrish Rawat, Kush R. Varshney, Prasanna Sattigeri, 10 Dec 2024, Granite Guardian, https://arxiv.org/abs/2412.07724 https://github.com/ibm-granite/granite-guardian (Open-sourcing of safety models with many capabilities.)
  • Lilian Weng, July 7, 2024, Extrinsic Hallucinations in LLMs, https://lilianweng.github.io/posts/2024-07-07-hallucination/
  • Rhiannon Williams, December 31, 2024, The biggest AI flops of 2024: From chatbots dishing out illegal advice to dodgy AI-generated search results, take a look back over the year’s top AI failures. https://www.technologyreview.com/2024/12/31/1109612/biggest-worst-ai-artificial-intelligence-flops-fails-2024/
  • Kazi Hasan Ibn Arif, Sajib Acharjee Dip, Khizar Hussain, Lang Zhang, Chris Thomas, 21 Jan 2025, Fixing Imbalanced Attention to Mitigate In-Context Hallucination of Large Vision-Language Model, https://arxiv.org/abs/2501.12206
  • Huan Ma, Jingdong Chen, Guangyu Wang, Changqing Zhang, 1 Feb 2025, Estimating LLM Uncertainty with Logits, https://arxiv.org/abs/2502.00290
  • Ningke Li, Yahui Song, Kailong Wang, Yuekang Li, Ling Shi, Yi Liu, Haoyu Wang, 19 Feb 2025, Detecting LLM Fact-conflicting Hallucinations Enhanced by Temporal-logic-based Reasoning, https://arxiv.org/abs/2502.13416
  • Seongheon Park, Xuefeng Du, Min-Hsuan Yeh, Haobo Wang, Yixuan Li, 1 Mar 2025, How to Steer LLM Latents for Hallucination Detection? https://arxiv.org/abs/2503.01917
  • Sean Michael Kerner, May 13, 2025, Guardian agents: New approach could reduce AI hallucinations to below 1%, https://venturebeat.com/ai/beyond-detection-why-automatically-correcting-hallucinations-could-transform-enterprise-ai-adoption/
  • Lei Wang, 12 May 2025, SEReDeEP: Hallucination Detection in Retrieval-Augmented Models via Semantic Entropy and Context-Parameter Fusion, https://arxiv.org/abs/2505.07528
  • Manuel Cossio, 3 Aug 2025, A comprehensive taxonomy of hallucinations in Large Language Models, https://arxiv.org/abs/2508.01781
  • Igor Halperin, 13 Aug 2025, Prompt-Response Semantic Divergence Metrics for Faithfulness Hallucination and Misalignment Detection in Large Language Models, https://arxiv.org/abs/2508.10192
  • Denis Janiak, Jakub Binkowski, Albert Sawczyn, Bogdan Gabrys, Ravid Shwartz-Ziv, Tomasz Kajdanowicz, 13 Aug 2025, The Illusion of Progress: Re-evaluating Hallucination Detection in LLMs, https://arxiv.org/abs/2508.08285
  • Xi Long, Christy Boscardin, Lauren A. Maggio, Joseph A. Costello, Ralph Gonzales, Rasmyah Hammoudeh, Ki Lai, Yoon Soo Park, Brian C. Gin, 14 Aug 2025, Hallucination vs interpretation: rethinking accuracy and precision in AI-assisted data extraction for knowledge synthesis, https://arxiv.org/abs/2508.09458
  • Siyuan Liu, Wenjing Liu, Zhiwei Xu, Xin Wang, Bo Chen, Tao Li, 21 Jul 2025, Towards Mitigation of Hallucination for LLM-empowered Agents: Progressive Generalization Bound Exploration and Watchdog Monitor, https://arxiv.org/abs/2507.15903
  • Zhenliang Zhang, Xinyu Hu, Huixuan Zhang, Junzhe Zhang, Xiaojun Wan, 22 Jul 2025, ICR Probe: Tracking Hidden State Dynamics for Reliable Hallucination Detection in LLMs, https://arxiv.org/abs/2507.16488
  • Xin Dong, Shichao Dong, Jin Wang, Jing Huang, Li Zhou, Zenghui Sun, Lihua Jing, Jingsong Lan, Xiaoyong Zhu, Bo Zheng, 22 Jul 2025, INTER: Mitigating Hallucination in Large Vision-Language Models by Interaction Guidance Sampling, https://arxiv.org/abs/2507.05056
  • Seunghoi Kim and Henry F. J. Tregidgo and Matteo Figini and Chen Jin and Sarang Joshi and Daniel C. Alexander, 24 Jul 2025, Tackling Hallucination from Conditional Models for Medical Image Reconstruction with DynamicDPS, https://arxiv.org/abs/2503.01075
  • Weihua Zheng, Roy Ka-Wei Lee, Zhengyuan Liu, Kui Wu, AiTi Aw, Bowei Zou, 17 Jul 2025, CCL-XCoT: An Efficient Cross-Lingual Knowledge Transfer Method for Mitigating Hallucination Generation, https://arxiv.org/abs/2507.14239
  • Jingwei Huang, Kuroush Nezafati, Ismael Villanueva-Miranda, Zifan Gu, Yueshuang Xu, Ann Marie Navar, Tingyi Wanyan, Qin Zhou, Bo Yao, Ruichen Rong, Xiaowei Zhan, Guanghua Xiao, Eric D. Peterson, Donghan M. Yang, Wenqi Shi, Yang Xie, 18 Jul 2025, Large Language Models Powered Multiagent Ensemble for Mitigating Hallucination and Efficient Atrial Fibrillation Annotation of ECG Reports, https://arxiv.org/abs/2410.16543
  • Ashley Lewis, Michael White, Jing Liu, Toshiaki Koike-Akino, Kieran Parsons, Ye Wang, 21 Jul 2025, Winning Big with Small Models: Knowledge Distillation vs. Self-Training for Reducing Hallucination in Product QA Agents, https://arxiv.org/abs/2502.19545
  • Quan Shi, Wang Xi, Zenghui Ding, Jianqing Gao, Xianjun Yang, 10 Aug 2025, Hallucination as a Computational Boundary: A Hierarchy of Inevitability and the Oracle Escape, https://arxiv.org/abs/2508.07334
  • Ming-Kun Xie, Jia-Hao Xiao, Gang Niu, Lei Feng, Zhiqiang Kou, Min-Ling Zhang, and Masashi Sugiyama, 3 Aug 2025, What Makes "Good" Distractors for Object Hallucination Evaluation in Large Vision-Language Models?, https://arxiv.org/abs/2508.06530
  • Jakob Snel and Seong Joon Oh, 28 Jul 2025, First Hallucination Tokens Are Different from Conditional Ones, https://arxiv.org/abs/2507.20836
  • Shengyuan Wang, Jie Feng, Tianhui Liu, Dan Pei, Yong Li, 25 Jul 2025, Mitigating Geospatial Knowledge Hallucination in Large Language Models: Benchmarking and Dynamic Factuality Aligning, https://arxiv.org/abs/2507.19586
  • Baiyu Chen, Wilson Wongso, Xiaoqian Hu, Yue Tan, Flora Salim, 27 Jul 2025, Multi-Stage Verification-Centric Framework for Mitigating Hallucination in Multi-Modal RAG, https://arxiv.org/abs/2507.20136
  • Joosung Lee, Cheonbok Park, Hwiyeol Jo, Jeonghoon Kim, Joonsuk Park, Kang Min Yoo, 28 Jul 2025, Enhancing Hallucination Detection via Future Context, https://arxiv.org/abs/2507.20546
  • Esmail Gumaan, 20 Jul 2025, Theoretical Foundations and Mitigation of Hallucination in Large Language Models, https://arxiv.org/abs/2507.22915
  • Praveenkumar Katwe, Rakesh Chandra, Balabantaray Kali, Prasad Vittala, 30 Jul 2025, Reducing Hallucinations in Summarization via Reinforcement Learning with Entity Hallucination Index, https://arxiv.org/abs/2507.22744
  • Vijja Wichitwechkarn, Charles Fox, Ruchi Choudhary, 23 Jul 2025, Hallucination Detection and Mitigation with Diffusion in Multi-Variate Time-Series Foundation Models, https://arxiv.org/abs/2508.00881
  • Xiaoyu Pan, Yang Bai, Ke Zou, Yang Zhou, Jun Zhou, Huazhu Fu, Yih-Chung Tham, Yong Liu, 24 Jul 2025, EH-Benchmark Ophthalmic Hallucination Benchmark and Agent-Driven Top-Down Traceable Reasoning Workflow, https://arxiv.org/abs/2507.22929
  • Zhaochen Wang, Yiwei Wang, Yujun Cai, 3 Aug 2025, Cure or Poison? Embedding Instructions Visually Alters Hallucination in Vision-Language Models, https://arxiv.org/abs/2508.01678
  • Yijun Feng, 3 Aug 2025, Counterfactual Probing for Hallucination Detection and Mitigation in Large Language Models, https://arxiv.org/abs/2508.01862
  • Zhaoyi Sun, Wen-Wai Yim, Ozlem Uzuner, Fei Xia, Meliha Yetisgen, 1 Aug 2025, A Scoping Review of Natural Language Processing in Addressing Medically Inaccurate Information: Errors, Misinformation, and Hallucination, https://arxiv.org/abs/2505.00008
  • Junyoung Lim, Jaewoo Ahn, Gunhee Kim, 5 Aug 2025, ChartCap: Mitigating Hallucination of Dense Chart Captioning, https://arxiv.org/abs/2508.03164
  • Subhey Sadi Rahman, Md. Adnanul Islam, Md. Mahbub Alam, Musarrat Zeba, Md. Abdur Rahman, Sadia Sultana Chowa, Mohaimenul Azam Khan Raiaan, Sami Azam, 5 Aug 2025, Hallucination to Truth: A Review of Fact-Checking and Factuality Evaluation in Large Language Models, https://arxiv.org/abs/2508.03860
  • Shunqi Mao, Chaoyi Zhang, Weidong Cai, 6 Aug 2025, Through the Magnifying Glass: Adaptive Perception Magnification for Hallucination-Free VLM Decoding, https://arxiv.org/abs/2503.10183
  • Micha{\l} P. Karpowicz, 6 Aug 2025, On the Fundamental Impossibility of Hallucination Control in Large Language Models, https://arxiv.org/abs/2506.06382
  • Huaicheng Zhang, Wei Tan, Guangzheng Li, Yixuan Zhang, Hangting Chen, Shun Lei, Chenyu Yang, Zhiyong Wu, Shuai Wang, Qijun Huang, Dong Yu, 7 Aug 2025, Towards Hallucination-Free Music: A Reinforcement Learning Preference Optimization Framework for Reliable Song Generation, https://arxiv.org/abs/2508.05011
  • Kim Hammar and Tansu Alpcan and Emil C. Lupu, 7 Aug 2025, Incident Response Planning Using a Lightweight Large Language Model with Reduced Hallucination, https://arxiv.org/abs/2508.05188
  • Marc Pavel, Nenad Petrovic, Lukasz Mazur, Vahid Zolfaghari, Fengjunjie Pan, Alois Knoll, 15 Aug 2025, Hallucination in LLM-Based Code Generation: An Automotive Case Study, https://arxiv.org/abs/2508.11257
  • Nanxing Hu, Xiaoyue Duan, Jinchao Zhang, Guoliang Kang, 19 Aug 2025, Enhancing Visual Reliance in Text Generation: A Bayesian Perspective on Mitigating Hallucination in Large Vision-Language Models, https://arxiv.org/abs/2505.19498
  • Huan Ma, Jiadong Pan, Jing Liu, Yan Chen, Joey Tianyi Zhou, Guangyu Wang, Qinghua Hu, Hua Wu, Changqing Zhang, Haifeng Wang, 20 Aug 2025, Semantic Energy: Detecting LLM Hallucination Beyond Entropy, https://arxiv.org/abs/2508.14496
  • Aman Goel, Daniel Schwartz, Yanjun Qi, 19 Aug 2025, Zero-knowledge LLM hallucination detection and mitigation through fine-grained cross-model consistency, https://arxiv.org/abs/2508.14314
  • Yupei Yang, Fan Feng, Lin Yang, Wanxi Deng, Lin Qu, Biwei Huang, Shikui Tu, Lei Xu, 20 Aug 2025, DEPTH: Hallucination-Free Relation Extraction via Dependency-Aware Sentence Simplification and Two-tiered Hierarchical Refinement, https://arxiv.org/abs/2508.14391
  • Nicole Cho, William Watson, Alec Koppel, Sumitra Ganesh, Manuela Veloso, 22 Aug 2025, QueryBandits for Hallucination Mitigation: Exploiting Semantic Features for No-Regret Rewriting, https://arxiv.org/abs/2508.16697
  • Nicolas Zucchet, J\"org Bornschein, Stephanie Chan, Andrew Lampinen, Razvan Pascanu, Soham De, 24 Jul 2025, How do language models learn facts? Dynamics, curricula and hallucinations, https://arxiv.org/abs/2503.21676
  • Anindya Bijoy Das, Shahnewaz Karim Sakib and Shibbir Ahmed, 9 Aug 2025, Trustworthy Medical Imaging with Large Language Models: A Study of Hallucinations Across Modalities, https://arxiv.org/abs/2508.07031
  • Charles O'Neill, Slava Chalnev, Chi Chi Zhao, Max Kirkby, Mudith Jayasekara, 31 Jul 2025, A Single Direction of Truth: An Observer Model's Linear Residual Probe Exposes and Steers Contextual Hallucinations, https://arxiv.org/abs/2507.23221
  • Zhangcheng Qiang, Kerry Taylor, Weiqing Wang, Jing Jiang, 25 Mar 2025, OAEI-LLM-T: A TBox Benchmark Dataset for Understanding Large Language Model Hallucinations in Ontology Matching, https://arxiv.org/abs/2503.21813
  • Yudong Zhang, Ruobing Xie, Xingwu Sun, Yiqing Huang, Jiansheng Chen, Zhanhui Kang, Di Wang, Yu Wang, 31 Jul 2025, DHCP: Detecting Hallucinations by Cross-modal Attention Pattern in Large Vision-Language Models, https://arxiv.org/abs/2411.18659
  • Haonan Ge, Yiwei Wang, Ming-Hsuan Yang, Yujun Cai, 14 Aug 2025, MRFD: Multi-Region Fusion Decoding with Self-Consistency for Mitigating Hallucinations in LVLMs, https://arxiv.org/abs/2508.10264
  • Likun Tan, Kuan-Wei Huang, Kevin Wu, 28 Jul 2025, FRED: Financial Retrieval-Enhanced Detection and Editing of Hallucinations in Language Models, https://arxiv.org/abs/2507.20930
  • Neil F. Johnson and Frank Yingjie Huo, 1 Aug 2025, Multispin Physics of AI Tipping Points and Hallucinations, https://arxiv.org/abs/2508.01097
  • Chenxi Li, Yichen Guo, Benfang Qian, Jinhao You, Kai Tang, Yaosong Du, Zonghao Zhang, and Xiande Huang, 3 Aug 2025, MAP: Mitigating Hallucinations in Large Vision-Language Models with Map-Level Attention Processing, https://arxiv.org/abs/2508.01653
  • Peizheng Guo, Jingyao Wang, Wenwen Qiang, Huijie Guo, Changwen Zheng, Jiahuan Zhou, Gang Hua, 6 Aug 2025, Hacking Hallucinations of MLLMs with Causal Sufficiency and Necessity, https://arxiv.org/abs/2508.04182
  • Mengao Zhang, Jiayu Fu, Tanya Warrier, Yuwen Wang, Tianhui Tan, Ke-wei Huang, 7 Aug 2025, FAITH: A Framework for Assessing Intrinsic Tabular Hallucinations in finance, https://arxiv.org/abs/2508.05201
  • Vibhor Agarwal, Yiqiao Jin, Mohit Chandra, Munmun De Choudhury, Srijan Kumar, Nishanth Sastry, 7 Aug 2025, MedHalu: Hallucinations in Responses to Healthcare Queries by Large Language Models, https://arxiv.org/abs/2409.19492
  • Chunhua Liu, Hong Yi Lin and Patanamon Thongtanunam, 12 Aug 2025, Hallucinations in Code Change to Natural Language Generation: Prevalence and Evaluation of Detection Metrics, https://arxiv.org/abs/2508.08661
  • Ashish Seth, Utkarsh Tyagi, Ramaneswaran Selvakumar, Nishit Anand, Sonal Kumar, Sreyan Ghosh, Ramani Duraiswami, Chirag Agarwal, Dinesh Manocha, 18 Aug 2025, EGOILLUSION: Benchmarking Hallucinations in Egocentric Video Understanding, https://arxiv.org/abs/2508.12687
  • Yuangang Li, Yiqing Shen, Yi Nian, Jiechao Gao, Ziyi Wang, Chenxiao Yu, Shawn Li, Jie Wang, Xiyang Hu, Yue Zhao, 17 Aug 2025, Mitigating Hallucinations in Large Language Models via Causal Reasoning, https://arxiv.org/abs/2508.12495
  • Wenhao Li, Xiu Su, Jingyi Wu, Feng Yang, Yang Liu, Yi Chen, Shan You, Chang Xu, 19 Aug 2025, Identify, Isolate, and Purge: Mitigating Hallucinations in LVLMs via Self-Evolving Distillation, https://arxiv.org/abs/2507.04680
  • Anindya Bijoy Das, Shibbir Ahmed and Shahnewaz Karim Sakib, 19 Aug 2025, Hallucinations and Key Information Extraction in Medical Texts: A Comprehensive Assessment of Open-Source Large Language Models, https://arxiv.org/abs/2504.19061
  • Chenlin Liu, Minghui Fang, Patrick Zhang, Wei Zhou, Jie Gao, Jiqing Han, 21 Aug 2025, Mitigating Hallucinations in LM-Based TTS Models via Distribution Alignment Using GFlowNets, https://arxiv.org/abs/2508.15442
  • Reilly Haskins and Benjamin Adams, 21 Aug 2025, KEA Explain: Explanations of Hallucinations using Graph Kernel Analysis, https://arxiv.org/abs/2507.03847
  • Shuzhou Yuan, Zhan Qu, Ashish Yashwanth Kangen, Michael F\"arber, 22 Aug 2025, Can Hallucinations Help? Boosting LLMs for Drug Discovery, https://arxiv.org/abs/2501.13824
  • Charles Moslonka, Hicham Randrianarivo, Arthur Garnier and Emmanuel Malherbe, 1 Sep 2025, Learned Hallucination Detection in Black-Box LLMs using Token-level Entropy Production Rate, https://arxiv.org/abs/2509.04492
  • Jiawei Li, Akshayaa Magesh, Venugopal V. Veeravalli, 25 Aug 2025, Principled Detection of Hallucinations in Large Language Models via Multiple Testing, https://arxiv.org/abs/2508.18473
  • Yiming Huang, Junyan Zhang, Zihao Wang, Biquan Bie, Yunzhong Qiu, Yi R. Fung, Xinlei He, 26 Aug 2025, RePPL: Recalibrating Perplexity by Uncertainty in Semantic Propagation and Language Generation for Explainable QA Hallucination Detection, https://arxiv.org/abs/2505.15386
  • Supratik Sarkar, Swagatam Das, 26 Aug 2025, Grounding the Ungrounded: A Spectral-Graph Framework for Quantifying Hallucinations in multimodal LLMs, https://arxiv.org/abs/2508.19366
  • Kehao Miao, Xiaolong Jin, 26 Aug 2025, An Investigation on Group Query Hallucination Attacks, https://arxiv.org/abs/2508.19321
  • Seongheon Park and Yixuan Li, 27 Aug 2025, GLSim: Detecting Object Hallucinations in LVLMs via Global-Local Similarity, https://arxiv.org/abs/2508.19972
  • Alberto Compagnoni, Davide Caffagni, Nicholas Moratelli, Lorenzo Baraldi, Marcella Cornia, Rita Cucchiara, 27 Aug 2025, Mitigating Hallucinations in Multimodal LLMs via Object-aware Preference Optimization, https://arxiv.org/abs/2508.20181
  • Tingxuan Xu, Jiarui Feng, Justin Melendez, Kaleigh Roberts, Donghong Cai, Mingfang Zhu, Donald Elbert, Yixin Chen, Randall J. Bateman, 28 Aug 2025, Addressing accuracy and hallucination of LLMs in Alzheimer's disease research through knowledge graphs, https://arxiv.org/abs/2508.21238
  • Weizhi Gao, Xiaorui Liu, Feiyi Wang, Dan Lu, Junqi Yin, 28 Aug 2025, Decoding Memories: An Efficient Pipeline for Self-Consistency Hallucination Detection, https://arxiv.org/abs/2508.21228
  • Hao Lu, Jiahao Wang, Yaolun Zhang, Ruohui Wang, Xuanyu Zheng, Yepeng Tang, Dahua Lin, Lewei Lu, 29 Aug 2025, ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding, https://arxiv.org/abs/2508.21496
  • Junzhe Chen, Tianshu Zhang, Shiyu Huang, Yuwei Niu, Chao Sun, Rongzhou Zhang, Guanyu Zhou, Lijie Wen, Xuming Hu, 31 Aug 2025, OmniDPO: A Preference Optimization Framework to Address Omni-Modal Hallucination, https://arxiv.org/abs/2509.00723
  • Saad Abdul Ghani, Zizhao Wang, Peter Stone, Xuesu Xiao, 1 Sep 2025, Dyna-LfLH: Learning Agile Navigation in Dynamic Environments from Learned Hallucination, https://arxiv.org/abs/2403.17231
  • Haoran Huan, Mihir Prabhudesai, Mengning Wu, Shantanu Jaiswal, Deepak Pathak, 3 Sep 2025, Can LLMs Lie? Investigation beyond Hallucination, https://arxiv.org/abs/2509.03518
  • Qiang Liu, Xinlong Chen, Yue Ding, Bowen Song, Weiqiang Wang, Shu Wu, Liang Wang, 3 Sep 2025, Attention-guided Self-reflection for Zero-shot Hallucination Detection in Large Language Models, https://arxiv.org/abs/2501.09997
  • Praneet Suresh, Jack Stanley, Sonia Joseph, Luca Scimeca, Danilo Bzdok, 8 Sep 2025, From Noise to Narrative: Tracing the Origins of Hallucinations in Transformers, https://arxiv.org/abs/2509.06938
  • Xin Tong, Zhi Lin, Jingya Wang, Bo Jin, 8 Sep 2025, HAVE: Head-Adaptive Gating and ValuE Calibration for Hallucination Mitigation in Large Language Models, https://arxiv.org/abs/2509.06596
  • Jerry Li, Evangelos Papalexakis, 3 Sep 2025, Beyond ROUGE: N-Gram Subspace Features for LLM Hallucination Detection, https://arxiv.org/abs/2509.05360
  • Kamil Ciosek, Nicol\`o Felicioni, Sina Ghiassian, 8 Sep 2025, Hallucination Detection on a Budget: Efficient Bayesian Estimation of Semantic Entropy, https://arxiv.org/abs/2504.03579
  • Kishan Maharaj, Vitobha Munigala, Srikanth G. Tamilselvam, Prince Kumar, Sayandeep Sen, Palani Kodeswaran, Abhijit Mishra, Pushpak Bhattacharyya, 6 Sep 2025, ETF: An Entity Tracing Framework for Hallucination Detection in Code Summaries, https://arxiv.org/abs/2410.14748
  • Masoumeh Zareh, Mohammad Hossein Manshaei, Sayed Jalal Zahabi, and Marwan Krunz, 6 Sep 2025, Modeling Visual Hallucination: A Generative Adversarial Network Framework, https://arxiv.org/abs/2102.08209
  • OpenAI, September 5, 2025, Why language models hallucinate, https://openai.com/index/why-language-models-hallucinate/ (Many interesting findings, including that some level of hallucinations are inevitable in the next-token decoding method, and also that current LLM evals reward hallucinations, and need to be reworked to fix hallucinations properly by rewarding expressions of uncertainty in results, i.e., when the model admits it doesn't know something instead of making something up.)
  • Saumya Goswami, Siddharth Kurra, 9 Sep 2025, HALT-RAG: A Task-Adaptable Framework for Hallucination Detection with Calibrated NLI Ensembles and Abstention, https://arxiv.org/abs/2509.07475
  • Nobin Sarwar, 8 Sep 2025, FilterRAG: Zero-Shot Informed Retrieval-Augmented Generation to Mitigate Hallucinations in VQA, https://arxiv.org/abs/2502.18536
  • Malavika Suresh, Rahaf Aljundi, Ikechukwu Nkisi-Orji, Nirmalie Wiratunga, 4 Sep 2025, Cross-Layer Attention Probing for Fine-Grained Hallucination Detection, https://arxiv.org/abs/2509.09700
  • Naveen Lamba, Sanju Tiwari and Manas Gaur, 9 Sep 2025, Investigating Symbolic Triggers of Hallucination in Gemma Models Across HaluEval and TruthfulQA, https://arxiv.org/abs/2509.09715
  • Ponhvoan Srey, Xiaobao Wu, Anh Tuan Luu, 12 Sep 2025, Unsupervised Hallucination Detection by Inspecting Reasoning Processes, https://arxiv.org/abs/2509.10004
  • Garry Yang, Zizhe Chen, Man Hon Wong, Haoyu Lei, Yongqiang Chen, Zhenguo Li, Kaiwen Zhou, James Cheng, 11 Sep 2025, MESH -- Understanding Videos Like Human: Measuring Hallucinations in Large Video Models, https://arxiv.org/abs/2509.08538
  • Humam Kourani, Anton Antonov, Alessandro Berti, Wil M.P. van der Aalst, 18 Sep 2025, Knowledge-Driven Hallucination in Large Language Models: An Empirical Study on Process Modeling, https://arxiv.org/abs/2509.15336
  • Davide Ettori, Nastaran Darabi, Sina Tayebati, Ranganath Krishnan, Mahesh Subedar, Omesh Tickoo, and Amit Ranjan Trivedi, 19 Sep 2025, EigenTrack: Spectral Activation Feature Tracking for Hallucination and Out-of-Distribution Detection in LLMs and VLMs, https://arxiv.org/abs/2509.15735
  • Chung-En Johnny Yu, Hsuan-Chih (Neil) Chen, Brian Jalaian, Nathaniel D. Bastian, 18 Sep 2025, ORCA: Agentic Reasoning For Hallucination and Adversarial Robustness in Vision-Language Models, https://arxiv.org/abs/2509.15435
  • Seongmin Lee, Hsiang Hsu, Chun-Fu Chen, Duen Horng Chau, 15 Sep 2025, Probing LLM Hallucination from Within: Perturbation-Driven Approach via Internal Knowledge, https://arxiv.org/abs/2411.09689
  • Saad Obaid ul Islam, Anne Lauscher, Goran Glava\v{s}, 16 Sep 2025, How Much Do LLMs Hallucinate across Languages? On Multilingual Estimation of LLM Hallucination in the Wild, https://arxiv.org/abs/2502.12769
  • Boris Kovalerchuk, Brent D. Fegley, 13 Sep 2025, LLM Enhancement with Domain Expert Mental Model to Reduce LLM Hallucination with Causal Prompt Engineering, https://arxiv.org/abs/2509.10818
  • Minh Vu, Brian K. Tran, Syed A. Shah, Geigh Zollicoffer, Nhat Hoang-Xuan, Manish Bhattarai, 12 Sep 2025, HalluField: Detecting LLM Hallucinations via Field-Theoretic Modeling, https://arxiv.org/abs/2509.10753
  • Junjie Hu, Gang Tu, ShengYu Cheng, Jinxin Li, Jinting Wang, Rui Chen, Zhilong Zhou, Dongbo Shan, 15 Sep 2025, HARP: Hallucination Detection via Reasoning Subspace Projection, https://arxiv.org/abs/2509.11536
  • Leon Chlon, Ahmed Karim, Maggie Chlon, 14 Sep 2025, Predictable Compression Failures: Why Language Models Actually Hallucinate, https://arxiv.org/abs/2509.11208
  • Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Chen Xu, Yulong Chen, Longyue Wang, Anh Tuan Luu, Wei Bi, Freda Shi, Shuming Shi, 14 Sep 2025, Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models, https://arxiv.org/abs/2309.01219
  • Zhenglin Hua, Jinghan He, Zijun Yao, Tianxu Han, Haiyun Guo, Yuheng Jia, Junfeng Fang, 15 Sep 2025, Steering LVLMs via Sparse Autoencoder for Hallucination Mitigation, https://arxiv.org/abs/2505.16146
  • Hongxiang Zhang, Hao Chen, Muhao Chen, Tianyi Zhang, 15 Sep 2025, Active Layer-Contrastive Decoding Reduces Hallucination in Large Language Model Generation, https://arxiv.org/abs/2505.23657
  • Yurui Chang, Bochuan Cao, Lu Lin, 13 Sep 2025, Monitoring Decoding: Mitigating Hallucination via Evaluating the Factuality of Partial Response during Generation, https://arxiv.org/abs/2503.03106
  • Martin Prei{\ss}, 11 Sep 2025, Hallucination Detection with the Internal Layers of LLMs, https://arxiv.org/abs/2509.14254
  • Zihao Li, Weiwei Yi, Jiahong Chen, 12 Sep 2025, Accuracy Paradox in Large Language Models: Regulating Hallucination Risks in Generative AI, https://arxiv.org/abs/2509.13345
  • Xiao Zheng, 17 Sep 2025, DSCC-HS: A Dynamic Self-Reinforcing Framework for Hallucination Suppression in Large Language Models, https://arxiv.org/abs/2509.13702
  • Mahjabin Nahar, Eun-Ju Lee, Jin Won Park, Dongwon Lee, 17 Sep 2025, Catch Me if You Search: When Contextual Web Search Results Affect the Detection of Hallucinations, https://arxiv.org/abs/2504.01153
  • Nandakishor M, 23 Sep 2025, Confidence-Aware Routing for Large Language Model Reliability Enhancement: A Multi-Signal Approach to Pre-Generation Hallucination Mitigation, https://arxiv.org/abs/2510.01237
  • Hsuan Su, Ting-Yao Hu, Hema Swetha Koppula, Kundan Krishna, Hadi Pouransari, Cheng-Yu Hsieh, Cem Koc, Joseph Yitan Cheng, Oncel Tuzel, Raviteja Vemulapalli, 2 Oct 2025, Learning to Reason for Hallucination Span Detection, https://arxiv.org/abs/2510.02173
  • Shenxu Chang, Junchi Yu, Weixing Wang, Yongqiang Chen, Jialin Yu, Philip Torr, Jindong Gu, 30 Sep 2025, TraceDet: Hallucination Detection from the Decoding Trace of Diffusion Large Language Models, https://arxiv.org/abs/2510.01274
  • Aayush Gupta, 2 Oct 2025, Fact Grounded Attention: Eliminating Hallucination in Large Language Models Through Attention Level Knowledge Integration, https://arxiv.org/abs/2509.25252
  • Jung-Woo Shim, Yeong-Joon Ju, Ji-Hoon Park, Seong-Whan Lee, 14 Oct 2025, CPR: Mitigating Large Language Model Hallucinations with Curative Prompt Refinement, https://arxiv.org/abs/2510.12029
  • Jung-Woo Shim, Yeong-Joon Ju, Ji-Hoon Park, Seong-Whan Lee, 14 Oct 2025, Multi-stage Prompt Refinement for Mitigating Hallucinations in Large Language Models, https://arxiv.org/abs/2510.12032
  • Shihao Ji, Zihui Song, Jiajie Huang, 14 Oct 2025, Credal Transformer: A Principled Approach for Quantifying and Mitigating Hallucinations in Large Language Models, https://arxiv.org/abs/2510.12137
  • Trishna Chakraborty, Udita Ghosh, Xiaopan Zhang, Fahim Faisal Niloy, Yue Dong, Jiachen Li, Amit K. Roy-Chowdhury, Chengyu Song, 14 Oct 2025, HEAL: An Empirical Study on Hallucinations in Embodied Agents Driven by Large Language Models, https://arxiv.org/abs/2506.15065
  • Zeshi Dai, Zimo Peng, Zerui Cheng, Ryan Yihe Li, 30 Sep 2025, When Hallucination Costs Millions: Benchmarking AI Agents in High-Stakes Adversarial Financial Markets, https://arxiv.org/abs/2510.00332
  • Guy Bar-Shalom, Fabrizio Frasca, Yaniv Galron, Yftah Ziser, Haggai Maron, 30 Sep 2025, Beyond Token Probes: Hallucination Detection via Activation Tensors with ACT-ViT, https://arxiv.org/abs/2510.00296
  • Zhengyi Ho, Siyuan Liang, Dacheng Tao, 26 Sep 2025, Review of Hallucination Understanding in Large Language and Vision Models, https://arxiv.org/abs/2510.00034
  • Yongchao Long, Xian Wu, Yingying Zhang, Xianbin Wen, Yuxi Zhou and Shenda Hong, 1 Oct 2025, Copy-Paste to Mitigate Large Language Model Hallucinations, https://arxiv.org/abs/2510.00508
  • Zuzanna Dubanowska, Maciej \.Zelaszczyk, Micha{\l} Brzozowski, Paolo Mandica, Micha{\l} Karpowicz, 19 Sep 2025, Representation-based Broad Hallucination Detectors Fail to Generalize Out of Distribution, https://arxiv.org/abs/2509.19372
  • ShiMing Wang, ZhiHao Du, Yang Xiang, TianYu Zhao, Han Zhao, Qian Chen, XianGang Li, HanJie Guo, ZhenHua Ling, 24 Sep 2025, Eliminating stability hallucinations in llm-based tts models via attention guidance, https://arxiv.org/abs/2509.19852
  • Keshav Kumar, 24 Sep 2025, Detecting Token-Level Hallucinations Using Variance Signals: A Reference-Free Approach, https://arxiv.org/abs/2507.04137
  • Yihan Li, Xiyuan Fu, Ghanshyam Verma, Paul Buitelaar and Mingming Liu, 28 Oct 2025, Mitigating Hallucination in Large Language Models (LLMs): An Application-Oriented Survey on RAG, Reasoning, and Agentic Systems, https://arxiv.org/abs/2510.24476
  • Liyang Chen, Hongkai Chen, Yujun Cai, Sifan Li, Qingwen Ye and Yiwei Wang, 28 Oct 2025, Detecting and Mitigating Insertion Hallucination in Video-to-Audio Generation, https://arxiv.org/abs/2510.08078
  • Ge Zheng, Jiaye Qian, Jiajin Tang, Sibei Yang, 23 Oct 2025, Why LVLMs Are More Prone to Hallucinations in Longer Responses: The Role of Context, https://arxiv.org/abs/2510.20229
  • Jaehyung Seo, Hyeonseok Moon, Heuiseok Lim, 23 Oct 2025, The Impact of Negated Text on Hallucination with Large Language Models, https://arxiv.org/abs/2510.20375
  • Kushal Chakrabarti and Nirmal Balachundhar, 23 Oct 2025, Neural Diversity Regularizes Hallucinations in Small Models, https://arxiv.org/abs/2510.20690
  • Demian Till, John Smeaton, Peter Haubrick, Gouse Saheb, Florian Graef, David Berman, 23 Oct 2025, Teaming LLMs to Detect and Mitigate Hallucinations, https://arxiv.org/abs/2510.19507
  • Wenyun Li, Zheng Zhang, Dongmei Jiang, Xiangyuan Lan, 13 Oct 2025, Bolster Hallucination Detection via Prompt-Guided Data Augmentation, https://arxiv.org/abs/2510.15977
  • Yiyang Huang, Liang Shi, Yitian Zhang, Yi Xu, Yun Fu, 18 Oct 2025, SHIELD: Suppressing Hallucinations In LVLM Encoders via Bias and Vulnerability Defense, https://arxiv.org/abs/2510.16596
  • Tong Chen, Akari Asai, Luke Zettlemoyer, Hannaneh Hajishirzi, Faeze Brahman, 20 Oct 2025, Train for Truth, Keep the Skills: Binary Retrieval-Augmented Reward Mitigates Hallucinations, https://arxiv.org/abs/2510.17733
  • Jakub Binkowski, Denis Janiak, Albert Sawczyn, Bogdan Gabrys, Tomasz Kajdanowicz, 18 Oct 2025, Hallucination Detection in LLMs Using Spectral Features of Attention Maps, https://arxiv.org/abs/2502.17598
  • Richard Ackermann and Simeon Emanuilov, 19 Sep 2025, How Large Language Models are Designed to Hallucinate, https://arxiv.org/abs/2509.16297
  • Akshay Govind Srinivasan, Ryan Jacob George, Jayden Koshy Joe, Hrushikesh Kant, Harshith M R, Sachin Sundar, Sudharshan Suresh, Rahul Vimalkanth, Vijayavallabh, 19 Sep 2025, Enhancing Financial RAG with Agentic AI and Multi-HyDE: A Novel Approach to Knowledge Retrieval and Hallucination Reduction, https://arxiv.org/abs/2509.16369
  • Xingqi Wang, Yiming Cui, Xin Yao, Shijin Wang, Guoping Hu, Xiaoyu Qin, 22 Sep 2025, ChartHal: A Fine-grained Framework Evaluating Hallucination of Large Vision Language Models in Chart Understanding, https://arxiv.org/abs/2509.17481
  • Selva Ta\c{s}, Mahmut El Huseyni, \"Ozay Ezerceli, Reyhan Bayraktar, Fatma Bet\"ul Terzio\u{g}lu, 22 Sep 2025, Turk-LettuceDetect: A Hallucination Detection Models for Turkish RAG Applications, https://arxiv.org/abs/2509.17671
  • Jiulong Wu, Zhengliang Shi, Shuaiqiang Wang, Jizhou Huang, Dawei Yin, Lingyong Yan, Min Cao, Min Zhang, 22 Sep 2025, Mitigating Hallucinations in Large Vision-Language Models via Entity-Centric Multimodal Preference Optimization, https://arxiv.org/abs/2506.04039
  • Piyushkumar Patel, 26 Oct 2025, Multi-Modal Fact-Verification Framework for Reducing Hallucinations in Large Language Models, https://arxiv.org/abs/2510.22751
  • Chenlong Yin, Zeyang Sha, Shiwen Cui, Changhua Meng, 27 Oct 2025, The Reasoning Trap: How Enhancing LLM Reasoning Amplifies Tool Hallucination, https://arxiv.org/abs/2510.22977
  • Zongxia Li, Xiyang Wu, Guangyao Shi, Yubin Qin, Hongyang Du, Fuxiao Liu, Tianyi Zhou, Dinesh Manocha, Jordan Lee Boyd-Graber, 26 Oct 2025, VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations on Synthetic Video Understanding, https://arxiv.org/abs/2505.01481
  • Yixiao Huang, Hanlin Zhu, Tianyu Guo, Jiantao Jiao, Somayeh Sojoudi, Michael I. Jordan, Stuart Russell, Song Mei, 25 Oct 2025, Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers, https://arxiv.org/abs/2506.10887
  • Tsung-En Lin, Kuan-Yi Lee, Hung-Yi Lee, 14 Oct 2025, Adaptive vector steering: A training-free, layer-wise intervention for hallucination mitigation in large audio and multimodal models, https://arxiv.org/abs/2510.12851
  • Andrew Jesson and Nicolas Beltran-Velez and Quentin Chu and Sweta Karlekar and Jannik Kossen and Yarin Gal and John P. Cunningham and David Blei, 14 Oct 2025, Estimating the Hallucination Rate of Generative AI, https://arxiv.org/abs/2406.07457
  • Alkis Kalavasis, Anay Mehrotra, Grigoris Velegkas, 14 Oct 2025, On the Limits of Language Generation: Trade-Offs Between Hallucination and Mode Collapse, https://arxiv.org/abs/2411.09642
  • Michal Sadowski, Tadija Radusinovi\'c, Maria Wyrzykowska, Lukasz Sztukiewicz, Jan Rzymkowski, Pawe{\l} W{\l}odarczyk-Pruszy\'nski, Miko{\l}aj Sacha, Piotr Kozakowski, Ruard van Workum, Stanislaw Kamil Jastrzebski, 15 Oct 2025, Trustworthy Retrosynthesis: Eliminating Hallucinations with a Diverse Ensemble of Reaction Scorers, https://arxiv.org/abs/2510.10645
  • Hude Liu, Jerry Yao-Chieh Hu, Jennifer Yuntong Zhang, Zhao Song, Han Liu, 25 Sep 2025, Are Hallucinations Bad Estimations?, https://arxiv.org/abs/2509.21473
  • Chanwoong Park, Uijeong Jang, Ernest K. Ryu, Insoon Yang, 26 Sep 2025, Sharpness-Aware Minimization Can Hallucinate Minimizers, https://arxiv.org/abs/2509.21818
  • Wenkai Wang, Vincent Lee, Yizhen Zheng, 20 Sep 2025, A Novel Differential Feature Learning for Effective Hallucination Detection and Classification, https://arxiv.org/abs/2509.21357
  • Seongho Joo, Kyungmin Min, Jahyun Koo, Kyomin Jung, 26 Sep 2025, Black-Box Hallucination Detection via Consistency Under the Uncertain Expression, https://arxiv.org/abs/2509.21999
  • Yifang Zhang, Pengfei Duan, Yiwen Yang, Shengwu Xiong, 26 Sep 2025, Beyond Textual Context: Structural Graph Encoding with Adaptive Space Alignment to alleviate the hallucination of LLMs, https://arxiv.org/abs/2509.22251
  • Wenqi Liu, Xuemeng Song, Jiaxi Li, Yinwei Wei, Na Zheng, Jianhua Yin, Liqiang Nie, 26 Sep 2025, Mitigating Hallucination Through Theory-Consistent Symmetric Multimodal Preference Optimization, https://arxiv.org/abs/2506.11712
  • Yike Wu, Yiwei Wang, Yujun Cai, 7 Oct 2025, ChainMPQ: Interleaved Text-Image Reasoning Chains for Mitigating Relation Hallucinations, https://arxiv.org/abs/2510.06292
  • Young D. Kwon, Abhinav Mehrotra, Malcolm Chadwick, Alberto Gil Ramos, and Sourav Bhattacharya, 7 Oct 2025, Efficient High-Resolution Image Editing with Hallucination-Aware Loss and Adaptive Tiling, https://arxiv.org/abs/2510.06295
  • Wannan Yang, Xinchi Qiu, Lei Yu, Yuchen Zhang, Oliver Aobo Yang, Narine Kokhlikyan, Nicola Cancedda, Diego Garcia-Olano, 25 Sep 2025, Hallucination reduction with CASAL: Contrastive Activation Steering For Amortized Learning, https://arxiv.org/abs/2510.02324
  • Vivek Bhavsar, Joseph Ereifej, Aravanan Gurusami, 25 Sep 2025, Hallucination-Resistant, Domain-Specific Research Assistant with Self-Evaluation and Vector-Grounded Retrieval, https://arxiv.org/abs/2510.02326
  • Jingyuan Deng, Yujiu Yang, 3 Oct 2025, MaskCD: Mitigating LVLM Hallucinations by Image Head Masked Contrastive Decoding, https://arxiv.org/abs/2510.02790
  • Geigh Zollicoffer, Minh Vu, Manish Bhattarai, 20 Oct 2025, MTRE: Multi-Token Reliability Estimation for Hallucination Detection in VLMs, https://arxiv.org/abs/2505.11741
  • Siya Qi, Lin Gui, Yulan He, Zheng Yuan, 20 Oct 2025, A Survey of Automatic Hallucination Evaluation on Natural Language Generation, https://arxiv.org/abs/2404.12041
  • Ofir Azachi, Kfir Eliyahu, Eyal El Ani, Rom Himelstein, Roi Reichart, Yuval Pinter, Nitay Calderon, 20 Sep 2025, Leveraging NTPs for Efficient Hallucination Detection in VLMs, https://arxiv.org/abs/2509.20379
  • Fabrizio Frasca, Guy Bar-Shalom, Yftah Ziser, Haggai Maron, 29 Sep 2025, Neural Message-Passing on Attention Graphs for Hallucination Detection, https://arxiv.org/abs/2509.24770
  • Mingfei Han, Haihong Hao, Jinxing Zhou, Zhihui Li, Yuhui Zheng, Xueqing Deng, Linjie Yang, Xiaojun Chang, 27 Sep 2025, Self-Consistency as a Free Lunch: Reducing Hallucinations in Vision-Language Models via Self-Reflection, https://arxiv.org/abs/2509.23236
  • Xi Zhang, Zaiqiao Meng, Jake Lever, Edmond S. L. Ho, 27 Sep 2025, CCD: Mitigating Hallucinations in Radiology MLLMs via Clinical Contrastive Decoding, https://arxiv.org/abs/2509.23379
  • Yukai Zhao, Menghan Wu, Xing Hu, Xin Xia, 28 Sep 2025, HFuzzer: Testing Large Language Models for Package Hallucinations via Phrase-based Fuzzing, https://arxiv.org/abs/2509.23835
  • Yuanshuai Li, Yuping Yan, Junfeng Tang, Yunxuan Li, Zeqi Zheng and Yaochu Jin, 29 Sep 2025, Mitigating Visual Hallucinations via Semantic Curriculum Preference Optimization in MLLMs, https://arxiv.org/abs/2509.24491
  • Aryan Yazdan Parast, Parsa Hosseini, Hesam Asadollahzadeh, Arshia Soltani Moakhar, Basim Azam, Soheil Feizi, Naveed Akhtar, 29 Sep 2025, GHOST: Hallucination-Inducing Image Generation for Multimodal LLMs, https://arxiv.org/abs/2509.25178
  • Buyun Liang, Liangzu Peng, Jinqi Luo, Darshan Thaker, Kwan Ho Ryan Chan, Ren\'e Vidal, 5 Oct 2025, SECA: Semantically Equivalent and Coherent Attacks for Eliciting LLM Hallucinations, https://arxiv.org/abs/2510.04398
  • Amir Hameed Mir, 6 Oct 2025, The Geometry of Truth: Layer-wise Semantic Dynamics for Hallucination Detection in Large Language Models, https://arxiv.org/abs/2510.04933
  • Hazel Kim, Tom A. Lamb, Adel Bibi, Philip Torr, Yarin Gal, 4 Oct 2025, Detecting LLM Hallucination Through Layer-wise Information Deficiency: Analysis of Ambiguous Prompts and Unanswerable Questions, https://arxiv.org/abs/2412.10246
  • Hansol Park, Hoseong Ahn, Junwon Moon, Yejin Lee, Kyuhong Shim, 19 Sep 2025, Evaluating Hallucinations in Multimodal LLMs with Spoken Queries under Diverse Acoustic Conditions, https://arxiv.org/abs/2510.08581
  • Hoigi Seo, Dong Un Kang, Hyunjin Cho, Joohoon Lee, Se Young Chun, 10 Oct 2025, On Epistemic Uncertainty of Visual Tokens for Object Hallucinations in Large Vision-Language Models, https://arxiv.org/abs/2510.09008
  • Atoosa Chegini, Hamid Kazemi, Garrett Souza, Maria Safi, Yang Song, Samy Bengio, Sinead Williamson, Mehrdad Farajtabar, 23 Oct 2025, Reasoning's Razor: Reasoning Improves Accuracy but Can Hurt Recall at Critical Operating Points in Safety and Hallucination Detection, https://arxiv.org/abs/2510.21049
  • Haolang Lu, Bolun Chu, WeiYe Fu, Guoshun Nan, Junning Liu, Minghui Pan, Qiankun Li, Yi Yu, Hua Wang, Kun Wang, 11 Oct 2025, Mitigating Hallucination in Multimodal Reasoning via Functional Attention Control, https://arxiv.org/abs/2510.10285
  • Ilan Lobel, Humberto Moreira and Omar Mouchtaki, 13 Oct 2025, Auction Design using Value Prediction with Hallucinations, https://arxiv.org/abs/2502.08792
  • Xiangrui Liu, Man Luo, Agneet Chatterjee, Hua Wei, Chitta Baral, Yezhou Yang, 11 Oct 2025, Investigating VLM Hallucination from a Cognitive Psychology Perspective: A First Step Toward Interpretation with Intriguing Observations, https://arxiv.org/abs/2507.03123
  • Rui Wang, Zeming Wei, Guanzhang Yue, Meng Sun, 9 Oct 2025, Revisiting Hallucination Detection with Effective Rank-based Uncertainty, https://arxiv.org/abs/2510.08389
  • Alexandra Bazarova, Aleksandr Yugay, Andrey Shulga, Alina Ermilova, Andrei Volodichev, Konstantin Polev, Julia Belikova, Rauf Parchiev, Dmitry Simakov, Maxim Savchenko, Andrey Savchenko, Serguei Barannikov, Alexey Zaytsev, 9 Oct 2025, Hallucination Detection in LLMs with Topological Divergence on Attention Graphs, https://arxiv.org/abs/2504.10063
  • Atharva Kulkarni, Yuan Zhang, Joel Ruben Antony Moniz, Xiou Ge, Bo-Hsiang Tseng, Dhivya Piraviperumal, Swabha Swayamdipta, Hong Yu, 9 Oct 2025, Evaluating Evaluation Metrics -- The Mirage of Hallucination Detection, https://arxiv.org/abs/2504.18114
  • Gagan Bhatia, Somayajulu G Sripada, Kevin Allan, Jacobo Azcona, 8 Oct 2025, Distributional Semantics Tracing: A Framework for Explaining Hallucinations in Large Language Models, https://arxiv.org/abs/2510.06107
  • Xixun Lin, Yucheng Ning, Jingwen Zhang, Yan Dong, Yilong Liu, Yongxuan Wu, Xiaohua Qi, Nan Sun, Yanmin Shang, Pengfei Cao, Lixin Zou, Xu Chen, Chuan Zhou, Jia Wu, Shirui Pan, Bin Wang, Yanan Cao, Kai Chen, Songlin Hu, Li Guo, 23 Sep 2025, LLM-based Agents Suffer from Hallucinations: A Survey of Taxonomy, Methods, and Directions, https://arxiv.org/abs/2509.18970
  • Fengyuan Sun, Hui Chen, Xinhao Xu, Dandan Zheng, Jingdong Chen, Jun Zhou, Jungong Han, Guiguang Ding, 22 Oct 2025, PruneHal: Reducing Hallucinations in Multi-modal Large Language Models through Adaptive KV Cache Pruning, https://arxiv.org/abs/2510.19183
  • Valentin No\"el, 21 Oct 2025, A Graph Signal Processing Framework for Hallucination Detection in Large Language Models, https://arxiv.org/abs/2510.19117
  • Xinxi Chen, Tianyang Chen, Lijia Hong, 30 Sep 2025, GroundSight: Augmenting Vision-Language Models with Grounding Information and De-hallucination, https://arxiv.org/abs/2509.25669
  • Guy Bar-Shalom, Fabrizio Frasca, Derek Lim, Yoav Gelberg, Yftah Ziser, Ran El-Yaniv, Gal Chechik, Haggai Maron, 30 Sep 2025, Beyond Next Token Probabilities: Learnable, Fast Detection of Hallucinations and Data Contamination on LLM Output Distributions, https://arxiv.org/abs/2503.14043
  • Bowen Xu, 29 Sep 2025, Hallucination is Inevitable for LLMs with the Open World Assumption, https://arxiv.org/abs/2510.05116
  • Maksym Zavhorodnii, Dmytro Dehtiarov, Anna Konovalenko, 6 Oct 2025, A novel hallucination classification framework, https://arxiv.org/abs/2510.05189
  • Kostas Triaridis, Alexandros Graikos, Aggelina Chatziagapi, Grigorios G. Chrysos, Dimitris Samaras, 6 Oct 2025, Mitigating Diffusion Model Hallucinations with Dynamic Guidance, https://arxiv.org/abs/2510.05356
  • Shahrad Mohammadzadeh, Juan David Guerra, Marco Bonizzato, Reihaneh Rabbany, Golnoosh Farnadi, 6 Oct 2025, Hallucination Detox: Sensitivity Dropout (SenD) for Large Language Model Training, https://arxiv.org/abs/2410.15460
  • Gaya Mehenni and Fabrice Lamarche and Odette Rios-Ibacache and John Kildea and Amal Zouaq, 7 Oct 2025, MedHal: An Evaluation Dataset for Medical Hallucination Detection, https://arxiv.org/abs/2504.08596
  • Hao Yin, Guangzong Si, Zilei Wang, 7 Oct 2025, The Mirage of Performance Gains: Why Contrastive Decoding Fails to Mitigate Object Hallucinations in MLLMs?, https://arxiv.org/abs/2504.10020
  • Rikard Rosenbacke, Carl Rosenbacke, Victor Rosenbacke, Martin McKee, 16 Oct 2025, Beyond Hallucinations: The Illusion of Understanding in Large Language Models, https://arxiv.org/abs/2510.14665
  • Sreetama Sarkar, Yue Che, Alex Gavin, Peter A. Beerel, Souvik Kundu, 15 Oct 2025, Mitigating Hallucinations in Vision-Language Models through Image-Guided Head Suppression, https://arxiv.org/abs/2505.16411
  • Will Lockett, Oct 23, 2025, You Have No Idea How Screwed OpenAI Actually Is: When you find yourself in a hole, at what point do you stop digging? https://wlockett.medium.com/you-have-no-idea-how-screwed-openai-actually-is-8358dccfca1c

Security of AI

Research on security issues involving AI and LLMs:

  • Jason Koebler, June 26, 2024, Researchers Prove Rabbit AI Breach By Sending Email to Us as Admin, https://www.404media.co/researchers-prove-rabbit-ai-breach-by-sending-email-to-us-as-admin/ (Rabbit's API security credentials were hard-coded into the device.)
  • Yuanchun Li, Hao Wen, Weijun Wang, Xiangyu Li, Yizhen Yuan, Guohong Liu, Jiacheng Liu, Wenxing Xu, Xiang Wang, Yi Sun, Rui Kong, Yile Wang, Hanfei Geng, Jian Luan, Xuefeng Jin, Zilong Ye, Guanjing Xiong, Fan Zhang, Xiang Li, Mengwei Xu, Zhijun Li, Peng Li, Yang Liu, Ya-Qin Zhang, Yunxin Liu, 8 May 2024 (v2), Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security, https://arxiv.org/abs/2401.05459 https://github.com/MobileLLM/Personal_LLM_Agents_Survey
  • Michael Nuñez, August 30, 2024, AI is growing faster than companies can secure it, warn industry leaders, https://venturebeat.com/ai/ai-is-growing-faster-than-companies-can-secure-it-warn-industry-leaders/
  • Yiheng Liu, Hao He, Tianle Han, Xu Zhang, Mengyuan Liu, Jiaming Tian, Yutong Zhang, Jiaqi Wang, Xiaohui Gao, Tianyang Zhong, Yi Pan, Shaochen Xu, Zihao Wu, Zhengliang Liu, Xin Zhang, Shu Zhang, Xintao Hu, Tuo Zhang, Ning Qiang, Tianming Liu, Bao Ge, 6 Jan 2024 (v2), Understanding LLMs: A Comprehensive Overview from Training to Inference, https://arxiv.org/abs/2401.02038
  • Huan Yang, Deyu Zhang, Yudong Zhao, Yuanchun Li, Yunxin Liu, 6 Sep 2024, A First Look At Efficient And Secure On-Device LLM Inference Against KV Leakage, https://arxiv.org/abs/2409.04040 (Security issues where KV caches can be data leaks as they may contain encodings of private information.)
  • Nicholas Carlini, Milad Nasr, 22 Oct 2024, Remote Timing Attacks on Efficient Language Model Inference, https://arxiv.org/abs/2410.17175
  • Chenchen Gu, Xiang Lisa Li, Rohith Kuditipudi, Percy Liang, Tatsunori Hashimoto, Dec 2024, Timing Attacks on Prompt Caching in Language Model APIs, Stanford CS 191W Senior Project, https://cs191w.stanford.edu/projects/Gu,%20Chenchen_CS191W.pdf (Using timing attacks to detect prefix KV caching, thereby gaining information about other users' prompts.)
  • Úlfar Erlingsson, 27 Mar 2025, How to Secure Existing C and C++ Software without Memory Safety, https://arxiv.org/pdf/2503.21145 (Examines four risk mitigation techniques for memory safety.)
  • Anthropic, 13 Aug 2025, Building Safeguards for Claude, https://www.anthropic.com/news/building-safeguards-for-claude
  • Pallavi Zambare, Venkata Nikhil Thanikella, Nikhil Padmanabh Kottur, Sree Akhil Akula, Ying Liu, 12 Aug 2025, NetMoniAI: An Agentic AI Framework for Network Security & Monitoring, https://arxiv.org/abs/2508.10052
  • Miles Q. Li and Benjamin C. M. Fung, 13 Aug 2025, Security Concerns for Large Language Models: A Survey, https://arxiv.org/abs/2505.18889
  • Vita Santa Barletta, Vito Bavaro, Miriana Calvano, Antonio Curci, Antonio Piccinno, Davide Pio Posa, 23 Jul 2025, Enabling Cyber Security Education through Digital Twins and Generative AI, https://arxiv.org/abs/2507.17518
  • Haibo Wang, Lutfu S.Sua, and Bahram Alidaee, 22 Jul 2025, Enhancing supply chain security with automated machine learning, https://arxiv.org/abs/2406.13166
  • Lily Stelling, Mick Yang, Rokas Gipi\v{s}kis, Leon Staufer, Ze Shen Chin, Sim\'eon Campos, Ariel Gil, and Michael Chen, 22 Jul 2025, Mapping Industry Practices to the EU AI Act's GPAI Code of Practice Safety and Security Measures, https://arxiv.org/abs/2504.15181
  • Rui Guo, Avinash Ayalasomayajula, Henian Li, Jingbo Zhou, Sujan Kumar Saha, Farimah Farahmandi, 22 Jul 2025, SVAgent: AI Agent for Hardware Security Verification Assertion, https://arxiv.org/abs/2507.16203
  • Chang Gong and Zhongwen Li and Xiaoqi Li, 24 Jul 2025, Information Security Based on LLM Approaches: A Review, https://arxiv.org/abs/2507.18215
  • Pengfei Du, 14 Jul 2025, PRM-Free Security Alignment of Large Models via Red Teaming and Adversarial Training, https://arxiv.org/abs/2507.14202
  • Eldor Abdukhamidov, Mohammed Abuhamad, Simon S. Woo, Hyoungshick Kim, Tamer Abuhmed, 18 Jul 2025, Breaking the Illusion of Security via Interpretation: Interpretable Vision Transformer Systems under Attack, https://arxiv.org/abs/2507.14248
  • Zhou Li, Xiang Zhang, Jiawen Lv, Jihao Fan, Haiqiang Chen, Giuseppe Caire, 19 Jul 2025, Collusion-Resilient Hierarchical Secure Aggregation with Heterogeneous Security Constraints, https://arxiv.org/abs/2507.14768
  • Nidhi Rastogi, Shirid Pant, Devang Dhanuka, Amulya Saxena, Pranjal Mairal, 20 Jul 2025, Too Much to Trust? Measuring the Security and Cognitive Impacts of Explainability in AI-Driven SOCs, https://arxiv.org/abs/2503.02065
  • Andrew C. Cullen, Paul Montague, Sarah M. Erfani, Benjamin I.P. Rubinstein, 11 Aug 2025, Position: Certified Robustness Does Not (Yet) Imply Model Security, https://arxiv.org/abs/2506.13024
  • Andy Zou, Maxwell Lin, Eliot Jones, Micha Nowak, Mateusz Dziemian, Nick Winter, Alexander Grattan, Valent Nathanael, Ayla Croft, Xander Davies, Jai Patel, Robert Kirk, Nate Burnikell, Yarin Gal, Dan Hendrycks, J. Zico Kolter, Matt Fredrikson, 28 Jul 2025, Security Challenges in AI Agent Deployment: Insights from a Large Scale Public Competition, https://arxiv.org/abs/2507.20526
  • Shen Li, Liuyi Yao, Wujia Niu, Lan Zhang, Yaliang Li, 28 Jul 2025, Security Tensors as a Cross-Modal Bridge: Extending Text-Aligned Safety to Vision in LVLM, https://arxiv.org/abs/2507.20994
  • Song Son Ha,Florian Foerster,Thomas Robert Doebbert,Tim Kittel,Dominik Merli,Gerd Scholl, 28 Jul 2025, Testbed and Software Architecture for Enhancing Security in Industrial Private 5G Networks, https://arxiv.org/abs/2507.20873
  • Keerthana Madhavan, Abbas Yazdinejad, Fattane Zarrinkalam, Ali Dehghantanha, 26 Jul 2025, Quantifying Security Vulnerabilities: A Metric-Driven Security Analysis of Gaps in Current AI Standards, https://arxiv.org/abs/2502.08610
  • Craig Wright, 10 Jul 2025, A Formal Rebuttal of "The Blockchain Trilemma: A Formal Proof of the Inherent Trade-Offs Among Decentralization, Security, and Scalability", https://arxiv.org/abs/2507.21111
  • Gauri Sharma, Vidhi Kulkarni, Miles King, Ken Huang, 23 Jul 2025, Towards Unifying Quantitative Security Benchmarking for Multi Agent Systems, https://arxiv.org/abs/2507.21146
  • Muzhi Dai, Shixuan Liu, Zhiyuan Zhao, Junyu Gao, Hao Sun, Xuelong Li, 29 Jul 2025, Secure Tug-of-War (SecTOW): Iterative Defense-Attack Training with Reinforcement Learning for Multimodal Model Security, https://arxiv.org/abs/2507.22037
  • Kang Chen, Xiuze Zhou, Yuanguo Lin, Jinhe Su, Yuanhui Yu, Li Shen, Fan Lin, 4 Aug 2025, A Survey on Data Security in Large Language Models, https://arxiv.org/abs/2508.02312
  • Niklas Pfister, V\'aclav Volhejn, Manuel Knott, Santiago Arias, Julia Bazi\'nska, Mykhailo Bichurin, Alan Commike, Janet Darling, Peter Dienes, Matthew Fiedler, David Haber, Matthias Kraft, Marco Lancini, Max Mathys, Dami\'an Pascual-Ortiz, Jakub Podolak, Adri\`a Romero-L\'opez, Kyriacos Shiarlis, Andreas Signer, Zsolt Terek, Athanasios Theocharis, Daniel Timbrell, Samuel Trautwein, Samuel Watts, Yun-Han Wu, Mateo Rojas-Carulla, 4 Aug 2025, Gandalf the Red: Adaptive Security for LLMs, https://arxiv.org/abs/2501.07927
  • Nusrat Zahan, Imranur Rahman, Laurie Williams, 2 Aug 2025, Assumptions to Evidence: Evaluating Security Practices Adoption and Their Impact on Outcomes in the npm Ecosystem, https://arxiv.org/abs/2504.14026
  • Arturo S\'anchez-Matas, Pablo Escribano Ruiz, Daniel D\'iaz-L\'opez, Angel Luis Perales G\'omez, Pantaleone Nespoli, Gregorio Mart\'inez P\'erez, 5 Aug 2025, Simulating Cyberattacks through a Breach Attack Simulation (BAS) Platform empowered by Security Chaos Engineering (SCE), https://arxiv.org/abs/2508.03882
  • Hammad Atta, Ken Huang, Manish Bhatt, Kamal Ahmed, Muhammad Aziz Ul Haq, Yasir Mehmood, 6 Aug 2025, Logic layer Prompt Control Injection (LPCI): A Novel Security Vulnerability Class in Agentic Systems, https://arxiv.org/abs/2507.10457
  • Minghao Shao, Nanda Rani, Kimberly Milner, Haoran Xi, Meet Udeshi, Saksham Aggarwal, Venkata Sai Charan Putrevu, Sandeep Kumar Shukla, Prashanth Krishnamurthy, Farshad Khorrami, Ramesh Karri, Muhammad Shafique, 5 Aug 2025, Towards Effective Offensive Security LLM Agents: Hyperparameter Tuning, LLM as a Judge, and a Lightweight CTF Benchmark, https://arxiv.org/abs/2508.05674
  • Hiroya Kato, Kentaro Kita, Kento Hasegawa, Seira Hidano, 12 Aug 2025, AI Security Map: Holistic Organization of AI Security Technologies and Impacts on Stakeholders, https://arxiv.org/abs/2508.08583
  • Aayush Gupta, 12 Aug 2025, Can AI Keep a Secret? Contextual Integrity Verification: A Provable Security Architecture for LLMs, https://arxiv.org/abs/2508.09288
  • Irash Perera (1), Hiranya Abeyrathne (2), Sanjeewa Malalgoda (2), Arshardh Ifthikar (2) ((1) Department of Computer Science and Engineering, University of Moratuwa, Colombo, Sri Lanka, (2) WSO2, Colombo, Sri Lanka), 14 Aug 2025, Enhancing GraphQL Security by Detecting Malicious Queries Using Large Language Models, Sentence Transformers, and Convolutional Neural Networks, https://arxiv.org/abs/2508.11711
  • Afrah Gueriani, Hamza Kheddar, Ahmed Cherif Mazari and Mohamed Chahine Ghanem, 17 Aug 2025, A Robust Cross-Domain IDS using BiGRU-LSTM-Attention for Medical and Industrial IoT Security, https://arxiv.org/abs/2508.12470
  • Yongjian Guo, Puzhuo Liu, Wanlun Ma, Zehang Deng, Xiaogang Zhu, Peng Di, Xi Xiao, Sheng Wen, 18 Aug 2025, Systematic Analysis of MCP Security, https://arxiv.org/abs/2508.12538
  • Yixuan Yang and Daoyuan Wu and Yufan Chen, 17 Aug 2025, MCPSecBench: A Systematic Security Benchmark and Playground for Testing Model Context Protocols, https://arxiv.org/abs/2508.13220
  • Daniel M. Jimenez-Gutierrez, Yelizaveta Falkouskaya, Jose L. Hernandez-Ramos, Aris Anagnostopoulos, Ioannis Chatzigiannakis, Andrea Vitaletti, 19 Aug 2025, On the Security and Privacy of Federated Learning: A Survey with Attacks, Defenses, Frameworks, Applications, and Future Directions, https://arxiv.org/abs/2508.13730
  • Abbas Sabra, Olivier Schmitt and Joseph Tyler, 20 Aug 2025, Assessing the Quality and Security of AI-Generated Code: A Quantitative Analysis, https://arxiv.org/abs/2508.14727
  • Zhixiang Guo, Siyuan Liang, Aishan Liu, Dacheng Tao, 21 Aug 2025, CopyrightShield: Enhancing Diffusion Model Security against Copyright Infringement Attacks, https://arxiv.org/abs/2412.01528
  • Akshay Mhatre and Noujoud Nader and Patrick Diehl and Deepti Gupta, 22 Aug 2025, LLM-GUARD: Large Language Model-Based Detection and Repair of Bugs and Security Vulnerabilities in C++ and Python, https://arxiv.org/abs/2508.16419
  • Anton Ludwig Bonin, Pawel Robert Smolinski, Jacek Winiarski, 22 Aug 2025, Exploring the Impact of Generative Artificial Intelligence on Software Development in the IT Sector: Preliminary Findings on Productivity, Efficiency and Job Security, https://arxiv.org/abs/2508.16811
  • Keke Lian and Bin Wang, Lei Zhang, Libo Chen, Junjie Wang, Ziming Zhao, Yujiu Yang, Haotong Duan, Haoran Zhao, Shuang Liao, Mingda Guo, Jiazheng Quan, Yilu Zhong, Chenhao He, Zichuan Chen, Jie Wu, Haoling Li, Zhaoxuan Li, Jiongchi Yu, Hui Li and Dong Zhang, 25 Aug 2025, A.S.E: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code, https://arxiv.org/abs/2508.18106
  • Matous Kozak, Roshanak Zilouchian Moghaddam, Siva Sivaraman, 23 Aug 2025, When Developer Aid Becomes Security Debt: A Systematic Analysis of Insecure Behaviors in LLM Coding Agents, https://arxiv.org/abs/2507.09329
  • Ada Chen, Yongjiang Wu, Junyuan Zhang, Jingyu Xiao, Shu Yang, Jen-tse Huang, Kun Wang, Wenxuan Wang, Shuai Wang, 25 Aug 2025, A Survey on the Safety and Security Threats of Computer-Using Agents: JARVIS or Ultron?, https://arxiv.org/abs/2505.10924
  • Niveen O. Jaffal, Mohammed Alkhanafseh, David Mohaisen, 18 Jul 2025, Large Language Models in Cybersecurity: Applications, Vulnerabilities, and Defense Techniques, https://arxiv.org/abs/2507.13629
  • Julia Laubmann, Johannes Reschke, 18 Jul 2025, Tackling fake images in cybersecurity -- Interpretation of a StyleGAN and lifting its black-box, https://arxiv.org/abs/2507.13722
  • Felix H\"arer, 19 Jul 2025, Specification and Evaluation of Multi-Agent LLM Systems -- Prototype and Cybersecurity Applications, https://arxiv.org/abs/2506.10467
  • Terry Yue Zhuo, Dingmin Wang, Hantian Ding, Varun Kumar, Zijian Wang, 29 Jul 2025, Cyber-Zero: Training Cybersecurity Agents without Runtime, https://arxiv.org/abs/2508.00910
  • Mehdi Akbari Gurabi, Lasse Nitz, Radu-Mihai Castravet, Roman Matzutt, Avikarsha Mandal, Stefan Decker, 5 Aug 2025, From Legacy to Standard: LLM-Assisted Transformation of Cybersecurity Playbooks into CACAO Format, https://arxiv.org/abs/2508.03342
  • Md Zesun Ahmed Mia, Malyaban Bal, Sen Lu, George M. Nishibuchi, Suhas Chelian, Srini Vasan, Abhronil Sengupta, 6 Aug 2025, Neuromorphic Cybersecurity with Semi-supervised Lifelong Learning, https://arxiv.org/abs/2508.04610
  • Daniele Proverbio, Alessio Buscemi, Alessandro Di Stefano, The Anh Han, German Castignani and Pietro Li\`o, 4 Aug 2025, Can LLMs effectively provide game-theoretic-based scenarios for cybersecurity?, https://arxiv.org/abs/2508.05670
  • Victor Lopez Juarez, 9 Aug 2025, EU Digital Regulation and Guatemala: AI, 5G, and Cybersecurity, https://arxiv.org/abs/2508.08315
  • Yuksel Aydin, 9 Aug 2025, Cognitive Cybersecurity for Artificial Intelligence: Guardrail Engineering with CCS-7, https://arxiv.org/abs/2508.10033
  • Aydin Zaboli and Junho Hong, 12 Aug 2025, Generative AI for Cybersecurity of Energy Management Systems: Methods, Challenges, and Future Directions, https://arxiv.org/abs/2508.10044
  • Nsengiyumva Wilberforce, 2 Sep 2025, A software security review on Uganda's Mobile Money Services: Dr. Jim Spire's tweets sentiment analysis, https://arxiv.org/abs/2509.03545
  • Ofir Cohen, Gil Ari Agmon, Asaf Shabtai, Rami Puzis, 5 Sep 2025, The Information Security Awareness of Large Language Models, https://arxiv.org/abs/2411.13207
  • Anders M{\o}lmen H{\o}st and Pierre Lison and Leon Moonen, 25 Aug 2025, A Systematic Approach to Predict the Impact of Cybersecurity Vulnerabilities Using LLMs, https://arxiv.org/abs/2508.18439
  • Martin Lochner and Keegan Keplinger, 25 Aug 2025, Collaborative Intelligence: Topic Modelling of Large Language Model use in Live Cybersecurity Operations, https://arxiv.org/abs/2508.18488
  • Afan Ali and Irfanullah Khan, 26 Aug 2025, SkyTrust: Blockchain-Enhanced UAV Security for NTNs with Dynamic Trust and Energy-Aware Consensus, https://arxiv.org/abs/2508.18735
  • Xavier Cadet, Simona Boboila, Sie Hendrata Dharmawan, Alina Oprea, Peter Chin, 27 Aug 2025, PoolFlip: A Multi-Agent Reinforcement Learning Security Environment for Cyber Defense, https://arxiv.org/abs/2508.19488
  • Sai Teja Reddy Adapala, Yashwanth Reddy Alugubelly, 22 Aug 2025, The Aegis Protocol: A Foundational Security Framework for Autonomous AI Agents, https://arxiv.org/abs/2508.19267
  • Michael R Smith, Joe Ingram, 27 Aug 2025, Surveying the Operational Cybersecurity and Supply Chain Threat Landscape when Developing and Deploying AI Systems, https://arxiv.org/abs/2508.20307
  • Dan Lin, Shunfeng Lu, Ziyan Liu, Jiajing Wu, Junyuan Fang, Kaixin Lin, Bowen Song, Zibin Zheng, 28 Aug 2025, BridgeShield: Enhancing Security for Cross-chain Bridge Applications via Heterogeneous Graph Mining, https://arxiv.org/abs/2508.20517
  • Guofu Liao, Taotao Wang, Shengli Zhang, Jiqun Zhang, Shi Long, and Dacheng Tao, 29 Aug 2025, zkLoRA: Fine-Tuning Large Language Models with Verifiable Security via Zero-Knowledge Proofs, https://arxiv.org/abs/2508.21393
  • Georgios Syros, Anshuman Suri, Jacob Ginesin, Cristina Nita-Rotaru, Alina Oprea, 29 Aug 2025, SAGA: A Security Architecture for Governing AI Agentic Systems, https://arxiv.org/abs/2504.21034
  • Wenxiao Zhang, Xiangrui Kong, Conan Dewitt, Thomas Br\"aunl, Jin B. Hong, 2 Sep 2025, Enhancing Reliability in LLM-Integrated Robotic Systems: A Unified Approach to Security and Safety, https://arxiv.org/abs/2509.02163
  • Honghui Xu, Kaiyang Li, Wei Chen, Danyang Zheng, Zhiyuan Li, Zhipeng Cai, 2 Sep 2025, A Survey: Towards Privacy and Security in Mobile Large Language Models, https://arxiv.org/abs/2509.02411
  • Chengshuai Zhao, Riccardo De Maria, Tharindu Kumarage, Kumar Satvik Chaudhary, Garima Agrawal, Yiwen Li, Jongchan Park, Yuli Deng, Ying-Chih Chen, Huan Liu, 3 Sep 2025, CyberBOT: Towards Reliable Cybersecurity Education via Ontology-Grounded Retrieval Augmented Generation, https://arxiv.org/abs/2504.00389
  • Ayoub Si-ahmed, Mohammed Ali Al-Garadi, Narhimene Boustia, 2 Sep 2025, Explainable Machine Learning-Based Security and Privacy Protection Framework for Internet of Medical Things Systems, https://arxiv.org/abs/2403.09752
  • Qingyuan Li, Binchang Li, Cuiyun Gao, Shuzheng Gao, and Zongjie Li, 7 Sep 2025, Empirical Study of Code Large Language Models for Binary Security Patch Detection, https://arxiv.org/abs/2509.06052
  • Safayat Bin Hakim, Muhammad Adil, Alvaro Velasquez, Shouhuai Xu, Houbing Herbert Song, 8 Sep 2025, Neuro-Symbolic AI for Cybersecurity: State of the Art, Challenges, and Opportunities, https://arxiv.org/abs/2509.06921
  • Guangyu Lei, Tianhao Liang, Yuqi Ping, Xinglin Chen, Longyu Zhou, Junwei Wu, Xiyuan Zhang, Huahao Ding, Xingjian Zhang, Weijie Yuan, Tingting Zhang, Qinyu Zhang, 8 Sep 2025, Enhancing Low-Altitude Airspace Security: MLLM-Enabled UAV Intent Recognition, https://arxiv.org/abs/2509.06312
  • Gabriele Digregorio and Marco Di Gennaro and Stefano Zanero and Stefano Longari and Michele Carminati, 8 Sep 2025, When Secure Isn't: Assessing the Security of Machine Learning Model Sharing, https://arxiv.org/abs/2509.06703
  • Nicol\`o Romandini, Carlo Mazzocca, Kai Otsuki, Rebecca Montanari, 8 Sep 2025, SoK: Security and Privacy of AI Agents for Blockchain, https://arxiv.org/abs/2509.07131
  • Lei Yu, Jingyuan Zhang, Xin Wang, Jiajia Ma, Li Yang, Fengjun Zhang, 12 Sep 2025, SmartCoder-R1: Towards Secure and Explainable Smart Contract Generation with Security-Aware Group Relative Policy Optimization, https://arxiv.org/abs/2509.09942
  • Evan Li, Tushin Mallick, Evan Rose, William Robertson, Alina Oprea, Cristina Nita-Rotaru, 10 Sep 2025, ACE: A Security Architecture for LLM-Integrated App Systems, https://arxiv.org/abs/2504.20984
  • Yuzhou Nie, Zhun Wang, Yu Yang, Ruizhe Jiang, Yuheng Tang, Xander Davies, Yarin Gal, Bo Li, Wenbo Guo, Dawn Song, 18 Sep 2025, SeCodePLT: A Unified Platform for Evaluating the Security of Code GenAI, https://arxiv.org/abs/2410.11096
  • Yuchong Xie, Mingyu Luo, Zesen Liu, Zhixiang Zhang, Kaikai Zhang, Yu Liu, Zongjie Li, Ping Chen, Shuai Wang, Dongdong She, 19 Sep 2025, On the Security of Tool-Invocation Prompts for LLM-Based Agentic Systems: An Empirical Risk Assessment, https://arxiv.org/abs/2509.05755
  • Sergio Benlloch-Lopez, Miquel Viel-Vazquez, Javier Naranjo-Alcazar, Jordi Grau-Haro and Pedro Zuccarello, 19 Sep 2025, Threat Modeling for Enhancing Security of IoT Audio Classification Devices under a Secure Protocols Framework, https://arxiv.org/abs/2509.14657
  • Kiho Lee, Jungkon Kim, Doowon Kim, Hyoungshick Kim, 16 Sep 2025, A Systematic Evaluation of Parameter-Efficient Fine-Tuning Methods for the Security of Code LLMs, https://arxiv.org/abs/2509.12649
  • Magnus Wiik Eckhoff, Peter Marius Flydal, Siem Peters, Martin Eian, Jonas Halvorsen, Vasileios Mavroeidis, Gudmund Grov, 16 Sep 2025, A Graph-Based Approach to Alert Contextualisation in Security Operations Centres, https://arxiv.org/abs/2509.12923
  • Shaina Raza, Ranjan Sapkota, Manoj Karkee, Christos Emmanouilidis, 15 Sep 2025, TRiSM for Agentic AI: A Review of Trust, Risk, and Security Management in LLM-based Agentic Multi-Agent Systems, https://arxiv.org/abs/2506.04133
  • Umberto Gon\c{c}alves de Sousa, 2 Sep 2025, LogGuardQ: A Cognitive-Enhanced Reinforcement Learning Framework for Cybersecurity Anomaly Detection in Security Logs, https://arxiv.org/abs/2509.10511
  • Ali Habibzadeh, Farid Feyzi, and Reza Ebrahimi Atani, 13 Sep 2025, Large Language Models for Security Operations Centers: A Comprehensive Survey, https://arxiv.org/abs/2509.10858
  • Ambra Demontis, Srishti Gupta, Maura Pintor, Luca Demetrio, Kathrin Grosse, Hsiao-Ying Lin, Chengfang Fang, Battista Biggio, Fabio Roli, 15 Sep 2025, Security of Deep Reinforcement Learning for Autonomous Driving: A Survey, https://arxiv.org/abs/2212.06123
  • Amena Amro and Manar H. Alalfi, 17 Sep 2025, GitHub's Copilot Code Review: Can AI Spot Security Flaws Before You Commit?, https://arxiv.org/abs/2509.13650
  • Adel ElZemity, Budi Arief and Shujun Li, 17 Sep 2025, CyberLLMInstruct: A Pseudo-malicious Dataset Revealing Safety-performance Trade-offs in Cyber Security LLM Fine-tuning, https://arxiv.org/abs/2503.09334
  • Adel ElZemity, Budi Arief and Shujun Li, 17 Sep 2025, Analysing Safety Risks in LLMs Fine-Tuned with Pseudo-Malicious Cyber Security Data, https://arxiv.org/abs/2505.09974
  • Samuele Pasini, Jinhan Kim, Tommaso Aiello, Rocio Cabrera Lozoya, Antonino Sabetta, Paolo Tonella, 17 Sep 2025, Evaluating and Improving the Robustness of Security Attack Detectors Generated by LLMs, https://arxiv.org/abs/2411.18216
  • Luca Cotti, Idilio Drago, Anisa Rula, Devis Bianchini and Federico Cerutti, 1 Oct 2025, OntoLogX: Ontology-Guided Knowledge Graph Extraction from Cybersecurity Logs with Large Language Models, https://arxiv.org/abs/2510.01409
  • Masike Malatji, 2 Oct 2025, A cybersecurity AI agent selection and decision support framework, https://arxiv.org/abs/2510.01751
  • Shoumik Saha, Jifan Chen, Sam Mayers, Sanjay Krishna Gouda, Zijian Wang, Varun Kumar, 1 Oct 2025, Breaking the Code: Security Assessment of AI Code Agents Through Systematic Jailbreaking Attacks, https://arxiv.org/abs/2510.01359
  • Mudita Khurana, Raunak Jain, 2 Oct 2025, SoK: Measuring What Matters for Closed-Loop Security Agents, https://arxiv.org/abs/2510.01654
  • Oluwakemi T. Olayinka, Sumeet Jeswani, and Divine Iloh, 2 Oct 2025, Adaptive Cybersecurity Architecture for Digital Product Ecosystems Using Agentic AI, https://arxiv.org/abs/2509.20640
  • Caelin Kaplan, Alexander Warnecke, and Neil Archibald, 13 Oct 2025, BlackIce: A Containerized Red Teaming Toolkit for AI Security Testing, https://arxiv.org/abs/2510.11823
  • Dominik Schwarz, 13 Oct 2025, Countermind: A Multi-Layered Security Architecture for Large Language Models, https://arxiv.org/abs/2510.11837
  • Ayush Chaudhary, 14 Oct 2025, Formal Models and Convergence Analysis for Context-Aware Security Verification, https://arxiv.org/abs/2510.12440
  • Ehsan Aghaei, Sarthak Jain, Prashanth Arun, Arjun Sambamoorthy, 30 Sep 2025, SecureBERT 2.0: Advanced Language Model for Cybersecurity Intelligence, https://arxiv.org/abs/2510.00240
  • Hafijul Hoque Chowdhury, Riad Ahmed Anonto, Sourov Jajodia, Suryadipta Majumdar, Md. Shohrab Hossain, 23 Sep 2025, Identifying and Addressing User-level Security Concerns in Smart Homes Using "Smaller" LLMs, https://arxiv.org/abs/2509.19485
  • Tanmay Khule, Stefan Marksteiner, Jose Alguindigue, Hannes Fuchs, Sebastian Fischmeister, Apurva Narayan, 24 Sep 2025, STAF: Leveraging LLMs for Automated Attack Tree-Based Security Test Generation, https://arxiv.org/abs/2509.20190
  • Xiaofan Li and Xing Gao, 24 Sep 2025, Investigating Security Implications of Automatically Generated Code on the Software Supply Chain, https://arxiv.org/abs/2509.20277
  • Atousa Arzanipour and Rouzbeh Behnia and Reza Ebrahimi and Kaushik Dutta, 24 Sep 2025, RAG Security and Privacy: Formalizing the Threat Model and Attack Surface, https://arxiv.org/abs/2509.20324
  • Shrestha Datta, Shahriar Kabir Nahin, Anshuman Chhabra, Prasant Mohapatra, 27 Oct 2025, Agentic AI Security: Threats, Defenses, Evaluation, and Open Challenges, https://arxiv.org/abs/2510.23883
  • Vijay Prakash, Kevin Lee, Arkaprabha Bhattacharya, Danny Yuxing Huang, Jessica Staddon, 27 Oct 2025, Learned, Lagged, LLM-splained: LLM Responses to End User Security Questions, https://arxiv.org/abs/2411.14571
  • Wu Yichao, Wang Yirui, Ding Panpan, Wang Hailong, Zhu Bingqian, Liu Chun, 23 Oct 2025, Enhancing Security in Deep Reinforcement Learning: A Comprehensive Survey on Adversarial Attacks and Defenses, https://arxiv.org/abs/2510.20314
  • Dongsen Zhang, Zekun Li, Xu Luo, Xuannan Liu, Peipei Li, Wenjun Xu, 14 Oct 2025, MCP Security Bench (MSB): Benchmarking Attacks Against Model Context Protocol in LLM Agents, https://arxiv.org/abs/2510.15994
  • Xiaofan Li and Xing Gao, 18 Oct 2025, Toward Understanding Security Issues in the Model Context Protocol Ecosystem, https://arxiv.org/abs/2510.16558
  • Hanjun Luo, Shenyu Dai, Chiming Ni, Xinfeng Li, Guibin Zhang, Kun Wang, Tongliang Liu, Hanan Salam, 19 Oct 2025, AgentAuditor: Human-Level Safety and Security Evaluation for LLM Agents, https://arxiv.org/abs/2506.00641
  • Lingkai Kong, Haichuan Wang, Yuqi Pan, Cheol Woo Kim, Mingxiao Song, Alayna Nguyen, Tonghan Wang, Haifeng Xu, Milind Tambe, 18 Oct 2025, Robust Optimization with Diffusion Models for Green Security, https://arxiv.org/abs/2503.05730
  • Jeffrey Yang Fan Chiang, Seungjae Lee, Jia-Bin Huang, Furong Huang, Yizheng Chen, 21 Sep 2025, Why Are Web AI Agents More Vulnerable Than Standalone LLMs? A Security Analysis, https://arxiv.org/abs/2502.20383
  • Hanxiang Xu, Shenao Wang, Ningke Li, Kailong Wang, Yanjie Zhao, Kai Chen, Ting Yu, Yang Liu and Haoyu Wang, 22 Sep 2025, Large Language Models for Cyber Security: A Systematic Literature Review, https://arxiv.org/abs/2405.04760
  • Chiara Bonfanti, Alessandro Druetto, Cataldo Basile, Tharindu Ranasinghe, Marcos Zampieri, 27 Oct 2025, A Neuro-Symbolic Multi-Agent Approach to Legal-Cybersecurity Knowledge Integration, https://arxiv.org/abs/2510.23443
  • Julia Bazinska, Max Mathys, Francesco Casucci, Mateo Rojas-Carulla, Xander Davies, Alexandra Souly, Niklas Pfister, 26 Oct 2025, Breaking Agent Backbones: Evaluating the Security of Backbone LLMs in AI Agents, https://arxiv.org/abs/2510.22620
  • Bin Wang, YiLu Zhong, MiDi Wan, WenJie Yu, YuanBing Ouyang, Yenan Huang, and Hui Li, 27 Oct 2025, Is Your Prompt Poisoning Code? Defect Induction Rates and Security Mitigation Strategies, https://arxiv.org/abs/2510.22944
  • Rohan Senthil and Swee Liang Wong, 22 Oct 2025, Quantum Autoencoders for Anomaly Detection in Cybersecurity, https://arxiv.org/abs/2510.21837
  • Jun Dan, Yang Liu, Baigui Sun, Jiankang Deng, Shan Luo, 25 Oct 2025, TransFace++: Rethinking the Face Recognition Paradigm with a Focus on Accuracy, Efficiency, and Security, https://arxiv.org/abs/2308.10133
  • Shivani Shukla, Himanshu Joshi and Romilla Syed, 26 Sep 2025, Security Degradation in Iterative AI Code Generation -- A Systematic Analysis of the Paradox, https://arxiv.org/abs/2506.11022
  • Ali Naseh, Anshuman Suri, Yuefeng Peng, Harsh Chaudhari, Alina Oprea, Amir Houmansadr, 7 Oct 2025, Text-to-Image Models Leave Identifiable Signatures: Implications for Leaderboard Security, https://arxiv.org/abs/2510.06525
  • Fikret Mert G\"ultekin and Oscar Lilja and Ranim Khojah and Rebekka Wohlrab and Marvin Damschen and Mazen Mohamad, 7 Oct 2025, Leveraging Large Language Models for Cybersecurity Risk Assessment -- A Case from Forestry Cyber-Physical Systems, https://arxiv.org/abs/2510.06343
  • Asif Shahriar, Md Nafiu Rahman, Sadif Ahmed, Farig Sadeque, Md Rizwan Parvez, 7 Oct 2025, A Survey on Agentic Security: Applications, Threats and Defenses, https://arxiv.org/abs/2510.06445
  • Nuo Xu, Kaleel Mahmood, Haowen Fang, Ethan Rathbun, Caiwen Ding, Wujie Wen, 7 Oct 2025, Attacking the Spike: On the Transferability and Security of Spiking Neural Networks to Adversarial Examples, https://arxiv.org/abs/2209.03358
  • Zhun Wang, Tianneng Shi, Jingxuan He, Matthew Cai, Jialin Zhang, Dawn Song, 8 Oct 2025, CyberGym: Evaluating AI Agents' Real-World Cybersecurity Capabilities at Scale, https://arxiv.org/abs/2506.02548
  • Basil Abdullah AL-Zahrani, 2 Oct 2025, Adaptive Deception Framework with Behavioral Analysis for Enhanced Cybersecurity Defense, https://arxiv.org/abs/2510.02424
  • Yao-Ching Yu, Tsun-Han Chiang, Cheng-Wei Tsai, Chien-Ming Huang, Wen-Kwang Tsao, 3 Oct 2025, Primus: A Pioneering Collection of Open-Source Datasets for Cybersecurity LLM Training, https://arxiv.org/abs/2502.11191
  • Andrew Bowne, 20 Oct 2025, Attracting Commercial Artificial Intelligence Firms to Support National Security through Collaborative Contracts, https://arxiv.org/abs/2510.17931
  • Zaineh Abughazzah, Emna Baccour, Loay Ismail, Amr Mohamed, Mounir Hamdi, 20 Oct 2025, RL-Driven Security-Aware Resource Allocation Framework for UAV-Assisted O-RAN, https://arxiv.org/abs/2510.18084
  • Avishag Shapira, Parth Atulbhai Gandhi, Edan Habler, Asaf Shabtai, 20 Oct 2025, Mind the Web: The Security of Web Use Agents, https://arxiv.org/abs/2506.07153
  • Hanbin Hong, Shuya Feng, Nima Naderloui, Shenao Yan, Jingyu Zhang, Biying Liu, Ali Arastehfard, Heqing Huang, and Yuan Hong, 21 Oct 2025, SoK: Taxonomy and Evaluation of Prompt Security in Large Language Models, https://arxiv.org/abs/2510.15476
  • Noam Schmitt (IP Paris, TSP, ENS Paris Saclay), Marc Antoine Lacoste, 23 Sep 2025, Centralized vs. Decentralized Security for Space AI Systems? A New Look, https://arxiv.org/abs/2509.20395
  • Tharcisse Ndayipfukamiye, Jianguo Ding, Doreen Sebastian Sarwatt, Adamu Gaston Philipo, Huansheng Ning, 24 Sep 2025, Adversarial Defense in Cybersecurity: A Systematic Review of GANs for Threat Detection and Mitigation, https://arxiv.org/abs/2509.20411
  • Shanjukta Nath, Jiwon Hong, Jae Ho Chang, Keith Warren, Subhadeep Paul, 25 Sep 2025, Recidivism and Peer Influence with LLM Text Embeddings in Low Security Correctional Facilities, https://arxiv.org/abs/2509.20634
  • Yu Liu, Boxiang He, and Fanggang Wang, 25 Sep 2025, Security-aware Semantic-driven ISAC via Paired Adversarial Residual Networks, https://arxiv.org/abs/2509.20835
  • Petar Radanliev, 26 Sep 2025, Red Teaming Quantum-Resistant Cryptographic Standards: A Penetration Testing Framework Integrating AI and Quantum Security, https://arxiv.org/abs/2509.22757
  • Xingyu Li, Juefei Pu, Yifan Wu, Xiaochen Zou, Shitong Zhu, Xiaochen Zou, Shitong Zhu, Qiushi Wu, Zheng Zhang, Joshua Hsu, Yue Dong, Zhiyun Qian, Kangjie Lu, Trent Jaeger, Michael De Lucia, Srikanth V. Krishnamurthy (UC Riverside), 26 Sep 2025, What Do They Fix? LLM-Aided Categorization of Security Patches for Critical Memory Bugs, https://arxiv.org/abs/2509.22796
  • Cade Houston Kennedy, Amr Hilal, Morteza Momeni, 7 Oct 2025, The Role of Federated Learning in Improving Financial Security: A Survey, https://arxiv.org/abs/2510.14991
  • Maraz Mia, Mir Mehedi A. Pritom, 4 Oct 2025, Explainable but Vulnerable: Adversarial Attacks on XAI Explanation in Cybersecurity Applications, https://arxiv.org/abs/2510.03623
  • Yudong Zhang, Ruobing Xie, Xingwu Sun, Jiansheng Chen, Zhanhui Kang, Di Wang, Yu Wang, 4 Oct 2025, The Security Threat of Compressed Projectors in Large Vision-Language Models, https://arxiv.org/abs/2506.00534
  • Soham Hans, Stacy Marsella, Sophia Hirschmann, and Nikolos Gurney, 23 Oct 2025, Security Logs to ATT&CK Insights: Leveraging LLMs for High-Level Threat Understanding and Cognitive Trait Inference, https://arxiv.org/abs/2510.20930
  • Mohammadhossein Homaei, Mehran Tarif, Mar Avilla, and Andres Caro, 16 Sep 2025, Causal Digital Twins for Cyber-Physical Security: A Framework for Robust Anomaly Detection in Industrial Control Systems, https://arxiv.org/abs/2510.09616
  • Jiayun Mo, Xin Kang, Tieyan Li, Zhongding Lei, 18 Sep 2025, Toward a Unified Security Framework for AI Agents: Trust, Risk, and Liability, https://arxiv.org/abs/2510.09620
  • Bernhard Mueller, 29 Sep 2025, Hound: Relation-First Knowledge Graphs for Complex-System Reasoning in Security Audits, https://arxiv.org/abs/2510.09633
  • Gaojian Wang, Feng Lin, Tong Wu, Zhisheng Yan, Kui Ren, 12 Oct 2025, Scalable Face Security Vision Foundation Model for Deepfake, Diffusion, and Spoofing Detection, https://arxiv.org/abs/2510.10663
  • Michael Schlichtkrull, 13 Oct 2025, Attacks by Content: Automated Fact-checking is an AI Security Issue, https://arxiv.org/abs/2510.11238
  • Tonmoy Ghosh, 4 Oct 2025, AdaptAuth: Multi-Layered Behavioral and Credential Analysis for a Secure and Adaptive Authentication Framework for Password Security, https://arxiv.org/abs/2510.09645
  • Yujin Potter, Wenbo Guo, Zhun Wang, Tianneng Shi, Andy Zhang, Patrick Gage Kelley, Kurt Thomas, and Dawn Song, 11 Oct 2025, Frontier AI's Impact on the Cybersecurity Landscape, https://arxiv.org/abs/2504.05408
  • Hikmat A. M. Abdeljaber, Md. Alamgir Hossain, Sultan Ahmad, Ahmed Alsanad, Md Alimul Haque, Sudan Jha and Jabeen Nazeer, 9 Oct 2025, A Novel Ensemble Learning Approach for Enhanced IoT Attack Detection: Redefining Security Paradigms in Connected Systems, https://arxiv.org/abs/2510.08084
  • Steve Huntsman, 23 Sep 2025, Coherence-driven inference for cybersecurity, https://arxiv.org/abs/2509.18520
  • Aicha War, Serge L.B. Nikiema, Jordan Samhi, Jacques Klein, Tegawende F. Bissyande, 23 Sep 2025, Security smells in infrastructure as code: a taxonomy update beyond the seven sins, https://arxiv.org/abs/2509.18761
  • Aicha War, Adnan A. Rawass, Abdoul K. Kabore, Jordan Samhi, Jacques Klein, and Tegawende F. Bissyande, 23 Sep 2025, Detection of security smells in IaC scripts through semantics-aware code and language processing, https://arxiv.org/abs/2509.18790
  • Hwiwon Lee, Ziqi Zhang, Hanxiao Lu, Lingming Zhang, 22 Oct 2025, SEC-bench: Automated Benchmarking of LLM Agents on Real-World Software Security Tasks, https://arxiv.org/abs/2506.11791
  • Yuan Huang, 25 Sep 2025, Fine-tuning of Large Language Models for Domain-Specific Cybersecurity Knowledge, https://arxiv.org/abs/2509.25241
  • Mary Llewellyn, Annie Gray, Josh Collyer and Michael Harries, 7 Oct 2025, Towards Reliable and Practical LLM Security Evaluations via Bayesian Modelling, https://arxiv.org/abs/2510.05709
  • Xinyi Hou, Yanjie Zhao, Shenao Wang, Haoyu Wang, 7 Oct 2025, Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions, https://arxiv.org/abs/2503.23278
  • Alex Pierron, Michel Barbeau, Luca De Cicco, Jose Rubio-Hernan, Joaquin Garcia-Alfaro, 7 Oct 2025, A Fairness-Aware Strategy for B5G Physical-layer Security Leveraging Reconfigurable Intelligent Surfaces, https://arxiv.org/abs/2506.06344
  • Edoardo Allegrini, Ananth Shreekumar, Z. Berkay Celik, 15 Oct 2025, Formalizing the Safety, Security, and Functional Properties of Agentic AI Systems, https://arxiv.org/abs/2510.14133
  • Mason Nakamura, Abhinav Kumar, Saaduddin Mahmud, Sahar Abdelnabi, Shlomo Zilberstein, Eugene Bagdasarian, 16 Oct 2025, Terrarium: Revisiting the Blackboard for Multi-Agent Safety, Privacy, and Security Studies, https://arxiv.org/abs/2510.14312
  • Eugene Neelou, Ivan Novikov, Max Moroz, Om Narayan, Tiffany Saade, Mika Ayenson, Ilya Kabanov, Jen Ozmen, Edward Lee, Vineeth Sai Narajala, Emmanuel Guilherme Junior, Ken Huang, Huseyin Gulsin, Jason Ross, Marat Vyshegorodtsev, Adelin Travers, Idan Habler, Rahul Jadav, 8 Oct 2025, A2AS: Agentic AI Runtime Security and Self-Defense, https://arxiv.org/abs/2510.13825
  • Ruchit Rawal, Jeffrey Yang Fan Chiang, Chihao Shen, Jeffery Siyuan Tian, Aastha Mahajan, Tom Goldstein, Yizheng Chen, 13 Oct 2025, Benchmarking Correctness and Security in Multi-Turn Code Generation, https://arxiv.org/abs/2510.13859
  • Matan Levi, Daniel Ohayon, Ariel Blobstein, Ravid Sagi, Ian Molloy, Yair Allouche, 15 Oct 2025, Toward Cybersecurity-Expert Small Language Models, https://arxiv.org/abs/2510.14113
  • AbdulAziz AbdulGhaffar, Ashraf Matrawy, 15 Oct 2025, LLMs' Suitability for Network Security: A Case Study of STRIDE Threat Modeling, https://arxiv.org/abs/2505.04101

Safety Monitor

A safety monitor is a component that can be added to the LLM deployment.

General Thoughts on AI Safety

High-level debate and discussions of AI safety issues:

Government Policy and Regulation

Various governments have examined issues around regulation, and there has also been much debate:

Auditing and Enforcement

Papers on auditing or enforcement of AI policy:

  • J. Mökander and L. Floridi. 2022, Operationalising AI governance through ethics-based auditing: An industry case study. AI and Ethics, pages 1–18, https://link.springer.com/article/10.1007/s43681-022-00171-7
  • J. Mökander, J. Schuett, H. R. Kirk, and L. Floridi. June 2023. Auditing large language models: A three-layered approach. arXiv preprint arXiv:2302.08500. https://arxiv.org/abs/2302.08500
  • J. Mökander, J. Morley, M. Taddeo, and L. Floridi. Ethics-based auditing of automated decision-making systems: Nature, scope, and limitations. Science and Engineering Ethics, 27(44), 2021. https://arxiv.org/abs/2110.10980

Bias and Fairness

AI engines have shown bias in various ways. The goal is to have them show "fairness" in their results:

  • Dastin Jeffrey. Oct 2018, Amazon scraps secret AI recruiting tool that showed bias against women. Reuters. https://www.reuters.com/article/us-amazon-com-jobs-automation-insight-idUSKCN1MK08G
  • Courtland R., 2018, Bias detectives: the researchers striving to make algorithms fair. Nature. 2018 Jun;558(7710):357-360. doi: 10.1038/d41586-018-05469-3. PMID: 29925973 https://pubmed.ncbi.nlm.nih.gov/29925973/
  • Caliskan Aylin, Bryson Joanna J., Narayanan Arvind. 2017. Semantics derived automatically from language corpora contain human-like biases. Science. 2017;356:183–186. https://pubmed.ncbi.nlm.nih.gov/28408601/
  • A Levendowski, 2018, How copyright law can fix artificial intelligence's implicit bias problem, Wash. L. Rev., https://digitalcommons.law.uw.edu/cgi/viewcontent.cgi?article=5042&context=wlr
  • Hao Karen. 2020. AI researchers say scientific publishers help perpetuate racist algorithms. MIT Technology Review. https://www.technologyreview.com/2020/06/23/1004333/ai-science-publishers-perpetuate-racist-face-recognition/
  • K Ramesh, A Chavan, S Pandit, 2023, A Comparative Study on the Impact of Model Compression Techniques on Fairness in Language Models, Microsoft Research, https://aclanthology.org/2023.acl-long.878.pdf, https://www.microsoft.com/en-us/research/uploads/prod/2023/07/3687_Paper.pdf
  • Jwala Dhamala, Varun Kumar, Rahul Gupta, Kai-Wei Chang, Aram Galstyan, Oct 2022, An Analysis of the Effects of Decoding Algorithms on Fairness in Open-Ended Language Generation, https://arxiv.org/abs/2210.03826 (Examines top-p, top-k, and temperature in decoding algorithms from a safety perspective.)
  • Bingbin Liu, Jordan T. Ash, Surbhi Goel, Akshay Krishnamurthy, Cyril Zhang, June 2023, Exposing Attention Glitches with Flip-Flop Language Modeling, https://arxiv.org/abs/2306.00946
  • Valentin Hofmann, Pratyusha Ria Kalluri, Dan Jurafsky, Sharese King, 1 Mar 2024, Dialect prejudice predicts AI decisions about people's character, employability, and criminality, https://arxiv.org/abs/2403.00742 https://arxiv.org/pdf/2403.00742.pdf
  • Jaymari Chua, Yun Li, Shiyi Yang, Chen Wang, Lina Yao, 6 Jul 2024, AI Safety in Generative AI Large Language Models: A Survey, https://arxiv.org/abs/2407.18369
  • Cem Dilmegani, Jan 10, 2024, The Future of Large Language Models in 2024, https://research.aimultiple.com/future-of-large-language-models/
  • Douglas C. Youvan, September 27, 2024, Building and Running Large-Scale Language Models: The Infrastructure and Techniques Behind GPT-4 , https://www.researchgate.net/profile/Douglas-Youvan/publication/384398902_Building_and_Running_Large-Scale_Language_Models_The_Infrastructure_and_Techniques_Behind_GPT-4/links/66f6f4d3906bca2ac3d20e68/Building-and-Running-Large-Scale-Language-Models-The-Infrastructure-and-Techniques-Behind-GPT-4.pdf
  • Mayank Vatsa, Anubhooti Jain, Richa Singh, 7 Dec 2023, Adventures of Trustworthy Vision-Language Models: A Survey, https://arxiv.org/abs/2312.04231
  • Inkit Padhi, Manish Nagireddy, Giandomenico Cornacchia, Subhajit Chaudhury, Tejaswini Pedapati, Pierre Dognin, Keerthiram Murugesan, Erik Miehling, Martín Santillán Cooper, Kieran Fraser, Giulio Zizzo, Muhammad Zaid Hameed, Mark Purcell, Michael Desmond, Qian Pan, Inge Vejsbjerg, Elizabeth M. Daly, Michael Hind, Werner Geyer, Ambrish Rawat, Kush R. Varshney, Prasanna Sattigeri, 10 Dec 2024, Granite Guardian, https://arxiv.org/abs/2412.07724 https://github.com/ibm-granite/granite-guardian (Open-sourcing of safety models with many capabilities.)
  • FZ Subah, Oct 2025, Mitigating and Assessing Bias and Fairness in Large Language Model-Generated Synthetic Tabular Data, Masters Thesis, Department of Engineering, University of Cambridge, https://www.mlmi.eng.cam.ac.uk/files/2023-2024/fzs21_mitigating_2024.pdf
  • Abdullah Hashmat, Muhammad Arham Mirza, Agha Ali Raza, 13 Aug 2025, PakBBQ: A Culturally Adapted Bias Benchmark for QA, https://arxiv.org/abs/2508.10186
  • Gustavo Bonil, Simone Hashiguti, Jhessica Silva, Jo\~ao Gondim, Helena Maia, N\'adia Silva, Helio Pedrini, Sandra Avila, 14 Aug 2025, Yet another algorithmic bias: A Discursive Analysis of Large Language Models Reinforcing Dominant Discourses on Gender and Race, https://arxiv.org/abs/2508.10304
  • Alessio Buscemi, Daniele Proverbio, Alessandro Di Stefano, The Anh Han, German Castignani and Pietro Li\`o, 14 Aug 2025, FAIRGAME: a Framework for AI Agents Bias Recognition using Game Theory, https://arxiv.org/abs/2504.14325
  • Suhas G Hegde, Shilpy Kaur, Aruna Tiwari, 14 Aug 2025, VectorFit : Adaptive Singular & Bias Vector Fine-Tuning of Pre-trained Foundation Models, https://arxiv.org/abs/2503.19530
  • Yan Li, Guangyi Chen, Yunlong Deng, Zijian Li, Zeyu Tang, Anpeng Wu, Kun Zhang, 22 Jul 2025, Should Bias Always be Eliminated? A Principled Framework to Use Data Bias for OOD Generation, https://arxiv.org/abs/2507.17001
  • Shalaka Satheesh, Katrin Klug, Katharina Beckh, H\'ector Allende-Cid, Sebastian Houben, Teena Hassan, 22 Jul 2025, GG-BBQ: German Gender Bias Benchmark for Question Answering, https://arxiv.org/abs/2507.16410
  • Kristin Gnadt, David Thulke, Simone Kopeinik, Ralf Schl\"uter, 22 Jul 2025, Exploring Gender Bias in Large Language Models: An In-depth Dive into the German Language, https://arxiv.org/abs/2507.16557
  • Zhenyuan Chen, 21 Jul 2025, Rethinking Inductive Bias in Geographically Neural Network Weighted Regression, https://arxiv.org/abs/2507.09958
  • Sergio Morales, Robert Claris\'o, Jordi Cabot, 22 Jul 2025, LangBiTe: A Platform for Testing Bias in Large Language Models, https://arxiv.org/abs/2404.18558
  • Yanbiao Ma, Bowei Liu, Boyuan Gao, Wei Dai, Jiayi Chen, Shuo Li, Andi Zhang, 22 Jul 2025, Revealing Bias Formation in Deep Neural Networks Through the Geometric Mechanisms of Human Visual Decoupling, https://arxiv.org/abs/2502.11809
  • Brian Liu and Rahul Mazumder, 21 Jul 2025, Randomization Can Reduce Both Bias and Variance: A Case Study in Random Forests, https://arxiv.org/abs/2402.12668
  • Ali Vardasbi, Gustavo Penha, Claudia Hauff, and Hugues Bouchard, 23 Jul 2025, Adaptive Repetition for Mitigating Position Bias in LLM-Based Ranking, https://arxiv.org/abs/2507.17788
  • Steven A. Frank, 24 Jul 2025, The Price equation reveals a universal force-metric-bias law of algorithmic learning and natural selection, https://arxiv.org/abs/2507.18549
  • Bruno Scarone, Alfredo Viola, Ren\'ee J. Miller, Ricardo Baeza-Yates, 24 Jul 2025, A Principled Approach for Data Bias Mitigation, https://arxiv.org/abs/2405.12312
  • He-Yang Xu, Hongxiang Gao, Yuwen Li, Xiu-Shen Wei and Chengyu Liu, 24 Jul 2025, Masked Autoencoders that Feel the Heart: Unveiling Simplicity Bias for ECG Analyses, https://arxiv.org/abs/2506.22495
  • Yuen Chen, Vethavikashini Chithrra Raghuram, Justus Mattern, Rada Mihalcea, Zhijing Jin, 24 Jul 2025, Causally Testing Gender Bias in LLMs: A Case Study on Occupational Bias, https://arxiv.org/abs/2212.10678
  • Yongyi Yang, Hidenori Tanaka, Wei Hu, 17 Jul 2025, Provable Low-Frequency Bias of In-Context Learning of Representations, https://arxiv.org/abs/2507.13540
  • Yile Yan, Yuqi Zhu, Wentao Xu, 18 Jul 2025, Bias in Decision-Making for AI's Ethical Dilemmas: A Comparative Study of ChatGPT and Claude, https://arxiv.org/abs/2501.10484
  • Andr\'es Morales-Forero (1), Lili J. Rueda (2), Ronald Herrera (3), Samuel Bassetto (1), Eric Coatanea (4) ((1) Polytechnique Montr\'eal, (2) Universidad El Bosque, (3) Boehringer Ingelheim International GmbH, (4) Tampere University), 10 Jul 2025, Predictive Representativity: Uncovering Racial Bias in AI-based Skin Cancer Detection, https://arxiv.org/abs/2507.14176
  • Xiaotong Luo, Shengda Zhuo, Min Chen, Lichun Li, Ruizhao Lu, Wenqi Fan, Shuqiang Huang and Yin Tang, 12 Jul 2025, From Bias to Behavior: Learning Bull-Bear Market Dynamics with Contrastive Modeling, https://arxiv.org/abs/2507.14182
  • Eoghan Cunningham, James Cross, Derek Greene, 16 Jul 2025, Identifying Algorithmic and Domain-Specific Bias in Parliamentary Debate Summarisation, https://arxiv.org/abs/2507.14221
  • Garud Iyengar, Henry Lam, Tianyu Wang, 21 Jul 2025, Optimizer's Information Criterion: Dissecting and Correcting Bias in Data-Driven Optimization, https://arxiv.org/abs/2306.10081
  • Evangelia Spiliopoulou, Riccardo Fogliato, Hanna Burnsky, Tamer Soliman, Jie Ma, Graham Horwood, Miguel Ballesteros, 8 Aug 2025, Play Favorites: A Statistical Method to Measure Self-Bias in LLM-as-a-Judge, https://arxiv.org/abs/2508.06709
  • Falaah Arif Khan, Nivedha Sivakumar, Yinong Oliver Wang, Katherine Metcalf, Cezanne Camacho, Barry-John Theobald, Luca Zappella, Nicholas Apostoloff, 9 Aug 2025, Investigating Intersectional Bias in Large Language Models using Confidence Disparities in Coreference Resolution, https://arxiv.org/abs/2508.07111
  • Vivek Hruday Kavuri, Vysishtya Karanam, Venkata Jahnavi Venkamsetty, Kriti Madumadukala, Lakshmipathi Balaji Darur, Ponnurangam Kumaraguru, 10 Aug 2025, Freeze and Reveal: Exposing Modality Bias in Vision-Language Models, https://arxiv.org/abs/2508.07432
  • Vojt\v{e}ch Stan\v{e}k, Karel Srna, Anton Firc, Kamil Malinka, 11 Aug 2025, SCDF: A Speaker Characteristics DeepFake Speech Dataset for Bias Analysis, https://arxiv.org/abs/2508.07944
  • Xinyi Wu, Yifei Wang, Stefanie Jegelka, Ali Jadbabaie, 9 Aug 2025, On the Emergence of Position Bias in Transformers, https://arxiv.org/abs/2502.01951
  • Walter Laurito, Benjamin Davis, Peli Grietzer, Tom\'a\v{s} Gaven\v{c}iak, Ada B\"ohm, Jan Kulveit, 11 Aug 2025, AI-AI Bias: large language models favor communications generated by large language models, https://arxiv.org/abs/2407.12856
  • Dasol Choi, Jihwan Lee, Minjae Lee, Minsuk Kahng, 10 Aug 2025, When Cars Have Stereotypes: Auditing Demographic Bias in Objects from Text-to-Image Models, https://arxiv.org/abs/2508.03483
  • Anuprabha M, Krishna Gurugubelli and Anil Kumar Vuppala, 11 Aug 2025, Fairness in Dysarthric Speech Synthesis: Understanding Intrinsic Bias in Dysarthric Speech Cloning using F5-TTS, https://arxiv.org/abs/2508.05102
  • Chao Wu, Zhenyi Wang, Kangxian Xie, Naresh Kumar Devulapally, Vishnu Suresh Lokhande, Mingchen Gao, 28 Jul 2025, Model-Agnostic Gender Bias Control for Text-to-Image Generation via Sparse Autoencoder, https://arxiv.org/abs/2507.20973
  • Gabriel Recchia, Chatrik Singh Mangat, Jinu Nyachhyon, Mridul Sharma, Callum Canavan, Dylan Epstein-Gross, Muhammed Abdulbari, 17 May 2025, Confirmation bias: A challenge for scalable oversight, https://arxiv.org/abs/2507.19486
  • Pavel Korshunov, Ketan Kotwal, Christophe Ecabert, Vidit Vidit, Amir Mohammadi, and Sebastien Marcel, 28 Jul 2025, Investigation of Accuracy and Bias in Face Recognition Trained with Synthetic Data, https://arxiv.org/abs/2507.20782
  • Hoyoung Lee, Junhyuk Seo, Suhwan Park, Junhyeong Lee, Wonbin Ahn, Chanyeol Choi, Alejandro Lopez-Lira, Yongjae Lee, 28 Jul 2025, Your AI, Not Your View: The Bias of LLMs in Investment Analysis, https://arxiv.org/abs/2507.20957
  • Yooshin Cho, Hanbyel Cho, Janghyeon Lee, HyeongGwon Hong, Jaesung Ahn, Junmo Kim, 27 Jul 2025, Controllable Feature Whitening for Hyperparameter-Free Bias Mitigation, https://arxiv.org/abs/2507.20284
  • Seoyoung Doh, Hyeon Jeon, Sungbok Shin, Ghulam Jilani Quadri, Nam Wook Kim, Jinwook Seo, 28 Jul 2025, Understanding Bias in Perceiving Dimensionality Reduction Projections, https://arxiv.org/abs/2507.20805
  • Hitomi Yanaka, Xinqi He, Jie Lu, Namgi Han, Sunjin Oh, Ryoma Kumon, Yuma Matsuoka, Katsuhiko Watabe, Yuko Itatsu, 27 Jul 2025, Intersectional Bias in Japanese Large Language Models from a Contextualized Perspective, https://arxiv.org/abs/2506.12327
  • Franck Bardol, 17 Jun 2025, ChatGPT Reads Your Tone and Responds Accordingly -- Until It Does Not -- Emotional Framing Induces Bias in LLM Outputs, https://arxiv.org/abs/2507.21083
  • Zhenyu Pan, Yutong Zhang, Jianshu Zhang, Haoran Lu, Haozheng Luo, Yuwei Han, Philip S. Yu, Manling Li, Han Liu, 30 Jul 2025, FairReason: Balancing Reasoning and Social Bias in MLLMs, https://arxiv.org/abs/2507.23067
  • Patricia A. Apell\'aniz and Ana Jim\'enez and Borja Arroyo Galende and Juan Parras and Santiago Zazo, 31 Jul 2025, Artificial Inductive Bias for Synthetic Tabular Data Generation in Data-Scarce Scenarios, https://arxiv.org/abs/2407.03080
  • Utku Ozbulak, Seyed Amir Mousavi, Francesca Tozzi, Niki Rashidian, Wouter Willaert, Wesley De Neve, Joris Vankerschaver, 31 Jul 2025, Revisiting the Evaluation Bias Introduced by Frame Sampling Strategies in Surgical Video Segmentation Using SAM2, https://arxiv.org/abs/2502.20934
  • Afrozah Nadeem, Mark Dras, and Usman Naseem, 31 Jul 2025, Framing Political Bias in Multilingual LLMs Across Pakistani Languages, https://arxiv.org/abs/2506.00068
  • Bushra Asseri, Estabrag Abdelaziz, Areej Al-Wabil, 30 Jul 2025, Prompt Engineering Techniques for Mitigating Cultural Bias Against Arabs and Muslims in Large Language Models: A Systematic Review, https://arxiv.org/abs/2506.18199
  • Simon M\"unker, 31 Jul 2025, Cultural Bias in Large Language Models: Evaluating AI Agents through Moral Questionnaires, https://arxiv.org/abs/2507.10073
  • Kwesi Cobbina and Tianyi Zhou, 30 Jul 2025, Where to show Demos in Your Prompt: A Positional Bias of In-Context Learning, https://arxiv.org/abs/2507.22887
  • Adam Block and Cyril Zhang, 31 Jul 2025, EMA Without the Lag: Bias-Corrected Iterate Averaging Schemes, https://arxiv.org/abs/2508.00180
  • Kangda Wei, Hasnat Md Abdullah, Ruihong Huang, 1 Aug 2025, Mitigating Gender Bias via Fostering Exploratory Thinking in LLMs, https://arxiv.org/abs/2505.17217
  • Zhenpeng Chen, Xinyue Li, Jie M. Zhang, Weisong Sun, Ying Xiao, Tianlin Li, Yiling Lou, Yang Liu, 5 Aug 2025, Software Fairness Dilemma: Is Bias Mitigation a Zero-Sum Game?, https://arxiv.org/abs/2508.03323
  • Jiangen He, 2 Aug 2025, Who Gets Cited? Gender- and Majority-Bias in LLM-Driven Reference Selection, https://arxiv.org/abs/2508.02740
  • Shahed Masoudian, Gustavo Escobedo, Hannah Strauss, Markus Schedl, 5 Aug 2025, Investigating Gender Bias in LLM-Generated Stories via Psychological Stereotypes, https://arxiv.org/abs/2508.03292
  • Joseph Lee, Tianqi Shang, Jae Young Baik, Duy Duong-Tran, Shu Yang, Lingyao Li, Li Shen, 4 Aug 2025, From Promising Capability to Pervasive Bias: Assessing Large Language Models for Emergency Department Triage, https://arxiv.org/abs/2504.16273
  • Zhen Zou, Feng Zhao, 5 Aug 2025, FEB-Cache: Frequency-Guided Exposure Bias Reduction for Enhancing Diffusion Transformer Caching, https://arxiv.org/abs/2503.07120
  • Hamed Ayoobi, Nico Potyka, Anna Rapberger, Francesca Toni, 6 Aug 2025, Argumentative Debates for Transparent Bias Detection [Technical Report], https://arxiv.org/abs/2508.04511
  • Tiffany Zhu, Iain Weissburg, Kexun Zhang, William Yang Wang, 6 Aug 2025, Human Bias in the Face of AI: Examining Human Judgment Against Text Labeled as AI Generated, https://arxiv.org/abs/2410.03723
  • Tosin Fadahunsi, Giordano d'Aloisio, Antinisca Di Marco, Federica Sarro, 5 Aug 2025, How Do Generative Models Draw a Software Engineer? A Case Study on Stable Diffusion Bias, https://arxiv.org/abs/2501.09014
  • Kelsey Doerksen, Yuliya Marchetti, Kevin Bowman, Steven Lu, James Montgomery, Yarin Gal, Freddie Kalaitzis, Kazuyuki Miyazaki, 6 Aug 2025, Leveraging Deep Learning for Physical Model Bias of Global Air Quality Estimates, https://arxiv.org/abs/2508.04886
  • Menghua Jiang, Yuxia Lin, Baoliang Chen, Haifeng Hu, Yuncheng Jiang, Sijie Mai, 7 Aug 2025, Disentangling Bias by Modeling Intra- and Inter-modal Causal Attention for Multimodal Sentiment Analysis, https://arxiv.org/abs/2508.04999
  • Jiahao Chen, Bin Qin, Jiangmeng Li, Hao Chen, Bing Su, 8 Aug 2025, Rethinking the Bias of Foundation Model under Long-tailed Distribution, https://arxiv.org/abs/2501.15955
  • Shivam Dubey, 12 Aug 2025, Activation Steering for Bias Mitigation: An Interpretable Approach to Safer LLMs, https://arxiv.org/abs/2508.09019
  • Afrozah Nadeem, Mark Dras, Usman Naseem, 12 Aug 2025, Steering Towards Fairness: Mitigating Political Bias in LLMs, https://arxiv.org/abs/2508.08846
  • Krzysztof Maziarz, Guoqing Liu, Hubert Misztela, Austin Tripp, Junren Li, Aleksei Kornev, Piotr Gai\'nski, Holger Hoefling, Mike Fortunato, Rishi Gupta, Marwin Segler, 12 Aug 2025, Chemist-aligned retrosynthesis by ensembling diverse inductive bias models, https://arxiv.org/abs/2412.05269
  • Zara Siddique, Irtaza Khalid, Liam D. Turner, Luis Espinosa-Anke, 13 Aug 2025, Shifting Perspectives: Steering Vectors for Robust Bias Mitigation in LLMs, https://arxiv.org/abs/2503.05371
  • Jingwei Li, Jing Xu, Zifan Wang, Huishuai Zhang, Jingzhao Zhang, 13 Aug 2025, Understanding Nonlinear Implicit Bias via Region Counts in Input Space, https://arxiv.org/abs/2505.11370
  • Parker Whitfill, 14 Aug 2025, Note on Selection Bias in Observational Estimates of Algorithmic Progress, https://arxiv.org/abs/2508.11033
  • Aiswarya Konavoor, Raj Abhijit Dandekar, Rajat Dandekar, Sreedath Panat, 15 Aug 2025, Vision-Language Models display a strong gender bias, https://arxiv.org/abs/2508.11262
  • Binxu Wang, Cengiz Pehlevan, 14 Aug 2025, An Analytical Theory of Spectral Bias in the Learning Dynamics of Diffusion Models, https://arxiv.org/abs/2503.03206
  • Keyon Vafa, Peter G. Chang, Ashesh Rambachan, Sendhil Mullainathan, 14 Aug 2025, What Has a Foundation Model Found? Using Inductive Bias to Probe for World Models, https://arxiv.org/abs/2507.06952
  • Pengcheng Huang, Shuhao Liu, Zhenghao Liu, Yukun Yan, Shuo Wang, Zulong Chen, Tong Xiao, 18 Aug 2025, PC-Sampler: Position-Aware Calibration of Decoding Bias in Masked Diffusion Models, https://arxiv.org/abs/2508.13021
  • Yuanzhe Hu, Kinshuk Goel, Vlad Killiakov, Yaoqing Yang, 18 Aug 2025, Eigenspectrum Analysis of Neural Networks without Aspect Ratio Bias, https://arxiv.org/abs/2506.06280
  • Evan Chen, Run-Jun Zhan, Yan-Bai Lin, Hung-Hsuan Chen, 15 Aug 2025, More Women, Same Stereotypes: Unpacking the Gender Bias Paradox in Large Language Models, https://arxiv.org/abs/2503.15904
  • Koffi Ismael Ouattara, Ioannis Krontiris, Theo Dimitrakos and Frank Kargl, 19 Aug 2025, Assessing Trustworthiness of AI Training Dataset using Subjective Logic -- A Use Case on Bias, https://arxiv.org/abs/2508.13813
  • Jonathan A. Karr Jr., Benjamin F. Herbst, Ting Hua, Matthew Hauenstein, Georgina Curto, Nitesh V. Chawla, 14 Aug 2025, Combating Homelessness Stigma with LLMs: A New Multi-Modal Dataset for Bias Detection, https://arxiv.org/abs/2508.13187
  • Hao Zhang and Chen Li and Basura Fernando, 19 Aug 2025, Mitigating Easy Option Bias in Multiple-Choice Question Answering, https://arxiv.org/abs/2508.13428
  • Dariia Puhach and Amir H. Payberah and \'Eva Sz\'ekely, 19 Aug 2025, Who Gets the Mic? Investigating Gender Bias in the Speaker Assignment of a Speech-LLM, https://arxiv.org/abs/2508.13603
  • Vinod Kumar Chauhan, Lei Clifton, Achille Sala\"un, Huiqi Yvonne Lu, Kim Branson, Patrick Schwab, Gaurav Nigam, David A. Clifton, 20 Aug 2025, Sample Selection Bias in Machine Learning for Healthcare, https://arxiv.org/abs/2405.07841
  • Ilja Kuzborskij, Yasin Abbasi Yadkori, 20 Aug 2025, Low-rank bias, weight decay, and model merging in neural networks, https://arxiv.org/abs/2502.17340
  • Haodi Zhong, Liuxin Zou, Di Wang, Bo Wang, Zhenxing Niu, Quan Wang, 21 Aug 2025, EvoFormer: Learning Dynamic Graph-Level Representations with Structural and Temporal Bias Correction, https://arxiv.org/abs/2508.15378
  • Cheng Wang, Gelei Deng, Xianglin Yang, Han Qiu, Tianwei Zhang, 21 Aug 2025, When Audio and Text Disagree: Revealing Text Bias in Large Audio-Language Models, https://arxiv.org/abs/2508.15407
  • Tuhina Tripathi, Manya Wadhwa, Greg Durrett, Scott Niekum, 21 Aug 2025, Pairwise or Pointwise? Evaluating Feedback Protocols for Bias in LLM-Based Evaluation, https://arxiv.org/abs/2504.14716
  • Saumya Roy, 13 Aug 2025, Persuasiveness and Bias in LLM: Investigating the Impact of Persuasiveness and Reinforcement of Bias in Language Models, https://arxiv.org/abs/2508.15798
  • Xu Pan, Jingxuan Fan, Zidi Xiong, Ely Hahami, Jorin Overwiening, Ziqian Xie, 16 Aug 2025, User-Assistant Bias in LLMs, https://arxiv.org/abs/2508.15815
  • Srikant Panda, Vishnu Hari, Kalpana Panda, Amit Agarwal, Hitesh Laxmichand Patel, 18 Aug 2025, Who's Asking? Investigating Bias Through the Lens of Disability Framed Queries in LLMs, https://arxiv.org/abs/2508.15831
  • Tom Jacobs, Chao Zhou, Rebekka Burkholz, 22 Aug 2025, Mirror, Mirror of the Flow: How Does Regularization Shape Implicit Bias?, https://arxiv.org/abs/2504.12883
  • Gousia Habib, Tausifa Jan Saleem, Ishfaq Ahmad Malik, Brejesh Lall, 21 Aug 2025, LIB-KD: Teaching Inductive Bias for Efficient Vision Transformer Distillation and Compression, https://arxiv.org/abs/2310.00369
  • Shir Bernstein, David Beste, Daniel Ayzenshteyn, Lea Schonherr, Yisroel Mirsky, 24 Aug 2025, Trust Me, I Know This Function: Hijacking LLM Static Analysis using Bias, https://arxiv.org/abs/2508.17361
  • Pooja S. B. Rao and Laxminarayen Nagarajan Venkatesan and Mauro Cherubini and Dinesh Babu Jayagopi, 21 Aug 2025, Invisible Filters: Cultural Bias in Hiring Evaluations Using Large Language Models, https://arxiv.org/abs/2508.16673
  • Viacheslav Yusupov, Danil Maksimov, Ameliia Alaeva, Tatiana Zaitceva, Antipina Anna, Anna Vasileva, Chenlin Liu, Rayuth Chheng, Danil Sazanakov, Andrey Chetvergov, Alina Ermilova, Egor Shvetsov, 23 Aug 2025, Token Homogenization under Positional Bias, https://arxiv.org/abs/2508.17126
  • Kyra Wilson, Sourojit Ghosh, Aylin Caliskan, 24 Aug 2025, Bias Amplification in Stable Diffusion's Representation of Stigma Through Skin Tones and Their Homogeneity, https://arxiv.org/abs/2508.17465
  • Xuan-Bac Nguyen, Thanh-Dat Truong, Pawan Sinha, Khoa Luu, 25 Aug 2025, BRAIN: Bias-Mitigation Continual Learning Approach to Vision-Brain Understanding, https://arxiv.org/abs/2508.18187
  • Federico Marcuzzi, Xuefei Ning, Roy Schwartz, and Iryna Gurevych, 25 Aug 2025, How Quantization Shapes Bias in Large Language Models, https://arxiv.org/abs/2508.18088
  • Emanuele Zangrando, Piero Deidda, Simone Brugiapaglia, Nicola Guglielmi, Francesco Tudisco, 23 Aug 2025, Provable Emergence of Deep Neural Collapse and Low-Rank Bias in $L^2$-Regularized Nonlinear Networks, https://arxiv.org/abs/2402.03991
  • Jihwan Oh, Minchan Jeong, Jongwoo Ko, Se-Young Yun, 24 Aug 2025, Understanding Bias Reinforcement in LLM Agents Debate, https://arxiv.org/abs/2503.16814
  • Jiancong Xiao, Ziniu Li, Xingyu Xie, Emily Getzen, Cong Fang, Qi Long, Weijie J. Su, 25 Aug 2025, On the Algorithmic Bias of Aligning Large Language Models with RLHF: Preference Collapse and Matching Regularization, https://arxiv.org/abs/2405.16455
  • Paul Scherer, Andreas Kirsch, Jake P. Taylor-King, 4 Sep 2025, When three experiments are better than two: Avoiding intractable correlated aleatoric uncertainty by leveraging a novel bias--variance tradeoff, https://arxiv.org/abs/2509.04363
  • Joseph Jackson, Georgiy Lapin, Jeremy E. Thompson, 4 Sep 2025, Gravity Well Echo Chamber Modeling With An LLM-Based Confirmation Bias Model, https://arxiv.org/abs/2509.03832
  • Zhengyu Hu, Linxin Song, Jieyu Zhang, Zheyuan Xiao, Tianfu Wang, Zhengyu Chen, Nicholas Jing Yuan, Jianxun Lian, Kaize Ding, Hui Xiong, 4 Sep 2025, Explaining Length Bias in LLM-Based Preference Evaluations, https://arxiv.org/abs/2407.01085
  • Andrii Dzhoha, Katya Mirylenka, Egor Malykh, Marco-Andrea Buchmann, Francesca Catino, 3 Sep 2025, Short-Form Video Recommendations with Multimodal Embeddings: Addressing Cold-Start and Bias Challenges, https://arxiv.org/abs/2507.19346
  • Junyu Yan, Feng Chen, Yuyang Xue, Yuning Du, Konstantinos Vilouras, Sotirios A. Tsaftaris, Steven McDonagh, 4 Sep 2025, SWiFT: Soft-Mask Weight Fine-tuning for Bias Mitigation, https://arxiv.org/abs/2508.18826
  • Yifan Chen, Xiaoou Cheng, Jonathan Niles-Weed, Jonathan Weare, 3 Sep 2025, Convergence of Unadjusted Langevin in High Dimensions: Delocalization of Bias, https://arxiv.org/abs/2408.13115
  • Martha O. Dimgba, Sharon Oba, Ameeta Agrawal, Philippe J. Giabbanelli, 3 Sep 2025, Mitigation of Gender and Ethnicity Bias in AI-Generated Stories through Model Explanations, https://arxiv.org/abs/2509.04515
  • Karanbir Singh, Deepak Muppiri, William Ngu, 26 Aug 2025, Bias Mitigation Agent: Optimizing Source Selection for Fair and Balanced Knowledge Retrieval, https://arxiv.org/abs/2508.18724
  • Jay L. Cunningham, Adinawa Adjagbodjou, Jeffrey Basoah, Jainaba Jawara, Kowe Kadoma, Aaleyah Lewis, 20 Aug 2025, Toward Responsible ASR for African American English Speakers: A Scoping Review of Bias and Equity in Speech Technology, https://arxiv.org/abs/2508.18288
  • Kwonyoung Kim, Jungin Park, Jiyoung Lee, Dongbo Min, Kwanghoon Sohn, 26 Aug 2025, PointFix: Learning to Fix Domain Bias for Robust Online Stereo Adaptation, https://arxiv.org/abs/2207.13340
  • Sheryl Mathew and N Harshit, 27 Aug 2025, Counterfactual Reward Model Training for Bias Mitigation in Multimodal Reinforcement Learning, https://arxiv.org/abs/2508.19567
  • Md Abdullah Al Mamun, Ihsen Alouani, Nael Abu-Ghazaleh, 28 Aug 2025, Poison Once, Refuse Forever: Weaponizing Alignment for Injecting Bias in LLMs, https://arxiv.org/abs/2508.20333
  • Ruben Solozabal, Velibor Bojkovic, Hilal AlQuabeh, Kentaro Inui, Martin Tak\'a\v{c}, 28 Aug 2025, Uncovering the Spectral Bias in Diagonal State Space Models, https://arxiv.org/abs/2508.20441
  • Farhad Abtahi, Mehdi Astaraki, Fernando Seoane, 29 Aug 2025, Leveraging Imperfection with MEDLEY A Multi-Model Approach Harnessing Bias in Medical AI, https://arxiv.org/abs/2508.21648
  • Muskan Saraf, Sajjad Rezvani Boroujeni, Justin Beaudry, Hossein Abedi, Tom Bush, 28 Aug 2025, Quantifying Label-Induced Bias in Large Language Model Self- and Cross-Evaluations, https://arxiv.org/abs/2508.21164
  • Liulu He, Shenli Zheng, Karwei Sun, Yijiang Liu, Yufei Zhao, Chongkang Tan, Huanrui Yang, Yuan Du, Li Du, 29 Aug 2025, BASE-Q: Bias and Asymmetric Scaling Enhanced Rotational Quantization for Large Language Models, https://arxiv.org/abs/2506.15689
  • Lucas Mansilla, Rodrigo Echeveste, Camila Gonzalez, Diego H. Milone, Enzo Ferrante, 1 Sep 2025, BM-CL: Bias Mitigation through the lens of Continual Learning, https://arxiv.org/abs/2509.01730
  • Sanjeeevan Selvaganapathy and Mehwish Nasim, 31 Aug 2025, Confident, Calibrated, or Complicit: Probing the Trade-offs between Safety Alignment and Ideological Bias in Language Models in Detecting Hate Speech, https://arxiv.org/abs/2509.00673
  • Theodor Stoecker, Samed Bayer, and Ingo Weber, 28 Aug 2025, Bias Mitigation for AI-Feedback Loops in Recommender Systems: A Systematic Literature Review and Taxonomy, https://arxiv.org/abs/2509.00109
  • Chen Zheng, Zhenyu Zhao, 29 Aug 2025, Algorithm Adaptation Bias in Recommendation System Online Experiments, https://arxiv.org/abs/2509.00199
  • Ryan Franks, Alexey Miroshnikov, Konstandinos Kotsiopoulos, 2 Sep 2025, Explainable post-training bias mitigation with distribution-based fairness metrics, https://arxiv.org/abs/2504.01223
  • Abhishek Pasula and Deepak N. Subramani, 2 Sep 2025, Global Climate Model Bias Correction Using Deep Learning, https://arxiv.org/abs/2504.19145
  • Serra Aksoy, 3 Sep 2025, Systematic Evaluation of Attribution Methods: Eliminating Threshold Bias and Revealing Method-Dependent Performance Patterns, https://arxiv.org/abs/2509.03176
  • Alissa A. Valentine, Lauren A. Lepow, Lili Chan, Alexander W. Charney, Isotta Landi, 2 Sep 2025, Quantifying Clinician Bias and its Effects on Schizophrenia Diagnosis in the Emergency Department of the Mount Sinai Health System, https://arxiv.org/abs/2509.02651
  • Jessica M. Lundin, Ada Zhang, Nihal Karim, Hamza Louzan, Victor Wei, David Adelani, Cody Carroll, 5 Sep 2025, The Token Tax: Systematic Bias in Multilingual Tokenization, https://arxiv.org/abs/2509.05486
  • Jinrui Yang, Xudong Han, Timothy Baldwin, 7 Sep 2025, Benchmarking Gender and Political Bias in Large Language Models, https://arxiv.org/abs/2509.06164
  • Jinrui Yang, Fan Jiang, Timothy Baldwin, 7 Sep 2025, Language Bias in Information Retrieval: The Nature of the Beast and Mitigation Methods, https://arxiv.org/abs/2509.06195
  • Vincent C. Brockers, David A. Ehrlich, Viola Priesemann, 8 Sep 2025, Disentangling Interaction and Bias Effects in Opinion Dynamics of Large Language Models, https://arxiv.org/abs/2509.06858
  • Jinrui Yang, Timothy Baldwin, Trevor Cohn, 3 Nov 2023, Multi-EuP: The Multilingual European Parliament Dataset for Analysis of Bias in Information Retrieval, https://arxiv.org/abs/2311.01870
  • Viacheslav Sinii, Alexey Gorbatovski, Artem Cherepanov, Boris Shaposhnikov, Nikita Balagansky, Daniil Gavrilov, 8 Sep 2025, Steering LLM Reasoning Through Bias-Only Adaptation, https://arxiv.org/abs/2505.18706
  • Rushia Harada, Yuken Kimura, Keito Inoshita, 7 Sep 2025, Role-Playing LLM-Based Multi-Agent Support Framework for Detecting and Addressing Family Communication Bias, https://arxiv.org/abs/2507.11210
  • Amnon Balanov, Tamir Bendory, and Wasim Huleihel, 7 Sep 2025, Confirmation Bias in Gaussian Mixture Models, https://arxiv.org/abs/2408.09718
  • Qihu Xie, Yuan Li, and Yi Kang, 9 Sep 2025, SBS: Enhancing Parameter-Efficiency of Neural Representations for Neural Networks via Spectral Bias Suppression, https://arxiv.org/abs/2509.07373
  • Juan Manuel Contreras, 8 Sep 2025, Automated Evaluation of Gender Bias Across 13 Large Multimodal Models, https://arxiv.org/abs/2509.07050
  • Sai Siddhartha Chary Aylapuram, Veeraraju Elluru, Shivang Agarwal, 9 Sep 2025, Bias-Aware Machine Unlearning: Towards Fairer Vision Models via Controllable Forgetting, https://arxiv.org/abs/2509.07456
  • Camilo Chac\'on Sartori, Mart\'in Isla Pino, Pedro Pinacho-Davidson, Christian Blum, 5 Sep 2025, LLM-Based Instance-Driven Heuristic Bias In the Context of a Biased Random Key Genetic Algorithm, https://arxiv.org/abs/2509.09707
  • Zahraa Al Sahili, Ioannis Patras, Matthew Purver, 11 Sep 2025, Data Matters Most: Auditing Social Bias in Contrastive Vision Language Models, https://arxiv.org/abs/2501.13223
  • Zahraa Al Sahili, Ioannis Patras, Matthew Purver, 11 Sep 2025, Breaking Language Barriers or Reinforcing Bias? A Study of Gender and Racial Disparities in Multilingual Contrastive Vision Language Models, https://arxiv.org/abs/2505.14160
  • Baichuan Huang, Ananth Balashankar, Amir Aminifar, 19 Sep 2025, BEFT: Bias-Efficient Fine-Tuning of Language Models, https://arxiv.org/abs/2509.15974
  • Shuo Wang and Renhao Li and Xi Chen and Yulin Yuan and Derek F. Wong and Min Yang, 18 Sep 2025, Exploring the Impact of Personality Traits on LLM Bias and Toxicity, https://arxiv.org/abs/2502.12566
  • Nikolaos Tsilivis, Eitan Gronich, Gal Vardi, Julia Kempe, 19 Sep 2025, Flavors of Margin: Implicit Bias of Steepest Descent in Homogeneous Neural Networks, https://arxiv.org/abs/2410.22069
  • Avinash Madasu, Vasudev Lal, Phillip Howard, 19 Sep 2025, Pruning the Paradox: How CLIP's Most Informative Heads Enhance Performance While Amplifying Bias, https://arxiv.org/abs/2503.11103
  • Xiaoguang Chang, Teng Wang and Changyin Sun, 13 Sep 2025, A Modern Look at Simplicity Bias in Image Classification Tasks, https://arxiv.org/abs/2509.12265
  • Paul Kr\"oger, Emilio Barkett, 16 Sep 2025, Don't Change My View: Ideological Bias Auditing in Large Language Models, https://arxiv.org/abs/2509.12652
  • Maximus Powers, Shaina Raza, Alex Chang, Rehana Riaz, Umang Mavani, Harshitha Reddy Jonala, Ansh Tiwari, Hua Wei, 15 Sep 2025, Responsible AI in NLP: GUS-Net Span-Level Bias Detection Dataset and Benchmark for Generalizations, Unfairness, and Stereotypes, https://arxiv.org/abs/2410.08388
  • Robin Narsingh Ranabhat, Longwei Wang, Amit Kumar Patel, KC santosh, 14 Sep 2025, Promoting Shape Bias in CNNs: Frequency-Based and Contrastive Regularization for Corruption Robustness, https://arxiv.org/abs/2509.11355
  • Amy Rafferty, Rishi Ramaesh, Ajitha Rajan, 18 Sep 2025, Limitations of Public Chest Radiography Datasets for Artificial Intelligence: Label Quality, Domain Shift, Bias and Evaluation Challenges, https://arxiv.org/abs/2509.15107
  • Kiana Kiashemshaki, Mohammad Jalili Torkamani, Negin Mahmoudi, Meysam Shirdel Bilehsavar, 17 Sep 2025, Simulating a Bias Mitigation Scenario in Large Language Models, https://arxiv.org/abs/2509.14438
  • Chiyu Ma, Enpei Zhang, Yilun Zhao, Wenjun Liu, Yaning Jia, Peijun Qing, Lin Shi, Arman Cohan, Yujun Yan, Soroush Vosoughi, 17 Sep 2025, Judging with Many Minds: Do More Perspectives Mean Less Prejudice? On Bias Amplifications and Resistance in Multi-Agent Based LLM-as-Judge, https://arxiv.org/abs/2505.19477
  • Zoya Hammad, Nii Longdon Sowah, 7 Sep 2025, Evaluating and comparing gender bias across four text-to-image models, https://arxiv.org/abs/2509.08004
  • Nivedha Sivakumar, Natalie Mackraz, Samira Khorshidi, Krishna Patel, Barry-John Theobald, Luca Zappella, Nicholas Apostoloff, 9 Sep 2025, Bias after Prompting: Persistent Discrimination in Large Language Models, https://arxiv.org/abs/2509.08146
  • Daniel Lacker and Fuzhong Zhou, 10 Sep 2025, A hierarchical entropy method for the delocalization of bias in high-dimensional Langevin Monte Carlo, https://arxiv.org/abs/2509.08619
  • Ji Zhang, Xu Luo, Lianli Gao, Difan Zou, Hengtao Shen, Jingkuan Song, 10 Sep 2025, From Channel Bias to Feature Redundancy: Uncovering the "Less is More" Principle in Few-Shot Learning, https://arxiv.org/abs/2310.03843
  • Xuan Liu, Haoyang Shang, Haojian Jin, 16 Sep 2025, Programmable Cognitive Bias in Social Agents, https://arxiv.org/abs/2509.13588
  • Sai Suresh Marchala Vasu, Ivaxi Sheth, Hui-Po Wang, Ruta Binkyte, Mario Fritz, 16 Sep 2025, Justice in Judgment: Unveiling (Hidden) Bias in LLM-assisted Peer Reviews, https://arxiv.org/abs/2509.13400
  • Dingwei Zhang, Dong Zhang, Jinhui Tang, 17 Sep 2025, Mitigating Query Selection Bias in Referring Video Object Segmentation, https://arxiv.org/abs/2509.13722
  • Mohsinul Kabir, Tasfia Tahsin, Sophia Ananiadou, 17 Sep 2025, From n-gram to Attention: How Model Architectures Learn and Propagate Bias in Language Modeling, https://arxiv.org/abs/2505.12381
  • Leroy Z. Wang, 21 Sep 2025, Uncovering Implicit Bias in Large Language Models with Concept Learning Dataset, https://arxiv.org/abs/2510.01219
  • Shree Harsha Bokkahalli Satish, Gustav Eje Henter, \'Eva Sz\'ekely, 24 Sep 2025, Do Bias Benchmarks Generalise? Evidence from Voice-based Evaluation of Gender Bias in SpeechLLMs, https://arxiv.org/abs/2510.01254
  • Adithya Rajan, Xiaoyu Liu, Prateek Verma, Vibhu Arora, 2 Oct 2025, Synthetic Prefixes to Mitigate Bias in Real-Time Neural Query Autocomplete, https://arxiv.org/abs/2510.01574
  • Sergej Kucenko, Nathaniel Dennler, Fengxiang He, 2 Oct 2025, The Current State of AI Bias Bounties: An Overview of Existing Programmes and Research, https://arxiv.org/abs/2510.02036
  • Ahmet Solak, Florian Gr\"otschla, Luca A. Lanzend\"orfer, Roger Wattenhofer, 2 Oct 2025, Bias beyond Borders: Global Inequalities in AI-Generated Music, https://arxiv.org/abs/2510.01963
  • Suryaansh Jain, Umair Z. Ahmed, Shubham Sahai, Ben Leong, 13 Oct 2025, Beyond Consensus: Mitigating the Agreeableness Bias in LLM Judge Evaluations, https://arxiv.org/abs/2510.11822
  • Jiaxin Gao, Chen Chen, Yanwen Jia, Xueluan Gong, Kwok-Yan Lam, Qian Wang, 14 Oct 2025, Evaluating and Mitigating LLM-as-a-judge Bias in Communication Systems, https://arxiv.org/abs/2510.12462
  • Bianca Raimondi, Daniela Dalbagno, Maurizio Gabbrielli, 14 Oct 2025, Analysing Moral Bias in Finetuned LLMs through Mechanistic Interpretability, https://arxiv.org/abs/2510.12229
  • Hailay Kidu Teklehaymanot, Wolfgang Nejdl, 14 Oct 2025, Tokenization Disparities as Infrastructure Bias: How Subword Systems Create Inequities in LLM Access and Efficiency, https://arxiv.org/abs/2510.12389
  • Suyash Fulay, Jocelyn Zhu, Michiel Bakker, 14 Oct 2025, From Delegates to Trustees: How Optimizing for Long-Term Interests Shapes Bias and Alignment in LLM, https://arxiv.org/abs/2510.12689
  • Thierry Blankenstein, Jialin Yu, Zixuan Li, Vassilis Plachouras, Sunando Sengupta, Philip Torr, Yarin Gal, Alasdair Paren, Adel Bibi, 30 Sep 2025, BiasBusters: Uncovering and Mitigating Tool Selection Bias in Large Language Models, https://arxiv.org/abs/2510.00307
  • Franck Vandewiele, Remi Synave, Samuel Delepoulle, Remi Cozot, 27 Sep 2025, Beyond the Prompt: Gender Bias in Text-to-Image Models, with a Case Study on Hospital Professions, https://arxiv.org/abs/2510.00045
  • Xin Xu, Xunzhi He, Churan Zhi, Ruizhe Chen, Julian McAuley, Zexue He, 30 Sep 2025, BiasFreeBench: a Benchmark for Mitigating Bias in Large Language Model Responses, https://arxiv.org/abs/2510.00232
  • Wang Zhang, Huaqiu Li, Xiaowan Hu, Tao Jiang, Zikang Chen, Haoqian Wang, 1 Oct 2025, Measuring and Controlling the Spectral Bias for Self-Supervised Image Denoising, https://arxiv.org/abs/2510.00454
  • Shaina Raza, Caesar Saleh, Azib Farooq, Emrul Hasan, Franklin Ogidi, Maximus Powers, Veronica Chatrath, Marcelo Lotif, Karanpal Sekhon, Roya Javadi, Haad Zahid, Anam Zahid, Vahid Reza Khazaie, Zhenyu Yu, 1 Oct 2025, ViLBias: Detecting and Reasoning about Bias in Multimodal Content, https://arxiv.org/abs/2412.17052
  • Jeongyeon Hwang, Sangdon Park, Jungseul Ok, 1 Oct 2025, LLM Watermark Evasion via Bias Inversion, https://arxiv.org/abs/2509.23019
  • Sirui Wu, Daijin Yang, 9 Sep 2025, Automated Item Neutralization for Non-Cognitive Scales: A Large Language Model Approach to Reducing Social-Desirability Bias, https://arxiv.org/abs/2509.19314
  • Tom Heskes, 23 Sep 2025, Bias-variance decompositions: the exclusive privilege of Bregman divergences, https://arxiv.org/abs/2501.18581
  • Adela DePavia, Vasileios Charisopoulos, and Rebecca Willett, 27 Oct 2025, How do simple rotations affect the implicit bias of Adam?, https://arxiv.org/abs/2510.23804
  • Kaveh Eskandari Miandoab, Mahammed Kamruzzaman, Arshia Gharooni, Gene Louis Kim, Vasanth Sarathy, Ninareh Mehrabi, 27 Oct 2025, Breaking the Benchmark: Revealing LLM Bias via Minimal Contextual Augmentation, https://arxiv.org/abs/2510.23921
  • Anna Arias-Duart, Maria Eugenia Cardello, Atia Cort\'es, 23 Oct 2025, Bias by Design? How Data Practices Shape Fairness in AI Healthcare Systems, https://arxiv.org/abs/2510.20332
  • L. Elisa Celis, Lingxiao Huang, Milind Sohoni, Nisheeth K. Vishnoi, 23 Oct 2025, Strategic Costs of Perceived Bias in Fair Selection, https://arxiv.org/abs/2510.20606
  • Jinhee Kim, Jae Jun An, Kang Eun Jeon, Jong Hwan Ko, 23 Oct 2025, Efficient Multi-bit Quantization Network Training via Weight Bias Correction and Bit-wise Coreset Sampling, https://arxiv.org/abs/2510.20673
  • SeongKu Kang, Jianxun Lian, Dongha Lee, Wonbin Kweon, Sanghwan Jang, Jaehyun Lee, Jindong Wang, Xing Xie, Hwanjo Yu, 17 Oct 2025, BPL: Bias-adaptive Preference Distillation Learning for Recommender System, https://arxiv.org/abs/2510.16076
  • Farjana Yesmin, 17 Oct 2025, Data-Driven Analysis of Intersectional Bias in Image Classification: A Framework with Bias-Weighted Augmentation, https://arxiv.org/abs/2510.16072
  • Yiyang Huang, Liang Shi, Yitian Zhang, Yi Xu, Yun Fu, 18 Oct 2025, SHIELD: Suppressing Hallucinations In LVLM Encoders via Bias and Vulnerability Defense, https://arxiv.org/abs/2510.16596
  • Guoqing Luo, Iffat Maab, Lili Mou, Junichi Yamagishi, 20 Oct 2025, Investigating Thinking Behaviours of Reasoning-Based Language Models for Social Bias Mitigation, https://arxiv.org/abs/2510.17062
  • Mohd Ruhul Ameen, Akif Islam, Abu Saleh Musa Miah, Ayesha Siddiqua, and Jungpil Shin, 20 Oct 2025, How News Feels: Understanding Affective Bias in Multilingual Headlines for Human-Centered Media Design, https://arxiv.org/abs/2510.17252
  • Fran\c{c}ois Bachoc (LPP), J\'er\^ome Bolte (TSE-R), Ryan Boustany (TSE-R), Jean-Michel Loubes (IMT), 20 Oct 2025, When majority rules, minority loses: bias amplification of gradient descent, https://arxiv.org/abs/2505.13122
  • Yuval Golbari, Navve Wasserman, Gal Vardi, Michal Irani, 17 Oct 2025, ImpMIA: Leveraging Implicit Bias for Membership Inference Attack under Realistic Scenarios, https://arxiv.org/abs/2510.10625
  • Junjie Liu, Xi Luo, Sirong Wu, Gengchen Sun, and Yuhui Deng, 20 Oct 2025, Tracing Partisan Bias to Its Emotional Fingerprints: A Computational Approach to Mitigation, https://arxiv.org/abs/2501.01284
  • Zongqian Wu, Baoduo Xu, Tianyu Li, Zhu Sun, Xiaofeng Zhu, Lei Feng, 22 Sep 2025, Mitigating Strategy-Selection Bias in Reasoning for More Effective Test-Time Scaling, https://arxiv.org/abs/2509.17905
  • Lovely Yeswanth Panchumarthi, Saurabh Kataria, Yi Wu, Xiao Hu, Alex Fedorov, Hyunjung Gloria Kwak, 20 Sep 2025, FairTune: A Bias-Aware Fine-Tuning Framework Towards Fair Heart Rate Prediction from PPG, https://arxiv.org/abs/2509.16491
  • Wenjie Lin, Hange Liu, Xutao Mao, Yingying Zhuang, Jingwei Shi, Xudong Han, Tianyu Shi, Jinrui Yang, 18 Sep 2025, Gender and Political Bias in Large Language Models: A Demonstration Platform, https://arxiv.org/abs/2509.16264
  • Mariam Mahran, Katharina Simbeck, 22 Sep 2025, Investigating Bias: A Multilingual Pipeline for Generating, Solving, and Evaluating Math Problems with LLMs, https://arxiv.org/abs/2509.17701
  • 'Mina Arzaghi', 'Alireza Dehghanpour Farashah','Florian Carichon',' Golnoosh Farnadi', 19 Sep 2025, Intrinsic Meets Extrinsic Fairness: Assessing the Downstream Impact of Bias Mitigation in Large Language Models, https://arxiv.org/abs/2509.16462
  • Shivam Kumar, Haotian Xu, Carlos Misael Madrid Padilla, Yuehaw Khoo, Oscar Hernan Madrid Padilla, and Daren Wang, 22 Sep 2025, Bias-variance Tradeoff in Tensor Estimation, https://arxiv.org/abs/2509.17382
  • Zeyu Huang, Zihan Qiu, Zili Wang, Edoardo M. Ponti, Ivan Titov, 21 Sep 2025, Post-hoc Reward Calibration: A Case Study on Length Bias, https://arxiv.org/abs/2409.17407
  • Yue Xu, Chengyan Fu, Li Xiong, Sibei Yang, Wenjie Wang, 22 Sep 2025, Auto-Search and Refinement: An Automated Framework for Gender Bias Mitigation in Large Language Models, https://arxiv.org/abs/2502.11559
  • Xuyang Wu, Jinming Nian, Ting-Ruen Wei, Zhiqiang Tao, Hsin-Tai Wu, Yi Fang, 20 Sep 2025, Does Reasoning Introduce Bias? A Study of Social Bias Evaluation and Mitigation in LLM Reasoning, https://arxiv.org/abs/2502.15361
  • Emre Kavak, Tom Nuno Wolf, Christian Wachinger, 22 Sep 2025, DISCO: Mitigating Bias in Deep Learning with Conditional Distance Correlation, https://arxiv.org/abs/2506.11653
  • James Thiering, Tarun Sethupat Radha Krishna, Dylan Zelkin, Ashis Kumer Biswas, 24 Oct 2025, Automatic Assessment of Students' Classroom Engagement with Bias Mitigated Multi-task Model, https://arxiv.org/abs/2510.22057
  • Jan Simson, Alessandro Fabris, Cosima Fr\"ohner, Frauke Kreuter, Christoph Kern, 25 Oct 2025, Bias Begins with Data: The FairGround Corpus for Robust and Reproducible Research on Algorithmic Fairness, https://arxiv.org/abs/2510.22363
  • Mahsa Goodarzi and M. Abdullah Canbaz, 27 Sep 2025, Modeling Bias Evolution in Fashion Recommender Systems: A System Dynamics Approach, https://arxiv.org/abs/2510.21728
  • Xingjian Tao, Yiwei Wang, Yujun Cai, Yihong Luo, Jing Tang, 25 Oct 2025, Mitigating Coordinate Prediction Bias from Positional Encoding Failures, https://arxiv.org/abs/2510.22102
  • Poli Nemkova, Amrit Adhikari, Matthew Pearson, Vamsi Krishna Sadu, Mark V. Albert, 26 Oct 2025, Cross-Lingual Stability and Bias in Instruction-Tuned Language Models for Humanitarian NLP, https://arxiv.org/abs/2510.22823
  • Tingxu Han and Wei Song and Ziqi Ding and Ziming Li and Chunrong Fang and Yuekang Li and Dongfang Liu and Zhenyu Chen and Zhenting Wang, 25 Oct 2025, DiffHeads: Differential Analysis and Inference-Time Masking of Bias Heads in Large Language Models, https://arxiv.org/abs/2510.10142
  • Noor Islam S. Mohammad, 14 Oct 2025, A Multimodal XAI Framework for Trustworthy CNNs and Bias Detection in Deep Representation Learning, https://arxiv.org/abs/2510.12957
  • Robin Staab, Jasper Dekoninck, Maximilian Baader, Martin Vechev, 14 Oct 2025, Adaptive Generation of Bias-Eliciting Questions for LLMs, https://arxiv.org/abs/2510.12857
  • Sai Suhruth Reddy Karri, Yashwanth Sai Nallapuneni, Laxmi Narasimha Reddy Mallireddy, Gopichand G, 15 Oct 2025, LLM-Guided Synthetic Augmentation (LGSA) for Mitigating Bias in AI Systems, https://arxiv.org/abs/2510.13202
  • David Freire-Obreg\'on, Jos\'e Salas-C\'aceres, Javier Lorenzo-Navarro, Oliverio J. Santana, Daniel Hern\'andez-Sosa, and Modesto Castrill\'on-Santana, 15 Oct 2025, Modeling Cultural Bias in Facial Expression Recognition with Adaptive Agents, https://arxiv.org/abs/2510.13557
  • Yonatan Slutzky, Yotam Alexander, Noam Razin, Nadav Cohen, 15 Oct 2025, The Implicit Bias of Structured State Space Models Can Be Poisoned With Clean Labels, https://arxiv.org/abs/2410.10473
  • Rakesh Thakur, Shivaansh Kaushik, Gauri Chopra, Harsh Rohilla, 26 Sep 2025, TrueGradeAI: Retrieval-Augmented and Bias-Resistant AI for Transparent and Explainable Digital Assessments, https://arxiv.org/abs/2509.22516
  • Jacob B. Landsberg and Elizabeth A. Barnes, 26 Sep 2025, Forecasting the Future with Yesterday's Climate: Temperature Bias in AI Weather and Climate Models, https://arxiv.org/abs/2509.22359
  • Xiaocheng Zou, Shijin Duan, Charles Fleming, Gaowen Liu, Ramana Rao Kompella, Shaolei Ren, Xiaolin Xu, 26 Sep 2025, ConQuER: Modular Architectures for Control and Bias Mitigation in IQP Quantum Generative Models, https://arxiv.org/abs/2509.22551
  • Yessin Moakher, Malik Tiomoko, Cosme Louart, Zhenyu Liao, 26 Sep 2025, A Random Matrix Perspective of Echo State Networks: From Precise Bias--Variance Characterization to Optimal Regularization, https://arxiv.org/abs/2509.22011
  • Masahiro Kato, 26 Sep 2025, Direct Bias-Correction Term Estimation for Propensity Scores and Average Treatment Effect Estimation, https://arxiv.org/abs/2509.22122
  • Renfei Dang, Zhening Li, Shujian Huang, Jiajun Chen, 26 Sep 2025, The First Impression Problem: Internal Bias Triggers Overthinking in Reasoning Models, https://arxiv.org/abs/2505.16448
  • Francesco D'Amico, Dario Bocchi, Matteo Negri, 26 Sep 2025, Implicit bias produces neural scaling laws in learning curves, from perceptrons to deep networks, https://arxiv.org/abs/2505.13230
  • Qiyu Chen and Guozhang Chen, 26 Sep 2025, Aligning Inductive Bias for Data-Efficient Generalization in State Space Models, https://arxiv.org/abs/2509.20789
  • Mikhail Menschikov, Alexander Kharitonov, Maiia Kotyga, Vadim Porvatov, Anna Zhukovskaya, David Kagramanyan, Egor Shvetsov, Evgeny Burnaev, 26 Sep 2025, Beyond Early-Token Bias: Model-Specific and Language-Specific Position Effects in Multilingual LLMs, https://arxiv.org/abs/2505.16134
  • Lorenzo Pastori, Veronika Eyring, Mierk Schwabe, 8 Oct 2025, Fisher Information, Training and Bias in Fourier Regression Models, https://arxiv.org/abs/2510.06945
  • Shuofeng Zhang, Ard Louis, 8 Oct 2025, Closed-form $\ell_r$ norm scaling with data for overparameterized linear regression and diagonal linear networks under $\ell_p$ bias, https://arxiv.org/abs/2509.21181
  • Yihao Wu, Tianrui Wang, Yizhou Peng, Yi-Wen Chao, Xuyi Zhuang, Xinsheng Wang, Shunshun Yin, Ziyang Ma, 27 Sep 2025, Evaluating Bias in Spoken Dialogue LLMs for Real-World Decisions and Recommendations, https://arxiv.org/abs/2510.02352
  • Matei-Iulian Cocu, R\u{a}zvan-Cosmin Cristia, Adrian Marius Dumitran, 28 Sep 2025, A Cross-Lingual Analysis of Bias in Large Language Models Using Romanian History, https://arxiv.org/abs/2510.02362
  • Qin-Cheng Zheng, Shao-Qun Zhang, Shen-Huan Lyu, Yuan Jiang, Zhi-Hua Zhou, 3 Oct 2025, Theoretical Investigation on Inductive Bias of Isolation Forest, https://arxiv.org/abs/2505.12825
  • Tadesse K Bahiru, Natnael Tilahun Sinshaw, Teshager Hailemariam Moges, and Dheeraj Kumar Singh, 17 Oct 2025, Auditing and Mitigating Bias in Gender Classification Algorithms: A Data-Centric Approach, https://arxiv.org/abs/2510.17873
  • Yoshinari Fujinuma, 21 Oct 2025, Contrastive Decoding Mitigates Score Range Bias in LLM-as-a-Judge, https://arxiv.org/abs/2510.18196
  • Haixiang Lan, Luofeng Liao, Adam N. Elmachtoub, Christian Kroer, Henry Lam, Haofeng Zhang, 21 Oct 2025, The Bias-Variance Tradeoff in Data-Driven Optimization: A Local Misspecification Perspective, https://arxiv.org/abs/2510.18215
  • Haixu Wu, Minghao Guo, Yuezhou Ma, Yuanxu Sun, Jianmin Wang, Wojciech Matusik, Mingsheng Long, 21 Oct 2025, FlashBias: Fast Computation of Attention with Bias, https://arxiv.org/abs/2505.12044
  • Olga Fink, Ismail Nejjar, Vinay Sharma, Keivan Faghih Niresi, Han Sun, Hao Dong, Chenghao Xu, Amaury Wei, Arthur Bizzi, Raffael Theiler, Yuan Tian, Leandro Von Krannichfeldt, Zhan Ma, Sergei Garmaev, Zepeng Zhang, Mengjie Zhao, 25 Sep 2025, From Physics to Machine Learning and Back: Part II - Learning and Observational Bias in PHM, https://arxiv.org/abs/2509.21207
  • Adrian Kuenzler and Stefan Schmid, 25 Sep 2025, Communication Bias in Large Language Models: A Regulatory Perspective, https://arxiv.org/abs/2509.21075
  • Yixin Wan, Xingrun Chen, Kai-Wei Chang, 25 Sep 2025, Which Cultural Lens Do Models Adopt? On Cultural Positioning Bias and Agentic Mitigation in LLMs, https://arxiv.org/abs/2509.21080
  • Hyejun Jeong, Shiqing Ma, Amir Houmansadr, 25 Sep 2025, Bias Similarity Measurement: A Black-Box Audit of Fairness Across LLMs, https://arxiv.org/abs/2410.12010
  • Zhangyu Wang, Nemin Wu, Qian Cao, Jiangnan Xia, Zeping Liu, Yiqun Xie, Akshay Nambi, Tanuja Ganu, Ni Lao, Ninghao Liu, Gengchen Mai, 27 Sep 2025, GeoBS: Information-Theoretic Quantification of Geographic Bias in AI Models, https://arxiv.org/abs/2509.23482
  • German M. Matilla and Jiri Nemecek and Illia Kryvoviaz and Jakub Marecek, 29 Sep 2025, humancompatible.detect: a Python Toolkit for Detecting Bias in AI Models, https://arxiv.org/abs/2509.24340
  • Sanxing Chen, Xiaoyin Chen, Yukun Huang, Roy Xie, Bhuwan Dhingra, 29 Sep 2025, When Greedy Wins: Emergent Exploitation Bias in Meta-Bandit LLM Training, https://arxiv.org/abs/2509.24923
  • Srikant Panda, Amit Agarwal, Hitesh Laxmichand Patel, 22 Sep 2025, AccessEval: Benchmarking Disability Bias in Large Language Models, https://arxiv.org/abs/2509.22703
  • Fabrizio Boncoraglio, Vittorio Erba, Emanuele Troiani, Florent Krzakala, Lenka Zdeborov\'a, 29 Sep 2025, Inductive Bias and Spectral Properties of Single-Head Attention in High Dimensions, https://arxiv.org/abs/2509.24914
  • Gengze Xu, Wei Yao, Ziqiao Wang, Yong Liu, 28 Sep 2025, On the Emergence of Weak-to-Strong Generalization: A Bias-Variance Perspective, https://arxiv.org/abs/2505.24313
  • Md Nakhla Rafi, Dong Jae Kim, Tse-Hsun Chen, Shaowei Wang, 26 Sep 2025, Order Matters! An Empirical Study on Large Language Models' Input Order Bias in Software Fault Localization, https://arxiv.org/abs/2412.18750
  • Messi H.J. Lee, Calvin K. Lai, 27 Sep 2025, Implicit Bias-Like Patterns in Reasoning Models, https://arxiv.org/abs/2503.11572
  • Emilio Barkett, Olivia Long, Madhavendra Thakur, 28 Sep 2025, Reasoning Isn't Enough: Examining Truth-Bias and Sycophancy in LLMs, https://arxiv.org/abs/2506.21561
  • Jiahao Yu, Haozhuang Liu, Yeqiu Yang, Lu Chen, Jian Wu, Yuning Jiang, Bo Zheng, 29 Sep 2025, TranSUN: A Preemptive Paradigm to Eradicate Retransformation Bias Intrinsically from Regression Models in Recommender Systems, https://arxiv.org/abs/2505.13881
  • Emma Kondrup, Anne Imouza, 16 Oct 2025, Dr. Bias: Social Disparities in AI-Powered Medical Guidance, https://arxiv.org/abs/2510.09162
  • Chenxiang Zhang, Alexander Theus, Damien Teney, Antonio Orvieto, Jun Pang, Sjouke Mauw, 6 Oct 2025, How does the optimizer implicitly bias the model merging loss landscape?, https://arxiv.org/abs/2510.04686
  • Mingkang Zhu, Xi Chen, Bei Yu, Hengshuang Zhao, Jiaya Jia, 6 Oct 2025, From Noisy Traces to Stable Gradients: Bias-Variance Optimized Preference Optimization for Aligning Large Reasoning Models, https://arxiv.org/abs/2510.05095
  • Toby Drinkall, 3 Oct 2025, Red Lines and Grey Zones in the Fog of War: Benchmarking Legal Risk, Moral Harm, and Regional Bias in Large Language Model Military Decision-Making, https://arxiv.org/abs/2510.03514
  • Santhosh KumarRavindran, 6 Oct 2025, Unified Threat Detection and Mitigation Framework (UTDMF): Combating Prompt Injection, Deception, and Bias in Enterprise-Scale Transformers, https://arxiv.org/abs/2510.04528
  • Abhi Chawla, David M. Bortz, Vanja Dukic, 3 Oct 2025, Bias and Coverage Properties of the WENDy-IRLS Algorithm, https://arxiv.org/abs/2510.03365
  • Leander Girrbach and Stephan Alaniz and Genevieve Smith and Trevor Darrell and Zeynep Akata, 4 Oct 2025, Person-Centric Annotations of LAION-400M: Auditing Bias and Its Transfer to Models, https://arxiv.org/abs/2510.03721
  • Ayan Majumdar, Feihao Chen, Jinghui Li, Xiaozhen Wang, 6 Oct 2025, Evaluating LLMs for Demographic-Targeted Social Bias Detection: A Comprehensive Benchmark Study, https://arxiv.org/abs/2510.04641
  • Connall Garrod and Jonathan P. Keating, 5 Oct 2025, The Persistence of Neural Collapse Despite Low-Rank Bias, https://arxiv.org/abs/2410.23169
  • Harsh Chaudhari, Jamie Hayes, Matthew Jagielski, Ilia Shumailov, Milad Nasr, Alina Oprea, 5 Oct 2025, Cascading Adversarial Bias from Injection to Distillation in Language Models, https://arxiv.org/abs/2505.24842
  • Xunlian Dai and Li Zhou and Benyou Wang and Haizhou Li, 6 Oct 2025, From Word to World: Evaluate and Mitigate Culture Bias in LLMs via Word Association Test, https://arxiv.org/abs/2505.18562
  • Hossein A. Rahmani, Varsha Ramineni, Emine Yilmaz, Nick Craswell, Bhaskar Mitra, 4 Oct 2025, Towards Understanding Bias in Synthetic Data for Evaluation, https://arxiv.org/abs/2506.10301
  • Chuyi Tan, Peiwen Yuan, Xinglin Wang, Yiwei Li, Shaoxiong Feng, Yueqi Zhang, Jiayi Shi, Ji Zhang, Boyuan Pan, Yao Hu, Kan Li, 10 Oct 2025, Diagnosing and Mitigating System Bias in Self-Rewarding RL, https://arxiv.org/abs/2510.08977
  • Rahul Nair, Bhanu Tokas and Hannah Kerner, 9 Oct 2025, Measuring directional bias amplification in image captions using predictability, https://arxiv.org/abs/2503.07878
  • Bhanu Tokas, Rahul Nair, Hannah Kerner, 9 Oct 2025, Making Bias Amplification in Balanced Datasets Directional and Interpretable, https://arxiv.org/abs/2412.11060
  • Zihao Fu, Ryan Brown, Shun Shao, Kai Rawal, Eoin Delaney, Chris Russell, 24 Oct 2025, FairImagen: Post-Processing for Bias Mitigation in Text-to-Image Models, https://arxiv.org/abs/2510.21363
  • Whie Jung, Dong Hoon Lee, Seunghoon Hong, 24 Oct 2025, Disentangled Representation Learning via Modular Compositional Bias, https://arxiv.org/abs/2510.21402
  • Vighnesh Subramaniam, David Mayo, Colin Conwell, Tomaso Poggio, Boris Katz, Brian Cheung, Andrei Barbu, 23 Oct 2025, Training the Untrainable: Introducing Inductive Bias via Representational Alignment, https://arxiv.org/abs/2410.20035
  • Massimiliano Ciranni, Vito Paolo Pastore, Roberto Di Via, Enzo Tartaglione, Francesca Odone, Vittorio Murino, 24 Oct 2025, Diffusing DeBias: Synthetic Bias Amplification for Model Debiasing, https://arxiv.org/abs/2502.09564
  • Bhavya Vasudeva, Jung Whan Lee, Vatsal Sharan, Mahdi Soltanolkotabi, 24 Oct 2025, The Rich and the Simple: On the Implicit Bias of Adam and SGD, https://arxiv.org/abs/2505.24022
  • Yanhao Jia and Ji Xie and S Jivaganesh and Hao Li and Xu Wu and Mengmi Zhang, 24 Oct 2025, Seeing Sound, Hearing Sight: Uncovering Modality Bias and Conflict of AI models in Sound Localization, https://arxiv.org/abs/2505.11217
  • Sudip Khadka and L.S. Paudel, 9 Oct 2025, A Multi-Component Reward Function with Policy Gradient for Automated Feature Selection with Dynamic Regularization and Bias Mitigation, https://arxiv.org/abs/2510.09705
  • Parsa Gooya, Reinel Sospedra-Alfonso, 10 Oct 2025, Probabilistic bias adjustment of seasonal predictions of Arctic Sea Ice Concentration, https://arxiv.org/abs/2510.09891
  • Prarthana P. Kartholy, Thandi M. Labor, Neil N. Panchal, Sean H. Wang, Hillary N. Owusu, 30 Sep 2025, Bias-Aware AI Chatbot for Engineering Advising at the University of Maryland A. James Clark School of Engineering, https://arxiv.org/abs/2510.09636
  • Sai Teja Erukude, 12 Oct 2025, Identifying bias in CNN image classification using image scrambling and transforms, https://arxiv.org/abs/2510.10383
  • Clemence Mottez, Louisa Fay, Maya Varma, Sophie Ostmeier, Curtis Langlotz, 12 Oct 2025, From Detection to Mitigation: Addressing Bias in Deep Learning Models for Chest X-Ray Diagnosis, https://arxiv.org/abs/2510.10822
  • Mahika Phutane, Hayoung Jung, Matthew Kim, Tanushree Mitra, Aditya Vashistha, 13 Oct 2025, ABLEIST: Intersectional Disability Bias in LLM-Generated Hiring Scenarios, https://arxiv.org/abs/2510.10998
  • Hyeong Kyu Choi, Xiaojin Zhu, Yixuan Li, 8 Oct 2025, Measuring and Mitigating Identity Bias in Multi-Agent Debate via Anonymization, https://arxiv.org/abs/2510.07517
  • Konrad L\"ohr, Shuzhou Yuan, Michael F\"arber, 9 Oct 2025, The Hidden Bias: A Study on Explicit and Implicit Political Stereotypes in Large Language Models, https://arxiv.org/abs/2510.08236
  • Tsuyoshi Okita, 9 Oct 2025, Bridging the Physics-Data Gap with FNO-Guided Conditional Flow Matching: Designing Inductive Bias through Hierarchical Physical Constraints, https://arxiv.org/abs/2510.08295
  • Xinyi Liu, Weiguang Wang, Hangfeng He, 9 Oct 2025, The Role of Model Confidence on Bias Effects in Measured Uncertainties for Vision-Language Models, https://arxiv.org/abs/2506.16724
  • Francesco Sovrano, 23 Sep 2025, Can Global XAI Methods Reveal Injected Bias in LLMs? SHAP vs Rule Extraction vs RuleSHAP, https://arxiv.org/abs/2505.11189
  • Matheus Vinicius da Silva de Oliveira, Jonathan de Andrade Silva, Awdren de Lima Fontao, 30 Sep 2025, Fairness Testing in Retrieval-Augmented Generation: How Small Perturbations Reveal Bias in Small Language Models, https://arxiv.org/abs/2509.26584
  • Yuki Takezawa, Anastasia Koloskova, Xiaowen Jiang, Sebastian U. Stich, 30 Sep 2025, FedMuon: Federated Learning with Bias-corrected LMO-based Optimization, https://arxiv.org/abs/2509.26337
  • Wenda Xu, Sweta Agrawal, Vil\'em Zouhar, Markus Freitag, Daniel Deutsch, 30 Sep 2025, Deconstructing Self-Bias in LLM-generated Translation Benchmarks, https://arxiv.org/abs/2509.26600
  • Matthias K\"ummerer, Harneet Singh Khanuja, Matthias Bethge, 30 Sep 2025, Modeling Saliency Dataset Bias, https://arxiv.org/abs/2505.10169
  • Armin Gerami, Ramani Duraiswami, 1 Oct 2025, Auditing Algorithmic Bias in Transformer-Based Trading, https://arxiv.org/abs/2510.05140
  • Fabrizio Dimino, Krati Saxena, Bhaskarjit Sarmah, Stefano Pasquali, 7 Oct 2025, Uncovering Representation Bias for Investment Decisions in Open-Source Large Language Models, https://arxiv.org/abs/2510.05702
  • Ana Ozaki, Roberto Confalonieri, Ricardo Guimar\~aes, Anders Imenes, 6 Oct 2025, Extracting PAC Decision Trees from Black Box Binary Classifiers: The Gender Bias Case Study on BERT-based Language Models, https://arxiv.org/abs/2412.10513
  • Sara Altamirano, Arjan Vreeken, Sennay Ghebreab, 16 Oct 2025, Machine Learning and Public Health: Identifying and Mitigating Algorithmic Bias through a Systematic Review, https://arxiv.org/abs/2510.14669
  • Massimiliano Ciranni, Luca Molinaro, Carlo Alberto Barbano, Attilio Fiandrotti, Vittorio Murino, Vito Paolo Pastore, Enzo Tartaglione, 16 Oct 2025, Say My Name: a Model's Bias Discovery Framework, https://arxiv.org/abs/2408.09570
  • Riccardo Cantini, Alessio Orsino, Massimo Ruggiero, Domenico Talia, 16 Oct 2025, Benchmarking Adversarial Robustness to Bias Elicitation in Large Language Models: Scalable Automated Assessment with LLM-as-a-Judge, https://arxiv.org/abs/2504.07887

Toxicity

Toxicity is the LLM safety issue of ensuring that the AI does not give "toxic" answers to the user. There are many subtypes of this issue, such as ensuring that answers are appropriate, non-aggressive, non-disparaging, not insulting, and generally helpful. The overall tone of AI interactions should be positive rather than negative.

Research papers on LLM toxicity issues:

  • Jaymari Chua, Yun Li, Shiyi Yang, Chen Wang, Lina Yao, 6 Jul 2024, AI Safety in Generative AI Large Language Models: A Survey, https://arxiv.org/abs/2407.18369
  • Cem Dilmegani, Jan 10, 2024, The Future of Large Language Models in 2024, https://research.aimultiple.com/future-of-large-language-models/
  • Shayne Longpre, Stella Biderman, Alon Albalak, Hailey Schoelkopf, Daniel McDuff, Sayash Kapoor, Kevin Klyman, Kyle Lo, Gabriel Ilharco, Nay San, Maribeth Rauh, Aviya Skowron, Bertie Vidgen, Laura Weidinger, Arvind Narayanan, Victor Sanh, David Adelani, Percy Liang, Rishi Bommasani, Peter Henderson, Sasha Luccioni, Yacine Jernite, Luca Soldaini, 26 Jun 2024 (v2), The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources, https://arxiv.org/abs/2406.16746
  • Inkit Padhi, Manish Nagireddy, Giandomenico Cornacchia, Subhajit Chaudhury, Tejaswini Pedapati, Pierre Dognin, Keerthiram Murugesan, Erik Miehling, Martín Santillán Cooper, Kieran Fraser, Giulio Zizzo, Muhammad Zaid Hameed, Mark Purcell, Michael Desmond, Qian Pan, Inge Vejsbjerg, Elizabeth M. Daly, Michael Hind, Werner Geyer, Ambrish Rawat, Kush R. Varshney, Prasanna Sattigeri, 10 Dec 2024, Granite Guardian, https://arxiv.org/abs/2412.07724 https://github.com/ibm-granite/granite-guardian (Open-sourcing of safety models with many capabilities.)
  • Jueon Park, Yein Park, Minju Song, Soyon Park, Donghyeon Lee, Seungheun Baek and Jaewoo Kang, 5 Aug 2025, CoTox: Chain-of-Thought-Based Molecular Toxicity Reasoning and Prediction, https://arxiv.org/abs/2508.03159
  • Axel Delaval, Shujian Yang, Haicheng Wang, Han Qiu, Jialiang Lu, 15 Aug 2025, ToxiFrench: Benchmarking and Enhancing Language Models via CoT Fine-Tuning for French Toxicity Detection, https://arxiv.org/abs/2508.11281
  • Han Zhang, Fengji Ma, Jiamin Su, Xinyue Yang, Lei Wang, Wen-Cai Ye, Li Liu, 4 Sep 2025, Quantum-Enhanced Multi-Task Learning with Learnable Weighting for Pharmacokinetic and Toxicity Prediction, https://arxiv.org/abs/2509.04601
  • Guillermo Villate-Castillo, Javier Del Ser, Borja Sanz, 29 Aug 2025, A Collaborative Content Moderation Framework for Toxicity Detection based on Conformalized Estimates of Annotation Disagreement, https://arxiv.org/abs/2411.04090
  • Naquee Rizwan, Nayandeep Deb, Sarthak Roy, Vishwajeet Singh Solanki, Kiran Garimella, Animesh Mukherjee, 29 Aug 2025, Toxicity Begets Toxicity: Unraveling Conversational Chains in Political Podcasts, https://arxiv.org/abs/2501.12640
  • Akriti Verma, Shama Islam, Valeh Moghaddam and Adnan Anwar, 31 Aug 2025, Queuing for Civility: Regulating Emotions and Reducing Toxicity in Digital Discourse, https://arxiv.org/abs/2509.00696
  • Ruoxi Cheng, Yizhong Ding, Shuirong Cao, Ranjie Duan, Xiaoshuang Jia, Shaowei Yuan, Simeng Qin, Zhiqiang Wang, Xiaojun Jia, 30 Aug 2025, PBI-Attack: Prior-Guided Bimodal Interactive Black-Box Jailbreak Attack for Toxicity Maximization, https://arxiv.org/abs/2412.05892
  • Shuo Wang and Renhao Li and Xi Chen and Yulin Yuan and Derek F. Wong and Min Yang, 18 Sep 2025, Exploring the Impact of Personality Traits on LLM Bias and Toxicity, https://arxiv.org/abs/2502.12566
  • Sudeshna Jana, Manjira Sinha and Tirthankar Dasgupta, 14 Sep 2025, Decoding Plastic Toxicity: An Intelligent Framework for Conflict-Aware Relational Metapath Extraction from Scientific Abstracts, https://arxiv.org/abs/2509.11330
  • Huy Nghiem, Advik Sachdeva, Hal Daum\'e III, 18 Sep 2025, SMARTER: A Data-efficient Framework to Improve Toxicity Detection with Explanation via Self-augmenting Large Language Models, https://arxiv.org/abs/2509.15174
  • Gautam Kishore Shahi, Tim A. Majchrzak, 14 Sep 2025, Defining, Understanding, and Detecting Online Toxicity: Challenges and Machine Learning Approaches, https://arxiv.org/abs/2509.14264
  • Sergey Berezin, Reza Farahbakhsh, Noel Crespi, 24 Sep 2025, Evading Toxicity Detection with ASCII-art: A Benchmark of Spatial Attacks on Moderation Systems, https://arxiv.org/abs/2409.18708
  • Eduard Popescu and Adrian Groza and Andreea Cernat, 26 Oct 2025, Combining Deep Learning and Explainable AI for Toxicity Prediction of Chemical Compounds, https://arxiv.org/abs/2510.22572
  • Yehor Tereshchenko, Mika H\"am\"al\"ainen, 20 Oct 2025, Efficient Toxicity Detection in Gaming Chats: A Comparative Study of Embeddings, Fine-Tuned Transformers and LLMs, https://arxiv.org/abs/2510.17924
  • Zhiqiang Kou, Junyang Chen, Xin-Qiang Cai, Ming-Kun Xie, Biao Liu, Changwei Wang, Lei Feng, Yuheng Jia, Gang Niu, Masashi Sugiyama and Xin Geng, 16 Oct 2025, Rethinking Toxicity Evaluation in Large Language Models: A Multi-Label Perspective, https://arxiv.org/abs/2510.15007
  • Jan Wehner, Mario Fritz, 24 Oct 2025, Probe-based Fine-tuning for Reducing Toxicity, https://arxiv.org/abs/2510.21531
  • Simone Corbo, Luca Bancale, Valeria De Gennaro, Livia Lestingi, Vincenzo Scotti, Matteo Camilli, 24 Oct 2025, How Toxic Can You Get? Search-based Toxicity Testing for Large Language Models, https://arxiv.org/abs/2501.01741
  • Thomas Lautenschlager, Nils Friederich, Angelo Jovin Yamachui Sitcheu, Katja Nau, Ga\"elle Hayot, Thomas Dickmeis, Ralf Mikut, 9 Oct 2025, Self-Supervised Learning Strategies for a Platform to Test the Toxicity of New Chemicals and Materials, https://arxiv.org/abs/2510.07853
  • Hojun Cho, Donghu Kim, Soyoung Yang, Chan Lee, Hunjoo Lee, Jaegul Choo, 9 Oct 2025, Building Resource-Constrained Language Agents: A Korean Case Study on Chemical Toxicity Information, https://arxiv.org/abs/2503.17753
  • Smita Khapre, Melkamu Abay Mersha, Hassan Shakil, Jonali Baruah, Jugal Kalita, 29 Sep 2025, Toxicity in Online Platforms and AI Systems: A Survey of Needs, Challenges, Mitigations, and Future Directions, https://arxiv.org/abs/2509.25539

Ethics of Responsible AI Research

Ethical issues in AI research and related publication of results:

AI Alignment Research

Alignment is the study of how to ensure that AI engines are "aligned" with the goals and intent of humans.

  • J. Leike, J. Schulman, and J. Wu. OpenAI, August 2022. Our approach to alignment research. https://openai.com/blog/our-approach-to-alignment-research
  • OpenAI, July 2023, Introducing Superalignment, https://openai.com/blog/introducing-superalignment
  • V. Krakovna and R. Shah. 2023, Some high-level thoughts on the DeepMind alignment team’s strategy. https://www.alignmentforum.org/posts/a9SPcZ6GXAg9cNKdi/linkpost-some-high-level-thoughts-on-the-deepmind-alignment
  • J. Leike. Dec 2022, Why I’m optimistic about our alignment approach. https://aligned.substack.com/p/alignment-optimism
  • Nate Soares and Benja Fallenstein. Aligning superintelligence with human interests: A technical research agenda. Technical report, Machine Intelligence Research Institute, 2014. https://www.semanticscholar.org/paper/Aligning-Superintelligence-with-Human-Interests%3A-A-Soares-Fallenstein/d8033a314493c8df3791912272ac4b58d3a7b8c2
  • Jessica Taylor, Eliezer Yudkowsky, Patrick LaVictoire, and Andrew Critch. 2016. Alignment for advanced machine learning systems. Technical report, Machine Intelligence Research Institute, 2016. PDF: https://intelligence.org/files/AlignmentMachineLearning.pdf
  • Daniel Weld and Oren Etzioni. The first law of robotics (a call to arms). Proceedings of the AAAI Conference on Artificial Intelligence, 12, pages 1042–1047, 1994. https://aaai.org/papers/01042-the-first-law-of-robotics-a-call-to-arms/
  • Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe, Mar 2022, Training language models to follow instructions with human feedback, https://arxiv.org/abs/2203.02155 (InstructGPT main paper from OpenAI in 2022.)
  • Ziniu Li1, Tian Xu, Yushun Zhang, Zhihang Lin, Yang Yu, Ruoyu Sun, Zhi-Quan Luo, 2024, ReMax: ASimple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models, https://openreview.net/pdf?id=Stn8hXkpe6
  • Aibek Bekbayev, Sungbae Chun, Yerzat Dulat, James Yamazaki, Aug 2023, The Poison of Alignment, https://arxiv.org/abs/2308.13449
  • Yotam Wolf, Noam Wies, Yoav Levine, and Amnon Shashua. Fundamental limitations of alignment in large language models. arXiv preprint arXiv:2304.11082, 2023. https://arxiv.org/abs/2304.11082
  • Renze Lou, Kai Zhang, Wenpeng Yin, 25 May 2024 (v8), Large Language Model Instruction Following: A Survey of Progresses and Challenges, https://arxiv.org/abs/2303.10475 Project: https://github.com/RenzeLou/awesome-instruction-learning
  • Alexandre Ramé, Nino Vieillard, Léonard Hussenot, Robert Dadashi, Geoffrey Cideron, Olivier Bachem, Johan Ferret, 22 Jan 2024, WARM: On the Benefits of Weight Averaged Reward Models, https://arxiv.org/abs/2401.12187 (Uses multiple reward models to avoid problems with the LLM "hacking rewards" in unforeseen ways.)
  • NVIDIA, June 2024, Nemotron-4 340B Technical Report, https://d1qx31qr3h6wln.cloudfront.net/publications/Nemotron_4_340B_8T_0.pdf (Architecture is decoder-only with GQA, SentencePiece tokenizer, causal attention masks, RoPE, 96 layers, 96 heads, 8 KV heads, 256,000 vocabulary, 18432 internal dimension, context window 4096, and uses squared RELU.)
  • Piotr Wojciech Mirowski, Juliette Love, Kory W. Mathewson, Shakir Mohamed, 3 Jun 2024 (v2), A Robot Walks into a Bar: Can Language Models Serve as Creativity Support Tools for Comedy? An Evaluation of LLMs' Humour Alignment with Comedians, https://arxiv.org/abs/2405.20956 (The unfunny fact that AI is bad at humor.)
  • Jaymari Chua, Yun Li, Shiyi Yang, Chen Wang, Lina Yao, 6 Jul 2024, AI Safety in Generative AI Large Language Models: A Survey, https://arxiv.org/abs/2407.18369
  • Mintong Kang, Nezihe Merve Gürel, Ning Yu, Dawn Song, Bo Li, July 2024, C-RAG: Certified Generation Risks for Retrieval-Augmented Language Models, Proceedings of the 41st International Conference on Machine Learning, PMLR 235:22963-23000, 2024, https://proceedings.mlr.press/v235/kang24a.html
  • Rohin Shah, Seb Farquhar, Anca Dragan, 21st Aug 2024, AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work, https://www.alignmentforum.org/posts/79BPxvSsjzBkiSyTq/agi-safety-and-alignment-at-google-deepmind-a-summary-of
  • Shayne Longpre, Stella Biderman, Alon Albalak, Hailey Schoelkopf, Daniel McDuff, Sayash Kapoor, Kevin Klyman, Kyle Lo, Gabriel Ilharco, Nay San, Maribeth Rauh, Aviya Skowron, Bertie Vidgen, Laura Weidinger, Arvind Narayanan, Victor Sanh, David Adelani, Percy Liang, Rishi Bommasani, Peter Henderson, Sasha Luccioni, Yacine Jernite, Luca Soldaini, 26 Jun 2024 (v2), The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources, https://arxiv.org/abs/2406.16746
  • Yiheng Liu, Hao He, Tianle Han, Xu Zhang, Mengyuan Liu, Jiaming Tian, Yutong Zhang, Jiaqi Wang, Xiaohui Gao, Tianyang Zhong, Yi Pan, Shaochen Xu, Zihao Wu, Zhengliang Liu, Xin Zhang, Shu Zhang, Xintao Hu, Tuo Zhang, Ning Qiang, Tianming Liu, Bao Ge, 6 Jan 2024 (v2), Understanding LLMs: A Comprehensive Overview from Training to Inference, https://arxiv.org/abs/2401.02038
  • Hao Zhou, Chengming Hu, Ye Yuan, Yufei Cui, Yili Jin, Can Chen, Haolun Wu, Dun Yuan, Li Jiang, Di Wu, Xue Liu, Charlie Zhang, Xianbin Wang, Jiangchuan Liu, 17 May 2024, Large Language Model (LLM) for Telecommunications: A Comprehensive Survey on Principles, Key Techniques, and Opportunities, https://arxiv.org/abs/2405.10825
  • Zekun Moore Wang, Shawn Wang, Kang Zhu, Jiaheng Liu, Ke Xu, Jie Fu, Wangchunshu Zhou, Wenhao Huang, 17 Oct 2024, PopAlign: Diversifying Contrasting Patterns for a More Comprehensive Alignment, https://arxiv.org/abs/2410.13785
  • Mozhi Zhang, Pengyu Wang, Chenkun Tan, Mianqiu Huang, Dong Zhang, Yaqian Zhou, Xipeng Qiu, 18 Oct 2024, MetaAlign: Align Large Language Models with Diverse Preferences during Inference Time, https://arxiv.org/abs/2410.14184
  • OpenAI, Dec 2024, Deliberative alignment: reasoning enables safer language models. Introducing our new alignment strategy for o-series models, which are directly taught safety specifications and how to reason over them. https://openai.com/index/deliberative-alignment/
  • Asif Razzaq, December 23, 2024, OpenAI Researchers Propose ‘Deliberative Alignment’: A Training Approach that Teaches LLMs to Explicitly Reason through Safety Specifications before Producing an Answer, https://www.marktechpost.com/2024/12/23/openai-researchers-propose-deliberative-alignment-a-training-approach-that-teaches-llms-to-explicitly-reason-through-safety-specifications-before-producing-an-answer/
  • Andrea Matarazzo, Riccardo Torlone, 3 Jan 2025, A Survey on Large Language Models with some Insights on their Capabilities and Limitations, https://arxiv.org/abs/2501.04040 (Broad survey with many LLM topics covered from history to architectures to optimizations.)
  • Tong Xiao, Jingbo Zhu, 16 Jan 2025, Foundations of Large Language Models, https://arxiv.org/abs/2501.09223 (Huge 230 page paper on many topics such as training, prompting, alignment, and long context.)
  • Zongxi Li, Yang Li, Haoran Xie, S. Joe Qin, 3 Feb 2025, CondAmbigQA: A Benchmark and Dataset for Conditional Ambiguous Question Answering, https://arxiv.org/abs/2502.01523
  • Y Gong, D Ran, X He, T Cong, A Wang, X Wang, Feb 2025, Safety Misalignment Against Large Language Models, Network and Distributed System Security (NDSS) Symposium 2025, 24-28 February 2025, San Diego, CA, USA, ISBN 979-8-9894372-8-3, https://dx.doi.org/10.14722/ndss.2025.241089 https://www.ndss-symposium.org/wp-content/uploads/2025-1089-paper.pdf
  • Guiyao Tie, Zeli Zhao, Dingjie Song, Fuyang Wei, Rong Zhou, Yurou Dai, Wen Yin, Zhejian Yang, Jiangyue Yan, Yao Su, Zhenhan Dai, Yifeng Xie, Yihan Cao, Lichao Sun, Pan Zhou, Lifang He, Hechang Chen, Yu Zhang, Qingsong Wen, Tianming Liu, Neil Zhenqiang Gong, Jiliang Tang, Caiming Xiong, Heng Ji, Philip S. Yu, Jianfeng Gao, 8 Mar 2025, A Survey on Post-training of Large Language Models, https://arxiv.org/abs/2503.06072
  • Ziqi Yin, Hao Wang, Kaito Horio, Daisuke Kawahara, Satoshi Sekine, 14 Oct 2024 (v2), Should We Respect LLMs? A Cross-Lingual Study on the Influence of Prompt Politeness on LLM Performance, https://arxiv.org/abs/2402.14531
  • Michael Nuñez, July 15, 2025, OpenAI, Google DeepMind and Anthropic sound alarm: ‘We may be losing the ability to understand AI’, https://venturebeat.com/ai/openai-google-deepmind-and-anthropic-sound-alarm-we-may-be-losing-the-ability-to-understand-ai/ (Monitoring the text-based interim "thinking-out-loud" reasoning of models in CoT.)
  • Tomek Korbak, Mikita Balesni, (and many more authors) July 2025, Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety, https://tomekkorbak.com/cot-monitorability-is-a-fragile-opportunity/cot_monitoring.pdf
  • Yafu Li, Xuyang Hu, Xiaoye Qu, Linjie Li, Yu Cheng, 22 Jan 2025, Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback, https://arxiv.org/abs/2501.12895 https://github.com/yafuly/TPO
  • Cameron R. Wolfe, Ph.D., Jun 30, 2025, Reward Models: Modeling human preferences for LLMs in the age of reasoning models, https://cameronrwolfe.substack.com/p/reward-models
  • Zetian Sun, Dongfang Li, Baotian Hu, 14 Aug 2025, Diversity First, Quality Later: A Two-Stage Assumption for Language Model Alignment, https://arxiv.org/abs/2508.10530
  • Xinyan Jiang, Lin Zhang, Jiayi Zhang, Qingsong Yang, Guimin Hu, Di Wang, Lijie Hu, 14 Aug 2025, MSRS: Adaptive Multi-Subspace Representation Steering for Attribute Alignment in Large Language Models, https://arxiv.org/abs/2508.10599
  • Jinhwa Kim, Ian G. Harris, 9 Aug 2025, Context Misleads LLMs: The Role of Context Filtering in Maintaining Safe Alignment of LLMs, https://arxiv.org/abs/2508.10031
  • Christopher Pinier, Sonia Acu\~na Vargas, Mariia Steeghs-Turchina, Dora Matzke, Claire E. Stevenson, Michael D. Nunez, 12 Aug 2025, Large Language Models Show Signs of Alignment with Human Neurocognition During Abstract Reasoning, https://arxiv.org/abs/2508.10057
  • Shixiong Xu, Chenghao Zhang, Lubin Fan, Yuan Zhou, Bin Fan, Shiming Xiang, Gaofeng Meng, Jieping Ye, 14 Aug 2025, AddressVLM: Cross-view Alignment Tuning for Image Address Localization using Large Vision-Language Models, https://arxiv.org/abs/2508.10667
  • Xia Chen, 13 Aug 2025, Dynamical Alignment: A Principle for Adaptive Neural Computation, https://arxiv.org/abs/2508.10064
  • Yihao Xue, Baharan Mirzasoleiman, 22 Jul 2025, LoRA is All You Need for Safety Alignment of Reasoning LLMs, https://arxiv.org/abs/2507.17075
  • Haoran Sun, Zekun Zhang, Shaoning Zeng, 23 Jul 2025, An Uncertainty-Driven Adaptive Self-Alignment Framework for Large Language Models, https://arxiv.org/abs/2507.17477
  • Xiang Li, 21 Jul 2025, Look Before You Fuse: 2D-Guided Cross-Modal Alignment for Robust 3D Detection, https://arxiv.org/abs/2507.16861
  • Miguel Carrasco, C\'esar Gonz\'alez-Mart\'in, Jos\'e Aranda, Luis Oliveros, 23 Jul 2025, Vision Transformer attention alignment with human visual perception in aesthetic object evaluation, https://arxiv.org/abs/2507.17616
  • Yifan Wang, Runjin Chen, Bolian Li, David Cho, Yihe Deng, Ruqi Zhang, Tianlong Chen, Zhangyang Wang, Ananth Grama, Junyuan Hong, 22 Jul 2025, More is Less: The Pitfalls of Multi-Model Synthetic Preference Data in DPO Safety Alignment, https://arxiv.org/abs/2504.02193
  • Tom\'as H\"uttebr\"aucker, Mario Edoardo Pandolfo, Simone Fiorellino, Emilio Calvanese Strinati, Paolo Di Lorenzo, 23 Jul 2025, RIS-aided Latent Space Alignment for Semantic Channel Equalization, https://arxiv.org/abs/2507.16450
  • Shehzeen Hussain, Paarth Neekhara, Xuesong Yang, Edresson Casanova, Subhankar Ghosh, Mikyas T. Desta, Roy Fejgin, Rafael Valle, Jason Li, 22 Jul 2025, Koel-TTS: Enhancing LLM based Speech Generation with Preference Alignment and Classifier Free Guidance, https://arxiv.org/abs/2502.05236
  • Songming Zhang, Xue Zhang, Tong Zhang, Bojie Hu, Yufeng Chen, Jinan Xu, 23 Jul 2025, AlignDistil: Token-Level Language Model Alignment as Adaptive Policy Distillation, https://arxiv.org/abs/2503.02832
  • ZhengXiao He, Jinghao Wen, Huayu Li, Siyuan Tian, Ao Li, 23 Jul 2025, NeuroHD-RA: Neural-distilled Hyperdimensional Model with Rhythm Alignment, https://arxiv.org/abs/2507.14184
  • Amir Mohammad Izadi, Seyed Mohammad Hadi Hosseini, Soroush Vafaie Tabar, Ali Abdollahi, Armin Saghafian, and Mahdieh Soleymani Baghshah, 22 Jul 2025, Fine-Grained Alignment and Noise Refinement for Compositional Text-to-Image Generation, https://arxiv.org/abs/2503.06506
  • Andy E. Williams, 18 Jul 2025, The Recursive Coherence Principle: A Formal Constraint on Scalable Intelligence, Alignment, and Reasoning Architecture, https://arxiv.org/abs/2507.15880
  • Debangshu Banerjee, Kintan Saha, Aditya Gopalan, 21 Jul 2025, Towards Reliable, Uncertainty-Aware Alignment, https://arxiv.org/abs/2507.15906
  • Mario Edoardo Pandolfo, Simone Fiorellino, Emilio Calvanese Strinati, Paolo Di Lorenzo, 22 Jul 2025, Latent Space Alignment for AI-Native MIMO Semantic Communications, https://arxiv.org/abs/2507.16680
  • Han Jiang, Dongyao Zhu, Zhihua Wei, Xiaoyuan Yi, Ziang Xiao, Xing Xie, 22 Jul 2025, PICACO: Pluralistic In-Context Value Alignment of LLMs via Total Correlation Optimization, https://arxiv.org/abs/2507.16679
  • Difei Gu, Yunhe Gao, Yang Zhou, Mu Zhou, Dimitris Metaxas, 22 Jul 2025, RadAlign: Advancing Radiology Report Generation with Vision-Language Concept Alignment, https://arxiv.org/abs/2501.07525
  • Ziteng Yang, Jingzehua Xu, Yanshu Li, Zepeng Li, Yeqiang Wang, Xinghui Li, 22 Jul 2025, ViP$^2$-CLIP: Visual-Perception Prompting with Unified Alignment for Zero-Shot Anomaly Detection, https://arxiv.org/abs/2505.17692
  • Tianze Wang, Dongnan Gui, Yifan Hu, Shuhang Lin, Linjun Zhang, 22 Jul 2025, MPO: An Efficient Post-Processing Framework for Mixing Diverse Preference Alignment, https://arxiv.org/abs/2502.18699
  • Xiandong Zou, Wanyu Lin, Yuchen Li, Pan Zhou, 24 Jul 2025, HPS: Hard Preference Sampling for Human Preference Alignment, https://arxiv.org/abs/2502.14400
  • Alberto Hern\'andez-Espinosa, Felipe S. Abrah\~ao, Olaf Witkowski, Hector Zenil, 24 Jul 2025, Neurodivergent Influenceability as a Contingent Solution to the AI Alignment Problem, https://arxiv.org/abs/2505.02581
  • Yuhui Sun (University of Alberta), Xiyao Wang (University of Toronto), Zixi Li (Zhejiang University), Zhenlong Yuan (Institute of Computing Technology, Chinese Academy of Sciences), and Jinman Zhao (University of Toronto), 24 Jul 2025, Multi-Preference Lambda-weighted Listwise DPO for Small-Scale Model Alignment, https://arxiv.org/abs/2506.19780
  • Bowen Jin, Jinsung Yoon, Zhen Qin, Ziqi Wang, Wei Xiong, Yu Meng, Jiawei Han, Sercan O. Arik, 23 Jul 2025, LLM Alignment as Retriever Optimization: An Information Retrieval Perspective, https://arxiv.org/abs/2502.03699
  • Jie Xu, Na Zhao, Gang Niu, Masashi Sugiyama, Xiaofeng Zhu, 24 Jul 2025, Robust Multi-View Learning via Representation Fusion of Sample-Level Attention and Alignment of Simulated Perturbation, https://arxiv.org/abs/2503.04151
  • Charvi Rastogi, Tian Huey Teh, Pushkar Mishra, Roma Patel, Ding Wang, Mark D\'iaz, Alicia Parrish, Aida Mostafazadeh Davani, Zoe Ashwood, Michela Paganini, Vinodkumar Prabhakaran, Verena Rieser, Lora Aroyo, 15 Jul 2025, Whose View of Safety? A Deep DIVE Dataset for Pluralistic Alignment of Text-to-Image Models, https://arxiv.org/abs/2507.13383
  • Oussama Bouaggad, Natalia Grabar, 18 Jul 2025, Search-Optimized Quantization in Biomedical Ontology Alignment, https://arxiv.org/abs/2507.13742
  • Shuliang Liu, Qi Zheng, Jesse Jiaxi Xu, Yibo Yan, He Geng, Aiwei Liu, Peijie Jiang, Jia Liu, Yik-Cheung Tam, and Xuming Hu, 18 Jul 2025, VLA-Mark: A cross modal watermark for large vision-language alignment model, https://arxiv.org/abs/2507.14067
  • Yi Zhang, An Zhang, XiuYu Zhang, Leheng Sheng, Yuxin Chen, Zhenkai Liang, Xiang Wang, 20 Jul 2025, AlphaAlign: Incentivizing Safety Alignment with Extremely Simplified Reinforcement Learning, https://arxiv.org/abs/2507.14987
  • Pengfei Du, 14 Jul 2025, PRM-Free Security Alignment of Large Models via Red Teaming and Adversarial Training, https://arxiv.org/abs/2507.14202
  • Wenqian Ye, Guangtao Zheng, Aidong Zhang, 20 Jul 2025, Improving Group Robustness on Spurious Correlation via Evidential Alignment, https://arxiv.org/abs/2506.11347
  • Anirudh Sundar, Sinead Williamson, Katherine Metcalf, Barry-John Theobald, Skyler Seto, Masha Fedzechkina, 21 Jul 2025, Steering into New Embedding Spaces: Analyzing Cross-Lingual Alignment Induced by Model Interventions in Multilingual Language Models, https://arxiv.org/abs/2502.15639
  • Noel Teku, Fengwei Tian, Payel Bhattacharjee, Souradip Chakraborty, Amrit Singh Bedi, Ravi Tandon, 9 Aug 2025, PROPS: Progressively Private Self-alignment of Large Language Models, https://arxiv.org/abs/2508.06783
  • Yuandong Tan, 10 Aug 2025, A Stable and Principled Loss Function for Direct Language Model Alignment, https://arxiv.org/abs/2508.07137
  • Jia Zhang, Yao Liu, Chen-Xi Zhang, Yi Liu, Yi-Xuan Jin, Lan-Zhe Guo, Yu-Feng Li, 11 Aug 2025, Beyond Single: A Data Selection Principle for LLM Alignment via Fine-Grained Preference Signals, https://arxiv.org/abs/2508.07638
  • Haowen Wang, Yun Yue, Zhiling Ye, Shuowen Zhang, Lei Fan, Jiaxin Liang, Jiadi Jiang, Cheng Wei, Jingyuan Deng, Xudong Han, Ji Li, Chunxiao Guo, Peng Wei, Jian Wang, Jinjie Gu, 11 Aug 2025, Learning to Align, Aligning to Learn: A Unified Approach for Self-Optimized Alignment, https://arxiv.org/abs/2508.07750
  • Qiang He, Setareh Maghsudi, 11 Aug 2025, Pareto Multi-Objective Alignment for Language Models, https://arxiv.org/abs/2508.07768
  • Nicole Lai-Tan and Xiao Gu and Marios G. Philiastides and Fani Deligianni, 11 Aug 2025, Cross-Subject and Cross-Montage EEG Transfer Learning via Individual Tangent Space Alignment and Spatial-Riemannian Feature Fusion, https://arxiv.org/abs/2508.08216
  • Ben Y. Reis and William La Cava, 8 Aug 2025, Towards Integrated Alignment, https://arxiv.org/abs/2508.06592
  • Xiaobo Zhang (1 and 2), Congqing He (2), Ying He (1 and 2), Jian Peng (1), Dajie Fu (1), Tien-Ping Tan (2) ((1) School of Information Engineering, Jiangxi Vocational College of Finance & Economics, Jiujiang, China, (2) School of Computer Sciences, Universiti Sains Malaysia, Penang, Malaysia), 9 Aug 2025, ESNERA: Empirical and semantic named entity alignment for named entity dataset merging, https://arxiv.org/abs/2508.06877
  • Jianting Tang, Yubo Wang, Haoyu Cao, Linli Xu, 9 Aug 2025, BASIC: Boosting Visual Alignment with Intrinsic Refined Embeddings in Multimodal Large Language Models, https://arxiv.org/abs/2508.06895
  • Yanru Sun, Emadeldeen Eldele, Zongxia Xie, Yucheng Wang, Wenzhe Niu, Qinghua Hu, Chee Keong Kwoh, Min Wu, 10 Aug 2025, Adapting LLMs to Time Series Forecasting via Temporal Heterogeneity Modeling and Semantic Alignment, https://arxiv.org/abs/2508.07195
  • Gustavo Moreira, Leonardo Ferreira, Carolina Veiga, Maryam Hosseini, Fabio Miranda, 10 Aug 2025, Urbanite: A Dataflow-Based Framework for Human-AI Interactive Alignment in Urban Visual Analytics, https://arxiv.org/abs/2508.07390
  • Wenze Xu and Chun Wang and Jiazhen Yu and Sheng Chen and Liang Gao and Weihong Deng, 11 Aug 2025, Optimal Transport Regularization for Speech Text Alignment in Spoken Language Models, https://arxiv.org/abs/2508.08131
  • Kyle Moore, Jesse Roberts, Daryl Watson, 11 Aug 2025, Human-Alignment and Calibration of Inference-Time Uncertainty in Large Language Models, https://arxiv.org/abs/2508.08204
  • Jie Xiao, Changyuan Fan, Qingnan Ren, Alfred Long, Yuchen Zhang, Rymon Yu, Eric Yang, Lynn Ai, Shaoduo Gan, 9 Aug 2025, Echo: Decoupling Inference and Training for Large-Scale RL Alignment on Heterogeneous Swarms, https://arxiv.org/abs/2508.05387
  • Haoran Lu, Luyang Fang, Ruidong Zhang, Xinliang Li, Jiazhang Cai, Huimin Cheng, Lin Tang, Ziyu Liu, Zeliang Sun, Tao Wang, Yingchuan Zhang, Arif Hassan Zidan, Jinwen Xu, Jincheng Yu, Meizhi Yu, Hanqi Jiang, Xilin Gong, Weidi Luo, Bolun Sun, Yongkai Chen, Terry Ma, Shushan Wu, Yifan Zhou, Junhao Chen, Haotian Xiang, Jing Zhang, Afrar Jahin, Wei Ruan, Ke Deng, Yi Pan, Peilong Wang, Jiahui Li, Zhengliang Liu, Lu Zhang, Lin Zhao, Wei Liu, Dajiang Zhu, Xin Xing, Fei Dou, Wei Zhang, Chao Huang, Rongjie Liu, Mengrui Zhang, Yiwen Liu, Xiaoxiao Sun, Qin Lu, Zhen Xiang, Wenxuan Zhong, Tianming Liu, Ping Ma, 25 Jul 2025, Alignment and Safety in Large Language Models: Safety Mechanisms, Training Paradigms, and Emerging Challenges, https://arxiv.org/abs/2507.19672
  • Sarat Chandra Bobbili, Ujwal Dinesha, Dheeraj Narasimha, Srinivas Shakkottai, 26 Jul 2025, PITA: Preference-Guided Inference-Time Alignment for LLM Post-Training, https://arxiv.org/abs/2507.20067
  • Rachel S.Y. Teo, Laziz U. Abdullaev, Tan M. Nguyen, 27 Jul 2025, The Blessing and Curse of Dimensionality in Safety Alignment, https://arxiv.org/abs/2507.20333
  • Tiantian Peng, Yuyang Liu, Shuo Yang, Qiuhe Hong, YongHong Tian, 26 Jul 2025, GNSP: Gradient Null Space Projection for Preserving Cross-Modal Alignment in VLMs Continual Learning, https://arxiv.org/abs/2507.19839
  • Siyu Song, Wentao Liu, Ye Lu, Ruohua Zhang, Tao Liu, Jinze Lv, Xinyun Wang, Aimin Zhou, Fei Tan, Bo Jiang, Hao Hao, 27 Jul 2025, Cultivating Helpful, Personalized, and Creative AI Tutors: A Framework for Pedagogical Alignment using Reinforcement Learning, https://arxiv.org/abs/2507.20335
  • Rongyao Cai, Ming Jin, Qingsong Wen, Kexin Zhang, 28 Jul 2025, From Entanglement to Alignment: Representation Space Decomposition for Unsupervised Time Series Domain Adaptation, https://arxiv.org/abs/2507.20968
  • Andr\'e Steingr\"uber, Kevin Baum, 24 Jul 2025, Justifications for Democratizing AI Alignment and Their Prospects, https://arxiv.org/abs/2507.19548
  • Shuhaib Mehri, Xiaocheng Yang, Takyoung Kim, Gokhan Tur, Shikib Mehri, Dilek Hakkani-T\"ur, 27 Jul 2025, Goal Alignment in LLM-Based User Simulators for Conversational AI, https://arxiv.org/abs/2507.20152
  • Gabriel Downer, Sean Craven, Damian Ruck, Jake Thomas, 28 Jul 2025, Text2VLM: Adapting Text-Only Datasets to Evaluate Alignment Training in Visual Language Models, https://arxiv.org/abs/2507.20704
  • Renhang Liu, Chia-Yu Hung, Navonil Majumder, Taylor Gautreaux, Amir Ali Bagherzadeh, Chuan Li, Dorien Herremans, Soujanya Poria, 28 Jul 2025, JAM: A Tiny Flow-based Song Generator with Fine-grained Controllability and Aesthetic Alignment, https://arxiv.org/abs/2507.20880
  • Hei Shing Cheung and Boya Zhang, 26 Jul 2025, Efficient Vocal-Conditioned Music Generation via Soft Alignment Attention and Latent Diffusion, https://arxiv.org/abs/2507.19991
  • Jiaming Ji, Kaile Wang, Tianyi Qiu, Boyuan Chen, Jiayi Zhou, Changye Li, Hantao Lou, Juntao Dai, Yunhuai Liu, Yaodong Yang, 27 Jul 2025, Language Models Resist Alignment: Evidence From Data Compression, https://arxiv.org/abs/2406.06144
  • Madhava Gaikwad (1), Ashwini Ramchandra Doke (2) ((1) Microsoft, (2) Amrita University), 22 Jul 2025, NPO: Learning Alignment and Meta-Alignment through Structured Human Feedback, https://arxiv.org/abs/2507.21131
  • Lenart Motnikar, Katharina Baum, Alexander Kagan, Sarah Spiekermann-Hoff, 26 Jun 2025, The Value of Gen-AI Conversations: A bottom-up Framework for AI Value Alignment, https://arxiv.org/abs/2507.21091
  • Aran Nayebi, 29 Jul 2025, Intrinsic Barriers and Practical Pathways for Human-AI Alignment: An Agreement-Based Complexity Analysis, https://arxiv.org/abs/2502.05934
  • Haipeng Liu, Yuxuan Liu, Ting Long, 31 Jul 2025, Personalized Education with Ranking Alignment Recommendation, https://arxiv.org/abs/2507.23664
  • Wei Li and Xun Gong and Jiao Li and Xiaobin Sun, 31 Jul 2025, AGA: An adaptive group alignment framework for structured medical cross-modal representation learning, https://arxiv.org/abs/2507.23402
  • Ananth Balashankar and Ziteng Sun and Jonathan Berant and Jacob Eisenstein and Michael Collins and Adrian Hutter and Jong Lee and Chirag Nagpal and Flavien Prost and Aradhana Sinha and Ananda Theertha Suresh and Ahmad Beirami, 31 Jul 2025, InfAlign: Inference-aware language model alignment, https://arxiv.org/abs/2412.19792
  • Qun Ma, Xiao Xue, Ming Zhang, Yifan Shen, Zihan Zhao, 30 Jul 2025, An Explainable Emotion Alignment Framework for LLM-Empowered Agent in Metaverse Service Ecosystem, https://arxiv.org/abs/2507.22326
  • Yixuan Nan, Xixun Lin, Yanmin Shang, Zhuofan Li, Can Zhao and Yanan Cao, 30 Jul 2025, RANA: Robust Active Learning for Noisy Network Alignment, https://arxiv.org/abs/2507.22434
  • Shaoan Xie, Lingjing Kong, Yujia Zheng, Yu Yao, Zeyu Tang, Eric P. Xing, Guangyi Chen, Kun Zhang, 29 Jul 2025, SmartCLIP: Modular Vision-language Alignment with Identification Guarantees, https://arxiv.org/abs/2507.22264
  • Junjie Cao, 30 Jul 2025, Adaptive Duration Model for Text Speech Alignment, https://arxiv.org/abs/2507.22612
  • Yeonjun In, Wonjoong Kim, Sangwu Park, Chanyoung Park, 1 Aug 2025, R1-ACT: Efficient Reasoning Model Safety Alignment by Activating Safety Knowledge, https://arxiv.org/abs/2508.00324
  • Jens U. Kreber, Joerg Stueckler, 1 Aug 2025, Guiding Diffusion-Based Articulated Object Generation by Partial Point Cloud Alignment and Physical Plausibility Constraints, https://arxiv.org/abs/2508.00558
  • Amitava Das, Vinija Jain, Aman Chadha, 4 Aug 2025, TRACEALIGN -- Tracing the Drift: Attributing Alignment Failures to Training-Time Belief Sources in LLMs, https://arxiv.org/abs/2508.02063
  • Istabrak Abbes, Gopeshh Subbaraj, Matthew Riemer, Nizar Islah, Benjamin Therien, Tsuguchika Tabaru, Hiroaki Kingetsu, Sarath Chandar, Irina Rish, 3 Aug 2025, Revisiting Replay and Gradient Alignment for Continual Pre-Training of Large Language Models, https://arxiv.org/abs/2508.01908
  • Ziyu Zhou, Yiming Huang, Yanyun Wang, Yuankai Wu, James Kwok, Yuxuan Liang, 4 Aug 2025, Revitalizing Canonical Pre-Alignment for Irregular Multivariate Time Series Forecasting, https://arxiv.org/abs/2508.01971
  • Amitava Das, Abhilekh Borah, Vinija Jain, Aman Chadha, 4 Aug 2025, AlignGuard-LoRA: Alignment-Preserving Fine-Tuning via Fisher-Guided Decomposition and Riemannian-Geodesic Collision Regularization, https://arxiv.org/abs/2508.02079
  • Yu Lei, Jinbin Bai, Qingyu Shi, Aosong Feng and Kaidong Yu, 2 Aug 2025, Personalized Safety Alignment for Text-to-Image Diffusion Models, https://arxiv.org/abs/2508.01151
  • Tae Soo Kim, Yoonjoo Lee, Yoonah Park, Jiho Kim, Young-Ho Kim, Juho Kim, 3 Aug 2025, CUPID: Evaluating Personalized and Contextualized Alignment of LLMs from Interactions, https://arxiv.org/abs/2508.01674
  • Tom S. Juzek, Zina B. Ward, 3 Aug 2025, Word Overuse and Alignment in Large Language Models: The Influence of Learning from Human Feedback, https://arxiv.org/abs/2508.01930
  • Haoran Gu, Handing Wang, Yi Mei, Mengjie Zhang, Yaochu Jin, 4 Aug 2025, ParetoHqD: Fast Offline Multiobjective Alignment of Large Language Models using Pareto High-quality Data, https://arxiv.org/abs/2504.16628
  • Ivan Zakazov, Mikolaj Boronski, Lorenzo Drudi, Robert West, 4 Aug 2025, Assessing Social Alignment: Do Personality-Prompted Large Language Models Behave Like Humans?, https://arxiv.org/abs/2412.16772
  • Taibiao Zhao, Xiaobing Chen, and Mingxuan Sun, 1 Aug 2025, Enhancing Time Series Forecasting via Multi-Level Text Alignment with LLMs, https://arxiv.org/abs/2504.07360
  • Bolian Li, Yifan Wang, Anamika Lochab, Ananth Grama, Ruqi Zhang, 3 Aug 2025, Cascade Reward Sampling for Efficient Decoding-Time Alignment, https://arxiv.org/abs/2406.16306
  • Amir Aghdam, Vincent Tao Hu, Bj\"orn Ommer, 4 Aug 2025, ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment, https://arxiv.org/abs/2506.22967
  • Dahun Kim, Anelia Angelova, 3 Aug 2025, Context-Adaptive Multi-Prompt LLM Embedding for Vision-Language Alignment, https://arxiv.org/abs/2508.02762
  • Hongjun Liu, Chao Yao, Yalan Zhang, Xiaokun wang and Xiaojuan Ban, 5 Aug 2025, Spatial Imputation Drives Cross-Domain Alignment for EEG Classification, https://arxiv.org/abs/2508.03437
  • Anamika Lochab, Ruqi Zhang, 5 Aug 2025, Energy-Based Reward Models for Robust Language Model Alignment, https://arxiv.org/abs/2504.13134
  • Wentao Wu, Linqing Chen, Hanmeng Zhong, Weilei Wang, 6 Aug 2025, Large Language Model's Multi-Capability Alignment in Biomedical Domain, https://arxiv.org/abs/2508.04278
  • Abdul Monaf Chowdhury, Rabeya Akter, Safaeid Hossain Arib, 6 Aug 2025, T3Time: Tri-Modal Time Series Forecasting via Adaptive Multi-Head Alignment and Residual Fusion, https://arxiv.org/abs/2508.04251
  • Hongxu Chen, Zhen Wang, Taoran Mei, Lin Li, Bowei Zhu, Runshi Li, Long Chen, 6 Aug 2025, Zero-Residual Concept Erasure via Progressive Alignment in Text-to-Image Model, https://arxiv.org/abs/2508.04472
  • Feifan Song, Bofei Gao, Yifan Song, Yi Liu, Weimin Xiong, Yuyang Song, Tianyu Liu, Guoyin Wang, Houfeng Wang, 6 Aug 2025, P-Aligner: Enabling Pre-Alignment of Language Models via Principled Instruction Synthesis, https://arxiv.org/abs/2508.04626
  • Wenji Fang, Jing Wang, Yao Lu, Shang Liu, Zhiyao Xie, 6 Aug 2025, GenEDA: Towards Generative Netlist Functional Reasoning via Cross-Modal Circuit Encoder-Decoder Alignment, https://arxiv.org/abs/2504.09485
  • You Rim Choi, Subeom Park, Seojun Heo, Eunchung Noh, Hyung-Sin Kim, 6 Aug 2025, Let the Void Be Void: Robust Open-Set Semi-Supervised Learning via Selective Non-Alignment, https://arxiv.org/abs/2504.12569
  • Krzysztof Janowicz and Zilong Liu and Gengchen Mai and Zhangyu Wang and Ivan Majic and Alexandra Fortacz and Grant McKenzie and Song Gao, 7 Aug 2025, Whose Truth? Pluralistic Geo-Alignment for (Agentic) AI, https://arxiv.org/abs/2508.05432
  • Shruti Saxena, Arijit Khan and Joydeep Chandra, 5 Aug 2025, NAEx: A Plug-and-Play Framework for Explaining Network Alignment, https://arxiv.org/abs/2508.04731
  • Mason Nakamura, Saaduddin Mahmud, Kyle H. Wray, Hamed Zamani, Shlomo Zilberstein, 7 Aug 2025, Aligning LLMs on a Budget: Inference-Time Alignment with Heuristic Reward Models, https://arxiv.org/abs/2508.05165
  • Zhongheng Yang, Aijia Sun, Yushang Zhao, Yinuo Yang, Dannier Li, Chengrui Zhou, 7 Aug 2025, RLHF Fine-Tuning of LLMs for Alignment with Implicit User Feedback in Conversational Recommenders, https://arxiv.org/abs/2508.05289
  • Qinghua Yao, Xiangrui Xu, Zhize Li, 7 Aug 2025, X-VFL: A New Vertical Federated Learning Framework with Cross Completion and Decision Subspace Alignment, https://arxiv.org/abs/2508.05568
  • Sam Kouteili, Hiren Madhu, George Typaldos, Mark Santolucito, 7 Aug 2025, Embedding Alignment in Code Generation for Audio, https://arxiv.org/abs/2508.05473
  • Yubin Zhang, Yanhua Huang, Haiming Xu, Mingliang Qi, Chang Wang, Jiarui Jin, Xiangyuan Ren, Xiaodan Wang, Ruiwen Xu, 7 Aug 2025, A Metric for MLLM Alignment in Large-scale Recommendation, https://arxiv.org/abs/2508.04963
  • Zhiqing Xiao, Haobo Wang, Xu Lu, Wentao Ye, Gang Chen, Junbo Zhao, 7 Aug 2025, SPA++: Generalized Graph Spectral Alignment for Versatile Domain Adaptation, https://arxiv.org/abs/2508.05182
  • Wei Zeng, Hengshu Zhu, Chuan Qin, Han Wu, Yihang Cheng, Sirui Zhang, Xiaowei Jin, Yinuo Shen, Zhenxing Wang, Feimin Zhong, Hui Xiong, 7 Aug 2025, Multi-level Value Alignment in Agentic AI Systems: Survey and Perspectives, https://arxiv.org/abs/2506.09656
  • Yifei Xu, Tusher Chakraborty, Emre K{\i}c{\i}man, Bibek Aryal, Eduardo Rodrigues, Srinagesh Sharma, Roberto Estevao, Maria Angels de Luis Balaguer, Jessica Wolk, Rafael Padilha, Leonardo Nunes, Shobana Balakrishnan, Songwu Lu, Ranveer Chandra, 6 Aug 2025, RLTHF: Targeted Human Feedback for LLM Alignment, https://arxiv.org/abs/2502.13417
  • Shengzhu Yang, Jiawei Du, Shuai Lu, Weihang Zhang, Ningli Wang, Huiqi Li, 8 Aug 2025, CLIPin: A Non-contrastive Plug-in to CLIP for Multimodal Semantic Alignment, https://arxiv.org/abs/2508.06434
  • Keiyu Nosaka, Yuichi Takano, Akiko Yoshise, 8 Aug 2025, Data Collaboration Analysis with Orthonormal Basis Selection and Alignment, https://arxiv.org/abs/2403.02780
  • Parker Whitfill, Stewy Slocum, 11 Aug 2025, Beyond Ordinal Preferences: Why Alignment Needs Cardinal Human Feedback, https://arxiv.org/abs/2508.08486
  • Sviatoslav Lushnei, Dmytro Shumskyi, Severyn Shykula, Ernesto Jimenez-Ruiz, Artur d'Avila Garcez, 11 Aug 2025, Large Language Models as Oracles for Ontology Alignment, https://arxiv.org/abs/2508.08500
  • Saketh Reddy Vemula, Dipti Mishra Sharma and Parameswari Krishnamurthy, 11 Aug 2025, Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment, https://arxiv.org/abs/2508.08424
  • Jadie Adams, Brian Hu, Emily Veenhuis, David Joy, Bharadwaj Ravichandran, Aaron Bray, Anthony Hoogs, Arslan Basharat, 11 Aug 2025, Steerable Pluralism: Pluralistic Alignment via Few-Shot Comparative Regression, https://arxiv.org/abs/2508.08509
  • Birong Pan, Yongqi Li, Weiyu Zhang, Wenpeng Lu, Mayi Xu, Shen Zhou, Yuanyuan Zhu, Ming Zhong, Tieyun Qian, 12 Aug 2025, A Survey on Training-free Alignment of Large Language Models, https://arxiv.org/abs/2508.09016
  • Sejin Kim, Sundong Kim, 12 Aug 2025, System~2 Reasoning for Human--AI Alignment: Generality and Adaptivity via ARC-AGI, https://arxiv.org/abs/2410.07866
  • Liang Chen, Xueting Han, Li Shen, Jing Bai, Kam-Fai Wong, 12 Aug 2025, Vulnerability-Aware Alignment: Mitigating Uneven Forgetting in Harmful Fine-Tuning, https://arxiv.org/abs/2506.03850
  • Yuxin Chen and Chen Tang and Jianglan Wei and Chenran Li and Ran Tian and Xiang Zhang and Wei Zhan and Peter Stone and Masayoshi Tomizuka, 12 Aug 2025, MEReQ: Max-Ent Residual-Q Inverse RL for Sample-Efficient Alignment from Intervention, https://arxiv.org/abs/2406.16258
  • Yang Fan, 12 Aug 2025, AdEval: Alignment-based Dynamic Evaluation to Mitigate Data Contamination in Large Language Models, https://arxiv.org/abs/2501.13983
  • Shravan Nayak, Mehar Bhatia, Xiaofeng Zhang, Verena Rieser, Lisa Anne Hendricks, Sjoerd van Steenkiste, Yash Goyal, Karolina Sta\'nczak, Aishwarya Agrawal, 12 Aug 2025, CulturalFrames: Assessing Cultural Expectation Alignment in Text-to-Image Models and Evaluation Metrics, https://arxiv.org/abs/2506.08835
  • Yang Zhang, Cunxiang Wang, Lindong Wu, Wenbo Yu, Yidong Wang, Guangsheng Bao, Jie Tang, 13 Aug 2025, UDA: Unsupervised Debiasing Alignment for Pair-wise LLM-as-a-Judge, https://arxiv.org/abs/2508.09724
  • Mansi, Anastasios Lepipas, Dominika Woszczyk, Yiying Guan, Soteris Demetriou, 12 Aug 2025, Understanding Dementia Speech Alignment with Diffusion-Based Image Generation, https://arxiv.org/abs/2508.09385
  • Birong Pan, Mayi Xu, Qiankun Pi, Jianhao Chen, Yuanyuan Zhu, Ming Zhong, Tieyun Qian, 13 Aug 2025, NeuronTune: Fine-Grained Neuron Modulation for Balanced Safety-Utility Alignment in LLMs, https://arxiv.org/abs/2508.09473
  • Peiran Peng, Tingfa Xu, Liqiang Song, Mengqi Zhu, Yuqiang Fang, Jianan Li, 13 Aug 2025, COXNet: Cross-Layer Fusion with Adaptive Alignment and Scale Integration for RGBT Tiny Object Detection, https://arxiv.org/abs/2508.09533
  • Muneeza Azmat, Momin Abbas, Maysa Malfiza Garcia de Macedo, Marcelo Carpinette Grave, Luan Soares de Souza, Tiago Machado, Rogerio A de Paula, Raya Horesh, Yixin Chen, Heloisa Caroline de Souza Pereira Candello, Rebecka Nordenlow, Aminat Adebiyi, 13 Aug 2025, A Comprehensive Evaluation framework of Alignment Techniques for LLMs, https://arxiv.org/abs/2508.09937
  • Zichao Hu, Junyi Jessy Li, Arjun Guha, Joydeep Biswas, 12 Aug 2025, Robo-Instruct: Simulator-Augmented Instruction Alignment For Finetuning Code LLMs, https://arxiv.org/abs/2405.20179
  • Numair Nadeem, Saeed Anwar, Muhammad Hamza Asad, Abdul Bais, 13 Aug 2025, HVL: Semi-Supervised Segmentation leveraging Hierarchical Vision-Language Synergy with Dynamic Text-Spatial Query Alignment, https://arxiv.org/abs/2506.13925
  • Durgesh Mishra, Rishabh Uikey, 15 Aug 2025, Unified Knowledge Distillation Framework: Fine-Grained Alignment and Geometric Relationship Preservation for Deep Face Recognition, https://arxiv.org/abs/2508.11376
  • Alessio Galatolo, Luca Alberto Rappuoli, Katie Winkle, Meriem Beloucif, 18 Aug 2025, Beyond Ethical Alignment: Evaluating LLMs as Artificial Moral Assistants, https://arxiv.org/abs/2508.12754
  • Manning Zhu, Songtao Guo, Pengzhan Zhou, Yansong Ning, Chang Han, Dewen Qiao, 18 Aug 2025, FedSODA: Federated Fine-tuning of LLMs via Similarity Group Pruning and Orchestrated Distillation Alignment, https://arxiv.org/abs/2508.12727
  • Zhixin Xie, Xurui Song, Jun Luo, 17 Aug 2025, Where to Start Alignment? Diffusion Large Language Model May Demand a Distinct Position, https://arxiv.org/abs/2508.12398
  • Xuhui Zhan and Tyler Derr, 17 Aug 2025, Inverse-LLaVA: Eliminating Alignment Pre-training Through Text-to-Vision Mapping, https://arxiv.org/abs/2508.12466
  • Yizhuo Zhang, Heng Wang, Shangbin Feng, Zhaoxuan Tan, Xinyun Liu, Yulia Tsvetkov, 17 Aug 2025, Generalizable LLM Learning of Graph Synthetic Data with Post-training Alignment, https://arxiv.org/abs/2506.00845
  • Mohammad Jalali, Bahar Dibaei Nia, Farzan Farnia, 16 Aug 2025, Towards an Explainable Comparison and Alignment of Feature Embeddings, https://arxiv.org/abs/2506.06231
  • Guangfu Hao, Haojie Wen, Liangxuan Guo, Yang Chen, Yanchao Bi, Shan Yu, 18 Aug 2025, Flexible Tool Selection through Low-dimensional Attribute Alignment of Vision and Language, https://arxiv.org/abs/2505.22146
  • Yang Zhang, Yu Yu, Bo Tang, Yu Zhu, Chuxiong Sun, Wenqiang Wei, Jie Hu, Zipeng Xie, Zhiyu Li, Feiyu Xiong, Edward Chung, 16 Aug 2025, Token-level Accept or Reject: A Micro Alignment Approach for Large Language Models, https://arxiv.org/abs/2505.19743
  • Jeremy Carleton, Debajoy Mukherjee, Srinivas Shakkottai, Dileep Kalathil, 19 Aug 2025, MAVIS: Multi-Objective Alignment via Value-Guided Inference-Time Search, https://arxiv.org/abs/2508.13415
  • Zeeshan Ahmed, Frank Seide, Niko Moritz, Ju Lin, Ruiming Xie, Simone Merello, Zhe Liu and Christian Fuegen, 18 Aug 2025, Overcoming Latency Bottlenecks in On-Device Speech Translation: A Cascaded Approach with Alignment-Based Streaming MT, https://arxiv.org/abs/2508.13358
  • Jinhui Pang, Changqing Lin, Hao Lin, Zhihui Zhang, Long Chen, Weiping Ding, Yu Liu, Xiaoshuai Hao, 19 Aug 2025, MEGA: Second-Order Gradient Alignment for Catastrophic Forgetting Mitigation in GFSCIL, https://arxiv.org/abs/2504.13691
  • Samir Abdaljalil, Erchin Serpedin, Khalid Qaraqe, Hasan Kurban, 20 Aug 2025, Evaluating Multilingual and Code-Switched Alignment in LLMs via Synthetic Natural Language Inference, https://arxiv.org/abs/2508.14735
  • Abhigya Verma, Sriram Puttagunta, Seganrasan Subramanian, Sravan Ramachandran, 21 Aug 2025, GRAFT: GRaPH and Table Reasoning for Textual Alignment -- A Benchmark for Structured Instruction Following and Visual Reasoning, https://arxiv.org/abs/2508.15690
  • Chenlin Liu, Minghui Fang, Patrick Zhang, Wei Zhou, Jie Gao, Jiqing Han, 21 Aug 2025, Mitigating Hallucinations in LM-Based TTS Models via Distribution Alignment Using GFlowNets, https://arxiv.org/abs/2508.15442
  • Youjia Zhang, Youngeun Kim, Young-Geun Choi, Hongyeob Kim, Huiling Liu, Sungeun Hong, 21 Aug 2025, Backpropagation-Free Test-Time Adaptation via Probabilistic Gaussian Alignment, https://arxiv.org/abs/2508.15568
  • J. Koorndijk, 21 Aug 2025, Empirical Evidence for Alignment Faking in a Small LLM and Prompt-Based Mitigation Techniques, https://arxiv.org/abs/2506.21584
  • Qilong Xing, Zikai Song, Youjia Zhang, Na Feng, Junqing Yu, Wei Yang, 21 Aug 2025, MCA-RG: Enhancing LLMs with Medical Concept Alignment for Radiology Report Generation, https://arxiv.org/abs/2507.06992
  • Shuyuan Tu, Zhen Xing, Xintong Han, Zhi-Qi Cheng, Qi Dai, Chong Luo, Zuxuan Wu, Yu-Gang Jiang, 20 Jul 2025, StableAnimator++: Overcoming Pose Misalignment and Face Distortion for Human Image Animation, https://arxiv.org/abs/2507.15064
  • Vince Trencsenyi and Agnieszka Mensfelt and Kostas Stathis, 25 Jul 2025, Hypergames: Modeling Misaligned Perceptions and Nested Beliefs for Multi-agent Systems, https://arxiv.org/abs/2507.19593
  • Bryce Anderson, Riley Galpin, Tom S. Juzek, 1 Aug 2025, Model Misalignment and Language Change: Traces of AI-Associated Language in Unscripted Spoken English, https://arxiv.org/abs/2508.00238
  • Fan Bu, Zheng Wang, Siyi Wang and Ziyao Liu, 1 Aug 2025, An Investigation into Value Misalignment in LLM-Generated Texts for Cultural Heritage, https://arxiv.org/abs/2501.02039
  • Siddhant Panpatil, Hiskias Dingeto, Haon Park, 6 Aug 2025, Eliciting and Analyzing Emergent Misalignment in State-of-the-Art Large Language Models, https://arxiv.org/abs/2508.04196
  • David Kacz\'er, Magnus J{\o}rgenv{\aa}g, Clemens Vetter, Lucie Flek, Florian Mai, 8 Aug 2025, In-Training Defenses against Emergent Misalignment in Language Models, https://arxiv.org/abs/2508.06249
  • Yichao Cai, Yuhang Liu, Erdun Gao, Tianjiao Jiang, Zhen Zhang, Anton van den Hengel, Javen Qinfeng Shi, 7 Aug 2025, On the Value of Cross-Modal Misalignment in Multimodal Representation Learning, https://arxiv.org/abs/2504.10143
  • Dongyoon Hahm, Taywon Min, Woogyeol Jin, Kimin Lee, 19 Aug 2025, Unintended Misalignment from Agentic Fine-Tuning: Risks and Mitigation, https://arxiv.org/abs/2508.14031
  • Igor Halperin, 13 Aug 2025, Prompt-Response Semantic Divergence Metrics for Faithfulness Hallucination and Misalignment Detection in Large Language Models, https://arxiv.org/abs/2508.10192
  • Zhi Wen Soi, Chenrui Fan, Aditya Shankar, Abele M\u{a}lan, Lydia Y. Chen, 14 Aug 2025, Federated Time Series Generation on Feature and Temporally Misaligned Data, https://arxiv.org/abs/2410.21072
  • Yue Pei, Hongming Zhang, Chao Gao, Martin M\"uller, Mengxiao Zhu, Hao Sheng, Haogang Zhu, Liang Lin, 22 Aug 2025, Double Check My Desired Return: Transformer with Target Alignment for Offline Reinforcement Learning, https://arxiv.org/abs/2508.16420
  • Junhao Yin, Haolin Wang, Peng Bao, Ju Xu, Yongliang Wang, 15 Aug 2025, From Clicks to Preference: A Multi-stage Alignment Framework for Generative Query Suggestion in Conversational System, https://arxiv.org/abs/2508.15811
  • Pi-Wei Chen, Jerry Chun-Wei Lin, Wei-Han Chen, Jia Ji, Zih-Ching Chen, Feng-Hao Yeh, Chao-Chun Chen, 22 Aug 2025, Beyond Human-prompting: Adaptive Prompt Tuning with Semantic Alignment for Anomaly Detection, https://arxiv.org/abs/2508.16157
  • Xiaoxiong Zhang, Xin Zhou, Zhiwei Zeng, Yongjie Wang, Dusit Niyato, Zhiqi Shen, 22 Aug 2025, EGRA:Toward Enhanced Behavior Graphs and Representation Alignment for Multimodal Recommendation, https://arxiv.org/abs/2508.16170
  • Zirui Li and Stephan Husung and Haoze Wang, 22 Aug 2025, LLM-Assisted Semantic Alignment and Integration in Collaborative Model-Based Systems Engineering Using SysML v2, https://arxiv.org/abs/2508.16181
  • Buhua Liu, Shitong Shao, Bao Li, Lichen Bai, Zhiqiang Xu, Haoyi Xiong, James Kwok, Sumi Helal, Zeke Xie, 7 Aug 2025, Alignment of Diffusion Models: Fundamentals, Challenges, and Future, https://arxiv.org/abs/2409.07253
  • Somnath Banerjee, Sayan Layek, Pratyush Chatterjee, Animesh Mukherjee, Rima Hazra, 22 Aug 2025, Soteria: Language-Specific Functional Parameter Steering for Multilingual Safety Alignment, https://arxiv.org/abs/2502.11244
  • Zeguan Xiao, Yun Chen, Guanhua Chen, Ke Tang, 22 Aug 2025, Towards Bridging the Reward-Generation Gap in Direct Alignment Algorithms, https://arxiv.org/abs/2506.09457
  • Mia Taylor and James Chua and Jan Betley and Johannes Treutlein and Owain Evans, 24 Aug 2025, School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs, https://arxiv.org/abs/2508.17511
  • Junan Zhang, Xueyao Zhang, Jing Yang, Yuancheng Wang, Fan Fan, Zhizheng Wu, 24 Aug 2025, Multi-Metric Preference Alignment for Generative Speech Restoration, https://arxiv.org/abs/2508.17229
  • Yang Li, Songlin Yang, Xiaoxuan Han, Wei Wang, Jing Dong, Yueming Lyu, Ziyu Xue, 25 Aug 2025, Instant Preference Alignment for Text-to-Image Diffusion Models, https://arxiv.org/abs/2508.17718
  • Bin Tan, Wangyao Ge, Yidi Wang, Xin Liu, Jeff Burtoft, Hao Fan, Hui Wang, 25 Aug 2025, PCR-CA: Parallel Codebook Representations with Contrastive Alignment for Multiple-Category App Recommendation, https://arxiv.org/abs/2508.18166
  • Yaoyao Qian, Jindan Huang, Yuanli Wang, Simon Yu, Kyrie Zhixuan Zhou, Jiayuan Mao, Mingfu Liang, Hanhan Zhou, 23 Aug 2025, WHEN TO ACT, WHEN TO WAIT: Modeling the Intent-Action Alignment Problem in Dialogue, https://arxiv.org/abs/2506.01881
  • Paul Darm, Annalisa Riccardi, 25 Aug 2025, Head-Specific Intervention Can Induce Misaligned AI Coordination in Large Language Models, https://arxiv.org/abs/2502.05945
  • Stephanie Palazzolo, Sep 2025, OpenAI’s Models Are Getting Too Smart For Their Human Teachers, https://www.theinformation.com/articles/openais-models-getting-smart-human-teachers (Using human labeling to train AI models is becoming more difficult, as the models begin to surpass humans.)
  • Cyrus Cousins, Vijay Keswani, Vincent Conitzer, Hoda Heidari, Jana Schaich Borg, Walter Sinnott-Armstrong, 4 Sep 2025, Towards Cognitively-Faithful Decision-Making Models to Improve AI Alignment, https://arxiv.org/abs/2509.04445
  • Yuqing Huang, Rongyang Zhang, Qimeng Wang, Chengqiang Lu, Yan Gao, Yi Wu, Yao Hu, Xuyang Zhi, Guiquan Liu, Xin Li, Hao Wang, Enhong Chen, 4 Sep 2025, SelfAug: Mitigating Catastrophic Forgetting in Retrieval-Augmented Generation via Distribution Self-Alignment, https://arxiv.org/abs/2509.03934
  • Ranjie Duan, Jiexi Liu, Xiaojun Jia, Shiji Zhao, Ruoxi Cheng, Fengxiang Wang, Cheng Wei, Yong Xie, Chang Liu, Defeng Li, Yinpeng Dong, Yichi Zhang, Yuefeng Chen, Chongwen Wang, Xingjun Ma, Xingxing Wei, Yang Liu, Hang Su, Jun Zhu, Xinfeng Li, Yitong Sun, Jie Zhang, Jinzhao Hu, Sha Xu, Yitong Yang, Jialing Tao, Hui Xue, 4 Sep 2025, Oyster-I: Beyond Refusal -- Constructive Safety Alignment for Responsible Language Models, https://arxiv.org/abs/2509.01909
  • Jonathn Chang, Leonhard Piff, Suvadip Sana, Jasmine X. Li, Lionel Levine, 3 Sep 2025, EigenBench: A Comparative Behavioral Measure of Value Alignment, https://arxiv.org/abs/2509.01938
  • Jusheng Zhang, Yijia Fan, Kaitong Cai, Xiaofei Sun, Keze Wang, 5 Sep 2025, OSC: Cognitive Orchestration through Dynamic Knowledge Alignment in Multi-Agent LLM Collaboration, https://arxiv.org/abs/2509.04876
  • Gongyue Zhang and Honghai Liu, 5 Sep 2025, Natural Spectral Fusion: p-Exponent Cyclic Scheduling and Early Decision-Boundary Alignment in First-Order Optimization, https://arxiv.org/abs/2509.04713
  • Wei Chen, Shigui Li, Jiacheng Li, Jian Xu, Zhiqi Lin, Junmei Yang, Delu Zeng, John Paisley, Qibin Zhao, 5 Sep 2025, Any-Step Density Ratio Estimation via Interval-Annealed Secant Alignment, https://arxiv.org/abs/2509.04852
  • Tiansheng Huang, Gautam Bhattacharya, Pratik Joshi, Josh Kimball, Ling Liu, 5 Sep 2025, Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning, https://arxiv.org/abs/2408.09600
  • Furong Jia, Lanxin Liu, Ce Hou, Fan Zhang, Xinyan Liu, Yu Liu, 5 Sep 2025, Towards Interpretable Geo-localization: a Concept-Aware Global Image-GPS Alignment Framework, https://arxiv.org/abs/2509.01910
  • Subrata Biswas, Mohammad Nur Hossain Khan, Bashima Islam, 5 Sep 2025, RAVEN: Query-Guided Representation Alignment for Question Answering over Audio, Video, Embedded Sensors, and Natural Language, https://arxiv.org/abs/2505.17114
  • Shervin Khalafi, Ignacio Hounie, Dongsheng Ding, Alejandro Ribeiro, 26 Aug 2025, Composition and Alignment of Diffusion Models using Constrained Learning, https://arxiv.org/abs/2508.19104
  • Nanxi Li, Zhengyue Zhao, Chaowei Xiao, 26 Aug 2025, PRISM: Robust VLM Alignment with Principled Reasoning for Integrated Safety in Multimodality, https://arxiv.org/abs/2508.18649
  • Trisanth Srinivasan, Santosh Patapati, 27 Aug 2025, Democracy-in-Silico: Institutional Design as Alignment in AI-Governed Polities, https://arxiv.org/abs/2508.19562
  • Julian Arnold, Niels L\"orch, 27 Aug 2025, Decomposing Behavioral Phase Transitions in LLMs: Order Parameters for Emergent Misalignment, https://arxiv.org/abs/2508.20015
  • Mingxi Fu, Fanglei Fu, Xitong Ling, Huaitian Yuan, Tian Guan, Yonghong He, Lianghui Zhu, 27 Aug 2025, Multimodal Prototype Alignment for Semi-supervised Pathology Image Segmentation, https://arxiv.org/abs/2508.19574
  • Chao Huang, Zefeng Zhang, Juewei Yue, Quangang Li, Chuang Zhang, Tingwen Liu, 27 Aug 2025, Safety Alignment Should Be Made More Than Just A Few Attention Heads, https://arxiv.org/abs/2508.19697
  • Md Abdullah Al Mamun, Ihsen Alouani, Nael Abu-Ghazaleh, 28 Aug 2025, Poison Once, Refuse Forever: Weaponizing Alignment for Injecting Bias in LLMs, https://arxiv.org/abs/2508.20333
  • Zhibang Yang, Xinke Jiang, Rihong Qiu, Ruiqing Li, Yihang Zhang, Yue Fang, Yongxin Xu, Hongxin Ding, Xu Chu, Junfeng Zhao, Yasha Wang, 28 Aug 2025, DFAMS: Dynamic-flow guided Federated Alignment based Multi-prototype Search, https://arxiv.org/abs/2508.20353
  • Guillaume Guy, Mihajlo Grbovic, Chun How Tan, Han Zhao, 28 Aug 2025, BiListing: Modality Alignment for Listings, https://arxiv.org/abs/2508.20396
  • Luozhijie Jin, Zijie Qiu, Jie Liu, Zijie Diao, Lifeng Qiao, Ning Ding, Alex Lamb, Xipeng Qiu, 28 Aug 2025, Inference-Time Alignment Control for Diffusion Models with Reinforcement Learning Guidance, https://arxiv.org/abs/2508.21016
  • Harethah Abu Shairah, Hasan Abed Al Kader Hammoud, George Turkiyyah, Bernard Ghanem, 28 Aug 2025, Turning the Spell Around: Lightweight Alignment Amplification via Rank-One Safety Injection, https://arxiv.org/abs/2508.20766
  • Haoze Wu, Cheng Wang, Wenshuo Zhao, Junxian He, 28 Aug 2025, Model-Task Alignment Drives Distinct RL Outcomes, https://arxiv.org/abs/2508.21188
  • Ephraiem Sarabamoun, 27 Aug 2025, Ensemble Debates with Local Large Language Models for AI Alignment, https://arxiv.org/abs/2509.00091
  • Shiqiao Zhou, Holger Sch\"oner, Huanbo Lyu, Edouard Fouch\'e, Shuo Wang, 30 Aug 2025, BALM-TSF: Balanced Multimodal Alignment for LLM-Based Time Series Forecasting, https://arxiv.org/abs/2509.00622
  • Jinzhou Tang, Jusheng zhang, Sidi Liu, Waikit Xiu, Qinhan Lv, Xiying Li, 29 Aug 2025, Beyond Pixels: Introducing Geometric-Semantic World Priors for Video-based Embodied Models via Spatio-temporal Alignment, https://arxiv.org/abs/2509.00210
  • Sanjeeevan Selvaganapathy and Mehwish Nasim, 31 Aug 2025, Confident, Calibrated, or Complicit: Probing the Trade-offs between Safety Alignment and Ideological Bias in Language Models in Detecting Hate Speech, https://arxiv.org/abs/2509.00673
  • Yu Liu, Yanan Cao, Xixun Lin, Yanmin Shang, Shi Wang, Shirui Pan, 1 Sep 2025, Enhancing Large Language Model for Knowledge Graph Completion via Structure-Aware Alignment-Tuning, https://arxiv.org/abs/2509.01166
  • Hongyu Li, Chaofeng Chen, Xiaoming Li, Guangming Lu, 2 Sep 2025, 2D Gaussian Splatting with Semantic Alignment for Image Inpainting, https://arxiv.org/abs/2509.01964
  • Antoun Yaacoub, J\'er\^ome Da-Rugna, Zainab Assaghir, 30 Aug 2025, Assessing AI-Generated Questions' Alignment with Cognitive Frameworks in Educational Assessment, https://arxiv.org/abs/2504.14232
  • Jonathan Rystr{\o}m, Hannah Rose Kirk and Scott Hale, 30 Aug 2025, Multilingual != Multicultural: Evaluating Gaps Between Multilingual Capabilities and Cultural Alignment in LLMs, https://arxiv.org/abs/2502.16534
  • Dayeon Ki, Rachel Rudinger, Tianyi Zhou, Marine Carpuat, 1 Sep 2025, Multiple LLM Agents Debate for Equitable Cultural Alignment, https://arxiv.org/abs/2505.24671
  • Ertu\u{g}rul Ke\c{c}eci, M\"ujde G\"uzelkaya, Tufan Kumbasar, 3 Sep 2025, A State Alignment-Centric Approach to Federated System Identification: The FedAlign Framework, https://arxiv.org/abs/2503.12137
  • Jiahao Qiu, Yifu Lu, Yifan Zeng, Jiacheng Guo, Jiayi Geng, Chenhao Zhu, Xinzhe Juan, Ling Yang, Huazheng Wang, Kaixuan Huang, Yue Wu, Mengdi Wang, 3 Sep 2025, TreeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling, https://arxiv.org/abs/2410.16033
  • Madhava Gaikwad, 4 Sep 2025, Murphys Laws of AI Alignment: Why the Gap Always Wins, https://arxiv.org/abs/2509.05381
  • Chengwei Wu, Li Du, Hanyu Zhao, Yiming Ju, Jiapu Wang, Tengfei Pan, 8 Sep 2025, Accelerate Scaling of LLM Alignment via Quantifying the Coverage and Depth of Instruction Set, https://arxiv.org/abs/2509.06463
  • Abhijnan Nath, Carine Graff and Nikhil Krishnaswamy, 7 Sep 2025, Let's Roleplay: Examining LLM Alignment in Collaborative Dialogues, https://arxiv.org/abs/2509.05882
  • Xugang Lu, Peng Shen, Yu Tsao, Hisashi Kawai, 6 Sep 2025, New Insights into Optimal Alignment of Acoustic and Linguistic Representations for Knowledge Transfer in ASR, https://arxiv.org/abs/2509.05609
  • Shuai Yuan, Zhibo Zhang, Yuxi Li, Guangdong Bai, Wang Kailong, 8 Sep 2025, Embedding Poisoning: Bypassing Safety Alignment via Embedding Semantic Shift, https://arxiv.org/abs/2509.06338
  • Sascha Kaltenpoth, Oliver M\"uller, 9 Sep 2025, Getting In Contract with Large Language Models -- An Agency Theory Perspective On Large Language Model Alignment, https://arxiv.org/abs/2509.07642
  • Xiaomeng Hu, Fei Huang, Chenhan Yuan, Junyang Lin, Tsung-Yi Ho, 1 Sep 2025, CARE: Decoding Time Safety Alignment via Rollback and Introspection Intervention, https://arxiv.org/abs/2509.06982
  • Neal G. Ravindra, Arijit Sehanobish, 22 Aug 2025, Cross-device Zero-shot Label Transfer via Alignment of Time Series Foundation Model Embeddings, https://arxiv.org/abs/2509.06966
  • Ji Xie and Trevor Darrell and Luke Zettlemoyer and XuDong Wang, 8 Sep 2025, Reconstruction Alignment Improves Unified Multimodal Models, https://arxiv.org/abs/2509.07295
  • Andrey Sakhovskiy, Elena Tutubalina, 9 Sep 2025, BALI: Enhancing Biomedical Language Representations through Knowledge Graph and Language Model Alignment, https://arxiv.org/abs/2509.07588
  • Crispin Cooper, Ana Friedrich, Tommaso Reggiani, Wouter Poortinga, 9 Sep 2025, Individual utilities of life satisfaction reveal inequality aversion unrelated to political alignment, https://arxiv.org/abs/2509.07793
  • Tianyi Wang, Jianan Fan, Dingxin Zhang, Dongnan Liu, Yong Xia, Heng Huang, Weidong Cai, 9 Sep 2025, MIRROR: Multi-Modal Pathological Self-Supervised Representation Learning via Modality Alignment and Retention, https://arxiv.org/abs/2503.00374
  • Hasibur Rahman, Smit Desai, 11 Sep 2025, Vibe Check: Understanding the Effects of LLM-Based Conversational Agents' Personality and Alignment on User Perceptions in Goal-Oriented Tasks, https://arxiv.org/abs/2509.09870
  • Yuexi Du, Lihui Chen, Nicha C. Dvornek, 12 Sep 2025, GLAM: Geometry-Guided Local Alignment for Multi-View VLP in Mammography, https://arxiv.org/abs/2509.10344
  • Maysam Behmanesh, Erkan Turan, and Maks Ovsjanikov, 11 Sep 2025, Graph Alignment via Dual-Pass Spectral Encoding and Latent Space Communication, https://arxiv.org/abs/2509.09597
  • Dohun Lee, Hyeonho Jeong, Jiwook Kim, Duygu Ceylan, Jong Chul Ye, 11 Sep 2025, Improving Video Diffusion Transformer Training by Multi-Feature Fusion and Alignment from Self-Supervised Vision Encoders, https://arxiv.org/abs/2509.09547
  • Oriane Peter and Kate Devlin, 9 Sep 2025, Decentralising LLM Alignment: A Case for Context, Pluralism, and Participation, https://arxiv.org/abs/2509.08858
  • Bronson Schoen, Evgenia Nitishinskaya, Mikita Balesni, Axel H{\o}jmark, Felix Hofst\"atter, J\'er\'emy Scheurer, Alexander Meinke, Jason Wolfe, Teun van der Weij, Alex Lloyd, Nicholas Goldowsky-Dill, Angela Fan, Andrei Matveiakin, Rusheb Shah, Marcus Williams, Amelia Glaese, Boaz Barak, Wojciech Zaremba, Marius Hobbhahn, 19 Sep 2025, Stress Testing Deliberative Alignment for Anti-Scheming Training, https://arxiv.org/abs/2509.15541
  • Wenjun Cao, 19 Sep 2025, The Alignment Bottleneck, https://arxiv.org/abs/2509.15932
  • Maithili Joshi, Palash Nandi, Tanmoy Chakraborty, 19 Sep 2025, SABER: Uncovering Vulnerabilities in Safety Alignment via Cross-Layer Residual Connection, https://arxiv.org/abs/2509.16060
  • Nomi Yu (1), Md Ferdous Alam (1), A. John Hart (1), and Faez Ahmed (1) ((1) Massachusetts Institute of Technology), 17 Sep 2025, GenCAD-3D: CAD Program Generation using Multimodal Latent Space Alignment and Synthetic Dataset Balancing, https://arxiv.org/abs/2509.15246
  • Ajsal Shereef Palattuparambil, Thommen George Karimpanal, Santu Rana, 19 Sep 2025, Dynamic Policy Fusion for User Alignment Without Re-Interaction, https://arxiv.org/abs/2409.20016
  • Sifan Wang, Ananyae Kumar Bhartari, Bowen Li, Paris Perdikaris, 19 Sep 2025, Gradient Alignment in Physics-informed Neural Networks: A Second-Order Optimization Perspective, https://arxiv.org/abs/2502.00604
  • Tianhao Zhang, Zhecheng Sheng, Zhexiao Lin, Chen Jiang, Dongyeop Kang, 19 Sep 2025, BBScoreV2: Learning Time-Evolution and Latent Alignment from Stochastic Representation, https://arxiv.org/abs/2405.17764
  • Rashid Mushkani, Hugo Berard, Shin Koseki, 18 Sep 2025, Negotiative Alignment: Embracing Disagreement to Achieve Fairer Outcomes -- Insights from Urban Studies, https://arxiv.org/abs/2503.12613
  • Jeremias Ferrao, Matthijs van der Lende, Ilija Lichkovski, Clement Neo, 16 Sep 2025, The Anatomy of Alignment: Decomposing Preference Optimization by Steering Sparse Features, https://arxiv.org/abs/2509.12934
  • Denis Janiak, Julia Moska, Dawid Motyka, Karolina Seweryn, Pawe{\l} Walkowiak, Bartosz \.Zuk, Arkadiusz Janz, 16 Sep 2025, Rethinking the Evaluation of Alignment Methods: Insights into Diversity, Generalisation, and Safety, https://arxiv.org/abs/2509.12936
  • Jinjie Shen, Yaxiong Wang, Lechao Cheng, Nan Pu, Zhun Zhong, 16 Sep 2025, Beyond Artificial Misalignment: Detecting and Grounding Semantic-Coordinated Multimodal Manipulations, https://arxiv.org/abs/2509.12653
  • Qianqi Lu, Yuxiang Xie, Jing Zhang, Shiwei Zou, Yan Chen, Xidao Luan, 16 Sep 2025, TFANet: Three-Stage Image-Text Feature Alignment Network for Robust Referring Image Segmentation, https://arxiv.org/abs/2509.13070
  • Yubo Li, Weiyi Song, 16 Sep 2025, Co-Alignment: Rethinking Alignment as Bidirectional Human-AI Cognitive Adaptation, https://arxiv.org/abs/2509.12179
  • Mohsinul Kabir, Ajwad Abrar, Sophia Ananiadou, 16 Sep 2025, Break the Checkbox: Challenging Closed-Style Evaluations of Cultural Alignment in LLMs, https://arxiv.org/abs/2502.08045
  • Jie Wu, Haoling Li, Xin Zhang, Jianwen Luo, Yangyu Huang, Ruihang Chu, Yujiu Yang, Scarlett Li, 16 Sep 2025, Teaching Your Models to Understand Code via Focal Preference Alignment, https://arxiv.org/abs/2503.02783
  • Jing Xiao, Chang You, Zhiyu Chen, 14 Sep 2025, AlignKT: Explicitly Modeling Knowledge State for Knowledge Tracing with Ideal State Alignment, https://arxiv.org/abs/2509.11135
  • Yining Lu, Zilong Wang, Shiyang Li, Xin Liu, Changlong Yu, Qingyu Yin, Zhan Shi, Zixuan Zhang, Meng Jiang, 14 Sep 2025, Learning to Optimize Multi-Objective Alignment Through Dynamic Reward Weighting, https://arxiv.org/abs/2509.11452
  • Chentao Cao, Xiaojun Xu, Bo Han, Hang Li, 15 Sep 2025, Reasoned Safety Alignment: Ensuring Jailbreak Defense via Answer-Then-Check, https://arxiv.org/abs/2509.11629
  • Jiayou Zhong, Anudeex Shetty, Chao Jia, Xuanrui Lin, Usman Naseem, 12 Sep 2025, Pluralistic Alignment for Healthcare: A Role-Driven Framework, https://arxiv.org/abs/2509.10685
  • Hyeongju Kim, Juheon Lee, Jinhyeok Yang, Jacob Morton, 14 Sep 2025, Length-Aware Rotary Position Embedding for Text-Speech Alignment, https://arxiv.org/abs/2509.11084
  • Etienne Boursier, Nicolas Flammarion, 15 Sep 2025, Early alignment in two-layer networks training is a two-edged sword, https://arxiv.org/abs/2401.10791
  • Zedian Shao, Hongbin Liu, Jaden Mu, Neil Zhenqiang Gong, 15 Sep 2025, Enhancing Prompt Injection Attacks to LLMs via Poisoning Alignment, https://arxiv.org/abs/2410.14827
  • Ankur Samanta, Akshayaa Magesh, Youliang Yu, Runzhe Wu, Ayush Jain, Daniel Jiang, Boris Vidolov, Paul Sajda, Yonathan Efroni, Kaveh Hassani, 18 Sep 2025, Internalizing Self-Consistency in Language Models: Multi-Agent Consensus Alignment, https://arxiv.org/abs/2509.15172
  • Herlock (SeyedAbolfazl) Rahimi, Dionysis Kalogerias, 17 Sep 2025, FedAVOT: Exact Distribution Alignment in Federated Learning via Masked Optimal Transport, https://arxiv.org/abs/2509.14444
  • Natalie Collina, Surbhi Goel, Aaron Roth, Emily Ryu, Mirah Shi, 18 Sep 2025, Emergent Alignment via Competition, https://arxiv.org/abs/2509.15090
  • Andr\'es Corrada-Emmanuel, 10 Sep 2025, No-Knowledge Alarms for Misaligned LLMs-as-Judges, https://arxiv.org/abs/2509.08593
  • Aadit Sengupta, Pratinav Seth, Vinay Kumar Sankarapu, 10 Sep 2025, Interpretability as Alignment: Making Internal Understanding a Design Principle, https://arxiv.org/abs/2509.08592
  • Katalina Hernandez Delgado, 8 Sep 2025, The Law-Following AI Framework: Legal Foundations and Technical Constraints. Legal Analogues for AI Actorship and technical feasibility of Law Alignment, https://arxiv.org/abs/2509.08009
  • Hua Shen, Nicholas Clark, Tanushree Mitra, 9 Sep 2025, Mind the Value-Action Gap: Do LLMs Act in Alignment with Their Values?, https://arxiv.org/abs/2501.15463
  • Vincent Siu, Nicholas Crispino, David Park, Nathan W. Henry, Zhun Wang, Yang Liu, Dawn Song, Chenguang Wang, 16 Sep 2025, SteeringControl: Holistic Evaluation of Alignment Steering in LLMs, https://arxiv.org/abs/2509.13450
  • Zhanting Zhou and Jinshan Lai and Fengchun Zhang and Zeqin Wu and Fengli Zhang, 17 Sep 2025, FedSSG: Expectation-Gated and History-Aware Drift Alignment for Federated Learning, https://arxiv.org/abs/2509.13895
  • Yifan Hu, Jie Yang, Tian Zhou, Peiyuan Liu, Yujin Tang, Rong Jin, Liang Sun, 17 Sep 2025, Bridging Past and Future: Distribution-Aware Alignment for Time Series Forecasting, https://arxiv.org/abs/2509.14181
  • Jack McKinlay, Marina De Vos, Janina A. Hoffmann, Andreas Theodorou, 17 Sep 2025, Understanding the Process of Human-AI Value Alignment, https://arxiv.org/abs/2509.13854
  • Elena Camuffo, Francesco Barbato, Mete Ozay, Simone Milani, Umberto Michieli, 17 Sep 2025, MOCHA: Multi-modal Objects-aware Cross-arcHitecture Alignment, https://arxiv.org/abs/2509.14001
  • Puru Vaish, Felix Meister, Tobias Heimann, Christoph Brune, Jelmer M. Wolterink, 17 Sep 2025, Consistent View Alignment Improves Foundation Models for 3D Medical Image Segmentation, https://arxiv.org/abs/2509.13846
  • Yuu Jinnai, Ukyo Honda, 17 Sep 2025, Annotation-Efficient Language Model Alignment via Diverse and Representative Response Texts, https://arxiv.org/abs/2405.13541
  • Zhiqiang Yuan, Weitong Chen, Hanlin Wang, Kai Yu, Xin Peng, Yiling Lou, 17 Sep 2025, Semantic Alignment-Enhanced Code Translation via an LLM-Based Multi-Agent System, https://arxiv.org/abs/2409.19894
  • Rohan Wadhawan, Fabrice Y Harel-Canada, Zi-Yi Dou, Suhaila Shakiah, Robinson Piramuthu, Nanyun Peng, 2 Oct 2025, VaPR -- Vision-language Preference alignment for Reasoning, https://arxiv.org/abs/2510.01700
  • Crystal Qian, Aaron Parisi, Cl\'ementine Bouleau, Vivian Tsai, Ma\"el Lebreton, Lucas Dixon, 2 Oct 2025, To Mask or to Mirror: Human-AI Alignment in Collective Reasoning, https://arxiv.org/abs/2510.01924
  • Hengwei Zhao, Zhengzhong Tu, Zhuo Zheng, Wei Wang, Junjue Wang, Rusty Feagin, Wenzhe Jiao, 30 Sep 2025, Noisy-Pair Robust Representation Alignment for Positive-Unlabeled Learning, https://arxiv.org/abs/2510.01278
  • Kevin Zhai, Utsav Singh, Anirudh Thatipelli, Souradip Chakraborty, Anit Kumar Sahu, Furong Huang, Amrit Singh Bedi, Mubarak Shah, 2 Oct 2025, MIRA: Towards Mitigating Reward Hacking in Inference-Time Alignment of T2I Diffusion Models, https://arxiv.org/abs/2510.01549
  • Shaan Shah, Meenakshi Khosla, 2 Oct 2025, Representational Alignment Across Model Layers and Brain Regions with Hierarchical Optimal Transport, https://arxiv.org/abs/2510.01706
  • Isa Inuwa-Dutse, 26 Sep 2025, OpenAI's GPT-OSS-20B Model and Safety Alignment Issues in a Low-Resource Language, https://arxiv.org/abs/2510.01266
  • Jiping Li and Rishi Sonthalia, 1 Oct 2025, Risk Phase Transitions in Spiked Regression: Alignment Driven Benign and Catastrophic Overfitting, https://arxiv.org/abs/2510.01414
  • Ching-Huei Tsou, Michal Ozery-Flato, Ella Barkan, Diwakar Mahajan, Ben Shapira, 1 Oct 2025, BioVERSE: Representation Alignment of Biomedical Modalities to LLMs for Multi-Modal Reasoning, https://arxiv.org/abs/2510.01428
  • Bo Ma and LuYao Liu and Simon Lau and Chandler Yuan and and XueY Cui and Rosie Zhang, 2 Oct 2025, Bridging Collaborative Filtering and Large Language Models with Dynamic Alignment, Multimodal Fusion and Evidence-grounded Explanations, https://arxiv.org/abs/2510.01606
  • Nilay Naharas, Dang Nguyen, Nesihan Bulut, Mohammadhossein Bateni, Vahab Mirrokni, Baharan Mirzasoleiman, 1 Oct 2025, Data Selection for Fine-tuning Vision Language Models via Cross Modal Alignment Trajectories, https://arxiv.org/abs/2510.01454
  • Mattson Ogg, Ritwik Bose, Jamie Scharf, Christopher Ratto, Michael Wolmetz, 1 Oct 2025, A Flexible Method for Behaviorally Measuring Alignment Between Human and Artificial Intelligence Using Representational Similarity Analysis, https://arxiv.org/abs/2412.00577
  • Hao Wang, Licheng Pan, Zhichao Chen, Xu Chen, Qingyang Dai, Lei Wang, Haoxuan Li, Zhouchen Lin, 2 Oct 2025, Time-o1: Time-Series Forecasting Needs Transformed Label Alignment, https://arxiv.org/abs/2505.17847
  • Suli Wang, Yangshen Deng, Zhenghua Bao, Xinyu Zhan, Yiqun Duan, 1 Oct 2025, NeuroTTT: Bridging Pretraining-Downstream Task Misalignment in EEG Foundation Models via Test-Time Training, https://arxiv.org/abs/2509.26301
  • Jianwei Li and Jung-Eun Kim, 2 Oct 2025, Superficial Safety Alignment Hypothesis, https://arxiv.org/abs/2410.10862
  • Heng Zhang, Tianyi Zhang, Yuling Shi, Xiaodong Gu, Yaomin Shen, Haochen You, Zijian Zhang, Yilei Yuan and Jin Huang, 14 Oct 2025, GraphShaper: Geometry-aware Alignment for Improving Transfer Learning in Text-Attributed Graphs, https://arxiv.org/abs/2510.12085
  • Ruben Belo, Claudia Soares, Marta Guimaraes, 14 Oct 2025, Keep Calm and Avoid Harmful Content: Concept Alignment and Latent Manipulation Towards Safer Answers, https://arxiv.org/abs/2510.12672
  • Simone Carnemolla, Matteo Pennisi, Chiara Russo, Simone Palazzo, Daniela Giordano, Concetto Spampinato, 10 Oct 2025, SeeingSounds: Learning Audio-to-Visual Alignment via Text, https://arxiv.org/abs/2510.11738
  • Zhiyu Wang, Bingxin Zhou, Jing Wang, Yang Tan, Weishu Zhao, Pietro Li\`o, Liang Hong, 12 Oct 2025, Fast and Interpretable Protein Substructure Alignment via Optimal Transport, https://arxiv.org/abs/2510.11752
  • Yukun Zhang, and Qi Dong, 14 Oct 2025, Hierarchical Alignment: Surgical Fine-Tuning via Functional Layer Specialization in Large Language Models, https://arxiv.org/abs/2510.12044
  • Bajian Xiang, Shuaijiang Zhao, Tingwei Guo, Wei Zou, 14 Oct 2025, Understanding the Modality Gap: An Empirical Study on the Speech-Text Alignment Mechanism of Large Speech Language Models, https://arxiv.org/abs/2510.12116
  • Suyash Fulay, Jocelyn Zhu, Michiel Bakker, 14 Oct 2025, From Delegates to Trustees: How Optimizing for Long-Term Interests Shapes Bias and Alignment in LLM, https://arxiv.org/abs/2510.12689
  • Samuel Yeh, Sharon Li, 14 Oct 2025, Clean First, Align Later: Benchmarking Preference Data Cleaning for Reliable LLM Alignment, https://arxiv.org/abs/2509.23564
  • Yi Liu, Dianqing Liu, Mingye Zhu, Junbo Guo, Yongdong Zhang, Zhendong Mao, 14 Oct 2025, Leveraging Importance Sampling to Detach Alignment Modules from Large Language Models, https://arxiv.org/abs/2505.19700
  • Peiyan Li, Yixiang Chen, Hongtao Wu, Xiao Ma, Xiangnan Wu, Yan Huang, Liang Wang, Tao Kong, Tieniu Tan, 14 Oct 2025, BridgeVLA: Input-Output Alignment for Efficient 3D Manipulation Learning with Vision-Language Models, https://arxiv.org/abs/2506.07961
  • Ekaterina Redekop, Mara Pleasure, Zichen Wang, Kimberly Flores, Anthony Sisk, William Speier, Corey W. Arnold, 13 Oct 2025, SPADE: Spatial Transcriptomics and Pathology Alignment Using a Mixture of Data Experts for an Expressive Latent Space, https://arxiv.org/abs/2506.21857
  • Trishna Chakraborty, Erfan Shayegani, Zikui Cai, Nael Abu-Ghazaleh, M. Salman Asif, Yue Dong, Amit K. Roy-Chowdhury, Chengyu Song, 14 Oct 2025, Cross-Modal Safety Alignment: Is textual unlearning all you need?, https://arxiv.org/abs/2406.02575
  • Suhyeon Lee, Jong Chul Ye, 1 Oct 2025, Plug-and-Play Prompt Refinement via Latent Feedback for Diffusion Model Alignment, https://arxiv.org/abs/2510.00430
  • Jaewoo Lee, Minsu Kim, Sanghyeok Choi, Inhyuck Song, Sujin Yun, Hyeongyu Kang, Woocheol Shin, Taeyoung Yun, Kiyoung Om, Jinkyoo Park, 1 Oct 2025, Diffusion Alignment as Variational Expectation-Maximization, https://arxiv.org/abs/2510.00502
  • ShengYun Peng, Eric Smith, Ivan Evtimov, Song Jiang, Pin-Yu Chen, Hongyuan Zhan, Haozhu Wang, Duen Horng Chau, Mahesh Pasupuleti, Jianfeng Chi, 1 Oct 2025, Large Reasoning Models Learn Better Alignment from Flawed Thinking, https://arxiv.org/abs/2510.00938
  • Yiran Shen, Yu Xia, Jonathan Chang, Prithviraj Ammanabrolu, 1 Oct 2025, Simultaneous Multi-objective Alignment Across Verifiable and Non-verifiable Rewards, https://arxiv.org/abs/2510.01167
  • Han Zhou, Jinjin Cao, Liyuan Ma, Xueji Fang, Guo-jun Qi, 1 Oct 2025, From Human Hands to Robot Arms: Manipulation Skills Transfer via Trajectory Alignment, https://arxiv.org/abs/2510.00491
  • Yuanfang Xiang, Lun Ai, 1 Oct 2025, Adaptive Data-Knowledge Alignment in Genetic Perturbation Prediction, https://arxiv.org/abs/2510.00512
  • Eunki Kim, Na Min An, James Thorne, Hyunjung Shim, 1 Oct 2025, Multi-Objective Task-Aware Predictor for Image-Text Alignment, https://arxiv.org/abs/2510.00766
  • Akshat Naik, Patrick Quinn, Guillermo Bosch, Emma Goun\'e, Francisco Javier Campos Zabala, Jason Ross Brown, Edward James Young, 1 Oct 2025, AgentMisalignment: Measuring the Propensity for Misaligned Behaviour in LLM-Based Agents, https://arxiv.org/abs/2506.04018
  • Xiaoyang Cao, Zelai Xu, Mo Guang, Kaiwen Long, Michiel A. Bakker, Yu Wang and Chao Yu, 1 Oct 2025, Latent Collective Preference Optimization: A General Framework for Robust LLM Alignment, https://arxiv.org/abs/2509.24159
  • Jonathan Geuter and Youssef Mroueh and David Alvarez-Melis, 30 Sep 2025, Guided Speculative Inference for Efficient Test-Time Alignment of LLMs, https://arxiv.org/abs/2506.04118
  • Md Hasan Shahriar, Md Mohaimin Al Barat, Harshavardhan Sundar, Ning Zhang, Naren Ramakrishnan, Y. Thomas Hou, Wenjing Lou, 1 Oct 2025, Temporal Misalignment Attacks against Multimodal Perception in Autonomous Driving, https://arxiv.org/abs/2507.09095
  • Zitong Lu, Yile Wang and Julie D. Golomb, 1 Oct 2025, Achieving More Human Brain-Like Vision via Human EEG Representational Alignment, https://arxiv.org/abs/2401.17231
  • Tong Nie, Yuewen Mei, Yihong Tang, Junlin He, Jie Sun, Haotian Shi, Wei Ma, Jian Sun, 24 Sep 2025, Steerable Adversarial Scenario Generation through Test-Time Preference Alignment, https://arxiv.org/abs/2509.20102
  • Dongming Huang, Zhifan Li, Yicheng Li, Qian Lin, 24 Sep 2025, Alignment-Sensitive Minimax Rates for Spectral Algorithms with Learned Kernels, https://arxiv.org/abs/2509.20294
  • Chengkai Huang, Junda Wu, Zhouhang Xie, Yu Xia, Rui Wang, Tong Yu, Subrata Mitra, Julian McAuley, Lina Yao, 15 Sep 2025, Pluralistic Off-policy Evaluation and Alignment, https://arxiv.org/abs/2509.19333
  • Jiarui Jin, Xiaocheng Fang, Haoyu Wang, Jun Li, Che Liu, Donglin Xie, Hongyan Li, Shenda Hong, 23 Sep 2025, Self-Alignment Learning to Improve Myocardial Infarction Detection from Single-Lead ECG, https://arxiv.org/abs/2509.19397
  • Sheng-Bin Duan, Jian-Long Hao, Tian-Yu Xiang, Xiao-Hu Zhou, Mei-Jiang Gui, Xiao-Liang Xie, Shi-Qi Liu, and Zeng-Guang Hou, 23 Sep 2025, Online Adaptation via Dual-Stage Alignment and Self-Supervision for Fast-Calibration Brain-Computer Interfaces, https://arxiv.org/abs/2509.19403
  • Shuyu Zhang, Yifan Wei, Xinru Wang, Yanmin Zhu, Yangfan He, Yixuan Weng, Bin Li, 24 Sep 2025, HiCoLoRA: Addressing Context-Prompt Misalignment via Hierarchical Collaborative LoRA for Zero-Shot DST, https://arxiv.org/abs/2509.19742
  • Ruoxi Cheng, Haoxuan Ma, Weixin Wang, Ranjie Duan, Jiexi Liu, Xiaoshuang Jia, Simeng Qin, Xiaochun Cao, Yang Liu, Xiaojun Jia, 24 Sep 2025, Inverse Reinforcement Learning with Dynamic Reward Scaling for LLM Alignment, https://arxiv.org/abs/2503.18991
  • Jiaxun Yang, Yifei Han, Long Zhang, Yujie Liu, Bin Li, Bo Gao, Yangfan He, Kejia Zhan, 24 Sep 2025, CPCLDETECTOR: Knowledge Enhancement and Alignment Selection for Chinese Patronizing and Condescending Language Detection, https://arxiv.org/abs/2509.18562
  • Aymane El Gadarri, Ali Aouad, Vivek F. Farias, 28 Oct 2025, The Sign Estimator: LLM Alignment in the Face of Choice Heterogeneity, https://arxiv.org/abs/2510.23965
  • Yuxuan Tang, Yifan Feng, 24 Oct 2025, Beyond Pairwise: Empowering LLM Alignment With Ranked Choice Modeling, https://arxiv.org/abs/2510.23631
  • Hao Wang, Licheng Pan, Yuan Lu, Zhixuan Chu, Xiaoxi Li, Shuting He, Zhichao Chen, Haoxuan Li, Qingsong Wen, Zhouchen Lin, 28 Oct 2025, DistDF: Time-Series Forecasting Needs Joint-Distribution Wasserstein Alignment, https://arxiv.org/abs/2510.24574
  • Kang Zhang, Trung X. Pham, Suyeon Lee, Axi Niu, Arda Senocak, Joon Son Chung, 28 Oct 2025, Model-Guided Dual-Role Alignment for High-Fidelity Open-Domain Video-to-Audio Generation, https://arxiv.org/abs/2510.24103
  • Jian Gu, Aldeida Aleti, Chunyang Chen, Hongyu Zhang, 28 Oct 2025, Beyond Neural Incompatibility: Easing Cross-Scale Knowledge Transfer in Large Language Models through Latent Semantic Alignment, https://arxiv.org/abs/2510.24208
  • Nahid Torbati, Michael Gaebler, Simon M. Hofmann, Nico Scherf, 27 Oct 2025, Geometry matters: insights from Ollivier Ricci Curvature and Ricci Flow into representational alignment through Ollivier-Ricci Curvature and Ricci Flow, https://arxiv.org/abs/2501.00919
  • Nevan Wichers, Aram Ebtekar, Ariana Azarbal, Victor Gillioz, Christine Ye, Emil Ryd, Neil Rathi, Henry Sleight, Alex Mallen, Fabien Roger, Samuel Marks, 27 Oct 2025, Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment, https://arxiv.org/abs/2510.05024
  • Bingsheng Yao, Bo Sun, Yuanzhe Dong, Yuxuan Lu, Dakuo Wang, 27 Oct 2025, DPRF: A Generalizable Dynamic Persona Refinement Framework for Optimizing Behavior Alignment Between Personalized LLM Role-Playing Agents and Humans, https://arxiv.org/abs/2510.14205
  • Sibo Xiao, Jinyuan Fu, Zhongle Xie, Lidan Shou, 28 Oct 2025, TokenTiming: A Dynamic Alignment Method for Universal Speculative Decoding Model Pairs, https://arxiv.org/abs/2510.15545
  • Yiwen Peng (IP Paris), Thomas Bonald (IP Paris), Fabian M. Suchanek (IP Paris), 23 Oct 2025, FLORA: Unsupervised Knowledge Graph Alignment by Fuzzy Logic, https://arxiv.org/abs/2510.20467
  • Abdulmomen Ghalkha, Zhuojun Tian, Chaouki Ben Issaid, and Mehdi Bennis, 23 Oct 2025, SheafAlign: A Sheaf-theoretic Framework for Decentralized Multimodal Alignment, https://arxiv.org/abs/2510.20540
  • Zhiyu Lin, Jingwen Yang, Jiale Zhao, Meng Liu, Sunzhu Li, Benyou Wang, 23 Oct 2025, Decoding the Ear: A Framework for Objectifying Expressiveness from Human Preference Through Efficient Alignment, https://arxiv.org/abs/2510.20513
  • Wonje Jeung, Sangyeon Yoon, Minsuk Kahng, Albert No, 23 Oct 2025, SAFEPATH: Preventing Harmful Reasoning in Chain-of-Thought via Early Alignment, https://arxiv.org/abs/2505.14667
  • Siqi Zhu, David Zhang, Pedro Cisneros-Velarde, Jiaxuan You, 22 Oct 2025, GTAlign: Game-Theoretic Alignment of LLM Assistants for Mutual Welfare, https://arxiv.org/abs/2510.08872
  • Shriram Chennakesavalu, Frank Hu, Sebastian Ibarraran, Grant M. Rotskoff, 22 Oct 2025, Aligning Transformers with Continuous Feedback via Energy Rank Alignment, https://arxiv.org/abs/2405.12961
  • Yibo Wen, Chenwei Xu, Jerry Yao-Chieh Hu, Kaize Ding, Han Liu, 23 Oct 2025, Pareto-Optimal Energy Alignment for Designing Nature-Like Antibodies, https://arxiv.org/abs/2412.20984
  • Shenzhi Yang, Junbo Zhao, Sharon Li, Shouqing Yang, Dingyu Yang, Xiaofang Zhang, Haobo Wang, 23 Oct 2025, Harnessing Feature Resonance under Arbitrary Target Alignment for Out-of-Distribution Node Detection, https://arxiv.org/abs/2502.16076
  • Kunwoong Kim, Jihu Lee, Sangchul Park, Yongdai Kim, 23 Oct 2025, Fair Clustering via Alignment, https://arxiv.org/abs/2505.09131
  • Jianhui Chen, Xiaozhi Wang, Zijun Yao, Yushi Bai, Lei Hou, Juanzi Li, 23 Oct 2025, Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons, https://arxiv.org/abs/2406.14144
  • Shun Lei, Yaoxun Xu, Zhiwei Lin, Huaicheng Zhang, Wei Tan, Hangting Chen, Jianwei Yu, Yixuan Zhang, Chenyu Yang, Haina Zhu, Shuai Wang, Zhiyong Wu, Dong Yu, 23 Oct 2025, LeVo: High-Quality Song Generation with Multi-Preference Alignment, https://arxiv.org/abs/2506.07520
  • Bingyu Li, Feiyu Wang, Da Zhang, Zhiyuan Zhao, Junyu Gao, Xuelong Li, 23 Oct 2025, MARIS: Marine Open-Vocabulary Instance Segmentation with Geometric Enhancement and Semantic Alignment, https://arxiv.org/abs/2510.15398
  • MingSheng Li, Guangze Zhao, Sichen Liu, 10 Oct 2025, VisuoAlign: Safety Alignment of LVLMs with Multimodal Tree Search, https://arxiv.org/abs/2510.15948
  • Ali Shirali, 18 Oct 2025, The Burden of Interactive Alignment with Inconsistent Preferences, https://arxiv.org/abs/2510.16368
  • Amir Jalilifard, Anderson de Rezende Rocha, Marcos Medeiros Raimundo, 20 Oct 2025, Reasoning Distillation and Structural Alignment for Improved Code Generation, https://arxiv.org/abs/2510.17598
  • Archie Chaudhury, 17 Oct 2025, Alignment is Localized: A Causal Probe into Preference Layers, https://arxiv.org/abs/2510.16167
  • Jiayi Huang, Sangwoo Park, Nicola Paoletti, Osvaldo Simeone, 20 Oct 2025, Reliable Inference in Edge-Cloud Model Cascades via Conformal Alignment, https://arxiv.org/abs/2510.17543
  • Yilin Wu, Anqi Li, Tucker Hermans, Fabio Ramos, Andrea Bajcsy, Claudia P'erez-D'Arpino, 18 Oct 2025, Do What You Say: Steering Vision-Language-Action Models via Runtime Reasoning-Action Alignment Verification, https://arxiv.org/abs/2510.16281
  • Jihoon Kwon, Kyle Min, Jy-yong Sohn, 18 Oct 2025, Enhancing Compositional Reasoning in CLIP via Reconstruction and Alignment of Text Descriptions, https://arxiv.org/abs/2510.16540
  • Junhao Zhao, Zishuai Liu, Ruili Fang, Jin Lu, Linghan Zhang, Fei Dou, 19 Oct 2025, CARE: Contrastive Alignment for ADL Recognition from Event-Triggered Sensor Streams, https://arxiv.org/abs/2510.16988
  • Tiancheng Hu, Benjamin Minixhofer, Nigel Collier, 20 Oct 2025, Navigating the Alignment-Calibration Trade-off: A Pareto-Superior Frontier via Model Merging, https://arxiv.org/abs/2510.17426
  • Daria Korotyshova, Boris Shaposhnikov, Alexey Malakhov, Alexey Khokhulin, Nikita Surnachev, Kirill Ovcharenko, George Bredis, Alexey Gorbatovski, Viacheslav Sinii, Daniil Gavrilov, 18 Oct 2025, ESSA: Evolutionary Strategies for Scalable Alignment, https://arxiv.org/abs/2507.04453
  • Jaemin Kim, Bryan Sangwoo Kim, Jong Chul Ye, 19 Oct 2025, Free$^2$Guide: Training-Free Text-to-Video Alignment using Image LVLM, https://arxiv.org/abs/2411.17041
  • Shijing Hu, Jingyang Li, Xingyu Xie, Zhihui Lu, Kim-Chuan Toh and Pan Zhou, 19 Oct 2025, GRIFFIN: Effective Token Alignment for Faster Speculative Decoding, https://arxiv.org/abs/2502.11018
  • Xilong Cheng, Yunxiao Qin, Yuting Tan, Zhengnan Li, Ye Wang, Hongjiang Xiao, Yuan Zhang, 20 Oct 2025, PsyMem: Fine-grained psychological alignment and Explicit Memory Control for Advanced Role-Playing LLMs, https://arxiv.org/abs/2505.12814
  • Guy Dar, 19 Oct 2025, mini-vec2vec: Scaling Universal Geometry Alignment with Linear Transformations, https://arxiv.org/abs/2510.02348
  • Saad Obaid ul Islam, Anne Lauscher, Goran Glava\v{s}, 18 Oct 2025, The Curious Case of Factual (Mis)Alignment between LLMs' Short- and Long-Form Answers, https://arxiv.org/abs/2510.11218
  • Yingpeng Ning, Yuanyuan Sun, Ling Luo, Yanhua Wang, Yuchen Pan and Hongfei Lin, 18 Oct 2025, MedTrust-RAG: Evidence Verification and Trust Alignment for Biomedical Question Answering, https://arxiv.org/abs/2510.14400
  • Haofeng Huang, Yifei Han, Long Zhang, Bin Li, Yangfan He, 22 Sep 2025, MVCL-DAF++: Enhancing Multimodal Intent Recognition via Prototype-Aware Contrastive Alignment and Coarse-to-Fine Dynamic Attention Fusion, https://arxiv.org/abs/2509.17446
  • Deuksin Kwon, Kaleen Shrestha, Bin Han, Elena Hayoung Lee, Gale Lucas, 19 Sep 2025, Evaluating Behavioral Alignment in Conflict Dialogue: A Multi-Dimensional Comparison of LLM Agents and Humans, https://arxiv.org/abs/2509.16394
  • Yunzhe Wang, Gale M. Lucas, Burcin Becerik-Gerber, Volkan Ustun, 19 Sep 2025, Implicit Behavioral Alignment of Language Agents in High-Stakes Crowd Simulations, https://arxiv.org/abs/2509.16457
  • Abhirama Subramanyam Penamakuri, Navlika Singh, Piyush Arora, Anand Mishra, 20 Sep 2025, When Big Models Train Small Ones: Label-Free Model Parity Alignment for Efficient Visual Question Answering using Small VLMs, https://arxiv.org/abs/2509.16633
  • Ragib Amin Nihal, Benjamin Yen, Takeshi Ashizawa, Kazuhiro Nakadai, 21 Sep 2025, Cross-Attention with Confidence Weighting for Multi-Channel Audio Alignment, https://arxiv.org/abs/2509.16926
  • Qian Zhang, Lin Zhang, Xing Fang, Mingxin Zhang, Zhiyuan Wei, Ran Song, Wei Zhang, 21 Sep 2025, Informative Text-Image Alignment for Visual Affordance Learning with Foundation Models, https://arxiv.org/abs/2509.17074
  • Junsong Li, Jie Zhou, Bihao Zhan, Yutao Yang, Qianjun Pan, Shilian Chen, Tianyu Huai, Xin Li, Qin Chen, Liang He, 21 Sep 2025, LifeAlign: Lifelong Alignment for Large Language Models with Memory-Augmented Focalized Preference Optimization, https://arxiv.org/abs/2509.17183
  • Runjia Zeng, James Chenhao Liang, Cheng Han, Zhiwen Cao, Jiahao Liu, Xiaojun Quan, Yingjie Victor Chen, Lifu Huang, Tong Geng, Qifan Wang, Dongfang Liu, 21 Sep 2025, Probabilistic Token Alignment for Large Language Model Fusion, https://arxiv.org/abs/2509.17276
  • Dujin Lee, Sojung An, Jungmyung Wi, Kuniaki Saito and Donghyun Kim, 22 Sep 2025, Training-Free Label Space Alignment for Universal Domain Adaptation, https://arxiv.org/abs/2509.17452
  • Sheng Huang and Jiexuan Yan and Beiyan Liu and Bo Liu and Richang Hong, 22 Sep 2025, Dual-View Alignment Learning with Hierarchical-Prompt for Class-Imbalance Multi-Label Classification, https://arxiv.org/abs/2509.17747
  • Romain Thoreau, Jessie Levillain, Dawa Derksen, 22 Sep 2025, Can multimodal representation learning by alignment preserve modality-specific information?, https://arxiv.org/abs/2509.17943
  • G. R. Lau, W. Y. Low, S. M. Koh, A. Hartanto, 20 Sep 2025, Evaluating AI Alignment in Eleven LLMs through Output-Based Analysis and Human Benchmarking, https://arxiv.org/abs/2506.12617
  • Zijian Zhao, Zhijie Cai, Tingwei Chen, Xiaoyang Li, Hang Li, Qimei Chen, Guangxu Zhu, 21 Sep 2025, KNN-MMD: Cross Domain Wireless Sensing via Local Distribution Alignment, https://arxiv.org/abs/2412.04783
  • Jean-Charles Noirot Ferrand, Yohan Beugin, Eric Pauley, Ryan Sheatsley, Patrick McDaniel, 19 Sep 2025, Targeting Alignment: Extracting Safety Classifiers of Aligned LLMs, https://arxiv.org/abs/2501.16534
  • Dhruv Agarwal, Anya Shukla, Sunayana Sitaram, Aditya Vashistha, 21 Sep 2025, Fluent but Foreign: Even Regional LLMs Lack Cultural Alignment, https://arxiv.org/abs/2505.21548
  • Ji Huang, Mengfei Li and Shuai Shao, 24 Oct 2025, Distribution Shift Alignment Helps LLMs Simulate Survey Response Distributions, https://arxiv.org/abs/2510.21977
  • Ashwin Ramachandran, Vaibhav Raj, Indrayumna Roy, Soumen Chakrabarti, Abir De, 26 Oct 2025, Iteratively Refined Early Interaction Alignment for Subgraph Matching based Graph Retrieval, https://arxiv.org/abs/2510.22538
  • Mohammad Tariqul Islam, Du Liu, Deblina Sarkar, 27 Oct 2025, Manifold Approximation leads to Robust Kernel Alignment, https://arxiv.org/abs/2510.22953
  • Kejia Chen, Jiawen Zhang, Jiacong Hu, Kewei Gao, Jian Lou, Zunlei Feng, Mingli Song, 20 Oct 2025, Token-Level Inference-Time Alignment for Vision-Language Models, https://arxiv.org/abs/2510.21794
  • Ariel Flint, Luca Maria Aiello, Romualdo Pastor-Satorras, Andrea Baronchelli, 25 Oct 2025, Group size effects and collective misalignment in LLM multi-agent systems, https://arxiv.org/abs/2510.22422
  • Jing Yang, Yufeng Yang, 26 Oct 2025, DynaPose4D: High-Quality 4D Dynamic Content Generation via Pose Alignment Loss, https://arxiv.org/abs/2510.22473
  • Zahraa Al Sahili, Maryam Fetanat, Maimuna Nowaz, Ioannis Patras, Matthew Purver, 26 Oct 2025, FairJudge: MLLM Judging for Social Attributes and Prompt Image Alignment, https://arxiv.org/abs/2510.22827
  • Zexi Li, Zhiqi Li, Jie Lin, Tao Shen, Jun Xiao, Yike Guo, Tao Lin, and Chao Wu, 27 Oct 2025, Improving Model Fusion by Training-time Neuron Alignment with Fixed Neuron Anchors, https://arxiv.org/abs/2402.01342
  • Taehoon Yoon, Yunhong Min, Kyeongmin Yeo, Minhyuk Sung, 27 Oct 2025, Psi-Sampler: Initial Particle Sampling for SMC-Based Inference-Time Reward Alignment in Score Models, https://arxiv.org/abs/2506.01320
  • Lily Hong Zhang and Smitha Milli and Karen Jusko and Jonathan Smith and Brandon Amos and Wassim Bouaziz and Manon Revel and Jack Kussman and Yasha Sheynin and Lisa Titus and Bhaktipriya Radharapu and Jane Yu and Vidya Sarma and Kris Rose and Maximilian Nickel, 24 Oct 2025, Cultivating Pluralism In Algorithmic Monoculture: The Community Alignment Dataset, https://arxiv.org/abs/2507.09650
  • Aladin Djuhera, Swanand Ravindra Kadhe, Farhan Ahmed, Syed Zawad, Holger Boche, 27 Oct 2025, SafeMERGE: Preserving Safety Alignment in Fine-Tuned Large Language Models via Selective Layer-Wise Model Merging, https://arxiv.org/abs/2503.17239
  • Peter Chen, Xi Chen, Wotao Yin, Tianyi Lin, 25 Oct 2025, ComPO: Preference Alignment via Comparison Oracles, https://arxiv.org/abs/2505.05465
  • Xiaojian Ding, Lin Zhao, Xian Li, Xiaoying Zhu, 27 Oct 2025, Incomplete Multi-view Clustering via Hierarchical Semantic Alignment and Cooperative Completion, https://arxiv.org/abs/2510.13887
  • Fang Chen, Alex Villa, Gongbo Liang, Xiaoyi Lu, Meng Tang, 24 Oct 2025, Contrastive Conditional-Unconditional Alignment for Long-tailed Diffusion Model, https://arxiv.org/abs/2507.09052
  • Liangwei Nathan Zheng, Wenhao Liang, Wei Emma Zhang, Miao Xu, Olaf Maennel, Weitong Chen, 14 Oct 2025, Lifting Manifolds to Mitigate Pseudo-Alignment in LLM4TS, https://arxiv.org/abs/2510.12847
  • Haolin Li, Hoda Bidkhori, 14 Oct 2025, FedGTEA: Federated Class-Incremental Learning with Gaussian Task Embedding and Alignment, https://arxiv.org/abs/2510.12927
  • Joshua R. Tempelman, Adam J. Wachtor, Eric B. Flynn, 14 Oct 2025, Machine Learning-Based Ultrasonic Weld Characterization Using Hierarchical Wave Modeling and Diffusion-Driven Distribution Alignment, https://arxiv.org/abs/2510.13023
  • Zizhuo Zhang, Qizhou Wang, Shanshan Ye, Jianing Zhu, Jiangchao Yao, Bo Han, Masashi Sugiyama, 15 Oct 2025, Towards Understanding Valuable Preference Data for Large Language Model Alignment, https://arxiv.org/abs/2510.13212
  • Bingbin Liu, Rachit Bansal, Depen Morwani, Nikhil Vyas, David Alvarez-Melis, Sham M. Kakade, 15 Oct 2025, Adam or Gauss-Newton? A Comparative Study In Terms of Basis Alignment and SGD Noise, https://arxiv.org/abs/2510.13680
  • Chenghao Yang, Ari Holtzman, 14 Oct 2025, LLM Probability Concentration: How Alignment Shrinks the Generative Horizon, https://arxiv.org/abs/2506.17871
  • Lin Zhu, Yifeng Yang, Xinbing Wang, Qinying Gu, Nanyang Ye, 15 Oct 2025, $\Delta \mathrm{Energy}$: Optimizing Energy Change During Vision-Language Alignment Improves both OOD Detection and OOD Generalization, https://arxiv.org/abs/2510.11296
  • Sathwik Karnik, Somil Bansal, 25 Sep 2025, Preemptive Detection and Steering of LLM Misalignment via Latent Reachability, https://arxiv.org/abs/2509.21528
  • Saurabh Kataria, Davood Fattahi, Minxiao Wang, Ran Xiao, Matthew Clark, Timothy Ruchti, Mark Mai, Xiao Hu, 25 Sep 2025, Wav2Arrest 2.0: Long-Horizon Cardiac Arrest Prediction with Time-to-Event Modeling, Identity-Invariance, and Pseudo-Lab Alignment, https://arxiv.org/abs/2509.21695
  • Aayush Mishra, Daniel Khashabi, Anqi Liu, 26 Sep 2025, IA2: Alignment with ICL Activations Improves Supervised Fine-Tuning, https://arxiv.org/abs/2509.22621
  • Jason Jordan, Mohammadreza Akbari Lor, Peter Koulen, Mei-Ling Shyu, Shu-Ching Chen, 21 Sep 2025, MDF-MLLM: Deep Fusion Through Cross-Modal Feature Alignment for Contextually Aware Fundoscopic Image Classification, https://arxiv.org/abs/2509.21358
  • Sualeha Farid, Jayden Lin, Zean Chen, Shivani Kumar, David Jurgens, 25 Sep 2025, One Model, Many Morals: Uncovering Cross-Linguistic Misalignments in Computational Moral Reasoning, https://arxiv.org/abs/2509.21443
  • Junno Yun, Ya\c{s}ar Utku Al\c{c}alar, Mehmet Ak\c{c}akaya, 25 Sep 2025, No Alignment Needed for Generation: Learning Linearly Separable Representations in Diffusion Models, https://arxiv.org/abs/2509.21565
  • Hongbin Zhang, Kehai Chen, Xuefeng Bai, Yang Xiang, Min Zhang, 26 Sep 2025, Evaluating and Improving Cultural Awareness of Reward Models for LLM Alignment, https://arxiv.org/abs/2509.21798
  • Shijing Hu, Jingyang Li, Zhihui Lu and Pan Zhou, 26 Sep 2025, Bridging Draft Policy Misalignment: Group Tree Optimization for Speculative Decoding, https://arxiv.org/abs/2509.22134
  • Yifang Zhang, Pengfei Duan, Yiwen Yang, Shengwu Xiong, 26 Sep 2025, Beyond Textual Context: Structural Graph Encoding with Adaptive Space Alignment to alleviate the hallucination of LLMs, https://arxiv.org/abs/2509.22251
  • Yasmine Omri, Connor Ding, Tsachy Weissman, Thierry Tambe, 26 Sep 2025, Vision-Language Alignment from Compressed Image Representations using 2D Gaussian Splatting, https://arxiv.org/abs/2509.22615
  • Jingzhi Hu, Geoffrey Ye Li, 26 Sep 2025, Distillation-Enabled Knowledge Alignment Protocol for Semantic Communication in AI Agent Networks, https://arxiv.org/abs/2505.17030
  • Toshiki Nakai, Ravi Kiran Chikkala, Lena Sophie Oberkircher, Nicholas Jennings, Natalia Skachkova, Tatiana Anikina, Jesujoba Oluwadara Alabi, 3 Oct 2025, TRepLiNa: Layer-wise CKA+REPINA Alignment Improves Low-Resource Machine Translation in Aya-23 8B, https://arxiv.org/abs/2510.06249
  • Aman Gupta, Denny O'Shea, Fazl Barez, 8 Oct 2025, VAL-Bench: Measuring Value Alignment in Language Models, https://arxiv.org/abs/2510.05465
  • Yacouba Kaloga, Shashi Kumar, Petr Motlicek, Ina Kodrasi, 8 Oct 2025, A Differentiable Alignment Framework for Sequence-to-Sequence Modeling via Optimal Transport, https://arxiv.org/abs/2502.01588
  • Lingjie Yi, Raphael Douady, Chao Chen, 7 Oct 2025, Decipher the Modality Gap in Multimodal Contrastive Learning: From Convergent Representations to Pairwise Alignment, https://arxiv.org/abs/2510.03268
  • Matthieu Bou, Nyal Patel, Arjun Jagota, Satyapriya Krishna, Sonali Parbhoo, 8 Oct 2025, The Alignment Auditor: A Bayesian Framework for Verifying and Refining LLM Objectives, https://arxiv.org/abs/2510.06096
  • Hadi Mohammadi, Anastasia Giachanou, and Ayoub Bagheri, 8 Oct 2025, EvalMORAAL: Interpretable Chain-of-Thought and LLM-as-Judge Evaluation for Moral Alignment in Large Language Models, https://arxiv.org/abs/2510.05942
  • Xinle Wu, Yao Lu, 3 Oct 2025, Reward Model Routing in Alignment, https://arxiv.org/abs/2510.02850
  • Marc Lelarge, 3 Oct 2025, Bootstrap Learning for Combinatorial Graph Alignment with Sequential GNNs, https://arxiv.org/abs/2510.03086
  • Andr\'e Longon, David Klindt, Meenakshi Khosla, 3 Oct 2025, Superposition disentanglement of neural representations reveals hidden alignment, https://arxiv.org/abs/2510.03186
  • Ara Seo, Bryan Sangwoo Kim, Hyungjin Chung, Jong Chul Ye, 3 Oct 2025, Align Your Query: Representation Alignment for Multimodality Medical Object Detection, https://arxiv.org/abs/2510.02789
  • Hongxiang Zhang and Yuan Tian and Tianyi Zhang, 3 Oct 2025, Self-Anchor: Large Language Model Reasoning via Step-by-step Attention Alignment, https://arxiv.org/abs/2510.03223
  • Matthias Burkhardt, Tobias Schm\"ahling, Pascal Stegmann, Michael Layh, Tobias Windisch, 2 Oct 2025, Active Alignments of Lens Systems with Reinforcement Learning, https://arxiv.org/abs/2503.02075
  • Hanze Guo, Jing Yao, Xiao Zhou, Xiaoyuan Yi, Xing Xie, 21 Oct 2025, Counterfactual Reasoning for Steerable Pluralistic Value Alignment of Large Language Models, https://arxiv.org/abs/2510.18526
  • Federico Barbero, Xiangming Gu, Christopher A. Choquette-Choo, Chawin Sitawarin, Matthew Jagielski, Itay Yona, Petar Veli\v{c}kovi\'c, Ilia Shumailov, Jamie Hayes, 21 Oct 2025, Extracting alignment data in open models, https://arxiv.org/abs/2510.18554
  • Jiawei Zhang, Andrew Estornell, David D. Baek, Bo Li, Xiaojun Xu, 20 Oct 2025, Any-Depth Alignment: Unlocking Innate Safety Alignment of LLMs to Any-Depth, https://arxiv.org/abs/2510.18081
  • Haobin Li, Yijie Lin, Peng Hu, Mouxing Yang, Xi Peng, 21 Oct 2025, Learning with Dual-level Noisy Correspondence for Multi-modal Entity Alignment, https://arxiv.org/abs/2510.18240
  • Hao Qin, Thang Duong, Ming Li, Chicheng Zhang, 21 Oct 2025, Physics-Informed Parametric Bandits for Beam Alignment in mmWave Communications, https://arxiv.org/abs/2510.18299
  • \'Angela L\'opez-Cardona, Sebasti\'an Idesis, Mireia Masias-Bruns, Sergi Abadal, Ioannis Arapakis, 3 Oct 2025, Brain-Language Model Alignment: Insights into the Platonic Hypothesis and Intermediate-Layer Advantage, https://arxiv.org/abs/2510.17833
  • Xue Jiang, Yihong Dong, Mengyang Liu, Hongyi Deng, Tian Wang, Yongding Tao, Rongyu Cao, Binhua Li, Zhi Jin, Wenpin Jiao, Fei Huang, Yongbin Li, Ge Li, 21 Oct 2025, CodeRL+: Improving Code Generation via Reinforcement with Execution Semantics Alignment, https://arxiv.org/abs/2510.18471
  • Rajdeep Haldar, Ziyi Wang, Qifan Song, Guang Lin, Yue Xing, 20 Oct 2025, LLM Safety Alignment is Divergence Estimation in Disguise, https://arxiv.org/abs/2502.00657
  • Xue Han, Qian Hu, Yitong Wang, Wenchun Gao, Lianlian Zhang, Qing Wang, Lijun Mei, Chao Deng, Junlan Feng, 21 Oct 2025, Temporal Alignment of LLMs through Cycle Encoding for Long-Range Time Representations, https://arxiv.org/abs/2503.04150
  • Mar\'ia Victoria Carro, Denise Alejandra Mester, Facundo Nieto, Oscar Agust\'in Stanchi, Guido Ernesto Bergman, Mario Alejandro Leiva, Eitan Sprejer, Luca Nicol\'as Forziati Gangi, Francisca Gauna Selasco, Juan Gustavo Corval\'an, Gerardo I. Simari, Mar\'ia Vanina Martinez, 21 Oct 2025, AI Debaters are More Persuasive when Arguing in Alignment with Their Own Beliefs, https://arxiv.org/abs/2510.13912
  • Wentao Zhu, Zhining Zhang, Yuwei Ren, Yin Huang, Hao Xu, Yizhou Wang, 25 Sep 2025, Embodied Representation Alignment with Mirror Neurons, https://arxiv.org/abs/2509.21136
  • Zhengyuan Shi, Jingxin Wang, Wentao Jiang, Chengyu Ma, Ziyang Zheng, Zhufei Chu, Weikang Qian, Qiang Xu, 25 Sep 2025, Alignment Unlocks Complementarity: A Framework for Multiview Circuit Representation Learning, https://arxiv.org/abs/2509.20968
  • Duc-Tuan Truong, Tianchi Liu, Junjie Li, Ruijie Tao, Kong Aik Lee, Eng Siong Chng, 25 Sep 2025, Addressing Gradient Misalignment in Data-Augmented Training for Robust Speech Deepfake Detection, https://arxiv.org/abs/2509.20682
  • Zoe Wanying He, Sean Trott, Meenakshi Khosla, 25 Sep 2025, Seeing Through Words, Speaking Through Pixels: Deep Representational Alignment Between Vision and Language Models, https://arxiv.org/abs/2509.20751
  • Ryan L. Yang, Dipkamal Bhusal, Nidhi Rastogi, 25 Sep 2025, Learning to Look: Cognitive Attention Alignment with Vision-Language Models, https://arxiv.org/abs/2509.21247
  • Junyang Zhang, Tianyi Zhu, Thierry Tambe, 27 Sep 2025, AttAnchor: Guiding Cross-Modal Token Alignment in VLMs with Attention Anchors, https://arxiv.org/abs/2509.23109
  • Sijia Liu, Niklas Muennighoff, Kawin Ethayarajh, 29 Sep 2025, Humanline: Online Alignment as Perceptual Loss, https://arxiv.org/abs/2509.24207
  • Zihao Zhu, Xinyu Wu, Gehan Hu, Siwei Lyu, Ke Xu, Baoyuan Wu, 29 Sep 2025, AdvChain: Adversarial Chain-of-Thought Tuning for Robust Safety Alignment of Large Reasoning Models, https://arxiv.org/abs/2509.24269
  • Jake S. Rhodes, Adam G. Rustad, Marshall S. Nielsen, Morgan Chase McClellan, Dallan Gardner, Dawson Hedges, 26 Sep 2025, Guided Manifold Alignment with Geometry-Regularized Twin Autoencoders, https://arxiv.org/abs/2509.22913
  • Sungmin Cha, Kyunghyun Cho, 28 Sep 2025, Why Alignment Must Precede Distillation: A Minimal Working Explanation, https://arxiv.org/abs/2509.23667
  • Yao Luan, Ni Mu, Yiqin Yang, Bo Xu, Qing-Shan Jia, 28 Sep 2025, STAIR: Addressing Stage Misalignment through Temporal-Aligned Preference Reinforcement Learning, https://arxiv.org/abs/2509.23802
  • Muyun Jiang, Shuailei Zhang, Zhenjie Yang, Mengjun Wu, Weibang Jiang, Zhiwei Guo, Wei Zhang, Rui Liu, Shangen Zhang, Yong Li, Yi Ding, Cuntai Guan, 29 Sep 2025, ELASTIQ: EEG-Language Alignment with Semantic Task Instruction and Querying, https://arxiv.org/abs/2509.24302
  • Liang Lin, Zhihao Xu, Junhao Dong, Jian Zhao, Yuchen Yuan, Guibin Zhang, Miao Yu, Yiming Zhang, Zhengtao Yao, Huahui Yi, Dongrui Liu, Xinfeng Li, Kun Wang, 29 Sep 2025, OrthAlign: Orthogonal Subspace Decomposition for Non-Interfering Multi-Objective Alignment, https://arxiv.org/abs/2509.24610
  • Giordano Cicchetti, Eleonora Grassucci, Danilo Comminiello, 29 Sep 2025, A TRIANGLE Enables Multimodal Alignment Beyond Cosine Similarity, https://arxiv.org/abs/2509.24734
  • Peter Holderrieth, Uriel Singer, Tommi Jaakkola, Ricky T. Q. Chen, Yaron Lipman, Brian Karrer, 29 Sep 2025, GLASS Flows: Transition Sampling for Alignment of Flow and Diffusion Models, https://arxiv.org/abs/2509.25170
  • Abhiroop Chatterjee, Susmita Ghosh, 20 Sep 2025, Learning Hyperspectral Images with Curated Text Prompts for Efficient Multimodal Alignment, https://arxiv.org/abs/2509.22697
  • Jaehan Kim, Minkyoo Song, Seungwon Shin, Sooel Son, 26 Sep 2025, Defending MoE LLMs against Harmful Fine-Tuning via Safety Routing Alignment, https://arxiv.org/abs/2509.22745
  • Wonje Jeung, Sangyeon Yoon, Yoonjun Cho, Dongjae Jeon, Sangwoo Shin, Hyesoo Hong, Albert No, 27 Sep 2025, A2D: Any-Order, Any-Step Safety Alignment for Diffusion Language Models, https://arxiv.org/abs/2509.23286
  • Junming Yang, Ning Xu, Biao Liu, Shiqi Qiao, Xin Geng, 27 Sep 2025, Alignment through Meta-Weighted Online Sampling: Bridging the Gap between Data Generation and Preference Optimization, https://arxiv.org/abs/2509.23371
  • Pu Huang, Shouguang Wang, Siya Yao, Mengchu Zhou, 28 Sep 2025, Generalizable Speech Deepfake Detection via Information Bottleneck Enhanced Adversarial Alignment, https://arxiv.org/abs/2509.23618
  • Junliang Li, Yucheng Wang, Yan Chen, Yu Ran, Ruiqing Zhang, Jing Liu, Hua Wu, Haifeng Wang, 28 Sep 2025, Knowledge-Level Consistency Reinforcement Learning: Dual-Fact Alignment for Long-Form Factuality, https://arxiv.org/abs/2509.23765
  • Moxin Zhao, Nan Meng, Jason Pui Yin Cheung, Chris Yuk Kwan Tang, Chenxi Yu, Wenting Zhong, Pengyu Lu, Chang Shi, Yipeng Zhuang, Teng Zhang, 29 Sep 2025, LatXGen: Towards Radiation-Free and Accurate Quantitative Analysis of Sagittal Spinal Alignment Via Cross-Modal Radiographic View Synthesis, https://arxiv.org/abs/2509.24165
  • Soumyadeep Chandra, Kaushik Roy, 29 Sep 2025, REALIGN: Regularized Procedure Alignment with Matching Video Embeddings via Partial Gromov-Wasserstein Optimal Transport, https://arxiv.org/abs/2509.24382
  • Lingyou Pang, Lei Huang, Jianyu Lin, Tianyu Wang, Akira Horiguchi, Alexander Aue, and Carey E. Priebe, 26 Sep 2025, Unsupervised Conformal Inference: Bootstrapping and Alignment to Control LLM Uncertainty, https://arxiv.org/abs/2509.23002
  • Mingshu Li, Dhruv Desai, Jerinsh Jeyapaulraj, Philip Sommer, Riya Jain, Peter Chu, Dhagash Mehta, 29 Sep 2025, STRAPSim: A Portfolio Similarity Metric for ETF Alignment and Portfolio Trades, https://arxiv.org/abs/2509.24151
  • Shengyuan Chen, Zheng Yuan, Qinggang Zhang, Wen Hua, Jiannong Cao, Xiao Huang, 29 Sep 2025, Neuro-Symbolic Entity Alignment via Variational Inference, https://arxiv.org/abs/2410.04153
  • Jiaqi Han, Austin Wang, Minkai Xu, Wenda Chu, Meihua Dang, Yisong Yue, Stefano Ermon, 27 Sep 2025, Discrete Diffusion Trajectory Alignment via Stepwise Decomposition, https://arxiv.org/abs/2507.04832
  • Hua Shen, Tiffany Knearem, Reshmi Ghosh, Kenan Alkiek, Kundan Krishna, Yachuan Liu, Ziqiao Ma, Savvas Petridis, Yi-Hao Peng, Li Qiwei, Sushrita Rakshit, Chenglei Si, Yutong Xie, Jeffrey P. Bigham, Frank Bentley, Joyce Chai, Zachary Lipton, Qiaozhu Mei, Rada Mihalcea, Michael Terry, Diyi Yang, Meredith Ringel Morris, Paul Resnick, David Jurgens, 29 Sep 2025, Position: Towards Bidirectional Human-AI Alignment, https://arxiv.org/abs/2406.09264
  • Haitao Li, Che Liu, Zhengyao Ding, Ziyi Liu, Wenqi Shao, Zhengxing Huang, 29 Sep 2025, Fine-grained Contrastive Learning for ECG-Report Alignment with Waveform Enhancement, https://arxiv.org/abs/2505.11939
  • Vinod Raman, Hilal Asi, Satyen Kale, 29 Sep 2025, AdaBoN: Adaptive Best-of-N Alignment, https://arxiv.org/abs/2505.12050
  • Yuanfei Wang, Xinju Huang, Fangwei Zhong, Yaodong Yang, Yizhou Wang, Yuanpei Chen, Hao Dong, 27 Sep 2025, Communication-Efficient Desire Alignment for Embodied Agent-Human Adaptation, https://arxiv.org/abs/2505.22503
  • Danush Khanna, Gurucharan Marthi Krishna Kumar, Basab Ghosh, Yaswanth Narsupalli, Vinija Jain, Vasu Sharma, Aman Chadha, Amitava Das, 28 Sep 2025, AdversariaL attacK sAfety aLIgnment(ALKALI): Safeguarding LLMs through GRACE: Geometric Representation-Aware Contrastive Enhancement- Introducing Adversarial Vulnerability Quality Index (AVQI), https://arxiv.org/abs/2506.08885
  • Zhiming Zhang, Qingfu Zhu, Xianzhen Luo, Yixuan Wang, Bohan Li, Wanxiang Che, 16 Oct 2025, Automated Snippet-Alignment Data Augmentation for Code Translation, https://arxiv.org/abs/2510.15004
  • Ignacio Serna, 17 Oct 2025, Latent Feature Alignment: Discovering Biased and Interpretable Subpopulations in Face Recognition Models, https://arxiv.org/abs/2510.15520
  • Chunyu Xie, Bin Wang, Fanjing Kong, Jincheng Li, Dawei Liang, Ji Ao, Dawei Leng, Yuhui Yin, 17 Oct 2025, FG-CLIP 2: A Bilingual Fine-grained Vision-Language Alignment Model, https://arxiv.org/abs/2510.10921
  • Santhosh Kumar Ravindran, 5 Oct 2025, Moral Anchor System: A Predictive Framework for AI Value Alignment and Drift Prevention, https://arxiv.org/abs/2510.04073
  • Yoonjeon Kim, Doohyuk Jang, Eunho Yang, 26 Sep 2025, Meta-Awareness Enhances Reasoning Models: Self-Alignment Reinforcement Learning, https://arxiv.org/abs/2510.03259
  • Yufei Li, Yu Fu, Yue Dong, Cong Liu, 28 Sep 2025, MACE: A Hybrid LLM Serving System with Colocated SLO-aware Continuous Retraining Alignment, https://arxiv.org/abs/2510.03283
  • Mallikarjuna Tupakula, 30 Sep 2025, Thin Bridges for Drug Text Alignment: Lightweight Contrastive Learning for Target Specific Drug Retrieval, https://arxiv.org/abs/2510.03309
  • Ameya Daigavane, YuQing Xie, Bodhi P. Vani, Saeed Saremi, Joseph Kleinhenz, Tess Smidt, 2 Oct 2025, Matching the Optimal Denoiser in Point Cloud Diffusion with (Improved) Rotational Alignment, https://arxiv.org/abs/2510.03335
  • Haiquan Qiu, You Wu, Yingjie Tan, Yaqing Wang, Quanming Yao, 5 Oct 2025, Spectral Alignment as Predictor of Loss Explosion in Neural Network Training, https://arxiv.org/abs/2510.04202
  • Siwei Han, Jiaqi Liu, Yaofeng Su, Wenbo Duan, Xinyuan Liu, Cihang Xie, Mohit Bansal, Mingyu Ding, Linjun Zhang, Huaxiu Yao, 6 Oct 2025, Alignment Tipping Process: How Self-Evolution Pushes LLM Agents Off the Rails, https://arxiv.org/abs/2510.04860
  • Antoun Yaacoub, Zainab Assaghir, J\'er\^ome Da-Rugna, 3 Oct 2025, Lightweight Prompt Engineering for Cognitive Alignment in Educational AI: A OneClickQuiz Case Study, https://arxiv.org/abs/2510.03374
  • Xiaoyu Yang, Jie Lu, En Yu, 5 Oct 2025, Learning from All: Concept Alignment for Autonomous Distillation from Multiple Drifting MLLMs, https://arxiv.org/abs/2510.04142
  • Davood Rafiei and Morgan Lindsay Heisler and Weiwei Zhang and Mohammadreza Pourreza and Yong Zhang, 6 Oct 2025, Do LLMs Align with My Task? Evaluating Text-to-SQL via Dataset Alignment, https://arxiv.org/abs/2510.04919
  • Yunfan Zhang, Kathleen McKeown, Smaranda Muresan, 5 Oct 2025, Exploring Chain-of-Thought Reasoning for Steerable Pluralistic Alignment, https://arxiv.org/abs/2510.04045
  • Kihyun Kim, Jiawei Zhang, Asuman Ozdaglar, Pablo A. Parrilo, 5 Oct 2025, Beyond RLHF and NLHF: Population-Proportional Alignment under an Axiomatic Framework, https://arxiv.org/abs/2506.05619
  • Felipe Azua and Leopoldo Bertossi, 6 Oct 2025, Causality-Based Scores Alignment in Explainable Data Management, https://arxiv.org/abs/2503.14469
  • Natalia O\.zegalska-{\L}ukasik and Szymon {\L}ukasik, 4 Oct 2025, Artificial Authority: From Machine Minds to Political Alignments. An Experimental Analysis of Democratic and Autocratic Biases in Large-Language Models, https://arxiv.org/abs/2509.25286
  • Sushil Mahavir Varma, Ir\`ene Waldspurger, Laurent Massouli\'e, 5 Oct 2025, Graph Alignment via Birkhoff Relaxation, https://arxiv.org/abs/2503.05323
  • Binxu Li, Minkai Xu, Jiaqi Han, Meihua Dang, Stefano Ermon, 6 Oct 2025, Divergence Minimization Preference Optimization for Diffusion Model Alignment, https://arxiv.org/abs/2507.07510
  • Allison Sihan Jia, Daniel Huang, Nikhil Vytla, Nirvika Choudhury, John C Mitchell, and Anupam Datta, 9 Oct 2025, What Is Your Agent's GPA? A Framework for Evaluating Agent Goal-Plan-Action Alignment, https://arxiv.org/abs/2510.08847
  • Jingxuan Wu, Zhenglin Wan, Xingrui Yu, Yuzhe Yang, Bo An, Ivor Tsang, 10 Oct 2025, OSCAR: Orthogonal Stochastic Control for Alignment-Respecting Diversity in Flow Matching, https://arxiv.org/abs/2510.09060
  • S M Rafiuddin, 9 Oct 2025, Edu-EmotionNet: Cross-Modality Attention Alignment with Temporal Feedback Loops, https://arxiv.org/abs/2510.08802
  • Achleshwar Luthra, Priyadarsi Mishra, Tomer Galanti, 9 Oct 2025, On the Alignment Between Supervised and Self-Supervised Contrastive Learning, https://arxiv.org/abs/2510.08852
  • Ammar I Marvi, Nancy G Kanwisher, Meenakshi Khosla, 9 Oct 2025, Sparse components distinguish visual pathways & their alignment to neural networks, https://arxiv.org/abs/2510.08858
  • Hyunin Lee, Yong Zhang, Hoang Vu Nguyen, Xiaoyi Liu, Namyong Park, Christopher Jung, Rong Jin, Yang Wang, Zhigang Wang, Somayeh Sojoudi, Xue Feng, 10 Oct 2025, Cross-attention Secretly Performs Orthogonal Alignment in Recommendation Models, https://arxiv.org/abs/2510.09435
  • Yu Liu, Hanlei Shi, Haoxun Li, Yuqing Sun, Yuxuan Ding, Linlin Gong, Leyuan Qu, Taihao Li, 7 Oct 2025, Centering Emotion Hotspots: Multimodal Local-Global Fusion and Cross-Modal Alignment for Emotion Recognition in Conversations, https://arxiv.org/abs/2510.08606
  • Weihua Zheng, Zhengyuan Liu, Tanmoy Chakraborty, Weiwen Xu, Xiaoxue Gao, Bryan Chen Zhengyu Tan, Bowei Zou, Chang Liu, Yujia Hu, Xing Xie, Xiaoyuan Yi, Jing Yao, Chaojun Wang, Long Li, Rui Liu, Huiyao Liu, Koji Inoue, Ryuichi Sumida, Tatsuya Kawahara, Fan Xu, Lingyu Ye, Wei Tian, Dongjun Kim, Jimin Jung, Jaehyung Seo, Nadya Yuki Wangsajaya, Pham Minh Duc, Ojasva Saxena, Palash Nandi, Xiyan Tao, Wiwik Karlina, Tuan Luong, Keertana Arun Vasan, Roy Ka-Wei Lee, Nancy F. Chen, 7 Oct 2025, MMA-ASIA: A Multilingual and Multimodal Alignment Framework for Culturally-Grounded Evaluation, https://arxiv.org/abs/2510.08608
  • Zongcai Du, Guilin Deng, Xiaofeng Guo, Xin Gao, Linke Li, Kaichang Cheng, Fubo Han, Siyu Yang, Peng Liu, Pan Zhong, Qiang Fu, 10 Oct 2025, DiTSinger: Scaling Singing Voice Synthesis with Diffusion Transformer and Implicit Alignment, https://arxiv.org/abs/2510.09016
  • Upasana Tiwari, Rupayan Chakraborty, Sunil Kumar Kopparapu, 10 Oct 2025, Emotion-Disentangled Embedding Alignment for Noise-Robust and Cross-Corpus Speech Emotion Recognition, https://arxiv.org/abs/2510.09072
  • Laxmiraju Kandikatla, Branislav Radeljic, 10 Oct 2025, AI and Human Oversight: A Risk-Based Framework for Alignment, https://arxiv.org/abs/2510.09090
  • Tri Ton, Ji Woo Hong, Chang D. Yoo, 10 Oct 2025, TARO: Timestep-Adaptive Representation Alignment with Onset-Aware Conditioning for Synchronized Video-to-Audio Synthesis, https://arxiv.org/abs/2504.05684
  • Marta Contreiras Silva, Daniel Faria, Catia Pesquita, 24 Oct 2025, CMOMgen: Complex Multi-Ontology Alignment via Pattern-Guided In-Context Learning, https://arxiv.org/abs/2510.21656
  • Jialu Tang, Hung Manh Pham, Ignace De Lathauwer, Henk S. Schipper, Yuan Lu, Dong Ma, Aaqib Saeed, 24 Oct 2025, Interpretable Multimodal Zero-Shot ECG Diagnosis via Structured Clinical Knowledge Alignment, https://arxiv.org/abs/2510.21551
  • Shufan Shen, Junshu Sun, Qingming Huang, Shuhui Wang, 24 Oct 2025, VL-SAE: Interpreting and Enhancing Vision-Language Alignment with a Unified Concept Set, https://arxiv.org/abs/2510.21323
  • Vighnesh Subramaniam, David Mayo, Colin Conwell, Tomaso Poggio, Boris Katz, Brian Cheung, Andrei Barbu, 23 Oct 2025, Training the Untrainable: Introducing Inductive Bias via Representational Alignment, https://arxiv.org/abs/2410.20035
  • Zaiyan Xu, Sushil Vemuri, Kishan Panaganti, Dileep Kalathil, Rahul Jain, Deepak Ramachandran, 24 Oct 2025, Robust LLM Alignment via Distributionally Robust Direct Preference Optimization, https://arxiv.org/abs/2502.01930
  • Masahiro Fujisawa, Masaki Adachi, Michael A. Osborne, 24 Oct 2025, Scalable Valuation of Human Feedback through Provably Robust Model Alignment, https://arxiv.org/abs/2505.17859
  • David Ortiz-Perez and Manuel Benavent-Lledo and Javier Rodriguez-Juan and Jose Garcia-Rodriguez and David Tom\'as, 24 Oct 2025, CogniAlign: Word-Level Multimodal Speech Alignment with Gated Cross-Attention for Alzheimer's Detection, https://arxiv.org/abs/2506.01890
  • Yige Yuan, Teng Xiao, Li Yunfan, Bingbing Xu, Shuchang Tao, Yunqi Qiu, Huawei Shen, Xueqi Cheng, 24 Oct 2025, Inference-time Alignment in Continuous Space, https://arxiv.org/abs/2505.20081
  • Jun Dan, Yang Liu, Jiankang Deng, Haoyu Xie, Siyuan Li, Baigui Sun, Shan Luo, 24 Oct 2025, TopoFR: A Closer Look at Topology Alignment on Face Recognition, https://arxiv.org/abs/2410.10587
  • Pratik S. Sachdeva and Tom van Nuenen, 11 Oct 2025, Deliberative Dynamics and Value Alignment in LLM Debates, https://arxiv.org/abs/2510.10002
  • Leonard Dung, Florian Mai, 13 Oct 2025, AI Alignment Strategies from a Risk Perspective: Independent Safety Mechanisms or Shared Failures?, https://arxiv.org/abs/2510.11235
  • Youngrok Park, Hojung Jung, Sangmin Bae, Se-Young Yun, 13 Oct 2025, Temporal Alignment Guidance: On-Manifold Sampling in Diffusion Models, https://arxiv.org/abs/2510.11057
  • Gareth Seneque, Lap-Hang Ho, Nafise Erfanian Saeedi, Jeffrey Molendijk, Ariel Kupermann and Tim Elson, 13 Oct 2025, ENIGMA: The Geometry of Reasoning and Alignment in Large-Language Models, https://arxiv.org/abs/2510.11278
  • Guozhi Liu, Qi Mu, Tiansheng Huang, Xinhua Wang, Li Shen, Weiwei Lin, Zhang Li, 11 Oct 2025, Pharmacist: Safety Alignment Data Curation for Large Language Models against Harmful Fine-tuning, https://arxiv.org/abs/2510.10085
  • Junwon You, Dasol Kang, Jae-Hun Jung, 13 Oct 2025, Topological Alignment of Shared Vision-Language Embedding Space, https://arxiv.org/abs/2510.10889
  • Yongxi Cao and Julian F. Schumann, Jens Kober, Joni Pajarinen, Arkady Zgonnikov, 12 Oct 2025, Controllable Generative Trajectory Prediction via Weak Preference Alignment, https://arxiv.org/abs/2510.10731
  • James Y. Huang, Sailik Sengupta, Daniele Bonadiman, Yi-An Lai, Arshit Gupta, Nikolaos Pappas, Saab Mansour, Katrin Kirchhoff, Dan Roth, 12 Oct 2025, DeAL: Decoding-time Alignment for Large Language Models, https://arxiv.org/abs/2402.06147
  • Menglan Chen, Xianghe Pang, Jingjing Dong, WenHao Wang, Yaxin Du and Siheng Chen, 13 Oct 2025, VLMGuard-R1: Proactive Safety Alignment for VLMs via Reasoning-Driven Prompt Optimization, https://arxiv.org/abs/2504.12661
  • Shuai Zhao, Yunqiu Xu, Linchao Zhu, Yi Yang, 13 Oct 2025, Learning from Reference Answers: Versatile Language Model Alignment without Binary Human Preference Data, https://arxiv.org/abs/2504.09895
  • Yukun Zhang, Qi Dong, 13 Oct 2025, Multi-Scale Manifold Alignment for Interpreting Large Language Models: A Unified Information-Geometric Framework, https://arxiv.org/abs/2505.20333
  • Qingshu Xu, Hong Jiao, Tianyi Zhou, Ming Li, Nan Zhang, Sydney Peters, Yanbin Fu, 11 Oct 2025, Automated Alignment of Math Items to Content Standards in Large-Scale Assessments Using Language Models, https://arxiv.org/abs/2510.05129
  • Johann Schmidt, Sebastian Stober, 9 Oct 2025, Robust Canonicalization through Bootstrapped Data Re-Alignment, https://arxiv.org/abs/2510.08178
  • XuHao Hu, Peng Wang, Xiaoya Lu, Dongrui Liu, Xuanjing Huang, Jing Shao, 9 Oct 2025, LLMs Learn to Deceive Unintentionally: Emergent Misalignment in Dishonesty from Misaligned Samples to Biased Human-AI Interactions, https://arxiv.org/abs/2510.08211
  • Teng Xiao, Zuchao Li, Lefei Zhang, 23 Sep 2025, OmniBridge: Unified Multimodal Understanding, Generation, and Retrieval via Latent Space Alignment, https://arxiv.org/abs/2509.19018
  • Sharan Sahu and Martin T. Wells, 23 Sep 2025, DRO-REBEL: Distributionally Robust Relative-Reward Regression for Fast and Efficient LLM Alignment, https://arxiv.org/abs/2509.19104
  • Riad Ahmed Anonto, Sardar Md. Saffat Zabin, M. Saifur Rahman, 22 Sep 2025, Align Where the Words Look: Cross-Attention-Guided Patch Alignment with Contrastive and Transport Regularization for Bengali Captioning, https://arxiv.org/abs/2509.18369
  • Rustem Turtayev, Natalia Fedorova, Oleg Serikov, Sergey Koldyba, Lev Avagyan, Dmitrii Volkov, 22 Oct 2025, Misalignment Bounty: Crowdsourcing AI Agent Misbehavior, https://arxiv.org/abs/2510.19738
  • Yuhang Liu, Minglai Shao, Zengyi Wo, Yunlong Chu, Bing Hao, Shengzhong Liu, Ruijie Wang, Jianxin Li, 22 Oct 2025, Learning Noise-Resilient and Transferable Graph-Text Alignment via Dynamic Quality Assessment, https://arxiv.org/abs/2510.19384
  • Fabian Gr\"oger, Shuo Wen, Huyen Le, Maria Brbi\'c, 22 Oct 2025, With Limited Data for Multimodal Alignment, Let the STRUCTURE Guide You, https://arxiv.org/abs/2506.16895
  • Eric J. W. Orlowski, Hakim Norhashim and Tristan Koh Ly Wey, 30 Sep 2025, 'Too much alignment; not enough culture': Re-balancing cultural alignment practices in LLMs, https://arxiv.org/abs/2509.26167
  • Zhuoning Xu, Xinyan Liu, 17 Sep 2025, VLHSA: Vision-Language Hierarchical Semantic Alignment for Jigsaw Puzzle Solving with Eroded Gaps, https://arxiv.org/abs/2509.25202
  • Hao Chen, Tao Han, Jie Zhang, Song Guo, Lei Bai, 21 Sep 2025, STCast: Adaptive Boundary Alignment for Global and Regional Weather Forecasting, https://arxiv.org/abs/2509.25210
  • Prajjwal Bhattarai, Mohammad Amjad, Dmytro Zhylko, Tuka Alhanai, 27 Sep 2025, Knowledge distillation through geometry-aware representational alignment, https://arxiv.org/abs/2509.25253
  • Dengming Zhang, Xiaowen Ma, Zhenliang Ni, Zhenkai Wu, Han Shu, Xin Jiang, and Xinghao Chen, 30 Sep 2025, Expert Merging: Model Merging with Unsupervised Expert Alignment and Importance-Guided Layer Chunking, https://arxiv.org/abs/2509.25712
  • Seong-Hyeon Hwang, Soyoung Choi, Steven Euijong Whang, 30 Sep 2025, MIDAS: Misalignment-based Data Augmentation Strategy for Imbalanced Multimodal Learning, https://arxiv.org/abs/2509.25831
  • Bissmella Bahaduri, Hicham Talaoubrid, Fangchen Feng, Zuheng Ming, Anissa Mokraoui, 30 Sep 2025, Indirect Attention: Turning Context Misalignment into a Feature, https://arxiv.org/abs/2509.26015
  • Fr\'ed\'eric Berdoz, Luca A. Lanzend\"orfer, Ren\'e Caky, Roger Wattenhofer, 30 Sep 2025, Alignment-Aware Decoding, https://arxiv.org/abs/2509.26169
  • Asma Farajidizaji, Akash Gupta, Vatsal Raina, 29 Sep 2025, Probing the Limits of Stylistic Alignment in Vision-Language Models, https://arxiv.org/abs/2509.25568
  • Jia Jun Cheng Xian, Muchen Li, Haotian Yang, Xin Tao, Pengfei Wan, Leonid Sigal, Renjie Liao, 30 Sep 2025, Free Lunch Alignment of Text-to-Image Diffusion Models without Preference Image Pairs, https://arxiv.org/abs/2509.25771
  • Mary I. Letey, Jacob A. Zavatone-Veth, Yue M. Lu, Cengiz Pehlevan, 30 Sep 2025, Pretrain-Test Task Alignment Governs Generalization in In-Context Learning, https://arxiv.org/abs/2509.26551
  • Chi Zhou, Wang Luo, Haoran Li, Congying Han, Tiande Guo, Zicheng Zhang, 30 Sep 2025, Dual Alignment Maximin Optimization for Offline Model-based RL, https://arxiv.org/abs/2502.00850
  • Haishuo Fang, Xiaodan Zhu, Iryna Gurevych, 30 Sep 2025, Preemptive Detection and Correction of Misaligned Actions in LLM Agents, https://arxiv.org/abs/2407.11843
  • Edward Gu, Ho Chit Siu, Melanie Platt, Isabelle Hurley, Jaime Pe\~na, Rohan Paleja, 30 Sep 2025, Enabling Rapid Shared Human-AI Mental Model Alignment via the After-Action Review, https://arxiv.org/abs/2503.19607
  • Yansen Zhang, Qingcan Kang, Yujie Chen, Yufei Wang, Xiongwei Han, Tao Zhong, Mingxuan Yuan, Chen Ma, 28 Sep 2025, Optimization Modeling via Semantic Anchored Alignment, https://arxiv.org/abs/2510.05115
  • Jianglin Lu, Hailing Wang, Yi Xu, Yizhou Wang, Kuo Yang, Yun Fu, 5 Oct 2025, Representation Potentials of Foundation Models for Multimodal Alignment: A Survey, https://arxiv.org/abs/2510.05184
  • Radha Gulhane, Sathish Reddy Indurthi, 6 Oct 2025, Beyond Monolithic Rewards: A Hybrid and Multi-Aspect Reward Optimization for MLLM Alignment, https://arxiv.org/abs/2510.05283
  • Qingyu Yin, Chak Tou Leong, Linyi Yang, Wenxuan Huang, Wenjie Li, Xiting Wang, Jaehong Yoon, YunXing, XingYu, Jinjin Gu, 7 Oct 2025, Refusal Falls off a Cliff: How Safety Alignment Fails in Reasoning?, https://arxiv.org/abs/2510.06036
  • Mallika Mainali, Harsha Sureshbabu, Anik Sen, Christopher B. Rauch, Noah D. Reifsnyder, John Meyer, J. T. Turner, Michael W. Floyd, Matthew Molineaux, and Rosina O. Weber, 7 Oct 2025, Classical AI vs. LLMs for Decision-Maker Alignment in Health Insurance Choices, https://arxiv.org/abs/2510.06093
  • Batu El and James Zou, 7 Oct 2025, Moloch's Bargain: Emergent Misalignment When LLMs Compete for Audiences, https://arxiv.org/abs/2510.06105
  • Ziyi Chen, Junyi Li, Peiran Yu, Heng Huang, 7 Oct 2025, Provably Mitigating Corruption, Overoptimization, and Verbosity Simultaneously in Offline and Online RLHF/DPO Alignment, https://arxiv.org/abs/2510.05526
  • Yihan Du, Seo Taek Kong, R. Srikant, 7 Oct 2025, Primal-Dual Direct Preference Optimization for Constrained LLM Alignment, https://arxiv.org/abs/2510.05703
  • Nyal Patel, Matthieu Bou, Arjun Jagota, Satyapriya Krishna, Sonali Parbhoo, 7 Oct 2025, Learning from Failures: Understanding LLM Alignment through Failure-Aware Inverse RL, https://arxiv.org/abs/2510.06092
  • Aengus Lynch, Benjamin Wright, Caleb Larson, Stuart J. Ritchie, Soren Mindermann, Ethan Perez, Kevin K. Troy, Evan Hubinger, 5 Oct 2025, Agentic Misalignment: How LLMs Could Be Insider Threats, https://arxiv.org/abs/2510.05179
  • Francesca Gomez, 6 Oct 2025, Adapting Insider Risk mitigations for Agentic Misalignment: an empirical study, https://arxiv.org/abs/2510.05192
  • Miles Wang, Tom Dupr\'e la Tour, Olivia Watkins, Alex Makelov, Ryan A. Chi, Samuel Miserendino, Jeffrey Wang, Achyuta Rajaram, Johannes Heidecke, Tejal Patwardhan, Dan Mossing, 6 Oct 2025, Persona Features Control Emergent Misalignment, https://arxiv.org/abs/2506.19823
  • Yang Xiao, Wang Lu, Jie Ji, Ruimeng Ye, Gen Li, Xiaolong Ma, Bo Hui, 6 Oct 2025, Optimal Transport for Brain-Image Alignment: Unveiling Redundancy and Synergy in Neural Information Processing, https://arxiv.org/abs/2503.10663
  • Yik Siu Chan, Zheng-Xin Yong, Stephen H. Bach, 7 Oct 2025, Can We Predict Alignment Before Models Finish Thinking? Towards Monitoring Misaligned Reasoning Models, https://arxiv.org/abs/2507.12428
  • Ruchi Sandilya, Sumaira Perez, Charles Lynch, Lindsay Victoria, Benjamin Zebley, Derrick Matthew Buchanan, Mahendra T. Bhati, Nolan Williams, Timothy J. Spellman, Faith M. Gunning, Conor Liston, Logan Grosenick, 16 Oct 2025, Contrastive Diffusion Alignment: Learning Structured Latents for Controllable Generation, https://arxiv.org/abs/2510.14190
  • Shayan Gharib, Marcelo Hartmann, Arto Klami, 16 Oct 2025, Geometric Moment Alignment for Domain Adaptation via Siegel Embeddings, https://arxiv.org/abs/2510.14666
  • Maulidi Adi Prasetia, Muhamad Risqi U. Saputra, Guntur Dharma Putra, 16 Oct 2025, FedPPA: Progressive Parameter Alignment for Personalized Federated Learning, https://arxiv.org/abs/2510.14698
  • Mingxuan Yan and Yuping Wang and Zechun Liu and Jiachen Li, 16 Oct 2025, RDD: Retrieval-Based Demonstration Decomposer for Planner Alignment in Long-Horizon Tasks, https://arxiv.org/abs/2510.14968
  • Sicheng Lyu, Yu Gu, Xinyu Wang, Jerry Huang, Sitao Luan, Yufei Cui, Xiao-Wen Chang, Peng Lu, 11 Oct 2025, EvoEdit: Evolving Null-space Alignment for Robust and Efficient Knowledge Editing, https://arxiv.org/abs/2510.13851
  • Yunze Tong, Didi Zhu, Zijing Hu, Jinluan Yang, Ziyu Zhao, 16 Oct 2025, Noise Projection: Closing the Prompt-Agnostic Gap Behind Text-to-Image Misalignment in Diffusion Models, https://arxiv.org/abs/2510.14526
  • Yuxin Xiao, Sana Tonekaboni, Walter Gerych, Vinith Suriyakumar, Marzyeh Ghassemi, 16 Oct 2025, When Style Breaks Safety: Defending LLMs Against Superficial Style Alignment, https://arxiv.org/abs/2506.07452

Trustworthy AI

Trustworthy AI is the practice of ensuring that LLM-based systems are safe and predictable. This involves ensuring not only the safety of the LLM's outputs, such as avoiding bias and toxicity, but also ensuring that the AI infrastructure is resilient and the overall system is reliable. The idea of "Trustworthy AI" has been championed by NVIDIA.

Articles and papers on trustworthy AI:

  • Leon Derczynski, Christopher Parisien, Nikki Pope, Michael Boone, Nov 2024, NVIDIA Approaches to AI Trust and Safety: Innovation and Tools, https://www.nvidia.com/en-us/on-demand/session/aisummitdc24-sdc1088/?playlistId=playList-c6a9450c-c790-462d-a058-0bacacd5d370
  • Mayank Vatsa, Anubhooti Jain, Richa Singh, 7 Dec 2023, Adventures of Trustworthy Vision-Language Models: A Survey, https://arxiv.org/abs/2312.04231
  • Nikki Pope, March 1, 2024, What Is Trustworthy AI? Trustworthy AI is an approach to AI development that prioritizes safety and transparency for the people who interact with it. https://blogs.nvidia.com/blog/what-is-trustworthy-ai/
  • NVIDIA, Dec 2024 (accessed), Trustworthy AI, https://www.nvidia.com/en-us/ai-data-science/trustworthy-ai/
  • Phoebe Lee and Kristina Joos, Jan 25, 2024, Advancing Production AI with NVIDIA AI Enterprise, https://developer.nvidia.com/blog/advancing-production-ai-with-nvidia-ai-enterprise/ ("... advances in NVIDIA AI software deliver up to 54% performance gains without a hardware upgrade...")
  • Yedi Zhang, Yufan Cai, Xinyue Zuo, Xiaokun Luan, Kailong Wang, Zhe Hou, Yifan Zhang, Zhiyuan Wei, Meng Sun, Jun Sun, Jing Sun, Jin Song Dong, 9 Dec 2024, The Fusion of Large Language Models and Formal Methods for Trustworthy AI Agents: A Roadmap, https://arxiv.org/abs/2412.06512
  • Athanasios Davvetas, Xenia Ziouvelou, Ypatia Dami, Alexis Kaponis, Konstantina Giouvanopoulou, Michael Papademas, 23 Jul 2025, TAI Scan Tool: A RAG-Based Tool With Minimalistic Input for Trustworthy AI Self-Assessment, https://arxiv.org/abs/2507.17514
  • Ilias Chatzistefanidis, Navid Nikaein, 23 Jul 2025, Symbiotic Agents: A Novel Paradigm for Trustworthy AGI-driven Networks, https://arxiv.org/abs/2507.17695
  • H M Mohaimanul Islam, Huynh Q. N. Vo, Aditya Rane, 22 Jul 2025, Towards Trustworthy AI: Secure Deepfake Detection using CNNs and Zero-Knowledge Proofs, https://arxiv.org/abs/2507.17010
  • Tushar Talukder Showrav, Soyabul Islam Lincoln, Md. Kamrul Hasan, 23 Jul 2025, EXGnet: a single-lead explainable-AI guided multiresolution network with train-only quantitative features for trustworthy ECG arrhythmia classification, https://arxiv.org/abs/2506.12404
  • Yaomin Jiang, Levin Brinkmann, Anne-Marie Nussberger, Ivan Soraperra, Jean-Fran\c{c}ois Bonnefon, Iyad Rahwan, 17 Jul 2025, Humans learn to prefer trustworthy AI over human partners, https://arxiv.org/abs/2507.13524
  • Nuria Rodr\'iguez-Barroso and Mario Garc\'ia-M\'arquez and M. Victoria Luz\'on and Francisco Herrera, 21 Jul 2025, Challenges of Trustworthy Federated Learning: What's Done, Current Trends and Remaining Work, https://arxiv.org/abs/2507.15796
  • Mustafa Cavus, Jan N. van Rijn, Przemys{\l}aw Biecek, 19 Jul 2025, Beyond the Single-Best Model: Rashomon Partial Dependence Profile for Trustworthy Explanations in AutoML, https://arxiv.org/abs/2507.14744
  • Amina Dzafic, Merve Kavut, Ulya Bayram, 19 Jul 2025, Rethinking Suicidal Ideation Detection: A Trustworthy Annotation Framework and Cross-Lingual Model Evaluation, https://arxiv.org/abs/2507.14693
  • Yi Zhang, Zhen Chen, Chih-Hong Cheng, Wenjie Ruan, Xiaowei Huang, Dezong Zhao, David Flynn, Siddartha Khastgir, Xingyu Zhao, 20 Jul 2025, Trustworthy Text-to-Image Diffusion Models: A Timely and Focused Survey, https://arxiv.org/abs/2409.18214
  • Anthony Bellotti and Xindi Zhao, 9 Aug 2025, Conformal Prediction and Trustworthy AI, https://arxiv.org/abs/2508.06885
  • Stephan Rabanser, 11 Aug 2025, Uncertainty-Driven Reliability: Selective Prediction and Trustworthy Deployment in Modern Machine Learning, https://arxiv.org/abs/2508.07556
  • Anindya Bijoy Das, Shahnewaz Karim Sakib and Shibbir Ahmed, 9 Aug 2025, Trustworthy Medical Imaging with Large Language Models: A Study of Hallucinations Across Modalities, https://arxiv.org/abs/2508.07031
  • Jesco Talies, Eric Breitbarth, David Melching, 28 Jul 2025, Towards trustworthy AI in materials mechanics through domain-guided attention, https://arxiv.org/abs/2507.20658
  • Marius Baden, Ahmed Abouelazm, Christian Hubschneider, Yin Wu, Daniel Slieter, and J. Marius Z\"ollner, 27 Jul 2025, TPK: Trustworthy Trajectory Prediction Integrating Prior Knowledge For Interpretability and Kinematic Feasibility, https://arxiv.org/abs/2505.06743
  • Rob Procter, Mark Rouncefield, 25 Jul 2025, Trustworthy AI: UK Air Traffic Control Revisited, https://arxiv.org/abs/2507.21169
  • Rui Jiao, Yue Zhang, Jinku Li, 25 Jul 2025, Trustworthy Reasoning: Evaluating and Enhancing Factual Accuracy in LLM Intermediate Thought Processes, https://arxiv.org/abs/2507.22940
  • Xinwei Wu, Haojie Li, Hongyu Liu, Xinyu Ji, Ruohan Li, Yule Chen, Yigeng Zhang, 30 Jul 2025, Uncovering the Fragility of Trustworthy LLMs through Chinese Textual Ambiguity, https://arxiv.org/abs/2507.23121
  • Xiaojin Zhang, Wei Chen, 30 Jul 2025, Bridging Privacy and Robustness for Trustworthy Machine Learning, https://arxiv.org/abs/2403.16591
  • Sihang Zeng, Lucas Jing Liu, Jun Wen, Meliha Yetisgen, Ruth Etzioni, Gang Luo, 1 Aug 2025, TrajSurv: Learning Continuous Latent Trajectories from Electronic Health Records for Trustworthy Survival Prediction, https://arxiv.org/abs/2508.00657
  • Jiazhen Pan, Bailiang Jian, Paul Hager, Yundi Zhang, Che Liu, Friedrike Jungmann, Hongwei Bran Li, Chenyu You, Junde Wu, Jiayuan Zhu, Fenglin Liu, Yuyuan Liu, Niklas Bubeck, Christian Wachinger, Chen (Cherise) Chen, Zhenyu Gong, Cheng Ouyang, Georgios Kaissis, Benedikt Wiestler, Daniel Rueckert, 30 Jul 2025, Beyond Benchmarks: Dynamic, Automatic And Systematic Red-Teaming Agents For Trustworthy Medical Language Models, https://arxiv.org/abs/2508.00923
  • James Carzon and Luca Masserano and Joshua D. Ingram and Alex Shen and Antonio Carlos Herling Ribeiro Junior and Tommaso Dorigo and Michele Doro and Joshua S. Speagle and Rafael Izbicki and Ann B. Lee, 4 Aug 2025, Trustworthy scientific inference for inverse problems with generative models, https://arxiv.org/abs/2508.02602
  • Vinicius Lima, Dzung T. Phan, Jayant Kalagnanam, Dhaval Patel, Nianjun Zhou, 5 Aug 2025, Toward a Trustworthy Optimization Modeling Agent via Verifiable Synthetic Data Generation, https://arxiv.org/abs/2508.03117
  • Claudiu Leoveanu-Condrei, 5 Aug 2025, A DbC Inspired Neurosymbolic Layer for Trustworthy Agent Design, https://arxiv.org/abs/2508.03665
  • Anqi Li, Wenwei Jin, Jintao Tong, Pengda Qin, Weijia Li, Guo Lu, 5 Aug 2025, Towards Trustworthy Multimodal Moderation via Policy-Aligned Reasoning and Hierarchical Labeling, https://arxiv.org/abs/2508.03296
  • Haoran Li and Lihao Mai and Muhao Guo and Jiaqi Wu and Yang Weng and Yannan Sun and Ce Jimmy Liu, 7 Aug 2025, From Imperfect Signals to Trustworthy Structure: Confidence-Aware Inference from Heterogeneous and Reliability-Varying Utility Data, https://arxiv.org/abs/2508.05791
  • Ahmad Farooq and Kamran Iqbal, 7 Aug 2025, Towards Transparent Ethical AI: A Roadmap for Trustworthy Robotic Systems, https://arxiv.org/abs/2508.05846
  • Kristian Miok, Blaz \v{S}krlj, Daniela Zaharie, and Marko Robnik \v{S}ikonja, 30 Jul 2025, TT-XAI: Trustworthy Clinical Text Explanations via Keyword Distillation and LLM Reasoning, https://arxiv.org/abs/2508.08273
  • Mithat Can Ozgun, Jiahuan Pei, Koen Hindriks, Lucia Donatelli, Qingzhi Liu, Xin Sun, Junxiao Wang, 15 Aug 2025, Trustworthy AI Psychotherapy: Multi-Agent LLM Workflow for Counseling and Explainable Mental Disorder Diagnosis, https://arxiv.org/abs/2508.11398
  • Benjamin Alt, Mareike Picklum, Sorin Arion, Franklin Kenghagho Kenfack and Michael Beetz, 15 Aug 2025, Open, Reproducible and Trustworthy Robot-Based Experiments with Virtual Labs and Digital-Twin-Based Execution Tracing, https://arxiv.org/abs/2508.11406
  • Zihan Guo, Yuanjian Zhou, Chenyi Wang, Linlin You, Minjie Bian, Weinan Zhang, 19 Aug 2025, BetaWeb: Towards a Blockchain-enabled Trustworthy Agentic Web, https://arxiv.org/abs/2508.13787
  • Mary Versa Clemens-Sewall, Christopher Cervantes, Emma Rafkin, J. Neil Otte, Tom Magelinski, Libby Lewis, Michelle Liu, Dana Udwin, Monique Kirkman-Bey, 20 Aug 2025, CaTE Data Curation for Trustworthy AI, https://arxiv.org/abs/2508.14741
  • Wenjie Lin, Jin Wei-Kocsis, 21 Aug 2025, LLM4Sweat: A Trustworthy Large Language Model for Hyperhidrosis Support, https://arxiv.org/abs/2508.15192
  • Yongwoo Song and Minbyul Jeong and Mujeen Sung, 26 Aug 2025, Trustworthy Agents for Electronic Health Records through Confidence Estimation, https://arxiv.org/abs/2508.19096
  • William Jurayj, Nils Holzenberger, Benjamin Van Durme, 28 Aug 2025, Enabling Equitable Access to Trustworthy Financial Reasoning, https://arxiv.org/abs/2508.21051
  • \v{S}imon Kucharsk\'y, Aayush Mishra, Daniel Habermann, Stefan T. Radev, Paul-Christian B\"urkner, 28 Aug 2025, Towards Trustworthy Amortized Bayesian Model Comparison, https://arxiv.org/abs/2508.20614
  • Daocheng Fu, Jianlong Chen, Renqiu Xia, Zijun Chen, Qi Liu, Yuan Feng, Hongbin Zhou, Renrui Zhang, Shiyang Feng, Peng Gao, Hongyuan Zha, Junchi Yan, Botian Shi, Yu Qiao, Bo Zhang, 29 Aug 2025, TrustGeoGen: Formal-Verified Data Engine for Trustworthy Multi-modal Geometric Problem Solving, https://arxiv.org/abs/2504.15780
  • Li Rong Wang, Thomas C. Henderson, Yew Soon Ong, Yih Yng Ng, Xiuyi Fan, 1 Sep 2025, Towards Trustworthy Vital Sign Forecasting: Leveraging Uncertainty for Prediction Intervals, https://arxiv.org/abs/2509.01319
  • Chaoyu Zhang and Heng Jin and Shanghao Shi and Hexuan Yu and Sydney Johns and Y. Thomas Hou and Wenjing Lou, 30 Aug 2025, Enabling Trustworthy Federated Learning via Remote Attestation for Mitigating Byzantine Threats, https://arxiv.org/abs/2509.00634
  • Aivin V. Solatorio, 8 Sep 2025, Proof-Carrying Numbers (PCN): A Protocol for Trustworthy Numeric Answers from LLMs via Claim Verification, https://arxiv.org/abs/2509.06902
  • Teeradaj Racharak, Chaiyong Ragkhitwetsagul, Chommakorn Sontesadisai, Thanwadee Sunetnanta, 8 Sep 2025, Test It Before You Trust It: Applying Software Testing for Trustworthy In-context Learning, https://arxiv.org/abs/2504.18827
  • Zhuoyue Zhang, Haitong Xu, 19 Sep 2025, Explainable AI for Maritime Autonomous Surface Ships (MASS): Adaptive Interfaces and Trustworthy Human-AI Collaboration, https://arxiv.org/abs/2509.15959
  • Meryem Malak Dif, Mouhamed Amine Bouchiha, Abdelaziz Amara Korba, Yacine Ghamri-Doudane, 8 Sep 2025, Towards Trustworthy Agentic IoEV: AI Agents for Explainable Cyberthreat Mitigation and State Analytics, https://arxiv.org/abs/2509.12233
  • Diego Gosmar, Deborah A. Dahl, 18 Sep 2025, Sentinel Agents for Secure and Trustworthy Agentic AI in Multi-Agent Systems, https://arxiv.org/abs/2509.14956
  • Prathamesh Vasudeo Naik, Naresh Kumar Dintakurthi, Zhanghao Hu, Yue Wang, Robby Qiu, 10 Sep 2025, Co-Investigator AI: The Rise of Agentic AI for Smarter, Trustworthy AML Compliance Narratives, https://arxiv.org/abs/2509.08380
  • Shuaidong Pan and Di Wu, 23 Sep 2025, Trustworthy Summarization via Uncertainty Quantification and Risk Awareness in Large Language Models, https://arxiv.org/abs/2510.01231
  • Giuseppina Carannante, Nidhal C.Bouaynaya, Dimah Dera, Hassan M. Fathallah-Shaykh, and Ghulam Rasool, 1 Oct 2025, SUPER-Net: Trustworthy Image Segmentation via Uncertainty Propagation in Encoder-Decoder Networks, https://arxiv.org/abs/2111.05978
  • Haolin Li, Tianjie Dai, Zhe Chen, Siyuan Du, Jiangchao Yao, Ya Zhang, Yanfeng Wang, 24 Sep 2025, RAD: Towards Trustworthy Retrieval-Augmented Multi-modal Clinical Diagnosis, https://arxiv.org/abs/2509.19980
  • Riccardo Guidotti, Martina Cinquini, Marta Marchiori Manerba, Mattia Setzu, Francesco Spinnato, 23 Oct 2025, Towards the Formalization of a Trustworthy AI for Mining Interpretable Models explOiting Sophisticated Algorithms, https://arxiv.org/abs/2510.20621
  • Kanghui Ning, Zijie Pan, Yushan Jiang, Anderson Schneider, Yuriy Nevmyvaka, Dongjin Song, 19 Oct 2025, Towards Interpretable and Trustworthy Time Series Reasoning: A BlueSky Vision, https://arxiv.org/abs/2510.16980
  • David Peer, Sebastian Stabinger, 18 Oct 2025, ATA: A Neuro-Symbolic Approach to Implement Autonomous and Trustworthy Agents, https://arxiv.org/abs/2510.16381
  • Yankai Chen, Xinni Zhang, Yifei Zhang, Yangning Li, Henry Peng Zou, Chunyu Miao, Weizhi Zhang, Xue Liu, Philip S. Yu, 25 Oct 2025, Embracing Trustworthy Brain-Agent Collaboration as Paradigm Extension for Intelligent Assistive Technologies, https://arxiv.org/abs/2510.22095
  • Noor Islam S. Mohammad, 14 Oct 2025, A Multimodal XAI Framework for Trustworthy CNNs and Bias Detection in Deep Representation Learning, https://arxiv.org/abs/2510.12957
  • Karthik Avinash, Nikhil Pareek, Rishav Hada, 15 Oct 2025, Protect: Towards Robust Guardrailing Stack for Trustworthy Enterprise LLM Systems, https://arxiv.org/abs/2510.13351
  • Michal Sadowski, Tadija Radusinovi\'c, Maria Wyrzykowska, Lukasz Sztukiewicz, Jan Rzymkowski, Pawe{\l} W{\l}odarczyk-Pruszy\'nski, Miko{\l}aj Sacha, Piotr Kozakowski, Ruard van Workum, Stanislaw Kamil Jastrzebski, 15 Oct 2025, Trustworthy Retrosynthesis: Eliminating Hallucinations with a Diverse Ensemble of Reaction Scorers, https://arxiv.org/abs/2510.10645
  • Xiaonan Si, Meilin Zhu, Simeng Qin, Lijia Yu, Lijun Zhang, Shuaitong Liu, Xinfeng Li, Ranjie Duan, Yang Liu, Xiaojun Jia, 15 Oct 2025, SeCon-RAG: A Two-Stage Semantic Filtering and Conflict-Free Framework for Trustworthy RAG, https://arxiv.org/abs/2510.09710
  • Yanghe Pan, Yuntao Wang, Shaolong Guo, Chengyu Yin, Ruidong Li, Zhou Su, Yuan Wu, 25 Sep 2025, Trustworthy Semantic Communication for Vehicular Networks: Challenges and Solutions, https://arxiv.org/abs/2509.20830
  • Kristina P. Sinaga, Arjun S. Nair, 28 Sep 2025, Calibration Meets Reality: Making Machine Learning Predictions Trustworthy, https://arxiv.org/abs/2509.23665
  • Yizhuo Ding, Mingkang Chen, Qiuhua Liu, Fenghua Weng, Wanying Qu, Yue Yang, Yugang Jiang, Zuxuan Wu, Yanwei Fu, Wenqi Shao, 5 Oct 2025, COSMO-RL: Towards Trustworthy LMRMs via Joint Safety and Stability, https://arxiv.org/abs/2510.04196
  • Yue wu, 4 Oct 2025, A Trustworthy Industrial Fault Diagnosis Architecture Integrating Probabilistic Models and Large Language Models, https://arxiv.org/abs/2510.03815
  • Xinling Yu, Ziyue Liu, Hai Li, Yixing Li, Xin Ai, Zhiyu Zeng, Ian Young, and Zheng Zhang, 10 Oct 2025, DeepOHeat-v1: Efficient Operator Learning for Fast and Trustworthy Thermal Simulation and Optimization in 3D-IC Design, https://arxiv.org/abs/2504.03955
  • Petros Drineas, Rohit Nema, Rafail Ostrovsky, Vassilis Zikas, 9 Oct 2025, Game of Trust: How Trustworthy Does Your Blockchain Think You Are?, https://arxiv.org/abs/2505.14551
  • Hongwei Zhang, Ji Lu, Shiqing Jiang, Chenxiang Zhu, Li Xie, Chen Zhong, Haoran Chen, Yurui Zhu, Yongsheng Du, Yanqin Gao, Lingjun Huang, Baoli Wang, Fang Tan, and Peng Zou, 24 Oct 2025, Co-Sight: Enhancing LLM-Based Agents via Conflict-Aware Meta-Verification and Trustworthy Reasoning with Structured Facts, https://arxiv.org/abs/2510.21557
  • Stella C. Dong and James R. Finlay, 9 Oct 2025, ClauseLens: Clause-Grounded, CVaR-Constrained Reinforcement Learning for Trustworthy Reinsurance Pricing, https://arxiv.org/abs/2510.08429
  • Chih-Yu Chang, Milad Azvar, Chinedum Okwudire and Raed Al Kontar, 9 Oct 2025, LLINBO: Trustworthy LLM-in-the-Loop Bayesian Optimization, https://arxiv.org/abs/2505.14756
  • Som Sagar, Aditya Taparia, Harsh Mankodiya, Pranav Bidare, Yifan Zhou, Ransalu Senanayake, 8 Oct 2025, BaTCAVe: Trustworthy Explanations for Robot Behaviors, https://arxiv.org/abs/2409.10733
  • Ferdinand Kahenga, Antoine Bagula, Sajal K. Das, Patrick Sello, 23 Sep 2025, FedFiTS: Fitness-Selected, Slotted Client Scheduling for Trustworthy Federated Learning in Healthcare AI, https://arxiv.org/abs/2509.19120

AI Industry Safety Practices

Various papers discuss the practices of the major AI players in the industry, along with issues such as self-governance.

Technical Verification and Testing of AI Safety

Testing and evaluation of AI safety issues:

  • Xiaowei Huang, Marta Kwiatkowska, Sen Wang, and Min Wu. May 2017. Safety verification of deep neural networks. In Computer Aided Verification, pages 3–29, https://arxiv.org/abs/1610.06940
  • D. Ganguli, L. Lovitt, J. Kernion, A. Askell, Y. Bai, S. Kadavath, B. Mann, E. Perez, N. Schiefer, K. Ndousse, A. Jones, S. Bowman, A. Chen, T. Conerly, N. DasSarma, D. Drain, N. Elhage, S. El-Showk, S. Fort, Z. Hatfield-Dodds, T. Henighan, D. Hernandez, T. Hume, J. Jacobson, S. Johnston, S. Kravec, C. Olsson, S. Ringer, E. Tran-Johnson, D. Amodei, T. Brown, N. Joseph, S. McCandlish, C. Olah, J. Kaplan, and J. Clark. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858, 2022 https://arxiv.org/abs/2209.07858
  • K Ramesh, A Chavan, S Pandit, 2023, A Comparative Study on the Impact of Model Compression Techniques on Fairness in Language Models, Microsoft Research, https://aclanthology.org/2023.acl-long.878.pdf, https://www.microsoft.com/en-us/research/uploads/prod/2023/07/3687_Paper.pdf (Rather than testing full models, this analysis examines optimized models due to quantization, pruning or distillation.)
  • T. Shevlane. Structured access: An emerging paradigm for safe AI deployment. In The Oxford Handbook of AI Governance, 2022, https://arxiv.org/abs/2201.05159
  • E. Perez, S. Huang, F. Song, T. Cai, R. Ring, J. Aslanides, A. Glaese, N. McAleese, and G. Irving. 2022, Red teaming language models with language models. arXiv preprint arXiv:2202.03286, https://arxiv.org/abs/2202.03286
  • OpenAI. 2023. Safety best practices. https://platform.openai.com/docs/guides/safety-best-practices
  • William Saunders, Girish Sastry, Andreas Stuhlmueller, and Owain Evans. Trial without error: Towards safe reinforcement learning via human intervention. arXiv preprint arXiv:1707.05173, 2017. https://arxiv.org/abs/1707.05173
  • Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed, Oct 2023, Mistral 7B, https://arxiv.org/abs/2310.06825, Code: https://mistral.ai/news/announcing-mistral-7b/ (Examines guardrails and testing of the safety of the model against harmful inputs.)

AI Factual Inaccuracy

Research papers on accuracy of AI results include:

AI Safety Incidents

Various incidents and accidents related to AI safety issues:

Incident Databases: There are various databases that collect information about AI safety incidents.

Medical Ethics and AI

The use of AI in medicine creates some additional ethical issues:

  • Vollmer S., Mateen B.A., Bohner G., Király F.J., Ghani R., Jonsson P., et al. Machine learning and AI research for patient benefit: 20 critical questions on transparency, replicability, ethics and effectiveness. BMJ. 2018;(368):1–12. https://pubmed.ncbi.nlm.nih.gov/32198138/
  • Cockerill RG., 2020, Ethics Implications of the Use of Artificial Intelligence in Violence Risk Assessment. J Am Acad Psychiatry Law. 2020 Sep;48(3):345-349. doi: 10.29158/JAAPL.003940-20. Epub 2020 May 14. PMID: 32409300, https://pubmed.ncbi.nlm.nih.gov/32409300/
  • Barron DS. 2021, Commentary: the ethical challenges of machine learning in psychiatry: a focus on data, diagnosis, and treatment. Psychol Med. 2021 Nov;51(15):2522-2524. doi: 10.1017/S0033291721001008. Epub 2021 May 12. PMID: 33975655, https://pubmed.ncbi.nlm.nih.gov/33975655/
  • O'Reilly-Shah VN, Gentry KR, Walters AM, Zivot J, Anderson CT, Tighe PJ. 2020, Bias and ethical considerations in machine learning and the automation of perioperative risk assessment. Br J Anaesth. 2020 Dec;125(6):843-846. doi: 10.1016/j.bja.2020.07.040. Epub 2020 Aug 21. PMID: 32838979, https://pubmed.ncbi.nlm.nih.gov/32838979/
  • Buchlak QD, Esmaili N, Leveque JC, Bennett C, Piccardi M, Farrokhi F., 2020, Ethical thinking machines in surgery and the requirement for clinical leadership. Am J Surg. 2020 Nov;220(5):1372-1374. doi: 10.1016/j.amjsurg.2020.06.073. Epub 2020 Jul 8. PMID: 32723487, https://pubmed.ncbi.nlm.nih.gov/32723487/
  • Starke G, De Clercq E, Borgwardt S, Elger BS., 2020, Computing schizophrenia: ethical challenges for machine learning in psychiatry. Psychol Med. 2021 Nov;51(15):2515-2521. doi: 10.1017/S0033291720001683. Epub 2020 Jun 15. PMID: 32536358, https://pubmed.ncbi.nlm.nih.gov/32536358/
  • Jacobson NC, Bentley KH, Walton A, Wang SB, Fortgang RG, Millner AJ, Coombs G 3rd, Rodman AM, Coppersmith DDL., 2020, Ethical dilemmas posed by mobile health and machine learning in psychiatry research. Bull World Health Organ. 2020 Apr 1;98(4):270-276. doi: 10.2471/BLT.19.237107. Epub 2020 Feb 25. PMID: 32284651, https://pubmed.ncbi.nlm.nih.gov/32284651/
  • Johnson SLJ., 2019, AI, Machine Learning, and Ethics in Health Care. J Leg Med. 2019 Oct-Dec;39(4):427-441. doi: 10.1080/01947648.2019.1690604. PMID: 31940250 https://pubmed.ncbi.nlm.nih.gov/31940250/
  • Vayena E, Blasimme A, Cohen IG., 2018, Machine learning in medicine: Addressing ethical challenges. PLoS Med. 2018 Nov 6;15(11):e1002689. doi: 10.1371/journal.pmed.1002689. eCollection 2018 Nov. PMID: 30399149, https://pubmed.ncbi.nlm.nih.gov/30399149/
  • Nabi J., 2018, How Bioethics Can Shape Artificial Intelligence and Machine Learning. Hastings Cent Rep. 2018 Sep;48(5):10-13. doi: 10.1002/hast.895. PMID: 30311202, https://pubmed.ncbi.nlm.nih.gov/30311202/
  • Char DS, Shah NH, Magnus D., 2018, Implementing Machine Learning in Health Care - Addressing Ethical Challenges. N Engl J Med. 2018 Mar 15;378(11):981-983. doi: 10.1056/NEJMp1714229. PMID: 29539284, https://pubmed.ncbi.nlm.nih.gov/29539284/
  • Fiske A, Henningsen P, Buyx A., 2019, Your Robot Therapist Will See You Now: Ethical Implications of Embodied Artificial Intelligence in Psychiatry, Psychology, and Psychotherapy. J Med Internet Res. 2019 May 9;21(5):e13216. doi: 10.2196/13216. PMID: 31094356, https://pubmed.ncbi.nlm.nih.gov/31094356/
  • Beil Michael, Proft Ingo, van Heerden Daniel, Sviri Sigal, van Heerden Peter Vernon. 2019, Ethical considerations about artificial intelligence for prognostication in intensive care. Intensive Care Medicine Experimental. 2019;7:70. http://www.ncbi.nlm.nih.gov/pmc/articles/pmc6904702/, https://pubmed.ncbi.nlm.nih.gov/31823128/
  • Lasse Benzinger, Frank Ursin, Wolf-Tilo Balke, Tim Kacprowski & Sabine Salloch, 2023, Should Artificial Intelligence be used to support clinical ethical decision-making? A systematic review of reasons BMC Medical Ethics volume 24, Article number: 48 (2023), https://doi.org/10.1186/s12910-023-00929-6
  • Rachel Dlugatch, Antoniya Georgieva & Angeliki Kerasidou, 2023, Trustworthy artificial intelligence and ethical design: public perceptions of trustworthiness of an AI-based decision-support tool in the context of intrapartum care, BMC Medical Ethics Open Access 20 June 2023, https://doi.org/10.1186/s12910-023-00917-w
  • Dzobo K, Adotey S, Thomford NE, Dzobo W. Integrating Artificial and Human Intelligence: A Partnership for Responsible Innovation in Biomedical Engineering and Medicine. OMICS. 2020 May;24(5):247-263. doi: 10.1089/omi.2019.0038. Epub 2019 Jul 16. PMID: 31313972, https://pubmed.ncbi.nlm.nih.gov/31313972/
  • McCradden MD, Joshi S, Mazwi M, Anderson JA., 2020, Ethical limitations of algorithmic fairness solutions in health care machine learning. Lancet Digit Health. 2020 May;2(5):e221-e223. doi: 10.1016/S2589-7500(20)30065-0. PMID: 33328054, https://pubmed.ncbi.nlm.nih.gov/33328054/
  • Kulikowski CA., 2019, Beginnings of Artificial Intelligence in Medicine (AIM): Computational Artifice Assisting Scientific Inquiry and Clinical Art - with Reflections on Present AIM Challenges. Yearb Med Inform. 2019 Aug;28(1):249-256. doi: 10.1055/s-0039-1677895. Epub 2019 Apr 25. PMID: 31022744, https://pubmed.ncbi.nlm.nih.gov/31022744/
  • Park S.H., Kim Y.H., Lee J.Y., Yoo S., Kim C.J. Ethical challenges regarding artificial intelligence in medicine from the perspective of scientific editing and peer review. Science Editing. 2019;6:91–98. https://www.semanticscholar.org/paper/Ethical-challenges-regarding-artificial-in-medicine-Park-Kim/7a5b3c84c6f5d16e68eaf17989b0debfd4ba57d0

Data Leakage

Data leakage refers to the AI accidentally causing the leak of data that you'd prefer was kept confidential. The "leak" can actually be caused by the LLM, or by the user, depending on the context. There are various ways this can occur:

  • Uploading confidential data in AI queries (User data leakage)
  • Training or fine-tuning data containing proprietary information (Training data leakage)
  • RAG datastore documents containing proprietary information (RAG data leakage)

In the context of an LLM output leaking, this refers to where internal company IP is accidentally "leaked" to the public by training the AI with documents containing internal information. The AI is not smart enough to note when it shouldn't be reading a document, and anything that goes into the training dataset, or in the RAG datastore, will be shown to users.

User data leakage is where company users are sending proprietary information to a third-party AI engine. In theory, this data is protected by the confidentiality practices of the LLM company. This issue is similar to having company staff emitting confidential information in their Google queries, but the issue is more problematic because AI queries can upload entire documents to be analyzed by the LLM, such as when doing grammar checking with an LLM.

Research papers on data leakage:

Refusal

Refusal refers to the way that an LLM will politely decline to answer an inappropriate question. There are all types of questions that we don't want an LLM to respond to, and this requires training to achieve.

  • Andy Arditi, Oscar Obeso, Aaquib111, wesg, Neel Nanda, 27th Apr 2024, Refusal in LLMs is mediated by a single direction, LessWrong, https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction
  • Maxime Labonne June 13, 2024 Uncensor any LLM with abliteration, https://huggingface.co/blog/mlabonne/abliteration
  • NVIDIA, June 2024, Nemotron-4 340B Technical Report, https://d1qx31qr3h6wln.cloudfront.net/publications/Nemotron_4_340B_8T_0.pdf (Architecture is decoder-only with GQA, SentencePiece tokenizer, causal attention masks, RoPE, 96 layers, 96 heads, 8 KV heads, 256,000 vocabulary, 18432 internal dimension, context window 4096, and uses squared RELU.)
  • Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, Nouha Dziri, 26 Jun 2024, WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs, https://arxiv.org/abs/2406.18495
  • Maksym Andriushchenko, Nicolas Flammarion, 16 Jul 2024, Does Refusal Training in LLMs Generalize to the Past Tense? https://arxiv.org/abs/2407.11969 Code: https://github.com/tml-epfl/llm-past-tense
  • Kylie Robison, Jul 20, 2024, OpenAI’s latest model will block the ‘ignore all previous instructions’ loophole, https://www.theverge.com/2024/7/19/24201414/openai-chatgpt-gpt-4o-prompt-injection-instruction-hierarchy
  • Xinyi Hou, Yanjie Zhao, Haoyu Wang, 3 Aug 2024, Voices from the Frontier: A Comprehensive Analysis of the OpenAI Developer Forum, https://arxiv.org/abs/2408.01687
  • Asir Saadat, Tasmia Binte Sogir, Md Taukir Azam Chowdhury, Syem Aziz, 16 Oct 2024, When Not to Answer: Evaluating Prompts on GPT Models for Effective Abstention in Unanswerable Math Word Problems, https://arxiv.org/abs/2410.13029
  • Kyle O'Brien, David Majercak, Xavier Fernandes, Richard Edgar, Jingya Chen, Harsha Nori, Dean Carignan, Eric Horvitz, Forough Poursabzi-Sangde, 18 Nov 2024, Steering Language Model Refusal with Sparse Autoencoders, https://arxiv.org/abs/2411.11296
  • Mohit Sewak, Dec 6, 2024, Prompt Injection Attacks on Large Language Models, https://pub.towardsai.net/prompt-injection-attacks-on-large-language-models-bd8062fa1bb7
  • Yue Liu, Hongcheng Gao, Shengfang Zhai, Jun Xia, Tianyi Wu, Zhiwei Xue, Yulin Chen, Kenji Kawaguchi, Jiaheng Zhang, Bryan Hooi, 30 Jan 2025, GuardReasoner: Towards Reasoning-based LLM Safeguards, https://arxiv.org/abs/2501.18492
  • Holistic AI Team, March 6, 2025, Anthropic’s Claude 3.7 Sonnet Jailbreaking & Red Teaming Audit: The Most Secure Model Yet? https://www.holisticai.com/blog/claude-3-7-sonnet-jailbreaking-audit
  • Vishnu Kabir Chhabra, Mohammad Mahdi Khalili, 5 Apr 2025, Towards Understanding and Improving Refusal in Compressed Models via Mechanistic Interpretability, https://arxiv.org/abs/2504.04215
  • Wojciech Zaremba, Evgenia Nitishinskaya, Boaz Barak, Stephanie Lin, Sam Toyer, Yaodong Yu, Rachel Dias, Eric Wallace, Kai Xiao, Johannes Heidecke, Amelia Glaese, 31 Jan 2025, Trading Inference-Time Compute for Adversarial Robustness, https://arxiv.org/abs/2501.18841
  • Hongzhe Du, Weikai Li, Min Cai, Karim Saraipour, Zimin Zhang, Himabindu Lakkaraju, Yizhou Sun, Shichang Zhang, 11 Aug 2025, How Post-Training Reshapes LLMs: A Mechanistic View on Knowledge, Truthfulness, Refusal, and Confidence, https://arxiv.org/abs/2504.02904
  • Haonan Zhang, Dongxia Wang, Yi Liu, Kexin Chen, Jiashui Wang, Xinlei Ying, Long Liu, Wenhai Wang, 15 Aug 2025, ORFuzz: Fuzzing the "Other Side" of LLM Safety -- Testing Over-Refusal, https://arxiv.org/abs/2508.11222
  • Yuan Yuan, Tina Sriskandarajah, Anna-Luisa Brakman, Alec Helyar, Alex Beutel, Andrea Vallone, Saachi Jain, 12 Aug 2025, From Hard Refusals to Safe-Completions: Toward Output-Centric Safety Training, https://arxiv.org/abs/2508.09224
  • Ranjie Duan, Jiexi Liu, Xiaojun Jia, Shiji Zhao, Ruoxi Cheng, Fengxiang Wang, Cheng Wei, Yong Xie, Chang Liu, Defeng Li, Yinpeng Dong, Yichi Zhang, Yuefeng Chen, Chongwen Wang, Xingjun Ma, Xingxing Wei, Yang Liu, Hang Su, Jun Zhu, Xinfeng Li, Yitong Sun, Jie Zhang, Jinzhao Hu, Sha Xu, Yitong Yang, Jialing Tao, Hui Xue, 4 Sep 2025, Oyster-I: Beyond Refusal -- Constructive Safety Alignment for Responsible Language Models, https://arxiv.org/abs/2509.01909
  • Md Abdullah Al Mamun, Ihsen Alouani, Nael Abu-Ghazaleh, 28 Aug 2025, Poison Once, Refuse Forever: Weaponizing Alignment for Injecting Bias in LLMs, https://arxiv.org/abs/2508.20333
  • Neel Jain, Aditya Shrivastava, Chenyang Zhu, Daben Liu, Alfy Samuel, Ashwinee Panda, Anoop Kumar, Micah Goldblum, Tom Goldstein, 29 Aug 2025, Refusal Tokens: A Simple Way to Calibrate Refusals in Large Language Models, https://arxiv.org/abs/2412.06748
  • Nirmalendu Prakash, Yeo Wei Jie, Amir Abdullah, Ranjan Satapathy, Erik Cambria, Roy Ka Wei Lee, 7 Sep 2025, Beyond I'm Sorry, I Can't: Dissecting Large Language Model Refusal, https://arxiv.org/abs/2509.09708
  • Wenbo Pan, Jie Xu, Qiguang Chen, Junhao Dong, Libo Qin, Xinfeng Li, Haining Yu, Xiaohua Jia, 2 Oct 2025, Can LLMs Refuse Questions They Do Not Know? Measuring Knowledge-Aware Refusal in Factual Tasks, https://arxiv.org/abs/2510.01782
  • Huizhen Shu, Xuying Li, Zhuo Li, 24 Sep 2025, LatentGuard: Controllable Latent Steering for Robust Refusal of Attacks and Reliable Response Generation, https://arxiv.org/abs/2509.19839
  • Ziheng Cheng, Yixiao Huang, Hui Xu, Somayeh Sojoudi, Xuandong Zhao, Dawn Song, Song Mei, 25 Oct 2025, OVERT: A Benchmark for Over-Refusal Evaluation on Text-to-Image Models, https://arxiv.org/abs/2505.21347
  • Sha Luo, Sang Jung Kim, Zening Duan, Kaiping Chen, 27 Oct 2025, Refusal as Silence: Gendered Disparities in Vision-Language Model Responses, https://arxiv.org/abs/2406.08222
  • Ravi Pandya, Madison Bland, Duy P. Nguyen, Changliu Liu, Jaime Fern\'andez Fisac, Andrea Bajcsy, 15 Oct 2025, From Refusal to Recovery: A Control-Theoretic Approach to Generative AI Guardrails, https://arxiv.org/abs/2510.13727
  • Vincent Siu and Nathan W. Henry and Nicholas Crispino and Yang Liu and Dawn Song and Chenguang Wang, 20 Oct 2025, RepIt: Steering Language Models with Concept-Specific Refusal Vectors, https://arxiv.org/abs/2509.13281
  • Eric Hanchen Jiang, Weixuan Ou, Run Liu, Shengyuan Pang, Guancheng Wan, Ranjie Duan, Wei Dong, Kai-Wei Chang, XiaoFeng Wang, Ying Nian Wu, Xinfeng Li, 9 Oct 2025, Energy-Driven Steering: Reducing False Refusals in Large Language Models, https://arxiv.org/abs/2510.08646
  • Aashiq Muhamed, Leonardo F. R. Ribeiro, Markus Dreyer, Virginia Smith, Mona T. Diab, 12 Oct 2025, RefusalBench: Generative Evaluation of Selective Refusal in Grounded Language Models, https://arxiv.org/abs/2510.10390
  • Thijs Willems, Sumbul Khan, Qian Huang, Bradley Camburn, Nachamma Sockalingam, King Wang Poon, 22 Oct 2025, To Use or to Refuse? Re-Centering Student Agency with Generative AI in Engineering Design Education, https://arxiv.org/abs/2510.19342
  • Cheng-Kuang Wu, Zhi Rui Tam, Chieh-Yen Lin, Yun-Nung Chen, Hung-yi Lee, 30 Sep 2025, Answer, Refuse, or Guess? Investigating Risk-Aware Decision Making in Language Models, https://arxiv.org/abs/2503.01332
  • Qingyu Yin, Chak Tou Leong, Linyi Yang, Wenxuan Huang, Wenjie Li, Xiting Wang, Jaehong Yoon, YunXing, XingYu, Jinjin Gu, 7 Oct 2025, Refusal Falls off a Cliff: How Safety Alignment Fails in Reasoning?, https://arxiv.org/abs/2510.06036

Guardrails

  • Aarushi Kansal, Chapter 4: Guardrails and AI: Building Safe and Controllable Apps, Building Generative AI-Powered Apps: A Hands-on Guide for Developers, Apress, https://www.amazon.com/Building-Generative-AI-Powered-Apps-Hands-ebook/dp/B0CTXXP1S4/
  • Meta, July 2024 (accessed), Llama: Making safety tools accessible to everyone, https://llama.meta.com/trust-and-safety/
  • Chip Huyen, Jul 25, 2024, Building A Generative AI Platform, https://huyenchip.com/2024/07/25/genai-platform.html
  • Jaymari Chua, Yun Li, Shiyi Yang, Chen Wang, Lina Yao, 6 Jul 2024, AI Safety in Generative AI Large Language Models: A Survey, https://arxiv.org/abs/2407.18369
  • Marko Zivkovic, Aug 06, 2024, Discovered Apple Intelligence prompts show Apple's attempt at preventing AI disaster, https://appleinsider.com/articles/24/08/06/discovered-apple-intelligence-prompts-show-apples-attempt-at-preventing-ai-disaster
  • Rachel Curry, Aug 28 2024, Why companies including JPMorgan and Walmart are opting for internal gen AI assistants after initially restricting usage, https://www.cnbc.com/2024/08/28/why-jpmorgan-and-walmart-are-opting-for-internal-gen-ai-assistants.html
  • Shayne Longpre, Stella Biderman, Alon Albalak, Hailey Schoelkopf, Daniel McDuff, Sayash Kapoor, Kevin Klyman, Kyle Lo, Gabriel Ilharco, Nay San, Maribeth Rauh, Aviya Skowron, Bertie Vidgen, Laura Weidinger, Arvind Narayanan, Victor Sanh, David Adelani, Percy Liang, Rishi Bommasani, Peter Henderson, Sasha Luccioni, Yacine Jernite, Luca Soldaini, 26 Jun 2024 (v2), The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources, https://arxiv.org/abs/2406.16746
  • Jason Perlow, Nov. 6, 2024, The best open-source AI models: All your free-to-use options explained: Here are the best open-source and free-to-use AI models for text, images, and audio, organized by type, application, and licensing considerations. https://www.zdnet.com/article/the-best-open-source-ai-models-all-your-free-to-use-options-explained/
  • McKinsey, November 14, 2024, What are AI guardrails? https://www.mckinsey.com/featured-insights/mckinsey-explainers/what-are-ai-guardrails
  • Aditi Bodhankar, Dec 06, 2024, Content Moderation and Safety Checks with NVIDIA NeMo Guardrails, https://developer.nvidia.com/blog/content-moderation-and-safety-checks-with-nvidia-nemo-guardrails/
  • Inkit Padhi, Manish Nagireddy, Giandomenico Cornacchia, Subhajit Chaudhury, Tejaswini Pedapati, Pierre Dognin, Keerthiram Murugesan, Erik Miehling, Martín Santillán Cooper, Kieran Fraser, Giulio Zizzo, Muhammad Zaid Hameed, Mark Purcell, Michael Desmond, Qian Pan, Inge Vejsbjerg, Elizabeth M. Daly, Michael Hind, Werner Geyer, Ambrish Rawat, Kush R. Varshney, Prasanna Sattigeri, 10 Dec 2024, Granite Guardian, https://arxiv.org/abs/2412.07724 https://github.com/ibm-granite/granite-guardian (Open-sourcing of safety models with many capabilities.)
  • Rama Akkiraju, Anbang Xu, Deepak Bora, Tan Yu, Lu An, Vishal Seth, Aaditya Shukla, Pritam Gundecha, Hridhay Mehta, Ashwin Jha, Prithvi Raj, Abhinav Balasubramanian, Murali Maram, Guru Muthusamy, Shivakesh Reddy Annepally, Sidney Knowles, Min Du, Nick Burnett, Sean Javiya, Ashok Marannan, Mamta Kumari, Surbhi Jha, Ethan Dereszenski, Anupam Chakraborty, Subhash Ranjan, Amina Terfai, Anoop Surya, Tracey Mercer, Vinodh Kumar Thanigachalam, Tamar Bar, Sanjana Krishnan, Samy Kilaru, Jasmine Jaksic, Nave Algarici, Jacob Liberman, Joey Conway, Sonu Nayyar, Justin Boitano, 10 Jul 2024, FACTS About Building Retrieval Augmented Generation-based Chatbots, NVIDIA Research, https://arxiv.org/abs/2407.07858
  • Aditi Bodhankar, Jan 16, 2025, How to Safeguard AI Agents for Customer Service with NVIDIA NeMo Guardrails, https://developer.nvidia.com/blog/how-to-safeguard-ai-agents-for-customer-service-with-nvidia-nemo-guardrails/
  • Yue Liu, Hongcheng Gao, Shengfang Zhai, Jun Xia, Tianyi Wu, Zhiwei Xue, Yulin Chen, Kenji Kawaguchi, Jiaheng Zhang, Bryan Hooi, 30 Jan 2025, GuardReasoner: Towards Reasoning-based LLM Safeguards, https://arxiv.org/abs/2501.18492
  • Aditi Bodhankar, Mar 03, 2025, Measuring the Effectiveness and Performance of AI Guardrails in Generative AI Applications, https://developer.nvidia.com/blog/measuring-the-effectiveness-and-performance-of-ai-guardrails-in-generative-ai-applications/
  • Manuel Cossio, 3 Aug 2025, A comprehensive taxonomy of hallucinations in Large Language Models, https://arxiv.org/abs/2508.01781
  • Anthropic, 13 Aug 2025, Building Safeguards for Claude, https://www.anthropic.com/news/building-safeguards-for-claude
  • Yuksel Aydin, 9 Aug 2025, Cognitive Cybersecurity for Artificial Intelligence: Guardrail Engineering with CCS-7, https://arxiv.org/abs/2508.10033
  • Boyuan Zheng, Zeyi Liao, Scott Salisbury, Zeyuan Liu, Michael Lin, Qinyuan Zheng, Zifan Wang, Xiang Deng, Dawn Song, Huan Sun, Yu Su, 18 Jul 2025, WebGuard: Building a Generalizable Guardrail for Web Agents, https://arxiv.org/abs/2507.14293
  • Cheng-Fu Yang, Thanh Tran, Christos Christodoulopoulos, Weitong Ruan, Rahul Gupta, Kai-Wei Chang, 28 Jul 2025, Customize Multi-modal RAI Guardrails with Precedent-based predictions, https://arxiv.org/abs/2507.20503
  • Chad DeLuca, Anna Lisa Gentile, Shubhi Asthana, Bing Zhang, Pawan Chowdhary, Kellen Cheng, Basel Shbita, Pengyuan Li, Guang-Jie Ren, Sandeep Gopisetty, 25 Jul 2025, OneShield - the Next Generation of LLM Guardrails, https://arxiv.org/abs/2507.21170
  • Hannah-Beth Clark, Laura Benton, Emma Searle, Margaux Dowland, Matthew Gregory, Will Gayne and John Roberts, 7 Aug 2025, Building Effective Safety Guardrails in AI Education Tools, https://arxiv.org/abs/2508.05360
  • Alexander W. Lee, Justin Chan, Michael Fu, Nicolas Kim, Akshay Mehta, Deepti Raghavan, Ugur Cetintemel, 7 Aug 2025, Semantic Integrity Constraints: Declarative Guardrails for AI-Augmented Data Processing Systems, https://arxiv.org/abs/2503.00600
  • Darpan Aswal and C\'eline Hudelot, 22 Aug 2025, LLMSymGuard: A Symbolic Safety Guardrail Framework Leveraging Interpretable Jailbreak Concepts, https://arxiv.org/abs/2508.16325
  • Jun Zhuang, Haibo Jin, Ye Zhang, Zhengjian Kang, Wenbin Zhang, Gaby G. Dagher, Haohan Wang, 25 Aug 2025, Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation, https://arxiv.org/abs/2505.18556
  • Kellen Tan Cheng, Anna Lisa Gentile, Chad DeLuca, Guang-Jie Ren, 25 Aug 2025, Backprompting: Leveraging Synthetic Production Data for Health Advice Guardrails, https://arxiv.org/abs/2508.18384
  • Victoria R. Li and Yida Chen and Naomi Saphra, 26 Aug 2025, ChatGPT Doesn't Trust Chargers Fans: Guardrail Sensitivity in Context, https://arxiv.org/abs/2407.06866
  • Monte Hoover, Vatsal Baherwani, Neel Jain, Khalid Saifullah, Joseph Vincent, Chirag Jain, Melissa Kazemi Rad, C. Bayan Bruss, Ashwinee Panda, Tom Goldstein, 2 Sep 2025, DynaGuard: A Dynamic Guardrail Model With User-Defined Policies, https://arxiv.org/abs/2509.02563
  • Nils Durner, 25 Sep 2025, In AI Sweet Harmony: Sociopragmatic Guardrail Bypasses and Evaluation-Awareness in OpenAI gpt-oss-20b, https://arxiv.org/abs/2510.01259
  • Peiyang Xu, Minzhou Pan, Zhaorun Chen, Shuang Yang, Chaowei Xiao, Bo Li, 28 Oct 2025, SafeVision: Efficient Image Guardrail with Robust Policy Adherence and Explainability, https://arxiv.org/abs/2510.23960
  • Giacomo Bertollo, Naz Bodemir, Jonah Burgess, 14 Oct 2025, Breaking Guardrails, Facing Walls: Insights on Adversarial AI for Defenders & Researchers, https://arxiv.org/abs/2510.16005
  • Bingjie Zhang, Yibo Yang, Zhe Ren, Dandan Guo, Jindong Gu, Philip Torr, Bernard Ghanem, 27 Oct 2025, A Guardrail for Safety Preservation: When Safety-Sensitive Subspace Meets Harmful-Resistant Null-Space, https://arxiv.org/abs/2510.14301
  • Ravi Pandya, Madison Bland, Duy P. Nguyen, Changliu Liu, Jaime Fern\'andez Fisac, Andrea Bajcsy, 15 Oct 2025, From Refusal to Recovery: A Control-Theoretic Approach to Generative AI Guardrails, https://arxiv.org/abs/2510.13727
  • Wenyuan Chen, Fateme Nateghi Haredasht, Kameron C. Black, Francois Grolleau, Emily Alsentzer, Jonathan H. Chen, and Stephen P. Ma, 26 Sep 2025, Retrieval-Augmented Guardrails for AI-Drafted Patient-Portal Messages: Error Taxonomy Construction and Large-Scale Evaluation, https://arxiv.org/abs/2509.22565
  • Yaozu Wu, Jizhou Guo, Dongyuan Li, Henry Peng Zou, Wei-Chieh Huang, Yankai Chen, Zhen Wang, Weizhi Zhang, Yangning Li, Meng Zhang, Renhe Jiang, Philip S. Yu, 28 Sep 2025, PSG-Agent: Personality-Aware Safety Guardrail for LLM-based Agents, https://arxiv.org/abs/2509.23614
  • Gauri Kholkar, Ratinder Ahuja, 28 Sep 2025, The AI Agent Code of Conduct: Automated Guardrail Policy-as-Prompt Synthesis, https://arxiv.org/abs/2509.23994
  • ChenYu Wu, Yi Wang, Yang Liao, 16 Oct 2025, Active Honeypot Guardrail System: Probing and Confirming Multi-Turn LLM Jailbreaks, https://arxiv.org/abs/2510.15017
  • Xiaofei Wen, Wenjie Jacky Mo, Yanan Xie, Peng Qi, Muhao Chen, 3 Oct 2025, Towards Policy-Compliant Agents: Learning Efficient Guardrails For Policy Violation Detection, https://arxiv.org/abs/2510.03485
  • Yingzhi Mao (1 and 2), Chunkang Zhang (1 and 2), Junxiang Wang (1), Xinyan Guan (1 and 2), Boxi Cao (1), Yaojie Lu (1), Hongyu Lin (1), Xianpei Han (1 and 2), Le Sun (1 and 2) ((1) Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences, (2) University of Chinese Academy of Sciences), 24 Oct 2025, When Models Outthink Their Safety: Mitigating Self-Jailbreak in Large Reasoning Models with Chain-of-Guardrails, https://arxiv.org/abs/2510.21285
  • Yue Huang, Hang Hua, Yujun Zhou, Pengcheng Jing, Manish Nagireddy, Inkit Padhi, Greta Dolcetti, Zhangchen Xu, Subhajit Chaudhury, Ambrish Rawat, Liubov Nedoshivina, Pin-Yu Chen, Prasanna Sattigeri, Xiangliang Zhang, 10 Oct 2025, Building a Foundational Guardrail for General Agentic Systems via Synthetic Data, https://arxiv.org/abs/2510.09781
  • Wei-Chieh Huang, Henry Peng Zou, Yaozu Wu, Dongyuan Li, Yankai Chen, Weizhi Zhang, Yangning Li, Angelo Zangari, Jizhou Guo, Chunyu Miao, Liancheng Fang, Langzhou He, Renhe Jiang, Philip S. Yu, 13 Oct 2025, DeepResearchGuard: Deep Research with Open-Domain Evaluation and Multi-Stage Guardrails for Safety, https://arxiv.org/abs/2510.10994
  • Prithviraj Singh Shahani, Kaveh Eskandari Miandoab, and Matthias Scheutz, 12 Oct 2025, Noise Injection Systemically Degrades Large Language Model Safety Guardrails, https://arxiv.org/abs/2505.13500
  • Yining She, Daniel W. Peterson, Marianne Menglin Liu, Vikas Upadhyay, Mohammad Hossein Chaghazardi, Eunsuk Kang, Dan Roth, 6 Oct 2025, RAG Makes Guardrails Unsafe? Investigating Robustness of Guardrails under RAG-style Contexts, https://arxiv.org/abs/2510.05310
  • Olga E. Sorokoletova, Francesco Giarrusso, Vincenzo Suriani, Daniele Nardi, 14 Oct 2025, Guarding the Guardrails: A Taxonomy-Driven Approach to Jailbreak Detection, https://arxiv.org/abs/2510.13893
  • Xunguang Wang, Zhenlan Ji, Wenxuan Wang, Zongjie Li, Daoyuan Wu, Shuai Wang, 16 Oct 2025, SoK: Evaluating Jailbreak Guardrails for Large Language Models, https://arxiv.org/abs/2506.10597

Jailbreak

Jailbreaking is the hack of using English to break into a computer system. Actually, it's not so much a violation of the server, but it does refer to a way of getting the LLM to answer questions that its developer probably doesn't want it to. In other words, it's a trick to bypass the "refusal" module of an LLM.

  • Andy Arditi, Oscar Obeso, Aaquib111, wesg, Neel Nanda, 27th Apr 2024, Refusal in LLMs is mediated by a single direction, LessWrong, https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction
  • Adva Nakash Peleg, May 30, 2024, An LLM Journey: From POC to Production, https://medium.com/cyberark-engineering/an-llm-journey-from-poc-to-production-6c5ec6a172fb
  • Yu Wang, Xiaogeng Liu, Yu Li, Muhao Chen, Chaowei Xiao, 14 Mar 2024, AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting, https://arxiv.org/abs/2403.09513 Code: https://github.com/rain305f/AdaShield
  • Jinhwa Kim, Ali Derakhshan, Ian G. Harris, 31 Oct 2023, Robust Safety Classifier for Large Language Models: Adversarial Prompt Shield, https://arxiv.org/abs/2311.00172
  • Zixuan Ni, Longhui Wei, Jiacheng Li, Siliang Tang, Yueting Zhuang, Qi Tian, 8 Aug 2023 (v2), Degeneration-Tuning: Using Scrambled Grid shield Unwanted Concepts from Stable Diffusion, https://arxiv.org/abs/2308.02552
  • Xiao Peng, Tao Liu, Ying Wang, 3 Jun 2024 (v2), Genshin: General Shield for Natural Language Processing with Large Language Models, https://arxiv.org/abs/2405.18741
  • Ayushi Nirmal, Amrita Bhattacharjee, Paras Sheth, Huan Liu, 8 May 2024 ( v2), Towards Interpretable Hate Speech Detection using Large Language Model-extracted Rationales, https://arxiv.org/abs/2403.12403 Code: https://github.com/AmritaBh/shield
  • Shweta Sharma, 27 Jun 2024, Microsoft warns of ‘Skeleton Key’ jailbreak affecting many generative AI models, https://www.csoonline.com/article/2507702/microsoft-warns-of-novel-jailbreak-affecting-many-generative-ai-models.html
  • Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, Nouha Dziri, 26 Jun 2024, WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs, https://arxiv.org/abs/2406.18495
  • Maksym Andriushchenko, Nicolas Flammarion, 16 Jul 2024, Does Refusal Training in LLMs Generalize to the Past Tense? https://arxiv.org/abs/2407.11969 Code: https://github.com/tml-epfl/llm-past-tense
  • Kylie Robison, Jul 20, 2024, OpenAI’s latest model will block the ‘ignore all previous instructions’ loophole, https://www.theverge.com/2024/7/19/24201414/openai-chatgpt-gpt-4o-prompt-injection-instruction-hierarchy
  • Chip Huyen, Jul 25, 2024, Building A Generative AI Platform, https://huyenchip.com/2024/07/25/genai-platform.html
  • Jaymari Chua, Yun Li, Shiyi Yang, Chen Wang, Lina Yao, 6 Jul 2024, AI Safety in Generative AI Large Language Models: A Survey, https://arxiv.org/abs/2407.18369
  • Ayush RoyChowdhury, Mulong Luo,, Prateek Sahu,, Sarbartha Banerjee, Mohit Tiwari, Aug 2024, ConfusedPilot: Confused Deputy Risks in RAG-based LLMs, https://confusedpilot.info/confused_pilot_new.pdf
  • Dr. Ashish Bamania, Sep 2024, ‘MathPrompt’ Embarassingly Jailbreaks All LLMs Available On The Market Today. A deep dive into how a novel LLM Jailbreaking technique called ‘MathPrompt’ works, why it is so effective, and why it needs to be patched as soon as possible to prevent harmful LLM content generation, https://bamania-ashish.medium.com/mathprompt-embarassingly-jailbreaks-all-llms-available-on-the-market-today-d749da26c6e8
  • Y. Bai et al., "Backdoor Attack and Defense on Deep Learning: A Survey," in IEEE Transactions on Computational Social Systems, doi: 10.1109/TCSS.2024.3482723. https://ieeexplore.ieee.org/abstract/document/10744415
  • Steve Jones, Oct 3, 2024, LLM Prompt Injection: Never send the request to the model. Classify, rewrite and reject, https://blog.metamirror.io/llm-prompt-injection-never-send-the-request-to-the-model-e8017269b96a
  • Emet Bethany, Mazal Bethany, Juan Arturo Nolazco Flores, Sumit Kumar Jha, Peyman Najafirad, 5 Nov 2024 (v2), Jailbreaking Large Language Models with Symbolic Mathematics, https://arxiv.org/abs/2409.11445
  • Alwin Peng, Julian Michael, Henry Sleight, Ethan Perez, Mrinank Sharma, 12 Nov 2024, Rapid Response: Mitigating LLM Jailbreaks with a Few Examples, https://arxiv.org/abs/2411.07494
  • Kyle O'Brien, David Majercak, Xavier Fernandes, Richard Edgar, Jingya Chen, Harsha Nori, Dean Carignan, Eric Horvitz, Forough Poursabzi-Sangde, 18 Nov 2024, Steering Language Model Refusal with Sparse Autoencoders, https://arxiv.org/abs/2411.11296
  • Zachary Coalson, Jeonghyun Woo, Shiyang Chen, Yu Sun, Lishan Yang, Prashant Nair, Bo Fang, Sanghyun Hong, 10 Dec 2024, PrisonBreak: Jailbreaking Large Language Models with Fewer Than Twenty-Five Targeted Bit-flips, https://arxiv.org/abs/2412.07192
  • Inkit Padhi, Manish Nagireddy, Giandomenico Cornacchia, Subhajit Chaudhury, Tejaswini Pedapati, Pierre Dognin, Keerthiram Murugesan, Erik Miehling, Martín Santillán Cooper, Kieran Fraser, Giulio Zizzo, Muhammad Zaid Hameed, Mark Purcell, Michael Desmond, Qian Pan, Inge Vejsbjerg, Elizabeth M. Daly, Michael Hind, Werner Geyer, Ambrish Rawat, Kush R. Varshney, Prasanna Sattigeri, 10 Dec 2024, Granite Guardian, https://arxiv.org/abs/2412.07724 https://github.com/ibm-granite/granite-guardian (Open-sourcing of safety models with many capabilities.)
  • Mohit Sewak, Dec 6, 2024, Prompt Injection Attacks on Large Language Models, https://pub.towardsai.net/prompt-injection-attacks-on-large-language-models-bd8062fa1bb7
  • Sicheng Zhu, Brandon Amos, Yuandong Tian, Chuan Guo, Ivan Evtimov, 13 Dec 2024, AdvPrefix: An Objective for Nuanced LLM Jailbreaks, https://arxiv.org/abs/2412.10321
  • Aditi Bodhankar, Jan 16, 2025, How to Safeguard AI Agents for Customer Service with NVIDIA NeMo Guardrails, https://developer.nvidia.com/blog/how-to-safeguard-ai-agents-for-customer-service-with-nvidia-nemo-guardrails/
  • Xin Yi, Yue Li, Linlin Wang, Xiaoling Wang, Liang He, 18 Jan 2025, Latent-space adversarial training with post-aware calibration for defending large language models against jailbreak attacks, https://arxiv.org/abs/2501.10639
  • Yue Liu, Hongcheng Gao, Shengfang Zhai, Jun Xia, Tianyi Wu, Zhiwei Xue, Yulin Chen, Kenji Kawaguchi, Jiaheng Zhang, Bryan Hooi, 30 Jan 2025, GuardReasoner: Towards Reasoning-based LLM Safeguards, https://arxiv.org/abs/2501.18492
  • Taryn Plumb, February 3, 2025, Anthropic claims new AI security method blocks 95% of jailbreaks, invites red teamers to try, https://venturebeat.com/security/anthropic-claims-new-ai-security-method-blocks-95-of-jailbreaks-invites-red-teamers-to-try/
  • Holistic AI Team, March 6, 2025, Anthropic’s Claude 3.7 Sonnet Jailbreaking & Red Teaming Audit: The Most Secure Model Yet? https://www.holisticai.com/blog/claude-3-7-sonnet-jailbreaking-audit
  • Jiacheng Liang, Tanqiu Jiang, Yuhui Wang, Rongyi Zhu, Fenglong Ma, Ting Wang, 16 May 2025, AutoRAN: Weak-to-Strong Jailbreaking of Large Reasoning Models, https://arxiv.org/abs/2505.10846
  • Wojciech Zaremba, Evgenia Nitishinskaya, Boaz Barak, Stephanie Lin, Sam Toyer, Yaodong Yu, Rachel Dias, Eric Wallace, Kai Xiao, Johannes Heidecke, Amelia Glaese, 31 Jan 2025, Trading Inference-Time Compute for Adversarial Robustness, https://arxiv.org/abs/2501.18841
  • Manuel Cossio, 3 Aug 2025, A comprehensive taxonomy of hallucinations in Large Language Models, https://arxiv.org/abs/2508.01781
  • Anthropic, 13 Aug 2025, Building Safeguards for Claude, https://www.anthropic.com/news/building-safeguards-for-claude
  • Wenpeng Xing, Mohan Li, Chunqiang Hu, Haitao XuNingyu Zhang, Bo Lin, Meng Han, 8 Aug 2025, Latent Fusion Jailbreak: Blending Harmful and Harmless Representations to Elicit Unsafe LLM Outputs, https://arxiv.org/abs/2508.10029
  • Fan Yang, 9 Aug 2025, The Cost of Thinking: Increased Jailbreak Risk in Large Language Models, https://arxiv.org/abs/2508.10032
  • Xiaoxue Yang, Jaeha Lee, Anna-Katharina Dick, Jasper Timm, Fei Xie, Diogo Cruz, 11 Aug 2025, Multi-Turn Jailbreaks Are Simpler Than They Seem, https://arxiv.org/abs/2508.07646
  • Xianjun Yang, Liqiang Xiao, Shiyang Li, Faisal Ladhak, Hyokun Yun, Linda Ruth Petzold, Yi Xu, William Yang Wang, 9 Aug 2025, Many-Turn Jailbreaking, https://arxiv.org/abs/2508.06755
  • Xuancun Lu, Zhengxian Huang, Xinfeng Li, Chi Zhang, Xiaoyu ji, Wenyuan Xu, 11 Aug 2025, POEX: Towards Policy Executable Jailbreak Attacks Against the LLM-based Robots, https://arxiv.org/abs/2412.16633
  • Tatia Tsmindashvili, Ana Kolkhidashvili, Dachi Kurtskhalia, Nino Maghlakelidze, Elene Mekvabishvili, Guram Dentoshvili, Orkhan Shamilov, Zaal Gachechiladze, Steven Saporta, David Dachi Choladze, 11 Aug 2025, Improving LLM Outputs Against Jailbreak Attacks with Expert Model Integration, https://arxiv.org/abs/2505.17066
  • Jirui Yang, Zheyu Lin, Zhihui Lu, Yinggui Wang, Lei Wang, Tao Wei, Xin Du, Shuhan Yang, 31 Jul 2025, CEE: An Inference-Time Jailbreak Defense for Embodied Intelligence via Subspace Concept Rotation, https://arxiv.org/abs/2504.13201
  • Zheng Zhang, Peilin Zhao, Deheng Ye, Hao Wang, 28 Jul 2025, Enhancing Jailbreak Attacks on LLMs via Persona Prompts, https://arxiv.org/abs/2507.22171
  • Jiecong Wang, Haoran Li, Hao Peng, Ziqian Zeng, Zihao Wang, Haohua Du, Zhengtao Yu, 1 Aug 2025, Activation-Guided Local Editing for Jailbreaking Attacks, https://arxiv.org/abs/2508.00555
  • Yelim Ahn, Jaejin Lee, 2 Aug 2025, PUZZLED: Jailbreaking LLMs through Word-Based Puzzles, https://arxiv.org/abs/2508.01306
  • Yik Siu Chan, Narutatsu Ri, Yuxin Xiao, Marzyeh Ghassemi, 2 Aug 2025, Speak Easy: Eliciting Harmful Jailbreaks from LLMs with Simple Interactions, https://arxiv.org/abs/2502.04322
  • Muyang Zheng, Yuanzhi Yao, Changting Lin, Rui Wang, Caihong Kai, 4 Aug 2025, MIST: Jailbreaking Black-box Large Language Models via Iterative Semantic Tuning, https://arxiv.org/abs/2506.16792
  • Rui Pu, Chaozhuo Li, Rui Ha, Litian Zhang, Lirong Qiu, Xi Zhang, 5 Aug 2025, Beyond Surface-Level Detection: Towards Cognitive-Driven Defense Against Jailbreak Attacks via Meta-Operations Reasoning, https://arxiv.org/abs/2508.03054
  • Bodam Kim, Hiskias Dingeto, Taeyoun Kwon, Dasol Choi, DongGeon Lee, Haon Park, JaeHoon Lee, Jongho Shin, 5 Aug 2025, When Good Sounds Go Adversarial: Jailbreaking Audio-Language Models with Benign Inputs, https://arxiv.org/abs/2508.03365
  • Giovanni Cherubin, Andrew Paverd, 4 Aug 2025, Highlight & Summarize: RAG without the jailbreaks, https://arxiv.org/abs/2508.02872
  • Ruofan Wang, Juncheng Li, Yixu Wang, Bo Wang, Xiaosen Wang, Yan Teng, Yingchun Wang, Xingjun Ma, Yu-Gang Jiang, 5 Aug 2025, IDEATOR: Jailbreaking and Benchmarking Large Vision-Language Models Using Themselves, https://arxiv.org/abs/2411.00827
  • Junwoo Ha, Hyunjun Kim, Sangyoon Yu, Haon Park, Ashkan Yousefpour, Yuna Park, Suhyun Kim, 5 Aug 2025, M2S: Multi-turn to Single-turn jailbreak in Red Teaming for LLMs, https://arxiv.org/abs/2503.04856
  • Thilo Hagendorff, Erik Derner, Nuria Oliver, 4 Aug 2025, Large Reasoning Models Are Autonomous Jailbreak Agents, https://arxiv.org/abs/2508.04039
  • Xiaohu Li and Yunfeng Ning and Zepeng Bao and Mayi Xu and Jianhao Chen and Tieyun Qian, 6 Aug 2025, CAVGAN: Unifying Jailbreak and Defense of LLMs via Generative Adversarial Attacks on their Internal Representations, https://arxiv.org/abs/2507.06043
  • Renmiao Chen, Shiyao Cui, Xuancheng Huang, Chengwei Pan, Victor Shea-Jay Huang, QingLin Zhang, Xuan Ouyang, Zhexin Zhang, Hongning Wang, and Minlie Huang, 7 Aug 2025, JPS: Jailbreak Multimodal Large Language Models with Collaborative Visual Perturbation and Textual Steering, https://arxiv.org/abs/2508.05087
  • Jesson Wang, Zhanhao Hu, David Wagner, 7 Aug 2025, JULI: Jailbreak Large Language Models by Self-Introspection, https://arxiv.org/abs/2505.11790
  • Shuang Liang, Zhihao Xu, Jialing Tao, Hui Xue, Xiting Wang, 8 Aug 2025, Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models: A Unified and Accurate Approach, https://arxiv.org/abs/2508.09201
  • Zuoou Li, Weitong Zhang, Jingyuan Wang, Shuyuan Zhang, Wenjia Bai, Bernhard Kainz, Mengyun Qiao, 11 Aug 2025, Towards Effective MLLM Jailbreaking Through Balanced On-Topicness and OOD-Intensity, https://arxiv.org/abs/2508.09218
  • Boyuan Chen, Minghao Shao, Abdul Basit, Siddharth Garg, Muhammad Shafique, 13 Aug 2025, MetaCipher: A Time-Persistent and Universal Multi-Agent Framework for Cipher-Based Jailbreak Attacks for LLMs, https://arxiv.org/abs/2506.22557
  • Ma Teng and Jia Xiaojun and Duan Ranjie and Li Xinfeng and Huang Yihao and Jia Xiaoshuang and Chu Zhixuan and Ren Wenqi, 18 Aug 2025, Heuristic-Induced Multimodal Risk Distribution Jailbreak Attack for Multimodal Large Language Models, https://arxiv.org/abs/2412.05934
  • Zhipeng Wei, Yuqi Liu, N. Benjamin Erichson, 16 Aug 2025, Emoji Attack: Enhancing Jailbreak Attacks Against Judge LLM Detection, https://arxiv.org/abs/2411.01077
  • Yangyang Guo and Yangyan Li and Mohan Kankanhalli, 18 Aug 2025, Involuntary Jailbreak, https://arxiv.org/abs/2508.13246
  • Jiaming Hu, Haoyu Wang, Debarghya Mukherjee, Ioannis Ch. Paschalidis, 19 Aug 2025, CCFC: Core & Core-Full-Core Dual-Track Defense for LLM Jailbreak Protection, https://arxiv.org/abs/2508.14128
  • Xiangman Li, Xiaodong Wu, Qi Li, Jianbing Ni, and Rongxing Lu, 21 Aug 2025, SafeLLM: Unlearning Harmful Outputs from Large Language Models against Jailbreak Attacks, https://arxiv.org/abs/2508.15182
  • Darpan Aswal and C\'eline Hudelot, 22 Aug 2025, LLMSymGuard: A Symbolic Safety Guardrail Framework Leveraging Interpretable Jailbreak Concepts, https://arxiv.org/abs/2508.16325
  • Yu Yan, Sheng Sun, Zhe Wang, Yijun Lin, Zenghao Duan, zhifei zheng, Min Liu, Zhiyi yin, Jianping Zhang, 22 Aug 2025, Confusion is the Final Barrier: Rethinking Jailbreak Evaluation and Investigating the Real Misuse Threat of LLMs, https://arxiv.org/abs/2508.16347
  • Yu Yan, Sheng Sun, Zenghao Duan, Teli Liu, Min Liu, Zhiyi Yin, Jiangyu Lei, Qi Li, 22 Aug 2025, from Benign import Toxic: Jailbreaking the Language Model via Adversarial Metaphors, https://arxiv.org/abs/2503.00038
  • Chongwen Zhao, Zhihao Dou, Kaizhu Huang, 25 Aug 2025, Defending against Jailbreak through Early Exit Generation of Large Language Models, https://arxiv.org/abs/2408.11308
  • Junchen Ding, Jiahao Zhang, Yi Liu, Ziqi Ding, Gelei Deng, Yuekang Li, 25 Aug 2025, TombRaider: Entering the Vault of History to Jailbreak Large Language Models, https://arxiv.org/abs/2501.18628
  • Salman Rahman, Liwei Jiang, James Shiffer, Genglin Liu, Sheriff Issaka, Md Rizwan Parvez, Hamid Palangi, Kai-Wei Chang, Yejin Choi, Saadia Gabriel, 23 Aug 2025, X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents, https://arxiv.org/abs/2504.13203
  • Hanjiang Hu, Alexander Robey, Changliu Liu, 25 Aug 2025, Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks, https://arxiv.org/abs/2503.00187
  • Chuhan Zhang, Ye Zhang, Bowen Shi, Yuyou Gan, Tianyu Du, Shouling Ji, Dazhan Deng, Yingcai Wu, 4 Sep 2025, NeuroBreak: Unveil Internal Jailbreak Mechanisms in Large Language Models, https://arxiv.org/abs/2509.03985
  • Yakai Li, Jiekang Hu, Weiduan Sang, Luping Ma, Dongsheng Nie, Weijuan Zhang, Aimin Yu, Yi Su, Qingjia Huang, Qihang Zhou, 25 Aug 2025, Prefill-level Jailbreak: A Black-Box Risk Analysis of Large Language Models, https://arxiv.org/abs/2504.21038
  • Haibo Jin, Ruoxi Chen, Peiyan Zhang, Andy Zhou, Yang Zhang, Haohan Wang, 28 Aug 2025, GUARD: Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics for LLMs, https://arxiv.org/abs/2508.20325
  • Junjie Chu and Mingjie Li and Ziqing Yang and Ye Leng and Chenhao Lin and Chao Shen and Michael Backes and Yun Shen and Yang Zhang, 28 Aug 2025, JADES: A Universal Framework for Jailbreak Assessment via Decompositional Scoring, https://arxiv.org/abs/2508.20848
  • Chongwen Zhao and Kaizhu Huang, 1 Sep 2025, Unraveling LLM Jailbreaks Through Safety Knowledge Neurons, https://arxiv.org/abs/2509.01631
  • Sihao Wu, Gaojie Jin, Wei Huang, Jianhong Wang, Xiaowei Huang, 30 Aug 2025, Activation Steering Meets Preference Optimization: Defense Against Jailbreaks in Vision Language Models, https://arxiv.org/abs/2509.00373
  • Ruoxi Cheng, Yizhong Ding, Shuirong Cao, Ranjie Duan, Xiaoshuang Jia, Shaowei Yuan, Simeng Qin, Zhiqiang Wang, Xiaojun Jia, 30 Aug 2025, PBI-Attack: Prior-Guided Bimodal Interactive Black-Box Jailbreak Attack for Toxicity Maximization, https://arxiv.org/abs/2412.05892
  • Shei Pern Chua, Thai Zhen Leng, Teh Kai Jun, Xiao Li, Xiaolin Hu, 4 Sep 2025, Between a Rock and a Hard Place: Exploiting Ethical Reasoning to Jailbreak LLMs, https://arxiv.org/abs/2509.05367
  • Youjia Zheng, Mohammad Zandsalimy, and Shanu Sushmita, 5 Sep 2025, Behind the Mask: Benchmarking Camouflaged Jailbreaks in Large Language Models, https://arxiv.org/abs/2509.05471
  • Junjie Mu, Zonghao Ying, Zhekui Fan, Zonglei Jing, Yaoyuan Zhang, Zhengmin Yu, Wenxin Zhang, Quanchen Zou, Xiangzheng Zhang, 8 Sep 2025, Mask-GCG: Are All Tokens in Adversarial Suffixes Necessary for Jailbreak Attacks?, https://arxiv.org/abs/2509.06350
  • Yunhan Zhao, Xiang Zheng, Xingjun Ma, 16 Sep 2025, Defense-to-Attack: Bypassing Weak Defenses Enables Stronger Jailbreaks in Vision-Language Models, https://arxiv.org/abs/2509.12724
  • Johan Wahr\'eus, Ahmed Hussain, Panos Papadimitratos, 16 Sep 2025, Jailbreaking Large Language Models Through Content Concretization, https://arxiv.org/abs/2509.12937
  • Seongho Joo, Hyukhun Koh, Kyomin Jung, 13 Sep 2025, Harmful Prompt Laundering: Jailbreaking LLMs with Abductive Styles and Symbolic Encoding, https://arxiv.org/abs/2509.10931
  • Chentao Cao, Xiaojun Xu, Bo Han, Hang Li, 15 Sep 2025, Reasoned Safety Alignment: Ensuring Jailbreak Defense via Answer-Then-Check, https://arxiv.org/abs/2509.11629
  • Yibo Zhang, Liang Lin, 14 Sep 2025, ENJ: Optimizing Noise with Genetic Algorithms to Jailbreak LSMs, https://arxiv.org/abs/2509.11128
  • Guorui Chen, Yifan Xia, Xiaojun Jia, Zhijiang Li, Philip Torr, Jindong Gu, 18 Sep 2025, LLM Jailbreak Detection for (Almost) Free!, https://arxiv.org/abs/2509.14558
  • Hyunjun Kim, Junwoo Ha, Sangyoon Yu, Haon Park, 10 Sep 2025, X-Teaming Evolutionary M2S: Automated Discovery of Multi-turn to Single-turn Jailbreak Templates, https://arxiv.org/abs/2509.08729
  • Shoumik Saha, Jifan Chen, Sam Mayers, Sanjay Krishna Gouda, Zijian Wang, Varun Kumar, 1 Oct 2025, Breaking the Code: Security Assessment of AI Code Agents Through Systematic Jailbreaking Attacks, https://arxiv.org/abs/2510.01359
  • John Hawkins and Aditya Pramar and Rodney Beard and Rohitash Chandra, 2 Oct 2025, NLP Methods for Detecting Novel LLM Jailbreaks and Keyword Analysis with BERT, https://arxiv.org/abs/2510.01644
  • Zixuan Huang, Kecheng Huang, Lihao Yin, Bowei He, Huiling Zhen, Mingxuan Yuan, Zili Shao, 14 Oct 2025, Attention-Aware GNN-based Input Defense against Multi-Turn LLM Jailbreak, https://arxiv.org/abs/2507.07146
  • Divij Handa, Zehua Zhang, Amir Saeidi, Shrinidhi Kumbhar, Md Nayem Uddin, Aswin RRV, Chitta Baral, 14 Oct 2025, When "Competency" in Reasoning Opens the Door to Vulnerability: Jailbreaking LLMs via Novel Complex Ciphers, https://arxiv.org/abs/2402.10601
  • Wence Ji, Jiancan Wu, Aiying Li, Shuyi Zhang, Junkang Wu, An Zhang, Xiang Wang, Xiangnan He, 24 Sep 2025, bi-GRPO: Bidirectional Optimization for Jailbreak Backdoor Injection on LLMs, https://arxiv.org/abs/2509.19775
  • Xinzhe Huang, Wenjing Hu, Tianhang Zheng, Kedong Xiu, Xiaojun Jia, Di Wang, Zhan Qin, Kui Ren, 28 Oct 2025, Untargeted Jailbreak Attack, https://arxiv.org/abs/2510.02999
  • Chung-En Sun, Xiaodong Liu, Weiwei Yang, Tsui-Wei Weng, Hao Cheng, Aidan San, Michel Galley, Jianfeng Gao, 23 Oct 2025, Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities, https://arxiv.org/abs/2410.18469
  • Qilin Liao, Anamika Lochab, Ruqi Zhang, 20 Oct 2025, VERA-V: Variational Inference Framework for Jailbreaking Vision-Language Models, https://arxiv.org/abs/2510.17759
  • Shuang Liang, Zhihao Xu, Jialing Tao, Hui Xue and Xiting Wang, 20 Oct 2025, Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models, https://arxiv.org/abs/2510.15430
  • Ziyi Zhang, Zhen Sun, Zongmin Zhang, Jihui Guo, Xinlei He, 20 Sep 2025, FC-Attack: Jailbreaking Multimodal Large Language Models via Auto-Generated Flowcharts, https://arxiv.org/abs/2502.21059
  • Brendan Murphy, Dillon Bowen, Shahrad Mohammadzadeh, Tom Tseng, Julius Broomfield, Adam Gleave, Kellin Pelrine, 20 Sep 2025, Jailbreak-Tuning: Models Efficiently Learn Jailbreak Susceptibility, https://arxiv.org/abs/2507.11630
  • Mahavir Dabas, Tran Huynh, Nikhil Reddy Billa, Jiachen T. Wang, Peng Gao, Charith Peris, Yao Ma, Rahul Gupta, Ming Jin, Prateek Mittal, and Ruoxi Jia, 24 Oct 2025, Adversarial D\'ej\`a Vu: Jailbreak Dictionary Learning for Stronger Generalization to Unseen Attacks, https://arxiv.org/abs/2510.21910
  • Havva Alizadeh Noughabi, Julien Serbanescu, Fattane Zarrinkalam, Ali Dehghantanha, 24 Oct 2025, Uncovering the Persuasive Fingerprint of LLMs in Jailbreaking Attacks, https://arxiv.org/abs/2510.21983
  • Pavlos Ntais, 24 Oct 2025, Jailbreak Mimicry: Automated Discovery of Narrative-Based Jailbreaks for Large Language Models, https://arxiv.org/abs/2510.22085
  • Md. Mehedi Hasan, Ziaur Rahman, Rafid Mostafiz, and Md. Abir Hossain, 26 Oct 2025, Sentra-Guard: A Multilingual Human-AI Framework for Real-Time Defense Against Adversarial LLM Jailbreaks, https://arxiv.org/abs/2510.22628
  • Wonjun Lee, Haon Park, Doehyeon Lee, Bumsub Ham, Suhyun Kim, 26 Sep 2025, Jailbreaking on Text-to-Video Models via Scene Splitting Strategy, https://arxiv.org/abs/2509.22292
  • Runqi Lin, Alasdair Paren, Suqin Yuan, Muyang Li, Philip Torr, Adel Bibi, and Tongliang Liu, 26 Sep 2025, FORCE: Transferable Visual Jailbreaking Attacks via Feature Over-Reliance CorrEction, https://arxiv.org/abs/2509.21029
  • Xiaogeng Liu, Chaowei Xiao, 8 Oct 2025, AutoDAN-Reasoning: Enhancing Strategies Exploration based Jailbreak Attacks with Test-Time Scaling, https://arxiv.org/abs/2510.05379
  • Amit Levi, Rom Himelstein, Yaniv Nemcovsky, Avi Mendelson, Chaim Baskin, 8 Oct 2025, Jailbreak Attack Initializations as Extractors of Compliance Directions, https://arxiv.org/abs/2502.09755
  • Marco Arazzi, Vignesh Kumar Kembu, Antonino Nocera, Vinod P, 3 Oct 2025, XBreaking: Explainable Artificial Intelligence for Jailbreaking LLMs, https://arxiv.org/abs/2504.21700
  • Zifan Peng, Yule Liu, Zhen Sun, Mingchen Li, Zeren Luo, Jingyi Zheng, Wenhan Dong, Xinlei He, Xuechao Wang, Yingjie Xue, Shengmin Xu, Xinyi Huang, 3 Oct 2025, JALMBench: Benchmarking Jailbreak Vulnerabilities in Audio Language Models, https://arxiv.org/abs/2505.17568
  • Amirkia Rafiei Oskooei, Mehmet S. Aktas, 19 Oct 2025, BreakFun: Jailbreaking LLMs via Schema Exploitation, https://arxiv.org/abs/2510.17904
  • Sidhant Narula, Javad Rafiei Asl, Mohammad Ghasemigol, Eduardo Blanco, Daniel Takabi, 21 Oct 2025, HarmNet: A Framework for Adaptive Multi-Turn Jailbreak Attacks on Large Language Models, https://arxiv.org/abs/2510.18728
  • Javad Rafiei Asl, Sidhant Narula, Mohammad Ghasemigol, Eduardo Blanco, Daniel Takabi, 21 Oct 2025, NEXUS: Network Exploration for eXploiting Unsafe Sequences in Multi-Turn LLM Jailbreaks, https://arxiv.org/abs/2510.03417
  • Zhaoqi Wang, Daqing He, Zijian Zhang, Xin Li, Liehuang Zhu, Meng Li, Jiamou Liu, 28 Sep 2025, Formalization Driven LLM Prompt Jailbreaking via Reinforcement Learning, https://arxiv.org/abs/2509.23558
  • Javad Forough and Mohammad Maheri and Hamed Haddadi, 27 Sep 2025, GuardNet: Graph-Attention Filtering for Jailbreak Defense in Large Language Models, https://arxiv.org/abs/2509.23037
  • Haibo Tong, Dongcheng Zhao, Guobin Shen, Xiang He, Dachuan Lin, Feifei Zhao, Yi Zeng, 25 Sep 2025, Bidirectional Intention Inference Enhances LLMs' Defense Against Multi-Turn Jailbreak Attacks, https://arxiv.org/abs/2509.22732
  • Saswat Das, Jameson Sandler, Ferdinando Fioretto, 27 Sep 2025, Beyond Jailbreaking: Auditing Contextual Privacy in LLM Agents, https://arxiv.org/abs/2506.10171
  • Hua Tang, Lingyong Yan, Yukun Zhao, Shuaiqiang Wang, Jizhou Huang, Dawei Yin, 29 Sep 2025, GRAF: Multi-turn Jailbreaking via Global Refinement and Active Fabrication, https://arxiv.org/abs/2506.17881
  • ChenYu Wu, Yi Wang, Yang Liao, 16 Oct 2025, Active Honeypot Guardrail System: Probing and Confirming Multi-Turn LLM Jailbreaks, https://arxiv.org/abs/2510.15017
  • Deyue Zhang, Dongdong Yang, Junjie Mu, Quancheng Zou, Zonghao Ying, Wenzhuo Xu, Zhao Liu, Xuan Wang, Xiangzheng Zhang, 16 Oct 2025, Sequential Comics for Jailbreaking Multimodal Large Language Models via Structured Visual Storytelling, https://arxiv.org/abs/2510.15068
  • Kuofeng Gao, Yiming Li, Chao Du, Xin Wang, Xingjun Ma, Shu-Tao Xia, Tianyu Pang, 6 Oct 2025, Imperceptible Jailbreaking against Large Language Models, https://arxiv.org/abs/2510.05025
  • Xinzhe Huang, Kedong Xiu, Tianhang Zheng, Churui Zeng, Wangze Ni, Zhan Qin, Kui Ren, Chun Chen, 4 Oct 2025, DualBreach: Efficient Dual-Jailbreaking via Target-Driven Initialization and Multi-Target Optimization, https://arxiv.org/abs/2504.18564
  • Milad Nasr, Nicholas Carlini, Chawin Sitawarin, Sander V. Schulhoff, Jamie Hayes, Michael Ilie, Juliette Pluto, Shuang Song, Harsh Chaudhari, Ilia Shumailov, Abhradeep Thakurta, Kai Yuanqing Xiao, Andreas Terzis, and Florian Tram\`er, 10 Oct 2025, The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections, https://arxiv.org/abs/2510.09023
  • Raffaele Mura, Giorgio Piras, Kamil\.e Luko\v{s}i\=ut\.e, Maura Pintor, Amin Karbasi, Battista Biggio, 7 Oct 2025, LatentBreak: Jailbreaking Large Language Models through Latent Space Feedback, https://arxiv.org/abs/2510.08604
  • Ragib Amin Nihal, Rui Wen, Kazuhiro Nakadai, Jun Sakuma, 9 Oct 2025, Pattern Enhanced Multi-Turn Jailbreaking: Exploiting Structural Vulnerabilities in Large Language Models, https://arxiv.org/abs/2510.08859
  • Yingzhi Mao (1 and 2), Chunkang Zhang (1 and 2), Junxiang Wang (1), Xinyan Guan (1 and 2), Boxi Cao (1), Yaojie Lu (1), Hongyu Lin (1), Xianpei Han (1 and 2), Le Sun (1 and 2) ((1) Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences, (2) University of Chinese Academy of Sciences), 24 Oct 2025, When Models Outthink Their Safety: Mitigating Self-Jailbreak in Large Reasoning Models with Chain-of-Guardrails, https://arxiv.org/abs/2510.21285
  • Wentian Zhu, Zhen Xiang, Wei Niu, Le Guan, 11 Oct 2025, MetaBreak: Jailbreaking Online LLM Services via Special Token Manipulation, https://arxiv.org/abs/2510.10271
  • Guan-Yan Yang, Tzu-Yu Cheng, Ya-Wen Teng, Farn Wanga, Kuo-Hui Yeh, 11 Oct 2025, ArtPerception: ASCII Art-based Jailbreak on LLMs with Recognition Pre-test, https://arxiv.org/abs/2510.10281
  • Weisen Jiang and Sinno Jialin Pan, 9 Oct 2025, MetaDefense: Defending Finetuning-based Jailbreak Attack Before and During Generation, https://arxiv.org/abs/2510.07835
  • Renhua Ding, Xiao Yang, Zhengwei Fang, Jun Luo, Kun He, Jun Zhu, 9 Oct 2025, Effective and Stealthy One-Shot Jailbreaks on Deployed Mobile Vision-Language Agents, https://arxiv.org/abs/2510.07809
  • Aofan Liu, Lulu Tang, Ting Pan, Yuguo Yin, Bin Wang, Ao Yang, 9 Oct 2025, PiCo: Jailbreaking Multimodal Large Language Models via Pictorial Code Contextualization, https://arxiv.org/abs/2504.01444
  • Jingyu Peng, Maolin Wang, Nan Wang, Jiatong Li, Yuchen Li, Yuyang Ye, Wanyu Wang, Pengyue Jia, Kai Zhang, Xiangyu Zhao, 9 Oct 2025, Logic Jailbreak: Efficiently Unlocking LLM Safety Restrictions Through Formal Logical Expression, https://arxiv.org/abs/2505.13527
  • Yein Park, Jungwoo Park, Jaewoo Kang, 30 Sep 2025, ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack, https://arxiv.org/abs/2509.25843
  • Qinjian Zhao, Jiaqi Wang, Zhiqiang Gao, Zhihao Dou, Belal Abuhaija, Kaizhu Huang, 30 Sep 2025, SafeBehavior: Simulating Human-Like Multistage Reasoning to Mitigate Jailbreak Attacks in Large Language Models, https://arxiv.org/abs/2509.26345
  • Jing-Jing Li, Jianfeng He, Chao Shang, Devang Kulshreshtha, Xun Xian, Yi Zhang, Hang Su, Sandesh Swamy, Yanjun Qi, 30 Sep 2025, STAC: When Innocent Tools Form Dangerous Chains to Jailbreak LLM Agents, https://arxiv.org/abs/2509.25