Aussie AI
AI Safety Research
-
Last Updated 22 October, 2025
-
by David Spuler, Ph.D.
Safe and responsible use of AI is an important and all-encompassing goal. Multiple concerns arise in the use of modern AI capabilities, and for the future with more advanced AI systems. This article examines the various research papers on difference AI safety issues.
Types of AI Safety Issues
There are a variety of distinct issue in terms of appropriate use of AI. Some of the categories include:
- Bias and fairness
- Inaccurate results
- Imaginary results ("hallucinations" or "confabulations")
- Inappropriate responses (e.g., "toxicity")
- Plagiarism
There are some issues that get quite close to being philosophy rather than technology:
- Alignment (ensuring AI engines are "aligned" with human goals)
- Overrideability/interruptibility
- Obedience vs autonomy
There are some overarching issues for AI matters for the government and in the community:
- Ethics
- Governance
- Regulation
- Auditing and Enforcement
- Privacy
- Risk Mitigation
Issues specific to mitigation of AI safety risks include:
- Red teaming (testing of safety issues)
- Prompt shields
- Guardrails
- Jailbreak prevention
- Refusal modules
- Security issues
And since we may rely on AI models in various real-world situations, including dangerous real-time situations like driving a car, there are some practical technological issues ensuring that AI engines operate safely and reliably within their basic operational scope:
- Testing and Debugging (simply avoiding coding "bugs" in complex AI engines)
- Real-time performance profiling ("de-slugging")
- Error Handling (tolerance of internal or external errors)
- Code Resilience (handling unexpected inputs or situations reasonably)
Overviews, Surveys, and Reviews
Various authors have reviewed the areas of safety and ethics:
- Cath C. Governing artificial intelligence: ethical, legal and technical opportunities and challenges. Philos Trans A Math Phys Eng Sci. 2018 Oct 15;376(2133):20180080. doi: 10.1098/rsta.2018.0080. PMID: 30322996 https://pubmed.ncbi.nlm.nih.gov/30322996/
- Hagendorff Thilo. The ethics of AI ethics: an evaluation of guidelines. Minds and Machines. 2020; 30(1):99–120. https://link.springer.com/article/10.1007/s11023-020-09517-8
- Jobin Anna, Ienca Marcello, Vayena Effy. The global landscape of AI ethics guidelines. Nature Machine Intell. 2019;(1):389–399. https://www.nature.com/articles/s42256-019-0088-2
- Soni N., Sharma E.K., Singh N., Kapoor A. 2019. Impact of Artificial Intelligence on Businesses: from Research, Innovation, Market Deployment to Future Shifts in Business Models”.arXiv:1905.02092. https://arxiv.org/abs/1905.02092
Hallucinations
Hallucinations are plausible-sounding answers that are not correct, and not based on any facts. It appears like the LLM is lying or faking the answer, but it doesn't actually know this. Rather, it is probabilistically trying to come up with the best answer, and sometimes it doesn't have a factual answer, so it can fill in the blanks.
- Liangming Pan, Michael Saxon, Wenda Xu, Deepak Nathani, Xinyi Wang, William Yang Wang, May 03 2024, Automatically Correcting Large Language Models: Surveying the Landscape of Diverse Automated Correction Strategies, https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00660/120911
- Bingbin Liu, Jordan T. Ash, Surbhi Goel, Akshay Krishnamurthy, Cyril Zhang, June 2023, Exposing Attention Glitches with Flip-Flop Language Modeling, https://arxiv.org/abs/2306.00946
- Lucas Mearian, 14 Mar 2024, AI hallucination mitigation: two brains are better than one, https://www.computerworld.com/article/1612465/ai-hallucination-mitigation-two-brains-are-better-than-one.html
- Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 15 Mar 2024 (v5), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer (A large survey of a variety of LLM optimizations.)
- Bijit Ghosh Feb 2024, Advanced Prompt Engineering for Reducing Hallucination, https://medium.com/@bijit211987/advanced-prompt-engineering-for-reducing-hallucination-bb2c8ce62fc6
- Junyi Li, Jie Chen, Ruiyang Ren, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Nie, Ji-Rong Wen, 6 Jan 2024, The Dawn After the Dark: An Empirical Study on Factuality Hallucination in Large Language Models, https://arxiv.org/abs/2401.03205 Code: https://github.com/RUCAIBox/HaluEval-2.0
- Colin Fraser, Apr 18, 2024, Hallucinations, Errors, and Dreams On why modern AI systems produce false outputs and what there is to be done about it, https://medium.com/@colin.fraser/hallucinations-errors-and-dreams-c281a66f3c35
- Johnny Li, Saksham Consul, Eda Zhou, James Wong, Naila Farooqui, Yuxin Ye, Nithyashree Manohar, Zhuxiaona Wei, Tian Wu, Ben Echols, Sharon Zhou, Gregory Diamos, 25 Jun 2024, Banishing LLM Hallucinations Requires Rethinking Generalization, https://arxiv.org/abs/2406.17642
- Pavan Belagatti, Jul 31, 2024, Semantic Chunking for Enhanced RAG Applications! https://levelup.gitconnected.com/semantic-chunking-for-enhanced-rag-applications-b6bc92942af0
- Mintong Kang, Nezihe Merve Gürel, Ning Yu, Dawn Song, Bo Li, July 2024, C-RAG: Certified Generation Risks for Retrieval-Augmented Language Models, Proceedings of the 41st International Conference on Machine Learning, PMLR 235:22963-23000, 2024, https://proceedings.mlr.press/v235/kang24a.html
- Mengya Hu, Rui Xu, Deren Lei, Yaxi Li, Mingyu Wang, Emily Ching, Eslam Kamal, Alex Deng, 22 Aug 2024, SLM Meets LLM: Balancing Latency, Interpretability and Consistency in Hallucination Detection, https://arxiv.org/abs/2408.12748
- Hao Zhou, Chengming Hu, Ye Yuan, Yufei Cui, Yili Jin, Can Chen, Haolun Wu, Dun Yuan, Li Jiang, Di Wu, Xue Liu, Charlie Zhang, Xianbin Wang, Jiangchuan Liu, 17 May 2024, Large Language Model (LLM) for Telecommunications: A Comprehensive Survey on Principles, Key Techniques, and Opportunities, https://arxiv.org/abs/2405.10825
- C Yang, S Fujita, 2024, Adaptive Control of Retrieval-Augmented Generation for LLMs Through Reflective Tags, https://www.preprints.org/manuscript/202408.2152/download/final_file
- Michael Wood, Aug 26, 2024, 100% Accurate AI Claimed by Acurai — OpenAI and Anthropic Confirm Acurai’s Discoveries, https://blog.cubed.run/100-accurate-ai-claimed-by-acurai-openai-and-anthropic-confirm-acurais-discoveries-98fce1ddeb5b
- James Lee Stakelum, Sep 2024, The End of AI Hallucinations: A Big Breakthrough in Accuracy for AI Application Developers, https://medium.com/@JamesStakelum/the-end-of-ai-hallucinations-a-breakthrough-in-accuracy-for-data-engineers-e67be5cc742a
- F. Li, X. zhang and P. Zhang, 2024, Mitigating Hallucination Issues in Small-Parameter LLMs through Inter-Layer Contrastive Decoding, 2024 International Joint Conference on Neural Networks (IJCNN), Yokohama, Japan, 2024, pp. 1-8, doi: 10.1109/IJCNN60899.2024.10650644, https://ieeexplore.ieee.org/abstract/document/10650644
- Zhongxiang Sun, Zihua Si, Xiaoxue Zang, Kai Zheng, Yang Song, Xiao Zhang, Jun Xu, 15 Oct 2024, LargePiG: Your Large Language Model is Secretly a Pointer Generator, https://arxiv.org/abs/2410.11366
- Garanc Burke, Hilke Schellmann, October 27, 2024, Researchers say an AI-powered transcription tool used in hospitals invents things no one ever said, https://apnews.com/article/ai-artificial-intelligence-health-business-90020cdf5fa16c79ca2e5b6c4c9bbb14
- Adi Simhi, Jonathan Herzig, Idan Szpektor, Yonatan Belinkov, 29 Oct 2024, Distinguishing Ignorance from Error in LLM Hallucinations, https://arxiv.org/abs/2410.22071 https://github.com/technion-cs-nlp/hallucination-mitigation
- Salvatore Raieli, Nov 2024, What Is The Best Therapy For a Hallucinating AI Patient? Exploring the Art and Science of Prompt Engineering to Cure LLM Hallucinations, https://levelup.gitconnected.com/what-is-the-best-therapy-for-a-hallucinating-ai-patient-acf0cb9b3e00
- Vitaly Kukharenko, Nov 2024, Why Do Neural Networks Hallucinate (And What Are Experts Doing About It)? https://pub.towardsai.net/why-do-neural-networks-hallucinate-and-what-are-experts-doing-about-it-7b9342605bf7
- Yixiong Fang, Ziran Yang, Zhaorun Chen, Zhuokai Zhao, Jiawei Zhou, 9 Dec 2024, From Uncertainty to Trust: Enhancing Reliability in Vision-Language Models with Uncertainty-Guided Dropout Decoding, https://arxiv.org/abs/2412.06474
- Inkit Padhi, Manish Nagireddy, Giandomenico Cornacchia, Subhajit Chaudhury, Tejaswini Pedapati, Pierre Dognin, Keerthiram Murugesan, Erik Miehling, Martín Santillán Cooper, Kieran Fraser, Giulio Zizzo, Muhammad Zaid Hameed, Mark Purcell, Michael Desmond, Qian Pan, Inge Vejsbjerg, Elizabeth M. Daly, Michael Hind, Werner Geyer, Ambrish Rawat, Kush R. Varshney, Prasanna Sattigeri, 10 Dec 2024, Granite Guardian, https://arxiv.org/abs/2412.07724 https://github.com/ibm-granite/granite-guardian (Open-sourcing of safety models with many capabilities.)
- Lilian Weng, July 7, 2024, Extrinsic Hallucinations in LLMs, https://lilianweng.github.io/posts/2024-07-07-hallucination/
- Rhiannon Williams, December 31, 2024, The biggest AI flops of 2024: From chatbots dishing out illegal advice to dodgy AI-generated search results, take a look back over the year’s top AI failures. https://www.technologyreview.com/2024/12/31/1109612/biggest-worst-ai-artificial-intelligence-flops-fails-2024/
- Kazi Hasan Ibn Arif, Sajib Acharjee Dip, Khizar Hussain, Lang Zhang, Chris Thomas, 21 Jan 2025, Fixing Imbalanced Attention to Mitigate In-Context Hallucination of Large Vision-Language Model, https://arxiv.org/abs/2501.12206
- Huan Ma, Jingdong Chen, Guangyu Wang, Changqing Zhang, 1 Feb 2025, Estimating LLM Uncertainty with Logits, https://arxiv.org/abs/2502.00290
- Ningke Li, Yahui Song, Kailong Wang, Yuekang Li, Ling Shi, Yi Liu, Haoyu Wang, 19 Feb 2025, Detecting LLM Fact-conflicting Hallucinations Enhanced by Temporal-logic-based Reasoning, https://arxiv.org/abs/2502.13416
- Seongheon Park, Xuefeng Du, Min-Hsuan Yeh, Haobo Wang, Yixuan Li, 1 Mar 2025, How to Steer LLM Latents for Hallucination Detection? https://arxiv.org/abs/2503.01917
- Sean Michael Kerner, May 13, 2025, Guardian agents: New approach could reduce AI hallucinations to below 1%, https://venturebeat.com/ai/beyond-detection-why-automatically-correcting-hallucinations-could-transform-enterprise-ai-adoption/
- Lei Wang, 12 May 2025, SEReDeEP: Hallucination Detection in Retrieval-Augmented Models via Semantic Entropy and Context-Parameter Fusion, https://arxiv.org/abs/2505.07528
- Manuel Cossio, 3 Aug 2025, A comprehensive taxonomy of hallucinations in Large Language Models, https://arxiv.org/abs/2508.01781
- Igor Halperin, 13 Aug 2025, Prompt-Response Semantic Divergence Metrics for Faithfulness Hallucination and Misalignment Detection in Large Language Models, https://arxiv.org/abs/2508.10192
- Denis Janiak, Jakub Binkowski, Albert Sawczyn, Bogdan Gabrys, Ravid Shwartz-Ziv, Tomasz Kajdanowicz, 13 Aug 2025, The Illusion of Progress: Re-evaluating Hallucination Detection in LLMs, https://arxiv.org/abs/2508.08285
- Xi Long, Christy Boscardin, Lauren A. Maggio, Joseph A. Costello, Ralph Gonzales, Rasmyah Hammoudeh, Ki Lai, Yoon Soo Park, Brian C. Gin, 14 Aug 2025, Hallucination vs interpretation: rethinking accuracy and precision in AI-assisted data extraction for knowledge synthesis, https://arxiv.org/abs/2508.09458
- Siyuan Liu, Wenjing Liu, Zhiwei Xu, Xin Wang, Bo Chen, Tao Li, 21 Jul 2025, Towards Mitigation of Hallucination for LLM-empowered Agents: Progressive Generalization Bound Exploration and Watchdog Monitor, https://arxiv.org/abs/2507.15903
- Zhenliang Zhang, Xinyu Hu, Huixuan Zhang, Junzhe Zhang, Xiaojun Wan, 22 Jul 2025, ICR Probe: Tracking Hidden State Dynamics for Reliable Hallucination Detection in LLMs, https://arxiv.org/abs/2507.16488
- Xin Dong, Shichao Dong, Jin Wang, Jing Huang, Li Zhou, Zenghui Sun, Lihua Jing, Jingsong Lan, Xiaoyong Zhu, Bo Zheng, 22 Jul 2025, INTER: Mitigating Hallucination in Large Vision-Language Models by Interaction Guidance Sampling, https://arxiv.org/abs/2507.05056
- Seunghoi Kim and Henry F. J. Tregidgo and Matteo Figini and Chen Jin and Sarang Joshi and Daniel C. Alexander, 24 Jul 2025, Tackling Hallucination from Conditional Models for Medical Image Reconstruction with DynamicDPS, https://arxiv.org/abs/2503.01075
- Weihua Zheng, Roy Ka-Wei Lee, Zhengyuan Liu, Kui Wu, AiTi Aw, Bowei Zou, 17 Jul 2025, CCL-XCoT: An Efficient Cross-Lingual Knowledge Transfer Method for Mitigating Hallucination Generation, https://arxiv.org/abs/2507.14239
- Jingwei Huang, Kuroush Nezafati, Ismael Villanueva-Miranda, Zifan Gu, Yueshuang Xu, Ann Marie Navar, Tingyi Wanyan, Qin Zhou, Bo Yao, Ruichen Rong, Xiaowei Zhan, Guanghua Xiao, Eric D. Peterson, Donghan M. Yang, Wenqi Shi, Yang Xie, 18 Jul 2025, Large Language Models Powered Multiagent Ensemble for Mitigating Hallucination and Efficient Atrial Fibrillation Annotation of ECG Reports, https://arxiv.org/abs/2410.16543
- Ashley Lewis, Michael White, Jing Liu, Toshiaki Koike-Akino, Kieran Parsons, Ye Wang, 21 Jul 2025, Winning Big with Small Models: Knowledge Distillation vs. Self-Training for Reducing Hallucination in Product QA Agents, https://arxiv.org/abs/2502.19545
- Quan Shi, Wang Xi, Zenghui Ding, Jianqing Gao, Xianjun Yang, 10 Aug 2025, Hallucination as a Computational Boundary: A Hierarchy of Inevitability and the Oracle Escape, https://arxiv.org/abs/2508.07334
- Ming-Kun Xie, Jia-Hao Xiao, Gang Niu, Lei Feng, Zhiqiang Kou, Min-Ling Zhang, and Masashi Sugiyama, 3 Aug 2025, What Makes "Good" Distractors for Object Hallucination Evaluation in Large Vision-Language Models?, https://arxiv.org/abs/2508.06530
- Jakob Snel and Seong Joon Oh, 28 Jul 2025, First Hallucination Tokens Are Different from Conditional Ones, https://arxiv.org/abs/2507.20836
- Shengyuan Wang, Jie Feng, Tianhui Liu, Dan Pei, Yong Li, 25 Jul 2025, Mitigating Geospatial Knowledge Hallucination in Large Language Models: Benchmarking and Dynamic Factuality Aligning, https://arxiv.org/abs/2507.19586
- Baiyu Chen, Wilson Wongso, Xiaoqian Hu, Yue Tan, Flora Salim, 27 Jul 2025, Multi-Stage Verification-Centric Framework for Mitigating Hallucination in Multi-Modal RAG, https://arxiv.org/abs/2507.20136
- Joosung Lee, Cheonbok Park, Hwiyeol Jo, Jeonghoon Kim, Joonsuk Park, Kang Min Yoo, 28 Jul 2025, Enhancing Hallucination Detection via Future Context, https://arxiv.org/abs/2507.20546
- Esmail Gumaan, 20 Jul 2025, Theoretical Foundations and Mitigation of Hallucination in Large Language Models, https://arxiv.org/abs/2507.22915
- Praveenkumar Katwe, Rakesh Chandra, Balabantaray Kali, Prasad Vittala, 30 Jul 2025, Reducing Hallucinations in Summarization via Reinforcement Learning with Entity Hallucination Index, https://arxiv.org/abs/2507.22744
- Vijja Wichitwechkarn, Charles Fox, Ruchi Choudhary, 23 Jul 2025, Hallucination Detection and Mitigation with Diffusion in Multi-Variate Time-Series Foundation Models, https://arxiv.org/abs/2508.00881
- Xiaoyu Pan, Yang Bai, Ke Zou, Yang Zhou, Jun Zhou, Huazhu Fu, Yih-Chung Tham, Yong Liu, 24 Jul 2025, EH-Benchmark Ophthalmic Hallucination Benchmark and Agent-Driven Top-Down Traceable Reasoning Workflow, https://arxiv.org/abs/2507.22929
- Zhaochen Wang, Yiwei Wang, Yujun Cai, 3 Aug 2025, Cure or Poison? Embedding Instructions Visually Alters Hallucination in Vision-Language Models, https://arxiv.org/abs/2508.01678
- Yijun Feng, 3 Aug 2025, Counterfactual Probing for Hallucination Detection and Mitigation in Large Language Models, https://arxiv.org/abs/2508.01862
- Zhaoyi Sun, Wen-Wai Yim, Ozlem Uzuner, Fei Xia, Meliha Yetisgen, 1 Aug 2025, A Scoping Review of Natural Language Processing in Addressing Medically Inaccurate Information: Errors, Misinformation, and Hallucination, https://arxiv.org/abs/2505.00008
- Junyoung Lim, Jaewoo Ahn, Gunhee Kim, 5 Aug 2025, ChartCap: Mitigating Hallucination of Dense Chart Captioning, https://arxiv.org/abs/2508.03164
- Subhey Sadi Rahman, Md. Adnanul Islam, Md. Mahbub Alam, Musarrat Zeba, Md. Abdur Rahman, Sadia Sultana Chowa, Mohaimenul Azam Khan Raiaan, Sami Azam, 5 Aug 2025, Hallucination to Truth: A Review of Fact-Checking and Factuality Evaluation in Large Language Models, https://arxiv.org/abs/2508.03860
- Shunqi Mao, Chaoyi Zhang, Weidong Cai, 6 Aug 2025, Through the Magnifying Glass: Adaptive Perception Magnification for Hallucination-Free VLM Decoding, https://arxiv.org/abs/2503.10183
- Micha{\l} P. Karpowicz, 6 Aug 2025, On the Fundamental Impossibility of Hallucination Control in Large Language Models, https://arxiv.org/abs/2506.06382
- Huaicheng Zhang, Wei Tan, Guangzheng Li, Yixuan Zhang, Hangting Chen, Shun Lei, Chenyu Yang, Zhiyong Wu, Shuai Wang, Qijun Huang, Dong Yu, 7 Aug 2025, Towards Hallucination-Free Music: A Reinforcement Learning Preference Optimization Framework for Reliable Song Generation, https://arxiv.org/abs/2508.05011
- Kim Hammar and Tansu Alpcan and Emil C. Lupu, 7 Aug 2025, Incident Response Planning Using a Lightweight Large Language Model with Reduced Hallucination, https://arxiv.org/abs/2508.05188
- Marc Pavel, Nenad Petrovic, Lukasz Mazur, Vahid Zolfaghari, Fengjunjie Pan, Alois Knoll, 15 Aug 2025, Hallucination in LLM-Based Code Generation: An Automotive Case Study, https://arxiv.org/abs/2508.11257
- Nanxing Hu, Xiaoyue Duan, Jinchao Zhang, Guoliang Kang, 19 Aug 2025, Enhancing Visual Reliance in Text Generation: A Bayesian Perspective on Mitigating Hallucination in Large Vision-Language Models, https://arxiv.org/abs/2505.19498
- Huan Ma, Jiadong Pan, Jing Liu, Yan Chen, Joey Tianyi Zhou, Guangyu Wang, Qinghua Hu, Hua Wu, Changqing Zhang, Haifeng Wang, 20 Aug 2025, Semantic Energy: Detecting LLM Hallucination Beyond Entropy, https://arxiv.org/abs/2508.14496
- Aman Goel, Daniel Schwartz, Yanjun Qi, 19 Aug 2025, Zero-knowledge LLM hallucination detection and mitigation through fine-grained cross-model consistency, https://arxiv.org/abs/2508.14314
- Yupei Yang, Fan Feng, Lin Yang, Wanxi Deng, Lin Qu, Biwei Huang, Shikui Tu, Lei Xu, 20 Aug 2025, DEPTH: Hallucination-Free Relation Extraction via Dependency-Aware Sentence Simplification and Two-tiered Hierarchical Refinement, https://arxiv.org/abs/2508.14391
- Nicole Cho, William Watson, Alec Koppel, Sumitra Ganesh, Manuela Veloso, 22 Aug 2025, QueryBandits for Hallucination Mitigation: Exploiting Semantic Features for No-Regret Rewriting, https://arxiv.org/abs/2508.16697
- Nicolas Zucchet, J\"org Bornschein, Stephanie Chan, Andrew Lampinen, Razvan Pascanu, Soham De, 24 Jul 2025, How do language models learn facts? Dynamics, curricula and hallucinations, https://arxiv.org/abs/2503.21676
- Anindya Bijoy Das, Shahnewaz Karim Sakib and Shibbir Ahmed, 9 Aug 2025, Trustworthy Medical Imaging with Large Language Models: A Study of Hallucinations Across Modalities, https://arxiv.org/abs/2508.07031
- Charles O'Neill, Slava Chalnev, Chi Chi Zhao, Max Kirkby, Mudith Jayasekara, 31 Jul 2025, A Single Direction of Truth: An Observer Model's Linear Residual Probe Exposes and Steers Contextual Hallucinations, https://arxiv.org/abs/2507.23221
- Zhangcheng Qiang, Kerry Taylor, Weiqing Wang, Jing Jiang, 25 Mar 2025, OAEI-LLM-T: A TBox Benchmark Dataset for Understanding Large Language Model Hallucinations in Ontology Matching, https://arxiv.org/abs/2503.21813
- Yudong Zhang, Ruobing Xie, Xingwu Sun, Yiqing Huang, Jiansheng Chen, Zhanhui Kang, Di Wang, Yu Wang, 31 Jul 2025, DHCP: Detecting Hallucinations by Cross-modal Attention Pattern in Large Vision-Language Models, https://arxiv.org/abs/2411.18659
- Haonan Ge, Yiwei Wang, Ming-Hsuan Yang, Yujun Cai, 14 Aug 2025, MRFD: Multi-Region Fusion Decoding with Self-Consistency for Mitigating Hallucinations in LVLMs, https://arxiv.org/abs/2508.10264
- Likun Tan, Kuan-Wei Huang, Kevin Wu, 28 Jul 2025, FRED: Financial Retrieval-Enhanced Detection and Editing of Hallucinations in Language Models, https://arxiv.org/abs/2507.20930
- Neil F. Johnson and Frank Yingjie Huo, 1 Aug 2025, Multispin Physics of AI Tipping Points and Hallucinations, https://arxiv.org/abs/2508.01097
- Chenxi Li, Yichen Guo, Benfang Qian, Jinhao You, Kai Tang, Yaosong Du, Zonghao Zhang, and Xiande Huang, 3 Aug 2025, MAP: Mitigating Hallucinations in Large Vision-Language Models with Map-Level Attention Processing, https://arxiv.org/abs/2508.01653
- Peizheng Guo, Jingyao Wang, Wenwen Qiang, Huijie Guo, Changwen Zheng, Jiahuan Zhou, Gang Hua, 6 Aug 2025, Hacking Hallucinations of MLLMs with Causal Sufficiency and Necessity, https://arxiv.org/abs/2508.04182
- Mengao Zhang, Jiayu Fu, Tanya Warrier, Yuwen Wang, Tianhui Tan, Ke-wei Huang, 7 Aug 2025, FAITH: A Framework for Assessing Intrinsic Tabular Hallucinations in finance, https://arxiv.org/abs/2508.05201
- Vibhor Agarwal, Yiqiao Jin, Mohit Chandra, Munmun De Choudhury, Srijan Kumar, Nishanth Sastry, 7 Aug 2025, MedHalu: Hallucinations in Responses to Healthcare Queries by Large Language Models, https://arxiv.org/abs/2409.19492
- Chunhua Liu, Hong Yi Lin and Patanamon Thongtanunam, 12 Aug 2025, Hallucinations in Code Change to Natural Language Generation: Prevalence and Evaluation of Detection Metrics, https://arxiv.org/abs/2508.08661
- Ashish Seth, Utkarsh Tyagi, Ramaneswaran Selvakumar, Nishit Anand, Sonal Kumar, Sreyan Ghosh, Ramani Duraiswami, Chirag Agarwal, Dinesh Manocha, 18 Aug 2025, EGOILLUSION: Benchmarking Hallucinations in Egocentric Video Understanding, https://arxiv.org/abs/2508.12687
- Yuangang Li, Yiqing Shen, Yi Nian, Jiechao Gao, Ziyi Wang, Chenxiao Yu, Shawn Li, Jie Wang, Xiyang Hu, Yue Zhao, 17 Aug 2025, Mitigating Hallucinations in Large Language Models via Causal Reasoning, https://arxiv.org/abs/2508.12495
- Wenhao Li, Xiu Su, Jingyi Wu, Feng Yang, Yang Liu, Yi Chen, Shan You, Chang Xu, 19 Aug 2025, Identify, Isolate, and Purge: Mitigating Hallucinations in LVLMs via Self-Evolving Distillation, https://arxiv.org/abs/2507.04680
- Anindya Bijoy Das, Shibbir Ahmed and Shahnewaz Karim Sakib, 19 Aug 2025, Hallucinations and Key Information Extraction in Medical Texts: A Comprehensive Assessment of Open-Source Large Language Models, https://arxiv.org/abs/2504.19061
- Chenlin Liu, Minghui Fang, Patrick Zhang, Wei Zhou, Jie Gao, Jiqing Han, 21 Aug 2025, Mitigating Hallucinations in LM-Based TTS Models via Distribution Alignment Using GFlowNets, https://arxiv.org/abs/2508.15442
- Reilly Haskins and Benjamin Adams, 21 Aug 2025, KEA Explain: Explanations of Hallucinations using Graph Kernel Analysis, https://arxiv.org/abs/2507.03847
- Shuzhou Yuan, Zhan Qu, Ashish Yashwanth Kangen, Michael F\"arber, 22 Aug 2025, Can Hallucinations Help? Boosting LLMs for Drug Discovery, https://arxiv.org/abs/2501.13824
- Charles Moslonka, Hicham Randrianarivo, Arthur Garnier and Emmanuel Malherbe, 1 Sep 2025, Learned Hallucination Detection in Black-Box LLMs using Token-level Entropy Production Rate, https://arxiv.org/abs/2509.04492
- Jiawei Li, Akshayaa Magesh, Venugopal V. Veeravalli, 25 Aug 2025, Principled Detection of Hallucinations in Large Language Models via Multiple Testing, https://arxiv.org/abs/2508.18473
- Yiming Huang, Junyan Zhang, Zihao Wang, Biquan Bie, Yunzhong Qiu, Yi R. Fung, Xinlei He, 26 Aug 2025, RePPL: Recalibrating Perplexity by Uncertainty in Semantic Propagation and Language Generation for Explainable QA Hallucination Detection, https://arxiv.org/abs/2505.15386
- Supratik Sarkar, Swagatam Das, 26 Aug 2025, Grounding the Ungrounded: A Spectral-Graph Framework for Quantifying Hallucinations in multimodal LLMs, https://arxiv.org/abs/2508.19366
- Kehao Miao, Xiaolong Jin, 26 Aug 2025, An Investigation on Group Query Hallucination Attacks, https://arxiv.org/abs/2508.19321
- Seongheon Park and Yixuan Li, 27 Aug 2025, GLSim: Detecting Object Hallucinations in LVLMs via Global-Local Similarity, https://arxiv.org/abs/2508.19972
- Alberto Compagnoni, Davide Caffagni, Nicholas Moratelli, Lorenzo Baraldi, Marcella Cornia, Rita Cucchiara, 27 Aug 2025, Mitigating Hallucinations in Multimodal LLMs via Object-aware Preference Optimization, https://arxiv.org/abs/2508.20181
- Tingxuan Xu, Jiarui Feng, Justin Melendez, Kaleigh Roberts, Donghong Cai, Mingfang Zhu, Donald Elbert, Yixin Chen, Randall J. Bateman, 28 Aug 2025, Addressing accuracy and hallucination of LLMs in Alzheimer's disease research through knowledge graphs, https://arxiv.org/abs/2508.21238
- Weizhi Gao, Xiaorui Liu, Feiyi Wang, Dan Lu, Junqi Yin, 28 Aug 2025, Decoding Memories: An Efficient Pipeline for Self-Consistency Hallucination Detection, https://arxiv.org/abs/2508.21228
- Hao Lu, Jiahao Wang, Yaolun Zhang, Ruohui Wang, Xuanyu Zheng, Yepeng Tang, Dahua Lin, Lewei Lu, 29 Aug 2025, ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding, https://arxiv.org/abs/2508.21496
- Junzhe Chen, Tianshu Zhang, Shiyu Huang, Yuwei Niu, Chao Sun, Rongzhou Zhang, Guanyu Zhou, Lijie Wen, Xuming Hu, 31 Aug 2025, OmniDPO: A Preference Optimization Framework to Address Omni-Modal Hallucination, https://arxiv.org/abs/2509.00723
- Saad Abdul Ghani, Zizhao Wang, Peter Stone, Xuesu Xiao, 1 Sep 2025, Dyna-LfLH: Learning Agile Navigation in Dynamic Environments from Learned Hallucination, https://arxiv.org/abs/2403.17231
- Haoran Huan, Mihir Prabhudesai, Mengning Wu, Shantanu Jaiswal, Deepak Pathak, 3 Sep 2025, Can LLMs Lie? Investigation beyond Hallucination, https://arxiv.org/abs/2509.03518
- Qiang Liu, Xinlong Chen, Yue Ding, Bowen Song, Weiqiang Wang, Shu Wu, Liang Wang, 3 Sep 2025, Attention-guided Self-reflection for Zero-shot Hallucination Detection in Large Language Models, https://arxiv.org/abs/2501.09997
- Praneet Suresh, Jack Stanley, Sonia Joseph, Luca Scimeca, Danilo Bzdok, 8 Sep 2025, From Noise to Narrative: Tracing the Origins of Hallucinations in Transformers, https://arxiv.org/abs/2509.06938
- Xin Tong, Zhi Lin, Jingya Wang, Bo Jin, 8 Sep 2025, HAVE: Head-Adaptive Gating and ValuE Calibration for Hallucination Mitigation in Large Language Models, https://arxiv.org/abs/2509.06596
- Jerry Li, Evangelos Papalexakis, 3 Sep 2025, Beyond ROUGE: N-Gram Subspace Features for LLM Hallucination Detection, https://arxiv.org/abs/2509.05360
- Kamil Ciosek, Nicol\`o Felicioni, Sina Ghiassian, 8 Sep 2025, Hallucination Detection on a Budget: Efficient Bayesian Estimation of Semantic Entropy, https://arxiv.org/abs/2504.03579
- Kishan Maharaj, Vitobha Munigala, Srikanth G. Tamilselvam, Prince Kumar, Sayandeep Sen, Palani Kodeswaran, Abhijit Mishra, Pushpak Bhattacharyya, 6 Sep 2025, ETF: An Entity Tracing Framework for Hallucination Detection in Code Summaries, https://arxiv.org/abs/2410.14748
- Masoumeh Zareh, Mohammad Hossein Manshaei, Sayed Jalal Zahabi, and Marwan Krunz, 6 Sep 2025, Modeling Visual Hallucination: A Generative Adversarial Network Framework, https://arxiv.org/abs/2102.08209
- OpenAI, September 5, 2025, Why language models hallucinate, https://openai.com/index/why-language-models-hallucinate/ (Many interesting findings, including that some level of hallucinations are inevitable in the next-token decoding method, and also that current LLM evals reward hallucinations, and need to be reworked to fix hallucinations properly by rewarding expressions of uncertainty in results, i.e., when the model admits it doesn't know something instead of making something up.)
- Saumya Goswami, Siddharth Kurra, 9 Sep 2025, HALT-RAG: A Task-Adaptable Framework for Hallucination Detection with Calibrated NLI Ensembles and Abstention, https://arxiv.org/abs/2509.07475
- Nobin Sarwar, 8 Sep 2025, FilterRAG: Zero-Shot Informed Retrieval-Augmented Generation to Mitigate Hallucinations in VQA, https://arxiv.org/abs/2502.18536
- Malavika Suresh, Rahaf Aljundi, Ikechukwu Nkisi-Orji, Nirmalie Wiratunga, 4 Sep 2025, Cross-Layer Attention Probing for Fine-Grained Hallucination Detection, https://arxiv.org/abs/2509.09700
- Naveen Lamba, Sanju Tiwari and Manas Gaur, 9 Sep 2025, Investigating Symbolic Triggers of Hallucination in Gemma Models Across HaluEval and TruthfulQA, https://arxiv.org/abs/2509.09715
- Ponhvoan Srey, Xiaobao Wu, Anh Tuan Luu, 12 Sep 2025, Unsupervised Hallucination Detection by Inspecting Reasoning Processes, https://arxiv.org/abs/2509.10004
- Garry Yang, Zizhe Chen, Man Hon Wong, Haoyu Lei, Yongqiang Chen, Zhenguo Li, Kaiwen Zhou, James Cheng, 11 Sep 2025, MESH -- Understanding Videos Like Human: Measuring Hallucinations in Large Video Models, https://arxiv.org/abs/2509.08538
- Humam Kourani, Anton Antonov, Alessandro Berti, Wil M.P. van der Aalst, 18 Sep 2025, Knowledge-Driven Hallucination in Large Language Models: An Empirical Study on Process Modeling, https://arxiv.org/abs/2509.15336
- Davide Ettori, Nastaran Darabi, Sina Tayebati, Ranganath Krishnan, Mahesh Subedar, Omesh Tickoo, and Amit Ranjan Trivedi, 19 Sep 2025, EigenTrack: Spectral Activation Feature Tracking for Hallucination and Out-of-Distribution Detection in LLMs and VLMs, https://arxiv.org/abs/2509.15735
- Chung-En Johnny Yu, Hsuan-Chih (Neil) Chen, Brian Jalaian, Nathaniel D. Bastian, 18 Sep 2025, ORCA: Agentic Reasoning For Hallucination and Adversarial Robustness in Vision-Language Models, https://arxiv.org/abs/2509.15435
- Seongmin Lee, Hsiang Hsu, Chun-Fu Chen, Duen Horng Chau, 15 Sep 2025, Probing LLM Hallucination from Within: Perturbation-Driven Approach via Internal Knowledge, https://arxiv.org/abs/2411.09689
- Saad Obaid ul Islam, Anne Lauscher, Goran Glava\v{s}, 16 Sep 2025, How Much Do LLMs Hallucinate across Languages? On Multilingual Estimation of LLM Hallucination in the Wild, https://arxiv.org/abs/2502.12769
- Boris Kovalerchuk, Brent D. Fegley, 13 Sep 2025, LLM Enhancement with Domain Expert Mental Model to Reduce LLM Hallucination with Causal Prompt Engineering, https://arxiv.org/abs/2509.10818
- Minh Vu, Brian K. Tran, Syed A. Shah, Geigh Zollicoffer, Nhat Hoang-Xuan, Manish Bhattarai, 12 Sep 2025, HalluField: Detecting LLM Hallucinations via Field-Theoretic Modeling, https://arxiv.org/abs/2509.10753
- Junjie Hu, Gang Tu, ShengYu Cheng, Jinxin Li, Jinting Wang, Rui Chen, Zhilong Zhou, Dongbo Shan, 15 Sep 2025, HARP: Hallucination Detection via Reasoning Subspace Projection, https://arxiv.org/abs/2509.11536
- Leon Chlon, Ahmed Karim, Maggie Chlon, 14 Sep 2025, Predictable Compression Failures: Why Language Models Actually Hallucinate, https://arxiv.org/abs/2509.11208
- Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Chen Xu, Yulong Chen, Longyue Wang, Anh Tuan Luu, Wei Bi, Freda Shi, Shuming Shi, 14 Sep 2025, Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models, https://arxiv.org/abs/2309.01219
- Zhenglin Hua, Jinghan He, Zijun Yao, Tianxu Han, Haiyun Guo, Yuheng Jia, Junfeng Fang, 15 Sep 2025, Steering LVLMs via Sparse Autoencoder for Hallucination Mitigation, https://arxiv.org/abs/2505.16146
- Hongxiang Zhang, Hao Chen, Muhao Chen, Tianyi Zhang, 15 Sep 2025, Active Layer-Contrastive Decoding Reduces Hallucination in Large Language Model Generation, https://arxiv.org/abs/2505.23657
- Yurui Chang, Bochuan Cao, Lu Lin, 13 Sep 2025, Monitoring Decoding: Mitigating Hallucination via Evaluating the Factuality of Partial Response during Generation, https://arxiv.org/abs/2503.03106
- Martin Prei{\ss}, 11 Sep 2025, Hallucination Detection with the Internal Layers of LLMs, https://arxiv.org/abs/2509.14254
- Zihao Li, Weiwei Yi, Jiahong Chen, 12 Sep 2025, Accuracy Paradox in Large Language Models: Regulating Hallucination Risks in Generative AI, https://arxiv.org/abs/2509.13345
- Xiao Zheng, 17 Sep 2025, DSCC-HS: A Dynamic Self-Reinforcing Framework for Hallucination Suppression in Large Language Models, https://arxiv.org/abs/2509.13702
- Mahjabin Nahar, Eun-Ju Lee, Jin Won Park, Dongwon Lee, 17 Sep 2025, Catch Me if You Search: When Contextual Web Search Results Affect the Detection of Hallucinations, https://arxiv.org/abs/2504.01153
Security of AI
Research on security issues involving AI and LLMs:
- Jason Koebler, June 26, 2024, Researchers Prove Rabbit AI Breach By Sending Email to Us as Admin, https://www.404media.co/researchers-prove-rabbit-ai-breach-by-sending-email-to-us-as-admin/ (Rabbit's API security credentials were hard-coded into the device.)
- Yuanchun Li, Hao Wen, Weijun Wang, Xiangyu Li, Yizhen Yuan, Guohong Liu, Jiacheng Liu, Wenxing Xu, Xiang Wang, Yi Sun, Rui Kong, Yile Wang, Hanfei Geng, Jian Luan, Xuefeng Jin, Zilong Ye, Guanjing Xiong, Fan Zhang, Xiang Li, Mengwei Xu, Zhijun Li, Peng Li, Yang Liu, Ya-Qin Zhang, Yunxin Liu, 8 May 2024 (v2), Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security, https://arxiv.org/abs/2401.05459 https://github.com/MobileLLM/Personal_LLM_Agents_Survey
- Michael Nuñez, August 30, 2024, AI is growing faster than companies can secure it, warn industry leaders, https://venturebeat.com/ai/ai-is-growing-faster-than-companies-can-secure-it-warn-industry-leaders/
- Yiheng Liu, Hao He, Tianle Han, Xu Zhang, Mengyuan Liu, Jiaming Tian, Yutong Zhang, Jiaqi Wang, Xiaohui Gao, Tianyang Zhong, Yi Pan, Shaochen Xu, Zihao Wu, Zhengliang Liu, Xin Zhang, Shu Zhang, Xintao Hu, Tuo Zhang, Ning Qiang, Tianming Liu, Bao Ge, 6 Jan 2024 (v2), Understanding LLMs: A Comprehensive Overview from Training to Inference, https://arxiv.org/abs/2401.02038
- Huan Yang, Deyu Zhang, Yudong Zhao, Yuanchun Li, Yunxin Liu, 6 Sep 2024, A First Look At Efficient And Secure On-Device LLM Inference Against KV Leakage, https://arxiv.org/abs/2409.04040 (Security issues where KV caches can be data leaks as they may contain encodings of private information.)
- Nicholas Carlini, Milad Nasr, 22 Oct 2024, Remote Timing Attacks on Efficient Language Model Inference, https://arxiv.org/abs/2410.17175
- Chenchen Gu, Xiang Lisa Li, Rohith Kuditipudi, Percy Liang, Tatsunori Hashimoto, Dec 2024, Timing Attacks on Prompt Caching in Language Model APIs, Stanford CS 191W Senior Project, https://cs191w.stanford.edu/projects/Gu,%20Chenchen_CS191W.pdf (Using timing attacks to detect prefix KV caching, thereby gaining information about other users' prompts.)
- Úlfar Erlingsson, 27 Mar 2025, How to Secure Existing C and C++ Software without Memory Safety, https://arxiv.org/pdf/2503.21145 (Examines four risk mitigation techniques for memory safety.)
- Anthropic, 13 Aug 2025, Building Safeguards for Claude, https://www.anthropic.com/news/building-safeguards-for-claude
- Pallavi Zambare, Venkata Nikhil Thanikella, Nikhil Padmanabh Kottur, Sree Akhil Akula, Ying Liu, 12 Aug 2025, NetMoniAI: An Agentic AI Framework for Network Security & Monitoring, https://arxiv.org/abs/2508.10052
- Miles Q. Li and Benjamin C. M. Fung, 13 Aug 2025, Security Concerns for Large Language Models: A Survey, https://arxiv.org/abs/2505.18889
- Vita Santa Barletta, Vito Bavaro, Miriana Calvano, Antonio Curci, Antonio Piccinno, Davide Pio Posa, 23 Jul 2025, Enabling Cyber Security Education through Digital Twins and Generative AI, https://arxiv.org/abs/2507.17518
- Haibo Wang, Lutfu S.Sua, and Bahram Alidaee, 22 Jul 2025, Enhancing supply chain security with automated machine learning, https://arxiv.org/abs/2406.13166
- Lily Stelling, Mick Yang, Rokas Gipi\v{s}kis, Leon Staufer, Ze Shen Chin, Sim\'eon Campos, Ariel Gil, and Michael Chen, 22 Jul 2025, Mapping Industry Practices to the EU AI Act's GPAI Code of Practice Safety and Security Measures, https://arxiv.org/abs/2504.15181
- Rui Guo, Avinash Ayalasomayajula, Henian Li, Jingbo Zhou, Sujan Kumar Saha, Farimah Farahmandi, 22 Jul 2025, SVAgent: AI Agent for Hardware Security Verification Assertion, https://arxiv.org/abs/2507.16203
- Chang Gong and Zhongwen Li and Xiaoqi Li, 24 Jul 2025, Information Security Based on LLM Approaches: A Review, https://arxiv.org/abs/2507.18215
- Pengfei Du, 14 Jul 2025, PRM-Free Security Alignment of Large Models via Red Teaming and Adversarial Training, https://arxiv.org/abs/2507.14202
- Eldor Abdukhamidov, Mohammed Abuhamad, Simon S. Woo, Hyoungshick Kim, Tamer Abuhmed, 18 Jul 2025, Breaking the Illusion of Security via Interpretation: Interpretable Vision Transformer Systems under Attack, https://arxiv.org/abs/2507.14248
- Zhou Li, Xiang Zhang, Jiawen Lv, Jihao Fan, Haiqiang Chen, Giuseppe Caire, 19 Jul 2025, Collusion-Resilient Hierarchical Secure Aggregation with Heterogeneous Security Constraints, https://arxiv.org/abs/2507.14768
- Nidhi Rastogi, Shirid Pant, Devang Dhanuka, Amulya Saxena, Pranjal Mairal, 20 Jul 2025, Too Much to Trust? Measuring the Security and Cognitive Impacts of Explainability in AI-Driven SOCs, https://arxiv.org/abs/2503.02065
- Andrew C. Cullen, Paul Montague, Sarah M. Erfani, Benjamin I.P. Rubinstein, 11 Aug 2025, Position: Certified Robustness Does Not (Yet) Imply Model Security, https://arxiv.org/abs/2506.13024
- Andy Zou, Maxwell Lin, Eliot Jones, Micha Nowak, Mateusz Dziemian, Nick Winter, Alexander Grattan, Valent Nathanael, Ayla Croft, Xander Davies, Jai Patel, Robert Kirk, Nate Burnikell, Yarin Gal, Dan Hendrycks, J. Zico Kolter, Matt Fredrikson, 28 Jul 2025, Security Challenges in AI Agent Deployment: Insights from a Large Scale Public Competition, https://arxiv.org/abs/2507.20526
- Shen Li, Liuyi Yao, Wujia Niu, Lan Zhang, Yaliang Li, 28 Jul 2025, Security Tensors as a Cross-Modal Bridge: Extending Text-Aligned Safety to Vision in LVLM, https://arxiv.org/abs/2507.20994
- Song Son Ha,Florian Foerster,Thomas Robert Doebbert,Tim Kittel,Dominik Merli,Gerd Scholl, 28 Jul 2025, Testbed and Software Architecture for Enhancing Security in Industrial Private 5G Networks, https://arxiv.org/abs/2507.20873
- Keerthana Madhavan, Abbas Yazdinejad, Fattane Zarrinkalam, Ali Dehghantanha, 26 Jul 2025, Quantifying Security Vulnerabilities: A Metric-Driven Security Analysis of Gaps in Current AI Standards, https://arxiv.org/abs/2502.08610
- Craig Wright, 10 Jul 2025, A Formal Rebuttal of "The Blockchain Trilemma: A Formal Proof of the Inherent Trade-Offs Among Decentralization, Security, and Scalability", https://arxiv.org/abs/2507.21111
- Gauri Sharma, Vidhi Kulkarni, Miles King, Ken Huang, 23 Jul 2025, Towards Unifying Quantitative Security Benchmarking for Multi Agent Systems, https://arxiv.org/abs/2507.21146
- Muzhi Dai, Shixuan Liu, Zhiyuan Zhao, Junyu Gao, Hao Sun, Xuelong Li, 29 Jul 2025, Secure Tug-of-War (SecTOW): Iterative Defense-Attack Training with Reinforcement Learning for Multimodal Model Security, https://arxiv.org/abs/2507.22037
- Kang Chen, Xiuze Zhou, Yuanguo Lin, Jinhe Su, Yuanhui Yu, Li Shen, Fan Lin, 4 Aug 2025, A Survey on Data Security in Large Language Models, https://arxiv.org/abs/2508.02312
- Niklas Pfister, V\'aclav Volhejn, Manuel Knott, Santiago Arias, Julia Bazi\'nska, Mykhailo Bichurin, Alan Commike, Janet Darling, Peter Dienes, Matthew Fiedler, David Haber, Matthias Kraft, Marco Lancini, Max Mathys, Dami\'an Pascual-Ortiz, Jakub Podolak, Adri\`a Romero-L\'opez, Kyriacos Shiarlis, Andreas Signer, Zsolt Terek, Athanasios Theocharis, Daniel Timbrell, Samuel Trautwein, Samuel Watts, Yun-Han Wu, Mateo Rojas-Carulla, 4 Aug 2025, Gandalf the Red: Adaptive Security for LLMs, https://arxiv.org/abs/2501.07927
- Nusrat Zahan, Imranur Rahman, Laurie Williams, 2 Aug 2025, Assumptions to Evidence: Evaluating Security Practices Adoption and Their Impact on Outcomes in the npm Ecosystem, https://arxiv.org/abs/2504.14026
- Arturo S\'anchez-Matas, Pablo Escribano Ruiz, Daniel D\'iaz-L\'opez, Angel Luis Perales G\'omez, Pantaleone Nespoli, Gregorio Mart\'inez P\'erez, 5 Aug 2025, Simulating Cyberattacks through a Breach Attack Simulation (BAS) Platform empowered by Security Chaos Engineering (SCE), https://arxiv.org/abs/2508.03882
- Hammad Atta, Ken Huang, Manish Bhatt, Kamal Ahmed, Muhammad Aziz Ul Haq, Yasir Mehmood, 6 Aug 2025, Logic layer Prompt Control Injection (LPCI): A Novel Security Vulnerability Class in Agentic Systems, https://arxiv.org/abs/2507.10457
- Minghao Shao, Nanda Rani, Kimberly Milner, Haoran Xi, Meet Udeshi, Saksham Aggarwal, Venkata Sai Charan Putrevu, Sandeep Kumar Shukla, Prashanth Krishnamurthy, Farshad Khorrami, Ramesh Karri, Muhammad Shafique, 5 Aug 2025, Towards Effective Offensive Security LLM Agents: Hyperparameter Tuning, LLM as a Judge, and a Lightweight CTF Benchmark, https://arxiv.org/abs/2508.05674
- Hiroya Kato, Kentaro Kita, Kento Hasegawa, Seira Hidano, 12 Aug 2025, AI Security Map: Holistic Organization of AI Security Technologies and Impacts on Stakeholders, https://arxiv.org/abs/2508.08583
- Aayush Gupta, 12 Aug 2025, Can AI Keep a Secret? Contextual Integrity Verification: A Provable Security Architecture for LLMs, https://arxiv.org/abs/2508.09288
- Irash Perera (1), Hiranya Abeyrathne (2), Sanjeewa Malalgoda (2), Arshardh Ifthikar (2) ((1) Department of Computer Science and Engineering, University of Moratuwa, Colombo, Sri Lanka, (2) WSO2, Colombo, Sri Lanka), 14 Aug 2025, Enhancing GraphQL Security by Detecting Malicious Queries Using Large Language Models, Sentence Transformers, and Convolutional Neural Networks, https://arxiv.org/abs/2508.11711
- Afrah Gueriani, Hamza Kheddar, Ahmed Cherif Mazari and Mohamed Chahine Ghanem, 17 Aug 2025, A Robust Cross-Domain IDS using BiGRU-LSTM-Attention for Medical and Industrial IoT Security, https://arxiv.org/abs/2508.12470
- Yongjian Guo, Puzhuo Liu, Wanlun Ma, Zehang Deng, Xiaogang Zhu, Peng Di, Xi Xiao, Sheng Wen, 18 Aug 2025, Systematic Analysis of MCP Security, https://arxiv.org/abs/2508.12538
- Yixuan Yang and Daoyuan Wu and Yufan Chen, 17 Aug 2025, MCPSecBench: A Systematic Security Benchmark and Playground for Testing Model Context Protocols, https://arxiv.org/abs/2508.13220
- Daniel M. Jimenez-Gutierrez, Yelizaveta Falkouskaya, Jose L. Hernandez-Ramos, Aris Anagnostopoulos, Ioannis Chatzigiannakis, Andrea Vitaletti, 19 Aug 2025, On the Security and Privacy of Federated Learning: A Survey with Attacks, Defenses, Frameworks, Applications, and Future Directions, https://arxiv.org/abs/2508.13730
- Abbas Sabra, Olivier Schmitt and Joseph Tyler, 20 Aug 2025, Assessing the Quality and Security of AI-Generated Code: A Quantitative Analysis, https://arxiv.org/abs/2508.14727
- Zhixiang Guo, Siyuan Liang, Aishan Liu, Dacheng Tao, 21 Aug 2025, CopyrightShield: Enhancing Diffusion Model Security against Copyright Infringement Attacks, https://arxiv.org/abs/2412.01528
- Akshay Mhatre and Noujoud Nader and Patrick Diehl and Deepti Gupta, 22 Aug 2025, LLM-GUARD: Large Language Model-Based Detection and Repair of Bugs and Security Vulnerabilities in C++ and Python, https://arxiv.org/abs/2508.16419
- Anton Ludwig Bonin, Pawel Robert Smolinski, Jacek Winiarski, 22 Aug 2025, Exploring the Impact of Generative Artificial Intelligence on Software Development in the IT Sector: Preliminary Findings on Productivity, Efficiency and Job Security, https://arxiv.org/abs/2508.16811
- Keke Lian and Bin Wang, Lei Zhang, Libo Chen, Junjie Wang, Ziming Zhao, Yujiu Yang, Haotong Duan, Haoran Zhao, Shuang Liao, Mingda Guo, Jiazheng Quan, Yilu Zhong, Chenhao He, Zichuan Chen, Jie Wu, Haoling Li, Zhaoxuan Li, Jiongchi Yu, Hui Li and Dong Zhang, 25 Aug 2025, A.S.E: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code, https://arxiv.org/abs/2508.18106
- Matous Kozak, Roshanak Zilouchian Moghaddam, Siva Sivaraman, 23 Aug 2025, When Developer Aid Becomes Security Debt: A Systematic Analysis of Insecure Behaviors in LLM Coding Agents, https://arxiv.org/abs/2507.09329
- Ada Chen, Yongjiang Wu, Junyuan Zhang, Jingyu Xiao, Shu Yang, Jen-tse Huang, Kun Wang, Wenxuan Wang, Shuai Wang, 25 Aug 2025, A Survey on the Safety and Security Threats of Computer-Using Agents: JARVIS or Ultron?, https://arxiv.org/abs/2505.10924
- Niveen O. Jaffal, Mohammed Alkhanafseh, David Mohaisen, 18 Jul 2025, Large Language Models in Cybersecurity: Applications, Vulnerabilities, and Defense Techniques, https://arxiv.org/abs/2507.13629
- Julia Laubmann, Johannes Reschke, 18 Jul 2025, Tackling fake images in cybersecurity -- Interpretation of a StyleGAN and lifting its black-box, https://arxiv.org/abs/2507.13722
- Felix H\"arer, 19 Jul 2025, Specification and Evaluation of Multi-Agent LLM Systems -- Prototype and Cybersecurity Applications, https://arxiv.org/abs/2506.10467
- Terry Yue Zhuo, Dingmin Wang, Hantian Ding, Varun Kumar, Zijian Wang, 29 Jul 2025, Cyber-Zero: Training Cybersecurity Agents without Runtime, https://arxiv.org/abs/2508.00910
- Mehdi Akbari Gurabi, Lasse Nitz, Radu-Mihai Castravet, Roman Matzutt, Avikarsha Mandal, Stefan Decker, 5 Aug 2025, From Legacy to Standard: LLM-Assisted Transformation of Cybersecurity Playbooks into CACAO Format, https://arxiv.org/abs/2508.03342
- Md Zesun Ahmed Mia, Malyaban Bal, Sen Lu, George M. Nishibuchi, Suhas Chelian, Srini Vasan, Abhronil Sengupta, 6 Aug 2025, Neuromorphic Cybersecurity with Semi-supervised Lifelong Learning, https://arxiv.org/abs/2508.04610
- Daniele Proverbio, Alessio Buscemi, Alessandro Di Stefano, The Anh Han, German Castignani and Pietro Li\`o, 4 Aug 2025, Can LLMs effectively provide game-theoretic-based scenarios for cybersecurity?, https://arxiv.org/abs/2508.05670
- Victor Lopez Juarez, 9 Aug 2025, EU Digital Regulation and Guatemala: AI, 5G, and Cybersecurity, https://arxiv.org/abs/2508.08315
- Yuksel Aydin, 9 Aug 2025, Cognitive Cybersecurity for Artificial Intelligence: Guardrail Engineering with CCS-7, https://arxiv.org/abs/2508.10033
- Aydin Zaboli and Junho Hong, 12 Aug 2025, Generative AI for Cybersecurity of Energy Management Systems: Methods, Challenges, and Future Directions, https://arxiv.org/abs/2508.10044
- Nsengiyumva Wilberforce, 2 Sep 2025, A software security review on Uganda's Mobile Money Services: Dr. Jim Spire's tweets sentiment analysis, https://arxiv.org/abs/2509.03545
- Ofir Cohen, Gil Ari Agmon, Asaf Shabtai, Rami Puzis, 5 Sep 2025, The Information Security Awareness of Large Language Models, https://arxiv.org/abs/2411.13207
- Anders M{\o}lmen H{\o}st and Pierre Lison and Leon Moonen, 25 Aug 2025, A Systematic Approach to Predict the Impact of Cybersecurity Vulnerabilities Using LLMs, https://arxiv.org/abs/2508.18439
- Martin Lochner and Keegan Keplinger, 25 Aug 2025, Collaborative Intelligence: Topic Modelling of Large Language Model use in Live Cybersecurity Operations, https://arxiv.org/abs/2508.18488
- Afan Ali and Irfanullah Khan, 26 Aug 2025, SkyTrust: Blockchain-Enhanced UAV Security for NTNs with Dynamic Trust and Energy-Aware Consensus, https://arxiv.org/abs/2508.18735
- Xavier Cadet, Simona Boboila, Sie Hendrata Dharmawan, Alina Oprea, Peter Chin, 27 Aug 2025, PoolFlip: A Multi-Agent Reinforcement Learning Security Environment for Cyber Defense, https://arxiv.org/abs/2508.19488
- Sai Teja Reddy Adapala, Yashwanth Reddy Alugubelly, 22 Aug 2025, The Aegis Protocol: A Foundational Security Framework for Autonomous AI Agents, https://arxiv.org/abs/2508.19267
- Michael R Smith, Joe Ingram, 27 Aug 2025, Surveying the Operational Cybersecurity and Supply Chain Threat Landscape when Developing and Deploying AI Systems, https://arxiv.org/abs/2508.20307
- Dan Lin, Shunfeng Lu, Ziyan Liu, Jiajing Wu, Junyuan Fang, Kaixin Lin, Bowen Song, Zibin Zheng, 28 Aug 2025, BridgeShield: Enhancing Security for Cross-chain Bridge Applications via Heterogeneous Graph Mining, https://arxiv.org/abs/2508.20517
- Guofu Liao, Taotao Wang, Shengli Zhang, Jiqun Zhang, Shi Long, and Dacheng Tao, 29 Aug 2025, zkLoRA: Fine-Tuning Large Language Models with Verifiable Security via Zero-Knowledge Proofs, https://arxiv.org/abs/2508.21393
- Georgios Syros, Anshuman Suri, Jacob Ginesin, Cristina Nita-Rotaru, Alina Oprea, 29 Aug 2025, SAGA: A Security Architecture for Governing AI Agentic Systems, https://arxiv.org/abs/2504.21034
- Wenxiao Zhang, Xiangrui Kong, Conan Dewitt, Thomas Br\"aunl, Jin B. Hong, 2 Sep 2025, Enhancing Reliability in LLM-Integrated Robotic Systems: A Unified Approach to Security and Safety, https://arxiv.org/abs/2509.02163
- Honghui Xu, Kaiyang Li, Wei Chen, Danyang Zheng, Zhiyuan Li, Zhipeng Cai, 2 Sep 2025, A Survey: Towards Privacy and Security in Mobile Large Language Models, https://arxiv.org/abs/2509.02411
- Chengshuai Zhao, Riccardo De Maria, Tharindu Kumarage, Kumar Satvik Chaudhary, Garima Agrawal, Yiwen Li, Jongchan Park, Yuli Deng, Ying-Chih Chen, Huan Liu, 3 Sep 2025, CyberBOT: Towards Reliable Cybersecurity Education via Ontology-Grounded Retrieval Augmented Generation, https://arxiv.org/abs/2504.00389
- Ayoub Si-ahmed, Mohammed Ali Al-Garadi, Narhimene Boustia, 2 Sep 2025, Explainable Machine Learning-Based Security and Privacy Protection Framework for Internet of Medical Things Systems, https://arxiv.org/abs/2403.09752
- Qingyuan Li, Binchang Li, Cuiyun Gao, Shuzheng Gao, and Zongjie Li, 7 Sep 2025, Empirical Study of Code Large Language Models for Binary Security Patch Detection, https://arxiv.org/abs/2509.06052
- Safayat Bin Hakim, Muhammad Adil, Alvaro Velasquez, Shouhuai Xu, Houbing Herbert Song, 8 Sep 2025, Neuro-Symbolic AI for Cybersecurity: State of the Art, Challenges, and Opportunities, https://arxiv.org/abs/2509.06921
- Guangyu Lei, Tianhao Liang, Yuqi Ping, Xinglin Chen, Longyu Zhou, Junwei Wu, Xiyuan Zhang, Huahao Ding, Xingjian Zhang, Weijie Yuan, Tingting Zhang, Qinyu Zhang, 8 Sep 2025, Enhancing Low-Altitude Airspace Security: MLLM-Enabled UAV Intent Recognition, https://arxiv.org/abs/2509.06312
- Gabriele Digregorio and Marco Di Gennaro and Stefano Zanero and Stefano Longari and Michele Carminati, 8 Sep 2025, When Secure Isn't: Assessing the Security of Machine Learning Model Sharing, https://arxiv.org/abs/2509.06703
- Nicol\`o Romandini, Carlo Mazzocca, Kai Otsuki, Rebecca Montanari, 8 Sep 2025, SoK: Security and Privacy of AI Agents for Blockchain, https://arxiv.org/abs/2509.07131
- Lei Yu, Jingyuan Zhang, Xin Wang, Jiajia Ma, Li Yang, Fengjun Zhang, 12 Sep 2025, SmartCoder-R1: Towards Secure and Explainable Smart Contract Generation with Security-Aware Group Relative Policy Optimization, https://arxiv.org/abs/2509.09942
- Evan Li, Tushin Mallick, Evan Rose, William Robertson, Alina Oprea, Cristina Nita-Rotaru, 10 Sep 2025, ACE: A Security Architecture for LLM-Integrated App Systems, https://arxiv.org/abs/2504.20984
- Yuzhou Nie, Zhun Wang, Yu Yang, Ruizhe Jiang, Yuheng Tang, Xander Davies, Yarin Gal, Bo Li, Wenbo Guo, Dawn Song, 18 Sep 2025, SeCodePLT: A Unified Platform for Evaluating the Security of Code GenAI, https://arxiv.org/abs/2410.11096
- Yuchong Xie, Mingyu Luo, Zesen Liu, Zhixiang Zhang, Kaikai Zhang, Yu Liu, Zongjie Li, Ping Chen, Shuai Wang, Dongdong She, 19 Sep 2025, On the Security of Tool-Invocation Prompts for LLM-Based Agentic Systems: An Empirical Risk Assessment, https://arxiv.org/abs/2509.05755
- Sergio Benlloch-Lopez, Miquel Viel-Vazquez, Javier Naranjo-Alcazar, Jordi Grau-Haro and Pedro Zuccarello, 19 Sep 2025, Threat Modeling for Enhancing Security of IoT Audio Classification Devices under a Secure Protocols Framework, https://arxiv.org/abs/2509.14657
- Kiho Lee, Jungkon Kim, Doowon Kim, Hyoungshick Kim, 16 Sep 2025, A Systematic Evaluation of Parameter-Efficient Fine-Tuning Methods for the Security of Code LLMs, https://arxiv.org/abs/2509.12649
- Magnus Wiik Eckhoff, Peter Marius Flydal, Siem Peters, Martin Eian, Jonas Halvorsen, Vasileios Mavroeidis, Gudmund Grov, 16 Sep 2025, A Graph-Based Approach to Alert Contextualisation in Security Operations Centres, https://arxiv.org/abs/2509.12923
- Shaina Raza, Ranjan Sapkota, Manoj Karkee, Christos Emmanouilidis, 15 Sep 2025, TRiSM for Agentic AI: A Review of Trust, Risk, and Security Management in LLM-based Agentic Multi-Agent Systems, https://arxiv.org/abs/2506.04133
- Umberto Gon\c{c}alves de Sousa, 2 Sep 2025, LogGuardQ: A Cognitive-Enhanced Reinforcement Learning Framework for Cybersecurity Anomaly Detection in Security Logs, https://arxiv.org/abs/2509.10511
- Ali Habibzadeh, Farid Feyzi, and Reza Ebrahimi Atani, 13 Sep 2025, Large Language Models for Security Operations Centers: A Comprehensive Survey, https://arxiv.org/abs/2509.10858
- Ambra Demontis, Srishti Gupta, Maura Pintor, Luca Demetrio, Kathrin Grosse, Hsiao-Ying Lin, Chengfang Fang, Battista Biggio, Fabio Roli, 15 Sep 2025, Security of Deep Reinforcement Learning for Autonomous Driving: A Survey, https://arxiv.org/abs/2212.06123
- Amena Amro and Manar H. Alalfi, 17 Sep 2025, GitHub's Copilot Code Review: Can AI Spot Security Flaws Before You Commit?, https://arxiv.org/abs/2509.13650
- Adel ElZemity, Budi Arief and Shujun Li, 17 Sep 2025, CyberLLMInstruct: A Pseudo-malicious Dataset Revealing Safety-performance Trade-offs in Cyber Security LLM Fine-tuning, https://arxiv.org/abs/2503.09334
- Adel ElZemity, Budi Arief and Shujun Li, 17 Sep 2025, Analysing Safety Risks in LLMs Fine-Tuned with Pseudo-Malicious Cyber Security Data, https://arxiv.org/abs/2505.09974
- Samuele Pasini, Jinhan Kim, Tommaso Aiello, Rocio Cabrera Lozoya, Antonino Sabetta, Paolo Tonella, 17 Sep 2025, Evaluating and Improving the Robustness of Security Attack Detectors Generated by LLMs, https://arxiv.org/abs/2411.18216
Safety Monitor
A safety monitor is a component that can be added to the LLM deployment.
- OpenAI, Moderation: Learn how to build moderation into your AI applications, 2024, https://platform.openai.com/docs/guides/moderation
- Azure, 06/13/2024, Content filtering, https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/content-filter?tabs=warning%2Cpython
- Yu Wang, Xiaogeng Liu, Yu Li, Muhao Chen, Chaowei Xiao, 14 Mar 2024, AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting, https://arxiv.org/abs/2403.09513 Code: https://github.com/rain305f/AdaShield
- Jinhwa Kim, Ali Derakhshan, Ian G. Harris, 31 Oct 2023, Robust Safety Classifier for Large Language Models: Adversarial Prompt Shield, https://arxiv.org/abs/2311.00172
- Francisco Munguia-Galeano, Zhengxue Zhou, Satheeshkumar Veeramani, Hatem Fakhruldeen, Louis Longley, Rob Clowes and Andrew I. Cooper, 7 Aug 2025, Chemist Eye: A Visual Language Model-Powered System for Safety Monitoring and Robot Decision-Making in Self-Driving Laboratories, https://arxiv.org/abs/2508.05148
General Thoughts on AI Safety
High-level debate and discussions of AI safety issues:
- Stephen Hawking, Max Tegmark, Stuart Russell, and Frank Wilczek. April 2014. Transcending complacency on superintelligent machines. http://www.huffingtonpost.com/stephen-hawking/artificial-intelligence_b_5174265.html
- S. Alexander. OpenAI’s “Planning for AGI and beyond”. March 2023, https://astralcodexten.substack.com/p/openais-planning-for-agi-and-beyond
- N. Bostrom. The vulnerable world hypothesis. Global Policy, 10(4):455–476, 2019. https://doi.org/10.1111/1758-5899.12718
- Smitha Milli, Dylan Hadfield-Menell, Anca Dragan, and Stuart Russell. Should robots be obedient? In International Joint Conference on Artificial Intelligence, 2017. https://arxiv.org/abs/1705.09990
- Nick Bostrom. Superintelligence: Paths, Dangers, Strategies. Oxford University Press, March 2016, https://www.amazon.com.au/Superintelligence-Professor-Philosophy-Institute-University/dp/0198739834/
- Nick Bostrom. Superintelligence: Paths, Dangers, Strategies. Oxford University Press, July 2014 (prior edition), https://www.amazon.com.au/Superintelligence-Dangers-Strategies-Nick-Bostrom-ebook/dp/B00LOOCGB2/
- OpenAI, May 2023, Governance of superintelligence, https://openai.com/blog/governance-of-superintelligence
- Winfield AFT, Jirotka M. Ethical governance is essential to building trust in robotics and artificial intelligence systems. Philos Trans A Math Phys Eng Sci. 2018 Oct 15;376(2133):20180085. doi: 10.1098/rsta.2018.0085. PMID: 30323000 https://pubmed.ncbi.nlm.nih.gov/30323000/
- OpenAI, Feb 2023, How should AI systems behave, and who should decide? https://openai.com/blog/how-should-ai-systems-behave
- Stuart Russell. Should we fear supersmart robots? Scientific American, 314(6):58–59, 2016. https://www.scientificamerican.com/article/should-we-fear-supersmart-robots/, https://pubmed.ncbi.nlm.nih.gov/27196844/
- A Ramalho, 2017, Will robots rule the (artistic) world? A proposed model for the legal status of creations by artificial intelligence systems, https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2987757
- Bernd Carsten Stahl, 2023, Embedding responsibility in intelligent systems: from AI ethics to responsible AI ecosystems, Scientific Reports Open Access 18 May 2023, https://doi.org/10.1038/s41598-023-34622-w
- McCarthy, John, and Patrick J. Hayes. 1969. Some Philosophical Problems From the Standpoint of Artificial Intelligence, In: Machine Intelligence 4, B. Meltzer and D. Michie (eds.), Edinburgh University Press, 1969, pp. 463-502, Stanford University. http://jmc.stanford.edu/articles/mcchay69.html
- Russell, Stuart J. 2019. Human Compatible: Artificial Intelligence and the Problem of Control (Viking-Penguin Random House: London). https://link.springer.com/chapter/10.1007/978-3-030-86144-5_3
- Winfield A.F.T., Jirotka M., 2018, Ethical governance is essential to building trust in robotics and artificial intelligence systems. Philos. Trans. R. Soc. A. Math. Phys. Eng. Sci. 2018;376:13. http://www.ncbi.nlm.nih.gov/pmc/articles/pmc6191667/, https://pubmed.ncbi.nlm.nih.gov/30323000/
- Thomas Claburn 12 Oct 2023, AI safety guardrails easily thwarted, security study finds, The Register, https://www.theregister.com/2023/10/12/chatbot_defenses_dissolve/
- Alibaba Qwen Team, Sep 2023, Qwen Technical Report, https://arxiv.org/pdf/2309.16609.pdf
Government Policy and Regulation
Various governments have examined issues around regulation, and there has also been much debate:
- A. Solender and A. Gold. April 2023, Scoop: Schumer lays groundwork for Congress to regulate AI. https://www.axios.com/2023/04/13/congress-regulate-ai-tech
- UK Government. National AI strategy. Sep 2021. https://www.gov.uk/government/publications/national-ai-strategy
- AI Now Institute, A. Kak, and S. M. West. April 2023, General purpose AI poses serious risks, should not be excluded from the EU’s AI Act. https://ainowinstitute.org/publication/gpai-is-high-risk-should-not-be-excluded-from-eu-ai-act
- L. Bertuzzi. March 2023, Leading EU lawmakers propose obligations for general purpose ai. https://www.euractiv.com/section/artificial-intelligence/news/leading-eu-lawmakers-propose-obligations-for-general-purpose-ai
- UK Department for Science and Technology. Aug 2023, Policy paper: A pro-innovation approach to AI regulation. https://www.gov.uk/government/publications/ai-regulation-a-pro-innovation-approach/white-paper
- White House. May 2023. Fact sheet: Biden-Harris Administration announces new actions to promote responsible AI innovation that protects Americans’ rights and safety. https://www.whitehouse.gov/briefing-room/statements-releases/2023/05/04/fact-sheet-biden-harris-administration-annou nces-new-actions-to-promote-responsible-ai-innovation-that-protects-americans-rights-and-safety
- B. Zhang, M. Anderljung, L. Kahn, N. Dreksler, M. C. Horowitz, and A. Dafoe. 2021, Ethics and governance of artificial intelligence: Evidence from a survey of machine learning researchers. arXiv preprint arXiv:2105.02117, https://arxiv.org/abs/2105.02117
- ISO/IEC. 2023, ISO/IEC 23894:2023 Information technology — Artificial intelligence — Guidance on risk management. https://www.iso.org/standard/77304.html
- NIST, AI Risk Management Framework Concept Paper, 13 December 2021, PDF: https://www.nist.gov/system/files/documents/2021/12/14/AI%20RMF%20Concept%20Paper_13Dec2021_posted.pdf
- NIST. 2023, Artificial Intelligence Risk Management Framework (AI RMF 1.0). https://doi.org/10.6028/NIST.AI.100-1, https://www.nist.gov/itl/ai-risk-management-framework
- Tathagat Katiyar & Harshitha Chondamma II, Accorian, Feb 2023, UNDERSTANDING AI RMF 1.0 – The Artificial Intelligence Risk Management Framework https://accorian.com/understanding-ai-rmf-1-0-the-artificial-intelligence-risk-management-framework/
- E. Yudkowsky, 2023. Pausing AI developments isn’t enough. We need to shut it all down. https://time.com/6266923/ai-eliezer-yudkowsky-open-letter-not-enough
- Stephanie Palazzolo, Erin Woo, Aug 2024, Passage of California AI Bill Sends Shivers Across Tech Industry, https://www.theinformation.com/articles/passage-of-california-ai-bill-sends-shivers-across-tech-industry
Auditing and Enforcement
Papers on auditing or enforcement of AI policy:
- J. Mökander and L. Floridi. 2022, Operationalising AI governance through ethics-based auditing: An industry case study. AI and Ethics, pages 1–18, https://link.springer.com/article/10.1007/s43681-022-00171-7
- J. Mökander, J. Schuett, H. R. Kirk, and L. Floridi. June 2023. Auditing large language models: A three-layered approach. arXiv preprint arXiv:2302.08500. https://arxiv.org/abs/2302.08500
- J. Mökander, J. Morley, M. Taddeo, and L. Floridi. Ethics-based auditing of automated decision-making systems: Nature, scope, and limitations. Science and Engineering Ethics, 27(44), 2021. https://arxiv.org/abs/2110.10980
Bias and Fairness
AI engines have shown bias in various ways. The goal is to have them show "fairness" in their results:
- Dastin Jeffrey. Oct 2018, Amazon scraps secret AI recruiting tool that showed bias against women. Reuters. https://www.reuters.com/article/us-amazon-com-jobs-automation-insight-idUSKCN1MK08G
- Courtland R., 2018, Bias detectives: the researchers striving to make algorithms fair. Nature. 2018 Jun;558(7710):357-360. doi: 10.1038/d41586-018-05469-3. PMID: 29925973 https://pubmed.ncbi.nlm.nih.gov/29925973/
- Caliskan Aylin, Bryson Joanna J., Narayanan Arvind. 2017. Semantics derived automatically from language corpora contain human-like biases. Science. 2017;356:183–186. https://pubmed.ncbi.nlm.nih.gov/28408601/
- A Levendowski, 2018, How copyright law can fix artificial intelligence's implicit bias problem, Wash. L. Rev., https://digitalcommons.law.uw.edu/cgi/viewcontent.cgi?article=5042&context=wlr
- Hao Karen. 2020. AI researchers say scientific publishers help perpetuate racist algorithms. MIT Technology Review. https://www.technologyreview.com/2020/06/23/1004333/ai-science-publishers-perpetuate-racist-face-recognition/
- K Ramesh, A Chavan, S Pandit, 2023, A Comparative Study on the Impact of Model Compression Techniques on Fairness in Language Models, Microsoft Research, https://aclanthology.org/2023.acl-long.878.pdf, https://www.microsoft.com/en-us/research/uploads/prod/2023/07/3687_Paper.pdf
- Jwala Dhamala, Varun Kumar, Rahul Gupta, Kai-Wei Chang, Aram Galstyan, Oct 2022, An Analysis of the Effects of Decoding Algorithms on Fairness in Open-Ended Language Generation, https://arxiv.org/abs/2210.03826 (Examines top-p, top-k, and temperature in decoding algorithms from a safety perspective.)
- Bingbin Liu, Jordan T. Ash, Surbhi Goel, Akshay Krishnamurthy, Cyril Zhang, June 2023, Exposing Attention Glitches with Flip-Flop Language Modeling, https://arxiv.org/abs/2306.00946
- Valentin Hofmann, Pratyusha Ria Kalluri, Dan Jurafsky, Sharese King, 1 Mar 2024, Dialect prejudice predicts AI decisions about people's character, employability, and criminality, https://arxiv.org/abs/2403.00742 https://arxiv.org/pdf/2403.00742.pdf
- Jaymari Chua, Yun Li, Shiyi Yang, Chen Wang, Lina Yao, 6 Jul 2024, AI Safety in Generative AI Large Language Models: A Survey, https://arxiv.org/abs/2407.18369
- Cem Dilmegani, Jan 10, 2024, The Future of Large Language Models in 2024, https://research.aimultiple.com/future-of-large-language-models/
- Douglas C. Youvan, September 27, 2024, Building and Running Large-Scale Language Models: The Infrastructure and Techniques Behind GPT-4 , https://www.researchgate.net/profile/Douglas-Youvan/publication/384398902_Building_and_Running_Large-Scale_Language_Models_The_Infrastructure_and_Techniques_Behind_GPT-4/links/66f6f4d3906bca2ac3d20e68/Building-and-Running-Large-Scale-Language-Models-The-Infrastructure-and-Techniques-Behind-GPT-4.pdf
- Mayank Vatsa, Anubhooti Jain, Richa Singh, 7 Dec 2023, Adventures of Trustworthy Vision-Language Models: A Survey, https://arxiv.org/abs/2312.04231
- Inkit Padhi, Manish Nagireddy, Giandomenico Cornacchia, Subhajit Chaudhury, Tejaswini Pedapati, Pierre Dognin, Keerthiram Murugesan, Erik Miehling, Martín Santillán Cooper, Kieran Fraser, Giulio Zizzo, Muhammad Zaid Hameed, Mark Purcell, Michael Desmond, Qian Pan, Inge Vejsbjerg, Elizabeth M. Daly, Michael Hind, Werner Geyer, Ambrish Rawat, Kush R. Varshney, Prasanna Sattigeri, 10 Dec 2024, Granite Guardian, https://arxiv.org/abs/2412.07724 https://github.com/ibm-granite/granite-guardian (Open-sourcing of safety models with many capabilities.)
- FZ Subah, Oct 2025, Mitigating and Assessing Bias and Fairness in Large Language Model-Generated Synthetic Tabular Data, Masters Thesis, Department of Engineering, University of Cambridge, https://www.mlmi.eng.cam.ac.uk/files/2023-2024/fzs21_mitigating_2024.pdf
- Abdullah Hashmat, Muhammad Arham Mirza, Agha Ali Raza, 13 Aug 2025, PakBBQ: A Culturally Adapted Bias Benchmark for QA, https://arxiv.org/abs/2508.10186
- Gustavo Bonil, Simone Hashiguti, Jhessica Silva, Jo\~ao Gondim, Helena Maia, N\'adia Silva, Helio Pedrini, Sandra Avila, 14 Aug 2025, Yet another algorithmic bias: A Discursive Analysis of Large Language Models Reinforcing Dominant Discourses on Gender and Race, https://arxiv.org/abs/2508.10304
- Alessio Buscemi, Daniele Proverbio, Alessandro Di Stefano, The Anh Han, German Castignani and Pietro Li\`o, 14 Aug 2025, FAIRGAME: a Framework for AI Agents Bias Recognition using Game Theory, https://arxiv.org/abs/2504.14325
- Suhas G Hegde, Shilpy Kaur, Aruna Tiwari, 14 Aug 2025, VectorFit : Adaptive Singular & Bias Vector Fine-Tuning of Pre-trained Foundation Models, https://arxiv.org/abs/2503.19530
- Yan Li, Guangyi Chen, Yunlong Deng, Zijian Li, Zeyu Tang, Anpeng Wu, Kun Zhang, 22 Jul 2025, Should Bias Always be Eliminated? A Principled Framework to Use Data Bias for OOD Generation, https://arxiv.org/abs/2507.17001
- Shalaka Satheesh, Katrin Klug, Katharina Beckh, H\'ector Allende-Cid, Sebastian Houben, Teena Hassan, 22 Jul 2025, GG-BBQ: German Gender Bias Benchmark for Question Answering, https://arxiv.org/abs/2507.16410
- Kristin Gnadt, David Thulke, Simone Kopeinik, Ralf Schl\"uter, 22 Jul 2025, Exploring Gender Bias in Large Language Models: An In-depth Dive into the German Language, https://arxiv.org/abs/2507.16557
- Zhenyuan Chen, 21 Jul 2025, Rethinking Inductive Bias in Geographically Neural Network Weighted Regression, https://arxiv.org/abs/2507.09958
- Sergio Morales, Robert Claris\'o, Jordi Cabot, 22 Jul 2025, LangBiTe: A Platform for Testing Bias in Large Language Models, https://arxiv.org/abs/2404.18558
- Yanbiao Ma, Bowei Liu, Boyuan Gao, Wei Dai, Jiayi Chen, Shuo Li, Andi Zhang, 22 Jul 2025, Revealing Bias Formation in Deep Neural Networks Through the Geometric Mechanisms of Human Visual Decoupling, https://arxiv.org/abs/2502.11809
- Brian Liu and Rahul Mazumder, 21 Jul 2025, Randomization Can Reduce Both Bias and Variance: A Case Study in Random Forests, https://arxiv.org/abs/2402.12668
- Ali Vardasbi, Gustavo Penha, Claudia Hauff, and Hugues Bouchard, 23 Jul 2025, Adaptive Repetition for Mitigating Position Bias in LLM-Based Ranking, https://arxiv.org/abs/2507.17788
- Steven A. Frank, 24 Jul 2025, The Price equation reveals a universal force-metric-bias law of algorithmic learning and natural selection, https://arxiv.org/abs/2507.18549
- Bruno Scarone, Alfredo Viola, Ren\'ee J. Miller, Ricardo Baeza-Yates, 24 Jul 2025, A Principled Approach for Data Bias Mitigation, https://arxiv.org/abs/2405.12312
- He-Yang Xu, Hongxiang Gao, Yuwen Li, Xiu-Shen Wei and Chengyu Liu, 24 Jul 2025, Masked Autoencoders that Feel the Heart: Unveiling Simplicity Bias for ECG Analyses, https://arxiv.org/abs/2506.22495
- Yuen Chen, Vethavikashini Chithrra Raghuram, Justus Mattern, Rada Mihalcea, Zhijing Jin, 24 Jul 2025, Causally Testing Gender Bias in LLMs: A Case Study on Occupational Bias, https://arxiv.org/abs/2212.10678
- Yongyi Yang, Hidenori Tanaka, Wei Hu, 17 Jul 2025, Provable Low-Frequency Bias of In-Context Learning of Representations, https://arxiv.org/abs/2507.13540
- Yile Yan, Yuqi Zhu, Wentao Xu, 18 Jul 2025, Bias in Decision-Making for AI's Ethical Dilemmas: A Comparative Study of ChatGPT and Claude, https://arxiv.org/abs/2501.10484
- Andr\'es Morales-Forero (1), Lili J. Rueda (2), Ronald Herrera (3), Samuel Bassetto (1), Eric Coatanea (4) ((1) Polytechnique Montr\'eal, (2) Universidad El Bosque, (3) Boehringer Ingelheim International GmbH, (4) Tampere University), 10 Jul 2025, Predictive Representativity: Uncovering Racial Bias in AI-based Skin Cancer Detection, https://arxiv.org/abs/2507.14176
- Xiaotong Luo, Shengda Zhuo, Min Chen, Lichun Li, Ruizhao Lu, Wenqi Fan, Shuqiang Huang and Yin Tang, 12 Jul 2025, From Bias to Behavior: Learning Bull-Bear Market Dynamics with Contrastive Modeling, https://arxiv.org/abs/2507.14182
- Eoghan Cunningham, James Cross, Derek Greene, 16 Jul 2025, Identifying Algorithmic and Domain-Specific Bias in Parliamentary Debate Summarisation, https://arxiv.org/abs/2507.14221
- Garud Iyengar, Henry Lam, Tianyu Wang, 21 Jul 2025, Optimizer's Information Criterion: Dissecting and Correcting Bias in Data-Driven Optimization, https://arxiv.org/abs/2306.10081
- Evangelia Spiliopoulou, Riccardo Fogliato, Hanna Burnsky, Tamer Soliman, Jie Ma, Graham Horwood, Miguel Ballesteros, 8 Aug 2025, Play Favorites: A Statistical Method to Measure Self-Bias in LLM-as-a-Judge, https://arxiv.org/abs/2508.06709
- Falaah Arif Khan, Nivedha Sivakumar, Yinong Oliver Wang, Katherine Metcalf, Cezanne Camacho, Barry-John Theobald, Luca Zappella, Nicholas Apostoloff, 9 Aug 2025, Investigating Intersectional Bias in Large Language Models using Confidence Disparities in Coreference Resolution, https://arxiv.org/abs/2508.07111
- Vivek Hruday Kavuri, Vysishtya Karanam, Venkata Jahnavi Venkamsetty, Kriti Madumadukala, Lakshmipathi Balaji Darur, Ponnurangam Kumaraguru, 10 Aug 2025, Freeze and Reveal: Exposing Modality Bias in Vision-Language Models, https://arxiv.org/abs/2508.07432
- Vojt\v{e}ch Stan\v{e}k, Karel Srna, Anton Firc, Kamil Malinka, 11 Aug 2025, SCDF: A Speaker Characteristics DeepFake Speech Dataset for Bias Analysis, https://arxiv.org/abs/2508.07944
- Xinyi Wu, Yifei Wang, Stefanie Jegelka, Ali Jadbabaie, 9 Aug 2025, On the Emergence of Position Bias in Transformers, https://arxiv.org/abs/2502.01951
- Walter Laurito, Benjamin Davis, Peli Grietzer, Tom\'a\v{s} Gaven\v{c}iak, Ada B\"ohm, Jan Kulveit, 11 Aug 2025, AI-AI Bias: large language models favor communications generated by large language models, https://arxiv.org/abs/2407.12856
- Dasol Choi, Jihwan Lee, Minjae Lee, Minsuk Kahng, 10 Aug 2025, When Cars Have Stereotypes: Auditing Demographic Bias in Objects from Text-to-Image Models, https://arxiv.org/abs/2508.03483
- Anuprabha M, Krishna Gurugubelli and Anil Kumar Vuppala, 11 Aug 2025, Fairness in Dysarthric Speech Synthesis: Understanding Intrinsic Bias in Dysarthric Speech Cloning using F5-TTS, https://arxiv.org/abs/2508.05102
- Chao Wu, Zhenyi Wang, Kangxian Xie, Naresh Kumar Devulapally, Vishnu Suresh Lokhande, Mingchen Gao, 28 Jul 2025, Model-Agnostic Gender Bias Control for Text-to-Image Generation via Sparse Autoencoder, https://arxiv.org/abs/2507.20973
- Gabriel Recchia, Chatrik Singh Mangat, Jinu Nyachhyon, Mridul Sharma, Callum Canavan, Dylan Epstein-Gross, Muhammed Abdulbari, 17 May 2025, Confirmation bias: A challenge for scalable oversight, https://arxiv.org/abs/2507.19486
- Pavel Korshunov, Ketan Kotwal, Christophe Ecabert, Vidit Vidit, Amir Mohammadi, and Sebastien Marcel, 28 Jul 2025, Investigation of Accuracy and Bias in Face Recognition Trained with Synthetic Data, https://arxiv.org/abs/2507.20782
- Hoyoung Lee, Junhyuk Seo, Suhwan Park, Junhyeong Lee, Wonbin Ahn, Chanyeol Choi, Alejandro Lopez-Lira, Yongjae Lee, 28 Jul 2025, Your AI, Not Your View: The Bias of LLMs in Investment Analysis, https://arxiv.org/abs/2507.20957
- Yooshin Cho, Hanbyel Cho, Janghyeon Lee, HyeongGwon Hong, Jaesung Ahn, Junmo Kim, 27 Jul 2025, Controllable Feature Whitening for Hyperparameter-Free Bias Mitigation, https://arxiv.org/abs/2507.20284
- Seoyoung Doh, Hyeon Jeon, Sungbok Shin, Ghulam Jilani Quadri, Nam Wook Kim, Jinwook Seo, 28 Jul 2025, Understanding Bias in Perceiving Dimensionality Reduction Projections, https://arxiv.org/abs/2507.20805
- Hitomi Yanaka, Xinqi He, Jie Lu, Namgi Han, Sunjin Oh, Ryoma Kumon, Yuma Matsuoka, Katsuhiko Watabe, Yuko Itatsu, 27 Jul 2025, Intersectional Bias in Japanese Large Language Models from a Contextualized Perspective, https://arxiv.org/abs/2506.12327
- Franck Bardol, 17 Jun 2025, ChatGPT Reads Your Tone and Responds Accordingly -- Until It Does Not -- Emotional Framing Induces Bias in LLM Outputs, https://arxiv.org/abs/2507.21083
- Zhenyu Pan, Yutong Zhang, Jianshu Zhang, Haoran Lu, Haozheng Luo, Yuwei Han, Philip S. Yu, Manling Li, Han Liu, 30 Jul 2025, FairReason: Balancing Reasoning and Social Bias in MLLMs, https://arxiv.org/abs/2507.23067
- Patricia A. Apell\'aniz and Ana Jim\'enez and Borja Arroyo Galende and Juan Parras and Santiago Zazo, 31 Jul 2025, Artificial Inductive Bias for Synthetic Tabular Data Generation in Data-Scarce Scenarios, https://arxiv.org/abs/2407.03080
- Utku Ozbulak, Seyed Amir Mousavi, Francesca Tozzi, Niki Rashidian, Wouter Willaert, Wesley De Neve, Joris Vankerschaver, 31 Jul 2025, Revisiting the Evaluation Bias Introduced by Frame Sampling Strategies in Surgical Video Segmentation Using SAM2, https://arxiv.org/abs/2502.20934
- Afrozah Nadeem, Mark Dras, and Usman Naseem, 31 Jul 2025, Framing Political Bias in Multilingual LLMs Across Pakistani Languages, https://arxiv.org/abs/2506.00068
- Bushra Asseri, Estabrag Abdelaziz, Areej Al-Wabil, 30 Jul 2025, Prompt Engineering Techniques for Mitigating Cultural Bias Against Arabs and Muslims in Large Language Models: A Systematic Review, https://arxiv.org/abs/2506.18199
- Simon M\"unker, 31 Jul 2025, Cultural Bias in Large Language Models: Evaluating AI Agents through Moral Questionnaires, https://arxiv.org/abs/2507.10073
- Kwesi Cobbina and Tianyi Zhou, 30 Jul 2025, Where to show Demos in Your Prompt: A Positional Bias of In-Context Learning, https://arxiv.org/abs/2507.22887
- Adam Block and Cyril Zhang, 31 Jul 2025, EMA Without the Lag: Bias-Corrected Iterate Averaging Schemes, https://arxiv.org/abs/2508.00180
- Kangda Wei, Hasnat Md Abdullah, Ruihong Huang, 1 Aug 2025, Mitigating Gender Bias via Fostering Exploratory Thinking in LLMs, https://arxiv.org/abs/2505.17217
- Zhenpeng Chen, Xinyue Li, Jie M. Zhang, Weisong Sun, Ying Xiao, Tianlin Li, Yiling Lou, Yang Liu, 5 Aug 2025, Software Fairness Dilemma: Is Bias Mitigation a Zero-Sum Game?, https://arxiv.org/abs/2508.03323
- Jiangen He, 2 Aug 2025, Who Gets Cited? Gender- and Majority-Bias in LLM-Driven Reference Selection, https://arxiv.org/abs/2508.02740
- Shahed Masoudian, Gustavo Escobedo, Hannah Strauss, Markus Schedl, 5 Aug 2025, Investigating Gender Bias in LLM-Generated Stories via Psychological Stereotypes, https://arxiv.org/abs/2508.03292
- Joseph Lee, Tianqi Shang, Jae Young Baik, Duy Duong-Tran, Shu Yang, Lingyao Li, Li Shen, 4 Aug 2025, From Promising Capability to Pervasive Bias: Assessing Large Language Models for Emergency Department Triage, https://arxiv.org/abs/2504.16273
- Zhen Zou, Feng Zhao, 5 Aug 2025, FEB-Cache: Frequency-Guided Exposure Bias Reduction for Enhancing Diffusion Transformer Caching, https://arxiv.org/abs/2503.07120
- Hamed Ayoobi, Nico Potyka, Anna Rapberger, Francesca Toni, 6 Aug 2025, Argumentative Debates for Transparent Bias Detection [Technical Report], https://arxiv.org/abs/2508.04511
- Tiffany Zhu, Iain Weissburg, Kexun Zhang, William Yang Wang, 6 Aug 2025, Human Bias in the Face of AI: Examining Human Judgment Against Text Labeled as AI Generated, https://arxiv.org/abs/2410.03723
- Tosin Fadahunsi, Giordano d'Aloisio, Antinisca Di Marco, Federica Sarro, 5 Aug 2025, How Do Generative Models Draw a Software Engineer? A Case Study on Stable Diffusion Bias, https://arxiv.org/abs/2501.09014
- Kelsey Doerksen, Yuliya Marchetti, Kevin Bowman, Steven Lu, James Montgomery, Yarin Gal, Freddie Kalaitzis, Kazuyuki Miyazaki, 6 Aug 2025, Leveraging Deep Learning for Physical Model Bias of Global Air Quality Estimates, https://arxiv.org/abs/2508.04886
- Menghua Jiang, Yuxia Lin, Baoliang Chen, Haifeng Hu, Yuncheng Jiang, Sijie Mai, 7 Aug 2025, Disentangling Bias by Modeling Intra- and Inter-modal Causal Attention for Multimodal Sentiment Analysis, https://arxiv.org/abs/2508.04999
- Jiahao Chen, Bin Qin, Jiangmeng Li, Hao Chen, Bing Su, 8 Aug 2025, Rethinking the Bias of Foundation Model under Long-tailed Distribution, https://arxiv.org/abs/2501.15955
- Shivam Dubey, 12 Aug 2025, Activation Steering for Bias Mitigation: An Interpretable Approach to Safer LLMs, https://arxiv.org/abs/2508.09019
- Afrozah Nadeem, Mark Dras, Usman Naseem, 12 Aug 2025, Steering Towards Fairness: Mitigating Political Bias in LLMs, https://arxiv.org/abs/2508.08846
- Krzysztof Maziarz, Guoqing Liu, Hubert Misztela, Austin Tripp, Junren Li, Aleksei Kornev, Piotr Gai\'nski, Holger Hoefling, Mike Fortunato, Rishi Gupta, Marwin Segler, 12 Aug 2025, Chemist-aligned retrosynthesis by ensembling diverse inductive bias models, https://arxiv.org/abs/2412.05269
- Zara Siddique, Irtaza Khalid, Liam D. Turner, Luis Espinosa-Anke, 13 Aug 2025, Shifting Perspectives: Steering Vectors for Robust Bias Mitigation in LLMs, https://arxiv.org/abs/2503.05371
- Jingwei Li, Jing Xu, Zifan Wang, Huishuai Zhang, Jingzhao Zhang, 13 Aug 2025, Understanding Nonlinear Implicit Bias via Region Counts in Input Space, https://arxiv.org/abs/2505.11370
- Parker Whitfill, 14 Aug 2025, Note on Selection Bias in Observational Estimates of Algorithmic Progress, https://arxiv.org/abs/2508.11033
- Aiswarya Konavoor, Raj Abhijit Dandekar, Rajat Dandekar, Sreedath Panat, 15 Aug 2025, Vision-Language Models display a strong gender bias, https://arxiv.org/abs/2508.11262
- Binxu Wang, Cengiz Pehlevan, 14 Aug 2025, An Analytical Theory of Spectral Bias in the Learning Dynamics of Diffusion Models, https://arxiv.org/abs/2503.03206
- Keyon Vafa, Peter G. Chang, Ashesh Rambachan, Sendhil Mullainathan, 14 Aug 2025, What Has a Foundation Model Found? Using Inductive Bias to Probe for World Models, https://arxiv.org/abs/2507.06952
- Pengcheng Huang, Shuhao Liu, Zhenghao Liu, Yukun Yan, Shuo Wang, Zulong Chen, Tong Xiao, 18 Aug 2025, PC-Sampler: Position-Aware Calibration of Decoding Bias in Masked Diffusion Models, https://arxiv.org/abs/2508.13021
- Yuanzhe Hu, Kinshuk Goel, Vlad Killiakov, Yaoqing Yang, 18 Aug 2025, Eigenspectrum Analysis of Neural Networks without Aspect Ratio Bias, https://arxiv.org/abs/2506.06280
- Evan Chen, Run-Jun Zhan, Yan-Bai Lin, Hung-Hsuan Chen, 15 Aug 2025, More Women, Same Stereotypes: Unpacking the Gender Bias Paradox in Large Language Models, https://arxiv.org/abs/2503.15904
- Koffi Ismael Ouattara, Ioannis Krontiris, Theo Dimitrakos and Frank Kargl, 19 Aug 2025, Assessing Trustworthiness of AI Training Dataset using Subjective Logic -- A Use Case on Bias, https://arxiv.org/abs/2508.13813
- Jonathan A. Karr Jr., Benjamin F. Herbst, Ting Hua, Matthew Hauenstein, Georgina Curto, Nitesh V. Chawla, 14 Aug 2025, Combating Homelessness Stigma with LLMs: A New Multi-Modal Dataset for Bias Detection, https://arxiv.org/abs/2508.13187
- Hao Zhang and Chen Li and Basura Fernando, 19 Aug 2025, Mitigating Easy Option Bias in Multiple-Choice Question Answering, https://arxiv.org/abs/2508.13428
- Dariia Puhach and Amir H. Payberah and \'Eva Sz\'ekely, 19 Aug 2025, Who Gets the Mic? Investigating Gender Bias in the Speaker Assignment of a Speech-LLM, https://arxiv.org/abs/2508.13603
- Vinod Kumar Chauhan, Lei Clifton, Achille Sala\"un, Huiqi Yvonne Lu, Kim Branson, Patrick Schwab, Gaurav Nigam, David A. Clifton, 20 Aug 2025, Sample Selection Bias in Machine Learning for Healthcare, https://arxiv.org/abs/2405.07841
- Ilja Kuzborskij, Yasin Abbasi Yadkori, 20 Aug 2025, Low-rank bias, weight decay, and model merging in neural networks, https://arxiv.org/abs/2502.17340
- Haodi Zhong, Liuxin Zou, Di Wang, Bo Wang, Zhenxing Niu, Quan Wang, 21 Aug 2025, EvoFormer: Learning Dynamic Graph-Level Representations with Structural and Temporal Bias Correction, https://arxiv.org/abs/2508.15378
- Cheng Wang, Gelei Deng, Xianglin Yang, Han Qiu, Tianwei Zhang, 21 Aug 2025, When Audio and Text Disagree: Revealing Text Bias in Large Audio-Language Models, https://arxiv.org/abs/2508.15407
- Tuhina Tripathi, Manya Wadhwa, Greg Durrett, Scott Niekum, 21 Aug 2025, Pairwise or Pointwise? Evaluating Feedback Protocols for Bias in LLM-Based Evaluation, https://arxiv.org/abs/2504.14716
- Saumya Roy, 13 Aug 2025, Persuasiveness and Bias in LLM: Investigating the Impact of Persuasiveness and Reinforcement of Bias in Language Models, https://arxiv.org/abs/2508.15798
- Xu Pan, Jingxuan Fan, Zidi Xiong, Ely Hahami, Jorin Overwiening, Ziqian Xie, 16 Aug 2025, User-Assistant Bias in LLMs, https://arxiv.org/abs/2508.15815
- Srikant Panda, Vishnu Hari, Kalpana Panda, Amit Agarwal, Hitesh Laxmichand Patel, 18 Aug 2025, Who's Asking? Investigating Bias Through the Lens of Disability Framed Queries in LLMs, https://arxiv.org/abs/2508.15831
- Tom Jacobs, Chao Zhou, Rebekka Burkholz, 22 Aug 2025, Mirror, Mirror of the Flow: How Does Regularization Shape Implicit Bias?, https://arxiv.org/abs/2504.12883
- Gousia Habib, Tausifa Jan Saleem, Ishfaq Ahmad Malik, Brejesh Lall, 21 Aug 2025, LIB-KD: Teaching Inductive Bias for Efficient Vision Transformer Distillation and Compression, https://arxiv.org/abs/2310.00369
- Shir Bernstein, David Beste, Daniel Ayzenshteyn, Lea Schonherr, Yisroel Mirsky, 24 Aug 2025, Trust Me, I Know This Function: Hijacking LLM Static Analysis using Bias, https://arxiv.org/abs/2508.17361
- Pooja S. B. Rao and Laxminarayen Nagarajan Venkatesan and Mauro Cherubini and Dinesh Babu Jayagopi, 21 Aug 2025, Invisible Filters: Cultural Bias in Hiring Evaluations Using Large Language Models, https://arxiv.org/abs/2508.16673
- Viacheslav Yusupov, Danil Maksimov, Ameliia Alaeva, Tatiana Zaitceva, Antipina Anna, Anna Vasileva, Chenlin Liu, Rayuth Chheng, Danil Sazanakov, Andrey Chetvergov, Alina Ermilova, Egor Shvetsov, 23 Aug 2025, Token Homogenization under Positional Bias, https://arxiv.org/abs/2508.17126
- Kyra Wilson, Sourojit Ghosh, Aylin Caliskan, 24 Aug 2025, Bias Amplification in Stable Diffusion's Representation of Stigma Through Skin Tones and Their Homogeneity, https://arxiv.org/abs/2508.17465
- Xuan-Bac Nguyen, Thanh-Dat Truong, Pawan Sinha, Khoa Luu, 25 Aug 2025, BRAIN: Bias-Mitigation Continual Learning Approach to Vision-Brain Understanding, https://arxiv.org/abs/2508.18187
- Federico Marcuzzi, Xuefei Ning, Roy Schwartz, and Iryna Gurevych, 25 Aug 2025, How Quantization Shapes Bias in Large Language Models, https://arxiv.org/abs/2508.18088
- Emanuele Zangrando, Piero Deidda, Simone Brugiapaglia, Nicola Guglielmi, Francesco Tudisco, 23 Aug 2025, Provable Emergence of Deep Neural Collapse and Low-Rank Bias in $L^2$-Regularized Nonlinear Networks, https://arxiv.org/abs/2402.03991
- Jihwan Oh, Minchan Jeong, Jongwoo Ko, Se-Young Yun, 24 Aug 2025, Understanding Bias Reinforcement in LLM Agents Debate, https://arxiv.org/abs/2503.16814
- Jiancong Xiao, Ziniu Li, Xingyu Xie, Emily Getzen, Cong Fang, Qi Long, Weijie J. Su, 25 Aug 2025, On the Algorithmic Bias of Aligning Large Language Models with RLHF: Preference Collapse and Matching Regularization, https://arxiv.org/abs/2405.16455
- Paul Scherer, Andreas Kirsch, Jake P. Taylor-King, 4 Sep 2025, When three experiments are better than two: Avoiding intractable correlated aleatoric uncertainty by leveraging a novel bias--variance tradeoff, https://arxiv.org/abs/2509.04363
- Joseph Jackson, Georgiy Lapin, Jeremy E. Thompson, 4 Sep 2025, Gravity Well Echo Chamber Modeling With An LLM-Based Confirmation Bias Model, https://arxiv.org/abs/2509.03832
- Zhengyu Hu, Linxin Song, Jieyu Zhang, Zheyuan Xiao, Tianfu Wang, Zhengyu Chen, Nicholas Jing Yuan, Jianxun Lian, Kaize Ding, Hui Xiong, 4 Sep 2025, Explaining Length Bias in LLM-Based Preference Evaluations, https://arxiv.org/abs/2407.01085
- Andrii Dzhoha, Katya Mirylenka, Egor Malykh, Marco-Andrea Buchmann, Francesca Catino, 3 Sep 2025, Short-Form Video Recommendations with Multimodal Embeddings: Addressing Cold-Start and Bias Challenges, https://arxiv.org/abs/2507.19346
- Junyu Yan, Feng Chen, Yuyang Xue, Yuning Du, Konstantinos Vilouras, Sotirios A. Tsaftaris, Steven McDonagh, 4 Sep 2025, SWiFT: Soft-Mask Weight Fine-tuning for Bias Mitigation, https://arxiv.org/abs/2508.18826
- Yifan Chen, Xiaoou Cheng, Jonathan Niles-Weed, Jonathan Weare, 3 Sep 2025, Convergence of Unadjusted Langevin in High Dimensions: Delocalization of Bias, https://arxiv.org/abs/2408.13115
- Martha O. Dimgba, Sharon Oba, Ameeta Agrawal, Philippe J. Giabbanelli, 3 Sep 2025, Mitigation of Gender and Ethnicity Bias in AI-Generated Stories through Model Explanations, https://arxiv.org/abs/2509.04515
- Karanbir Singh, Deepak Muppiri, William Ngu, 26 Aug 2025, Bias Mitigation Agent: Optimizing Source Selection for Fair and Balanced Knowledge Retrieval, https://arxiv.org/abs/2508.18724
- Jay L. Cunningham, Adinawa Adjagbodjou, Jeffrey Basoah, Jainaba Jawara, Kowe Kadoma, Aaleyah Lewis, 20 Aug 2025, Toward Responsible ASR for African American English Speakers: A Scoping Review of Bias and Equity in Speech Technology, https://arxiv.org/abs/2508.18288
- Kwonyoung Kim, Jungin Park, Jiyoung Lee, Dongbo Min, Kwanghoon Sohn, 26 Aug 2025, PointFix: Learning to Fix Domain Bias for Robust Online Stereo Adaptation, https://arxiv.org/abs/2207.13340
- Sheryl Mathew and N Harshit, 27 Aug 2025, Counterfactual Reward Model Training for Bias Mitigation in Multimodal Reinforcement Learning, https://arxiv.org/abs/2508.19567
- Md Abdullah Al Mamun, Ihsen Alouani, Nael Abu-Ghazaleh, 28 Aug 2025, Poison Once, Refuse Forever: Weaponizing Alignment for Injecting Bias in LLMs, https://arxiv.org/abs/2508.20333
- Ruben Solozabal, Velibor Bojkovic, Hilal AlQuabeh, Kentaro Inui, Martin Tak\'a\v{c}, 28 Aug 2025, Uncovering the Spectral Bias in Diagonal State Space Models, https://arxiv.org/abs/2508.20441
- Farhad Abtahi, Mehdi Astaraki, Fernando Seoane, 29 Aug 2025, Leveraging Imperfection with MEDLEY A Multi-Model Approach Harnessing Bias in Medical AI, https://arxiv.org/abs/2508.21648
- Muskan Saraf, Sajjad Rezvani Boroujeni, Justin Beaudry, Hossein Abedi, Tom Bush, 28 Aug 2025, Quantifying Label-Induced Bias in Large Language Model Self- and Cross-Evaluations, https://arxiv.org/abs/2508.21164
- Liulu He, Shenli Zheng, Karwei Sun, Yijiang Liu, Yufei Zhao, Chongkang Tan, Huanrui Yang, Yuan Du, Li Du, 29 Aug 2025, BASE-Q: Bias and Asymmetric Scaling Enhanced Rotational Quantization for Large Language Models, https://arxiv.org/abs/2506.15689
- Lucas Mansilla, Rodrigo Echeveste, Camila Gonzalez, Diego H. Milone, Enzo Ferrante, 1 Sep 2025, BM-CL: Bias Mitigation through the lens of Continual Learning, https://arxiv.org/abs/2509.01730
- Sanjeeevan Selvaganapathy and Mehwish Nasim, 31 Aug 2025, Confident, Calibrated, or Complicit: Probing the Trade-offs between Safety Alignment and Ideological Bias in Language Models in Detecting Hate Speech, https://arxiv.org/abs/2509.00673
- Theodor Stoecker, Samed Bayer, and Ingo Weber, 28 Aug 2025, Bias Mitigation for AI-Feedback Loops in Recommender Systems: A Systematic Literature Review and Taxonomy, https://arxiv.org/abs/2509.00109
- Chen Zheng, Zhenyu Zhao, 29 Aug 2025, Algorithm Adaptation Bias in Recommendation System Online Experiments, https://arxiv.org/abs/2509.00199
- Ryan Franks, Alexey Miroshnikov, Konstandinos Kotsiopoulos, 2 Sep 2025, Explainable post-training bias mitigation with distribution-based fairness metrics, https://arxiv.org/abs/2504.01223
- Abhishek Pasula and Deepak N. Subramani, 2 Sep 2025, Global Climate Model Bias Correction Using Deep Learning, https://arxiv.org/abs/2504.19145
- Serra Aksoy, 3 Sep 2025, Systematic Evaluation of Attribution Methods: Eliminating Threshold Bias and Revealing Method-Dependent Performance Patterns, https://arxiv.org/abs/2509.03176
- Alissa A. Valentine, Lauren A. Lepow, Lili Chan, Alexander W. Charney, Isotta Landi, 2 Sep 2025, Quantifying Clinician Bias and its Effects on Schizophrenia Diagnosis in the Emergency Department of the Mount Sinai Health System, https://arxiv.org/abs/2509.02651
- Jessica M. Lundin, Ada Zhang, Nihal Karim, Hamza Louzan, Victor Wei, David Adelani, Cody Carroll, 5 Sep 2025, The Token Tax: Systematic Bias in Multilingual Tokenization, https://arxiv.org/abs/2509.05486
- Jinrui Yang, Xudong Han, Timothy Baldwin, 7 Sep 2025, Benchmarking Gender and Political Bias in Large Language Models, https://arxiv.org/abs/2509.06164
- Jinrui Yang, Fan Jiang, Timothy Baldwin, 7 Sep 2025, Language Bias in Information Retrieval: The Nature of the Beast and Mitigation Methods, https://arxiv.org/abs/2509.06195
- Vincent C. Brockers, David A. Ehrlich, Viola Priesemann, 8 Sep 2025, Disentangling Interaction and Bias Effects in Opinion Dynamics of Large Language Models, https://arxiv.org/abs/2509.06858
- Jinrui Yang, Timothy Baldwin, Trevor Cohn, 3 Nov 2023, Multi-EuP: The Multilingual European Parliament Dataset for Analysis of Bias in Information Retrieval, https://arxiv.org/abs/2311.01870
- Viacheslav Sinii, Alexey Gorbatovski, Artem Cherepanov, Boris Shaposhnikov, Nikita Balagansky, Daniil Gavrilov, 8 Sep 2025, Steering LLM Reasoning Through Bias-Only Adaptation, https://arxiv.org/abs/2505.18706
- Rushia Harada, Yuken Kimura, Keito Inoshita, 7 Sep 2025, Role-Playing LLM-Based Multi-Agent Support Framework for Detecting and Addressing Family Communication Bias, https://arxiv.org/abs/2507.11210
- Amnon Balanov, Tamir Bendory, and Wasim Huleihel, 7 Sep 2025, Confirmation Bias in Gaussian Mixture Models, https://arxiv.org/abs/2408.09718
- Qihu Xie, Yuan Li, and Yi Kang, 9 Sep 2025, SBS: Enhancing Parameter-Efficiency of Neural Representations for Neural Networks via Spectral Bias Suppression, https://arxiv.org/abs/2509.07373
- Juan Manuel Contreras, 8 Sep 2025, Automated Evaluation of Gender Bias Across 13 Large Multimodal Models, https://arxiv.org/abs/2509.07050
- Sai Siddhartha Chary Aylapuram, Veeraraju Elluru, Shivang Agarwal, 9 Sep 2025, Bias-Aware Machine Unlearning: Towards Fairer Vision Models via Controllable Forgetting, https://arxiv.org/abs/2509.07456
- Camilo Chac\'on Sartori, Mart\'in Isla Pino, Pedro Pinacho-Davidson, Christian Blum, 5 Sep 2025, LLM-Based Instance-Driven Heuristic Bias In the Context of a Biased Random Key Genetic Algorithm, https://arxiv.org/abs/2509.09707
- Zahraa Al Sahili, Ioannis Patras, Matthew Purver, 11 Sep 2025, Data Matters Most: Auditing Social Bias in Contrastive Vision Language Models, https://arxiv.org/abs/2501.13223
- Zahraa Al Sahili, Ioannis Patras, Matthew Purver, 11 Sep 2025, Breaking Language Barriers or Reinforcing Bias? A Study of Gender and Racial Disparities in Multilingual Contrastive Vision Language Models, https://arxiv.org/abs/2505.14160
- Baichuan Huang, Ananth Balashankar, Amir Aminifar, 19 Sep 2025, BEFT: Bias-Efficient Fine-Tuning of Language Models, https://arxiv.org/abs/2509.15974
- Shuo Wang and Renhao Li and Xi Chen and Yulin Yuan and Derek F. Wong and Min Yang, 18 Sep 2025, Exploring the Impact of Personality Traits on LLM Bias and Toxicity, https://arxiv.org/abs/2502.12566
- Nikolaos Tsilivis, Eitan Gronich, Gal Vardi, Julia Kempe, 19 Sep 2025, Flavors of Margin: Implicit Bias of Steepest Descent in Homogeneous Neural Networks, https://arxiv.org/abs/2410.22069
- Avinash Madasu, Vasudev Lal, Phillip Howard, 19 Sep 2025, Pruning the Paradox: How CLIP's Most Informative Heads Enhance Performance While Amplifying Bias, https://arxiv.org/abs/2503.11103
- Xiaoguang Chang, Teng Wang and Changyin Sun, 13 Sep 2025, A Modern Look at Simplicity Bias in Image Classification Tasks, https://arxiv.org/abs/2509.12265
- Paul Kr\"oger, Emilio Barkett, 16 Sep 2025, Don't Change My View: Ideological Bias Auditing in Large Language Models, https://arxiv.org/abs/2509.12652
- Maximus Powers, Shaina Raza, Alex Chang, Rehana Riaz, Umang Mavani, Harshitha Reddy Jonala, Ansh Tiwari, Hua Wei, 15 Sep 2025, Responsible AI in NLP: GUS-Net Span-Level Bias Detection Dataset and Benchmark for Generalizations, Unfairness, and Stereotypes, https://arxiv.org/abs/2410.08388
- Robin Narsingh Ranabhat, Longwei Wang, Amit Kumar Patel, KC santosh, 14 Sep 2025, Promoting Shape Bias in CNNs: Frequency-Based and Contrastive Regularization for Corruption Robustness, https://arxiv.org/abs/2509.11355
- Amy Rafferty, Rishi Ramaesh, Ajitha Rajan, 18 Sep 2025, Limitations of Public Chest Radiography Datasets for Artificial Intelligence: Label Quality, Domain Shift, Bias and Evaluation Challenges, https://arxiv.org/abs/2509.15107
- Kiana Kiashemshaki, Mohammad Jalili Torkamani, Negin Mahmoudi, Meysam Shirdel Bilehsavar, 17 Sep 2025, Simulating a Bias Mitigation Scenario in Large Language Models, https://arxiv.org/abs/2509.14438
- Chiyu Ma, Enpei Zhang, Yilun Zhao, Wenjun Liu, Yaning Jia, Peijun Qing, Lin Shi, Arman Cohan, Yujun Yan, Soroush Vosoughi, 17 Sep 2025, Judging with Many Minds: Do More Perspectives Mean Less Prejudice? On Bias Amplifications and Resistance in Multi-Agent Based LLM-as-Judge, https://arxiv.org/abs/2505.19477
- Zoya Hammad, Nii Longdon Sowah, 7 Sep 2025, Evaluating and comparing gender bias across four text-to-image models, https://arxiv.org/abs/2509.08004
- Nivedha Sivakumar, Natalie Mackraz, Samira Khorshidi, Krishna Patel, Barry-John Theobald, Luca Zappella, Nicholas Apostoloff, 9 Sep 2025, Bias after Prompting: Persistent Discrimination in Large Language Models, https://arxiv.org/abs/2509.08146
- Daniel Lacker and Fuzhong Zhou, 10 Sep 2025, A hierarchical entropy method for the delocalization of bias in high-dimensional Langevin Monte Carlo, https://arxiv.org/abs/2509.08619
- Ji Zhang, Xu Luo, Lianli Gao, Difan Zou, Hengtao Shen, Jingkuan Song, 10 Sep 2025, From Channel Bias to Feature Redundancy: Uncovering the "Less is More" Principle in Few-Shot Learning, https://arxiv.org/abs/2310.03843
- Xuan Liu, Haoyang Shang, Haojian Jin, 16 Sep 2025, Programmable Cognitive Bias in Social Agents, https://arxiv.org/abs/2509.13588
- Sai Suresh Marchala Vasu, Ivaxi Sheth, Hui-Po Wang, Ruta Binkyte, Mario Fritz, 16 Sep 2025, Justice in Judgment: Unveiling (Hidden) Bias in LLM-assisted Peer Reviews, https://arxiv.org/abs/2509.13400
- Dingwei Zhang, Dong Zhang, Jinhui Tang, 17 Sep 2025, Mitigating Query Selection Bias in Referring Video Object Segmentation, https://arxiv.org/abs/2509.13722
- Mohsinul Kabir, Tasfia Tahsin, Sophia Ananiadou, 17 Sep 2025, From n-gram to Attention: How Model Architectures Learn and Propagate Bias in Language Modeling, https://arxiv.org/abs/2505.12381
Toxicity
Toxicity is the LLM safety issue of ensuring that the AI does not give "toxic" answers to the user. There are many subtypes of this issue, such as ensuring that answers are appropriate, non-aggressive, non-disparaging, not insulting, and generally helpful. The overall tone of AI interactions should be positive rather than negative.
Research papers on LLM toxicity issues:
- Jaymari Chua, Yun Li, Shiyi Yang, Chen Wang, Lina Yao, 6 Jul 2024, AI Safety in Generative AI Large Language Models: A Survey, https://arxiv.org/abs/2407.18369
- Cem Dilmegani, Jan 10, 2024, The Future of Large Language Models in 2024, https://research.aimultiple.com/future-of-large-language-models/
- Shayne Longpre, Stella Biderman, Alon Albalak, Hailey Schoelkopf, Daniel McDuff, Sayash Kapoor, Kevin Klyman, Kyle Lo, Gabriel Ilharco, Nay San, Maribeth Rauh, Aviya Skowron, Bertie Vidgen, Laura Weidinger, Arvind Narayanan, Victor Sanh, David Adelani, Percy Liang, Rishi Bommasani, Peter Henderson, Sasha Luccioni, Yacine Jernite, Luca Soldaini, 26 Jun 2024 (v2), The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources, https://arxiv.org/abs/2406.16746
- Inkit Padhi, Manish Nagireddy, Giandomenico Cornacchia, Subhajit Chaudhury, Tejaswini Pedapati, Pierre Dognin, Keerthiram Murugesan, Erik Miehling, Martín Santillán Cooper, Kieran Fraser, Giulio Zizzo, Muhammad Zaid Hameed, Mark Purcell, Michael Desmond, Qian Pan, Inge Vejsbjerg, Elizabeth M. Daly, Michael Hind, Werner Geyer, Ambrish Rawat, Kush R. Varshney, Prasanna Sattigeri, 10 Dec 2024, Granite Guardian, https://arxiv.org/abs/2412.07724 https://github.com/ibm-granite/granite-guardian (Open-sourcing of safety models with many capabilities.)
- Jueon Park, Yein Park, Minju Song, Soyon Park, Donghyeon Lee, Seungheun Baek and Jaewoo Kang, 5 Aug 2025, CoTox: Chain-of-Thought-Based Molecular Toxicity Reasoning and Prediction, https://arxiv.org/abs/2508.03159
- Axel Delaval, Shujian Yang, Haicheng Wang, Han Qiu, Jialiang Lu, 15 Aug 2025, ToxiFrench: Benchmarking and Enhancing Language Models via CoT Fine-Tuning for French Toxicity Detection, https://arxiv.org/abs/2508.11281
- Han Zhang, Fengji Ma, Jiamin Su, Xinyue Yang, Lei Wang, Wen-Cai Ye, Li Liu, 4 Sep 2025, Quantum-Enhanced Multi-Task Learning with Learnable Weighting for Pharmacokinetic and Toxicity Prediction, https://arxiv.org/abs/2509.04601
- Guillermo Villate-Castillo, Javier Del Ser, Borja Sanz, 29 Aug 2025, A Collaborative Content Moderation Framework for Toxicity Detection based on Conformalized Estimates of Annotation Disagreement, https://arxiv.org/abs/2411.04090
- Naquee Rizwan, Nayandeep Deb, Sarthak Roy, Vishwajeet Singh Solanki, Kiran Garimella, Animesh Mukherjee, 29 Aug 2025, Toxicity Begets Toxicity: Unraveling Conversational Chains in Political Podcasts, https://arxiv.org/abs/2501.12640
- Akriti Verma, Shama Islam, Valeh Moghaddam and Adnan Anwar, 31 Aug 2025, Queuing for Civility: Regulating Emotions and Reducing Toxicity in Digital Discourse, https://arxiv.org/abs/2509.00696
- Ruoxi Cheng, Yizhong Ding, Shuirong Cao, Ranjie Duan, Xiaoshuang Jia, Shaowei Yuan, Simeng Qin, Zhiqiang Wang, Xiaojun Jia, 30 Aug 2025, PBI-Attack: Prior-Guided Bimodal Interactive Black-Box Jailbreak Attack for Toxicity Maximization, https://arxiv.org/abs/2412.05892
- Shuo Wang and Renhao Li and Xi Chen and Yulin Yuan and Derek F. Wong and Min Yang, 18 Sep 2025, Exploring the Impact of Personality Traits on LLM Bias and Toxicity, https://arxiv.org/abs/2502.12566
- Sudeshna Jana, Manjira Sinha and Tirthankar Dasgupta, 14 Sep 2025, Decoding Plastic Toxicity: An Intelligent Framework for Conflict-Aware Relational Metapath Extraction from Scientific Abstracts, https://arxiv.org/abs/2509.11330
- Huy Nghiem, Advik Sachdeva, Hal Daum\'e III, 18 Sep 2025, SMARTER: A Data-efficient Framework to Improve Toxicity Detection with Explanation via Self-augmenting Large Language Models, https://arxiv.org/abs/2509.15174
- Gautam Kishore Shahi, Tim A. Majchrzak, 14 Sep 2025, Defining, Understanding, and Detecting Online Toxicity: Challenges and Machine Learning Approaches, https://arxiv.org/abs/2509.14264
Ethics of Responsible AI Research
Ethical issues in AI research and related publication of results:
- Partnership on AI. 2021, Managing the risks of AI research: Six Recommendations for Responsible Publication. https://partnershiponai.org/paper/responsible-publication-recommendations
- M. Brundage, S. Avin, J. Wang, H. Belfield, G. Krueger, G. Hadfield, H. Khlaaf, J. Yang, H. Toner, R. Fong, T. Maharaj, P. W. Koh, S. Hooker, J. Leung, A. Trask, E. Bluemke, J. Lebensold, C. O’Keefe, M. Koren, T. Ryffel, J. Rubinovitz, T. Besiroglu, F. Carugati, J. Clark, P. Eckersley, S. de Haas, M. Johnson, B. Laurie, A. Ingerman, I. Krawczuk, A. Askell, R. Cammarota, A. Lohn, D. Krueger, C. Stix, P. Henderson, L. Graham, C. Prunkl, B. Martin, E. Seger, N. Zilberman, S. Ó. hÉigeartaigh, F. Kroeger, G. Sastry, R. Kagan, A. Weller, B. Tse, E. Barnes, A. Dafoe, P. Scharre, A. Herbert-Voss, M. Rasser, S. Sodhani, C. Flynn, T. K. Gilbert, L. Dyer, S. Khan, Y. Bengio, and M. Anderljung. Toward trustworthy AI development: Mechanisms for supporting verifiable claims. arXiv preprint arXiv:2004.07213, 2020. https://arxiv.org/abs/2004.07213
- R. Crootof. 2019, Artificial intelligence research needs responsible publication norms. https://www.lawfareblog.com/artificial-intelligence-research-needs-responsible-publication-norms
- C. Ashurst, S. Barocas, R. Campbell, and D. Raji. Disentangling the components of ethical research in machine learning. In 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 2057–2068, 2022. http://dx.doi.org/10.1145/3531146.3533781, https://www.researchgate.net/publication/361439688_Disentangling_the_Components_of_Ethical_Research_in_Machine_Learning
- Herrmann H. What's next for responsible artificial intelligence: a way forward through responsible innovation. Heliyon. 2023 Mar 11;9(3):e14379. doi: 10.1016/j.heliyon.2023.e14379. eCollection 2023 Mar. PMID: 36967876, https://pubmed.ncbi.nlm.nih.gov/36967876/
- Ethically governing artificial intelligence in the field of scientific research and innovation. González-Esteban Y Patrici Calvo E. Heliyon. 2022 Feb 16;8(2):e08946. doi: 10.1016/j.heliyon.2022.e08946. eCollection 2022 Feb. PMID: 35243068, https://pubmed.ncbi.nlm.nih.gov/35243068/
- Dzobo K, Adotey S, Thomford NE, Dzobo W. Integrating Artificial and Human Intelligence: A Partnership for Responsible Innovation in Biomedical Engineering and Medicine. OMICS. 2020 May;24(5):247-263. doi: 10.1089/omi.2019.0038. Epub 2019 Jul 16. PMID: 31313972, https://pubmed.ncbi.nlm.nih.gov/31313972/
- d'Aquin M., Troullinou P., O'Connor N.E., Cullen A., Faller G., Holden L. 2018 AAAI/ACM Conference on AI, Ethics, and Society (AIES ’18) ACM; New York: 2018. Towards an “ethics by design” methodology for AI research projects”; pp. 54–59. https://www.researchgate.net/publication/330297261_Towards_an_Ethics_by_Design_Methodology_for_AI_Research_Projects
- Dignum Virginia. 2019. Responsible Artificial Intelligence. How to Develop and Use AI in a Responsible Way. Springer, https://link.springer.com/book/10.1007/978-3-030-30371-6
- European Commission. 2012. Responsible Research and Innovation: Europe’s Ability to Respond to Societal Challenges. Brussels. https://op.europa.eu/en/publication-detail/-/publication/2be36f74-b490-409e-bb60-12fd438100fe
- Helmore Edward. 2019. Profit over safety? Boeing under fire over 737 Max crashes as families demand answers. Guardian. https://www.theguardian.com/business/2019/jun/17/boeing-737-max-ethiopian-airlines-crash
- High-level expert Group on Artificial Intelligence. European Commission; 2019. Ethics Guidelines for Trustworthy AI. Brussels. https://op.europa.eu/en/publication-detail/-/publication/d3988569-0434-11ea-8c1f-01aa75ed71a1
- Prates M., Avelar P., Lamb L.C. 2018, On quantifying and understanding the role of ethics in AI research: a historical account of flagship conferences and journals. EPiC Series in Computing. 2018;55:188–201. https://arxiv.org/abs/1809.08328
- Castelvecchi D., 2021, Prestigious AI meeting takes steps to improve ethics of research. Nature. 2021 Jan;589(7840):12-13. doi: 10.1038/d41586-020-03611-8. PMID: 33361804, https://pubmed.ncbi.nlm.nih.gov/33361804/
- Bouhouita-Guermech S, Gogognon P, Bélisle-Pipon JC. 2023, Specific challenges posed by artificial intelligence in research ethics. Front Artif Intell. 2023 Jul 6;6:1149082. doi: 10.3389/frai.2023.1149082. eCollection 2023. PMID: 37483869 https://pubmed.ncbi.nlm.nih.gov/37483869/
- Gibney E., 2020, The battle for ethical AI at the world's biggest machine-learning conference. Nature. 2020 Jan;577(7792):609. doi: 10.1038/d41586-020-00160-y. PMID: 31992885, https://pubmed.ncbi.nlm.nih.gov/31992885/
- Sánchez López JD, Cambil Martín J, Villegas Calvo M, Luque Martínez F., 2020. Ethical conflicts between authonomy and deep learning, J Healthc Qual Res. 2020 Jan-Feb;35(1):51-52. doi: 10.1016/j.jhqr.2019.06.009. Epub 2019 Nov 26. PMID: 31784256, https://pubmed.ncbi.nlm.nih.gov/31784256/
- Prabhu SP., 2019, Ethical challenges of machine learning and deep learning algorithms. Lancet Oncol. 2019 May;20(5):621-622. doi: 10.1016/S1470-2045(19)30230-X. PMID: 31044701, https://pubmed.ncbi.nlm.nih.gov/31044701/
- Dignum V. Ethics in artificial intelligence: introduction to the special issue. Ethics Inf. Technol. 2018;20:1–3. https://link.springer.com/article/10.1007/s10676-018-9450-z
- IEEE. 2019. "Ethically Aligned Design: A Vision for Prioritizing Human Well-being With Autonomous and Intelligent Systems [First Edition]." The IEEE Global Initiative on Ethics of Autonomous and Intelligent Systems. https://standards.ieee.org/content/ieee-standards/en/industry-connections/ec/autonomous-systems.html
- Stuart Russell, Daniel Dewey, and Max Tegmark. 2015. Research priorities for robust and beneficial artificial intelligence. AI Magazine, 36(4):105–114, 2015. PDF: https://futureoflife.org/data/documents/research_priorities.pdf
- Peter Dizikes, December 11, 2023, MIT group releases white papers on governance of AI, MIT News, https://news.mit.edu/2023/mit-group-releases-white-papers-governance-ai-1211
- Thomas Mildner, Orla Cooney, Anna-Maria Meck, Marion Bartl, Gian-Luca Savino, Philip R. Doyle, Diego Garaialde, Leigh Clark, John Sloan, Nina Wenig, Rainer Malaka, Jasmin Niess, 26 Jan 2024, Listening to the Voices: Describing Ethical Caveats of Conversational User Interfaces According to Experts and Frequent Users, Proceedings of the CHI Conference on Human Factors in Computing Systems (CHI '24), May 11--16, 2024, Honolulu, HI, USA, https://arxiv.org/abs/2401.14746 https://doi.org/https://doi.org/10.1145/3613904.3642542
- Balasubramaniam S. , Vanajaroselin Chirchi, Seifedine Kadry, Moorthy Agoramoorthy, Gururama Senthilvel P., Satheesh Kumar K., and Sivakumar T. A., Oct 2024, The Road Ahead: Emerging Trends, Unresolved Issues, and ConcludingRemarksinGenerativeAI—AComprehensiveReview, International Journal of Intelligent Systems, Volume 2024, Article ID 4013195, 38 pages, https://doi.org/10.1155/2024/4013195 https://www.researchgate.net/profile/Balasubramaniam-s-2/publication/384729387_The_Road_Ahead_Emerging_Trends_Unresolved_Issues_and_Concluding_Remarks_in_Generative_AI-A_Comprehensive_Review/links/6705560cf5eb7108c6e5d261/The-Road-Ahead-Emerging-Trends-Unresolved-Issues-and-Concluding-Remarks-in-Generative-AI-A-Comprehensive-Review.pdf
AI Alignment Research
Alignment is the study of how to ensure that AI engines are "aligned" with the goals and intent of humans.
- J. Leike, J. Schulman, and J. Wu. OpenAI, August 2022. Our approach to alignment research. https://openai.com/blog/our-approach-to-alignment-research
- OpenAI, July 2023, Introducing Superalignment, https://openai.com/blog/introducing-superalignment
- V. Krakovna and R. Shah. 2023, Some high-level thoughts on the DeepMind alignment team’s strategy. https://www.alignmentforum.org/posts/a9SPcZ6GXAg9cNKdi/linkpost-some-high-level-thoughts-on-the-deepmind-alignment
- J. Leike. Dec 2022, Why I’m optimistic about our alignment approach. https://aligned.substack.com/p/alignment-optimism
- Nate Soares and Benja Fallenstein. Aligning superintelligence with human interests: A technical research agenda. Technical report, Machine Intelligence Research Institute, 2014. https://www.semanticscholar.org/paper/Aligning-Superintelligence-with-Human-Interests%3A-A-Soares-Fallenstein/d8033a314493c8df3791912272ac4b58d3a7b8c2
- Jessica Taylor, Eliezer Yudkowsky, Patrick LaVictoire, and Andrew Critch. 2016. Alignment for advanced machine learning systems. Technical report, Machine Intelligence Research Institute, 2016. PDF: https://intelligence.org/files/AlignmentMachineLearning.pdf
- Daniel Weld and Oren Etzioni. The first law of robotics (a call to arms). Proceedings of the AAAI Conference on Artificial Intelligence, 12, pages 1042–1047, 1994. https://aaai.org/papers/01042-the-first-law-of-robotics-a-call-to-arms/
- Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe, Mar 2022, Training language models to follow instructions with human feedback, https://arxiv.org/abs/2203.02155 (InstructGPT main paper from OpenAI in 2022.)
- Ziniu Li1, Tian Xu, Yushun Zhang, Zhihang Lin, Yang Yu, Ruoyu Sun, Zhi-Quan Luo, 2024, ReMax: ASimple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models, https://openreview.net/pdf?id=Stn8hXkpe6
- Aibek Bekbayev, Sungbae Chun, Yerzat Dulat, James Yamazaki, Aug 2023, The Poison of Alignment, https://arxiv.org/abs/2308.13449
- Yotam Wolf, Noam Wies, Yoav Levine, and Amnon Shashua. Fundamental limitations of alignment in large language models. arXiv preprint arXiv:2304.11082, 2023. https://arxiv.org/abs/2304.11082
- Renze Lou, Kai Zhang, Wenpeng Yin, 25 May 2024 (v8), Large Language Model Instruction Following: A Survey of Progresses and Challenges, https://arxiv.org/abs/2303.10475 Project: https://github.com/RenzeLou/awesome-instruction-learning
- Alexandre Ramé, Nino Vieillard, Léonard Hussenot, Robert Dadashi, Geoffrey Cideron, Olivier Bachem, Johan Ferret, 22 Jan 2024, WARM: On the Benefits of Weight Averaged Reward Models, https://arxiv.org/abs/2401.12187 (Uses multiple reward models to avoid problems with the LLM "hacking rewards" in unforeseen ways.)
- NVIDIA, June 2024, Nemotron-4 340B Technical Report, https://d1qx31qr3h6wln.cloudfront.net/publications/Nemotron_4_340B_8T_0.pdf (Architecture is decoder-only with GQA, SentencePiece tokenizer, causal attention masks, RoPE, 96 layers, 96 heads, 8 KV heads, 256,000 vocabulary, 18432 internal dimension, context window 4096, and uses squared RELU.)
- Piotr Wojciech Mirowski, Juliette Love, Kory W. Mathewson, Shakir Mohamed, 3 Jun 2024 (v2), A Robot Walks into a Bar: Can Language Models Serve as Creativity Support Tools for Comedy? An Evaluation of LLMs' Humour Alignment with Comedians, https://arxiv.org/abs/2405.20956 (The unfunny fact that AI is bad at humor.)
- Jaymari Chua, Yun Li, Shiyi Yang, Chen Wang, Lina Yao, 6 Jul 2024, AI Safety in Generative AI Large Language Models: A Survey, https://arxiv.org/abs/2407.18369
- Mintong Kang, Nezihe Merve Gürel, Ning Yu, Dawn Song, Bo Li, July 2024, C-RAG: Certified Generation Risks for Retrieval-Augmented Language Models, Proceedings of the 41st International Conference on Machine Learning, PMLR 235:22963-23000, 2024, https://proceedings.mlr.press/v235/kang24a.html
- Rohin Shah, Seb Farquhar, Anca Dragan, 21st Aug 2024, AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work, https://www.alignmentforum.org/posts/79BPxvSsjzBkiSyTq/agi-safety-and-alignment-at-google-deepmind-a-summary-of
- Shayne Longpre, Stella Biderman, Alon Albalak, Hailey Schoelkopf, Daniel McDuff, Sayash Kapoor, Kevin Klyman, Kyle Lo, Gabriel Ilharco, Nay San, Maribeth Rauh, Aviya Skowron, Bertie Vidgen, Laura Weidinger, Arvind Narayanan, Victor Sanh, David Adelani, Percy Liang, Rishi Bommasani, Peter Henderson, Sasha Luccioni, Yacine Jernite, Luca Soldaini, 26 Jun 2024 (v2), The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources, https://arxiv.org/abs/2406.16746
- Yiheng Liu, Hao He, Tianle Han, Xu Zhang, Mengyuan Liu, Jiaming Tian, Yutong Zhang, Jiaqi Wang, Xiaohui Gao, Tianyang Zhong, Yi Pan, Shaochen Xu, Zihao Wu, Zhengliang Liu, Xin Zhang, Shu Zhang, Xintao Hu, Tuo Zhang, Ning Qiang, Tianming Liu, Bao Ge, 6 Jan 2024 (v2), Understanding LLMs: A Comprehensive Overview from Training to Inference, https://arxiv.org/abs/2401.02038
- Hao Zhou, Chengming Hu, Ye Yuan, Yufei Cui, Yili Jin, Can Chen, Haolun Wu, Dun Yuan, Li Jiang, Di Wu, Xue Liu, Charlie Zhang, Xianbin Wang, Jiangchuan Liu, 17 May 2024, Large Language Model (LLM) for Telecommunications: A Comprehensive Survey on Principles, Key Techniques, and Opportunities, https://arxiv.org/abs/2405.10825
- Zekun Moore Wang, Shawn Wang, Kang Zhu, Jiaheng Liu, Ke Xu, Jie Fu, Wangchunshu Zhou, Wenhao Huang, 17 Oct 2024, PopAlign: Diversifying Contrasting Patterns for a More Comprehensive Alignment, https://arxiv.org/abs/2410.13785
- Mozhi Zhang, Pengyu Wang, Chenkun Tan, Mianqiu Huang, Dong Zhang, Yaqian Zhou, Xipeng Qiu, 18 Oct 2024, MetaAlign: Align Large Language Models with Diverse Preferences during Inference Time, https://arxiv.org/abs/2410.14184
- OpenAI, Dec 2024, Deliberative alignment: reasoning enables safer language models. Introducing our new alignment strategy for o-series models, which are directly taught safety specifications and how to reason over them. https://openai.com/index/deliberative-alignment/
- Asif Razzaq, December 23, 2024, OpenAI Researchers Propose ‘Deliberative Alignment’: A Training Approach that Teaches LLMs to Explicitly Reason through Safety Specifications before Producing an Answer, https://www.marktechpost.com/2024/12/23/openai-researchers-propose-deliberative-alignment-a-training-approach-that-teaches-llms-to-explicitly-reason-through-safety-specifications-before-producing-an-answer/
- Andrea Matarazzo, Riccardo Torlone, 3 Jan 2025, A Survey on Large Language Models with some Insights on their Capabilities and Limitations, https://arxiv.org/abs/2501.04040 (Broad survey with many LLM topics covered from history to architectures to optimizations.)
- Tong Xiao, Jingbo Zhu, 16 Jan 2025, Foundations of Large Language Models, https://arxiv.org/abs/2501.09223 (Huge 230 page paper on many topics such as training, prompting, alignment, and long context.)
- Zongxi Li, Yang Li, Haoran Xie, S. Joe Qin, 3 Feb 2025, CondAmbigQA: A Benchmark and Dataset for Conditional Ambiguous Question Answering, https://arxiv.org/abs/2502.01523
- Y Gong, D Ran, X He, T Cong, A Wang, X Wang, Feb 2025, Safety Misalignment Against Large Language Models, Network and Distributed System Security (NDSS) Symposium 2025, 24-28 February 2025, San Diego, CA, USA, ISBN 979-8-9894372-8-3, https://dx.doi.org/10.14722/ndss.2025.241089 https://www.ndss-symposium.org/wp-content/uploads/2025-1089-paper.pdf
- Guiyao Tie, Zeli Zhao, Dingjie Song, Fuyang Wei, Rong Zhou, Yurou Dai, Wen Yin, Zhejian Yang, Jiangyue Yan, Yao Su, Zhenhan Dai, Yifeng Xie, Yihan Cao, Lichao Sun, Pan Zhou, Lifang He, Hechang Chen, Yu Zhang, Qingsong Wen, Tianming Liu, Neil Zhenqiang Gong, Jiliang Tang, Caiming Xiong, Heng Ji, Philip S. Yu, Jianfeng Gao, 8 Mar 2025, A Survey on Post-training of Large Language Models, https://arxiv.org/abs/2503.06072
- Ziqi Yin, Hao Wang, Kaito Horio, Daisuke Kawahara, Satoshi Sekine, 14 Oct 2024 (v2), Should We Respect LLMs? A Cross-Lingual Study on the Influence of Prompt Politeness on LLM Performance, https://arxiv.org/abs/2402.14531
- Michael Nuñez, July 15, 2025, OpenAI, Google DeepMind and Anthropic sound alarm: ‘We may be losing the ability to understand AI’, https://venturebeat.com/ai/openai-google-deepmind-and-anthropic-sound-alarm-we-may-be-losing-the-ability-to-understand-ai/ (Monitoring the text-based interim "thinking-out-loud" reasoning of models in CoT.)
- Tomek Korbak, Mikita Balesni, (and many more authors) July 2025, Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety, https://tomekkorbak.com/cot-monitorability-is-a-fragile-opportunity/cot_monitoring.pdf
- Yafu Li, Xuyang Hu, Xiaoye Qu, Linjie Li, Yu Cheng, 22 Jan 2025, Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback, https://arxiv.org/abs/2501.12895 https://github.com/yafuly/TPO
- Cameron R. Wolfe, Ph.D., Jun 30, 2025, Reward Models: Modeling human preferences for LLMs in the age of reasoning models, https://cameronrwolfe.substack.com/p/reward-models
- Zetian Sun, Dongfang Li, Baotian Hu, 14 Aug 2025, Diversity First, Quality Later: A Two-Stage Assumption for Language Model Alignment, https://arxiv.org/abs/2508.10530
- Xinyan Jiang, Lin Zhang, Jiayi Zhang, Qingsong Yang, Guimin Hu, Di Wang, Lijie Hu, 14 Aug 2025, MSRS: Adaptive Multi-Subspace Representation Steering for Attribute Alignment in Large Language Models, https://arxiv.org/abs/2508.10599
- Jinhwa Kim, Ian G. Harris, 9 Aug 2025, Context Misleads LLMs: The Role of Context Filtering in Maintaining Safe Alignment of LLMs, https://arxiv.org/abs/2508.10031
- Christopher Pinier, Sonia Acu\~na Vargas, Mariia Steeghs-Turchina, Dora Matzke, Claire E. Stevenson, Michael D. Nunez, 12 Aug 2025, Large Language Models Show Signs of Alignment with Human Neurocognition During Abstract Reasoning, https://arxiv.org/abs/2508.10057
- Shixiong Xu, Chenghao Zhang, Lubin Fan, Yuan Zhou, Bin Fan, Shiming Xiang, Gaofeng Meng, Jieping Ye, 14 Aug 2025, AddressVLM: Cross-view Alignment Tuning for Image Address Localization using Large Vision-Language Models, https://arxiv.org/abs/2508.10667
- Xia Chen, 13 Aug 2025, Dynamical Alignment: A Principle for Adaptive Neural Computation, https://arxiv.org/abs/2508.10064
- Yihao Xue, Baharan Mirzasoleiman, 22 Jul 2025, LoRA is All You Need for Safety Alignment of Reasoning LLMs, https://arxiv.org/abs/2507.17075
- Haoran Sun, Zekun Zhang, Shaoning Zeng, 23 Jul 2025, An Uncertainty-Driven Adaptive Self-Alignment Framework for Large Language Models, https://arxiv.org/abs/2507.17477
- Xiang Li, 21 Jul 2025, Look Before You Fuse: 2D-Guided Cross-Modal Alignment for Robust 3D Detection, https://arxiv.org/abs/2507.16861
- Miguel Carrasco, C\'esar Gonz\'alez-Mart\'in, Jos\'e Aranda, Luis Oliveros, 23 Jul 2025, Vision Transformer attention alignment with human visual perception in aesthetic object evaluation, https://arxiv.org/abs/2507.17616
- Yifan Wang, Runjin Chen, Bolian Li, David Cho, Yihe Deng, Ruqi Zhang, Tianlong Chen, Zhangyang Wang, Ananth Grama, Junyuan Hong, 22 Jul 2025, More is Less: The Pitfalls of Multi-Model Synthetic Preference Data in DPO Safety Alignment, https://arxiv.org/abs/2504.02193
- Tom\'as H\"uttebr\"aucker, Mario Edoardo Pandolfo, Simone Fiorellino, Emilio Calvanese Strinati, Paolo Di Lorenzo, 23 Jul 2025, RIS-aided Latent Space Alignment for Semantic Channel Equalization, https://arxiv.org/abs/2507.16450
- Shehzeen Hussain, Paarth Neekhara, Xuesong Yang, Edresson Casanova, Subhankar Ghosh, Mikyas T. Desta, Roy Fejgin, Rafael Valle, Jason Li, 22 Jul 2025, Koel-TTS: Enhancing LLM based Speech Generation with Preference Alignment and Classifier Free Guidance, https://arxiv.org/abs/2502.05236
- Songming Zhang, Xue Zhang, Tong Zhang, Bojie Hu, Yufeng Chen, Jinan Xu, 23 Jul 2025, AlignDistil: Token-Level Language Model Alignment as Adaptive Policy Distillation, https://arxiv.org/abs/2503.02832
- ZhengXiao He, Jinghao Wen, Huayu Li, Siyuan Tian, Ao Li, 23 Jul 2025, NeuroHD-RA: Neural-distilled Hyperdimensional Model with Rhythm Alignment, https://arxiv.org/abs/2507.14184
- Amir Mohammad Izadi, Seyed Mohammad Hadi Hosseini, Soroush Vafaie Tabar, Ali Abdollahi, Armin Saghafian, and Mahdieh Soleymani Baghshah, 22 Jul 2025, Fine-Grained Alignment and Noise Refinement for Compositional Text-to-Image Generation, https://arxiv.org/abs/2503.06506
- Andy E. Williams, 18 Jul 2025, The Recursive Coherence Principle: A Formal Constraint on Scalable Intelligence, Alignment, and Reasoning Architecture, https://arxiv.org/abs/2507.15880
- Debangshu Banerjee, Kintan Saha, Aditya Gopalan, 21 Jul 2025, Towards Reliable, Uncertainty-Aware Alignment, https://arxiv.org/abs/2507.15906
- Mario Edoardo Pandolfo, Simone Fiorellino, Emilio Calvanese Strinati, Paolo Di Lorenzo, 22 Jul 2025, Latent Space Alignment for AI-Native MIMO Semantic Communications, https://arxiv.org/abs/2507.16680
- Han Jiang, Dongyao Zhu, Zhihua Wei, Xiaoyuan Yi, Ziang Xiao, Xing Xie, 22 Jul 2025, PICACO: Pluralistic In-Context Value Alignment of LLMs via Total Correlation Optimization, https://arxiv.org/abs/2507.16679
- Difei Gu, Yunhe Gao, Yang Zhou, Mu Zhou, Dimitris Metaxas, 22 Jul 2025, RadAlign: Advancing Radiology Report Generation with Vision-Language Concept Alignment, https://arxiv.org/abs/2501.07525
- Ziteng Yang, Jingzehua Xu, Yanshu Li, Zepeng Li, Yeqiang Wang, Xinghui Li, 22 Jul 2025, ViP$^2$-CLIP: Visual-Perception Prompting with Unified Alignment for Zero-Shot Anomaly Detection, https://arxiv.org/abs/2505.17692
- Tianze Wang, Dongnan Gui, Yifan Hu, Shuhang Lin, Linjun Zhang, 22 Jul 2025, MPO: An Efficient Post-Processing Framework for Mixing Diverse Preference Alignment, https://arxiv.org/abs/2502.18699
- Xiandong Zou, Wanyu Lin, Yuchen Li, Pan Zhou, 24 Jul 2025, HPS: Hard Preference Sampling for Human Preference Alignment, https://arxiv.org/abs/2502.14400
- Alberto Hern\'andez-Espinosa, Felipe S. Abrah\~ao, Olaf Witkowski, Hector Zenil, 24 Jul 2025, Neurodivergent Influenceability as a Contingent Solution to the AI Alignment Problem, https://arxiv.org/abs/2505.02581
- Yuhui Sun (University of Alberta), Xiyao Wang (University of Toronto), Zixi Li (Zhejiang University), Zhenlong Yuan (Institute of Computing Technology, Chinese Academy of Sciences), and Jinman Zhao (University of Toronto), 24 Jul 2025, Multi-Preference Lambda-weighted Listwise DPO for Small-Scale Model Alignment, https://arxiv.org/abs/2506.19780
- Bowen Jin, Jinsung Yoon, Zhen Qin, Ziqi Wang, Wei Xiong, Yu Meng, Jiawei Han, Sercan O. Arik, 23 Jul 2025, LLM Alignment as Retriever Optimization: An Information Retrieval Perspective, https://arxiv.org/abs/2502.03699
- Jie Xu, Na Zhao, Gang Niu, Masashi Sugiyama, Xiaofeng Zhu, 24 Jul 2025, Robust Multi-View Learning via Representation Fusion of Sample-Level Attention and Alignment of Simulated Perturbation, https://arxiv.org/abs/2503.04151
- Charvi Rastogi, Tian Huey Teh, Pushkar Mishra, Roma Patel, Ding Wang, Mark D\'iaz, Alicia Parrish, Aida Mostafazadeh Davani, Zoe Ashwood, Michela Paganini, Vinodkumar Prabhakaran, Verena Rieser, Lora Aroyo, 15 Jul 2025, Whose View of Safety? A Deep DIVE Dataset for Pluralistic Alignment of Text-to-Image Models, https://arxiv.org/abs/2507.13383
- Oussama Bouaggad, Natalia Grabar, 18 Jul 2025, Search-Optimized Quantization in Biomedical Ontology Alignment, https://arxiv.org/abs/2507.13742
- Shuliang Liu, Qi Zheng, Jesse Jiaxi Xu, Yibo Yan, He Geng, Aiwei Liu, Peijie Jiang, Jia Liu, Yik-Cheung Tam, and Xuming Hu, 18 Jul 2025, VLA-Mark: A cross modal watermark for large vision-language alignment model, https://arxiv.org/abs/2507.14067
- Yi Zhang, An Zhang, XiuYu Zhang, Leheng Sheng, Yuxin Chen, Zhenkai Liang, Xiang Wang, 20 Jul 2025, AlphaAlign: Incentivizing Safety Alignment with Extremely Simplified Reinforcement Learning, https://arxiv.org/abs/2507.14987
- Pengfei Du, 14 Jul 2025, PRM-Free Security Alignment of Large Models via Red Teaming and Adversarial Training, https://arxiv.org/abs/2507.14202
- Wenqian Ye, Guangtao Zheng, Aidong Zhang, 20 Jul 2025, Improving Group Robustness on Spurious Correlation via Evidential Alignment, https://arxiv.org/abs/2506.11347
- Anirudh Sundar, Sinead Williamson, Katherine Metcalf, Barry-John Theobald, Skyler Seto, Masha Fedzechkina, 21 Jul 2025, Steering into New Embedding Spaces: Analyzing Cross-Lingual Alignment Induced by Model Interventions in Multilingual Language Models, https://arxiv.org/abs/2502.15639
- Noel Teku, Fengwei Tian, Payel Bhattacharjee, Souradip Chakraborty, Amrit Singh Bedi, Ravi Tandon, 9 Aug 2025, PROPS: Progressively Private Self-alignment of Large Language Models, https://arxiv.org/abs/2508.06783
- Yuandong Tan, 10 Aug 2025, A Stable and Principled Loss Function for Direct Language Model Alignment, https://arxiv.org/abs/2508.07137
- Jia Zhang, Yao Liu, Chen-Xi Zhang, Yi Liu, Yi-Xuan Jin, Lan-Zhe Guo, Yu-Feng Li, 11 Aug 2025, Beyond Single: A Data Selection Principle for LLM Alignment via Fine-Grained Preference Signals, https://arxiv.org/abs/2508.07638
- Haowen Wang, Yun Yue, Zhiling Ye, Shuowen Zhang, Lei Fan, Jiaxin Liang, Jiadi Jiang, Cheng Wei, Jingyuan Deng, Xudong Han, Ji Li, Chunxiao Guo, Peng Wei, Jian Wang, Jinjie Gu, 11 Aug 2025, Learning to Align, Aligning to Learn: A Unified Approach for Self-Optimized Alignment, https://arxiv.org/abs/2508.07750
- Qiang He, Setareh Maghsudi, 11 Aug 2025, Pareto Multi-Objective Alignment for Language Models, https://arxiv.org/abs/2508.07768
- Nicole Lai-Tan and Xiao Gu and Marios G. Philiastides and Fani Deligianni, 11 Aug 2025, Cross-Subject and Cross-Montage EEG Transfer Learning via Individual Tangent Space Alignment and Spatial-Riemannian Feature Fusion, https://arxiv.org/abs/2508.08216
- Ben Y. Reis and William La Cava, 8 Aug 2025, Towards Integrated Alignment, https://arxiv.org/abs/2508.06592
- Xiaobo Zhang (1 and 2), Congqing He (2), Ying He (1 and 2), Jian Peng (1), Dajie Fu (1), Tien-Ping Tan (2) ((1) School of Information Engineering, Jiangxi Vocational College of Finance & Economics, Jiujiang, China, (2) School of Computer Sciences, Universiti Sains Malaysia, Penang, Malaysia), 9 Aug 2025, ESNERA: Empirical and semantic named entity alignment for named entity dataset merging, https://arxiv.org/abs/2508.06877
- Jianting Tang, Yubo Wang, Haoyu Cao, Linli Xu, 9 Aug 2025, BASIC: Boosting Visual Alignment with Intrinsic Refined Embeddings in Multimodal Large Language Models, https://arxiv.org/abs/2508.06895
- Yanru Sun, Emadeldeen Eldele, Zongxia Xie, Yucheng Wang, Wenzhe Niu, Qinghua Hu, Chee Keong Kwoh, Min Wu, 10 Aug 2025, Adapting LLMs to Time Series Forecasting via Temporal Heterogeneity Modeling and Semantic Alignment, https://arxiv.org/abs/2508.07195
- Gustavo Moreira, Leonardo Ferreira, Carolina Veiga, Maryam Hosseini, Fabio Miranda, 10 Aug 2025, Urbanite: A Dataflow-Based Framework for Human-AI Interactive Alignment in Urban Visual Analytics, https://arxiv.org/abs/2508.07390
- Wenze Xu and Chun Wang and Jiazhen Yu and Sheng Chen and Liang Gao and Weihong Deng, 11 Aug 2025, Optimal Transport Regularization for Speech Text Alignment in Spoken Language Models, https://arxiv.org/abs/2508.08131
- Kyle Moore, Jesse Roberts, Daryl Watson, 11 Aug 2025, Human-Alignment and Calibration of Inference-Time Uncertainty in Large Language Models, https://arxiv.org/abs/2508.08204
- Jie Xiao, Changyuan Fan, Qingnan Ren, Alfred Long, Yuchen Zhang, Rymon Yu, Eric Yang, Lynn Ai, Shaoduo Gan, 9 Aug 2025, Echo: Decoupling Inference and Training for Large-Scale RL Alignment on Heterogeneous Swarms, https://arxiv.org/abs/2508.05387
- Haoran Lu, Luyang Fang, Ruidong Zhang, Xinliang Li, Jiazhang Cai, Huimin Cheng, Lin Tang, Ziyu Liu, Zeliang Sun, Tao Wang, Yingchuan Zhang, Arif Hassan Zidan, Jinwen Xu, Jincheng Yu, Meizhi Yu, Hanqi Jiang, Xilin Gong, Weidi Luo, Bolun Sun, Yongkai Chen, Terry Ma, Shushan Wu, Yifan Zhou, Junhao Chen, Haotian Xiang, Jing Zhang, Afrar Jahin, Wei Ruan, Ke Deng, Yi Pan, Peilong Wang, Jiahui Li, Zhengliang Liu, Lu Zhang, Lin Zhao, Wei Liu, Dajiang Zhu, Xin Xing, Fei Dou, Wei Zhang, Chao Huang, Rongjie Liu, Mengrui Zhang, Yiwen Liu, Xiaoxiao Sun, Qin Lu, Zhen Xiang, Wenxuan Zhong, Tianming Liu, Ping Ma, 25 Jul 2025, Alignment and Safety in Large Language Models: Safety Mechanisms, Training Paradigms, and Emerging Challenges, https://arxiv.org/abs/2507.19672
- Sarat Chandra Bobbili, Ujwal Dinesha, Dheeraj Narasimha, Srinivas Shakkottai, 26 Jul 2025, PITA: Preference-Guided Inference-Time Alignment for LLM Post-Training, https://arxiv.org/abs/2507.20067
- Rachel S.Y. Teo, Laziz U. Abdullaev, Tan M. Nguyen, 27 Jul 2025, The Blessing and Curse of Dimensionality in Safety Alignment, https://arxiv.org/abs/2507.20333
- Tiantian Peng, Yuyang Liu, Shuo Yang, Qiuhe Hong, YongHong Tian, 26 Jul 2025, GNSP: Gradient Null Space Projection for Preserving Cross-Modal Alignment in VLMs Continual Learning, https://arxiv.org/abs/2507.19839
- Siyu Song, Wentao Liu, Ye Lu, Ruohua Zhang, Tao Liu, Jinze Lv, Xinyun Wang, Aimin Zhou, Fei Tan, Bo Jiang, Hao Hao, 27 Jul 2025, Cultivating Helpful, Personalized, and Creative AI Tutors: A Framework for Pedagogical Alignment using Reinforcement Learning, https://arxiv.org/abs/2507.20335
- Rongyao Cai, Ming Jin, Qingsong Wen, Kexin Zhang, 28 Jul 2025, From Entanglement to Alignment: Representation Space Decomposition for Unsupervised Time Series Domain Adaptation, https://arxiv.org/abs/2507.20968
- Andr\'e Steingr\"uber, Kevin Baum, 24 Jul 2025, Justifications for Democratizing AI Alignment and Their Prospects, https://arxiv.org/abs/2507.19548
- Shuhaib Mehri, Xiaocheng Yang, Takyoung Kim, Gokhan Tur, Shikib Mehri, Dilek Hakkani-T\"ur, 27 Jul 2025, Goal Alignment in LLM-Based User Simulators for Conversational AI, https://arxiv.org/abs/2507.20152
- Gabriel Downer, Sean Craven, Damian Ruck, Jake Thomas, 28 Jul 2025, Text2VLM: Adapting Text-Only Datasets to Evaluate Alignment Training in Visual Language Models, https://arxiv.org/abs/2507.20704
- Renhang Liu, Chia-Yu Hung, Navonil Majumder, Taylor Gautreaux, Amir Ali Bagherzadeh, Chuan Li, Dorien Herremans, Soujanya Poria, 28 Jul 2025, JAM: A Tiny Flow-based Song Generator with Fine-grained Controllability and Aesthetic Alignment, https://arxiv.org/abs/2507.20880
- Hei Shing Cheung and Boya Zhang, 26 Jul 2025, Efficient Vocal-Conditioned Music Generation via Soft Alignment Attention and Latent Diffusion, https://arxiv.org/abs/2507.19991
- Jiaming Ji, Kaile Wang, Tianyi Qiu, Boyuan Chen, Jiayi Zhou, Changye Li, Hantao Lou, Juntao Dai, Yunhuai Liu, Yaodong Yang, 27 Jul 2025, Language Models Resist Alignment: Evidence From Data Compression, https://arxiv.org/abs/2406.06144
- Madhava Gaikwad (1), Ashwini Ramchandra Doke (2) ((1) Microsoft, (2) Amrita University), 22 Jul 2025, NPO: Learning Alignment and Meta-Alignment through Structured Human Feedback, https://arxiv.org/abs/2507.21131
- Lenart Motnikar, Katharina Baum, Alexander Kagan, Sarah Spiekermann-Hoff, 26 Jun 2025, The Value of Gen-AI Conversations: A bottom-up Framework for AI Value Alignment, https://arxiv.org/abs/2507.21091
- Aran Nayebi, 29 Jul 2025, Intrinsic Barriers and Practical Pathways for Human-AI Alignment: An Agreement-Based Complexity Analysis, https://arxiv.org/abs/2502.05934
- Haipeng Liu, Yuxuan Liu, Ting Long, 31 Jul 2025, Personalized Education with Ranking Alignment Recommendation, https://arxiv.org/abs/2507.23664
- Wei Li and Xun Gong and Jiao Li and Xiaobin Sun, 31 Jul 2025, AGA: An adaptive group alignment framework for structured medical cross-modal representation learning, https://arxiv.org/abs/2507.23402
- Ananth Balashankar and Ziteng Sun and Jonathan Berant and Jacob Eisenstein and Michael Collins and Adrian Hutter and Jong Lee and Chirag Nagpal and Flavien Prost and Aradhana Sinha and Ananda Theertha Suresh and Ahmad Beirami, 31 Jul 2025, InfAlign: Inference-aware language model alignment, https://arxiv.org/abs/2412.19792
- Qun Ma, Xiao Xue, Ming Zhang, Yifan Shen, Zihan Zhao, 30 Jul 2025, An Explainable Emotion Alignment Framework for LLM-Empowered Agent in Metaverse Service Ecosystem, https://arxiv.org/abs/2507.22326
- Yixuan Nan, Xixun Lin, Yanmin Shang, Zhuofan Li, Can Zhao and Yanan Cao, 30 Jul 2025, RANA: Robust Active Learning for Noisy Network Alignment, https://arxiv.org/abs/2507.22434
- Shaoan Xie, Lingjing Kong, Yujia Zheng, Yu Yao, Zeyu Tang, Eric P. Xing, Guangyi Chen, Kun Zhang, 29 Jul 2025, SmartCLIP: Modular Vision-language Alignment with Identification Guarantees, https://arxiv.org/abs/2507.22264
- Junjie Cao, 30 Jul 2025, Adaptive Duration Model for Text Speech Alignment, https://arxiv.org/abs/2507.22612
- Yeonjun In, Wonjoong Kim, Sangwu Park, Chanyoung Park, 1 Aug 2025, R1-ACT: Efficient Reasoning Model Safety Alignment by Activating Safety Knowledge, https://arxiv.org/abs/2508.00324
- Jens U. Kreber, Joerg Stueckler, 1 Aug 2025, Guiding Diffusion-Based Articulated Object Generation by Partial Point Cloud Alignment and Physical Plausibility Constraints, https://arxiv.org/abs/2508.00558
- Amitava Das, Vinija Jain, Aman Chadha, 4 Aug 2025, TRACEALIGN -- Tracing the Drift: Attributing Alignment Failures to Training-Time Belief Sources in LLMs, https://arxiv.org/abs/2508.02063
- Istabrak Abbes, Gopeshh Subbaraj, Matthew Riemer, Nizar Islah, Benjamin Therien, Tsuguchika Tabaru, Hiroaki Kingetsu, Sarath Chandar, Irina Rish, 3 Aug 2025, Revisiting Replay and Gradient Alignment for Continual Pre-Training of Large Language Models, https://arxiv.org/abs/2508.01908
- Ziyu Zhou, Yiming Huang, Yanyun Wang, Yuankai Wu, James Kwok, Yuxuan Liang, 4 Aug 2025, Revitalizing Canonical Pre-Alignment for Irregular Multivariate Time Series Forecasting, https://arxiv.org/abs/2508.01971
- Amitava Das, Abhilekh Borah, Vinija Jain, Aman Chadha, 4 Aug 2025, AlignGuard-LoRA: Alignment-Preserving Fine-Tuning via Fisher-Guided Decomposition and Riemannian-Geodesic Collision Regularization, https://arxiv.org/abs/2508.02079
- Yu Lei, Jinbin Bai, Qingyu Shi, Aosong Feng and Kaidong Yu, 2 Aug 2025, Personalized Safety Alignment for Text-to-Image Diffusion Models, https://arxiv.org/abs/2508.01151
- Tae Soo Kim, Yoonjoo Lee, Yoonah Park, Jiho Kim, Young-Ho Kim, Juho Kim, 3 Aug 2025, CUPID: Evaluating Personalized and Contextualized Alignment of LLMs from Interactions, https://arxiv.org/abs/2508.01674
- Tom S. Juzek, Zina B. Ward, 3 Aug 2025, Word Overuse and Alignment in Large Language Models: The Influence of Learning from Human Feedback, https://arxiv.org/abs/2508.01930
- Haoran Gu, Handing Wang, Yi Mei, Mengjie Zhang, Yaochu Jin, 4 Aug 2025, ParetoHqD: Fast Offline Multiobjective Alignment of Large Language Models using Pareto High-quality Data, https://arxiv.org/abs/2504.16628
- Ivan Zakazov, Mikolaj Boronski, Lorenzo Drudi, Robert West, 4 Aug 2025, Assessing Social Alignment: Do Personality-Prompted Large Language Models Behave Like Humans?, https://arxiv.org/abs/2412.16772
- Taibiao Zhao, Xiaobing Chen, and Mingxuan Sun, 1 Aug 2025, Enhancing Time Series Forecasting via Multi-Level Text Alignment with LLMs, https://arxiv.org/abs/2504.07360
- Bolian Li, Yifan Wang, Anamika Lochab, Ananth Grama, Ruqi Zhang, 3 Aug 2025, Cascade Reward Sampling for Efficient Decoding-Time Alignment, https://arxiv.org/abs/2406.16306
- Amir Aghdam, Vincent Tao Hu, Bj\"orn Ommer, 4 Aug 2025, ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment, https://arxiv.org/abs/2506.22967
- Dahun Kim, Anelia Angelova, 3 Aug 2025, Context-Adaptive Multi-Prompt LLM Embedding for Vision-Language Alignment, https://arxiv.org/abs/2508.02762
- Hongjun Liu, Chao Yao, Yalan Zhang, Xiaokun wang and Xiaojuan Ban, 5 Aug 2025, Spatial Imputation Drives Cross-Domain Alignment for EEG Classification, https://arxiv.org/abs/2508.03437
- Anamika Lochab, Ruqi Zhang, 5 Aug 2025, Energy-Based Reward Models for Robust Language Model Alignment, https://arxiv.org/abs/2504.13134
- Wentao Wu, Linqing Chen, Hanmeng Zhong, Weilei Wang, 6 Aug 2025, Large Language Model's Multi-Capability Alignment in Biomedical Domain, https://arxiv.org/abs/2508.04278
- Abdul Monaf Chowdhury, Rabeya Akter, Safaeid Hossain Arib, 6 Aug 2025, T3Time: Tri-Modal Time Series Forecasting via Adaptive Multi-Head Alignment and Residual Fusion, https://arxiv.org/abs/2508.04251
- Hongxu Chen, Zhen Wang, Taoran Mei, Lin Li, Bowei Zhu, Runshi Li, Long Chen, 6 Aug 2025, Zero-Residual Concept Erasure via Progressive Alignment in Text-to-Image Model, https://arxiv.org/abs/2508.04472
- Feifan Song, Bofei Gao, Yifan Song, Yi Liu, Weimin Xiong, Yuyang Song, Tianyu Liu, Guoyin Wang, Houfeng Wang, 6 Aug 2025, P-Aligner: Enabling Pre-Alignment of Language Models via Principled Instruction Synthesis, https://arxiv.org/abs/2508.04626
- Wenji Fang, Jing Wang, Yao Lu, Shang Liu, Zhiyao Xie, 6 Aug 2025, GenEDA: Towards Generative Netlist Functional Reasoning via Cross-Modal Circuit Encoder-Decoder Alignment, https://arxiv.org/abs/2504.09485
- You Rim Choi, Subeom Park, Seojun Heo, Eunchung Noh, Hyung-Sin Kim, 6 Aug 2025, Let the Void Be Void: Robust Open-Set Semi-Supervised Learning via Selective Non-Alignment, https://arxiv.org/abs/2504.12569
- Krzysztof Janowicz and Zilong Liu and Gengchen Mai and Zhangyu Wang and Ivan Majic and Alexandra Fortacz and Grant McKenzie and Song Gao, 7 Aug 2025, Whose Truth? Pluralistic Geo-Alignment for (Agentic) AI, https://arxiv.org/abs/2508.05432
- Shruti Saxena, Arijit Khan and Joydeep Chandra, 5 Aug 2025, NAEx: A Plug-and-Play Framework for Explaining Network Alignment, https://arxiv.org/abs/2508.04731
- Mason Nakamura, Saaduddin Mahmud, Kyle H. Wray, Hamed Zamani, Shlomo Zilberstein, 7 Aug 2025, Aligning LLMs on a Budget: Inference-Time Alignment with Heuristic Reward Models, https://arxiv.org/abs/2508.05165
- Zhongheng Yang, Aijia Sun, Yushang Zhao, Yinuo Yang, Dannier Li, Chengrui Zhou, 7 Aug 2025, RLHF Fine-Tuning of LLMs for Alignment with Implicit User Feedback in Conversational Recommenders, https://arxiv.org/abs/2508.05289
- Qinghua Yao, Xiangrui Xu, Zhize Li, 7 Aug 2025, X-VFL: A New Vertical Federated Learning Framework with Cross Completion and Decision Subspace Alignment, https://arxiv.org/abs/2508.05568
- Sam Kouteili, Hiren Madhu, George Typaldos, Mark Santolucito, 7 Aug 2025, Embedding Alignment in Code Generation for Audio, https://arxiv.org/abs/2508.05473
- Yubin Zhang, Yanhua Huang, Haiming Xu, Mingliang Qi, Chang Wang, Jiarui Jin, Xiangyuan Ren, Xiaodan Wang, Ruiwen Xu, 7 Aug 2025, A Metric for MLLM Alignment in Large-scale Recommendation, https://arxiv.org/abs/2508.04963
- Zhiqing Xiao, Haobo Wang, Xu Lu, Wentao Ye, Gang Chen, Junbo Zhao, 7 Aug 2025, SPA++: Generalized Graph Spectral Alignment for Versatile Domain Adaptation, https://arxiv.org/abs/2508.05182
- Wei Zeng, Hengshu Zhu, Chuan Qin, Han Wu, Yihang Cheng, Sirui Zhang, Xiaowei Jin, Yinuo Shen, Zhenxing Wang, Feimin Zhong, Hui Xiong, 7 Aug 2025, Multi-level Value Alignment in Agentic AI Systems: Survey and Perspectives, https://arxiv.org/abs/2506.09656
- Yifei Xu, Tusher Chakraborty, Emre K{\i}c{\i}man, Bibek Aryal, Eduardo Rodrigues, Srinagesh Sharma, Roberto Estevao, Maria Angels de Luis Balaguer, Jessica Wolk, Rafael Padilha, Leonardo Nunes, Shobana Balakrishnan, Songwu Lu, Ranveer Chandra, 6 Aug 2025, RLTHF: Targeted Human Feedback for LLM Alignment, https://arxiv.org/abs/2502.13417
- Shengzhu Yang, Jiawei Du, Shuai Lu, Weihang Zhang, Ningli Wang, Huiqi Li, 8 Aug 2025, CLIPin: A Non-contrastive Plug-in to CLIP for Multimodal Semantic Alignment, https://arxiv.org/abs/2508.06434
- Keiyu Nosaka, Yuichi Takano, Akiko Yoshise, 8 Aug 2025, Data Collaboration Analysis with Orthonormal Basis Selection and Alignment, https://arxiv.org/abs/2403.02780
- Parker Whitfill, Stewy Slocum, 11 Aug 2025, Beyond Ordinal Preferences: Why Alignment Needs Cardinal Human Feedback, https://arxiv.org/abs/2508.08486
- Sviatoslav Lushnei, Dmytro Shumskyi, Severyn Shykula, Ernesto Jimenez-Ruiz, Artur d'Avila Garcez, 11 Aug 2025, Large Language Models as Oracles for Ontology Alignment, https://arxiv.org/abs/2508.08500
- Saketh Reddy Vemula, Dipti Mishra Sharma and Parameswari Krishnamurthy, 11 Aug 2025, Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment, https://arxiv.org/abs/2508.08424
- Jadie Adams, Brian Hu, Emily Veenhuis, David Joy, Bharadwaj Ravichandran, Aaron Bray, Anthony Hoogs, Arslan Basharat, 11 Aug 2025, Steerable Pluralism: Pluralistic Alignment via Few-Shot Comparative Regression, https://arxiv.org/abs/2508.08509
- Birong Pan, Yongqi Li, Weiyu Zhang, Wenpeng Lu, Mayi Xu, Shen Zhou, Yuanyuan Zhu, Ming Zhong, Tieyun Qian, 12 Aug 2025, A Survey on Training-free Alignment of Large Language Models, https://arxiv.org/abs/2508.09016
- Sejin Kim, Sundong Kim, 12 Aug 2025, System~2 Reasoning for Human--AI Alignment: Generality and Adaptivity via ARC-AGI, https://arxiv.org/abs/2410.07866
- Liang Chen, Xueting Han, Li Shen, Jing Bai, Kam-Fai Wong, 12 Aug 2025, Vulnerability-Aware Alignment: Mitigating Uneven Forgetting in Harmful Fine-Tuning, https://arxiv.org/abs/2506.03850
- Yuxin Chen and Chen Tang and Jianglan Wei and Chenran Li and Ran Tian and Xiang Zhang and Wei Zhan and Peter Stone and Masayoshi Tomizuka, 12 Aug 2025, MEReQ: Max-Ent Residual-Q Inverse RL for Sample-Efficient Alignment from Intervention, https://arxiv.org/abs/2406.16258
- Yang Fan, 12 Aug 2025, AdEval: Alignment-based Dynamic Evaluation to Mitigate Data Contamination in Large Language Models, https://arxiv.org/abs/2501.13983
- Shravan Nayak, Mehar Bhatia, Xiaofeng Zhang, Verena Rieser, Lisa Anne Hendricks, Sjoerd van Steenkiste, Yash Goyal, Karolina Sta\'nczak, Aishwarya Agrawal, 12 Aug 2025, CulturalFrames: Assessing Cultural Expectation Alignment in Text-to-Image Models and Evaluation Metrics, https://arxiv.org/abs/2506.08835
- Yang Zhang, Cunxiang Wang, Lindong Wu, Wenbo Yu, Yidong Wang, Guangsheng Bao, Jie Tang, 13 Aug 2025, UDA: Unsupervised Debiasing Alignment for Pair-wise LLM-as-a-Judge, https://arxiv.org/abs/2508.09724
- Mansi, Anastasios Lepipas, Dominika Woszczyk, Yiying Guan, Soteris Demetriou, 12 Aug 2025, Understanding Dementia Speech Alignment with Diffusion-Based Image Generation, https://arxiv.org/abs/2508.09385
- Birong Pan, Mayi Xu, Qiankun Pi, Jianhao Chen, Yuanyuan Zhu, Ming Zhong, Tieyun Qian, 13 Aug 2025, NeuronTune: Fine-Grained Neuron Modulation for Balanced Safety-Utility Alignment in LLMs, https://arxiv.org/abs/2508.09473
- Peiran Peng, Tingfa Xu, Liqiang Song, Mengqi Zhu, Yuqiang Fang, Jianan Li, 13 Aug 2025, COXNet: Cross-Layer Fusion with Adaptive Alignment and Scale Integration for RGBT Tiny Object Detection, https://arxiv.org/abs/2508.09533
- Muneeza Azmat, Momin Abbas, Maysa Malfiza Garcia de Macedo, Marcelo Carpinette Grave, Luan Soares de Souza, Tiago Machado, Rogerio A de Paula, Raya Horesh, Yixin Chen, Heloisa Caroline de Souza Pereira Candello, Rebecka Nordenlow, Aminat Adebiyi, 13 Aug 2025, A Comprehensive Evaluation framework of Alignment Techniques for LLMs, https://arxiv.org/abs/2508.09937
- Zichao Hu, Junyi Jessy Li, Arjun Guha, Joydeep Biswas, 12 Aug 2025, Robo-Instruct: Simulator-Augmented Instruction Alignment For Finetuning Code LLMs, https://arxiv.org/abs/2405.20179
- Numair Nadeem, Saeed Anwar, Muhammad Hamza Asad, Abdul Bais, 13 Aug 2025, HVL: Semi-Supervised Segmentation leveraging Hierarchical Vision-Language Synergy with Dynamic Text-Spatial Query Alignment, https://arxiv.org/abs/2506.13925
- Durgesh Mishra, Rishabh Uikey, 15 Aug 2025, Unified Knowledge Distillation Framework: Fine-Grained Alignment and Geometric Relationship Preservation for Deep Face Recognition, https://arxiv.org/abs/2508.11376
- Alessio Galatolo, Luca Alberto Rappuoli, Katie Winkle, Meriem Beloucif, 18 Aug 2025, Beyond Ethical Alignment: Evaluating LLMs as Artificial Moral Assistants, https://arxiv.org/abs/2508.12754
- Manning Zhu, Songtao Guo, Pengzhan Zhou, Yansong Ning, Chang Han, Dewen Qiao, 18 Aug 2025, FedSODA: Federated Fine-tuning of LLMs via Similarity Group Pruning and Orchestrated Distillation Alignment, https://arxiv.org/abs/2508.12727
- Zhixin Xie, Xurui Song, Jun Luo, 17 Aug 2025, Where to Start Alignment? Diffusion Large Language Model May Demand a Distinct Position, https://arxiv.org/abs/2508.12398
- Xuhui Zhan and Tyler Derr, 17 Aug 2025, Inverse-LLaVA: Eliminating Alignment Pre-training Through Text-to-Vision Mapping, https://arxiv.org/abs/2508.12466
- Yizhuo Zhang, Heng Wang, Shangbin Feng, Zhaoxuan Tan, Xinyun Liu, Yulia Tsvetkov, 17 Aug 2025, Generalizable LLM Learning of Graph Synthetic Data with Post-training Alignment, https://arxiv.org/abs/2506.00845
- Mohammad Jalali, Bahar Dibaei Nia, Farzan Farnia, 16 Aug 2025, Towards an Explainable Comparison and Alignment of Feature Embeddings, https://arxiv.org/abs/2506.06231
- Guangfu Hao, Haojie Wen, Liangxuan Guo, Yang Chen, Yanchao Bi, Shan Yu, 18 Aug 2025, Flexible Tool Selection through Low-dimensional Attribute Alignment of Vision and Language, https://arxiv.org/abs/2505.22146
- Yang Zhang, Yu Yu, Bo Tang, Yu Zhu, Chuxiong Sun, Wenqiang Wei, Jie Hu, Zipeng Xie, Zhiyu Li, Feiyu Xiong, Edward Chung, 16 Aug 2025, Token-level Accept or Reject: A Micro Alignment Approach for Large Language Models, https://arxiv.org/abs/2505.19743
- Jeremy Carleton, Debajoy Mukherjee, Srinivas Shakkottai, Dileep Kalathil, 19 Aug 2025, MAVIS: Multi-Objective Alignment via Value-Guided Inference-Time Search, https://arxiv.org/abs/2508.13415
- Zeeshan Ahmed, Frank Seide, Niko Moritz, Ju Lin, Ruiming Xie, Simone Merello, Zhe Liu and Christian Fuegen, 18 Aug 2025, Overcoming Latency Bottlenecks in On-Device Speech Translation: A Cascaded Approach with Alignment-Based Streaming MT, https://arxiv.org/abs/2508.13358
- Jinhui Pang, Changqing Lin, Hao Lin, Zhihui Zhang, Long Chen, Weiping Ding, Yu Liu, Xiaoshuai Hao, 19 Aug 2025, MEGA: Second-Order Gradient Alignment for Catastrophic Forgetting Mitigation in GFSCIL, https://arxiv.org/abs/2504.13691
- Samir Abdaljalil, Erchin Serpedin, Khalid Qaraqe, Hasan Kurban, 20 Aug 2025, Evaluating Multilingual and Code-Switched Alignment in LLMs via Synthetic Natural Language Inference, https://arxiv.org/abs/2508.14735
- Abhigya Verma, Sriram Puttagunta, Seganrasan Subramanian, Sravan Ramachandran, 21 Aug 2025, GRAFT: GRaPH and Table Reasoning for Textual Alignment -- A Benchmark for Structured Instruction Following and Visual Reasoning, https://arxiv.org/abs/2508.15690
- Chenlin Liu, Minghui Fang, Patrick Zhang, Wei Zhou, Jie Gao, Jiqing Han, 21 Aug 2025, Mitigating Hallucinations in LM-Based TTS Models via Distribution Alignment Using GFlowNets, https://arxiv.org/abs/2508.15442
- Youjia Zhang, Youngeun Kim, Young-Geun Choi, Hongyeob Kim, Huiling Liu, Sungeun Hong, 21 Aug 2025, Backpropagation-Free Test-Time Adaptation via Probabilistic Gaussian Alignment, https://arxiv.org/abs/2508.15568
- J. Koorndijk, 21 Aug 2025, Empirical Evidence for Alignment Faking in a Small LLM and Prompt-Based Mitigation Techniques, https://arxiv.org/abs/2506.21584
- Qilong Xing, Zikai Song, Youjia Zhang, Na Feng, Junqing Yu, Wei Yang, 21 Aug 2025, MCA-RG: Enhancing LLMs with Medical Concept Alignment for Radiology Report Generation, https://arxiv.org/abs/2507.06992
- Shuyuan Tu, Zhen Xing, Xintong Han, Zhi-Qi Cheng, Qi Dai, Chong Luo, Zuxuan Wu, Yu-Gang Jiang, 20 Jul 2025, StableAnimator++: Overcoming Pose Misalignment and Face Distortion for Human Image Animation, https://arxiv.org/abs/2507.15064
- Vince Trencsenyi and Agnieszka Mensfelt and Kostas Stathis, 25 Jul 2025, Hypergames: Modeling Misaligned Perceptions and Nested Beliefs for Multi-agent Systems, https://arxiv.org/abs/2507.19593
- Bryce Anderson, Riley Galpin, Tom S. Juzek, 1 Aug 2025, Model Misalignment and Language Change: Traces of AI-Associated Language in Unscripted Spoken English, https://arxiv.org/abs/2508.00238
- Fan Bu, Zheng Wang, Siyi Wang and Ziyao Liu, 1 Aug 2025, An Investigation into Value Misalignment in LLM-Generated Texts for Cultural Heritage, https://arxiv.org/abs/2501.02039
- Siddhant Panpatil, Hiskias Dingeto, Haon Park, 6 Aug 2025, Eliciting and Analyzing Emergent Misalignment in State-of-the-Art Large Language Models, https://arxiv.org/abs/2508.04196
- David Kacz\'er, Magnus J{\o}rgenv{\aa}g, Clemens Vetter, Lucie Flek, Florian Mai, 8 Aug 2025, In-Training Defenses against Emergent Misalignment in Language Models, https://arxiv.org/abs/2508.06249
- Yichao Cai, Yuhang Liu, Erdun Gao, Tianjiao Jiang, Zhen Zhang, Anton van den Hengel, Javen Qinfeng Shi, 7 Aug 2025, On the Value of Cross-Modal Misalignment in Multimodal Representation Learning, https://arxiv.org/abs/2504.10143
- Dongyoon Hahm, Taywon Min, Woogyeol Jin, Kimin Lee, 19 Aug 2025, Unintended Misalignment from Agentic Fine-Tuning: Risks and Mitigation, https://arxiv.org/abs/2508.14031
- Igor Halperin, 13 Aug 2025, Prompt-Response Semantic Divergence Metrics for Faithfulness Hallucination and Misalignment Detection in Large Language Models, https://arxiv.org/abs/2508.10192
- Zhi Wen Soi, Chenrui Fan, Aditya Shankar, Abele M\u{a}lan, Lydia Y. Chen, 14 Aug 2025, Federated Time Series Generation on Feature and Temporally Misaligned Data, https://arxiv.org/abs/2410.21072
- Yue Pei, Hongming Zhang, Chao Gao, Martin M\"uller, Mengxiao Zhu, Hao Sheng, Haogang Zhu, Liang Lin, 22 Aug 2025, Double Check My Desired Return: Transformer with Target Alignment for Offline Reinforcement Learning, https://arxiv.org/abs/2508.16420
- Junhao Yin, Haolin Wang, Peng Bao, Ju Xu, Yongliang Wang, 15 Aug 2025, From Clicks to Preference: A Multi-stage Alignment Framework for Generative Query Suggestion in Conversational System, https://arxiv.org/abs/2508.15811
- Pi-Wei Chen, Jerry Chun-Wei Lin, Wei-Han Chen, Jia Ji, Zih-Ching Chen, Feng-Hao Yeh, Chao-Chun Chen, 22 Aug 2025, Beyond Human-prompting: Adaptive Prompt Tuning with Semantic Alignment for Anomaly Detection, https://arxiv.org/abs/2508.16157
- Xiaoxiong Zhang, Xin Zhou, Zhiwei Zeng, Yongjie Wang, Dusit Niyato, Zhiqi Shen, 22 Aug 2025, EGRA:Toward Enhanced Behavior Graphs and Representation Alignment for Multimodal Recommendation, https://arxiv.org/abs/2508.16170
- Zirui Li and Stephan Husung and Haoze Wang, 22 Aug 2025, LLM-Assisted Semantic Alignment and Integration in Collaborative Model-Based Systems Engineering Using SysML v2, https://arxiv.org/abs/2508.16181
- Buhua Liu, Shitong Shao, Bao Li, Lichen Bai, Zhiqiang Xu, Haoyi Xiong, James Kwok, Sumi Helal, Zeke Xie, 7 Aug 2025, Alignment of Diffusion Models: Fundamentals, Challenges, and Future, https://arxiv.org/abs/2409.07253
- Somnath Banerjee, Sayan Layek, Pratyush Chatterjee, Animesh Mukherjee, Rima Hazra, 22 Aug 2025, Soteria: Language-Specific Functional Parameter Steering for Multilingual Safety Alignment, https://arxiv.org/abs/2502.11244
- Zeguan Xiao, Yun Chen, Guanhua Chen, Ke Tang, 22 Aug 2025, Towards Bridging the Reward-Generation Gap in Direct Alignment Algorithms, https://arxiv.org/abs/2506.09457
- Mia Taylor and James Chua and Jan Betley and Johannes Treutlein and Owain Evans, 24 Aug 2025, School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs, https://arxiv.org/abs/2508.17511
- Junan Zhang, Xueyao Zhang, Jing Yang, Yuancheng Wang, Fan Fan, Zhizheng Wu, 24 Aug 2025, Multi-Metric Preference Alignment for Generative Speech Restoration, https://arxiv.org/abs/2508.17229
- Yang Li, Songlin Yang, Xiaoxuan Han, Wei Wang, Jing Dong, Yueming Lyu, Ziyu Xue, 25 Aug 2025, Instant Preference Alignment for Text-to-Image Diffusion Models, https://arxiv.org/abs/2508.17718
- Bin Tan, Wangyao Ge, Yidi Wang, Xin Liu, Jeff Burtoft, Hao Fan, Hui Wang, 25 Aug 2025, PCR-CA: Parallel Codebook Representations with Contrastive Alignment for Multiple-Category App Recommendation, https://arxiv.org/abs/2508.18166
- Yaoyao Qian, Jindan Huang, Yuanli Wang, Simon Yu, Kyrie Zhixuan Zhou, Jiayuan Mao, Mingfu Liang, Hanhan Zhou, 23 Aug 2025, WHEN TO ACT, WHEN TO WAIT: Modeling the Intent-Action Alignment Problem in Dialogue, https://arxiv.org/abs/2506.01881
- Paul Darm, Annalisa Riccardi, 25 Aug 2025, Head-Specific Intervention Can Induce Misaligned AI Coordination in Large Language Models, https://arxiv.org/abs/2502.05945
- Stephanie Palazzolo, Sep 2025, OpenAI’s Models Are Getting Too Smart For Their Human Teachers, https://www.theinformation.com/articles/openais-models-getting-smart-human-teachers (Using human labeling to train AI models is becoming more difficult, as the models begin to surpass humans.)
- Cyrus Cousins, Vijay Keswani, Vincent Conitzer, Hoda Heidari, Jana Schaich Borg, Walter Sinnott-Armstrong, 4 Sep 2025, Towards Cognitively-Faithful Decision-Making Models to Improve AI Alignment, https://arxiv.org/abs/2509.04445
- Yuqing Huang, Rongyang Zhang, Qimeng Wang, Chengqiang Lu, Yan Gao, Yi Wu, Yao Hu, Xuyang Zhi, Guiquan Liu, Xin Li, Hao Wang, Enhong Chen, 4 Sep 2025, SelfAug: Mitigating Catastrophic Forgetting in Retrieval-Augmented Generation via Distribution Self-Alignment, https://arxiv.org/abs/2509.03934
- Ranjie Duan, Jiexi Liu, Xiaojun Jia, Shiji Zhao, Ruoxi Cheng, Fengxiang Wang, Cheng Wei, Yong Xie, Chang Liu, Defeng Li, Yinpeng Dong, Yichi Zhang, Yuefeng Chen, Chongwen Wang, Xingjun Ma, Xingxing Wei, Yang Liu, Hang Su, Jun Zhu, Xinfeng Li, Yitong Sun, Jie Zhang, Jinzhao Hu, Sha Xu, Yitong Yang, Jialing Tao, Hui Xue, 4 Sep 2025, Oyster-I: Beyond Refusal -- Constructive Safety Alignment for Responsible Language Models, https://arxiv.org/abs/2509.01909
- Jonathn Chang, Leonhard Piff, Suvadip Sana, Jasmine X. Li, Lionel Levine, 3 Sep 2025, EigenBench: A Comparative Behavioral Measure of Value Alignment, https://arxiv.org/abs/2509.01938
- Jusheng Zhang, Yijia Fan, Kaitong Cai, Xiaofei Sun, Keze Wang, 5 Sep 2025, OSC: Cognitive Orchestration through Dynamic Knowledge Alignment in Multi-Agent LLM Collaboration, https://arxiv.org/abs/2509.04876
- Gongyue Zhang and Honghai Liu, 5 Sep 2025, Natural Spectral Fusion: p-Exponent Cyclic Scheduling and Early Decision-Boundary Alignment in First-Order Optimization, https://arxiv.org/abs/2509.04713
- Wei Chen, Shigui Li, Jiacheng Li, Jian Xu, Zhiqi Lin, Junmei Yang, Delu Zeng, John Paisley, Qibin Zhao, 5 Sep 2025, Any-Step Density Ratio Estimation via Interval-Annealed Secant Alignment, https://arxiv.org/abs/2509.04852
- Tiansheng Huang, Gautam Bhattacharya, Pratik Joshi, Josh Kimball, Ling Liu, 5 Sep 2025, Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning, https://arxiv.org/abs/2408.09600
- Furong Jia, Lanxin Liu, Ce Hou, Fan Zhang, Xinyan Liu, Yu Liu, 5 Sep 2025, Towards Interpretable Geo-localization: a Concept-Aware Global Image-GPS Alignment Framework, https://arxiv.org/abs/2509.01910
- Subrata Biswas, Mohammad Nur Hossain Khan, Bashima Islam, 5 Sep 2025, RAVEN: Query-Guided Representation Alignment for Question Answering over Audio, Video, Embedded Sensors, and Natural Language, https://arxiv.org/abs/2505.17114
- Shervin Khalafi, Ignacio Hounie, Dongsheng Ding, Alejandro Ribeiro, 26 Aug 2025, Composition and Alignment of Diffusion Models using Constrained Learning, https://arxiv.org/abs/2508.19104
- Nanxi Li, Zhengyue Zhao, Chaowei Xiao, 26 Aug 2025, PRISM: Robust VLM Alignment with Principled Reasoning for Integrated Safety in Multimodality, https://arxiv.org/abs/2508.18649
- Trisanth Srinivasan, Santosh Patapati, 27 Aug 2025, Democracy-in-Silico: Institutional Design as Alignment in AI-Governed Polities, https://arxiv.org/abs/2508.19562
- Julian Arnold, Niels L\"orch, 27 Aug 2025, Decomposing Behavioral Phase Transitions in LLMs: Order Parameters for Emergent Misalignment, https://arxiv.org/abs/2508.20015
- Mingxi Fu, Fanglei Fu, Xitong Ling, Huaitian Yuan, Tian Guan, Yonghong He, Lianghui Zhu, 27 Aug 2025, Multimodal Prototype Alignment for Semi-supervised Pathology Image Segmentation, https://arxiv.org/abs/2508.19574
- Chao Huang, Zefeng Zhang, Juewei Yue, Quangang Li, Chuang Zhang, Tingwen Liu, 27 Aug 2025, Safety Alignment Should Be Made More Than Just A Few Attention Heads, https://arxiv.org/abs/2508.19697
- Md Abdullah Al Mamun, Ihsen Alouani, Nael Abu-Ghazaleh, 28 Aug 2025, Poison Once, Refuse Forever: Weaponizing Alignment for Injecting Bias in LLMs, https://arxiv.org/abs/2508.20333
- Zhibang Yang, Xinke Jiang, Rihong Qiu, Ruiqing Li, Yihang Zhang, Yue Fang, Yongxin Xu, Hongxin Ding, Xu Chu, Junfeng Zhao, Yasha Wang, 28 Aug 2025, DFAMS: Dynamic-flow guided Federated Alignment based Multi-prototype Search, https://arxiv.org/abs/2508.20353
- Guillaume Guy, Mihajlo Grbovic, Chun How Tan, Han Zhao, 28 Aug 2025, BiListing: Modality Alignment for Listings, https://arxiv.org/abs/2508.20396
- Luozhijie Jin, Zijie Qiu, Jie Liu, Zijie Diao, Lifeng Qiao, Ning Ding, Alex Lamb, Xipeng Qiu, 28 Aug 2025, Inference-Time Alignment Control for Diffusion Models with Reinforcement Learning Guidance, https://arxiv.org/abs/2508.21016
- Harethah Abu Shairah, Hasan Abed Al Kader Hammoud, George Turkiyyah, Bernard Ghanem, 28 Aug 2025, Turning the Spell Around: Lightweight Alignment Amplification via Rank-One Safety Injection, https://arxiv.org/abs/2508.20766
- Haoze Wu, Cheng Wang, Wenshuo Zhao, Junxian He, 28 Aug 2025, Model-Task Alignment Drives Distinct RL Outcomes, https://arxiv.org/abs/2508.21188
- Ephraiem Sarabamoun, 27 Aug 2025, Ensemble Debates with Local Large Language Models for AI Alignment, https://arxiv.org/abs/2509.00091
- Shiqiao Zhou, Holger Sch\"oner, Huanbo Lyu, Edouard Fouch\'e, Shuo Wang, 30 Aug 2025, BALM-TSF: Balanced Multimodal Alignment for LLM-Based Time Series Forecasting, https://arxiv.org/abs/2509.00622
- Jinzhou Tang, Jusheng zhang, Sidi Liu, Waikit Xiu, Qinhan Lv, Xiying Li, 29 Aug 2025, Beyond Pixels: Introducing Geometric-Semantic World Priors for Video-based Embodied Models via Spatio-temporal Alignment, https://arxiv.org/abs/2509.00210
- Sanjeeevan Selvaganapathy and Mehwish Nasim, 31 Aug 2025, Confident, Calibrated, or Complicit: Probing the Trade-offs between Safety Alignment and Ideological Bias in Language Models in Detecting Hate Speech, https://arxiv.org/abs/2509.00673
- Yu Liu, Yanan Cao, Xixun Lin, Yanmin Shang, Shi Wang, Shirui Pan, 1 Sep 2025, Enhancing Large Language Model for Knowledge Graph Completion via Structure-Aware Alignment-Tuning, https://arxiv.org/abs/2509.01166
- Hongyu Li, Chaofeng Chen, Xiaoming Li, Guangming Lu, 2 Sep 2025, 2D Gaussian Splatting with Semantic Alignment for Image Inpainting, https://arxiv.org/abs/2509.01964
- Antoun Yaacoub, J\'er\^ome Da-Rugna, Zainab Assaghir, 30 Aug 2025, Assessing AI-Generated Questions' Alignment with Cognitive Frameworks in Educational Assessment, https://arxiv.org/abs/2504.14232
- Jonathan Rystr{\o}m, Hannah Rose Kirk and Scott Hale, 30 Aug 2025, Multilingual != Multicultural: Evaluating Gaps Between Multilingual Capabilities and Cultural Alignment in LLMs, https://arxiv.org/abs/2502.16534
- Dayeon Ki, Rachel Rudinger, Tianyi Zhou, Marine Carpuat, 1 Sep 2025, Multiple LLM Agents Debate for Equitable Cultural Alignment, https://arxiv.org/abs/2505.24671
- Ertu\u{g}rul Ke\c{c}eci, M\"ujde G\"uzelkaya, Tufan Kumbasar, 3 Sep 2025, A State Alignment-Centric Approach to Federated System Identification: The FedAlign Framework, https://arxiv.org/abs/2503.12137
- Jiahao Qiu, Yifu Lu, Yifan Zeng, Jiacheng Guo, Jiayi Geng, Chenhao Zhu, Xinzhe Juan, Ling Yang, Huazheng Wang, Kaixuan Huang, Yue Wu, Mengdi Wang, 3 Sep 2025, TreeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling, https://arxiv.org/abs/2410.16033
- Madhava Gaikwad, 4 Sep 2025, Murphys Laws of AI Alignment: Why the Gap Always Wins, https://arxiv.org/abs/2509.05381
- Chengwei Wu, Li Du, Hanyu Zhao, Yiming Ju, Jiapu Wang, Tengfei Pan, 8 Sep 2025, Accelerate Scaling of LLM Alignment via Quantifying the Coverage and Depth of Instruction Set, https://arxiv.org/abs/2509.06463
- Abhijnan Nath, Carine Graff and Nikhil Krishnaswamy, 7 Sep 2025, Let's Roleplay: Examining LLM Alignment in Collaborative Dialogues, https://arxiv.org/abs/2509.05882
- Xugang Lu, Peng Shen, Yu Tsao, Hisashi Kawai, 6 Sep 2025, New Insights into Optimal Alignment of Acoustic and Linguistic Representations for Knowledge Transfer in ASR, https://arxiv.org/abs/2509.05609
- Shuai Yuan, Zhibo Zhang, Yuxi Li, Guangdong Bai, Wang Kailong, 8 Sep 2025, Embedding Poisoning: Bypassing Safety Alignment via Embedding Semantic Shift, https://arxiv.org/abs/2509.06338
- Sascha Kaltenpoth, Oliver M\"uller, 9 Sep 2025, Getting In Contract with Large Language Models -- An Agency Theory Perspective On Large Language Model Alignment, https://arxiv.org/abs/2509.07642
- Xiaomeng Hu, Fei Huang, Chenhan Yuan, Junyang Lin, Tsung-Yi Ho, 1 Sep 2025, CARE: Decoding Time Safety Alignment via Rollback and Introspection Intervention, https://arxiv.org/abs/2509.06982
- Neal G. Ravindra, Arijit Sehanobish, 22 Aug 2025, Cross-device Zero-shot Label Transfer via Alignment of Time Series Foundation Model Embeddings, https://arxiv.org/abs/2509.06966
- Ji Xie and Trevor Darrell and Luke Zettlemoyer and XuDong Wang, 8 Sep 2025, Reconstruction Alignment Improves Unified Multimodal Models, https://arxiv.org/abs/2509.07295
- Andrey Sakhovskiy, Elena Tutubalina, 9 Sep 2025, BALI: Enhancing Biomedical Language Representations through Knowledge Graph and Language Model Alignment, https://arxiv.org/abs/2509.07588
- Crispin Cooper, Ana Friedrich, Tommaso Reggiani, Wouter Poortinga, 9 Sep 2025, Individual utilities of life satisfaction reveal inequality aversion unrelated to political alignment, https://arxiv.org/abs/2509.07793
- Tianyi Wang, Jianan Fan, Dingxin Zhang, Dongnan Liu, Yong Xia, Heng Huang, Weidong Cai, 9 Sep 2025, MIRROR: Multi-Modal Pathological Self-Supervised Representation Learning via Modality Alignment and Retention, https://arxiv.org/abs/2503.00374
- Hasibur Rahman, Smit Desai, 11 Sep 2025, Vibe Check: Understanding the Effects of LLM-Based Conversational Agents' Personality and Alignment on User Perceptions in Goal-Oriented Tasks, https://arxiv.org/abs/2509.09870
- Yuexi Du, Lihui Chen, Nicha C. Dvornek, 12 Sep 2025, GLAM: Geometry-Guided Local Alignment for Multi-View VLP in Mammography, https://arxiv.org/abs/2509.10344
- Maysam Behmanesh, Erkan Turan, and Maks Ovsjanikov, 11 Sep 2025, Graph Alignment via Dual-Pass Spectral Encoding and Latent Space Communication, https://arxiv.org/abs/2509.09597
- Dohun Lee, Hyeonho Jeong, Jiwook Kim, Duygu Ceylan, Jong Chul Ye, 11 Sep 2025, Improving Video Diffusion Transformer Training by Multi-Feature Fusion and Alignment from Self-Supervised Vision Encoders, https://arxiv.org/abs/2509.09547
- Oriane Peter and Kate Devlin, 9 Sep 2025, Decentralising LLM Alignment: A Case for Context, Pluralism, and Participation, https://arxiv.org/abs/2509.08858
- Bronson Schoen, Evgenia Nitishinskaya, Mikita Balesni, Axel H{\o}jmark, Felix Hofst\"atter, J\'er\'emy Scheurer, Alexander Meinke, Jason Wolfe, Teun van der Weij, Alex Lloyd, Nicholas Goldowsky-Dill, Angela Fan, Andrei Matveiakin, Rusheb Shah, Marcus Williams, Amelia Glaese, Boaz Barak, Wojciech Zaremba, Marius Hobbhahn, 19 Sep 2025, Stress Testing Deliberative Alignment for Anti-Scheming Training, https://arxiv.org/abs/2509.15541
- Wenjun Cao, 19 Sep 2025, The Alignment Bottleneck, https://arxiv.org/abs/2509.15932
- Maithili Joshi, Palash Nandi, Tanmoy Chakraborty, 19 Sep 2025, SABER: Uncovering Vulnerabilities in Safety Alignment via Cross-Layer Residual Connection, https://arxiv.org/abs/2509.16060
- Nomi Yu (1), Md Ferdous Alam (1), A. John Hart (1), and Faez Ahmed (1) ((1) Massachusetts Institute of Technology), 17 Sep 2025, GenCAD-3D: CAD Program Generation using Multimodal Latent Space Alignment and Synthetic Dataset Balancing, https://arxiv.org/abs/2509.15246
- Ajsal Shereef Palattuparambil, Thommen George Karimpanal, Santu Rana, 19 Sep 2025, Dynamic Policy Fusion for User Alignment Without Re-Interaction, https://arxiv.org/abs/2409.20016
- Sifan Wang, Ananyae Kumar Bhartari, Bowen Li, Paris Perdikaris, 19 Sep 2025, Gradient Alignment in Physics-informed Neural Networks: A Second-Order Optimization Perspective, https://arxiv.org/abs/2502.00604
- Tianhao Zhang, Zhecheng Sheng, Zhexiao Lin, Chen Jiang, Dongyeop Kang, 19 Sep 2025, BBScoreV2: Learning Time-Evolution and Latent Alignment from Stochastic Representation, https://arxiv.org/abs/2405.17764
- Rashid Mushkani, Hugo Berard, Shin Koseki, 18 Sep 2025, Negotiative Alignment: Embracing Disagreement to Achieve Fairer Outcomes -- Insights from Urban Studies, https://arxiv.org/abs/2503.12613
- Jeremias Ferrao, Matthijs van der Lende, Ilija Lichkovski, Clement Neo, 16 Sep 2025, The Anatomy of Alignment: Decomposing Preference Optimization by Steering Sparse Features, https://arxiv.org/abs/2509.12934
- Denis Janiak, Julia Moska, Dawid Motyka, Karolina Seweryn, Pawe{\l} Walkowiak, Bartosz \.Zuk, Arkadiusz Janz, 16 Sep 2025, Rethinking the Evaluation of Alignment Methods: Insights into Diversity, Generalisation, and Safety, https://arxiv.org/abs/2509.12936
- Jinjie Shen, Yaxiong Wang, Lechao Cheng, Nan Pu, Zhun Zhong, 16 Sep 2025, Beyond Artificial Misalignment: Detecting and Grounding Semantic-Coordinated Multimodal Manipulations, https://arxiv.org/abs/2509.12653
- Qianqi Lu, Yuxiang Xie, Jing Zhang, Shiwei Zou, Yan Chen, Xidao Luan, 16 Sep 2025, TFANet: Three-Stage Image-Text Feature Alignment Network for Robust Referring Image Segmentation, https://arxiv.org/abs/2509.13070
- Yubo Li, Weiyi Song, 16 Sep 2025, Co-Alignment: Rethinking Alignment as Bidirectional Human-AI Cognitive Adaptation, https://arxiv.org/abs/2509.12179
- Mohsinul Kabir, Ajwad Abrar, Sophia Ananiadou, 16 Sep 2025, Break the Checkbox: Challenging Closed-Style Evaluations of Cultural Alignment in LLMs, https://arxiv.org/abs/2502.08045
- Jie Wu, Haoling Li, Xin Zhang, Jianwen Luo, Yangyu Huang, Ruihang Chu, Yujiu Yang, Scarlett Li, 16 Sep 2025, Teaching Your Models to Understand Code via Focal Preference Alignment, https://arxiv.org/abs/2503.02783
- Jing Xiao, Chang You, Zhiyu Chen, 14 Sep 2025, AlignKT: Explicitly Modeling Knowledge State for Knowledge Tracing with Ideal State Alignment, https://arxiv.org/abs/2509.11135
- Yining Lu, Zilong Wang, Shiyang Li, Xin Liu, Changlong Yu, Qingyu Yin, Zhan Shi, Zixuan Zhang, Meng Jiang, 14 Sep 2025, Learning to Optimize Multi-Objective Alignment Through Dynamic Reward Weighting, https://arxiv.org/abs/2509.11452
- Chentao Cao, Xiaojun Xu, Bo Han, Hang Li, 15 Sep 2025, Reasoned Safety Alignment: Ensuring Jailbreak Defense via Answer-Then-Check, https://arxiv.org/abs/2509.11629
- Jiayou Zhong, Anudeex Shetty, Chao Jia, Xuanrui Lin, Usman Naseem, 12 Sep 2025, Pluralistic Alignment for Healthcare: A Role-Driven Framework, https://arxiv.org/abs/2509.10685
- Hyeongju Kim, Juheon Lee, Jinhyeok Yang, Jacob Morton, 14 Sep 2025, Length-Aware Rotary Position Embedding for Text-Speech Alignment, https://arxiv.org/abs/2509.11084
- Etienne Boursier, Nicolas Flammarion, 15 Sep 2025, Early alignment in two-layer networks training is a two-edged sword, https://arxiv.org/abs/2401.10791
- Zedian Shao, Hongbin Liu, Jaden Mu, Neil Zhenqiang Gong, 15 Sep 2025, Enhancing Prompt Injection Attacks to LLMs via Poisoning Alignment, https://arxiv.org/abs/2410.14827
- Ankur Samanta, Akshayaa Magesh, Youliang Yu, Runzhe Wu, Ayush Jain, Daniel Jiang, Boris Vidolov, Paul Sajda, Yonathan Efroni, Kaveh Hassani, 18 Sep 2025, Internalizing Self-Consistency in Language Models: Multi-Agent Consensus Alignment, https://arxiv.org/abs/2509.15172
- Herlock (SeyedAbolfazl) Rahimi, Dionysis Kalogerias, 17 Sep 2025, FedAVOT: Exact Distribution Alignment in Federated Learning via Masked Optimal Transport, https://arxiv.org/abs/2509.14444
- Natalie Collina, Surbhi Goel, Aaron Roth, Emily Ryu, Mirah Shi, 18 Sep 2025, Emergent Alignment via Competition, https://arxiv.org/abs/2509.15090
- Andr\'es Corrada-Emmanuel, 10 Sep 2025, No-Knowledge Alarms for Misaligned LLMs-as-Judges, https://arxiv.org/abs/2509.08593
- Aadit Sengupta, Pratinav Seth, Vinay Kumar Sankarapu, 10 Sep 2025, Interpretability as Alignment: Making Internal Understanding a Design Principle, https://arxiv.org/abs/2509.08592
- Katalina Hernandez Delgado, 8 Sep 2025, The Law-Following AI Framework: Legal Foundations and Technical Constraints. Legal Analogues for AI Actorship and technical feasibility of Law Alignment, https://arxiv.org/abs/2509.08009
- Hua Shen, Nicholas Clark, Tanushree Mitra, 9 Sep 2025, Mind the Value-Action Gap: Do LLMs Act in Alignment with Their Values?, https://arxiv.org/abs/2501.15463
- Vincent Siu, Nicholas Crispino, David Park, Nathan W. Henry, Zhun Wang, Yang Liu, Dawn Song, Chenguang Wang, 16 Sep 2025, SteeringControl: Holistic Evaluation of Alignment Steering in LLMs, https://arxiv.org/abs/2509.13450
- Zhanting Zhou and Jinshan Lai and Fengchun Zhang and Zeqin Wu and Fengli Zhang, 17 Sep 2025, FedSSG: Expectation-Gated and History-Aware Drift Alignment for Federated Learning, https://arxiv.org/abs/2509.13895
- Yifan Hu, Jie Yang, Tian Zhou, Peiyuan Liu, Yujin Tang, Rong Jin, Liang Sun, 17 Sep 2025, Bridging Past and Future: Distribution-Aware Alignment for Time Series Forecasting, https://arxiv.org/abs/2509.14181
- Jack McKinlay, Marina De Vos, Janina A. Hoffmann, Andreas Theodorou, 17 Sep 2025, Understanding the Process of Human-AI Value Alignment, https://arxiv.org/abs/2509.13854
- Elena Camuffo, Francesco Barbato, Mete Ozay, Simone Milani, Umberto Michieli, 17 Sep 2025, MOCHA: Multi-modal Objects-aware Cross-arcHitecture Alignment, https://arxiv.org/abs/2509.14001
- Puru Vaish, Felix Meister, Tobias Heimann, Christoph Brune, Jelmer M. Wolterink, 17 Sep 2025, Consistent View Alignment Improves Foundation Models for 3D Medical Image Segmentation, https://arxiv.org/abs/2509.13846
- Yuu Jinnai, Ukyo Honda, 17 Sep 2025, Annotation-Efficient Language Model Alignment via Diverse and Representative Response Texts, https://arxiv.org/abs/2405.13541
- Zhiqiang Yuan, Weitong Chen, Hanlin Wang, Kai Yu, Xin Peng, Yiling Lou, 17 Sep 2025, Semantic Alignment-Enhanced Code Translation via an LLM-Based Multi-Agent System, https://arxiv.org/abs/2409.19894
Trustworthy AI
Trustworthy AI is the practice of ensuring that LLM-based systems are safe and predictable. This involves ensuring not only the safety of the LLM's outputs, such as avoiding bias and toxicity, but also ensuring that the AI infrastructure is resilient and the overall system is reliable. The idea of "Trustworthy AI" has been championed by NVIDIA.
Articles and papers on trustworthy AI:
- Leon Derczynski, Christopher Parisien, Nikki Pope, Michael Boone, Nov 2024, NVIDIA Approaches to AI Trust and Safety: Innovation and Tools, https://www.nvidia.com/en-us/on-demand/session/aisummitdc24-sdc1088/?playlistId=playList-c6a9450c-c790-462d-a058-0bacacd5d370
- Mayank Vatsa, Anubhooti Jain, Richa Singh, 7 Dec 2023, Adventures of Trustworthy Vision-Language Models: A Survey, https://arxiv.org/abs/2312.04231
- Nikki Pope, March 1, 2024, What Is Trustworthy AI? Trustworthy AI is an approach to AI development that prioritizes safety and transparency for the people who interact with it. https://blogs.nvidia.com/blog/what-is-trustworthy-ai/
- NVIDIA, Dec 2024 (accessed), Trustworthy AI, https://www.nvidia.com/en-us/ai-data-science/trustworthy-ai/
- Phoebe Lee and Kristina Joos, Jan 25, 2024, Advancing Production AI with NVIDIA AI Enterprise, https://developer.nvidia.com/blog/advancing-production-ai-with-nvidia-ai-enterprise/ ("... advances in NVIDIA AI software deliver up to 54% performance gains without a hardware upgrade...")
- Yedi Zhang, Yufan Cai, Xinyue Zuo, Xiaokun Luan, Kailong Wang, Zhe Hou, Yifan Zhang, Zhiyuan Wei, Meng Sun, Jun Sun, Jing Sun, Jin Song Dong, 9 Dec 2024, The Fusion of Large Language Models and Formal Methods for Trustworthy AI Agents: A Roadmap, https://arxiv.org/abs/2412.06512
- Athanasios Davvetas, Xenia Ziouvelou, Ypatia Dami, Alexis Kaponis, Konstantina Giouvanopoulou, Michael Papademas, 23 Jul 2025, TAI Scan Tool: A RAG-Based Tool With Minimalistic Input for Trustworthy AI Self-Assessment, https://arxiv.org/abs/2507.17514
- Ilias Chatzistefanidis, Navid Nikaein, 23 Jul 2025, Symbiotic Agents: A Novel Paradigm for Trustworthy AGI-driven Networks, https://arxiv.org/abs/2507.17695
- H M Mohaimanul Islam, Huynh Q. N. Vo, Aditya Rane, 22 Jul 2025, Towards Trustworthy AI: Secure Deepfake Detection using CNNs and Zero-Knowledge Proofs, https://arxiv.org/abs/2507.17010
- Tushar Talukder Showrav, Soyabul Islam Lincoln, Md. Kamrul Hasan, 23 Jul 2025, EXGnet: a single-lead explainable-AI guided multiresolution network with train-only quantitative features for trustworthy ECG arrhythmia classification, https://arxiv.org/abs/2506.12404
- Yaomin Jiang, Levin Brinkmann, Anne-Marie Nussberger, Ivan Soraperra, Jean-Fran\c{c}ois Bonnefon, Iyad Rahwan, 17 Jul 2025, Humans learn to prefer trustworthy AI over human partners, https://arxiv.org/abs/2507.13524
- Nuria Rodr\'iguez-Barroso and Mario Garc\'ia-M\'arquez and M. Victoria Luz\'on and Francisco Herrera, 21 Jul 2025, Challenges of Trustworthy Federated Learning: What's Done, Current Trends and Remaining Work, https://arxiv.org/abs/2507.15796
- Mustafa Cavus, Jan N. van Rijn, Przemys{\l}aw Biecek, 19 Jul 2025, Beyond the Single-Best Model: Rashomon Partial Dependence Profile for Trustworthy Explanations in AutoML, https://arxiv.org/abs/2507.14744
- Amina Dzafic, Merve Kavut, Ulya Bayram, 19 Jul 2025, Rethinking Suicidal Ideation Detection: A Trustworthy Annotation Framework and Cross-Lingual Model Evaluation, https://arxiv.org/abs/2507.14693
- Yi Zhang, Zhen Chen, Chih-Hong Cheng, Wenjie Ruan, Xiaowei Huang, Dezong Zhao, David Flynn, Siddartha Khastgir, Xingyu Zhao, 20 Jul 2025, Trustworthy Text-to-Image Diffusion Models: A Timely and Focused Survey, https://arxiv.org/abs/2409.18214
- Anthony Bellotti and Xindi Zhao, 9 Aug 2025, Conformal Prediction and Trustworthy AI, https://arxiv.org/abs/2508.06885
- Stephan Rabanser, 11 Aug 2025, Uncertainty-Driven Reliability: Selective Prediction and Trustworthy Deployment in Modern Machine Learning, https://arxiv.org/abs/2508.07556
- Anindya Bijoy Das, Shahnewaz Karim Sakib and Shibbir Ahmed, 9 Aug 2025, Trustworthy Medical Imaging with Large Language Models: A Study of Hallucinations Across Modalities, https://arxiv.org/abs/2508.07031
- Jesco Talies, Eric Breitbarth, David Melching, 28 Jul 2025, Towards trustworthy AI in materials mechanics through domain-guided attention, https://arxiv.org/abs/2507.20658
- Marius Baden, Ahmed Abouelazm, Christian Hubschneider, Yin Wu, Daniel Slieter, and J. Marius Z\"ollner, 27 Jul 2025, TPK: Trustworthy Trajectory Prediction Integrating Prior Knowledge For Interpretability and Kinematic Feasibility, https://arxiv.org/abs/2505.06743
- Rob Procter, Mark Rouncefield, 25 Jul 2025, Trustworthy AI: UK Air Traffic Control Revisited, https://arxiv.org/abs/2507.21169
- Rui Jiao, Yue Zhang, Jinku Li, 25 Jul 2025, Trustworthy Reasoning: Evaluating and Enhancing Factual Accuracy in LLM Intermediate Thought Processes, https://arxiv.org/abs/2507.22940
- Xinwei Wu, Haojie Li, Hongyu Liu, Xinyu Ji, Ruohan Li, Yule Chen, Yigeng Zhang, 30 Jul 2025, Uncovering the Fragility of Trustworthy LLMs through Chinese Textual Ambiguity, https://arxiv.org/abs/2507.23121
- Xiaojin Zhang, Wei Chen, 30 Jul 2025, Bridging Privacy and Robustness for Trustworthy Machine Learning, https://arxiv.org/abs/2403.16591
- Sihang Zeng, Lucas Jing Liu, Jun Wen, Meliha Yetisgen, Ruth Etzioni, Gang Luo, 1 Aug 2025, TrajSurv: Learning Continuous Latent Trajectories from Electronic Health Records for Trustworthy Survival Prediction, https://arxiv.org/abs/2508.00657
- Jiazhen Pan, Bailiang Jian, Paul Hager, Yundi Zhang, Che Liu, Friedrike Jungmann, Hongwei Bran Li, Chenyu You, Junde Wu, Jiayuan Zhu, Fenglin Liu, Yuyuan Liu, Niklas Bubeck, Christian Wachinger, Chen (Cherise) Chen, Zhenyu Gong, Cheng Ouyang, Georgios Kaissis, Benedikt Wiestler, Daniel Rueckert, 30 Jul 2025, Beyond Benchmarks: Dynamic, Automatic And Systematic Red-Teaming Agents For Trustworthy Medical Language Models, https://arxiv.org/abs/2508.00923
- James Carzon and Luca Masserano and Joshua D. Ingram and Alex Shen and Antonio Carlos Herling Ribeiro Junior and Tommaso Dorigo and Michele Doro and Joshua S. Speagle and Rafael Izbicki and Ann B. Lee, 4 Aug 2025, Trustworthy scientific inference for inverse problems with generative models, https://arxiv.org/abs/2508.02602
- Vinicius Lima, Dzung T. Phan, Jayant Kalagnanam, Dhaval Patel, Nianjun Zhou, 5 Aug 2025, Toward a Trustworthy Optimization Modeling Agent via Verifiable Synthetic Data Generation, https://arxiv.org/abs/2508.03117
- Claudiu Leoveanu-Condrei, 5 Aug 2025, A DbC Inspired Neurosymbolic Layer for Trustworthy Agent Design, https://arxiv.org/abs/2508.03665
- Anqi Li, Wenwei Jin, Jintao Tong, Pengda Qin, Weijia Li, Guo Lu, 5 Aug 2025, Towards Trustworthy Multimodal Moderation via Policy-Aligned Reasoning and Hierarchical Labeling, https://arxiv.org/abs/2508.03296
- Haoran Li and Lihao Mai and Muhao Guo and Jiaqi Wu and Yang Weng and Yannan Sun and Ce Jimmy Liu, 7 Aug 2025, From Imperfect Signals to Trustworthy Structure: Confidence-Aware Inference from Heterogeneous and Reliability-Varying Utility Data, https://arxiv.org/abs/2508.05791
- Ahmad Farooq and Kamran Iqbal, 7 Aug 2025, Towards Transparent Ethical AI: A Roadmap for Trustworthy Robotic Systems, https://arxiv.org/abs/2508.05846
- Kristian Miok, Blaz \v{S}krlj, Daniela Zaharie, and Marko Robnik \v{S}ikonja, 30 Jul 2025, TT-XAI: Trustworthy Clinical Text Explanations via Keyword Distillation and LLM Reasoning, https://arxiv.org/abs/2508.08273
- Mithat Can Ozgun, Jiahuan Pei, Koen Hindriks, Lucia Donatelli, Qingzhi Liu, Xin Sun, Junxiao Wang, 15 Aug 2025, Trustworthy AI Psychotherapy: Multi-Agent LLM Workflow for Counseling and Explainable Mental Disorder Diagnosis, https://arxiv.org/abs/2508.11398
- Benjamin Alt, Mareike Picklum, Sorin Arion, Franklin Kenghagho Kenfack and Michael Beetz, 15 Aug 2025, Open, Reproducible and Trustworthy Robot-Based Experiments with Virtual Labs and Digital-Twin-Based Execution Tracing, https://arxiv.org/abs/2508.11406
- Zihan Guo, Yuanjian Zhou, Chenyi Wang, Linlin You, Minjie Bian, Weinan Zhang, 19 Aug 2025, BetaWeb: Towards a Blockchain-enabled Trustworthy Agentic Web, https://arxiv.org/abs/2508.13787
- Mary Versa Clemens-Sewall, Christopher Cervantes, Emma Rafkin, J. Neil Otte, Tom Magelinski, Libby Lewis, Michelle Liu, Dana Udwin, Monique Kirkman-Bey, 20 Aug 2025, CaTE Data Curation for Trustworthy AI, https://arxiv.org/abs/2508.14741
- Wenjie Lin, Jin Wei-Kocsis, 21 Aug 2025, LLM4Sweat: A Trustworthy Large Language Model for Hyperhidrosis Support, https://arxiv.org/abs/2508.15192
- Yongwoo Song and Minbyul Jeong and Mujeen Sung, 26 Aug 2025, Trustworthy Agents for Electronic Health Records through Confidence Estimation, https://arxiv.org/abs/2508.19096
- William Jurayj, Nils Holzenberger, Benjamin Van Durme, 28 Aug 2025, Enabling Equitable Access to Trustworthy Financial Reasoning, https://arxiv.org/abs/2508.21051
- \v{S}imon Kucharsk\'y, Aayush Mishra, Daniel Habermann, Stefan T. Radev, Paul-Christian B\"urkner, 28 Aug 2025, Towards Trustworthy Amortized Bayesian Model Comparison, https://arxiv.org/abs/2508.20614
- Daocheng Fu, Jianlong Chen, Renqiu Xia, Zijun Chen, Qi Liu, Yuan Feng, Hongbin Zhou, Renrui Zhang, Shiyang Feng, Peng Gao, Hongyuan Zha, Junchi Yan, Botian Shi, Yu Qiao, Bo Zhang, 29 Aug 2025, TrustGeoGen: Formal-Verified Data Engine for Trustworthy Multi-modal Geometric Problem Solving, https://arxiv.org/abs/2504.15780
- Li Rong Wang, Thomas C. Henderson, Yew Soon Ong, Yih Yng Ng, Xiuyi Fan, 1 Sep 2025, Towards Trustworthy Vital Sign Forecasting: Leveraging Uncertainty for Prediction Intervals, https://arxiv.org/abs/2509.01319
- Chaoyu Zhang and Heng Jin and Shanghao Shi and Hexuan Yu and Sydney Johns and Y. Thomas Hou and Wenjing Lou, 30 Aug 2025, Enabling Trustworthy Federated Learning via Remote Attestation for Mitigating Byzantine Threats, https://arxiv.org/abs/2509.00634
- Aivin V. Solatorio, 8 Sep 2025, Proof-Carrying Numbers (PCN): A Protocol for Trustworthy Numeric Answers from LLMs via Claim Verification, https://arxiv.org/abs/2509.06902
- Teeradaj Racharak, Chaiyong Ragkhitwetsagul, Chommakorn Sontesadisai, Thanwadee Sunetnanta, 8 Sep 2025, Test It Before You Trust It: Applying Software Testing for Trustworthy In-context Learning, https://arxiv.org/abs/2504.18827
- Zhuoyue Zhang, Haitong Xu, 19 Sep 2025, Explainable AI for Maritime Autonomous Surface Ships (MASS): Adaptive Interfaces and Trustworthy Human-AI Collaboration, https://arxiv.org/abs/2509.15959
- Meryem Malak Dif, Mouhamed Amine Bouchiha, Abdelaziz Amara Korba, Yacine Ghamri-Doudane, 8 Sep 2025, Towards Trustworthy Agentic IoEV: AI Agents for Explainable Cyberthreat Mitigation and State Analytics, https://arxiv.org/abs/2509.12233
- Diego Gosmar, Deborah A. Dahl, 18 Sep 2025, Sentinel Agents for Secure and Trustworthy Agentic AI in Multi-Agent Systems, https://arxiv.org/abs/2509.14956
- Prathamesh Vasudeo Naik, Naresh Kumar Dintakurthi, Zhanghao Hu, Yue Wang, Robby Qiu, 10 Sep 2025, Co-Investigator AI: The Rise of Agentic AI for Smarter, Trustworthy AML Compliance Narratives, https://arxiv.org/abs/2509.08380
AI Industry Safety Practices
Various papers discuss the practices of the major AI players in the industry, along with issues such as self-governance.
- OpenAI, July 2023, Frontier Model Forum, https://openai.com/blog/frontier-model-forum
- OpenAI. April 2023, Our approach to AI safety. https://openai.com/blog/our-approach-to-ai-safety
- A. M. Barrett, J. Newman, D. Hendrycks, and B. Nonnecke. 2023, UC Berkeley AI Risk-Management Standards Profile for General-Purpose AI Systems (GPAIS) and Foundation Models, https://cltc.berkeley.edu/seeking-input-and-feedback-ai-risk-management-standards-profile-for-increasingly-multi-purpose-or-general-purpose-ai
- Meta, 2023, Responsible AI: Driven by our belief that AI should benefit everyone, https://ai.meta.com/responsible-ai/
- Google, 2023, AI Governance reviews and operations, https://ai.google/responsibility/ai-governance-operations
- Google, 2023, Responsibility: Our Principles, https://ai.google/responsibility/principles/
- Google, 2023, How Bard Works | A Responsible Approach to AI, YouTube, https://www.youtube.com/watch?v=vhbkCEnNXcY
Technical Verification and Testing of AI Safety
Testing and evaluation of AI safety issues:
- Xiaowei Huang, Marta Kwiatkowska, Sen Wang, and Min Wu. May 2017. Safety verification of deep neural networks. In Computer Aided Verification, pages 3–29, https://arxiv.org/abs/1610.06940
- D. Ganguli, L. Lovitt, J. Kernion, A. Askell, Y. Bai, S. Kadavath, B. Mann, E. Perez, N. Schiefer, K. Ndousse, A. Jones, S. Bowman, A. Chen, T. Conerly, N. DasSarma, D. Drain, N. Elhage, S. El-Showk, S. Fort, Z. Hatfield-Dodds, T. Henighan, D. Hernandez, T. Hume, J. Jacobson, S. Johnston, S. Kravec, C. Olsson, S. Ringer, E. Tran-Johnson, D. Amodei, T. Brown, N. Joseph, S. McCandlish, C. Olah, J. Kaplan, and J. Clark. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858, 2022 https://arxiv.org/abs/2209.07858
- K Ramesh, A Chavan, S Pandit, 2023, A Comparative Study on the Impact of Model Compression Techniques on Fairness in Language Models, Microsoft Research, https://aclanthology.org/2023.acl-long.878.pdf, https://www.microsoft.com/en-us/research/uploads/prod/2023/07/3687_Paper.pdf (Rather than testing full models, this analysis examines optimized models due to quantization, pruning or distillation.)
- T. Shevlane. Structured access: An emerging paradigm for safe AI deployment. In The Oxford Handbook of AI Governance, 2022, https://arxiv.org/abs/2201.05159
- E. Perez, S. Huang, F. Song, T. Cai, R. Ring, J. Aslanides, A. Glaese, N. McAleese, and G. Irving. 2022, Red teaming language models with language models. arXiv preprint arXiv:2202.03286, https://arxiv.org/abs/2202.03286
- OpenAI. 2023. Safety best practices. https://platform.openai.com/docs/guides/safety-best-practices
- William Saunders, Girish Sastry, Andreas Stuhlmueller, and Owain Evans. Trial without error: Towards safe reinforcement learning via human intervention. arXiv preprint arXiv:1707.05173, 2017. https://arxiv.org/abs/1707.05173
- Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed, Oct 2023, Mistral 7B, https://arxiv.org/abs/2310.06825, Code: https://mistral.ai/news/announcing-mistral-7b/ (Examines guardrails and testing of the safety of the model against harmful inputs.)
AI Factual Inaccuracy
Research papers on accuracy of AI results include:
- M Yuksekgonul, V Chandrasekaran, E Jones, Sep 2023, Attention Satisfies: A Constraint-Satisfaction Lens on Factual Errors of Language Models, https://arxiv.org/pdf/2309.15098.pdf, Code: https://github.com/microsoft/mechanistic-error-probe
- Bingbin Liu, Jordan T. Ash, Surbhi Goel, Akshay Krishnamurthy, Cyril Zhang, June 2023, Exposing Attention Glitches with Flip-Flop Language Modeling, https://arxiv.org/abs/2306.00946
- S Latifi, 2023, Efficient and Dependable Deep Learning Systems Ph.D. Thesis, Computer Science and Engineering, University of Michigan, https://deepblue.lib.umich.edu/bitstream/handle/2027.42/176548/salar_1.pdf?sequence=1
- Michael Wood, Aug 26, 2024, 100% Accurate AI Claimed by Acurai — OpenAI and Anthropic Confirm Acurai’s Discoveries, https://blog.cubed.run/100-accurate-ai-claimed-by-acurai-openai-and-anthropic-confirm-acurais-discoveries-98fce1ddeb5b
AI Safety Incidents
Various incidents and accidents related to AI safety issues:
- S. McGregor. Nov 2021. Preventing repeated real world AI failures by cataloging incidents: The AI Incident Database. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 15458–15463, https://arxiv.org/abs/2011.08512
- Sarah Perez, 2023, Snapchat’s My AI goes rogue, posts to Stories, but Snap confirms it was just a glitch, August 17, 2023, TechCrunch, https://techcrunch.com/2023/08/16/snapchats-my-ai-goes-rogue-posts-to-stories-but-snap-confirms-it-was-just-a-glitch/
- Jaime Seidel, 2019, How a ‘confused’ AI May Have Fought Pilots Attempting to Save Boeing 737 MAX8s, News Corp Australia Network, https://www.news.com.au/technology/innovation/inventions/how-a-confused-ai-may-have-fought-pilots-attempting-to-save-boeing-737-max8s/news-story/bf0d102f699905e5aa8d1f6d65f4c27e (A very good example of the need for overrides and interruptibility.)
- Zachary Arnold, Helen Toner, July 2021, AI Accidents: An Emerging Threat What Could Happen and What to Do, CSET Policy Brief, https://cset.georgetown.edu/wp-content/uploads/CSET-AI-Accidents-An-Emerging-Threat.pdf
- Hern Alex. Apple contractors ‘regularly hear confidential details’ on Siri recordings. Guardian. 2019, https://www.theguardian.com/technology/2019/jul/26/apple-contractors-regularly-hear-confidential-details-on-siri-recordings
- Victor Tangermann, Sep 2023, Microsoft Publishes Garbled AI Article Calling Tragically Deceased NBA Player "Useless", Futurism, https://futurism.com/msn-ai-brandon-hunter-useless ("AI should not be writing obituaries.")
Incident Databases: There are various databases that collect information about AI safety incidents.
- AI Incident Database, https://incidentdatabase.ai/
- Zach Stein-Perlman, SeLo, stepanlos, MvK, July 20, 2023, Incident reporting for AI safety, Effective Altruism Forum, https://forum.effectivealtruism.org/posts/qkK5ejystp8GCJ3vC/incident-reporting-for-ai-safety
- AVID, 2023, AI Vulnerability Database: An open-source, extensible knowledge base of AI failures, https://avidml.org/
- AIAAIC (AI, Algorithmic, and Automation Incidents and Controversies), 2023, https://www.aiaaic.org/home
- MITRE ATLAS™ (Adversarial Threat Landscape for Artificial-Intelligence Systems), https://atlas.mitre.org/
- AI Badness: An open catalog of generative AI badness, 2023, https://badness.ai/
- David Dao, 2023, Awful AI, https://github.com/daviddao/awful-ai
Medical Ethics and AI
The use of AI in medicine creates some additional ethical issues:
- Vollmer S., Mateen B.A., Bohner G., Király F.J., Ghani R., Jonsson P., et al. Machine learning and AI research for patient benefit: 20 critical questions on transparency, replicability, ethics and effectiveness. BMJ. 2018;(368):1–12. https://pubmed.ncbi.nlm.nih.gov/32198138/
- Cockerill RG., 2020, Ethics Implications of the Use of Artificial Intelligence in Violence Risk Assessment. J Am Acad Psychiatry Law. 2020 Sep;48(3):345-349. doi: 10.29158/JAAPL.003940-20. Epub 2020 May 14. PMID: 32409300, https://pubmed.ncbi.nlm.nih.gov/32409300/
- Barron DS. 2021, Commentary: the ethical challenges of machine learning in psychiatry: a focus on data, diagnosis, and treatment. Psychol Med. 2021 Nov;51(15):2522-2524. doi: 10.1017/S0033291721001008. Epub 2021 May 12. PMID: 33975655, https://pubmed.ncbi.nlm.nih.gov/33975655/
- O'Reilly-Shah VN, Gentry KR, Walters AM, Zivot J, Anderson CT, Tighe PJ. 2020, Bias and ethical considerations in machine learning and the automation of perioperative risk assessment. Br J Anaesth. 2020 Dec;125(6):843-846. doi: 10.1016/j.bja.2020.07.040. Epub 2020 Aug 21. PMID: 32838979, https://pubmed.ncbi.nlm.nih.gov/32838979/
- Buchlak QD, Esmaili N, Leveque JC, Bennett C, Piccardi M, Farrokhi F., 2020, Ethical thinking machines in surgery and the requirement for clinical leadership. Am J Surg. 2020 Nov;220(5):1372-1374. doi: 10.1016/j.amjsurg.2020.06.073. Epub 2020 Jul 8. PMID: 32723487, https://pubmed.ncbi.nlm.nih.gov/32723487/
- Starke G, De Clercq E, Borgwardt S, Elger BS., 2020, Computing schizophrenia: ethical challenges for machine learning in psychiatry. Psychol Med. 2021 Nov;51(15):2515-2521. doi: 10.1017/S0033291720001683. Epub 2020 Jun 15. PMID: 32536358, https://pubmed.ncbi.nlm.nih.gov/32536358/
- Jacobson NC, Bentley KH, Walton A, Wang SB, Fortgang RG, Millner AJ, Coombs G 3rd, Rodman AM, Coppersmith DDL., 2020, Ethical dilemmas posed by mobile health and machine learning in psychiatry research. Bull World Health Organ. 2020 Apr 1;98(4):270-276. doi: 10.2471/BLT.19.237107. Epub 2020 Feb 25. PMID: 32284651, https://pubmed.ncbi.nlm.nih.gov/32284651/
- Johnson SLJ., 2019, AI, Machine Learning, and Ethics in Health Care. J Leg Med. 2019 Oct-Dec;39(4):427-441. doi: 10.1080/01947648.2019.1690604. PMID: 31940250 https://pubmed.ncbi.nlm.nih.gov/31940250/
- Vayena E, Blasimme A, Cohen IG., 2018, Machine learning in medicine: Addressing ethical challenges. PLoS Med. 2018 Nov 6;15(11):e1002689. doi: 10.1371/journal.pmed.1002689. eCollection 2018 Nov. PMID: 30399149, https://pubmed.ncbi.nlm.nih.gov/30399149/
- Nabi J., 2018, How Bioethics Can Shape Artificial Intelligence and Machine Learning. Hastings Cent Rep. 2018 Sep;48(5):10-13. doi: 10.1002/hast.895. PMID: 30311202, https://pubmed.ncbi.nlm.nih.gov/30311202/
- Char DS, Shah NH, Magnus D., 2018, Implementing Machine Learning in Health Care - Addressing Ethical Challenges. N Engl J Med. 2018 Mar 15;378(11):981-983. doi: 10.1056/NEJMp1714229. PMID: 29539284, https://pubmed.ncbi.nlm.nih.gov/29539284/
- Fiske A, Henningsen P, Buyx A., 2019, Your Robot Therapist Will See You Now: Ethical Implications of Embodied Artificial Intelligence in Psychiatry, Psychology, and Psychotherapy. J Med Internet Res. 2019 May 9;21(5):e13216. doi: 10.2196/13216. PMID: 31094356, https://pubmed.ncbi.nlm.nih.gov/31094356/
- Beil Michael, Proft Ingo, van Heerden Daniel, Sviri Sigal, van Heerden Peter Vernon. 2019, Ethical considerations about artificial intelligence for prognostication in intensive care. Intensive Care Medicine Experimental. 2019;7:70. http://www.ncbi.nlm.nih.gov/pmc/articles/pmc6904702/, https://pubmed.ncbi.nlm.nih.gov/31823128/
- Lasse Benzinger, Frank Ursin, Wolf-Tilo Balke, Tim Kacprowski & Sabine Salloch, 2023, Should Artificial Intelligence be used to support clinical ethical decision-making? A systematic review of reasons BMC Medical Ethics volume 24, Article number: 48 (2023), https://doi.org/10.1186/s12910-023-00929-6
- Rachel Dlugatch, Antoniya Georgieva & Angeliki Kerasidou, 2023, Trustworthy artificial intelligence and ethical design: public perceptions of trustworthiness of an AI-based decision-support tool in the context of intrapartum care, BMC Medical Ethics Open Access 20 June 2023, https://doi.org/10.1186/s12910-023-00917-w
- Dzobo K, Adotey S, Thomford NE, Dzobo W. Integrating Artificial and Human Intelligence: A Partnership for Responsible Innovation in Biomedical Engineering and Medicine. OMICS. 2020 May;24(5):247-263. doi: 10.1089/omi.2019.0038. Epub 2019 Jul 16. PMID: 31313972, https://pubmed.ncbi.nlm.nih.gov/31313972/
- McCradden MD, Joshi S, Mazwi M, Anderson JA., 2020, Ethical limitations of algorithmic fairness solutions in health care machine learning. Lancet Digit Health. 2020 May;2(5):e221-e223. doi: 10.1016/S2589-7500(20)30065-0. PMID: 33328054, https://pubmed.ncbi.nlm.nih.gov/33328054/
- Kulikowski CA., 2019, Beginnings of Artificial Intelligence in Medicine (AIM): Computational Artifice Assisting Scientific Inquiry and Clinical Art - with Reflections on Present AIM Challenges. Yearb Med Inform. 2019 Aug;28(1):249-256. doi: 10.1055/s-0039-1677895. Epub 2019 Apr 25. PMID: 31022744, https://pubmed.ncbi.nlm.nih.gov/31022744/
- Park S.H., Kim Y.H., Lee J.Y., Yoo S., Kim C.J. Ethical challenges regarding artificial intelligence in medicine from the perspective of scientific editing and peer review. Science Editing. 2019;6:91–98. https://www.semanticscholar.org/paper/Ethical-challenges-regarding-artificial-in-medicine-Park-Kim/7a5b3c84c6f5d16e68eaf17989b0debfd4ba57d0
Data Leakage
Data leakage refers to the AI accidentally causing the leak of data that you'd prefer was kept confidential. The "leak" can actually be caused by the LLM, or by the user, depending on the context. There are various ways this can occur:
- Uploading confidential data in AI queries (User data leakage)
- Training or fine-tuning data containing proprietary information (Training data leakage)
- RAG datastore documents containing proprietary information (RAG data leakage)
In the context of an LLM output leaking, this refers to where internal company IP is accidentally "leaked" to the public by training the AI with documents containing internal information. The AI is not smart enough to note when it shouldn't be reading a document, and anything that goes into the training dataset, or in the RAG datastore, will be shown to users.
User data leakage is where company users are sending proprietary information to a third-party AI engine. In theory, this data is protected by the confidentiality practices of the LLM company. This issue is similar to having company staff emitting confidential information in their Google queries, but the issue is more problematic because AI queries can upload entire documents to be analyzed by the LLM, such as when doing grammar checking with an LLM.
Research papers on data leakage:
- Grant Gross, 05 Jun 2024, Unauthorized AI is eating your company data, thanks to your employees, https://www.csoonline.com/article/2138447/unauthorized-ai-is-eating-your-company-data-thanks-to-your-employees.html
- Mary K. Pratt, 08 Jul 2024, 10 ways to prevent shadow AI disaster, https://www.cio.com/article/2150142/10-ways-to-prevent-shadow-ai-disaster.html
- Rachel Curry, Aug 28 2024, Why companies including JPMorgan and Walmart are opting for internal gen AI assistants after initially restricting usage, https://www.cnbc.com/2024/08/28/why-jpmorgan-and-walmart-are-opting-for-internal-gen-ai-assistants.html
- Huan Yang, Deyu Zhang, Yudong Zhao, Yuanchun Li, Yunxin Liu, 6 Sep 2024, A First Look At Efficient And Secure On-Device LLM Inference Against KV Leakage, https://arxiv.org/abs/2409.04040 (Security issues where KV caches can be data leaks as they may contain encodings of private information.)
- G Wu, Z Zhang, Y Zhang, W Wang, J Niu, Y Wu, Mar 2025, I Know What You Asked: Prompt Leakage via KV-Cache Sharing in Multi-Tenant LLM Serving https://www.ndss-symposium.org/wp-content/uploads/2025-1772-paper.pdf
Refusal
Refusal refers to the way that an LLM will politely decline to answer an inappropriate question. There are all types of questions that we don't want an LLM to respond to, and this requires training to achieve.
- Andy Arditi, Oscar Obeso, Aaquib111, wesg, Neel Nanda, 27th Apr 2024, Refusal in LLMs is mediated by a single direction, LessWrong, https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction
- Maxime Labonne June 13, 2024 Uncensor any LLM with abliteration, https://huggingface.co/blog/mlabonne/abliteration
- NVIDIA, June 2024, Nemotron-4 340B Technical Report, https://d1qx31qr3h6wln.cloudfront.net/publications/Nemotron_4_340B_8T_0.pdf (Architecture is decoder-only with GQA, SentencePiece tokenizer, causal attention masks, RoPE, 96 layers, 96 heads, 8 KV heads, 256,000 vocabulary, 18432 internal dimension, context window 4096, and uses squared RELU.)
- Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, Nouha Dziri, 26 Jun 2024, WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs, https://arxiv.org/abs/2406.18495
- Maksym Andriushchenko, Nicolas Flammarion, 16 Jul 2024, Does Refusal Training in LLMs Generalize to the Past Tense? https://arxiv.org/abs/2407.11969 Code: https://github.com/tml-epfl/llm-past-tense
- Kylie Robison, Jul 20, 2024, OpenAI’s latest model will block the ‘ignore all previous instructions’ loophole, https://www.theverge.com/2024/7/19/24201414/openai-chatgpt-gpt-4o-prompt-injection-instruction-hierarchy
- Xinyi Hou, Yanjie Zhao, Haoyu Wang, 3 Aug 2024, Voices from the Frontier: A Comprehensive Analysis of the OpenAI Developer Forum, https://arxiv.org/abs/2408.01687
- Asir Saadat, Tasmia Binte Sogir, Md Taukir Azam Chowdhury, Syem Aziz, 16 Oct 2024, When Not to Answer: Evaluating Prompts on GPT Models for Effective Abstention in Unanswerable Math Word Problems, https://arxiv.org/abs/2410.13029
- Kyle O'Brien, David Majercak, Xavier Fernandes, Richard Edgar, Jingya Chen, Harsha Nori, Dean Carignan, Eric Horvitz, Forough Poursabzi-Sangde, 18 Nov 2024, Steering Language Model Refusal with Sparse Autoencoders, https://arxiv.org/abs/2411.11296
- Mohit Sewak, Dec 6, 2024, Prompt Injection Attacks on Large Language Models, https://pub.towardsai.net/prompt-injection-attacks-on-large-language-models-bd8062fa1bb7
- Yue Liu, Hongcheng Gao, Shengfang Zhai, Jun Xia, Tianyi Wu, Zhiwei Xue, Yulin Chen, Kenji Kawaguchi, Jiaheng Zhang, Bryan Hooi, 30 Jan 2025, GuardReasoner: Towards Reasoning-based LLM Safeguards, https://arxiv.org/abs/2501.18492
- Holistic AI Team, March 6, 2025, Anthropic’s Claude 3.7 Sonnet Jailbreaking & Red Teaming Audit: The Most Secure Model Yet? https://www.holisticai.com/blog/claude-3-7-sonnet-jailbreaking-audit
- Vishnu Kabir Chhabra, Mohammad Mahdi Khalili, 5 Apr 2025, Towards Understanding and Improving Refusal in Compressed Models via Mechanistic Interpretability, https://arxiv.org/abs/2504.04215
- Wojciech Zaremba, Evgenia Nitishinskaya, Boaz Barak, Stephanie Lin, Sam Toyer, Yaodong Yu, Rachel Dias, Eric Wallace, Kai Xiao, Johannes Heidecke, Amelia Glaese, 31 Jan 2025, Trading Inference-Time Compute for Adversarial Robustness, https://arxiv.org/abs/2501.18841
- Hongzhe Du, Weikai Li, Min Cai, Karim Saraipour, Zimin Zhang, Himabindu Lakkaraju, Yizhou Sun, Shichang Zhang, 11 Aug 2025, How Post-Training Reshapes LLMs: A Mechanistic View on Knowledge, Truthfulness, Refusal, and Confidence, https://arxiv.org/abs/2504.02904
- Haonan Zhang, Dongxia Wang, Yi Liu, Kexin Chen, Jiashui Wang, Xinlei Ying, Long Liu, Wenhai Wang, 15 Aug 2025, ORFuzz: Fuzzing the "Other Side" of LLM Safety -- Testing Over-Refusal, https://arxiv.org/abs/2508.11222
- Yuan Yuan, Tina Sriskandarajah, Anna-Luisa Brakman, Alec Helyar, Alex Beutel, Andrea Vallone, Saachi Jain, 12 Aug 2025, From Hard Refusals to Safe-Completions: Toward Output-Centric Safety Training, https://arxiv.org/abs/2508.09224
- Ranjie Duan, Jiexi Liu, Xiaojun Jia, Shiji Zhao, Ruoxi Cheng, Fengxiang Wang, Cheng Wei, Yong Xie, Chang Liu, Defeng Li, Yinpeng Dong, Yichi Zhang, Yuefeng Chen, Chongwen Wang, Xingjun Ma, Xingxing Wei, Yang Liu, Hang Su, Jun Zhu, Xinfeng Li, Yitong Sun, Jie Zhang, Jinzhao Hu, Sha Xu, Yitong Yang, Jialing Tao, Hui Xue, 4 Sep 2025, Oyster-I: Beyond Refusal -- Constructive Safety Alignment for Responsible Language Models, https://arxiv.org/abs/2509.01909
- Md Abdullah Al Mamun, Ihsen Alouani, Nael Abu-Ghazaleh, 28 Aug 2025, Poison Once, Refuse Forever: Weaponizing Alignment for Injecting Bias in LLMs, https://arxiv.org/abs/2508.20333
- Neel Jain, Aditya Shrivastava, Chenyang Zhu, Daben Liu, Alfy Samuel, Ashwinee Panda, Anoop Kumar, Micah Goldblum, Tom Goldstein, 29 Aug 2025, Refusal Tokens: A Simple Way to Calibrate Refusals in Large Language Models, https://arxiv.org/abs/2412.06748
- Nirmalendu Prakash, Yeo Wei Jie, Amir Abdullah, Ranjan Satapathy, Erik Cambria, Roy Ka Wei Lee, 7 Sep 2025, Beyond I'm Sorry, I Can't: Dissecting Large Language Model Refusal, https://arxiv.org/abs/2509.09708
Guardrails
- Aarushi Kansal, Chapter 4: Guardrails and AI: Building Safe and Controllable Apps, Building Generative AI-Powered Apps: A Hands-on Guide for Developers, Apress, https://www.amazon.com/Building-Generative-AI-Powered-Apps-Hands-ebook/dp/B0CTXXP1S4/
- Meta, July 2024 (accessed), Llama: Making safety tools accessible to everyone, https://llama.meta.com/trust-and-safety/
- Chip Huyen, Jul 25, 2024, Building A Generative AI Platform, https://huyenchip.com/2024/07/25/genai-platform.html
- Jaymari Chua, Yun Li, Shiyi Yang, Chen Wang, Lina Yao, 6 Jul 2024, AI Safety in Generative AI Large Language Models: A Survey, https://arxiv.org/abs/2407.18369
- Marko Zivkovic, Aug 06, 2024, Discovered Apple Intelligence prompts show Apple's attempt at preventing AI disaster, https://appleinsider.com/articles/24/08/06/discovered-apple-intelligence-prompts-show-apples-attempt-at-preventing-ai-disaster
- Rachel Curry, Aug 28 2024, Why companies including JPMorgan and Walmart are opting for internal gen AI assistants after initially restricting usage, https://www.cnbc.com/2024/08/28/why-jpmorgan-and-walmart-are-opting-for-internal-gen-ai-assistants.html
- Shayne Longpre, Stella Biderman, Alon Albalak, Hailey Schoelkopf, Daniel McDuff, Sayash Kapoor, Kevin Klyman, Kyle Lo, Gabriel Ilharco, Nay San, Maribeth Rauh, Aviya Skowron, Bertie Vidgen, Laura Weidinger, Arvind Narayanan, Victor Sanh, David Adelani, Percy Liang, Rishi Bommasani, Peter Henderson, Sasha Luccioni, Yacine Jernite, Luca Soldaini, 26 Jun 2024 (v2), The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources, https://arxiv.org/abs/2406.16746
- Jason Perlow, Nov. 6, 2024, The best open-source AI models: All your free-to-use options explained: Here are the best open-source and free-to-use AI models for text, images, and audio, organized by type, application, and licensing considerations. https://www.zdnet.com/article/the-best-open-source-ai-models-all-your-free-to-use-options-explained/
- McKinsey, November 14, 2024, What are AI guardrails? https://www.mckinsey.com/featured-insights/mckinsey-explainers/what-are-ai-guardrails
- Aditi Bodhankar, Dec 06, 2024, Content Moderation and Safety Checks with NVIDIA NeMo Guardrails, https://developer.nvidia.com/blog/content-moderation-and-safety-checks-with-nvidia-nemo-guardrails/
- Inkit Padhi, Manish Nagireddy, Giandomenico Cornacchia, Subhajit Chaudhury, Tejaswini Pedapati, Pierre Dognin, Keerthiram Murugesan, Erik Miehling, Martín Santillán Cooper, Kieran Fraser, Giulio Zizzo, Muhammad Zaid Hameed, Mark Purcell, Michael Desmond, Qian Pan, Inge Vejsbjerg, Elizabeth M. Daly, Michael Hind, Werner Geyer, Ambrish Rawat, Kush R. Varshney, Prasanna Sattigeri, 10 Dec 2024, Granite Guardian, https://arxiv.org/abs/2412.07724 https://github.com/ibm-granite/granite-guardian (Open-sourcing of safety models with many capabilities.)
- Rama Akkiraju, Anbang Xu, Deepak Bora, Tan Yu, Lu An, Vishal Seth, Aaditya Shukla, Pritam Gundecha, Hridhay Mehta, Ashwin Jha, Prithvi Raj, Abhinav Balasubramanian, Murali Maram, Guru Muthusamy, Shivakesh Reddy Annepally, Sidney Knowles, Min Du, Nick Burnett, Sean Javiya, Ashok Marannan, Mamta Kumari, Surbhi Jha, Ethan Dereszenski, Anupam Chakraborty, Subhash Ranjan, Amina Terfai, Anoop Surya, Tracey Mercer, Vinodh Kumar Thanigachalam, Tamar Bar, Sanjana Krishnan, Samy Kilaru, Jasmine Jaksic, Nave Algarici, Jacob Liberman, Joey Conway, Sonu Nayyar, Justin Boitano, 10 Jul 2024, FACTS About Building Retrieval Augmented Generation-based Chatbots, NVIDIA Research, https://arxiv.org/abs/2407.07858
- Aditi Bodhankar, Jan 16, 2025, How to Safeguard AI Agents for Customer Service with NVIDIA NeMo Guardrails, https://developer.nvidia.com/blog/how-to-safeguard-ai-agents-for-customer-service-with-nvidia-nemo-guardrails/
- Yue Liu, Hongcheng Gao, Shengfang Zhai, Jun Xia, Tianyi Wu, Zhiwei Xue, Yulin Chen, Kenji Kawaguchi, Jiaheng Zhang, Bryan Hooi, 30 Jan 2025, GuardReasoner: Towards Reasoning-based LLM Safeguards, https://arxiv.org/abs/2501.18492
- Aditi Bodhankar, Mar 03, 2025, Measuring the Effectiveness and Performance of AI Guardrails in Generative AI Applications, https://developer.nvidia.com/blog/measuring-the-effectiveness-and-performance-of-ai-guardrails-in-generative-ai-applications/
- Manuel Cossio, 3 Aug 2025, A comprehensive taxonomy of hallucinations in Large Language Models, https://arxiv.org/abs/2508.01781
- Anthropic, 13 Aug 2025, Building Safeguards for Claude, https://www.anthropic.com/news/building-safeguards-for-claude
- Yuksel Aydin, 9 Aug 2025, Cognitive Cybersecurity for Artificial Intelligence: Guardrail Engineering with CCS-7, https://arxiv.org/abs/2508.10033
- Boyuan Zheng, Zeyi Liao, Scott Salisbury, Zeyuan Liu, Michael Lin, Qinyuan Zheng, Zifan Wang, Xiang Deng, Dawn Song, Huan Sun, Yu Su, 18 Jul 2025, WebGuard: Building a Generalizable Guardrail for Web Agents, https://arxiv.org/abs/2507.14293
- Cheng-Fu Yang, Thanh Tran, Christos Christodoulopoulos, Weitong Ruan, Rahul Gupta, Kai-Wei Chang, 28 Jul 2025, Customize Multi-modal RAI Guardrails with Precedent-based predictions, https://arxiv.org/abs/2507.20503
- Chad DeLuca, Anna Lisa Gentile, Shubhi Asthana, Bing Zhang, Pawan Chowdhary, Kellen Cheng, Basel Shbita, Pengyuan Li, Guang-Jie Ren, Sandeep Gopisetty, 25 Jul 2025, OneShield - the Next Generation of LLM Guardrails, https://arxiv.org/abs/2507.21170
- Hannah-Beth Clark, Laura Benton, Emma Searle, Margaux Dowland, Matthew Gregory, Will Gayne and John Roberts, 7 Aug 2025, Building Effective Safety Guardrails in AI Education Tools, https://arxiv.org/abs/2508.05360
- Alexander W. Lee, Justin Chan, Michael Fu, Nicolas Kim, Akshay Mehta, Deepti Raghavan, Ugur Cetintemel, 7 Aug 2025, Semantic Integrity Constraints: Declarative Guardrails for AI-Augmented Data Processing Systems, https://arxiv.org/abs/2503.00600
- Darpan Aswal and C\'eline Hudelot, 22 Aug 2025, LLMSymGuard: A Symbolic Safety Guardrail Framework Leveraging Interpretable Jailbreak Concepts, https://arxiv.org/abs/2508.16325
- Jun Zhuang, Haibo Jin, Ye Zhang, Zhengjian Kang, Wenbin Zhang, Gaby G. Dagher, Haohan Wang, 25 Aug 2025, Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation, https://arxiv.org/abs/2505.18556
- Kellen Tan Cheng, Anna Lisa Gentile, Chad DeLuca, Guang-Jie Ren, 25 Aug 2025, Backprompting: Leveraging Synthetic Production Data for Health Advice Guardrails, https://arxiv.org/abs/2508.18384
- Victoria R. Li and Yida Chen and Naomi Saphra, 26 Aug 2025, ChatGPT Doesn't Trust Chargers Fans: Guardrail Sensitivity in Context, https://arxiv.org/abs/2407.06866
- Monte Hoover, Vatsal Baherwani, Neel Jain, Khalid Saifullah, Joseph Vincent, Chirag Jain, Melissa Kazemi Rad, C. Bayan Bruss, Ashwinee Panda, Tom Goldstein, 2 Sep 2025, DynaGuard: A Dynamic Guardrail Model With User-Defined Policies, https://arxiv.org/abs/2509.02563
Jailbreak
Jailbreaking is the hack of using English to break into a computer system. Actually, it's not so much a violation of the server, but it does refer to a way of getting the LLM to answer questions that its developer probably doesn't want it to. In other words, it's a trick to bypass the "refusal" module of an LLM.
- Andy Arditi, Oscar Obeso, Aaquib111, wesg, Neel Nanda, 27th Apr 2024, Refusal in LLMs is mediated by a single direction, LessWrong, https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction
- Adva Nakash Peleg, May 30, 2024, An LLM Journey: From POC to Production, https://medium.com/cyberark-engineering/an-llm-journey-from-poc-to-production-6c5ec6a172fb
- Yu Wang, Xiaogeng Liu, Yu Li, Muhao Chen, Chaowei Xiao, 14 Mar 2024, AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting, https://arxiv.org/abs/2403.09513 Code: https://github.com/rain305f/AdaShield
- Jinhwa Kim, Ali Derakhshan, Ian G. Harris, 31 Oct 2023, Robust Safety Classifier for Large Language Models: Adversarial Prompt Shield, https://arxiv.org/abs/2311.00172
- Zixuan Ni, Longhui Wei, Jiacheng Li, Siliang Tang, Yueting Zhuang, Qi Tian, 8 Aug 2023 (v2), Degeneration-Tuning: Using Scrambled Grid shield Unwanted Concepts from Stable Diffusion, https://arxiv.org/abs/2308.02552
- Xiao Peng, Tao Liu, Ying Wang, 3 Jun 2024 (v2), Genshin: General Shield for Natural Language Processing with Large Language Models, https://arxiv.org/abs/2405.18741
- Ayushi Nirmal, Amrita Bhattacharjee, Paras Sheth, Huan Liu, 8 May 2024 ( v2), Towards Interpretable Hate Speech Detection using Large Language Model-extracted Rationales, https://arxiv.org/abs/2403.12403 Code: https://github.com/AmritaBh/shield
- Shweta Sharma, 27 Jun 2024, Microsoft warns of ‘Skeleton Key’ jailbreak affecting many generative AI models, https://www.csoonline.com/article/2507702/microsoft-warns-of-novel-jailbreak-affecting-many-generative-ai-models.html
- Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, Nouha Dziri, 26 Jun 2024, WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs, https://arxiv.org/abs/2406.18495
- Maksym Andriushchenko, Nicolas Flammarion, 16 Jul 2024, Does Refusal Training in LLMs Generalize to the Past Tense? https://arxiv.org/abs/2407.11969 Code: https://github.com/tml-epfl/llm-past-tense
- Kylie Robison, Jul 20, 2024, OpenAI’s latest model will block the ‘ignore all previous instructions’ loophole, https://www.theverge.com/2024/7/19/24201414/openai-chatgpt-gpt-4o-prompt-injection-instruction-hierarchy
- Chip Huyen, Jul 25, 2024, Building A Generative AI Platform, https://huyenchip.com/2024/07/25/genai-platform.html
- Jaymari Chua, Yun Li, Shiyi Yang, Chen Wang, Lina Yao, 6 Jul 2024, AI Safety in Generative AI Large Language Models: A Survey, https://arxiv.org/abs/2407.18369
- Ayush RoyChowdhury, Mulong Luo,, Prateek Sahu,, Sarbartha Banerjee, Mohit Tiwari, Aug 2024, ConfusedPilot: Confused Deputy Risks in RAG-based LLMs, https://confusedpilot.info/confused_pilot_new.pdf
- Dr. Ashish Bamania, Sep 2024, ‘MathPrompt’ Embarassingly Jailbreaks All LLMs Available On The Market Today. A deep dive into how a novel LLM Jailbreaking technique called ‘MathPrompt’ works, why it is so effective, and why it needs to be patched as soon as possible to prevent harmful LLM content generation, https://bamania-ashish.medium.com/mathprompt-embarassingly-jailbreaks-all-llms-available-on-the-market-today-d749da26c6e8
- Y. Bai et al., "Backdoor Attack and Defense on Deep Learning: A Survey," in IEEE Transactions on Computational Social Systems, doi: 10.1109/TCSS.2024.3482723. https://ieeexplore.ieee.org/abstract/document/10744415
- Steve Jones, Oct 3, 2024, LLM Prompt Injection: Never send the request to the model. Classify, rewrite and reject, https://blog.metamirror.io/llm-prompt-injection-never-send-the-request-to-the-model-e8017269b96a
- Emet Bethany, Mazal Bethany, Juan Arturo Nolazco Flores, Sumit Kumar Jha, Peyman Najafirad, 5 Nov 2024 (v2), Jailbreaking Large Language Models with Symbolic Mathematics, https://arxiv.org/abs/2409.11445
- Alwin Peng, Julian Michael, Henry Sleight, Ethan Perez, Mrinank Sharma, 12 Nov 2024, Rapid Response: Mitigating LLM Jailbreaks with a Few Examples, https://arxiv.org/abs/2411.07494
- Kyle O'Brien, David Majercak, Xavier Fernandes, Richard Edgar, Jingya Chen, Harsha Nori, Dean Carignan, Eric Horvitz, Forough Poursabzi-Sangde, 18 Nov 2024, Steering Language Model Refusal with Sparse Autoencoders, https://arxiv.org/abs/2411.11296
- Zachary Coalson, Jeonghyun Woo, Shiyang Chen, Yu Sun, Lishan Yang, Prashant Nair, Bo Fang, Sanghyun Hong, 10 Dec 2024, PrisonBreak: Jailbreaking Large Language Models with Fewer Than Twenty-Five Targeted Bit-flips, https://arxiv.org/abs/2412.07192
- Inkit Padhi, Manish Nagireddy, Giandomenico Cornacchia, Subhajit Chaudhury, Tejaswini Pedapati, Pierre Dognin, Keerthiram Murugesan, Erik Miehling, Martín Santillán Cooper, Kieran Fraser, Giulio Zizzo, Muhammad Zaid Hameed, Mark Purcell, Michael Desmond, Qian Pan, Inge Vejsbjerg, Elizabeth M. Daly, Michael Hind, Werner Geyer, Ambrish Rawat, Kush R. Varshney, Prasanna Sattigeri, 10 Dec 2024, Granite Guardian, https://arxiv.org/abs/2412.07724 https://github.com/ibm-granite/granite-guardian (Open-sourcing of safety models with many capabilities.)
- Mohit Sewak, Dec 6, 2024, Prompt Injection Attacks on Large Language Models, https://pub.towardsai.net/prompt-injection-attacks-on-large-language-models-bd8062fa1bb7
- Sicheng Zhu, Brandon Amos, Yuandong Tian, Chuan Guo, Ivan Evtimov, 13 Dec 2024, AdvPrefix: An Objective for Nuanced LLM Jailbreaks, https://arxiv.org/abs/2412.10321
- Aditi Bodhankar, Jan 16, 2025, How to Safeguard AI Agents for Customer Service with NVIDIA NeMo Guardrails, https://developer.nvidia.com/blog/how-to-safeguard-ai-agents-for-customer-service-with-nvidia-nemo-guardrails/
- Xin Yi, Yue Li, Linlin Wang, Xiaoling Wang, Liang He, 18 Jan 2025, Latent-space adversarial training with post-aware calibration for defending large language models against jailbreak attacks, https://arxiv.org/abs/2501.10639
- Yue Liu, Hongcheng Gao, Shengfang Zhai, Jun Xia, Tianyi Wu, Zhiwei Xue, Yulin Chen, Kenji Kawaguchi, Jiaheng Zhang, Bryan Hooi, 30 Jan 2025, GuardReasoner: Towards Reasoning-based LLM Safeguards, https://arxiv.org/abs/2501.18492
- Taryn Plumb, February 3, 2025, Anthropic claims new AI security method blocks 95% of jailbreaks, invites red teamers to try, https://venturebeat.com/security/anthropic-claims-new-ai-security-method-blocks-95-of-jailbreaks-invites-red-teamers-to-try/
- Holistic AI Team, March 6, 2025, Anthropic’s Claude 3.7 Sonnet Jailbreaking & Red Teaming Audit: The Most Secure Model Yet? https://www.holisticai.com/blog/claude-3-7-sonnet-jailbreaking-audit
- Jiacheng Liang, Tanqiu Jiang, Yuhui Wang, Rongyi Zhu, Fenglong Ma, Ting Wang, 16 May 2025, AutoRAN: Weak-to-Strong Jailbreaking of Large Reasoning Models, https://arxiv.org/abs/2505.10846
- Wojciech Zaremba, Evgenia Nitishinskaya, Boaz Barak, Stephanie Lin, Sam Toyer, Yaodong Yu, Rachel Dias, Eric Wallace, Kai Xiao, Johannes Heidecke, Amelia Glaese, 31 Jan 2025, Trading Inference-Time Compute for Adversarial Robustness, https://arxiv.org/abs/2501.18841
- Manuel Cossio, 3 Aug 2025, A comprehensive taxonomy of hallucinations in Large Language Models, https://arxiv.org/abs/2508.01781
- Anthropic, 13 Aug 2025, Building Safeguards for Claude, https://www.anthropic.com/news/building-safeguards-for-claude
- Wenpeng Xing, Mohan Li, Chunqiang Hu, Haitao XuNingyu Zhang, Bo Lin, Meng Han, 8 Aug 2025, Latent Fusion Jailbreak: Blending Harmful and Harmless Representations to Elicit Unsafe LLM Outputs, https://arxiv.org/abs/2508.10029
- Fan Yang, 9 Aug 2025, The Cost of Thinking: Increased Jailbreak Risk in Large Language Models, https://arxiv.org/abs/2508.10032
- Xiaoxue Yang, Jaeha Lee, Anna-Katharina Dick, Jasper Timm, Fei Xie, Diogo Cruz, 11 Aug 2025, Multi-Turn Jailbreaks Are Simpler Than They Seem, https://arxiv.org/abs/2508.07646
- Xianjun Yang, Liqiang Xiao, Shiyang Li, Faisal Ladhak, Hyokun Yun, Linda Ruth Petzold, Yi Xu, William Yang Wang, 9 Aug 2025, Many-Turn Jailbreaking, https://arxiv.org/abs/2508.06755
- Xuancun Lu, Zhengxian Huang, Xinfeng Li, Chi Zhang, Xiaoyu ji, Wenyuan Xu, 11 Aug 2025, POEX: Towards Policy Executable Jailbreak Attacks Against the LLM-based Robots, https://arxiv.org/abs/2412.16633
- Tatia Tsmindashvili, Ana Kolkhidashvili, Dachi Kurtskhalia, Nino Maghlakelidze, Elene Mekvabishvili, Guram Dentoshvili, Orkhan Shamilov, Zaal Gachechiladze, Steven Saporta, David Dachi Choladze, 11 Aug 2025, Improving LLM Outputs Against Jailbreak Attacks with Expert Model Integration, https://arxiv.org/abs/2505.17066
- Jirui Yang, Zheyu Lin, Zhihui Lu, Yinggui Wang, Lei Wang, Tao Wei, Xin Du, Shuhan Yang, 31 Jul 2025, CEE: An Inference-Time Jailbreak Defense for Embodied Intelligence via Subspace Concept Rotation, https://arxiv.org/abs/2504.13201
- Zheng Zhang, Peilin Zhao, Deheng Ye, Hao Wang, 28 Jul 2025, Enhancing Jailbreak Attacks on LLMs via Persona Prompts, https://arxiv.org/abs/2507.22171
- Jiecong Wang, Haoran Li, Hao Peng, Ziqian Zeng, Zihao Wang, Haohua Du, Zhengtao Yu, 1 Aug 2025, Activation-Guided Local Editing for Jailbreaking Attacks, https://arxiv.org/abs/2508.00555
- Yelim Ahn, Jaejin Lee, 2 Aug 2025, PUZZLED: Jailbreaking LLMs through Word-Based Puzzles, https://arxiv.org/abs/2508.01306
- Yik Siu Chan, Narutatsu Ri, Yuxin Xiao, Marzyeh Ghassemi, 2 Aug 2025, Speak Easy: Eliciting Harmful Jailbreaks from LLMs with Simple Interactions, https://arxiv.org/abs/2502.04322
- Muyang Zheng, Yuanzhi Yao, Changting Lin, Rui Wang, Caihong Kai, 4 Aug 2025, MIST: Jailbreaking Black-box Large Language Models via Iterative Semantic Tuning, https://arxiv.org/abs/2506.16792
- Rui Pu, Chaozhuo Li, Rui Ha, Litian Zhang, Lirong Qiu, Xi Zhang, 5 Aug 2025, Beyond Surface-Level Detection: Towards Cognitive-Driven Defense Against Jailbreak Attacks via Meta-Operations Reasoning, https://arxiv.org/abs/2508.03054
- Bodam Kim, Hiskias Dingeto, Taeyoun Kwon, Dasol Choi, DongGeon Lee, Haon Park, JaeHoon Lee, Jongho Shin, 5 Aug 2025, When Good Sounds Go Adversarial: Jailbreaking Audio-Language Models with Benign Inputs, https://arxiv.org/abs/2508.03365
- Giovanni Cherubin, Andrew Paverd, 4 Aug 2025, Highlight & Summarize: RAG without the jailbreaks, https://arxiv.org/abs/2508.02872
- Ruofan Wang, Juncheng Li, Yixu Wang, Bo Wang, Xiaosen Wang, Yan Teng, Yingchun Wang, Xingjun Ma, Yu-Gang Jiang, 5 Aug 2025, IDEATOR: Jailbreaking and Benchmarking Large Vision-Language Models Using Themselves, https://arxiv.org/abs/2411.00827
- Junwoo Ha, Hyunjun Kim, Sangyoon Yu, Haon Park, Ashkan Yousefpour, Yuna Park, Suhyun Kim, 5 Aug 2025, M2S: Multi-turn to Single-turn jailbreak in Red Teaming for LLMs, https://arxiv.org/abs/2503.04856
- Thilo Hagendorff, Erik Derner, Nuria Oliver, 4 Aug 2025, Large Reasoning Models Are Autonomous Jailbreak Agents, https://arxiv.org/abs/2508.04039
- Xiaohu Li and Yunfeng Ning and Zepeng Bao and Mayi Xu and Jianhao Chen and Tieyun Qian, 6 Aug 2025, CAVGAN: Unifying Jailbreak and Defense of LLMs via Generative Adversarial Attacks on their Internal Representations, https://arxiv.org/abs/2507.06043
- Renmiao Chen, Shiyao Cui, Xuancheng Huang, Chengwei Pan, Victor Shea-Jay Huang, QingLin Zhang, Xuan Ouyang, Zhexin Zhang, Hongning Wang, and Minlie Huang, 7 Aug 2025, JPS: Jailbreak Multimodal Large Language Models with Collaborative Visual Perturbation and Textual Steering, https://arxiv.org/abs/2508.05087
- Jesson Wang, Zhanhao Hu, David Wagner, 7 Aug 2025, JULI: Jailbreak Large Language Models by Self-Introspection, https://arxiv.org/abs/2505.11790
- Shuang Liang, Zhihao Xu, Jialing Tao, Hui Xue, Xiting Wang, 8 Aug 2025, Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models: A Unified and Accurate Approach, https://arxiv.org/abs/2508.09201
- Zuoou Li, Weitong Zhang, Jingyuan Wang, Shuyuan Zhang, Wenjia Bai, Bernhard Kainz, Mengyun Qiao, 11 Aug 2025, Towards Effective MLLM Jailbreaking Through Balanced On-Topicness and OOD-Intensity, https://arxiv.org/abs/2508.09218
- Boyuan Chen, Minghao Shao, Abdul Basit, Siddharth Garg, Muhammad Shafique, 13 Aug 2025, MetaCipher: A Time-Persistent and Universal Multi-Agent Framework for Cipher-Based Jailbreak Attacks for LLMs, https://arxiv.org/abs/2506.22557
- Ma Teng and Jia Xiaojun and Duan Ranjie and Li Xinfeng and Huang Yihao and Jia Xiaoshuang and Chu Zhixuan and Ren Wenqi, 18 Aug 2025, Heuristic-Induced Multimodal Risk Distribution Jailbreak Attack for Multimodal Large Language Models, https://arxiv.org/abs/2412.05934
- Zhipeng Wei, Yuqi Liu, N. Benjamin Erichson, 16 Aug 2025, Emoji Attack: Enhancing Jailbreak Attacks Against Judge LLM Detection, https://arxiv.org/abs/2411.01077
- Yangyang Guo and Yangyan Li and Mohan Kankanhalli, 18 Aug 2025, Involuntary Jailbreak, https://arxiv.org/abs/2508.13246
- Jiaming Hu, Haoyu Wang, Debarghya Mukherjee, Ioannis Ch. Paschalidis, 19 Aug 2025, CCFC: Core & Core-Full-Core Dual-Track Defense for LLM Jailbreak Protection, https://arxiv.org/abs/2508.14128
- Xiangman Li, Xiaodong Wu, Qi Li, Jianbing Ni, and Rongxing Lu, 21 Aug 2025, SafeLLM: Unlearning Harmful Outputs from Large Language Models against Jailbreak Attacks, https://arxiv.org/abs/2508.15182
- Darpan Aswal and C\'eline Hudelot, 22 Aug 2025, LLMSymGuard: A Symbolic Safety Guardrail Framework Leveraging Interpretable Jailbreak Concepts, https://arxiv.org/abs/2508.16325
- Yu Yan, Sheng Sun, Zhe Wang, Yijun Lin, Zenghao Duan, zhifei zheng, Min Liu, Zhiyi yin, Jianping Zhang, 22 Aug 2025, Confusion is the Final Barrier: Rethinking Jailbreak Evaluation and Investigating the Real Misuse Threat of LLMs, https://arxiv.org/abs/2508.16347
- Yu Yan, Sheng Sun, Zenghao Duan, Teli Liu, Min Liu, Zhiyi Yin, Jiangyu Lei, Qi Li, 22 Aug 2025, from Benign import Toxic: Jailbreaking the Language Model via Adversarial Metaphors, https://arxiv.org/abs/2503.00038
- Chongwen Zhao, Zhihao Dou, Kaizhu Huang, 25 Aug 2025, Defending against Jailbreak through Early Exit Generation of Large Language Models, https://arxiv.org/abs/2408.11308
- Junchen Ding, Jiahao Zhang, Yi Liu, Ziqi Ding, Gelei Deng, Yuekang Li, 25 Aug 2025, TombRaider: Entering the Vault of History to Jailbreak Large Language Models, https://arxiv.org/abs/2501.18628
- Salman Rahman, Liwei Jiang, James Shiffer, Genglin Liu, Sheriff Issaka, Md Rizwan Parvez, Hamid Palangi, Kai-Wei Chang, Yejin Choi, Saadia Gabriel, 23 Aug 2025, X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents, https://arxiv.org/abs/2504.13203
- Hanjiang Hu, Alexander Robey, Changliu Liu, 25 Aug 2025, Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks, https://arxiv.org/abs/2503.00187
- Chuhan Zhang, Ye Zhang, Bowen Shi, Yuyou Gan, Tianyu Du, Shouling Ji, Dazhan Deng, Yingcai Wu, 4 Sep 2025, NeuroBreak: Unveil Internal Jailbreak Mechanisms in Large Language Models, https://arxiv.org/abs/2509.03985
- Yakai Li, Jiekang Hu, Weiduan Sang, Luping Ma, Dongsheng Nie, Weijuan Zhang, Aimin Yu, Yi Su, Qingjia Huang, Qihang Zhou, 25 Aug 2025, Prefill-level Jailbreak: A Black-Box Risk Analysis of Large Language Models, https://arxiv.org/abs/2504.21038
- Haibo Jin, Ruoxi Chen, Peiyan Zhang, Andy Zhou, Yang Zhang, Haohan Wang, 28 Aug 2025, GUARD: Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics for LLMs, https://arxiv.org/abs/2508.20325
- Junjie Chu and Mingjie Li and Ziqing Yang and Ye Leng and Chenhao Lin and Chao Shen and Michael Backes and Yun Shen and Yang Zhang, 28 Aug 2025, JADES: A Universal Framework for Jailbreak Assessment via Decompositional Scoring, https://arxiv.org/abs/2508.20848
- Chongwen Zhao and Kaizhu Huang, 1 Sep 2025, Unraveling LLM Jailbreaks Through Safety Knowledge Neurons, https://arxiv.org/abs/2509.01631
- Sihao Wu, Gaojie Jin, Wei Huang, Jianhong Wang, Xiaowei Huang, 30 Aug 2025, Activation Steering Meets Preference Optimization: Defense Against Jailbreaks in Vision Language Models, https://arxiv.org/abs/2509.00373
- Ruoxi Cheng, Yizhong Ding, Shuirong Cao, Ranjie Duan, Xiaoshuang Jia, Shaowei Yuan, Simeng Qin, Zhiqiang Wang, Xiaojun Jia, 30 Aug 2025, PBI-Attack: Prior-Guided Bimodal Interactive Black-Box Jailbreak Attack for Toxicity Maximization, https://arxiv.org/abs/2412.05892
- Shei Pern Chua, Thai Zhen Leng, Teh Kai Jun, Xiao Li, Xiaolin Hu, 4 Sep 2025, Between a Rock and a Hard Place: Exploiting Ethical Reasoning to Jailbreak LLMs, https://arxiv.org/abs/2509.05367
- Youjia Zheng, Mohammad Zandsalimy, and Shanu Sushmita, 5 Sep 2025, Behind the Mask: Benchmarking Camouflaged Jailbreaks in Large Language Models, https://arxiv.org/abs/2509.05471
- Junjie Mu, Zonghao Ying, Zhekui Fan, Zonglei Jing, Yaoyuan Zhang, Zhengmin Yu, Wenxin Zhang, Quanchen Zou, Xiangzheng Zhang, 8 Sep 2025, Mask-GCG: Are All Tokens in Adversarial Suffixes Necessary for Jailbreak Attacks?, https://arxiv.org/abs/2509.06350
- Yunhan Zhao, Xiang Zheng, Xingjun Ma, 16 Sep 2025, Defense-to-Attack: Bypassing Weak Defenses Enables Stronger Jailbreaks in Vision-Language Models, https://arxiv.org/abs/2509.12724
- Johan Wahr\'eus, Ahmed Hussain, Panos Papadimitratos, 16 Sep 2025, Jailbreaking Large Language Models Through Content Concretization, https://arxiv.org/abs/2509.12937
- Seongho Joo, Hyukhun Koh, Kyomin Jung, 13 Sep 2025, Harmful Prompt Laundering: Jailbreaking LLMs with Abductive Styles and Symbolic Encoding, https://arxiv.org/abs/2509.10931
- Chentao Cao, Xiaojun Xu, Bo Han, Hang Li, 15 Sep 2025, Reasoned Safety Alignment: Ensuring Jailbreak Defense via Answer-Then-Check, https://arxiv.org/abs/2509.11629
- Yibo Zhang, Liang Lin, 14 Sep 2025, ENJ: Optimizing Noise with Genetic Algorithms to Jailbreak LSMs, https://arxiv.org/abs/2509.11128
- Guorui Chen, Yifan Xia, Xiaojun Jia, Zhijiang Li, Philip Torr, Jindong Gu, 18 Sep 2025, LLM Jailbreak Detection for (Almost) Free!, https://arxiv.org/abs/2509.14558
- Hyunjun Kim, Junwoo Ha, Sangyoon Yu, Haon Park, 10 Sep 2025, X-Teaming Evolutionary M2S: Automated Discovery of Multi-turn to Single-turn Jailbreak Templates, https://arxiv.org/abs/2509.08729
Prompt Injection
Prompt injection is a type of LLM "hack" or "jailbreak" involving the insertion of nefarious words into the prompt. A simple example is the jailbreak involving words to the effect of "ignore all previous instructions and do what I say," which was a surprisingly effective method.
Research papers on prompt injection attacks and mitigation include:
- Jaymari Chua, Yun Li, Shiyi Yang, Chen Wang, Lina Yao, 6 Jul 2024, AI Safety in Generative AI Large Language Models: A Survey, https://arxiv.org/abs/2407.18369
- Yuanchun Li, Hao Wen, Weijun Wang, Xiangyu Li, Yizhen Yuan, Guohong Liu, Jiacheng Liu, Wenxing Xu, Xiang Wang, Yi Sun, Rui Kong, Yile Wang, Hanfei Geng, Jian Luan, Xuefeng Jin, Zilong Ye, Guanjing Xiong, Fan Zhang, Xiang Li, Mengwei Xu, Zhijun Li, Peng Li, Yang Liu, Ya-Qin Zhang, Yunxin Liu, 8 May 2024 (v2), Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security, https://arxiv.org/abs/2401.05459 https://github.com/MobileLLM/Personal_LLM_Agents_Survey
- Steve Jones, Oct 3, 2024, LLM Prompt Injection: Never send the request to the model. Classify, rewrite and reject, https://blog.metamirror.io/llm-prompt-injection-never-send-the-request-to-the-model-e8017269b96a
- Mohit Sewak, Dec 6, 2024, Prompt Injection Attacks on Large Language Models, https://pub.towardsai.net/prompt-injection-attacks-on-large-language-models-bd8062fa1bb7
- Kylie Robison, Jul 20, 2024, OpenAI’s latest model will block the ‘ignore all previous instructions’ loophole, https://www.theverge.com/2024/7/19/24201414/openai-chatgpt-gpt-4o-prompt-injection-instruction-hierarchy
- Jerry Wang and Fang Yu, 20 Jul 2025, DeRAG: Black-box Adversarial Attacks on Multiple Retrieval-Augmented Generation Applications via Prompt Injection, https://arxiv.org/abs/2507.15042
- Sam Johnson, Viet Pham, Thai Le, 20 Jul 2025, Manipulating LLM Web Agents with Indirect Prompt Injection Attack via HTML Accessibility Tree, https://arxiv.org/abs/2507.14799
- Tianneng Shi, Kaijie Zhu, Zhun Wang, Yuqi Jia, Will Cai, Weida Liang, Haonan Wang, Hend Alzahrani, Joshua Lu, Kenji Kawaguchi, Basel Alomair, Xuandong Zhao, William Yang Wang, Neil Gong, Wenbo Guo, Dawn Song, 21 Jul 2025, PromptArmor: Simple yet Effective Prompt Injection Defenses, https://arxiv.org/abs/2507.15219
- Aleksandr Gashkov, Aleksandr Perevalov, Maria Eltsova, Andreas Both, 18 Jul 2025, SPARQL Query Generation with LLMs: Measuring the Impact of Training Data Memorization and Knowledge Injection, https://arxiv.org/abs/2507.13859
- Junhyeong Lee, Joon-Young Kim, Heekyu Kim, Inhyo Lee and Seunghwa Ryu, 21 Jul 2025, IM-Chat: A Multi-agent LLM-based Framework for Knowledge Transfer in Injection Molding Industry, https://arxiv.org/abs/2507.15268
- Zhengyun Zhao, Huaiyuan Ying, Yue Zhong, Sheng Yu, 24 Jul 2025, DR.EHR: Dense Retrieval for Electronic Health Record with Knowledge Injection and Synthetic Data, https://arxiv.org/abs/2507.18583
- Da-Wei Zhou, Kai-Wen Li, Jingyi Ning, Han-Jia Ye, Lijun Zhang, De-Chuan Zhan, 24 Jul 2025, External Knowledge Injection for CLIP-Based Class-Incremental Learning, https://arxiv.org/abs/2503.08510
- Taibiao Zhao, Mingxuan Sun, Hao Wang, Xiaobing Chen, Xiangwei Zhou, 14 Aug 2025, Pruning and Malicious Injection: A Retraining-Free Backdoor Attack on Transformer Models, https://arxiv.org/abs/2508.10243
- Francesco Panebianco, Stefano Bonfanti, Francesco Trov\`o, Michele Carminati, 1 Aug 2025, LeakSealer: A Semisupervised Defense for LLMs Against Prompt Injection and Leakage Attacks, https://arxiv.org/abs/2508.00602
- Peiran Wang, Yang Liu, Yunfei Lu, Yifeng Cai, Hongbo Chen, Qingyou Yang, Jie Zhang, Jue Hong, Ye Wu, 2 Aug 2025, AgentArmor: Enforcing Program Analysis on Agent Runtime Trace to Defend Against Prompt Injection, https://arxiv.org/abs/2508.01249
- Zhiyao Luo, Tingting Zhu, 6 Aug 2025, Are Large Language Models Dynamic Treatment Planners? An In Silico Study from a Prior Knowledge Injection Angle, https://arxiv.org/abs/2508.04755
- Thorsten Peinemann, Paula Arnold, Sebastian Berndt, Thomas Eisenbarth, Esfandiar Mohammadi, 7 Aug 2025, Non-omniscient backdoor injection with a single poison sample: Proving the one-poison hypothesis for linear regression and linear classification, https://arxiv.org/abs/2508.05600
- Hammad Atta, Ken Huang, Manish Bhatt, Kamal Ahmed, Muhammad Aziz Ul Haq, Yasir Mehmood, 6 Aug 2025, Logic layer Prompt Control Injection (LPCI): A Novel Security Vulnerability Class in Agentic Systems, https://arxiv.org/abs/2507.10457
- Kalle Kujanp\"a\"a, Pekka Marttinen, Harri Valpola, Alexander Ilin, 7 Aug 2025, Efficient Knowledge Injection in LLMs via Self-Distillation, https://arxiv.org/abs/2412.14964
- Ameya Anjarlekar, Sandeep Pombra, 8 Aug 2025, LLM Unlearning using Gradient Ratio-Based Influence Estimation and Noise Injection, https://arxiv.org/abs/2508.06467
- Zhiqiu Zhang, Dongqi Fan, Mingjie Wang, Qiang Tang, Jian Yang, Zili Yi, 13 Aug 2025, Region-to-Region: Enhancing Generative Image Harmonization with Adaptive Regional Injection, https://arxiv.org/abs/2508.09746
- Xuyang Guo, Zekai Huang, Zhao Song, Jiahao Zhang, 16 Aug 2025, Too Easily Fooled? Prompt Injection Breaks LLMs on Frustratingly Simple Multiple-Choice Questions, https://arxiv.org/abs/2508.13214
- Xudong Wang, Guoming Tang, Junyu Xue, Srinivasan Keshav, Tongxin Li, Chris Ding, 20 Aug 2025, DualNILM: Energy Injection Identification Enabled Disaggregation with Deep Multi-Task Learning, https://arxiv.org/abs/2508.14600
- Hengyu An, Jinghuai Zhang, Tianyu Du, Chunyi Zhou, Qingming Li, Tao Lin, Shouling Ji, 21 Aug 2025, IPIGuard: A Novel Tool Dependency Graph-Based Defense Against Indirect Prompt Injection in LLM Agents, https://arxiv.org/abs/2508.15310
- Khalil Hennara, Sara Chrouf, Mohamed Motaism Hamed, Zeina Aldallal, Omar Hadid, Safwan AlModhayan, 21 Aug 2025, Kuwain 1.5B: An Arabic SLM via Language Injection, https://arxiv.org/abs/2504.15120
- Jiawen Shi, Zenghui Yuan, Yinuo Liu, Yue Huang, Pan Zhou, Lichao Sun, Neil Zhenqiang Gong, 24 Aug 2025, Optimization-based Prompt Injection Attack to LLM-as-a-Judge, https://arxiv.org/abs/2403.17710
- Sarthak Choudhary, Divyam Anshumaan, Nils Palumbo, Somesh Jha, 17 Jul 2025, How Not to Detect Prompt Injections with an LLM, https://arxiv.org/abs/2507.05630
- Qifeng Tan, Shusen Yang, Xuebin Ren, Yikai Zhang (Xi'an Jiaotong University), 4 Sep 2025, Rethinking Layer-wise Gaussian Noise Injection: Bridging Implicit Objectives and Privacy Budget Allocation, https://arxiv.org/abs/2509.04232
- Xilong Wang, John Bloch, Zedian Shao, Yuepeng Hu, Shuyan Zhou, Neil Zhenqiang Gong, 27 Aug 2025, EnvInjection: Environmental Prompt Injection Attack to Multi-modal Web Agents, https://arxiv.org/abs/2505.11717
- Harethah Abu Shairah, Hasan Abed Al Kader Hammoud, George Turkiyyah, Bernard Ghanem, 28 Aug 2025, Turning the Spell Around: Lightweight Alignment Amplification via Rank-One Safety Injection, https://arxiv.org/abs/2508.20766
- Amine Lbath, Massih-Reza Amini, Aurelien Delaitre, Vadim Okun, 28 Aug 2025, AI Agentic Vulnerability Injection And Transformation with Optimized Reasoning, https://arxiv.org/abs/2508.20866
- Govind Waghmare, Sumedh BG, Sonia Gupta, Srikanta Bedathur, 31 Aug 2025, Efficient Graph Understanding with LLMs via Structured Context Injection, https://arxiv.org/abs/2509.00740
- Ting-Chun Liu and Ching-Yu Hsu and Kuan-Yi Lee and Chi-An Fu and Hung-yi Lee, 27 Aug 2025, AEGIS : Automated Co-Evolutionary Framework for Guarding Prompt Injections Schema, https://arxiv.org/abs/2509.00088
- Mario U. Gaimann and Miriam Klopotek, 1 Sep 2025, Optimal information injection and transfer mechanisms for active matter reservoir computing, https://arxiv.org/abs/2509.01799
- Ishaan Verma, 6 Sep 2025, Decoding Latent Attack Surfaces in LLMs: Prompt Injection via HTML in Web Summarization, https://arxiv.org/abs/2509.05831
- Andrew Yeo, Daeseon Choi, 7 Sep 2025, Multimodal Prompt Injection Attacks: Risks and Defenses for Modern LLMs, https://arxiv.org/abs/2509.05883
- Mengxue Yang, Chun Yang, Jiaqi Zhu, Jiafan Li, Jingqi Zhang, Yuyang Li, Ying Li, 8 Sep 2025, SLiNT: Structure-aware Language Model with Injection and Contrastive Training for Knowledge Graph Completion, https://arxiv.org/abs/2509.06531
- Minghui Li, Hao Zhang, Yechao Zhang, Wei Wan, Shengshan Hu, pei Xiaobing, Jing Wang, 9 Sep 2025, Transferable Direct Prompt Injection via Activation-Guided MCMC Sampling, https://arxiv.org/abs/2509.07617
- Janis Keuper, 12 Sep 2025, Prompt Injection Attacks on LLM Generated Reviews of Scientific Publications, https://arxiv.org/abs/2509.10248
- Hai-Vy Nguyen, Fabrice Gamboa, Sixin Zhang, Reda Chhaibi, Serge Gratton, Thierry Giaccone, 19 Sep 2025, Training More Robust Classification Model via Discriminative Loss and Gaussian Noise Injection, https://arxiv.org/abs/2405.18499
- Jiahao Zhang and Xiaobing Pei and Zhaokun Zhong and Wenqiang Hao and Zhenghao Tang, 16 Sep 2025, JANUS: A Dual-Constraint Generative Framework for Stealthy Node Injection Attacks, https://arxiv.org/abs/2509.13266
- Luke Howard, 13 Sep 2025, GoldenTransformer: A Modular Fault Injection Framework for Transformer Robustness Research, https://arxiv.org/abs/2509.10790
- Pavan Reddy, Aditya Sanjay Gujral, 6 Sep 2025, EchoLeak: The First Real-World Zero-Click Prompt Injection Exploit in a Production LLM System, https://arxiv.org/abs/2509.10540
- Zedian Shao, Hongbin Liu, Jaden Mu, Neil Zhenqiang Gong, 15 Sep 2025, Enhancing Prompt Injection Attacks to LLMs via Poisoning Alignment, https://arxiv.org/abs/2410.14827
- Yupei Liu, Yuqi Jia, Jinyuan Jia, Dawn Song, Neil Zhenqiang Gong, 14 Sep 2025, DataSentinel: A Game-Theoretic Detection of Prompt Injection Attacks, https://arxiv.org/abs/2504.11358
- Gustavo Sandoval, Denys Fenchenko and Junyao Chen, 15 Sep 2025, Early Approaches to Adversarial Fine-Tuning for Prompt Injection Defense: A 2022 Study of GPT-3 and Contemporary Models, https://arxiv.org/abs/2509.14271
- S M Asif Hossain, Ruksat Khan Shayoni, Mohd Ruhul Ameen, Akif Islam, M. F. Mridha, Jungpil Shin, 16 Sep 2025, A Multi-Agent LLM Defense Pipeline Against Prompt Injection Attacks, https://arxiv.org/abs/2509.14285
- Tongyu Wen, Chenglong Wang, Xiyuan Yang, Haoyu Tang, Yueqi Xie, Lingjuan Lyu, Zhicheng Dou, Fangzhao Wu, 17 Sep 2025, Defending against Indirect Prompt Injection by Instruction Detection, https://arxiv.org/abs/2505.06311
Plagiarism
Plagiarism is an issue for LLMs when they repeat input training verbatim. This is a controversial issue with numerous copyright lawsuits in progress at the moment. Another side to "plagiarism" is detecting when authors or students have used AI in their writing, without attribution.
Research papers on plagiarism issues with AI include:
- Ruixiang Tang, Yu-Neng Chuang, Xia Hu, June 2023, The Science of Detecting LLM-Generated Texts, https://arxiv.org/abs/2303.07205
- JON CHRISTIAN, 2023, CNET's AI Journalist Appears to Have Committed Extensive Plagiarism, https://futurism.com/cnet-ai-plagiarism
- Jaymari Chua, Yun Li, Shiyi Yang, Chen Wang, Lina Yao, 6 Jul 2024, AI Safety in Generative AI Large Language Models: A Survey, https://arxiv.org/abs/2407.18369
- David Gewirtz, Nov. 26, 2024, I tested 9 AI content detectors - and these 2 correctly identified AI text every time, https://www.zdnet.com/article/i-tested-9-ai-content-detectors-and-these-2-correctly-identified-ai-text-every-time/
- Guillaume Cabanac, Cyril Labbé, Alexander Magazinov, 12 Jul 2021, Tortured phrases: A dubious writing style emerging in science. Evidence of critical issues affecting established journals, https://arxiv.org/abs/2107.06751 (Detects "tortured phrases" created by pre-AI paraphrasing tools used to avoid plagiarism detectors.)
- Eléna Martel, Martin Lentschat, Cyril Labbé, 2 Feb 2024, Detection of tortured phrases in scientific literature, https://arxiv.org/abs/2402.03370
- Seonghyeon Go, 10 Sep 2025, Real-world Music Plagiarism Detection With Music Segment Transcription System, https://arxiv.org/abs/2509.08282
AI Detectors
AI detectors are software that is supposed to detect whether a text or image has been created by humans or by LLMs. In practice, these have been a mixed success, being prone to both false positives and false negatives, and their use remains controversial and unclear.
Research papers on AI detectors:
- David Gewirtz, Aug. 19, 2024, How do AI checkers actually work? https://www.zdnet.com/article/how-do-ai-checkers-work/
- David Gewirtz, Aug. 8, 2024, I tested 7 AI content detectors - they're getting dramatically better at identifying plagiarism, https://www.zdnet.com/article/i-tested-7-ai-content-detectors-theyre-getting-dramatically-better-at-identifying-plagiarism/
- Write A Catalyst, Aug 23, 2024, Words and Phrases That Show ChatGPT Generated It, https://medium.com/write-a-catalyst/words-and-phrases-that-show-chatgpt-generated-it-ca7e28ae8e8f
- Brian Contreras, September 19, 2024, How Can You Detect AI-Generated Text? This Startup Has Some Compelling Ideas, https://www.inc-aus.com/brian-contreras/how-can-you-detect-ai-generated-text-this-startup-has-some-compelling-ideas.html
- Tan Rosado, Sep 9, 2024, 10 Phrases That Scream ‘AI Wrote This!’ — Even When It Didn’t. https://medium.com/write-a-catalyst/10-phrases-that-scream-ai-wrote-this-even-when-it-didn-t-c58f273c9075
- David Gewirtz, Nov. 26, 2024, I tested 9 AI content detectors - and these 2 correctly identified AI text every time, https://www.zdnet.com/article/i-tested-9-ai-content-detectors-and-these-2-correctly-identified-ai-text-every-time/
- The Medium Newsletter Dec 2024, ChatGPT’s favorite words & punctuation, The Medium Blog, https://blog.medium.com/chatgpts-favorite-words-punctuation-fca042bb6bea
- The Medium Blog, Jun 7, 2024, How to become a marine biologist, https://blog.medium.com/how-to-become-a-marine-biologist-ca849217523b
- Alex Hern, 16 Apr 2024, TechScape: How cheap, outsourced labour in Africa is shaping AI English, https://www.theguardian.com/technology/2024/apr/16/techscape-ai-gadgest-humane-ai-pin-chatgpt
- Jordan Gibbs, Dec 14, 2023, Which Words Does ChatGPT Use the Most? I analyzed 1 million words of ChatGPT output and found the words that ChatGPT overuses most. https://medium.com/@jordan_gibbs/which-words-does-chatgpt-use-the-most-7c9ff02416a8
- Asif Iqbal, August 31, 2024, ChatGPT's Top 50 Favorite Words and Phrases, https://www.linkedin.com/pulse/chatgpts-top-50-favorite-words-phrases-asif-iqbal-mba-cmbe-lavpe/
- BaggyBoy, 2024, Is an em dash (—) proof of AI manipulation? https://www.reddit.com/r/ChatGPT/comments/1fx12q1/is_an_em_dash_proof_of_ai_manipulation/?rdt=38192
- Linda Caroll, Jan 2025, I Don’t Know How To Make You Care What ChatGPT Is Quietly Doing: Over half of the internet is now AI generated text https://medium.com/the-generator/i-dont-know-how-to-make-you-care-what-chatgpt-is-quietly-doing-8177dfcfb486
- Maria Cassano, Jan 4, 2025, I’m a Professional Editor and These Phrases Tell Me You Used ChatGPT: AI chatbots were trained on novice writing, and it shows, https://writingcooperative.com/im-a-professional-editor-and-these-phrases-tell-me-you-used-chatgpt-23236708918f
- Mingjie Sun, Yida Yin, Zhiqiu Xu, J. Zico Kolter, Zhuang Liu, 17 Feb 2025, Idiosyncrasies in Large Language Models, https://arxiv.org/abs/2502.12150
- W Li, Y Lai, S Soni, K Saha, 2025, Emails by LLMs: A Comparison of Language in AI-Generated and Human-Written Emails, Proceedings of the 17th ACM Web Science Conference 2025 (Websci ’25), May 20–24, 2025, New Brunswick, NJ, USA. ACM, New York, NY, USA, 12 pages. https://doi.org/10.1145/3717867.3717872 https://www.researchgate.net/profile/Koustuv-Saha-2/publication/389509862_Emails_by_LLMs_A_Comparison_of_Language_in_AI-Generated_and_Human-Written_Emails/links/67c5cd02461fb56424efccc6/Emails-by-LLMs-A-Comparison-of-Language-in-AI-Generated-and-Human-Written-Emails.pdf
- David Gewirtz, April 30, 2025, I tested 10 AI content detectors - and these 5 correctly identified AI text every time: I've been testing AI content detectors for two years now. They're getting more and more reliable, https://www.zdnet.com/article/i-tested-10-ai-content-detectors-and-these-5-correctly-identified-ai-text-every-time/
- Shreya Shankar, Jun 16, 2025, Writing in the Age of LLMs: Common Patterns of Bad Writing I See from LLM Tools, https://www.sh-reya.com/blog/ai-writing/ (A good overview of the types of bad writing that comes out of LLMs.)
Privacy
Research on privacy-related risks or concerns:
- Matthew Finnegan 14 Jun 2024, Microsoft delays Recall launch amid privacy concerns, ComputerWorld, https://www.computerworld.com/article/2147736/microsoft-delays-recall-launch-amid-privacy-concerns.html
- Rohan Goswami 21 June, 2024, Apple Intelligence won’t launch in EU in 2024 due to antitrust regulation, company says, CNBS, https://www.cnbc.com/2024/06/21/apple-ai-europe-dma-macos.html
- Dan Peng, Zhihui Fu, Jun Wang, 1 Jul 2024, PocketLLM: Enabling On-Device Fine-Tuning for Personalized LLMs, https://arxiv.org/abs/2407.01031 (Running fine-tuning on a smartphone via a low-memory optimization using a "derivative-free" "zeroth-order" technique called MeZo, with advantages such as privacy.)
- Jay Peters, Jul 4, 2024, OpenAI’s ChatGPT Mac app was storing conversations in plain text, https://www.theverge.com/2024/7/3/24191636/openai-chatgpt-mac-app-conversations-plain-text
- Jaymari Chua, Yun Li, Shiyi Yang, Chen Wang, Lina Yao, 6 Jul 2024, AI Safety in Generative AI Large Language Models: A Survey, https://arxiv.org/abs/2407.18369
- Y. Zhang, J. Zhang, S. Yue, W. Lu, J. Ren, X. Shen, August 2024, "Mobile Generative AI: Opportunities and Challenges," in IEEE Wireless Communications, vol. 31, no. 4, pp. 58-64, doi: 10.1109/MWC.006.2300576, https://ieeexplore.ieee.org/abstract/document/10628027/
- Yuanchun Li, Hao Wen, Weijun Wang, Xiangyu Li, Yizhen Yuan, Guohong Liu, Jiacheng Liu, Wenxing Xu, Xiang Wang, Yi Sun, Rui Kong, Yile Wang, Hanfei Geng, Jian Luan, Xuefeng Jin, Zilong Ye, Guanjing Xiong, Fan Zhang, Xiang Li, Mengwei Xu, Zhijun Li, Peng Li, Yang Liu, Ya-Qin Zhang, Yunxin Liu, 8 May 2024 (v2), Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security, https://arxiv.org/abs/2401.05459 https://github.com/MobileLLM/Personal_LLM_Agents_Survey
- Yiheng Liu, Hao He, Tianle Han, Xu Zhang, Mengyuan Liu, Jiaming Tian, Yutong Zhang, Jiaqi Wang, Xiaohui Gao, Tianyang Zhong, Yi Pan, Shaochen Xu, Zihao Wu, Zhengliang Liu, Xin Zhang, Shu Zhang, Xintao Hu, Tuo Zhang, Ning Qiang, Tianming Liu, Bao Ge, 6 Jan 2024 (v2), Understanding LLMs: A Comprehensive Overview from Training to Inference, https://arxiv.org/abs/2401.02038
- Huan Yang, Deyu Zhang, Yudong Zhao, Yuanchun Li, Yunxin Liu, 6 Sep 2024, A First Look At Efficient And Secure On-Device LLM Inference Against KV Leakage, https://arxiv.org/abs/2409.04040 (Security issues where KV caches can be data leaks as they may contain encodings of private information.)
- Apple, Sep 2024, Apple Intelligence comes to iPhone, iPad, and Mac starting next month, https://www.apple.com/newsroom/2024/09/apple-intelligence-comes-to-iphone-ipad-and-mac-starting-next-month/
- Donghwan Rho, Taeseong Kim, Minje Park, Jung Woo Kim, Hyunsik Chae, Jung Hee Cheon, Ernest K. Ryu, 3 Oct 2024, Encryption-Friendly LLM Architecture, https://arxiv.org/abs/2410.02486
- Jiankun Wei, Abdulrahman Abdulrazzag, Tianchen Zhang, Adel Muursepp, Gururaj Saileshwar, 5 Nov 2024 (v2), Privacy Risks of Speculative Decoding in Large Language Models, https://arxiv.org/abs/2411.01076
- G Wu, Z Zhang, Y Zhang, W Wang, J Niu, Y Wu, Mar 2025, I Know What You Asked: Prompt Leakage via KV-Cache Sharing in Multi-Tenant LLM Serving https://www.ndss-symposium.org/wp-content/uploads/2025-1772-paper.pdf
- Maimunatu Tunau, Vincent Gbouna Zakka, Zhuangzhuang Dai, 14 Aug 2025, Enhanced Sparse Point Cloud Data Processing for Privacy-aware Human Action Recognition, https://arxiv.org/abs/2508.10469
- Feiran Li, Qianqian Xu, Shilong Bao, Boyu Han, Zhiyong Yang, Qingming Huang, 14 Aug 2025, Hybrid Generative Fusion for Efficient and Privacy-Preserving Face Recognition Dataset Generation, https://arxiv.org/abs/2508.10672
- Yanzhe Zhang, Diyi Yang, 14 Aug 2025, Searching for Privacy Risks in LLM Agents via Simulation, https://arxiv.org/abs/2508.10880
- Quentin Hillebrand, Vorapong Suppakitpaisarn and Tetsuo Shibuya, 14 Aug 2025, Communication Cost Reduction for Subgraph Counting under Local Differential Privacy via Hash Functions, https://arxiv.org/abs/2312.07055
- Jessup Byun, Xiaofeng Lin, Joshua Ward, Guang Cheng, 22 Jul 2025, Risk In Context: Benchmarking Privacy Leakage of Foundation Models in Synthetic Tabular Data Generation, https://arxiv.org/abs/2507.17066
- Wei Fan, JinYi Yoon, Xiaochang Li, Huajie Shao, and Bo Ji, 23 Jul 2025, P3SL: Personalized Privacy-Preserving Split Learning on Heterogeneous Edge Devices, https://arxiv.org/abs/2507.17228
- Na Li and Yansong Gao and Hongsheng Hu and Boyu Kuang and Anmin Fu, 22 Jul 2025, CompLeak: Deep Learning Model Compression Exacerbates Privacy Leakage, https://arxiv.org/abs/2507.16872
- Angelo Rodio, Zheng Chen, Erik G. Larsson, 23 Jul 2025, Optimizing Privacy-Utility Trade-off in Decentralized Learning with Generalized Correlated Noise, https://arxiv.org/abs/2501.14644
- Mehdi Khalaj, Shahrzad Golestani Najafabadi, Julita Vassileva, 23 Jul 2025, Privacy-Preserving Multimodal News Recommendation through Federated Learning, https://arxiv.org/abs/2507.15460
- Harsha Sammangi (Dakota State University), Aditya Jagatha (College of Business and Information Systems, Dakota State University), Giridhar Reddy Bojja (College of Business, Michigan Technological University), Jun Liu (College of Business and I.S, Dakota State University), 29 Apr 2025, Decentralized AI-driven IoT Architecture for Privacy-Preserving and Latency-Optimized Healthcare in Pandemic and Critical Care Scenarios, https://arxiv.org/abs/2507.15859
- Dakota Sullivan, Shirley Zhang, Jennica Li, Heather Kirkorian, Bilge Mutlu, Kassem Fawaz, 22 Jul 2025, Benchmarking LLM Privacy Recognition for Social Robot Decision Making, https://arxiv.org/abs/2507.16124
- Tanusree Sharma, Yihao Zhou, Visar Berisha, 22 Jul 2025, PRAC3 (Privacy, Reputation, Accountability, Consent, Credit, Compensation): Long Tailed Risks of Voice Actors in AI Data-Economy, https://arxiv.org/abs/2507.16247
- Tian Dong, Yan Meng, Shaofeng Li, Guoxing Chen, Zhen Liu, Haojin Zhu, 22 Jul 2025, Depth Gives a False Sense of Privacy: LLM Internal States Inversion, https://arxiv.org/abs/2507.16372
- Ryusei Fujimoto, Yugo Nakamura, Yutaka Arakawa, 24 Jul 2025, C-AAE: Compressively Anonymizing Autoencoders for Privacy-Preserving Activity Recognition in Healthcare Sensor Streams, https://arxiv.org/abs/2507.18072
- Tevin Atwal, Chan Nam Tieu, Yefeng Yuan, Zhan Shi, Yuhong Liu, Liang Cheng, 24 Jul 2025, Privacy-Preserving Synthetic Review Generation with Diverse Writing Styles Using LLMs, https://arxiv.org/abs/2507.18055
- Nikola Pavlovic, Sudeep Salgia, Qing Zhao, 18 Jul 2025, Differential Privacy in Kernelized Contextual Bandits via Random Projections, https://arxiv.org/abs/2507.13639
- Daniel Commey, Benjamin Appiah, Griffith S. Klogo, and Garth V. Crosby, 18 Jul 2025, ZKP-FedEval: Verifiable and Privacy-Preserving Federated Evaluation using Zero-Knowledge Proofs, https://arxiv.org/abs/2507.11649
- Efe Bozkir and S\"uleyman \"Ozdel and Mengdi Wang and Brendan David-John and Hong Gao and Kevin Butler and Eakta Jain and Enkelejda Kasneci, 18 Jul 2025, Eye-tracked Virtual Reality: A Comprehensive Survey on Methods and Privacy Challenges, https://arxiv.org/abs/2305.14080
- Matteo Boglioni and Terrance Liu and Andrew Ilyas and Zhiwei Steven Wu, 21 Jul 2025, Optimizing Canaries for Privacy Auditing with Metagradient Descent, https://arxiv.org/abs/2507.15836
- Wenxuan Zeng, Tianshi Xu, Yi Chen, Yifan Zhou, Mingzhe Zhang, Jin Tan, Cheng Hong, Meng Li, 19 Jul 2025, Towards Efficient Privacy-Preserving Machine Learning: A Systematic Review from Protocol, Model, and System Perspectives, https://arxiv.org/abs/2507.14519
- Juntao Tan, Lan Zhang, Zhonghao Hu, Kai Yang, Peng Ran, Bo Li, 19 Jul 2025, VMask: Tunable Label Privacy Protection for Vertical Federated Learning via Layer Masking, https://arxiv.org/abs/2507.14629
- Khoa Nguyen, Tanveer Khan, Antonis Michalas, 20 Jul 2025, A Privacy-Centric Approach: Scalable and Secure Federated Learning Enabled by Hybrid Homomorphic Encryption, https://arxiv.org/abs/2507.14853
- Tanusree Sharma, Yu-Yun Tseng, Lotus Zhang, Ayae Ide, Kelly Avery Mack, Leah Findlater, Danna Gurari, Yang Wang, 19 Jul 2025, "Before, I Asked My Mom, Now I Ask ChatGPT": Visual Privacy Management with Generative AI for Blind and Low-Vision People, https://arxiv.org/abs/2507.00286
- Wenkai Li, Liwen Sun, Zhenxiang Guan, Xuhui Zhou, Maarten Sap, 11 Aug 2025, 1-2-3 Check: Enhancing Contextual Privacy in LLM via Multi-Agent Reasoning, https://arxiv.org/abs/2508.07667
- Andrey Sidorenko and Paul Tiwald, 8 Aug 2025, Privacy-Preserving Tabular Synthetic Data Generation Using TabularARGN, https://arxiv.org/abs/2508.06647
- Yueyang Quan, Chang Wang, Shengjie Zhai, Minghong Fang, Zhuqing Liu, 10 Aug 2025, Enhancing Privacy in Decentralized Min-Max Optimization: A Differentially Private Approach, https://arxiv.org/abs/2508.07505
- Chenchen Lin, Xuehe Wang, 11 Aug 2025, Multi-Hop Privacy Propagation for Differentially Private Federated Learning in Social Networks, https://arxiv.org/abs/2508.07676
- Juan Zambrano, Cl\'ement Contet, Jairo Gudi\~no, Felipe Garrido-Lucero, Umberto Grandi, Cesar A Hidalgo, 7 Aug 2025, Leveraging LLMs for Privacy-Aware Predictions in Participatory Budgeting, https://arxiv.org/abs/2508.06577
- William Zerong Wang and Dongfang Zhao, 9 Aug 2025, Balancing Privacy and Efficiency: Music Information Retrieval via Additive Homomorphic Encryption, https://arxiv.org/abs/2508.07044
- Dawood Wasif, Dian Chen, Sindhuja Madabushi, Nithin Alluru, Terrence J. Moore, Jin-Hee Cho, 9 Aug 2025, Empirical Analysis of Privacy-Fairness-Accuracy Trade-offs in Federated Learning: A Step Towards Responsible AI, https://arxiv.org/abs/2503.16233
- Xingke Yang and Liang Li and Zhiyi Wan and Sicong Li and Xiaoqi Qi and Jiang Liu and Tomoaki Ohtsuki and Xin Fu and Miao Pan, 9 Aug 2025, PAE MobiLLM: Privacy-Aware and Efficient LLM Fine-Tuning on the Mobile Device via Additive Side-Tuning, https://arxiv.org/abs/2507.01216
- Kaveen Hiniduma, Zilinghan Li, Aditya Sinha, Ravi Madduri, Suren Byna, 11 Aug 2025, CADRE: Customizable Assurance of Data Readiness in Privacy-Preserving Federated Learning, https://arxiv.org/abs/2505.23849
- Md Rakibul Hasan, Md Zakir Hossain, Aneesh Krishna, Shafin Rahman, Tom Gedeon, 9 Aug 2025, TFMPathy: Tabular Foundation Model for Privacy-Aware, Generalisable Empathy Detection from Videos, https://arxiv.org/abs/2504.10808
- Nomaan A. Kherani, Urbashi Mitra, 26 Jul 2025, ModShift: Model Privacy via Designed Shifts, https://arxiv.org/abs/2507.20060
- Yaxin Xiao and Qingqing Ye and Li Hu and Huadi Zheng and Haibo Hu and Zi Liang and Haoyang Li and Yijie Jiao, 28 Jul 2025, Reminiscence Attack on Residuals: Exploiting Approximate Machine Unlearning for Privacy, https://arxiv.org/abs/2507.20573
- Ivoline Ngong, Swanand Kadhe, Hao Wang, Keerthiram Murugesan, Justin D. Weisz, Amit Dhurandhar, Karthikeyan Natesan Ramamurthy, 28 Jul 2025, Protecting Users From Themselves: Safeguarding Contextual Privacy in Interactions with Conversational Agents, https://arxiv.org/abs/2502.18509
- Abdullah Al Siam and Sadequzzaman Shohan, 17 May 2025, Privacy-Preserving AI for Encrypted Medical Imaging: A Framework for Secure Diagnosis and Learning, https://arxiv.org/abs/2507.21060
- Chenhao Fang, Yanqing Peng, Rajeev Rao, Matt Sarmiento, Wendy Summer, Arya Pudota, Alex Goncalves, Jordi Mola, Herv\'e Robert, 23 Jul 2025, Privacy Artifact ConnecTor (PACT): Embedding Enterprise Artifacts for Compliance AI Agents, https://arxiv.org/abs/2507.21142
- Yuetian Chen, Zhiqi Wang, Nathalie Baracaldo, Swanand Ravindra Kadhe, Lei Yu, 31 Jul 2025, Evaluating the Dynamics of Membership Privacy in Deep Learning, https://arxiv.org/abs/2507.23291
- Abhishek Sawaika, Swetang Krishna, Tushar Tomar, Durga Pritam Suggisetti, Aditi Lal, Tanmaya Shrivastav, Nouhaila Innan, Muhammad Shafique, 15 Jul 2025, A Privacy-Preserving Federated Framework with Hybrid Quantum-Enhanced Learning for Financial Fraud Detection, https://arxiv.org/abs/2507.22908
- Jiajie He, Yuechun Gu, Keke Chen, 24 Jul 2025, RecPS: Privacy Risk Scoring for Recommender Systems, https://arxiv.org/abs/2507.18365
- Shreyansh Pathak, Sonu Shreshtha, Richa Singh, Mayank Vatsa, 29 Jul 2025, Quantum-Inspired Audio Unlearning: Towards Privacy-Preserving Voice Biometrics, https://arxiv.org/abs/2507.22208
- Xiaojin Zhang, Wei Chen, 30 Jul 2025, Bridging Privacy and Robustness for Trustworthy Machine Learning, https://arxiv.org/abs/2403.16591
- Javier Mu\~noz-Haro, Ruben Tolosana, Ruben Vera-Rodriguez, Aythami Morales, Julian Fierrez, 1 Aug 2025, FakeIDet: Exploring Patches for Privacy-Preserving Fake ID Detection, https://arxiv.org/abs/2504.07761
- Tianpei Lu, Bingsheng Zhang, Lekun Peng, Bowen Zheng, Lichun Li, Kui Ren, 3 Aug 2025, Privacy-Preserving Inference for Quantized BERT Models, https://arxiv.org/abs/2508.01636
- Runkai Zheng, Vishnu Asutosh Dasu, Yinong Oliver Wang, Haohan Wang, Fernando De la Torre, 3 Aug 2025, Improving Noise Efficiency in Privacy-preserving Dataset Distillation, https://arxiv.org/abs/2508.01749
- Jan Schuchardt, Mina Dalirrooyfard, Jed Guzelkabaagac, Anderson Schneider, Yuriy Nevmyvaka, Stephan G\"unnemann, 4 Aug 2025, Privacy Amplification by Structured Subsampling for Deep Differentially Private Time Series Forecasting, https://arxiv.org/abs/2502.02410
- Xinwei Liu, Xiaojun Jia, Yuan Xun, Simeng Qin, Xiaochun Cao, 5 Aug 2025, GeoShield: Safeguarding Geolocation Privacy from Vision-Language Models via Adversarial Perturbations, https://arxiv.org/abs/2508.03209
- Mengyu Zhang, Zhuotao Liu, Jingwen Huang, Xuanqi Liu, 30 Jul 2025, Agentic Privacy-Preserving Machine Learning, https://arxiv.org/abs/2508.02836
- Xin Yang, Omid Ardakanian, 5 Aug 2025, PrivDiffuser: Privacy-Guided Diffusion Model for Data Obfuscation in Sensor Networks, https://arxiv.org/abs/2412.14499
- Chongyu Bao, Ruimin Dai, Yangbo Shen, Runyang Jian, Jinghan Zhang, Xiaolan Liu, Kunpeng Liu, 6 Aug 2025, Galaxy: A Cognition-Centered Framework for Proactive, Privacy-Preserving, and Self-Evolving LLM Agents, https://arxiv.org/abs/2508.03991
- Dhruv Sarkar, Nishant Pandey, Sayak Ray Chowdhury, 5 Aug 2025, DP-NCB: Privacy Preserving Fair Bandits, https://arxiv.org/abs/2508.03836
- Ajesh Koyatan Chathoth, Shuhao Yu, Stephen Lee, 6 Aug 2025, Dynamic User-controllable Privacy-preserving Few-shot Sensing Framework, https://arxiv.org/abs/2508.03989
- Haoran Niu and K. Suzanne Barber, 6 Aug 2025, Privacy Risk Predictions Based on Fundamental Understanding of Personal Data and an Evolving Threat Landscape, https://arxiv.org/abs/2508.04542
- Yubo Wang and Min Tang and Nuo Shen and Shujie Cui and Weiqing Wang, 20 Jul 2025, Privacy Risks of LLM-Empowered Recommender Systems: An Inversion Attack Perspective, https://arxiv.org/abs/2508.03703
- Fardis Nadimi, Payam Abdisarabshali, Kasra Borazjani, Jacob Chakareski, Seyyedali Hosseinalipour, 5 Aug 2025, Multi-Modal Multi-Task Federated Foundation Models for Next-Generation Extended Reality Systems: Towards Privacy-Preserving Distributed Intelligence in AR/VR/MR, https://arxiv.org/abs/2506.05683
- Chengxi Li, Ming Xiao, Mikael Skoglund, 6 Aug 2025, Adaptive Coded Federated Learning: Privacy Preservation and Straggler Mitigation, https://arxiv.org/abs/2403.14905
- Haotian Ma, Lin Gu, Siyi Wu, Yingying Zhu, 6 Aug 2025, Computation-Efficient and Recognition-Friendly 3D Point Cloud Privacy Protection, https://arxiv.org/abs/2503.15818
- Suqing Liu, Xuan Bi, Tianxi Li, 7 Aug 2025, GRAND: Graph Release with Assured Node Differential Privacy, https://arxiv.org/abs/2507.00402
- Ce Na, Kai Yang, Dengzhao Fang, Yu Li, Jingtong Gao, Chengcheng Zhu, Jiale Zhang, Xiaobing Sun, Yi Chang, 8 Aug 2025, Graph Federated Learning for Personalized Privacy Recommendation, https://arxiv.org/abs/2508.06208
- Alejandro Moreno R., Desale Fentaw, Samuel Palmer, Ra\'ul Salles de Padua, Ninad Dixit, Samuel Mugel, Roman Or\'us, Manuel Radons, Josef Menter, and Ali Abedi, 8 Aug 2025, Synthetic Data Generation and Differential Privacy using Tensor Networks' Matrix Product States (MPS), https://arxiv.org/abs/2508.06251
- Junhyeog Yun, Minui Hong, Gunhee Kim, 8 Aug 2025, FedMeNF: Privacy-Preserving Federated Meta-Learning for Neural Fields, https://arxiv.org/abs/2508.06301
- Zhihao Yao, Yuxuan Gu, Xiachong Feng, Weitao Ma, Bo Li, Xiaocheng Feng, 8 Aug 2025, Adaptive Backtracking for Privacy Protection in Large Language Models, https://arxiv.org/abs/2508.06087
- Yuzhou Nie, Zhun Wang, Ye Yu, Xian Wu, Xuandong Zhao, Wenbo Guo, Dawn Song, 8 Aug 2025, LeakAgent: RL-based Red-teaming Agent for LLM Privacy Leakage, https://arxiv.org/abs/2412.05734
- Zane Witherspoon, Thet Mon Aye, YingYing Hao, 12 Aug 2025, Can We Trust AI to Govern AI? Benchmarking LLM Performance on Privacy and AI Governance Exams, https://arxiv.org/abs/2508.09036
- Ratun Rahman, 12 Aug 2025, Federated Learning: A Survey on Privacy-Preserving Collaborative Intelligence, https://arxiv.org/abs/2504.17703
- Abdolazim Rezaei, Mehdi Sookhak, Mahboobeh Haghparast, 7 Aug 2025, RL-MoE: An Image-Based Privacy Preserving Approach In Intelligent Transportation System, https://arxiv.org/abs/2508.09186
- Nick Oh, Giorgos D. Vrakas, Si\^an J. M. Brooke, Sasha Morini\`ere, Toju Duke, 12 Aug 2025, PETLP: A Privacy-by-Design Pipeline for Social Media Data in AI Research, https://arxiv.org/abs/2508.09232
- Zhifan Luo, Shuo Shao, Su Zhang, Lijing Zhou, Yuke Hu, Chenxu Zhao, Zhihao Liu, Zhan Qin, 13 Aug 2025, Shadow in the Cache: Unveiling and Mitigating Privacy Risks of KV-cache in LLM Inference, https://arxiv.org/abs/2508.09442
- Javier Mu\~noz-Haro and Ruben Tolosana and Ruben Vera-Rodriguez and Aythami Morales and Julian Fierrez, 14 Aug 2025, Privacy-Aware Detection of Fake Identity Documents: Methodology, Benchmark, and Improved Detection Methods (FakeIDet2), https://arxiv.org/abs/2508.11716
- Xiaojin Zhang, Mingcong Xu, Yiming Li, Wei Chen, Qiang Yang, 16 Aug 2025, Deciphering the Interplay between Attack and Protection Complexity in Privacy-Preserving Federated Learning, https://arxiv.org/abs/2508.11907
- Jinyu Lu, Xinrong Sun, Yunting Tao, Tong Ji, Fanyu Kong, Guoqiang Yang, 18 Aug 2025, Efficient and Verifiable Privacy-Preserving Convolutional Computation for CNN Inference with Untrusted Clouds, https://arxiv.org/abs/2508.12832
- Daniel M. Jimenez-Gutierrez, Yelizaveta Falkouskaya, Jose L. Hernandez-Ramos, Aris Anagnostopoulos, Ioannis Chatzigiannakis, Andrea Vitaletti, 19 Aug 2025, On the Security and Privacy of Federated Learning: A Survey with Attacks, Defenses, Frameworks, Applications, and Future Directions, https://arxiv.org/abs/2508.13730
- Salman Habib, Remi Chou, Taejoon Kim, 21 Aug 2025, Stabilization of Perturbed Loss Function: Differential Privacy without Gradient Noise, https://arxiv.org/abs/2508.15523
- Michael Sun, Tai Vu, Andrew Wang, 12 Aug 2025, Privacy Preserving Inference of Personalized Content for Out of Matrix Users, https://arxiv.org/abs/2508.14905
- Ruyi Ding, Tianhong Xu, Xinyi Shen, Aidong Adam Ding, Yunsi Fei, 20 Aug 2025, MoEcho: Exploiting Side-Channel Attacks to Compromise User Privacy in Mixture-of-Experts LLMs, https://arxiv.org/abs/2508.15036
- Aishik Mandal, Tanmoy Chakraborty, Iryna Gurevych, 22 Aug 2025, Towards Privacy-aware Mental Health AI Models: Advances, Challenges, and Opportunities, https://arxiv.org/abs/2502.00451
- Muhammet Anil Yagiz, Zeynep Sude Cengiz, Polat Goktas, 24 Aug 2025, MetaFed: Advancing Privacy, Performance, and Sustainability in Federated Metaverse Systems, https://arxiv.org/abs/2508.17341
- GodsGift Uzor, Hasan Al-Qudah, Ynes Ineza, Abdul Serwadda, 22 Aug 2025, Guarding Your Conversations: Privacy Gatekeepers for Secure Interactions with Cloud-Based AI Models, https://arxiv.org/abs/2508.16765
- Jiale Liu, Jiahao Zhang, Suhang Wang, 24 Aug 2025, Exposing Privacy Risks in Graph Retrieval-Augmented Generation, https://arxiv.org/abs/2508.17222
- Carlos Soto, 23 Aug 2025, Rao Differential Privacy, https://arxiv.org/abs/2508.17135
- Xiaoyu Luo, Qiongxiu Li, 22 Aug 2025, DeMem: Privacy-Enhanced Robust Adversarial Learning via De-Memorization, https://arxiv.org/abs/2412.05767
- Nicolas Johansson (1), Tobias Olsson (1), Daniel Nilsson (2), Johan \"Ostman (2), Fazeleh Hoseini (2) ((1) Chalmers University of Technology, (2) AI Sweden), 4 Sep 2025, Privacy Risks in Time Series Forecasting: User- and Record-Level Membership Inference, https://arxiv.org/abs/2509.04169
- Qifeng Tan, Shusen Yang, Xuebin Ren, Yikai Zhang (Xi'an Jiaotong University), 4 Sep 2025, Rethinking Layer-wise Gaussian Noise Injection: Bridging Implicit Objectives and Privacy Budget Allocation, https://arxiv.org/abs/2509.04232
- Yaohong Yang, Aki Rehn, Sammie Katt, Antti Honkela, Samuel Kaski, 4 Sep 2025, An Interactive Framework for Finding the Optimal Trade-off in Differential Privacy, https://arxiv.org/abs/2509.04290
- Shokichi Takakura, Seng Pei Liew, Satoshi Hasegawa, 5 Sep 2025, Optimal Variance and Covariance Estimation under Differential Privacy in the Add-Remove Model and Beyond, https://arxiv.org/abs/2509.04919
- Zijian Wang, Wei Tong, Tingxuan Han, Haoyu Chen, Tianling Zhang, Yunlong Mao, Sheng Zhong, 5 Sep 2025, On Evaluating the Poisoning Robustness of Federated Learning under Local Differential Privacy, https://arxiv.org/abs/2509.05265
- Francesco Diana, Andr\'e Nusser, Chuan Xu, Giovanni Neglia, 5 Sep 2025, Cutting Through Privacy: A Hyperplane-Based Data Reconstruction Attack in Federated Learning, https://arxiv.org/abs/2505.10264
- Jiahao Xu, Rui Hu, Olivera Kotevska, 5 Sep 2025, Optimal Client Sampling in Federated Learning with Client-Level Heterogeneous Differential Privacy, https://arxiv.org/abs/2505.13655
- Yang Li, Hanjie Wang, Yuanzheng Li, Jiazheng Li, Zhaoyang Dong, 24 Aug 2025, ZTFed-MAS2S: A Zero-Trust Federated Learning Framework with Verifiable Privacy and Trust-Aware Aggregation for Wind Power Data Imputation, https://arxiv.org/abs/2508.18318
- Zhibo Xu, Jianhao Zhu, Jingwen Xu, Changze Lv, Zisu Huang, Xiaohua Wang, Muling Wu, Qi Qian, Xiaoqing Zheng, Xuanjing Huang, 26 Aug 2025, Enhancing Model Privacy in Federated Learning with Random Masking and Quantization, https://arxiv.org/abs/2508.18911
- Joshua Lee, Ali Arastehfard, Weiran Liu, Xuegang Ban, Yuan Hong, 26 Aug 2025, SecureV2X: An Efficient and Privacy-Preserving System for Vehicle-to-Everything (V2X) Applications, https://arxiv.org/abs/2508.19115
- Yusi Wei, Hande Y. Benson, and Muge Capan, 25 Aug 2025, An Analytical Approach to Privacy and Performance Trade-Offs in Healthcare Data Sharing, https://arxiv.org/abs/2508.18513
- Shaojie Bai, Mohammad Sadegh Talebi, Chengcheng Zhao, Peng Cheng, and Jiming Chen, 26 Aug 2025, Secure Reinforcement Learning via Shuffle Privacy Model, https://arxiv.org/abs/2411.11647
- Mahdi Haghifam, Adam Smith, Jonathan Ullman, 26 Aug 2025, The Sample Complexity of Membership Inference and Privacy Auditing, https://arxiv.org/abs/2508.19458
- Zhan Shi, Yefeng Yuan, Yuhong Liu, Liang Cheng, Yi Fang, 25 Aug 2025, RL-Finetuned LLMs for Privacy-Preserving Synthetic Rewriting, https://arxiv.org/abs/2508.19286
- Grzegorz Skorupko, Fotios Avgoustidis, Carlos Mart\'in-Isla, Lidia Garrucho, Dimitri A. Kessler, Esmeralda Ruiz Pujadas, Oliver D\'iaz, Maciej Bobowicz, Katarzyna Gwo\'zdziewicz, Xavier Bargall\'o, Paulius Jaru\v{s}evi\v{c}ius, Richard Osuala, Kaisar Kushibar and Karim Lekadir, 28 Aug 2025, Federated nnU-Net for Privacy-Preserving Medical Image Segmentation, https://arxiv.org/abs/2503.02549
- Joshua Ward, Chi-Hua Wang, Guang Cheng, 28 Aug 2025, Privacy Auditing Synthetic Data Release through Local Likelihood Attacks, https://arxiv.org/abs/2508.21146
- Tobias Hyrup, Emmanouil Panagiotou, Arjun Roy, Arthur Zimek, Eirini Ntoutsi, Peter Schneider-Kamp, 29 Aug 2025, Achieving Hilbert-Schmidt Independence Under R\'enyi Differential Privacy for Fair and Private Data Generation, https://arxiv.org/abs/2508.21815
- Masahiro Hayashitani, Junki Mori, and Isamu Teranishi, 29 Aug 2025, Survey of Privacy Threats and Countermeasures in Federated Learning, https://arxiv.org/abs/2402.00342
- Timur Sattarov, Marco Schreyer, Damian Borth, 29 Aug 2025, Federated Diffusion Modeling with Differential Privacy for Tabular Data Synthesis, https://arxiv.org/abs/2412.16083
- Rui Zhao, Vladyslav Melnychuk, Jun Zhao, Jesse Wright, Nigel Shadbolt, 1 Sep 2025, An LLM-enabled semantic-centric framework to consume privacy policies, https://arxiv.org/abs/2509.01716
- Arun Vignesh Malarkkan, Haoyue Bai, Anjali Kaushik, and Yanjie Fu, 31 Aug 2025, DELTA: Variational Disentangled Learning for Privacy-Preserving Data Reprogramming, https://arxiv.org/abs/2509.00693
- Wei Huang, Anda Cheng, Zhao Zhang, Yinggui Wang, 1 Sep 2025, DPF-CM: A Data Processing Framework with Privacy-Preserving Vector Databases for Chinese Medical LLMs Training and Deployment, https://arxiv.org/abs/2509.01354
- Yi Yin, Guangquan Zhang, Hua Zuo, and Jie Lu, 2 Sep 2025, Privacy-Utility Trade-off in Data Publication: A Bilevel Optimization Framework with Curvature-Guided Perturbation, https://arxiv.org/abs/2509.02048
- Narasimha Raghavan Veeraragavan, Jan Franz Nyg{\aa}rd, 30 Aug 2025, Federated Survival Analysis with Node-Level Differential Privacy: Private Kaplan-Meier Curves, https://arxiv.org/abs/2509.00615
- Honghui Xu, Kaiyang Li, Wei Chen, Danyang Zheng, Zhiyuan Li, Zhipeng Cai, 2 Sep 2025, A Survey: Towards Privacy and Security in Mobile Large Language Models, https://arxiv.org/abs/2509.02411
- Jianwei Wang, Chengming Shi, Junyao Yang, Haoran Li, Qianli Ma, Huiping Zhuang, Cen Chen and Ziqian Zeng, 31 Aug 2025, RewardDS: Privacy-Preserving Fine-Tuning for Large Language Models via Reward Driven Data Synthesis, https://arxiv.org/abs/2502.18517
- Moontaha Nishat Chowdhury, Andr\'e Bauer, Minxuan Zhou, 3 Sep 2025, Efficient Privacy-Preserving Recommendation on Sparse Data using Fully Homomorphic Encryption, https://arxiv.org/abs/2509.03024
- Napsu Karmitsa, Antti Airola, Tapio Pahikkala, Tinja Pitk\"am\"aki, 3 Sep 2025, A Comprehensive Guide to Differential Privacy: From Theory to User Expectations, https://arxiv.org/abs/2509.03294
- Syomantak Chaudhuri, Thomas A. Courtade, 2 Sep 2025, Managing Correlations in Data and Privacy Demand, https://arxiv.org/abs/2509.02856
- Ayoub Si-ahmed, Mohammed Ali Al-Garadi, Narhimene Boustia, 2 Sep 2025, Explainable Machine Learning-Based Security and Privacy Protection Framework for Internet of Medical Things Systems, https://arxiv.org/abs/2403.09752
- Cheng Qian, Hainan Zhang, Yongxin Tong, Hong-Wei Zheng, Zhiming Zheng, 8 Sep 2025, HyFedRAG: A Federated Retrieval-Augmented Generation Framework for Heterogeneous and Privacy-Sensitive Data, https://arxiv.org/abs/2509.06444
- Ismail Hossain, Sai Puppala, Sajedul Talukder, Md Jahangir Alam, 4 Sep 2025, AI-in-the-Loop: Privacy Preserving Real-Time Scam Detection and Conversational Scambaiting by Leveraging LLMs and Federated Learning, https://arxiv.org/abs/2509.05362
- Abdul Rehman, Are D{\ae}hlen, Ilona Heldal, Jerry Chun-wei Lin, 4 Sep 2025, Privacy Preservation and Identity Tracing Prevention in AI-Driven Eye Tracking for Interactive Learning Environments, https://arxiv.org/abs/2509.05376
- Jennifer King, Kevin Klyman, Emily Capstick, Tiffany Saade, Victoria Hsieh, 5 Sep 2025, User Privacy and Large Language Models: An Analysis of Frontier Developers' Privacy Policies, https://arxiv.org/abs/2509.05382
- Waris Gill, Natalie Isak and Matthew Dressman, 6 Sep 2025, Cross-Service Threat Intelligence in LLM Services using Privacy-Preserving Fingerprints, https://arxiv.org/abs/2509.05608
- Ikhlasse Badidi, Nouhaila El Khiyaoui, Aya Riany, Badr Ben Elallid, Amine Abouaomar, 30 Aug 2025, Privacy-Preserving Offloading for Large Language Models in 6G Vehicular Networks, https://arxiv.org/abs/2509.05320
- Qin Yang, Nicholas Stout, Meisam Mohammady, Han Wang, Ayesha Samreen, Christopher J Quinn, Yan Yan, Ashish Kundu, Yuan Hong, 8 Sep 2025, PLRV-O: Advancing Differentially Private Deep Learning via Privacy Loss Random Variable Optimization, https://arxiv.org/abs/2509.06264
- Wenhan Dong, Chao Lin, Xinlei He, Shengmin Xu, Xinyi Huang, 6 Sep 2025, Privacy-Preserving Federated Learning via Homomorphic Adversarial Networks, https://arxiv.org/abs/2412.01650
- Nicol\`o Romandini, Carlo Mazzocca, Kai Otsuki, Rebecca Montanari, 8 Sep 2025, SoK: Security and Privacy of AI Agents for Blockchain, https://arxiv.org/abs/2509.07131
- Tom\'as Gonz\'alez, Mateo Dulce-Rubio, Aaditya Ramdas, M\'onica Ribero, 8 Sep 2025, Sequentially Auditing Differential Privacy, https://arxiv.org/abs/2509.07055
- Hailong Yang, Renhuo Zhao, Guanjin Wang and Zhaohong Deng, 12 Sep 2025, GAMA: A General Anonymizing Multi-Agent System for Privacy Preservation Enhanced by Domain Rules and Disproof Method, https://arxiv.org/abs/2509.10018
- Francisco Javier Esono Nkulu Andong and Qi Min, 12 Sep 2025, Federated Multi-Agent Reinforcement Learning for Privacy-Preserving and Energy-Aware Resource Management in 6G Edge Networks, https://arxiv.org/abs/2509.10163
- Nojan Sheybani, Alessandro Pegoraro, Jonathan Knauer, Phillip Rieger, Elissa Mollakuqe, Farinaz Koushanfar, Ahmad-Reza Sadeghi, 11 Sep 2025, ZORRO: Zero-Knowledge Robustness and Privacy for Split Learning (Full Version), https://arxiv.org/abs/2509.09787
- Zhanhong Jiang, Md Zahid Hasan, Nastaran Saadati, Aditya Balu, Chao Liu, Soumik Sarkar, 12 Sep 2025, Balancing Utility and Privacy: Dynamically Private SGD with Random Projection, https://arxiv.org/abs/2509.09485
- Vincent C. M\"uller, 30 Aug 2025, Deep opacity and AI: A threat to XAI and to privacy protection mechanisms, https://arxiv.org/abs/2509.08835
- Honglan Yu, Yibin Wang, Feifei Dai, Dong Liu, Haihui Fan, Xiaoyan Gu, 11 Sep 2025, Towards Confidential and Efficient LLM Inference with Dual Privacy Protection, https://arxiv.org/abs/2509.09091
- Honghui Xu, Shiva Shrestha, Wei Chen, Zhiyuan Li, Zhipeng Cai, 11 Sep 2025, DP-FedLoRA: Privacy-Enhanced Federated Fine-Tuning for On-Device Large Language Models, https://arxiv.org/abs/2509.09097
- Osama Zafar, Mina Namazi, Yuqiao Xu, Youngjin Yoo, Erman Ayday, 11 Sep 2025, A User-Centric, Privacy-Preserving, and Verifiable Ecosystem for Personal Data Management and Utilization, https://arxiv.org/abs/2506.22606
- Pol G. Recasens and \'Ad\'am Horv\'ath and Alberto Gutierrez-Torre and Jordi Torres and Josep Ll.Berral and Bal\'azs Pej\'o, 19 Sep 2025, FRIDA: Free-Rider Detection using Privacy Attacks, https://arxiv.org/abs/2410.05020
- Hilda Hadan, Reza Hadi Mogavi, Leah Zhang-Kennedy, Lennart E. Nacke, 18 Sep 2025, Who is Responsible When AI Fails? Mapping Causes, Entities, and Consequences of AI Privacy and Ethical Incidents, https://arxiv.org/abs/2504.01029
- Bihao Zhan, Jie Zhou, Junsong Li, Yutao Yang, Shilian Chen, Qianjun Pan, Xin Li, Wen Wu, Xingjiao Wu, Qin Chen, Hang Yan, Liang He, 16 Sep 2025, Forget What's Sensitive, Remember What Matters: Token-Level Differential Privacy in Memory Sculpting for Continual Learning, https://arxiv.org/abs/2509.12958
- Binquan Guo, Junteng Cao, Marie Siew, Binbin Chen, Tony Q. S. Quek, Zhu Han, 5 Sep 2025, Accelerating Privacy-Preserving Federated Learning in Large-Scale LEO Satellite Systems, https://arxiv.org/abs/2509.12222
- Rodrigo Tertulino, 3 Sep 2025, Privacy-Preserving Personalization in Education: A Federated Recommender System for Student Performance Prediction, https://arxiv.org/abs/2509.10516
- Muhammad H. Ashiq, Peter Triantafillou, Hung Yun Tseng, Grigoris G. Chrysos, 15 Sep 2025, Inducing Uncertainty for Test-Time Privacy, https://arxiv.org/abs/2509.11625
- Madhava Gaikwad, 10 Sep 2025, AVEC: Bootstrapping Privacy for Local LLMs, https://arxiv.org/abs/2509.10561
- Fardin Jalil Piran, Zhiling Chen, Yang Zhang, Qianyu Zhou, Jiong Tang, Farhad Imani, 12 Sep 2025, Privacy-Preserving Decentralized Federated Learning via Explainable Adaptive Differential Privacy, https://arxiv.org/abs/2509.10691
- Hyeju Shin, Vincent-Daniel, Kyudan Jung, Seongwon Yun, 13 Sep 2025, Fast Fourier Transform-Based Spectral and Temporal Gradient Filtering for Differential Privacy, https://arxiv.org/abs/2505.04468
- Xingchen Wang, Feijie Wu, Chenglin Miao, Tianchun Li, Haoyu Hu, Qiming Cao, Jing Gao, Lu Su, 18 Sep 2025, Towards Privacy-Preserving and Heterogeneity-aware Split Federated Learning via Probabilistic Masking, https://arxiv.org/abs/2509.14603
- Nobin Sarwar, Shubhashis Roy Dipta, 16 Sep 2025, FedMentor: Domain-Aware Differential Privacy for Heterogeneous Federated LLMs in Mental Health, https://arxiv.org/abs/2509.14275
- Yuntao Du, Zitao Li, Ninghui Li, Bolin Ding, 16 Sep 2025, Beyond Data Privacy: New Privacy Risks for Large Language Models, https://arxiv.org/abs/2509.14278
- Vaidehi Patil, Elias Stengel-Eskin, Mohit Bansal, 16 Sep 2025, The Sum Leaks More Than Its Parts: Compositional Privacy Risks and Mitigations in Multi-Agent Collaboration, https://arxiv.org/abs/2509.14284
- Ramazan Yener, Guan-Hung Chen, Ece Gumusel, Masooda Bashir, 18 Sep 2025, Can I Trust This Chatbot? Assessing User Privacy in AI-Healthcare Chatbot Applications, https://arxiv.org/abs/2509.14581
- Linfeng Luo, Zhiqi Guo, Fengxiao Tang, Zihao Qiu, Ming Zhao, 18 Sep 2025, Federated Hypergraph Learning with Local Differential Privacy: Toward Privacy-Aware Hypergraph Structure Completion, https://arxiv.org/abs/2408.05160
- Chih Wei Ling, Chun Hei Michael Shiu, Youqi Wu, Jiande Sun, Cheuk Ting Li, Linqi Song, Weitao Xu, 18 Sep 2025, Communication-Efficient and Privacy-Adaptable Mechanism for Federated Learning, https://arxiv.org/abs/2501.12046
- Avais Jan, Qasim Zia, Murray Patterson, 9 Sep 2025, Enhancing Privacy Preservation and Reducing Analysis Time with Federated Transfer Learning in Digital Twins-based Computed Tomography Scan Analysis, https://arxiv.org/abs/2509.08018
- Bishnu Bhusal, Manoj Acharya, Ramneet Kaur, Colin Samplawski, Anirban Roy, Adam D. Cobb, Rohit Chadha, Susmit Jha, 17 Sep 2025, Privacy-Aware In-Context Learning for Large Language Models, https://arxiv.org/abs/2509.13625
- Zihou Wu (1), Yuecheng Li (1), Tianchi Liao (2), Jian Lou (2), Chuan Chen (1) ((1) School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China (2) School of Software Engineering, Sun Yat-sen University, Zhuhai, China), 17 Sep 2025, ParaAegis: Parallel Protection for Flexible Privacy-preserved Federated Learning, https://arxiv.org/abs/2509.13739
- Vijay Kumar Butte, Sujata Butte, 17 Sep 2025, Secure, Scalable and Privacy Aware Data Strategy in Cloud, https://arxiv.org/abs/2509.13627
- Ozer Ozturk, Busra Buyuktanir, Gozde Karatas Baydogmus, Kazim Yildiz, 17 Sep 2025, Differential Privacy in Federated Learning: Mitigating Inference Attacks with Randomized Response, https://arxiv.org/abs/2509.13987
More Research on AI Safety
Research papers that cover various other AI safety issues:
- J Schuett, N Dreksler, M Anderljung, 2023, Towards best practices in AGI safety and governance: A survey of expert opinion, arXiv preprint, https://arxiv.org/abs/2305.07153
- Jan Leike, Miljan Martic, Victoria Krakovna, Pedro A. Ortega, Tom Everitt, Andrew Lefrancq, Laurent Orseau, Shane Legg, Nov 2017, AI Safety Gridworlds, https://arxiv.org/abs/1711.09883
- J. Schuett. Risk management in the Artificial Intelligence Act. European Journal of Risk Regulation, pages 1–19, 2023. https://arxiv.org/abs/2212.03109
- Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, Dan Mané, July 2016, Concrete Problems in AI Safety, https://arxiv.org/abs/1606.06565
- Mark O Riedl and Brent Harrison. 2018. Enter the matrix: A virtual world approach to safely interruptable autonomous systems. arXiv preprint arXiv:1703.10284, 2017 (revised Nov 2018). https://arxiv.org/abs/1703.10284v2
- M. Brundage, K. Mayer, T. Eloundou, S. Agarwal, S. Adler, G. Krueger, J. Leike, and P. Mishkin. OpenAI, 2022, Lessons learned on language model safety and misuse. https://openai.com/research/language-model-safety-and-misuse
- OpenAI, Feb 2023, Planning for AGI and beyond, https://openai.com/blog/planning-for-agi-and-beyond
- Andreas Cebulla, Zygmunt Szpak, Catherine Howell, Genevieve Knight & Sazzad Hussain, 2022, Applying ethics to AI in the workplace: the design of a scorecard for Australian workplace health and safety, Network Research, 13 May 2022, volume 38, pages919–935 (2023) https://link.springer.com/article/10.1007/s00146-022-01460-9
- Mohammad Ghavamzadeh, Marek Petrik, and Yinlam Chow. Safe policy improvement by minimizing robust baseline regret. In Advances in Neural Information Processing Systems, pages 2298–2306, 2016. https://arxiv.org/abs/1607.03842v1
- Laurent Orseau and Stuart Armstrong. Safely interruptible agents. In Uncertainty in Artificial Intelligence, pages 557–566, 2016. PDF: http://www.auai.org/uai2016/proceedings/papers/68.pdf
- Tate Ryan-Mosley, August 14, 2023, AI isn’t great at decoding human emotions. So why are regulators targeting the tech? MIT Technology Review, https://www.technologyreview.com/2023/08/14/1077788/ai-decoding-human-emotions-target-for-regulators/
- Maria Korolov, 15 May 2024, 10 things to watch out for with open source gen AI, CIO, https://www.cio.com/article/2104280/10-things-to-watch-out-for-with-open-source-gen-ai.html
- Andy Arditi, Oscar Obeso, Aaquib111, wesg, Neel Nanda, 27th Apr 2024, Refusal in LLMs is mediated by a single direction, LessWrong, https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction
- Google, Responsible Generative AI Toolkit, Feb 2024, https://ai.google.dev/responsible
- Bingbin Liu, Jordan T. Ash, Surbhi Goel, Akshay Krishnamurthy, Cyril Zhang, June 2023, Exposing Attention Glitches with Flip-Flop Language Modeling, https://arxiv.org/abs/2306.00946
- Jon Christian, Jan 30, 2023, CNET's Article-Writing AI Is Already Publishing Very Dumb Errors, https://futurism.com/cnet-ai-errors
- R Dubin, 2023. Disarming Steganography Attacks Inside Neural Network Models, arXiv preprint arXiv:2309.03071, https://arxiv.org/pdf/2309.03071.pdf
- Michael O'Neill, Mark Connor, 6 Jul 2023, Amplifying Limitations, Harms and Risks of Large Language Models, https://arxiv.org/abs/2307.04821
- Lucas Mearian, 14 Mar 2024, AI hallucination mitigation: two brains are better than one, https://www.computerworld.com/article/1612465/ai-hallucination-mitigation-two-brains-are-better-than-one.html
- Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 15 Mar 2024 (v5), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer (A large survey of a variety of LLM optimizations.)
- Laura Manduchi, Kushagra Pandey, Robert Bamler, Ryan Cotterell, Sina Däubener, Sophie Fellenz, Asja Fischer, Thomas Gärtner, Matthias Kirchler, Marius Kloft, Yingzhen Li, Christoph Lippert, Gerard de Melo, Eric Nalisnick, Björn Ommer, Rajesh Ranganath, Maja Rudolph, Karen Ullrich, Guy Van den Broeck, Julia E Vogt, Yixin Wang, Florian Wenzel, Frank Wood, Stephan Mandt, Vincent Fortuin, 28 Feb 2024, On the Challenges and Opportunities in Generative AI, https://arxiv.org/abs/2403.00025
- Zhexin Zhang, Yida Lu, Jingyuan Ma, Di Zhang, Rui Li, Pei Ke, Hao Sun, Lei Sha, Zhifang Sui, Hongning Wang, Minlie Huang, 26 Feb 2024, ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable Safety Detectors, https://arxiv.org/abs/2402.16444, Code: https://github.com/thu-coai/shieldlm
- Peter Dizikes, December 11, 2023, MIT group releases white papers on governance of AI, MIT News, https://news.mit.edu/2023/mit-group-releases-white-papers-governance-ai-1211
- MAK Raiaan, MSH Mukta, K Fatema, NM Fahad, 2023 A Review on Large Language Models: Architectures, Applications, Taxonomies, Open Issues and Challenges, https://www.techrxiv.org/articles/preprint/A_Review_on_Large_Language_Models_Architectures_Applications_Taxonomies_Open_Issues_and_Challenges/24171183/1/files/42414054.pdf
- Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen3 Ruoxi Jia, Prateek Mittal, Peter Henderson, Oct 2023, Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! https://arxiv.org/abs/2310.03693v1 Code: https://llm-tuning-safety.github.io/
- Y Hu, J Setpal, D Zhang, J Zietek, J Lambert, 2023, BoilerBot: A Reliable Task-oriented Chatbot Enhanced with Large Language Models, https://assets.amazon.science/8c/03/80c814a749f58e73a1aeda2ff282/boilerbot-tb2-final-2023.pdf
- S Latifi, 2023, Efficient and Dependable Deep Learning Systems Ph.D. Thesis, Computer Science and Engineering, University of Michigan, https://deepblue.lib.umich.edu/bitstream/handle/2027.42/176548/salar_1.pdf?sequence=1
- N. Soares. 2023, Comments on OpenAI’s “Planning for AGI and beyond”. https://www.lesswrong.com/posts/uxnjXBwr79uxLkifG
- K Ramesh, A Chavan, S Pandit, 2023, A Comparative Study on the Impact of Model Compression Techniques on Fairness in Language Models, Microsoft Research, https://aclanthology.org/2023.acl-long.878.pdf https://www.microsoft.com/en-us/research/uploads/prod/2023/07/3687_Paper.pdf
- David Spuler, March 2024, Chapter 43. Overview of AI Research, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- Rithesh Murthy, Liangwei Yang, Juntao Tan, Tulika Manoj Awalgaonkar, Yilun Zhou, Shelby Heinecke, Sachin Desai, Jason Wu, Ran Xu, Sarah Tan, Jianguo Zhang, Zhiwei Liu, Shirley Kokane, Zuxin Liu, Ming Zhu, Huan Wang, Caiming Xiong, Silvio Savarese, 12 Jun 2024, MobileAIBench: Benchmarking LLMs and LMMs for On-Device Use Cases, https://arxiv.org/abs/2406.10290
- Shicheng Xu, Liang Pang, Mo Yu, Fandong Meng, Huawei Shen, Xueqi Cheng, Jie Zhou, 12 Jun 2024 (v2), Unsupervised Information Refinement Training of Large Language Models for Retrieval-Augmented Generation, https://arxiv.org/abs/2402.18150 (Analysis about how LLMs can mishandle information retrieved from a datastore and how to make LLMs better at handling RAG information using a specialized training regime.)
- OpenAI, Moderation: Learn how to build moderation into your AI applications, 2024, https://platform.openai.com/docs/guides/moderation
- Azure, 06/13/2024, Content filtering, https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/content-filter?tabs=warning%2Cpython
- Yu Wang, Xiaogeng Liu, Yu Li, Muhao Chen, Chaowei Xiao, 14 Mar 2024, AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting, https://arxiv.org/abs/2403.09513 Code: https://github.com/rain305f/AdaShield
- Jinhwa Kim, Ali Derakhshan, Ian G. Harris, 31 Oct 2023, Robust Safety Classifier for Large Language Models: Adversarial Prompt Shield, https://arxiv.org/abs/2311.00172
- Apple, June 2024, Introducing Apple’s On-Device and Server Foundation Models, https://machinelearning.apple.com/research/introducing-apple-foundation-models (Apple's on-device models feature optimizations including small models, grouped query attention, 2-bit/4-bit quantization including activation quantization, shared embedding/unembedding tensors, small-ish vocabulary size of 49k, an undisclosed efficient KV cache optimization for neural engines, and layer-specific 16-bit LoRA/QLoRA adapters of size "10s of megabytes" for fine-tuned specialized model versions, also sometimes in 2-bit/4-bit, claiming speed rates of 0.6ms/token in prefill, and 30 tokens per second in decoding.)
- NVIDIA, June 2024, Nemotron-4 340B Technical Report, https://d1qx31qr3h6wln.cloudfront.net/publications/Nemotron_4_340B_8T_0.pdf (Architecture is decoder-only with GQA, SentencePiece tokenizer, causal attention masks, RoPE, 96 layers, 96 heads, 8 KV heads, 256,000 vocabulary, 18432 internal dimension, context window 4096, and uses squared RELU.)
- Frank Chung, June 23, 2024, ‘I need to go outside’: Young people ‘extremely addicted’ as Character.AI explodes, https://www.news.com.au/technology/online/internet/i-need-to-go-outside-young-people-extremely-addicted-as-characterai-explodes/news-story/5780991c61455c680f34b25d5847a341
- Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe, 4 Mar 2022, Training language models to follow instructions with human feedback, https://arxiv.org/abs/2203.02155 (The original 2022 InstructGPT paper from OpenAI.)
- Valentina Alto, 2024, Chapter 12: Responsible AI, Building LLM-Powered Applications: Create intelligence apps and agents with large language models, Packt Publishing, https://www.amazon.com/Building-LLM-Apps-Intelligent-Language/dp/1835462316/
- Aarushi Kansal, Chapter 4: Guardrails and AI: Building Safe and Controllable Apps, Building Generative AI-Powered Apps: A Hands-on Guide for Developers, Apress, https://www.amazon.com/Building-Generative-AI-Powered-Apps-Hands-ebook/dp/B0CTXXP1S4/
- Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, Nouha Dziri, 26 Jun 2024, WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs, https://arxiv.org/abs/2406.18495
- Jaymari Chua, Yun Li, Shiyi Yang, Chen Wang, Lina Yao, 6 Jul 2024, AI Safety in Generative AI Large Language Models: A Survey, https://arxiv.org/abs/2407.18369
- Marko Zivkovic, Aug 06, 2024, Discovered Apple Intelligence prompts show Apple's attempt at preventing AI disaster, https://appleinsider.com/articles/24/08/06/discovered-apple-intelligence-prompts-show-apples-attempt-at-preventing-ai-disaster
- Mack DeGeurin, Aug 9, 2024, Researchers worry about AI turning humans into jerks: OpenAI safety researchers think GPT4o could influence 'social norms.', https://www.popsci.com/technology/openai-jerks/
- OpenAI, August 8, 2024 GPT-4o System Card, https://openai.com/index/gpt-4o-system-card/
- Rohin Shah, Seb Farquhar, Anca Dragan, 21st Aug 2024, AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work, https://www.alignmentforum.org/posts/79BPxvSsjzBkiSyTq/agi-safety-and-alignment-at-google-deepmind-a-summary-of
- Shayne Longpre, Stella Biderman, Alon Albalak, Hailey Schoelkopf, Daniel McDuff, Sayash Kapoor, Kevin Klyman, Kyle Lo, Gabriel Ilharco, Nay San, Maribeth Rauh, Aviya Skowron, Bertie Vidgen, Laura Weidinger, Arvind Narayanan, Victor Sanh, David Adelani, Percy Liang, Rishi Bommasani, Peter Henderson, Sasha Luccioni, Yacine Jernite, Luca Soldaini, 26 Jun 2024 (v2), The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources, https://arxiv.org/abs/2406.16746
- Thomas Mildner, Orla Cooney, Anna-Maria Meck, Marion Bartl, Gian-Luca Savino, Philip R. Doyle, Diego Garaialde, Leigh Clark, John Sloan, Nina Wenig, Rainer Malaka, Jasmin Niess, 26 Jan 2024, Listening to the Voices: Describing Ethical Caveats of Conversational User Interfaces According to Experts and Frequent Users, Proceedings of the CHI Conference on Human Factors in Computing Systems (CHI '24), May 11--16, 2024, Honolulu, HI, USA, https://arxiv.org/abs/2401.14746 https://doi.org/https://doi.org/10.1145/3613904.3642542
- Kyle Wiggers, September 4, 2024, Ilya Sutskever’s startup, Safe Superintelligence, raises $1B, https://techcrunch.com/2024/09/04/ilya-sutskevers-startup-safe-super-intelligence-raises-1b/
- Balasubramaniam S. , Vanajaroselin Chirchi, Seifedine Kadry, Moorthy Agoramoorthy, Gururama Senthilvel P., Satheesh Kumar K., and Sivakumar T. A., Oct 2024, The Road Ahead: Emerging Trends, Unresolved Issues, and ConcludingRemarksinGenerativeAI—AComprehensiveReview, International Journal of Intelligent Systems, Volume 2024, Article ID 4013195, 38 pages, https://doi.org/10.1155/2024/4013195 https://www.researchgate.net/profile/Balasubramaniam-s-2/publication/384729387_The_Road_Ahead_Emerging_Trends_Unresolved_Issues_and_Concluding_Remarks_in_Generative_AI-A_Comprehensive_Review/links/6705560cf5eb7108c6e5d261/The-Road-Ahead-Emerging-Trends-Unresolved-Issues-and-Concluding-Remarks-in-Generative-AI-A-Comprehensive-Review.pdf
- Xinyi Zeng, Yuying Shang, Yutao Zhu, Jiawei Chen, Yu Tian, 9 Oct 2024, Root Defence Strategies: Ensuring Safety of LLM at the Decoding Level, https://arxiv.org/abs/2410.06809
- Michael Nuñez, October 15, 2024, Anthropic just made it harder for AI to go rogue with its updated safety policy, https://venturebeat.com/ai/anthropic-just-made-it-harder-for-ai-to-go-rogue-with-its-updated-safety-policy/
- ETO, Apr 2024, The state of global AI safety research, https://eto.tech/blog/state-of-global-ai-safety-research/
- Leon Derczynski, Christopher Parisien, Nikki Pope, Michael Boone, Nov 2024, NVIDIA Approaches to AI Trust and Safety: Innovation and Tools, https://www.nvidia.com/en-us/on-demand/session/aisummitdc24-sdc1088/?playlistId=playList-c6a9450c-c790-462d-a058-0bacacd5d370
- Y. Bai et al., "Backdoor Attack and Defense on Deep Learning: A Survey," in IEEE Transactions on Computational Social Systems, doi: 10.1109/TCSS.2024.3482723. https://ieeexplore.ieee.org/abstract/document/10744415
- OpenAI, November 21, 2024, Advancing red teaming with people and AI, https://openai.com/index/advancing-red-teaming-with-people-and-ai/
- Patrick Mineault, Niccolò Zanichelli, Joanne Zichen Peng, Anton Arkhipov, Eli Bingham, Julian Jara-Ettinger, Emily Mackevicius, Adam Marblestone, Marcelo Mattar, Andrew Payne, Sophia Sanborn, Karen Schroeder, Zenna Tavares, Andreas Tolias, 27 Nov 2024, NeuroAI for AI Safety, https://arxiv.org/abs/2411.18526
- Maria Korolov and Michael Hill, 03 Dec 2024, 10 most critical LLM vulnerabilities, https://www.csoonline.com/article/575497/owasp-lists-10-most-critical-large-language-model-vulnerabilities.html
- Mayank Vatsa, Anubhooti Jain, Richa Singh, 7 Dec 2023, Adventures of Trustworthy Vision-Language Models: A Survey, https://arxiv.org/abs/2312.04231
- Yedi Zhang, Yufan Cai, Xinyue Zuo, Xiaokun Luan, Kailong Wang, Zhe Hou, Yifan Zhang, Zhiyuan Wei, Meng Sun, Jun Sun, Jing Sun, Jin Song Dong, 9 Dec 2024, The Fusion of Large Language Models and Formal Methods for Trustworthy AI Agents: A Roadmap, https://arxiv.org/abs/2412.06512
- Aditi Bodhankar, Dec 06, 2024, Content Moderation and Safety Checks with NVIDIA NeMo Guardrails, https://developer.nvidia.com/blog/content-moderation-and-safety-checks-with-nvidia-nemo-guardrails/
- Inkit Padhi, Manish Nagireddy, Giandomenico Cornacchia, Subhajit Chaudhury, Tejaswini Pedapati, Pierre Dognin, Keerthiram Murugesan, Erik Miehling, Martín Santillán Cooper, Kieran Fraser, Giulio Zizzo, Muhammad Zaid Hameed, Mark Purcell, Michael Desmond, Qian Pan, Inge Vejsbjerg, Elizabeth M. Daly, Michael Hind, Werner Geyer, Ambrish Rawat, Kush R. Varshney, Prasanna Sattigeri, 10 Dec 2024, Granite Guardian, https://arxiv.org/abs/2412.07724 https://github.com/ibm-granite/granite-guardian (Open-sourcing of safety models with many capabilities.)
- Rhiannon Williams, December 31, 2024, The biggest AI flops of 2024: From chatbots dishing out illegal advice to dodgy AI-generated search results, take a look back over the year’s top AI failures. https://www.technologyreview.com/2024/12/31/1109612/biggest-worst-ai-artificial-intelligence-flops-fails-2024/
- James Manyika, Demis Hassabis, Feb 04, 2025, Responsible AI: Our 2024 report and ongoing work, https://blog.google/technology/ai/responsible-ai-2024-report-ongoing-work/
- Arjun Kharpal, Feb 6 2025, ‘Dangerous proposition’: Top scientists warn of out-of-control AI, https://www.cnbc.com/2025/02/07/dangerous-proposition-top-scientists-warn-of-out-of-control-ai.html
- Vagner Figueredo de Santana, Sara Berger, Tiago Machado, Maysa Malfiza Garcia de Macedo, Cassia Sampaio Sanctos, Lemara Williams, and Zhaoqing Wu. 2025. Can LLMs Recommend More Responsible Prompts? In Proceedings of the 30th International Conference on Intelligent User Interfaces (IUI '25). Association for Computing Machinery, New York, NY, USA, 298–313. https://doi.org/10.1145/3708359.3712137 https://dl.acm.org/doi/full/10.1145/3708359.3712137 https://dl.acm.org/doi/pdf/10.1145/3708359.3712137
- Michael Nuñez, July 15, 2025, OpenAI, Google DeepMind and Anthropic sound alarm: ‘We may be losing the ability to understand AI’, https://venturebeat.com/ai/openai-google-deepmind-and-anthropic-sound-alarm-we-may-be-losing-the-ability-to-understand-ai/ (Monitoring the text-based interim "thinking-out-loud" reasoning of models in CoT.)
- Tomek Korbak, Mikita Balesni, (and many more authors) July 2025, Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety, https://tomekkorbak.com/cot-monitorability-is-a-fragile-opportunity/cot_monitoring.pdf
- Anthropic, 13 Aug 2025, Building Safeguards for Claude, https://www.anthropic.com/news/building-safeguards-for-claude
- Gabriel J. Perin, Runjin Chen, Xuxi Chen, Nina S. T. Hirata, Zhangyang Wang, Junyuan Hong, 23 Jul 2025, LoX: Low-Rank Extrapolation Robustifies LLM Safety Against Fine-tuning, https://arxiv.org/abs/2506.15606
- Zheng Hui, Yijiang River Dong, Ehsan Shareghi, Nigel Collier, 22 Jul 2025, TRIDENT: Benchmarking LLM Safety in Finance, Medicine, and Law, https://arxiv.org/abs/2507.21134
- Haonan Zhang, Dongxia Wang, Yi Liu, Kexin Chen, Jiashui Wang, Xinlei Ying, Long Liu, Wenhai Wang, 15 Aug 2025, ORFuzz: Fuzzing the "Other Side" of LLM Safety -- Testing Over-Refusal, https://arxiv.org/abs/2508.11222
- Yuan Gao, Mattia Piccinini, Korbinian Moller, Amr Alanwar, Johannes Betz, 18 Jul 2025, From Words to Collisions: LLM-Guided Evaluation and Adversarial Generation of Safety-Critical Driving Scenarios, https://arxiv.org/abs/2502.02145
- Juan Manuel Contreras, 19 Jul 2025, Automated Safety Evaluations Across 20 Large Language Models: The Aymara LLM Risk and Responsibility Matrix, https://arxiv.org/abs/2507.14719
- Haoyu Wang and Chris M. Poskitt and Jun Sun and Jiali Wei, 1 Aug 2025, Pro2Guard: Proactive Runtime Enforcement of LLM Agent Safety via Probabilistic Model Checking, https://arxiv.org/abs/2508.00500
- Avni Kothari, Patrick Vossler, Jean Digitale, Mohammad Forouzannia, Elise Rosenberg, Michele Lee, Jennee Bryant, Melanie Molina, James Marks, Lucas Zier, Jean Feng, 11 Aug 2025, When the Domain Expert Has No Time and the LLM Developer Has No Clinical Expertise: Real-World Lessons from LLM Co-Design in a Safety-Net Hospital, https://arxiv.org/abs/2508.08504
- Bing Han, Feifei Zhao, Dongcheng Zhao, Guobin Shen, Ping Wu, Yu Shi, Yi Zeng, 8 Aug 2025, Fine-Grained Safety Neurons with Training-Free Continual Projection to Reduce LLM Fine Tuning Risks, https://arxiv.org/abs/2508.09190
- Minseon Kim, Jin Myung Kwak, Lama Alssum, Bernard Ghanem, Philip Torr, David Krueger, Fazl Barez, Adel Bibi, 17 Aug 2025, Rethinking Safety in LLM Fine-tuning: An Optimization Perspective, https://arxiv.org/abs/2508.12531
- Mingxing Peng, Yuting Xie, Xusen Guo, Ruoyu Yao, Hai Yang, and Jun Ma, 17 Aug 2025, LD-Scene: LLM-Guided Diffusion for Controllable Generation of Adversarial Safety-Critical Driving Scenarios, https://arxiv.org/abs/2505.11247
- Yeonjun In, Wonjoong Kim, Sangwu Park, Chanyoung Park, 1 Aug 2025, R1-ACT: Efficient Reasoning Model Safety Alignment by Activating Safety Knowledge, https://arxiv.org/abs/2508.00324
- Yifan Wang, Runjin Chen, Bolian Li, David Cho, Yihe Deng, Ruqi Zhang, Tianlong Chen, Zhangyang Wang, Ananth Grama, Junyuan Hong, 22 Jul 2025, More is Less: The Pitfalls of Multi-Model Synthetic Preference Data in DPO Safety Alignment, https://arxiv.org/abs/2504.02193
- Frederik Pahde, Thomas Wiegand, Sebastian Lapuschkin, Wojciech Samek, 29 Jul 2025, Ensuring Medical AI Safety: Interpretability-Driven Detection and Mitigation of Spurious Model Behavior and Associated Data, https://arxiv.org/abs/2501.13818
- Qianli Ma, Dongrui Liu, Qian Chen, Linfeng Zhang, Jing Shao, 14 Aug 2025, LED-Merging: Mitigating Safety-Utility Conflicts in Model Merging with Location-Election-Disjoint, https://arxiv.org/abs/2502.16770
- Raviraj Joshi, Rakesh Paul, Kanishk Singla, Anusha Kamath, Michael Evans, Katherine Luna, Shaona Ghosh, Utkarsh Vaidya, Eileen Long, Sanjay Singh Chauhan, Niranjan Wartikar, 3 Aug 2025, CultureGuard: Towards Culturally-Aware Dataset and Guard Model for Multilingual Safety Applications, https://arxiv.org/abs/2508.01710
- Xingjun Ma, Yifeng Gao, Yixu Wang, Ruofan Wang, Xin Wang, Ye Sun, Yifan Ding, Hengyuan Xu, Yunhao Chen, Yunhan Zhao, Hanxun Huang, Yige Li, Yutao Wu, Jiaming Zhang, Xiang Zheng, Yang Bai, Zuxuan Wu, Xipeng Qiu, Jingfeng Zhang, Yiming Li, Xudong Han, Haonan Li, Jun Sun, Cong Wang, Jindong Gu, Baoyuan Wu, Siheng Chen, Tianwei Zhang, Yang Liu, Mingming Gong, Tongliang Liu, Shirui Pan, Cihang Xie, Tianyu Pang, Yinpeng Dong, Ruoxi Jia, Yang Zhang, Shiqing Ma, Xiangyu Zhang, Neil Gong, Chaowei Xiao, Sarah Erfani, Tim Baldwin, Bo Li, Masashi Sugiyama, Dacheng Tao, James Bailey, Yu-Gang Jiang, 2 Aug 2025, Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety, https://arxiv.org/abs/2502.05206
- Francisco Munguia-Galeano, Zhengxue Zhou, Satheeshkumar Veeramani, Hatem Fakhruldeen, Louis Longley, Rob Clowes and Andrew I. Cooper, 7 Aug 2025, Chemist Eye: A Visual Language Model-Powered System for Safety Monitoring and Robot Decision-Making in Self-Driving Laboratories, https://arxiv.org/abs/2508.05148
- Chongwen Zhao and Kaizhu Huang, 1 Sep 2025, Unraveling LLM Jailbreaks Through Safety Knowledge Neurons, https://arxiv.org/abs/2509.01631
- Wenxiao Zhang, Xiangrui Kong, Conan Dewitt, Thomas Br\"aunl, Jin B. Hong, 2 Sep 2025, Enhancing Reliability in LLM-Integrated Robotic Systems: A Unified Approach to Security and Safety, https://arxiv.org/abs/2509.02163
- Taegyeong Lee, Jeonghwa Yoo, Hyoungseo Cho, Soo Yong Kim and Yunho Maeng, 30 Aug 2025, QGuard:Question-based Zero-shot Guard for Multi-modal LLM Safety, https://arxiv.org/abs/2506.12299
- Roland Pihlakas, Sruthi Kuriakose, 2 Sep 2025, BioBlue: Notable runaway-optimiser-like LLM failure modes on biologically and economically aligned AI safety benchmarks for LLMs with simplified observation format, https://arxiv.org/abs/2509.02655
- Piyush Pant, 10 Sep 2025, Improving LLM Safety and Helpfulness using SFT and DPO: A Study on OPT-350M, https://arxiv.org/abs/2509.09055
- Adel ElZemity, Budi Arief and Shujun Li, 17 Sep 2025, CyberLLMInstruct: A Pseudo-malicious Dataset Revealing Safety-performance Trade-offs in Cyber Security LLM Fine-tuning, https://arxiv.org/abs/2503.09334
- Dylan Butts, Oct 22 2025, Hundreds of public figures, including Apple co-founder Steve Wozniak and Virgin’s Richard Branson urge AI ‘superintelligence’ ban, https://www.cnbc.com/2025/10/22/800-petition-signatures-apple-steve-wozniak-and-virgin-richard-branson-superintelligence-race.html
AI Books from Aussie AI
|
The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
Get your copy from Amazon: The Sweetest Lesson |
|
RAG Optimization: Accurate and Efficient LLM Applications:
new book on RAG architectures:
Get your copy from Amazon: RAG Optimization |
|
Generative AI Applications book:
Get your copy from Amazon: Generative AI Applications |
|
Generative AI programming book:
Get your copy from Amazon: Generative AI in C++ |
|
CUDA C++ Optimization book:
Get your copy from Amazon: CUDA C++ Optimization |
|
CUDA C++ Debugging book:
Get your copy from Amazon: CUDA C++ Debugging |
More AI Research
Read more about: