Aussie AI
AI Safety Research
-
Last Updated 12 December, 2024
-
by David Spuler, Ph.D.
Safe and responsible use of AI is an important and all-encompassing goal. Multiple concerns arise in the use of modern AI capabilities, and for the future with more advanced AI systems. This article examines the various research papers on difference AI safety issues.
Types of AI Safety Issues
There are a variety of distinct issue in terms of appropriate use of AI. Some of the categories include:
- Bias and fairness
- Inaccurate results
- Imaginary results ("hallucinations")
- Inappropriate responses
There are some issues that get quite close to being philosophy rather than technology:
- Alignment (ensuring AI engines are "aligned" with human goals)
- Overrideability/interruptibility
- Obedience vs autonomy
There are some overarching issues for AI matters for the government and in the community:
- Ethics
- Governance
- Regulation
- Auditing and Enforcement
- Risk Mitigation
And since we may rely on AI models in various real-world situations, including dangerous real-time situations like driving a car, there are some practical technological issues ensuring that AI engines operate safely and reliably within their basic operational scope:
- Testing and Debugging (simply avoiding coding "bugs" in complex AI engines)
- Real-time performance profiling ("de-slugging")
- Error Handling (tolerance of internal or external errors)
- Code Resilience (handling unexpected inputs or situations reasonably)
Overviews, Surveys, and Reviews
Various authors have reviewed the areas of safety and ethics:
- Cath C. Governing artificial intelligence: ethical, legal and technical opportunities and challenges. Philos Trans A Math Phys Eng Sci. 2018 Oct 15;376(2133):20180080. doi: 10.1098/rsta.2018.0080. PMID: 30322996 https://pubmed.ncbi.nlm.nih.gov/30322996/
- Hagendorff Thilo. The ethics of AI ethics: an evaluation of guidelines. Minds and Machines. 2020; 30(1):99–120. https://link.springer.com/article/10.1007/s11023-020-09517-8
- Jobin Anna, Ienca Marcello, Vayena Effy. The global landscape of AI ethics guidelines. Nature Machine Intell. 2019;(1):389–399. https://www.nature.com/articles/s42256-019-0088-2
- Soni N., Sharma E.K., Singh N., Kapoor A. 2019. Impact of Artificial Intelligence on Businesses: from Research, Innovation, Market Deployment to Future Shifts in Business Models”.arXiv:1905.02092. https://arxiv.org/abs/1905.02092
Hallucinations
Hallucinations are plausible-sounding answers that are not correct, and not based on any facts. It appears like the LLM is lying or faking the answer, but it doesn't actually know this. Rather, it is probabilistically trying to come up with the best answer, and sometimes it doesn't have a factual answer, so it can fill in the blanks.
- Liangming Pan, Michael Saxon, Wenda Xu, Deepak Nathani, Xinyi Wang, William Yang Wang, May 03 2024, Automatically Correcting Large Language Models: Surveying the Landscape of Diverse Automated Correction Strategies, https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00660/120911
- Bingbin Liu, Jordan T. Ash, Surbhi Goel, Akshay Krishnamurthy, Cyril Zhang, June 2023, Exposing Attention Glitches with Flip-Flop Language Modeling, https://arxiv.org/abs/2306.00946
- Lucas Mearian, 14 Mar 2024, AI hallucination mitigation: two brains are better than one, https://www.computerworld.com/article/1612465/ai-hallucination-mitigation-two-brains-are-better-than-one.html
- Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 15 Mar 2024 (v5), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer (A large survey of a variety of LLM optimizations.)
- Bijit Ghosh Feb 2024, Advanced Prompt Engineering for Reducing Hallucination, https://medium.com/@bijit211987/advanced-prompt-engineering-for-reducing-hallucination-bb2c8ce62fc6
- Junyi Li, Jie Chen, Ruiyang Ren, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Nie, Ji-Rong Wen, 6 Jan 2024, The Dawn After the Dark: An Empirical Study on Factuality Hallucination in Large Language Models, https://arxiv.org/abs/2401.03205 Code: https://github.com/RUCAIBox/HaluEval-2.0
- Colin Fraser, Apr 18, 2024, Hallucinations, Errors, and Dreams On why modern AI systems produce false outputs and what there is to be done about it, https://medium.com/@colin.fraser/hallucinations-errors-and-dreams-c281a66f3c35
- Johnny Li, Saksham Consul, Eda Zhou, James Wong, Naila Farooqui, Yuxin Ye, Nithyashree Manohar, Zhuxiaona Wei, Tian Wu, Ben Echols, Sharon Zhou, Gregory Diamos, 25 Jun 2024, Banishing LLM Hallucinations Requires Rethinking Generalization, https://arxiv.org/abs/2406.17642
- Pavan Belagatti, Jul 31, 2024, Semantic Chunking for Enhanced RAG Applications! https://levelup.gitconnected.com/semantic-chunking-for-enhanced-rag-applications-b6bc92942af0
- Mintong Kang, Nezihe Merve Gürel, Ning Yu, Dawn Song, Bo Li, July 2024, C-RAG: Certified Generation Risks for Retrieval-Augmented Language Models, Proceedings of the 41st International Conference on Machine Learning, PMLR 235:22963-23000, 2024, https://proceedings.mlr.press/v235/kang24a.html
- Mengya Hu, Rui Xu, Deren Lei, Yaxi Li, Mingyu Wang, Emily Ching, Eslam Kamal, Alex Deng, 22 Aug 2024, SLM Meets LLM: Balancing Latency, Interpretability and Consistency in Hallucination Detection, https://arxiv.org/abs/2408.12748
- Hao Zhou, Chengming Hu, Ye Yuan, Yufei Cui, Yili Jin, Can Chen, Haolun Wu, Dun Yuan, Li Jiang, Di Wu, Xue Liu, Charlie Zhang, Xianbin Wang, Jiangchuan Liu, 17 May 2024, Large Language Model (LLM) for Telecommunications: A Comprehensive Survey on Principles, Key Techniques, and Opportunities, https://arxiv.org/abs/2405.10825
- C Yang, S Fujita, 2024, Adaptive Control of Retrieval-Augmented Generation for LLMs Through Reflective Tags, https://www.preprints.org/manuscript/202408.2152/download/final_file
- Michael Wood, Aug 26, 2024, 100% Accurate AI Claimed by Acurai — OpenAI and Anthropic Confirm Acurai’s Discoveries, https://blog.cubed.run/100-accurate-ai-claimed-by-acurai-openai-and-anthropic-confirm-acurais-discoveries-98fce1ddeb5b
- James Lee Stakelum, Sep 2024, The End of AI Hallucinations: A Big Breakthrough in Accuracy for AI Application Developers, https://medium.com/@JamesStakelum/the-end-of-ai-hallucinations-a-breakthrough-in-accuracy-for-data-engineers-e67be5cc742a
- F. Li, X. zhang and P. Zhang, 2024, Mitigating Hallucination Issues in Small-Parameter LLMs through Inter-Layer Contrastive Decoding, 2024 International Joint Conference on Neural Networks (IJCNN), Yokohama, Japan, 2024, pp. 1-8, doi: 10.1109/IJCNN60899.2024.10650644, https://ieeexplore.ieee.org/abstract/document/10650644
- Zhongxiang Sun, Zihua Si, Xiaoxue Zang, Kai Zheng, Yang Song, Xiao Zhang, Jun Xu, 15 Oct 2024, LargePiG: Your Large Language Model is Secretly a Pointer Generator, https://arxiv.org/abs/2410.11366
- Garanc Burke, Hilke Schellmann, October 27, 2024, Researchers say an AI-powered transcription tool used in hospitals invents things no one ever said, https://apnews.com/article/ai-artificial-intelligence-health-business-90020cdf5fa16c79ca2e5b6c4c9bbb14
- Adi Simhi, Jonathan Herzig, Idan Szpektor, Yonatan Belinkov, 29 Oct 2024, Distinguishing Ignorance from Error in LLM Hallucinations, https://arxiv.org/abs/2410.22071 https://github.com/technion-cs-nlp/hallucination-mitigation
- Salvatore Raieli, Nov 2024, What Is The Best Therapy For a Hallucinating AI Patient? Exploring the Art and Science of Prompt Engineering to Cure LLM Hallucinations, https://levelup.gitconnected.com/what-is-the-best-therapy-for-a-hallucinating-ai-patient-acf0cb9b3e00
- Vitaly Kukharenko, Nov 2024, Why Do Neural Networks Hallucinate (And What Are Experts Doing About It)? https://pub.towardsai.net/why-do-neural-networks-hallucinate-and-what-are-experts-doing-about-it-7b9342605bf7
- Yixiong Fang, Ziran Yang, Zhaorun Chen, Zhuokai Zhao, Jiawei Zhou, 9 Dec 2024, From Uncertainty to Trust: Enhancing Reliability in Vision-Language Models with Uncertainty-Guided Dropout Decoding, https://arxiv.org/abs/2412.06474
Security of AI
Research on security issues involving AI and LLMs:
- Jason Koebler, June 26, 2024, Researchers Prove Rabbit AI Breach By Sending Email to Us as Admin, https://www.404media.co/researchers-prove-rabbit-ai-breach-by-sending-email-to-us-as-admin/ (Rabbit's API security credentials were hard-coded into the device.)
- Yuanchun Li, Hao Wen, Weijun Wang, Xiangyu Li, Yizhen Yuan, Guohong Liu, Jiacheng Liu, Wenxing Xu, Xiang Wang, Yi Sun, Rui Kong, Yile Wang, Hanfei Geng, Jian Luan, Xuefeng Jin, Zilong Ye, Guanjing Xiong, Fan Zhang, Xiang Li, Mengwei Xu, Zhijun Li, Peng Li, Yang Liu, Ya-Qin Zhang, Yunxin Liu, 8 May 2024 (v2), Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security, https://arxiv.org/abs/2401.05459 https://github.com/MobileLLM/Personal_LLM_Agents_Survey
- Michael Nuñez, August 30, 2024, AI is growing faster than companies can secure it, warn industry leaders, https://venturebeat.com/ai/ai-is-growing-faster-than-companies-can-secure-it-warn-industry-leaders/
- Yiheng Liu, Hao He, Tianle Han, Xu Zhang, Mengyuan Liu, Jiaming Tian, Yutong Zhang, Jiaqi Wang, Xiaohui Gao, Tianyang Zhong, Yi Pan, Shaochen Xu, Zihao Wu, Zhengliang Liu, Xin Zhang, Shu Zhang, Xintao Hu, Tuo Zhang, Ning Qiang, Tianming Liu, Bao Ge, 6 Jan 2024 (v2), Understanding LLMs: A Comprehensive Overview from Training to Inference, https://arxiv.org/abs/2401.02038
- Huan Yang, Deyu Zhang, Yudong Zhao, Yuanchun Li, Yunxin Liu, 6 Sep 2024, A First Look At Efficient And Secure On-Device LLM Inference Against KV Leakage, https://arxiv.org/abs/2409.04040 (Security issues where KV caches can be data leaks as they may contain encodings of private information.)
- Nicholas Carlini, Milad Nasr, 22 Oct 2024, Remote Timing Attacks on Efficient Language Model Inference, https://arxiv.org/abs/2410.17175
Safety Monitor
A safety monitor is a component that can be added to the LLM deployment.
- OpenAI, Moderation: Learn how to build moderation into your AI applications, 2024, https://platform.openai.com/docs/guides/moderation
- Azure, 06/13/2024, Content filtering, https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/content-filter?tabs=warning%2Cpython
- Yu Wang, Xiaogeng Liu, Yu Li, Muhao Chen, Chaowei Xiao, 14 Mar 2024, AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting, https://arxiv.org/abs/2403.09513 Code: https://github.com/rain305f/AdaShield
- Jinhwa Kim, Ali Derakhshan, Ian G. Harris, 31 Oct 2023, Robust Safety Classifier for Large Language Models: Adversarial Prompt Shield, https://arxiv.org/abs/2311.00172
General Thoughts on AI Safety
High-level debate and discussions of AI safety issues:
- Stephen Hawking, Max Tegmark, Stuart Russell, and Frank Wilczek. April 2014. Transcending complacency on superintelligent machines. http://www.huffingtonpost.com/stephen-hawking/artificial-intelligence_b_5174265.html
- S. Alexander. OpenAI’s “Planning for AGI and beyond”. March 2023, https://astralcodexten.substack.com/p/openais-planning-for-agi-and-beyond
- N. Bostrom. The vulnerable world hypothesis. Global Policy, 10(4):455–476, 2019. https://doi.org/10.1111/1758-5899.12718
- Smitha Milli, Dylan Hadfield-Menell, Anca Dragan, and Stuart Russell. Should robots be obedient? In International Joint Conference on Artificial Intelligence, 2017. https://arxiv.org/abs/1705.09990
- Nick Bostrom. Superintelligence: Paths, Dangers, Strategies. Oxford University Press, March 2016, https://www.amazon.com.au/Superintelligence-Professor-Philosophy-Institute-University/dp/0198739834/
- Nick Bostrom. Superintelligence: Paths, Dangers, Strategies. Oxford University Press, July 2014 (prior edition), https://www.amazon.com.au/Superintelligence-Dangers-Strategies-Nick-Bostrom-ebook/dp/B00LOOCGB2/
- OpenAI, May 2023, Governance of superintelligence, https://openai.com/blog/governance-of-superintelligence
- Winfield AFT, Jirotka M. Ethical governance is essential to building trust in robotics and artificial intelligence systems. Philos Trans A Math Phys Eng Sci. 2018 Oct 15;376(2133):20180085. doi: 10.1098/rsta.2018.0085. PMID: 30323000 https://pubmed.ncbi.nlm.nih.gov/30323000/
- OpenAI, Feb 2023, How should AI systems behave, and who should decide? https://openai.com/blog/how-should-ai-systems-behave
- Stuart Russell. Should we fear supersmart robots? Scientific American, 314(6):58–59, 2016. https://www.scientificamerican.com/article/should-we-fear-supersmart-robots/, https://pubmed.ncbi.nlm.nih.gov/27196844/
- A Ramalho, 2017, Will robots rule the (artistic) world? A proposed model for the legal status of creations by artificial intelligence systems, https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2987757
- Bernd Carsten Stahl, 2023, Embedding responsibility in intelligent systems: from AI ethics to responsible AI ecosystems, Scientific Reports Open Access 18 May 2023, https://doi.org/10.1038/s41598-023-34622-w
- McCarthy, John, and Patrick J. Hayes. 1969. Some Philosophical Problems From the Standpoint of Artificial Intelligence, In: Machine Intelligence 4, B. Meltzer and D. Michie (eds.), Edinburgh University Press, 1969, pp. 463-502, Stanford University. http://jmc.stanford.edu/articles/mcchay69.html
- Russell, Stuart J. 2019. Human Compatible: Artificial Intelligence and the Problem of Control (Viking-Penguin Random House: London). https://link.springer.com/chapter/10.1007/978-3-030-86144-5_3
- Winfield A.F.T., Jirotka M., 2018, Ethical governance is essential to building trust in robotics and artificial intelligence systems. Philos. Trans. R. Soc. A. Math. Phys. Eng. Sci. 2018;376:13. http://www.ncbi.nlm.nih.gov/pmc/articles/pmc6191667/, https://pubmed.ncbi.nlm.nih.gov/30323000/
- Thomas Claburn 12 Oct 2023, AI safety guardrails easily thwarted, security study finds, The Register, https://www.theregister.com/2023/10/12/chatbot_defenses_dissolve/
- Alibaba Qwen Team, Sep 2023, Qwen Technical Report, https://arxiv.org/pdf/2309.16609.pdf
Government Policy and Regulation
Various governments have examined issues around regulation, and there has also been much debate:
- A. Solender and A. Gold. April 2023, Scoop: Schumer lays groundwork for Congress to regulate AI. https://www.axios.com/2023/04/13/congress-regulate-ai-tech
- UK Government. National AI strategy. Sep 2021. https://www.gov.uk/government/publications/national-ai-strategy
- AI Now Institute, A. Kak, and S. M. West. April 2023, General purpose AI poses serious risks, should not be excluded from the EU’s AI Act. https://ainowinstitute.org/publication/gpai-is-high-risk-should-not-be-excluded-from-eu-ai-act
- L. Bertuzzi. March 2023, Leading EU lawmakers propose obligations for general purpose ai. https://www.euractiv.com/section/artificial-intelligence/news/leading-eu-lawmakers-propose-obligations-for-general-purpose-ai
- UK Department for Science and Technology. Aug 2023, Policy paper: A pro-innovation approach to AI regulation. https://www.gov.uk/government/publications/ai-regulation-a-pro-innovation-approach/white-paper
- White House. May 2023. Fact sheet: Biden-Harris Administration announces new actions to promote responsible AI innovation that protects Americans’ rights and safety. https://www.whitehouse.gov/briefing-room/statements-releases/2023/05/04/fact-sheet-biden-harris-administration-annou nces-new-actions-to-promote-responsible-ai-innovation-that-protects-americans-rights-and-safety
- B. Zhang, M. Anderljung, L. Kahn, N. Dreksler, M. C. Horowitz, and A. Dafoe. 2021, Ethics and governance of artificial intelligence: Evidence from a survey of machine learning researchers. arXiv preprint arXiv:2105.02117, https://arxiv.org/abs/2105.02117
- ISO/IEC. 2023, ISO/IEC 23894:2023 Information technology — Artificial intelligence — Guidance on risk management. https://www.iso.org/standard/77304.html
- NIST, AI Risk Management Framework Concept Paper, 13 December 2021, PDF: https://www.nist.gov/system/files/documents/2021/12/14/AI%20RMF%20Concept%20Paper_13Dec2021_posted.pdf
- NIST. 2023, Artificial Intelligence Risk Management Framework (AI RMF 1.0). https://doi.org/10.6028/NIST.AI.100-1, https://www.nist.gov/itl/ai-risk-management-framework
- Tathagat Katiyar & Harshitha Chondamma II, Accorian, Feb 2023, UNDERSTANDING AI RMF 1.0 – The Artificial Intelligence Risk Management Framework https://accorian.com/understanding-ai-rmf-1-0-the-artificial-intelligence-risk-management-framework/
- E. Yudkowsky, 2023. Pausing AI developments isn’t enough. We need to shut it all down. https://time.com/6266923/ai-eliezer-yudkowsky-open-letter-not-enough
- Stephanie Palazzolo, Erin Woo, Aug 2024, Passage of California AI Bill Sends Shivers Across Tech Industry, https://www.theinformation.com/articles/passage-of-california-ai-bill-sends-shivers-across-tech-industry
Auditing and Enforcement
Papers on auditing or enforcement of AI policy:
- J. Mökander and L. Floridi. 2022, Operationalising AI governance through ethics-based auditing: An industry case study. AI and Ethics, pages 1–18, https://link.springer.com/article/10.1007/s43681-022-00171-7
- J. Mökander, J. Schuett, H. R. Kirk, and L. Floridi. June 2023. Auditing large language models: A three-layered approach. arXiv preprint arXiv:2302.08500. https://arxiv.org/abs/2302.08500
- J. Mökander, J. Morley, M. Taddeo, and L. Floridi. Ethics-based auditing of automated decision-making systems: Nature, scope, and limitations. Science and Engineering Ethics, 27(44), 2021. https://arxiv.org/abs/2110.10980
Bias and Fairness
AI engines have shown bias in various ways. The goal is to have them show "fairness" in their results:
- Dastin Jeffrey. Oct 2018, Amazon scraps secret AI recruiting tool that showed bias against women. Reuters. https://www.reuters.com/article/us-amazon-com-jobs-automation-insight-idUSKCN1MK08G
- Courtland R., 2018, Bias detectives: the researchers striving to make algorithms fair. Nature. 2018 Jun;558(7710):357-360. doi: 10.1038/d41586-018-05469-3. PMID: 29925973 https://pubmed.ncbi.nlm.nih.gov/29925973/
- Caliskan Aylin, Bryson Joanna J., Narayanan Arvind. 2017. Semantics derived automatically from language corpora contain human-like biases. Science. 2017;356:183–186. https://pubmed.ncbi.nlm.nih.gov/28408601/
- A Levendowski, 2018, How copyright law can fix artificial intelligence's implicit bias problem, Wash. L. Rev., https://digitalcommons.law.uw.edu/cgi/viewcontent.cgi?article=5042&context=wlr
- Hao Karen. 2020. AI researchers say scientific publishers help perpetuate racist algorithms. MIT Technology Review. https://www.technologyreview.com/2020/06/23/1004333/ai-science-publishers-perpetuate-racist-face-recognition/
- K Ramesh, A Chavan, S Pandit, 2023, A Comparative Study on the Impact of Model Compression Techniques on Fairness in Language Models, Microsoft Research, https://aclanthology.org/2023.acl-long.878.pdf, https://www.microsoft.com/en-us/research/uploads/prod/2023/07/3687_Paper.pdf
- Jwala Dhamala, Varun Kumar, Rahul Gupta, Kai-Wei Chang, Aram Galstyan, Oct 2022, An Analysis of the Effects of Decoding Algorithms on Fairness in Open-Ended Language Generation, https://arxiv.org/abs/2210.03826 (Examines top-p, top-k, and temperature in decoding algorithms from a safety perspective.)
- Bingbin Liu, Jordan T. Ash, Surbhi Goel, Akshay Krishnamurthy, Cyril Zhang, June 2023, Exposing Attention Glitches with Flip-Flop Language Modeling, https://arxiv.org/abs/2306.00946
- Valentin Hofmann, Pratyusha Ria Kalluri, Dan Jurafsky, Sharese King, 1 Mar 2024, Dialect prejudice predicts AI decisions about people's character, employability, and criminality, https://arxiv.org/abs/2403.00742 https://arxiv.org/pdf/2403.00742.pdf
- Jaymari Chua, Yun Li, Shiyi Yang, Chen Wang, Lina Yao, 6 Jul 2024, AI Safety in Generative AI Large Language Models: A Survey, https://arxiv.org/abs/2407.18369
- Cem Dilmegani, Jan 10, 2024, The Future of Large Language Models in 2024, https://research.aimultiple.com/future-of-large-language-models/
- Douglas C. Youvan, September 27, 2024, Building and Running Large-Scale Language Models: The Infrastructure and Techniques Behind GPT-4 , https://www.researchgate.net/profile/Douglas-Youvan/publication/384398902_Building_and_Running_Large-Scale_Language_Models_The_Infrastructure_and_Techniques_Behind_GPT-4/links/66f6f4d3906bca2ac3d20e68/Building-and-Running-Large-Scale-Language-Models-The-Infrastructure-and-Techniques-Behind-GPT-4.pdf
- Mayank Vatsa, Anubhooti Jain, Richa Singh, 7 Dec 2023, Adventures of Trustworthy Vision-Language Models: A Survey, https://arxiv.org/abs/2312.04231
Ethics of Responsible AI Research
Ethical issues in AI research and related publication of results:
- Partnership on AI. 2021, Managing the risks of AI research: Six Recommendations for Responsible Publication. https://partnershiponai.org/paper/responsible-publication-recommendations
- M. Brundage, S. Avin, J. Wang, H. Belfield, G. Krueger, G. Hadfield, H. Khlaaf, J. Yang, H. Toner, R. Fong, T. Maharaj, P. W. Koh, S. Hooker, J. Leung, A. Trask, E. Bluemke, J. Lebensold, C. O’Keefe, M. Koren, T. Ryffel, J. Rubinovitz, T. Besiroglu, F. Carugati, J. Clark, P. Eckersley, S. de Haas, M. Johnson, B. Laurie, A. Ingerman, I. Krawczuk, A. Askell, R. Cammarota, A. Lohn, D. Krueger, C. Stix, P. Henderson, L. Graham, C. Prunkl, B. Martin, E. Seger, N. Zilberman, S. Ó. hÉigeartaigh, F. Kroeger, G. Sastry, R. Kagan, A. Weller, B. Tse, E. Barnes, A. Dafoe, P. Scharre, A. Herbert-Voss, M. Rasser, S. Sodhani, C. Flynn, T. K. Gilbert, L. Dyer, S. Khan, Y. Bengio, and M. Anderljung. Toward trustworthy AI development: Mechanisms for supporting verifiable claims. arXiv preprint arXiv:2004.07213, 2020. https://arxiv.org/abs/2004.07213
- R. Crootof. 2019, Artificial intelligence research needs responsible publication norms. https://www.lawfareblog.com/artificial-intelligence-research-needs-responsible-publication-norms
- C. Ashurst, S. Barocas, R. Campbell, and D. Raji. Disentangling the components of ethical research in machine learning. In 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 2057–2068, 2022. http://dx.doi.org/10.1145/3531146.3533781, https://www.researchgate.net/publication/361439688_Disentangling_the_Components_of_Ethical_Research_in_Machine_Learning
- Herrmann H. What's next for responsible artificial intelligence: a way forward through responsible innovation. Heliyon. 2023 Mar 11;9(3):e14379. doi: 10.1016/j.heliyon.2023.e14379. eCollection 2023 Mar. PMID: 36967876, https://pubmed.ncbi.nlm.nih.gov/36967876/
- Ethically governing artificial intelligence in the field of scientific research and innovation. González-Esteban Y Patrici Calvo E. Heliyon. 2022 Feb 16;8(2):e08946. doi: 10.1016/j.heliyon.2022.e08946. eCollection 2022 Feb. PMID: 35243068, https://pubmed.ncbi.nlm.nih.gov/35243068/
- Dzobo K, Adotey S, Thomford NE, Dzobo W. Integrating Artificial and Human Intelligence: A Partnership for Responsible Innovation in Biomedical Engineering and Medicine. OMICS. 2020 May;24(5):247-263. doi: 10.1089/omi.2019.0038. Epub 2019 Jul 16. PMID: 31313972, https://pubmed.ncbi.nlm.nih.gov/31313972/
- d'Aquin M., Troullinou P., O'Connor N.E., Cullen A., Faller G., Holden L. 2018 AAAI/ACM Conference on AI, Ethics, and Society (AIES ’18) ACM; New York: 2018. Towards an “ethics by design” methodology for AI research projects”; pp. 54–59. https://www.researchgate.net/publication/330297261_Towards_an_Ethics_by_Design_Methodology_for_AI_Research_Projects
- Dignum Virginia. 2019. Responsible Artificial Intelligence. How to Develop and Use AI in a Responsible Way. Springer, https://link.springer.com/book/10.1007/978-3-030-30371-6
- European Commission. 2012. Responsible Research and Innovation: Europe’s Ability to Respond to Societal Challenges. Brussels. https://op.europa.eu/en/publication-detail/-/publication/2be36f74-b490-409e-bb60-12fd438100fe
- Helmore Edward. 2019. Profit over safety? Boeing under fire over 737 Max crashes as families demand answers. Guardian. https://www.theguardian.com/business/2019/jun/17/boeing-737-max-ethiopian-airlines-crash
- High-level expert Group on Artificial Intelligence. European Commission; 2019. Ethics Guidelines for Trustworthy AI. Brussels. https://op.europa.eu/en/publication-detail/-/publication/d3988569-0434-11ea-8c1f-01aa75ed71a1
- Prates M., Avelar P., Lamb L.C. 2018, On quantifying and understanding the role of ethics in AI research: a historical account of flagship conferences and journals. EPiC Series in Computing. 2018;55:188–201. https://arxiv.org/abs/1809.08328
- Castelvecchi D., 2021, Prestigious AI meeting takes steps to improve ethics of research. Nature. 2021 Jan;589(7840):12-13. doi: 10.1038/d41586-020-03611-8. PMID: 33361804, https://pubmed.ncbi.nlm.nih.gov/33361804/
- Bouhouita-Guermech S, Gogognon P, Bélisle-Pipon JC. 2023, Specific challenges posed by artificial intelligence in research ethics. Front Artif Intell. 2023 Jul 6;6:1149082. doi: 10.3389/frai.2023.1149082. eCollection 2023. PMID: 37483869 https://pubmed.ncbi.nlm.nih.gov/37483869/
- Gibney E., 2020, The battle for ethical AI at the world's biggest machine-learning conference. Nature. 2020 Jan;577(7792):609. doi: 10.1038/d41586-020-00160-y. PMID: 31992885, https://pubmed.ncbi.nlm.nih.gov/31992885/
- Sánchez López JD, Cambil Martín J, Villegas Calvo M, Luque Martínez F., 2020. Ethical conflicts between authonomy and deep learning, J Healthc Qual Res. 2020 Jan-Feb;35(1):51-52. doi: 10.1016/j.jhqr.2019.06.009. Epub 2019 Nov 26. PMID: 31784256, https://pubmed.ncbi.nlm.nih.gov/31784256/
- Prabhu SP., 2019, Ethical challenges of machine learning and deep learning algorithms. Lancet Oncol. 2019 May;20(5):621-622. doi: 10.1016/S1470-2045(19)30230-X. PMID: 31044701, https://pubmed.ncbi.nlm.nih.gov/31044701/
- Dignum V. Ethics in artificial intelligence: introduction to the special issue. Ethics Inf. Technol. 2018;20:1–3. https://link.springer.com/article/10.1007/s10676-018-9450-z
- IEEE. 2019. "Ethically Aligned Design: A Vision for Prioritizing Human Well-being With Autonomous and Intelligent Systems [First Edition]." The IEEE Global Initiative on Ethics of Autonomous and Intelligent Systems. https://standards.ieee.org/content/ieee-standards/en/industry-connections/ec/autonomous-systems.html
- Stuart Russell, Daniel Dewey, and Max Tegmark. 2015. Research priorities for robust and beneficial artificial intelligence. AI Magazine, 36(4):105–114, 2015. PDF: https://futureoflife.org/data/documents/research_priorities.pdf
- Peter Dizikes, December 11, 2023, MIT group releases white papers on governance of AI, MIT News, https://news.mit.edu/2023/mit-group-releases-white-papers-governance-ai-1211
- Thomas Mildner, Orla Cooney, Anna-Maria Meck, Marion Bartl, Gian-Luca Savino, Philip R. Doyle, Diego Garaialde, Leigh Clark, John Sloan, Nina Wenig, Rainer Malaka, Jasmin Niess, 26 Jan 2024, Listening to the Voices: Describing Ethical Caveats of Conversational User Interfaces According to Experts and Frequent Users, Proceedings of the CHI Conference on Human Factors in Computing Systems (CHI '24), May 11--16, 2024, Honolulu, HI, USA, https://arxiv.org/abs/2401.14746 https://doi.org/https://doi.org/10.1145/3613904.3642542
- Balasubramaniam S. , Vanajaroselin Chirchi, Seifedine Kadry, Moorthy Agoramoorthy, Gururama Senthilvel P., Satheesh Kumar K., and Sivakumar T. A., Oct 2024, The Road Ahead: Emerging Trends, Unresolved Issues, and ConcludingRemarksinGenerativeAI—AComprehensiveReview, International Journal of Intelligent Systems, Volume 2024, Article ID 4013195, 38 pages, https://doi.org/10.1155/2024/4013195 https://www.researchgate.net/profile/Balasubramaniam-s-2/publication/384729387_The_Road_Ahead_Emerging_Trends_Unresolved_Issues_and_Concluding_Remarks_in_Generative_AI-A_Comprehensive_Review/links/6705560cf5eb7108c6e5d261/The-Road-Ahead-Emerging-Trends-Unresolved-Issues-and-Concluding-Remarks-in-Generative-AI-A-Comprehensive-Review.pdf
AI Alignment Research
Alignment is the study of how to ensure that AI engines are "aligned" with the goals and intent of humans.
- J. Leike, J. Schulman, and J. Wu. OpenAI, August 2022. Our approach to alignment research. https://openai.com/blog/our-approach-to-alignment-research
- OpenAI, July 2023, Introducing Superalignment, https://openai.com/blog/introducing-superalignment
- V. Krakovna and R. Shah. 2023, Some high-level thoughts on the DeepMind alignment team’s strategy. https://www.alignmentforum.org/posts/a9SPcZ6GXAg9cNKdi/linkpost-some-high-level-thoughts-on-the-deepmind-alignment
- J. Leike. Dec 2022, Why I’m optimistic about our alignment approach. https://aligned.substack.com/p/alignment-optimism
- Nate Soares and Benja Fallenstein. Aligning superintelligence with human interests: A technical research agenda. Technical report, Machine Intelligence Research Institute, 2014. https://www.semanticscholar.org/paper/Aligning-Superintelligence-with-Human-Interests%3A-A-Soares-Fallenstein/d8033a314493c8df3791912272ac4b58d3a7b8c2
- Jessica Taylor, Eliezer Yudkowsky, Patrick LaVictoire, and Andrew Critch. 2016. Alignment for advanced machine learning systems. Technical report, Machine Intelligence Research Institute, 2016. PDF: https://intelligence.org/files/AlignmentMachineLearning.pdf
- Daniel Weld and Oren Etzioni. The first law of robotics (a call to arms). Proceedings of the AAAI Conference on Artificial Intelligence, 12, pages 1042–1047, 1994. https://aaai.org/papers/01042-the-first-law-of-robotics-a-call-to-arms/
- Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe, Mar 2022, Training language models to follow instructions with human feedback, https://arxiv.org/abs/2203.02155 (InstructGPT main paper from OpenAI in 2022.)
- Ziniu Li1, Tian Xu, Yushun Zhang, Zhihang Lin, Yang Yu, Ruoyu Sun, Zhi-Quan Luo, 2024, ReMax: ASimple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models, https://openreview.net/pdf?id=Stn8hXkpe6
- Aibek Bekbayev, Sungbae Chun, Yerzat Dulat, James Yamazaki, Aug 2023, The Poison of Alignment, https://arxiv.org/abs/2308.13449
- Yotam Wolf, Noam Wies, Yoav Levine, and Amnon Shashua. Fundamental limitations of alignment in large language models. arXiv preprint arXiv:2304.11082, 2023. https://arxiv.org/abs/2304.11082
- Renze Lou, Kai Zhang, Wenpeng Yin, 25 May 2024 (v8), Large Language Model Instruction Following: A Survey of Progresses and Challenges, https://arxiv.org/abs/2303.10475 Project: https://github.com/RenzeLou/awesome-instruction-learning
- Alexandre Ramé, Nino Vieillard, Léonard Hussenot, Robert Dadashi, Geoffrey Cideron, Olivier Bachem, Johan Ferret, 22 Jan 2024, WARM: On the Benefits of Weight Averaged Reward Models, https://arxiv.org/abs/2401.12187 (Uses multiple reward models to avoid problems with the LLM "hacking rewards" in unforeseen ways.)
- NVIDIA, June 2024, Nemotron-4 340B Technical Report, https://d1qx31qr3h6wln.cloudfront.net/publications/Nemotron_4_340B_8T_0.pdf (Architecture is decoder-only with GQA, SentencePiece tokenizer, causal attention masks, RoPE, 96 layers, 96 heads, 8 KV heads, 256,000 vocabulary, 18432 internal dimension, context window 4096, and uses squared RELU.)
- Piotr Wojciech Mirowski, Juliette Love, Kory W. Mathewson, Shakir Mohamed, 3 Jun 2024 (v2), A Robot Walks into a Bar: Can Language Models Serve as Creativity Support Tools for Comedy? An Evaluation of LLMs' Humour Alignment with Comedians, https://arxiv.org/abs/2405.20956 (The unfunny fact that AI is bad at humor.)
- Jaymari Chua, Yun Li, Shiyi Yang, Chen Wang, Lina Yao, 6 Jul 2024, AI Safety in Generative AI Large Language Models: A Survey, https://arxiv.org/abs/2407.18369
- Mintong Kang, Nezihe Merve Gürel, Ning Yu, Dawn Song, Bo Li, July 2024, C-RAG: Certified Generation Risks for Retrieval-Augmented Language Models, Proceedings of the 41st International Conference on Machine Learning, PMLR 235:22963-23000, 2024, https://proceedings.mlr.press/v235/kang24a.html
- Rohin Shah, Seb Farquhar, Anca Dragan, 21st Aug 2024, AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work, https://www.alignmentforum.org/posts/79BPxvSsjzBkiSyTq/agi-safety-and-alignment-at-google-deepmind-a-summary-of
- Shayne Longpre, Stella Biderman, Alon Albalak, Hailey Schoelkopf, Daniel McDuff, Sayash Kapoor, Kevin Klyman, Kyle Lo, Gabriel Ilharco, Nay San, Maribeth Rauh, Aviya Skowron, Bertie Vidgen, Laura Weidinger, Arvind Narayanan, Victor Sanh, David Adelani, Percy Liang, Rishi Bommasani, Peter Henderson, Sasha Luccioni, Yacine Jernite, Luca Soldaini, 26 Jun 2024 (v2), The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources, https://arxiv.org/abs/2406.16746
- Yiheng Liu, Hao He, Tianle Han, Xu Zhang, Mengyuan Liu, Jiaming Tian, Yutong Zhang, Jiaqi Wang, Xiaohui Gao, Tianyang Zhong, Yi Pan, Shaochen Xu, Zihao Wu, Zhengliang Liu, Xin Zhang, Shu Zhang, Xintao Hu, Tuo Zhang, Ning Qiang, Tianming Liu, Bao Ge, 6 Jan 2024 (v2), Understanding LLMs: A Comprehensive Overview from Training to Inference, https://arxiv.org/abs/2401.02038
- Hao Zhou, Chengming Hu, Ye Yuan, Yufei Cui, Yili Jin, Can Chen, Haolun Wu, Dun Yuan, Li Jiang, Di Wu, Xue Liu, Charlie Zhang, Xianbin Wang, Jiangchuan Liu, 17 May 2024, Large Language Model (LLM) for Telecommunications: A Comprehensive Survey on Principles, Key Techniques, and Opportunities, https://arxiv.org/abs/2405.10825
- Zekun Moore Wang, Shawn Wang, Kang Zhu, Jiaheng Liu, Ke Xu, Jie Fu, Wangchunshu Zhou, Wenhao Huang, 17 Oct 2024, PopAlign: Diversifying Contrasting Patterns for a More Comprehensive Alignment, https://arxiv.org/abs/2410.13785
- Mozhi Zhang, Pengyu Wang, Chenkun Tan, Mianqiu Huang, Dong Zhang, Yaqian Zhou, Xipeng Qiu, 18 Oct 2024, MetaAlign: Align Large Language Models with Diverse Preferences during Inference Time, https://arxiv.org/abs/2410.14184
AI Industry Safety Practices
Various papers discuss the practices of the major AI players in the industry, along with issues such as self-governance.
- OpenAI, July 2023, Frontier Model Forum, https://openai.com/blog/frontier-model-forum
- OpenAI. April 2023, Our approach to AI safety. https://openai.com/blog/our-approach-to-ai-safety
- A. M. Barrett, J. Newman, D. Hendrycks, and B. Nonnecke. 2023, UC Berkeley AI Risk-Management Standards Profile for General-Purpose AI Systems (GPAIS) and Foundation Models, https://cltc.berkeley.edu/seeking-input-and-feedback-ai-risk-management-standards-profile-for-increasingly-multi-purpose-or-general-purpose-ai
- Meta, 2023, Responsible AI: Driven by our belief that AI should benefit everyone, https://ai.meta.com/responsible-ai/
- Google, 2023, AI Governance reviews and operations, https://ai.google/responsibility/ai-governance-operations
- Google, 2023, Responsibility: Our Principles, https://ai.google/responsibility/principles/
- Google, 2023, How Bard Works | A Responsible Approach to AI, YouTube, https://www.youtube.com/watch?v=vhbkCEnNXcY
Technical Verification and Testing of AI Safety
Testing and evaluation of AI safety issues:
- Xiaowei Huang, Marta Kwiatkowska, Sen Wang, and Min Wu. May 2017. Safety verification of deep neural networks. In Computer Aided Verification, pages 3–29, https://arxiv.org/abs/1610.06940
- D. Ganguli, L. Lovitt, J. Kernion, A. Askell, Y. Bai, S. Kadavath, B. Mann, E. Perez, N. Schiefer, K. Ndousse, A. Jones, S. Bowman, A. Chen, T. Conerly, N. DasSarma, D. Drain, N. Elhage, S. El-Showk, S. Fort, Z. Hatfield-Dodds, T. Henighan, D. Hernandez, T. Hume, J. Jacobson, S. Johnston, S. Kravec, C. Olsson, S. Ringer, E. Tran-Johnson, D. Amodei, T. Brown, N. Joseph, S. McCandlish, C. Olah, J. Kaplan, and J. Clark. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858, 2022 https://arxiv.org/abs/2209.07858
- K Ramesh, A Chavan, S Pandit, 2023, A Comparative Study on the Impact of Model Compression Techniques on Fairness in Language Models, Microsoft Research, https://aclanthology.org/2023.acl-long.878.pdf, https://www.microsoft.com/en-us/research/uploads/prod/2023/07/3687_Paper.pdf (Rather than testing full models, this analysis examines optimized models due to quantization, pruning or distillation.)
- T. Shevlane. Structured access: An emerging paradigm for safe AI deployment. In The Oxford Handbook of AI Governance, 2022, https://arxiv.org/abs/2201.05159
- E. Perez, S. Huang, F. Song, T. Cai, R. Ring, J. Aslanides, A. Glaese, N. McAleese, and G. Irving. 2022, Red teaming language models with language models. arXiv preprint arXiv:2202.03286, https://arxiv.org/abs/2202.03286
- OpenAI. 2023. Safety best practices. https://platform.openai.com/docs/guides/safety-best-practices
- William Saunders, Girish Sastry, Andreas Stuhlmueller, and Owain Evans. Trial without error: Towards safe reinforcement learning via human intervention. arXiv preprint arXiv:1707.05173, 2017. https://arxiv.org/abs/1707.05173
- Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed, Oct 2023, Mistral 7B, https://arxiv.org/abs/2310.06825, Code: https://mistral.ai/news/announcing-mistral-7b/ (Examines guardrails and testing of the safety of the model against harmful inputs.)
AI Factual Inaccuracy
Research papers on accuracy of AI results include:
- M Yuksekgonul, V Chandrasekaran, E Jones, Sep 2023, Attention Satisfies: A Constraint-Satisfaction Lens on Factual Errors of Language Models, https://arxiv.org/pdf/2309.15098.pdf, Code: https://github.com/microsoft/mechanistic-error-probe
- Bingbin Liu, Jordan T. Ash, Surbhi Goel, Akshay Krishnamurthy, Cyril Zhang, June 2023, Exposing Attention Glitches with Flip-Flop Language Modeling, https://arxiv.org/abs/2306.00946
- S Latifi, 2023, Efficient and Dependable Deep Learning Systems Ph.D. Thesis, Computer Science and Engineering, University of Michigan, https://deepblue.lib.umich.edu/bitstream/handle/2027.42/176548/salar_1.pdf?sequence=1
- Michael Wood, Aug 26, 2024, 100% Accurate AI Claimed by Acurai — OpenAI and Anthropic Confirm Acurai’s Discoveries, https://blog.cubed.run/100-accurate-ai-claimed-by-acurai-openai-and-anthropic-confirm-acurais-discoveries-98fce1ddeb5b
AI Safety Incidents
Various incidents and accidents related to AI safety issues:
- S. McGregor. Nov 2021. Preventing repeated real world AI failures by cataloging incidents: The AI Incident Database. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 15458–15463, https://arxiv.org/abs/2011.08512
- Sarah Perez, 2023, Snapchat’s My AI goes rogue, posts to Stories, but Snap confirms it was just a glitch, August 17, 2023, TechCrunch, https://techcrunch.com/2023/08/16/snapchats-my-ai-goes-rogue-posts-to-stories-but-snap-confirms-it-was-just-a-glitch/
- Jaime Seidel, 2019, How a ‘confused’ AI May Have Fought Pilots Attempting to Save Boeing 737 MAX8s, News Corp Australia Network, https://www.news.com.au/technology/innovation/inventions/how-a-confused-ai-may-have-fought-pilots-attempting-to-save-boeing-737-max8s/news-story/bf0d102f699905e5aa8d1f6d65f4c27e (A very good example of the need for overrides and interruptibility.)
- Zachary Arnold, Helen Toner, July 2021, AI Accidents: An Emerging Threat What Could Happen and What to Do, CSET Policy Brief, https://cset.georgetown.edu/wp-content/uploads/CSET-AI-Accidents-An-Emerging-Threat.pdf
- Hern Alex. Apple contractors ‘regularly hear confidential details’ on Siri recordings. Guardian. 2019, https://www.theguardian.com/technology/2019/jul/26/apple-contractors-regularly-hear-confidential-details-on-siri-recordings
- Victor Tangermann, Sep 2023, Microsoft Publishes Garbled AI Article Calling Tragically Deceased NBA Player "Useless", Futurism, https://futurism.com/msn-ai-brandon-hunter-useless ("AI should not be writing obituaries.")
Incident Databases: There are various databases that collect information about AI safety incidents.
- AI Incident Database, https://incidentdatabase.ai/
- Zach Stein-Perlman, SeLo, stepanlos, MvK, July 20, 2023, Incident reporting for AI safety, Effective Altruism Forum, https://forum.effectivealtruism.org/posts/qkK5ejystp8GCJ3vC/incident-reporting-for-ai-safety
- AVID, 2023, AI Vulnerability Database: An open-source, extensible knowledge base of AI failures, https://avidml.org/
- AIAAIC (AI, Algorithmic, and Automation Incidents and Controversies), 2023, https://www.aiaaic.org/home
- MITRE ATLAS™ (Adversarial Threat Landscape for Artificial-Intelligence Systems), https://atlas.mitre.org/
- AI Badness: An open catalog of generative AI badness, 2023, https://badness.ai/
- David Dao, 2023, Awful AI, https://github.com/daviddao/awful-ai
Medical Ethics and AI
The use of AI in medicine creates some additional ethical issues:
- Vollmer S., Mateen B.A., Bohner G., Király F.J., Ghani R., Jonsson P., et al. Machine learning and AI research for patient benefit: 20 critical questions on transparency, replicability, ethics and effectiveness. BMJ. 2018;(368):1–12. https://pubmed.ncbi.nlm.nih.gov/32198138/
- Cockerill RG., 2020, Ethics Implications of the Use of Artificial Intelligence in Violence Risk Assessment. J Am Acad Psychiatry Law. 2020 Sep;48(3):345-349. doi: 10.29158/JAAPL.003940-20. Epub 2020 May 14. PMID: 32409300, https://pubmed.ncbi.nlm.nih.gov/32409300/
- Barron DS. 2021, Commentary: the ethical challenges of machine learning in psychiatry: a focus on data, diagnosis, and treatment. Psychol Med. 2021 Nov;51(15):2522-2524. doi: 10.1017/S0033291721001008. Epub 2021 May 12. PMID: 33975655, https://pubmed.ncbi.nlm.nih.gov/33975655/
- O'Reilly-Shah VN, Gentry KR, Walters AM, Zivot J, Anderson CT, Tighe PJ. 2020, Bias and ethical considerations in machine learning and the automation of perioperative risk assessment. Br J Anaesth. 2020 Dec;125(6):843-846. doi: 10.1016/j.bja.2020.07.040. Epub 2020 Aug 21. PMID: 32838979, https://pubmed.ncbi.nlm.nih.gov/32838979/
- Buchlak QD, Esmaili N, Leveque JC, Bennett C, Piccardi M, Farrokhi F., 2020, Ethical thinking machines in surgery and the requirement for clinical leadership. Am J Surg. 2020 Nov;220(5):1372-1374. doi: 10.1016/j.amjsurg.2020.06.073. Epub 2020 Jul 8. PMID: 32723487, https://pubmed.ncbi.nlm.nih.gov/32723487/
- Starke G, De Clercq E, Borgwardt S, Elger BS., 2020, Computing schizophrenia: ethical challenges for machine learning in psychiatry. Psychol Med. 2021 Nov;51(15):2515-2521. doi: 10.1017/S0033291720001683. Epub 2020 Jun 15. PMID: 32536358, https://pubmed.ncbi.nlm.nih.gov/32536358/
- Jacobson NC, Bentley KH, Walton A, Wang SB, Fortgang RG, Millner AJ, Coombs G 3rd, Rodman AM, Coppersmith DDL., 2020, Ethical dilemmas posed by mobile health and machine learning in psychiatry research. Bull World Health Organ. 2020 Apr 1;98(4):270-276. doi: 10.2471/BLT.19.237107. Epub 2020 Feb 25. PMID: 32284651, https://pubmed.ncbi.nlm.nih.gov/32284651/
- Johnson SLJ., 2019, AI, Machine Learning, and Ethics in Health Care. J Leg Med. 2019 Oct-Dec;39(4):427-441. doi: 10.1080/01947648.2019.1690604. PMID: 31940250 https://pubmed.ncbi.nlm.nih.gov/31940250/
- Vayena E, Blasimme A, Cohen IG., 2018, Machine learning in medicine: Addressing ethical challenges. PLoS Med. 2018 Nov 6;15(11):e1002689. doi: 10.1371/journal.pmed.1002689. eCollection 2018 Nov. PMID: 30399149, https://pubmed.ncbi.nlm.nih.gov/30399149/
- Nabi J., 2018, How Bioethics Can Shape Artificial Intelligence and Machine Learning. Hastings Cent Rep. 2018 Sep;48(5):10-13. doi: 10.1002/hast.895. PMID: 30311202, https://pubmed.ncbi.nlm.nih.gov/30311202/
- Char DS, Shah NH, Magnus D., 2018, Implementing Machine Learning in Health Care - Addressing Ethical Challenges. N Engl J Med. 2018 Mar 15;378(11):981-983. doi: 10.1056/NEJMp1714229. PMID: 29539284, https://pubmed.ncbi.nlm.nih.gov/29539284/
- Fiske A, Henningsen P, Buyx A., 2019, Your Robot Therapist Will See You Now: Ethical Implications of Embodied Artificial Intelligence in Psychiatry, Psychology, and Psychotherapy. J Med Internet Res. 2019 May 9;21(5):e13216. doi: 10.2196/13216. PMID: 31094356, https://pubmed.ncbi.nlm.nih.gov/31094356/
- Beil Michael, Proft Ingo, van Heerden Daniel, Sviri Sigal, van Heerden Peter Vernon. 2019, Ethical considerations about artificial intelligence for prognostication in intensive care. Intensive Care Medicine Experimental. 2019;7:70. http://www.ncbi.nlm.nih.gov/pmc/articles/pmc6904702/, https://pubmed.ncbi.nlm.nih.gov/31823128/
- Lasse Benzinger, Frank Ursin, Wolf-Tilo Balke, Tim Kacprowski & Sabine Salloch, 2023, Should Artificial Intelligence be used to support clinical ethical decision-making? A systematic review of reasons BMC Medical Ethics volume 24, Article number: 48 (2023), https://doi.org/10.1186/s12910-023-00929-6
- Rachel Dlugatch, Antoniya Georgieva & Angeliki Kerasidou, 2023, Trustworthy artificial intelligence and ethical design: public perceptions of trustworthiness of an AI-based decision-support tool in the context of intrapartum care, BMC Medical Ethics Open Access 20 June 2023, https://doi.org/10.1186/s12910-023-00917-w
- Dzobo K, Adotey S, Thomford NE, Dzobo W. Integrating Artificial and Human Intelligence: A Partnership for Responsible Innovation in Biomedical Engineering and Medicine. OMICS. 2020 May;24(5):247-263. doi: 10.1089/omi.2019.0038. Epub 2019 Jul 16. PMID: 31313972, https://pubmed.ncbi.nlm.nih.gov/31313972/
- McCradden MD, Joshi S, Mazwi M, Anderson JA., 2020, Ethical limitations of algorithmic fairness solutions in health care machine learning. Lancet Digit Health. 2020 May;2(5):e221-e223. doi: 10.1016/S2589-7500(20)30065-0. PMID: 33328054, https://pubmed.ncbi.nlm.nih.gov/33328054/
- Kulikowski CA., 2019, Beginnings of Artificial Intelligence in Medicine (AIM): Computational Artifice Assisting Scientific Inquiry and Clinical Art - with Reflections on Present AIM Challenges. Yearb Med Inform. 2019 Aug;28(1):249-256. doi: 10.1055/s-0039-1677895. Epub 2019 Apr 25. PMID: 31022744, https://pubmed.ncbi.nlm.nih.gov/31022744/
- Park S.H., Kim Y.H., Lee J.Y., Yoo S., Kim C.J. Ethical challenges regarding artificial intelligence in medicine from the perspective of scientific editing and peer review. Science Editing. 2019;6:91–98. https://www.semanticscholar.org/paper/Ethical-challenges-regarding-artificial-in-medicine-Park-Kim/7a5b3c84c6f5d16e68eaf17989b0debfd4ba57d0
Data Leakage
Data leakage refers to the AI accidentally causing the leak of data that you'd prefer was kept confidential. The "leak" can actually be caused by the LLM, or by the user, depending on the context. There are various ways this can occur:
- Uploading confidential data in AI queries (User data leakage)
- Training or fine-tuning data containing proprietary information (Training data leakage)
- RAG datastore documents containing proprietary information (RAG data leakage)
In the context of an LLM output leaking, this refers to where internal company IP is accidentally "leaked" to the public by training the AI with documents containing internal information. The AI is not smart enough to note when it shouldn't be reading a document, and anything that goes into the training dataset, or in the RAG datastore, will be shown to users.
User data leakage is where company users are sending proprietary information to a third-party AI engine. In theory, this data is protected by the confidentiality practices of the LLM company. This issue is similar to having company staff emitting confidential information in their Google queries, but the issue is more problematic because AI queries can upload entire documents to be analyzed by the LLM, such as when doing grammar checking with an LLM.
Research papers on data leakage:
- Grant Gross, 05 Jun 2024, Unauthorized AI is eating your company data, thanks to your employees, https://www.csoonline.com/article/2138447/unauthorized-ai-is-eating-your-company-data-thanks-to-your-employees.html
- Mary K. Pratt, 08 Jul 2024, 10 ways to prevent shadow AI disaster, https://www.cio.com/article/2150142/10-ways-to-prevent-shadow-ai-disaster.html
- Rachel Curry, Aug 28 2024, Why companies including JPMorgan and Walmart are opting for internal gen AI assistants after initially restricting usage, https://www.cnbc.com/2024/08/28/why-jpmorgan-and-walmart-are-opting-for-internal-gen-ai-assistants.html
- Huan Yang, Deyu Zhang, Yudong Zhao, Yuanchun Li, Yunxin Liu, 6 Sep 2024, A First Look At Efficient And Secure On-Device LLM Inference Against KV Leakage, https://arxiv.org/abs/2409.04040 (Security issues where KV caches can be data leaks as they may contain encodings of private information.)
Refusal
Refusal refers to the way that an LLM will politely decline to answer an inappropriate question. There are all types of questions that we don't want an LLM to respond to, and this requires training to achieve.
- Andy Arditi, Oscar Obeso, Aaquib111, wesg, Neel Nanda, 27th Apr 2024, Refusal in LLMs is mediated by a single direction, LessWrong, https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction
- Maxime Labonne June 13, 2024 Uncensor any LLM with abliteration, https://huggingface.co/blog/mlabonne/abliteration
- NVIDIA, June 2024, Nemotron-4 340B Technical Report, https://d1qx31qr3h6wln.cloudfront.net/publications/Nemotron_4_340B_8T_0.pdf (Architecture is decoder-only with GQA, SentencePiece tokenizer, causal attention masks, RoPE, 96 layers, 96 heads, 8 KV heads, 256,000 vocabulary, 18432 internal dimension, context window 4096, and uses squared RELU.)
- Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, Nouha Dziri, 26 Jun 2024, WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs, https://arxiv.org/abs/2406.18495
- Maksym Andriushchenko, Nicolas Flammarion, 16 Jul 2024, Does Refusal Training in LLMs Generalize to the Past Tense? https://arxiv.org/abs/2407.11969 Code: https://github.com/tml-epfl/llm-past-tense
- Kylie Robison, Jul 20, 2024, OpenAI’s latest model will block the ‘ignore all previous instructions’ loophole, https://www.theverge.com/2024/7/19/24201414/openai-chatgpt-gpt-4o-prompt-injection-instruction-hierarchy
- Xinyi Hou, Yanjie Zhao, Haoyu Wang, 3 Aug 2024, Voices from the Frontier: A Comprehensive Analysis of the OpenAI Developer Forum, https://arxiv.org/abs/2408.01687
- Asir Saadat, Tasmia Binte Sogir, Md Taukir Azam Chowdhury, Syem Aziz, 16 Oct 2024, When Not to Answer: Evaluating Prompts on GPT Models for Effective Abstention in Unanswerable Math Word Problems, https://arxiv.org/abs/2410.13029
- Kyle O'Brien, David Majercak, Xavier Fernandes, Richard Edgar, Jingya Chen, Harsha Nori, Dean Carignan, Eric Horvitz, Forough Poursabzi-Sangde, 18 Nov 2024, Steering Language Model Refusal with Sparse Autoencoders, https://arxiv.org/abs/2411.11296
Guardrails
- Aarushi Kansal, Chapter 4: Guardrails and AI: Building Safe and Controllable Apps, Building Generative AI-Powered Apps: A Hands-on Guide for Developers, Apress, https://www.amazon.com/Building-Generative-AI-Powered-Apps-Hands-ebook/dp/B0CTXXP1S4/
- Meta, July 2024 (accessed), Llama: Making safety tools accessible to everyone, https://llama.meta.com/trust-and-safety/
- Chip Huyen, Jul 25, 2024, Building A Generative AI Platform, https://huyenchip.com/2024/07/25/genai-platform.html
- Jaymari Chua, Yun Li, Shiyi Yang, Chen Wang, Lina Yao, 6 Jul 2024, AI Safety in Generative AI Large Language Models: A Survey, https://arxiv.org/abs/2407.18369
- Marko Zivkovic, Aug 06, 2024, Discovered Apple Intelligence prompts show Apple's attempt at preventing AI disaster, https://appleinsider.com/articles/24/08/06/discovered-apple-intelligence-prompts-show-apples-attempt-at-preventing-ai-disaster
- Rachel Curry, Aug 28 2024, Why companies including JPMorgan and Walmart are opting for internal gen AI assistants after initially restricting usage, https://www.cnbc.com/2024/08/28/why-jpmorgan-and-walmart-are-opting-for-internal-gen-ai-assistants.html
- Shayne Longpre, Stella Biderman, Alon Albalak, Hailey Schoelkopf, Daniel McDuff, Sayash Kapoor, Kevin Klyman, Kyle Lo, Gabriel Ilharco, Nay San, Maribeth Rauh, Aviya Skowron, Bertie Vidgen, Laura Weidinger, Arvind Narayanan, Victor Sanh, David Adelani, Percy Liang, Rishi Bommasani, Peter Henderson, Sasha Luccioni, Yacine Jernite, Luca Soldaini, 26 Jun 2024 (v2), The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources, https://arxiv.org/abs/2406.16746
- Jason Perlow, Nov. 6, 2024, The best open-source AI models: All your free-to-use options explained: Here are the best open-source and free-to-use AI models for text, images, and audio, organized by type, application, and licensing considerations. https://www.zdnet.com/article/the-best-open-source-ai-models-all-your-free-to-use-options-explained/
- McKinsey, November 14, 2024, What are AI guardrails? https://www.mckinsey.com/featured-insights/mckinsey-explainers/what-are-ai-guardrails
Jailbreak
Jailbreaking is the hack of using English to break into a computer system. Actually, it's not so much a violation of the server, but it does refer to a way of getting the LLM to answer questions that its developer probably doesn't want it to. In other words, it's a trick to bypass the "refusal" module of an LLM.
- Andy Arditi, Oscar Obeso, Aaquib111, wesg, Neel Nanda, 27th Apr 2024, Refusal in LLMs is mediated by a single direction, LessWrong, https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction
- Adva Nakash Peleg, May 30, 2024, An LLM Journey: From POC to Production, https://medium.com/cyberark-engineering/an-llm-journey-from-poc-to-production-6c5ec6a172fb
- Yu Wang, Xiaogeng Liu, Yu Li, Muhao Chen, Chaowei Xiao, 14 Mar 2024, AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting, https://arxiv.org/abs/2403.09513 Code: https://github.com/rain305f/AdaShield
- Jinhwa Kim, Ali Derakhshan, Ian G. Harris, 31 Oct 2023, Robust Safety Classifier for Large Language Models: Adversarial Prompt Shield, https://arxiv.org/abs/2311.00172
- Zixuan Ni, Longhui Wei, Jiacheng Li, Siliang Tang, Yueting Zhuang, Qi Tian, 8 Aug 2023 (v2), Degeneration-Tuning: Using Scrambled Grid shield Unwanted Concepts from Stable Diffusion, https://arxiv.org/abs/2308.02552
- Xiao Peng, Tao Liu, Ying Wang, 3 Jun 2024 (v2), Genshin: General Shield for Natural Language Processing with Large Language Models, https://arxiv.org/abs/2405.18741
- Ayushi Nirmal, Amrita Bhattacharjee, Paras Sheth, Huan Liu, 8 May 2024 ( v2), Towards Interpretable Hate Speech Detection using Large Language Model-extracted Rationales, https://arxiv.org/abs/2403.12403 Code: https://github.com/AmritaBh/shield
- Shweta Sharma, 27 Jun 2024, Microsoft warns of ‘Skeleton Key’ jailbreak affecting many generative AI models, https://www.csoonline.com/article/2507702/microsoft-warns-of-novel-jailbreak-affecting-many-generative-ai-models.html
- Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, Nouha Dziri, 26 Jun 2024, WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs, https://arxiv.org/abs/2406.18495
- Maksym Andriushchenko, Nicolas Flammarion, 16 Jul 2024, Does Refusal Training in LLMs Generalize to the Past Tense? https://arxiv.org/abs/2407.11969 Code: https://github.com/tml-epfl/llm-past-tense
- Kylie Robison, Jul 20, 2024, OpenAI’s latest model will block the ‘ignore all previous instructions’ loophole, https://www.theverge.com/2024/7/19/24201414/openai-chatgpt-gpt-4o-prompt-injection-instruction-hierarchy
- Chip Huyen, Jul 25, 2024, Building A Generative AI Platform, https://huyenchip.com/2024/07/25/genai-platform.html
- Jaymari Chua, Yun Li, Shiyi Yang, Chen Wang, Lina Yao, 6 Jul 2024, AI Safety in Generative AI Large Language Models: A Survey, https://arxiv.org/abs/2407.18369
- Ayush RoyChowdhury, Mulong Luo,, Prateek Sahu,, Sarbartha Banerjee, Mohit Tiwari, Aug 2024, ConfusedPilot: Confused Deputy Risks in RAG-based LLMs, https://confusedpilot.info/confused_pilot_new.pdf
- Dr. Ashish Bamania, Sep 2024, ‘MathPrompt’ Embarassingly Jailbreaks All LLMs Available On The Market Today. A deep dive into how a novel LLM Jailbreaking technique called ‘MathPrompt’ works, why it is so effective, and why it needs to be patched as soon as possible to prevent harmful LLM content generation, https://bamania-ashish.medium.com/mathprompt-embarassingly-jailbreaks-all-llms-available-on-the-market-today-d749da26c6e8
- Y. Bai et al., "Backdoor Attack and Defense on Deep Learning: A Survey," in IEEE Transactions on Computational Social Systems, doi: 10.1109/TCSS.2024.3482723. https://ieeexplore.ieee.org/abstract/document/10744415
- Steve Jones, Oct 3, 2024, LLM Prompt Injection: Never send the request to the model. Classify, rewrite and reject, https://blog.metamirror.io/llm-prompt-injection-never-send-the-request-to-the-model-e8017269b96a
- Emet Bethany, Mazal Bethany, Juan Arturo Nolazco Flores, Sumit Kumar Jha, Peyman Najafirad, 5 Nov 2024 (v2), Jailbreaking Large Language Models with Symbolic Mathematics, https://arxiv.org/abs/2409.11445
- Alwin Peng, Julian Michael, Henry Sleight, Ethan Perez, Mrinank Sharma, 12 Nov 2024, Rapid Response: Mitigating LLM Jailbreaks with a Few Examples, https://arxiv.org/abs/2411.07494
- Kyle O'Brien, David Majercak, Xavier Fernandes, Richard Edgar, Jingya Chen, Harsha Nori, Dean Carignan, Eric Horvitz, Forough Poursabzi-Sangde, 18 Nov 2024, Steering Language Model Refusal with Sparse Autoencoders, https://arxiv.org/abs/2411.11296
Privacy
Research on privacy-related risks or concerns:
- Matthew Finnegan 14 Jun 2024, Microsoft delays Recall launch amid privacy concerns, ComputerWorld, https://www.computerworld.com/article/2147736/microsoft-delays-recall-launch-amid-privacy-concerns.html
- Rohan Goswami 21 June, 2024, Apple Intelligence won’t launch in EU in 2024 due to antitrust regulation, company says, CNBS, https://www.cnbc.com/2024/06/21/apple-ai-europe-dma-macos.html
- Dan Peng, Zhihui Fu, Jun Wang, 1 Jul 2024, PocketLLM: Enabling On-Device Fine-Tuning for Personalized LLMs, https://arxiv.org/abs/2407.01031 (Running fine-tuning on a smartphone via a low-memory optimization using a "derivative-free" "zeroth-order" technique called MeZo, with advantages such as privacy.)
- Jay Peters, Jul 4, 2024, OpenAI’s ChatGPT Mac app was storing conversations in plain text, https://www.theverge.com/2024/7/3/24191636/openai-chatgpt-mac-app-conversations-plain-text
- Jaymari Chua, Yun Li, Shiyi Yang, Chen Wang, Lina Yao, 6 Jul 2024, AI Safety in Generative AI Large Language Models: A Survey, https://arxiv.org/abs/2407.18369
- Y. Zhang, J. Zhang, S. Yue, W. Lu, J. Ren, X. Shen, August 2024, "Mobile Generative AI: Opportunities and Challenges," in IEEE Wireless Communications, vol. 31, no. 4, pp. 58-64, doi: 10.1109/MWC.006.2300576, https://ieeexplore.ieee.org/abstract/document/10628027/
- Yuanchun Li, Hao Wen, Weijun Wang, Xiangyu Li, Yizhen Yuan, Guohong Liu, Jiacheng Liu, Wenxing Xu, Xiang Wang, Yi Sun, Rui Kong, Yile Wang, Hanfei Geng, Jian Luan, Xuefeng Jin, Zilong Ye, Guanjing Xiong, Fan Zhang, Xiang Li, Mengwei Xu, Zhijun Li, Peng Li, Yang Liu, Ya-Qin Zhang, Yunxin Liu, 8 May 2024 (v2), Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security, https://arxiv.org/abs/2401.05459 https://github.com/MobileLLM/Personal_LLM_Agents_Survey
- Yiheng Liu, Hao He, Tianle Han, Xu Zhang, Mengyuan Liu, Jiaming Tian, Yutong Zhang, Jiaqi Wang, Xiaohui Gao, Tianyang Zhong, Yi Pan, Shaochen Xu, Zihao Wu, Zhengliang Liu, Xin Zhang, Shu Zhang, Xintao Hu, Tuo Zhang, Ning Qiang, Tianming Liu, Bao Ge, 6 Jan 2024 (v2), Understanding LLMs: A Comprehensive Overview from Training to Inference, https://arxiv.org/abs/2401.02038
- Huan Yang, Deyu Zhang, Yudong Zhao, Yuanchun Li, Yunxin Liu, 6 Sep 2024, A First Look At Efficient And Secure On-Device LLM Inference Against KV Leakage, https://arxiv.org/abs/2409.04040 (Security issues where KV caches can be data leaks as they may contain encodings of private information.)
- Apple, Sep 2024, Apple Intelligence comes to iPhone, iPad, and Mac starting next month, https://www.apple.com/newsroom/2024/09/apple-intelligence-comes-to-iphone-ipad-and-mac-starting-next-month/
- Donghwan Rho, Taeseong Kim, Minje Park, Jung Woo Kim, Hyunsik Chae, Jung Hee Cheon, Ernest K. Ryu, 3 Oct 2024, Encryption-Friendly LLM Architecture, https://arxiv.org/abs/2410.02486
- Jiankun Wei, Abdulrahman Abdulrazzag, Tianchen Zhang, Adel Muursepp, Gururaj Saileshwar, 5 Nov 2024 (v2), Privacy Risks of Speculative Decoding in Large Language Models, https://arxiv.org/abs/2411.01076
More Research on AI Safety
Research papers that cover various other AI safety issues:
- J Schuett, N Dreksler, M Anderljung, 2023, Towards best practices in AGI safety and governance: A survey of expert opinion, arXiv preprint, https://arxiv.org/abs/2305.07153
- Jan Leike, Miljan Martic, Victoria Krakovna, Pedro A. Ortega, Tom Everitt, Andrew Lefrancq, Laurent Orseau, Shane Legg, Nov 2017, AI Safety Gridworlds, https://arxiv.org/abs/1711.09883
- J. Schuett. Risk management in the Artificial Intelligence Act. European Journal of Risk Regulation, pages 1–19, 2023. https://arxiv.org/abs/2212.03109
- Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, Dan Mané, July 2016, Concrete Problems in AI Safety, https://arxiv.org/abs/1606.06565
- Mark O Riedl and Brent Harrison. 2018. Enter the matrix: A virtual world approach to safely interruptable autonomous systems. arXiv preprint arXiv:1703.10284, 2017 (revised Nov 2018). https://arxiv.org/abs/1703.10284v2
- M. Brundage, K. Mayer, T. Eloundou, S. Agarwal, S. Adler, G. Krueger, J. Leike, and P. Mishkin. OpenAI, 2022, Lessons learned on language model safety and misuse. https://openai.com/research/language-model-safety-and-misuse
- OpenAI, Feb 2023, Planning for AGI and beyond, https://openai.com/blog/planning-for-agi-and-beyond
- Andreas Cebulla, Zygmunt Szpak, Catherine Howell, Genevieve Knight & Sazzad Hussain, 2022, Applying ethics to AI in the workplace: the design of a scorecard for Australian workplace health and safety, Network Research, 13 May 2022, volume 38, pages919–935 (2023) https://link.springer.com/article/10.1007/s00146-022-01460-9
- Mohammad Ghavamzadeh, Marek Petrik, and Yinlam Chow. Safe policy improvement by minimizing robust baseline regret. In Advances in Neural Information Processing Systems, pages 2298–2306, 2016. https://arxiv.org/abs/1607.03842v1
- Laurent Orseau and Stuart Armstrong. Safely interruptible agents. In Uncertainty in Artificial Intelligence, pages 557–566, 2016. PDF: http://www.auai.org/uai2016/proceedings/papers/68.pdf
- Tate Ryan-Mosley, August 14, 2023, AI isn’t great at decoding human emotions. So why are regulators targeting the tech? MIT Technology Review, https://www.technologyreview.com/2023/08/14/1077788/ai-decoding-human-emotions-target-for-regulators/
- Maria Korolov, 15 May 2024, 10 things to watch out for with open source gen AI, CIO, https://www.cio.com/article/2104280/10-things-to-watch-out-for-with-open-source-gen-ai.html
- Andy Arditi, Oscar Obeso, Aaquib111, wesg, Neel Nanda, 27th Apr 2024, Refusal in LLMs is mediated by a single direction, LessWrong, https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction
- Google, Responsible Generative AI Toolkit, Feb 2024, https://ai.google.dev/responsible
- Bingbin Liu, Jordan T. Ash, Surbhi Goel, Akshay Krishnamurthy, Cyril Zhang, June 2023, Exposing Attention Glitches with Flip-Flop Language Modeling, https://arxiv.org/abs/2306.00946
- Jon Christian, Jan 30, 2023, CNET's Article-Writing AI Is Already Publishing Very Dumb Errors, https://futurism.com/cnet-ai-errors
- R Dubin, 2023. Disarming Steganography Attacks Inside Neural Network Models, arXiv preprint arXiv:2309.03071, https://arxiv.org/pdf/2309.03071.pdf
- Michael O'Neill, Mark Connor, 6 Jul 2023, Amplifying Limitations, Harms and Risks of Large Language Models, https://arxiv.org/abs/2307.04821
- Lucas Mearian, 14 Mar 2024, AI hallucination mitigation: two brains are better than one, https://www.computerworld.com/article/1612465/ai-hallucination-mitigation-two-brains-are-better-than-one.html
- Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 15 Mar 2024 (v5), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer (A large survey of a variety of LLM optimizations.)
- Laura Manduchi, Kushagra Pandey, Robert Bamler, Ryan Cotterell, Sina Däubener, Sophie Fellenz, Asja Fischer, Thomas Gärtner, Matthias Kirchler, Marius Kloft, Yingzhen Li, Christoph Lippert, Gerard de Melo, Eric Nalisnick, Björn Ommer, Rajesh Ranganath, Maja Rudolph, Karen Ullrich, Guy Van den Broeck, Julia E Vogt, Yixin Wang, Florian Wenzel, Frank Wood, Stephan Mandt, Vincent Fortuin, 28 Feb 2024, On the Challenges and Opportunities in Generative AI, https://arxiv.org/abs/2403.00025
- Zhexin Zhang, Yida Lu, Jingyuan Ma, Di Zhang, Rui Li, Pei Ke, Hao Sun, Lei Sha, Zhifang Sui, Hongning Wang, Minlie Huang, 26 Feb 2024, ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable Safety Detectors, https://arxiv.org/abs/2402.16444, Code: https://github.com/thu-coai/shieldlm
- Peter Dizikes, December 11, 2023, MIT group releases white papers on governance of AI, MIT News, https://news.mit.edu/2023/mit-group-releases-white-papers-governance-ai-1211
- MAK Raiaan, MSH Mukta, K Fatema, NM Fahad, 2023 A Review on Large Language Models: Architectures, Applications, Taxonomies, Open Issues and Challenges, https://www.techrxiv.org/articles/preprint/A_Review_on_Large_Language_Models_Architectures_Applications_Taxonomies_Open_Issues_and_Challenges/24171183/1/files/42414054.pdf
- Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen3 Ruoxi Jia, Prateek Mittal, Peter Henderson, Oct 2023, Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! https://arxiv.org/abs/2310.03693v1 Code: https://llm-tuning-safety.github.io/
- Y Hu, J Setpal, D Zhang, J Zietek, J Lambert, 2023, BoilerBot: A Reliable Task-oriented Chatbot Enhanced with Large Language Models, https://assets.amazon.science/8c/03/80c814a749f58e73a1aeda2ff282/boilerbot-tb2-final-2023.pdf
- S Latifi, 2023, Efficient and Dependable Deep Learning Systems Ph.D. Thesis, Computer Science and Engineering, University of Michigan, https://deepblue.lib.umich.edu/bitstream/handle/2027.42/176548/salar_1.pdf?sequence=1
- N. Soares. 2023, Comments on OpenAI’s “Planning for AGI and beyond”. https://www.lesswrong.com/posts/uxnjXBwr79uxLkifG
- K Ramesh, A Chavan, S Pandit, 2023, A Comparative Study on the Impact of Model Compression Techniques on Fairness in Language Models, Microsoft Research, https://aclanthology.org/2023.acl-long.878.pdf https://www.microsoft.com/en-us/research/uploads/prod/2023/07/3687_Paper.pdf
- David Spuler, March 2024, Chapter 43. Overview of AI Research, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- Rithesh Murthy, Liangwei Yang, Juntao Tan, Tulika Manoj Awalgaonkar, Yilun Zhou, Shelby Heinecke, Sachin Desai, Jason Wu, Ran Xu, Sarah Tan, Jianguo Zhang, Zhiwei Liu, Shirley Kokane, Zuxin Liu, Ming Zhu, Huan Wang, Caiming Xiong, Silvio Savarese, 12 Jun 2024, MobileAIBench: Benchmarking LLMs and LMMs for On-Device Use Cases, https://arxiv.org/abs/2406.10290
- Shicheng Xu, Liang Pang, Mo Yu, Fandong Meng, Huawei Shen, Xueqi Cheng, Jie Zhou, 12 Jun 2024 (v2), Unsupervised Information Refinement Training of Large Language Models for Retrieval-Augmented Generation, https://arxiv.org/abs/2402.18150 (Analysis about how LLMs can mishandle information retrieved from a datastore and how to make LLMs better at handling RAG information using a specialized training regime.)
- OpenAI, Moderation: Learn how to build moderation into your AI applications, 2024, https://platform.openai.com/docs/guides/moderation
- Azure, 06/13/2024, Content filtering, https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/content-filter?tabs=warning%2Cpython
- Yu Wang, Xiaogeng Liu, Yu Li, Muhao Chen, Chaowei Xiao, 14 Mar 2024, AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting, https://arxiv.org/abs/2403.09513 Code: https://github.com/rain305f/AdaShield
- Jinhwa Kim, Ali Derakhshan, Ian G. Harris, 31 Oct 2023, Robust Safety Classifier for Large Language Models: Adversarial Prompt Shield, https://arxiv.org/abs/2311.00172
- Apple, June 2024, Introducing Apple’s On-Device and Server Foundation Models, https://machinelearning.apple.com/research/introducing-apple-foundation-models (Apple's on-device models feature optimizations including small models, grouped query attention, 2-bit/4-bit quantization including activation quantization, shared embedding/unembedding tensors, small-ish vocabulary size of 49k, an undisclosed efficient KV cache optimization for neural engines, and layer-specific 16-bit LoRA/QLoRA adapters of size "10s of megabytes" for fine-tuned specialized model versions, also sometimes in 2-bit/4-bit, claiming speed rates of 0.6ms/token in prefill, and 30 tokens per second in decoding.)
- NVIDIA, June 2024, Nemotron-4 340B Technical Report, https://d1qx31qr3h6wln.cloudfront.net/publications/Nemotron_4_340B_8T_0.pdf (Architecture is decoder-only with GQA, SentencePiece tokenizer, causal attention masks, RoPE, 96 layers, 96 heads, 8 KV heads, 256,000 vocabulary, 18432 internal dimension, context window 4096, and uses squared RELU.)
- Frank Chung, June 23, 2024, ‘I need to go outside’: Young people ‘extremely addicted’ as Character.AI explodes, https://www.news.com.au/technology/online/internet/i-need-to-go-outside-young-people-extremely-addicted-as-characterai-explodes/news-story/5780991c61455c680f34b25d5847a341
- Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe, 4 Mar 2022, Training language models to follow instructions with human feedback, https://arxiv.org/abs/2203.02155 (The original 2022 InstructGPT paper from OpenAI.)
- Valentina Alto, 2024, Chapter 12: Responsible AI, Building LLM-Powered Applications: Create intelligence apps and agents with large language models, Packt Publishing, https://www.amazon.com/Building-LLM-Apps-Intelligent-Language/dp/1835462316/
- Aarushi Kansal, Chapter 4: Guardrails and AI: Building Safe and Controllable Apps, Building Generative AI-Powered Apps: A Hands-on Guide for Developers, Apress, https://www.amazon.com/Building-Generative-AI-Powered-Apps-Hands-ebook/dp/B0CTXXP1S4/
- Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, Nouha Dziri, 26 Jun 2024, WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs, https://arxiv.org/abs/2406.18495
- Jaymari Chua, Yun Li, Shiyi Yang, Chen Wang, Lina Yao, 6 Jul 2024, AI Safety in Generative AI Large Language Models: A Survey, https://arxiv.org/abs/2407.18369
- Marko Zivkovic, Aug 06, 2024, Discovered Apple Intelligence prompts show Apple's attempt at preventing AI disaster, https://appleinsider.com/articles/24/08/06/discovered-apple-intelligence-prompts-show-apples-attempt-at-preventing-ai-disaster
- Mack DeGeurin, Aug 9, 2024, Researchers worry about AI turning humans into jerks: OpenAI safety researchers think GPT4o could influence 'social norms.', https://www.popsci.com/technology/openai-jerks/
- OpenAI, August 8, 2024 GPT-4o System Card, https://openai.com/index/gpt-4o-system-card/
- Rohin Shah, Seb Farquhar, Anca Dragan, 21st Aug 2024, AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work, https://www.alignmentforum.org/posts/79BPxvSsjzBkiSyTq/agi-safety-and-alignment-at-google-deepmind-a-summary-of
- Shayne Longpre, Stella Biderman, Alon Albalak, Hailey Schoelkopf, Daniel McDuff, Sayash Kapoor, Kevin Klyman, Kyle Lo, Gabriel Ilharco, Nay San, Maribeth Rauh, Aviya Skowron, Bertie Vidgen, Laura Weidinger, Arvind Narayanan, Victor Sanh, David Adelani, Percy Liang, Rishi Bommasani, Peter Henderson, Sasha Luccioni, Yacine Jernite, Luca Soldaini, 26 Jun 2024 (v2), The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources, https://arxiv.org/abs/2406.16746
- Thomas Mildner, Orla Cooney, Anna-Maria Meck, Marion Bartl, Gian-Luca Savino, Philip R. Doyle, Diego Garaialde, Leigh Clark, John Sloan, Nina Wenig, Rainer Malaka, Jasmin Niess, 26 Jan 2024, Listening to the Voices: Describing Ethical Caveats of Conversational User Interfaces According to Experts and Frequent Users, Proceedings of the CHI Conference on Human Factors in Computing Systems (CHI '24), May 11--16, 2024, Honolulu, HI, USA, https://arxiv.org/abs/2401.14746 https://doi.org/https://doi.org/10.1145/3613904.3642542
- Kyle Wiggers, September 4, 2024, Ilya Sutskever’s startup, Safe Superintelligence, raises $1B, https://techcrunch.com/2024/09/04/ilya-sutskevers-startup-safe-super-intelligence-raises-1b/
- Balasubramaniam S. , Vanajaroselin Chirchi, Seifedine Kadry, Moorthy Agoramoorthy, Gururama Senthilvel P., Satheesh Kumar K., and Sivakumar T. A., Oct 2024, The Road Ahead: Emerging Trends, Unresolved Issues, and ConcludingRemarksinGenerativeAI—AComprehensiveReview, International Journal of Intelligent Systems, Volume 2024, Article ID 4013195, 38 pages, https://doi.org/10.1155/2024/4013195 https://www.researchgate.net/profile/Balasubramaniam-s-2/publication/384729387_The_Road_Ahead_Emerging_Trends_Unresolved_Issues_and_Concluding_Remarks_in_Generative_AI-A_Comprehensive_Review/links/6705560cf5eb7108c6e5d261/The-Road-Ahead-Emerging-Trends-Unresolved-Issues-and-Concluding-Remarks-in-Generative-AI-A-Comprehensive-Review.pdf
- Xinyi Zeng, Yuying Shang, Yutao Zhu, Jiawei Chen, Yu Tian, 9 Oct 2024, Root Defence Strategies: Ensuring Safety of LLM at the Decoding Level, https://arxiv.org/abs/2410.06809
- Michael Nuñez, October 15, 2024, Anthropic just made it harder for AI to go rogue with its updated safety policy, https://venturebeat.com/ai/anthropic-just-made-it-harder-for-ai-to-go-rogue-with-its-updated-safety-policy/
- ETO, Apr 2024, The state of global AI safety research, https://eto.tech/blog/state-of-global-ai-safety-research/
- Leon Derczynski, Christopher Parisien, Nikki Pope, Michael Boone, Nov 2024, NVIDIA Approaches to AI Trust and Safety: Innovation and Tools, https://www.nvidia.com/en-us/on-demand/session/aisummitdc24-sdc1088/?playlistId=playList-c6a9450c-c790-462d-a058-0bacacd5d370
- Y. Bai et al., "Backdoor Attack and Defense on Deep Learning: A Survey," in IEEE Transactions on Computational Social Systems, doi: 10.1109/TCSS.2024.3482723. https://ieeexplore.ieee.org/abstract/document/10744415
- OpenAI, November 21, 2024, Advancing red teaming with people and AI, https://openai.com/index/advancing-red-teaming-with-people-and-ai/
- Patrick Mineault, Niccolò Zanichelli, Joanne Zichen Peng, Anton Arkhipov, Eli Bingham, Julian Jara-Ettinger, Emily Mackevicius, Adam Marblestone, Marcelo Mattar, Andrew Payne, Sophia Sanborn, Karen Schroeder, Zenna Tavares, Andreas Tolias, 27 Nov 2024, NeuroAI for AI Safety, https://arxiv.org/abs/2411.18526
- Maria Korolov and Michael Hill, 03 Dec 2024, 10 most critical LLM vulnerabilities, https://www.csoonline.com/article/575497/owasp-lists-10-most-critical-large-language-model-vulnerabilities.html
- Mayank Vatsa, Anubhooti Jain, Richa Singh, 7 Dec 2023, Adventures of Trustworthy Vision-Language Models: A Survey, https://arxiv.org/abs/2312.04231
- Yedi Zhang, Yufan Cai, Xinyue Zuo, Xiaokun Luan, Kailong Wang, Zhe Hou, Yifan Zhang, Zhiyuan Wei, Meng Sun, Jun Sun, Jing Sun, Jin Song Dong, 9 Dec 2024, The Fusion of Large Language Models and Formal Methods for Trustworthy AI Agents: A Roadmap, https://arxiv.org/abs/2412.06512
More AI Research
Read more about: