Constitutional-AI-awesome-papers

Paper lists about 'Constitutional AI System' or 'AI under Ethical Guidelines'. This GitHub repository is intended for personal study, and under consistent update. I hope for everyone's active related-works recommendations.

Paper

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Anthropic [Link] arxiv Nov.2022
Constitutional AI: Harmlessness from AI Feedback

Anthropic [Link] arxiv Dec.2022
Moral Stories: Situated Reasoning about Norms, Intents, Actions, and their Consequences

Denis Emelin, Ronan Le Bras, Jena D. Hwang, Maxwell Forbes, Yejin Choi [Link] EMNLP 2022
Second Thoughts are Best: Learning to Re-Align With Human Values from Text Edits

Ruibo Liu, Chenyan Jia, Ge Zhang, Ziyu Zhuang, Tony X. Liu, Soroush Vosoughi [Link] NeurIPS 2022
The Capacity for Moral Self-Correction in Large Language Models

Anthropic [Link] arxiv Feb.2023
Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision

Zhiqing Sun1, Yikang Shen, Qinhong Zhou, Hongxin Zhang, Zhenfang Chen, David Cox, Yiming Yang, Chuang Gan [Link] arxiv May.2023
Prometheus: Inducing Fine-grained Evaluation Capability in Language Models

Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sungdong Kim, James Thorne, Minjoon Seo [Link] arxiv Oct.2023
Generating Summaries with Controllable Readability Levels

Leonardo F. R. Ribeiro, Mohit Bansal, Markus Dreyer [Link] arxiv Oct.2023
Personalized Soups: Personalized Large Language Model Alignment via Post-hoc Parameter Merging

Joel Jang, Seungone Kim, Bill Yuchen Lin, Yizhong Wang, Jack Hessel, Luke Zettlemoyer, Hannaneh Hajishirzi, Yejin Choi, Prithviraj Ammanabrolu [Link] arxiv Oct.2023
Collective Constitutional AI: Aligning a Language Model with Public Input

Anthropic [Link] arxiv Oct.2023
Specific versus General Principles for Constitutional AI

Anthropic [Link] arxiv Oct.2023

Recommend Projects