Code Monkey home page Code Monkey logo

llm-jailbreak's Introduction

LLM Reading List

Prompt Engineering

  • Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution. pdf(Google deepmind) arXiv, 2023.

  • See, Think, Confirm: Interactive Prompting Between Vision and Language Models for Knowledge-based Visual Reasoning. pdf arXiv, 2023.

  • Scaling Instruction-Finetuned Language Models. pdf arXiv, 2022.

  • Automatic Chain of Thought Prompting in Large Language Models. pdf arXiv, 2023.

  • Multimodal Chain-of-Thought Reasoning in Language Models. pdf arXiv, 2023.

  • Design of a Chain-of-Thought in Math Problem Solving. pdf arXiv, 2023.

  • Large Language Models Are Human-Level Prompt Engineers. pdf ICLR, 2023.

  • ReAct: Synergizing Reasoning and Acting in Language Models. pdf ICLR, 2023.

  • Prompting Is Programming: A Query Language for Large Language Models. pdf PLDI 2023.

  • Cue-CoT: Chain-of-thought Prompting for Responding to In-depth Dialogue Questions with LLMs. pdf arXiv, 2023.

  • Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine. pdf arXiv, 2023.

  • Label Words are Anchors: An Information Flow Perspective for Understanding In-Context Learning. pdf, 2023

Robustness and Safety Alignment

  • RARR: Researching and Revising What Language Models Say, Using Language Models. pdf arXiv, 2023.

  • Fundamental Limitations of Alignment in Large Language Models. pdf arXiv, 2023.

  • DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models. pdf arXiv, 2023.

  • Large Language Model Alignment: A Survey. pdf arXiv, 2023.

  • The Janus Interface: How Fine-Tuning in Large Language Models Amplifies the Privacy Risk. pdf arXiv, 2023.

  • Identifying and Mitigating the Security Risks of Generative AI. pdf arXiv, 2023.

  • The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning. pdf arXiv, 2023.

  • Chain-of-Verification Reduces Hallucination in Large Language Models. pdf arXiv, 2023.

  • Language Is Not All You Need: Aligning Perception with Language Models. pdf arXiv, 2023.

Jailbreak

  • GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher. pdf arXiv, 2023.

  • Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models. pdf arXiv, 2023.

  • Visual Adversarial Examples Jailbreak Aligned Large Language Models. pdf arXiv, 2023.

  • Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! pdf website arXiv, 2023.

  • JAILBREAKER: Automated Jailbreak Across Multiple Large Language Model Chatbots. pdf NDSS, 2024.

  • Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study. pdf arXiv, 2023.

  • Multi-step Jailbreaking Privacy Attacks on ChatGPT. pdf arXiv, 2023.

  • Jailbroken: How Does LLM Safety Training Fail? pdf arXiv, 2023.

  • [workshop] On the Privacy Risk of In-context Learning. pdf arXiv, 2023.

  • Jailbreaking Black Box Large Language Models in Twenty Queries. pdf arXiv, 2023.

  • Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation. pdf arXiv, 2023.

  • Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models. pdf arXiv, 2023.

  • AutoDAN: Automatic and Interpretable Adversarial Attacks on Large Language Models. pdf arXiv, 2023.

  • "Open Sesame! Universal Black Box Jailbreaking of Large Language Models. pdf arXiv, 2023.

Others

  • LAMBRETTA: Learning to Rank for Twitter Soft Moderation. pdf S&P, 2023.

  • SoK: Content Moderation in Social Media, from Guidelines to Enforcement, and Research to Practice.pdf arXiv, 2023.

  • You Only Prompt Once: On the Capabilities of Prompt Learning on Large Language Models to Tackle Toxic Content. pdf S&P, 2024.

  • Rule By Example: Harnessing Logical Rules for Explainable Hate Speech Detection. pdf ACL 2023.

  • Last One Standing: A Comparative Analysis of Security and Privacy of Soft Prompt Tuning, LoRA, and In-Context Learning. pdf arXiv, 2023.

  • Is ChatGPT a General-Purpose Natural Language Processing Task Solver? pdf arXiv, 2023.

  • Pretraining Data Mixtures Enable Narrow Model Selection Capabilities in Transformer Models. pdf arXiv, 2023.

  • [website] Jailbreaking Large Language Models: Techniques, Examples, Prevention Methods link

  • Text Embeddings Reveal (Almost) As Much As Text. pdf EMNLP, 2023.

llm-jailbreak's People

Contributors

keyanub avatar

Stargazers

 avatar Teddy Tran avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.