Code Monkey home page Code Monkey logo

qdsyntheticdata's Introduction

QD Synthetic Data

This repository serves as the main organizational tool for the survey paper "A Survey of Methods for Generating Quality and Diverse Synthetic Data with LLMs". We are collecting papers as Github issues with the tag Paper. To add a new paper, first check that it is not present, then fill out the new paper issue template here. To close the issue you (or someone else) can make a PR containing a report on the paper using the provided format here. You can find a roadmap for the project on this Github projects board. Weekly meeting notes and recordings are housed here.

Project Description

The aim of this project is to catalog the many current ad hoc methods for synthetic data generation via LLMs with a focus on understanding their impact on two metrics: dataset quality and dataset diversity. Ideally, this can be done under a single conceptual framework. An important sub-question we will need to discuss is how to appropriately define these metrics, in particular dataset diversity.

Overall, this will roughly consist of three stages:

  1. Collection: finding as many papers as we can pertaining to synthetic data generation and measures of and techniques improving quality or diversity for LLMs.
  2. Synthesis: writing a survey of our findings by organizing methods into a single conceptual framework. We may also want to do some benchmarking.
  3. Next steps: identifying promising research directions as recommendations to the broader community

Questions

Some important questions we will want to think about addressing:

  • How to define quality?
  • How to define diversity?
  • The impact of the model
    • model size
    • pretraining data
    • fine-tuning data
  • The impact of the sampling methodology
    • Type of prompt
    • Sampling algorithm
  • Impact of the task domain

Meeting Time

Meetings are at 5:30 PM EST on Thursdays. Email [email protected] or DM Alex Havrilla on discord for access.

Links

qdsyntheticdata's People

Contributors

alon-albalak avatar baberabb avatar cblagden avatar dahoas avatar daia99 avatar dmahan93 avatar giove91 avatar haileyschoelkopf avatar kcoost avatar lauraaisling avatar maia-iyer avatar mistobaan avatar mmhamdy avatar reshinthadithyan avatar srishti-git1110 avatar veratr86 avatar xu3kev avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

qdsyntheticdata's Issues

Papers

Papers

  • #4

  • #5

  • #6

  • #7

  • #8

  • #9

  • #10

  • #11

  • #12

  • #13

  • #14

  • #15

  • #16

  • #17

  • #18

  • #19

  • #20

  • #21

  • #22

  • #23

  • #24

  • #25

  • #26

  • #27

  • #28

  • #29

  • #30

  • #31

  • #32

  • #33

  • #34

  • #35

  • #36

  • #37

  • #38

  • #39

  • #40

  • #41

  • #42

  • #43

  • #44

  • #45

  • #46

  • #47

  • #48

  • #49

  • #50

  • #51

  • #52

  • #53

  • #54

  • #55

  • #56

  • #57

  • #58

  • #59

  • #60

  • #61

  • #62

  • #63

  • #64

  • #65

  • #66

  • #67

  • #68

  • #69

  • #70

  • #71

  • #72

  • #73

  • #74

  • #75

  • #76

  • #77

  • #78

  • #79

  • #80

  • #81

  • #82

  • #83

  • #84

  • #85

  • #86

  • #87

  • #88

  • #89

  • #90

  • #91

  • #92

  • #93

  • #94

  • #95

  • #96

  • #97

  • #98

  • #99

  • #100

  • #101

  • #102

  • #103

  • #104

  • #105

  • #106

  • #107

  • #108

  • #109

  • #110

  • #111

  • #112

  • #113

  • #114

  • #115

  • #116

  • #117

  • #118

  • #119

  • #120

  • #121

  • #122

  • #123

  • #124

  • #125

  • #126

  • #127

  • #128

  • #129

  • #130

  • West-of-N: Synthetic Preference Generation for Improved Reward Modeling

  • Rishabh synthetic data talk

  • Navigating the Geometry of Language: A New Approach to Synthetic Text Generation

  • Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models

  • Generative AI for Synthetic Data Generation: Methods, Challenges and the Future

  • Comprehensive Exploration of Synthetic Data Generation: A Survey

  • Tuning Language Models as Training Data Generators for Augmentation-Enhanced Few-Shot Learning

  • Generating Training Data with Language Models: Towards Zero-Shot Language Understanding

  • ZEROGEN: Efficient Zero-shot Learning via Dataset Generation

  • Self-Guided Noise-Free Data Generation for Efficient Zero-Shot Learning

  • ProGen: Progressive Zero-shot Dataset Generation via In-context Feedback

  • ReGen: Zero-Shot Text Classification via Training Data Generation with Progressive Dense Retrieval

  • Prototypical Verbalizer for Prompt-based Few-shot Tuning

  • Knowledgeable Prompt-tuning: Incorporating Knowledge into Prompt Verbalizer for Text Classification

  • Mixture of Soft Prompts for Controllable Data Generation
    Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm

  • Differentiable Quality Diversity

  • Diversity of Thought Improves Reasoning Abilities of Large Language Models

  • Instruction Diversity Drives Generalization To Unseen Tasks

  • Diversity-Aware Ensembling of Language Models Based on Topological Data Analysis

  • Increasing Diversity While Maintaining Accuracy: Text Data Generation with Large Language Models and Human Interventions
    Texygen: A Benchmarking Platform for Text Generation Models

  • Impossible Distillation: from Low-Quality Model to High-Quality Dataset & Model for Summarization and Paraphrasing

  • Enhancing Chat Language Models by Scaling High-quality Instructional Conversations

  • Instruction Mining: High-Quality Instruction Data Selection for Large Language Models

  • Co-training and Co-distillation for Quality Improvement and Compression of Language Models

  • In-context Reinforcement Learning with Algorithm Distillation

  • SODA: Million-scale Dialogue Distillation with Social Commonsense Contextualization

  • Dataset Distillation: A Comprehensive Review

  • Knowledge Distillation of Large Language Models

  • GKD: Generalized Knowledge Distillation for Auto-regressive Sequence Models

  • RLCD: Reinforcement Learning from Contrast Distillation for Language Model Alignment

  • IMPLICIT CHAIN OF THOUGHT REASONING VIA KNOWLEDGE DISTILLATION

  • Latent Dataset Distillation with Diffusion Models

  • Large Language Models Can Self-improve

  • Self-Programming Artificial Intelligence Using Code-Generating Language Models

  • Self-Instruct: Aligning Language Model with Self Generated Instructions

  • Self Supervision Does Not Help Natural Language Supervision at Scale

  • The Capacity for Moral Self-Correction in Large Language Models

  • Self-planning Code Generation with Large Language Model

  • Reflexion: an autonomous agent with dynamic memory and self-reflection

  • Self-Refine: Iterative Refinement with Self-Feedback

  • Teaching Large Language Models to Self-Debug

  • Learning to Reason and Memorize with Self-Notes

  • The Socratic Method for Self-Discovery in Large Language Models

  • CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing

  • language model self improvement by reinforcement learning contemplation

  • Improving Language Model Negotiation with Self-Play and In-Context Learning from AI Feedback

  • GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction

  • Generating Sequences by Learning to Self-Correct

  • SelFee: Iterative Self-Revising LLM Empowered by Self-Feedback Generation

  • Demystifying GPT Self-Repair for Code Generation

  • Bring Your Own Data! Self-Supervised Evaluation for Large Language Models

  • Long-range Language Modeling with Self-retrieval

  • Automatic Calibration and Error Correction for Large Language Models via Pareto Optimal Self-Supervision

  • Self-consistency for open-ended generations

  • SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning

  • Automatically Correcting Large Language Models: Surveying the landscape of diverse self-correction strategies

  • Shepherd: A Critic for Language Model Generation

  • Self alignment with instruction back translation

  • Reinforced Self-Training (ReST) for Language Modeling

  • Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

  • Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation

  • SALMON: Self-Alignment with Principle-Following Reward Models

  • Self-Specialization: Uncovering Latent Expertise within Large Language Models

  • Large Language Models are Better Reasoners with Self-Verification

  • SelfEval: Leveraging the discriminative nature of generative models for evaluation

  • Language model self-teaching for domain adaptation

  • Complementary Benefits of Contrastive Learning and Self-Training Under Distribution Shift

  • Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models

  • Self evaluation improves selective generation

  • Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

  • MATH-SHEPHERD: VERIFY AND REINFORCE LLMS STEP-BY-STEP WITHOUT HUMAN ANNOTATIONS

  • Self-Contrast: Better Reflection Through Inconsistent Solving Perspectives

  • Self-Rewarding Language Models

  • ReFT: Reasoning with Reinforced Fine-Tuning

  • Investigate-Consolidate-Exploit: A General Strategy for Inter-Task Agent Self-Evolution

  • Self-Discover: Large Language Models Self-Compose Reasoning Structures

  • CodeIt: Self-Improving Language Models with Prioritized Hindsight Replay

  • V-STaR: Training Verifiers for Self-Taught Reasoners

  • Perils of Self-Feedback: Self-Bias Amplifies in Large Language Models

  • Can Large Language Models Really Improve by Self-critiquing Their Own Plans?

  • Soft Self-Consistency Improves Language Model Agents

  • the Dimension of Self-Directed Learning

  • TOOLVERIFIER: Generalization to New Tools via Self-Verification

  • Explorations of Self-Repair in Language Models

  • Can Large Language Models Play Games? A Case Study of A Self-Play Approach

  • Evolution through Large Models

  • Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers

  • EvoGPT-f: An Evolutionary GPT Framework for Benchmarking Formal Math Languages

  • Large Language Models As Evolution Strategies

  • Evolutionary Computation in the Era of Large Language Model: Survey and Roadmap

  • Quality-Diversity through AI Feedback

  • ACES: GENERATING DIVERSE PROGRAMMING PUZZLES WITH AUTOTELIC LANGUAGE MODELS AND SEMANTIC DESCRIPTORS

  • Language Model Crossover: Variation through Few-Shot Prompting

  • EvoPrompting: Language Models for Code-Level Neural Architecture Search

  • Quality Diversity through Human Feedback

  • Large Language Models as Optimizers

  • Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers

  • Open-endedness via Modeling human Notions of Interestingness

  • Human-Timescale Adaptation in an Open-Ended Task Space

  • Weak-to-strong generalization

  • Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts

  • Open-Ended Generation of Diverse Adversarial Prompts

  • POET: Endlessly Generating Increasingly Complex and Diverse Learning Environments and their Solutions through the Paired Open-Ended Trailblazer

  • Learning New Skills after Deployment: Improving open-domain internet-driven dialogue with human feedback

  • Open-Ended Learning Leads to Generally Capable Agents

  • Voyager: An Open-Ended Embodied Agent with Large Language Models

  • Describe and Explain and Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents

  • Reinforcement Learning for Generative AI: State of the Art, Opportunities and Open Research Challenges

  • WizardLM: Empowering Large Language Models to Follow Complex Instructions

  • MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

  • Comprehensive Exploration of Synthetic Data Generation: A Survey

  • Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias

  • Benchmarking and Improving Generator-Validator Consistency of Language Models

  • #131

  • #132

  • #133

  • #134

  • #135

  • #136

  • #137

  • #138

  • #139

  • #140

  • #141

  • #142

  • #143

  • #144

  • #145

  • Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.