Code Monkey home page Code Monkey logo

ml-lucid-datagen's Introduction

LUCID

This software project accompanies the research paper LUCID: LLM-Generated Utterances for Complex and Interesting Dialogues.

LUCID is a highly automated, LLM-driven data generation system for task-oriented dialogues. LUCID aims to produce realistic, diverse and challenging conversations, with highly accurate labels. LUCID takes a modularised approach to data generation, compartmentalising the data generation task into manageable steps that an LLM can consistently perform accurately. For more details, please see our paper.

This repo contains the code for the data generation system (which can be used to generate more data), the data we have already generated for our paper (LUCIDv1.0), and the code for our baseline models.

Documentation

Getting Started

Step 1: Generating intents

To create new intents from a description:

  • Open lucid_generate_data/run_scripts/create_intents_from_description.py
  • In this file, update INTENTS, a dictionary containing domains, and the desired intent descriptions within each domain
  • Once finished, run the .py file from the root directory (_** python lucid_generate_data/run_scripts/create_intents_from_description.py**)
  • The new intents will be generated in lucid_generate_data/intents_for_data_generation

Step 2: Generating conversations

a Open lucid_generate_data/run_scripts/run_conversations.py

  • Inside this file, decide now many conversations to generate per intent (CONVS_PER_INTENT), the maximum number of intents for a conversation (MAX_INTENTS_IN_CONVERSATION)
  • You also need to specify the conversational phenomena that you would like for the conversation (UNHAPPY_PATHS). Note that for the data generated for the paper, these were randomly sampled for each conversation (with either 0, 1 or 2 unhappy paths per conversation.
  • Your saved conversations will be stored in lucid_generate_data/saved_conversations

Step 3: Data formatting and post-processing

  • To assemble your generated conversations into your final dataset, run lucid_generate_data/compile_data.py
  • Your final dataset will be called LUCID_data.json

Step 4: Running our baseline model

To run the LUCID baselines, please use: python running_baseline/run_llm.py

ml-lucid-datagen's People

Contributors

johnptorr avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.