This project demonstrates how to generate SQL queries from natural language using a large language model (LLM). The demo uses a sample SQLite database (demo.sqlite
) and a Python script (text_to_sql.py
) to interact with the LLM and execute the generated SQL queries.
demo.sqlite
: An SQLite database containing dummy data for demonstration purposes.prompt.txt
: A prompt template used by the LLM to generate SQL queries based on natural language input.schema.txt
: A Data Definition Language (DDL) description of the data structure contained within thedemo.sqlite
database.text_to_sql.py
: A Python script that demonstrates how to generate SQL queries from natural language using an LLM and execute them against thedemo.sqlite
database.
Before running the script, ensure you have the following:
- Python installed on your system
- An Anthropic API key with access to the Claude model
To install the required Python packages, run the following command:
pip install anthropic
For more information on setting up the Anthropic API and obtaining an API key, refer to the Anthropic documentation.
- Make sure you have fulfilled the prerequisites mentioned above.
- Run the
text_to_sql.py
script using the commandpython text_to_sql.py
. - Enter a natural language question when prompted. The script will generate an SQL query based on the provided question and execute it against the
demo.sqlite
database. - The script will display the results of the executed SQL query.
- The
text_to_sql.py
script reads theprompt.txt
template and theschema.txt
DDL description. - It prompts the user to enter a natural language question.
- The script sends the question, along with the prompt template and schema, to the LLM for processing.
- The LLM generates an SQL query based on the provided natural language question and the database schema.
- The script executes the generated SQL query against the
demo.sqlite
database and displays the results.
prompt.txt
contains a prompt template that is used to generate SQL from text. The template has the following variables:
{$SCHEMA}
: The database schema in SQL DDL format.{$QUESTION}
: The natural language question to be translated into an SQL query.{$DATABASE_TYPE}
: The type of the database (e.g., SQLite).{$SIMILAR_QUERIES}
: Examples of similar questions and their corresponding SQL queries.
For more information on prompting, read the docs here.
In this prompt, we are leveraging several prompting techniques at once:
-
Schema Specification: By providing the database schema in SQL DDL format, we give the LLM a clear understanding of the structure and relationships within the database. This allows the LLM to generate more accurate and relevant SQL queries based on the given schema.
-
Question Template: The prompt includes a placeholder for the natural language question (
{$QUESTION}
). This allows us to easily inject different questions into the prompt without modifying the entire template. -
Database Type Specification: The prompt includes a placeholder for the database type (
{$DATABASE_TYPE}
). This enables the LLM to generate SQL queries that are specific to the target database system, taking into account any differences in syntax or supported features. -
Similar Query Examples: The prompt includes a placeholder for examples of similar questions and their corresponding SQL queries (
{$SIMILAR_QUERIES}
). By providing these examples, we give the LLM additional context and guidance on how to translate natural language questions into SQL queries. The examples serve as a reference for the LLM to understand the desired format and structure of the generated SQL. -
Step-by-Step Reasoning: The prompt instructs the LLM to break down the steps and reasoning used to translate the question into an SQL query. This encourages the LLM to provide a clear and logical thought process, making the generated SQL more understandable and easier to debug or modify if needed.
-
Iterative Refinement: The prompt asks the LLM to provide a first draft SQL query, then review and refine it based on the original question and the desired output. This iterative process helps improve the accuracy and relevance of the generated SQL query.
-
Structured Output: The prompt leverages a defined XML response schema this ensures that we can parse the data in a repeatable manner.
-
Sampling Techniques: When generating a response from the API we ensure that the beginning and end of the sequence to be generated is fixed. To control the beginning we add a message to the
messages
array after our user prompt, this message is the first xml tag in our response schema. This will ensure there is no pre-amble. Next we append to thestop_sequences
array the final xml tag from the response schema. This enables a clean extraction of the contents within the xml response schema, avoiding preamble or postamble.
By combining these prompting techniques, we can guide the LLM to generate more accurate and targeted SQL queries based on natural language questions, while also providing flexibility and adaptability to different database schemas and types.