mrseanryan / gpt-dm Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 0.0 470 KB

Data modelling via natural language using an LLM. Outputs JSON or SQL. Also generates Test data in SQL or CSV format.

License: MIT License

Python 21.71% Shell 0.25% Jupyter Notebook 78.04%

aiad database-schema prompt-engineering sql-gen test-data-gen

gpt-dm's Introduction

Greetings, Earthling 👋

Hi 👋, I'm Sean, a passionate full stack developer from Ireland now living in the Netherlands

🌱 I’m currently learning more about Clean and Sustainable architecture, Machine Learning, Data Engineering and Functional Programming ...
⚡ Fun fact: I enjoy cooking new things ...

📝 Blogs

Anti-Patterns Codin' - coding problems and solutions

Latest posts

UI Blunderz - software engineering and quality commentary

Latest posts

Medium

Badges

gpt-dm's People

Contributors

Stargazers

Watchers

gpt-dm's Issues

A try remove LangChain - via extraction or new Menu prompt

There is a dependency on LangChain which makes porting this away from Python difficult. Also may couple to Open AI.

Options:

extract the multi prompt processing from LangChain
try a simpler “menu prompt” approach to just ask the LLM to pick the best command or else the menu summary

Try generating higher volume of test data (via Mistral 7B ?)

Approach:

use a larger context window
repeat and concatenate, with higher temperature
multiply programmatically (post-process)

LLM - Mistral 7B

Context size: a sliding 4K window - see https://huggingface.co/mistralai/Mistral-7B-v0.1/discussions/4

There are many flavours of Mistral-7B:

regular mistral-7B, quantized

TheBloke/Mistral-7B-v0.1-GGUF = https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF

cognitivecomputations/dolphin-2.6-mistral-7b-dpo

good for coding, has 16K context (note: is not 'ethically trained'
https://huggingface.co/cognitivecomputations/dolphin-2.6-mistral-7b-dpo
quantized = https://huggingface.co/TheBloke/dolphin-2.6-mistral-7B-dpo-GPTQ

- mistral-7B with larger context window

TheBloke_Writing_Partner_Mistral_7B-GPTQ_gptq-8bit-32g-actorder_True = https://huggingface.co/TheBloke/Writing_Partner_Mistral_7B-GPTQ/tree/gptq-8bit-32g-actorder_True

4 - instruct prompting, and newer:

https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GPTQ

a base version, suitable for FT

https://huggingface.co/TheBloke/mistral-ft-optimized-1227-GPTQ

hermes-2.5 - mistral-7B

phi from Microsoft

phi-1.5 (from Microsoft) - Open LLM leaderboard
= https://huggingface.co/microsoft/phi-1_5
phi-2 = https://huggingface.co/microsoft/phi-2

Add feature to summarize a large schema

take a list of tables, with their properties

output high level summary:

classify tables - but grouping via associations could be more important.
table summary
class description
higher level class of classes
(top level category, then table category)

use LLM
or just an embedding like sbert? (word2vec) then cluster - or dillibert no case
but need name the cluster

use llm for that.
or better have categories and convert to embedding. pick the closest one:

dot prod of 2 normalised vectors = cos Angle
cosine distance = 1 - v.w
smaller then closer

user can add categories. different views (sets of categories).
stem entity names by their casing

Try modelling the DM as a graph. Then can use DOT

A Data Model can be modelled as a graph:

dm -> entity
entity -> attribute [label = name, type]
entity -> entity [label = association-type]

The can use DOT instead of json
can use smaller LLM like CodeLlama 13B

see gpt-workflow issues

try local llama 2

https://replicate.com/blog/run-llama-locally

Train local gpt-dm LLM via qLora

ref https://youtu.be/J_3hDqSvpmg?si=UpY418pEr4jxT0qx

Try BPML as alt to DOT

BPML = Business Process Modeling Language

a DSL so could be better format than DOT...

Try other LLM - HuggingFace's Code Llama (Llama 2 learns to code)

https://huggingface.co/blog/codellama

Try self hosted Mistral 7B-instruct

Mistral 7B instruct is easy to host and already seems promising, without any fine tuning:

Prompt format:

text = "<s>[INST] What is your favourite condiment? [/INST]"
"Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!</s> "
"[INST] Do you have mayonnaise recipes? [/INST]"

Example:

<s>[INST] 

Follow these steps:

- Step 1: Ask the user what is the name of the application. In later steps, this will be referred to as <application-name>.
- Step 2: Ask the user for the list of entity names. In later steps, this will be referred to as <entities-list>.
- Step 3: Ask the user to describe how the entities are related.
- Step 4: Create a database schema in JSON for the application named <application-name>.
This application manages the entities: <entities-list>.
The entities are related as follows: <entity-relationships-list>.

The output MUST be in JSON only, based on the following example:
---
{
  "application": “my-cars",
  "entity-names": ["Users","Projects","Tasks"],
  "entity-attributes": {
    <entity-A-name>: [<list of suitable attributes>],
    <entity-B-name>: [<list of suitable attributes>]
  },
    "entity-relationships": {
    <entity-A-name>: {
      <entity-B-name>: “<multiplicity>”
    }
  }
}
---

Where:
  - <multiplicity> can be one of these: one-to-one, one-to-many, many-to-one, many-to-many
  - <entity-A-name> is the name of some entity
  - <entity-B-name> is the name of some other entity

IMPORTANT: For step 4, only output valid JSON.
[/INST] Sure, I'd be happy to help you with that! Can you please provide me with more information about the entities and their relationships?</s>

[INST]
A Car has Wheels. A Wheel has a Tyre.
[/INST]
Great, thank you for providing the entity relationship details. Based on your input, here's an example of how the JSON output could look like:
---json
{
	"application": "car_app",
	"entity-names": ["Cars", "Wheels", "Tyres"],
	"entity-attributes": {
		"Cars": ["color", "make", "model", "year"],
		"Wheels": ["size", "brand", "material"],
		"Tyres": ["pressure", "tread", "wear"]
	},
	"entity-relationships": {
		"Cars": {
			"Wheels": "many-to-many"
		},
		"Wheels": {
			"Tyres": "many-to-many"
		}
	}
}
``</s>
[INST]Generate example data in JSON format[/INST]
Sure, here's an example of how the data for each entity could look like in JSON format:
---json
// Cars
{
  "id": 1,
  "color": "red",
  "make": "Toyota",
  "model": "Camry",
  "year": 2021
}

// Wheels
{
  "id": 1,
  "size": 15,
  "brand": "Goodyear",
  "material": "steel"
}

// Tyres
{
  "id": 1,
  "pressure": 30,
  "tread": 8,
  "wear": 20
}
---
Please note that the above data is just an example and may not necessarily reflect real-world data. The actual data would depend on the specific requirements of the

note: in this example, backtick `` x 3 was replaced with --- so that I could paste it into Markdown format