Code Monkey home page Code Monkey logo

sql-pilot's Introduction

SQL-PILOT

A small llama finetuned to create your SQL queries.

sql-gen

Dataset

b-mc2/sql-create-context:
https://huggingface.co/datasets/b-mc2/sql-create-context/tree/main

Train sentencepiece on dataset.

  • This will save a sentencepiece model in the cache directory.

    python scripts/train_tokenizer.py --vocab_size=<vocab size> --data_cache_dir=<your fav dir>\

  • A limited functionality Tokenizer class is used to wrap this sentencepiece model. This trained checkpoint can be loaded into the Tokenizer class like this.

    from tokenizer import Tokenizer
    tokenizer = Tokenizer('data/tok3072.model')
    

** This Tokenizer class and training code is adapted from karpathy/llama.c

Padding details and Dataset Preparation:

  • Sequences in a batch are left padded.
  • In the below example the context and question string are concatenated to form the context string.
    A data sample:
    {
        "context": "some context",
        "question": "some question",
        "answer": "some answer"
    }
        
    <bos>: begin sequence token
    <eos>: end sequence token
    <pad>: pad token
    -100: label ignored by pytorch's CrossEntropyLoss by default

    encoded context tokens: [C1, C2, C3, C4, C5]
    encoded answer tokens:  [A1, A2, A3]

    input sequence  :[<bos>,   C1,   C2,   C3,   C4, C5, A1, A2,    A3,  <eos>]
    target sequence :[ -100, -100, -100, -100, -100, A1, A2, A3, <eos>,   -100]

  • This will save a preprocessed dataset in the cache directory.
    python scripts/prepare_dataset.py --tokenizer_path=<your fav dir> --data_cache_dir=<your fav dir>

Finetune

use finetune.py

sql-pilot's People

Contributors

arpytanshu avatar

Watchers

 avatar Kostas Georgiou avatar

Forkers

mvandermeulen

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.