pkoukk / tiktoken-go Goto Github PK

View Code? Open in Web Editor NEW

552.0 5.0 46.0 1.16 MB

go version of tiktoken

License: MIT License

Go 89.87% Python 10.13%

gpt-35-turbo openai tiktoken chatgpt go golang gpt-4

tiktoken-go's Introduction

tiktoken-go

简体中文

OpenAI's tiktoken in Go.

Tiktoken is a fast BPE tokeniser for use with OpenAI's models.

This is a port of the original tiktoken.

Usage

Install

go get github.com/pkoukk/tiktoken-go

Cache

Tiktoken-go has the same cache mechanism as the original Tiktoken library.

You can set the cache directory by using the environment variable TIKTOKEN_CACHE_DIR.

Once this variable is set, tiktoken-go will use this directory to cache the token dictionary.

If you don't set this environment variable, tiktoken-go will download the dictionary each time you initialize an encoding for the first time.

Alternative BPE loaders

If you don't want to use cache or download the dictionary each time, you can use alternative BPE loader.

Just call tiktoken.SetBpeLoader before calling tiktoken.GetEncoding or tiktoken.EncodingForModel.

BpeLoader is an interface, you can implement your own BPE loader by implementing this interface.

Offline BPE loader

The offline BPE loader loads the BPE dictionary from embed files, it helps if you don't want to download the dictionary at runtime.

Due to the size of the BPE dictionary, this loader is in other project.

Include if you require this loader: tiktoken_loader

Examples

Get Token By Encoding

package main

import (
    "fmt"
    "github.com/pkoukk/tiktoken-go"
)

func main()  {
	text := "Hello, world!"
	encoding := "cl100k_base"

	// if you don't want download dictionary at runtime, you can use offline loader
	// tiktoken.SetBpeLoader(tiktoken_loader.NewOfflineLoader())
	tke, err := tiktoken.GetEncoding(encoding)
	if err != nil {
		err = fmt.Errorf("getEncoding: %v", err)
		return
	}

	// encode
	token := tke.Encode(text, nil, nil)

	//tokens
	fmt.Println((token))
	// num_tokens
	fmt.Println(len(token))
}

Get Token By Model

package main

import (
    "fmt"
    "github.com/pkoukk/tiktoken-go"
)

func main()  {
	text := "Hello, world!"
	encoding := "gpt-3.5-turbo"

	tkm, err := tiktoken.EncodingForModel(encoding)
	if err != nil {
		err = fmt.Errorf("getEncoding: %v", err)
		return
	}

	// encode
	token := tkm.Encode(text, nil, nil)

	// tokens
	fmt.Println(token)
	// num_tokens
	fmt.Println(len(token))
}

Counting Tokens For Chat API Calls

Below is an example function for counting tokens for messages passed to gpt-3.5-turbo or gpt-4.

The following code was written based on openai-cookbook examples at Wednesday, 28 June 2023.

Please note that the token calculation method for the message may change at any time, so this code may not necessarily be applicable in the future.

If you need accurate calculation, please refer to the official documentation.

If you find that this code is no longer applicable, please feel free to submit a PR or Issue.

package main

import (
	"fmt"

	"github.com/pkoukk/tiktoken-go"
	"github.com/sashabaranov/go-openai"
)

// OpenAI Cookbook: https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb
func NumTokensFromMessages(messages []openai.ChatCompletionMessage, model string) (numTokens int) {
	tkm, err := tiktoken.EncodingForModel(model)
	if err != nil {
		err = fmt.Errorf("encoding for model: %v", err)
		log.Println(err)
		return
	}

	var tokensPerMessage, tokensPerName int
	switch model {
	case "gpt-3.5-turbo-0613",
		"gpt-3.5-turbo-16k-0613",
		"gpt-4-0314",
		"gpt-4-32k-0314",
		"gpt-4-0613",
		"gpt-4-32k-0613":
		tokensPerMessage = 3
		tokensPerName = 1
	case "gpt-3.5-turbo-0301":
		tokensPerMessage = 4 // every message follows <|start|>{role/name}\n{content}<|end|>\n
		tokensPerName = -1   // if there's a name, the role is omitted
	default:
		if strings.Contains(model, "gpt-3.5-turbo") {
			log.Println("warning: gpt-3.5-turbo may update over time. Returning num tokens assuming gpt-3.5-turbo-0613.")
			return NumTokensFromMessages(messages, "gpt-3.5-turbo-0613")
		} else if strings.Contains(model, "gpt-4") {
			log.Println("warning: gpt-4 may update over time. Returning num tokens assuming gpt-4-0613.")
			return NumTokensFromMessages(messages, "gpt-4-0613")
		} else {
			err = fmt.Errorf("num_tokens_from_messages() is not implemented for model %s. See https://github.com/openai/openai-python/blob/main/chatml.md for information on how messages are converted to tokens.", model)
			log.Println(err)
			return
		}
	}

	for _, message := range messages {
		numTokens += tokensPerMessage
		numTokens += len(tkm.Encode(message.Content, nil, nil))
		numTokens += len(tkm.Encode(message.Role, nil, nil))
		numTokens += len(tkm.Encode(message.Name, nil, nil))
		if message.Name != "" {
			numTokens += tokensPerName
		}
	}
	numTokens += 3 // every reply is primed with <|start|>assistant<|message|>
	return numTokens
}

Available Encodings

Encoding name	OpenAI models
`cl100k_base`	`gpt-4`, `gpt-3.5-turbo`, `text-embedding-ada-002`
`p50k_base`	Codex models, `text-davinci-002`, `text-davinci-003`
`r50k_base` (or `gpt2`)	GPT-3 models like `davinci`

Available Models

Model name	OpenAI models
gpt-4-*	cl100k_base
gpt-3.5-turbo-*	cl100k_base
gpt-4	cl100k_base
gpt-3.5-turbo	cl100k_base
text-davinci-003	p50k_base
text-davinci-002	p50k_base
text-davinci-001	r50k_base
text-curie-001	r50k_base
text-babbage-001	r50k_base
text-ada-001	r50k_base
davinci	r50k_base
curie	r50k_base
babbage	r50k_base
ada	r50k_base
code-davinci-002	p50k_base
code-davinci-001	p50k_base
code-cushman-002	p50k_base
code-cushman-001	p50k_base
davinci-codex	p50k_base
cushman-codex	p50k_base
text-davinci-edit-001	p50k_edit
code-davinci-edit-001	p50k_edit
text-embedding-ada-002	cl100k_base
text-similarity-davinci-001	r50k_base
text-similarity-curie-001	r50k_base
text-similarity-babbage-001	r50k_base
text-similarity-ada-001	r50k_base
text-search-davinci-doc-001	r50k_base
text-search-curie-doc-001	r50k_base
text-search-babbage-doc-001	r50k_base
text-search-ada-doc-001	r50k_base
code-search-babbage-code-001	r50k_base
code-search-ada-code-001	r50k_base
gpt2	gpt2

Test

you can run test in test folder

compare with original tiktoken

get token by encoding

result

get token by model

result

Benchmark

you can run benchmark in test folder

Benchmark result

name	time/op	os	cpu	text	times
tiktoken-go	8795ns	macOS 13.2	Apple M1	UDHR	100000
tiktoken	8838ns	macOS 13.2	Apple M1	UDHR	100000

It looks like the performance is almost the same.

Maybe the difference is due to the difference in the performance of the machine.

Or maybe my benchmark method is not appropriate.

If you have better benchmark method or if you want add your benchmark result, please feel free to submit a PR.

License

MIT

tiktoken-go's People

Contributors

Stargazers

Watchers

Forkers

elvuel blueicesir ly931003 guoyuanchao1202 gargantuax kimboqi nasa1024 te-simonren philippgille justsong-lab aicodehunt aavaz-ai kriuchkovaa zhoub gaixianggeng zerodeng01 zhangdahai112 it-talon zhangchong5566 eswulei bakks slinkylab munaryesen vaayne yuukirn sugarshop winston-stripe yangjian102621 hcws aidenli cckalen kell066 roman-lyubimov shapor sourcegraph feiyizhou yohamta betashepherd armson jooyyy matthiasthomas rinfx weaviate iimeta synthread zjy282

tiktoken-go's Issues

Counting Example changed a bit.

At the moment(6/27), Counting token is slightly changed.
I changed the Example of ChatMessage

// below link may not work on Chrome(error: Unable to render code block)
// then, use FireFox
// OpenAI Cookbook: https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb
func NumTokensFromMessages(messages []openai.ChatCompletionMessage, model string) (numTokens int) {
	tkm, err := tiktoken.EncodingForModel(model)
	if err != nil {
		err = fmt.Errorf("encoding for model: %v", err)
		log.Println(err)
		return
	}

	var tokensPerMessage, tokensPerName int

	if model == "gpt-3.5-turbo-0613" ||
		model == "gpt-3.5-turbo-16k-0613" ||
		model == "gpt-4-0314" ||
		model == "gpt-4-32k-0314" ||
		model == "gpt-4-0613" ||
		model == "gpt-4-32k-0613" {
		tokensPerMessage = 3
		tokensPerName = -1
	} else if model == "gpt-3.5-turbo-0301" {
		tokensPerMessage = 4 // every message follows <|start|>{role/name}\n{content}<|end|>\n
		tokensPerName = -1   // if there's a name, the role is omitted
	} else if model == "gpt-3.5-turbo" {
		log.Println("warning: gpt-3.5-turbo may update over time. Returning num tokens assuming gpt-3.5-turbo-0613.")
		return NumTokensFromMessages(messages, "gpt-3.5-turbo-0613")
	} else if model == "gpt-4" {
		log.Println("warning: gpt-4 may update over time. Returning num tokens assuming gpt-4-0613.")
		return NumTokensFromMessages(messages, "gpt-4-0613")
	} else {
		err := errors.New("warning: model not found. Using cl100k_base encoding")
		log.Println(err)
		return
	}

	for _, message := range messages {
		numTokens += tokensPerMessage
		numTokens += len(tkm.Encode(message.Content, nil, nil))
		numTokens += len(tkm.Encode(message.Role, nil, nil))
		numTokens += len(tkm.Encode(message.Name, nil, nil))
		if message.Name != "" {
			numTokens += tokensPerName
		}
	}
	numTokens += 3 // every reply is primed with <|start|>assistant<|message|>
	return numTokens
}

sortedTokenBytes doesn't seem necessary

Thank you for your efforts. I found that in the NewCoreBPE function, the result of the following code does not seem to be used anywhere,

        sortedTokenBytes := make([][]byte, 0, len(encoder))
	for k := range encoder {
		sortedTokenBytes = append(sortedTokenBytes, []byte(k))
	}
	sort.Slice(sortedTokenBytes, func(i, j int) bool {
		return bytes.Compare(sortedTokenBytes[i], sortedTokenBytes[j]) < 0
	})

	return &CoreBPE{
		......
		sortedTokenBytes:     sortedTokenBytes,
	}, nil

but this sorting operation seems to be very expensive. Is there any consideration here?

Can a Tiktoken instance be shared by multiple go routines?

It seems the GetEncoding is not cheap to call since it have to compile regex. Creating a Tiktoken instance before every token calculation is not efficient. Therefore, if Tiktoken instance can be shared by multiple go routines, then we only need to create it once, which is more efficient.

应当使用 allowedSpecial 以代替 disallowedSpecial??

// func (t *Tiktoken) Encode(text string, allowedSpecial []string, disallowedSpecial []string) []int {
	var allowedSpecialSet map[string]any
	if len(allowedSpecial) == 0 {
		allowedSpecialSet = map[string]any{}
	} else if len(disallowedSpecial) == 1 && disallowedSpecial[0] == "all" {
		allowedSpecialSet = t.specialTokensSet
	} else {
		allowedSpecialSet = map[string]any{}
		for _, v := range allowedSpecial {
			allowedSpecialSet[v] = nil
		}
	}

@pkoukk 暂时没有看全部代码的逻辑，但感觉 disallowedSpecial 怪怪，不知道是否应该是 allowedSpecial？

gpt2 support appears missing

I'm not seeing where the gpt2 model is handled similar to how it is done here: https://github.com/openai/tiktoken/blob/main/tiktoken_ext/openai_public.py#L10

A simple test like:

+func TestGpt2Encoding(t *testing.T) {
+       if _, err := EncodingForModel("gpt2"); err != nil {
+               t.Error(err)
+       }
+}

will fail like so:

--- FAIL: TestGpt2Encoding (0.00s)
    tiktoken_test.go:36: Unknown encoding: gpt2
FAIL
exit status 1
FAIL	github.com/pkoukk/tiktoken-go	0.194s

EncodingForModel: no encoding for model gpt-3.5-turbo-0301

tkm, err := tiktoken.EncodingForModel(Model)
	if err != nil {
		fmt.Println(fmt.Errorf("EncodingForModel: %v", err))
		return
	}

我在执行这个model gpt-3.5-turbo-0301 报错找不到encoding ，我看了源码MODEL_TO_ENCODING没有这个模型

Is there support for function calls tokens

Thanks for the project. Great so far.

Is there support for counting tokens when using function calls?
https://platform.openai.com/docs/guides/gpt/function-calling

Incorrect calculation for Chinese characters

The number of tokens deviates a lot comparing to https://platform.openai.com/tokenizer.

package main

import (
	"fmt"
	"github.com/pkoukk/tiktoken-go"
)

func main() {
	text := "这是一个测试"
	tke, _ := tiktoken.GetEncoding("cl100k_base")
	token := tke.Encode(text, nil, nil)
	fmt.Println(len(token)) // Result:  4
}

The result is 10 as generated by OpenAI Tokenizer .

it‘s result is not ok, compared with openai official tool.

tiktoken-go getEmbedding isn't thread-safe

Howdy,
There's potential concurrent access to the ENCODING_MAP in the getEncoding function here;

func getEncoding(encodingName string) (*Encoding, error) {
encoding, ok := ENCODING_MAP[encodingName]
if !ok {
initEncoding, err := initEncoding(encodingName)
if err != nil {
return nil, err
}
encoding = initEncoding
ENCODING_MAP[encodingName] = encoding
}
return encoding, nil
}

There may be some other issues in the package that make it unsafe to run in multiple go-routines - which isn't expected since we're picking up unique instances via tiktoken.EncodingForModel(model). Might want to move this (and other) globals into a struct.

Any benchmark for throughput?

Curious how this compares.

计算结果有误差我该如何调整

我的代码如下

package main

import (
	"fmt"
	"log"
	"strings"

	"github.com/pkoukk/tiktoken-go"
	"github.com/sashabaranov/go-openai"
)

func main() {
	ins := []openai.ChatCompletionMessage{
		{
			Role:    "user",
			Content: "Hello!",
		},
		{
			Role:    "assistant",
			Content: "Hello! How can I assist you today?",
		},
	}

	fmt.Println(NumTokensFromMessages(ins, "gpt-3.5-turbo-0613"))
}

func NumTokensFromMessages(messages []openai.ChatCompletionMessage, model string) (numTokens int) {
	tkm, err := tiktoken.EncodingForModel(model)
	if err != nil {
		err = fmt.Errorf("encoding for model: %v", err)
		log.Println(err)
		return
	}

	var tokensPerMessage, tokensPerName int
	switch model {
	case "gpt-3.5-turbo-0613",
		"gpt-3.5-turbo-16k-0613",
		"gpt-4-0314",
		"gpt-4-32k-0314",
		"gpt-4-0613",
		"gpt-4-32k-0613":
		tokensPerMessage = 3
		tokensPerName = 1
	case "gpt-3.5-turbo-0301":
		tokensPerMessage = 4 // every message follows <|start|>{role/name}\n{content}<|end|>\n
		tokensPerName = -1   // if there's a name, the role is omitted
	default:
		if strings.Contains(model, "gpt-3.5-turbo") {
			log.Println("warning: gpt-3.5-turbo may update over time. Returning num tokens assuming gpt-3.5-turbo-0613.")
			return NumTokensFromMessages(messages, "gpt-3.5-turbo-0613")
		} else if strings.Contains(model, "gpt-4") {
			log.Println("warning: gpt-4 may update over time. Returning num tokens assuming gpt-4-0613.")
			return NumTokensFromMessages(messages, "gpt-4-0613")
		} else {
			err = fmt.Errorf("num_tokens_from_messages() is not implemented for model %s. See https://github.com/openai/openai-python/blob/main/chatml.md for information on how messages are converted to tokens.", model)
			log.Println(err)
			return
		}
	}

	for _, message := range messages {
		numTokens += tokensPerMessage
		numTokens += len(tkm.Encode(message.Content, nil, nil))
		numTokens += len(tkm.Encode(message.Role, nil, nil))
		numTokens += len(tkm.Encode(message.Name, nil, nil))
		if message.Name != "" {
			numTokens += tokensPerName
		}
	}
	numTokens += 3
	return numTokens
}

我的打印结果如下
go run main.go 22

使用官方接口返回值如下

请问我应该如何修改才能获得正确的token数量

关于计算误差比较大的问题

你好，我这边采用如下方法计算的token误差和Openai 官方计算工具上计算的结果相差很大，不知你您那边遇见过没有。gpt3.5。使用如下demo结果方法进行的计算

func NumTokensFromMessages(messages []openai.ChatCompletionMessage, model string) (num_tokens int) {
	tkm, err := tiktoken.EncodingForModel(model)
	if err != nil {
		err = fmt.Errorf("EncodingForModel: %v", err)
		fmt.Println(err)
		return
	}

	var tokens_per_message int
	var tokens_per_name int
	if model == "gpt-3.5-turbo-0301" || model == "gpt-3.5-turbo" {
		tokens_per_message = 4
		tokens_per_name = -1
	} else if model == "gpt-4-0314" || model == "gpt-4" {
		tokens_per_message = 3
		tokens_per_name = 1
	} else {
		fmt.Println("Warning: model not found. Using cl100k_base encoding.")
		tokens_per_message = 3
		tokens_per_name = 1
	}

	for _, message := range messages {
		num_tokens += tokens_per_message
		num_tokens += len(tkm.Encode(message.Content, nil, nil))
		num_tokens += len(tkm.Encode(message.Role, nil, nil))
		num_tokens += len(tkm.Encode(message.Name,nil,nil))
		if message.Name != "" {
			num_tokens += tokens_per_name
		}
	}
	num_tokens += 3
	return num_tokens
}

help for count token

hi sir
this is very good tools! useful
can you support an token counter?
or any suggestion how can i do this?

ref: https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb
chapter: 6. Counting tokens for chat API calls

OOM server error

I have a basic usecase counter, however because GetEncoding is so expensive, my server with 50M of memory gets an OOM error immediately with only about 7 go routines calling it at the same time. It would be great if this was optimized or if the tkm could be shared and re-used

func countTokens(messages []openai.ChatCompletionMessage) int {
	tkm, err := tiktoken.GetEncoding(tiktoken.MODEL_CL100K_BASE)
	if err != nil {
		panic(err)
	}

	tokensPerMessage := 3
	var tokenCount int
	for _, message := range messages {
		tokenCount += tokensPerMessage
		tokenCount += len(tkm.Encode(message.Content, nil, nil))
		tokenCount += len(tkm.Encode(message.Role, nil, nil))
	}
	tokenCount += tokensPerMessage // every reply is primed with <|start|>assistant<|message|>

	return tokenCount
}

License?

Hi, pkoukk!

Nice port! However, I noticed that there is no explicit license note in the repository, which means the code cannot be used by others and is considered proprietary.

If you would like to publish the code as a Free and Open Source Software, I recommend choosing a license from https://www.gnu.org/licenses/license-list.html.

Thank you for publishing this code and I hope you find my suggestion helpful.

golang, Please use Markdown syntax to reply

model: gpt-3.5-turbo
encoding: cl100k_base

计算的结果是：9个
但是实际是：17个接口报错：

This model's maximum context length is 4097 tokens. However, you requested 4104 tokens (17 in the messages, 4087 in the completion). Please reduce the length of the messages or completion.

代码如下：

func (g *GPT) getTikTokenByEncoding(prompt string) (int, error) {
	encoding := g.getAvailableEncodingModel(Model)
	g.App.LogInfo("encoding: ", encoding)
	tkm, err := tiktoken.GetEncoding(encoding)
	if err != nil {
		return 0, err
	}
	token := tkm.Encode(prompt, nil, nil)
	return len(token), nil
}

请问如何解决？