openai模型BPE标记,压缩文本数据 https://github.com/openai/tiktoken

天问 15a7c61bb6 Update 'README.md' 3 weeks ago
README.md 15a7c61bb6 Update 'README.md' 3 weeks ago

README.md

tiktoken

openai开源的BPE(Byte pair encoding)算法,进行文本分割,标记,词性标注,词频统计等。

Usage

pip install tiktoken

文本分割

from tiktoken import Tokenizer
tokenizer = Tokenizer()
text = "hello, world!"
tokens = tokenizer.tokenize(text)
print(tokens)

编码解码:

import tiktoken

text = f"""
hello world
"""
# tiktoken.encoding_for_model("gpt-4")
encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")

# Encode 
tokens = encoding.encode(text)
print(tokens);

# Decode
[encoding.decode_single_token_bytes(token) for token in tokens]

词频统计

from tiktoken import Tokenizer
tokenizer = Tokenizer()
text = "I am runing, I am happy, I like runing."
tokens = tokenizer.tokenize(text)
work_counts= tokenizer.count_words(tokens)
print(word_counts)