openai开源的BPE(Byte pair encoding)算法,进行文本分割,标记,词性标注,词频统计等。
pip install tiktoken
文本分割
from tiktoken import Tokenizer
tokenizer = Tokenizer()
text = "hello, world!"
tokens = tokenizer.tokenize(text)
print(tokens)
编码解码:
import tiktoken
text = f"""
hello world
"""
# tiktoken.encoding_for_model("gpt-4")
encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")
# Encode
tokens = encoding.encode(text)
print(tokens);
# Decode
[encoding.decode_single_token_bytes(token) for token in tokens]
词频统计
from tiktoken import Tokenizer
tokenizer = Tokenizer()
text = "I am runing, I am happy, I like runing."
tokens = tokenizer.tokenize(text)
work_counts= tokenizer.count_words(tokens)
print(word_counts)