Split by tokens¶
語言模型有一個 token 數的限制。您不應超出 token 限制。因此,當您將文字拆分為區塊時,最好計算 token 的數量。當您計算文字中的 token 數時,您應該使用與語言模型中使用的相同的 tokenizer。
tiktoken¶
tiktoken 是
OpenAI
創建的快速BPE
tokenizer。
我們可以用它來估計所使用的 token。對於 OpenAI 模型來說,它可能會更準確。
- 文字如何分割:按傳入的字元。
- 如何測量區塊大小:透過
tiktoken tokenizer
。
# 這是一個很長的文檔,我們可以將其拆分。
with open("../../../state_of_the_union.txt") as f:
state_of_the_union = f.read()
from langchain.text_splitter import CharacterTextSplitter
API Reference:
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
chunk_size=100, chunk_overlap=0
)
texts = text_splitter.split_text(state_of_the_union)
print(texts[0])
結果:
Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.
Last year COVID-19 kept us apart. This year we are finally together again.
Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans.
With a duty to one another to the American people to the Constitution.
Info
請注意,如果我們使用 CharacterTextSplitter.from_tiktoken_encoder
,則文字僅由 CharacterTextSplitter
分割,並且 tiktoken
分詞器用於合併分割。這意味著 split 可以大於 tiktoken
tokenizer 測量的區塊大小。
我們可以使用 RecursiveCharacterTextSplitter.from_tiktoken_encoder
來確保分割不大於語言模型允許的標記區塊大小,其中如果每個分割具有更大的大小,則每個分割將被遞歸分割。
我們也可以直接載入一個 tiktoken
splitter,這確保每個 split 都小於 chunk 大小。
from langchain.text_splitter import TokenTextSplitter
text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)
texts = text_splitter.split_text(state_of_the_union)
print(texts[0])
API Reference:
spaCy¶
spaCy 是一個用於高階自然語言處理的開源軟體套件,以程式語言 Python 和 Cython 編寫。
NLTK 的另一種替代方法是使用 spaCy tokenizer。
- 文本如何分割:透過
spaCy
tokenizer。 - 如何測量區塊大小:按字元數。
# This is a long document we can split up.
with open("../../../state_of_the_union.txt") as f:
state_of_the_union = f.read()
from langchain.text_splitter import SpacyTextSplitter
text_splitter = SpacyTextSplitter(chunk_size=1000)
API Reference:
結果:
Madam Speaker, Madam Vice President, our First Lady and Second Gentleman.
Members of Congress and the Cabinet.
Justices of the Supreme Court.
My fellow Americans.
Last year COVID-19 kept us apart.
This year we are finally together again.
Tonight, we meet as Democrats Republicans and Independents.
But most importantly as Americans.
With a duty to one another to the American people to the Constitution.
And with an unwavering resolve that freedom will always triumph over tyranny.
Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways.
But he badly miscalculated.
He thought he could roll into Ukraine and the world would roll over.
Instead he met a wall of strength he never imagined.
He met the Ukrainian people.
From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.
SentenceTransformers¶
SentenceTransformersTokenTextSplitter
是一個專門的文字分割器,用於句子轉換器模型。預設行為是將文字分割成適合您要使用的句子轉換器模型的標記視窗的區塊。
API Reference:
splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=0)
text = "Lorem "
count_start_and_stop_tokens = 2
text_token_count = splitter.count_tokens(text=text) - count_start_and_stop_tokens
print(text_token_count)
token_multiplier = splitter.maximum_tokens_per_chunk // text_token_count + 1
# `text_to_split` does not fit in a single chunk
text_to_split = text * token_multiplier
print(f"tokens in text to split: {splitter.count_tokens(text=text_to_split)}")
text_chunks = splitter.split_text(text=text_to_split)
print(text_chunks[1])
NLTK¶
自然語言工具包(或更常見的是 NLTK)是一套用 Python 程式語言編寫的用於英語符號和統計自然語言處理 (NLP) 的函式套件和程式。
我們可以使用 NLTK
基於 NLTK tokenizer進行分割,而不是僅根據 \n\n
進行分割。
- 文本如何分割:通過
NLTK
tokenizer。 - 如何測量區塊大小:按字元數。
# This is a long document we can split up.
with open("../../../state_of_the_union.txt") as f:
state_of_the_union = f.read()
from langchain.text_splitter import NLTKTextSplitter
text_splitter = NLTKTextSplitter(chunk_size=1000)
API Reference:
結果:
Madam Speaker, Madam Vice President, our First Lady and Second Gentleman.
Members of Congress and the Cabinet.
Justices of the Supreme Court.
My fellow Americans.
Last year COVID-19 kept us apart.
This year we are finally together again.
Tonight, we meet as Democrats Republicans and Independents.
But most importantly as Americans.
With a duty to one another to the American people to the Constitution.
And with an unwavering resolve that freedom will always triumph over tyranny.
Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways.
But he badly miscalculated.
He thought he could roll into Ukraine and the world would roll over.
Instead he met a wall of strength he never imagined.
He met the Ukrainian people.
From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.
Groups of citizens blocking tanks with their bodies.
Hugging Face tokenizer¶
Hugging Face 有很多 tokenizer。
我們使用 Hugging Face tokenizer(GPT2TokenizerFast)來計算標記中的文字長度。
- 文字如何分割:按傳入的字元。
- 如何測量區塊大小:透過
Hugging Face
tokenizer 計算的 token 數量。
from transformers import GPT2TokenizerFast
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
# This is a long document we can split up.
with open("../../../state_of_the_union.txt") as f:
state_of_the_union = f.read()
from langchain.text_splitter import CharacterTextSplitter
API Reference:
text_splitter = CharacterTextSplitter.from_huggingface_tokenizer(
tokenizer, chunk_size=100, chunk_overlap=0
)
texts = text_splitter.split_text(state_of_the_union)
print(texts[0])
結果:
Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.
Last year COVID-19 kept us apart. This year we are finally together again.
Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans.
With a duty to one another to the American people to the Constitution.