Recursively split by character¶

對於一般文本，建議使用此文本分割器。它由字元列表參數化。它嘗試按順序分割它們，直到區塊足夠小。預設清單為 ["\n\n", "\n", " ", ""]。這樣做的效果是嘗試將所有段落（然後是句子，然後是單字）盡可能地放在一起，因為這些通常看起來是語義相關性最強的文本片段。

文字如何分割：按字元清單。
如何測量區塊大小：按字元數。

# 這是一個很長的文檔，我們可以將其拆分。
with open('../../../state_of_the_union.txt') as f:
    state_of_the_union = f.read()

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    # 設定一個非常小的區塊大小，只是為了範例顯示。
    chunk_size = 100,
    chunk_overlap  = 20,
    length_function = len,
    is_separator_regex = False,
)

texts = text_splitter.create_documents([state_of_the_union])

print(texts[0])
print(texts[1])

結果:

page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and' lookup_str='' metadata={} lookup_index=0
page_content='of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.' lookup_str='' metadata={} lookup_index=0

text_splitter.split_text(state_of_the_union)[:2]

結果:

['Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and',
    'of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.']