MultiQueryRetriever¶
基於距離的向量資料庫檢索在高維空間中嵌入(表示)查詢,並根據「距離」找到相似的嵌入文件。但是,如果查詢措辭發生細微變化,或者嵌入不能很好地捕獲資料的語義,檢索可能會產生不同的結果。有時會進行及時的工程/調整來手動解決這些問題,但這可能很乏味。
MultiQueryRetriever
透過使用 LLM 從不同角度為給定的使用者輸入查詢來產生多個查詢,從而自動執行提示調整流程。對於每個查詢,它都會檢索一組相關文檔,並採用所有查詢之間的唯一聯集來獲取更大的一組潛在相關文檔。透過對同一問題產生多個視角,MultiQueryRetriever
或許能夠克服基於距離的檢索的一些限制,並獲得更豐富的結果集。
# Build a sample vectorDB
from langchain.vectorstores import Chroma
from langchain.document_loaders import WebBaseLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Load blog post
loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/")
data = loader.load()
# Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
splits = text_splitter.split_documents(data)
# VectorDB
embedding = OpenAIEmbeddings()
vectordb = Chroma.from_documents(documents=splits, embedding=embedding)
API Reference:
簡單使用範例¶
指定用於查詢產生的 LLM,檢索器將完成其餘的工作。
from langchain.chat_models import ChatOpenAI
from langchain.retrievers.multi_query import MultiQueryRetriever
question = "What are the approaches to Task Decomposition?"
llm = ChatOpenAI(temperature=0)
retriever_from_llm = MultiQueryRetriever.from_llm(
retriever=vectordb.as_retriever(), llm=llm
)
API Reference:
# Set logging for the queries
import logging
logging.basicConfig()
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)
unique_docs = retriever_from_llm.get_relevant_documents(query=question)
len(unique_docs)
結果:
INFO:langchain.retrievers.multi_query:Generated queries: ['1. How can Task Decomposition be approached?', '2. What are the different methods for Task Decomposition?', '3. What are the various approaches to decomposing tasks?']
5
提供自己的提示¶
您還可以提供提示和輸出解析器,以將結果拆分為查詢清單。
from typing import List
from langchain.chains import LLMChain
from pydantic import BaseModel, Field
from langchain.prompts import PromptTemplate
from langchain.output_parsers import PydanticOutputParser
# Output parser will split the LLM result into a list of queries
class LineList(BaseModel):
# "lines" is the key (attribute name) of the parsed output
lines: List[str] = Field(description="Lines of text")
class LineListOutputParser(PydanticOutputParser):
def __init__(self) -> None:
super().__init__(pydantic_object=LineList)
def parse(self, text: str) -> LineList:
lines = text.strip().split("\n")
return LineList(lines=lines)
output_parser = LineListOutputParser()
QUERY_PROMPT = PromptTemplate(
input_variables=["question"],
template="""You are an AI language model assistant. Your task is to generate five
different versions of the given user question to retrieve relevant documents from a vector
database. By generating multiple perspectives on the user question, your goal is to help
the user overcome some of the limitations of the distance-based similarity search.
Provide these alternative questions seperated by newlines.
Original question: {question}""",
)
llm = ChatOpenAI(temperature=0)
# Chain
llm_chain = LLMChain(llm=llm, prompt=QUERY_PROMPT, output_parser=output_parser)
# Other inputs
question = "What are the approaches to Task Decomposition?"
API Reference:
# Run
retriever = MultiQueryRetriever(
retriever=vectordb.as_retriever(), llm_chain=llm_chain, parser_key="lines"
) # "lines" is the key (attribute name) of the parsed output
# Results
unique_docs = retriever.get_relevant_documents(結果:
5
query="What does the course say about regression?" ) len(unique_docs)
結果:
INFO:langchain.retrievers.multi_query:Generated queries: ["1. What is the course's perspective on regression?", '2. Can you provide information on regression as discussed in the course?', '3. How does the course cover the topic of regression?', "4. What are the course's teachings on regression?", '5. In relation to the course, what is mentioned about regression?']
11