mirror of https://github.com/hwchase17/langchain
You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
310 lines
11 KiB
Plaintext
310 lines
11 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "raw",
|
|
"id": "b5fc1fc7-c4c5-418f-99da-006c604a7ea6",
|
|
"metadata": {},
|
|
"source": [
|
|
"---\n",
|
|
"title: Custom Retriever\n",
|
|
"---"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "ff6f3c79-0848-4956-9115-54f6b2134587",
|
|
"metadata": {},
|
|
"source": [
|
|
"# How to create a custom Retriever\n",
|
|
"\n",
|
|
"## Overview\n",
|
|
"\n",
|
|
"Many LLM applications involve retrieving information from external data sources using a `Retriever`. \n",
|
|
"\n",
|
|
"A retriever is responsible for retrieving a list of relevant `Documents` to a given user `query`.\n",
|
|
"\n",
|
|
"The retrieved documents are often formatted into prompts that are fed into an LLM, allowing the LLM to use the information in the to generate an appropriate response (e.g., answering a user question based on a knowledge base).\n",
|
|
"\n",
|
|
"## Interface\n",
|
|
"\n",
|
|
"To create your own retriever, you need to extend the `BaseRetriever` class and implement the following methods:\n",
|
|
"\n",
|
|
"| Method | Description | Required/Optional |\n",
|
|
"|--------------------------------|--------------------------------------------------|-------------------|\n",
|
|
"| `_get_relevant_documents` | Get documents relevant to a query. | Required |\n",
|
|
"| `_aget_relevant_documents` | Implement to provide async native support. | Optional |\n",
|
|
"\n",
|
|
"\n",
|
|
"The logic inside of `_get_relevant_documents` can involve arbitrary calls to a database or to the web using requests.\n",
|
|
"\n",
|
|
":::{.callout-tip}\n",
|
|
"By inherting from `BaseRetriever`, your retriever automatically becomes a LangChain [Runnable](/docs/expression_language/interface) and will gain the standard `Runnable` functionality out of the box!\n",
|
|
":::\n",
|
|
"\n",
|
|
"\n",
|
|
":::{.callout-info}\n",
|
|
"You can use a `RunnableLambda` or `RunnableGenerator` to implement a retriever.\n",
|
|
"\n",
|
|
"The main benefit of implementing a retriever as a `BaseRetriever` vs. a `RunnableLambda` (a custom [runnable function](/docs/how_to/functions)) is that a `BaseRetriever` is a well\n",
|
|
"known LangChain entity so some tooling for monitoring may implement specialized behavior for retrievers. Another difference\n",
|
|
"is that a `BaseRetriever` will behave slightly differently from `RunnableLambda` in some APIs; e.g., the `start` event\n",
|
|
"in `astream_events` API will be `on_retriever_start` instead of `on_chain_start`.\n",
|
|
":::\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "2be9fe82-0757-41d1-a647-15bed11fd3bf",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Example\n",
|
|
"\n",
|
|
"Let's implement a toy retriever that returns all documents whose text contains the text in the user query."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 26,
|
|
"id": "bdf61902-2984-493b-a002-d4fced6df590",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from typing import List\n",
|
|
"\n",
|
|
"from langchain_core.callbacks import CallbackManagerForRetrieverRun\n",
|
|
"from langchain_core.documents import Document\n",
|
|
"from langchain_core.retrievers import BaseRetriever\n",
|
|
"\n",
|
|
"\n",
|
|
"class ToyRetriever(BaseRetriever):\n",
|
|
" \"\"\"A toy retriever that contains the top k documents that contain the user query.\n",
|
|
"\n",
|
|
" This retriever only implements the sync method _get_relevant_documents.\n",
|
|
"\n",
|
|
" If the retriever were to involve file access or network access, it could benefit\n",
|
|
" from a native async implementation of `_aget_relevant_documents`.\n",
|
|
"\n",
|
|
" As usual, with Runnables, there's a default async implementation that's provided\n",
|
|
" that delegates to the sync implementation running on another thread.\n",
|
|
" \"\"\"\n",
|
|
"\n",
|
|
" documents: List[Document]\n",
|
|
" \"\"\"List of documents to retrieve from.\"\"\"\n",
|
|
" k: int\n",
|
|
" \"\"\"Number of top results to return\"\"\"\n",
|
|
"\n",
|
|
" def _get_relevant_documents(\n",
|
|
" self, query: str, *, run_manager: CallbackManagerForRetrieverRun\n",
|
|
" ) -> List[Document]:\n",
|
|
" \"\"\"Sync implementations for retriever.\"\"\"\n",
|
|
" matching_documents = []\n",
|
|
" for document in documents:\n",
|
|
" if len(matching_documents) > self.k:\n",
|
|
" return matching_documents\n",
|
|
"\n",
|
|
" if query.lower() in document.page_content.lower():\n",
|
|
" matching_documents.append(document)\n",
|
|
" return matching_documents\n",
|
|
"\n",
|
|
" # Optional: Provide a more efficient native implementation by overriding\n",
|
|
" # _aget_relevant_documents\n",
|
|
" # async def _aget_relevant_documents(\n",
|
|
" # self, query: str, *, run_manager: AsyncCallbackManagerForRetrieverRun\n",
|
|
" # ) -> List[Document]:\n",
|
|
" # \"\"\"Asynchronously get documents relevant to a query.\n",
|
|
"\n",
|
|
" # Args:\n",
|
|
" # query: String to find relevant documents for\n",
|
|
" # run_manager: The callbacks handler to use\n",
|
|
"\n",
|
|
" # Returns:\n",
|
|
" # List of relevant documents\n",
|
|
" # \"\"\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "2eac1f28-29c1-4888-b3aa-b4fa70c73b4c",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Test it 🧪"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 21,
|
|
"id": "ea868db5-48cc-4ec2-9b0a-1ab94c32b302",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"documents = [\n",
|
|
" Document(\n",
|
|
" page_content=\"Dogs are great companions, known for their loyalty and friendliness.\",\n",
|
|
" metadata={\"type\": \"dog\", \"trait\": \"loyalty\"},\n",
|
|
" ),\n",
|
|
" Document(\n",
|
|
" page_content=\"Cats are independent pets that often enjoy their own space.\",\n",
|
|
" metadata={\"type\": \"cat\", \"trait\": \"independence\"},\n",
|
|
" ),\n",
|
|
" Document(\n",
|
|
" page_content=\"Goldfish are popular pets for beginners, requiring relatively simple care.\",\n",
|
|
" metadata={\"type\": \"fish\", \"trait\": \"low maintenance\"},\n",
|
|
" ),\n",
|
|
" Document(\n",
|
|
" page_content=\"Parrots are intelligent birds capable of mimicking human speech.\",\n",
|
|
" metadata={\"type\": \"bird\", \"trait\": \"intelligence\"},\n",
|
|
" ),\n",
|
|
" Document(\n",
|
|
" page_content=\"Rabbits are social animals that need plenty of space to hop around.\",\n",
|
|
" metadata={\"type\": \"rabbit\", \"trait\": \"social\"},\n",
|
|
" ),\n",
|
|
"]\n",
|
|
"retriever = ToyRetriever(documents=documents, k=3)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 22,
|
|
"id": "18be85e9-6ef0-4ee0-ae5d-a0810c38b254",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"[Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'type': 'cat', 'trait': 'independence'}),\n",
|
|
" Document(page_content='Rabbits are social animals that need plenty of space to hop around.', metadata={'type': 'rabbit', 'trait': 'social'})]"
|
|
]
|
|
},
|
|
"execution_count": 22,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"retriever.invoke(\"that\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "13f76f6e-cf2b-4f67-859b-0ef8be98abbe",
|
|
"metadata": {},
|
|
"source": [
|
|
"It's a **runnable** so it'll benefit from the standard Runnable Interface! 🤩"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 23,
|
|
"id": "3672e9fe-4365-4628-9d25-31924cfaf784",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"[Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'type': 'cat', 'trait': 'independence'}),\n",
|
|
" Document(page_content='Rabbits are social animals that need plenty of space to hop around.', metadata={'type': 'rabbit', 'trait': 'social'})]"
|
|
]
|
|
},
|
|
"execution_count": 23,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"await retriever.ainvoke(\"that\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 24,
|
|
"id": "e2c96eed-6813-421c-acf2-6554839840ee",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"[[Document(page_content='Dogs are great companions, known for their loyalty and friendliness.', metadata={'type': 'dog', 'trait': 'loyalty'})],\n",
|
|
" [Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'type': 'cat', 'trait': 'independence'})]]"
|
|
]
|
|
},
|
|
"execution_count": 24,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"retriever.batch([\"dog\", \"cat\"])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 25,
|
|
"id": "978b6636-bf36-42c2-969c-207718f084cf",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"{'event': 'on_retriever_start', 'run_id': 'f96f268d-8383-4921-b175-ca583924d9ff', 'name': 'ToyRetriever', 'tags': [], 'metadata': {}, 'data': {'input': 'bar'}}\n",
|
|
"{'event': 'on_retriever_stream', 'run_id': 'f96f268d-8383-4921-b175-ca583924d9ff', 'tags': [], 'metadata': {}, 'name': 'ToyRetriever', 'data': {'chunk': []}}\n",
|
|
"{'event': 'on_retriever_end', 'name': 'ToyRetriever', 'run_id': 'f96f268d-8383-4921-b175-ca583924d9ff', 'tags': [], 'metadata': {}, 'data': {'output': []}}\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"async for event in retriever.astream_events(\"bar\", version=\"v1\"):\n",
|
|
" print(event)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "7b45c404-37bf-4370-bb7c-26556777ff46",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Contributing\n",
|
|
"\n",
|
|
"We appreciate contributions of interesting retrievers!\n",
|
|
"\n",
|
|
"Here's a checklist to help make sure your contribution gets added to LangChain:\n",
|
|
"\n",
|
|
"Documentation:\n",
|
|
"\n",
|
|
"* The retriever contains doc-strings for all initialization arguments, as these will be surfaced in the [API Reference](https://api.python.langchain.com/en/stable/langchain_api_reference.html).\n",
|
|
"* The class doc-string for the model contains a link to any relevant APIs used for the retriever (e.g., if the retriever is retrieving from wikipedia, it'll be good to link to the wikipedia API!)\n",
|
|
"\n",
|
|
"Tests:\n",
|
|
"\n",
|
|
"* [ ] Add unit or integration tests to verify that `invoke` and `ainvoke` work.\n",
|
|
"\n",
|
|
"Optimizations:\n",
|
|
"\n",
|
|
"If the retriever is connecting to external data sources (e.g., an API or a file), it'll almost certainly benefit from an async native optimization!\n",
|
|
" \n",
|
|
"* [ ] Provide a native async implementation of `_aget_relevant_documents` (used by `ainvoke`)"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.10.1"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
}
|