langchain/docs/docs/how_to/custom_retriever.ipynb

{
 "cells": [
  {
   "cell_type": "raw",
   "id": "b5fc1fc7-c4c5-418f-99da-006c604a7ea6",
   "metadata": {},
   "source": [
    "---\n",
    "title: Custom Retriever\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ff6f3c79-0848-4956-9115-54f6b2134587",
   "metadata": {},
   "source": [
    "# How to create a custom Retriever\n",
    "\n",
    "## Overview\n",
    "\n",
    "Many LLM applications involve retrieving information from external data sources using a `Retriever`. \n",
    "\n",
    "A retriever is responsible for retrieving a list of relevant `Documents` to a given user `query`.\n",
    "\n",
    "The retrieved documents are often formatted into prompts that are fed into an LLM, allowing the LLM to use the information in the to generate an appropriate response (e.g., answering a user question based on a knowledge base).\n",
    "\n",
    "## Interface\n",
    "\n",
    "To create your own retriever, you need to extend the `BaseRetriever` class and implement the following methods:\n",
    "\n",
    "| Method                         | Description                                      | Required/Optional |\n",
    "|--------------------------------|--------------------------------------------------|-------------------|\n",
    "| `_get_relevant_documents`      | Get documents relevant to a query.               | Required          |\n",
    "| `_aget_relevant_documents`     | Implement to provide async native support.       | Optional          |\n",
    "\n",
    "\n",
    "The logic inside of `_get_relevant_documents` can involve arbitrary calls to a database or to the web using requests.\n",
    "\n",
    ":::{.callout-tip}\n",
    "By inherting from `BaseRetriever`, your retriever automatically becomes a LangChain [Runnable](/docs/expression_language/interface) and will gain the standard `Runnable` functionality out of the box!\n",
    ":::\n",
    "\n",
    "\n",
    ":::{.callout-info}\n",
    "You can use a `RunnableLambda` or `RunnableGenerator` to implement a retriever.\n",
    "\n",
    "The main benefit of implementing a retriever as a `BaseRetriever` vs. a `RunnableLambda` (a custom [runnable function](/docs/how_to/functions)) is that a `BaseRetriever` is a well\n",
    "known LangChain entity so some tooling for monitoring may implement specialized behavior for retrievers. Another difference\n",
    "is that a `BaseRetriever` will behave slightly differently from `RunnableLambda` in some APIs; e.g., the `start` event\n",
    "in `astream_events` API will be `on_retriever_start` instead of `on_chain_start`.\n",
    ":::\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2be9fe82-0757-41d1-a647-15bed11fd3bf",
   "metadata": {},
   "source": [
    "## Example\n",
    "\n",
    "Let's implement a toy retriever that returns all documents whose text contains the text in the user query."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "bdf61902-2984-493b-a002-d4fced6df590",
   "metadata": {},
   "outputs": [],
   "source": [
    "from typing import List\n",
    "\n",
    "from langchain_core.callbacks import CallbackManagerForRetrieverRun\n",
    "from langchain_core.documents import Document\n",
    "from langchain_core.retrievers import BaseRetriever\n",
    "\n",
    "\n",
    "class ToyRetriever(BaseRetriever):\n",
    "    \"\"\"A toy retriever that contains the top k documents that contain the user query.\n",
    "\n",
    "    This retriever only implements the sync method _get_relevant_documents.\n",
    "\n",
    "    If the retriever were to involve file access or network access, it could benefit\n",
    "    from a native async implementation of `_aget_relevant_documents`.\n",
    "\n",
    "    As usual, with Runnables, there's a default async implementation that's provided\n",
    "    that delegates to the sync implementation running on another thread.\n",
    "    \"\"\"\n",
    "\n",
    "    documents: List[Document]\n",
    "    \"\"\"List of documents to retrieve from.\"\"\"\n",
    "    k: int\n",
    "    \"\"\"Number of top results to return\"\"\"\n",
    "\n",
    "    def _get_relevant_documents(\n",
    "        self, query: str, *, run_manager: CallbackManagerForRetrieverRun\n",
    "    ) -> List[Document]:\n",
    "        \"\"\"Sync implementations for retriever.\"\"\"\n",
    "        matching_documents = []\n",
    "        for document in documents:\n",
    "            if len(matching_documents) > self.k:\n",
    "                return matching_documents\n",
    "\n",
    "            if query.lower() in document.page_content.lower():\n",
    "                matching_documents.append(document)\n",
    "        return matching_documents\n",
    "\n",
    "    # Optional: Provide a more efficient native implementation by overriding\n",
    "    # _aget_relevant_documents\n",
    "    # async def _aget_relevant_documents(\n",
    "    #     self, query: str, *, run_manager: AsyncCallbackManagerForRetrieverRun\n",
    "    # ) -> List[Document]:\n",
    "    #     \"\"\"Asynchronously get documents relevant to a query.\n",
    "\n",
    "    #     Args:\n",
    "    #         query: String to find relevant documents for\n",
    "    #         run_manager: The callbacks handler to use\n",
    "\n",
    "    #     Returns:\n",
    "    #         List of relevant documents\n",
    "    #     \"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2eac1f28-29c1-4888-b3aa-b4fa70c73b4c",
   "metadata": {},
   "source": [
    "## Test it 🧪"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "ea868db5-48cc-4ec2-9b0a-1ab94c32b302",
   "metadata": {},
   "outputs": [],
   "source": [
    "documents = [\n",
    "    Document(\n",
    "        page_content=\"Dogs are great companions, known for their loyalty and friendliness.\",\n",
    "        metadata={\"type\": \"dog\", \"trait\": \"loyalty\"},\n",
    "    ),\n",
    "    Document(\n",
    "        page_content=\"Cats are independent pets that often enjoy their own space.\",\n",
    "        metadata={\"type\": \"cat\", \"trait\": \"independence\"},\n",
    "    ),\n",
    "    Document(\n",
    "        page_content=\"Goldfish are popular pets for beginners, requiring relatively simple care.\",\n",
    "        metadata={\"type\": \"fish\", \"trait\": \"low maintenance\"},\n",
    "    ),\n",
    "    Document(\n",
    "        page_content=\"Parrots are intelligent birds capable of mimicking human speech.\",\n",
    "        metadata={\"type\": \"bird\", \"trait\": \"intelligence\"},\n",
    "    ),\n",
    "    Document(\n",
    "        page_content=\"Rabbits are social animals that need plenty of space to hop around.\",\n",
    "        metadata={\"type\": \"rabbit\", \"trait\": \"social\"},\n",
    "    ),\n",
    "]\n",
    "retriever = ToyRetriever(documents=documents, k=3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "18be85e9-6ef0-4ee0-ae5d-a0810c38b254",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'type': 'cat', 'trait': 'independence'}),\n",
       " Document(page_content='Rabbits are social animals that need plenty of space to hop around.', metadata={'type': 'rabbit', 'trait': 'social'})]"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "retriever.invoke(\"that\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "13f76f6e-cf2b-4f67-859b-0ef8be98abbe",
   "metadata": {},
   "source": [
    "It's a **runnable** so it'll benefit from the standard Runnable Interface! 🤩"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "3672e9fe-4365-4628-9d25-31924cfaf784",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'type': 'cat', 'trait': 'independence'}),\n",
       " Document(page_content='Rabbits are social animals that need plenty of space to hop around.', metadata={'type': 'rabbit', 'trait': 'social'})]"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "await retriever.ainvoke(\"that\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "e2c96eed-6813-421c-acf2-6554839840ee",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[[Document(page_content='Dogs are great companions, known for their loyalty and friendliness.', metadata={'type': 'dog', 'trait': 'loyalty'})],\n",
       " [Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'type': 'cat', 'trait': 'independence'})]]"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "retriever.batch([\"dog\", \"cat\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "978b6636-bf36-42c2-969c-207718f084cf",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'event': 'on_retriever_start', 'run_id': 'f96f268d-8383-4921-b175-ca583924d9ff', 'name': 'ToyRetriever', 'tags': [], 'metadata': {}, 'data': {'input': 'bar'}}\n",
      "{'event': 'on_retriever_stream', 'run_id': 'f96f268d-8383-4921-b175-ca583924d9ff', 'tags': [], 'metadata': {}, 'name': 'ToyRetriever', 'data': {'chunk': []}}\n",
      "{'event': 'on_retriever_end', 'name': 'ToyRetriever', 'run_id': 'f96f268d-8383-4921-b175-ca583924d9ff', 'tags': [], 'metadata': {}, 'data': {'output': []}}\n"
     ]
    }
   ],
   "source": [
    "async for event in retriever.astream_events(\"bar\", version=\"v1\"):\n",
    "    print(event)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7b45c404-37bf-4370-bb7c-26556777ff46",
   "metadata": {},
   "source": [
    "## Contributing\n",
    "\n",
    "We appreciate contributions of interesting retrievers!\n",
    "\n",
    "Here's a checklist to help make sure your contribution gets added to LangChain:\n",
    "\n",
    "Documentation:\n",
    "\n",
    "* The retriever contains doc-strings for all initialization arguments, as these will be surfaced in the [API Reference](https://api.python.langchain.com/en/stable/langchain_api_reference.html).\n",
    "* The class doc-string for the model contains a link to any relevant APIs used for the retriever (e.g., if the retriever is retrieving from wikipedia, it'll be good to link to the wikipedia API!)\n",
    "\n",
    "Tests:\n",
    "\n",
    "* [ ] Add unit or integration tests to verify that `invoke` and `ainvoke` work.\n",
    "\n",
    "Optimizations:\n",
    "\n",
    "If the retriever is connecting to external data sources (e.g., an API or a file), it'll almost certainly benefit from an async native optimization!\n",
    " \n",
    "* [ ] Provide a native async implementation of `_aget_relevant_documents` (used by `ainvoke`)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}