langchain/docs/docs/integrations/retrievers/fleet_context.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "a33a03c9-f11d-45ef-a563-9da0652fcf92",
   "metadata": {},
   "source": [
    "# Fleet AI Context\n",
    "\n",
    ">[Fleet AI Context](https://www.fleet.so/context) is a dataset of high-quality embeddings of the top 1200 most popular & permissive Python Libraries & their documentation.\n",
    ">\n",
    ">The `Fleet AI` team is on a mission to embed the world's most important data. They've started by embedding the top 1200 Python libraries to enable code generation with up-to-date knowledge. They've been kind enough to share their embeddings of the [LangChain docs](/docs/get_started/introduction) and [API reference](https://api.python.langchain.com/en/latest/api_reference.html).\n",
    "\n",
    "Let's take a look at how we can use these embeddings to power a docs retrieval system and ultimately a simple code-generating chain!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fe79536b-8b06-44a9-b81b-f2af16521c3a",
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install --upgrade --quiet  langchain fleet-context langchain-openai pandas faiss-cpu # faiss-gpu for CUDA supported GPU"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "467eaea5-c6fa-45bb-973d-1bc92d2b8a48",
   "metadata": {},
   "outputs": [],
   "source": [
    "from operator import itemgetter\n",
    "from typing import Any, Optional, Type\n",
    "\n",
    "import pandas as pd\n",
    "from langchain.retrievers import MultiVectorRetriever\n",
    "from langchain_community.vectorstores import FAISS\n",
    "from langchain_core.documents import Document\n",
    "from langchain_core.stores import BaseStore\n",
    "from langchain_core.vectorstores import VectorStore\n",
    "from langchain_openai import OpenAIEmbeddings\n",
    "\n",
    "\n",
    "def load_fleet_retriever(\n",
    "    df: pd.DataFrame,\n",
    "    *,\n",
    "    vectorstore_cls: Type[VectorStore] = FAISS,\n",
    "    docstore: Optional[BaseStore] = None,\n",
    "    **kwargs: Any,\n",
    "):\n",
    "    vectorstore = _populate_vectorstore(df, vectorstore_cls)\n",
    "    if docstore is None:\n",
    "        return vectorstore.as_retriever(**kwargs)\n",
    "    else:\n",
    "        _populate_docstore(df, docstore)\n",
    "        return MultiVectorRetriever(\n",
    "            vectorstore=vectorstore, docstore=docstore, id_key=\"parent\", **kwargs\n",
    "        )\n",
    "\n",
    "\n",
    "def _populate_vectorstore(\n",
    "    df: pd.DataFrame,\n",
    "    vectorstore_cls: Type[VectorStore],\n",
    ") -> VectorStore:\n",
    "    if not hasattr(vectorstore_cls, \"from_embeddings\"):\n",
    "        raise ValueError(\n",
    "            f\"Incompatible vector store class {vectorstore_cls}.\"\n",
    "            \"Must implement `from_embeddings` class method.\"\n",
    "        )\n",
    "    texts_embeddings = []\n",
    "    metadatas = []\n",
    "    for _, row in df.iterrows():\n",
    "        texts_embeddings.append((row.metadata[\"text\"], row[\"dense_embeddings\"]))\n",
    "        metadatas.append(row.metadata)\n",
    "    return vectorstore_cls.from_embeddings(\n",
    "        texts_embeddings,\n",
    "        OpenAIEmbeddings(model=\"text-embedding-ada-002\"),\n",
    "        metadatas=metadatas,\n",
    "    )\n",
    "\n",
    "\n",
    "def _populate_docstore(df: pd.DataFrame, docstore: BaseStore) -> None:\n",
    "    parent_docs = []\n",
    "    df = df.copy()\n",
    "    df[\"parent\"] = df.metadata.apply(itemgetter(\"parent\"))\n",
    "    for parent_id, group in df.groupby(\"parent\"):\n",
    "        sorted_group = group.iloc[\n",
    "            group.metadata.apply(itemgetter(\"section_index\")).argsort()\n",
    "        ]\n",
    "        text = \"\".join(sorted_group.metadata.apply(itemgetter(\"text\")))\n",
    "        metadata = {\n",
    "            k: sorted_group.iloc[0].metadata[k] for k in (\"title\", \"type\", \"url\")\n",
    "        }\n",
    "        text = metadata[\"title\"] + \"\\n\" + text\n",
    "        metadata[\"id\"] = parent_id\n",
    "        parent_docs.append(Document(page_content=text, metadata=metadata))\n",
    "    docstore.mset(((d.metadata[\"id\"], d) for d in parent_docs))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "01287610-9224-45c1-94c6-35b8c002bd49",
   "metadata": {},
   "source": [
    "## Retriever chunks\n",
    "\n",
    "As part of their embedding process, the Fleet AI team first chunked long documents before embedding them. This means the vectors correspond to sections of pages in the LangChain docs, not entire pages. By default, when we spin up a retriever from these embeddings, we'll be retrieving these embedded chunks.\n",
    "\n",
    "We will be using Fleet Context's `download_embeddings()` to grab Langchain's documentation embeddings. You can view all supported libraries' documentation at https://fleet.so/context."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "ad91e026-8f05-4868-8c03-34f7dd254263",
   "metadata": {},
   "outputs": [],
   "source": [
    "from context import download_embeddings\n",
    "\n",
    "df = download_embeddings(\"langchain\")\n",
    "vecstore_retriever = load_fleet_retriever(df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "519c6898-3ef7-4b0d-94e3-e60ac7da51a3",
   "metadata": {},
   "outputs": [],
   "source": [
    "vecstore_retriever.invoke(\"How does the multi vector retriever work\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "32bb9085",
   "metadata": {},
   "source": [
    "## Other packages\n",
    "\n",
    "You can download and use other embeddings from [this Dropbox link](https://www.dropbox.com/scl/fo/54t2e7fogtixo58pnlyub/h?rlkey=tne16wkssgf01jor0p1iqg6p9&dl=0)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b9ad8f2a-bcfc-4784-83c0-a4d0a88eec3a",
   "metadata": {},
   "source": [
    "## Retrieve parent docs\n",
    "\n",
    "The embeddings provided by Fleet AI contain metadata that indicates which embedding chunks correspond to the same original document page. If we'd like we can use this information to retrieve whole parent documents, and not just embedded chunks. Under the hood, we'll use a MultiVectorRetriever and a BaseStore object to search for relevant chunks and then map them to their parent document."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "908d74da-7d63-49ed-bda5-91fc5d2f9568",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.storage import InMemoryStore\n",
    "\n",
    "parent_retriever = load_fleet_retriever(\n",
    "    \"https://www.dropbox.com/scl/fi/4rescpkrg9970s3huz47l/libraries_langchain_release.parquet?rlkey=283knw4wamezfwiidgpgptkep&dl=1\",\n",
    "    docstore=InMemoryStore(),\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "540a54c2-b9a0-475c-ada0-91c5f67dd0fa",
   "metadata": {},
   "outputs": [],
   "source": [
    "parent_retriever.invoke(\"How does the multi vector retriever work\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d42be550-87e3-49fe-9fb8-1e36d67b3f0b",
   "metadata": {},
   "source": [
    "## Putting it in a chain\n",
    "\n",
    "Let's try using our retrieval systems in a simple chain!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "28baece3-577e-4236-be5f-38db67b34352",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_core.output_parsers import StrOutputParser\n",
    "from langchain_core.prompts import ChatPromptTemplate\n",
    "from langchain_core.runnables import RunnablePassthrough\n",
    "from langchain_openai import ChatOpenAI\n",
    "\n",
    "prompt = ChatPromptTemplate.from_messages(\n",
    "    [\n",
    "        (\n",
    "            \"system\",\n",
    "            \"\"\"You are a great software engineer who is very familiar \\\n",
    "with Python. Given a user question or request about a new Python library called LangChain and \\\n",
    "parts of the LangChain documentation, answer the question or generate the requested code. \\\n",
    "Your answers must be accurate, should include code whenever possible, and should assume anything \\\n",
    "about LangChain which is note explicitly stated in the LangChain documentation. If the required \\\n",
    "information is not available, just say so.\n",
    "\n",
    "LangChain Documentation\n",
    "------------------\n",
    "\n",
    "{context}\"\"\",\n",
    "        ),\n",
    "        (\"human\", \"{question}\"),\n",
    "    ]\n",
    ")\n",
    "\n",
    "model = ChatOpenAI(model=\"gpt-3.5-turbo-16k\")\n",
    "\n",
    "chain = (\n",
    "    {\n",
    "        \"question\": RunnablePassthrough(),\n",
    "        \"context\": parent_retriever\n",
    "        | (lambda docs: \"\\n\\n\".join(d.page_content for d in docs)),\n",
    "    }\n",
    "    | prompt\n",
    "    | model\n",
    "    | StrOutputParser()\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f1b9f091-a5f9-4468-9d52-633bf3361f4e",
   "metadata": {},
   "outputs": [],
   "source": [
    "for chunk in chain.invoke(\n",
    "    \"How do I create a FAISS vector store retriever that returns 10 documents per search query\"\n",
    "):\n",
    "    print(chunk, end=\"\", flush=True)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}