langchain/docs/docs/how_to/extraction_long_text.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "9e161a8a-fcf0-4d55-933e-da271ce28d7e",
   "metadata": {},
   "source": [
    "# How to handle long text when doing extraction\n",
    "\n",
    "When working with files, like PDFs, you're likely to encounter text that exceeds your language model's context window. To process this text, consider these strategies:\n",
    "\n",
    "1. **Change LLM** Choose a different LLM that supports a larger context window.\n",
    "2. **Brute Force** Chunk the document, and extract content from each chunk.\n",
    "3. **RAG** Chunk the document, index the chunks, and only extract content from a subset of chunks that look \"relevant\".\n",
    "\n",
    "Keep in mind that these strategies have different trade off and the best strategy likely depends on the application that you're designing!\n",
    "\n",
    "This guide demonstrates how to implement strategies 2 and 3."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "57969139-ad0a-487e-97d8-cb30e2af9742",
   "metadata": {},
   "source": [
    "## Set up\n",
    "\n",
    "We need some example data! Let's download an article about [cars from wikipedia](https://en.wikipedia.org/wiki/Car) and load it as a LangChain [Document](https://api.python.langchain.com/en/latest/documents/langchain_core.documents.base.Document.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "84460db2-36e1-4037-bfa6-2a11883c2ba5",
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "\n",
    "import requests\n",
    "from langchain_community.document_loaders import BSHTMLLoader\n",
    "\n",
    "# Download the content\n",
    "response = requests.get(\"https://en.wikipedia.org/wiki/Car\")\n",
    "# Write it to a file\n",
    "with open(\"car.html\", \"w\", encoding=\"utf-8\") as f:\n",
    "    f.write(response.text)\n",
    "# Load it with an HTML parser\n",
    "loader = BSHTMLLoader(\"car.html\")\n",
    "document = loader.load()[0]\n",
    "# Clean up code\n",
    "# Replace consecutive new lines with a single new line\n",
    "document.page_content = re.sub(\"\\n\\n+\", \"\\n\", document.page_content)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "fcb6917b-123d-4630-a0ce-ed8b293d482d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "79174\n"
     ]
    }
   ],
   "source": [
    "print(len(document.page_content))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "af3ffb8d-587a-4370-886a-e56e617bcb9c",
   "metadata": {},
   "source": [
    "## Define the schema\n",
    "\n",
    "Following the [extraction tutorial](/docs/tutorials/extraction), we will use Pydantic to define the schema of information we wish to extract. In this case, we will extract a list of \"key developments\" (e.g., important historical events) that include a year and description.\n",
    "\n",
    "Note that we also include an `evidence` key and instruct the model to provide in verbatim the relevant sentences of text from the article. This allows us to compare the extraction results to (the model's reconstruction of) text from the original document."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "a3b288ed-87a6-4af0-aac8-20921dc370d4",
   "metadata": {},
   "outputs": [],
   "source": [
    "from typing import List, Optional\n",
    "\n",
    "from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder\n",
    "from langchain_core.pydantic_v1 import BaseModel, Field\n",
    "\n",
    "\n",
    "class KeyDevelopment(BaseModel):\n",
    "    \"\"\"Information about a development in the history of cars.\"\"\"\n",
    "\n",
    "    year: int = Field(\n",
    "        ..., description=\"The year when there was an important historic development.\"\n",
    "    )\n",
    "    description: str = Field(\n",
    "        ..., description=\"What happened in this year? What was the development?\"\n",
    "    )\n",
    "    evidence: str = Field(\n",
    "        ...,\n",
    "        description=\"Repeat in verbatim the sentence(s) from which the year and description information were extracted\",\n",
    "    )\n",
    "\n",
    "\n",
    "class ExtractionData(BaseModel):\n",
    "    \"\"\"Extracted information about key developments in the history of cars.\"\"\"\n",
    "\n",
    "    key_developments: List[KeyDevelopment]\n",
    "\n",
    "\n",
    "# Define a custom prompt to provide instructions and any additional context.\n",
    "# 1) You can add examples into the prompt template to improve extraction quality\n",
    "# 2) Introduce additional parameters to take context into account (e.g., include metadata\n",
    "#    about the document from which the text was extracted.)\n",
    "prompt = ChatPromptTemplate.from_messages(\n",
    "    [\n",
    "        (\n",
    "            \"system\",\n",
    "            \"You are an expert at identifying key historic development in text. \"\n",
    "            \"Only extract important historic developments. Extract nothing if no important information can be found in the text.\",\n",
    "        ),\n",
    "        (\"human\", \"{text}\"),\n",
    "    ]\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3909e22e-8a00-4f3d-bbf2-4762a0558af3",
   "metadata": {},
   "source": [
    "## Create an extractor\n",
    "\n",
    "Let's select an LLM. Because we are using tool-calling, we will need a model that supports a tool-calling feature. See [this table](/docs/integrations/chat) for available LLMs.\n",
    "\n",
    "```{=mdx}\n",
    "import ChatModelTabs from \"@theme/ChatModelTabs\";\n",
    "\n",
    "<ChatModelTabs\n",
    "  customVarName=\"llm\"\n",
    "  openaiParams={`model=\"gpt-4-0125-preview\", temperature=0`}\n",
    "/>\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "109f4f05-d0ff-431d-93d9-8f5aa34979a6",
   "metadata": {},
   "outputs": [],
   "source": [
    "# | output: false\n",
    "# | echo: false\n",
    "\n",
    "from langchain_openai import ChatOpenAI\n",
    "\n",
    "llm = ChatOpenAI(model=\"gpt-4-0125-preview\", temperature=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "aa4ae224-6d3d-4fe2-b210-7db19a9fe580",
   "metadata": {},
   "outputs": [],
   "source": [
    "extractor = prompt | llm.with_structured_output(\n",
    "    schema=ExtractionData,\n",
    "    include_raw=False,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "13aebafb-26b5-42b2-ae8e-9c05cd56e5c5",
   "metadata": {},
   "source": [
    "## Brute force approach\n",
    "\n",
    "Split the documents into chunks such that each chunk fits into the context window of the LLMs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "27b8a373-14b3-45ea-8bf5-9749122ad927",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_text_splitters import TokenTextSplitter\n",
    "\n",
    "text_splitter = TokenTextSplitter(\n",
    "    # Controls the size of each chunk\n",
    "    chunk_size=2000,\n",
    "    # Controls overlap between chunks\n",
    "    chunk_overlap=20,\n",
    ")\n",
    "\n",
    "texts = text_splitter.split_text(document.page_content)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5b43d7e0-3c85-4d97-86c7-e8c984b60b0a",
   "metadata": {},
   "source": [
    "Use [batch](https://api.python.langchain.com/en/latest/runnables/langchain_core.runnables.base.Runnable.html) functionality to run the extraction in **parallel** across each chunk! \n",
    "\n",
    ":::{.callout-tip}\n",
    "You can often use .batch() to parallelize the extractions! `.batch` uses a threadpool under the hood to help you parallelize workloads.\n",
    "\n",
    "If your model is exposed via an API, this will likely speed up your extraction flow!\n",
    ":::"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "6ba766b5-8d6c-48e6-8d69-f391a66b65d2",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Limit just to the first 3 chunks\n",
    "# so the code can be re-run quickly\n",
    "first_few = texts[:3]\n",
    "\n",
    "extractions = extractor.batch(\n",
    "    [{\"text\": text} for text in first_few],\n",
    "    {\"max_concurrency\": 5},  # limit the concurrency by passing max concurrency!\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "67da8904-e927-406b-a439-2a16f6087ccf",
   "metadata": {},
   "source": [
    "### Merge results\n",
    "\n",
    "After extracting data from across the chunks, we'll want to merge the extractions together."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "c3f77470-ce6c-477f-8957-650913218632",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[KeyDevelopment(year=1966, description='The Toyota Corolla began production, becoming the best-selling series of automobile in history.', evidence='The Toyota Corolla, which has been in production since 1966, is the best-selling series of automobile in history.'),\n",
       " KeyDevelopment(year=1769, description='Nicolas-Joseph Cugnot built the first steam-powered road vehicle.', evidence='The French inventor Nicolas-Joseph Cugnot built the first steam-powered road vehicle in 1769.'),\n",
       " KeyDevelopment(year=1808, description='François Isaac de Rivaz designed and constructed the first internal combustion-powered automobile.', evidence='the Swiss inventor François Isaac de Rivaz designed and constructed the first internal combustion-powered automobile in 1808.'),\n",
       " KeyDevelopment(year=1886, description='Carl Benz patented his Benz Patent-Motorwagen, inventing the modern car.', evidence='The modern car—a practical, marketable automobile for everyday use—was invented in 1886, when the German inventor Carl Benz patented his Benz Patent-Motorwagen.'),\n",
       " KeyDevelopment(year=1908, description='Ford Model T, one of the first cars affordable by the masses, began production.', evidence='One of the first cars affordable by the masses was the Ford Model T, begun in 1908, an American car manufactured by the Ford Motor Company.'),\n",
       " KeyDevelopment(year=1888, description=\"Bertha Benz undertook the first road trip by car to prove the road-worthiness of her husband's invention.\", evidence=\"In August 1888, Bertha Benz, the wife of Carl Benz, undertook the first road trip by car, to prove the road-worthiness of her husband's invention.\"),\n",
       " KeyDevelopment(year=1896, description='Benz designed and patented the first internal-combustion flat engine, called boxermotor.', evidence='In 1896, Benz designed and patented the first internal-combustion flat engine, called boxermotor.'),\n",
       " KeyDevelopment(year=1897, description='Nesselsdorfer Wagenbau produced the Präsident automobil, one of the first factory-made cars in the world.', evidence='The first motor car in central Europe and one of the first factory-made cars in the world, was produced by Czech company Nesselsdorfer Wagenbau (later renamed to Tatra) in 1897, the Präsident automobil.'),\n",
       " KeyDevelopment(year=1890, description='Daimler Motoren Gesellschaft (DMG) was founded by Daimler and Maybach in Cannstatt.', evidence='Daimler and Maybach founded Daimler Motoren Gesellschaft (DMG) in Cannstatt in 1890.'),\n",
       " KeyDevelopment(year=1891, description='Auguste Doriot and Louis Rigoulot completed the longest trip by a petrol-driven vehicle with a Daimler powered Peugeot Type 3.', evidence='In 1891, Auguste Doriot and his Peugeot colleague Louis Rigoulot completed the longest trip by a petrol-driven vehicle when their self-designed and built Daimler powered Peugeot Type 3 completed 2,100 kilometres (1,300 mi) from Valentigney to Paris and Brest and back again.')]"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "key_developments = []\n",
    "\n",
    "for extraction in extractions:\n",
    "    key_developments.extend(extraction.key_developments)\n",
    "\n",
    "key_developments[:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "48afd4a7-abcd-48b4-8ff1-6ca485f529e3",
   "metadata": {},
   "source": [
    "## RAG based approach\n",
    "\n",
    "Another simple idea is to chunk up the text, but instead of extracting information from every chunk, just focus on the the most relevant chunks.\n",
    "\n",
    ":::{.callout-caution}\n",
    "It can be difficult to identify which chunks are relevant.\n",
    "\n",
    "For example, in the `car` article we're using here, most of the article contains key development information. So by using\n",
    "**RAG**, we'll likely be throwing out a lot of relevant information.\n",
    "\n",
    "We suggest experimenting with your use case and determining whether this approach works or not.\n",
    ":::\n",
    "\n",
    "To implement the RAG based approach: \n",
    "\n",
    "1. Chunk up your document(s) and index them (e.g., in a vectorstore);\n",
    "2. Prepend the `extractor` chain with a retrieval step using the vectorstore.\n",
    "\n",
    "Here's a simple example that relies on the `FAISS` vectorstore."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "aaf37c82-625b-4fa1-8e88-73303f08ac16",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_community.vectorstores import FAISS\n",
    "from langchain_core.documents import Document\n",
    "from langchain_core.runnables import RunnableLambda\n",
    "from langchain_openai import OpenAIEmbeddings\n",
    "from langchain_text_splitters import CharacterTextSplitter\n",
    "\n",
    "texts = text_splitter.split_text(document.page_content)\n",
    "vectorstore = FAISS.from_texts(texts, embedding=OpenAIEmbeddings())\n",
    "\n",
    "retriever = vectorstore.as_retriever(\n",
    "    search_kwargs={\"k\": 1}\n",
    ")  # Only extract from first document"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "013ecad9-f80f-477c-b954-494b46a02a07",
   "metadata": {},
   "source": [
    "In this case the RAG extractor is only looking at the top document."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "47aad00b-7013-4f7f-a1b0-02ef269093bf",
   "metadata": {},
   "outputs": [],
   "source": [
    "rag_extractor = {\n",
    "    \"text\": retriever | (lambda docs: docs[0].page_content)  # fetch content of top doc\n",
    "} | extractor"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "68f2de01-0cd8-456e-a959-db236189d41b",
   "metadata": {},
   "outputs": [],
   "source": [
    "results = rag_extractor.invoke(\"Key developments associated with cars\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "1788e2d6-77bb-417f-827c-eb96c035164e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "year=1869 description='Mary Ward became one of the first documented car fatalities in Parsonstown, Ireland.' evidence='Mary Ward became one of the first documented car fatalities in 1869 in Parsonstown, Ireland,'\n",
      "year=1899 description=\"Henry Bliss one of the US's first pedestrian car casualties in New York City.\" evidence=\"Henry Bliss one of the US's first pedestrian car casualties in 1899 in New York City.\"\n",
      "year=2030 description='All fossil fuel vehicles will be banned in Amsterdam.' evidence='all fossil fuel vehicles will be banned in Amsterdam from 2030.'\n"
     ]
    }
   ],
   "source": [
    "for key_development in results.key_developments:\n",
    "    print(key_development)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cf36e626-cf5d-4324-ba29-9bd602be9b97",
   "metadata": {},
   "source": [
    "## Common issues\n",
    "\n",
    "Different methods have their own pros and cons related to cost, speed, and accuracy.\n",
    "\n",
    "Watch out for these issues:\n",
    "\n",
    "* Chunking content means that the LLM can fail to extract information if the information is spread across multiple chunks.\n",
    "* Large chunk overlap may cause the same information to be extracted twice, so be prepared to de-duplicate!\n",
    "* LLMs can make up data. If looking for a single fact across a large text and using a brute force approach, you may end up getting more made up data."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}