langchain/docs/docs/how_to/extraction_examples.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "70403d4f-50c1-43f8-a7ea-a211167649a5",
   "metadata": {},
   "source": [
    "# How to use reference examples when doing extraction\n",
    "\n",
    "The quality of extractions can often be improved by providing reference examples to the LLM.\n",
    "\n",
    "Data extraction attempts to generate structured representations of information found in text and other unstructured or semi-structured formats. [Tool-calling](/docs/concepts#functiontool-calling) LLM features are often used in this context. This guide demonstrates how to build few-shot examples of tool calls to help steer the behavior of extraction and similar applications.\n",
    "\n",
    ":::{.callout-tip}\n",
    "While this guide focuses how to use examples with a tool calling model, this technique is generally applicable, and will work\n",
    "also with JSON more or prompt based techniques.\n",
    ":::\n",
    "\n",
    "LangChain implements a [tool-call attribute](https://api.python.langchain.com/en/latest/messages/langchain_core.messages.ai.AIMessage.html#langchain_core.messages.ai.AIMessage.tool_calls) on messages from LLMs that include tool calls. See our [how-to guide on tool calling](/docs/how_to/tool_calling) for more detail. To build reference examples for data extraction, we build a chat history containing a sequence of: \n",
    "\n",
    "- [HumanMessage](https://api.python.langchain.com/en/latest/messages/langchain_core.messages.human.HumanMessage.html) containing example inputs;\n",
    "- [AIMessage](https://api.python.langchain.com/en/latest/messages/langchain_core.messages.ai.AIMessage.html) containing example tool calls;\n",
    "- [ToolMessage](https://api.python.langchain.com/en/latest/messages/langchain_core.messages.tool.ToolMessage.html) containing example tool outputs.\n",
    "\n",
    "LangChain adopts this convention for structuring tool calls into conversation across LLM model providers.\n",
    "\n",
    "First we build a prompt template that includes a placeholder for these messages:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "89579144-bcb3-490a-8036-86a0a6bcd56b",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder\n",
    "\n",
    "# Define a custom prompt to provide instructions and any additional context.\n",
    "# 1) You can add examples into the prompt template to improve extraction quality\n",
    "# 2) Introduce additional parameters to take context into account (e.g., include metadata\n",
    "#    about the document from which the text was extracted.)\n",
    "prompt = ChatPromptTemplate.from_messages(\n",
    "    [\n",
    "        (\n",
    "            \"system\",\n",
    "            \"You are an expert extraction algorithm. \"\n",
    "            \"Only extract relevant information from the text. \"\n",
    "            \"If you do not know the value of an attribute asked \"\n",
    "            \"to extract, return null for the attribute's value.\",\n",
    "        ),\n",
    "        # ↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓\n",
    "        MessagesPlaceholder(\"examples\"),  # <-- EXAMPLES!\n",
    "        # ↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑\n",
    "        (\"human\", \"{text}\"),\n",
    "    ]\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2484008c-ba1a-42a5-87a1-628a900de7fd",
   "metadata": {},
   "source": [
    "Test out the template:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "610c3025-ea63-4cd7-88bd-c8cbcb4d8a3f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "ChatPromptValue(messages=[SystemMessage(content=\"You are an expert extraction algorithm. Only extract relevant information from the text. If you do not know the value of an attribute asked to extract, return null for the attribute's value.\"), HumanMessage(content='testing 1 2 3'), HumanMessage(content='this is some text')])"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from langchain_core.messages import (\n",
    "    HumanMessage,\n",
    ")\n",
    "\n",
    "prompt.invoke(\n",
    "    {\"text\": \"this is some text\", \"examples\": [HumanMessage(content=\"testing 1 2 3\")]}\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "368abd80-0cf0-41a7-8224-acf90dd6830d",
   "metadata": {},
   "source": [
    "## Define the schema\n",
    "\n",
    "Let's re-use the person schema from the [extraction tutorial](/docs/tutorials/extraction)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "d875a49a-d2cb-4b9e-b5bf-41073bc3905c",
   "metadata": {},
   "outputs": [],
   "source": [
    "from typing import List, Optional\n",
    "\n",
    "from langchain_core.pydantic_v1 import BaseModel, Field\n",
    "from langchain_openai import ChatOpenAI\n",
    "\n",
    "\n",
    "class Person(BaseModel):\n",
    "    \"\"\"Information about a person.\"\"\"\n",
    "\n",
    "    # ^ Doc-string for the entity Person.\n",
    "    # This doc-string is sent to the LLM as the description of the schema Person,\n",
    "    # and it can help to improve extraction results.\n",
    "\n",
    "    # Note that:\n",
    "    # 1. Each field is an `optional` -- this allows the model to decline to extract it!\n",
    "    # 2. Each field has a `description` -- this description is used by the LLM.\n",
    "    # Having a good description can help improve extraction results.\n",
    "    name: Optional[str] = Field(..., description=\"The name of the person\")\n",
    "    hair_color: Optional[str] = Field(\n",
    "        ..., description=\"The color of the peron's eyes if known\"\n",
    "    )\n",
    "    height_in_meters: Optional[str] = Field(..., description=\"Height in METERs\")\n",
    "\n",
    "\n",
    "class Data(BaseModel):\n",
    "    \"\"\"Extracted data about people.\"\"\"\n",
    "\n",
    "    # Creates a model so that we can extract multiple entities.\n",
    "    people: List[Person]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "96c42162-e4f6-4461-88fd-c76f5aab7e32",
   "metadata": {},
   "source": [
    "## Define reference examples\n",
    "\n",
    "Examples can be defined as a list of input-output pairs. \n",
    "\n",
    "Each example contains an example `input` text and an example `output` showing what should be extracted from the text.\n",
    "\n",
    ":::{.callout-important}\n",
    "This is a bit in the weeds, so feel free to skip.\n",
    "\n",
    "The format of the example needs to match the API used (e.g., tool calling or JSON mode etc.).\n",
    "\n",
    "Here, the formatted examples will match the format expected for the tool calling API since that's what we're using.\n",
    ":::"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "08356810-77ce-4e68-99d9-faa0326f2cee",
   "metadata": {},
   "outputs": [],
   "source": [
    "import uuid\n",
    "from typing import Dict, List, TypedDict\n",
    "\n",
    "from langchain_core.messages import (\n",
    "    AIMessage,\n",
    "    BaseMessage,\n",
    "    HumanMessage,\n",
    "    SystemMessage,\n",
    "    ToolMessage,\n",
    ")\n",
    "from langchain_core.pydantic_v1 import BaseModel, Field\n",
    "\n",
    "\n",
    "class Example(TypedDict):\n",
    "    \"\"\"A representation of an example consisting of text input and expected tool calls.\n",
    "\n",
    "    For extraction, the tool calls are represented as instances of pydantic model.\n",
    "    \"\"\"\n",
    "\n",
    "    input: str  # This is the example text\n",
    "    tool_calls: List[BaseModel]  # Instances of pydantic model that should be extracted\n",
    "\n",
    "\n",
    "def tool_example_to_messages(example: Example) -> List[BaseMessage]:\n",
    "    \"\"\"Convert an example into a list of messages that can be fed into an LLM.\n",
    "\n",
    "    This code is an adapter that converts our example to a list of messages\n",
    "    that can be fed into a chat model.\n",
    "\n",
    "    The list of messages per example corresponds to:\n",
    "\n",
    "    1) HumanMessage: contains the content from which content should be extracted.\n",
    "    2) AIMessage: contains the extracted information from the model\n",
    "    3) ToolMessage: contains confirmation to the model that the model requested a tool correctly.\n",
    "\n",
    "    The ToolMessage is required because some of the chat models are hyper-optimized for agents\n",
    "    rather than for an extraction use case.\n",
    "    \"\"\"\n",
    "    messages: List[BaseMessage] = [HumanMessage(content=example[\"input\"])]\n",
    "    tool_calls = []\n",
    "    for tool_call in example[\"tool_calls\"]:\n",
    "        tool_calls.append(\n",
    "            {\n",
    "                \"id\": str(uuid.uuid4()),\n",
    "                \"args\": tool_call.dict(),\n",
    "                # The name of the function right now corresponds\n",
    "                # to the name of the pydantic model\n",
    "                # This is implicit in the API right now,\n",
    "                # and will be improved over time.\n",
    "                \"name\": tool_call.__class__.__name__,\n",
    "            },\n",
    "        )\n",
    "    messages.append(AIMessage(content=\"\", tool_calls=tool_calls))\n",
    "    tool_outputs = example.get(\"tool_outputs\") or [\n",
    "        \"You have correctly called this tool.\"\n",
    "    ] * len(tool_calls)\n",
    "    for output, tool_call in zip(tool_outputs, tool_calls):\n",
    "        messages.append(ToolMessage(content=output, tool_call_id=tool_call[\"id\"]))\n",
    "    return messages"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "463aa282-51c4-42bf-9463-6ca3b2c08de6",
   "metadata": {},
   "source": [
    "Next let's define our examples and then convert them into message format."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "7f59a745-5c81-4011-a4c5-a33ec1eca7ef",
   "metadata": {},
   "outputs": [],
   "source": [
    "examples = [\n",
    "    (\n",
    "        \"The ocean is vast and blue. It's more than 20,000 feet deep. There are many fish in it.\",\n",
    "        Person(name=None, height_in_meters=None, hair_color=None),\n",
    "    ),\n",
    "    (\n",
    "        \"Fiona traveled far from France to Spain.\",\n",
    "        Person(name=\"Fiona\", height_in_meters=None, hair_color=None),\n",
    "    ),\n",
    "]\n",
    "\n",
    "\n",
    "messages = []\n",
    "\n",
    "for text, tool_call in examples:\n",
    "    messages.extend(\n",
    "        tool_example_to_messages({\"input\": text, \"tool_calls\": [tool_call]})\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6fdbda30-e7e3-46b5-a54a-1769c580af93",
   "metadata": {},
   "source": [
    "Let's test out the prompt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "976bb7b8-09c4-4a3e-80df-49a483705c08",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "system: content=\"You are an expert extraction algorithm. Only extract relevant information from the text. If you do not know the value of an attribute asked to extract, return null for the attribute's value.\"\n",
      "human: content=\"The ocean is vast and blue. It's more than 20,000 feet deep. There are many fish in it.\"\n",
      "ai: content='' tool_calls=[{'name': 'Person', 'args': {'name': None, 'hair_color': None, 'height_in_meters': None}, 'id': 'b843ba77-4c9c-48ef-92a4-54e534f24521'}]\n",
      "tool: content='You have correctly called this tool.' tool_call_id='b843ba77-4c9c-48ef-92a4-54e534f24521'\n",
      "human: content='Fiona traveled far from France to Spain.'\n",
      "ai: content='' tool_calls=[{'name': 'Person', 'args': {'name': 'Fiona', 'hair_color': None, 'height_in_meters': None}, 'id': '46f00d6b-50e5-4482-9406-b07bb10340f6'}]\n",
      "tool: content='You have correctly called this tool.' tool_call_id='46f00d6b-50e5-4482-9406-b07bb10340f6'\n",
      "human: content='this is some text'\n"
     ]
    }
   ],
   "source": [
    "example_prompt = prompt.invoke({\"text\": \"this is some text\", \"examples\": messages})\n",
    "\n",
    "for message in example_prompt.messages:\n",
    "    print(f\"{message.type}: {message}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "47b0bbef-bc6b-4535-a8e2-5c84f09d5637",
   "metadata": {},
   "source": [
    "## Create an extractor\n",
    "\n",
    "Let's select an LLM. Because we are using tool-calling, we will need a model that supports a tool-calling feature. See [this table](/docs/integrations/chat) for available LLMs.\n",
    "\n",
    "```{=mdx}\n",
    "import ChatModelTabs from \"@theme/ChatModelTabs\";\n",
    "\n",
    "<ChatModelTabs\n",
    "  customVarName=\"llm\"\n",
    "  openaiParams={`model=\"gpt-4-0125-preview\", temperature=0`}\n",
    "/>\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "df2e1ee1-69e8-4c4d-b349-95f2e320317b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# | output: false\n",
    "# | echo: false\n",
    "\n",
    "from langchain_openai import ChatOpenAI\n",
    "\n",
    "llm = ChatOpenAI(model=\"gpt-4-0125-preview\", temperature=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ef21e8cb-c4df-4e12-9be7-37ac9d291d42",
   "metadata": {},
   "source": [
    "Following the [extraction tutorial](/docs/tutorials/extraction), we use the `.with_structured_output` method to structure model outputs according to the desired schema:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "dbfea43d-769b-42e9-a76f-ce722f7d6f93",
   "metadata": {},
   "outputs": [],
   "source": [
    "runnable = prompt | llm.with_structured_output(\n",
    "    schema=Data,\n",
    "    method=\"function_calling\",\n",
    "    include_raw=False,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "58a8139e-f201-4b8e-baf0-16a83e5fa987",
   "metadata": {},
   "source": [
    "## Without examples 😿\n",
    "\n",
    "Notice that even capable models can fail with a **very simple** test case!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "66545cab-af2a-40a4-9dc9-b4110458b7d3",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "people=[Person(name='earth', hair_color='null', height_in_meters='null')]\n",
      "people=[Person(name='earth', hair_color='null', height_in_meters='null')]\n",
      "people=[]\n",
      "people=[Person(name='earth', hair_color='null', height_in_meters='null')]\n",
      "people=[]\n"
     ]
    }
   ],
   "source": [
    "for _ in range(5):\n",
    "    text = \"The solar system is large, but earth has only 1 moon.\"\n",
    "    print(runnable.invoke({\"text\": text, \"examples\": []}))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "09840f17-ab26-4ea2-8a39-c747103804ec",
   "metadata": {},
   "source": [
    "## With examples 😻\n",
    "\n",
    "Reference examples helps to fix the failure!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "1c09d805-ec16-4123-aef9-6a5b59499b5c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "people=[]\n",
      "people=[]\n",
      "people=[]\n",
      "people=[]\n",
      "people=[]\n"
     ]
    }
   ],
   "source": [
    "for _ in range(5):\n",
    "    text = \"The solar system is large, but earth has only 1 moon.\"\n",
    "    print(runnable.invoke({\"text\": text, \"examples\": messages}))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3855cad5-dfee-4b42-ad35-b28d4d98902e",
   "metadata": {},
   "source": [
    "Note that we can see the few-shot examples as tool-calls in the [Langsmith trace](https://smith.langchain.com/public/4c436bc2-a1ce-440b-82f5-093947542e40/r).\n",
    "\n",
    "And we retain performance on a positive sample:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "a9b7a762-1b75-4f9f-b9d9-6732dd05802c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Data(people=[Person(name='Harrison', hair_color='black', height_in_meters=None)])"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "runnable.invoke(\n",
    "    {\n",
    "        \"text\": \"My name is Harrison. My hair is black.\",\n",
    "        \"examples\": messages,\n",
    "    }\n",
    ")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}