langchain/docs/docs/how_to/HTML_section_aware_splitter...

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "c95fcd15cd52c944",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "# How to split by HTML sections\n",
    "## Description and motivation\n",
    "Similar in concept to the [HTMLHeaderTextSplitter](/docs/how_to/HTML_header_metadata_splitter), the `HTMLSectionSplitter` is a \"structure-aware\" chunker that splits text at the element level and adds metadata for each header \"relevant\" to any given chunk.\n",
    "\n",
    "It can return chunks element by element or combine elements with the same metadata, with the objectives of (a) keeping related text grouped (more or less) semantically and (b) preserving context-rich information encoded in document structures.\n",
    "\n",
    "Use `xslt_path` to provide an absolute path to transform the HTML so that it can detect sections based on provided tags. The default is to use the `converting_to_header.xslt` file in the `data_connection/document_transformers` directory. This is for converting the html to a format/layout that is easier to detect sections. For example, `span` based on their font size can be converted to header tags to be detected as a section.\n",
    "\n",
    "## Usage examples\n",
    "### 1) How to split HTML strings:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "initial_id",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-10-02T18:57:49.208965400Z",
     "start_time": "2023-10-02T18:57:48.899756Z"
    },
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(page_content='Foo \\n Some intro text about Foo.', metadata={'Header 1': 'Foo'}),\n",
       " Document(page_content='Bar main section \\n Some intro text about Bar. \\n Bar subsection 1 \\n Some text about the first subtopic of Bar. \\n Bar subsection 2 \\n Some text about the second subtopic of Bar.', metadata={'Header 2': 'Bar main section'}),\n",
       " Document(page_content='Baz \\n Some text about Baz \\n \\n \\n Some concluding text about Foo', metadata={'Header 2': 'Baz'})]"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from langchain_text_splitters import HTMLSectionSplitter\n",
    "\n",
    "html_string = \"\"\"\n",
    "    <!DOCTYPE html>\n",
    "    <html>\n",
    "    <body>\n",
    "        <div>\n",
    "            <h1>Foo</h1>\n",
    "            <p>Some intro text about Foo.</p>\n",
    "            <div>\n",
    "                <h2>Bar main section</h2>\n",
    "                <p>Some intro text about Bar.</p>\n",
    "                <h3>Bar subsection 1</h3>\n",
    "                <p>Some text about the first subtopic of Bar.</p>\n",
    "                <h3>Bar subsection 2</h3>\n",
    "                <p>Some text about the second subtopic of Bar.</p>\n",
    "            </div>\n",
    "            <div>\n",
    "                <h2>Baz</h2>\n",
    "                <p>Some text about Baz</p>\n",
    "            </div>\n",
    "            <br>\n",
    "            <p>Some concluding text about Foo</p>\n",
    "        </div>\n",
    "    </body>\n",
    "    </html>\n",
    "\"\"\"\n",
    "\n",
    "headers_to_split_on = [(\"h1\", \"Header 1\"), (\"h2\", \"Header 2\")]\n",
    "\n",
    "html_splitter = HTMLSectionSplitter(headers_to_split_on)\n",
    "html_header_splits = html_splitter.split_text(html_string)\n",
    "html_header_splits"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e29b4aade2a0070c",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "### 2) How to constrain chunk sizes:\n",
    "\n",
    "`HTMLSectionSplitter` can be used with other text splitters as part of a chunking pipeline. Internally, it uses the `RecursiveCharacterTextSplitter` when the section size is larger than the chunk size. It also considers the font size of the text to determine whether it is a section or not based on the determined font size threshold."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "6ada8ea093ea0475",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-10-02T18:57:51.016141300Z",
     "start_time": "2023-10-02T18:57:50.647495400Z"
    },
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(page_content='Foo \\n Some intro text about Foo.', metadata={'Header 1': 'Foo'}),\n",
       " Document(page_content='Bar main section \\n Some intro text about Bar.', metadata={'Header 2': 'Bar main section'}),\n",
       " Document(page_content='Bar subsection 1 \\n Some text about the first subtopic of Bar.', metadata={'Header 3': 'Bar subsection 1'}),\n",
       " Document(page_content='Bar subsection 2 \\n Some text about the second subtopic of Bar.', metadata={'Header 3': 'Bar subsection 2'}),\n",
       " Document(page_content='Baz \\n Some text about Baz \\n \\n \\n Some concluding text about Foo', metadata={'Header 2': 'Baz'})]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
    "\n",
    "html_string = \"\"\"\n",
    "    <!DOCTYPE html>\n",
    "    <html>\n",
    "    <body>\n",
    "        <div>\n",
    "            <h1>Foo</h1>\n",
    "            <p>Some intro text about Foo.</p>\n",
    "            <div>\n",
    "                <h2>Bar main section</h2>\n",
    "                <p>Some intro text about Bar.</p>\n",
    "                <h3>Bar subsection 1</h3>\n",
    "                <p>Some text about the first subtopic of Bar.</p>\n",
    "                <h3>Bar subsection 2</h3>\n",
    "                <p>Some text about the second subtopic of Bar.</p>\n",
    "            </div>\n",
    "            <div>\n",
    "                <h2>Baz</h2>\n",
    "                <p>Some text about Baz</p>\n",
    "            </div>\n",
    "            <br>\n",
    "            <p>Some concluding text about Foo</p>\n",
    "        </div>\n",
    "    </body>\n",
    "    </html>\n",
    "\"\"\"\n",
    "\n",
    "headers_to_split_on = [\n",
    "    (\"h1\", \"Header 1\"),\n",
    "    (\"h2\", \"Header 2\"),\n",
    "    (\"h3\", \"Header 3\"),\n",
    "    (\"h4\", \"Header 4\"),\n",
    "]\n",
    "\n",
    "html_splitter = HTMLSectionSplitter(headers_to_split_on)\n",
    "\n",
    "html_header_splits = html_splitter.split_text(html_string)\n",
    "\n",
    "chunk_size = 500\n",
    "chunk_overlap = 30\n",
    "text_splitter = RecursiveCharacterTextSplitter(\n",
    "    chunk_size=chunk_size, chunk_overlap=chunk_overlap\n",
    ")\n",
    "\n",
    "# Split\n",
    "splits = text_splitter.split_documents(html_header_splits)\n",
    "splits"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}