{ "cells": [ { "cell_type": "markdown", "id": "1ad7250ddd99fba9", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "source": [ "# Databricks Vector Search\n", "\n", ">[Databricks Vector Search](https://docs.databricks.com/en/generative-ai/vector-search.html) is a serverless similarity search engine that allows you to store a vector representation of your data, including metadata, in a vector database. With Vector Search, you can create auto-updating vector search indexes from Delta tables managed by Unity Catalog and query them with a simple API to return the most similar vectors.\n", "\n", "\n", "In the walkthrough, we'll demo the `SelfQueryRetriever` with a Databricks Vector Search." ] }, { "cell_type": "markdown", "id": "209652d4ab38ba7f", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "source": [ "## create Databricks vector store index\n", "First we'll want to create a databricks vector store index and seed it with some data. We've created a small demo set of documents that contain summaries of movies.\n", "\n", "**Note:** The self-query retriever requires you to have `lark` installed (`pip install lark`) along with integration-specific requirements." ] }, { "cell_type": "code", "execution_count": 1, "id": "b68da3303b0625f2", "metadata": { "ExecuteTime": { "end_time": "2024-03-29T02:39:28.887634Z", "start_time": "2024-03-29T02:39:27.277978Z" }, "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Note: you may need to restart the kernel to use updated packages.\n" ] } ], "source": [ "%pip install --upgrade --quiet langchain-core databricks-vectorsearch langchain-openai tiktoken" ] }, { "cell_type": "markdown", "id": "a1113af6008f3f3d", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "source": [ "We want to use `OpenAIEmbeddings` so we have to get the OpenAI API Key." ] }, { "cell_type": "code", "execution_count": 2, "id": "c243e15bcf72d539", "metadata": { "ExecuteTime": { "end_time": "2024-03-29T02:40:59.788206Z", "start_time": "2024-03-29T02:40:59.783798Z" }, "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "name": "stdin", "output_type": "stream", "text": [ "OpenAI API Key: ········\n", "Databricks host: ········\n", "Databricks token: ········\n" ] } ], "source": [ "import getpass\n", "import os\n", "\n", "os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"OpenAI API Key:\")\n", "databricks_host = getpass.getpass(\"Databricks host:\")\n", "databricks_token = getpass.getpass(\"Databricks token:\")" ] }, { "cell_type": "code", "execution_count": 13, "id": "fd0c70c0be7d7130", "metadata": { "ExecuteTime": { "end_time": "2024-03-29T02:42:28.467682Z", "start_time": "2024-03-29T02:42:21.255335Z" }, "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[NOTICE] Using a Personal Authentication Token (PAT). Recommended for development only. For improved performance, please use Service Principal based authentication. To disable this message, pass disable_notice=True to VectorSearchClient().\n" ] } ], "source": [ "from databricks.vector_search.client import VectorSearchClient\n", "from langchain_openai import OpenAIEmbeddings\n", "\n", "embeddings = OpenAIEmbeddings()\n", "emb_dim = len(embeddings.embed_query(\"hello\"))\n", "\n", "vector_search_endpoint_name = \"vector_search_demo_endpoint\"\n", "\n", "\n", "vsc = VectorSearchClient(\n", " workspace_url=databricks_host, personal_access_token=databricks_token\n", ")\n", "vsc.create_endpoint(name=vector_search_endpoint_name, endpoint_type=\"STANDARD\")" ] }, { "cell_type": "code", "execution_count": 14, "id": "3ead3943-7dd6-448c-bead-01157a000221", "metadata": {}, "outputs": [], "source": [ "index_name = \"udhay_demo.10x.demo_index\"\n", "\n", "index = vsc.create_direct_access_index(\n", " endpoint_name=vector_search_endpoint_name,\n", " index_name=index_name,\n", " primary_key=\"id\",\n", " embedding_dimension=emb_dim,\n", " embedding_vector_column=\"text_vector\",\n", " schema={\n", " \"id\": \"string\",\n", " \"page_content\": \"string\",\n", " \"year\": \"int\",\n", " \"rating\": \"float\",\n", " \"genre\": \"string\",\n", " \"text_vector\": \"array\",\n", " },\n", ")\n", "\n", "index.describe()" ] }, { "cell_type": "code", "execution_count": 15, "id": "3e62fc39-51d9-4757-a449-f543638b3cd1", "metadata": {}, "outputs": [], "source": [ "index = vsc.get_index(endpoint_name=vector_search_endpoint_name, index_name=index_name)\n", "\n", "index.describe()" ] }, { "cell_type": "code", "execution_count": 7, "id": "13863677-8123-4b36-82bc-2c28ee2a90fb", "metadata": {}, "outputs": [], "source": [ "from langchain_core.documents import Document\n", "\n", "docs = [\n", " Document(\n", " page_content=\"A bunch of scientists bring back dinosaurs and mayhem breaks loose\",\n", " metadata={\"id\": 1, \"year\": 1993, \"rating\": 7.7, \"genre\": \"action\"},\n", " ),\n", " Document(\n", " page_content=\"Leo DiCaprio gets lost in a dream within a dream within a dream within a ...\",\n", " metadata={\"id\": 2, \"year\": 2010, \"genre\": \"thriller\", \"rating\": 8.2},\n", " ),\n", " Document(\n", " page_content=\"A bunch of normal-sized women are supremely wholesome and some men pine after them\",\n", " metadata={\"id\": 3, \"year\": 2019, \"rating\": 8.3, \"genre\": \"drama\"},\n", " ),\n", " Document(\n", " page_content=\"Three men walk into the Zone, three men walk out of the Zone\",\n", " metadata={\"id\": 4, \"year\": 1979, \"rating\": 9.9, \"genre\": \"science fiction\"},\n", " ),\n", " Document(\n", " page_content=\"A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea\",\n", " metadata={\"id\": 5, \"year\": 2006, \"genre\": \"thriller\", \"rating\": 9.0},\n", " ),\n", " Document(\n", " page_content=\"Toys come alive and have a blast doing so\",\n", " metadata={\"id\": 6, \"year\": 1995, \"genre\": \"animated\", \"rating\": 9.3},\n", " ),\n", "]" ] }, { "cell_type": "code", "execution_count": 16, "id": "6fdc8f55-5b4c-4506-97ac-59d9b9ef8ffc", "metadata": {}, "outputs": [], "source": [ "from langchain_community.vectorstores import DatabricksVectorSearch\n", "\n", "vector_store = DatabricksVectorSearch(\n", " index,\n", " text_column=\"page_content\",\n", " embedding=embeddings,\n", " columns=[\"year\", \"rating\", \"genre\"],\n", ")" ] }, { "cell_type": "code", "execution_count": 10, "id": "826375af-3fd7-4d41-9c7b-c273653c46b6", "metadata": {}, "outputs": [], "source": [ "vector_store.add_documents(docs)" ] }, { "cell_type": "markdown", "id": "3810b731a981a957", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "source": [ "## Creating our self-querying retriever\n", "Now we can instantiate our retriever. To do this we'll need to provide some information upfront about the metadata fields that our documents support and a short description of the document contents." ] }, { "cell_type": "code", "execution_count": 17, "id": "7095b68ea997468c", "metadata": { "ExecuteTime": { "end_time": "2024-03-29T02:42:37.901230Z", "start_time": "2024-03-29T02:42:36.836827Z" }, "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [], "source": [ "from langchain.chains.query_constructor.base import AttributeInfo\n", "from langchain.retrievers.self_query.base import SelfQueryRetriever\n", "from langchain_openai import OpenAI\n", "\n", "metadata_field_info = [\n", " AttributeInfo(\n", " name=\"genre\",\n", " description=\"The genre of the movie\",\n", " type=\"string\",\n", " ),\n", " AttributeInfo(\n", " name=\"year\",\n", " description=\"The year the movie was released\",\n", " type=\"integer\",\n", " ),\n", " AttributeInfo(\n", " name=\"rating\", description=\"A 1-10 rating for the movie\", type=\"float\"\n", " ),\n", "]\n", "document_content_description = \"Brief summary of a movie\"\n", "llm = OpenAI(temperature=0)\n", "retriever = SelfQueryRetriever.from_llm(\n", " llm, vector_store, document_content_description, metadata_field_info, verbose=True\n", ")" ] }, { "cell_type": "markdown", "id": "65ff2054be9d5236", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "source": [ "## Test it out\n", "And now we can try actually using our retriever!\n" ] }, { "cell_type": "code", "execution_count": 18, "id": "267e2a68f26505b1", "metadata": { "ExecuteTime": { "end_time": "2024-03-29T02:42:51.526470Z", "start_time": "2024-03-29T02:42:48.328191Z" }, "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "data": { "text/plain": [ "[Document(page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose', metadata={'year': 1993.0, 'rating': 7.7, 'genre': 'action', 'id': 1.0}),\n", " Document(page_content='Toys come alive and have a blast doing so', metadata={'year': 1995.0, 'rating': 9.3, 'genre': 'animated', 'id': 6.0}),\n", " Document(page_content='Three men walk into the Zone, three men walk out of the Zone', metadata={'year': 1979.0, 'rating': 9.9, 'genre': 'science fiction', 'id': 4.0}),\n", " Document(page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea', metadata={'year': 2006.0, 'rating': 9.0, 'genre': 'thriller', 'id': 5.0})]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# This example only specifies a relevant query\n", "retriever.invoke(\"What are some movies about dinosaurs\")" ] }, { "cell_type": "code", "execution_count": 19, "id": "3afd98ca20782dda", "metadata": { "ExecuteTime": { "end_time": "2024-03-29T02:42:55.179002Z", "start_time": "2024-03-29T02:42:53.057022Z" }, "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "data": { "text/plain": [ "[Document(page_content='Toys come alive and have a blast doing so', metadata={'year': 1995.0, 'rating': 9.3, 'genre': 'animated', 'id': 6.0}),\n", " Document(page_content='Three men walk into the Zone, three men walk out of the Zone', metadata={'year': 1979.0, 'rating': 9.9, 'genre': 'science fiction', 'id': 4.0})]" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# This example specifies a filter\n", "retriever.invoke(\"What are some highly rated movies (above 9)?\")" ] }, { "cell_type": "code", "execution_count": 20, "id": "9974f641e11abfe8", "metadata": { "ExecuteTime": { "end_time": "2024-03-29T02:42:58.472620Z", "start_time": "2024-03-29T02:42:56.131594Z" }, "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "data": { "text/plain": [ "[Document(page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea', metadata={'year': 2006.0, 'rating': 9.0, 'genre': 'thriller', 'id': 5.0}),\n", " Document(page_content='Leo DiCaprio gets lost in a dream within a dream within a dream within a ...', metadata={'year': 2010.0, 'rating': 8.2, 'genre': 'thriller', 'id': 2.0})]" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# This example specifies both a relevant query and a filter\n", "retriever.invoke(\"What are the thriller movies that are highly rated?\")" ] }, { "cell_type": "code", "execution_count": 21, "id": "edd31040-ede0-40bb-bfcd-962118df4ffb", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[Document(page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose', metadata={'year': 1993.0, 'rating': 7.7, 'genre': 'action', 'id': 1.0})]" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# This example specifies a query and composite filter\n", "retriever.invoke(\n", " \"What's a movie after 1990 but before 2005 that's all about dinosaurs, \\\n", " and preferably has a lot of action\"\n", ")" ] }, { "cell_type": "markdown", "id": "be593d3a6c508517", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "source": [ "## Filter k\n", "\n", "We can also use the self query retriever to specify `k`: the number of documents to fetch.\n", "\n", "We can do this by passing `enable_limit=True` to the constructor." ] }, { "cell_type": "markdown", "id": "7e17a10f-4187-4164-ab8f-b427c6b86cc0", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "source": [ "## Filter k\n", "\n", "We can also use the self query retriever to specify `k`: the number of documents to fetch.\n", "\n", "We can do this by passing `enable_limit=True` to the constructor." ] }, { "cell_type": "code", "execution_count": 22, "id": "e255b69c937fa424", "metadata": { "ExecuteTime": { "end_time": "2024-03-29T02:43:02.779337Z", "start_time": "2024-03-29T02:43:02.759900Z" }, "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [], "source": [ "retriever = SelfQueryRetriever.from_llm(\n", " llm,\n", " vector_store,\n", " document_content_description,\n", " metadata_field_info,\n", " verbose=True,\n", " enable_limit=True,\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "45674137c7f8a9d", "metadata": { "ExecuteTime": { "end_time": "2024-03-29T02:43:07.357830Z", "start_time": "2024-03-29T02:43:04.854323Z" }, "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [], "source": [ "retriever.invoke(\"What are two movies about dinosaurs?\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.16" } }, "nbformat": 4, "nbformat_minor": 5 }