You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
langchain/docs/docs/integrations/vectorstores/google_vertex_ai_vector_sea...

744 lines
73 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"id": "655b8f55-2089-4733-8b09-35dea9580695",
"metadata": {},
"source": [
"# Google Vertex AI Vector Search\n",
"\n",
"This notebook shows how to use functionality related to the `Google Cloud Vertex AI Vector Search` vector database.\n",
"\n",
"> [Google Vertex AI Vector Search](https://cloud.google.com/vertex-ai/docs/vector-search/overview), formerly known as Vertex AI Matching Engine, provides the industry's leading high-scale low latency vector database. These vector databases are commonly referred to as vector similarity-matching or an approximate nearest neighbor (ANN) service.\n",
"\n",
"**Note**: Langchain API expects an endpoint and deployed index already created.Index creation time can take upto one hour.\n",
"\n",
"> To see how to create an index refer to the section [Create Index and deploy it to an Endpoint](#create-index-and-deploy-it-to-an-endpoint) \n",
"If you already have an index deployed , skip to [Create VectorStore from texts](#create-vector-store-from-texts)"
]
},
{
"cell_type": "markdown",
"id": "aca99382",
"metadata": {},
"source": [
"## Create Index and deploy it to an Endpoint\n",
"- This section demonstrates creating a new index and deploying it to an endpoint"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "35b5f3c5",
"metadata": {},
"outputs": [],
"source": [
"# TODO : Set values as per your requirements\n",
"# Project and Storage Constants\n",
"PROJECT_ID = \"<my_project_id>\"\n",
"REGION = \"<my_region>\"\n",
"BUCKET = \"<my_gcs_bucket>\"\n",
"BUCKET_URI = f\"gs://{BUCKET}\"\n",
"\n",
"# The number of dimensions for the textembedding-gecko@003 is 768\n",
"# If other embedder is used, the dimensions would probably need to change.\n",
"DIMENSIONS = 768\n",
"\n",
"# Index Constants\n",
"DISPLAY_NAME = \"<my_matching_engine_index_id>\"\n",
"DEPLOYED_INDEX_ID = \"<my_matching_engine_endpoint_id>\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ce74ea7e",
"metadata": {},
"outputs": [],
"source": [
"# Create a bucket.\n",
"! gsutil mb -l $REGION -p $PROJECT_ID $BUCKET_URI"
]
},
{
"cell_type": "markdown",
"id": "28d93078",
"metadata": {},
"source": [
"### Use [VertexAIEmbeddings](https://python.langchain.com/docs/integrations/text_embedding/google_vertex_ai_palm/) as the embeddings model"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "dfa92a08",
"metadata": {},
"outputs": [],
"source": [
"from google.cloud import aiplatform\n",
"from langchain_google_vertexai import VertexAIEmbeddings"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "58e5c762",
"metadata": {},
"outputs": [],
"source": [
"aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=BUCKET_URI)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c795913e",
"metadata": {},
"outputs": [],
"source": [
"embedding_model = VertexAIEmbeddings(model_name=\"textembedding-gecko@003\")"
]
},
{
"cell_type": "markdown",
"id": "73c2e7b5",
"metadata": {},
"source": [
"### Create an empty Index "
]
},
{
"cell_type": "markdown",
"id": "5b347e21",
"metadata": {},
"source": [
"**Note :** While creating an index you should specify an \"index_update_method\" from either a \"BATCH_UPDATE\" or \"STREAM_UPDATE\"\n",
"> A batch index is for when you want to update your index in a batch, with data which has been stored over a set amount of time, like systems which are processed weekly or monthly. A streaming index is when you want index data to be updated as new data is added to your datastore, for instance, if you have a bookstore and want to show new inventory online as soon as possible. Which type you choose is important, since setup and requirements are different.\n",
"\n",
"Refer [Official Documentation](https://cloud.google.com/vertex-ai/docs/vector-search/create-manage-index#create-index-batch) for more details on configuring indexes\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "37fdc7f1",
"metadata": {},
"outputs": [],
"source": [
"# NOTE : This operation can take upto 30 seconds\n",
"my_index = aiplatform.MatchingEngineIndex.create_tree_ah_index(\n",
" display_name=DISPLAY_NAME,\n",
" dimensions=DIMENSIONS,\n",
" approximate_neighbors_count=150,\n",
" distance_measure_type=\"DOT_PRODUCT_DISTANCE\",\n",
" index_update_method=\"STREAM_UPDATE\", # allowed values BATCH_UPDATE , STREAM_UPDATE\n",
")"
]
},
{
"cell_type": "markdown",
"id": "1723d40a",
"metadata": {},
"source": [
"### Create an Endpoint"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f4059888",
"metadata": {},
"outputs": [],
"source": [
"# Create an endpoint\n",
"my_index_endpoint = aiplatform.MatchingEngineIndexEndpoint.create(\n",
" display_name=f\"{DISPLAY_NAME}-endpoint\", public_endpoint_enabled=True\n",
")"
]
},
{
"cell_type": "markdown",
"id": "43a85682",
"metadata": {},
"source": [
"### Deploy Index to the Endpoint"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a6582ec1",
"metadata": {},
"outputs": [],
"source": [
"# NOTE : This operation can take upto 20 minutes\n",
"my_index_endpoint = my_index_endpoint.deploy_index(\n",
" index=my_index, deployed_index_id=DEPLOYED_INDEX_ID\n",
")\n",
"\n",
"my_index_endpoint.deployed_indexes"
]
},
{
"cell_type": "markdown",
"id": "a9971578-0ae9-4809-9e80-e5f9d3dcc98a",
"metadata": {},
"source": [
"## Create Vector Store from texts"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f7c96da4-8d97-4f69-8c13-d2fcafc03b05",
"metadata": {},
"outputs": [],
"source": [
"from langchain_google_vertexai import (\n",
" VectorSearchVectorStore,\n",
" VectorSearchVectorStoreDatastore,\n",
")"
]
},
{
"attachments": {
"Langchainassets.png": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA8AAAAIcCAIAAAC2P1AsAACAAElEQVR4Xuy995cURRzuff+Ge8/7/nDPe+85K6IkQRCUIBIkSUZQJEgUBQRFcgZBQKIEyZIzIlmi5JxzznFhd2F32ZkhKf1+7JK2GcIwQ9hifT6Hw+mprq6uqu55vk9VV8/+D0cIIYQQQgjx1PyP8AQhhBBCCCHE45GBFkIIIYQQIgpkoIUQQgghhIgCGWghhBBCCCGiQAZaCCGEEEKIKJCBFkIIIYQQIgpkoIUQQgghhIgCGWghhBBCCCGiQAZaCCGEEEKIKJCBFkIIIYQQIgpkoIUQQgghhIgCGWghhBBCCCGiQAZaCCGEEEKIKJCBFkIIIYQQIgpkoIUQQgghhIgCGWghhBBCCCGiQAZaCCGEEEKIKJCBFkIIIYQQIgpkoIUQQgghhIgCGWghhBBCCCGiQAZaCCGEEEKIKJCBFkIIIYQQIgpkoIUQQgghhIgCGWghhBBCCCGiQAZaCCGEEEKIKJCBFkIIIYQQIgpkoIUQQgghhIgCGWghhBBCCCGiQAZaCCGEEEKIKJCBFkIIIYQQIgpkoIUQQgghhIgCGWghhBBCCCGiQAZaCCGEEEKIKJCBFkIIIYQQIgpkoIUQQgghhIgCGWghhBBCCCGiQAZaCCGEEEKIKJCBFkIIIYQQIgpkoIUQQgghhIgCGWghhBBCCCGiQAZaCCGEEEKIKHg1DPT169dr1KhRv379P//800ucMWMGib/++qsvY2Ru3769c+fO8NRIrFy5slatWoULFy5atOi333576tSp8BzPiePHj9OoMWPGhKU3adKE9AsXLngpW7duJaVv376+XE/F5s2bw5MiceDAASpQ2KVBgwb+ElJTU9nryyuEsAvpZ4bXz0OHDtXw0bRp07lz5967dy8834PcvXt327Zt4alPRwz9IEQG49Uw0FCpUqW4uLiNGzd6KR999BEpBw8e9OWKwOXLl4sUKdK2bdvwHU9k/fr1mTJlyp49O8JUtmxZTponT56kpKTwfM+DXbt2UX6HDh3C0jt37kz6yJEjvRRaQcrMmTN9uSKAnlatWpV+C9/xRC5dupQjR47XX3+9WrVqHE5XsE092bV//366YsSIEeHHCCFsQvqZsfVzy5YtNIdOxqMXKFAgzqVNmzbh+XykpKSULFnyyy+/DN/xFLRo0YJzhacK8R/jlTHQ48ePRxHatWtnPp4+fZqPpUqVMh+vXr2KgvgnGAyEB9KvX7/uuPK3YsUKjmrWrFl8fLyX58SJE+RJTk72UiiNUHHz5s2tW7cGg0FkyB97vv76az5OmzbNfPzzzz/37t27e/fuO3fueCXAsWPHKPbw4cNeCmVS8o0bN7Zv337r1i1/toSEBPPRCwBkINu5c+dM+s6dO0kvV66c+Xj37t3cuXNnyZIlNTWVj4FAgKpSlNnrcebMGQq/ePGi+UgFKKREiRJ01F9//WUS6Qp/HsedryIDp6AyVGzKlCn+2DNq1ChTQ9o7YcIEtvv06ZOYmGj2UuyhQ4eoOf3mFUgIoRz0mlaYXqLwPS5seNmEEC8I6WfG1s8troGmb81HGo6TJsWbJ6ZA6kO2a9eumZRt27aR4fPPP6d8k0JvbHXxn53ryNXxdzK1xaNjoGmmd9VoNXn8zxa4UmS4ffv2gQMHzp8/bxI5F9n8948Qry6vjIHmS8vAHdUz39jhw4fz5Tdj959++oldcS6ItZE2FM1MsQB7x44dm5aWZj5CkSJFHFdlKlSoYFLeeOONQYMGmXORmCdPnsqVK5Pez4UNPi5cuBD18aTTceNQ0aJFTQn58+cnEjiuVH366af3TxVXsWJFo0fvvfde8eLFixUrRuLEiROTkpKqVq1q8mTKlKl///7O/QBAOqWZXYMHD3bc6PXBBx/w0SjU+vXr2W7SpAnbixYtypEjh8nMedFZEkOhUMOGDU1i3P1u4exeCgGPsPHVV195Kd988w1aybEdO3bkY/369eNceV26dCkb77///tSpU9FBr/nIondsnTp1SKH5ppJx7lyIN7vz1ltvlS5dulChQqTPnj37+PHjXjYSCRgmmxDiBSH9zNj6GWagnftDpm7durG9adMmr43mapKYOXNmk5IzZ04+UjIX0aRkzZp12bJlJO7bt497xiS+9tprppPr1atnUmD//v0MgWiydwvVrFnTjLjMUMH0g7lhevToQSFx7vXq2bOnV1UhXlFeGQMNdevW5bu3atUqtsuXL89XEZVHGkisVavW0aNHv//++zhXYshQu3ZtMvzyyy8oS6lSpfjGnjhxglAR56okokYepJk8Q4cOXbNmjZH7OXPmOG4AYJsSxo0bh+AiB5Rg1AE+/vhjIy5QpUoVhAMJZtSeN29eggFKvXr16oIFC3Iu5LJp06YcsmLFCscNAGyTgrIwmkdw+Ui27du3Uw7be/bsMQEA9aQQtpHObNmyGc0dMGBA3P140L59e7apxtWrV7NkyYI8cey0adNI7NKlCxn69u3LdqdOnXbu3Gkiwe+//05+NqgbzSSOGqFv2bLlxo0bmzVrFndfbU36hx9+iOKzC330xxIC2IQJE0gkhJgHo7Row4YNhBwaSGWmTJmyfPlyqkSf0zTHDQBka9GiBQGb+FS2bFmUmsps3rwZdfamwYQQLw7pZ1zG1c+HDfS6detIadCgAdtUEge/e/duPDqlvfvuuyRio8nAnbBgwQLqU6NGjXLlyh07doyrSXr16tXJ8+WXX3KB8Ojx8fEUzoCBcRSNYoBEVekHLq4xytxdtKJ79+5xrml27htoLiv9sNiFj5TAnWYeStBMr7ZCvIq8SgZ67ty5RkfOnTvnfcPNNxZlQUHME0a+osFgEPUpWbKkOfDy5cvmCSPyQQazho+vcZwbOUyew4cP8xERce4HAP8TPeQSlUGy8+XLF+fClz8xMTHOfQy6xaVRo0Z8PH78uDkE0UEKiRYk/vbbb879AGBG54DKv/POO2YbeTLpJgCYqREwh5vnjAQwtmkUYod+IZ23b9+mZBJbt25t6oAyou9kplbENjNzQ8nmkRmBhMxmDR+FZM2alQqYh4A3b96kPmgrAcwEABMLPZBj0r3pooEDB5K4cOHCuPvzWCa6EHJM/iVLlsTdX4xIsaiwqQwxO859lmoqbIK696RVCPGCkH7GZVz93PKQgTaz7GQwHwOBwNq1awcNGpQjRw4GFY77AIEM/jXQtJTr8sMPP9B2s9yFnolz57mpCbX1Vs4ULlzYWwNNl5Lfuy6MHPiYlJRkDLT3XIIbj4+TJk2iqvPmzYu7fyMJ8eryKhloFIQvLd//wYMH8/WbPn26c/9ribaad5yhcePGV65cIbFKlSphJfgDAMP3ON9rFnz/+Vi6dGnnfgBAccwuAs/kyZPvl+GMHj2avZzFKDJi5J0aUHBiQJEiRd54441PP/2UiEIeSnDcAOCJDhEl7v6DLT8mALRv3958/Oyzz+J8McNUzMwcIMekjBs3Ls59KcerQLFixRz3XJkzZ75f6j/4AwCtY7tMmTLeXvNYkKBiAgAyZ9LXrFmD6nlrHE2/vf/++86DAWDGjBlsDx8+3GTbs2dPnBuMHTcAEDhN+sGDB+Puv+zi8VzeQxdCPAHpZwbWz4cNNNfXa+bIkSOzZMlCIc2aNcubN6/pxjADjXfHptMKrikb5cuXd9zL2q5dO2/5R4ECBa5eveo8aKAZRRhHbmBMFecOhIyBnjVrlkk3azkowau2DLR41XmVDDSYx3YM/d98802zWK1///6koESO+21ftWoV6n/37l2+3maOwXFV8osvvkBl9u3bF3df9OPj4zNlylSiRAnz004rV66Mc9+Pce7rrDkWiAp8XLdunfm4YMGCOPe5W1paGkJjYobjKiOn4NTffvstGcyPPSGObDPgdlxRpkomM+TPn58gYWZHfv311wYNGmzdutV7CcbkCQsAv/zyCx+RQv43Pz9k5i169erluPrOR7PIz6wOJESxvXbt2s8//5xd9+7di/OJPvWhJ837KxcvXqQtJiCZALDLfU8czOM2T9kp/7XXXkNnnfsBwOwy
}
},
"cell_type": "markdown",
"id": "b25a61eb",
"metadata": {},
"source": [
"![Langchainassets.png](attachment:Langchainassets.png)"
]
},
{
"cell_type": "markdown",
"id": "97ac49ae",
"metadata": {},
"source": [
"### Create simple vectorstore ( without filters)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "58b70880-edd9-46f3-b769-f26c2bcc8395",
"metadata": {},
"outputs": [],
"source": [
"# Input texts\n",
"texts = [\n",
" \"The cat sat on\",\n",
" \"the mat.\",\n",
" \"I like to\",\n",
" \"eat pizza for\",\n",
" \"dinner.\",\n",
" \"The sun sets\",\n",
" \"in the west.\",\n",
"]\n",
"\n",
"# Create a Vector Store\n",
"vector_store = VectorSearchVectorStore.from_components(\n",
" project_id=PROJECT_ID,\n",
" region=REGION,\n",
" gcs_bucket_name=BUCKET,\n",
" index_id=my_index.name,\n",
" endpoint_id=my_index_endpoint.name,\n",
" embedding=embedding_model,\n",
" stream_update=True,\n",
")\n",
"\n",
"# Add vectors and mapped text chunks to your vectore store\n",
"vector_store.add_texts(texts=texts)"
]
},
{
"cell_type": "markdown",
"id": "080cbbdc",
"metadata": {},
"source": [
"### OPTIONAL : You can also create vectore and store chunks in a Datastore "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "97ef5dfd",
"metadata": {},
"outputs": [],
"source": [
"# NOTE : This operation can take upto 20 mins\n",
"vector_store = VectorSearchVectorStoreDatastore.from_components(\n",
" project_id=PROJECT_ID,\n",
" region=REGION,\n",
" index_id=my_index.name,\n",
" endpoint_id=my_index_endpoint.name,\n",
" embedding=embedding_model,\n",
" stream_update=True,\n",
")\n",
"\n",
"vector_store.add_texts(texts=texts, is_complete_overwrite=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b7c65716",
"metadata": {},
"outputs": [],
"source": [
"# Try running a simialarity search\n",
"vector_store.similarity_search(\"pizza\")"
]
},
{
"cell_type": "markdown",
"id": "65d92635",
"metadata": {},
"source": [
"### Create vectorstore with metadata filters"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "986951f7",
"metadata": {},
"outputs": [],
"source": [
"# Input text with metadata\n",
"record_data = [\n",
" {\n",
" \"description\": \"A versatile pair of dark-wash denim jeans.\"\n",
" \"Made from durable cotton with a classic straight-leg cut, these jeans\"\n",
" \" transition easily from casual days to dressier occasions.\",\n",
" \"price\": 65.00,\n",
" \"color\": \"blue\",\n",
" \"season\": [\"fall\", \"winter\", \"spring\"],\n",
" },\n",
" {\n",
" \"description\": \"A lightweight linen button-down shirt in a crisp white.\"\n",
" \" Perfect for keeping cool with breathable fabric and a relaxed fit.\",\n",
" \"price\": 34.99,\n",
" \"color\": \"white\",\n",
" \"season\": [\"summer\", \"spring\"],\n",
" },\n",
" {\n",
" \"description\": \"A soft, chunky knit sweater in a vibrant forest green. \"\n",
" \"The oversized fit and cozy wool blend make this ideal for staying warm \"\n",
" \"when the temperature drops.\",\n",
" \"price\": 89.99,\n",
" \"color\": \"green\",\n",
" \"season\": [\"fall\", \"winter\"],\n",
" },\n",
" {\n",
" \"description\": \"A classic crewneck t-shirt in a soft, heathered blue. \"\n",
" \"Made from comfortable cotton jersey, this t-shirt is a wardrobe essential \"\n",
" \"that works for every season.\",\n",
" \"price\": 19.99,\n",
" \"color\": \"blue\",\n",
" \"season\": [\"fall\", \"winter\", \"summer\", \"spring\"],\n",
" },\n",
" {\n",
" \"description\": \"A flowing midi-skirt in a delicate floral print. \"\n",
" \"Lightweight and airy, this skirt adds a touch of feminine style \"\n",
" \"to warmer days.\",\n",
" \"price\": 45.00,\n",
" \"color\": \"white\",\n",
" \"season\": [\"spring\", \"summer\"],\n",
" },\n",
"]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6cd5fba1",
"metadata": {},
"outputs": [],
"source": [
"# Parse and prepare input data\n",
"\n",
"texts = []\n",
"metadatas = []\n",
"for record in record_data:\n",
" record = record.copy()\n",
" page_content = record.pop(\"description\")\n",
" texts.append(page_content)\n",
" if isinstance(page_content, str):\n",
" metadata = {**record}\n",
" metadatas.append(metadata)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fc6f0e08",
"metadata": {},
"outputs": [],
"source": [
"# Inspect metadatas\n",
"metadatas"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "eb993e1a",
"metadata": {},
"outputs": [],
"source": [
"# NOTE : This operation can take more than 20 mins\n",
"vector_store = VectorSearchVectorStore.from_components(\n",
" project_id=PROJECT_ID,\n",
" region=REGION,\n",
" gcs_bucket_name=BUCKET,\n",
" index_id=my_index.name,\n",
" endpoint_id=my_index_endpoint.name,\n",
" embedding=embedding_model,\n",
")\n",
"\n",
"vector_store.add_texts(texts=texts, metadatas=metadatas, is_complete_overwrite=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "dac171b9",
"metadata": {},
"outputs": [],
"source": [
"from google.cloud.aiplatform.matching_engine.matching_engine_index_endpoint import (\n",
" Namespace,\n",
" NumericNamespace,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "03ed6710",
"metadata": {},
"outputs": [],
"source": [
"# Try running a simple similarity search\n",
"\n",
"# Below code should return 5 results\n",
"vector_store.similarity_search(\"shirt\", k=5)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d084f0e7",
"metadata": {},
"outputs": [],
"source": [
"# Try running a similarity search with text filter\n",
"filters = [Namespace(name=\"season\", allow_tokens=[\"spring\"])]\n",
"\n",
"# Below code should return 4 results now\n",
"vector_store.similarity_search(\"shirt\", k=5, filter=filters)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3eb3206e",
"metadata": {},
"outputs": [],
"source": [
"# Try running a similarity search with combination of text and numeric filter\n",
"filters = [Namespace(name=\"season\", allow_tokens=[\"spring\"])]\n",
"numeric_filters = [NumericNamespace(name=\"price\", value_float=40.0, op=\"LESS\")]\n",
"\n",
"# Below code should return 2 results now\n",
"vector_store.similarity_search(\n",
" \"shirt\", k=5, filter=filters, numeric_filter=numeric_filters\n",
")"
]
},
{
"cell_type": "markdown",
"id": "4de820b3",
"metadata": {},
"source": [
"### Use Vector Store as retriever"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0ebe598e",
"metadata": {},
"outputs": [],
"source": [
"# Initialize the vectore_store as retriever\n",
"retriever = vector_store.as_retriever()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "98a251b1",
"metadata": {},
"outputs": [],
"source": [
"# perform simple similarity search on retriever\n",
"retriever.invoke(\"What are my options in breathable fabric?\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "61ab5631",
"metadata": {},
"outputs": [],
"source": [
"# Try running a similarity search with text filter\n",
"filters = [Namespace(name=\"season\", allow_tokens=[\"spring\"])]\n",
"\n",
"retriever.search_kwargs = {\"filter\": filters}\n",
"\n",
"# perform similarity search with filters on retriever\n",
"retriever.invoke(\"What are my options in breathable fabric?\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5bfcec72",
"metadata": {},
"outputs": [],
"source": [
"# Try running a similarity search with combination of text and numeric filter\n",
"filters = [Namespace(name=\"season\", allow_tokens=[\"spring\"])]\n",
"numeric_filters = [NumericNamespace(name=\"price\", value_float=40.0, op=\"LESS\")]\n",
"\n",
"\n",
"retriever.search_kwargs = {\"filter\": filters, \"numeric_filter\": numeric_filters}\n",
"\n",
"retriever.invoke(\"What are my options in breathable fabric?\")"
]
},
{
"cell_type": "markdown",
"id": "2def7692",
"metadata": {},
"source": [
"### Use filters with retriever in Question Answering Chains"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a0f6e31c",
"metadata": {},
"outputs": [],
"source": [
"from langchain_google_vertexai import VertexAI\n",
"\n",
"llm = VertexAI(model_name=\"gemini-pro\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6e9054c1",
"metadata": {},
"outputs": [],
"source": [
"from langchain.chains import RetrievalQA\n",
"\n",
"filters = [Namespace(name=\"season\", allow_tokens=[\"spring\"])]\n",
"numeric_filters = [NumericNamespace(name=\"price\", value_float=40.0, op=\"LESS\")]\n",
"\n",
"retriever.search_kwargs = {\"k\": 2, \"filter\": filters, \"numeric_filter\": numeric_filters}\n",
"\n",
"retrieval_qa = RetrievalQA.from_chain_type(\n",
" llm=llm,\n",
" chain_type=\"stuff\",\n",
" retriever=retriever,\n",
" return_source_documents=True,\n",
")\n",
"\n",
"question = \"What are my options in breathable fabric?\"\n",
"response = retrieval_qa({\"query\": question})\n",
"print(f\"{response['result']}\")\n",
"print(\"REFERENCES\")\n",
"print(f\"{response['source_documents']}\")"
]
},
{
"cell_type": "markdown",
"id": "e987ddef",
"metadata": {},
"source": [
"## Read , Chunk , Vectorise and Index PDFs"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "77675a97",
"metadata": {},
"outputs": [],
"source": [
"!pip install pypdf"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "aad1896b",
"metadata": {},
"outputs": [],
"source": [
"from langchain_community.document_loaders import PyPDFLoader\n",
"from langchain_text_splitters import RecursiveCharacterTextSplitter"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0454681b",
"metadata": {},
"outputs": [],
"source": [
"loader = PyPDFLoader(\"https://arxiv.org/pdf/1706.03762.pdf\")\n",
"pages = loader.load()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "159e5722",
"metadata": {},
"outputs": [],
"source": [
"text_splitter = RecursiveCharacterTextSplitter(\n",
" # Set a really small chunk size, just to show.\n",
" chunk_size=1000,\n",
" chunk_overlap=20,\n",
" length_function=len,\n",
" is_separator_regex=False,\n",
")\n",
"doc_splits = text_splitter.split_documents(pages)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5a598ec8",
"metadata": {},
"outputs": [],
"source": [
"texts = [doc.page_content for doc in doc_splits]\n",
"metadatas = [doc.metadata for doc in doc_splits]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4dc880d6",
"metadata": {},
"outputs": [],
"source": [
"texts[0]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "558f9495",
"metadata": {},
"outputs": [],
"source": [
"# Inspect Metadata of 1st page\n",
"metadatas[0]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "81143e4b",
"metadata": {},
"outputs": [],
"source": [
"vector_store = VectorSearchVectorStore.from_components(\n",
" project_id=PROJECT_ID,\n",
" region=REGION,\n",
" gcs_bucket_name=BUCKET,\n",
" index_id=my_index.name,\n",
" endpoint_id=my_index_endpoint.name,\n",
" embedding=embedding_model,\n",
")\n",
"\n",
"vector_store.add_texts(texts=texts, metadatas=metadatas, is_complete_overwrite=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "711efca3",
"metadata": {},
"outputs": [],
"source": [
"my_index = aiplatform.MatchingEngineIndex(\"5908955807575179264\")\n",
"my_index_endpoint = aiplatform.MatchingEngineIndexEndpoint(\"7751631742611488768\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7557d531",
"metadata": {},
"outputs": [],
"source": [
"vector_store = VectorSearchVectorStore.from_components(\n",
" project_id=PROJECT_ID,\n",
" region=REGION,\n",
" gcs_bucket_name=BUCKET,\n",
" index_id=my_index.name,\n",
" endpoint_id=my_index_endpoint.name,\n",
" embedding=embedding_model,\n",
")"
]
},
{
"cell_type": "markdown",
"id": "31222b03",
"metadata": {},
"source": []
}
],
"metadata": {
"environment": {
"kernel": "python3",
"name": "common-cpu.m107",
"type": "gcloud",
"uri": "gcr.io/deeplearning-platform-release/base-cpu:m107"
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3"
}
},
"nbformat": 4,
"nbformat_minor": 5
}