You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
langchain/docs/docs/integrations/vectorstores/timescalevector.ipynb

1733 lines
91 KiB
Plaintext

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Timescale Vector (Postgres)\n",
"\n",
">[Timescale Vector](https://www.timescale.com/ai?utm_campaign=vectorlaunch&utm_source=langchain&utm_medium=referral) is `PostgreSQL++` vector database for AI applications.\n",
"\n",
"This notebook shows how to use the Postgres vector database `Timescale Vector`. You'll learn how to use TimescaleVector for (1) semantic search, (2) time-based vector search, (3) self-querying, and (4) how to create indexes to speed up queries.\n",
"\n",
"## What is Timescale Vector?\n",
"\n",
"`Timescale Vector` enables you to efficiently store and query millions of vector embeddings in `PostgreSQL`.\n",
"- Enhances `pgvector` with faster and more accurate similarity search on 100M+ vectors via `DiskANN` inspired indexing algorithm.\n",
"- Enables fast time-based vector search via automatic time-based partitioning and indexing.\n",
"- Provides a familiar SQL interface for querying vector embeddings and relational data.\n",
"\n",
"`Timescale Vector` is cloud `PostgreSQL` for AI that scales with you from POC to production:\n",
"- Simplifies operations by enabling you to store relational metadata, vector embeddings, and time-series data in a single database.\n",
"- Benefits from rock-solid PostgreSQL foundation with enterprise-grade features like streaming backups and replication, high availability and row-level security.\n",
"- Enables a worry-free experience with enterprise-grade security and compliance.\n",
"\n",
"## How to access Timescale Vector\n",
"\n",
"`Timescale Vector` is available on [Timescale](https://www.timescale.com/ai?utm_campaign=vectorlaunch&utm_source=langchain&utm_medium=referral), the cloud PostgreSQL platform. (There is no self-hosted version at this time.)\n",
"\n",
"LangChain users get a 90-day free trial for Timescale Vector.\n",
"- To get started, [signup](https://console.cloud.timescale.com/signup?utm_campaign=vectorlaunch&utm_source=langchain&utm_medium=referral) to Timescale, create a new database and follow this notebook!\n",
"- See the [Timescale Vector explainer blog](https://www.timescale.com/blog/how-we-made-postgresql-the-best-vector-database/?utm_campaign=vectorlaunch&utm_source=langchain&utm_medium=referral) for more details and performance benchmarks.\n",
"- See the [installation instructions](https://github.com/timescale/python-vector) for more details on using Timescale Vector in Python."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup\n",
"\n",
"Follow these steps to get ready to follow this tutorial."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# Pip install necessary packages\n",
"%pip install --upgrade --quiet timescale-vector\n",
"%pip install --upgrade --quiet langchain-openai\n",
"%pip install --upgrade --quiet tiktoken"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"In this example, we'll use `OpenAIEmbeddings`, so let's load your OpenAI API key."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"\n",
"# Run export OPENAI_API_KEY=sk-YOUR_OPENAI_API_KEY...\n",
"# Get openAI api key by reading local .env file\n",
"from dotenv import find_dotenv, load_dotenv\n",
"\n",
"_ = load_dotenv(find_dotenv())\n",
"OPENAI_API_KEY = os.environ[\"OPENAI_API_KEY\"]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Get the API key and save it as an environment variable\n",
"# import os\n",
"# import getpass\n",
"# os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"OpenAI API Key:\")\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from typing import Tuple"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Next we'll import the needed Python libraries and libraries from LangChain. Note that we import the `timescale-vector` library as well as the TimescaleVector LangChain vectorstore."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from datetime import datetime, timedelta\n",
"\n",
"from langchain_community.docstore.document import Document\n",
"from langchain_community.document_loaders import TextLoader\n",
"from langchain_community.document_loaders.json_loader import JSONLoader\n",
"from langchain_community.vectorstores.timescalevector import TimescaleVector\n",
"from langchain_openai import OpenAIEmbeddings\n",
"from langchain_text_splitters import CharacterTextSplitter"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Similarity Search with Euclidean Distance (Default)\n",
"\n",
"First, we'll look at an example of doing a similarity search query on the State of the Union speech to find the most similar sentences to a given query sentence. We'll use the [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance) as our similarity metric."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"# Load the text and split it into chunks\n",
"loader = TextLoader(\"../../../extras/modules/state_of_the_union.txt\")\n",
"documents = loader.load()\n",
"text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
"docs = text_splitter.split_documents(documents)\n",
"\n",
"embeddings = OpenAIEmbeddings()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we'll load the service URL for our Timescale database. \n",
"\n",
"If you haven't already, [signup for Timescale](https://console.cloud.timescale.com/signup?utm_campaign=vectorlaunch&utm_source=langchain&utm_medium=referral), and create a new database.\n",
"\n",
"Then, to connect to your PostgreSQL database, you'll need your service URI, which can be found in the cheatsheet or `.env` file you downloaded after creating a new database. \n",
"\n",
"The URI will look something like this: `postgres://tsdbadmin:<password>@<id>.tsdb.cloud.timescale.com:<port>/tsdb?sslmode=require`. "
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"# Timescale Vector needs the service url to your cloud database. You can see this as soon as you create the\n",
"# service in the cloud UI or in your credentials.sql file\n",
"SERVICE_URL = os.environ[\"TIMESCALE_SERVICE_URL\"]\n",
"\n",
"# Specify directly if testing\n",
"# SERVICE_URL = \"postgres://tsdbadmin:<password>@<id>.tsdb.cloud.timescale.com:<port>/tsdb?sslmode=require\"\n",
"\n",
"# # You can get also it from an environment variables. We suggest using a .env file.\n",
"# import os\n",
"# SERVICE_URL = os.environ.get(\"TIMESCALE_SERVICE_URL\", \"\")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Next we create a TimescaleVector vectorstore. We specify a collection name, which will be the name of the table our data is stored in. \n",
"\n",
"Note: When creating a new instance of TimescaleVector, the TimescaleVector Module will try to create a table with the name of the collection. So, make sure that the collection name is unique (i.e it doesn't already exist)."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"# The TimescaleVector Module will create a table with the name of the collection.\n",
"COLLECTION_NAME = \"state_of_the_union_test\"\n",
"\n",
"# Create a Timescale Vector instance from the collection of documents\n",
"db = TimescaleVector.from_documents(\n",
" embedding=embeddings,\n",
" documents=docs,\n",
" collection_name=COLLECTION_NAME,\n",
" service_url=SERVICE_URL,\n",
")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we've loaded our data, we can perform a similarity search."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"docs_with_score = db.similarity_search_with_score(query)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"--------------------------------------------------------------------------------\n",
"Score: 0.18443380687035138\n",
"Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while youre at it, pass the Disclose Act so Americans can know who is funding our elections. \n",
"\n",
"Tonight, Id like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n",
"\n",
"One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n",
"\n",
"And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence.\n",
"--------------------------------------------------------------------------------\n",
"--------------------------------------------------------------------------------\n",
"Score: 0.18452197313308139\n",
"Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while youre at it, pass the Disclose Act so Americans can know who is funding our elections. \n",
"\n",
"Tonight, Id like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n",
"\n",
"One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n",
"\n",
"And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence.\n",
"--------------------------------------------------------------------------------\n",
"--------------------------------------------------------------------------------\n",
"Score: 0.21720781018594182\n",
"A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since shes been nominated, shes received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans. \n",
"\n",
"And if we are to advance liberty and justice, we need to secure the Border and fix the immigration system. \n",
"\n",
"We can do both. At our border, weve installed new technology like cutting-edge scanners to better detect drug smuggling. \n",
"\n",
"Weve set up joint patrols with Mexico and Guatemala to catch more human traffickers. \n",
"\n",
"Were putting in place dedicated immigration judges so families fleeing persecution and violence can have their cases heard faster. \n",
"\n",
"Were securing commitments and supporting partners in South and Central America to host more refugees and secure their own borders.\n",
"--------------------------------------------------------------------------------\n",
"--------------------------------------------------------------------------------\n",
"Score: 0.21724902288621384\n",
"A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since shes been nominated, shes received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans. \n",
"\n",
"And if we are to advance liberty and justice, we need to secure the Border and fix the immigration system. \n",
"\n",
"We can do both. At our border, weve installed new technology like cutting-edge scanners to better detect drug smuggling. \n",
"\n",
"Weve set up joint patrols with Mexico and Guatemala to catch more human traffickers. \n",
"\n",
"Were putting in place dedicated immigration judges so families fleeing persecution and violence can have their cases heard faster. \n",
"\n",
"Were securing commitments and supporting partners in South and Central America to host more refugees and secure their own borders.\n",
"--------------------------------------------------------------------------------\n"
]
}
],
"source": [
"for doc, score in docs_with_score:\n",
" print(\"-\" * 80)\n",
" print(\"Score: \", score)\n",
" print(doc.page_content)\n",
" print(\"-\" * 80)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Using a Timescale Vector as a Retriever\n",
"After initializing a TimescaleVector store, you can use it as a [retriever](/docs/modules/data_connection/retrievers/)."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"# Use TimescaleVector as a retriever\n",
"retriever = db.as_retriever()"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"tags=['TimescaleVector', 'OpenAIEmbeddings'] metadata=None vectorstore=<langchain_community.vectorstores.timescalevector.TimescaleVector object at 0x10fc8d070> search_type='similarity' search_kwargs={}\n"
]
}
],
"source": [
"print(retriever)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's look at an example of using Timescale Vector as a retriever with the RetrievalQA chain and the stuff documents chain.\n",
"\n",
"In this example, we'll ask the same query as above, but this time we'll pass the relevant documents returned from Timescale Vector to an LLM to use as context to answer our question.\n",
"\n",
"First we'll create our stuff chain:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"# Initialize GPT3.5 model\n",
"from langchain_openai import ChatOpenAI\n",
"\n",
"llm = ChatOpenAI(temperature=0.1, model=\"gpt-3.5-turbo-16k\")\n",
"\n",
"# Initialize a RetrievalQA class from a stuff chain\n",
"from langchain.chains import RetrievalQA\n",
"\n",
"qa_stuff = RetrievalQA.from_chain_type(\n",
" llm=llm,\n",
" chain_type=\"stuff\",\n",
" retriever=retriever,\n",
" verbose=True,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
"\u001b[1m> Entering new RetrievalQA chain...\u001b[0m\n",
"\n",
"\u001b[1m> Finished chain.\u001b[0m\n"
]
}
],
"source": [
"query = \"What did the president say about Ketanji Brown Jackson?\"\n",
"response = qa_stuff.run(query)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The President said that he nominated Circuit Court of Appeals Judge Ketanji Brown Jackson, who is one of our nation's top legal minds and will continue Justice Breyer's legacy of excellence. He also mentioned that since her nomination, she has received a broad range of support from various groups, including the Fraternal Order of Police and former judges appointed by Democrats and Republicans.\n"
]
}
],
"source": [
"print(response)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Similarity Search with time-based filtering\n",
"\n",
"A key use case for Timescale Vector is efficient time-based vector search. Timescale Vector enables this by automatically partitioning vectors (and associated metadata) by time. This allows you to efficiently query vectors by both similarity to a query vector and time.\n",
"\n",
"Time-based vector search functionality is helpful for applications like:\n",
"- Storing and retrieving LLM response history (e.g. chatbots)\n",
"- Finding the most recent embeddings that are similar to a query vector (e.g recent news).\n",
"- Constraining similarity search to a relevant time range (e.g asking time-based questions about a knowledge base)\n",
"\n",
"To illustrate how to use TimescaleVector's time-based vector search functionality, we'll ask questions about the git log history for TimescaleDB . We'll illustrate how to add documents with a time-based uuid and how run similarity searches with time range filters."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Extract content and metadata from git log JSON\n",
"First lets load in the git log data into a new collection in our PostgreSQL database named `timescale_commits`."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": []
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"We'll define a helper funciton to create a uuid for a document and associated vector embedding based on its timestamp. We'll use this function to create a uuid for each git log entry.\n",
"\n",
"Important note: If you are working with documents and want the current date and time associated with vector for time-based search, you can skip this step. A uuid will be automatically generated when the documents are ingested by default."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"from timescale_vector import client\n",
"\n",
"\n",
"# Function to take in a date string in the past and return a uuid v1\n",
"def create_uuid(date_string: str):\n",
" if date_string is None:\n",
" return None\n",
" time_format = \"%a %b %d %H:%M:%S %Y %z\"\n",
" datetime_obj = datetime.strptime(date_string, time_format)\n",
" uuid = client.uuid_from_time(datetime_obj)\n",
" return str(uuid)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we'll define a metadata function to extract the relevant metadata from the JSON record. We'll pass this function to the JSONLoader. See the [JSON document loader docs](/docs/modules/data_connection/document_loaders/json) for more details."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"# Helper function to split name and email given an author string consisting of Name Lastname <email>\n",
"def split_name(input_string: str) -> Tuple[str, str]:\n",
" if input_string is None:\n",
" return None, None\n",
" start = input_string.find(\"<\")\n",
" end = input_string.find(\">\")\n",
" name = input_string[:start].strip()\n",
" email = input_string[start + 1 : end].strip()\n",
" return name, email\n",
"\n",
"\n",
"# Helper function to transform a date string into a timestamp_tz string\n",
"def create_date(input_string: str) -> datetime:\n",
" if input_string is None:\n",
" return None\n",
" # Define a dictionary to map month abbreviations to their numerical equivalents\n",
" month_dict = {\n",
" \"Jan\": \"01\",\n",
" \"Feb\": \"02\",\n",
" \"Mar\": \"03\",\n",
" \"Apr\": \"04\",\n",
" \"May\": \"05\",\n",
" \"Jun\": \"06\",\n",
" \"Jul\": \"07\",\n",
" \"Aug\": \"08\",\n",
" \"Sep\": \"09\",\n",
" \"Oct\": \"10\",\n",
" \"Nov\": \"11\",\n",
" \"Dec\": \"12\",\n",
" }\n",
"\n",
" # Split the input string into its components\n",
" components = input_string.split()\n",
" # Extract relevant information\n",
" day = components[2]\n",
" month = month_dict[components[1]]\n",
" year = components[4]\n",
" time = components[3]\n",
" timezone_offset_minutes = int(components[5]) # Convert the offset to minutes\n",
" timezone_hours = timezone_offset_minutes // 60 # Calculate the hours\n",
" timezone_minutes = timezone_offset_minutes % 60 # Calculate the remaining minutes\n",
" # Create a formatted string for the timestamptz in PostgreSQL format\n",
" timestamp_tz_str = (\n",
" f\"{year}-{month}-{day} {time}+{timezone_hours:02}{timezone_minutes:02}\"\n",
" )\n",
" return timestamp_tz_str\n",
"\n",
"\n",
"# Metadata extraction function to extract metadata from a JSON record\n",
"def extract_metadata(record: dict, metadata: dict) -> dict:\n",
" record_name, record_email = split_name(record[\"author\"])\n",
" metadata[\"id\"] = create_uuid(record[\"date\"])\n",
" metadata[\"date\"] = create_date(record[\"date\"])\n",
" metadata[\"author_name\"] = record_name\n",
" metadata[\"author_email\"] = record_email\n",
" metadata[\"commit_hash\"] = record[\"commit\"]\n",
" return metadata"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, you'll need to [download the sample dataset](https://s3.amazonaws.com/assets.timescale.com/ai/ts_git_log.json) and place it in the same directory as this notebook.\n",
"\n",
"You can use following command:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "shellscript"
}
},
"outputs": [],
"source": [
"# Download the file using curl and save it as commit_history.csv\n",
"# Note: Execute this command in your terminal, in the same directory as the notebook\n",
"!curl -O https://s3.amazonaws.com/assets.timescale.com/ai/ts_git_log.json"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally we can initialize the JSON loader to parse the JSON records. We also remove empty records for simplicity."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"# Define path to the JSON file relative to this notebook\n",
"# Change this to the path to your JSON file\n",
"FILE_PATH = \"../../../../../ts_git_log.json\"\n",
"\n",
"# Load data from JSON file and extract metadata\n",
"loader = JSONLoader(\n",
" file_path=FILE_PATH,\n",
" jq_schema=\".commit_history[]\",\n",
" text_content=False,\n",
" metadata_func=extract_metadata,\n",
")\n",
"documents = loader.load()\n",
"\n",
"# Remove documents with None dates\n",
"documents = [doc for doc in documents if doc.metadata[\"date\"] is not None]"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"page_content='{\"commit\": \"44e41c12ab25e36c202f58e068ced262eadc8d16\", \"author\": \"Lakshmi Narayanan Sreethar<lakshmi@timescale.com>\", \"date\": \"Tue Sep 5 21:03:21 2023 +0530\", \"change summary\": \"Fix segfault in set_integer_now_func\", \"change details\": \"When an invalid function oid is passed to set_integer_now_func, it finds out that the function oid is invalid but before throwing the error, it calls ReleaseSysCache on an invalid tuple causing a segfault. Fixed that by removing the invalid call to ReleaseSysCache. Fixes #6037 \"}' metadata={'source': '/Users/avtharsewrathan/sideprojects2023/timescaleai/tsv-langchain/ts_git_log.json', 'seq_num': 1, 'id': '8b407680-4c01-11ee-96a6-b82284ddccc6', 'date': '2023-09-5 21:03:21+0850', 'author_name': 'Lakshmi Narayanan Sreethar', 'author_email': 'lakshmi@timescale.com', 'commit_hash': '44e41c12ab25e36c202f58e068ced262eadc8d16'}\n"
]
}
],
"source": [
"print(documents[0])"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Load documents and metadata into TimescaleVector vectorstore\n",
"Now that we have prepared our documents, let's process them and load them, along with their vector embedding representations into our TimescaleVector vectorstore.\n",
"\n",
"Since this is a demo, we will only load the first 500 records. In practice, you can load as many records as you want."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"NUM_RECORDS = 500\n",
"documents = documents[:NUM_RECORDS]"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Then we use the CharacterTextSplitter to split the documents into smaller chunks if needed for easier embedding. Note that this splitting process retains the metadata for each document."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"# Split the documents into chunks for embedding\n",
"text_splitter = CharacterTextSplitter(\n",
" chunk_size=1000,\n",
" chunk_overlap=200,\n",
")\n",
"docs = text_splitter.split_documents(documents)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Next we'll create a Timescale Vector instance from the collection of documents that we finished pre-processsing.\n",
"\n",
"First, we'll define a collection name, which will be the name of our table in the PostgreSQL database. \n",
"\n",
"We'll also define a time delta, which we pass to the `time_partition_interval` argument, which will be used to as the interval for partitioning the data by time. Each partition will consist of data for the specified length of time. We'll use 7 days for simplicity, but you can pick whatever value make sense for your use case -- for example if you query recent vectors frequently you might want to use a smaller time delta like 1 day, or if you query vectors over a decade long time period then you might want to use a larger time delta like 6 months or 1 year.\n",
"\n",
"Finally, we'll create the TimescaleVector instance. We specify the `ids` argument to be the `uuid` field in our metadata that we created in the pre-processing step above. We do this because we want the time part of our uuids to reflect dates in the past (i.e when the commit was made). However, if we wanted the current date and time to be associated with our document, we can remove the id argument and uuid's will be automatically created with the current date and time."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"# Define collection name\n",
"COLLECTION_NAME = \"timescale_commits\"\n",
"embeddings = OpenAIEmbeddings()\n",
"\n",
"# Create a Timescale Vector instance from the collection of documents\n",
"db = TimescaleVector.from_documents(\n",
" embedding=embeddings,\n",
" ids=[doc.metadata[\"id\"] for doc in docs],\n",
" documents=docs,\n",
" collection_name=COLLECTION_NAME,\n",
" service_url=SERVICE_URL,\n",
" time_partition_interval=timedelta(days=7),\n",
")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Querying vectors by time and similarity\n",
"\n",
"Now that we have loaded our documents into TimescaleVector, we can query them by time and similarity.\n",
"\n",
"TimescaleVector provides multiple methods for querying vectors by doing similarity search with time-based filtering.\n",
"\n",
"Let's take a look at each method below:"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [],
"source": [
"# Time filter variables\n",
"start_dt = datetime(2023, 8, 1, 22, 10, 35) # Start date = 1 August 2023, 22:10:35\n",
"end_dt = datetime(2023, 8, 30, 22, 10, 35) # End date = 30 August 2023, 22:10:35\n",
"td = timedelta(days=7) # Time delta = 7 days\n",
"\n",
"query = \"What's new with TimescaleDB functions?\""
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Method 1: Filter within a provided start date and end date.\n"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"--------------------------------------------------------------------------------\n",
"Score: 0.17488396167755127\n",
"Date: 2023-08-29 18:13:24+0320\n",
"{\"commit\": \" e4facda540286b0affba47ccc63959fefe2a7b26\", \"author\": \"Sven Klemm<sven@timescale.com>\", \"date\": \"Tue Aug 29 18:13:24 2023 +0200\", \"change summary\": \"Add compatibility layer for _timescaledb_internal functions\", \"change details\": \"With timescaledb 2.12 all the functions present in _timescaledb_internal were moved into the _timescaledb_functions schema to improve schema security. This patch adds a compatibility layer so external callers of these internal functions will not break and allow for more flexibility when migrating. \"}\n",
"--------------------------------------------------------------------------------\n",
"--------------------------------------------------------------------------------\n",
"Score: 0.18102192878723145\n",
"Date: 2023-08-20 22:47:10+0320\n",
"{\"commit\": \" 0a66bdb8d36a1879246bd652e4c28500c4b951ab\", \"author\": \"Sven Klemm<sven@timescale.com>\", \"date\": \"Sun Aug 20 22:47:10 2023 +0200\", \"change summary\": \"Move functions to _timescaledb_functions schema\", \"change details\": \"To increase schema security we do not want to mix our own internal objects with user objects. Since chunks are created in the _timescaledb_internal schema our internal functions should live in a different dedicated schema. This patch make the necessary adjustments for the following functions: - to_unix_microseconds(timestamptz) - to_timestamp(bigint) - to_timestamp_without_timezone(bigint) - to_date(bigint) - to_interval(bigint) - interval_to_usec(interval) - time_to_internal(anyelement) - subtract_integer_from_now(regclass, bigint) \"}\n",
"--------------------------------------------------------------------------------\n",
"--------------------------------------------------------------------------------\n",
"Score: 0.18150119891755445\n",
"Date: 2023-08-22 12:01:19+0320\n",
"{\"commit\": \" cf04496e4b4237440274eb25e4e02472fc4e06fc\", \"author\": \"Sven Klemm<sven@timescale.com>\", \"date\": \"Tue Aug 22 12:01:19 2023 +0200\", \"change summary\": \"Move utility functions to _timescaledb_functions schema\", \"change details\": \"To increase schema security we do not want to mix our own internal objects with user objects. Since chunks are created in the _timescaledb_internal schema our internal functions should live in a different dedicated schema. This patch make the necessary adjustments for the following functions: - generate_uuid() - get_git_commit() - get_os_info() - tsl_loaded() \"}\n",
"--------------------------------------------------------------------------------\n",
"--------------------------------------------------------------------------------\n",
"Score: 0.18422493887617963\n",
"Date: 2023-08-9 15:26:03+0500\n",
"{\"commit\": \" 44eab9cf9bef34274c88efd37a750eaa74cd8044\", \"author\": \"Konstantina Skovola<konstantina@timescale.com>\", \"date\": \"Wed Aug 9 15:26:03 2023 +0300\", \"change summary\": \"Release 2.11.2\", \"change details\": \"This release contains bug fixes since the 2.11.1 release. We recommend that you upgrade at the next available opportunity. **Features** * #5923 Feature flags for TimescaleDB features **Bugfixes** * #5680 Fix DISTINCT query with JOIN on multiple segmentby columns * #5774 Fixed two bugs in decompression sorted merge code * #5786 Ensure pg_config --cppflags are passed * #5906 Fix quoting owners in sql scripts. * #5912 Fix crash in 1-step integer policy creation **Thanks** * @mrksngl for submitting a PR to fix extension upgrade scripts * @ericdevries for reporting an issue with DISTINCT queries using segmentby columns of compressed hypertable \"}\n",
"--------------------------------------------------------------------------------\n"
]
}
],
"source": [
"# Method 1: Query for vectors between start_date and end_date\n",
"docs_with_score = db.similarity_search_with_score(\n",
" query, start_date=start_dt, end_date=end_dt\n",
")\n",
"\n",
"for doc, score in docs_with_score:\n",
" print(\"-\" * 80)\n",
" print(\"Score: \", score)\n",
" print(\"Date: \", doc.metadata[\"date\"])\n",
" print(doc.page_content)\n",
" print(\"-\" * 80)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Note how the query only returns results within the specified date range."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Method 2: Filter within a provided start date, and a time delta later."
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"--------------------------------------------------------------------------------\n",
"Score: 0.18458807468414307\n",
"Date: 2023-08-3 14:30:23+0500\n",
"{\"commit\": \" 7aeed663b9c0f337b530fd6cad47704a51a9b2ec\", \"author\": \"Dmitry Simonenko<dmitry@timescale.com>\", \"date\": \"Thu Aug 3 14:30:23 2023 +0300\", \"change summary\": \"Feature flags for TimescaleDB features\", \"change details\": \"This PR adds several GUCs which allow to enable/disable major timescaledb features: - enable_hypertable_create - enable_hypertable_compression - enable_cagg_create - enable_policy_create \"}\n",
"--------------------------------------------------------------------------------\n",
"--------------------------------------------------------------------------------\n",
"Score: 0.20492422580718994\n",
"Date: 2023-08-7 18:31:40+0320\n",
"{\"commit\": \" 07762ea4cedefc88497f0d1f8712d1515cdc5b6e\", \"author\": \"Sven Klemm<sven@timescale.com>\", \"date\": \"Mon Aug 7 18:31:40 2023 +0200\", \"change summary\": \"Test timescaledb debian 12 packages in CI\", \"change details\": \"\"}\n",
"--------------------------------------------------------------------------------\n",
"--------------------------------------------------------------------------------\n",
"Score: 0.21106326580047607\n",
"Date: 2023-08-3 14:36:39+0500\n",
"{\"commit\": \" 2863daf3df83c63ee36c0cf7b66c522da5b4e127\", \"author\": \"Dmitry Simonenko<dmitry@timescale.com>\", \"date\": \"Thu Aug 3 14:36:39 2023 +0300\", \"change summary\": \"Support CREATE INDEX ONLY ON main table\", \"change details\": \"This PR adds support for CREATE INDEX ONLY ON clause which allows to create index only on the main table excluding chunks. Fix #5908 \"}\n",
"--------------------------------------------------------------------------------\n",
"--------------------------------------------------------------------------------\n",
"Score: 0.21698051691055298\n",
"Date: 2023-08-2 20:24:14+0140\n",
"{\"commit\": \" 3af0d282ea71d9a8f27159a6171e9516e62ec9cb\", \"author\": \"Lakshmi Narayanan Sreethar<lakshmi@timescale.com>\", \"date\": \"Wed Aug 2 20:24:14 2023 +0100\", \"change summary\": \"PG16: ExecInsertIndexTuples requires additional parameter\", \"change details\": \"PG16 adds a new boolean parameter to the ExecInsertIndexTuples function to denote if the index is a BRIN index, which is then used to determine if the index update can be skipped. The fix also removes the INDEX_ATTR_BITMAP_ALL enum value. Adapt these changes by updating the compat function to accomodate the new parameter added to the ExecInsertIndexTuples function and using an alternative for the removed INDEX_ATTR_BITMAP_ALL enum value. postgres/postgres@19d8e23 \"}\n",
"--------------------------------------------------------------------------------\n"
]
}
],
"source": [
"# Method 2: Query for vectors between start_dt and a time delta td later\n",
"# Most relevant vectors between 1 August and 7 days later\n",
"docs_with_score = db.similarity_search_with_score(\n",
" query, start_date=start_dt, time_delta=td\n",
")\n",
"\n",
"for doc, score in docs_with_score:\n",
" print(\"-\" * 80)\n",
" print(\"Score: \", score)\n",
" print(\"Date: \", doc.metadata[\"date\"])\n",
" print(doc.page_content)\n",
" print(\"-\" * 80)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Once again, notice how we get results within the specified time filter, different from the previous query."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Method 3: Filter within a provided end date and a time delta earlier."
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"--------------------------------------------------------------------------------\n",
"Score: 0.17488396167755127\n",
"Date: 2023-08-29 18:13:24+0320\n",
"{\"commit\": \" e4facda540286b0affba47ccc63959fefe2a7b26\", \"author\": \"Sven Klemm<sven@timescale.com>\", \"date\": \"Tue Aug 29 18:13:24 2023 +0200\", \"change summary\": \"Add compatibility layer for _timescaledb_internal functions\", \"change details\": \"With timescaledb 2.12 all the functions present in _timescaledb_internal were moved into the _timescaledb_functions schema to improve schema security. This patch adds a compatibility layer so external callers of these internal functions will not break and allow for more flexibility when migrating. \"}\n",
"--------------------------------------------------------------------------------\n",
"--------------------------------------------------------------------------------\n",
"Score: 0.18496227264404297\n",
"Date: 2023-08-29 10:49:47+0320\n",
"{\"commit\": \" a9751ccd5eb030026d7b975d22753f5964972389\", \"author\": \"Sven Klemm<sven@timescale.com>\", \"date\": \"Tue Aug 29 10:49:47 2023 +0200\", \"change summary\": \"Move partitioning functions to _timescaledb_functions schema\", \"change details\": \"To increase schema security we do not want to mix our own internal objects with user objects. Since chunks are created in the _timescaledb_internal schema our internal functions should live in a different dedicated schema. This patch make the necessary adjustments for the following functions: - get_partition_for_key(val anyelement) - get_partition_hash(val anyelement) \"}\n",
"--------------------------------------------------------------------------------\n",
"--------------------------------------------------------------------------------\n",
"Score: 0.1871250867843628\n",
"Date: 2023-08-28 23:26:23+0320\n",
"{\"commit\": \" b2a91494a11d8b82849b6f11f9ea6dc26ef8a8cb\", \"author\": \"Sven Klemm<sven@timescale.com>\", \"date\": \"Mon Aug 28 23:26:23 2023 +0200\", \"change summary\": \"Move ddl_internal functions to _timescaledb_functions schema\", \"change details\": \"To increase schema security we do not want to mix our own internal objects with user objects. Since chunks are created in the _timescaledb_internal schema our internal functions should live in a different dedicated schema. This patch make the necessary adjustments for the following functions: - chunk_constraint_add_table_constraint(_timescaledb_catalog.chunk_constraint) - chunk_drop_replica(regclass,name) - chunk_index_clone(oid) - chunk_index_replace(oid,oid) - create_chunk_replica_table(regclass,name) - drop_stale_chunks(name,integer[]) - health() - hypertable_constraint_add_table_fk_constraint(name,name,name,integer) - process_ddl_event() - wait_subscription_sync(name,name,integer,numeric) \"}\n",
"--------------------------------------------------------------------------------\n",
"--------------------------------------------------------------------------------\n",
"Score: 0.18867712088363497\n",
"Date: 2023-08-27 13:20:04+0320\n",
"{\"commit\": \" e02b1f348eb4c48def00b7d5227238b4d9d41a4a\", \"author\": \"Sven Klemm<sven@timescale.com>\", \"date\": \"Sun Aug 27 13:20:04 2023 +0200\", \"change summary\": \"Simplify schema move update script\", \"change details\": \"Use dynamic sql to create the ALTER FUNCTION statements for those functions that may not exist in previous versions. \"}\n",
"--------------------------------------------------------------------------------\n"
]
}
],
"source": [
"# Method 3: Query for vectors between end_dt and a time delta td earlier\n",
"# Most relevant vectors between 30 August and 7 days earlier\n",
"docs_with_score = db.similarity_search_with_score(query, end_date=end_dt, time_delta=td)\n",
"\n",
"for doc, score in docs_with_score:\n",
" print(\"-\" * 80)\n",
" print(\"Score: \", score)\n",
" print(\"Date: \", doc.metadata[\"date\"])\n",
" print(doc.page_content)\n",
" print(\"-\" * 80)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Method 4: We can also filter for all vectors after a given date by only specifying a start date in our query.\n",
"\n",
"Method 5: Similarly, we can filter for or all vectors before a given date by only specify an end date in our query."
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"--------------------------------------------------------------------------------\n",
"Score: 0.17488396167755127\n",
"Date: 2023-08-29 18:13:24+0320\n",
"{\"commit\": \" e4facda540286b0affba47ccc63959fefe2a7b26\", \"author\": \"Sven Klemm<sven@timescale.com>\", \"date\": \"Tue Aug 29 18:13:24 2023 +0200\", \"change summary\": \"Add compatibility layer for _timescaledb_internal functions\", \"change details\": \"With timescaledb 2.12 all the functions present in _timescaledb_internal were moved into the _timescaledb_functions schema to improve schema security. This patch adds a compatibility layer so external callers of these internal functions will not break and allow for more flexibility when migrating. \"}\n",
"--------------------------------------------------------------------------------\n",
"--------------------------------------------------------------------------------\n",
"Score: 0.18102192878723145\n",
"Date: 2023-08-20 22:47:10+0320\n",
"{\"commit\": \" 0a66bdb8d36a1879246bd652e4c28500c4b951ab\", \"author\": \"Sven Klemm<sven@timescale.com>\", \"date\": \"Sun Aug 20 22:47:10 2023 +0200\", \"change summary\": \"Move functions to _timescaledb_functions schema\", \"change details\": \"To increase schema security we do not want to mix our own internal objects with user objects. Since chunks are created in the _timescaledb_internal schema our internal functions should live in a different dedicated schema. This patch make the necessary adjustments for the following functions: - to_unix_microseconds(timestamptz) - to_timestamp(bigint) - to_timestamp_without_timezone(bigint) - to_date(bigint) - to_interval(bigint) - interval_to_usec(interval) - time_to_internal(anyelement) - subtract_integer_from_now(regclass, bigint) \"}\n",
"--------------------------------------------------------------------------------\n",
"--------------------------------------------------------------------------------\n",
"Score: 0.18150119891755445\n",
"Date: 2023-08-22 12:01:19+0320\n",
"{\"commit\": \" cf04496e4b4237440274eb25e4e02472fc4e06fc\", \"author\": \"Sven Klemm<sven@timescale.com>\", \"date\": \"Tue Aug 22 12:01:19 2023 +0200\", \"change summary\": \"Move utility functions to _timescaledb_functions schema\", \"change details\": \"To increase schema security we do not want to mix our own internal objects with user objects. Since chunks are created in the _timescaledb_internal schema our internal functions should live in a different dedicated schema. This patch make the necessary adjustments for the following functions: - generate_uuid() - get_git_commit() - get_os_info() - tsl_loaded() \"}\n",
"--------------------------------------------------------------------------------\n",
"--------------------------------------------------------------------------------\n",
"Score: 0.18422493887617963\n",
"Date: 2023-08-9 15:26:03+0500\n",
"{\"commit\": \" 44eab9cf9bef34274c88efd37a750eaa74cd8044\", \"author\": \"Konstantina Skovola<konstantina@timescale.com>\", \"date\": \"Wed Aug 9 15:26:03 2023 +0300\", \"change summary\": \"Release 2.11.2\", \"change details\": \"This release contains bug fixes since the 2.11.1 release. We recommend that you upgrade at the next available opportunity. **Features** * #5923 Feature flags for TimescaleDB features **Bugfixes** * #5680 Fix DISTINCT query with JOIN on multiple segmentby columns * #5774 Fixed two bugs in decompression sorted merge code * #5786 Ensure pg_config --cppflags are passed * #5906 Fix quoting owners in sql scripts. * #5912 Fix crash in 1-step integer policy creation **Thanks** * @mrksngl for submitting a PR to fix extension upgrade scripts * @ericdevries for reporting an issue with DISTINCT queries using segmentby columns of compressed hypertable \"}\n",
"--------------------------------------------------------------------------------\n"
]
}
],
"source": [
"# Method 4: Query all vectors after start_date\n",
"docs_with_score = db.similarity_search_with_score(query, start_date=start_dt)\n",
"\n",
"for doc, score in docs_with_score:\n",
" print(\"-\" * 80)\n",
" print(\"Score: \", score)\n",
" print(\"Date: \", doc.metadata[\"date\"])\n",
" print(doc.page_content)\n",
" print(\"-\" * 80)"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"--------------------------------------------------------------------------------\n",
"Score: 0.16723191738128662\n",
"Date: 2023-04-11 22:01:14+0320\n",
"{\"commit\": \" 0595ff0888f2ffb8d313acb0bda9642578a9ade3\", \"author\": \"Sven Klemm<sven@timescale.com>\", \"date\": \"Tue Apr 11 22:01:14 2023 +0200\", \"change summary\": \"Move type support functions into _timescaledb_functions schema\", \"change details\": \"\"}\n",
"--------------------------------------------------------------------------------\n",
"--------------------------------------------------------------------------------\n",
"Score: 0.1706540584564209\n",
"Date: 2023-04-6 13:00:00+0320\n",
"{\"commit\": \" 04f43335dea11e9c467ee558ad8edfc00c1a45ed\", \"author\": \"Sven Klemm<sven@timescale.com>\", \"date\": \"Thu Apr 6 13:00:00 2023 +0200\", \"change summary\": \"Move aggregate support function into _timescaledb_functions\", \"change details\": \"This patch moves the support functions for histogram, first and last into the _timescaledb_functions schema. Since we alter the schema of the existing functions in upgrade scripts and do not change the aggregates this should work completely transparently for any user objects using those aggregates. \"}\n",
"--------------------------------------------------------------------------------\n",
"--------------------------------------------------------------------------------\n",
"Score: 0.17462033033370972\n",
"Date: 2023-03-31 08:22:57+0320\n",
"{\"commit\": \" feef9206facc5c5f506661de4a81d96ef059b095\", \"author\": \"Sven Klemm<sven@timescale.com>\", \"date\": \"Fri Mar 31 08:22:57 2023 +0200\", \"change summary\": \"Add _timescaledb_functions schema\", \"change details\": \"Currently internal user objects like chunks and our functions live in the same schema making locking down that schema hard. This patch adds a new schema _timescaledb_functions that is meant to be the schema used for timescaledb internal functions to allow separation of code and chunks or other user objects. \"}\n",
"--------------------------------------------------------------------------------\n",
"--------------------------------------------------------------------------------\n",
"Score: 0.17488396167755127\n",
"Date: 2023-08-29 18:13:24+0320\n",
"{\"commit\": \" e4facda540286b0affba47ccc63959fefe2a7b26\", \"author\": \"Sven Klemm<sven@timescale.com>\", \"date\": \"Tue Aug 29 18:13:24 2023 +0200\", \"change summary\": \"Add compatibility layer for _timescaledb_internal functions\", \"change details\": \"With timescaledb 2.12 all the functions present in _timescaledb_internal were moved into the _timescaledb_functions schema to improve schema security. This patch adds a compatibility layer so external callers of these internal functions will not break and allow for more flexibility when migrating. \"}\n",
"--------------------------------------------------------------------------------\n"
]
}
],
"source": [
"# Method 5: Query all vectors before end_date\n",
"docs_with_score = db.similarity_search_with_score(query, end_date=end_dt)\n",
"\n",
"for doc, score in docs_with_score:\n",
" print(\"-\" * 80)\n",
" print(\"Score: \", score)\n",
" print(\"Date: \", doc.metadata[\"date\"])\n",
" print(doc.page_content)\n",
" print(\"-\" * 80)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"The main takeaway is that in each result above, only vectors within the specified time range are returned. These queries are very efficient as they only need to search the relevant partitions."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also use this functionality for question answering, where we want to find the most relevant vectors within a specified time range to use as context for answering a question. Let's take a look at an example below, using Timescale Vector as a retriever:"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [],
"source": [
"# Set timescale vector as a retriever and specify start and end dates via kwargs\n",
"retriever = db.as_retriever(search_kwargs={\"start_date\": start_dt, \"end_date\": end_dt})"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
"\u001b[1m> Entering new RetrievalQA chain...\u001b[0m\n",
"\n",
"\u001b[1m> Finished chain.\u001b[0m\n",
"The following changes were made to the timescaledb functions:\n",
"\n",
"1. \"Add compatibility layer for _timescaledb_internal functions\" - This change was made on Tue Aug 29 18:13:24 2023 +0200.\n",
"2. \"Move functions to _timescaledb_functions schema\" - This change was made on Sun Aug 20 22:47:10 2023 +0200.\n",
"3. \"Move utility functions to _timescaledb_functions schema\" - This change was made on Tue Aug 22 12:01:19 2023 +0200.\n",
"4. \"Move partitioning functions to _timescaledb_functions schema\" - This change was made on Tue Aug 29 10:49:47 2023 +0200.\n"
]
}
],
"source": [
"from langchain_openai import ChatOpenAI\n",
"\n",
"llm = ChatOpenAI(temperature=0.1, model=\"gpt-3.5-turbo-16k\")\n",
"\n",
"from langchain.chains import RetrievalQA\n",
"\n",
"qa_stuff = RetrievalQA.from_chain_type(\n",
" llm=llm,\n",
" chain_type=\"stuff\",\n",
" retriever=retriever,\n",
" verbose=True,\n",
")\n",
"\n",
"query = (\n",
" \"What's new with the timescaledb functions? Tell me when these changes were made.\"\n",
")\n",
"response = qa_stuff.run(query)\n",
"print(response)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that the context the LLM uses to compose an answer are from retrieved documents only within the specified date range. \n",
"\n",
"This shows how you can use Timescale Vector to enhance retrieval augmented generation by retrieving documents within time ranges relevant to your query."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Using ANN Search Indexes to Speed Up Queries\n",
"\n",
"You can speed up similarity queries by creating an index on the embedding column. You should only do this once you have ingested a large part of your data.\n",
"\n",
"Timescale Vector supports the following indexes:\n",
"- timescale_vector index (tsv): a disk-ann inspired graph index for fast similarity search (default).\n",
"- pgvector's HNSW index: a hierarchical navigable small world graph index for fast similarity search.\n",
"- pgvector's IVFFLAT index: an inverted file index for fast similarity search.\n",
"\n",
"Important note: In PostgreSQL, each table can only have one index on a particular column. So if you'd like to test the performance of different index types, you can do so either by (1) creating multiple tables with different indexes, (2) creating multiple vector columns in the same table and creating different indexes on each column, or (3) by dropping and recreating the index on the same column and comparing results."
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [],
"source": [
"# Initialize an existing TimescaleVector store\n",
"COLLECTION_NAME = \"timescale_commits\"\n",
"embeddings = OpenAIEmbeddings()\n",
"db = TimescaleVector(\n",
" collection_name=COLLECTION_NAME,\n",
" service_url=SERVICE_URL,\n",
" embedding_function=embeddings,\n",
")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Using the `create_index()` function without additional arguments will create a timescale_vector_index by default, using the default parameters."
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [],
"source": [
"# create an index\n",
"# by default this will create a Timescale Vector (DiskANN) index\n",
"db.create_index()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"You can also specify the parameters for the index. See the Timescale Vector documentation for a full discussion of the different parameters and their effects on performance.\n",
"\n",
"Note: You don't need to specify parameters as we set smart defaults. But you can always specify your own parameters if you want to experiment eek out more performance for your specific dataset."
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [],
"source": [
"# drop the old index\n",
"db.drop_index()\n",
"\n",
"# create an index\n",
"# Note: You don't need to specify m and ef_construction parameters as we set smart defaults.\n",
"db.create_index(index_type=\"tsv\", max_alpha=1.0, num_neighbors=50)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"Timescale Vector also supports the HNSW ANN indexing algorithm, as well as the ivfflat ANN indexing algorithm. Simply specify in the `index_type` argument which index you'd like to create, and optionally specify the parameters for the index."
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [],
"source": [
"# drop the old index\n",
"db.drop_index()\n",
"\n",
"# Create an HNSW index\n",
"# Note: You don't need to specify m and ef_construction parameters as we set smart defaults.\n",
"db.create_index(index_type=\"hnsw\", m=16, ef_construction=64)"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [],
"source": [
"# drop the old index\n",
"db.drop_index()\n",
"\n",
"# Create an IVFFLAT index\n",
"# Note: You don't need to specify num_lists and num_records parameters as we set smart defaults.\n",
"db.create_index(index_type=\"ivfflat\", num_lists=20, num_records=1000)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"In general, we recommend using the default timescale vector index, or the HNSW index."
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [],
"source": [
"# drop the old index\n",
"db.drop_index()\n",
"# Create a new timescale vector index\n",
"db.create_index()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Self Querying Retriever with Timescale Vector\n",
"\n",
"Timescale Vector also supports the self-querying retriever functionality, which gives it the ability to query itself. Given a natural language query with a query statement and filters (single or composite), the retriever uses a query constructing LLM chain to write a SQL query and then applies it to the underlying PostgreSQL database in the Timescale Vector vectorstore.\n",
"\n",
"For more on self-querying, [see the docs](/docs/modules/data_connection/retrievers/self_query/)."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"To illustrate self-querying with Timescale Vector, we'll use the same gitlog dataset from Part 3."
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [],
"source": [
"COLLECTION_NAME = \"timescale_commits\"\n",
"vectorstore = TimescaleVector(\n",
" embedding_function=OpenAIEmbeddings(),\n",
" collection_name=COLLECTION_NAME,\n",
" service_url=SERVICE_URL,\n",
")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Next we'll create our self-querying retriever. To do this we'll need to provide some information upfront about the metadata fields that our documents support and a short description of the document contents."
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [],
"source": [
"from langchain.chains.query_constructor.base import AttributeInfo\n",
"from langchain.retrievers.self_query.base import SelfQueryRetriever\n",
"from langchain_openai import OpenAI\n",
"\n",
"# Give LLM info about the metadata fields\n",
"metadata_field_info = [\n",
" AttributeInfo(\n",
" name=\"id\",\n",
" description=\"A UUID v1 generated from the date of the commit\",\n",
" type=\"uuid\",\n",
" ),\n",
" AttributeInfo(\n",
" name=\"date\",\n",
" description=\"The date of the commit in timestamptz format\",\n",
" type=\"timestamptz\",\n",
" ),\n",
" AttributeInfo(\n",
" name=\"author_name\",\n",
" description=\"The name of the author of the commit\",\n",
" type=\"string\",\n",
" ),\n",
" AttributeInfo(\n",
" name=\"author_email\",\n",
" description=\"The email address of the author of the commit\",\n",
" type=\"string\",\n",
" ),\n",
"]\n",
"document_content_description = \"The git log commit summary containing the commit hash, author, date of commit, change summary and change details\"\n",
"\n",
"# Instantiate the self-query retriever from an LLM\n",
"llm = OpenAI(temperature=0)\n",
"retriever = SelfQueryRetriever.from_llm(\n",
" llm,\n",
" vectorstore,\n",
" document_content_description,\n",
" metadata_field_info,\n",
" enable_limit=True,\n",
" verbose=True,\n",
")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's test out the self-querying retriever on our gitlog dataset. \n",
"\n",
"Run the queries below and note how you can specify a query, query with a filter, and query with a composite filter (filters with AND, OR) in natural language and the self-query retriever will translate that query into SQL and perform the search on the Timescale Vector PostgreSQL vectorstore.\n",
"\n",
"This illustrates the power of the self-query retriever. You can use it to perform complex searches over your vectorstore without you or your users having to write any SQL directly!"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/avtharsewrathan/sideprojects2023/timescaleai/tsv-langchain/langchain/libs/langchain/langchain/chains/llm.py:275: UserWarning: The predict_and_parse method is deprecated, instead pass an output parser directly to LLMChain.\n",
" warnings.warn(\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"query='improvements to continuous aggregates' filter=None limit=None\n"
]
},
{
"data": {
"text/plain": [
"[Document(page_content='{\"commit\": \" 35c91204987ccb0161d745af1a39b7eb91bc65a5\", \"author\": \"Fabr\\\\u00edzio de Royes Mello<fabriziomello@gmail.com>\", \"date\": \"Thu Nov 24 13:19:36 2022 -0300\", \"change summary\": \"Add Hierarchical Continuous Aggregates validations\", \"change details\": \"Commit 3749953e introduce Hierarchical Continuous Aggregates (aka Continuous Aggregate on top of another Continuous Aggregate) but it lacks of some basic validations. Validations added during the creation of a Hierarchical Continuous Aggregate: * Forbid create a continuous aggregate with fixed-width bucket on top of a continuous aggregate with variable-width bucket. * Forbid incompatible bucket widths: - should not be equal; - bucket width of the new continuous aggregate should be greater than the source continuous aggregate; - bucket width of the new continuous aggregate should be multiple of the source continuous aggregate. \"}', metadata={'id': 'c98d1c00-6c13-11ed-9bbe-23925ce74d13', 'date': '2022-11-24 13:19:36+-500', 'source': '/Users/avtharsewrathan/sideprojects2023/timescaleai/tsv-langchain/langchain/docs/docs/modules/ts_git_log.json', 'seq_num': 446, 'author_name': 'Fabrízio de Royes Mello', 'commit_hash': ' 35c91204987ccb0161d745af1a39b7eb91bc65a5', 'author_email': 'fabriziomello@gmail.com'}),\n",
" Document(page_content='{\"commit\": \" 3749953e9704e45df8f621607989ada0714ce28d\", \"author\": \"Fabr\\\\u00edzio de Royes Mello<fabriziomello@gmail.com>\", \"date\": \"Wed Oct 5 18:45:40 2022 -0300\", \"change summary\": \"Hierarchical Continuous Aggregates\", \"change details\": \"Enable users create Hierarchical Continuous Aggregates (aka Continuous Aggregates on top of another Continuous Aggregates). With this PR users can create levels of aggregation granularity in Continuous Aggregates making the refresh process even faster. A problem with this feature can be in upper levels we can end up with the \\\\\"average of averages\\\\\". But to get the \\\\\"real average\\\\\" we can rely on \\\\\"stats_aggs\\\\\" TimescaleDB Toolkit function that calculate and store the partials that can be finalized with other toolkit functions like \\\\\"average\\\\\" and \\\\\"sum\\\\\". Closes #1400 \"}', metadata={'id': '0df31a00-44f7-11ed-9794-ebcc1227340f', 'date': '2022-10-5 18:45:40+-500', 'source': '/Users/avtharsewrathan/sideprojects2023/timescaleai/tsv-langchain/langchain/docs/docs/modules/ts_git_log.json', 'seq_num': 470, 'author_name': 'Fabrízio de Royes Mello', 'commit_hash': ' 3749953e9704e45df8f621607989ada0714ce28d', 'author_email': 'fabriziomello@gmail.com'}),\n",
" Document(page_content='{\"commit\": \" a6ff7ba6cc15b280a275e5acd315741ec9c86acc\", \"author\": \"Mats Kindahl<mats@timescale.com>\", \"date\": \"Tue Feb 28 12:04:17 2023 +0100\", \"change summary\": \"Rename columns in old-style continuous aggregates\", \"change details\": \"For continuous aggregates with the old-style partial aggregates renaming columns that are not in the group-by clause will generate an error when upgrading to a later version. The reason is that it is implicitly assumed that the name of the column is the same as for the direct view. This holds true for new-style continous aggregates, but is not always true for old-style continuous aggregates. In particular, columns that are not part of the `GROUP BY` clause can have an internally generated name. This commit fixes that by extracting the name of the column from the partial view and use that when renaming the partial view column and the materialized table column. \"}', metadata={'id': 'a49ace80-b757-11ed-8138-2390fd44ffd9', 'date': '2023-02-28 12:04:17+0140', 'source': '/Users/avtharsewrathan/sideprojects2023/timescaleai/tsv-langchain/langchain/docs/docs/modules/ts_git_log.json', 'seq_num': 294, 'author_name': 'Mats Kindahl', 'commit_hash': ' a6ff7ba6cc15b280a275e5acd315741ec9c86acc', 'author_email': 'mats@timescale.com'}),\n",
" Document(page_content='{\"commit\": \" 5bba74a2ec083728f8e93e09d03d102568fd72b5\", \"author\": \"Fabr\\\\u00edzio de Royes Mello<fabriziomello@gmail.com>\", \"date\": \"Mon Aug 7 19:49:47 2023 -0300\", \"change summary\": \"Relax strong table lock when refreshing a CAGG\", \"change details\": \"When refreshing a Continuous Aggregate we take a table lock on _timescaledb_catalog.continuous_aggs_invalidation_threshold when processing the invalidation logs (the first transaction of the refresh Continuous Aggregate procedure). It means that even two different Continuous Aggregates over two different hypertables will wait each other in the first phase of the refreshing procedure. Also it lead to problems when a pg_dump is running because it take an AccessShareLock on tables so Continuous Aggregate refresh execution will wait until the pg_dump finish. Improved it by relaxing the strong table-level lock to a row-level lock so now the Continuous Aggregate refresh procedure can be executed in multiple sessions with less locks. Fix #3554 \"}', metadata={'id': 'b5583780-3574-11ee-a5ba-2e305874a58f', 'date': '2023-08-7 19:49:47+-500', 'source': '/Users/avtharsewrathan/sideprojects2023/timescaleai/tsv-langchain/langchain/docs/docs/modules/ts_git_log.json', 'seq_num': 27, 'author_name': 'Fabrízio de Royes Mello', 'commit_hash': ' 5bba74a2ec083728f8e93e09d03d102568fd72b5', 'author_email': 'fabriziomello@gmail.com'})]"
]
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# This example specifies a relevant query\n",
"retriever.invoke(\"What are improvements made to continuous aggregates?\")"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"query=' ' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='author_name', value='Sven Klemm') limit=None\n"
]
},
{
"data": {
"text/plain": [
"[Document(page_content='{\"commit\": \" e2e7ae304521b74ac6b3f157a207da047d44ab06\", \"author\": \"Sven Klemm<sven@timescale.com>\", \"date\": \"Fri Mar 3 11:22:06 2023 +0100\", \"change summary\": \"Don\\'t run sanitizer test on individual PRs\", \"change details\": \"Sanitizer tests take a long time to run so we don\\'t want to run them on individual PRs but instead run them nightly and on commits to master. \"}', metadata={'id': '3f401b00-b9ad-11ed-b5ea-a3fd40b9ac16', 'date': '2023-03-3 11:22:06+0140', 'source': '/Users/avtharsewrathan/sideprojects2023/timescaleai/tsv-langchain/langchain/docs/docs/modules/ts_git_log.json', 'seq_num': 295, 'author_name': 'Sven Klemm', 'commit_hash': ' e2e7ae304521b74ac6b3f157a207da047d44ab06', 'author_email': 'sven@timescale.com'}),\n",
" Document(page_content='{\"commit\": \" d8f19e57a04d17593df5f2c694eae8775faddbc7\", \"author\": \"Sven Klemm<sven@timescale.com>\", \"date\": \"Wed Feb 1 08:34:20 2023 +0100\", \"change summary\": \"Bump version of setup-wsl github action\", \"change details\": \"The currently used version pulls in Node.js 12 which is deprecated on github. https://github.blog/changelog/2022-09-22-github-actions-all-actions-will-begin-running-on-node16-instead-of-node12/ \"}', metadata={'id': 'd70de600-a202-11ed-85d6-30b6df240f49', 'date': '2023-02-1 08:34:20+0140', 'source': '/Users/avtharsewrathan/sideprojects2023/timescaleai/tsv-langchain/langchain/docs/docs/modules/ts_git_log.json', 'seq_num': 350, 'author_name': 'Sven Klemm', 'commit_hash': ' d8f19e57a04d17593df5f2c694eae8775faddbc7', 'author_email': 'sven@timescale.com'}),\n",
" Document(page_content='{\"commit\": \" 83b13cf6f73a74656dde9cc6ec6cf76740cddd3c\", \"author\": \"Sven Klemm<sven@timescale.com>\", \"date\": \"Fri Nov 25 08:27:45 2022 +0100\", \"change summary\": \"Use packaged postgres for sqlsmith and coverity CI\", \"change details\": \"The sqlsmith and coverity workflows used the cache postgres build but could not produce a build by themselves and therefore relied on other workflows to produce the cached binaries. This patch changes those workflows to use normal postgres packages instead of custom built postgres to remove that dependency. \"}', metadata={'id': 'a786ae80-6c92-11ed-bd6c-a57bd3348b97', 'date': '2022-11-25 08:27:45+0140', 'source': '/Users/avtharsewrathan/sideprojects2023/timescaleai/tsv-langchain/langchain/docs/docs/modules/ts_git_log.json', 'seq_num': 447, 'author_name': 'Sven Klemm', 'commit_hash': ' 83b13cf6f73a74656dde9cc6ec6cf76740cddd3c', 'author_email': 'sven@timescale.com'}),\n",
" Document(page_content='{\"commit\": \" b1314e63f2ff6151ab5becfb105afa3682286a4d\", \"author\": \"Sven Klemm<sven@timescale.com>\", \"date\": \"Thu Dec 22 12:03:35 2022 +0100\", \"change summary\": \"Fix RPM package test for PG15 on centos 7\", \"change details\": \"Installing PG15 on Centos 7 requires the EPEL repository to satisfy the dependencies. \"}', metadata={'id': '477b1d80-81e8-11ed-9c8c-9b5abbd67c98', 'date': '2022-12-22 12:03:35+0140', 'source': '/Users/avtharsewrathan/sideprojects2023/timescaleai/tsv-langchain/langchain/docs/docs/modules/ts_git_log.json', 'seq_num': 408, 'author_name': 'Sven Klemm', 'commit_hash': ' b1314e63f2ff6151ab5becfb105afa3682286a4d', 'author_email': 'sven@timescale.com'})]"
]
},
"execution_count": 52,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# This example specifies a filter\n",
"retriever.invoke(\"What commits did Sven Klemm add?\")"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"query='timescaledb_functions' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='author_name', value='Sven Klemm') limit=None\n"
]
},
{
"data": {
"text/plain": [
"[Document(page_content='{\"commit\": \" 04f43335dea11e9c467ee558ad8edfc00c1a45ed\", \"author\": \"Sven Klemm<sven@timescale.com>\", \"date\": \"Thu Apr 6 13:00:00 2023 +0200\", \"change summary\": \"Move aggregate support function into _timescaledb_functions\", \"change details\": \"This patch moves the support functions for histogram, first and last into the _timescaledb_functions schema. Since we alter the schema of the existing functions in upgrade scripts and do not change the aggregates this should work completely transparently for any user objects using those aggregates. \"}', metadata={'id': '2cb47800-d46a-11ed-8f0e-2b624245c561', 'date': '2023-04-6 13:00:00+0320', 'source': '/Users/avtharsewrathan/sideprojects2023/timescaleai/tsv-langchain/langchain/docs/docs/modules/ts_git_log.json', 'seq_num': 233, 'author_name': 'Sven Klemm', 'commit_hash': ' 04f43335dea11e9c467ee558ad8edfc00c1a45ed', 'author_email': 'sven@timescale.com'}),\n",
" Document(page_content='{\"commit\": \" feef9206facc5c5f506661de4a81d96ef059b095\", \"author\": \"Sven Klemm<sven@timescale.com>\", \"date\": \"Fri Mar 31 08:22:57 2023 +0200\", \"change summary\": \"Add _timescaledb_functions schema\", \"change details\": \"Currently internal user objects like chunks and our functions live in the same schema making locking down that schema hard. This patch adds a new schema _timescaledb_functions that is meant to be the schema used for timescaledb internal functions to allow separation of code and chunks or other user objects. \"}', metadata={'id': '7a257680-cf8c-11ed-848c-a515e8687479', 'date': '2023-03-31 08:22:57+0320', 'source': '/Users/avtharsewrathan/sideprojects2023/timescaleai/tsv-langchain/langchain/docs/docs/modules/ts_git_log.json', 'seq_num': 239, 'author_name': 'Sven Klemm', 'commit_hash': ' feef9206facc5c5f506661de4a81d96ef059b095', 'author_email': 'sven@timescale.com'}),\n",
" Document(page_content='{\"commit\": \" 0a66bdb8d36a1879246bd652e4c28500c4b951ab\", \"author\": \"Sven Klemm<sven@timescale.com>\", \"date\": \"Sun Aug 20 22:47:10 2023 +0200\", \"change summary\": \"Move functions to _timescaledb_functions schema\", \"change details\": \"To increase schema security we do not want to mix our own internal objects with user objects. Since chunks are created in the _timescaledb_internal schema our internal functions should live in a different dedicated schema. This patch make the necessary adjustments for the following functions: - to_unix_microseconds(timestamptz) - to_timestamp(bigint) - to_timestamp_without_timezone(bigint) - to_date(bigint) - to_interval(bigint) - interval_to_usec(interval) - time_to_internal(anyelement) - subtract_integer_from_now(regclass, bigint) \"}', metadata={'id': 'bb99db00-3f9a-11ee-a8dc-0b9c1a5a37c4', 'date': '2023-08-20 22:47:10+0320', 'source': '/Users/avtharsewrathan/sideprojects2023/timescaleai/tsv-langchain/langchain/docs/docs/modules/ts_git_log.json', 'seq_num': 41, 'author_name': 'Sven Klemm', 'commit_hash': ' 0a66bdb8d36a1879246bd652e4c28500c4b951ab', 'author_email': 'sven@timescale.com'}),\n",
" Document(page_content='{\"commit\": \" 56ea8b4de93cefc38e002202d8ac96947dcbaa77\", \"author\": \"Sven Klemm<sven@timescale.com>\", \"date\": \"Thu Apr 13 13:16:14 2023 +0200\", \"change summary\": \"Move trigger functions to _timescaledb_functions schema\", \"change details\": \"To increase schema security we do not want to mix our own internal objects with user objects. Since chunks are created in the _timescaledb_internal schema our internal functions should live in a different dedicated schema. This patch make the necessary adjustments for our trigger functions. \"}', metadata={'id': '9a255300-d9ec-11ed-988f-7086c8ca463a', 'date': '2023-04-13 13:16:14+0320', 'source': '/Users/avtharsewrathan/sideprojects2023/timescaleai/tsv-langchain/langchain/docs/docs/modules/ts_git_log.json', 'seq_num': 44, 'author_name': 'Sven Klemm', 'commit_hash': ' 56ea8b4de93cefc38e002202d8ac96947dcbaa77', 'author_email': 'sven@timescale.com'})]"
]
},
"execution_count": 53,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# This example specifies a query and filter\n",
"retriever.invoke(\"What commits about timescaledb_functions did Sven Klemm add?\")"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"query=' ' filter=Operation(operator=<Operator.AND: 'and'>, arguments=[Comparison(comparator=<Comparator.GTE: 'gte'>, attribute='date', value='2023-07-01T00:00:00Z'), Comparison(comparator=<Comparator.LTE: 'lte'>, attribute='date', value='2023-07-31T23:59:59Z')]) limit=None\n"
]
},
{
"data": {
"text/plain": [
"[Document(page_content='{\"commit\": \" 5cf354e2469ee7e43248bed382a4b49fc7ccfecd\", \"author\": \"Markus Engel<engel@sero-systems.de>\", \"date\": \"Mon Jul 31 11:28:25 2023 +0200\", \"change summary\": \"Fix quoting owners in sql scripts.\", \"change details\": \"When referring to a role from a string type, it must be properly quoted using pg_catalog.quote_ident before it can be casted to regrole. Fixed this, especially in update scripts. \"}', metadata={'id': '99590280-2f84-11ee-915b-5715b2447de4', 'date': '2023-07-31 11:28:25+0320', 'source': '/Users/avtharsewrathan/sideprojects2023/timescaleai/tsv-langchain/langchain/docs/docs/modules/ts_git_log.json', 'seq_num': 76, 'author_name': 'Markus Engel', 'commit_hash': ' 5cf354e2469ee7e43248bed382a4b49fc7ccfecd', 'author_email': 'engel@sero-systems.de'}),\n",
" Document(page_content='{\"commit\": \" 88aaf23ae37fe7f47252b87325eb570aa417c607\", \"author\": \"noctarius aka Christoph Engelbert<me@noctarius.com>\", \"date\": \"Wed Jul 12 14:53:40 2023 +0200\", \"change summary\": \"Allow Replica Identity (Alter Table) on CAGGs (#5868)\", \"change details\": \"This commit is a follow up of #5515, which added support for ALTER TABLE\\\\r ... REPLICA IDENTITY (FULL | INDEX) on hypertables.\\\\r \\\\r This commit allows the execution against materialized hypertables to\\\\r enable update / delete operations on continuous aggregates when logical\\\\r replication in enabled for them.\"}', metadata={'id': '1fcfa200-20b3-11ee-9a18-370561c7cb1a', 'date': '2023-07-12 14:53:40+0320', 'source': '/Users/avtharsewrathan/sideprojects2023/timescaleai/tsv-langchain/langchain/docs/docs/modules/ts_git_log.json', 'seq_num': 96, 'author_name': 'noctarius aka Christoph Engelbert', 'commit_hash': ' 88aaf23ae37fe7f47252b87325eb570aa417c607', 'author_email': 'me@noctarius.com'}),\n",
" Document(page_content='{\"commit\": \" d5268c36fbd23fa2a93c0371998286e8688247bb\", \"author\": \"Alexander Kuzmenkov<36882414+akuzm@users.noreply.github.com>\", \"date\": \"Fri Jul 28 13:35:05 2023 +0200\", \"change summary\": \"Fix SQLSmith workflow\", \"change details\": \"The build was failing because it was picking up the wrong version of Postgres. Remove it. \"}', metadata={'id': 'cc0fba80-2d3a-11ee-ae7d-36dc25cad3b8', 'date': '2023-07-28 13:35:05+0320', 'source': '/Users/avtharsewrathan/sideprojects2023/timescaleai/tsv-langchain/langchain/docs/docs/modules/ts_git_log.json', 'seq_num': 82, 'author_name': 'Alexander Kuzmenkov', 'commit_hash': ' d5268c36fbd23fa2a93c0371998286e8688247bb', 'author_email': '36882414+akuzm@users.noreply.github.com'}),\n",
" Document(page_content='{\"commit\": \" 61c288ec5eb966a9b4d8ed90cd026ffc5e3543c9\", \"author\": \"Lakshmi Narayanan Sreethar<lakshmi@timescale.com>\", \"date\": \"Tue Jul 25 16:11:35 2023 +0530\", \"change summary\": \"Fix broken CI after PG12 removal\", \"change details\": \"The commit cdea343cc updated the gh_matrix_builder.py script but failed to import PG_LATEST variable into the script thus breaking the CI. Import that variable to fix the CI tests. \"}', metadata={'id': 'd3835980-2ad7-11ee-b98d-c4e3092e076e', 'date': '2023-07-25 16:11:35+0850', 'source': '/Users/avtharsewrathan/sideprojects2023/timescaleai/tsv-langchain/langchain/docs/docs/modules/ts_git_log.json', 'seq_num': 84, 'author_name': 'Lakshmi Narayanan Sreethar', 'commit_hash': ' 61c288ec5eb966a9b4d8ed90cd026ffc5e3543c9', 'author_email': 'lakshmi@timescale.com'})]"
]
},
"execution_count": 54,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# This example specifies a time-based filter\n",
"retriever.invoke(\"What commits were added in July 2023?\")"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"query='hierarchical continuous aggregates' filter=None limit=2\n"
]
},
{
"data": {
"text/plain": [
"[Document(page_content='{\"commit\": \" 35c91204987ccb0161d745af1a39b7eb91bc65a5\", \"author\": \"Fabr\\\\u00edzio de Royes Mello<fabriziomello@gmail.com>\", \"date\": \"Thu Nov 24 13:19:36 2022 -0300\", \"change summary\": \"Add Hierarchical Continuous Aggregates validations\", \"change details\": \"Commit 3749953e introduce Hierarchical Continuous Aggregates (aka Continuous Aggregate on top of another Continuous Aggregate) but it lacks of some basic validations. Validations added during the creation of a Hierarchical Continuous Aggregate: * Forbid create a continuous aggregate with fixed-width bucket on top of a continuous aggregate with variable-width bucket. * Forbid incompatible bucket widths: - should not be equal; - bucket width of the new continuous aggregate should be greater than the source continuous aggregate; - bucket width of the new continuous aggregate should be multiple of the source continuous aggregate. \"}', metadata={'id': 'c98d1c00-6c13-11ed-9bbe-23925ce74d13', 'date': '2022-11-24 13:19:36+-500', 'source': '/Users/avtharsewrathan/sideprojects2023/timescaleai/tsv-langchain/langchain/docs/docs/modules/ts_git_log.json', 'seq_num': 446, 'author_name': 'Fabrízio de Royes Mello', 'commit_hash': ' 35c91204987ccb0161d745af1a39b7eb91bc65a5', 'author_email': 'fabriziomello@gmail.com'}),\n",
" Document(page_content='{\"commit\": \" 3749953e9704e45df8f621607989ada0714ce28d\", \"author\": \"Fabr\\\\u00edzio de Royes Mello<fabriziomello@gmail.com>\", \"date\": \"Wed Oct 5 18:45:40 2022 -0300\", \"change summary\": \"Hierarchical Continuous Aggregates\", \"change details\": \"Enable users create Hierarchical Continuous Aggregates (aka Continuous Aggregates on top of another Continuous Aggregates). With this PR users can create levels of aggregation granularity in Continuous Aggregates making the refresh process even faster. A problem with this feature can be in upper levels we can end up with the \\\\\"average of averages\\\\\". But to get the \\\\\"real average\\\\\" we can rely on \\\\\"stats_aggs\\\\\" TimescaleDB Toolkit function that calculate and store the partials that can be finalized with other toolkit functions like \\\\\"average\\\\\" and \\\\\"sum\\\\\". Closes #1400 \"}', metadata={'id': '0df31a00-44f7-11ed-9794-ebcc1227340f', 'date': '2022-10-5 18:45:40+-500', 'source': '/Users/avtharsewrathan/sideprojects2023/timescaleai/tsv-langchain/langchain/docs/docs/modules/ts_git_log.json', 'seq_num': 470, 'author_name': 'Fabrízio de Royes Mello', 'commit_hash': ' 3749953e9704e45df8f621607989ada0714ce28d', 'author_email': 'fabriziomello@gmail.com'})]"
]
},
"execution_count": 55,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# This example specifies a query and a LIMIT value\n",
"retriever.invoke(\"What are two commits about hierarchical continuous aggregates?\")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. Working with an existing TimescaleVector vectorstore\n",
"\n",
"In the examples above, we created a vectorstore from a collection of documents. However, often we want to work insert data into and query data from an existing vectorstore. Let's see how to initialize, add documents to, and query an existing collection of documents in a TimescaleVector vector store.\n",
"\n",
"To work with an existing Timescale Vector store, we need to know the name of the table we want to query (`COLLECTION_NAME`) and the URL of the cloud PostgreSQL database (`SERVICE_URL`)."
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {},
"outputs": [],
"source": [
"# Initialize the existing\n",
"COLLECTION_NAME = \"timescale_commits\"\n",
"embeddings = OpenAIEmbeddings()\n",
"vectorstore = TimescaleVector(\n",
" collection_name=COLLECTION_NAME,\n",
" service_url=SERVICE_URL,\n",
" embedding_function=embeddings,\n",
")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"To load new data into the table, we use the `add_document()` function. This function takes a list of documents and a list of metadata. The metadata must contain a unique id for each document. \n",
"\n",
"If you want your documents to be associated with the current date and time, you do not need to create a list of ids. A uuid will be automatically generated for each document.\n",
"\n",
"If you want your documents to be associated with a past date and time, you can create a list of ids using the `uuid_from_time` function in the `timecale-vector` python library, as shown in Section 2 above. This function takes a datetime object and returns a uuid with the date and time encoded in the uuid."
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['a34f2b8a-53d7-11ee-8cc3-de1e4b2a0118']"
]
},
"execution_count": 58,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Add documents to a collection in TimescaleVector\n",
"ids = vectorstore.add_documents([Document(page_content=\"foo\")])\n",
"ids"
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {},
"outputs": [],
"source": [
"# Query the vectorstore for similar documents\n",
"docs_with_score = vectorstore.similarity_search_with_score(\"foo\")"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(Document(page_content='foo', metadata={}), 5.006789860928507e-06)"
]
},
"execution_count": 60,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"docs_with_score[0]"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(Document(page_content='{\"commit\": \" 00b566dfe478c11134bcf1e7bcf38943e7fafe8f\", \"author\": \"Fabr\\\\u00edzio de Royes Mello<fabriziomello@gmail.com>\", \"date\": \"Mon Mar 6 15:51:03 2023 -0300\", \"change summary\": \"Remove unused functions\", \"change details\": \"We don\\'t use `ts_catalog_delete[_only]` functions anywhere and instead we rely on `ts_catalog_delete_tid[_only]` functions so removing it from our code base. \"}', metadata={'id': 'd7f5c580-bc4f-11ed-9712-ffa0126a201a', 'date': '2023-03-6 15:51:03+-500', 'source': '/Users/avtharsewrathan/sideprojects2023/timescaleai/tsv-langchain/langchain/docs/docs/modules/ts_git_log.json', 'seq_num': 285, 'author_name': 'Fabrízio de Royes Mello', 'commit_hash': ' 00b566dfe478c11134bcf1e7bcf38943e7fafe8f', 'author_email': 'fabriziomello@gmail.com'}),\n",
" 0.23607668446580354)"
]
},
"execution_count": 61,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"docs_with_score[1]"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Deleting Data \n",
"\n",
"You can delete data by uuid or by a filter on the metadata."
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 64,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ids = vectorstore.add_documents([Document(page_content=\"Bar\")])\n",
"\n",
"vectorstore.delete(ids)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Deleting using metadata is especially useful if you want to periodically update information scraped from a particular source, or particular date or some other metadata attribute."
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['c6367004-53d7-11ee-8cc3-de1e4b2a0118']"
]
},
"execution_count": 65,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vectorstore.add_documents(\n",
" [Document(page_content=\"Hello World\", metadata={\"source\": \"www.example.com/hello\"})]\n",
")\n",
"vectorstore.add_documents(\n",
" [Document(page_content=\"Adios\", metadata={\"source\": \"www.example.com/adios\"})]\n",
")\n",
"\n",
"vectorstore.delete_by_metadata({\"source\": \"www.example.com/adios\"})\n",
"\n",
"vectorstore.add_documents(\n",
" [\n",
" Document(\n",
" page_content=\"Adios, but newer!\",\n",
" metadata={\"source\": \"www.example.com/adios\"},\n",
" )\n",
" ]\n",
")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Overriding a vectorstore\n",
"\n",
"If you have an existing collection, you override it by doing `from_documents` and setting `pre_delete_collection` = True"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"db = TimescaleVector.from_documents(\n",
" documents=docs,\n",
" embedding=embeddings,\n",
" collection_name=COLLECTION_NAME,\n",
" service_url=SERVICE_URL,\n",
" pre_delete_collection=True,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"docs_with_score = db.similarity_search_with_score(\"foo\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"docs_with_score[0]"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.1"
}
},
"nbformat": 4,
"nbformat_minor": 4
}