From 8021d2a2abeabef75cd63c3ed6972269fd379233 Mon Sep 17 00:00:00 2001 From: Rohan Aggarwal Date: Fri, 3 May 2024 20:15:35 -0700 Subject: [PATCH] community[minor]: Oraclevs integration (#21123) Thank you for contributing to LangChain! - Oracle AI Vector Search Oracle AI Vector Search is designed for Artificial Intelligence (AI) workloads that allows you to query data based on semantics, rather than keywords. One of the biggest benefit of Oracle AI Vector Search is that semantic search on unstructured data can be combined with relational search on business data in one single system. This is not only powerful but also significantly more effective because you don't need to add a specialized vector database, eliminating the pain of data fragmentation between multiple systems. - Oracle AI Vector Search is designed for Artificial Intelligence (AI) workloads that allows you to query data based on semantics, rather than keywords. One of the biggest benefit of Oracle AI Vector Search is that semantic search on unstructured data can be combined with relational search on business data in one single system. This is not only powerful but also significantly more effective because you don't need to add a specialized vector database, eliminating the pain of data fragmentation between multiple systems. This Pull Requests Adds the following functionalities Oracle AI Vector Search : Vector Store Oracle AI Vector Search : Document Loader Oracle AI Vector Search : Document Splitter Oracle AI Vector Search : Summary Oracle AI Vector Search : Oracle Embeddings - We have added unit tests and have our own local unit test suite which verifies all the code is correct. We have made sure to add guides for each of the components and one end to end guide that shows how the entire thing runs. - We have made sure that make format and make lint run clean. Additional guidelines: - Make sure optional dependencies are imported within a function. - Please do not add dependencies to pyproject.toml files (even optional ones) unless they are required for unit tests. - Most PRs should not touch more than one package. - Changes should be backwards compatible. - If you are adding something to community, do not re-import it in langchain. If no one reviews your PR within a few days, please @-mention one of baskaryan, efriis, eyurtsev, hwchase17. --------- Co-authored-by: skmishraoracle Co-authored-by: hroyofc Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com> Co-authored-by: Bagatur --- cookbook/README.md | 1 + cookbook/oracleai_demo.ipynb | 872 ++++++++++++++++ .../document_loaders/oracleai.ipynb | 236 +++++ docs/docs/integrations/providers/oracleai.mdx | 65 ++ .../text_embedding/oracleai.ipynb | 262 +++++ docs/docs/integrations/tools/oracleai.ipynb | 174 ++++ .../integrations/vectorstores/oracle.ipynb | 469 +++++++++ .../document_loaders/__init__.py | 8 + .../document_loaders/oracleai.py | 447 ++++++++ .../embeddings/__init__.py | 5 + .../embeddings/oracleai.py | 182 ++++ .../langchain_community/utilities/__init__.py | 5 + .../langchain_community/utilities/oracleai.py | 201 ++++ .../vectorstores/__init__.py | 5 + .../vectorstores/oraclevs.py | 930 +++++++++++++++++ libs/community/poetry.lock | 54 +- libs/community/pyproject.toml | 4 +- .../document_loaders/test_oracleds.py | 447 ++++++++ .../vectorstores/test_oraclevs.py | 955 ++++++++++++++++++ .../document_loaders/test_imports.py | 2 + .../unit_tests/embeddings/test_imports.py | 1 + .../unit_tests/utilities/test_imports.py | 1 + .../unit_tests/vectorstores/test_imports.py | 1 + .../vectorstores/test_indexing_docs.py | 1 + .../vectorstores/test_public_api.py | 1 + 25 files changed, 5325 insertions(+), 4 deletions(-) create mode 100644 cookbook/oracleai_demo.ipynb create mode 100644 docs/docs/integrations/document_loaders/oracleai.ipynb create mode 100644 docs/docs/integrations/providers/oracleai.mdx create mode 100644 docs/docs/integrations/text_embedding/oracleai.ipynb create mode 100644 docs/docs/integrations/tools/oracleai.ipynb create mode 100644 docs/docs/integrations/vectorstores/oracle.ipynb create mode 100644 libs/community/langchain_community/document_loaders/oracleai.py create mode 100644 libs/community/langchain_community/embeddings/oracleai.py create mode 100644 libs/community/langchain_community/utilities/oracleai.py create mode 100644 libs/community/langchain_community/vectorstores/oraclevs.py create mode 100644 libs/community/tests/integration_tests/document_loaders/test_oracleds.py create mode 100644 libs/community/tests/integration_tests/vectorstores/test_oraclevs.py diff --git a/cookbook/README.md b/cookbook/README.md index 87437895e8..f8724c6a17 100644 --- a/cookbook/README.md +++ b/cookbook/README.md @@ -57,3 +57,4 @@ Notebook | Description [two_agent_debate_tools.ipynb](https://github.com/langchain-ai/langchain/tree/master/cookbook/two_agent_debate_tools.ipynb) | Simulate multi-agent dialogues where the agents can utilize various tools. [two_player_dnd.ipynb](https://github.com/langchain-ai/langchain/tree/master/cookbook/two_player_dnd.ipynb) | Simulate a two-player dungeons & dragons game, where a dialogue simulator class is used to coordinate the dialogue between the protagonist and the dungeon master. [wikibase_agent.ipynb](https://github.com/langchain-ai/langchain/tree/master/cookbook/wikibase_agent.ipynb) | Create a simple wikibase agent that utilizes sparql generation, with testing done on http://wikidata.org. +[oracleai_demo.ipynb](https://github.com/langchain-ai/langchain/tree/master/cookbook/oracleai_demo.ipynb) | This guide outlines how to utilize Oracle AI Vector Search alongside Langchain for an end-to-end RAG pipeline, providing step-by-step examples. The process includes loading documents from various sources using OracleDocLoader, summarizing them either within or outside the database with OracleSummary, and generating embeddings similarly through OracleEmbeddings. It also covers chunking documents according to specific requirements using Advanced Oracle Capabilities from OracleTextSplitter, and finally, storing and indexing these documents in a Vector Store for querying with OracleVS. \ No newline at end of file diff --git a/cookbook/oracleai_demo.ipynb b/cookbook/oracleai_demo.ipynb new file mode 100644 index 0000000000..7e49408689 --- /dev/null +++ b/cookbook/oracleai_demo.ipynb @@ -0,0 +1,872 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Oracle AI Vector Search with Document Processing\n", + "Oracle AI Vector Search is designed for Artificial Intelligence (AI) workloads that allows you to query data based on semantics, rather than keywords.\n", + "One of the biggest benefit of Oracle AI Vector Search is that semantic search on unstructured data can be combined with relational search on business data in one single system. This is not only powerful but also significantly more effective because you don't need to add a specialized vector database, eliminating the pain of data fragmentation between multiple systems.\n", + "\n", + "In addition, because Oracle has been building database technologies for so long, your vectors can benefit from all of Oracle Database's most powerful features, like the following:\n", + "\n", + " * Partitioning Support\n", + " * Real Application Clusters scalability\n", + " * Exadata smart scans\n", + " * Shard processing across geographically distributed databases\n", + " * Transactions\n", + " * Parallel SQL\n", + " * Disaster recovery\n", + " * Security\n", + " * Oracle Machine Learning\n", + " * Oracle Graph Database\n", + " * Oracle Spatial and Graph\n", + " * Oracle Blockchain\n", + " * JSON\n", + "\n", + "This guide demonstrates how Oracle AI Vector Search can be used with Langchain to serve an end-to-end RAG pipeline. This guide goes through examples of:\n", + "\n", + " * Loading the documents from various sources using OracleDocLoader\n", + " * Summarizing them within/outside the database using OracleSummary\n", + " * Generating embeddings for them within/outside the database using OracleEmbeddings\n", + " * Chunking them according to different requirements using Advanced Oracle Capabilities from OracleTextSplitter\n", + " * Storing and Indexing them in a Vector Store and querying them for queries in OracleVS" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Prerequisites\n", + "\n", + "Please install Oracle Python Client driver to use Langchain with Oracle AI Vector Search. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# pip install oracledb" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Create Demo User\n", + "First, create a demo user with all the required privileges. " + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Connection successful!\n", + "User setup done!\n" + ] + } + ], + "source": [ + "import sys\n", + "\n", + "import oracledb\n", + "\n", + "# please update with your username, password, hostname and service_name\n", + "# please make sure this user has sufficient privileges to perform all below\n", + "username = \"\"\n", + "password = \"\"\n", + "dsn = \"\"\n", + "\n", + "try:\n", + " conn = oracledb.connect(user=username, password=password, dsn=dsn)\n", + " print(\"Connection successful!\")\n", + "\n", + " cursor = conn.cursor()\n", + " cursor.execute(\n", + " \"\"\"\n", + " begin\n", + " -- drop user\n", + " begin\n", + " execute immediate 'drop user testuser cascade';\n", + " exception\n", + " when others then\n", + " dbms_output.put_line('Error setting up user.');\n", + " end;\n", + " execute immediate 'create user testuser identified by testuser';\n", + " execute immediate 'grant connect, unlimited tablespace, create credential, create procedure, create any index to testuser';\n", + " execute immediate 'create or replace directory DEMO_PY_DIR as ''/scratch/hroy/view_storage/hroy_devstorage/demo/orachain''';\n", + " execute immediate 'grant read, write on directory DEMO_PY_DIR to public';\n", + " execute immediate 'grant create mining model to testuser';\n", + "\n", + " -- network access\n", + " begin\n", + " DBMS_NETWORK_ACL_ADMIN.APPEND_HOST_ACE(\n", + " host => '*',\n", + " ace => xs$ace_type(privilege_list => xs$name_list('connect'),\n", + " principal_name => 'testuser',\n", + " principal_type => xs_acl.ptype_db));\n", + " end;\n", + " end;\n", + " \"\"\"\n", + " )\n", + " print(\"User setup done!\")\n", + " cursor.close()\n", + " conn.close()\n", + "except Exception as e:\n", + " print(\"User setup failed!\")\n", + " cursor.close()\n", + " conn.close()\n", + " sys.exit(1)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Process Documents using Oracle AI\n", + "Let's think about a scenario that the users have some documents in Oracle Database or in a file system. They want to use the data for Oracle AI Vector Search using Langchain.\n", + "\n", + "For that, the users need to do some document preprocessing. The first step would be to read the documents, generate their summary(if needed) and then chunk/split them if needed. After that, they need to generate the embeddings for those chunks and store into Oracle AI Vector Store. Finally, the users will perform some semantic queries on those data. \n", + "\n", + "Oracle AI Vector Search Langchain library provides a range of document processing functionalities including document loading, splitting, generating summary and embeddings.\n", + "\n", + "In the following sections, we will go through how to use Oracle AI Langchain APIs to achieve each of these functionalities individually. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Connect to Demo User\n", + "The following sample code will show how to connect to Oracle Database. " + ] + }, + { + "cell_type": "code", + "execution_count": 45, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Connection successful!\n" + ] + } + ], + "source": [ + "import sys\n", + "\n", + "import oracledb\n", + "\n", + "# please update with your username, password, hostname and service_name\n", + "username = \"\"\n", + "password = \"\"\n", + "dsn = \"\"\n", + "\n", + "try:\n", + " conn = oracledb.connect(user=username, password=password, dsn=dsn)\n", + " print(\"Connection successful!\")\n", + "except Exception as e:\n", + " print(\"Connection failed!\")\n", + " sys.exit(1)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Populate a Demo Table\n", + "Create a demo table and insert some sample documents." + ] + }, + { + "cell_type": "code", + "execution_count": 46, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Table created and populated.\n" + ] + } + ], + "source": [ + "try:\n", + " cursor = conn.cursor()\n", + "\n", + " drop_table_sql = \"\"\"drop table demo_tab\"\"\"\n", + " cursor.execute(drop_table_sql)\n", + "\n", + " create_table_sql = \"\"\"create table demo_tab (id number, data clob)\"\"\"\n", + " cursor.execute(create_table_sql)\n", + "\n", + " insert_row_sql = \"\"\"insert into demo_tab values (:1, :2)\"\"\"\n", + " rows_to_insert = [\n", + " (\n", + " 1,\n", + " \"If the answer to any preceding questions is yes, then the database stops the search and allocates space from the specified tablespace; otherwise, space is allocated from the database default shared temporary tablespace.\",\n", + " ),\n", + " (\n", + " 2,\n", + " \"A tablespace can be online (accessible) or offline (not accessible) whenever the database is open.\\nA tablespace is usually online so that its data is available to users. The SYSTEM tablespace and temporary tablespaces cannot be taken offline.\",\n", + " ),\n", + " (\n", + " 3,\n", + " \"The database stores LOBs differently from other data types. Creating a LOB column implicitly creates a LOB segment and a LOB index. The tablespace containing the LOB segment and LOB index, which are always stored together, may be different from the tablespace containing the table.\\nSometimes the database can store small amounts of LOB data in the table itself rather than in a separate LOB segment.\",\n", + " ),\n", + " ]\n", + " cursor.executemany(insert_row_sql, rows_to_insert)\n", + "\n", + " conn.commit()\n", + "\n", + " print(\"Table created and populated.\")\n", + " cursor.close()\n", + "except Exception as e:\n", + " print(\"Table creation failed.\")\n", + " cursor.close()\n", + " conn.close()\n", + " sys.exit(1)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "Now that we have a demo user and a demo table with some data, we just need to do one more setup. For embedding and summary, we have a few provider options that the users can choose from such as database, 3rd party providers like ocigenai, huggingface, openai, etc. If the users choose to use 3rd party provider, they need to create a credential with corresponding authentication information. On the other hand, if the users choose to use 'database' as provider, they need to load an onnx model to Oracle Database for embeddings; however, for summary, they don't need to do anything." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Load ONNX Model\n", + "\n", + "To generate embeddings, Oracle provides a few provider options for users to choose from. The users can choose 'database' provider or some 3rd party providers like OCIGENAI, HuggingFace, etc.\n", + "\n", + "***Note*** If the users choose database option, they need to load an ONNX model to Oracle Database. The users do not need to load an ONNX model to Oracle Database if they choose to use 3rd party provider to generate embeddings.\n", + "\n", + "One of the core benefits of using an ONNX model is that the users do not need to transfer their data to 3rd party to generate embeddings. And also, since it does not involve any network or REST API calls, it may provide better performance.\n", + "\n", + "Here is the sample code to load an ONNX model to Oracle Database:" + ] + }, + { + "cell_type": "code", + "execution_count": 47, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "ONNX model loaded.\n" + ] + } + ], + "source": [ + "from langchain_community.embeddings.oracleai import OracleEmbeddings\n", + "\n", + "# please update with your related information\n", + "# make sure that you have onnx file in the system\n", + "onnx_dir = \"DEMO_PY_DIR\"\n", + "onnx_file = \"tinybert.onnx\"\n", + "model_name = \"demo_model\"\n", + "\n", + "try:\n", + " OracleEmbeddings.load_onnx_model(conn, onnx_dir, onnx_file, model_name)\n", + " print(\"ONNX model loaded.\")\n", + "except Exception as e:\n", + " print(\"ONNX model loading failed!\")\n", + " sys.exit(1)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Create Credential\n", + "\n", + "On the other hand, if the users choose to use 3rd party provider to generate embeddings and summary, they need to create credential to access 3rd party provider's end points.\n", + "\n", + "***Note:*** The users do not need to create any credential if they choose to use 'database' provider to generate embeddings and summary. Should the users choose to 3rd party provider, they need to create credential for the 3rd party provider they want to use. \n", + "\n", + "Here is a sample example:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "try:\n", + " cursor = conn.cursor()\n", + " cursor.execute(\n", + " \"\"\"\n", + " declare\n", + " jo json_object_t;\n", + " begin\n", + " -- HuggingFace\n", + " dbms_vector_chain.drop_credential(credential_name => 'HF_CRED');\n", + " jo := json_object_t();\n", + " jo.put('access_token', '');\n", + " dbms_vector_chain.create_credential(\n", + " credential_name => 'HF_CRED',\n", + " params => json(jo.to_string));\n", + "\n", + " -- OCIGENAI\n", + " dbms_vector_chain.drop_credential(credential_name => 'OCI_CRED');\n", + " jo := json_object_t();\n", + " jo.put('user_ocid','');\n", + " jo.put('tenancy_ocid','');\n", + " jo.put('compartment_ocid','');\n", + " jo.put('private_key','');\n", + " jo.put('fingerprint','');\n", + " dbms_vector_chain.create_credential(\n", + " credential_name => 'OCI_CRED',\n", + " params => json(jo.to_string));\n", + " end;\n", + " \"\"\"\n", + " )\n", + " cursor.close()\n", + " print(\"Credentials created.\")\n", + "except Exception as ex:\n", + " cursor.close()\n", + " raise" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Load Documents\n", + "The users can load the documents from Oracle Database or a file system or both. They just need to set the loader parameters accordingly. Please refer to the Oracle AI Vector Search Guide book for complete information about these parameters.\n", + "\n", + "The main benefit of using OracleDocLoader is that it can handle 150+ different file formats. You don't need to use different types of loader for different file formats. Here is the list formats that we support: [Oracle Text Supported Document Formats](https://docs.oracle.com/en/database/oracle/oracle-database/23/ccref/oracle-text-supported-document-formats.html)\n", + "\n", + "The following sample code will show how to do that:" + ] + }, + { + "cell_type": "code", + "execution_count": 48, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Number of docs loaded: 3\n" + ] + } + ], + "source": [ + "from langchain_community.document_loaders.oracleai import OracleDocLoader\n", + "from langchain_core.documents import Document\n", + "\n", + "# loading from Oracle Database table\n", + "# make sure you have the table with this specification\n", + "loader_params = {}\n", + "loader_params = {\n", + " \"owner\": \"testuser\",\n", + " \"tablename\": \"demo_tab\",\n", + " \"colname\": \"data\",\n", + "}\n", + "\n", + "\"\"\" load the docs \"\"\"\n", + "loader = OracleDocLoader(conn=conn, params=loader_params)\n", + "docs = loader.load()\n", + "\n", + "\"\"\" verify \"\"\"\n", + "print(f\"Number of docs loaded: {len(docs)}\")\n", + "# print(f\"Document-0: {docs[0].page_content}\") # content" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Generate Summary\n", + "Now that the user loaded the documents, they may want to generate a summary for each document. The Oracle AI Vector Search Langchain library provides an API to do that. There are a few summary generation provider options including Database, OCIGENAI, HuggingFace and so on. The users can choose their preferred provider to generate a summary. Like before, they just need to set the summary parameters accordingly. Please refer to the Oracle AI Vector Search Guide book for complete information about these parameters." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "***Note:*** The users may need to set proxy if they want to use some 3rd party summary generation providers other than Oracle's in-house and default provider: 'database'. If you don't have proxy, please remove the proxy parameter when you instantiate the OracleSummary." + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [], + "source": [ + "# proxy to be used when we instantiate summary and embedder object\n", + "proxy = \"\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The following sample code will show how to generate summary:" + ] + }, + { + "cell_type": "code", + "execution_count": 49, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Number of Summaries: 3\n" + ] + } + ], + "source": [ + "from langchain_community.utilities.oracleai import OracleSummary\n", + "from langchain_core.documents import Document\n", + "\n", + "# using 'database' provider\n", + "summary_params = {\n", + " \"provider\": \"database\",\n", + " \"glevel\": \"S\",\n", + " \"numParagraphs\": 1,\n", + " \"language\": \"english\",\n", + "}\n", + "\n", + "# get the summary instance\n", + "# Remove proxy if not required\n", + "summ = OracleSummary(conn=conn, params=summary_params, proxy=proxy)\n", + "\n", + "list_summary = []\n", + "for doc in docs:\n", + " summary = summ.get_summary(doc.page_content)\n", + " list_summary.append(summary)\n", + "\n", + "\"\"\" verify \"\"\"\n", + "print(f\"Number of Summaries: {len(list_summary)}\")\n", + "# print(f\"Summary-0: {list_summary[0]}\") #content" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Split Documents\n", + "The documents can be in different sizes: small, medium, large, or very large. The users like to split/chunk their documents into smaller pieces to generate embeddings. There are lots of different splitting customizations the users can do. Please refer to the Oracle AI Vector Search Guide book for complete information about these parameters.\n", + "\n", + "The following sample code will show how to do that:" + ] + }, + { + "cell_type": "code", + "execution_count": 50, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Number of Chunks: 3\n" + ] + } + ], + "source": [ + "from langchain_community.document_loaders.oracleai import OracleTextSplitter\n", + "from langchain_core.documents import Document\n", + "\n", + "# split by default parameters\n", + "splitter_params = {\"normalize\": \"all\"}\n", + "\n", + "\"\"\" get the splitter instance \"\"\"\n", + "splitter = OracleTextSplitter(conn=conn, params=splitter_params)\n", + "\n", + "list_chunks = []\n", + "for doc in docs:\n", + " chunks = splitter.split_text(doc.page_content)\n", + " list_chunks.extend(chunks)\n", + "\n", + "\"\"\" verify \"\"\"\n", + "print(f\"Number of Chunks: {len(list_chunks)}\")\n", + "# print(f\"Chunk-0: {list_chunks[0]}\") # content" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Generate Embeddings\n", + "Now that the documents are chunked as per requirements, the users may want to generate embeddings for these chunks. Oracle AI Vector Search provides a number of ways to generate embeddings. The users can load an ONNX embedding model to Oracle Database and use it to generate embeddings or use some 3rd party API's end points to generate embeddings. Please refer to the Oracle AI Vector Search Guide book for complete information about these parameters." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "***Note:*** The users may need to set proxy if they want to use some 3rd party embedding generation providers other than 'database' provider (aka using ONNX model)." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [], + "source": [ + "# proxy to be used when we instantiate summary and embedder object\n", + "proxy = \"\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The following sample code will show how to generate embeddings:" + ] + }, + { + "cell_type": "code", + "execution_count": 51, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Number of embeddings: 3\n" + ] + } + ], + "source": [ + "from langchain_community.embeddings.oracleai import OracleEmbeddings\n", + "from langchain_core.documents import Document\n", + "\n", + "# using ONNX model loaded to Oracle Database\n", + "embedder_params = {\"provider\": \"database\", \"model\": \"demo_model\"}\n", + "\n", + "# get the embedding instance\n", + "# Remove proxy if not required\n", + "embedder = OracleEmbeddings(conn=conn, params=embedder_params, proxy=proxy)\n", + "\n", + "embeddings = []\n", + "for doc in docs:\n", + " chunks = splitter.split_text(doc.page_content)\n", + " for chunk in chunks:\n", + " embed = embedder.embed_query(chunk)\n", + " embeddings.append(embed)\n", + "\n", + "\"\"\" verify \"\"\"\n", + "print(f\"Number of embeddings: {len(embeddings)}\")\n", + "# print(f\"Embedding-0: {embeddings[0]}\") # content" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Create Oracle AI Vector Store\n", + "Now that you know how to use Oracle AI Langchain library APIs individually to process the documents, let us show how to integrate with Oracle AI Vector Store to facilitate the semantic searches." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "First, let's import all the dependencies." + ] + }, + { + "cell_type": "code", + "execution_count": 52, + "metadata": {}, + "outputs": [], + "source": [ + "import sys\n", + "\n", + "import oracledb\n", + "from langchain_community.document_loaders.oracleai import (\n", + " OracleDocLoader,\n", + " OracleTextSplitter,\n", + ")\n", + "from langchain_community.embeddings.oracleai import OracleEmbeddings\n", + "from langchain_community.utilities.oracleai import OracleSummary\n", + "from langchain_community.vectorstores import oraclevs\n", + "from langchain_community.vectorstores.oraclevs import OracleVS\n", + "from langchain_community.vectorstores.utils import DistanceStrategy\n", + "from langchain_core.documents import Document" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Next, let's combine all document processing stages together. Here is the sample code below:" + ] + }, + { + "cell_type": "code", + "execution_count": 53, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Connection successful!\n", + "ONNX model loaded.\n", + "Number of total chunks with metadata: 3\n" + ] + } + ], + "source": [ + "\"\"\"\n", + "In this sample example, we will use 'database' provider for both summary and embeddings.\n", + "So, we don't need to do the followings:\n", + " - set proxy for 3rd party providers\n", + " - create credential for 3rd party providers\n", + "\n", + "If you choose to use 3rd party provider, \n", + "please follow the necessary steps for proxy and credential.\n", + "\"\"\"\n", + "\n", + "# oracle connection\n", + "# please update with your username, password, hostname, and service_name\n", + "username = \"\"\n", + "password = \"\"\n", + "dsn = \"\"\n", + "\n", + "try:\n", + " conn = oracledb.connect(user=username, password=password, dsn=dsn)\n", + " print(\"Connection successful!\")\n", + "except Exception as e:\n", + " print(\"Connection failed!\")\n", + " sys.exit(1)\n", + "\n", + "\n", + "# load onnx model\n", + "# please update with your related information\n", + "onnx_dir = \"DEMO_PY_DIR\"\n", + "onnx_file = \"tinybert.onnx\"\n", + "model_name = \"demo_model\"\n", + "try:\n", + " OracleEmbeddings.load_onnx_model(conn, onnx_dir, onnx_file, model_name)\n", + " print(\"ONNX model loaded.\")\n", + "except Exception as e:\n", + " print(\"ONNX model loading failed!\")\n", + " sys.exit(1)\n", + "\n", + "\n", + "# params\n", + "# please update necessary fields with related information\n", + "loader_params = {\n", + " \"owner\": \"testuser\",\n", + " \"tablename\": \"demo_tab\",\n", + " \"colname\": \"data\",\n", + "}\n", + "summary_params = {\n", + " \"provider\": \"database\",\n", + " \"glevel\": \"S\",\n", + " \"numParagraphs\": 1,\n", + " \"language\": \"english\",\n", + "}\n", + "splitter_params = {\"normalize\": \"all\"}\n", + "embedder_params = {\"provider\": \"database\", \"model\": \"demo_model\"}\n", + "\n", + "# instantiate loader, summary, splitter, and embedder\n", + "loader = OracleDocLoader(conn=conn, params=loader_params)\n", + "summary = OracleSummary(conn=conn, params=summary_params)\n", + "splitter = OracleTextSplitter(conn=conn, params=splitter_params)\n", + "embedder = OracleEmbeddings(conn=conn, params=embedder_params)\n", + "\n", + "# process the documents\n", + "chunks_with_mdata = []\n", + "for id, doc in enumerate(docs, start=1):\n", + " summ = summary.get_summary(doc.page_content)\n", + " chunks = splitter.split_text(doc.page_content)\n", + " for ic, chunk in enumerate(chunks, start=1):\n", + " chunk_metadata = doc.metadata.copy()\n", + " chunk_metadata[\"id\"] = chunk_metadata[\"_oid\"] + \"$\" + str(id) + \"$\" + str(ic)\n", + " chunk_metadata[\"document_id\"] = str(id)\n", + " chunk_metadata[\"document_summary\"] = str(summ[0])\n", + " chunks_with_mdata.append(\n", + " Document(page_content=str(chunk), metadata=chunk_metadata)\n", + " )\n", + "\n", + "\"\"\" verify \"\"\"\n", + "print(f\"Number of total chunks with metadata: {len(chunks_with_mdata)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "At this point, we have processed the documents and generated chunks with metadata. Next, we will create Oracle AI Vector Store with those chunks.\n", + "\n", + "Here is the sample code how to do that:" + ] + }, + { + "cell_type": "code", + "execution_count": 55, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Vector Store Table: oravs\n" + ] + } + ], + "source": [ + "# create Oracle AI Vector Store\n", + "vectorstore = OracleVS.from_documents(\n", + " chunks_with_mdata,\n", + " embedder,\n", + " client=conn,\n", + " table_name=\"oravs\",\n", + " distance_strategy=DistanceStrategy.DOT_PRODUCT,\n", + ")\n", + "\n", + "\"\"\" verify \"\"\"\n", + "print(f\"Vector Store Table: {vectorstore.table_name}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The above example creates a vector store with DOT_PRODUCT distance strategy. \n", + "\n", + "However, the users can create Oracle AI Vector Store provides different distance strategies. Please see the [comprehensive guide](/docs/integrations/vectorstores/oracle) for more information." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now that we have embeddings stored in vector stores, let's create an index on them to get better semantic search performance during query time.\n", + "\n", + "***Note*** If you are getting some insufficient memory error, please increase ***vector_memory_size*** in your database.\n", + "\n", + "Here is the sample code to create an index:" + ] + }, + { + "cell_type": "code", + "execution_count": 56, + "metadata": {}, + "outputs": [], + "source": [ + "oraclevs.create_index(\n", + " conn, vectorstore, params={\"idx_name\": \"hnsw_oravs\", \"idx_type\": \"HNSW\"}\n", + ")\n", + "\n", + "print(\"Index created.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The above example creates a default HNSW index on the embeddings stored in 'oravs' table. The users can set different parameters as per their requirements. Please refer to the Oracle AI Vector Search Guide book for complete information about these parameters.\n", + "\n", + "Also, there are different types of vector indices that the users can create. Please see the [comprehensive guide](/docs/integrations/vectorstores/oracle) for more information.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Perform Semantic Search\n", + "All set!\n", + "\n", + "We have processed the documents, stored them to vector store, and then created index to get better query performance. Now let's do some semantic searches.\n", + "\n", + "Here is the sample code for this:" + ] + }, + { + "cell_type": "code", + "execution_count": 58, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[Document(page_content='The database stores LOBs differently from other data types. Creating a LOB column implicitly creates a LOB segment and a LOB index. The tablespace containing the LOB segment and LOB index, which are always stored together, may be different from the tablespace containing the table. Sometimes the database can store small amounts of LOB data in the table itself rather than in a separate LOB segment.', metadata={'_oid': '662f2f257677f3c2311a8ff999fd34e5', '_rowid': 'AAAR/xAAEAAAAAnAAC', 'id': '662f2f257677f3c2311a8ff999fd34e5$3$1', 'document_id': '3', 'document_summary': 'Sometimes the database can store small amounts of LOB data in the table itself rather than in a separate LOB segment.\\n\\n'})]\n", + "[]\n", + "[(Document(page_content='The database stores LOBs differently from other data types. Creating a LOB column implicitly creates a LOB segment and a LOB index. The tablespace containing the LOB segment and LOB index, which are always stored together, may be different from the tablespace containing the table. Sometimes the database can store small amounts of LOB data in the table itself rather than in a separate LOB segment.', metadata={'_oid': '662f2f257677f3c2311a8ff999fd34e5', '_rowid': 'AAAR/xAAEAAAAAnAAC', 'id': '662f2f257677f3c2311a8ff999fd34e5$3$1', 'document_id': '3', 'document_summary': 'Sometimes the database can store small amounts of LOB data in the table itself rather than in a separate LOB segment.\\n\\n'}), 0.055675752460956573)]\n", + "[]\n", + "[Document(page_content='If the answer to any preceding questions is yes, then the database stops the search and allocates space from the specified tablespace; otherwise, space is allocated from the database default shared temporary tablespace.', metadata={'_oid': '662f2f253acf96b33b430b88699490a2', '_rowid': 'AAAR/xAAEAAAAAnAAA', 'id': '662f2f253acf96b33b430b88699490a2$1$1', 'document_id': '1', 'document_summary': 'If the answer to any preceding questions is yes, then the database stops the search and allocates space from the specified tablespace; otherwise, space is allocated from the database default shared temporary tablespace.\\n\\n'})]\n", + "[Document(page_content='If the answer to any preceding questions is yes, then the database stops the search and allocates space from the specified tablespace; otherwise, space is allocated from the database default shared temporary tablespace.', metadata={'_oid': '662f2f253acf96b33b430b88699490a2', '_rowid': 'AAAR/xAAEAAAAAnAAA', 'id': '662f2f253acf96b33b430b88699490a2$1$1', 'document_id': '1', 'document_summary': 'If the answer to any preceding questions is yes, then the database stops the search and allocates space from the specified tablespace; otherwise, space is allocated from the database default shared temporary tablespace.\\n\\n'})]\n" + ] + } + ], + "source": [ + "query = \"What is Oracle AI Vector Store?\"\n", + "filter = {\"document_id\": [\"1\"]}\n", + "\n", + "# Similarity search without a filter\n", + "print(vectorstore.similarity_search(query, 1))\n", + "\n", + "# Similarity search with a filter\n", + "print(vectorstore.similarity_search(query, 1, filter=filter))\n", + "\n", + "# Similarity search with relevance score\n", + "print(vectorstore.similarity_search_with_score(query, 1))\n", + "\n", + "# Similarity search with relevance score with filter\n", + "print(vectorstore.similarity_search_with_score(query, 1, filter=filter))\n", + "\n", + "# Max marginal relevance search\n", + "print(vectorstore.max_marginal_relevance_search(query, 1, fetch_k=20, lambda_mult=0.5))\n", + "\n", + "# Max marginal relevance search with filter\n", + "print(\n", + " vectorstore.max_marginal_relevance_search(\n", + " query, 1, fetch_k=20, lambda_mult=0.5, filter=filter\n", + " )\n", + ")" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.9" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/docs/docs/integrations/document_loaders/oracleai.ipynb b/docs/docs/integrations/document_loaders/oracleai.ipynb new file mode 100644 index 0000000000..1c096d38bf --- /dev/null +++ b/docs/docs/integrations/document_loaders/oracleai.ipynb @@ -0,0 +1,236 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Oracle AI Vector Search: Document Processing\n", + "Oracle AI Vector Search is designed for Artificial Intelligence (AI) workloads that allows you to query data based on semantics, rather than keywords. One of the biggest benefit of Oracle AI Vector Search is that semantic search on unstructured data can be combined with relational search on business data in one single system. This is not only powerful but also significantly more effective because you don't need to add a specialized vector database, eliminating the pain of data fragmentation between multiple systems.\n", + "\n", + "The guide demonstrates how to use Document Processing Capabilities within Oracle AI Vector Search to load and chunk documents using OracleDocLoader and OracleTextSplitter respectively." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Prerequisites\n", + "\n", + "Please install Oracle Python Client driver to use Langchain with Oracle AI Vector Search. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# pip install oracledb" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Connect to Oracle Database\n", + "The following sample code will show how to connect to Oracle Database. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import sys\n", + "\n", + "import oracledb\n", + "\n", + "# please update with your username, password, hostname and service_name\n", + "username = \"\"\n", + "password = \"\"\n", + "dsn = \"/\"\n", + "\n", + "try:\n", + " conn = oracledb.connect(user=username, password=password, dsn=dsn)\n", + " print(\"Connection successful!\")\n", + "except Exception as e:\n", + " print(\"Connection failed!\")\n", + " sys.exit(1)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now let's create a table and insert some sample docs to test." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "try:\n", + " cursor = conn.cursor()\n", + "\n", + " drop_table_sql = \"\"\"drop table if exists demo_tab\"\"\"\n", + " cursor.execute(drop_table_sql)\n", + "\n", + " create_table_sql = \"\"\"create table demo_tab (id number, data clob)\"\"\"\n", + " cursor.execute(create_table_sql)\n", + "\n", + " insert_row_sql = \"\"\"insert into demo_tab values (:1, :2)\"\"\"\n", + " rows_to_insert = [\n", + " (\n", + " 1,\n", + " \"If the answer to any preceding questions is yes, then the database stops the search and allocates space from the specified tablespace; otherwise, space is allocated from the database default shared temporary tablespace.\",\n", + " ),\n", + " (\n", + " 2,\n", + " \"A tablespace can be online (accessible) or offline (not accessible) whenever the database is open.\\nA tablespace is usually online so that its data is available to users. The SYSTEM tablespace and temporary tablespaces cannot be taken offline.\",\n", + " ),\n", + " (\n", + " 3,\n", + " \"The database stores LOBs differently from other data types. Creating a LOB column implicitly creates a LOB segment and a LOB index. The tablespace containing the LOB segment and LOB index, which are always stored together, may be different from the tablespace containing the table.\\nSometimes the database can store small amounts of LOB data in the table itself rather than in a separate LOB segment.\",\n", + " ),\n", + " ]\n", + " cursor.executemany(insert_row_sql, rows_to_insert)\n", + "\n", + " conn.commit()\n", + "\n", + " print(\"Table created and populated.\")\n", + " cursor.close()\n", + "except Exception as e:\n", + " print(\"Table creation failed.\")\n", + " cursor.close()\n", + " conn.close()\n", + " sys.exit(1)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Load Documents\n", + "The users can load the documents from Oracle Database or a file system or both. They just need to set the loader parameters accordingly. Please refer to the Oracle AI Vector Search Guide book for complete information about these parameters.\n", + "\n", + "The main benefit of using OracleDocLoader is that it can handle 150+ different file formats. You don't need to use different types of loader for different file formats. Here is the list of the formats that we support: [Oracle Text Supported Document Formats](https://docs.oracle.com/en/database/oracle/oracle-database/23/ccref/oracle-text-supported-document-formats.html)\n", + "\n", + "The following sample code will show how to do that:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from langchain_community.document_loaders.oracleai import OracleDocLoader\n", + "from langchain_core.documents import Document\n", + "\n", + "\"\"\"\n", + "# loading a local file\n", + "loader_params = {}\n", + "loader_params[\"file\"] = \"\"\n", + "\n", + "# loading from a local directory\n", + "loader_params = {}\n", + "loader_params[\"dir\"] = \"\"\n", + "\"\"\"\n", + "\n", + "# loading from Oracle Database table\n", + "loader_params = {\n", + " \"owner\": \"\",\n", + " \"tablename\": \"demo_tab\",\n", + " \"colname\": \"data\",\n", + "}\n", + "\n", + "\"\"\" load the docs \"\"\"\n", + "loader = OracleDocLoader(conn=conn, params=loader_params)\n", + "docs = loader.load()\n", + "\n", + "\"\"\" verify \"\"\"\n", + "print(f\"Number of docs loaded: {len(docs)}\")\n", + "# print(f\"Document-0: {docs[0].page_content}\") # content" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Split Documents\n", + "The documents can be in different sizes: small, medium, large, or very large. The users like to split/chunk their documents into smaller pieces to generate embeddings. There are lots of different splitting customizations the users can do. Please refer to the Oracle AI Vector Search Guide book for complete information about these parameters.\n", + "\n", + "The following sample code will show how to do that:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from langchain_community.document_loaders.oracleai import OracleTextSplitter\n", + "from langchain_core.documents import Document\n", + "\n", + "\"\"\"\n", + "# Some examples\n", + "# split by chars, max 500 chars\n", + "splitter_params = {\"split\": \"chars\", \"max\": 500, \"normalize\": \"all\"}\n", + "\n", + "# split by words, max 100 words\n", + "splitter_params = {\"split\": \"words\", \"max\": 100, \"normalize\": \"all\"}\n", + "\n", + "# split by sentence, max 20 sentences\n", + "splitter_params = {\"split\": \"sentence\", \"max\": 20, \"normalize\": \"all\"}\n", + "\"\"\"\n", + "\n", + "# split by default parameters\n", + "splitter_params = {\"normalize\": \"all\"}\n", + "\n", + "# get the splitter instance\n", + "splitter = OracleTextSplitter(conn=conn, params=splitter_params)\n", + "\n", + "list_chunks = []\n", + "for doc in docs:\n", + " chunks = splitter.split_text(doc.page_content)\n", + " list_chunks.extend(chunks)\n", + "\n", + "\"\"\" verify \"\"\"\n", + "print(f\"Number of Chunks: {len(list_chunks)}\")\n", + "# print(f\"Chunk-0: {list_chunks[0]}\") # content" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### End to End Demo\n", + "Please refer to our complete demo guide [Oracle AI Vector Search End-to-End Demo Guide](https://github.com/langchain-ai/langchain/tree/master/cookbook/oracleai_demo.ipynb) to build an end to end RAG pipeline with the help of Oracle AI Vector Search.\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.9" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/docs/docs/integrations/providers/oracleai.mdx b/docs/docs/integrations/providers/oracleai.mdx new file mode 100644 index 0000000000..3560ebdbf5 --- /dev/null +++ b/docs/docs/integrations/providers/oracleai.mdx @@ -0,0 +1,65 @@ +# OracleAI Vector Search +Oracle AI Vector Search is designed for Artificial Intelligence (AI) workloads that allows you to query data based on semantics, rather than keywords. One of the biggest benefit of Oracle AI Vector Search is that semantic search on unstructured data can be combined with relational search on business data in one single system. +This is not only powerful but also significantly more effective because you dont need to add a specialized vector database, eliminating the pain of data fragmentation between multiple systems. + +In addition, because Oracle has been building database technologies for so long, your vectors can benefit from all of Oracle Database's most powerful features, like the following: + + * Partitioning Support + * Real Application Clusters scalability + * Exadata smart scans + * Shard processing across geographically distributed databases + * Transactions + * Parallel SQL + * Disaster recovery + * Security + * Oracle Machine Learning + * Oracle Graph Database + * Oracle Spatial and Graph + * Oracle Blockchain + * JSON + + +## Document Loaders + +Please check the [usage example](/docs/integrations/document_loaders/oracleai). + +```python +from langchain_community.document_loaders.oracleai import OracleDocLoader +``` + +## Text Splitter + +Please check the [usage example](/docs/integrations/document_loaders/oracleai). + +```python +from langchain_community.document_loaders.oracleai import OracleTextSplitter +``` + +## Embeddings + +Please check the [usage example](/docs/integrations/text_embedding/oracleai). + +```python +from langchain_community.embeddings.oracleai import OracleEmbeddings +``` + +## Summary + +Please check the [usage example](/docs/integrations/tools/oracleai). + +```python +from langchain_community.utilities.oracleai import OracleSummary +``` + +## Vector Store + +Please check the [usage example](/docs/integrations/vectorstores/oracle). + +```python +from langchain_community.vectorstores.oraclevs import OracleVS +``` + +## End to End Demo + +Please check the [Oracle AI Vector Search End-to-End Demo Guide](https://github.com/langchain-ai/langchain/blob/master/cookbook/oracleai_demo). + diff --git a/docs/docs/integrations/text_embedding/oracleai.ipynb b/docs/docs/integrations/text_embedding/oracleai.ipynb new file mode 100644 index 0000000000..a44aab85f9 --- /dev/null +++ b/docs/docs/integrations/text_embedding/oracleai.ipynb @@ -0,0 +1,262 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Oracle AI Vector Search: Generate Embeddings\n", + "Oracle AI Vector Search is designed for Artificial Intelligence (AI) workloads that allows you to query data based on semantics, rather than keywords. One of the biggest benefit of Oracle AI Vector Search is that semantic search on unstructured data can be combined with relational search on business data in one single system. This is not only powerful but also significantly more effective because you don't need to add a specialized vector database, eliminating the pain of data fragmentation between multiple systems.\n", + "\n", + "The guide demonstrates how to use Embedding Capabilities within Oracle AI Vector Search to generate embeddings for your documents using OracleEmbeddings." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Prerequisites\n", + "\n", + "Please install Oracle Python Client driver to use Langchain with Oracle AI Vector Search. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# pip install oracledb" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Connect to Oracle Database\n", + "The following sample code will show how to connect to Oracle Database. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import sys\n", + "\n", + "import oracledb\n", + "\n", + "# please update with your username, password, hostname and service_name\n", + "username = \"\"\n", + "password = \"\"\n", + "dsn = \"/\"\n", + "\n", + "try:\n", + " conn = oracledb.connect(user=username, password=password, dsn=dsn)\n", + " print(\"Connection successful!\")\n", + "except Exception as e:\n", + " print(\"Connection failed!\")\n", + " sys.exit(1)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For embedding, we have a few provider options that the users can choose from such as database, 3rd party providers like ocigenai, huggingface, openai, etc. If the users choose to use 3rd party provider, they need to create a credential with corresponding authentication information. On the other hand, if the users choose to use 'database' as provider, they need to load an onnx model to Oracle Database for embeddings." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Load ONNX Model\n", + "\n", + "To generate embeddings, Oracle provides a few provider options for users to choose from. The users can choose 'database' provider or some 3rd party providers like OCIGENAI, HuggingFace, etc.\n", + "\n", + "***Note*** If the users choose database option, they need to load an ONNX model to Oracle Database. The users do not need to load an ONNX model to Oracle Database if they choose to use 3rd party provider to generate embeddings.\n", + "\n", + "One of the core benefits of using an ONNX model is that the users do not need to transfer their data to 3rd party to generate embeddings. And also, since it does not involve any network or REST API calls, it may provide better performance.\n", + "\n", + "Here is the sample code to load an ONNX model to Oracle Database:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from langchain_community.embeddings.oracleai import OracleEmbeddings\n", + "\n", + "# please update with your related information\n", + "# make sure that you have onnx file in the system\n", + "onnx_dir = \"DEMO_DIR\"\n", + "onnx_file = \"tinybert.onnx\"\n", + "model_name = \"demo_model\"\n", + "\n", + "try:\n", + " OracleEmbeddings.load_onnx_model(conn, onnx_dir, onnx_file, model_name)\n", + " print(\"ONNX model loaded.\")\n", + "except Exception as e:\n", + " print(\"ONNX model loading failed!\")\n", + " sys.exit(1)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Create Credential\n", + "\n", + "On the other hand, if the users choose to use 3rd party provider to generate embeddings, they need to create credential to access 3rd party provider's end points.\n", + "\n", + "***Note:*** The users do not need to create any credential if they choose to use 'database' provider to generate embeddings. Should the users choose to 3rd party provider, they need to create credential for the 3rd party provider they want to use. \n", + "\n", + "Here is a sample example:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "try:\n", + " cursor = conn.cursor()\n", + " cursor.execute(\n", + " \"\"\"\n", + " declare\n", + " jo json_object_t;\n", + " begin\n", + " -- HuggingFace\n", + " dbms_vector_chain.drop_credential(credential_name => 'HF_CRED');\n", + " jo := json_object_t();\n", + " jo.put('access_token', '');\n", + " dbms_vector_chain.create_credential(\n", + " credential_name => 'HF_CRED',\n", + " params => json(jo.to_string));\n", + "\n", + " -- OCIGENAI\n", + " dbms_vector_chain.drop_credential(credential_name => 'OCI_CRED');\n", + " jo := json_object_t();\n", + " jo.put('user_ocid','');\n", + " jo.put('tenancy_ocid','');\n", + " jo.put('compartment_ocid','');\n", + " jo.put('private_key','');\n", + " jo.put('fingerprint','');\n", + " dbms_vector_chain.create_credential(\n", + " credential_name => 'OCI_CRED',\n", + " params => json(jo.to_string));\n", + " end;\n", + " \"\"\"\n", + " )\n", + " cursor.close()\n", + " print(\"Credentials created.\")\n", + "except Exception as ex:\n", + " cursor.close()\n", + " raise" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Generate Embeddings\n", + "Oracle AI Vector Search provides a number of ways to generate embeddings. The users can load an ONNX embedding model to Oracle Database and use it to generate embeddings or use some 3rd party API's end points to generate embeddings. Please refer to the Oracle AI Vector Search Guide book for complete information about these parameters." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "***Note:*** The users may need to set proxy if they want to use some 3rd party embedding generation providers other than 'database' provider (aka using ONNX model)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# proxy to be used when we instantiate summary and embedder object\n", + "proxy = \"\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The following sample code will show how to generate embeddings:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from langchain_community.embeddings.oracleai import OracleEmbeddings\n", + "from langchain_core.documents import Document\n", + "\n", + "\"\"\"\n", + "# using ocigenai\n", + "embedder_params = {\n", + " \"provider\": \"ocigenai\",\n", + " \"credential_name\": \"OCI_CRED\",\n", + " \"url\": \"https://inference.generativeai.us-chicago-1.oci.oraclecloud.com/20231130/actions/embedText\",\n", + " \"model\": \"cohere.embed-english-light-v3.0\",\n", + "}\n", + "\n", + "# using huggingface\n", + "embedder_params = {\n", + " \"provider\": \"huggingface\", \n", + " \"credential_name\": \"HF_CRED\", \n", + " \"url\": \"https://api-inference.huggingface.co/pipeline/feature-extraction/\", \n", + " \"model\": \"sentence-transformers/all-MiniLM-L6-v2\", \n", + " \"wait_for_model\": \"true\"\n", + "}\n", + "\"\"\"\n", + "\n", + "# using ONNX model loaded to Oracle Database\n", + "embedder_params = {\"provider\": \"database\", \"model\": \"demo_model\"}\n", + "\n", + "# Remove proxy if not required\n", + "embedder = OracleEmbeddings(conn=conn, params=embedder_params, proxy=proxy)\n", + "embed = embedder.embed_query(\"Hello World!\")\n", + "\n", + "\"\"\" verify \"\"\"\n", + "print(f\"Embedding generated by OracleEmbeddings: {embed}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### End to End Demo\n", + "Please refer to our complete demo guide [Oracle AI Vector Search End-to-End Demo Guide](https://github.com/langchain-ai/langchain/tree/master/cookbook/oracleai_demo.ipynb) to build an end to end RAG pipeline with the help of Oracle AI Vector Search.\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.9" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/docs/docs/integrations/tools/oracleai.ipynb b/docs/docs/integrations/tools/oracleai.ipynb new file mode 100644 index 0000000000..a19bfb29f0 --- /dev/null +++ b/docs/docs/integrations/tools/oracleai.ipynb @@ -0,0 +1,174 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Oracle AI Vector Search: Generate Summary\n", + "Oracle AI Vector Search is designed for Artificial Intelligence (AI) workloads that allows you to query data based on semantics, rather than keywords. One of the biggest benefit of Oracle AI Vector Search is that semantic search on unstructured data can be combined with relational search on business data in one single system. This is not only powerful but also significantly more effective because you don't need to add a specialized vector database, eliminating the pain of data fragmentation between multiple systems.\n", + "\n", + "The guide demonstrates how to use Summary Capabilities within Oracle AI Vector Search to generate summary for your documents using OracleSummary." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Prerequisites\n", + "\n", + "Please install Oracle Python Client driver to use Langchain with Oracle AI Vector Search. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# pip install oracledb" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Connect to Oracle Database\n", + "The following sample code will show how to connect to Oracle Database. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import sys\n", + "\n", + "import oracledb\n", + "\n", + "# please update with your username, password, hostname and service_name\n", + "username = \"\"\n", + "password = \"\"\n", + "dsn = \"/\"\n", + "\n", + "try:\n", + " conn = oracledb.connect(user=username, password=password, dsn=dsn)\n", + " print(\"Connection successful!\")\n", + "except Exception as e:\n", + " print(\"Connection failed!\")\n", + " sys.exit(1)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Generate Summary\n", + "The Oracle AI Vector Search Langchain library provides APIs to generate summaries of documents. There are a few summary generation provider options including Database, OCIGENAI, HuggingFace and so on. The users can choose their preferred provider to generate a summary. They just need to set the summary parameters accordingly. Please refer to the Oracle AI Vector Search Guide book for complete information about these parameters." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "***Note:*** The users may need to set proxy if they want to use some 3rd party summary generation providers other than Oracle's in-house and default provider: 'database'. If you don't have proxy, please remove the proxy parameter when you instantiate the OracleSummary." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# proxy to be used when we instantiate summary and embedder object\n", + "proxy = \"\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The following sample code will show how to generate summary:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from langchain_community.utilities.oracleai import OracleSummary\n", + "from langchain_core.documents import Document\n", + "\n", + "\"\"\"\n", + "# using 'ocigenai' provider\n", + "summary_params = {\n", + " \"provider\": \"ocigenai\",\n", + " \"credential_name\": \"OCI_CRED\",\n", + " \"url\": \"https://inference.generativeai.us-chicago-1.oci.oraclecloud.com/20231130/actions/summarizeText\",\n", + " \"model\": \"cohere.command\",\n", + "}\n", + "\n", + "# using 'huggingface' provider\n", + "summary_params = {\n", + " \"provider\": \"huggingface\",\n", + " \"credential_name\": \"HF_CRED\",\n", + " \"url\": \"https://api-inference.huggingface.co/models/\",\n", + " \"model\": \"facebook/bart-large-cnn\",\n", + " \"wait_for_model\": \"true\"\n", + "}\n", + "\"\"\"\n", + "\n", + "# using 'database' provider\n", + "summary_params = {\n", + " \"provider\": \"database\",\n", + " \"glevel\": \"S\",\n", + " \"numParagraphs\": 1,\n", + " \"language\": \"english\",\n", + "}\n", + "\n", + "# get the summary instance\n", + "# Remove proxy if not required\n", + "summ = OracleSummary(conn=conn, params=summary_params, proxy=proxy)\n", + "summary = summ.get_summary(\n", + " \"In the heart of the forest, \"\n", + " + \"a lone fox ventured out at dusk, seeking a lost treasure. \"\n", + " + \"With each step, memories flooded back, guiding its path. \"\n", + " + \"As the moon rose high, illuminating the night, the fox unearthed \"\n", + " + \"not gold, but a forgotten friendship, worth more than any riches.\"\n", + ")\n", + "\n", + "print(f\"Summary generated by OracleSummary: {summary}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### End to End Demo\n", + "Please refer to our complete demo guide [Oracle AI Vector Search End-to-End Demo Guide](https://github.com/langchain-ai/langchain/tree/master/cookbook/oracleai_demo.ipynb) to build an end to end RAG pipeline with the help of Oracle AI Vector Search.\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.9" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/docs/docs/integrations/vectorstores/oracle.ipynb b/docs/docs/integrations/vectorstores/oracle.ipynb new file mode 100644 index 0000000000..63b9153c4a --- /dev/null +++ b/docs/docs/integrations/vectorstores/oracle.ipynb @@ -0,0 +1,469 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "dd33e9d5-9dba-4aac-9f7f-4cf9e6686593", + "metadata": {}, + "source": [ + "# Oracle AI Vector Search: Vector Store\n", + "\n", + "Oracle AI Vector Search is designed for Artificial Intelligence (AI) workloads that allows you to query data based on semantics, rather than keywords.\n", + "One of the biggest benefit of Oracle AI Vector Search is that semantic search on unstructured data can be combined with relational search on business data in one single system.\n", + "This is not only powerful but also significantly more effective because you dont need to add a specialized vector database, eliminating the pain of data fragmentation between multiple systems.\n", + "\n", + "In addition, because Oracle has been building database technologies for so long, your vectors can benefit from all of Oracle Database's most powerful features, like the following:\n", + "\n", + " * Partitioning Support\n", + " * Real Application Clusters scalability\n", + " * Exadata smart scans\n", + " * Shard processing across geographically distributed databases\n", + " * Transactions\n", + " * Parallel SQL\n", + " * Disaster recovery\n", + " * Security\n", + " * Oracle Machine Learning\n", + " * Oracle Graph Database\n", + " * Oracle Spatial and Graph\n", + " * Oracle Blockchain\n", + " * JSON" + ] + }, + { + "cell_type": "markdown", + "id": "7bd80054-c803-47e1-a259-c40ed073c37d", + "metadata": {}, + "source": [ + "### Prerequisites for using Langchain with Oracle AI Vector Search\n", + "\n", + "Please install Oracle Python Client driver to use Langchain with Oracle AI Vector Search. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2bbb989d-c6fb-4ab9-bafd-a95fd48538d0", + "metadata": {}, + "outputs": [], + "source": [ + "# pip install oracledb" + ] + }, + { + "cell_type": "markdown", + "id": "0fceaa5a-95da-4ebd-8b8d-5e73bb653172", + "metadata": {}, + "source": [ + "### Connect to Oracle AI Vector Search" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4421e4b7-2c7e-4bcd-82b3-9576595edd0f", + "metadata": {}, + "outputs": [], + "source": [ + "import oracledb\n", + "\n", + "username = \"username\"\n", + "password = \"password\"\n", + "dsn = \"ipaddress:port/orclpdb1\"\n", + "\n", + "try:\n", + " connection = oracledb.connect(user=username, password=password, dsn=dsn)\n", + " print(\"Connection successful!\")\n", + "except Exception as e:\n", + " print(\"Connection failed!\")" + ] + }, + { + "cell_type": "markdown", + "id": "b11cf362-01b0-485d-8527-31b0fbb5028e", + "metadata": {}, + "source": [ + "### Import the required dependencies to play with Oracle AI Vector Search" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "43ea59e3-2910-45a6-b195-5f06094bb7c9", + "metadata": {}, + "outputs": [], + "source": [ + "from langchain_community.embeddings import HuggingFaceEmbeddings\n", + "from langchain_community.vectorstores import oraclevs\n", + "from langchain_community.vectorstores.oraclevs import OracleVS\n", + "from langchain_community.vectorstores.utils import DistanceStrategy\n", + "from langchain_core.documents import Document" + ] + }, + { + "cell_type": "markdown", + "id": "0aac10dc-a9cc-4fdb-901c-1b7a4bbbe5a7", + "metadata": {}, + "source": [ + "### Load Documents" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "70ac6982-b13a-4e8c-9c47-57c6d136ac60", + "metadata": {}, + "outputs": [], + "source": [ + "# Define a list of documents (These dummy examples are 5 random documents from Oracle Concepts Manual )\n", + "\n", + "documents_json_list = [\n", + " {\n", + " \"id\": \"cncpt_15.5.3.2.2_P4\",\n", + " \"text\": \"If the answer to any preceding questions is yes, then the database stops the search and allocates space from the specified tablespace; otherwise, space is allocated from the database default shared temporary tablespace.\",\n", + " \"link\": \"https://docs.oracle.com/en/database/oracle/oracle-database/23/cncpt/logical-storage-structures.html#GUID-5387D7B2-C0CA-4C1E-811B-C7EB9B636442\",\n", + " },\n", + " {\n", + " \"id\": \"cncpt_15.5.5_P1\",\n", + " \"text\": \"A tablespace can be online (accessible) or offline (not accessible) whenever the database is open.\\nA tablespace is usually online so that its data is available to users. The SYSTEM tablespace and temporary tablespaces cannot be taken offline.\",\n", + " \"link\": \"https://docs.oracle.com/en/database/oracle/oracle-database/23/cncpt/logical-storage-structures.html#GUID-D02B2220-E6F5-40D9-AFB5-BC69BCEF6CD4\",\n", + " },\n", + " {\n", + " \"id\": \"cncpt_22.3.4.3.1_P2\",\n", + " \"text\": \"The database stores LOBs differently from other data types. Creating a LOB column implicitly creates a LOB segment and a LOB index. The tablespace containing the LOB segment and LOB index, which are always stored together, may be different from the tablespace containing the table.\\nSometimes the database can store small amounts of LOB data in the table itself rather than in a separate LOB segment.\",\n", + " \"link\": \"https://docs.oracle.com/en/database/oracle/oracle-database/23/cncpt/concepts-for-database-developers.html#GUID-3C50EAB8-FC39-4BB3-B680-4EACCE49E866\",\n", + " },\n", + " {\n", + " \"id\": \"cncpt_22.3.4.3.1_P3\",\n", + " \"text\": \"The LOB segment stores data in pieces called chunks. A chunk is a logically contiguous set of data blocks and is the smallest unit of allocation for a LOB. A row in the table stores a pointer called a LOB locator, which points to the LOB index. When the table is queried, the database uses the LOB index to quickly locate the LOB chunks.\",\n", + " \"link\": \"https://docs.oracle.com/en/database/oracle/oracle-database/23/cncpt/concepts-for-database-developers.html#GUID-3C50EAB8-FC39-4BB3-B680-4EACCE49E866\",\n", + " },\n", + "]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "eaa942d6-5954-4898-8c32-3627b923a3a5", + "metadata": {}, + "outputs": [], + "source": [ + "# Create Langchain Documents\n", + "\n", + "documents_langchain = []\n", + "\n", + "for doc in documents_json_list:\n", + " metadata = {\"id\": doc[\"id\"], \"link\": doc[\"link\"]}\n", + " doc_langchain = Document(page_content=doc[\"text\"], metadata=metadata)\n", + " documents_langchain.append(doc_langchain)" + ] + }, + { + "cell_type": "markdown", + "id": "6823f5e6-997c-4f15-927b-bd44c61f105f", + "metadata": {}, + "source": [ + "### Using AI Vector Search Create a bunch of Vector Stores with different distance strategies\n", + "\n", + "First we will create three vector stores each with different distance functions. Since we have not created indices in them yet, they will just create tables for now. Later we will use these vector stores to create HNSW indicies.\n", + "\n", + "You can manually connect to the Oracle Database and will see three tables \n", + "Documents_DOT, Documents_COSINE and Documents_EUCLIDEAN. \n", + "\n", + "We will then create three additional tables Documents_DOT_IVF, Documents_COSINE_IVF and Documents_EUCLIDEAN_IVF which will be used\n", + "to create IVF indicies on the tables instead of HNSW indices. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ed1b253e-5f5c-4a81-983c-74645213a170", + "metadata": {}, + "outputs": [], + "source": [ + "# Ingest documents into Oracle Vector Store using different distance strategies\n", + "\n", + "model = HuggingFaceEmbeddings(model_name=\"sentence-transformers/all-mpnet-base-v2\")\n", + "\n", + "vector_store_dot = OracleVS.from_documents(\n", + " documents_langchain,\n", + " model,\n", + " client=connection,\n", + " table_name=\"Documents_DOT\",\n", + " distance_strategy=DistanceStrategy.DOT_PRODUCT,\n", + ")\n", + "vector_store_max = OracleVS.from_documents(\n", + " documents_langchain,\n", + " model,\n", + " client=connection,\n", + " table_name=\"Documents_COSINE\",\n", + " distance_strategy=DistanceStrategy.COSINE,\n", + ")\n", + "vector_store_euclidean = OracleVS.from_documents(\n", + " documents_langchain,\n", + " model,\n", + " client=connection,\n", + " table_name=\"Documents_EUCLIDEAN\",\n", + " distance_strategy=DistanceStrategy.EUCLIDEAN_DISTANCE,\n", + ")\n", + "\n", + "# Ingest documents into Oracle Vector Store using different distance strategies\n", + "vector_store_dot_ivf = OracleVS.from_documents(\n", + " documents_langchain,\n", + " model,\n", + " client=connection,\n", + " table_name=\"Documents_DOT_IVF\",\n", + " distance_strategy=DistanceStrategy.DOT_PRODUCT,\n", + ")\n", + "vector_store_max_ivf = OracleVS.from_documents(\n", + " documents_langchain,\n", + " model,\n", + " client=connection,\n", + " table_name=\"Documents_COSINE_IVF\",\n", + " distance_strategy=DistanceStrategy.COSINE,\n", + ")\n", + "vector_store_euclidean_ivf = OracleVS.from_documents(\n", + " documents_langchain,\n", + " model,\n", + " client=connection,\n", + " table_name=\"Documents_EUCLIDEAN_IVF\",\n", + " distance_strategy=DistanceStrategy.EUCLIDEAN_DISTANCE,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "77c29505-8688-4b87-9a99-e648fbb2d425", + "metadata": {}, + "source": [ + "### Demonstrating add, delete operations for texts, and basic similarity search\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "306563ae-577b-4bc7-8a92-3dd6a59310f5", + "metadata": {}, + "outputs": [], + "source": [ + "def manage_texts(vector_stores):\n", + " \"\"\"\n", + " Adds texts to each vector store, demonstrates error handling for duplicate additions,\n", + " and performs deletion of texts. Showcases similarity searches and index creation for each vector store.\n", + "\n", + " Args:\n", + " - vector_stores (list): A list of OracleVS instances.\n", + " \"\"\"\n", + " texts = [\"Rohan\", \"Shailendra\"]\n", + " metadata = [\n", + " {\"id\": \"100\", \"link\": \"Document Example Test 1\"},\n", + " {\"id\": \"101\", \"link\": \"Document Example Test 2\"},\n", + " ]\n", + "\n", + " for i, vs in enumerate(vector_stores, start=1):\n", + " # Adding texts\n", + " try:\n", + " vs.add_texts(texts, metadata)\n", + " print(f\"\\n\\n\\nAdd texts complete for vector store {i}\\n\\n\\n\")\n", + " except Exception as ex:\n", + " print(f\"\\n\\n\\nExpected error on duplicate add for vector store {i}\\n\\n\\n\")\n", + "\n", + " # Deleting texts using the value of 'id'\n", + " vs.delete([metadata[0][\"id\"]])\n", + " print(f\"\\n\\n\\nDelete texts complete for vector store {i}\\n\\n\\n\")\n", + "\n", + " # Similarity search\n", + " results = vs.similarity_search(\"How are LOBS stored in Oracle Database\", 2)\n", + " print(f\"\\n\\n\\nSimilarity search results for vector store {i}: {results}\\n\\n\\n\")\n", + "\n", + "\n", + "vector_store_list = [\n", + " vector_store_dot,\n", + " vector_store_max,\n", + " vector_store_euclidean,\n", + " vector_store_dot_ivf,\n", + " vector_store_max_ivf,\n", + " vector_store_euclidean_ivf,\n", + "]\n", + "manage_texts(vector_store_list)" + ] + }, + { + "cell_type": "markdown", + "id": "0980cb33-69cf-4547-842a-afdc4d6fa7d3", + "metadata": {}, + "source": [ + "### Demonstrating index creation with specific parameters for each distance strategy\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "46298a27-e309-456e-b2b8-771d9cb3be29", + "metadata": {}, + "outputs": [], + "source": [ + "def create_search_indices(connection):\n", + " \"\"\"\n", + " Creates search indices for the vector stores, each with specific parameters tailored to their distance strategy.\n", + " \"\"\"\n", + " # Index for DOT_PRODUCT strategy\n", + " # Notice we are creating a HNSW index with default parameters\n", + " # This will default to creating a HNSW index with 8 Parallel Workers and use the Default Accuracy used by Oracle AI Vector Search\n", + " oraclevs.create_index(\n", + " connection,\n", + " vector_store_dot,\n", + " params={\"idx_name\": \"hnsw_idx1\", \"idx_type\": \"HNSW\"},\n", + " )\n", + "\n", + " # Index for COSINE strategy with specific parameters\n", + " # Notice we are creating a HNSW index with parallel 16 and Target Accuracy Specification as 97 percent\n", + " oraclevs.create_index(\n", + " connection,\n", + " vector_store_max,\n", + " params={\n", + " \"idx_name\": \"hnsw_idx2\",\n", + " \"idx_type\": \"HNSW\",\n", + " \"accuracy\": 97,\n", + " \"parallel\": 16,\n", + " },\n", + " )\n", + "\n", + " # Index for EUCLIDEAN_DISTANCE strategy with specific parameters\n", + " # Notice we are creating a HNSW index by specifying Power User Parameters which are neighbors = 64 and efConstruction = 100\n", + " oraclevs.create_index(\n", + " connection,\n", + " vector_store_euclidean,\n", + " params={\n", + " \"idx_name\": \"hnsw_idx3\",\n", + " \"idx_type\": \"HNSW\",\n", + " \"neighbors\": 64,\n", + " \"efConstruction\": 100,\n", + " },\n", + " )\n", + "\n", + " # Index for DOT_PRODUCT strategy with specific parameters\n", + " # Notice we are creating an IVF index with default parameters\n", + " # This will default to creating an IVF index with 8 Parallel Workers and use the Default Accuracy used by Oracle AI Vector Search\n", + " oraclevs.create_index(\n", + " connection,\n", + " vector_store_dot_ivf,\n", + " params={\n", + " \"idx_name\": \"ivf_idx1\",\n", + " \"idx_type\": \"IVF\",\n", + " },\n", + " )\n", + "\n", + " # Index for COSINE strategy with specific parameters\n", + " # Notice we are creating an IVF index with parallel 32 and Target Accuracy Specification as 90 percent\n", + " oraclevs.create_index(\n", + " connection,\n", + " vector_store_max_ivf,\n", + " params={\n", + " \"idx_name\": \"ivf_idx2\",\n", + " \"idx_type\": \"IVF\",\n", + " \"accuracy\": 90,\n", + " \"parallel\": 32,\n", + " },\n", + " )\n", + "\n", + " # Index for EUCLIDEAN_DISTANCE strategy with specific parameters\n", + " # Notice we are creating an IVF index by specifying Power User Parameters which is neighbor_part = 64\n", + " oraclevs.create_index(\n", + " connection,\n", + " vector_store_euclidean_ivf,\n", + " params={\"idx_name\": \"ivf_idx3\", \"idx_type\": \"IVF\", \"neighbor_part\": 64},\n", + " )\n", + "\n", + " print(\"Index creation complete.\")\n", + "\n", + "\n", + "create_search_indices(connection)" + ] + }, + { + "cell_type": "markdown", + "id": "7223d048-5c0b-4e91-a91b-a7daa9f86758", + "metadata": {}, + "source": [ + "### Now we will conduct a bunch of advanced searches on all six vector stores. Each of these three searches have a with and without filter version. The filter only selects the document with id 101 out and filters out everything else" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "37ca2e7d-9803-4260-95e7-62776d4fb820", + "metadata": {}, + "outputs": [], + "source": [ + "# Conduct advanced searches after creating the indices\n", + "def conduct_advanced_searches(vector_stores):\n", + " query = \"How are LOBS stored in Oracle Database\"\n", + " # Constructing a filter for direct comparison against document metadata\n", + " # This filter aims to include documents whose metadata 'id' is exactly '2'\n", + " filter_criteria = {\"id\": [\"101\"]} # Direct comparison filter\n", + "\n", + " for i, vs in enumerate(vector_stores, start=1):\n", + " print(f\"\\n--- Vector Store {i} Advanced Searches ---\")\n", + " # Similarity search without a filter\n", + " print(\"\\nSimilarity search results without filter:\")\n", + " print(vs.similarity_search(query, 2))\n", + "\n", + " # Similarity search with a filter\n", + " print(\"\\nSimilarity search results with filter:\")\n", + " print(vs.similarity_search(query, 2, filter=filter_criteria))\n", + "\n", + " # Similarity search with relevance score\n", + " print(\"\\nSimilarity search with relevance score:\")\n", + " print(vs.similarity_search_with_score(query, 2))\n", + "\n", + " # Similarity search with relevance score with filter\n", + " print(\"\\nSimilarity search with relevance score with filter:\")\n", + " print(vs.similarity_search_with_score(query, 2, filter=filter_criteria))\n", + "\n", + " # Max marginal relevance search\n", + " print(\"\\nMax marginal relevance search results:\")\n", + " print(vs.max_marginal_relevance_search(query, 2, fetch_k=20, lambda_mult=0.5))\n", + "\n", + " # Max marginal relevance search with filter\n", + " print(\"\\nMax marginal relevance search results with filter:\")\n", + " print(\n", + " vs.max_marginal_relevance_search(\n", + " query, 2, fetch_k=20, lambda_mult=0.5, filter=filter_criteria\n", + " )\n", + " )\n", + "\n", + "\n", + "conduct_advanced_searches(vector_store_list)" + ] + }, + { + "cell_type": "markdown", + "id": "0da8c7e2-0db0-4363-b31b-a7a5e3f83717", + "metadata": {}, + "source": [ + "### End to End Demo\n", + "Please refer to our complete demo guide [Oracle AI Vector Search End-to-End Demo Guide](https://github.com/langchain-ai/langchain/tree/master/cookbook/oracleai_demo.ipynb) to build an end to end RAG pipeline with the help of Oracle AI Vector Search.\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.9" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/libs/community/langchain_community/document_loaders/__init__.py b/libs/community/langchain_community/document_loaders/__init__.py index cf6d249e7f..5a5f1e4c34 100644 --- a/libs/community/langchain_community/document_loaders/__init__.py +++ b/libs/community/langchain_community/document_loaders/__init__.py @@ -331,6 +331,10 @@ if TYPE_CHECKING: from langchain_community.document_loaders.oracleadb_loader import ( OracleAutonomousDatabaseLoader, ) + from langchain_community.document_loaders.oracleai import ( + OracleDocLoader, # noqa: F401 + OracleTextSplitter, # noqa: F401 + ) from langchain_community.document_loaders.org_mode import ( UnstructuredOrgModeLoader, ) @@ -624,6 +628,8 @@ _module_lookup = { "OnlinePDFLoader": "langchain_community.document_loaders.pdf", "OpenCityDataLoader": "langchain_community.document_loaders.open_city_data", "OracleAutonomousDatabaseLoader": "langchain_community.document_loaders.oracleadb_loader", # noqa: E501 + "OracleDocLoader": "langchain_community.document_loaders.oracleai", + "OracleTextSplitter": "langchain_community.document_loaders.oracleai", "OutlookMessageLoader": "langchain_community.document_loaders.email", "PDFMinerLoader": "langchain_community.document_loaders.pdf", "PDFMinerPDFasHTMLLoader": "langchain_community.document_loaders.pdf", @@ -822,6 +828,8 @@ __all__ = [ "OnlinePDFLoader", "OpenCityDataLoader", "OracleAutonomousDatabaseLoader", + "OracleDocLoader", + "OracleTextSplitter", "OutlookMessageLoader", "PDFMinerLoader", "PDFMinerPDFasHTMLLoader", diff --git a/libs/community/langchain_community/document_loaders/oracleai.py b/libs/community/langchain_community/document_loaders/oracleai.py new file mode 100644 index 0000000000..0637227bf7 --- /dev/null +++ b/libs/community/langchain_community/document_loaders/oracleai.py @@ -0,0 +1,447 @@ +# Authors: +# Harichandan Roy (hroy) +# David Jiang (ddjiang) +# +# ----------------------------------------------------------------------------- +# oracleai.py +# ----------------------------------------------------------------------------- + +from __future__ import annotations + +import hashlib +import json +import logging +import os +import random +import struct +import time +import traceback +from html.parser import HTMLParser +from typing import TYPE_CHECKING, Any, Dict, List, Optional, Tuple, Union + +from langchain_core.document_loaders import BaseLoader +from langchain_core.documents import Document +from langchain_text_splitters import TextSplitter + +if TYPE_CHECKING: + from oracledb import Connection + +logger = logging.getLogger(__name__) + +"""ParseOracleDocMetadata class""" + + +class ParseOracleDocMetadata(HTMLParser): + """Parse Oracle doc metadata...""" + + def __init__(self) -> None: + super().__init__() + self.reset() + self.match = False + self.metadata: Dict[str, Any] = {} + + def handle_starttag(self, tag: str, attrs: List[Tuple[str, Optional[str]]]) -> None: + if tag == "meta": + entry: Optional[str] = "" + for name, value in attrs: + if name == "name": + entry = value + if name == "content": + if entry: + self.metadata[entry] = value + elif tag == "title": + self.match = True + + def handle_data(self, data: str) -> None: + if self.match: + self.metadata["title"] = data + self.match = False + + def get_metadata(self) -> Dict[str, Any]: + return self.metadata + + +"""OracleDocReader class""" + + +class OracleDocReader: + """Read a file""" + + @staticmethod + def generate_object_id(input_string: Union[str, None] = None) -> str: + out_length = 32 # output length + hash_len = 8 # hash value length + + if input_string is None: + input_string = "".join( + random.choices( + "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789", + k=16, + ) + ) + + # timestamp + timestamp = int(time.time()) + timestamp_bin = struct.pack(">I", timestamp) # 4 bytes + + # hash_value + hashval_bin = hashlib.sha256(input_string.encode()).digest() + hashval_bin = hashval_bin[:hash_len] # 8 bytes + + # counter + counter_bin = struct.pack(">I", random.getrandbits(32)) # 4 bytes + + # binary object id + object_id = timestamp_bin + hashval_bin + counter_bin # 16 bytes + object_id_hex = object_id.hex() # 32 bytes + object_id_hex = object_id_hex.zfill( + out_length + ) # fill with zeros if less than 32 bytes + + object_id_hex = object_id_hex[:out_length] + + return object_id_hex + + @staticmethod + def read_file( + conn: Connection, file_path: str, params: dict + ) -> Union[Document, None]: + """Read a file using OracleReader + Args: + conn: Oracle Connection, + file_path: Oracle Directory, + params: ONNX file name. + Returns: + Plain text and metadata as Langchain Document. + """ + + metadata: Dict[str, Any] = {} + try: + import oracledb + except ImportError as e: + raise ImportError( + "Unable to import oracledb, please install with " + "`pip install -U oracledb`." + ) from e + try: + oracledb.defaults.fetch_lobs = False + cursor = conn.cursor() + + with open(file_path, "rb") as f: + data = f.read() + + if data is None: + return Document(page_content="", metadata=metadata) + + mdata = cursor.var(oracledb.DB_TYPE_CLOB) + text = cursor.var(oracledb.DB_TYPE_CLOB) + cursor.execute( + """ + declare + input blob; + begin + input := :blob; + :mdata := dbms_vector_chain.utl_to_text(input, json(:pref)); + :text := dbms_vector_chain.utl_to_text(input); + end;""", + blob=data, + pref=json.dumps(params), + mdata=mdata, + text=text, + ) + cursor.close() + + if mdata is None: + metadata = {} + else: + doc_data = str(mdata.getvalue()) + if doc_data.startswith("" + ): + p = ParseOracleDocMetadata() + p.feed(doc_data) + metadata = p.get_metadata() + + doc_id = OracleDocReader.generate_object_id(conn.username + "$" + file_path) + metadata["_oid"] = doc_id + metadata["_file"] = file_path + + if text is None: + return Document(page_content="", metadata=metadata) + else: + return Document(page_content=str(text.getvalue()), metadata=metadata) + + except Exception as ex: + logger.info(f"An exception occurred :: {ex}") + logger.info(f"Skip processing {file_path}") + cursor.close() + return None + + +"""OracleDocLoader class""" + + +class OracleDocLoader(BaseLoader): + """Read documents using OracleDocLoader + Args: + conn: Oracle Connection, + params: Loader parameters. + """ + + def __init__(self, conn: Connection, params: Dict[str, Any], **kwargs: Any): + self.conn = conn + self.params = json.loads(json.dumps(params)) + super().__init__(**kwargs) + + def load(self) -> List[Document]: + """Load data into LangChain Document objects...""" + try: + import oracledb + except ImportError as e: + raise ImportError( + "Unable to import oracledb, please install with " + "`pip install -U oracledb`." + ) from e + + ncols = 0 + results: List[Document] = [] + metadata: Dict[str, Any] = {} + m_params = {"plaintext": "false"} + try: + # extract the parameters + if self.params is not None: + self.file = self.params.get("file") + self.dir = self.params.get("dir") + self.owner = self.params.get("owner") + self.tablename = self.params.get("tablename") + self.colname = self.params.get("colname") + else: + raise Exception("Missing loader parameters") + + oracledb.defaults.fetch_lobs = False + + if self.file: + doc = OracleDocReader.read_file(self.conn, self.file, m_params) + + if doc is None: + return results + + results.append(doc) + + if self.dir: + skip_count = 0 + for file_name in os.listdir(self.dir): + file_path = os.path.join(self.dir, file_name) + if os.path.isfile(file_path): + doc = OracleDocReader.read_file(self.conn, file_path, m_params) + + if doc is None: + skip_count = skip_count + 1 + logger.info(f"Total skipped: {skip_count}\n") + else: + results.append(doc) + + if self.tablename: + try: + if self.owner is None or self.colname is None: + raise Exception("Missing owner or column name or both.") + + cursor = self.conn.cursor() + self.mdata_cols = self.params.get("mdata_cols") + if self.mdata_cols is not None: + if len(self.mdata_cols) > 3: + raise Exception( + "Exceeds the max number of columns " + + "you can request for metadata." + ) + + # execute a query to get column data types + sql = ( + "select column_name, data_type from all_tab_columns " + + "where owner = :ownername and " + + "table_name = :tablename" + ) + cursor.execute( + sql, + ownername=self.owner.upper(), + tablename=self.tablename.upper(), + ) + + # cursor.execute(sql) + rows = cursor.fetchall() + for row in rows: + if row[0] in self.mdata_cols: + if row[1] not in [ + "NUMBER", + "BINARY_DOUBLE", + "BINARY_FLOAT", + "LONG", + "DATE", + "TIMESTAMP", + "VARCHAR2", + ]: + raise Exception( + "The datatype for the column requested " + + "for metadata is not supported." + ) + + self.mdata_cols_sql = ", rowid" + if self.mdata_cols is not None: + for col in self.mdata_cols: + self.mdata_cols_sql = self.mdata_cols_sql + ", " + col + + # [TODO] use bind variables + sql = ( + "select dbms_vector_chain.utl_to_text(t." + + self.colname + + ", json('" + + json.dumps(m_params) + + "')) mdata, dbms_vector_chain.utl_to_text(t." + + self.colname + + ") text" + + self.mdata_cols_sql + + " from " + + self.owner + + "." + + self.tablename + + " t" + ) + + cursor.execute(sql) + for row in cursor: + metadata = {} + + if row is None: + doc_id = OracleDocReader.generate_object_id( + self.conn.username + + "$" + + self.owner + + "$" + + self.tablename + + "$" + + self.colname + ) + metadata["_oid"] = doc_id + results.append(Document(page_content="", metadata=metadata)) + else: + if row[0] is not None: + data = str(row[0]) + if data.startswith("" + ): + p = ParseOracleDocMetadata() + p.feed(data) + metadata = p.get_metadata() + + doc_id = OracleDocReader.generate_object_id( + self.conn.username + + "$" + + self.owner + + "$" + + self.tablename + + "$" + + self.colname + + "$" + + str(row[2]) + ) + metadata["_oid"] = doc_id + metadata["_rowid"] = row[2] + + # process projected metadata cols + if self.mdata_cols is not None: + ncols = len(self.mdata_cols) + + for i in range(0, ncols): + metadata[self.mdata_cols[i]] = row[i + 2] + + if row[1] is None: + results.append( + Document(page_content="", metadata=metadata) + ) + else: + results.append( + Document( + page_content=str(row[1]), metadata=metadata + ) + ) + except Exception as ex: + logger.info(f"An exception occurred :: {ex}") + traceback.print_exc() + cursor.close() + raise + + return results + except Exception as ex: + logger.info(f"An exception occurred :: {ex}") + traceback.print_exc() + raise + + +class OracleTextSplitter(TextSplitter): + """Splitting text using Oracle chunker.""" + + def __init__(self, conn: Connection, params: Dict[str, Any], **kwargs: Any) -> None: + """Initialize.""" + self.conn = conn + self.params = params + super().__init__(**kwargs) + try: + import json + + try: + import oracledb + except ImportError as e: + raise ImportError( + "Unable to import oracledb, please install with " + "`pip install -U oracledb`." + ) from e + + self._oracledb = oracledb + self._json = json + except ImportError: + raise ImportError( + "oracledb or json or both are not installed. " + + "Please install them. " + + "Recommendations: `pip install oracledb`. " + ) + + def split_text(self, text: str) -> List[str]: + """Split incoming text and return chunks.""" + + try: + import oracledb + except ImportError as e: + raise ImportError( + "Unable to import oracledb, please install with " + "`pip install -U oracledb`." + ) from e + + splits = [] + + try: + # returns strings or bytes instead of a locator + self._oracledb.defaults.fetch_lobs = False + + cursor = self.conn.cursor() + + cursor.setinputsizes(content=oracledb.CLOB) + cursor.execute( + "select t.column_value from " + + "dbms_vector_chain.utl_to_chunks(:content, json(:params)) t", + content=text, + params=self._json.dumps(self.params), + ) + + while True: + row = cursor.fetchone() + if row is None: + break + d = self._json.loads(row[0]) + splits.append(d["chunk_data"]) + + return splits + + except Exception as ex: + logger.info(f"An exception occurred :: {ex}") + traceback.print_exc() + raise diff --git a/libs/community/langchain_community/embeddings/__init__.py b/libs/community/langchain_community/embeddings/__init__.py index 4ccb4421f9..88e3777801 100644 --- a/libs/community/langchain_community/embeddings/__init__.py +++ b/libs/community/langchain_community/embeddings/__init__.py @@ -169,6 +169,9 @@ if TYPE_CHECKING: from langchain_community.embeddings.optimum_intel import ( QuantizedBiEncoderEmbeddings, ) + from langchain_community.embeddings.oracleai import ( + OracleEmbeddings, # noqa: F401 + ) from langchain_community.embeddings.premai import ( PremAIEmbeddings, ) @@ -267,6 +270,7 @@ __all__ = [ "OpenAIEmbeddings", "OpenVINOBgeEmbeddings", "OpenVINOEmbeddings", + "OracleEmbeddings", "PremAIEmbeddings", "QianfanEmbeddingsEndpoint", "QuantizedBgeEmbeddings", @@ -344,6 +348,7 @@ _module_lookup = { "QianfanEmbeddingsEndpoint": "langchain_community.embeddings.baidu_qianfan_endpoint", # noqa: E501 "QuantizedBgeEmbeddings": "langchain_community.embeddings.itrex", "QuantizedBiEncoderEmbeddings": "langchain_community.embeddings.optimum_intel", + "OracleEmbeddings": "langchain_community.embeddings.oracleai", "SagemakerEndpointEmbeddings": "langchain_community.embeddings.sagemaker_endpoint", "SelfHostedEmbeddings": "langchain_community.embeddings.self_hosted", "SelfHostedHuggingFaceEmbeddings": "langchain_community.embeddings.self_hosted_hugging_face", # noqa: E501 diff --git a/libs/community/langchain_community/embeddings/oracleai.py b/libs/community/langchain_community/embeddings/oracleai.py new file mode 100644 index 0000000000..ca2dc7f5b7 --- /dev/null +++ b/libs/community/langchain_community/embeddings/oracleai.py @@ -0,0 +1,182 @@ +# Authors: +# Harichandan Roy (hroy) +# David Jiang (ddjiang) +# +# ----------------------------------------------------------------------------- +# oracleai.py +# ----------------------------------------------------------------------------- + +from __future__ import annotations + +import json +import logging +import traceback +from typing import TYPE_CHECKING, Any, Dict, List, Optional + +from langchain_core.embeddings import Embeddings +from langchain_core.pydantic_v1 import BaseModel, Extra + +if TYPE_CHECKING: + from oracledb import Connection + +logger = logging.getLogger(__name__) + +"""OracleEmbeddings class""" + + +class OracleEmbeddings(BaseModel, Embeddings): + """Get Embeddings""" + + """Oracle Connection""" + conn: Any + """Embedding Parameters""" + params: Dict[str, Any] + """Proxy""" + proxy: Optional[str] = None + + def __init__(self, **kwargs: Any): + super().__init__(**kwargs) + + class Config: + """Configuration for this pydantic object.""" + + extra = Extra.forbid + + """ + 1 - user needs to have create procedure, + create mining model, create any directory privilege. + 2 - grant create procedure, create mining model, + create any directory to ; + """ + + @staticmethod + def load_onnx_model( + conn: Connection, dir: str, onnx_file: str, model_name: str + ) -> None: + """Load an ONNX model to Oracle Database. + Args: + conn: Oracle Connection, + dir: Oracle Directory, + onnx_file: ONNX file name, + model_name: Name of the model. + """ + + try: + if conn is None or dir is None or onnx_file is None or model_name is None: + raise Exception("Invalid input") + + cursor = conn.cursor() + cursor.execute( + """ + begin + dbms_data_mining.drop_model(model_name => :model, force => true); + SYS.DBMS_VECTOR.load_onnx_model(:path, :filename, :model, + json('{"function" : "embedding", + "embeddingOutput" : "embedding", + "input": {"input": ["DATA"]}}')); + end;""", + path=dir, + filename=onnx_file, + model=model_name, + ) + + cursor.close() + + except Exception as ex: + logger.info(f"An exception occurred :: {ex}") + traceback.print_exc() + cursor.close() + raise + + def embed_documents(self, texts: List[str]) -> List[List[float]]: + """Compute doc embeddings using an OracleEmbeddings. + Args: + texts: The list of texts to embed. + Returns: + List of embeddings, one for each input text. + """ + + try: + import oracledb + except ImportError as e: + raise ImportError( + "Unable to import oracledb, please install with " + "`pip install -U oracledb`." + ) from e + + if texts is None: + return None + + embeddings: List[List[float]] = [] + try: + # returns strings or bytes instead of a locator + oracledb.defaults.fetch_lobs = False + cursor = self.conn.cursor() + + if self.proxy: + cursor.execute( + "begin utl_http.set_proxy(:proxy); end;", proxy=self.proxy + ) + + for text in texts: + cursor.execute( + "select t.* " + + "from dbms_vector_chain.utl_to_embeddings(:content, " + + "json(:params)) t", + content=text, + params=json.dumps(self.params), + ) + + for row in cursor: + if row is None: + embeddings.append([]) + else: + rdata = json.loads(row[0]) + # dereference string as array + vec = json.loads(rdata["embed_vector"]) + embeddings.append(vec) + + cursor.close() + return embeddings + except Exception as ex: + logger.info(f"An exception occurred :: {ex}") + traceback.print_exc() + cursor.close() + raise + + def embed_query(self, text: str) -> List[float]: + """Compute query embedding using an OracleEmbeddings. + Args: + text: The text to embed. + Returns: + Embedding for the text. + """ + return self.embed_documents([text])[0] + + +# uncomment the following code block to run the test + +""" +# A sample unit test. + +''' get the Oracle connection ''' +conn = oracledb.connect( + user="", + password="", + dsn="") +print("Oracle connection is established...") + +''' params ''' +embedder_params = {"provider":"database", "model":"demo_model"} +proxy = "" + +''' instance ''' +embedder = OracleEmbeddings(conn=conn, params=embedder_params, proxy=proxy) + +embed = embedder.embed_query("Hello World!") +print(f"Embedding generated by OracleEmbeddings: {embed}") + +conn.close() +print("Connection is closed.") + +""" diff --git a/libs/community/langchain_community/utilities/__init__.py b/libs/community/langchain_community/utilities/__init__.py index 6e93776003..6b818d3865 100644 --- a/libs/community/langchain_community/utilities/__init__.py +++ b/libs/community/langchain_community/utilities/__init__.py @@ -99,6 +99,9 @@ if TYPE_CHECKING: from langchain_community.utilities.openweathermap import ( OpenWeatherMapAPIWrapper, ) + from langchain_community.utilities.oracleai import ( + OracleSummary, # noqa: F401 + ) from langchain_community.utilities.outline import ( OutlineAPIWrapper, ) @@ -199,6 +202,7 @@ __all__ = [ "NasaAPIWrapper", "NutritionAIAPI", "OpenWeatherMapAPIWrapper", + "OracleSummary", "OutlineAPIWrapper", "Portkey", "PowerBIDataset", @@ -260,6 +264,7 @@ _module_lookup = { "NasaAPIWrapper": "langchain_community.utilities.nasa", "NutritionAIAPI": "langchain_community.utilities.passio_nutrition_ai", "OpenWeatherMapAPIWrapper": "langchain_community.utilities.openweathermap", + "OracleSummary": "langchain_community.utilities.oracleai", "OutlineAPIWrapper": "langchain_community.utilities.outline", "Portkey": "langchain_community.utilities.portkey", "PowerBIDataset": "langchain_community.utilities.powerbi", diff --git a/libs/community/langchain_community/utilities/oracleai.py b/libs/community/langchain_community/utilities/oracleai.py new file mode 100644 index 0000000000..f67d04066d --- /dev/null +++ b/libs/community/langchain_community/utilities/oracleai.py @@ -0,0 +1,201 @@ +# Authors: +# Harichandan Roy (hroy) +# David Jiang (ddjiang) +# +# ----------------------------------------------------------------------------- +# oracleai.py +# ----------------------------------------------------------------------------- + +from __future__ import annotations + +import json +import logging +import traceback +from typing import TYPE_CHECKING, Any, Dict, List, Optional + +from langchain_core.documents import Document + +if TYPE_CHECKING: + from oracledb import Connection + +logger = logging.getLogger(__name__) + +"""OracleSummary class""" + + +class OracleSummary: + """Get Summary + Args: + conn: Oracle Connection, + params: Summary parameters, + proxy: Proxy + """ + + def __init__( + self, conn: Connection, params: Dict[str, Any], proxy: Optional[str] = None + ): + self.conn = conn + self.proxy = proxy + self.summary_params = params + + def get_summary(self, docs: Any) -> List[str]: + """Get the summary of the input docs. + Args: + docs: The documents to generate summary for. + Allowed input types: str, Document, List[str], List[Document] + Returns: + List of summary text, one for each input doc. + """ + + try: + import oracledb + except ImportError as e: + raise ImportError( + "Unable to import oracledb, please install with " + "`pip install -U oracledb`." + ) from e + + if docs is None: + return [] + + results: List[str] = [] + try: + oracledb.defaults.fetch_lobs = False + cursor = self.conn.cursor() + + if self.proxy: + cursor.execute( + "begin utl_http.set_proxy(:proxy); end;", proxy=self.proxy + ) + + if isinstance(docs, str): + results = [] + + summary = cursor.var(oracledb.DB_TYPE_CLOB) + cursor.execute( + """ + declare + input clob; + begin + input := :data; + :summ := dbms_vector_chain.utl_to_summary(input, json(:params)); + end;""", + data=docs, + params=json.dumps(self.summary_params), + summ=summary, + ) + + if summary is None: + results.append("") + else: + results.append(str(summary.getvalue())) + + elif isinstance(docs, Document): + results = [] + + summary = cursor.var(oracledb.DB_TYPE_CLOB) + cursor.execute( + """ + declare + input clob; + begin + input := :data; + :summ := dbms_vector_chain.utl_to_summary(input, json(:params)); + end;""", + data=docs.page_content, + params=json.dumps(self.summary_params), + summ=summary, + ) + + if summary is None: + results.append("") + else: + results.append(str(summary.getvalue())) + + elif isinstance(docs, List): + results = [] + + for doc in docs: + summary = cursor.var(oracledb.DB_TYPE_CLOB) + if isinstance(doc, str): + cursor.execute( + """ + declare + input clob; + begin + input := :data; + :summ := dbms_vector_chain.utl_to_summary(input, + json(:params)); + end;""", + data=doc, + params=json.dumps(self.summary_params), + summ=summary, + ) + + elif isinstance(doc, Document): + cursor.execute( + """ + declare + input clob; + begin + input := :data; + :summ := dbms_vector_chain.utl_to_summary(input, + json(:params)); + end;""", + data=doc.page_content, + params=json.dumps(self.summary_params), + summ=summary, + ) + + else: + raise Exception("Invalid input type") + + if summary is None: + results.append("") + else: + results.append(str(summary.getvalue())) + + else: + raise Exception("Invalid input type") + + cursor.close() + return results + + except Exception as ex: + logger.info(f"An exception occurred :: {ex}") + traceback.print_exc() + cursor.close() + raise + + +# uncomment the following code block to run the test + +""" +# A sample unit test. + +''' get the Oracle connection ''' +conn = oracledb.connect( + user="", + password="", + dsn="") +print("Oracle connection is established...") + +''' params ''' +summary_params = {"provider": "database","glevel": "S", + "numParagraphs": 1,"language": "english"} +proxy = "" + +''' instance ''' +summ = OracleSummary(conn=conn, params=summary_params, proxy=proxy) + +summary = summ.get_summary("In the heart of the forest, " + + "a lone fox ventured out at dusk, seeking a lost treasure. " + + "With each step, memories flooded back, guiding its path. " + + "As the moon rose high, illuminating the night, the fox unearthed " + + "not gold, but a forgotten friendship, worth more than any riches.") +print(f"Summary generated by OracleSummary: {summary}") + +conn.close() +print("Connection is closed.") + +""" diff --git a/libs/community/langchain_community/vectorstores/__init__.py b/libs/community/langchain_community/vectorstores/__init__.py index 8c577b1be5..fe04af06db 100644 --- a/libs/community/langchain_community/vectorstores/__init__.py +++ b/libs/community/langchain_community/vectorstores/__init__.py @@ -178,6 +178,9 @@ if TYPE_CHECKING: from langchain_community.vectorstores.opensearch_vector_search import ( OpenSearchVectorSearch, ) + from langchain_community.vectorstores.oraclevs import ( + OracleVS, # noqa: F401 + ) from langchain_community.vectorstores.pathway import ( PathwayVectorClient, ) @@ -343,6 +346,7 @@ __all__ = [ "MyScaleSettings", "Neo4jVector", "NeuralDBVectorStore", + "OracleVS", "OpenSearchVectorSearch", "PGEmbedding", "PGVector", @@ -439,6 +443,7 @@ _module_lookup = { "Neo4jVector": "langchain_community.vectorstores.neo4j_vector", "NeuralDBVectorStore": "langchain_community.vectorstores.thirdai_neuraldb", "OpenSearchVectorSearch": "langchain_community.vectorstores.opensearch_vector_search", # noqa: E501 + "OracleVS": "langchain_community.vectorstores.oraclevs", "PathwayVectorClient": "langchain_community.vectorstores.pathway", "PGEmbedding": "langchain_community.vectorstores.pgembedding", "PGVector": "langchain_community.vectorstores.pgvector", diff --git a/libs/community/langchain_community/vectorstores/oraclevs.py b/libs/community/langchain_community/vectorstores/oraclevs.py new file mode 100644 index 0000000000..3c338c4786 --- /dev/null +++ b/libs/community/langchain_community/vectorstores/oraclevs.py @@ -0,0 +1,930 @@ +from __future__ import annotations + +import array +import functools +import hashlib +import json +import logging +import os +import uuid +from typing import ( + TYPE_CHECKING, + Any, + Callable, + Dict, + Iterable, + List, + Optional, + Tuple, + Type, + TypeVar, + Union, + cast, +) + +if TYPE_CHECKING: + from oracledb import Connection + +import numpy as np +from langchain_core.documents import Document +from langchain_core.embeddings import Embeddings +from langchain_core.vectorstores import VectorStore + +from langchain_community.vectorstores.utils import ( + DistanceStrategy, + maximal_marginal_relevance, +) + +logger = logging.getLogger(__name__) +log_level = os.getenv("LOG_LEVEL", "ERROR").upper() +logging.basicConfig( + level=getattr(logging, log_level), + format="%(asctime)s - %(levelname)s - %(message)s", +) + + +# Define a type variable that can be any kind of function +T = TypeVar("T", bound=Callable[..., Any]) + + +def _handle_exceptions(func: T) -> T: + @functools.wraps(func) + def wrapper(*args: Any, **kwargs: Any) -> Any: + try: + return func(*args, **kwargs) + except RuntimeError as db_err: + # Handle a known type of error (e.g., DB-related) specifically + logger.exception("DB-related error occurred.") + raise RuntimeError( + "Failed due to a DB issue: {}".format(db_err) + ) from db_err + except ValueError as val_err: + # Handle another known type of error specifically + logger.exception("Validation error.") + raise ValueError("Validation failed: {}".format(val_err)) from val_err + except Exception as e: + # Generic handler for all other exceptions + logger.exception("An unexpected error occurred: {}".format(e)) + raise RuntimeError("Unexpected error: {}".format(e)) from e + + return cast(T, wrapper) + + +def _table_exists(client: Connection, table_name: str) -> bool: + try: + import oracledb + except ImportError as e: + raise ImportError( + "Unable to import oracledb, please install with " + "`pip install -U oracledb`." + ) from e + + try: + with client.cursor() as cursor: + cursor.execute(f"SELECT COUNT(*) FROM {table_name}") + return True + except oracledb.DatabaseError as ex: + err_obj = ex.args + if err_obj[0].code == 942: + return False + raise + + +@_handle_exceptions +def _index_exists(client: Connection, index_name: str) -> bool: + # Check if the index exists + query = """ + SELECT index_name + FROM all_indexes + WHERE upper(index_name) = upper(:idx_name) + """ + + with client.cursor() as cursor: + # Execute the query + cursor.execute(query, idx_name=index_name.upper()) + result = cursor.fetchone() + + # Check if the index exists + return result is not None + + +def _get_distance_function(distance_strategy: DistanceStrategy) -> str: + # Dictionary to map distance strategies to their corresponding function + # names + distance_strategy2function = { + DistanceStrategy.EUCLIDEAN_DISTANCE: "EUCLIDEAN", + DistanceStrategy.DOT_PRODUCT: "DOT", + DistanceStrategy.COSINE: "COSINE", + } + + # Attempt to return the corresponding distance function + if distance_strategy in distance_strategy2function: + return distance_strategy2function[distance_strategy] + + # If it's an unsupported distance strategy, raise an error + raise ValueError(f"Unsupported distance strategy: {distance_strategy}") + + +def _get_index_name(base_name: str) -> str: + unique_id = str(uuid.uuid4()).replace("-", "") + return f"{base_name}_{unique_id}" + + +@_handle_exceptions +def _create_table(client: Connection, table_name: str, embedding_dim: int) -> None: + cols_dict = { + "id": "RAW(16) DEFAULT SYS_GUID() PRIMARY KEY", + "text": "CLOB", + "metadata": "CLOB", + "embedding": f"vector({embedding_dim}, FLOAT32)", + } + + if not _table_exists(client, table_name): + with client.cursor() as cursor: + ddl_body = ", ".join( + f"{col_name} {col_type}" for col_name, col_type in cols_dict.items() + ) + ddl = f"CREATE TABLE {table_name} ({ddl_body})" + cursor.execute(ddl) + logger.info("Table created successfully...") + else: + logger.info("Table already exists...") + + +@_handle_exceptions +def create_index( + client: Connection, + vector_store: OracleVS, + params: Optional[dict[str, Any]] = None, +) -> None: + if params: + if params["idx_type"] == "HNSW": + _create_hnsw_index( + client, vector_store.table_name, vector_store.distance_strategy, params + ) + elif params["idx_type"] == "IVF": + _create_ivf_index( + client, vector_store.table_name, vector_store.distance_strategy, params + ) + else: + _create_hnsw_index( + client, vector_store.table_name, vector_store.distance_strategy, params + ) + else: + _create_hnsw_index( + client, vector_store.table_name, vector_store.distance_strategy, params + ) + return + + +@_handle_exceptions +def _create_hnsw_index( + client: Connection, + table_name: str, + distance_strategy: DistanceStrategy, + params: Optional[dict[str, Any]] = None, +) -> None: + defaults = { + "idx_name": "HNSW", + "idx_type": "HNSW", + "neighbors": 32, + "efConstruction": 200, + "accuracy": 90, + "parallel": 8, + } + + if params: + config = params.copy() + # Ensure compulsory parts are included + for compulsory_key in ["idx_name", "parallel"]: + if compulsory_key not in config: + if compulsory_key == "idx_name": + config[compulsory_key] = _get_index_name( + str(defaults[compulsory_key]) + ) + else: + config[compulsory_key] = defaults[compulsory_key] + + # Validate keys in config against defaults + for key in config: + if key not in defaults: + raise ValueError(f"Invalid parameter: {key}") + else: + config = defaults + + # Base SQL statement + idx_name = config["idx_name"] + base_sql = ( + f"create vector index {idx_name} on {table_name}(embedding) " + f"ORGANIZATION INMEMORY NEIGHBOR GRAPH" + ) + + # Optional parts depending on parameters + accuracy_part = " WITH TARGET ACCURACY {accuracy}" if ("accuracy" in config) else "" + distance_part = f" DISTANCE {_get_distance_function(distance_strategy)}" + + parameters_part = "" + if "neighbors" in config and "efConstruction" in config: + parameters_part = ( + " parameters (type {idx_type}, neighbors {" + "neighbors}, efConstruction {efConstruction})" + ) + elif "neighbors" in config and "efConstruction" not in config: + config["efConstruction"] = defaults["efConstruction"] + parameters_part = ( + " parameters (type {idx_type}, neighbors {" + "neighbors}, efConstruction {efConstruction})" + ) + elif "neighbors" not in config and "efConstruction" in config: + config["neighbors"] = defaults["neighbors"] + parameters_part = ( + " parameters (type {idx_type}, neighbors {" + "neighbors}, efConstruction {efConstruction})" + ) + + # Always included part for parallel + parallel_part = " parallel {parallel}" + + # Combine all parts + ddl_assembly = ( + base_sql + accuracy_part + distance_part + parameters_part + parallel_part + ) + # Format the SQL with values from the params dictionary + ddl = ddl_assembly.format(**config) + + # Check if the index exists + if not _index_exists(client, config["idx_name"]): + with client.cursor() as cursor: + cursor.execute(ddl) + logger.info("Index created successfully...") + else: + logger.info("Index already exists...") + + +@_handle_exceptions +def _create_ivf_index( + client: Connection, + table_name: str, + distance_strategy: DistanceStrategy, + params: Optional[dict[str, Any]] = None, +) -> None: + # Default configuration + defaults = { + "idx_name": "IVF", + "idx_type": "IVF", + "neighbor_part": 32, + "accuracy": 90, + "parallel": 8, + } + + if params: + config = params.copy() + # Ensure compulsory parts are included + for compulsory_key in ["idx_name", "parallel"]: + if compulsory_key not in config: + if compulsory_key == "idx_name": + config[compulsory_key] = _get_index_name( + str(defaults[compulsory_key]) + ) + else: + config[compulsory_key] = defaults[compulsory_key] + + # Validate keys in config against defaults + for key in config: + if key not in defaults: + raise ValueError(f"Invalid parameter: {key}") + else: + config = defaults + + # Base SQL statement + idx_name = config["idx_name"] + base_sql = ( + f"CREATE VECTOR INDEX {idx_name} ON {table_name}(embedding) " + f"ORGANIZATION NEIGHBOR PARTITIONS" + ) + + # Optional parts depending on parameters + accuracy_part = " WITH TARGET ACCURACY {accuracy}" if ("accuracy" in config) else "" + distance_part = f" DISTANCE {_get_distance_function(distance_strategy)}" + + parameters_part = "" + if "idx_type" in config and "neighbor_part" in config: + parameters_part = ( + f" PARAMETERS (type {config['idx_type']}, neighbor" + f" partitions {config['neighbor_part']})" + ) + + # Always included part for parallel + parallel_part = f" PARALLEL {config['parallel']}" + + # Combine all parts + ddl_assembly = ( + base_sql + accuracy_part + distance_part + parameters_part + parallel_part + ) + # Format the SQL with values from the params dictionary + ddl = ddl_assembly.format(**config) + + # Check if the index exists + if not _index_exists(client, config["idx_name"]): + with client.cursor() as cursor: + cursor.execute(ddl) + logger.info("Index created successfully...") + else: + logger.info("Index already exists...") + + +@_handle_exceptions +def drop_table_purge(client: Connection, table_name: str) -> None: + if _table_exists(client, table_name): + cursor = client.cursor() + with cursor: + ddl = f"DROP TABLE {table_name} PURGE" + cursor.execute(ddl) + logger.info("Table dropped successfully...") + else: + logger.info("Table not found...") + return + + +@_handle_exceptions +def drop_index_if_exists(client: Connection, index_name: str) -> None: + if _index_exists(client, index_name): + drop_query = f"DROP INDEX {index_name}" + with client.cursor() as cursor: + cursor.execute(drop_query) + logger.info(f"Index {index_name} has been dropped.") + else: + logger.exception(f"Index {index_name} does not exist.") + return + + +class OracleVS(VectorStore): + """`OracleVS` vector store. + + To use, you should have both: + - the ``oracledb`` python package installed + - a connection string associated with a OracleDBCluster having deployed an + Search index + + Example: + .. code-block:: python + + from langchain.vectorstores import OracleVS + from langchain.embeddings.openai import OpenAIEmbeddings + import oracledb + + with oracledb.connect(user = user, passwd = pwd, dsn = dsn) as + connection: + print ("Database version:", connection.version) + embeddings = OpenAIEmbeddings() + query = "" + vectors = OracleVS(connection, table_name, embeddings, query) + """ + + def __init__( + self, + client: Connection, + embedding_function: Union[ + Callable[[str], List[float]], + Embeddings, + ], + table_name: str, + distance_strategy: DistanceStrategy = DistanceStrategy.EUCLIDEAN_DISTANCE, + query: Optional[str] = "What is a Oracle database", + params: Optional[Dict[str, Any]] = None, + ): + try: + import oracledb + except ImportError as e: + raise ImportError( + "Unable to import oracledb, please install with " + "`pip install -U oracledb`." + ) from e + + try: + """Initialize with oracledb client.""" + self.client = client + """Initialize with necessary components.""" + if not isinstance(embedding_function, Embeddings): + logger.warning( + "`embedding_function` is expected to be an Embeddings " + "object, support " + "for passing in a function will soon be removed." + ) + self.embedding_function = embedding_function + self.query = query + embedding_dim = self.get_embedding_dimension() + + self.table_name = table_name + self.distance_strategy = distance_strategy + self.params = params + + _create_table(client, table_name, embedding_dim) + except oracledb.DatabaseError as db_err: + logger.exception(f"Database error occurred while create table: {db_err}") + raise RuntimeError( + "Failed to create table due to a database error." + ) from db_err + except ValueError as val_err: + logger.exception(f"Validation error: {val_err}") + raise RuntimeError( + "Failed to create table due to a validation error." + ) from val_err + except Exception as ex: + logger.exception("An unexpected error occurred while creating the index.") + raise RuntimeError( + "Failed to create table due to an unexpected error." + ) from ex + + @property + def embeddings(self) -> Optional[Embeddings]: + """ + A property that returns an Embeddings instance embedding_function + is an instance of Embeddings, otherwise returns None. + + Returns: + Optional[Embeddings]: The embedding function if it's an instance of + Embeddings, otherwise None. + """ + return ( + self.embedding_function + if isinstance(self.embedding_function, Embeddings) + else None + ) + + def get_embedding_dimension(self) -> int: + # Embed the single document by wrapping it in a list + embedded_document = self._embed_documents( + [self.query if self.query is not None else ""] + ) + + # Get the first (and only) embedding's dimension + return len(embedded_document[0]) + + def _embed_documents(self, texts: List[str]) -> List[List[float]]: + if isinstance(self.embedding_function, Embeddings): + return self.embedding_function.embed_documents(texts) + elif callable(self.embedding_function): + return [self.embedding_function(text) for text in texts] + else: + raise TypeError( + "The embedding_function is neither Embeddings nor callable." + ) + + def _embed_query(self, text: str) -> List[float]: + if isinstance(self.embedding_function, Embeddings): + return self.embedding_function.embed_query(text) + else: + return self.embedding_function(text) + + @_handle_exceptions + def add_texts( + self, + texts: Iterable[str], + metadatas: Optional[List[Dict[Any, Any]]] = None, + ids: Optional[List[str]] = None, + **kwargs: Any, + ) -> List[str]: + """Add more texts to the vectorstore index. + Args: + texts: Iterable of strings to add to the vectorstore. + metadatas: Optional list of metadatas associated with the texts. + ids: Optional list of ids for the texts that are being added to + the vector store. + kwargs: vectorstore specific parameters + """ + + texts = list(texts) + if ids: + # If ids are provided, hash them to maintain consistency + processed_ids = [ + hashlib.sha256(_id.encode()).hexdigest()[:16].upper() for _id in ids + ] + elif metadatas and all("id" in metadata for metadata in metadatas): + # If no ids are provided but metadatas with ids are, generate + # ids from metadatas + processed_ids = [ + hashlib.sha256(metadata["id"].encode()).hexdigest()[:16].upper() + for metadata in metadatas + ] + else: + # Generate new ids if none are provided + generated_ids = [ + str(uuid.uuid4()) for _ in texts + ] # uuid4 is more standard for random UUIDs + processed_ids = [ + hashlib.sha256(_id.encode()).hexdigest()[:16].upper() + for _id in generated_ids + ] + + embeddings = self._embed_documents(texts) + if not metadatas: + metadatas = [{} for _ in texts] + docs = [ + (id_, text, json.dumps(metadata), array.array("f", embedding)) + for id_, text, metadata, embedding in zip( + processed_ids, texts, metadatas, embeddings + ) + ] + + with self.client.cursor() as cursor: + cursor.executemany( + f"INSERT INTO {self.table_name} (id, text, metadata, " + f"embedding) VALUES (:1, :2, :3, :4)", + docs, + ) + self.client.commit() + return processed_ids + + def similarity_search( + self, + query: str, + k: int = 4, + filter: Optional[Dict[str, Any]] = None, + **kwargs: Any, + ) -> List[Document]: + """Return docs most similar to query.""" + if isinstance(self.embedding_function, Embeddings): + embedding = self.embedding_function.embed_query(query) + documents = self.similarity_search_by_vector( + embedding=embedding, k=k, filter=filter, **kwargs + ) + return documents + + def similarity_search_by_vector( + self, + embedding: List[float], + k: int = 4, + filter: Optional[dict[str, Any]] = None, + **kwargs: Any, + ) -> List[Document]: + docs_and_scores = self.similarity_search_by_vector_with_relevance_scores( + embedding=embedding, k=k, filter=filter, **kwargs + ) + return [doc for doc, _ in docs_and_scores] + + def similarity_search_with_score( + self, + query: str, + k: int = 4, + filter: Optional[dict[str, Any]] = None, + **kwargs: Any, + ) -> List[Tuple[Document, float]]: + """Return docs most similar to query.""" + if isinstance(self.embedding_function, Embeddings): + embedding = self.embedding_function.embed_query(query) + docs_and_scores = self.similarity_search_by_vector_with_relevance_scores( + embedding=embedding, k=k, filter=filter, **kwargs + ) + return docs_and_scores + + @_handle_exceptions + def _get_clob_value(self, result: Any) -> str: + try: + import oracledb + except ImportError as e: + raise ImportError( + "Unable to import oracledb, please install with " + "`pip install -U oracledb`." + ) from e + + clob_value = "" + if result: + if isinstance(result, oracledb.LOB): + raw_data = result.read() + if isinstance(raw_data, bytes): + clob_value = raw_data.decode( + "utf-8" + ) # Specify the correct encoding + else: + clob_value = raw_data + elif isinstance(result, str): + clob_value = result + else: + raise Exception("Unexpected type:", type(result)) + return clob_value + + @_handle_exceptions + def similarity_search_by_vector_with_relevance_scores( + self, + embedding: List[float], + k: int = 4, + filter: Optional[dict[str, Any]] = None, + **kwargs: Any, + ) -> List[Tuple[Document, float]]: + docs_and_scores = [] + embedding_arr = array.array("f", embedding) + + query = f""" + SELECT id, + text, + metadata, + vector_distance(embedding, :embedding, + {_get_distance_function(self.distance_strategy)}) as distance + FROM {self.table_name} + ORDER BY distance + FETCH APPROX FIRST :k ROWS ONLY + """ + # Execute the query + with self.client.cursor() as cursor: + cursor.execute(query, embedding=embedding_arr, k=k) + results = cursor.fetchall() + + # Filter results if filter is provided + for result in results: + metadata = json.loads( + self._get_clob_value(result[2]) if result[2] is not None else "{}" + ) + + # Apply filtering based on the 'filter' dictionary + if filter: + if all(metadata.get(key) in value for key, value in filter.items()): + doc = Document( + page_content=( + self._get_clob_value(result[1]) + if result[1] is not None + else "" + ), + metadata=metadata, + ) + distance = result[3] + docs_and_scores.append((doc, distance)) + else: + doc = Document( + page_content=( + self._get_clob_value(result[1]) + if result[1] is not None + else "" + ), + metadata=metadata, + ) + distance = result[3] + docs_and_scores.append((doc, distance)) + + return docs_and_scores + + @_handle_exceptions + def similarity_search_by_vector_returning_embeddings( + self, + embedding: List[float], + k: int, + filter: Optional[Dict[str, Any]] = None, + **kwargs: Any, + ) -> List[Tuple[Document, float, np.ndarray[np.float32, Any]]]: + documents = [] + embedding_arr = array.array("f", embedding) + + query = f""" + SELECT id, + text, + metadata, + vector_distance(embedding, :embedding, {_get_distance_function( + self.distance_strategy)}) as distance, + embedding + FROM {self.table_name} + ORDER BY distance + FETCH APPROX FIRST :k ROWS ONLY + """ + + # Execute the query + with self.client.cursor() as cursor: + cursor.execute(query, embedding=embedding_arr, k=k) + results = cursor.fetchall() + + for result in results: + page_content_str = self._get_clob_value(result[1]) + metadata_str = self._get_clob_value(result[2]) + metadata = json.loads(metadata_str) + + # Apply filter if provided and matches; otherwise, add all + # documents + if not filter or all( + metadata.get(key) in value for key, value in filter.items() + ): + document = Document( + page_content=page_content_str, metadata=metadata + ) + distance = result[3] + # Assuming result[4] is already in the correct format; + # adjust if necessary + current_embedding = ( + np.array(result[4], dtype=np.float32) + if result[4] + else np.empty(0, dtype=np.float32) + ) + documents.append((document, distance, current_embedding)) + return documents # type: ignore + + @_handle_exceptions + def max_marginal_relevance_search_with_score_by_vector( + self, + embedding: List[float], + *, + k: int = 4, + fetch_k: int = 20, + lambda_mult: float = 0.5, + filter: Optional[Dict[str, Any]] = None, + ) -> List[Tuple[Document, float]]: + """Return docs and their similarity scores selected using the + maximal marginal + relevance. + + Maximal marginal relevance optimizes for similarity to query AND + diversity + among selected documents. + + Args: + self: An instance of the class + embedding: Embedding to look up documents similar to. + k: Number of Documents to return. Defaults to 4. + fetch_k: Number of Documents to fetch before filtering to + pass to MMR algorithm. + filter: (Optional[Dict[str, str]]): Filter by metadata. Defaults + to None. + lambda_mult: Number between 0 and 1 that determines the degree + of diversity among the results with 0 corresponding + to maximum diversity and 1 to minimum diversity. + Defaults to 0.5. + Returns: + List of Documents and similarity scores selected by maximal + marginal + relevance and score for each. + """ + + # Fetch documents and their scores + docs_scores_embeddings = self.similarity_search_by_vector_returning_embeddings( + embedding, fetch_k, filter=filter + ) + # Assuming documents_with_scores is a list of tuples (Document, score) + + # If you need to split documents and scores for processing (e.g., + # for MMR calculation) + documents, scores, embeddings = ( + zip(*docs_scores_embeddings) if docs_scores_embeddings else ([], [], []) + ) + + # Assume maximal_marginal_relevance method accepts embeddings and + # scores, and returns indices of selected docs + mmr_selected_indices = maximal_marginal_relevance( + np.array(embedding, dtype=np.float32), + list(embeddings), + k=k, + lambda_mult=lambda_mult, + ) + + # Filter documents based on MMR-selected indices and map scores + mmr_selected_documents_with_scores = [ + (documents[i], scores[i]) for i in mmr_selected_indices + ] + + return mmr_selected_documents_with_scores + + @_handle_exceptions + def max_marginal_relevance_search_by_vector( + self, + embedding: List[float], + k: int = 4, + fetch_k: int = 20, + lambda_mult: float = 0.5, + filter: Optional[Dict[str, Any]] = None, + **kwargs: Any, + ) -> List[Document]: + """Return docs selected using the maximal marginal relevance. + + Maximal marginal relevance optimizes for similarity to query AND + diversity + among selected documents. + + Args: + self: An instance of the class + embedding: Embedding to look up documents similar to. + k: Number of Documents to return. Defaults to 4. + fetch_k: Number of Documents to fetch to pass to MMR algorithm. + lambda_mult: Number between 0 and 1 that determines the degree + of diversity among the results with 0 corresponding + to maximum diversity and 1 to minimum diversity. + Defaults to 0.5. + filter: Optional[Dict[str, Any]] + **kwargs: Any + Returns: + List of Documents selected by maximal marginal relevance. + """ + docs_and_scores = self.max_marginal_relevance_search_with_score_by_vector( + embedding, k=k, fetch_k=fetch_k, lambda_mult=lambda_mult, filter=filter + ) + return [doc for doc, _ in docs_and_scores] + + @_handle_exceptions + def max_marginal_relevance_search( + self, + query: str, + k: int = 4, + fetch_k: int = 20, + lambda_mult: float = 0.5, + filter: Optional[Dict[str, Any]] = None, + **kwargs: Any, + ) -> List[Document]: + """Return docs selected using the maximal marginal relevance. + + Maximal marginal relevance optimizes for similarity to query AND + diversity + among selected documents. + + Args: + self: An instance of the class + query: Text to look up documents similar to. + k: Number of Documents to return. Defaults to 4. + fetch_k: Number of Documents to fetch to pass to MMR algorithm. + lambda_mult: Number between 0 and 1 that determines the degree + of diversity among the results with 0 corresponding + to maximum diversity and 1 to minimum diversity. + Defaults to 0.5. + filter: Optional[Dict[str, Any]] + **kwargs + Returns: + List of Documents selected by maximal marginal relevance. + + `max_marginal_relevance_search` requires that `query` returns matched + embeddings alongside the match documents. + """ + embedding = self._embed_query(query) + documents = self.max_marginal_relevance_search_by_vector( + embedding, + k=k, + fetch_k=fetch_k, + lambda_mult=lambda_mult, + filter=filter, + **kwargs, + ) + return documents + + @_handle_exceptions + def delete(self, ids: Optional[List[str]] = None, **kwargs: Any) -> None: + """Delete by vector IDs. + Args: + self: An instance of the class + ids: List of ids to delete. + **kwargs + """ + + if ids is None: + raise ValueError("No ids provided to delete.") + + # Compute SHA-256 hashes of the ids and truncate them + hashed_ids = [ + hashlib.sha256(_id.encode()).hexdigest()[:16].upper() for _id in ids + ] + + # Constructing the SQL statement with individual placeholders + placeholders = ", ".join([":id" + str(i + 1) for i in range(len(hashed_ids))]) + + ddl = f"DELETE FROM {self.table_name} WHERE id IN ({placeholders})" + + # Preparing bind variables + bind_vars = { + f"id{i}": hashed_id for i, hashed_id in enumerate(hashed_ids, start=1) + } + + with self.client.cursor() as cursor: + cursor.execute(ddl, bind_vars) + self.client.commit() + + @classmethod + @_handle_exceptions + def from_texts( + cls: Type[OracleVS], + texts: Iterable[str], + embedding: Embeddings, + metadatas: Optional[List[dict]] = None, + **kwargs: Any, + ) -> OracleVS: + """Return VectorStore initialized from texts and embeddings.""" + client = kwargs.get("client") + if client is None: + raise ValueError("client parameter is required...") + params = kwargs.get("params", {}) + + table_name = str(kwargs.get("table_name", "langchain")) + + distance_strategy = cast( + DistanceStrategy, kwargs.get("distance_strategy", None) + ) + if not isinstance(distance_strategy, DistanceStrategy): + raise TypeError( + f"Expected DistanceStrategy got " f"{type(distance_strategy).__name__} " + ) + + query = kwargs.get("query", "What is a Oracle database") + + drop_table_purge(client, table_name) + + vss = cls( + client=client, + embedding_function=embedding, + table_name=table_name, + distance_strategy=distance_strategy, + query=query, + params=params, + ) + vss.add_texts(texts=list(texts), metadatas=metadatas) + return vss diff --git a/libs/community/poetry.lock b/libs/community/poetry.lock index aa5dfd315f..321aad0338 100644 --- a/libs/community/poetry.lock +++ b/libs/community/poetry.lock @@ -1,4 +1,4 @@ -# This file is automatically @generated by Poetry 1.7.1 and should not be changed by hand. +# This file is automatically @generated by Poetry 1.8.2 and should not be changed by hand. [[package]] name = "aenum" @@ -5442,6 +5442,49 @@ text = ["spacy", "wordcloud (>=1.8.1)"] torch = ["oracle_ads[viz]", "torch", "torchvision"] viz = ["bokeh (>=3.0.0,<3.2.0)", "folium (>=0.12.1)", "graphviz (<0.17)", "scipy (>=1.5.4)", "seaborn (>=0.11.0)"] +[[package]] +name = "oracledb" +version = "2.2.0" +description = "Python interface to Oracle Database" +optional = true +python-versions = ">=3.7" +files = [ + {file = "oracledb-2.2.0-cp310-cp310-macosx_10_9_universal2.whl", hash = "sha256:253a85eef53d97815b4d838e5275d0a99e33ec340eb4b945cd2371e2bcede46b"}, + {file = "oracledb-2.2.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:fa5c2982076366f59dade28b554b43a257ad426e55359124bc37f191f51c2d46"}, + {file = "oracledb-2.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:19408844bd4af5b4d40f06c3e5b88c6bfce4a749f61ab766f41b22c4070c5c15"}, + {file = "oracledb-2.2.0-cp310-cp310-win32.whl", hash = "sha256:c2e2e3f00d7eb7f4dabfa8996dc70db03bd7dbe474d2d1dc381daeff54cfdeff"}, + {file = "oracledb-2.2.0-cp310-cp310-win_amd64.whl", hash = "sha256:efed536635b0fec5c1484eda55fad4affa57672b87596ec6273123a3133ba5b6"}, + {file = "oracledb-2.2.0-cp311-cp311-macosx_10_9_universal2.whl", hash = "sha256:c4b7e14b04dc2af4697ca561f9bcac110a67a7be2ccf868d789e92771017feca"}, + {file = "oracledb-2.2.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:61bbf9cd64a2f3b65a12550329b2f0caed7d9aa5e892c0ce69d9ea7b3cb3cb8e"}, + {file = "oracledb-2.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:4e461d1c7ef4d3f03d84595a13754390a62300976782d7c29efc07fcc915e1b3"}, + {file = "oracledb-2.2.0-cp311-cp311-win32.whl", hash = "sha256:6c7da69d18cf02e469e15215af9c6f219256972a172c0e544a2ecc2a5cab9aa5"}, + {file = "oracledb-2.2.0-cp311-cp311-win_amd64.whl", hash = "sha256:d0245f677e27ee0990eb0213485031dacdc837a89569563f1594b82ccb362255"}, + {file = "oracledb-2.2.0-cp312-cp312-macosx_10_9_universal2.whl", hash = "sha256:10d2cd354a15e2b7e191256a0179874068fc64fa6543b2e20c9c1c38f0dd0839"}, + {file = "oracledb-2.2.0-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:fbf07e0e88c9ff1555c9301d95c69e0d48263cf7df63172043fe0a042539e687"}, + {file = "oracledb-2.2.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:c6a1365d3e05ca73b638ef939f9a609fed0ae5da75d13b2cfb75601ab8b85fce"}, + {file = "oracledb-2.2.0-cp312-cp312-win32.whl", hash = "sha256:3fe57091a1463efac692b352e99f9daeab5ab375bab2060c5caba9a3a7743c15"}, + {file = "oracledb-2.2.0-cp312-cp312-win_amd64.whl", hash = "sha256:e5ca9c050e18b2b1005b40d44a2098155445836071253ee5d547c7f285fc7729"}, + {file = "oracledb-2.2.0-cp37-cp37m-macosx_10_9_x86_64.whl", hash = "sha256:b5ad105aabc8ff32e3d3a343a92cf84976cf2454b6a6ff02065383fc3863e68d"}, + {file = "oracledb-2.2.0-cp37-cp37m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:14a7f2572c358604186d857c80f384ad03226e372731770911856541a06bdd34"}, + {file = "oracledb-2.2.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:aa1fe78ed0cbf98593c1f3f620f751b725b189f8c845577e39a372f44b2bf384"}, + {file = "oracledb-2.2.0-cp37-cp37m-win32.whl", hash = "sha256:bcef115bd147d6f267e3b09cbc3fc04189bff69e94d05c1e266c698668061e8d"}, + {file = "oracledb-2.2.0-cp37-cp37m-win_amd64.whl", hash = "sha256:1272bf562bcd6ff5e23b1e1fe8c3363d7a66fe8f48b1e00c4fb081d5436e1df5"}, + {file = "oracledb-2.2.0-cp38-cp38-macosx_11_0_universal2.whl", hash = "sha256:e0010aee0ed0a57964ce9f6cb0e2315a4ffce947121e0bb1c618e5091e64bab4"}, + {file = "oracledb-2.2.0-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:437d7c5a36f7e72ca36e1ac3f1a7c087bffa1cd0ba3a84471e54506c8572a5ad"}, + {file = "oracledb-2.2.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:581b7067283910a53b1ac1a50c0046058a21bd5c073d529bf695113db6d25f62"}, + {file = "oracledb-2.2.0-cp38-cp38-win32.whl", hash = "sha256:97fdc27a15f6441434a7ef563f522c8ceac19c2933f2da1082125670a2e2fc6b"}, + {file = "oracledb-2.2.0-cp38-cp38-win_amd64.whl", hash = "sha256:c22a2052997a01e59a4c9c33c9c0593eebcb1d893addeda9cd57003c2e088a85"}, + {file = "oracledb-2.2.0-cp39-cp39-macosx_10_9_universal2.whl", hash = "sha256:b924ee3e7d41edb367e5bb4cbb30990ad447fedda9ef0fe29b691d36a8d338c2"}, + {file = "oracledb-2.2.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:de3f9fa10b5f5c5dbe80dc7bdea5e5746abd411217e812fae66cc61c68f3f8f6"}, + {file = "oracledb-2.2.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:ba96a450275bceb5e0928e0dc01b5fb200e81ba04e99499d4930ccba681fd88a"}, + {file = "oracledb-2.2.0-cp39-cp39-win32.whl", hash = "sha256:35b6524b57979dbe8463af06648ad9972bce06e014a292ad96fec34c62665a8b"}, + {file = "oracledb-2.2.0-cp39-cp39-win_amd64.whl", hash = "sha256:0b4968f39871d501ab16a2fe05b5b4ae954e338e6b9dcefeb9bced998ddd4c4b"}, + {file = "oracledb-2.2.0.tar.gz", hash = "sha256:f52c7df38b13243b5ce583457b80748a34682b9bb8370da2497868b71976798b"}, +] + +[package.dependencies] +cryptography = ">=3.2.1" + [[package]] name = "orjson" version = "3.9.15" @@ -6608,26 +6651,31 @@ python-versions = ">=3.8" files = [ {file = "PyMuPDF-1.23.26-cp310-none-macosx_10_9_x86_64.whl", hash = "sha256:645a05321aecc8c45739f71f0eb574ce33138d19189582ffa5241fea3a8e2549"}, {file = "PyMuPDF-1.23.26-cp310-none-macosx_11_0_arm64.whl", hash = "sha256:2dfc9e010669ae92fade6fb72aaea49ebe3b8dcd7ee4dcbbe50115abcaa4d3fe"}, + {file = "PyMuPDF-1.23.26-cp310-none-manylinux2014_aarch64.whl", hash = "sha256:734ee380b3abd038602be79114194a3cb74ac102b7c943bcb333104575922c50"}, {file = "PyMuPDF-1.23.26-cp310-none-manylinux2014_x86_64.whl", hash = "sha256:b22f8d854f8196ad5b20308c1cebad3d5189ed9f0988acbafa043947ea7e6c55"}, {file = "PyMuPDF-1.23.26-cp310-none-win32.whl", hash = "sha256:cc0f794e3466bc96b5bf79d42fbc1551428751e3fef38ebc10ac70396b676144"}, {file = "PyMuPDF-1.23.26-cp310-none-win_amd64.whl", hash = "sha256:2eb701247d8e685a24e45899d1175f01a3ce5fc792a4431c91fbb68633b29298"}, {file = "PyMuPDF-1.23.26-cp311-none-macosx_10_9_x86_64.whl", hash = "sha256:e2804a64bb57da414781e312fb0561f6be67658ad57ed4a73dce008b23fc70a6"}, {file = "PyMuPDF-1.23.26-cp311-none-macosx_11_0_arm64.whl", hash = "sha256:97b40bb22e3056874634617a90e0ed24a5172cf71791b9e25d1d91c6743bc567"}, + {file = "PyMuPDF-1.23.26-cp311-none-manylinux2014_aarch64.whl", hash = "sha256:fab8833559bc47ab26ce736f915b8fc1dd37c108049b90396f7cd5e1004d7593"}, {file = "PyMuPDF-1.23.26-cp311-none-manylinux2014_x86_64.whl", hash = "sha256:f25aafd3e7fb9d7761a22acf2b67d704f04cc36d4dc33a3773f0eb3f4ec3606f"}, {file = "PyMuPDF-1.23.26-cp311-none-win32.whl", hash = "sha256:05e672ed3e82caca7ef02a88ace30130b1dd392a1190f03b2b58ffe7aa331400"}, {file = "PyMuPDF-1.23.26-cp311-none-win_amd64.whl", hash = "sha256:92b3c4dd4d0491d495f333be2d41f4e1c155a409bc9d04b5ff29655dccbf4655"}, {file = "PyMuPDF-1.23.26-cp312-none-macosx_10_9_x86_64.whl", hash = "sha256:a217689ede18cc6991b4e6a78afee8a440b3075d53b9dec4ba5ef7487d4547e9"}, {file = "PyMuPDF-1.23.26-cp312-none-macosx_11_0_arm64.whl", hash = "sha256:42ad2b819b90ce1947e11b90ec5085889df0a2e3aa0207bc97ecacfc6157cabc"}, + {file = "PyMuPDF-1.23.26-cp312-none-manylinux2014_aarch64.whl", hash = "sha256:99607649f89a02bba7d8ebe96e2410664316adc95e9337f7dfeff6a154f93049"}, {file = "PyMuPDF-1.23.26-cp312-none-manylinux2014_x86_64.whl", hash = "sha256:bb42d4b8407b4de7cb58c28f01449f16f32a6daed88afb41108f1aeb3552bdd4"}, {file = "PyMuPDF-1.23.26-cp312-none-win32.whl", hash = "sha256:c40d044411615e6f0baa7d3d933b3032cf97e168c7fa77d1be8a46008c109aee"}, {file = "PyMuPDF-1.23.26-cp312-none-win_amd64.whl", hash = "sha256:3f876533aa7f9a94bcd9a0225ce72571b7808260903fec1d95c120bc842fb52d"}, {file = "PyMuPDF-1.23.26-cp38-none-macosx_10_9_x86_64.whl", hash = "sha256:52df831d46beb9ff494f5fba3e5d069af6d81f49abf6b6e799ee01f4f8fa6799"}, {file = "PyMuPDF-1.23.26-cp38-none-macosx_11_0_arm64.whl", hash = "sha256:0bbb0cf6593e53524f3fc26fb5e6ead17c02c64791caec7c4afe61b677dedf80"}, + {file = "PyMuPDF-1.23.26-cp38-none-manylinux2014_aarch64.whl", hash = "sha256:5ef4360f20015673c20cf59b7e19afc97168795188c584254ed3778cde43ce77"}, {file = "PyMuPDF-1.23.26-cp38-none-manylinux2014_x86_64.whl", hash = "sha256:d7cd88842b2e7f4c71eef4d87c98c35646b80b60e6375392d7ce40e519261f59"}, {file = "PyMuPDF-1.23.26-cp38-none-win32.whl", hash = "sha256:6577e2f473625e2d0df5f5a3bf1e4519e94ae749733cc9937994d1b256687bfa"}, {file = "PyMuPDF-1.23.26-cp38-none-win_amd64.whl", hash = "sha256:fbe1a3255b2cd0d769b2da2c4efdd0c0f30d4961a1aac02c0f75cf951b337aa4"}, {file = "PyMuPDF-1.23.26-cp39-none-macosx_10_9_x86_64.whl", hash = "sha256:73fce034f2afea886a59ead2d0caedf27e2b2a8558b5da16d0286882e0b1eb82"}, {file = "PyMuPDF-1.23.26-cp39-none-macosx_11_0_arm64.whl", hash = "sha256:b3de8618b7cb5b36db611083840b3bcf09b11a893e2d8262f4e042102c7e65de"}, + {file = "PyMuPDF-1.23.26-cp39-none-manylinux2014_aarch64.whl", hash = "sha256:879e7f5ad35709d8760ab6103c3d5dac8ab8043a856ab3653fd324af7358ee87"}, {file = "PyMuPDF-1.23.26-cp39-none-manylinux2014_x86_64.whl", hash = "sha256:deee96c2fd415ded7b5070d8d5b2c60679aee6ed0e28ac0d2cb998060d835c2c"}, {file = "PyMuPDF-1.23.26-cp39-none-win32.whl", hash = "sha256:9f7f4ef99dd8ac97fb0b852efa3dcbee515798078b6c79a6a13c7b1e7c5d41a4"}, {file = "PyMuPDF-1.23.26-cp39-none-win_amd64.whl", hash = "sha256:ba9a54552c7afb9ec85432c765e2fa9a81413acfaa7d70db7c9b528297749e5b"}, @@ -9996,9 +10044,9 @@ testing = ["big-O", "jaraco.functools", "jaraco.itertools", "more-itertools", "p [extras] cli = ["typer"] -extended-testing = ["aiosqlite", "aleph-alpha-client", "anthropic", "arxiv", "assemblyai", "atlassian-python-api", "azure-ai-documentintelligence", "azure-identity", "azure-search-documents", "beautifulsoup4", "bibtexparser", "cassio", "chardet", "cloudpickle", "cloudpickle", "cohere", "databricks-vectorsearch", "datasets", "dgml-utils", "elasticsearch", "esprima", "faiss-cpu", "feedparser", "fireworks-ai", "friendli-client", "geopandas", "gitpython", "google-cloud-documentai", "gql", "gradientai", "hdbcli", "hologres-vector", "html2text", "httpx", "httpx-sse", "javelin-sdk", "jinja2", "jq", "jsonschema", "lxml", "markdownify", "motor", "msal", "mwparserfromhell", "mwxml", "newspaper3k", "numexpr", "nvidia-riva-client", "oci", "openai", "openapi-pydantic", "oracle-ads", "pandas", "pdfminer-six", "pgvector", "praw", "premai", "psychicapi", "py-trello", "pyjwt", "pymupdf", "pypdf", "pypdfium2", "pyspark", "rank-bm25", "rapidfuzz", "rapidocr-onnxruntime", "rdflib", "requests-toolbelt", "rspace_client", "scikit-learn", "sqlite-vss", "streamlit", "sympy", "telethon", "tidb-vector", "timescale-vector", "tqdm", "tree-sitter", "tree-sitter-languages", "upstash-redis", "vdms", "xata", "xmltodict"] +extended-testing = ["aiosqlite", "aleph-alpha-client", "anthropic", "arxiv", "assemblyai", "atlassian-python-api", "azure-ai-documentintelligence", "azure-identity", "azure-search-documents", "beautifulsoup4", "bibtexparser", "cassio", "chardet", "cloudpickle", "cloudpickle", "cohere", "databricks-vectorsearch", "datasets", "dgml-utils", "elasticsearch", "esprima", "faiss-cpu", "feedparser", "fireworks-ai", "friendli-client", "geopandas", "gitpython", "google-cloud-documentai", "gql", "gradientai", "hdbcli", "hologres-vector", "html2text", "httpx", "httpx-sse", "javelin-sdk", "jinja2", "jq", "jsonschema", "lxml", "markdownify", "motor", "msal", "mwparserfromhell", "mwxml", "newspaper3k", "numexpr", "nvidia-riva-client", "oci", "openai", "openapi-pydantic", "oracle-ads", "oracledb", "pandas", "pdfminer-six", "pgvector", "praw", "premai", "psychicapi", "py-trello", "pyjwt", "pymupdf", "pypdf", "pypdfium2", "pyspark", "rank-bm25", "rapidfuzz", "rapidocr-onnxruntime", "rdflib", "requests-toolbelt", "rspace_client", "scikit-learn", "sqlite-vss", "streamlit", "sympy", "telethon", "tidb-vector", "timescale-vector", "tqdm", "tree-sitter", "tree-sitter-languages", "upstash-redis", "vdms", "xata", "xmltodict"] [metadata] lock-version = "2.0" python-versions = ">=3.8.1,<4.0" -content-hash = "df095fde2d52bf7dc2ebe6ddc54792cd2b0e189660a62af372480bb2cac20a3e" +content-hash = "7e34ee736dde96d95158527f55b184576ecef8bebcf85a00c3d4dc753d6ae28a" diff --git a/libs/community/pyproject.toml b/libs/community/pyproject.toml index 9b1aa3ef93..ec4e32e5ee 100644 --- a/libs/community/pyproject.toml +++ b/libs/community/pyproject.toml @@ -102,6 +102,7 @@ premai = {version = "^0.3.25", optional = true} vdms = {version = "^0.0.20", optional = true} httpx-sse = {version = "^0.4.0", optional = true} pyjwt = {version = "^2.8.0", optional = true} +oracledb = {version = "^2.2.0", optional = true} [tool.poetry.group.test] optional = true @@ -279,7 +280,8 @@ extended_testing = [ "premai", "vdms", "httpx-sse", - "pyjwt" + "pyjwt", + "oracledb" ] [tool.ruff] diff --git a/libs/community/tests/integration_tests/document_loaders/test_oracleds.py b/libs/community/tests/integration_tests/document_loaders/test_oracleds.py new file mode 100644 index 0000000000..498e4b6925 --- /dev/null +++ b/libs/community/tests/integration_tests/document_loaders/test_oracleds.py @@ -0,0 +1,447 @@ +# Authors: +# Sudhir Kumar (sudhirkk) +# +# ----------------------------------------------------------------------------- +# test_oracleds.py +# ----------------------------------------------------------------------------- +import sys + +from langchain_community.document_loaders.oracleai import ( + OracleDocLoader, + OracleTextSplitter, +) +from langchain_community.utilities.oracleai import OracleSummary +from langchain_community.vectorstores.oraclevs import ( + _table_exists, + drop_table_purge, +) + +uname = "hr" +passwd = "hr" +# uname = "LANGCHAINUSER" +# passwd = "langchainuser" +v_dsn = "100.70.107.245:1521/cdb1_pdb1.regress.rdbms.dev.us.oracle.com" + + +### Test loader ##### +def test_loader_test() -> None: + try: + import oracledb + except ImportError: + return + + try: + # oracle connection + connection = oracledb.connect(user=uname, password=passwd, dsn=v_dsn) + cursor = connection.cursor() + + if _table_exists(connection, "LANGCHAIN_DEMO"): + drop_table_purge(connection, "LANGCHAIN_DEMO") + + cursor.execute("CREATE TABLE langchain_demo(id number, text varchar2(25))") + + rows = [ + (1, "First"), + (2, "Second"), + (3, "Third"), + (4, "Fourth"), + (5, "Fifth"), + (6, "Sixth"), + (7, "Seventh"), + ] + + cursor.executemany("insert into LANGCHAIN_DEMO(id, text) values (:1, :2)", rows) + + connection.commit() + + # local file, local directory, database column + loader_params = { + "owner": uname, + "tablename": "LANGCHAIN_DEMO", + "colname": "TEXT", + } + + # instantiate + loader = OracleDocLoader(conn=connection, params=loader_params) + + # load + docs = loader.load() + + # verify + if len(docs) == 0: + sys.exit(1) + + if _table_exists(connection, "LANGCHAIN_DEMO"): + drop_table_purge(connection, "LANGCHAIN_DEMO") + + except Exception: + sys.exit(1) + + try: + # expectation : ORA-00942 + loader_params = { + "owner": uname, + "tablename": "COUNTRIES1", + "colname": "COUNTRY_NAME", + } + + # instantiate + loader = OracleDocLoader(conn=connection, params=loader_params) + + # load + docs = loader.load() + if len(docs) == 0: + pass + + except Exception: + pass + + try: + # expectation : file "SUDHIR" doesn't exist. + loader_params = {"file": "SUDHIR"} + + # instantiate + loader = OracleDocLoader(conn=connection, params=loader_params) + + # load + docs = loader.load() + if len(docs) == 0: + pass + + except Exception: + pass + + try: + # expectation : path "SUDHIR" doesn't exist. + loader_params = {"dir": "SUDHIR"} + + # instantiate + loader = OracleDocLoader(conn=connection, params=loader_params) + + # load + docs = loader.load() + if len(docs) == 0: + pass + + except Exception: + pass + + +### Test splitter #### +def test_splitter_test() -> None: + try: + import oracledb + except ImportError: + return + + try: + # oracle connection + connection = oracledb.connect(user=uname, password=passwd, dsn=v_dsn) + doc = """Langchain is a wonderful framework to load, split, chunk + and embed your data!!""" + + # by words , max = 1000 + splitter_params = { + "by": "words", + "max": "1000", + "overlap": "200", + "split": "custom", + "custom_list": [","], + "extended": "true", + "normalize": "all", + } + + # instantiate + splitter = OracleTextSplitter(conn=connection, params=splitter_params) + + # generate chunks + chunks = splitter.split_text(doc) + + # verify + if len(chunks) == 0: + sys.exit(1) + + # by chars , max = 4000 + splitter_params = { + "by": "chars", + "max": "4000", + "overlap": "800", + "split": "NEWLINE", + "normalize": "all", + } + + # instantiate + splitter = OracleTextSplitter(conn=connection, params=splitter_params) + + # generate chunks + chunks = splitter.split_text(doc) + + # verify + if len(chunks) == 0: + sys.exit(1) + + # by words , max = 10 + splitter_params = { + "by": "words", + "max": "10", + "overlap": "2", + "split": "SENTENCE", + } + + # instantiate + splitter = OracleTextSplitter(conn=connection, params=splitter_params) + + # generate chunks + chunks = splitter.split_text(doc) + + # verify + if len(chunks) == 0: + sys.exit(1) + + # by chars , max = 50 + splitter_params = { + "by": "chars", + "max": "50", + "overlap": "10", + "split": "SPACE", + "normalize": "all", + } + + # instantiate + splitter = OracleTextSplitter(conn=connection, params=splitter_params) + + # generate chunks + chunks = splitter.split_text(doc) + + # verify + if len(chunks) == 0: + sys.exit(1) + + except Exception: + sys.exit(1) + + try: + # ORA-20003: invalid value xyz for BY parameter + splitter_params = {"by": "xyz"} + + # instantiate + splitter = OracleTextSplitter(conn=connection, params=splitter_params) + + # generate chunks + chunks = splitter.split_text(doc) + + # verify + if len(chunks) == 0: + pass + + except Exception: + pass + + try: + # Expectation: ORA-30584: invalid text chunking MAXIMUM - '10' + splitter_params = { + "by": "chars", + "max": "10", + "overlap": "2", + "split": "SPACE", + "normalize": "all", + } + + # instantiate + splitter = OracleTextSplitter(conn=connection, params=splitter_params) + + # generate chunks + chunks = splitter.split_text(doc) + + # verify + if len(chunks) == 0: + pass + + except Exception: + pass + + try: + # Expectation: ORA-30584: invalid text chunking MAXIMUM - '5' + splitter_params = { + "by": "words", + "max": "5", + "overlap": "2", + "split": "SPACE", + "normalize": "all", + } + + # instantiate + splitter = OracleTextSplitter(conn=connection, params=splitter_params) + + # generate chunks + chunks = splitter.split_text(doc) + + # verify + if len(chunks) == 0: + pass + + except Exception: + pass + + try: + # Expectation: ORA-30586: invalid text chunking SPLIT BY - SENTENCE + splitter_params = { + "by": "words", + "max": "50", + "overlap": "2", + "split": "SENTENCE", + "normalize": "all", + } + + # instantiate + splitter = OracleTextSplitter(conn=connection, params=splitter_params) + + # generate chunks + chunks = splitter.split_text(doc) + + # verify + if len(chunks) == 0: + pass + + except Exception: + pass + + +#### Test summary #### +def test_summary_test() -> None: + try: + import oracledb + except ImportError: + return + + try: + # oracle connection + connection = oracledb.connect(user=uname, password=passwd, dsn=v_dsn) + + # provider : Database, glevel : Paragraph + summary_params = { + "provider": "database", + "glevel": "paragraph", + "numParagraphs": 2, + "language": "english", + } + + # summary + summary = OracleSummary(conn=connection, params=summary_params) + + doc = """It was 7 minutes after midnight. The dog was lying on the grass in + of the lawn in front of Mrs Shears house. Its eyes were closed. It + was running on its side, the way dogs run when they think they are + cat in a dream. But the dog was not running or asleep. The dog was dead. + was a garden fork sticking out of the dog. The points of the fork must + gone all the way through the dog and into the ground because the fork + not fallen over. I decided that the dog was probably killed with the + because I could not see any other wounds in the dog and I do not think + would stick a garden fork into a dog after it had died for some other + like cancer for example, or a road accident. But I could not be certain""" + + summaries = summary.get_summary(doc) + + # verify + if len(summaries) == 0: + sys.exit(1) + + # provider : Database, glevel : Sentence + summary_params = {"provider": "database", "glevel": "Sentence"} + + # summary + summary = OracleSummary(conn=connection, params=summary_params) + summaries = summary.get_summary(doc) + + # verify + if len(summaries) == 0: + sys.exit(1) + + # provider : Database, glevel : P + summary_params = {"provider": "database", "glevel": "P"} + + # summary + summary = OracleSummary(conn=connection, params=summary_params) + summaries = summary.get_summary(doc) + + # verify + if len(summaries) == 0: + sys.exit(1) + + # provider : Database, glevel : S + summary_params = { + "provider": "database", + "glevel": "S", + "numParagraphs": 16, + "language": "english", + } + + # summary + summary = OracleSummary(conn=connection, params=summary_params) + summaries = summary.get_summary(doc) + + # verify + if len(summaries) == 0: + sys.exit(1) + + # provider : Database, glevel : S, doc = ' ' + summary_params = {"provider": "database", "glevel": "S", "numParagraphs": 2} + + # summary + summary = OracleSummary(conn=connection, params=summary_params) + + doc = " " + summaries = summary.get_summary(doc) + + # verify + if len(summaries) == 0: + sys.exit(1) + + except Exception: + sys.exit(1) + + try: + # Expectation : DRG-11002: missing value for PROVIDER + summary_params = {"provider": "database1", "glevel": "S"} + + # summary + summary = OracleSummary(conn=connection, params=summary_params) + summaries = summary.get_summary(doc) + + # verify + if len(summaries) == 0: + pass + + except Exception: + pass + + try: + # Expectation : DRG-11425: gist level SUDHIR is invalid, + # DRG-11427: valid gist level values are S, P + summary_params = {"provider": "database", "glevel": "SUDHIR"} + + # summary + summary = OracleSummary(conn=connection, params=summary_params) + summaries = summary.get_summary(doc) + + # verify + if len(summaries) == 0: + pass + + except Exception: + pass + + try: + # Expectation : DRG-11441: gist numParagraphs -2 is invalid + summary_params = {"provider": "database", "glevel": "S", "numParagraphs": -2} + + # summary + summary = OracleSummary(conn=connection, params=summary_params) + summaries = summary.get_summary(doc) + + # verify + if len(summaries) == 0: + pass + + except Exception: + pass diff --git a/libs/community/tests/integration_tests/vectorstores/test_oraclevs.py b/libs/community/tests/integration_tests/vectorstores/test_oraclevs.py new file mode 100644 index 0000000000..f0ea54fb58 --- /dev/null +++ b/libs/community/tests/integration_tests/vectorstores/test_oraclevs.py @@ -0,0 +1,955 @@ +"""Test Oracle AI Vector Search functionality.""" + +# import required modules +import sys +import threading + +from langchain_community.embeddings import HuggingFaceEmbeddings +from langchain_community.vectorstores.oraclevs import ( + OracleVS, + _create_table, + _index_exists, + _table_exists, + create_index, + drop_index_if_exists, + drop_table_purge, +) +from langchain_community.vectorstores.utils import DistanceStrategy + +username = "" +password = "" +dsn = "" + + +############################ +####### table_exists ####### +############################ +def test_table_exists_test() -> None: + try: + import oracledb + except ImportError: + return + + try: + connection = oracledb.connect(user=username, password=password, dsn=dsn) + except Exception: + sys.exit(1) + # 1. Existing Table:(all capital letters) + # expectation:True + _table_exists(connection, "V$TRANSACTION") + + # 2. Existing Table:(all small letters) + # expectation:True + _table_exists(connection, "v$transaction") + + # 3. Non-Existing Table + # expectation:false + _table_exists(connection, "Hello") + + # 4. Invalid Table Name + # Expectation:ORA-00903: invalid table name + try: + _table_exists(connection, "123") + except Exception: + pass + + # 5. Empty String + # Expectation:ORA-00903: invalid table name + try: + _table_exists(connection, "") + except Exception: + pass + + # 6. Special Character + # Expectation:ORA-00911: #: invalid character after FROM + try: + _table_exists(connection, "##4") + except Exception: + pass + + # 7. Table name length > 128 + # Expectation:ORA-00972: The identifier XXXXXXXXXX...XXXXXXXXXX... + # exceeds the maximum length of 128 bytes. + try: + _table_exists(connection, "x" * 129) + except Exception: + pass + + # 8. + # Expectation:True + _create_table(connection, "TB1", 65535) + + # 9. Toggle Case (like TaBlE) + # Expectation:True + _table_exists(connection, "Tb1") + drop_table_purge(connection, "TB1") + + # 10. Table_Name→ "हिन्दी" + # Expectation:True + _create_table(connection, '"हिन्दी"', 545) + _table_exists(connection, '"हिन्दी"') + drop_table_purge(connection, '"हिन्दी"') + + +############################ +####### create_table ####### +############################ + + +def test_create_table_test() -> None: + try: + import oracledb + except ImportError: + return + + try: + connection = oracledb.connect(user=username, password=password, dsn=dsn) + except Exception: + sys.exit(1) + + # 1. New table - HELLO + # Dimension - 100 + # Expectation:table is created + _create_table(connection, "HELLO", 100) + + # 2. Existing table name + # HELLO + # Dimension - 110 + # Expectation:Nothing happens + _create_table(connection, "HELLO", 110) + drop_table_purge(connection, "HELLO") + + # 3. New Table - 123 + # Dimension - 100 + # Expectation:ORA-00903: invalid table name + try: + _create_table(connection, "123", 100) + drop_table_purge(connection, "123") + except Exception: + pass + + # 4. New Table - Hello123 + # Dimension - 65535 + # Expectation:table is created + _create_table(connection, "Hello123", 65535) + drop_table_purge(connection, "Hello123") + + # 5. New Table - T1 + # Dimension - 65536 + # Expectation:ORA-51801: VECTOR column type specification + # has an unsupported dimension count ('65536'). + try: + _create_table(connection, "T1", 65536) + drop_table_purge(connection, "T1") + except Exception: + pass + + # 6. New Table - T1 + # Dimension - 0 + # Expectation:ORA-51801: VECTOR column type specification has + # an unsupported dimension count (0). + try: + _create_table(connection, "T1", 0) + drop_table_purge(connection, "T1") + except Exception: + pass + + # 7. New Table - T1 + # Dimension - -1 + # Expectation:ORA-51801: VECTOR column type specification has + # an unsupported dimension count ('-'). + try: + _create_table(connection, "T1", -1) + drop_table_purge(connection, "T1") + except Exception: + pass + + # 8. New Table - T2 + # Dimension - '1000' + # Expectation:table is created + _create_table(connection, "T2", int("1000")) + drop_table_purge(connection, "T2") + + # 9. New Table - T3 + # Dimension - 100 passed as a variable + # Expectation:table is created + val = 100 + _create_table(connection, "T3", val) + drop_table_purge(connection, "T3") + + # 10. + # Expectation:ORA-00922: missing or invalid option + val2 = """H + ello""" + try: + _create_table(connection, val2, 545) + drop_table_purge(connection, val2) + except Exception: + pass + + # 11. New Table - हिन्दी + # Dimension - 545 + # Expectation:table is created + _create_table(connection, '"हिन्दी"', 545) + drop_table_purge(connection, '"हिन्दी"') + + # 12. + # Expectation:failure - user does not exist + try: + _create_table(connection, "U1.TB4", 128) + drop_table_purge(connection, "U1.TB4") + except Exception: + pass + + # 13. + # Expectation:table is created + _create_table(connection, '"T5"', 128) + drop_table_purge(connection, '"T5"') + + # 14. Toggle Case + # Expectation:table creation fails + try: + _create_table(connection, "TaBlE", 128) + drop_table_purge(connection, "TaBlE") + except Exception: + pass + + # 15. table_name as empty_string + # Expectation: ORA-00903: invalid table name + try: + _create_table(connection, "", 128) + drop_table_purge(connection, "") + _create_table(connection, '""', 128) + drop_table_purge(connection, '""') + except Exception: + pass + + # 16. Arithmetic Operations in dimension parameter + # Expectation:table is created + n = 1 + _create_table(connection, "T10", n + 500) + drop_table_purge(connection, "T10") + + # 17. String Operations in table_name&dimension parameter + # Expectation:table is created + _create_table(connection, "YaSh".replace("aS", "ok"), 500) + drop_table_purge(connection, "YaSh".replace("aS", "ok")) + + +################################## +####### create_hnsw_index ####### +################################## + + +def test_create_hnsw_index_test() -> None: + try: + import oracledb + except ImportError: + return + + try: + connection = oracledb.connect(user=username, password=password, dsn=dsn) + except Exception: + sys.exit(1) + # 1. Table_name - TB1 + # New Index + # distance_strategy - DistanceStrategy.Dot_product + # Expectation:Index created + model1 = HuggingFaceEmbeddings( + model_name="sentence-transformers/paraphrase-mpnet-base-v2" + ) + vs = OracleVS(connection, model1, "TB1", DistanceStrategy.EUCLIDEAN_DISTANCE) + create_index(connection, vs) + + # 2. Creating same index again + # Table_name - TB1 + # Expectation:Nothing happens + try: + create_index(connection, vs) + drop_index_if_exists(connection, "HNSW") + except Exception: + pass + drop_table_purge(connection, "TB1") + + # 3. Create index with following parameters: + # idx_name - hnsw_idx2 + # idx_type - HNSW + # Expectation:Index created + vs = OracleVS(connection, model1, "TB2", DistanceStrategy.EUCLIDEAN_DISTANCE) + create_index(connection, vs, params={"idx_name": "hnsw_idx2", "idx_type": "HNSW"}) + drop_index_if_exists(connection, "hnsw_idx2") + drop_table_purge(connection, "TB2") + + # 4. Table Name - TB1 + # idx_name - "हिन्दी" + # idx_type - HNSW + # Expectation:Index created + try: + vs = OracleVS(connection, model1, "TB3", DistanceStrategy.EUCLIDEAN_DISTANCE) + create_index(connection, vs, params={"idx_name": '"हिन्दी"', "idx_type": "HNSW"}) + drop_index_if_exists(connection, '"हिन्दी"') + except Exception: + pass + drop_table_purge(connection, "TB3") + + # 5. idx_name passed empty + # Expectation:ORA-01741: illegal zero-length identifier + try: + vs = OracleVS(connection, model1, "TB4", DistanceStrategy.EUCLIDEAN_DISTANCE) + create_index(connection, vs, params={"idx_name": '""', "idx_type": "HNSW"}) + drop_index_if_exists(connection, '""') + except Exception: + pass + drop_table_purge(connection, "TB4") + + # 6. idx_type left empty + # Expectation:Index created + try: + vs = OracleVS(connection, model1, "TB5", DistanceStrategy.EUCLIDEAN_DISTANCE) + create_index(connection, vs, params={"idx_name": "Hello", "idx_type": ""}) + drop_index_if_exists(connection, "Hello") + except Exception: + pass + drop_table_purge(connection, "TB5") + + # 7. efconstruction passed as parameter but not neighbours + # Expectation:Index created + vs = OracleVS(connection, model1, "TB7", DistanceStrategy.EUCLIDEAN_DISTANCE) + create_index( + connection, + vs, + params={"idx_name": "idx11", "efConstruction": 100, "idx_type": "HNSW"}, + ) + drop_index_if_exists(connection, "idx11") + drop_table_purge(connection, "TB7") + + # 8. efconstruction passed as parameter as well as neighbours + # (for this idx_type parameter is also necessary) + # Expectation:Index created + vs = OracleVS(connection, model1, "TB8", DistanceStrategy.EUCLIDEAN_DISTANCE) + create_index( + connection, + vs, + params={ + "idx_name": "idx11", + "efConstruction": 100, + "neighbors": 80, + "idx_type": "HNSW", + }, + ) + drop_index_if_exists(connection, "idx11") + drop_table_purge(connection, "TB8") + + # 9. Limit of Values for(integer values): + # parallel + # efConstruction + # Neighbors + # Accuracy + # 0 + # Expectation:Index created + vs = OracleVS(connection, model1, "TB15", DistanceStrategy.EUCLIDEAN_DISTANCE) + create_index( + connection, + vs, + params={ + "idx_name": "idx11", + "efConstruction": 200, + "neighbors": 100, + "idx_type": "HNSW", + "parallel": 8, + "accuracy": 10, + }, + ) + drop_index_if_exists(connection, "idx11") + drop_table_purge(connection, "TB15") + + # 11. index_name as + # Expectation:U1 not present + try: + vs = OracleVS( + connection, model1, "U1.TB16", DistanceStrategy.EUCLIDEAN_DISTANCE + ) + create_index( + connection, + vs, + params={ + "idx_name": "U1.idx11", + "efConstruction": 200, + "neighbors": 100, + "idx_type": "HNSW", + "parallel": 8, + "accuracy": 10, + }, + ) + drop_index_if_exists(connection, "U1.idx11") + drop_table_purge(connection, "TB16") + except Exception: + pass + + # 12. Index_name size >129 + # Expectation:Index not created + try: + vs = OracleVS(connection, model1, "TB17", DistanceStrategy.EUCLIDEAN_DISTANCE) + create_index(connection, vs, params={"idx_name": "x" * 129, "idx_type": "HNSW"}) + drop_index_if_exists(connection, "x" * 129) + except Exception: + pass + drop_table_purge(connection, "TB17") + + # 13. Index_name size 128 + # Expectation:Index created + vs = OracleVS(connection, model1, "TB18", DistanceStrategy.EUCLIDEAN_DISTANCE) + create_index(connection, vs, params={"idx_name": "x" * 128, "idx_type": "HNSW"}) + drop_index_if_exists(connection, "x" * 128) + drop_table_purge(connection, "TB18") + + +################################## +####### index_exists ############# +################################## + + +def test_index_exists_test() -> None: + try: + import oracledb + except ImportError: + return + + try: + connection = oracledb.connect(user=username, password=password, dsn=dsn) + except Exception: + sys.exit(1) + model1 = HuggingFaceEmbeddings( + model_name="sentence-transformers/paraphrase-mpnet-base-v2" + ) + # 1. Existing Index:(all capital letters) + # Expectation:true + vs = OracleVS(connection, model1, "TB1", DistanceStrategy.EUCLIDEAN_DISTANCE) + create_index(connection, vs, params={"idx_name": "idx11", "idx_type": "HNSW"}) + _index_exists(connection, "IDX11") + + # 2. Existing Table:(all small letters) + # Expectation:true + _index_exists(connection, "idx11") + + # 3. Non-Existing Index + # Expectation:False + _index_exists(connection, "Hello") + + # 4. Invalid Index Name + # Expectation:Error + try: + _index_exists(connection, "123") + except Exception: + pass + + # 5. Empty String + # Expectation:Error + try: + _index_exists(connection, "") + except Exception: + pass + try: + _index_exists(connection, "") + except Exception: + pass + + # 6. Special Character + # Expectation:Error + try: + _index_exists(connection, "##4") + except Exception: + pass + + # 7. Index name length > 128 + # Expectation:Error + try: + _index_exists(connection, "x" * 129) + except Exception: + pass + + # 8. + # Expectation:true + _index_exists(connection, "U1.IDX11") + + # 9. Toggle Case (like iDx11) + # Expectation:true + _index_exists(connection, "IdX11") + + # 10. Index_Name→ "हिन्दी" + # Expectation:true + drop_index_if_exists(connection, "idx11") + try: + create_index(connection, vs, params={"idx_name": '"हिन्दी"', "idx_type": "HNSW"}) + _index_exists(connection, '"हिन्दी"') + except Exception: + pass + drop_table_purge(connection, "TB1") + + +################################## +####### add_texts ################ +################################## + + +def test_add_texts_test() -> None: + try: + import oracledb + except ImportError: + return + + try: + connection = oracledb.connect(user=username, password=password, dsn=dsn) + except Exception: + sys.exit(1) + # 1. Add 2 records to table + # Expectation:Successful + texts = ["Rohan", "Shailendra"] + metadata = [ + {"id": "100", "link": "Document Example Test 1"}, + {"id": "101", "link": "Document Example Test 2"}, + ] + model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2") + vs_obj = OracleVS(connection, model, "TB1", DistanceStrategy.EUCLIDEAN_DISTANCE) + vs_obj.add_texts(texts, metadata) + drop_table_purge(connection, "TB1") + + # 2. Add record but metadata is not there + # Expectation:An exception occurred :: Either specify an 'ids' list or + # 'metadatas' with an 'id' attribute for each element. + model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2") + vs_obj = OracleVS(connection, model, "TB2", DistanceStrategy.EUCLIDEAN_DISTANCE) + texts2 = ["Sri Ram", "Krishna"] + vs_obj.add_texts(texts2) + drop_table_purge(connection, "TB2") + + # 3. Add record with ids option + # ids are passed as string + # ids are passed as empty string + # ids are passed as multi-line string + # ids are passed as "" + # Expectations: + # Successful + # Successful + # Successful + # Successful + + vs_obj = OracleVS(connection, model, "TB4", DistanceStrategy.EUCLIDEAN_DISTANCE) + ids3 = ["114", "124"] + vs_obj.add_texts(texts2, ids=ids3) + drop_table_purge(connection, "TB4") + + vs_obj = OracleVS(connection, model, "TB5", DistanceStrategy.EUCLIDEAN_DISTANCE) + ids4 = ["", "134"] + vs_obj.add_texts(texts2, ids=ids4) + drop_table_purge(connection, "TB5") + + vs_obj = OracleVS(connection, model, "TB6", DistanceStrategy.EUCLIDEAN_DISTANCE) + ids5 = [ + """Good afternoon + my friends""", + "India", + ] + vs_obj.add_texts(texts2, ids=ids5) + drop_table_purge(connection, "TB6") + + vs_obj = OracleVS(connection, model, "TB7", DistanceStrategy.EUCLIDEAN_DISTANCE) + ids6 = ['"Good afternoon"', '"India"'] + vs_obj.add_texts(texts2, ids=ids6) + drop_table_purge(connection, "TB7") + + # 4. Add records with ids and metadatas + # Expectation:Successful + vs_obj = OracleVS(connection, model, "TB8", DistanceStrategy.EUCLIDEAN_DISTANCE) + texts3 = ["Sri Ram 6", "Krishna 6"] + ids7 = ["1", "2"] + metadata = [ + {"id": "102", "link": "Document Example", "stream": "Science"}, + {"id": "104", "link": "Document Example 45"}, + ] + vs_obj.add_texts(texts3, metadata, ids=ids7) + drop_table_purge(connection, "TB8") + + # 5. Add 10000 records + # Expectation:Successful + vs_obj = OracleVS(connection, model, "TB9", DistanceStrategy.EUCLIDEAN_DISTANCE) + texts4 = ["Sri Ram{0}".format(i) for i in range(1, 10000)] + ids8 = ["Hello{0}".format(i) for i in range(1, 10000)] + vs_obj.add_texts(texts4, ids=ids8) + drop_table_purge(connection, "TB9") + + # 6. Add 2 different record concurrently + # Expectation:Successful + def add(val: str) -> None: + model = HuggingFaceEmbeddings( + model_name="sentence-transformers/all-mpnet-base-v2" + ) + vs_obj = OracleVS( + connection, model, "TB10", DistanceStrategy.EUCLIDEAN_DISTANCE + ) + texts5 = [val] + ids9 = texts5 + vs_obj.add_texts(texts5, ids=ids9) + + thread_1 = threading.Thread(target=add, args=("Sri Ram")) + thread_2 = threading.Thread(target=add, args=("Sri Krishna")) + thread_1.start() + thread_2.start() + thread_1.join() + thread_2.join() + drop_table_purge(connection, "TB10") + + # 7. Add 2 same record concurrently + # Expectation:Successful, For one of the insert,get primary key violation error + def add1(val: str) -> None: + model = HuggingFaceEmbeddings( + model_name="sentence-transformers/all-mpnet-base-v2" + ) + vs_obj = OracleVS( + connection, model, "TB11", DistanceStrategy.EUCLIDEAN_DISTANCE + ) + texts = [val] + ids10 = texts + vs_obj.add_texts(texts, ids=ids10) + + try: + thread_1 = threading.Thread(target=add1, args=("Sri Ram")) + thread_2 = threading.Thread(target=add1, args=("Sri Ram")) + thread_1.start() + thread_2.start() + thread_1.join() + thread_2.join() + except Exception: + pass + drop_table_purge(connection, "TB11") + + # 8. create object with table name of type + # Expectation:U1 does not exist + try: + vs_obj = OracleVS(connection, model, "U1.TB14", DistanceStrategy.DOT_PRODUCT) + for i in range(1, 10): + texts7 = ["Yash{0}".format(i)] + ids13 = ["1234{0}".format(i)] + vs_obj.add_texts(texts7, ids=ids13) + drop_table_purge(connection, "TB14") + except Exception: + pass + + +################################## +####### embed_documents(text) #### +################################## +def test_embed_documents_test() -> None: + try: + import oracledb + except ImportError: + return + + try: + connection = oracledb.connect(user=username, password=password, dsn=dsn) + except Exception: + sys.exit(1) + # 1. String Example-'Sri Ram' + # Expectation:Vector Printed + model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2") + vs_obj = OracleVS(connection, model, "TB7", DistanceStrategy.EUCLIDEAN_DISTANCE) + + # 4. List + # Expectation:Vector Printed + vs_obj._embed_documents(["hello", "yash"]) + drop_table_purge(connection, "TB7") + + +################################## +####### embed_query(text) ######## +################################## +def test_embed_query_test() -> None: + try: + import oracledb + except ImportError: + return + + try: + connection = oracledb.connect(user=username, password=password, dsn=dsn) + except Exception: + sys.exit(1) + # 1. String + # Expectation:Vector printed + model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2") + vs_obj = OracleVS(connection, model, "TB8", DistanceStrategy.EUCLIDEAN_DISTANCE) + vs_obj._embed_query("Sri Ram") + drop_table_purge(connection, "TB8") + + # 3. Empty string + # Expectation:[] + vs_obj._embed_query("") + + +################################## +####### create_index ############# +################################## +def test_create_index_test() -> None: + try: + import oracledb + except ImportError: + return + + try: + connection = oracledb.connect(user=username, password=password, dsn=dsn) + except Exception: + sys.exit(1) + # 1. No optional parameters passed + # Expectation:Successful + model1 = HuggingFaceEmbeddings( + model_name="sentence-transformers/paraphrase-mpnet-base-v2" + ) + vs = OracleVS(connection, model1, "TB1", DistanceStrategy.EUCLIDEAN_DISTANCE) + create_index(connection, vs) + drop_index_if_exists(connection, "HNSW") + drop_table_purge(connection, "TB1") + + # 2. ivf index + # Expectation:Successful + vs = OracleVS(connection, model1, "TB2", DistanceStrategy.EUCLIDEAN_DISTANCE) + create_index(connection, vs, {"idx_type": "IVF", "idx_name": "IVF"}) + drop_index_if_exists(connection, "IVF") + drop_table_purge(connection, "TB2") + + # 3. ivf index with neighbour_part passed as parameter + # Expectation:Successful + vs = OracleVS(connection, model1, "TB3", DistanceStrategy.EUCLIDEAN_DISTANCE) + create_index(connection, vs, {"idx_type": "IVF", "neighbor_part": 10}) + drop_index_if_exists(connection, "IVF") + drop_table_purge(connection, "TB3") + + # 4. ivf index with neighbour_part and accuracy passed as parameter + # Expectation:Successful + vs = OracleVS(connection, model1, "TB4", DistanceStrategy.EUCLIDEAN_DISTANCE) + create_index( + connection, vs, {"idx_type": "IVF", "neighbor_part": 10, "accuracy": 90} + ) + drop_index_if_exists(connection, "IVF") + drop_table_purge(connection, "TB4") + + # 5. ivf index with neighbour_part and parallel passed as parameter + # Expectation:Successful + vs = OracleVS(connection, model1, "TB5", DistanceStrategy.EUCLIDEAN_DISTANCE) + create_index( + connection, vs, {"idx_type": "IVF", "neighbor_part": 10, "parallel": 90} + ) + drop_index_if_exists(connection, "IVF") + drop_table_purge(connection, "TB5") + + # 6. ivf index and then perform dml(insert) + # Expectation:Successful + vs = OracleVS(connection, model1, "TB6", DistanceStrategy.EUCLIDEAN_DISTANCE) + create_index(connection, vs, {"idx_type": "IVF", "idx_name": "IVF"}) + texts = ["Sri Ram", "Krishna"] + vs.add_texts(texts) + # perform delete + vs.delete(["hello"]) + drop_index_if_exists(connection, "IVF") + drop_table_purge(connection, "TB6") + + # 7. ivf index with neighbour_part,parallel and accuracy passed as parameter + # Expectation:Successful + vs = OracleVS(connection, model1, "TB7", DistanceStrategy.EUCLIDEAN_DISTANCE) + create_index( + connection, + vs, + {"idx_type": "IVF", "neighbor_part": 10, "parallel": 90, "accuracy": 99}, + ) + drop_index_if_exists(connection, "IVF") + drop_table_purge(connection, "TB7") + + +################################## +####### perform_search ########### +################################## +def test_perform_search_test() -> None: + try: + import oracledb + except ImportError: + return + + try: + connection = oracledb.connect(user=username, password=password, dsn=dsn) + except Exception: + sys.exit(1) + model1 = HuggingFaceEmbeddings( + model_name="sentence-transformers/paraphrase-mpnet-base-v2" + ) + vs_1 = OracleVS(connection, model1, "TB10", DistanceStrategy.EUCLIDEAN_DISTANCE) + vs_2 = OracleVS(connection, model1, "TB11", DistanceStrategy.DOT_PRODUCT) + vs_3 = OracleVS(connection, model1, "TB12", DistanceStrategy.COSINE) + vs_4 = OracleVS(connection, model1, "TB13", DistanceStrategy.EUCLIDEAN_DISTANCE) + vs_5 = OracleVS(connection, model1, "TB14", DistanceStrategy.DOT_PRODUCT) + vs_6 = OracleVS(connection, model1, "TB15", DistanceStrategy.COSINE) + + # vector store lists: + vs_list = [vs_1, vs_2, vs_3, vs_4, vs_5, vs_6] + + for i, vs in enumerate(vs_list, start=1): + # insert data + texts = ["Yash", "Varanasi", "Yashaswi", "Mumbai", "BengaluruYash"] + metadatas = [ + {"id": "hello"}, + {"id": "105"}, + {"id": "106"}, + {"id": "yash"}, + {"id": "108"}, + ] + + vs.add_texts(texts, metadatas) + + # create index + if i == 1 or i == 2 or i == 3: + create_index(connection, vs, {"idx_type": "HNSW", "idx_name": f"IDX1{i}"}) + else: + create_index(connection, vs, {"idx_type": "IVF", "idx_name": f"IDX1{i}"}) + + # perform search + query = "YashB" + + filter = {"id": ["106", "108", "yash"]} + + # similarity_searh without filter + vs.similarity_search(query, 2) + + # similarity_searh with filter + vs.similarity_search(query, 2, filter=filter) + + # Similarity search with relevance score + vs.similarity_search_with_score(query, 2) + + # Similarity search with relevance score with filter + vs.similarity_search_with_score(query, 2, filter=filter) + + # Max marginal relevance search + vs.max_marginal_relevance_search(query, 2, fetch_k=20, lambda_mult=0.5) + + # Max marginal relevance search with filter + vs.max_marginal_relevance_search( + query, 2, fetch_k=20, lambda_mult=0.5, filter=filter + ) + + drop_table_purge(connection, "TB10") + drop_table_purge(connection, "TB11") + drop_table_purge(connection, "TB12") + drop_table_purge(connection, "TB13") + drop_table_purge(connection, "TB14") + drop_table_purge(connection, "TB15") diff --git a/libs/community/tests/unit_tests/document_loaders/test_imports.py b/libs/community/tests/unit_tests/document_loaders/test_imports.py index cc4ad1a6f9..a8890aabe0 100644 --- a/libs/community/tests/unit_tests/document_loaders/test_imports.py +++ b/libs/community/tests/unit_tests/document_loaders/test_imports.py @@ -113,6 +113,8 @@ EXPECTED_ALL = [ "OnlinePDFLoader", "OpenCityDataLoader", "OracleAutonomousDatabaseLoader", + "OracleDocLoader", + "OracleTextSplitter", "OutlookMessageLoader", "PDFMinerLoader", "PDFMinerPDFasHTMLLoader", diff --git a/libs/community/tests/unit_tests/embeddings/test_imports.py b/libs/community/tests/unit_tests/embeddings/test_imports.py index 25a823afe9..61fb228bd7 100644 --- a/libs/community/tests/unit_tests/embeddings/test_imports.py +++ b/libs/community/tests/unit_tests/embeddings/test_imports.py @@ -57,6 +57,7 @@ EXPECTED_ALL = [ "ErnieEmbeddings", "JavelinAIGatewayEmbeddings", "OllamaEmbeddings", + "OracleEmbeddings", "QianfanEmbeddingsEndpoint", "JohnSnowLabsEmbeddings", "VoyageEmbeddings", diff --git a/libs/community/tests/unit_tests/utilities/test_imports.py b/libs/community/tests/unit_tests/utilities/test_imports.py index f0cf53c19c..c06fafda6b 100644 --- a/libs/community/tests/unit_tests/utilities/test_imports.py +++ b/libs/community/tests/unit_tests/utilities/test_imports.py @@ -34,6 +34,7 @@ EXPECTED_ALL = [ "NVIDIARivaTTS", "NVIDIARivaStream", "OpenWeatherMapAPIWrapper", + "OracleSummary", "OutlineAPIWrapper", "NutritionAIAPI", "Portkey", diff --git a/libs/community/tests/unit_tests/vectorstores/test_imports.py b/libs/community/tests/unit_tests/vectorstores/test_imports.py index 77cd646ca8..6ce51c1669 100644 --- a/libs/community/tests/unit_tests/vectorstores/test_imports.py +++ b/libs/community/tests/unit_tests/vectorstores/test_imports.py @@ -60,6 +60,7 @@ EXPECTED_ALL = [ "Neo4jVector", "NeuralDBVectorStore", "OpenSearchVectorSearch", + "OracleVS", "PGEmbedding", "PGVector", "PathwayVectorClient", diff --git a/libs/community/tests/unit_tests/vectorstores/test_indexing_docs.py b/libs/community/tests/unit_tests/vectorstores/test_indexing_docs.py index f85f3d345c..3df9d17bcc 100644 --- a/libs/community/tests/unit_tests/vectorstores/test_indexing_docs.py +++ b/libs/community/tests/unit_tests/vectorstores/test_indexing_docs.py @@ -73,6 +73,7 @@ def test_compatible_vectorstore_documentation() -> None: "MomentoVectorIndex", "MyScale", "OpenSearchVectorSearch", + "OracleVS", "PGVector", "Pinecone", "Qdrant", diff --git a/libs/community/tests/unit_tests/vectorstores/test_public_api.py b/libs/community/tests/unit_tests/vectorstores/test_public_api.py index 54e296dc84..0c2dbe6673 100644 --- a/libs/community/tests/unit_tests/vectorstores/test_public_api.py +++ b/libs/community/tests/unit_tests/vectorstores/test_public_api.py @@ -55,6 +55,7 @@ _EXPECTED = [ "MyScaleSettings", "Neo4jVector", "OpenSearchVectorSearch", + "OracleVS", "PGEmbedding", "PGVector", "PathwayVectorClient",