docs: table legend updated (#21351)

Compacted the table column legends. Added links. Similar to #21259
pull/21390/head
Leonid Ganeline 2 weeks ago committed by GitHub
parent d5bde4fa91
commit 7cbf1c31aa
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

@ -25,27 +25,28 @@ That means there are two different axes along which you can customize your text
## Types of Text Splitters
LangChain offers many different types of text splitters. These all live in the `langchain-text-splitters` package. Below is a table listing all of them, along with a few characteristics:
**Name**: Name of the text splitter
**Splits On**: How this text splitter splits text
**Adds Metadata**: Whether or not this text splitter adds metadata about where each chunk came from.
**Description**: Description of the splitter, including recommendation on when to use it.
| Name | Splits On | Adds Metadata | Description |
|-----------|---------------------------------------|---------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Recursive | A list of user defined characters | | Recursively splits text. Splitting text recursively serves the purpose of trying to keep related pieces of text next to each other. This is the recommended way to start splitting text. |
| HTML | HTML specific characters | ✅ | Splits text based on HTML-specific characters. Notably, this adds in relevant information about where that chunk came from (based on the HTML) |
| Markdown | Markdown specific characters | ✅ | Splits text based on Markdown-specific characters. Notably, this adds in relevant information about where that chunk came from (based on the Markdown) |
| Code | Code (Python, JS) specific characters | | Splits text based on characters specific to coding languages. 15 different languages are available to choose from. |
| Token | Tokens | | Splits text on tokens. There exist a few different ways to measure tokens. |
| Character | A user defined character | | Splits text based on a user defined character. One of the simpler methods. |
| [Experimental] Semantic Chunker | Sentences | | First splits on sentences. Then combines ones next to each other if they are semantically similar enough. Taken from [Greg Kamradt](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb) |
| [AI21 Semantic Text Splitter](/docs/integrations/document_transformers/ai21_semantic_text_splitter) | Semantics | ✅ | Identifies distinct topics that form coherent pieces of text and splits along those. |
LangChain offers many different types of `text splitters`.
These all live in the `langchain-text-splitters` package.
Table columns:
- **Name**: Name of the text splitter
- **Classes**: Classes that implement this text splitter
- **Splits On**: How this text splitter splits text
- **Adds Metadata**: Whether or not this text splitter adds metadata about where each chunk came from.
- **Description**: Description of the splitter, including recommendation on when to use it.
| Name | Classes | Splits On | Adds Metadata | Description |
|----------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------|---------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Recursive | [RecursiveCharacterTextSplitter](/docs/modules/data_connection/document_transformers/recursive_text_splitter), [RecursiveJsonSplitter](/docs/modules/data_connection/document_transformers/recursive_json_splitter) | A list of user defined characters | | Recursively splits text. This splitting is trying to keep related pieces of text next to each other. This is the `recommended way` to start splitting text. |
| HTML | [HTMLHeaderTextSplitter](/docs/modules/data_connection/document_transformers/HTML_header_metadata), [HTMLSectionSplitter](/docs/modules/data_connection/document_transformers/HTML_section_aware_splitter) | HTML specific characters | ✅ | Splits text based on HTML-specific characters. Notably, this adds in relevant information about where that chunk came from (based on the HTML) |
| Markdown | [MarkdownHeaderTextSplitter](/docs/modules/data_connection/document_transformers/markdown_header_metadata) | Markdown specific characters | ✅ | Splits text based on Markdown-specific characters. Notably, this adds in relevant information about where that chunk came from (based on the Markdown) |
| Code | [many languages](/docs/modules/data_connection/document_transformers/code_splitter) | Code (Python, JS) specific characters | | Splits text based on characters specific to coding languages. 15 different languages are available to choose from. |
| Token | [many classes](/docs/modules/data_connection/document_transformers/split_by_token) | Tokens | | Splits text on tokens. There exist a few different ways to measure tokens. |
| Character | [CharacterTextSplitter](/docs/modules/data_connection/document_transformers/character_text_splitter) | A user defined character | | Splits text based on a user defined character. One of the simpler methods. |
| [Experimental] Semantic Chunker | [SemanticChunker](/docs/modules/data_connection/document_transformers/semantic-chunker) | Sentences | | First splits on sentences. Then combines ones next to each other if they are semantically similar enough. Taken from [Greg Kamradt](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb) |
| AI21 Semantic Text Splitter | [AI21SemanticTextSplitter](/docs/integrations/document_transformers/ai21_semantic_text_splitter) | ✅ | Identifies distinct topics that form coherent pieces of text and splits along those. |
## Evaluate text splitters

@ -10,32 +10,28 @@ A retriever is an interface that returns documents given an unstructured query.
A retriever does not need to be able to store documents, only to return (or retrieve) them. Vector stores can be used
as the backbone of a retriever, but there are other types of retrievers as well.
Retrievers accept a string query as input and return a list of `Document`'s as output.
Retrievers accept a string `query` as input and return a list of `Document`'s as output.
## Advanced Retrieval Types
LangChain provides several advanced retrieval types. A full list is below, along with the following information:
**Name**: Name of the retrieval algorithm.
**Index Type**: Which index type (if any) this relies on.
**Uses an LLM**: Whether this retrieval method uses an LLM.
**When to Use**: Our commentary on when you should considering using this retrieval method.
**Description**: Description of what this retrieval algorithm is doing.
| Name | Index Type | Uses an LLM | When to Use | Description |
|---------------------------|------------------------------|---------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| [Vectorstore](./vectorstore) | Vectorstore | No | If you are just getting started and looking for something quick and easy. | This is the simplest method and the one that is easiest to get started with. It involves creating embeddings for each piece of text. |
| [ParentDocument](./parent_document_retriever) | Vectorstore + Document Store | No | If your pages have lots of smaller pieces of distinct information that are best indexed by themselves, but best retrieved all together. | This involves indexing multiple chunks for each document. Then you find the chunks that are most similar in embedding space, but you retrieve the whole parent document and return that (rather than individual chunks). |
| [Multi Vector](multi_vector) | Vectorstore + Document Store | Sometimes during indexing | If you are able to extract information from documents that you think is more relevant to index than the text itself. | This involves creating multiple vectors for each document. Each vector could be created in a myriad of ways - examples include summaries of the text and hypothetical questions. |
| [Self Query](./self_query) | Vectorstore | Yes | If users are asking questions that are better answered by fetching documents based on metadata rather than similarity with the text. | This uses an LLM to transform user input into two things: (1) a string to look up semantically, (2) a metadata filer to go along with it. This is useful because oftentimes questions are about the METADATA of documents (not the content itself). |
| [Contextual Compression](./contextual_compression) | Any | Sometimes | If you are finding that your retrieved documents contain too much irrelevant information and are distracting the LLM. | This puts a post-processing step on top of another retriever and extracts only the most relevant information from retrieved documents. This can be done with embeddings or an LLM. |
| [Time-Weighted Vectorstore](./time_weighted_vectorstore) | Vectorstore | No | If you have timestamps associated with your documents, and you want to retrieve the most recent ones | This fetches documents based on a combination of semantic similarity (as in normal vector retrieval) and recency (looking at timestamps of indexed documents) |
| [Multi-Query Retriever](./MultiQueryRetriever) | Any | Yes | If users are asking questions that are complex and require multiple pieces of distinct information to respond | This uses an LLM to generate multiple queries from the original one. This is useful when the original query needs pieces of information about multiple topics to be properly answered. By generating multiple queries, we can then fetch documents for each of them. |
| [Ensemble](./ensemble) | Any | No | If you have multiple retrieval methods and want to try combining them. | This fetches documents from multiple retrievers and then combines them. |
Table columns:
- **Name**: Name of the retrieval algorithm.
- **Index Type**: Which index type (if any) this relies on.
- **Uses an LLM**: Whether this retrieval method uses an LLM.
- **When to Use**: Our commentary on when you should considering using this retrieval method.
- **Description**: Description of what this retrieval algorithm is doing.
| Name | Index Type | Uses an LLM | When to Use | Description |
|---------------------------|------------------------------|---------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| [Vectorstore](./vectorstore) | Vectorstore | No | If you are just getting started and looking for something quick and easy. | This is the simplest method and the one that is easiest to get started with. It creates embeddings for each piece of text. |
| [ParentDocument](./parent_document_retriever) | Vectorstore + Document Store | No | If your pages have lots of smaller pieces of distinct information that are best indexed by themselves, but best retrieved all together. | This indexes multiple chunks for each document. Then you find the chunks that are most similar in embedding space, but you retrieve the whole parent document and return that (rather than individual chunks). |
| [Multi Vector](multi_vector) | Vectorstore + Document Store | Sometimes during indexing | If you are able to extract information from documents that you think is more relevant to index than the text itself. | This creates multiple vectors for each document. Each vector could be created in a myriad of ways - examples include summaries of the text and hypothetical questions. |
| [Self Query](./self_query) | Vectorstore | Yes | If users are asking questions that are better answered by fetching documents based on metadata rather than similarity with the text. | This uses an LLM to transform user input into two things: (1) a string to look up semantically, (2) a metadata filer to go along with it. This is useful because oftentimes questions are about the METADATA of documents (not the content itself). |
| [Contextual Compression](./contextual_compression) | Any | Sometimes | If you are finding that your retrieved documents contain too much irrelevant information and are distracting the LLM. | This puts a post-processing step on top of another retriever and extracts only the most relevant information from retrieved documents. This can be done with embeddings or an LLM. |
| [Time-Weighted Vectorstore](./time_weighted_vectorstore) | Vectorstore | No | If you have timestamps associated with your documents, and you want to retrieve the most recent ones | This fetches documents based on a combination of semantic similarity (as in normal vector retrieval) and recency (looking at timestamps of indexed documents) |
| [Multi-Query Retriever](./MultiQueryRetriever) | Any | Yes | If users are asking questions that are complex and require multiple pieces of distinct information to respond | This uses an LLM to generate multiple queries from the original one. This is useful when the original query needs pieces of information about multiple topics to be properly answered. By generating multiple queries, we can then fetch documents for each of them. |
| [Ensemble](./ensemble) | Any | No | If you have multiple retrieval methods and want to try combining them. | This fetches documents from multiple retrievers and then combines them. |
| [Long-Context Reorder](./long_context_reorder) | Any | No | If you are working with a long-context model and noticing that it's not paying attention to information in the middle of retrieved documents. | This fetches documents from an underlying retriever, and then reorders them so that the most similar are near the beginning and end. This is useful because it's been shown that for longer context models they sometimes don't pay attention to information in the middle of the context window. |

@ -15,34 +15,30 @@ See [this quick-start guide](./quick_start) for an introduction to output parser
## Output Parser Types
LangChain has lots of different types of output parsers. This is a list of output parsers LangChain supports. The table below has various pieces of information:
LangChain has lots of different types of `output parsers`.
**Name**: The name of the output parser
Table columns:
**Supports Streaming**: Whether the output parser supports streaming.
**Has Format Instructions**: Whether the output parser has format instructions. This is generally available except when (a) the desired schema is not specified in the prompt but rather in other parameters (like OpenAI function calling), or (b) when the OutputParser wraps another OutputParser.
**Calls LLM**: Whether this output parser itself calls an LLM. This is usually only done by output parsers that attempt to correct misformatted output.
**Input Type**: Expected input type. Most output parsers work on both strings and messages, but some (like OpenAI Functions) need a message with specific kwargs.
**Output Type**: The output type of the object returned by the parser.
**Description**: Our commentary on this output parser and when to use it.
- **Name**: The name of the output parser
- **Supports Streaming**: Whether the output parser supports streaming.
- **Has Format Instructions**: Whether the output parser has format instructions. This is generally available except when (a) the desired schema is not specified in the prompt but rather in other parameters (like OpenAI function calling), or (b) when the OutputParser wraps another OutputParser.
- **Calls LLM**: Whether this output parser itself calls an LLM. This is usually only done by output parsers that attempt to correct misformatted output.
- **Input Type**: Expected input type. Most output parsers work on both strings and messages, but some (like OpenAI Functions) need a message with specific kwargs.
- **Output Type**: The output type of the object returned by the parser.
- **Description**: Our commentary on this output parser and when to use it.
| Name | Supports Streaming | Has Format Instructions | Calls LLM | Input Type | Output Type | Description |
|-----------------|--------------------|-------------------------------|-----------|----------------------------------|----------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| [OpenAITools](./types/openai_tools) | | (Passes `tools` to model) | | `Message` (with `tool_choice`) | JSON object | Uses latest OpenAI function calling args `tools` and `tool_choice` to structure the return output. If you are using a model that supports function calling, this is generally the most reliable method. |
| [OpenAIFunctions](./types/openai_functions) | ✅ | (Passes `functions` to model) | | `Message` (with `function_call`) | JSON object | Uses legacy OpenAI function calling args `functions` and `function_call` to structure the return output. |
| [JSON](./types/json) | ✅ | ✅ | | `str` \| `Message` | JSON object | Returns a JSON object as specified. You can specify a Pydantic model and it will return JSON for that model. Probably the most reliable output parser for getting structured data that does NOT use function calling. |
| [JSON](./types/json) | ✅ | ✅ | | `str` \| `Message` | JSON object | Returns a JSON object as specified. You specify a Pydantic model and it will return JSON for that model. Probably the most reliable output parser for getting structured data that does NOT use function calling. |
| [XML](./types/xml) | ✅ | ✅ | | `str` \| `Message` | `dict` | Returns a dictionary of tags. Use when XML output is needed. Use with models that are good at writing XML (like Anthropic's). |
| [CSV](./types/csv) | ✅ | ✅ | | `str` \| `Message` | `List[str]` | Returns a list of comma separated values. |
| [OutputFixing](./types/output_fixing) | | | ✅ | `str` \| `Message` | | Wraps another output parser. If that output parser errors, then this will pass the error message and the bad output to an LLM and ask it to fix the output. |
| [RetryWithError](./types/retry) | | | ✅ | `str` \| `Message` | | Wraps another output parser. If that output parser errors, then this will pass the original inputs, the bad output, and the error message to an LLM and ask it to fix it. Compared to OutputFixingParser, this one also sends the original instructions. |
| [RetryWithError](./types/retry) | | | ✅ | `str` \| `Message` | | Wraps another output parser. If that output parser errors, then this will pass the original inputs, the bad output, and the error message to an LLM and ask it to fix it. Compared to `OutputFixingParser`, this one also sends the original instructions. |
| [Pydantic](./types/pydantic) | | ✅ | | `str` \| `Message` | `pydantic.BaseModel` | Takes a user defined Pydantic model and returns data in that format. |
| [YAML](./types/yaml) | | ✅ | | `str` \| `Message` | `pydantic.BaseModel` | Takes a user defined Pydantic model and returns data in that format. Uses YAML to encode it. |
| [PandasDataFrame](./types/pandas_dataframe) | | ✅ | | `str` \| `Message` | `dict` | Useful for doing operations with pandas DataFrames. |
| [Enum](./types/enum) | | ✅ | | `str` \| `Message` | `Enum` | Parses response into one of the provided enum values. |
| [Datetime](./types/datetime) | | ✅ | | `str` \| `Message` | `datetime.datetime` | Parses response into a datetime string. |
| [Structured](./types/structured) | | ✅ | | `str` \| `Message` | `Dict[str, str]` | An output parser that returns structured information. It is less powerful than other output parsers since it only allows for fields to be strings. This can be useful when you are working with smaller LLMs. |
| [Structured](./types/structured) | | ✅ | | `str` \| `Message` | `Dict[str, str]` | An output parser that returns structured information. It is less powerful than other output parsers since it only allows for fields to be strings. This useful when you are working with smaller LLMs. |

Loading…
Cancel
Save