This feature is in public preview.
- Analyzing charts, graphs, and diagrams in financial reports
- Understanding infographics and visual data in research papers
- Interpreting visual layouts in technical documentation
How it works
When you enable multimodal context for a PDF:- Pinecone extracts text and images (raster or vector) from the file and analyzes their contents. For each image, the assistant generates a descriptive caption and set of keywords. Additionally, when it makes sense, the assistant captures data points found in the image (for example, values from a table or chart).
- During chat or context queries, the assistant searches for relevant text and image context it captured when analyzing the PDF. Image context can include the original image data (base64-encoded).
- The assistant passes this context to the LLM, which uses it to generate responses.
For an overview of how Pinecone Assistant works, see Pinecone Assistant architecture.
Try it out
The following steps demonstrate how to create an assistant, provide it with a PDF that contains images, and then query that assistant using chat and context APIs.All versions of Pinecone’s Assistant API allow you to upload multimodal PDFs.
1. Create an assistant
First, if you don’t have one, create an assistant:You don’t need to create a new assistant to use multimodal context. Existing assistants can enable multimodal context for newly uploaded PDFs, as described in the next section.
2. Upload a multimodal PDF
To enable multimodal context for a PDF, when uploading the file, set themultimodal URL parameter to true (defaults to false).
- The
multimodalparameter is only available for PDF files. - To check the status of a file, use the describe a file upload endpoint.
3. Chat with the assistant
Now, chat with your assistant. To tell the assistant to provide image-related context to the LLM:- Set the
multimodalrequest parameter to true (default) in thecontext_optionsobject. Settingmultimodalto false means the LLM only receives text snippets. - When
multimodalis true, useinclude_binary_contentto specify what image context the LLM should receive: base64 image data and captions (true) or captions only (false).
Sending image-related context to the LLM (whether captions, base64 data, or both) increases token usage. Learn about monitoring spend and usage.
If your assistant uses multimodal context snippets to generate a response, no highlights are returned—even when
include_highlights is true.4. Query for context
To query context for a custom RAG workflow, you can retrieve context snippets directly. Then, you can pass these snippets to an LLM as context. To fetch image-related context snippets (as well as text snippets), set themultimodal request parameter to true (default). When multimodal is true, use include_binary_content to specify what image context you’d like to receive: base64 image data and captions (true) or captions only (false).
If you set
multimodal to true and include_binary_content to false, image objects are not returned in the snippets. If you set multimodal to false, only text snippets are returned.Snippets are returned based on their semantic relevance to the provided query. When you set
multimodal to true, you’ll receive the most relevant snippets, regardless of the types of content they contain. You can receive text snippets, multimodal snippets, or both.Limits
Multimodal context for assistants is only available for PDF files. Additionally, the following limits apply:| Metric | Starter plan | Standard plan | Enterprise plan |
|---|---|---|---|
| Max file size | 10 MB | 50 MB | 50 MB |
| Page limit | 100 | 100 | 100 |
| Multimodal PDFs per assistant | 1 | 20 | 20 |