1. Upsert
- Generate a hierarchical naming convention for vector IDs.
- One recommended pattern may be
parentId-chunkIdwhere parentId is the ID of the document andchunkIdis an integer starting with 0 to the total number of chunks - While capturing embeddings and preparing upserts for Pinecone, capture the total number of chunks for each
parentId. - Append the
chunkCountto the metadata field of theparentId-0vector, or you may append them to all chunks if desired. This should be an integer and cardinality will naturally be low. - Upsert the vectors with the
parentId-chunkIdas the ID. - Reverse lookups can be created where you find a chunk and want to find the parent document or sibling chunks.
- One recommended pattern may be
2. Delete by ID (to avoid delete by metadata filter)
-
Identify the
parentId- This could be an internal process to identify documents that have been modified or deleted.
- Or, this could be a end-user initiated process to delete a document based on a query that finds a sibling chunk or
parentId.
-
Once the
parentIdis identified, use thefetchendpoint to retrieve thechunkCountfrom the metadata field by sending theparentId-0vector ID. -
Build a list of IDs using the pattern of
parentIdandchunkCount. -
Batch these together and send them to the
deleteendpoint using the IDs of the vectors. - You may then upsert the new version of the document with the new vectors and metadata or if it is a delete-only process, you are finished.
3. Updates
- Updates are intended to apply small changes to a record whether that means updating the vector, or more commonly, the metadata.
- In cases where you are chunking data, you are more likely going to need to delete and re-upsert using the steps above.
- If you are only performing very small changes to a small number of vectors, the update process is ideal.
- If you are updating a large number of vectors, you may want to consider batching and slowing down the updates to avoid rate limiting or affecting query latency and response times.