Test Pinecone at scale: Search

This guide walks you through testing Pinecone at scale with a real-world dataset and search workload. You’ll learn how to:

Gather requirements and define success criteria for your use case
Set up and configure a production-scale test environment
Import and index millions of vectors efficiently using bulk import
Run comprehensive benchmarks with Vector Search Bench (VSB)
Analyze performance results and validate against success criteria
Test production readiness with monitoring and backup procedures

While this guide uses a specific product search example, the testing methodology and framework can be adapted to evaluate Pinecone for any vector search application.

If you’re new to Pinecone Vector Database, complete the quickstart to learn basic concepts and operations.

1. Gather requirements

Start by clearly defining your use case, success criteria, data characteristics, and performance expectations. This ensures your test accurately reflects your production needs and provides meaningful results for decision-making.

Identify use case

Clearly define the search problem you need to solve. Your use case influences how you define success criteria and how you configure Pinecone.Use case for this test:

A product search application using semantic search to find similar products based on descriptions and features.
Users search for products using natural language queries, expecting relevant results ranked by similarity.

Determine data requirements

Analyze your dataset characteristics including volume, dimensions, embedding strategy, and expected costs. Understanding your data helps you choose appropriate index configurations and plan for storage and compute requirements. (NOTE: PROBABLY REMOVE STORAGE AND COMPUTE UNTIL DEDICATED IS READY)Dataset for this test:

Volume: 110 million product records
Vector dimensions: 1024-dimensional dense embeddings
Embedding model: llama-text-embed-v2 with cosine similarity
Architecture pattern: “Head and tail” search pattern with two namespaces:
- “head” namespace: 10 million vectors for frequently accessed recent products
- “tail” namespace: 100 million vectors for complete historical dataset

Define workload requirements

Specify the expected query load, traffic patterns, and concurrent usage that your system must handle. Performance requirements guide your testing methodology and help validate whether your configuration can handle production workloads.Workload requirements for this test:

Concurrent users: Test with 1-20 concurrent clients
Query patterns: 80% queries to head namespace, 20% to tail namespace
Peak load: Sustained high query volume during testing periods

Define success criteria

Establish measurable performance targets that define what constitutes a successful test. Success criteria should align with your production requirements and user expectations, covering latency, throughput, accuracy, and data freshness.Success criteria for this test:

Query latency: < 100ms (p95) for frequently accessed data, < 200ms (p95) for comprehensive searches
Throughput: > 1000 QPS for high-frequency queries, > 500 QPS for comprehensive dataset queries
Search accuracy: > 95% recall for frequent queries, > 90% recall for comprehensive searches
Data freshness: New products searchable within 5 minutes for frequent data, 15 minutes for comprehensive dataset

NOTE: THESE CRITERIA PROBABLY REQUIRE A DEDICATED INDEX. OTHERWISE, READ RATE LIMITS WILL BE A PROBLEM.

2. Set up your environment

Proper environment setup is critical for large-scale testing success. You’ll configure your Pinecone account with appropriate plan limits, install the necessary tools for benchmarking, and establish secure authentication. This foundation ensures you have the infrastructure needed to conduct comprehensive performance tests.

Set up your Pinecone account

This test requires a Pinecone account on the Standard or Enterprise plan. The scope of the test exceeds the limits of the Starter plan.

New users: Sign up at app.pinecone.io and choose the Standard plan trial for 21 days and $300 in credits.
Existing users: If you’re currently on the Starter plan, upgrade to Standard or Enterprise.

Install the Python SDK

This test uses the Python SDK. Install the SDK:

Terminal

# Install Pinecone Python SDK
pip install "pinecone[grpc]"

Create and export your API key

You need an API key to authenticate your requests to Pinecone. Create a new API key in the Pinecone console and then export it as an environment variable in your terminal:

Terminal

export PINECONE_API_KEY="YOUR_API_KEY"

3. Create your index

Index configuration is foundational to achieving your performance targets. The choices you make for embedding models, dimensions, similarity metrics, and deployment regions directly impact search accuracy, query latency, and operational costs. For this test, create a dense vector index optimized for the product search use case:

Python

from pinecone import Pinecone

# Initialize Pinecone client
pc = Pinecone(api_key="YOUR_API_KEY")

# Create index with integrated embedding
index_name = "search-test-at-scale"
if not pc.has_index(index_name):
    index = pc.create_index_for_model(
        name=index_name,
        cloud="aws",
        region="us-east-1",  # Choose region closest to your users
        embed={
            "model": "llama-text-embed-v2",
            "dimension": 1024,
            "metric": "cosine",
            "field_map": {"text": "chunk_text"}
        }
    )
    print(f"Created index: {index}")
else:
    print(f"Index {index_name} already exists")

4. Import the dataset

Large-scale data import requires efficient bulk loading strategies to minimize time and cost while ensuring data integrity. Pinecone’s import feature enables you to load millions of vectors from object storage in parallel, significantly faster and more cost-effective than individual upserts. This phase tests your data pipeline’s ability to handle production volumes. For this test, use the import feature to load 110 million product records into two distinct namespaces within your index.

Start bulk import

Start the import process for each namespace:

Python

from pinecone import Pinecone, ImportErrorMode

pc = Pinecone(api_key=$PINECONE_API_KEY)
index = pc.Index("search-test-at-scale")

# 2 namespaces, 100M and 10M records
root = "s3://fe-customer-pocs/at-scale-pocs/review_100000000_20250620_203728/search/dense"
import = index.start_import(
    uri=root,
    error_mode=ImportErrorMode.CONTINUE # or ImportErrorMode.ABORT
)
print(f"Import started: {import['id']}")

Monitor import progress

The amount of time required for an import depends on various factors, including dimensionality and metadata complexity. For this test, the import should take around 3-4 hours.To track progress, check the status bar in the Pinecone console or use the describe import operation with the import ID:

Python

from pinecone import Pinecone
import time 

while True:
    status = index.describe_import(id="<IMPORT_ID>")
    print(f"Status: {status['status']}, Progress: {status['percent_complete']:.1f}%")
    if status['status'] == "Completed":
        print("Import completed successfully!")
        break
    elif status['status'] == "Failed":
        print("Import failed. Check error details.")
        break
    time.sleep(300) # Check every 5 minutes

5. Run benchmarks

Systematic benchmarking measures your index’s performance under realistic conditions and generates the quantitative data needed to validate your system’s readiness. You’ll use the Vector Search Bench (VSB) tool to simulate production workloads, measure key performance metrics, and test different load scenarios. VSB provides standardized tools to simulate realistic query patterns while measuring latency, throughput, and recall metrics. You’ll configure synthetic workloads that match your data characteristics and expected usage patterns.

NOTE: The VSB tool doesn’t currently support running against a specific namespace. It runs against only the default namespace. We need to change this before this guide can be tested/used.

Install Vector Search Bench (VSB)

Clone the VSB repository and use Poetry toinstall the dependencies:

Terminal

git clone https://github.com/pinecone-io/VSB.git
cd VSB
pip install poetry && poetry install
eval $(poetry env activate)

Test high-frequency queries

Simulate the high-frequency queries on your head namespace data using a synthetic workload that matches your 1024-dimensional vectors:

Terminal

vsb \
    --database="pinecone" \
    --workload=synthetic-proportional \
    --pinecone_api_key="$PINECONE_API_KEY" \
    --pinecone_index_name="search-test-at-scale" \
    --synthetic_dimensions=1024 \
    --synthetic_metric=cosine \
    --users=10 \
    --requests_per_sec=100 \
    --synthetic_requests=10000 \
    --skip_populate

This test simulates 10 concurrent users querying your head namespace at 100 requests per second, measuring how well your index handles frequent queries.

Test comprehensive dataset queries

Test performance against your full tail namespace dataset with lower concurrency but higher query volume:

Terminal

vsb \
    --database=pinecone \
    --workload=synthetic-proportional \
    --pinecone_api_key="$PINECONE_API_KEY" \
    --pinecone_index_name="search-test-at-scale" \            --pinecone_namespace="tail" \
    --synthetic_records=100000000 \
    --synthetic_dimensions=1024 \
    --synthetic_metric=cosine \
    --users=5 \
    --requests_per_sec=50 \
    --synthetic_requests=5000

This test evaluates performance against your complete dataset with realistic concurrent load.

Run mixed workload simulation

Simulate realistic production traffic patterns that query both namespaces. Since VSB doesn’t natively support cross-namespace testing, run separate tests and combine the results:

Terminal

# 80% of queries to head namespace
vsb \
    --database=pinecone \
    --workload=synthetic-proportional \
    --pinecone_api_key="$PINECONE_API_KEY" \
    --pinecone_index_name="search-test-at-scale" \
    --synthetic_records=10000000 \
    --synthetic_dimensions=1024 \
    --synthetic_metric=cosine \
    --users=8 \
    --requests_per_sec=80 \
    --synthetic_requests=12000

# 20% of queries to tail namespace
vsb \
    --database=pinecone \
    --workload=synthetic-proportional \
    --pinecone_api_key="$PINECONE_API_KEY" \
    --pinecone_index_name="search-test-at-scale" \
    --synthetic_records=100000000 \
    --synthetic_dimensions=1024 \
    --synthetic_metric=cosine \
    --users=2 \
    --requests_per_sec=20 \
    --synthetic_requests=3000

This approach simulates your expected traffic distribution across both namespaces.

Test peak load scenarios

Validate performance under peak load conditions by increasing concurrent users and request rates:

Terminal

poetry run python -m vsb \
    --database=pinecone \
    --workload=synthetic-proportional \
    --pinecone_api_key="$PINECONE_API_KEY" \
    --pinecone_index_name="search-test-at-scale" \
    --pinecone_namespace="head" \
    --synthetic_records=10000000 \
    --synthetic_dimensions=1024 \
    --synthetic_metric=cosine \
    --users=20 \
    --requests_per_sec=200 \
    --synthetic_requests=5000

This stress test validates whether your configuration can handle peak traffic scenarios.

6. Analyze performance

Now analyze VSB results, understand performance characteristics, and validate whether your system meets the success criteria defined in step 1.

Review benchmark results

VSB provides two types of output: real-time metrics during test execution and detailed results saved to stats.json files. This allows you to monitor progress and perform detailed analysis.Real-time output during testing:VSB displays live metrics during execution, updating every 10 seconds with current performance data. Here’s an example of what you’ll see:

Vector Search Bench - Pinecone Test
Database: pinecone | Workload: synthetic-proportional | Users: 10

Live metrics (last 10 seconds):
Requests/sec: 98.4
Latency percentiles (ms):
  p50: 42.1
  p95: 78.5
  p99: 124.2

Progress: 8,750 / 10,000 requests (87.5%)
Elapsed: 01:28 | Estimated remaining: 00:13

Total requests issued: 8,750
Overall throughput: 99.2 RPS

Post-test analysis:After each test completes, VSB saves detailed results to stats.json. Analyze these results to compare performance across different scenarios:

Python

import json
import pandas as pd

def analyze_vsb_results(stats_file):
    """Analyze VSB stats.json output file"""
    with open(stats_file, 'r') as f:
        results = json.load(f)

    # Extract key performance metrics
    metrics = {
        'avg_latency_ms': results.get('avg_latency_ms', 0),
        'p95_latency_ms': results.get('p95_latency_ms', 0),
        'p99_latency_ms': results.get('p99_latency_ms', 0),
        'throughput_rps': results.get('throughput_rps', 0),
        'recall': results.get('recall', 0),
        'total_requests': results.get('total_requests', 0)
    }

    return metrics

# Analyze results from each test scenario
head_metrics = analyze_vsb_results('head_namespace_stats.json')
tail_metrics = analyze_vsb_results('tail_namespace_stats.json')
peak_metrics = analyze_vsb_results('peak_load_stats.json')

# Create performance comparison
comparison_df = pd.DataFrame({
    'Head namespace': head_metrics,
    'Tail namespace': tail_metrics,
    'Peak load': peak_metrics
})

print("Performance comparison:")
print(comparison_df)

Expected performance characteristics:

Head namespace (10M vectors, high-frequency queries):
- Latency: p95 latency around 50-80ms for 10 concurrent users at 100 RPS
- Throughput: Should sustain 100+ RPS with 10 concurrent users
- Recall: High recall (>0.95) due to smaller dataset and frequent access patterns
Tail namespace (100M vectors, comprehensive queries):
- Latency: p95 latency around 100-150ms for 5 concurrent users at 50 RPS
- Throughput: Should sustain 50+ RPS with lower concurrency
- Recall: Slightly lower recall (>0.90) due to larger dataset
Peak load scenarios:
- Latency degradation: Expect 2-3x higher latencies under peak load (20 users, 200 RPS)
- Rate limiting: May hit plan limits at sustained high throughput

Validate against success criteria

Compare your benchmark results against the success criteria defined in step 1:

Test Scenario	Target p95 Latency	Target Throughput	Target Recall	Your Results
Head namespace queries	< 100ms	> 100 RPS	> 0.95	Enter results
Tail namespace queries	< 200ms	> 50 RPS	> 0.90	Enter results
Mixed workload	< 150ms	> 75 RPS	> 0.92	Enter results
Peak load	< 300ms	> 150 RPS	> 0.90	Enter results

Success evaluation:

✅ Pass: Results meet or exceed targets
⚠️ Marginal: Results within 20% of targets - consider optimizations
❌ Fail: Results significantly below targets - requires optimization or architecture changes

7. Test production readiness

Production readiness testing goes beyond basic performance metrics to evaluate operational aspects like monitoring, backup procedures, and system reliability. This phase ensures your system can handle real-world operational demands and provides the observability needed for production deployment.

Set up monitoring and observability

Implement comprehensive monitoring to track index performance, query patterns, and system health. Pinecone provides multiple monitoring options for production deployments.Pinecone console monitoring: Monitor basic metrics through the Pinecone console, including index statistics, query volume, and performance trends.Prometheus integration: For production environments, integrate with Prometheus for advanced monitoring and alerting:

Python

from pinecone import Pinecone
import time
import logging
from datetime import datetime

def setup_performance_monitoring(index, duration_minutes=30):
    """
    Monitor index performance with metrics suitable for Prometheus export
    """
    start_time = time.time()
    end_time = start_time + (duration_minutes * 60)

    metrics_log = []

    while time.time() < end_time:
        # Collect index statistics for monitoring
        stats = index.describe_index_stats()

        # Prepare metrics for export to monitoring systems
        monitoring_metrics = {
            'timestamp': datetime.now().isoformat(),
            'total_vector_count': stats['total_vector_count'],
            'head_namespace_count': stats['namespaces']['head']['vector_count'],
            'tail_namespace_count': stats['namespaces']['tail']['vector_count'],
            'index_fullness': stats['index_fullness'],
        }

        metrics_log.append(monitoring_metrics)

        # Log metrics for Prometheus scraping or direct export
        logging.info(f"index_total_vectors{{namespace=\"all\"}} {monitoring_metrics['total_vector_count']}")
        logging.info(f"index_vectors{{namespace=\"head\"}} {monitoring_metrics['head_namespace_count']}")
        logging.info(f"index_vectors{{namespace=\"tail\"}} {monitoring_metrics['tail_namespace_count']}")
        logging.info(f"index_fullness_ratio {monitoring_metrics['index_fullness']}")

        time.sleep(60)  # Check every minute

    return metrics_log

# Initialize monitoring
pc = Pinecone(api_key="YOUR_API_KEY")
index = pc.Index("search-test-at-scale")
monitor_results = setup_performance_monitoring(index, duration_minutes=30)

Application-level monitoring: Implement query-level monitoring to track latency, error rates, and throughput in your application.

Test backup and recovery procedures

Validate backup and restore functionality to ensure data protection and disaster recovery capabilities.

Python

from pinecone import Pinecone
from datetime import datetime

pc = Pinecone(api_key="YOUR_API_KEY")
index = pc.Index("search-test-at-scale")

# Create backup
backup_result = index.create_backup(
    name="search-test-backup-" + datetime.now().strftime("%Y%m%d-%H%M%S")
)
print(f"Backup created: {backup_result}")

# List available backups
backups = index.list_backups()
print(f"Available backups: {backups}")

# Validate backup completion
backup_status = index.describe_backup(backup_id=backup_result['id'])
print(f"Backup status: {backup_status['status']}")

# Test restore process (use with caution in testing environment)
# restore_result = index.restore_backup(backup_id="BACKUP_ID")

Important considerations:

Test backup and restore procedures in a non-production environment
Document recovery time objectives (RTO) and recovery point objectives (RPO)
Validate that restored data maintains integrity and search functionality

8. Review costs

Cost analysis provides critical data for production planning and budgeting decisions. Understanding the total cost of ownership helps you optimize your architecture, plan capacity scaling, and make informed decisions about production deployment strategies. You’ll calculate costs for data import, storage, queries, and operational overhead. This analysis ensures your testing results translate into sustainable production economics.

Next steps

Production checklist

Learn production best practices

Optimize performance

Improve search accuracy and relevance

Database limits

Review capacity and performance limits

Manage cost

Understand, manage, and monitor costs

Get started

Index data

Search

Optimize

Manage data

Manage cost

Move to production

Admin

Operations

Using pods

Test Pinecone at scale: Search

1. Gather requirements

2. Set up your environment

3. Create your index

4. Import the dataset

5. Run benchmarks

6. Analyze performance

7. Test production readiness

8. Review costs

Next steps

Production checklist

Optimize performance

Database limits

Manage cost

Get started

Index data

Search

Optimize

Manage data

Manage cost

Move to production

Admin

Operations

Using pods

​1. Gather requirements

​2. Set up your environment

​3. Create your index

​4. Import the dataset

​5. Run benchmarks

​6. Analyze performance

​7. Test production readiness

​8. Review costs

​Next steps

Production checklist

Optimize performance

Database limits

Manage cost

1. Gather requirements

2. Set up your environment

3. Create your index

4. Import the dataset

5. Run benchmarks

6. Analyze performance

7. Test production readiness

8. Review costs

Next steps