RAG Storage Cost Calculator
Estimate the one-time embedding generation cost and monthly vector database hosting cost for your RAG (Retrieval-Augmented Generation) pipeline. Compare Pinecone, Weaviate, Qdrant, and Supabase pgvector side by side.
Embedding models: OpenAI, Cohere, Voyage AI · Vector DBs: Pinecone · Weaviate · Qdrant · Supabase. Prices from official provider pages, May 2025.
RAG Storage Cost Calculator
Assumes k=5 retrieved chunks × 512 tokens + 50 query tokens input, ~400 tokens output
How to Use This Calculator
- 1
Choose your data input method
Select 'Number of PDFs' if you know your document count, or 'GB of Text Data' if you have a raw data size. PDFs use an estimate of 400 tokens per page; GB mode uses 200M tokens per gigabyte of plain text.
- 2
Select your embedding model
Pick the model you plan to use to generate embeddings. OpenAI text-embedding-3-small is the most cost-effective general-purpose option. text-embedding-3-large produces higher-quality vectors for demanding retrieval tasks. Cohere and Voyage are competitive alternatives.
- 3
Select your vector database
Choose the vector DB provider you plan to use. Each has different storage pricing models — Pinecone charges per GB, Weaviate per million vector-dimensions, Qdrant per GB, and Supabase charges a $25 base plus per-GB storage.
- 4
Enter your daily query volume
Enter how many RAG queries your application will make per day. This drives the monthly retrieval and LLM generation cost estimates.
- 5
Choose your LLM for generation
Select the language model that will generate answers using the retrieved context. GPT-4o mini is the most cost-effective option for standard Q&A. GPT-4o or Claude Sonnet are better for complex reasoning over retrieved content.
Understanding RAG Infrastructure Costs
RAG pipelines have a two-phase cost structure that catches many teams off guard. The setup phase — converting documents into embeddings — is a one-time cost that scales with your data volume. The operational phase — hosting vectors and answering queries — is a recurring monthly cost that scales with both data size and query volume.
Most developers focus only on the LLM API cost (which is visible in their billing dashboard) and underestimate the vector storage and embedding costs. For a mid-size enterprise knowledge base of 50,000 documents averaging 20 pages each, the one-time embedding cost can easily reach $500–$2,000 depending on the model chosen — and the monthly storage cost adds another $50–$200 per month before any queries are made.
Phase 1: Embedding Generation (One-Time)
Before you can search your documents, you must convert every chunk into a vector embedding using an embedding model API. The cost is: total tokens × price per million tokens. A 400-page business document contains roughly 160,000 tokens when chunked at 512 tokens with 50% overlap, producing about 625 vectors. At $0.020/M tokens (text-embedding-3-small), that single document costs about $0.003 to embed — cheap in isolation, but a 10,000-document corpus at the same density costs $30 to embed.
Once embedded, you typically don't re-embed unless the document content changes. This makes embedding a capital expenditure rather than an operating cost. The key decision is choosing between a cheap model (lower upfront cost, fewer storage bytes, slightly lower retrieval quality) and a premium model (higher upfront cost, more storage, better semantic matching).
Phase 2: Vector Database Hosting (Monthly)
Every embedding must be stored in a vector database so it can be retrieved at query time using approximate nearest neighbour (ANN) search. Storage costs depend on vector count, embedding dimensions, and the provider's pricing model:
- Pinecone Serverless: charges $0.033/GB-month for stored vectors and a small per-query fee for ANN searches. Simple to use, no cluster management required.
- Weaviate Cloud: charges per million vector-dimensions per month ($0.05/1M), making it more cost-effective for lower-dimensional embeddings like Cohere or Voyage (1,024 dims vs. OpenAI's 1,536–3,072 dims).
- Qdrant Cloud: charges ~$0.040/GB-month for stored vectors, with a generous free tier (1 GB RAM). Strong performance for large datasets due to its Rust-based implementation.
- Supabase pgvector: storage is billed as standard PostgreSQL storage ($0.125/GB-month) plus a $25/month Pro plan base. The most economical choice if you already use Supabase — the marginal cost of adding pgvector is just the storage increment.
Phase 3: Query Costs (Per Request)
Every user query triggers two billable events: a vector retrieval (small fee to the vector DB) and an LLM generation call (larger fee to the language model). The retrieval cost is typically negligible — Pinecone serverless charges roughly $0.0001 per query; Weaviate and Qdrant even less. The dominant cost is almost always the LLM generation step.
With k=5 retrieved chunks at 512 tokens each, plus a 50-token user query, each generation call requires about 2,610 input tokens. At GPT-4o pricing ($2.50/M input, $10.00/M output), a 400-token response costs about $0.0105 per query — $315/month for 1,000 daily queries. Switching to GPT-4o mini ($0.15/M input, $0.60/M output) reduces that to about $0.0006 per query — $18/month. For most Q&A and document retrieval tasks, GPT-4o mini delivers comparable quality at a fraction of the cost.
Optimising Your RAG Budget
The highest-leverage cost optimisation for most RAG deployments is choosing the right LLM for generation. A 10× cheaper model (GPT-4o mini vs. GPT-4o) typically reduces monthly costs by 80–90% with minimal quality degradation for standard document Q&A. Embedding model choice has a smaller but meaningful impact: text-embedding-3-small produces 1,536-dim vectors at $0.020/M tokens; text-embedding-3-large produces 3,072-dim vectors at $0.130/M. The storage cost difference is proportional to dimensions — text-embedding-3-large uses twice the storage at 6.5× the embedding cost.
For most teams starting a new RAG project, the recommended baseline is: text-embedding-3-small for embeddings, Pinecone or Supabase for vector storage (depending on existing infrastructure), and GPT-4o mini or Claude Haiku for generation. Upgrade to larger models only after benchmarking retrieval quality on your specific domain.
Frequently Asked Questions
What is RAG and why does it have storage costs?
Retrieval-Augmented Generation (RAG) is a technique where you store your documents in a vector database, then retrieve the most relevant chunks at query time to provide as context to an LLM. Storage costs arise because you must: (1) convert every document chunk into a dense vector (an array of floating-point numbers called an embedding) using an embedding model API, and (2) host all those vectors in a vector database so they can be searched in milliseconds. The embedding step is a one-time cost per document set; the hosting is an ongoing monthly cost.
What is a vector embedding and how big is it?
A vector embedding is a fixed-length array of floating-point numbers that represents the semantic meaning of a text chunk. OpenAI's text-embedding-3-small produces 1,536-dimensional vectors; text-embedding-3-large produces 3,072 dimensions. Each dimension is stored as a 32-bit float (4 bytes), so a single 1,536-dim embedding takes 6,144 bytes (about 6 KB). For 100,000 document chunks, that's roughly 600 MB of raw vector data before metadata overhead — which is why vector database storage costs can add up for large document sets.
What is chunking and how does it affect costs?
Chunking is the process of splitting long documents into smaller, overlapping segments before embedding them. A typical configuration uses 512-token chunks with a 256-token stride (50% overlap). The overlap ensures that context spanning two adjacent chunks is captured in at least one vector. More chunks = more embedding API calls (higher one-time cost) and more vectors to store (higher ongoing hosting cost). This calculator uses 512-token chunks with 256-token stride as defaults — standard practice for most RAG use cases.
Which embedding model should I choose?
For most use cases, OpenAI text-embedding-3-small is the best starting point: it's inexpensive ($0.020/M tokens), widely supported, and performs well on English enterprise content. text-embedding-3-large produces higher-quality embeddings (especially for nuanced semantic matching) at 6.5× the cost and twice the storage. Cohere embed-v3 and Voyage AI voyage-3 are strong alternatives — particularly Voyage, which often outperforms OpenAI models on domain-specific retrieval benchmarks at a competitive price. If you're building a production system handling specialised content (legal, medical, technical), benchmark all four on a sample of your actual queries.
What is the difference between Pinecone, Weaviate, Qdrant, and Supabase pgvector?
Pinecone is a managed, fully-serverless vector database with minimal setup — ideal for teams who want zero infrastructure management. Weaviate is an open-source vector DB with a managed cloud offering; it supports multi-modal search and hybrid (vector + keyword) search natively. Qdrant is a high-performance open-source vector DB written in Rust, available both self-hosted and as a managed cloud service; it excels at large-scale deployments requiring low latency. Supabase pgvector uses PostgreSQL's pgvector extension — ideal if you already use Supabase and want to avoid a separate service; performance is lower than dedicated vector DBs at very large scale but adequate for most use cases under 5M vectors.
What does 'cost per query' include?
Each RAG query involves two steps: (1) vector retrieval — searching the database for the k most similar chunks (k=5 by default); and (2) LLM generation — passing the retrieved context plus the user's question to an LLM to generate an answer. The retrieval cost is charged by the vector DB provider (typically very small for Weaviate, Qdrant, and Supabase; slightly higher for Pinecone serverless). The generation cost is charged by the LLM provider based on the total input tokens (retrieved chunks + query) and output tokens (the generated answer). This calculator estimates both and shows the combined per-query cost.
How accurate are these estimates?
The estimates are designed for ballpark planning and provider comparison, not exact billing. Key assumptions include: 400 tokens per PDF page (reasonable for business documents; dense technical PDFs may be higher), 200M tokens per GB of plain text, 512-token chunks with 50% overlap, and k=5 retrieved chunks per query. Actual costs will vary based on your document types, chunk size settings, query complexity, and provider plan tier. Always check the latest pricing on each provider's website before making procurement decisions.
How do I reduce RAG storage costs?
The four main levers are: (1) Use a smaller embedding model — switching from text-embedding-3-large to text-embedding-3-small cuts both embedding cost (6.5×) and storage (50%). (2) Reduce chunk overlap — cutting overlap from 50% to 25% reduces chunk count by roughly 33%, with modest impact on retrieval quality. (3) Use Supabase pgvector if you already pay for a Supabase Pro plan — pgvector storage is included in your database allocation. (4) Use a cheaper LLM for generation — GPT-4o mini costs 94% less than GPT-4o per query with comparable quality for most Q&A tasks.
Does the free tier cover my use case?
Pinecone's free tier includes 2 GB of serverless storage and is suitable for prototypes under ~300K document chunks (1,536-dim). Weaviate's sandbox is free but resource-limited; suitable for development only. Qdrant's free cloud tier provides 1 GB of RAM — approximately 100K–200K vectors depending on dimensions. Supabase's free tier includes 500 MB of database storage shared across all tables, including pgvector. For most production RAG deployments processing thousands of business documents, you will need a paid tier. This calculator helps you estimate whether the cost is justified by the query volume.
Need an AI-assisted marketing system for your business?
MarketingAI builds done-with-you marketing systems for small businesses — content, email, and lead generation, configured to your offer and delivered in under a week. One-time setup, owned by you permanently.
Learn About MarketingAI →Related Calculators
AI Model Router Savings Calculator
See how much you save routing easy queries to cheaper LLM models.
Prompt Caching Discount Estimator
Calculate savings from caching system prompts with Claude, GPT-4o, or Gemini.
Multimodal Payload Estimator
Estimate token costs for sending images, video, or audio to vision models.
Marketing ROI Calculator
Calculate the return on your marketing investment.