CalcFuel

Multimodal Payload Estimator

Estimate the token count and API cost of sending images, video, or audio to GPT-4o, Claude 3.5 Sonnet, or Gemini 1.5 Pro. Each provider tokenises media differently — see the exact breakdown side by side.

OpenAI uses tile-based counting · Anthropic uses a pixel-division formula · Google uses 768×768 tiles at 258 tokens each. Prices from official provider pages, May 2025.

Multimodal Payload Estimator

Only affects GPT-4o. Claude and Gemini use their own fixed tokenisation formulas.

Total batch size — costs scale linearly per item.

Want your AI marketing system set up in a week?

MarketingAI builds and hands over three coordinated, AI-assisted marketing systems — content engine, outbound lead sequence, and email nurture — configured to your business. Done in under a week. You own it permanently.

Get your marketing system →

How to Use This Calculator

  1. 1

    Select your media type

    Choose Image, Video, or Audio depending on what you're sending to the model. Video and Audio are supported natively only by Gemini 1.5 Pro.

  2. 2

    Set the resolution or duration

    For images, pick a preset resolution (Thumbnail, SD, HD, Full HD, 4K) or enter custom dimensions. For video or audio, enter the duration in seconds.

  3. 3

    Set the item count

    Enter how many images, videos, or audio files you're sending in a single batch. Costs scale linearly per item.

  4. 4

    Review the comparison table

    The estimator shows tokens per item, total tokens, cost per item, and total cost for GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro side by side.

Understanding Multimodal Tokenisation

When you send an image, video clip, or audio file to a vision or multimodal model via API, the provider converts that media into tokens before processing. Unlike text — where 1 token ≈ 4 characters — media tokenisation depends on resolution, aspect ratio, duration, and the provider's internal encoding strategy. The same 1920×1080 image can cost anywhere from 765 to 1,548 tokens depending on which model you use.

Understanding this is critical for budgeting AI applications. A product catalogue analysis pipeline processing 500 product images per day generates very different monthly bills on GPT-4o versus Gemini — often a 3–5× difference — before you even consider the output token costs.

GPT-4o: Tile-Based Image Tokenisation

OpenAI's GPT-4o uses a two-mode system. In low detail mode, every image costs a flat 85 tokens regardless of resolution — useful for applications where fine visual detail is not needed. In high detail mode, the image goes through a three-step process: first scaled to fit within a 2048×2048 bounding box, then rescaled so the shortest side is exactly 768 pixels, then divided into 512×512 tiles. Each tile costs 170 tokens, plus a fixed 85-token base. A typical 1280×720 HD image in high-detail mode produces a 1280×720 → scales to 1365×768 → 3×2 tiles = (6 × 170) + 85 = 1,105 tokens.

For applications sending many moderate-resolution images where exact visual detail matters (e.g. document extraction, UI screenshot analysis), high detail is appropriate. For thumbnail-level classification tasks (e.g. "is there a person in this photo?"), low detail at 85 tokens flat is dramatically cheaper.

Claude 3.5 Sonnet: Pixel-Division Formula

Anthropic uses a simpler but equally effective approach. The image is resized so its longest dimension is at most 1568 pixels, preserving the aspect ratio. The token count is then the ceiling of (width × height) ÷ 750. This linear formula means token costs scale predictably with image area. A 1568×1568 square at maximum size costs ⌈2,457,856 ÷ 750⌉ = 3,278 tokens — but most real-world images at HD or lower are well under 2,000 tokens each.

Claude's formula is transparent and easy to reason about. Unlike tile-based approaches, there's no non-linear step change when crossing tile boundaries. The cap at 1568 pixels also means that sending a 4K image costs the same as sending a 1568px equivalent — Anthropic's preprocessing erases the resolution difference before billing.

Gemini 1.5 Pro: 258-Token Tiles for Images, Video, and Audio

Google's Gemini 1.5 Pro applies a unified 258-token-per-tile model across all media types. For images, the image is divided into 768×768 tiles without prior resizing: tile count = ⌈width ÷ 768⌉ × ⌈height ÷ 768⌉, with a minimum of one tile. This means a 3840×2160 4K image uses ⌈3840÷768⌉ × ⌈2160÷768⌉ = 5 × 3 = 15 tiles = 3,870 tokens — the only model where 4K images incur a meaningfully higher token count than 1080p.

For video, Gemini samples at 1 frame per second (regardless of the original frame rate) and encodes each frame as a single tile at 258 tokens, then adds audio at 32 tokens per second. This makes Gemini the only model of the three with native video file support. A 3-minute product demo video costs (180 × 258) + (180 × 32) = 46,440 + 5,760 = 52,200 tokens, or about $0.065 at $1.25/M — inexpensive enough for high-volume video analysis pipelines.

Audio is encoded at a flat 32 tokens/second. This covers speech, music, ambient sound, and any audio format supported by the Gemini API (MP3, WAV, AAC, FLAC, and others).

Practical Cost Comparison

For a typical HD image (1280×720) sent 10,000 times per month: GPT-4o high detail costs about $17/month, Claude 3.5 Sonnet costs about $18/month, and Gemini 1.5 Pro costs about $6.45/month. For video or audio workloads, Gemini is the clear choice — it's the only model with native support, and its pricing is competitive even for high-volume use.

However, cost alone shouldn't drive model selection. GPT-4o and Claude 3.5 Sonnet often produce higher-quality vision outputs for tasks requiring fine-grained reasoning about image content. For high-stakes document extraction, medical imaging analysis, or UI parsing, the marginal cost difference may be well worth the quality uplift.

Frequently Asked Questions

How does GPT-4o count image tokens?

GPT-4o uses a tile-based approach for high-detail images. The image is first scaled to fit within a 2048×2048 bounding box, then scaled so the shortest side is 768px. The resulting image is divided into 512×512 tiles, and each tile costs 170 tokens, plus a flat 85-token base fee. A 1920×1080 image in high-detail mode becomes approximately 1080×608 after scaling, yielding a 3×2 tile grid: (6 × 170) + 85 = 1,105 tokens. For low-detail mode, the cost is always 85 tokens regardless of resolution.

How does Anthropic Claude count image tokens?

Claude 3.5 Sonnet uses a pixel-division formula. The image is resized so its longest edge is at most 1568 pixels (preserving aspect ratio). The token count is then calculated as the ceiling of (width × height) ÷ 750. A 1280×720 image stays at its native size (longest edge 1280 < 1568) and costs ⌈921,600 ÷ 750⌉ = 1,229 tokens. A 3840×2160 image is scaled to 1568×882 before the formula is applied, costing ⌈1,383,936 ÷ 750⌉ = 1,846 tokens.

How does Gemini 1.5 Pro count image tokens?

Gemini 1.5 Pro divides images into 768×768 tiles and charges 258 tokens per tile. The tile count is ⌈width ÷ 768⌉ × ⌈height ÷ 768⌉ with a minimum of one tile. A 1280×720 image uses ⌈1280÷768⌉ × ⌈720÷768⌉ = 2 × 1 = 2 tiles = 516 tokens. A 1920×1080 image uses 3 × 2 = 6 tiles = 1,548 tokens. Unlike OpenAI and Anthropic, Gemini does not resize the image before tiling — the original dimensions determine the tile count.

How does Gemini tokenise video?

Gemini 1.5 Pro samples video at 1 frame per second and encodes each frame as 258 tokens (equivalent to a 768×768 image tile). The audio track is encoded separately at 32 tokens per second. A 60-second video therefore costs (60 × 258) + (60 × 32) = 15,480 + 1,920 = 17,400 tokens. For very short clips or high-frame-rate content, note that Gemini always samples at 1 fps regardless of the original frame rate.

Can GPT-4o and Claude process video files?

Neither GPT-4o nor Claude 3.5 Sonnet accepts native video file uploads via their standard vision APIs. To process video with these models, you must extract individual frames and send each frame as a separate image in the messages array. Token costs are then calculated per frame using the standard image tokenisation formulas. This makes Gemini 1.5 Pro the most practical choice for native video understanding without frame extraction preprocessing.

How does Gemini tokenise audio?

Gemini 1.5 Pro tokenises audio at a flat rate of 32 tokens per second. A 5-minute (300-second) audio file costs 9,600 tokens, which at Gemini's $1.25 per million input tokens costs approximately $0.012. This makes audio understanding relatively inexpensive compared to sending equivalent information as transcribed text for long files. Claude 3.5 Sonnet does not currently support audio input; GPT-4o audio requires the separate gpt-4o-audio-preview endpoint.

Do these token counts include text prompt tokens?

No. The estimator shows only the tokens consumed by the media payload itself. Your text system prompt, user message, and any tool definitions are charged separately at the model's standard input token rate. To calculate total API cost, add your text prompt token count to the media token count shown here, then multiply the combined total by the model's input price per million tokens.

Which model is cheapest for image processing?

At current pricing (May 2025), Gemini 1.5 Pro is cheapest per image at $1.25/M input tokens, followed by GPT-4o at $2.50/M and Claude 3.5 Sonnet at $3.00/M. However, the number of tokens per image also varies significantly by model and resolution. For a 1280×720 HD image: GPT-4o high detail uses ~765 tokens ($0.0019), Claude uses 1,229 tokens ($0.0037), and Gemini uses 516 tokens ($0.0006). Gemini is typically both cheapest-per-token and uses fewer tokens for moderate resolutions.

Related Calculators