Skip to content

[Bug/Question] Significant Image Token Count Discrepancy between Vertex AI and Gemini API backends #1907

@dev7608

Description

@dev7608

Description

I have noticed a significant discrepancy in token usage calculation for the same image when using the google-genai SDK, depending on whether the backend is Vertex AI or the Gemini API (AI Studio).

Using the same model (gemini-2.5-flash) and the same input image ($700 \times 1003$), the token counts differ drastically:

  • Gemini API: ~258 tokens.
  • Vertex AI: ~1806 tokens.

Environment details

  • SDK version: google-genai 1.12.1
  • Python version: 3.10.16

Reproduction Code

import json
import os
import rich
from google import genai
from google.oauth2 import service_account
from PIL import Image

def get_gemini_client(vertexai: bool = False):
    if vertexai:
        # Assumes credentials.json is present
        credentials = json.loads(open("credentials.json").read())
        return genai.Client(
            vertexai=True,
            project=os.getenv("GOOGLE_CLOUD_PROJECT"),
            location="us-central1", # or global
            credentials=service_account.Credentials.from_service_account_info(
                credentials,
                scopes=["https://www.googleapis.com/auth/cloud-platform"],
            ),
        )
    else:
        return genai.Client(
            api_key=os.getenv("GEMINI_API_KEY"),
        )

if __name__ == "__main__":
    # Test image size: (700, 1003)
    image = Image.open("test.png") 
    print(f"Image size: {image.size}")

    model = "gemini-2.5-flash"

    print("-" * 100)
    print("Vertex AI:")
    vertex_client = get_gemini_client(vertexai=True)
    vertex_count = vertex_client.models.count_tokens(model=model, contents=[image])
    print(f"Token count (Vertex): {vertex_count}")
    
    vertex_response = vertex_client.models.generate_content(
        model=model,
        contents=[image, "Describe the image in detail."],
    )
    print("Vertex AI usage metadata:")
    rich.print(vertex_response.usage_metadata.prompt_tokens_details)

    print("-" * 100)
    print("Gemini API:")
    gemini_client = get_gemini_client(vertexai=False)
    gemini_count = gemini_client.models.count_tokens(model=model, contents=[image])
    print(f"Token count (Gemini API): {gemini_count}")

    gemini_response = gemini_client.models.generate_content(
        model=model,
        contents=[image, "Describe the image in detail."],
    )
    print("Gemini API usage metadata:")
    rich.print(gemini_response.usage_metadata.prompt_tokens_details)
    print("-" * 100)

Logs / Output

Image size: (700, 1003)
----------------------------------------------------------------------------------------------------
Vertex AI:
Token count (Vertex): total_tokens=1806 cached_content_token_count=None
Vertex AI usage metadata:
[
    ModalityTokenCount(modality=<MediaModality.IMAGE: 'IMAGE'>, token_count=1806), 
    ModalityTokenCount(modality=<MediaModality.TEXT: 'TEXT'>, token_count=6)
]
----------------------------------------------------------------------------------------------------
Gemini API:
Token count (Gemini API): total_tokens=259 cached_content_token_count=None
Gemini API usage metadata:
[
    ModalityTokenCount(modality=<MediaModality.TEXT: 'TEXT'>, token_count=7), 
    ModalityTokenCount(modality=<MediaModality.IMAGE: 'IMAGE'>, token_count=258)
]
----------------------------------------------------------------------------------------------------

Analysis

According to the official documentation for Token Calculation (Gemini 2.0/2.5): 258 tokens if both dimensions <= 384 pixels. Larger images are tiled into 768x768 pixel tiles, each costing 258 tokens.

For an image of size 700x1003:

  • Vertex AI returns 1806 tokens. Since $1806 / 258 = 7$, Vertex seems to be breaking the image into 7 tiles.
  • Gemini API returns 258 tokens. This implies it is treating the image as a single tile, possibly resizing it heavily or calculating differently.

Expected Behavior: I expect consistency between the two backends for the same model and image. If the tiling logic is standard, both should return similar token counts. Currently, using Vertex AI is ~7x more expensive for the exact same inference request.

Could you clarify if this is intended behavior, a bug in the Vertex implementation of the tokenizer, or an issue with how the SDK handles images for different clients?

Metadata

Metadata

Assignees

Labels

api: gemini-apiapi: vertex-aiIssues related to the Vertex AI API.priority: p2Moderately-important priority. Fix may not be included in next release.type: bugError or flaw in code with unintended results or allowing sub-optimal usage patterns.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions