[Bug/Question] Significant Image Token Count Discrepancy between Vertex AI and Gemini API backends

### **Description**

I have noticed a significant discrepancy in token usage calculation for the same image when using the google-genai SDK, depending on whether the backend is **Vertex AI** or the **Gemini API** (AI Studio).

Using the same model (`gemini-2.5-flash`) and the same input image ($700 \times 1003$), the token counts differ drastically:
- Gemini API: **~258 tokens**.
- Vertex AI: **~1806 tokens**.

### **Environment details**

- **SDK version**: google-genai 1.12.1
- **Python version**: 3.10.16

### **Reproduction Code**

```python
import json
import os
import rich
from google import genai
from google.oauth2 import service_account
from PIL import Image

def get_gemini_client(vertexai: bool = False):
    if vertexai:
        # Assumes credentials.json is present
        credentials = json.loads(open("credentials.json").read())
        return genai.Client(
            vertexai=True,
            project=os.getenv("GOOGLE_CLOUD_PROJECT"),
            location="us-central1", # or global
            credentials=service_account.Credentials.from_service_account_info(
                credentials,
                scopes=["https://www.googleapis.com/auth/cloud-platform"],
            ),
        )
    else:
        return genai.Client(
            api_key=os.getenv("GEMINI_API_KEY"),
        )

if __name__ == "__main__":
    # Test image size: (700, 1003)
    image = Image.open("test.png") 
    print(f"Image size: {image.size}")

    model = "gemini-2.5-flash"

    print("-" * 100)
    print("Vertex AI:")
    vertex_client = get_gemini_client(vertexai=True)
    vertex_count = vertex_client.models.count_tokens(model=model, contents=[image])
    print(f"Token count (Vertex): {vertex_count}")
    
    vertex_response = vertex_client.models.generate_content(
        model=model,
        contents=[image, "Describe the image in detail."],
    )
    print("Vertex AI usage metadata:")
    rich.print(vertex_response.usage_metadata.prompt_tokens_details)

    print("-" * 100)
    print("Gemini API:")
    gemini_client = get_gemini_client(vertexai=False)
    gemini_count = gemini_client.models.count_tokens(model=model, contents=[image])
    print(f"Token count (Gemini API): {gemini_count}")

    gemini_response = gemini_client.models.generate_content(
        model=model,
        contents=[image, "Describe the image in detail."],
    )
    print("Gemini API usage metadata:")
    rich.print(gemini_response.usage_metadata.prompt_tokens_details)
    print("-" * 100)
```

### **Logs / Output**
```
Image size: (700, 1003)
----------------------------------------------------------------------------------------------------
Vertex AI:
Token count (Vertex): total_tokens=1806 cached_content_token_count=None
Vertex AI usage metadata:
[
    ModalityTokenCount(modality=<MediaModality.IMAGE: 'IMAGE'>, token_count=1806), 
    ModalityTokenCount(modality=<MediaModality.TEXT: 'TEXT'>, token_count=6)
]
----------------------------------------------------------------------------------------------------
Gemini API:
Token count (Gemini API): total_tokens=259 cached_content_token_count=None
Gemini API usage metadata:
[
    ModalityTokenCount(modality=<MediaModality.TEXT: 'TEXT'>, token_count=7), 
    ModalityTokenCount(modality=<MediaModality.IMAGE: 'IMAGE'>, token_count=258)
]
----------------------------------------------------------------------------------------------------
```


### **Analysis**

According to the official documentation for Token Calculation (Gemini 2.0/2.5): 258 tokens if both dimensions <= 384 pixels. Larger images are tiled into 768x768 pixel tiles, each costing 258 tokens.

For an image of size 700x1003:
- Vertex AI returns 1806 tokens. Since $1806 / 258 = 7$, Vertex seems to be breaking the image into 7 tiles.
- Gemini API returns 258 tokens. This implies it is treating the image as a single tile, possibly resizing it heavily or calculating differently. 
 
**Expected Behavior:** I expect consistency between the two backends for the same model and image. If the tiling logic is standard, both should return similar token counts. Currently, using Vertex AI is **~7x more expensive** for the exact same inference request.

Could you clarify if this is intended behavior, a bug in the Vertex implementation of the tokenizer, or an issue with how the SDK handles images for different clients?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug/Question] Significant Image Token Count Discrepancy between Vertex AI and Gemini API backends #1907

Description

Environment details

Reproduction Code

Logs / Output

Analysis

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug/Question] Significant Image Token Count Discrepancy between Vertex AI and Gemini API backends #1907

Description

Description

Environment details

Reproduction Code

Logs / Output

Analysis

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions