-
Notifications
You must be signed in to change notification settings - Fork 723
Description
What I am trying to do
I am building a voice-based conversational agent using the Gemini Developer Live API (ai.google.dev) with the Python SDK (googleapis/python-genai).
The agent streams microphone audio, enables input/output transcription, and tracks token usage via usage_metadata.
I am trying to understand why promptTokenCount is very high (300–500 tokens) even when the user input is only a simple greeting such as “hello”.
What I expected
For a very short user input (e.g., “hello”), I expected:
promptTokenCountto be relatively small- Growth across turns to roughly correlate with visible conversation history
What actually happens
Even with a minimal input like “hello”, the promptTokenCount is already several hundred tokens on the first turn, and continues to grow across turns ( it might be due to history but why a single hello is 334 tokens).
Example output from my session:
┌──────────────────────────────────────────────────────────────┐
│ 📊 Turn # 1 │
├──────────────────────────────────────────────────────────────┤
│ API UsageMetadata (raw values): │
│ promptTokenCount: 334 │
│ responseTokenCount: 56 │
│ thoughtsTokenCount: 44 │
│ totalTokenCount: 390 │
└──────────────────────────────────────────────────────────────┘
🎤 You: Hello
🤖 Gemini: Hello how are you?
Later in the same session i said "My name is vedant" and again we get 432 tokens it increased due to history but for this single text message we have 432 tokens or Audio is also included in this:
prompt_token_count=432
response_token_count=78
thoughts_token_count=47
total_token_count=510
prompt_tokens_details=[
TEXT: 428 tokens
AUDIO: 4 tokens
]
And the next turn:
┌──────────────────────────────────────────────────────────────┐
│ 📊 Turn # 2 │
├──────────────────────────────────────────────────────────────┤
│ API UsageMetadata (raw values): │
│ promptTokenCount: 432 │
│ responseTokenCount: 78 │
│ thoughtsTokenCount: 47 │
│ totalTokenCount: 510 │
└──────────────────────────────────────────────────────────────┘
This happens even though the user-visible input is just a short greeting.
NOTE: I have passed empty system prompt
. Nothing was passed.
What I understand so far
-
promptTokenCountis per request / per turn, not cumulative. -
In Live API sessions, the prompt appears to include:
- Prior conversation context
- Internal session/state wrappers
- Role formatting and safety framing
- Audio/transcription-related metadata
-
These internal components are not visible, but still count toward
promptTokenCount. Is this true?
Questions
- Does the promptTokenCount cumulative token count of all the previous conversation or it is just for a single current turn conversation?
- Is this level of prompt overhead expected behavior for Live (bidiGenerateContent) sessions?
- Is there any way to inspect or estimate what contributes to the non-user-visible prompt tokens?
- Are there recommended configurations to reduce prompt token usage for simple conversational turns (e.g., greetings)?
I also have attached the code files below: