How LLMs work (context, tokens, and cost)

Diagram: the chat UI resends conversation history on every message

Large Language Models (LLMs) can feel like they “remember” you. In reality, most LLMs are stateless: every time you press Send, the model treats it like a brand-new request.

The only reason a chat feels continuous is that the app copies some (or all) of the previous messages and includes them again as input.

This matters for two practical reasons:

Cost: more text included in the request means more tokens billed.
Quality: irrelevant old context can confuse the model (“context bleed”).

The key idea: every message is a fresh request

An LLM does not carry a persistent memory of your previous turns. Instead, the app sends a prompt that typically includes:

instructions (sometimes hidden “system” or “developer” text)
your current message
parts of earlier messages so the model can stay consistent

From the model’s perspective, it’s always the first time seeing that combined text.

So why does it feel like memory?

Because the chat app is doing the remembering for you.

When you continue a conversation, the app often sends the whole conversation so far (or a large chunk of it) every time.
If the conversation is long, you’re repeatedly paying for the same text.

This is not a “scam”; it’s simply how LLM APIs work. The model needs the relevant context included in the prompt to respond appropriately.

Tokens: the unit that drives cost and limits

Models don’t read text as characters or words. They read tokens (roughly: word pieces). Both your input and the model’s output consume tokens.

In a simplified form:

Total tokens are roughly: input tokens + output tokens.

The context window (why long chats eventually hit a wall)

Every model has a maximum context window: the maximum number of tokens it can consider at once.

If a chat becomes too long:

the app must truncate older messages
or it must summarize them

Either way, older details may disappear from the prompt, which is why the model can “forget” something you discussed earlier.

Image placeholder (medium): Diagram — “Context window: older messages fall off the left side as tokens grow”.

When to start a fresh conversation (and why it helps)

Starting a new chat is often the best move when:

you switch to a new topic (client A → client B, legal → marketing, EN→DE → EN→FR)
you want a different role or constraint (proofreader → terminologist → strict JSON output)
your current conversation contains lots of dead ends, experiments, or conflicting instructions

Benefits of starting fresh:

Cleaner context → fewer contradictions and less “bleed” from old instructions
Lower cost → fewer repeated tokens sent on every message
More predictable behavior → the model has less irrelevant baggage

When continuing a conversation is the right choice

You don’t need to fear long conversations. Continuing is great when:

you’re iterating on the same text (draft → revise → shorten → adapt tone)
you need consistency across multiple steps
you’re building on decisions made earlier (glossary, style choices, constraints)

Think of it like working in the same document: keep the conversation if the context is still useful.

How to keep context clean without restarting

A few habits keep quality high and costs reasonable:

Use a “state recap” message: periodically summarize the current goal, constraints, and key decisions in a few bullet points.
Pin the essentials: keep stable requirements short and consistent (tone, glossary, do/don’t rules).
Don’t paste the same giant reference text repeatedly: include it once, then refer back to it.
Prune intentionally: if you see the model getting confused, restate only the relevant info and ignore the rest.

TIP

If the model starts making mistakes that look like it’s mixing tasks, that’s a hint your prompt contains competing instructions. Either recap cleanly or start a new chat.

“Prompt caching” (why repeated text can be cheaper)

Many LLM providers support prompt caching: if the beginning of your prompt is identical to a previous request, the provider may reuse cached computation for that prefix.

In practice, this means:

repeating the same long, unchanged instructions may be faster and/or cheaper than sending entirely new text
changing even small parts near the start of the prompt can reduce cache hits

Image placeholder (medium): Diagram — “Cached prefix vs new suffix tokens”.

Notes:

caching is provider- and plan-dependent
“cheaper” usually applies to cached input tokens, not output tokens
even with caching, a very long prompt still increases the risk of irrelevant context

Quick rules of thumb

If you changed the topic: new chat.
If you’re still iterating on the same task: continue, but recap occasionally.
If you care about cost: keep the prompt short and stable, avoid repeating large texts, and watch how much history you carry forward.

How LLMs work (context, tokens, and cost) ​

The key idea: every message is a fresh request ​

So why does it feel like memory? ​

Tokens: the unit that drives cost and limits ​

The context window (why long chats eventually hit a wall) ​

When to start a fresh conversation (and why it helps) ​

When continuing a conversation is the right choice ​

How to keep context clean without restarting ​

“Prompt caching” (why repeated text can be cheaper) ​

Quick rules of thumb ​