Your First Production Bedrock Feature Is Five Decisions, Not Five Steps
The gap between “I called an LLM in a notebook” and “this runs in production” is wider than the tutorials admit. The notebook version is three lines and works on the first try. The production version looks like the same recipe with a few more lines. It’s actually a sequence of decisions, and that’s what trips people.
Five of them: region, model, API, IAM scope, and maxTokens. Each has a default you
can defend. Most of the walls people hit on a first Bedrock call are just a default someone
else chose — the tutorial, or a field left unset — that’s wrong for your case.
Get the five right and the rest is typing.
One fact sits under all of them: on Bedrock, a model call is an AWS API call. There is no separate API key and no vendor dashboard. The model runs in your account, in your region, signed with your AWS credentials and metered against your AWS quotas. Every decision below is an AWS decision underneath, which is why the failures don’t look like LLM problems.
This post walks all five against a live account. Code is TypeScript
(@aws-sdk/client-bedrock-runtime), with the Python equivalent where it differs. Every output
and error message shown here came from a real call.
Decision 1: which region?
Default: the region the rest of your stack already lives in. This post uses us-east-1.
Bedrock is multi-region, but the model catalog is not uniform across regions, and enabling a
model in us-east-1 does nothing for us-west-2. The trap is that a missing model doesn’t
announce itself — it shows up as an empty list or a ResourceNotFound, not a “switch regions”
hint. Pick one region, put your model and your app there, and only split later when you have a
concrete latency or data-residency reason. For a first feature, picking is the whole decision.
Decision 2: which model?
Default: Claude Haiku 4.5, addressed through its us. inference profile.
For a first feature you want fast, cheap, and current — not the biggest model on the menu. Haiku is cheap enough to not think about while you’re developing. Current is the load-bearing word. Two traps hide here, and both look like bugs unrelated to which model you picked.
Trap one: the model ID isn’t the model ID. Newer Claude models cannot be invoked on-demand
with their bare model ID. You have to use a cross-region inference profile, whose ID is the
model ID with a geography prefix (us., eu., …). Use the bare ID and Bedrock hands you this:
ValidationException: Invocation of model ID anthropic.claude-haiku-4-5-20251001-v1:0 withon-demand throughput isn't supported. Retry your request with the ID or ARN of aninference profile that contains this model.It’s a ValidationException, not an access error — so people burn an afternoon inspecting
their request payload instead of their model ID. The profile lets Bedrock route your request
across regions for capacity; for current Claude models on-demand, it’s the only address
that works.
Trap two: the legacy wall. Say you learn the inference-profile trick from trap one and
apply it to an older, cheaper-looking model — us.anthropic.claude-3-5-haiku-20241022-v1:0.
The profile ID is right this time; the problem is the model behind it has been retired:
ResourceNotFoundException: ... Model is marked by provider as Legacy and you have not beenactively using the model in the last 30 days. Please upgrade to an active model.Both traps point the same direction: start on a current model, addressed through its inference profile. That’s what makes “which model” a real decision.
Decision 3: Converse or InvokeModel?
Default: Converse. It’s Bedrock’s unified, model-agnostic API — the request and response shape is identical whichever provider you call, so switching models later is a one-line change. Here’s the whole first call:
import { BedrockRuntimeClient, ConverseCommand,} from '@aws-sdk/client-bedrock-runtime';
// Newer Claude models are invoked through a cross-region *inference profile*,// not the bare model ID. The `us.` prefix is the US inference profile.const MODEL_ID = 'us.anthropic.claude-haiku-4-5-20251001-v1:0';
const client = new BedrockRuntimeClient({ region: 'us-east-1' });
const response = await client.send( new ConverseCommand({ modelId: MODEL_ID, system: [ { text: 'You are a support triage assistant. Reply with exactly one word.' }, ], messages: [ { role: 'user', content: [ { text: "Classify the sentiment of this message: 'My invoice is wrong again and no one has replied.'" }, ], }, ], inferenceConfig: { maxTokens: 100, temperature: 0.2 }, }),);
console.log('reply:', response.output?.message?.content?.[0]?.text);console.log('stopReason:', response.stopReason);console.log('usage:', JSON.stringify(response.usage));Running it:
reply: NegativestopReason: end_turnusage: {"inputTokens":42,"outputTokens":5,"totalTokens":47,"cacheReadInputTokens":0,"cacheWriteInputTokens":0}That’s 47 tokens total — at current Claude Haiku 4.5 pricing on Bedrock, under $0.0001 per call. You can run this thousands of times before the cost is worth a thought.
The Python version is the same call through boto3:
import jsonimport boto3
MODEL_ID = "us.anthropic.claude-haiku-4-5-20251001-v1:0"client = boto3.client("bedrock-runtime", region_name="us-east-1")
response = client.converse( modelId=MODEL_ID, system=[{"text": "You are a support triage assistant. Reply with exactly one word."}], messages=[ { "role": "user", "content": [{"text": "Classify the sentiment of this message: " "'My invoice is wrong again and no one has replied.'"}], } ], inferenceConfig={"maxTokens": 100, "temperature": 0.2},)
print(response["output"]["message"]["content"][0]["text"])print(json.dumps(response["usage"]))One boto3 gotcha that wastes an hour: inference uses the bedrock-runtime client, while
listing and managing models uses the plain bedrock client. Call converse on the wrong one
and the error won’t tell you that’s what you did.
So when do you reach for InvokeModel? The difference is the envelope. Converse gives you one
schema for every provider, plus multi-turn and tool use. InvokeModel makes you hand-write the
model’s native request body — for Claude, the Anthropic message format including the required
anthropic_version: "bedrock-2023-05-31" field. The breaking point is specificity: drop to
InvokeModel only when you need a provider-only parameter, a model that doesn’t yet speak the
Converse schema, or payloads pushing against the
Bedrock service quotas. For
a first feature, that’s never. Start on Converse.
Decision 4: how wide is the IAM?
Default: one action, two ARNs. Develop with admin if you must, but ship a policy that allows
exactly bedrock:InvokeModel on exactly the model you call. Because there’s no API key to scope
— the credentials are the scope — this is the only thing standing between your feature and
every model in the account. An inference profile needs permission on both the profile and the
foundation models it routes to:
{ "Version": "2012-10-17", "Statement": [{ "Effect": "Allow", "Action": "bedrock:InvokeModel", "Resource": [ "arn:aws:bedrock:*:ACCOUNT_ID:inference-profile/us.anthropic.claude-haiku-4-5-20251001-v1:0", "arn:aws:bedrock:*::foundation-model/anthropic.claude-haiku-4-5-20251001-v1:0" ] }]}“My role has AdministratorAccess and the call still throws AccessDeniedException — how is
this an IAM problem?” It usually isn’t. Third-party models go through an AWS Marketplace
subscription: your first call to an Anthropic model triggers a background subscription, and
Anthropic also requires a one-time use-case form per account. Until that clears, the call fails
with an AccessDeniedException mentioning aws-marketplace:Subscribe — which reads like an IAM
bug even when your IAM is spotless. Wait it out once per account; it isn’t your policy.
Decision 5: what do you set maxTokens to?
Default: the smallest number your feature actually needs. Never leave it unset.
This is the one the demo won’t show you — it only bites in production. Leave maxTokens unset and it
defaults to the model’s maximum — and Bedrock reserves input + maxTokens against your
token-per-minute quota at the start of every request, before a single token is generated.
Do the multiplication. A classifier whose answer is one word needs maybe 5 output tokens. Leave
maxTokens at a model default of 8192 and every call reserves 42 + 8192 ≈ 8200 quota tokens
to produce 5 — roughly 40× more quota than the work requires, reserved up front. A handful
of those in flight starve the whole account into throttling while doing almost nothing. Worse, Claude 3.7 and later burn roughly
five quota tokens per output token,
so your effective throughput is lower than the raw quota number suggests. Cap maxTokens at
what the feature needs and the reservation shrinks with it.
Throttling here is routine: ThrottlingException (HTTP 429) happens
because Bedrock quotas are per-account, per-region, and shared across every app in that
account. Both SDKs default to standard retry, which backs off uniformly. Adaptive mode adds
client-side rate limiting on top and handles bursty quota windows better — turn it on
explicitly:
const client = new BedrockRuntimeClient({ region: 'us-east-1', retryMode: 'adaptive', maxAttempts: 5,});from botocore.config import Config
client = boto3.client( "bedrock-runtime", region_name="us-east-1", config=Config(retries={"mode": "adaptive", "max_attempts": 5}),)After the five: the hygiene that keeps it shipped
The five decisions get you a call that works and won’t take the account down. Three habits keep it that way, and none is more than a few lines.
Structure the prompt — don’t concatenate it. Converse takes two distinct inputs: system
(who the model is and the rules it follows, stable across requests) and messages (the actual
conversation and the variable input). It’s tempting to mash everything into one user string.
Don’t — keeping the rules in system and the input in messages makes behavior more
predictable, stops user-supplied text from quietly overriding your instructions, and sets you up
for prompt caching and multi-turn later without a rewrite. The temperature knob lives in
inferenceConfig; keep it low (~0.2) for classification and extraction.
Log usage, not payloads. Emit response.usage — the token counts — to your metrics so cost
and latency are visible from day one. Logging full prompts and responses by default is both a
privacy liability and, at volume, a surprising storage bill.
Watch the bill from day one. Output tokens cost several times more than input tokens, so a chatty feature costs more than its request count implies. Right-size the model — Haiku over the biggest model until you’ve proven you need more — before launch, not after the first invoice.
Where to go next
That’s the whole spine of a first production Bedrock feature: five deliberate decisions, plus the hygiene that keeps them honest. The rest of this series builds on the call you just shipped:
- Retrieval-Augmented Generation with Knowledge Bases — ground the model in your own documents instead of relying on what it memorized.
- Bedrock Agents and action groups — let the model call your APIs to take real actions.
- Cost controls and Guardrails — the safety rails and budget limits you want before this is in front of users, and a deeper look at the token-quota math above.
- Streaming responses through Lambda + API Gateway — return tokens as they’re generated for a responsive UX.
A first Bedrock feature isn’t five steps you follow. It’s five defaults you can defend — choose them on purpose, on the same account your bill arrives for.
Further reading
- Optimize your applications for scale and reliability on Amazon Bedrock — the AWS deep-dive on throttling, retries, and
maxTokens. - Simplified model access in Amazon Bedrock — what the 2025 IAM-driven access change does and doesn’t cover.
- Increase throughput with cross-Region inference — why the
us.inference-profile ID is required.