Your First Production Bedrock Feature Is Five Decisions, Not Five Steps


The gap between “I called an LLM in a notebook” and “this runs in production” is wider than the tutorials admit. The notebook version is three lines and works on the first try. The production version looks like the same recipe with a few more lines. It’s actually a sequence of decisions, and that’s what trips people.

Five of them: region, model, API, IAM scope, and maxTokens. Each has a default you can defend. Most of the walls people hit on a first Bedrock call are just a default someone else chose — the tutorial, or a field left unset — that’s wrong for your case. Get the five right and the rest is typing.

One fact sits under all of them: on Bedrock, a model call is an AWS API call. There is no separate API key and no vendor dashboard. The model runs in your account, in your region, signed with your AWS credentials and metered against your AWS quotas. Every decision below is an AWS decision underneath, which is why the failures don’t look like LLM problems.

This post walks all five against a live account. Code is TypeScript (@aws-sdk/client-bedrock-runtime), with the Python equivalent where it differs. Every output and error message shown here came from a real call.

Decision 1: which region?

Default: the region the rest of your stack already lives in. This post uses us-east-1.

Bedrock is multi-region, but the model catalog is not uniform across regions, and enabling a model in us-east-1 does nothing for us-west-2. The trap is that a missing model doesn’t announce itself — it shows up as an empty list or a ResourceNotFound, not a “switch regions” hint. Pick one region, put your model and your app there, and only split later when you have a concrete latency or data-residency reason. For a first feature, picking is the whole decision.

Decision 2: which model?

Default: Claude Haiku 4.5, addressed through its us. inference profile.

For a first feature you want fast, cheap, and current — not the biggest model on the menu. Haiku is cheap enough to not think about while you’re developing. Current is the load-bearing word. Two traps hide here, and both look like bugs unrelated to which model you picked.

Trap one: the model ID isn’t the model ID. Newer Claude models cannot be invoked on-demand with their bare model ID. You have to use a cross-region inference profile, whose ID is the model ID with a geography prefix (us., eu., …). Use the bare ID and Bedrock hands you this:

ValidationException
ValidationException: Invocation of model ID anthropic.claude-haiku-4-5-20251001-v1:0 with
on-demand throughput isn't supported. Retry your request with the ID or ARN of an
inference profile that contains this model.

It’s a ValidationException, not an access error — so people burn an afternoon inspecting their request payload instead of their model ID. The profile lets Bedrock route your request across regions for capacity; for current Claude models on-demand, it’s the only address that works.

Trap two: the legacy wall. Say you learn the inference-profile trick from trap one and apply it to an older, cheaper-looking model — us.anthropic.claude-3-5-haiku-20241022-v1:0. The profile ID is right this time; the problem is the model behind it has been retired:

ResourceNotFoundException
ResourceNotFoundException: ... Model is marked by provider as Legacy and you have not been
actively using the model in the last 30 days. Please upgrade to an active model.

Both traps point the same direction: start on a current model, addressed through its inference profile. That’s what makes “which model” a real decision.

Decision 3: Converse or InvokeModel?

Default: Converse. It’s Bedrock’s unified, model-agnostic API — the request and response shape is identical whichever provider you call, so switching models later is a one-line change. Here’s the whole first call:

first-call.mjs
import {
BedrockRuntimeClient,
ConverseCommand,
} from '@aws-sdk/client-bedrock-runtime';
// Newer Claude models are invoked through a cross-region *inference profile*,
// not the bare model ID. The `us.` prefix is the US inference profile.
const MODEL_ID = 'us.anthropic.claude-haiku-4-5-20251001-v1:0';
const client = new BedrockRuntimeClient({ region: 'us-east-1' });
const response = await client.send(
new ConverseCommand({
modelId: MODEL_ID,
system: [
{ text: 'You are a support triage assistant. Reply with exactly one word.' },
],
messages: [
{
role: 'user',
content: [
{ text: "Classify the sentiment of this message: 'My invoice is wrong again and no one has replied.'" },
],
},
],
inferenceConfig: { maxTokens: 100, temperature: 0.2 },
}),
);
console.log('reply:', response.output?.message?.content?.[0]?.text);
console.log('stopReason:', response.stopReason);
console.log('usage:', JSON.stringify(response.usage));

Running it:

output
reply: Negative
stopReason: end_turn
usage: {"inputTokens":42,"outputTokens":5,"totalTokens":47,"cacheReadInputTokens":0,"cacheWriteInputTokens":0}

That’s 47 tokens total — at current Claude Haiku 4.5 pricing on Bedrock, under $0.0001 per call. You can run this thousands of times before the cost is worth a thought.

The Python version is the same call through boto3:

first_call.py
import json
import boto3
MODEL_ID = "us.anthropic.claude-haiku-4-5-20251001-v1:0"
client = boto3.client("bedrock-runtime", region_name="us-east-1")
response = client.converse(
modelId=MODEL_ID,
system=[{"text": "You are a support triage assistant. Reply with exactly one word."}],
messages=[
{
"role": "user",
"content": [{"text": "Classify the sentiment of this message: "
"'My invoice is wrong again and no one has replied.'"}],
}
],
inferenceConfig={"maxTokens": 100, "temperature": 0.2},
)
print(response["output"]["message"]["content"][0]["text"])
print(json.dumps(response["usage"]))

One boto3 gotcha that wastes an hour: inference uses the bedrock-runtime client, while listing and managing models uses the plain bedrock client. Call converse on the wrong one and the error won’t tell you that’s what you did.

So when do you reach for InvokeModel? The difference is the envelope. Converse gives you one schema for every provider, plus multi-turn and tool use. InvokeModel makes you hand-write the model’s native request body — for Claude, the Anthropic message format including the required anthropic_version: "bedrock-2023-05-31" field. The breaking point is specificity: drop to InvokeModel only when you need a provider-only parameter, a model that doesn’t yet speak the Converse schema, or payloads pushing against the Bedrock service quotas. For a first feature, that’s never. Start on Converse.

Decision 4: how wide is the IAM?

Default: one action, two ARNs. Develop with admin if you must, but ship a policy that allows exactly bedrock:InvokeModel on exactly the model you call. Because there’s no API key to scope — the credentials are the scope — this is the only thing standing between your feature and every model in the account. An inference profile needs permission on both the profile and the foundation models it routes to:

bedrock-invoke-policy.json
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Action": "bedrock:InvokeModel",
"Resource": [
"arn:aws:bedrock:*:ACCOUNT_ID:inference-profile/us.anthropic.claude-haiku-4-5-20251001-v1:0",
"arn:aws:bedrock:*::foundation-model/anthropic.claude-haiku-4-5-20251001-v1:0"
]
}]
}

“My role has AdministratorAccess and the call still throws AccessDeniedException — how is this an IAM problem?” It usually isn’t. Third-party models go through an AWS Marketplace subscription: your first call to an Anthropic model triggers a background subscription, and Anthropic also requires a one-time use-case form per account. Until that clears, the call fails with an AccessDeniedException mentioning aws-marketplace:Subscribe — which reads like an IAM bug even when your IAM is spotless. Wait it out once per account; it isn’t your policy.

Decision 5: what do you set maxTokens to?

Default: the smallest number your feature actually needs. Never leave it unset.

This is the one the demo won’t show you — it only bites in production. Leave maxTokens unset and it defaults to the model’s maximum — and Bedrock reserves input + maxTokens against your token-per-minute quota at the start of every request, before a single token is generated.

Do the multiplication. A classifier whose answer is one word needs maybe 5 output tokens. Leave maxTokens at a model default of 8192 and every call reserves 42 + 8192 ≈ 8200 quota tokens to produce 5 — roughly 40× more quota than the work requires, reserved up front. A handful of those in flight starve the whole account into throttling while doing almost nothing. Worse, Claude 3.7 and later burn roughly five quota tokens per output token, so your effective throughput is lower than the raw quota number suggests. Cap maxTokens at what the feature needs and the reservation shrinks with it.

Throttling here is routine: ThrottlingException (HTTP 429) happens because Bedrock quotas are per-account, per-region, and shared across every app in that account. Both SDKs default to standard retry, which backs off uniformly. Adaptive mode adds client-side rate limiting on top and handles bursty quota windows better — turn it on explicitly:

client_with_retries.mjs
const client = new BedrockRuntimeClient({
region: 'us-east-1',
retryMode: 'adaptive',
maxAttempts: 5,
});
client_with_retries.py
from botocore.config import Config
client = boto3.client(
"bedrock-runtime",
region_name="us-east-1",
config=Config(retries={"mode": "adaptive", "max_attempts": 5}),
)

After the five: the hygiene that keeps it shipped

The five decisions get you a call that works and won’t take the account down. Three habits keep it that way, and none is more than a few lines.

Structure the prompt — don’t concatenate it. Converse takes two distinct inputs: system (who the model is and the rules it follows, stable across requests) and messages (the actual conversation and the variable input). It’s tempting to mash everything into one user string. Don’t — keeping the rules in system and the input in messages makes behavior more predictable, stops user-supplied text from quietly overriding your instructions, and sets you up for prompt caching and multi-turn later without a rewrite. The temperature knob lives in inferenceConfig; keep it low (~0.2) for classification and extraction.

Log usage, not payloads. Emit response.usage — the token counts — to your metrics so cost and latency are visible from day one. Logging full prompts and responses by default is both a privacy liability and, at volume, a surprising storage bill.

Watch the bill from day one. Output tokens cost several times more than input tokens, so a chatty feature costs more than its request count implies. Right-size the model — Haiku over the biggest model until you’ve proven you need more — before launch, not after the first invoice.

Where to go next

That’s the whole spine of a first production Bedrock feature: five deliberate decisions, plus the hygiene that keeps them honest. The rest of this series builds on the call you just shipped:

  • Retrieval-Augmented Generation with Knowledge Bases — ground the model in your own documents instead of relying on what it memorized.
  • Bedrock Agents and action groups — let the model call your APIs to take real actions.
  • Cost controls and Guardrails — the safety rails and budget limits you want before this is in front of users, and a deeper look at the token-quota math above.
  • Streaming responses through Lambda + API Gateway — return tokens as they’re generated for a responsive UX.

A first Bedrock feature isn’t five steps you follow. It’s five defaults you can defend — choose them on purpose, on the same account your bill arrives for.

Further reading