tutorial

15 min readApril 30, 2026

Claude Opus 4.7 and GPT-5.5: The 2026 Prompting Guide

Anthropic and OpenAI both shipped new prompting guides this spring. Old prompts produce worse results on the new models. Here is what changed, what stays, and a 10-minute migration checklist.

Tools mentioned in this article

ChatGPT

AI chatbot for conversation, writing, coding, and analysis

Free / $20/mo

AI Writing Assistant

Claude

AI assistant for problem solvers — chat, code, and automate tasks

Free / $20/mo

Content Creation

Cursor

The best way to code with AI

Free / $20/mo

Code Assistant

OpenAI API Platform

Build AI-powered applications with GPT, DALL·E, Whisper, and more

Freemium

AI Agents

Both Anthropic and OpenAI shipped new prompting guides this spring. Your old prompts will produce worse results on both new models. Here is why, and what to do about it.

The Mental Model

Old models were apprentices. You wrote the recipe step by step, and they cooked it.

The new models are not apprentices.

Claude Opus 4.7 = a senior contractor reading a blueprint
                  Builds exactly what is on the page. Nothing more.

GPT-5.5         = a senior contractor reading a brief
                  Picks the materials and the route. Resents the
                  blueprint that tells them which hammer to use.

If you keep handing both contractors the same apprentice-era recipe, one will build a literal wooden box and the other will overthink what kind of box you really meant. The fix is the same on both sides: stop writing recipes. Start writing briefs.

TL;DR

Dimension	Claude Opus 4.7	GPT-5.5
Direction of change	More literal	More autonomous
What it stopped doing	Inferring your intent	Following your steps
Old prompts that break	Vague, hedge-word, "try to"	Step-by-step, "ALWAYS/NEVER", padded
Fix	Be specific about scope, format, output	Describe outcome and success criteria, drop the process
Default effort	`xhigh` for coding, `high` for knowledge	`medium`, raise only with eval data
Anti-pattern	Negative rules ("don't do X")	Process micromanagement
Token cost vs prior	1.0x to 1.35x stated, up to 1.47x measured	Roughly equivalent

Anthropic publicly states the new tokenizer adds "up to 35%" tokens versus 4.6. Independent measurements on code-heavy and structured-data prompts reached 1.47x. Plan for the higher number on agentic workloads.

Claude Opus 4.7 in literal mode

Change 1: Hedge words now weaken instructions

In 4.6, "try to keep it under 200 words" meant "aim for 200, allow some flex." In 4.7, the model reads "try to" as permission to ignore the constraint.

Prompt	Claude 4.6 result	Claude 4.7 result
`Summarize this article. Try to keep it under 200 words.`	190 words, on target	350+ words, "try" treated as optional
`Summarize this article in 180 to 200 words. Hard cap at 200.`	190 words	190 words, cap respected

Rule: delete every "try to," "if possible," "you might want to," "ideally." Replace with explicit constraints or remove the sentence.

Change 2: Scope must be stated explicitly

Prompt	4.6 behavior	4.7 behavior
`Add JSDoc comments. Use the @param format shown in the first function.`	Applies to all functions	Applies only to the first function
`Add JSDoc comments to every function in this file. Use the @param format from the first function as the template.`	Applies to all	Applies to all

Rule: if you want a rule applied broadly, write "every," "all," or "across the entire X." 4.7 will not generalize silently.

Change 3: Positive examples beat negative rules

Both labs confirm this. Anthropic states it directly: tell Claude what to do instead of what not to do.

Anti-pattern (negative rules)	Better (positive example)
`Don't use markdown. Don't use bullets. Don't use bold. Don't use headers. Don't write fragments.`	`Write in flowing prose paragraphs. Example: "The acquisition closed on Tuesday. By Wednesday morning, three executives had resigned, and the engineering team learned about the change from a press release."`
`Never use ellipses`	`Your response will be read aloud by TTS, so write all pauses as commas or periods. Never use "..."`

Change 4: Effort levels are strict

The Mental Model

Effort is a thermostat, not a power switch. Most people set it to max because they assume hotter is better. Sometimes hotter just burns the food.

Quality returned per dollar spent (illustrative, not benchmarked):

low      ▓▓                        cheapest, shallow
medium   ▓▓▓▓▓▓                    balanced default
high     ▓▓▓▓▓▓▓▓▓                 sweet spot for most code work
xhigh    ▓▓▓▓▓▓▓▓▓▓▓               diminishing returns kick in
max      ▓▓▓▓▓▓▓▓▓▓▓▓              ~10% more quality, ~3x the cost

Quality keeps climbing past high, but cost climbs faster. Default to high for knowledge tasks and xhigh for coding. Use max only when an eval shows it's worth the bill.

Symptom	Old fix (4.6 era)	New fix (4.7)
Shallow reasoning	Add "think step by step"	Raise effort to `high` or `xhigh`
Missing edge cases	Add "consider all edge cases"	Raise effort, then add specific cases as examples
Sloppy multi-step work	Add scaffolding ("first plan, then execute")	Raise effort, remove the scaffolding

Anthropic's guidance: if you observe shallow reasoning, raise effort rather than prompting around it.

Change 5: Remove progress-update scaffolding

Old prompt (4.6)	New prompt (4.7)
`After every 3 tool calls, summarize what you've done so far.`	(delete entirely, 4.7 self-paces user updates)
`Before answering, write your reasoning in <thinking> tags.`	(delete, adaptive thinking handles this)

Change 6: Code review harnesses need recall language

If your bug-finding harness regressed after the 4.7 upgrade, this is almost certainly the cause. 4.7 obeys "be conservative, only high-severity" too well.

Prompt	4.6 finding count	4.7 finding count
`Review this PR. Only report high-severity issues. Don't nitpick.`	8 findings	3 findings (4.7 actually filters)
`Report every issue, including low-severity and uncertain ones. Tag each with confidence and severity. A downstream filter will rank.`	8 findings	12 findings (recall restored)

⚠️ If you upgrade Claude code without changing your API calls, requests will start returning 400 errors. The four parameters below are now hard-rejected on Opus 4.7. Audit before you deploy.

Change 7: API breakages

Verified against the Anthropic 4.7 migration guide (April 16, 2026). All four below return a 400 if you keep the old shape.

Old (4.6)	New (4.7)
`temperature`, `top_p`, `top_k` set to any non-default value	Omit entirely. Use prompting to control behavior.
`thinking: {type: "enabled", budget_tokens: 8000}`	`thinking: {type: "adaptive"}` plus an `effort` level
Effort levels: low, medium, high, max	Now: low, medium, high, xhigh, max
Adaptive thinking on by default	Off by default on 4.7. Opt in explicitly.

New on 4.7: task budgets (beta header task-budgets-2026-03-13). Set a token ceiling for an entire agentic loop (thinking, tool calls, results, and final output all count against it). Useful when you want a hard cost cap on a long-running agent.

GPT-5.5 in autonomous mode

Change 1: Outcome over process

The biggest shift. Old prompts walked the model through steps. New prompts define the destination.

Anti-pattern (legacy):

You are a customer support agent. Follow these steps:
1. Read the customer's message
2. Look up their account
3. Check their subscription status
4. Check their billing history
5. If they're past due, mention this
6. If they're current, check open tickets
7. Compare each ticket to the new issue
8. Decide which tool to call
9. Call the tool
10. Format the response politely
11. Double-check your work

Outcome-first (GPT-5.5):

Resolve the customer's issue end to end.

Success means:
- the eligibility decision is made from policy and account data
- any allowed action is completed before responding
- the final answer includes: resolution summary, next steps, ticket ID
- tone is warm but direct

Stop when you have minimum sufficient evidence to answer correctly.

Expected result delta:

Metric	Legacy on GPT-5	Legacy on GPT-5.5	Outcome-first on GPT-5.5
Tool calls	5 to 7	8 to 11 (overthinks the script)	3 to 5
Output naturalness	Mechanical	More mechanical	Conversational
Latency	Baseline	+40% (process loops)	-20%
Correctness	Baseline	Same	Same or better

Change 2: Replace ALWAYS/NEVER with decision rules

Anti-pattern	Better (decision rule and stopping condition)
`ALWAYS verify identity. NEVER respond without checking the database.`	`Verify identity when the request involves account changes or PII. Use cached data unless the request requires real-time accuracy.`
`ALWAYS double-check your work`	`Before responding, run the most relevant validation: unit tests for code, schema validation for JSON, fact-checking against retrieved sources for claims.`
`NEVER make more than 5 tool calls`	`Make another tool call only when: (a) top results don't answer the core question, (b) a required fact is missing, or (c) the user asked for exhaustive coverage.`

Change 3: Effort tuning, calmly

OpenAI states it plainly: reasoning effort is a last-mile knob, not the primary way to improve quality.

Task	Right effort	Wrong move
Extract emails from a doc	`none` or `low`	`high` (no reasoning needed, wastes tokens)
Summarize a meeting transcript	`low`	`high` (mechanical task, overthinking adds noise)
Classify customer intent	`medium`	`xhigh` (cost scales faster than quality)
Multi-file refactor	`high`	`low` (real reasoning required)
Strategy from ambiguous data	`xhigh`	`medium` (the lift is in reasoning)
`xhigh` as your default	(it should not be)	This is the single most common mistake

OpenAI's caution: if the task has conflicting instructions, weak stopping criteria, or open-ended tool access, higher effort can lead to overthinking. Higher effort with a sloppy prompt makes things worse, not better.

Change 4: Strip the legacy scaffolding

Old GPT-4/5 boilerplate	GPT-5.5 treatment
`You are a helpful assistant. Today's date is 2026-04-30.`	Delete the date, model already knows it (UTC)
`Respond in JSON with this schema: {...}`	Move to the Structured Outputs API
`Let's think step by step.`	Delete, model self-paces with reasoning effort
`Take a deep breath and work carefully.`	Delete, no measurable effect on 5.5
`Make sure to double-check your work.`	Replace with concrete validation: "run pytest before responding"

Change 5: Personality and behavior, briefly

GPT-5.5 is efficient, direct, task-oriented by default. If you want warmth, prompt for it.

Use case	Personality block
Production API for technical users	(omit, defaults are good)
Consumer chatbot	`You are approachable and warm. Use conversational sentences, brief acknowledgements, avoid corporate phrasing.`
Internal copilot for engineers	`You are a capable peer: assume the user is competent. Skip preamble. Get to the answer. Flag risks once.`

Convergent rules: what works on both models

Positive examples beat negative rules

Negative-rule prompt	Positive-example prompt
`Don't write robotic AI text. Don't use phrases like "I'd be happy to help" or "Certainly!"`	`Match this voice: "The migration ran clean. Three rows in the events table got truncated, IDs 4421, 4422, 4423. I'll re-ingest them tonight."`
`Don't use jargon. Don't be verbose.`	`Write like Hemingway: short sentences, concrete nouns, no adjectives unless they earn their place. Example: "The build failed. The lock file was stale. I regenerated it."`

Match effort to task

Task type	Claude effort	GPT-5.5 effort
Lookup or extraction	`low`	`none` or `low`
Summarization	`medium`	`low`
Classification	`medium`	`medium`
Single-file code edit	`high`	`medium`
Multi-file refactor or agent loop	`xhigh`	`high`
Strategy or open-ended analysis	`xhigh` or `max`	`high` or `xhigh`
Default for "I'm not sure"	`high`	`medium`

Define success criteria, always

The block that works on both:

Goal: <one sentence>

Success means:
- <required output element 1>
- <required output element 2>
- <constraint, e.g. tone, length, format>

Stop when: <explicit stopping condition>

✅ This block is the single highest-leverage change you can make. Drop it at the top of any prompt that runs in production and you will see the same prompt produce shorter, more consistent output on both Claude 4.7 and GPT-5.5.

Same task, three prompts, three results

Task: summarize a 30-page product spec.

Prompt A (legacy, vague)

Read this document and give me a summary. Make it good.

Model	Result
Claude 4.6	600 words, decent structure, fills gaps with reasonable inferences
Claude 4.7	250 words, generic, takes "good" literally as "competent and short"
GPT-5	500 words, slightly mechanical
GPT-5.5	400 words, conversational, no specific shape

Prompt B (over-engineered)

You are an expert technical writer. Follow these steps:
1. Read the entire document carefully
2. Identify the key sections
3. For each section, extract the main points
4. Combine into a coherent summary
5. Make sure to double-check your work
6. Format as bullet points
7. Try to keep under 500 words
8. ALWAYS include the product name in the title
9. NEVER use jargon
Take a deep breath and work step by step.

Model	Result
Claude 4.6	Decent, model interprets noise charitably
Claude 4.7	Mechanical bullets, "try to" ignored, length drifts to 700+ words
GPT-5	Decent, mechanical
GPT-5.5	Worst result. Over-specified process produces stilted output, reasoning effort wasted on following the script

Prompt C (outcome-first with success criteria)

Summarize this product spec for an exec audience.

Success means:
- 3 sentences on what the product does and who it's for
- bullet list of the 5 most important features (one line each)
- bullet list of the 3 biggest risks or open questions
- total length: 250 to 350 words, hard cap at 350
- tone: direct, no marketing language

If anything in the spec is ambiguous, flag it under "open questions"
rather than guessing.

Model	Result
Claude 4.6	On target
Claude 4.7	On target, slightly tighter prose
GPT-5	On target
GPT-5.5	On target, most natural prose of the four

The same prompt that works best on 4.7 is the prompt that works best on GPT-5.5. The convergence is real.

A 10-minute migration checklist

Run this on your top 3 prompts before doing anything else.

1
Hedge audit (Claude 4.7). Search for "try to," "if possible," "ideally," "you might," "consider." Decide for each: do I mean it, or is it noise? Delete or harden.
2
Scope audit (Claude 4.7). Find every instruction that implicitly applies to a class of items. Add explicit "every," "all," or "across N."
3
Process strip (GPT-5.5). Delete every "step 1, step 2, step 3" sequence unless the order is genuinely required.
4
Negative-rule replacement (both). For each "don't / never / avoid" rule, write one positive example and delete the rule.
5
Effort calibration (both). Default Claude to xhigh for coding, high for knowledge. Default GPT-5.5 to medium. Raise only with eval data.
6
API hygiene (Claude 4.7). Strip temperature, top_p, top_k. Replace thinking: {enabled} with adaptive plus effort.
7
Re-eval. Run the new prompt against the same eval set as the old one. If you do not have an eval set, build one. Both labs are explicit that prompt decisions need empirical grounding now.

ℹ️ Eval-driven iteration is not optional anymore. Anthropic and OpenAI both call this out in their 2026 prompting guides. "It feels better" is no longer a valid migration signal. Without an eval set, you cannot tell whether your changes helped or just felt fashionable.

What still works (don't throw the baby out)

Plenty of habits survived the reset. If you already do these, keep doing them.

Habit	Status on 4.7 / 5.5
System prompts for persona and constraints	Still the right place for global rules
Few-shot examples for tricky formats	Stronger than ever, especially for output shape
Structured Outputs / JSON schema (OpenAI)	Use the API, not prompt instructions
Anthropic XML tags for input boundaries	Still recommended
Eval-driven iteration	Now mandatory, not optional
Caching long static prefixes	The single best mitigation for the new tokenizer
Tool use with clear tool descriptions	Same playbook
Asking the model to flag ambiguity rather than guess	Works on both, especially helpful on 4.7

Sources

#claude-opus-4-7 #gpt-5-5 #prompt-engineering #anthropic #openai #migration-guide