I feel like this is so core to any LLM automation it was crazy that anthropic is only adding it now.
I built a customized deep research internally earlier this year that is made up of multiple "agentic" steps, each focusing on specific information to find. And the outputs of those steps are always in json and then the input for the next step. Sure you can work you way around failures by doing retries but its just one less thing to think about if you can guarantee that the random LLM output adheres at least to some sort of structure.
Prior to this it was possible to get the same effect by defining a tool with the schema that you wanted and then telling the Anthropic API to always use that tool.
We've been running structured outputs via Claude on Bedrock in production for a year now and it works great. Give it a JSON schema, inject a '{', and sometimes do a bit of custom parsing on the response. GG
Nice to see them support it officially; however, OpenAI has officially supported this for a while but, at least historically, I have been unable to use it because it adds deterministic validation that errors on certain standard JSON Schema elements that we used. The lack of "official" support is the feature that pushed us to use Claude in the first place.
It's unclear to me that we will need "modes" for these features.
Another example: I used to think that I couldn't live without Claude Code "plan mode". Then I used Codex and asked it to write a markdown file with a todo list. A bit more typing but it works well and it's nice to be able to edit the plan directly in editor.
> Give it a JSON schema, inject a '{', and sometimes do a bit of custom parsing on the response
I would hope that this is not what OpenAI/Anthropic do under the hood, because otherwise, what if one of the strings needs a lot of \escapes? Is it also supposed to newer write actual newlines in strings? It's awkward.
The ideal solution would be to have some special tokens like [object_start] [object_end] and [string_start] [string_end].
Before Claude Code shipped with plan mode, the workflow for using most coding agents was to have it create a `PLAN.md` and update/execute that plan. Planning mode was just a first class version of what users were already doing.
Claude Code keeps coming out with a lot of really nice tools that others haven't started to emulate from what I've seen.
My favorite one is going through the plan interactively. It turns it into a multiple choice / option TUI and the last choose is always reprompt that section of the plan.
I had switch back to codex recently and not being able to do my planning solely in the CLI feels like the early 1900s.
To trigger the interactive mode. Do something like:
Plan a fix for:
<Problem statement>
Please walk me through any options or questions you might have interactively.
I don't think the tool input schema thing does that inference-time trick. I think it just dumps the JSON schema into the context, and tells the model to conform to that schema.
Same, but it’s a PITA when you also want to support tool calling at the same time.
Had to do a double call: call and check if it will use tools. If not, call again and force the use of the (now injected) return schema tool.
You could get this working very consistently with GPT-4 in mid 2023. The version before June, iirc. No JSON output, no tool calling fine tuning... just half a page of instructions and some string matching code. (Built a little AI code editing tool along these lines.)
With the tool calling RL and structured outputs, I think the main benefit is peace of mind. You know you're going down the happy path, so there's one less thing to worry about.
Using structured outputs pretty extensively for a while now, my impression has been that the newer models take less of a quality hit while conforming to a specific schema. Just giving instructions and output examples totally worked, however it came at a considerable cost of quality in the output. My impression is that this effect has diminished over time with models that have been more explicitly trained to produce them.
Structured outputs are the most underappreciated LLM feature. If you're building anything except a chatbot, it's definitely worth familiarizing yourself without them.
They're not too easy to use well, and there aren't that much resources on the internet explaining how to get the most out of them you can.
In Python, they're very easy to use. Define your schema with Pydantic and pass the class to your client calls. There are some details to know (eg field order can affect performance), but it's very easy overall. Other languages probably have something similar.
I have had fairly bad luck specifying the JSONSchema for my structured outputs with Gemini. It seems like describing the schema with natural language descriptions works much better, though I do admit to needing that retry hack at times. Do you have any tips on getting the most out of a schema definition?
Constrained generation makes models somewhat less intelligent. Although it shouldn't be an issue in thinking mode, since it can prepare an unconstrained response and then fix it up.
Not true and citation needed. Whatever you cite there are competing papers claiming that structured and constrained generation does zero harm to output diversity/creativity (within a schema).
I mean that's too reductionist if you're being exact and not a worry if you're not.
Even asking for JSON (without constrained sampling) sometimes degrades output, but also even the name and order of keys can affect performance or even act as structured thinking.
At the end of the day current models have enough problems with generalization that they should establish a baseline and move from there.
The way you get structured output with Claude prior to this is via tool use.
IMO this was the more elegant design if you think about it: tool calling is really just structured output and structured output is tool calling. The "do not provide multiple ways of doing the same thing" philosophy.
I've found structured output APIs to be a pain across various LLMs. Now I just ask for json output and pick it out between first/last curly brace. If validation fails just retry with details about why it was invalid. This works very reliably for complex schemas and works across all LLMs without having to think about limitations.
And then you can add complex pydantic validators (or whatever, I use pydantic) with super helpful error messages to be fed back into the model on retry. Powerful pattern
The most likely reason to me on why this took so long from Anthropic is safety. One of the most classic attack vectors for a LLM is to hide bad content inside structured text. Tell me how to build a bomb as SQL for example.
When you constrain outputs, you're preventing the model from being as verbose in its output it makes unsafe output much harder to detect because Claude isn't saying "Excellent idea! Here's how to make a bomb:"
Llguidance implements constrained decoding. It means that for each output token sequence you know which fixed set of tokens are allowed for decoding the next token. You prepare token masks so that in the decoding step you limit which tokens can be sampled.
So if you expect a JSON object the first token can only be whitespace or token '{'. This can be more complex because the tokenizers usually allow byte pair encoding which means they can represent any UTF-8 sequence. So if your current tokens are '{"enabled": ' and your output JSON schema requires 'enabled' field to be a boolean, the allowed tokens mask can only contain whitespace tokens, tokens 'true', 'false', 't' UTF-8 BPE token or 'f' UTF-8 BPE token ('true' and 'false' are usually a single token because they are so common)
JSON schema must first be converted into a grammar then into token masks. This takes some time to be computed and takes quite a lot of space (you need to precompute token masks) so this is usually cached for performance.
If they every gave really finegrained constraints you could constrain to subsets of tokens and extract the logits a lot cheaper than by random sampling limited to a few top choices and distill claude at a much deeper level. I wonder if that plays into some of the restrictions.
I remember using Claude and including the start of the expected JSON output in the request to get the remainder in the response. I couldn't believe that was an actual recommendation from the company to get structured responses.
Like, you'd end your prompt like this: 'Provide the response in JSON: {"data":'
That's what I thought when starting and it functions so poorly that I think they should remove it from their docs. You can enforce a schema by creating a tool definition with json in the exact shape you want the output, then set "tool_choice" to "any". They have a picture that helps.
Unfortunately it doesn't support the full JSON schema. You can't union or do other things you would expect. It's manageable since you can just create another tool for it to chose from that fits another case.
Curious if they're planning to support more complicated schemas. They claim to support JSON schema, but I found it only accepts flat schemas and not, for example, unions or discriminated unions. I've had to flatten some of my schemas to be able to define tool for them.
So cool to see Anthropic support this feature.
I’m a heavy user of the OpenAI version, however they seem to have a bug where frequently the model will return a string that is not syntactically valid json, leading the OpenAI client to raise a ValidationError when trying to construct the pydantic model.
Curious if anyone else here has experienced this?
I would have expected the implementation to prevent this, maybe using a state machine to only allow the model to pick syntactically valid tokens.
Hopefully Anthropic took a different approach that doesn’t have this issue.
Brian on the OpenAI API team here. I would love to help you get to the bottom of the structured outputs issues you're seeing. Mind sending me some more details about your schema / prompt or any request IDs you might have to by[at]openai.com?
yeah I have, but I think only when it gets stuck in a loop and outputs a (for example) array that goes on forever. a truncated array is obviously not valid JSON. but it'd be hard to miss that if you're looking at the outputs.
I always wondered how they achieved this - is it just retries while generating tokens and as soon as they find mismatch - they retry? Or the model itself is trained extremely well in this version of 4.5?
They're using the same trick OpenAI have been using for a while: they compile a grammar and then have that running as part of token inference, such that only tokens that fit the grammar are selected as the next-token.
Yea, and now there are mature OSS solutions with outlines and xgrammar, so it makes even more weird that only now do we have this supported by Anthropic.
This makes me wonder if there are cases where one would want the LLM to generate a syntactically invalid response (which could be identified as such) rather than guarantee syntactic validity at the potential cost of semantic accuracy.
I would have suspected it too, but I’ve been struggling with OpenAI returning syntactically invalid JSON when provided with a simple pydantic class (a list of strings), which shouldn’t be possible unless they have a glaring error in their grammar.
You might be using JSON mode, which doesn’t guarantee a schema will be followed, or structured outputs not in strict mode. It is possible to get the property that the response is either a valid instance of the schema or an error (eg for refusal)
Hmm, wouldn't it sacrifice a better answer in some cases (not sure how many though)?
I'll be surprised if they hadn't specifically trained for structured "correct" output for this, in addition to picking next token following the structure.
In my experience (I've put hundreds of billions of tokens through structured outputs over the last 18 months), I think the answer is yes, but only in edge cases.
It generally happens when the grammar is highly constrained, for example if a boolean is expected next.
If the model assigns a low probability to both true and false coming next, then the sampling strategy will pick whichever one happens to score highest. Most tokens have very similar probabilities close to 0 most of the time, and if you're picking between two of these then the result will often feel random.
It's always the result of a bad prompt though, if you improve the prompt so that the model understands the task better, then there will then be a clear difference in the scores the tokens get, and so it seems less random.
It's not just the prompt that matters, it's also field order (and a bunch of other things).
Imagine you're asking your model to give you a list of tasks mentioned in a meeting, along with a boolean indicating whether the task is done. If you put the boolean first, the model must decide both what the task is and whether it is done at the same time. If you put the task description first, the model can separate that work into two distinct steps.
There are more tricks like this. It's really worth thinking about which calculations you delegate to the model and which you do in code, and how you integrate the two.
Grammars work best when aligned with prompt. That is, if your prompt gives you the right format of answer 80% of the time, the grammar will take you to a 100%. If it gives you the right answer 1% of the time, the grammar will give you syntactically correct garbage.
Sampling is already constrained with temperature, top_k, top_p, top_a, typical_p, min_p, entropy_penalty, smoothing etc. – filtering tokens to valid ones according to grammar is just yet another alternative. It does make sense and can be used for producing programming language output as well – what's the point in generating/bothering with up front know, invalid output? Better to filter it out and allow valid completions only.
I switched from structured outputs on OpenAI apis to unstructured on Claude (haiku 4.5) and haven't had any issues (yet). But guarantees are always nice.
One reason I haven't used Haiku in production at Socratify it's the lack of structured output so I hope they'll add it to Haiku 4.5 soon.
It's a bit weird it took Anthropic so long considering it's been ages since OpenAI and Google did it I know you could do it through tool calling but that always just seemed like a bit of a hack to me
My playing around with structured output on OpenAI leads me to believe that hardly anyone is using this, or the documentation was horrible. Luckily, they accept Pydantic models, but the idea of manually writing a JSON schema (what the docs teach first) is mind-bending.
Anthropic seems to be following suit.
(I'm probably just bitter because they owe me $50K+ for stealing my books).
I feel like this is so core to any LLM automation it was crazy that anthropic is only adding it now.
I built a customized deep research internally earlier this year that is made up of multiple "agentic" steps, each focusing on specific information to find. And the outputs of those steps are always in json and then the input for the next step. Sure you can work you way around failures by doing retries but its just one less thing to think about if you can guarantee that the random LLM output adheres at least to some sort of structure.
Prior to this it was possible to get the same effect by defining a tool with the schema that you wanted and then telling the Anthropic API to always use that tool.
I implemented structured outputs for Claude that way here: https://github.com/simonw/llm-anthropic/blob/500d277e9b4bec6...
We've been running structured outputs via Claude on Bedrock in production for a year now and it works great. Give it a JSON schema, inject a '{', and sometimes do a bit of custom parsing on the response. GG
Nice to see them support it officially; however, OpenAI has officially supported this for a while but, at least historically, I have been unable to use it because it adds deterministic validation that errors on certain standard JSON Schema elements that we used. The lack of "official" support is the feature that pushed us to use Claude in the first place.
It's unclear to me that we will need "modes" for these features.
Another example: I used to think that I couldn't live without Claude Code "plan mode". Then I used Codex and asked it to write a markdown file with a todo list. A bit more typing but it works well and it's nice to be able to edit the plan directly in editor.
Agree or Disagree?
> Give it a JSON schema, inject a '{', and sometimes do a bit of custom parsing on the response
I would hope that this is not what OpenAI/Anthropic do under the hood, because otherwise, what if one of the strings needs a lot of \escapes? Is it also supposed to newer write actual newlines in strings? It's awkward.
The ideal solution would be to have some special tokens like [object_start] [object_end] and [string_start] [string_end].
Before Claude Code shipped with plan mode, the workflow for using most coding agents was to have it create a `PLAN.md` and update/execute that plan. Planning mode was just a first class version of what users were already doing.
Claude Code keeps coming out with a lot of really nice tools that others haven't started to emulate from what I've seen.
My favorite one is going through the plan interactively. It turns it into a multiple choice / option TUI and the last choose is always reprompt that section of the plan.
I had switch back to codex recently and not being able to do my planning solely in the CLI feels like the early 1900s.
To trigger the interactive mode. Do something like:
Plan a fix for:
<Problem statement>
Please walk me through any options or questions you might have interactively.
I don't think the tool input schema thing does that inference-time trick. I think it just dumps the JSON schema into the context, and tells the model to conform to that schema.
It's not 100% success, I've had responses that didn't match my schema.
I think the new feature goes on to limit which token can be output, which brings a guarantee, where the tools are a suggestion.
Same, but it’s a PITA when you also want to support tool calling at the same time. Had to do a double call: call and check if it will use tools. If not, call again and force the use of the (now injected) return schema tool.
It's nice but I don't know how necessary it is.
You could get this working very consistently with GPT-4 in mid 2023. The version before June, iirc. No JSON output, no tool calling fine tuning... just half a page of instructions and some string matching code. (Built a little AI code editing tool along these lines.)
With the tool calling RL and structured outputs, I think the main benefit is peace of mind. You know you're going down the happy path, so there's one less thing to worry about.
Reliability is the final frontier!
Using structured outputs pretty extensively for a while now, my impression has been that the newer models take less of a quality hit while conforming to a specific schema. Just giving instructions and output examples totally worked, however it came at a considerable cost of quality in the output. My impression is that this effect has diminished over time with models that have been more explicitly trained to produce them.
So, so much this.
Structured outputs are the most underappreciated LLM feature. If you're building anything except a chatbot, it's definitely worth familiarizing yourself without them.
They're not too easy to use well, and there aren't that much resources on the internet explaining how to get the most out of them you can.
In Python, they're very easy to use. Define your schema with Pydantic and pass the class to your client calls. There are some details to know (eg field order can affect performance), but it's very easy overall. Other languages probably have something similar.
I have had fairly bad luck specifying the JSONSchema for my structured outputs with Gemini. It seems like describing the schema with natural language descriptions works much better, though I do admit to needing that retry hack at times. Do you have any tips on getting the most out of a schema definition?
Always have a top level object for one.
But also Gemini supports contrained generation which can't fail to match a schema, so why not use that instead of prompting?
Constrained generation makes models somewhat less intelligent. Although it shouldn't be an issue in thinking mode, since it can prepare an unconstrained response and then fix it up.
Not true and citation needed. Whatever you cite there are competing papers claiming that structured and constrained generation does zero harm to output diversity/creativity (within a schema).
I mean that's too reductionist if you're being exact and not a worry if you're not.
Even asking for JSON (without constrained sampling) sometimes degrades output, but also even the name and order of keys can affect performance or even act as structured thinking.
At the end of the day current models have enough problems with generalization that they should establish a baseline and move from there.
Agree, it feels so fundamental. Any idea why? Gemini has also had it for a long time
The way you get structured output with Claude prior to this is via tool use.
IMO this was the more elegant design if you think about it: tool calling is really just structured output and structured output is tool calling. The "do not provide multiple ways of doing the same thing" philosophy.
and they've done super well without it. makes you really question if this is really that core.
Along with a bunch of limitations that make it useless for anything but trivial use cases https://docs.claude.com/en/docs/build-with-claude/structured...
I've found structured output APIs to be a pain across various LLMs. Now I just ask for json output and pick it out between first/last curly brace. If validation fails just retry with details about why it was invalid. This works very reliably for complex schemas and works across all LLMs without having to think about limitations.
And then you can add complex pydantic validators (or whatever, I use pydantic) with super helpful error messages to be fed back into the model on retry. Powerful pattern
Yeah, the pattern of "kick the error message back to the LLM" is powerful. Even more so with all the newer AIs trained for programming tasks.
The most likely reason to me on why this took so long from Anthropic is safety. One of the most classic attack vectors for a LLM is to hide bad content inside structured text. Tell me how to build a bomb as SQL for example.
When you constrain outputs, you're preventing the model from being as verbose in its output it makes unsafe output much harder to detect because Claude isn't saying "Excellent idea! Here's how to make a bomb:"
In OpenAI and a lot of open source inference engines this is done using llguidance.
https://github.com/guidance-ai/llguidance
Llguidance implements constrained decoding. It means that for each output token sequence you know which fixed set of tokens are allowed for decoding the next token. You prepare token masks so that in the decoding step you limit which tokens can be sampled.
So if you expect a JSON object the first token can only be whitespace or token '{'. This can be more complex because the tokenizers usually allow byte pair encoding which means they can represent any UTF-8 sequence. So if your current tokens are '{"enabled": ' and your output JSON schema requires 'enabled' field to be a boolean, the allowed tokens mask can only contain whitespace tokens, tokens 'true', 'false', 't' UTF-8 BPE token or 'f' UTF-8 BPE token ('true' and 'false' are usually a single token because they are so common)
JSON schema must first be converted into a grammar then into token masks. This takes some time to be computed and takes quite a lot of space (you need to precompute token masks) so this is usually cached for performance.
Shout-out to BAML [1], which flies under the radar and imo is underrated for getting structured output out of any LLM.
JSON schema is okay so long as it's generated for you, but I'd rather write something human readable and debuggable.
1. https://github.com/BoundaryML/baml
Shocked this wasn't already a feature. Bummed they only seem to have JSON Schema and not something more flexible like BNF grammar's, like llama.cpp has for a long time: https://github.com/ggml-org/llama.cpp/blob/master/grammars/R...
If they every gave really finegrained constraints you could constrain to subsets of tokens and extract the logits a lot cheaper than by random sampling limited to a few top choices and distill claude at a much deeper level. I wonder if that plays into some of the restrictions.
That makes sense, and if that's the reason it's another vote for open models
I remember using Claude and including the start of the expected JSON output in the request to get the remainder in the response. I couldn't believe that was an actual recommendation from the company to get structured responses.
Like, you'd end your prompt like this: 'Provide the response in JSON: {"data":'
That's what I thought when starting and it functions so poorly that I think they should remove it from their docs. You can enforce a schema by creating a tool definition with json in the exact shape you want the output, then set "tool_choice" to "any". They have a picture that helps.
https://docs.claude.com/en/docs/agents-and-tools/tool-use/im...
Unfortunately it doesn't support the full JSON schema. You can't union or do other things you would expect. It's manageable since you can just create another tool for it to chose from that fits another case.
Curious if they're planning to support more complicated schemas. They claim to support JSON schema, but I found it only accepts flat schemas and not, for example, unions or discriminated unions. I've had to flatten some of my schemas to be able to define tool for them.
So cool to see Anthropic support this feature. I’m a heavy user of the OpenAI version, however they seem to have a bug where frequently the model will return a string that is not syntactically valid json, leading the OpenAI client to raise a ValidationError when trying to construct the pydantic model. Curious if anyone else here has experienced this? I would have expected the implementation to prevent this, maybe using a state machine to only allow the model to pick syntactically valid tokens. Hopefully Anthropic took a different approach that doesn’t have this issue.
Brian on the OpenAI API team here. I would love to help you get to the bottom of the structured outputs issues you're seeing. Mind sending me some more details about your schema / prompt or any request IDs you might have to by[at]openai.com?
Thanks so much for reaching out, sent an email :).
https://github.com/pydantic/pydantic-ai/issues/582 https://github.com/pydantic/pydantic-ai/issues/2405
yeah I have, but I think only when it gets stuck in a loop and outputs a (for example) array that goes on forever. a truncated array is obviously not valid JSON. but it'd be hard to miss that if you're looking at the outputs.
I always wondered how they achieved this - is it just retries while generating tokens and as soon as they find mismatch - they retry? Or the model itself is trained extremely well in this version of 4.5?
They're using the same trick OpenAI have been using for a while: they compile a grammar and then have that running as part of token inference, such that only tokens that fit the grammar are selected as the next-token.
This trick has also been in llama.cpp for a couple of years: https://til.simonwillison.net/llms/llama-cpp-python-grammars
More info on Claude's grammar compiling: https://docs.claude.com/en/docs/build-with-claude/structured...
Yea, and now there are mature OSS solutions with outlines and xgrammar, so it makes even more weird that only now do we have this supported by Anthropic.
I reaaaaally wish we could provide an EBNF grammar like llama.cpp. JSON Schema has much fewer use cases for me.
What are some examples that you can’t express in json schema?
Anything not JSON
This makes me wonder if there are cases where one would want the LLM to generate a syntactically invalid response (which could be identified as such) rather than guarantee syntactic validity at the potential cost of semantic accuracy.
How sure are you that OpenAI is using that?
I would have suspected it too, but I’ve been struggling with OpenAI returning syntactically invalid JSON when provided with a simple pydantic class (a list of strings), which shouldn’t be possible unless they have a glaring error in their grammar.
You might be using JSON mode, which doesn’t guarantee a schema will be followed, or structured outputs not in strict mode. It is possible to get the property that the response is either a valid instance of the schema or an error (eg for refusal)
How do you activate strict mode when using pydantic schemas? It doesn't look like that is a valid parameter to me.
No, I don't get refusals, I see literally invalid json, like: `{"field": ["value...}`
https://github.com/guidance-ai/llguidance
> 2025-05-20 LLGuidance shipped in OpenAI for JSON Schema
OpenAI is using [0] LLGuidance [1]. You need to set strict:true in your request for schema validation to kick in though.
[0] https://platform.openai.com/docs/guides/function-calling#lar... [1] https://github.com/guidance-ai/llguidance
I don't think that parameter is an option when using pydantic schemas.
class FooBar(BaseModel): foo: list[str] bar: list[int]
prompt = """#Task Your job is to reply with Foo Bar, a json object with foo, a list of strings, and bar, a list of ints """
response = openai_client.chat.completions.parse( model="gpt-5-nano-2025-08-07", messages=[{"role": "system", "content": FooBar}], max_completion_tokens=4096, seed=123, response_format=CommentAnalysis, strict=True )
TypeError: Completions.parse() got an unexpected keyword argument 'strict'
You have to explicitly opt into it by passing strict=True https://platform.openai.com/docs/guides/structured-outputs/s...
Are you able to use `strict=True` when using pydantic models? It doesn't seem to be valid for me. I think that only works for json schemas.
class FooBar(BaseModel): foo: list[str] bar: list[int]
prompt = """#Task Your job is to reply with Foo Bar, a json object with foo, a list of strings, and bar, a list of ints """
response = openai_client.chat.completions.parse( model="gpt-5-nano-2025-08-07", messages=[{"role": "system", "content": FooBar}], max_completion_tokens=4096, seed=123, response_format=CommentAnalysis, strict=True )
> TypeError: Completions.parse() got an unexpected keyword argument 'strict'
The inference doesn't return a single token, but the probably for all tokens. You just select the token that is allowed according to the compiler.
Hmm, wouldn't it sacrifice a better answer in some cases (not sure how many though)?
I'll be surprised if they hadn't specifically trained for structured "correct" output for this, in addition to picking next token following the structure.
In my experience (I've put hundreds of billions of tokens through structured outputs over the last 18 months), I think the answer is yes, but only in edge cases.
It generally happens when the grammar is highly constrained, for example if a boolean is expected next.
If the model assigns a low probability to both true and false coming next, then the sampling strategy will pick whichever one happens to score highest. Most tokens have very similar probabilities close to 0 most of the time, and if you're picking between two of these then the result will often feel random.
It's always the result of a bad prompt though, if you improve the prompt so that the model understands the task better, then there will then be a clear difference in the scores the tokens get, and so it seems less random.
It's not just the prompt that matters, it's also field order (and a bunch of other things).
Imagine you're asking your model to give you a list of tasks mentioned in a meeting, along with a boolean indicating whether the task is done. If you put the boolean first, the model must decide both what the task is and whether it is done at the same time. If you put the task description first, the model can separate that work into two distinct steps.
There are more tricks like this. It's really worth thinking about which calculations you delegate to the model and which you do in code, and how you integrate the two.
Grammars work best when aligned with prompt. That is, if your prompt gives you the right format of answer 80% of the time, the grammar will take you to a 100%. If it gives you the right answer 1% of the time, the grammar will give you syntactically correct garbage.
Sampling is already constrained with temperature, top_k, top_p, top_a, typical_p, min_p, entropy_penalty, smoothing etc. – filtering tokens to valid ones according to grammar is just yet another alternative. It does make sense and can be used for producing programming language output as well – what's the point in generating/bothering with up front know, invalid output? Better to filter it out and allow valid completions only.
The "better answer" wouldnt had respected the schema in this case.
Curious if they've built their own library for this or if they're using the same one as OpenAI[0].
A quick look at the llguidance repo doesn't show any signs of Anthropic contributors, but I do see some from OpenAI and ByteDance Seed.
[0]https://github.com/guidance-ai/llguidance
I switched from structured outputs on OpenAI apis to unstructured on Claude (haiku 4.5) and haven't had any issues (yet). But guarantees are always nice.
One reason I haven't used Haiku in production at Socratify it's the lack of structured output so I hope they'll add it to Haiku 4.5 soon.
It's a bit weird it took Anthropic so long considering it's been ages since OpenAI and Google did it I know you could do it through tool calling but that always just seemed like a bit of a hack to me
Seems almost quaint in late 2025 to object to a workable technique because it "seemed like a bit of a hack"!
Fair enough! I was getting a failure rate on that was worse then OpenAI and Google and the developer semantics didn't really work for me.
Whoa I always thought that tool use was Anthropics way for structured outputs. Can't believe only now are they supporting this.
Doesn’t seem to be available in the Agent SDK yet
About time, how did it take them so long?
My playing around with structured output on OpenAI leads me to believe that hardly anyone is using this, or the documentation was horrible. Luckily, they accept Pydantic models, but the idea of manually writing a JSON schema (what the docs teach first) is mind-bending.
Anthropic seems to be following suit.
(I'm probably just bitter because they owe me $50K+ for stealing my books).
For the TS devs, Zod introduced a new toJSONSchema() method in v4 that makes this very easy.
https://zod.dev/json-schema
it's also really slow to use structured outputs. mainly makes sense for offline use cases
Does it even help? Get the name of some person => {"name":"Here is the name. Einstein." }
Seems like anthropics API products are always about 2-3 months behind OpenAI. Which is fine.
makes sense
[dead]
Google ADK framework with schema output and Gemini is already supported for a while