@alexdias @Keweiqu There might be a different approach to build a synthetic benchmark to evaluate how well a model discovers and uses relevant GraphQL schema elements.
-
Start with a medium-sized schema
- Define a realistic GraphQL schema (e.g., a few dozen types).
-
Generate structured outputs for Q&A
- Use structured json outputs, loop for
ktimes, let LLM formulate
- User Query
- Corresponding GraphQL query
- The required types and their fields
=> LLM outputs:
[{
"id": 0,
"question": "Can you list all the users available?",
"model": "gpt-4.1-mini",
"assistant_parsed": {
"graphql_query": "query {\n users {\n id\n name\n }\n}",
"required_types": [
{
"type": "User",
"fields": ["id", "name"]
}
]
}
}, ...]
- Execute query (to check if its valid)
-
Only then Blow up the schema with noise or adversarial types
- Iteratively augment the schema with:
- Random types and fields
- Adversarially similar types/fields (also LLM generated)
- Keep the original questions and structured outputs fixed as the “gold” reference.
-
Evaluate schema discovery & robustness
- For each enlarged schema version:
- Re-run the same questions with the augmented schema.
- Compare new structured outputs to the gold outputs.
- Metrics:
- Correct identification of required types and fields.
- Semantic equivalence of the generated GraphQL query.
- Degradation as schema size and noise increase or as context limits are approached/exceeded.
-
Benchmark goal
- Measure robustness of:
- Schema discovery: finding correct types/fields in a large noisy schema.
- Query generation: producing the same correct GraphQL query under schema inflation.
This test would work for benchmarking the vanilla GraphQL-Formulation capability of LLMs, as well as for the schema discovery to see recall of retrieved types with minimal oversight.
So I tested the structured output array generation on small schemas, which already works quite well. So we might think of a good adversarial strategy to blow up the schema and generate a difficult benchmark
@alexdias @Keweiqu There might be a different approach to build a synthetic benchmark to evaluate how well a model discovers and uses relevant GraphQL schema elements.
Start with a medium-sized schema
Generate structured outputs for Q&A
ktimes, let LLM formulate=> LLM outputs:
[{ "id": 0, "question": "Can you list all the users available?", "model": "gpt-4.1-mini", "assistant_parsed": { "graphql_query": "query {\n users {\n id\n name\n }\n}", "required_types": [ { "type": "User", "fields": ["id", "name"] } ] } }, ...]Only then Blow up the schema with noise or adversarial types
Evaluate schema discovery & robustness
Benchmark goal
This test would work for benchmarking the vanilla GraphQL-Formulation capability of LLMs, as well as for the schema discovery to see recall of retrieved types with minimal oversight.
So I tested the structured output array generation on small schemas, which already works quite well. So we might think of a good adversarial strategy to blow up the schema and generate a difficult benchmark