Synthetic Benchmark Generation

@alexdias @Keweiqu There might be a different approach to build a synthetic benchmark to evaluate how well a model discovers and uses relevant GraphQL schema elements.


1. **Start with a medium-sized schema**  
   - Define a realistic GraphQL schema (e.g., a few dozen types).

2. **Generate structured outputs for Q&A**  
   -  Use structured json outputs, loop for `k`times, let LLM formulate
       - User Query
       - Corresponding GraphQL query
       - The required types and their fields  
   => LLM outputs:
```json
[{
     "id": 0,
     "question": "Can you list all the users available?",
     "model": "gpt-4.1-mini",
     "assistant_parsed": {
       "graphql_query": "query {\n  users {\n    id\n    name\n  }\n}",
       "required_types": [
         {
           "type": "User",
           "fields": ["id", "name"]
         }
       ]
     }
   }, ...]
   ```
- Execute query (to check if its valid)

3. Only then **Blow up the schema**  with noise or adversarial types
   - Iteratively augment the schema with:
     - Random types and fields  
     - Adversarially similar types/fields (also LLM generated)
   - Keep the original questions and structured outputs fixed as the “gold” reference.

4. **Evaluate schema discovery & robustness**  
   - For each enlarged schema version:
     - Re-run the same questions with the augmented schema.
     - Compare new structured outputs to the gold outputs.  
   - Metrics:
     - Correct identification of required types and fields.
     - Semantic equivalence of the generated GraphQL query.
     - Degradation as schema size and noise increase or as context limits are approached/exceeded.

5. **Benchmark goal**  
   - Measure robustness of:
     - **Schema discovery**: finding correct types/fields in a large noisy schema.
     - **Query generation**: producing the same correct GraphQL query under schema inflation.

This test would work for benchmarking the vanilla GraphQL-Formulation capability of LLMs, as well as for the schema discovery to see recall of retrieved types with minimal oversight.

So I tested the structured output array generation on small schemas, which already works quite well. So we might think of a good adversarial strategy to blow up the schema and generate a difficult benchmark


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Synthetic Benchmark Generation #46

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Synthetic Benchmark Generation #46

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions