Pre-registered Evaluation Repository

This repository holds a public record of pre-registered, cryptographically-locked evaluations of probabilistic models. Each evaluation specifies its methodology and decision criteria in writing before any prediction is computed, commits the methodology and analysis code to git at a fixed hash, executes once under a frozen pipeline, and publishes the result irrespective of outcome.

The repository functions as a working artifact: anyone may clone it, check out any locked commit, and reproduce the evaluation end-to-end from the same data, code, and parameters that produced the published result.

Evaluations on record

Identifier	Domain	Pre-registration	Result	Outcome
`pre-registration-v1`	Severe weather forecasting (calibrator-v1.0)	docs/PRE_REGISTRATION_v1.md	—	No predictive skill; static-input limitation disclosed up front
`pre-registration-v2`	Severe weather forecasting (calibrator-v2.0, atmospheric inputs)	docs/PRE_REGISTRATION_v2.md	—	No predictive skill at 50 km radius
`verification-v3a-spatial-100km`	Severe weather forecasting (re-scored at 100 km)	docs/PRE_REGISTRATION_v3a.md	docs/RESULT_v3a_2026-05-04.md	Hypothesis falsified; BSS = −0.31
`pre-registration-v3`	Severe weather forecasting (calibrator-v3.0, richer inputs)	docs/PRE_REGISTRATION_v3.md	—	Forward-only verification: 36 predictions accumulated 2026-05-04 18:57Z to 23:05Z, then operational dormancy until 2026-05-09. Continuous accumulation resumed; first meaningful verification window approximately 2026-05-26. Gap will be documented in the eventual result document.
`spinphase-gw-blind-v1`	Gravitational-wave detection (GW150914 blind ranking)	docs/PRE_REGISTRATION_GW_v1.md	docs/RESULT_GW_v1_2026-05-04.md	R = 52 of 100; no detection capability under this test
`spinphase-gw-quietwell-v1`	Gravitational-wave detection (within-segment differential)	docs/PRE_REGISTRATION_GW_QUIETWELL_v1.md	docs/RESULT_GW_QUIETWELL_v1_2026-05-04.md	R_D = 50 of 100; no local emergence under this test
`distilbert-sst2-calibration-v1`	AI/ML model calibration (DistilBERT SST-2)	case_studies/distilbert_sst2/PRE_REGISTRATION_DISTILBERT_SST2_v1.md	docs/RESULT_DISTILBERT_SST2_v1_2026-05-06.md	Not calibrated; bimodal overconfidence at the extremes
`toxic-bert-calibration-v1`	AI/ML model calibration (Unitary Toxic-BERT, Civil Comments)	case_studies/toxic_bert/PRE_REGISTRATION_TOXIC_BERT_v1.md	docs/RESULT_TOXIC_BERT_v1_2026-05-06.md	Calibration drift detected; structured overconfidence in middle-to-high probability range

Seven evaluations executed under the protocol; seven locked outcomes; one further evaluation accumulating data for future verification. Every result document is bound at the commit recorded in its provenance section and is not subject to revision after lock.

Methodology

Every evaluation in this repository is conducted under the following protocol:

Pre-registration. A document specifies the model or system under test, the data, the pipeline, the metrics, the decision criteria, and the conditions under which each possible outcome would be recorded. The document is signed and committed to git before any prediction or score is computed.
Cryptographic lock. The pre-registration, the analysis code, and (where applicable) the test data are committed to git in a single lock commit. SHA-256 hashes are computed for any held-out artifacts (sealed keys, blinded populations) and committed alongside the methodology so that post-hoc tampering is detectable. The lock commit hash is recorded in the result document.
Frozen execution. The analysis code is not modified between the lock commit and the recording of all outputs. Hashes are verified at runtime; the script refuses to execute against modified data. Re-runs for reproducibility are permitted; revisions to outputs are not.
Publication of outcome. The result document records the outcome under the pre-registered classification, including the cases where the outcome is unfavorable to the system being evaluated. Result documents are bound at commit time and may not be revised. The canonical structure of result documents is documented in docs/RESULT_TEMPLATE.md; the canonical structure of pre-registrations for AI/ML calibration audits is documented in docs/PRE_REGISTRATION_AI_CALIBRATION_TEMPLATE.md.

The protocol is portable across domains: the eight evaluations on record include severe weather forecasting verification against ground-truth storm reports, gravitational-wave detection on LIGO open data, and probabilistic model calibration on public natural-language classifiers. The same discipline applies in each case.

Most recent case study

distilbert-sst2-calibration-v1 evaluated distilbert-base-uncased-finetuned-sst-2-english (HuggingFace revision 714eb0fa…) for calibration on the SST-2 validation split (872 examples). The pre-registered claim was that across 10 reliability bins, the model's predicted probabilities would fall within Wilson 95% confidence intervals of observed positive-class frequencies for at least 8 of 10 bins, alongside a positive Brier skill score against base-rate climatology.

Outcome: not calibrated. The model's accuracy on SST-2 validation is 91.06%, matching the model card. The Brier skill score is +0.667. But of the 2 reliability bins meeting the n ≥ 30 sample-size requirement, neither passed the Wilson criterion. When the model predicts 99.6% confidence, the actual positive-class frequency at that confidence level is 90.9% — an 8.7-percentage-point overconfidence. When the model predicts 0.65% probability, the actual positive rate is 6.1%.

The model is bimodal and systematically overconfident at both extremes. A user thresholding on confidence ≥ 0.95 expecting 95% precision will instead see something close to 91%. The accuracy claim and the calibration claim point at different downstream use-cases.

The full result, including the per-bin reliability table, decision-criterion application, and an itemized list of what the test does and does not falsify, is at docs/RESULT_DISTILBERT_SST2_v1_2026-05-06.md.

Reproducing a result

Each evaluation is reproducible end-to-end from the locked commit. To reproduce distilbert-sst2-calibration-v1:

git clone https://github.com/earldixon310-max/dynametrix.git
cd dynametrix
git checkout dffd06a    # the lock commit for the DistilBERT result
cd case_studies/distilbert_sst2
python -m pip install -r requirements.txt
python distilbert_sst2_calibration.py run

The script verifies the data SHA-256, downloads the pinned HuggingFace model revision, runs inference on the locked test set, computes the pre-registered calibration metrics, and writes outputs under the same names as those committed at 2f8bb9e. A successful run produces calibration values numerically equivalent to the published result within float32 precision.

The same procedure applies to the other evaluations in this repository, with the analysis path adjusted to each evaluation's directory and lock commit.

Requesting an evaluation

The methodology is offered as an independent service for organizations that have a probabilistic model and a need to defend specific claims about it — calibration, generalization, subgroup behavior, robustness. Each engagement begins with a written pre-registration agreed by both parties, followed by a frozen execution and a published outcome under the same protocol used in the evaluations above.

A first engagement is typically scoped to one model, one specific claim, and a held-out test dataset, with a 2–4 week timeline from agreement to delivery. Engagement details and pricing for a pilot are available on request.

To inquire: earl_dixon@hsagconsortium.com

Provenance

This repository is maintained by Earl Dixon. The methodology has been applied across three scientific and applied domains over the period April–May 2026. All evaluations recorded above were executed under the protocol described in the Methodology section, with no exceptions, and all outcomes were locked at commit time.

Inquiries regarding methodology, reproduction failures, or proposed evaluations are welcome.

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
analysis		analysis
backend		backend
case_studies		case_studies
docs		docs
tools		tools
.gitattributes		.gitattributes
.gitignore		.gitignore
CITATION.cff		CITATION.cff
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pre-registered Evaluation Repository

Evaluations on record

Methodology

Most recent case study

Reproducing a result

Requesting an evaluation

Provenance

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Pre-registered Evaluation Repository

Evaluations on record

Methodology

Most recent case study

Reproducing a result

Requesting an evaluation

Provenance

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages