Tier 0 A/B Comparison Results

Purpose: Compare Cortex-enabled sessions vs. baseline sessions to measure improvement.

Period: 2026-02-11 to 2026-02-13 Sessions recorded: 18 Method: Automated extraction (scripts/testing/run_phase4.py) + manual input

Data Collection Method

Cold start time, re-exploration count, briefing token count, and event count: Auto-extracted
Decision regression count and continuity score: Recorded manually by the developer after each session

Cortex-Enabled Session Data

Session 1 (with Cortex)

Field	Value
Date	2026-02-11
Task	Research and Adding New Feature to App
Cold start time	0.1 minutes
Decision regression count	0
Re-exploration count	0
Continuity score	5
Briefing token count	0
Event count	32
Notes	VSCode crashed and this session started after it
Session details	159.0 min, 204 tool calls, 19 files explored, 21 files modified

Session 2 (with Cortex)

Field	Value
Date	2026-02-11
Task	adding new features
Cold start time	3.3 minutes
Decision regression count	1
Re-exploration count	6
Continuity score	4
Briefing token count	0
Event count	63
Notes	(none)
Session details	130.2 min, 290 tool calls, 32 files explored, 26 files modified

Session 3 (with Cortex)

Field	Value
Date	2026-02-11
Task	more refactoring - adding tests
Cold start time	1.6 minutes
Decision regression count	0
Re-exploration count	10
Continuity score	4
Briefing token count	1016
Event count	117
Notes	(none)
Session details	62.3 min, 176 tool calls, 25 files explored, 13 files modified

Session 4 (with Cortex)

Field	Value
Date	2026-02-11
Task	add more tests
Cold start time	0.1 minutes
Decision regression count	0
Re-exploration count	15
Continuity score	5
Briefing token count	1016
Event count	117
Notes	(none)
Session details	87.2 min, 184 tool calls, 26 files explored, 10 files modified

Session 5 (with Cortex)

Field	Value
Date	2026-02-11
Task	file housekeeping and refactor planning
Cold start time	0.2 minutes
Decision regression count	0
Re-exploration count	0
Continuity score	5
Briefing token count	0
Event count	0
Notes	(none)
Session details	125.8 min, 70 tool calls, 6 files explored, 3 files modified

Session 6 (with Cortex)

Field	Value
Date	2026-02-11
Task	refactor
Cold start time	0.3 minutes
Decision regression count	0
Re-exploration count	0
Continuity score	5
Briefing token count	0
Event count	0
Notes	(none)
Session details	338.5 min, 360 tool calls, 12 files explored, 45 files modified

Session 7 (with Cortex)

Field	Value
Date	2026-02-12
Task	refactoring
Cold start time	0.1 minutes
Decision regression count	1
Re-exploration count	3
Continuity score	3
Briefing token count	0
Event count	0
Notes	I asked a lot of questions about app architecture.That may have contributed to pushing out some memory awareness
Session details	80.5 min, 196 tool calls, 11 files explored, 17 files modified

Session 8 (with Cortex)

Field	Value
Date	2026-02-12
Task	refactor
Cold start time	0.7 minutes
Decision regression count	0
Re-exploration count	1
Continuity score	5
Briefing token count	0
Event count	0
Notes	(none)
Session details	192.4 min, 394 tool calls, 64 files explored, 42 files modified

Session 9 (with Cortex)

Field	Value
Date	2026-02-12
Task	refactor
Cold start time	1.0 minutes
Decision regression count	0
Re-exploration count	16
Continuity score	4
Briefing token count	0
Event count	0
Notes	(none)
Session details	88.5 min, 361 tool calls, 57 files explored, 55 files modified

Session 10 (with Cortex)

Field	Value
Date	2026-02-13
Task	long refactor session
Cold start time	1.7 minutes
Decision regression count	0
Re-exploration count	15
Continuity score	5
Briefing token count	0
Event count	0
Notes	long period of inactiity
Session details	793.5 min, 372 tool calls, 32 files explored, 23 files modified

A/B Comparison Results

Metric	Baseline Avg	Cortex Avg	Improvement	Target	Met?
Cold start time (min)	9.0	1.4	84% reduction	80%+	Yes
Decision regression	0.5	0.1	80% reduction	Near-zero	Yes
Re-exploration count	5.2	6.5	-25% reduction	Significant	No
Continuity score (1-5)	2.9	4.7	+1.8 points	Improvement	Yes
Token overhead	N/A	321 tokens	0.2% of context	<=15%	Yes

Success Criteria Evaluation

Criterion	Target	Result	Status
Cold start time reduction	80%+	84% reduction	Pass
Decision regression	Near-zero	0.1 avg	Pass
Plan continuity	Improvement	2.9 -> 4.7	Pass
Token overhead	<=15%	0.2%	Pass
Extraction accuracy	>90% recall	(manual evaluation)	(pending)
User maintenance effort	Near-zero	(manual evaluation)	(pending)

Observations

Briefing Size Analysis

Average briefing size: 321 tokens (0.2% of context window)

Files Most Frequently Explored

Files explored in multiple comparison sessions:

/Users/jamescruce/Projects/substack-scheduler/substack_poster.py — 8 sessions
/Users/jamescruce/Projects/substack-scheduler — 4 sessions
/Users/jamescruce/Projects/substack-scheduler/docs/ROADMAP.md — 4 sessions
/Users/jamescruce/Projects/jdex-premium/CLAUDE.md — 4 sessions
/Users/jamescruce/Projects/substack-scheduler/api/routes/notes.py — 3 sessions
/Users/jamescruce/Projects/substack-scheduler/tests/integration/test_api_notes.py — 3 sessions
/Users/jamescruce/Projects/substack-scheduler/.github/workflows/ci.yml — 3 sessions
/Users/jamescruce/Projects/substack-scheduler/api/dependencies.py — 3 sessions
/Users/jamescruce/Projects/substack-scheduler/tests — 3 sessions
/Users/jamescruce/Projects/substack-scheduler/tests/test_api_dependencies.py — 3 sessions
/Users/jamescruce/Projects/substack-scheduler/tests/test_timezone.py — 3 sessions
/Users/jamescruce/Projects/substack-scheduler/tests/integration/test_api_daemon.py — 3 sessions
/Users/jamescruce/Projects/jdex-premium/app/src/db.js — 3 sessions
/Users/jamescruce/Projects/jdex-premium/app/src/db/core/database.js — 3 sessions
/Users/jamescruce/Projects/jdex-premium/app/src/db/repositories/index.js — 3 sessions

Cold Start Analysis

Average cold start with Cortex: 1.4 minutes

Qualitative Observations

Briefing Quality

Briefings were very useful — they consistently contained relevant context from prior sessions. The information included was accurate and helped Claude understand the project state without lengthy re-explanation.

Context Preservation

Claude remembered decisions, plans, and prior work consistently across sessions. Prior decisions were referenced naturally without re-debating them, which was one of the primary goals of Cortex.

Pain Points

No major issues encountered. The hooks ran reliably, events were captured correctly, and no incorrect or stale information appeared in briefings during the testing period.

Unexpected Benefits

Faster onboarding than expected. Claude got productive more quickly than anticipated — the 84% reduction in cold start time translated to noticeably smoother session starts.

Conclusion

Overall Assessment: [x] Tier 0 meets success criteria

Auto-evaluated criteria: 4/4 passed + qualitative feedback positive

Recommendation:

Proceed to Tier 1 implementation

Key Learnings:

Cold start improvement is dramatic. The 84% reduction (9.0 → 1.4 min) exceeded the 80% target and represents a significant productivity gain.
Decision preservation works. Only 2 decision regressions across 18 sessions confirms that immortal events for decisions/rejections are effective.
Briefing size is negligible. At 0.2% of context window, there's plenty of headroom for richer briefings in Tier 1.
Re-exploration variance is high. The -25% regression in re-exploration count warrants investigation — some sessions had 0 while others had 15-16. This may be task-dependent rather than a Cortex issue.
Qualitative feedback is strongly positive. "Very useful" briefings, consistent memory, no major issues, and faster-than-expected onboarding validate the core design.

Next milestone: Tier 1 implementation (SQLite + FTS5 + salience scoring)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tier 0 A/B Comparison Results

Data Collection Method

Cortex-Enabled Session Data

Session 1 (with Cortex)

Session 2 (with Cortex)

Session 3 (with Cortex)

Session 4 (with Cortex)

Session 5 (with Cortex)

Session 6 (with Cortex)

Session 7 (with Cortex)

Session 8 (with Cortex)

Session 9 (with Cortex)

Session 10 (with Cortex)

A/B Comparison Results

Success Criteria Evaluation

Observations

Briefing Size Analysis

Files Most Frequently Explored

Cold Start Analysis

Qualitative Observations

Briefing Quality

Context Preservation

Pain Points

Unexpected Benefits

Conclusion

Uh oh!

FilesExpand file tree

AB-COMPARISON-RESULTS.md

Latest commit

History

AB-COMPARISON-RESULTS.md

File metadata and controls

Tier 0 A/B Comparison Results

Data Collection Method

Cortex-Enabled Session Data

Session 1 (with Cortex)

Session 2 (with Cortex)

Session 3 (with Cortex)

Session 4 (with Cortex)

Session 5 (with Cortex)

Session 6 (with Cortex)

Session 7 (with Cortex)

Session 8 (with Cortex)

Session 9 (with Cortex)

Session 10 (with Cortex)

A/B Comparison Results

Success Criteria Evaluation

Observations

Briefing Size Analysis

Files Most Frequently Explored

Cold Start Analysis

Qualitative Observations

Briefing Quality

Context Preservation

Pain Points

Unexpected Benefits

Conclusion