Purpose: Compare Cortex-enabled sessions vs. baseline sessions to measure improvement.
Period: 2026-02-11 to 2026-02-13 Sessions recorded: 18 Method: Automated extraction (scripts/testing/run_phase4.py) + manual input
- Cold start time, re-exploration count, briefing token count, and event count: Auto-extracted
- Decision regression count and continuity score: Recorded manually by the developer after each session
| Field | Value |
|---|---|
| Date | 2026-02-11 |
| Task | Research and Adding New Feature to App |
| Cold start time | 0.1 minutes |
| Decision regression count | 0 |
| Re-exploration count | 0 |
| Continuity score | 5 |
| Briefing token count | 0 |
| Event count | 32 |
| Notes | VSCode crashed and this session started after it |
| Session details | 159.0 min, 204 tool calls, 19 files explored, 21 files modified |
| Field | Value |
|---|---|
| Date | 2026-02-11 |
| Task | adding new features |
| Cold start time | 3.3 minutes |
| Decision regression count | 1 |
| Re-exploration count | 6 |
| Continuity score | 4 |
| Briefing token count | 0 |
| Event count | 63 |
| Notes | (none) |
| Session details | 130.2 min, 290 tool calls, 32 files explored, 26 files modified |
| Field | Value |
|---|---|
| Date | 2026-02-11 |
| Task | more refactoring - adding tests |
| Cold start time | 1.6 minutes |
| Decision regression count | 0 |
| Re-exploration count | 10 |
| Continuity score | 4 |
| Briefing token count | 1016 |
| Event count | 117 |
| Notes | (none) |
| Session details | 62.3 min, 176 tool calls, 25 files explored, 13 files modified |
| Field | Value |
|---|---|
| Date | 2026-02-11 |
| Task | add more tests |
| Cold start time | 0.1 minutes |
| Decision regression count | 0 |
| Re-exploration count | 15 |
| Continuity score | 5 |
| Briefing token count | 1016 |
| Event count | 117 |
| Notes | (none) |
| Session details | 87.2 min, 184 tool calls, 26 files explored, 10 files modified |
| Field | Value |
|---|---|
| Date | 2026-02-11 |
| Task | file housekeeping and refactor planning |
| Cold start time | 0.2 minutes |
| Decision regression count | 0 |
| Re-exploration count | 0 |
| Continuity score | 5 |
| Briefing token count | 0 |
| Event count | 0 |
| Notes | (none) |
| Session details | 125.8 min, 70 tool calls, 6 files explored, 3 files modified |
| Field | Value |
|---|---|
| Date | 2026-02-11 |
| Task | refactor |
| Cold start time | 0.3 minutes |
| Decision regression count | 0 |
| Re-exploration count | 0 |
| Continuity score | 5 |
| Briefing token count | 0 |
| Event count | 0 |
| Notes | (none) |
| Session details | 338.5 min, 360 tool calls, 12 files explored, 45 files modified |
| Field | Value |
|---|---|
| Date | 2026-02-12 |
| Task | refactoring |
| Cold start time | 0.1 minutes |
| Decision regression count | 1 |
| Re-exploration count | 3 |
| Continuity score | 3 |
| Briefing token count | 0 |
| Event count | 0 |
| Notes | I asked a lot of questions about app architecture.That may have contributed to pushing out some memory awareness |
| Session details | 80.5 min, 196 tool calls, 11 files explored, 17 files modified |
| Field | Value |
|---|---|
| Date | 2026-02-12 |
| Task | refactor |
| Cold start time | 0.7 minutes |
| Decision regression count | 0 |
| Re-exploration count | 1 |
| Continuity score | 5 |
| Briefing token count | 0 |
| Event count | 0 |
| Notes | (none) |
| Session details | 192.4 min, 394 tool calls, 64 files explored, 42 files modified |
| Field | Value |
|---|---|
| Date | 2026-02-12 |
| Task | refactor |
| Cold start time | 1.0 minutes |
| Decision regression count | 0 |
| Re-exploration count | 16 |
| Continuity score | 4 |
| Briefing token count | 0 |
| Event count | 0 |
| Notes | (none) |
| Session details | 88.5 min, 361 tool calls, 57 files explored, 55 files modified |
| Field | Value |
|---|---|
| Date | 2026-02-13 |
| Task | long refactor session |
| Cold start time | 1.7 minutes |
| Decision regression count | 0 |
| Re-exploration count | 15 |
| Continuity score | 5 |
| Briefing token count | 0 |
| Event count | 0 |
| Notes | long period of inactiity |
| Session details | 793.5 min, 372 tool calls, 32 files explored, 23 files modified |
| Metric | Baseline Avg | Cortex Avg | Improvement | Target | Met? |
|---|---|---|---|---|---|
| Cold start time (min) | 9.0 | 1.4 | 84% reduction | 80%+ | Yes |
| Decision regression | 0.5 | 0.1 | 80% reduction | Near-zero | Yes |
| Re-exploration count | 5.2 | 6.5 | -25% reduction | Significant | No |
| Continuity score (1-5) | 2.9 | 4.7 | +1.8 points | Improvement | Yes |
| Token overhead | N/A | 321 tokens | 0.2% of context | <=15% | Yes |
| Criterion | Target | Result | Status |
|---|---|---|---|
| Cold start time reduction | 80%+ | 84% reduction | Pass |
| Decision regression | Near-zero | 0.1 avg | Pass |
| Plan continuity | Improvement | 2.9 -> 4.7 | Pass |
| Token overhead | <=15% | 0.2% | Pass |
| Extraction accuracy | >90% recall | (manual evaluation) | (pending) |
| User maintenance effort | Near-zero | (manual evaluation) | (pending) |
Average briefing size: 321 tokens (0.2% of context window)
Files explored in multiple comparison sessions:
/Users/jamescruce/Projects/substack-scheduler/substack_poster.py— 8 sessions/Users/jamescruce/Projects/substack-scheduler— 4 sessions/Users/jamescruce/Projects/substack-scheduler/docs/ROADMAP.md— 4 sessions/Users/jamescruce/Projects/jdex-premium/CLAUDE.md— 4 sessions/Users/jamescruce/Projects/substack-scheduler/api/routes/notes.py— 3 sessions/Users/jamescruce/Projects/substack-scheduler/tests/integration/test_api_notes.py— 3 sessions/Users/jamescruce/Projects/substack-scheduler/.github/workflows/ci.yml— 3 sessions/Users/jamescruce/Projects/substack-scheduler/api/dependencies.py— 3 sessions/Users/jamescruce/Projects/substack-scheduler/tests— 3 sessions/Users/jamescruce/Projects/substack-scheduler/tests/test_api_dependencies.py— 3 sessions/Users/jamescruce/Projects/substack-scheduler/tests/test_timezone.py— 3 sessions/Users/jamescruce/Projects/substack-scheduler/tests/integration/test_api_daemon.py— 3 sessions/Users/jamescruce/Projects/jdex-premium/app/src/db.js— 3 sessions/Users/jamescruce/Projects/jdex-premium/app/src/db/core/database.js— 3 sessions/Users/jamescruce/Projects/jdex-premium/app/src/db/repositories/index.js— 3 sessions
Average cold start with Cortex: 1.4 minutes
Briefings were very useful — they consistently contained relevant context from prior sessions. The information included was accurate and helped Claude understand the project state without lengthy re-explanation.
Claude remembered decisions, plans, and prior work consistently across sessions. Prior decisions were referenced naturally without re-debating them, which was one of the primary goals of Cortex.
No major issues encountered. The hooks ran reliably, events were captured correctly, and no incorrect or stale information appeared in briefings during the testing period.
Faster onboarding than expected. Claude got productive more quickly than anticipated — the 84% reduction in cold start time translated to noticeably smoother session starts.
Overall Assessment: [x] Tier 0 meets success criteria
Auto-evaluated criteria: 4/4 passed + qualitative feedback positive
Recommendation:
- Proceed to Tier 1 implementation
Key Learnings:
-
Cold start improvement is dramatic. The 84% reduction (9.0 → 1.4 min) exceeded the 80% target and represents a significant productivity gain.
-
Decision preservation works. Only 2 decision regressions across 18 sessions confirms that immortal events for decisions/rejections are effective.
-
Briefing size is negligible. At 0.2% of context window, there's plenty of headroom for richer briefings in Tier 1.
-
Re-exploration variance is high. The -25% regression in re-exploration count warrants investigation — some sessions had 0 while others had 15-16. This may be task-dependent rather than a Cortex issue.
-
Qualitative feedback is strongly positive. "Very useful" briefings, consistent memory, no major issues, and faster-than-expected onboarding validate the core design.
Next milestone: Tier 1 implementation (SQLite + FTS5 + salience scoring)