55[ ![ docs] ( https://img.shields.io/badge/docs-sunnyadn.github.io%2Fcomprisk-blue )] ( https://sunnyadn.github.io/comprisk/ )
66[ ![ DOI] ( https://img.shields.io/badge/DOI-10.5281%2Fzenodo.19876282-blue )] ( https://doi.org/10.5281/zenodo.19876282 )
77
8- ** comprisk** — a Python toolkit for competing risks. Ships a scalable,
9- scikit-learn-compatible competing-risks random survival forest plus the
10- three canonical regression / non-parametric methods clinical researchers
11- actually need: Fine-Gray subdistribution-hazard regression, a stand-alone
12- Aalen-Johansen cumulative-incidence estimator with cmprsk-parity
13- variance, and cause-specific Cox PH (see [ Roadmap] ( #roadmap ) ). Designed
14- to remove the Python → R workflow split that applied researchers
15- currently endure for competing-risks survival analysis.
16-
17- > ** Status: alpha.** API and internals may change before v1.0.
18- > ** Renamed from ` crforest ` in 0.3.1** — ` pip install comprisk ` ,
19- > ` from comprisk import CompetingRiskForest ` .
8+ A Python toolkit for ** competing-risks** survival analysis: a scalable,
9+ scikit-learn-compatible competing-risks random survival forest plus the canonical
10+ regression / non-parametric methods — Fine-Gray, Aalen-Johansen CIF, cause-specific
11+ Cox — so applied researchers can drop the Python → R round-trip.
2012
21- ## Highlights
22-
23- - ** The four canonical CR methods, native Python** — Fine-Gray (+ penalized),
24- cause-specific Cox, Aalen-Johansen CIF, and Gray's test, each validated to
25- floating-point tolerances against ` cmprsk ` / ` crrp ` / ` survival ` (parity
26- table [ below] ( #regression-and-non-parametric-models ) ).
27- - ** The only native-Python competing-risks RSF** — cause-specific & composite
28- CR log-rank splitting, Aalen-Johansen CIF, Nelson-Aalen CHF, Wolbers + Uno
29- IPCW concordance, OOB Breiman VIMP, Ishwaran minimal depth, exact TreeSHAP.
30- - ** CR-aware model evaluation** — ` score_cr ` (IPCW time-dependent AUC/Brier,
31- integrated iAUC/IBS with bootstrap CIs) and ` calibration_cr ` replace the
32- CR-mode ` riskRegression::Score() ` / ` plotCalibration() ` blocks in one call.
33- - ** 10–22× faster than [ randomForestSRC] ( https://cran.r-project.org/package=randomForestSRC ) **
34- on real EHR data and ** 16.6–544× faster than [ scikit-survival] ( https://scikit-survival.readthedocs.io/ ) **
35- (n = 5k → 50k), at matched C ≈ 0.85 and without disabling CIF/CHF
36- outputs ([ benchmarks] ( docs/benchmarks.md ) ).
37- - ** Bit-identical to randomForestSRC** with ` equivalence="rfsrc" ` — reproduces
38- the per-tree mtry/nsplit RNG stream for paper-grade reproducibility and
39- rfSRC-baseline migrations.
13+ > ** Status: alpha** — API may change before v1.0. Renamed from ` crforest ` in 0.3.1
14+ > (` pip install comprisk ` ; ` from comprisk import CompetingRiskForest ` ).
4015
41- ## comprisk vs alternatives
16+ ## Highlights
4217
43- | | comprisk | randomForestSRC | scikit-survival |
44- | ------------------------------------------| :------------------------------:| :----------------------------------:| :------------------------:|
45- | Language | Python | R | Python |
46- | Native competing risks | ✓ | ✓ | ✗ (single-event only) |
47- | Aalen–Johansen CIF output | ✓ | ✓ | n/a |
48- | Cumulative hazard at scale | ✓ | ✓ | ✗¹ |
49- | OOB permutation VIMP | ✓ | ✓ | ✗ |
50- | Bit-identical reproducibility mode | ✓ (` equivalence="rfsrc" ` ) | — | n/a |
51- | Scales to n = 10⁶ | ✓ (63 s on i7) | memory-bound past n ≈ 500 000 on consumer hardware | ✗¹ / OOM² |
52- | Default parallelism | ✓ (` n_jobs=-1 ` ) | OpenMP (build-dependent; macOS Apple clang lacks it) | ✓ |
53- | GPU preview | ✓ (CUDA 12) | ✗ | ✗ |
54-
55- ¹ sksurv ` RandomSurvivalForest(low_memory=True) ` is the only mode that
56- scales beyond ~ 10k samples, but it disables ` predict_cumulative_hazard_function `
57- and ` predict_survival_function ` (raises ` NotImplementedError ` ).
58- ² sksurv ` low_memory=False ` exposes CHF / survival outputs but stores per-leaf
59- full CHF arrays; peak RSS reaches 16.8 GB at n = 5k on synthetic, OOMs
60- (> 21.5 GB) at n = 10k on a 24 GB host.
18+ - ** Four canonical CR methods, native Python** — Fine-Gray (+ penalized),
19+ cause-specific Cox, Aalen-Johansen CIF, Gray's test — each validated to
20+ floating-point tolerance against ` cmprsk ` / ` crrp ` / ` survival ` .
21+ - ** The only native-Python CR forest** — composite & cause-specific CR log-rank
22+ splitting, AJ CIF, Nelson-Aalen CHF, Wolbers + Uno IPCW concordance, OOB
23+ Breiman VIMP, Ishwaran minimal depth, exact TreeSHAP.
24+ - ** CR-aware evaluation** — ` score_cr ` (IPCW time-dependent AUC/Brier + bootstrap
25+ CIs) and ` calibration_cr ` , replacing the CR-mode ` riskRegression::Score() ` block.
26+ - ** Fast** — 10–22× vs randomForestSRC on real EHR, 16.6–544× vs scikit-survival
27+ (n = 5k → 50k), n = 10⁶ in 63 s — at matched C ≈ 0.85. [ Benchmarks →] ( docs/benchmarks.md )
28+ - ** Reproducible** — ` equivalence="rfsrc" ` reproduces rfSRC's per-tree mtry/nsplit
29+ RNG stream bit-for-bit. [ Methodology →] ( docs/equivalence-vs-rfsrc.md )
6130
6231## Install
6332
6433``` bash
6534pip install comprisk # or: uv add comprisk
66- pip install " comprisk[gpu]" # or: uv add 'comprisk[gpu]'
35+ pip install " comprisk[gpu]" # CUDA 12 preview (faster only at low p today)
6736```
6837
69- Requires Python ≥ 3.10. Core dependencies: numpy, scipy, pandas, joblib,
70- numba, scikit-learn. GPU extra adds cupy + CUDA 12 runtime libs (preview;
71- faster only at low feature count today, full rewrite scheduled for v1.1).
38+ Python ≥ 3.10. Core deps: numpy, scipy, pandas, joblib, numba, scikit-learn.
7239
7340## Quickstart
7441
7542``` python
76- import numpy as np
7743from comprisk import CompetingRiskForest
7844
79- # Toy competing-risks data *with signal*: cause-1 risk rises with the first
80- # two features, cause 2 competes, and some subjects are censored.
81- rng = np.random.default_rng(42 )
82- n = 1000
83- X = rng.normal(size = (n, 6 ))
84- lp = X[:, 0 ] + 0.5 * X[:, 1 ]
85- t1 = rng.exponential(np.exp(- lp)) # cause 1 fires sooner when lp is high
86- t2 = rng.exponential(2.0 , size = n) # cause 2 (competing)
87- tc = rng.exponential(4.0 , size = n) # censoring
88- time = np.minimum.reduce([t1, t2, tc])
89- event = np.where((t1 <= t2) & (t1 <= tc), 1 , np.where(t2 <= tc, 2 , 0 )) # 0 = censored
90-
91- # Fit. Defaults: n_estimators=100, max_features="sqrt", logrankCR, n_jobs=-1.
45+ # event: 0 = censored, k≥1 = cause-k event. Defaults: 100 trees, logrankCR, n_jobs=-1.
9246forest = CompetingRiskForest(n_estimators = 200 , random_state = 42 ).fit(X, time, event)
9347
94- # Aalen-Johansen cumulative incidence over the forest's chosen time grid.
95- cif = forest.predict_cif(X[:5 ]) # (5, n_causes, n_times)
96-
97- # Out-of-bag cause-specific Wolbers concordance — honest (out-of-sample),
98- # no held-out split needed. (forest.score(X, ...) would report the optimistic
99- # in-sample value.)
100- print (" OOB C-index, cause 1:" , forest.oob_score(cause = 1 ))
101- ```
102-
103- ### Explainability and feature selection
104-
105- ``` python
106- # OOB permutation importance (Uno IPCW-scored).
107- vimp = forest.compute_importance(random_state = 42 )
108-
109- # Ishwaran minimal-depth variable selection.
110- selected = forest.minimal_depth().query(" selected" )[" feature" ].tolist()
111-
112- # Exact TreeSHAP attributions (Lundberg 2018, Algorithm 2).
113- shap, base = forest.shap_values(X[:10 ]) # (n, p, n_times, n_causes)
48+ cif = forest.predict_cif(X[:5 ]) # (5, n_causes, n_times) — Aalen-Johansen
49+ print (forest.oob_score(cause = 1 )) # honest out-of-bag C-index (no holdout split)
50+ shap, base = forest.shap_values(X[:10 ]) # exact TreeSHAP (n, p, n_times, n_causes)
11451```
11552
116- SHAP additivity, per-cause global importance, and per-subject attribution over
117- the time grid are explored interactively (with sliders) in
118- [ ` examples/shap_explain.py ` ] ( examples/shap_explain.py ) (marimo).
53+ Prediction shapes, scoring, cross-validation, VIMP, minimal depth, GPU, and rfSRC
54+ migration — all with runnable code — are in the
55+ ** [ quickstart] ( docs/quickstart.md ) ** . ` CompetingRiskForest ` is a real sklearn
56+ estimator (` cross_val_score ` / ` Pipeline ` work without a wrapper).
11957
120- ### Regression and non-parametric models
121-
122- Beyond the forest, comprisk ships the classical competing-risks toolkit — each
123- validated to floating-point tolerances against its reference R package:
58+ ### Regression & non-parametric models
12459
12560``` python
12661from comprisk import FineGrayRegression
12762
12863fg = FineGrayRegression(cause = 1 , robust_se = True ).fit(X, time = time, event = event)
129- print (fg.coef_, fg.se_) # log subdistribution-HRs
64+ print (fg.coef_, fg.se_) # log subdistribution-HRs
13065```
13166
13267| Estimator | Estimates | R parity |
@@ -137,133 +72,88 @@ print(fg.coef_, fg.se_) # log subdistribution-HRs
13772| ` CumulativeIncidence ` | non-parametric Aalen-Johansen CIF | ` cmprsk::cuminc() ` |
13873| ` gray_test ` | K-sample test for equal CIFs | ` cmprsk::cuminc()$Tests ` to 1e-14 |
13974
140- Worked code for every row — coefficient tables, CIF-by-group plots, the LASSO
141- path — is in [ ` examples/02_regression_models.ipynb ` ] ( examples/02_regression_models.ipynb ) ;
142- data format, prediction shapes, cross-validation, GPU, and rfSRC migration are
143- in [ docs/quickstart.md] ( docs/quickstart.md ) .
144-
145- > ** scikit-learn drop-in.** ` CompetingRiskForest ` is a real sklearn
146- > estimator (` BaseEstimator ` , ` clone() ` -friendly, picklable).
147- > ` cross_val_score ` , ` KFold ` , ` Pipeline ` work without a wrapper — pass
148- > ` Surv.from_arrays(event, time) ` as the ` y ` argument, or use the legacy
149- > 3-arg ` fit(X, time, event) ` form. Full example in
150- > [ docs/quickstart.md § Cross-validation] ( docs/quickstart.md#cross-validation ) .
75+ Worked code for every row is in
76+ [ ` examples/02_regression_models.ipynb ` ] ( examples/02_regression_models.ipynb ) .
15177
152- ## Roadmap
78+ ## comprisk vs alternatives
15379
154- comprisk is intentionally CR-focused. For non-CR survival methods
155- (general Cox PH, AFT, parametric, deep-survival, Kaplan-Meier as a
156- standalone API), use [ lifelines] ( https://lifelines.readthedocs.io/ ) or
157- [ scikit-survival] ( https://scikit-survival.readthedocs.io/ ) .
80+ | | comprisk | randomForestSRC | scikit-survival |
81+ | ------------------------------------| :-------------------------:| :---------------:| :---------------------:|
82+ | Language | Python | R | Python |
83+ | Native competing risks | ✓ | ✓ | ✗ (single-event) |
84+ | Aalen–Johansen CIF output | ✓ | ✓ | n/a |
85+ | Cumulative hazard at scale | ✓ | ✓ | ✗ (low-memory only) |
86+ | OOB permutation VIMP | ✓ | ✓ | ✗ |
87+ | Bit-identical reproducibility mode | ✓ (` equivalence="rfsrc" ` ) | — | n/a |
88+ | Scales to n = 10⁶ | ✓ (63 s on i7) | memory-bound | ✗ / OOM |
89+ | GPU preview | ✓ (CUDA 12) | ✗ | ✗ |
15890
159- | Version | Module | Status |
160- | ----------| -------------------------------------------------------| ----------------------|
161- | v0.3 | ` CompetingRiskForest ` (CR-RSF) | Shipped |
162- | ** v0.4** | ` FineGrayRegression ` (subdistribution hazard) | Shipped |
163- | ** v0.4** | ` CumulativeIncidence ` (stand-alone Aalen-Johansen) | Shipped |
164- | ** v0.4** | ` gray_test ` (Gray's K-sample log-rank) | Shipped |
165- | ** v0.4** | ` CauseSpecificCox ` (CR-aware censoring) | Shipped |
166- | ** v0.4** | ` score_cr ` / ` calibration_cr ` (CR-aware evaluation) | Shipped |
167- | ** v0.5** | ` PenalizedFineGrayRegression ` (LASSO/ridge/elastic-net/MCP/SCAD) | Shipped |
168- | v1.0 | API freeze + JMLR MLOSS submission | Planned |
169- | v1.1 | Full GPU rewrite | Planned |
91+ scikit-survival's CHF/survival outputs and scaling caveats are detailed in the
92+ [ benchmarks] ( docs/benchmarks.md#vs-scikit-survival-paired-same-machine ) .
17093
17194## Benchmarks
17295
173- Headline numbers — full tables, methodology, and reproducibility scripts
174- in [ docs/benchmarks.md] ( docs/benchmarks.md ) .
96+ Matched-pair, real EHR data (full tables + methodology in [ docs/benchmarks.md] ( docs/benchmarks.md ) ):
17597
176- ** vs randomForestSRC, matched-pair on real EHR data:**
98+ | Cohort | n × p | comprisk | rfSRC (OMP-on) | Speedup |
99+ | ---| ---| ---| ---| ---|
100+ | CHF (cardio) | 75k × 58 | 5.6–9.4 s | 84.8–207.3 s | ** 14–22×** |
101+ | SEER breast | 238k × 17 | 7.0 s | 81.6 s | ** 11.6×** |
177102
178- | Cohort | n × p | Hardware | comprisk | rfSRC OMP-on | Speedup |
179- | ---| ---| ---| ---| ---| ---|
180- | CHF (cardio) | 75k × 58 | Apple M4 / i7-14700K / HPC | 5.6–9.4 s | 84.8–207.3 s | ** 14–22×** |
181- | SEER breast (oncology) | 238k × 17 | HPC Xeon Gold 6148 | 7.0 s | 81.6 s | ** 11.6×** |
103+ Both fit similarly well (C ≈ 0.85); the band tracks feature count. Also 16.6–544×
104+ vs scikit-survival (n = 5k → 50k) and n = 10⁶ in 63 s on a consumer i7.
182105
183- Both libraries fit similarly well (C ≈ 0.85); the cross-dataset band tracks
184- feature count (rfSRC's per-split scan scales with p). ~ 95× vs rfSRC built
185- without OpenMP (the default R-on-macOS install).
186-
187- ** vs scikit-survival, paired on i7-14700K** — synthetic 2-cause Weibull,
188- p = 58, both libraries at their best config:
106+ ## Roadmap
189107
190- | n | sksurv ` low_memory=True ` | comprisk | speedup |
191- | ---| ---| ---| ---|
192- | 5 000 | 18.2 s | 1.10 s | ** 16.6×** |
193- | 50 000 | 2935 s (49 min) | 5.40 s | ** 544×** |
108+ comprisk is intentionally CR-focused — for non-CR survival (general Cox, AFT,
109+ deep-survival), use [ lifelines] ( https://lifelines.readthedocs.io/ ) or
110+ [ scikit-survival] ( https://scikit-survival.readthedocs.io/ ) .
194111
195- The gap widens super-linearly (sksurv ≈ n^2.2; comprisk ≈ n^0.7), and comprisk
196- still returns the Aalen-Johansen CIF + Nelson-Aalen CHF that sksurv
197- ` low_memory=True ` cannot.
112+ - ** Shipped (v0.3–0.6):** CR forest, Fine-Gray (+ penalized), cause-specific Cox,
113+ Aalen-Johansen CIF, Gray's test, ` score_cr ` / ` calibration_cr ` .
114+ - ** v1.0 (planned):** API freeze + JMLR MLOSS submission.
115+ - ** v1.1 (planned):** full GPU rewrite.
198116
199- ** Scaling on a consumer desktop:** n = 10⁶ in ** 63 s** on i7-14700K,
200- 14.5 GB RSS. Reproducible via
201- [ ` validation/spikes/lambda/exp5_paper_scale_bench.py ` ] ( validation/spikes/lambda/exp5_paper_scale_bench.py ) .
117+ ## Documentation
202118
203- ## API
119+ 📖 ** [ Full documentation site ] ( https://sunnyadn.github.io/comprisk/ ) ** — searchable, autogenerated API reference.
204120
205- Full parameter lists in the
206- [ API reference] ( https://sunnyadn.github.io/comprisk/reference/ ) ; usage by task
207- in [ docs/quickstart.md ] ( docs/quickstart .md ) . Two forest splitrules are
208- available: ` logrankCR ` (composite competing-risks log-rank, default) and
209- ` logrank ` (cause-specific).
121+ - [ Quickstart ] ( docs/quickstart.md ) — common tasks with runnable code
122+ - [ API reference] ( https://sunnyadn.github.io/comprisk/reference/ ) — full parameter lists
123+ - [ Benchmarks ] ( docs/benchmarks .md ) — full tables, methodology, reproduction scripts
124+ - [ Equivalence vs rfSRC ] ( docs/equivalence-vs-rfsrc.md ) — cross-library validation
125+ - [ References ] ( docs/REFERENCES.md ) — algorithmic provenance
210126
211127## Examples
212128
213- Runnable notebooks in [ ` examples/ ` ] ( examples ) (rendered with output on GitHub —
214- click to view, or open in Colab to run):
129+ Runnable notebooks in [ ` examples/ ` ] ( examples ) (rendered on GitHub; open in Colab to run):
215130
216- - [ ` 01_forest_quickstart.ipynb ` ] ( examples/01_forest_quickstart.ipynb )
217- [ ![ Colab] ( https://colab.research.google.com/assets/colab-badge.svg )] ( https://colab.research.google.com/github/sunnyadn/comprisk/blob/main/examples/01_forest_quickstart.ipynb )
218- — fit → predict CIF → out-of-bag scoring → VIMP → minimal-depth selection
219- - [ ` 02_regression_models.ipynb ` ] ( examples/02_regression_models.ipynb )
220- [ ![ Colab] ( https://colab.research.google.com/assets/colab-badge.svg )] ( https://colab.research.google.com/github/sunnyadn/comprisk/blob/main/examples/02_regression_models.ipynb )
221- — Fine-Gray, cause-specific Cox, Aalen-Johansen by group, Gray's test, penalized Fine-Gray
222- - [ ` shap_explain.py ` ] ( examples/shap_explain.py ) — interactive
223- [ marimo] ( https://marimo.io ) app for TreeSHAP attributions (sliders for forest
224- size and subject); ` uv run --extra examples marimo edit examples/shap_explain.py `
225-
226- ## Documentation
227-
228- 📖 ** [ Full documentation site] ( https://sunnyadn.github.io/comprisk/ ) ** — searchable, with autogenerated API reference.
229-
230- - [ Quickstart] ( docs/quickstart.md ) — common tasks with runnable code
231- - [ API reference] ( https://sunnyadn.github.io/comprisk/reference/ ) — full parameter lists from the docstrings
232- - [ Equivalence vs rfSRC] ( docs/equivalence-vs-rfsrc.md ) — cross-library validation methodology
233- - [ References] ( docs/REFERENCES.md ) — algorithmic provenance (Park-Miller, Bays-Durham, Wolbers 2009, Uno 2011, Cole & Hernán 2008, Breiman 2001, Ishwaran 2008/2014, etc.)
131+ - [ ` 01_forest_quickstart.ipynb ` ] ( examples/01_forest_quickstart.ipynb ) — fit → CIF → OOB scoring → VIMP → minimal-depth selection
132+ - [ ` 02_regression_models.ipynb ` ] ( examples/02_regression_models.ipynb ) — Fine-Gray, cause-specific Cox, AJ by group, Gray's test, penalized FG
133+ - [ ` shap_explain.py ` ] ( examples/shap_explain.py ) — interactive [ marimo] ( https://marimo.io ) TreeSHAP app
234134
235135## Development
236136
237137Requires [ ` uv ` ] ( https://docs.astral.sh/uv/ ) .
238138
239139``` bash
240- uv venv
241- uv pip install -e " .[dev]"
140+ uv venv && uv pip install -e " .[dev]"
242141uv run pre-commit install
243- uv run pytest
244- uv run ruff check .
245- uv run ruff format --check .
142+ uv run pytest && uv run ruff check .
246143```
247144
248- ## License
249-
250- Apache-2.0. See [ LICENSE] ( LICENSE ) and [ NOTICE] ( NOTICE ) .
145+ ## License & citation
251146
252- ## Citation
147+ Apache-2.0 ([ LICENSE] ( LICENSE ) , [ NOTICE] ( NOTICE ) ). Cite via the DOI below (concept-level,
148+ resolves to latest) or GitHub's "Cite this repository" button ([ ` CITATION.cff ` ] ( CITATION.cff ) ):
253149
254150``` bibtex
255151@software{yang_comprisk_2026,
256152 author = {Yang, Sunny and Zhao, Wanqi},
257153 title = {{comprisk: a Python toolkit for competing risks}},
258154 year = {2026},
259155 publisher = {Zenodo},
260- version = {0.3.1},
261156 doi = {10.5281/zenodo.19876282},
262157 url = {https://doi.org/10.5281/zenodo.19876282},
263158}
264159```
265-
266- DOI is concept-level (always resolves to the latest version). GitHub's
267- "Cite this repository" button generates a version-specific record from
268- [ ` CITATION.cff ` ] ( CITATION.cff ) . Algorithmic references in
269- [ ` docs/REFERENCES.md ` ] ( docs/REFERENCES.md ) .
0 commit comments