Skip to content

Commit 25eddea

Browse files
committed
docs(readme): slim to a scannable landing page
269->159 lines. Cut prose + content duplicated in docs/: full benchmark tables/caveats -> docs/benchmarks.md, quickstart subsections + toy data-gen -> docs/quickstart.md, comparison footnotes -> benchmarks link, roadmap table -> 3 bullets. Kept the scannable signals (highlights, comparison table, regression parity table, one matched-pair benchmark, one quickstart snippet); everything else links out.
1 parent 3f19381 commit 25eddea

1 file changed

Lines changed: 77 additions & 187 deletions

File tree

README.md

Lines changed: 77 additions & 187 deletions
Original file line numberDiff line numberDiff line change
@@ -5,128 +5,63 @@
55
[![docs](https://img.shields.io/badge/docs-sunnyadn.github.io%2Fcomprisk-blue)](https://sunnyadn.github.io/comprisk/)
66
[![DOI](https://img.shields.io/badge/DOI-10.5281%2Fzenodo.19876282-blue)](https://doi.org/10.5281/zenodo.19876282)
77

8-
**comprisk** — a Python toolkit for competing risks. Ships a scalable,
9-
scikit-learn-compatible competing-risks random survival forest plus the
10-
three canonical regression / non-parametric methods clinical researchers
11-
actually need: Fine-Gray subdistribution-hazard regression, a stand-alone
12-
Aalen-Johansen cumulative-incidence estimator with cmprsk-parity
13-
variance, and cause-specific Cox PH (see [Roadmap](#roadmap)). Designed
14-
to remove the Python → R workflow split that applied researchers
15-
currently endure for competing-risks survival analysis.
16-
17-
> **Status: alpha.** API and internals may change before v1.0.
18-
> **Renamed from `crforest` in 0.3.1**`pip install comprisk`,
19-
> `from comprisk import CompetingRiskForest`.
8+
A Python toolkit for **competing-risks** survival analysis: a scalable,
9+
scikit-learn-compatible competing-risks random survival forest plus the canonical
10+
regression / non-parametric methods — Fine-Gray, Aalen-Johansen CIF, cause-specific
11+
Cox — so applied researchers can drop the Python → R round-trip.
2012

21-
## Highlights
22-
23-
- **The four canonical CR methods, native Python** — Fine-Gray (+ penalized),
24-
cause-specific Cox, Aalen-Johansen CIF, and Gray's test, each validated to
25-
floating-point tolerances against `cmprsk` / `crrp` / `survival` (parity
26-
table [below](#regression-and-non-parametric-models)).
27-
- **The only native-Python competing-risks RSF** — cause-specific & composite
28-
CR log-rank splitting, Aalen-Johansen CIF, Nelson-Aalen CHF, Wolbers + Uno
29-
IPCW concordance, OOB Breiman VIMP, Ishwaran minimal depth, exact TreeSHAP.
30-
- **CR-aware model evaluation**`score_cr` (IPCW time-dependent AUC/Brier,
31-
integrated iAUC/IBS with bootstrap CIs) and `calibration_cr` replace the
32-
CR-mode `riskRegression::Score()` / `plotCalibration()` blocks in one call.
33-
- **10–22× faster than [randomForestSRC](https://cran.r-project.org/package=randomForestSRC)**
34-
on real EHR data and **16.6–544× faster than [scikit-survival](https://scikit-survival.readthedocs.io/)**
35-
(n = 5k → 50k), at matched C ≈ 0.85 and without disabling CIF/CHF
36-
outputs ([benchmarks](docs/benchmarks.md)).
37-
- **Bit-identical to randomForestSRC** with `equivalence="rfsrc"` — reproduces
38-
the per-tree mtry/nsplit RNG stream for paper-grade reproducibility and
39-
rfSRC-baseline migrations.
13+
> **Status: alpha** — API may change before v1.0. Renamed from `crforest` in 0.3.1
14+
> (`pip install comprisk`; `from comprisk import CompetingRiskForest`).
4015
41-
## comprisk vs alternatives
16+
## Highlights
4217

43-
| | comprisk | randomForestSRC | scikit-survival |
44-
|------------------------------------------|:------------------------------:|:----------------------------------:|:------------------------:|
45-
| Language | Python | R | Python |
46-
| Native competing risks ||| ✗ (single-event only) |
47-
| Aalen–Johansen CIF output ||| n/a |
48-
| Cumulative hazard at scale ||| ✗¹ |
49-
| OOB permutation VIMP ||||
50-
| Bit-identical reproducibility mode | ✓ (`equivalence="rfsrc"`) || n/a |
51-
| Scales to n = 10⁶ | ✓ (63 s on i7) | memory-bound past n ≈ 500 000 on consumer hardware | ✗¹ / OOM² |
52-
| Default parallelism | ✓ (`n_jobs=-1`) | OpenMP (build-dependent; macOS Apple clang lacks it) ||
53-
| GPU preview | ✓ (CUDA 12) |||
54-
55-
¹ sksurv `RandomSurvivalForest(low_memory=True)` is the only mode that
56-
scales beyond ~10k samples, but it disables `predict_cumulative_hazard_function`
57-
and `predict_survival_function` (raises `NotImplementedError`).
58-
² sksurv `low_memory=False` exposes CHF / survival outputs but stores per-leaf
59-
full CHF arrays; peak RSS reaches 16.8 GB at n = 5k on synthetic, OOMs
60-
(> 21.5 GB) at n = 10k on a 24 GB host.
18+
- **Four canonical CR methods, native Python** — Fine-Gray (+ penalized),
19+
cause-specific Cox, Aalen-Johansen CIF, Gray's test — each validated to
20+
floating-point tolerance against `cmprsk` / `crrp` / `survival`.
21+
- **The only native-Python CR forest** — composite & cause-specific CR log-rank
22+
splitting, AJ CIF, Nelson-Aalen CHF, Wolbers + Uno IPCW concordance, OOB
23+
Breiman VIMP, Ishwaran minimal depth, exact TreeSHAP.
24+
- **CR-aware evaluation**`score_cr` (IPCW time-dependent AUC/Brier + bootstrap
25+
CIs) and `calibration_cr`, replacing the CR-mode `riskRegression::Score()` block.
26+
- **Fast** — 10–22× vs randomForestSRC on real EHR, 16.6–544× vs scikit-survival
27+
(n = 5k → 50k), n = 10⁶ in 63 s — at matched C ≈ 0.85. [Benchmarks →](docs/benchmarks.md)
28+
- **Reproducible**`equivalence="rfsrc"` reproduces rfSRC's per-tree mtry/nsplit
29+
RNG stream bit-for-bit. [Methodology →](docs/equivalence-vs-rfsrc.md)
6130

6231
## Install
6332

6433
```bash
6534
pip install comprisk # or: uv add comprisk
66-
pip install "comprisk[gpu]" # or: uv add 'comprisk[gpu]'
35+
pip install "comprisk[gpu]" # CUDA 12 preview (faster only at low p today)
6736
```
6837

69-
Requires Python ≥ 3.10. Core dependencies: numpy, scipy, pandas, joblib,
70-
numba, scikit-learn. GPU extra adds cupy + CUDA 12 runtime libs (preview;
71-
faster only at low feature count today, full rewrite scheduled for v1.1).
38+
Python ≥ 3.10. Core deps: numpy, scipy, pandas, joblib, numba, scikit-learn.
7239

7340
## Quickstart
7441

7542
```python
76-
import numpy as np
7743
from comprisk import CompetingRiskForest
7844

79-
# Toy competing-risks data *with signal*: cause-1 risk rises with the first
80-
# two features, cause 2 competes, and some subjects are censored.
81-
rng = np.random.default_rng(42)
82-
n = 1000
83-
X = rng.normal(size=(n, 6))
84-
lp = X[:, 0] + 0.5 * X[:, 1]
85-
t1 = rng.exponential(np.exp(-lp)) # cause 1 fires sooner when lp is high
86-
t2 = rng.exponential(2.0, size=n) # cause 2 (competing)
87-
tc = rng.exponential(4.0, size=n) # censoring
88-
time = np.minimum.reduce([t1, t2, tc])
89-
event = np.where((t1 <= t2) & (t1 <= tc), 1, np.where(t2 <= tc, 2, 0)) # 0 = censored
90-
91-
# Fit. Defaults: n_estimators=100, max_features="sqrt", logrankCR, n_jobs=-1.
45+
# event: 0 = censored, k≥1 = cause-k event. Defaults: 100 trees, logrankCR, n_jobs=-1.
9246
forest = CompetingRiskForest(n_estimators=200, random_state=42).fit(X, time, event)
9347

94-
# Aalen-Johansen cumulative incidence over the forest's chosen time grid.
95-
cif = forest.predict_cif(X[:5]) # (5, n_causes, n_times)
96-
97-
# Out-of-bag cause-specific Wolbers concordance — honest (out-of-sample),
98-
# no held-out split needed. (forest.score(X, ...) would report the optimistic
99-
# in-sample value.)
100-
print("OOB C-index, cause 1:", forest.oob_score(cause=1))
101-
```
102-
103-
### Explainability and feature selection
104-
105-
```python
106-
# OOB permutation importance (Uno IPCW-scored).
107-
vimp = forest.compute_importance(random_state=42)
108-
109-
# Ishwaran minimal-depth variable selection.
110-
selected = forest.minimal_depth().query("selected")["feature"].tolist()
111-
112-
# Exact TreeSHAP attributions (Lundberg 2018, Algorithm 2).
113-
shap, base = forest.shap_values(X[:10]) # (n, p, n_times, n_causes)
48+
cif = forest.predict_cif(X[:5]) # (5, n_causes, n_times) — Aalen-Johansen
49+
print(forest.oob_score(cause=1)) # honest out-of-bag C-index (no holdout split)
50+
shap, base = forest.shap_values(X[:10]) # exact TreeSHAP (n, p, n_times, n_causes)
11451
```
11552

116-
SHAP additivity, per-cause global importance, and per-subject attribution over
117-
the time grid are explored interactively (with sliders) in
118-
[`examples/shap_explain.py`](examples/shap_explain.py) (marimo).
53+
Prediction shapes, scoring, cross-validation, VIMP, minimal depth, GPU, and rfSRC
54+
migration — all with runnable code — are in the
55+
**[quickstart](docs/quickstart.md)**. `CompetingRiskForest` is a real sklearn
56+
estimator (`cross_val_score` / `Pipeline` work without a wrapper).
11957

120-
### Regression and non-parametric models
121-
122-
Beyond the forest, comprisk ships the classical competing-risks toolkit — each
123-
validated to floating-point tolerances against its reference R package:
58+
### Regression & non-parametric models
12459

12560
```python
12661
from comprisk import FineGrayRegression
12762

12863
fg = FineGrayRegression(cause=1, robust_se=True).fit(X, time=time, event=event)
129-
print(fg.coef_, fg.se_) # log subdistribution-HRs
64+
print(fg.coef_, fg.se_) # log subdistribution-HRs
13065
```
13166

13267
| Estimator | Estimates | R parity |
@@ -137,133 +72,88 @@ print(fg.coef_, fg.se_) # log subdistribution-HRs
13772
| `CumulativeIncidence` | non-parametric Aalen-Johansen CIF | `cmprsk::cuminc()` |
13873
| `gray_test` | K-sample test for equal CIFs | `cmprsk::cuminc()$Tests` to 1e-14 |
13974

140-
Worked code for every row — coefficient tables, CIF-by-group plots, the LASSO
141-
path — is in [`examples/02_regression_models.ipynb`](examples/02_regression_models.ipynb);
142-
data format, prediction shapes, cross-validation, GPU, and rfSRC migration are
143-
in [docs/quickstart.md](docs/quickstart.md).
144-
145-
> **scikit-learn drop-in.** `CompetingRiskForest` is a real sklearn
146-
> estimator (`BaseEstimator`, `clone()`-friendly, picklable).
147-
> `cross_val_score`, `KFold`, `Pipeline` work without a wrapper — pass
148-
> `Surv.from_arrays(event, time)` as the `y` argument, or use the legacy
149-
> 3-arg `fit(X, time, event)` form. Full example in
150-
> [docs/quickstart.md § Cross-validation](docs/quickstart.md#cross-validation).
75+
Worked code for every row is in
76+
[`examples/02_regression_models.ipynb`](examples/02_regression_models.ipynb).
15177

152-
## Roadmap
78+
## comprisk vs alternatives
15379

154-
comprisk is intentionally CR-focused. For non-CR survival methods
155-
(general Cox PH, AFT, parametric, deep-survival, Kaplan-Meier as a
156-
standalone API), use [lifelines](https://lifelines.readthedocs.io/) or
157-
[scikit-survival](https://scikit-survival.readthedocs.io/).
80+
| | comprisk | randomForestSRC | scikit-survival |
81+
|------------------------------------|:-------------------------:|:---------------:|:---------------------:|
82+
| Language | Python | R | Python |
83+
| Native competing risks ||| ✗ (single-event) |
84+
| Aalen–Johansen CIF output ||| n/a |
85+
| Cumulative hazard at scale ||| ✗ (low-memory only) |
86+
| OOB permutation VIMP ||||
87+
| Bit-identical reproducibility mode | ✓ (`equivalence="rfsrc"`) || n/a |
88+
| Scales to n = 10⁶ | ✓ (63 s on i7) | memory-bound | ✗ / OOM |
89+
| GPU preview | ✓ (CUDA 12) |||
15890

159-
| Version | Module | Status |
160-
|----------|-------------------------------------------------------|----------------------|
161-
| v0.3 | `CompetingRiskForest` (CR-RSF) | Shipped |
162-
| **v0.4** | `FineGrayRegression` (subdistribution hazard) | Shipped |
163-
| **v0.4** | `CumulativeIncidence` (stand-alone Aalen-Johansen) | Shipped |
164-
| **v0.4** | `gray_test` (Gray's K-sample log-rank) | Shipped |
165-
| **v0.4** | `CauseSpecificCox` (CR-aware censoring) | Shipped |
166-
| **v0.4** | `score_cr` / `calibration_cr` (CR-aware evaluation) | Shipped |
167-
| **v0.5** | `PenalizedFineGrayRegression` (LASSO/ridge/elastic-net/MCP/SCAD) | Shipped |
168-
| v1.0 | API freeze + JMLR MLOSS submission | Planned |
169-
| v1.1 | Full GPU rewrite | Planned |
91+
scikit-survival's CHF/survival outputs and scaling caveats are detailed in the
92+
[benchmarks](docs/benchmarks.md#vs-scikit-survival-paired-same-machine).
17093

17194
## Benchmarks
17295

173-
Headline numbers — full tables, methodology, and reproducibility scripts
174-
in [docs/benchmarks.md](docs/benchmarks.md).
96+
Matched-pair, real EHR data (full tables + methodology in [docs/benchmarks.md](docs/benchmarks.md)):
17597

176-
**vs randomForestSRC, matched-pair on real EHR data:**
98+
| Cohort | n × p | comprisk | rfSRC (OMP-on) | Speedup |
99+
|---|---|---|---|---|
100+
| CHF (cardio) | 75k × 58 | 5.6–9.4 s | 84.8–207.3 s | **14–22×** |
101+
| SEER breast | 238k × 17 | 7.0 s | 81.6 s | **11.6×** |
177102

178-
| Cohort | n × p | Hardware | comprisk | rfSRC OMP-on | Speedup |
179-
|---|---|---|---|---|---|
180-
| CHF (cardio) | 75k × 58 | Apple M4 / i7-14700K / HPC | 5.6–9.4 s | 84.8–207.3 s | **14–22×** |
181-
| SEER breast (oncology) | 238k × 17 | HPC Xeon Gold 6148 | 7.0 s | 81.6 s | **11.6×** |
103+
Both fit similarly well (C ≈ 0.85); the band tracks feature count. Also 16.6–544×
104+
vs scikit-survival (n = 5k → 50k) and n = 10⁶ in 63 s on a consumer i7.
182105

183-
Both libraries fit similarly well (C ≈ 0.85); the cross-dataset band tracks
184-
feature count (rfSRC's per-split scan scales with p). ~95× vs rfSRC built
185-
without OpenMP (the default R-on-macOS install).
186-
187-
**vs scikit-survival, paired on i7-14700K** — synthetic 2-cause Weibull,
188-
p = 58, both libraries at their best config:
106+
## Roadmap
189107

190-
| n | sksurv `low_memory=True` | comprisk | speedup |
191-
|---|---|---|---|
192-
| 5 000 | 18.2 s | 1.10 s | **16.6×** |
193-
| 50 000 | 2935 s (49 min) | 5.40 s | **544×** |
108+
comprisk is intentionally CR-focused — for non-CR survival (general Cox, AFT,
109+
deep-survival), use [lifelines](https://lifelines.readthedocs.io/) or
110+
[scikit-survival](https://scikit-survival.readthedocs.io/).
194111

195-
The gap widens super-linearly (sksurv ≈ n^2.2; comprisk ≈ n^0.7), and comprisk
196-
still returns the Aalen-Johansen CIF + Nelson-Aalen CHF that sksurv
197-
`low_memory=True` cannot.
112+
- **Shipped (v0.3–0.6):** CR forest, Fine-Gray (+ penalized), cause-specific Cox,
113+
Aalen-Johansen CIF, Gray's test, `score_cr` / `calibration_cr`.
114+
- **v1.0 (planned):** API freeze + JMLR MLOSS submission.
115+
- **v1.1 (planned):** full GPU rewrite.
198116

199-
**Scaling on a consumer desktop:** n = 10⁶ in **63 s** on i7-14700K,
200-
14.5 GB RSS. Reproducible via
201-
[`validation/spikes/lambda/exp5_paper_scale_bench.py`](validation/spikes/lambda/exp5_paper_scale_bench.py).
117+
## Documentation
202118

203-
## API
119+
📖 **[Full documentation site](https://sunnyadn.github.io/comprisk/)** — searchable, autogenerated API reference.
204120

205-
Full parameter lists in the
206-
[API reference](https://sunnyadn.github.io/comprisk/reference/); usage by task
207-
in [docs/quickstart.md](docs/quickstart.md). Two forest splitrules are
208-
available: `logrankCR` (composite competing-risks log-rank, default) and
209-
`logrank` (cause-specific).
121+
- [Quickstart](docs/quickstart.md) — common tasks with runnable code
122+
- [API reference](https://sunnyadn.github.io/comprisk/reference/) — full parameter lists
123+
- [Benchmarks](docs/benchmarks.md) — full tables, methodology, reproduction scripts
124+
- [Equivalence vs rfSRC](docs/equivalence-vs-rfsrc.md) — cross-library validation
125+
- [References](docs/REFERENCES.md) — algorithmic provenance
210126

211127
## Examples
212128

213-
Runnable notebooks in [`examples/`](examples) (rendered with output on GitHub —
214-
click to view, or open in Colab to run):
129+
Runnable notebooks in [`examples/`](examples) (rendered on GitHub; open in Colab to run):
215130

216-
- [`01_forest_quickstart.ipynb`](examples/01_forest_quickstart.ipynb)
217-
[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sunnyadn/comprisk/blob/main/examples/01_forest_quickstart.ipynb)
218-
— fit → predict CIF → out-of-bag scoring → VIMP → minimal-depth selection
219-
- [`02_regression_models.ipynb`](examples/02_regression_models.ipynb)
220-
[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sunnyadn/comprisk/blob/main/examples/02_regression_models.ipynb)
221-
— Fine-Gray, cause-specific Cox, Aalen-Johansen by group, Gray's test, penalized Fine-Gray
222-
- [`shap_explain.py`](examples/shap_explain.py) — interactive
223-
[marimo](https://marimo.io) app for TreeSHAP attributions (sliders for forest
224-
size and subject); `uv run --extra examples marimo edit examples/shap_explain.py`
225-
226-
## Documentation
227-
228-
📖 **[Full documentation site](https://sunnyadn.github.io/comprisk/)** — searchable, with autogenerated API reference.
229-
230-
- [Quickstart](docs/quickstart.md) — common tasks with runnable code
231-
- [API reference](https://sunnyadn.github.io/comprisk/reference/) — full parameter lists from the docstrings
232-
- [Equivalence vs rfSRC](docs/equivalence-vs-rfsrc.md) — cross-library validation methodology
233-
- [References](docs/REFERENCES.md) — algorithmic provenance (Park-Miller, Bays-Durham, Wolbers 2009, Uno 2011, Cole & Hernán 2008, Breiman 2001, Ishwaran 2008/2014, etc.)
131+
- [`01_forest_quickstart.ipynb`](examples/01_forest_quickstart.ipynb) — fit → CIF → OOB scoring → VIMP → minimal-depth selection
132+
- [`02_regression_models.ipynb`](examples/02_regression_models.ipynb) — Fine-Gray, cause-specific Cox, AJ by group, Gray's test, penalized FG
133+
- [`shap_explain.py`](examples/shap_explain.py) — interactive [marimo](https://marimo.io) TreeSHAP app
234134

235135
## Development
236136

237137
Requires [`uv`](https://docs.astral.sh/uv/).
238138

239139
```bash
240-
uv venv
241-
uv pip install -e ".[dev]"
140+
uv venv && uv pip install -e ".[dev]"
242141
uv run pre-commit install
243-
uv run pytest
244-
uv run ruff check .
245-
uv run ruff format --check .
142+
uv run pytest && uv run ruff check .
246143
```
247144

248-
## License
249-
250-
Apache-2.0. See [LICENSE](LICENSE) and [NOTICE](NOTICE).
145+
## License & citation
251146

252-
## Citation
147+
Apache-2.0 ([LICENSE](LICENSE), [NOTICE](NOTICE)). Cite via the DOI below (concept-level,
148+
resolves to latest) or GitHub's "Cite this repository" button ([`CITATION.cff`](CITATION.cff)):
253149

254150
```bibtex
255151
@software{yang_comprisk_2026,
256152
author = {Yang, Sunny and Zhao, Wanqi},
257153
title = {{comprisk: a Python toolkit for competing risks}},
258154
year = {2026},
259155
publisher = {Zenodo},
260-
version = {0.3.1},
261156
doi = {10.5281/zenodo.19876282},
262157
url = {https://doi.org/10.5281/zenodo.19876282},
263158
}
264159
```
265-
266-
DOI is concept-level (always resolves to the latest version). GitHub's
267-
"Cite this repository" button generates a version-specific record from
268-
[`CITATION.cff`](CITATION.cff). Algorithmic references in
269-
[`docs/REFERENCES.md`](docs/REFERENCES.md).

0 commit comments

Comments
 (0)