Description
When running trafilatura.extract() concurrently (e.g., via concurrent.futures.ThreadPoolExecutor), we intermittently hit segmentation faults that crash the Python process.
Environment
- trafilatura 2.0.0
- Python 3.13
- Windows 11 / Linux (Ubuntu, deployed on both)
- Concurrent extraction of 50-100+ articles per batch
Reproduction
The crash is intermittent and occurs during concurrent calls to trafilatura.extract() from multiple threads. It does not occur with sequential processing. The crash manifests as SEGV or SIGABRT with no Python traceback (process killed by signal).
Workaround
We isolated trafilatura in a ProcessPoolExecutor instead of ThreadPoolExecutor, so a crash kills the worker process rather than the main pipeline. This is reliable but adds overhead.
# ProcessPoolExecutor isolates trafilatura SEGV/SIGABRT
# from lxml's non-thread-safe C extensions
with ProcessPoolExecutor(max_workers=workers) as executor:
futures = {executor.submit(_fetch_and_extract, url, ...): url for url in urls}
Suspected cause
lxml's C extensions are not fully thread-safe. When multiple threads call trafilatura.extract() simultaneously, the underlying lxml parsing can corrupt shared state and segfault.
Related: #202 (signal handling in non-main threads)
Request
Would it be possible to document the thread-safety limitations, or add internal locking around the lxml calls? Happy to help test any fix.
Description
When running
trafilatura.extract()concurrently (e.g., viaconcurrent.futures.ThreadPoolExecutor), we intermittently hit segmentation faults that crash the Python process.Environment
Reproduction
The crash is intermittent and occurs during concurrent calls to
trafilatura.extract()from multiple threads. It does not occur with sequential processing. The crash manifests as SEGV or SIGABRT with no Python traceback (process killed by signal).Workaround
We isolated trafilatura in a
ProcessPoolExecutorinstead ofThreadPoolExecutor, so a crash kills the worker process rather than the main pipeline. This is reliable but adds overhead.Suspected cause
lxml's C extensions are not fully thread-safe. When multiple threads call
trafilatura.extract()simultaneously, the underlying lxml parsing can corrupt shared state and segfault.Related: #202 (signal handling in non-main threads)
Request
Would it be possible to document the thread-safety limitations, or add internal locking around the lxml calls? Happy to help test any fix.