Skip to content

Intermittent SEGV during concurrent extract() calls #840

@ducroq

Description

@ducroq

Description

When running trafilatura.extract() concurrently (e.g., via concurrent.futures.ThreadPoolExecutor), we intermittently hit segmentation faults that crash the Python process.

Environment

  • trafilatura 2.0.0
  • Python 3.13
  • Windows 11 / Linux (Ubuntu, deployed on both)
  • Concurrent extraction of 50-100+ articles per batch

Reproduction

The crash is intermittent and occurs during concurrent calls to trafilatura.extract() from multiple threads. It does not occur with sequential processing. The crash manifests as SEGV or SIGABRT with no Python traceback (process killed by signal).

Workaround

We isolated trafilatura in a ProcessPoolExecutor instead of ThreadPoolExecutor, so a crash kills the worker process rather than the main pipeline. This is reliable but adds overhead.

# ProcessPoolExecutor isolates trafilatura SEGV/SIGABRT
# from lxml's non-thread-safe C extensions
with ProcessPoolExecutor(max_workers=workers) as executor:
    futures = {executor.submit(_fetch_and_extract, url, ...): url for url in urls}

Suspected cause

lxml's C extensions are not fully thread-safe. When multiple threads call trafilatura.extract() simultaneously, the underlying lxml parsing can corrupt shared state and segfault.

Related: #202 (signal handling in non-main threads)

Request

Would it be possible to document the thread-safety limitations, or add internal locking around the lxml calls? Happy to help test any fix.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions