The PAISDB pipeline is designed to construct a comprehensive database of Post Acute Infection Syndromes (PAIS) by systematically collecting, processing, and analyzing biomedical literature. The pipeline integrates data from multiple sources, extracts pathogen-disease associations, and leverages large language models (LLMs) for relationship validation.
- Data Collection: Automated retrieval of abstracts from PubMed using disease-pathogen queries.
- Data Integration: Merging and deduplication of pathogen and disease lists from various databases (e.g., Disbiome, PathoPhenoDB, Wikipedia, DO, GCP).
- Relation Extraction: Use of LLMs (e.g., GPT-4, Llama-2, Mixtral) to classify the strength of evidence for pathogen-disease relationships in abstracts.
- Postprocessing: Full-text mining and scoring of articles for evidence strength.
- Knowledge Display: Database and webserver for querying and visualization.
workflow/rules/: Snakemake rules for each pipeline stage.scripts/: Python scripts for data collection, processing, and analysis.envs/: Conda environment files for reproducibility.notebooks/: Jupyter notebooks for exploratory analysis.tests/: Test scripts for pipeline validation.
src/: Source data files (disease lists, pathogen lists, etc.).config/: Configuration files (e.g.,config.yml).results/: Output data and intermediate results.
-
Configure the Pipeline
- Edit
config/config.ymlto set paths, API keys, and parameters.
- Edit
-
Set Up Environments
- Create conda environments as specified in
workflow/envs/.
- Create conda environments as specified in
-
Run with Snakemake
- From the
workflow/directory, execute:snakemake --cores <N>
- This will execute all rules defined in the
Snakefileand included rule files.
- From the
- Abstract Retrieval: Queries are generated by combining disease and pathogen terms (including synonyms) and submitted to PubMed.
- Sources: Disease and pathogen lists are compiled from Disbiome, PathoPhenoDB, Wikipedia, DO, GCP, and in-house lists.
- Deduplication: Disease and pathogen terms are unified and deduplicated using a priority order.
- Relation Extraction: Known causative relationships are excluded from novel association mining.
- Benchmarking: Abstracts are classified for evidence of pathogen-disease relationships using LLMs.
- Models Used: GPT-3.5, GPT-4, Llama-2-70b, Mixtral-8x7B, and others as defined in
workflow/rules/relation_extraction.smk.
- Full Article Mining: For selected abstracts, full articles are retrieved and mined for additional evidence.
- Scoring: Articles are scored based on the strength of evidence.
- Database: The final database is available for download and exploration at PAISDB Webserver.
- Adding New Data Sources: Update the relevant scripts and configuration in
config/config.yml. - Changing LLMs or Prompts: Modify the scripts in
workflow/scripts/relation_extraction/and update Snakemake rules as needed.
- For more details on the methodology and data sources, see the README.md.
- For technical details on each rule, see the corresponding
.smkfiles inworkflow/rules/.
For questions or contributions, please contact the maintainers listed in the repository.