PAISDB Pipeline Documentation

Overview

The PAISDB pipeline is designed to construct a comprehensive database of Post Acute Infection Syndromes (PAIS) by systematically collecting, processing, and analyzing biomedical literature. The pipeline integrates data from multiple sources, extracts pathogen-disease associations, and leverages large language models (LLMs) for relationship validation.

Pipeline Structure

Data Collection: Automated retrieval of abstracts from PubMed using disease-pathogen queries.
Data Integration: Merging and deduplication of pathogen and disease lists from various databases (e.g., Disbiome, PathoPhenoDB, Wikipedia, DO, GCP).
Relation Extraction: Use of LLMs (e.g., GPT-4, Llama-2, Mixtral) to classify the strength of evidence for pathogen-disease relationships in abstracts.
Postprocessing: Full-text mining and scoring of articles for evidence strength.
Knowledge Display: Database and webserver for querying and visualization.

Directory Layout

workflow/
- rules/: Snakemake rules for each pipeline stage.
- scripts/: Python scripts for data collection, processing, and analysis.
- envs/: Conda environment files for reproducibility.
- notebooks/: Jupyter notebooks for exploratory analysis.
- tests/: Test scripts for pipeline validation.
src/: Source data files (disease lists, pathogen lists, etc.).
config/: Configuration files (e.g., config.yml).
results/: Output data and intermediate results.

Running the Pipeline

Configure the Pipeline
- Edit config/config.yml to set paths, API keys, and parameters.
Set Up Environments
- Create conda environments as specified in workflow/envs/.
Run with Snakemake
- From the workflow/ directory, execute:
```
snakemake --cores <N>
```
- This will execute all rules defined in the Snakefile and included rule files.

Key Pipeline Steps

1. Data Collection

Abstract Retrieval: Queries are generated by combining disease and pathogen terms (including synonyms) and submitted to PubMed.
Sources: Disease and pathogen lists are compiled from Disbiome, PathoPhenoDB, Wikipedia, DO, GCP, and in-house lists.

2. Data Integration

Deduplication: Disease and pathogen terms are unified and deduplicated using a priority order.
Relation Extraction: Known causative relationships are excluded from novel association mining.

3. Relation Extraction with LLMs

Benchmarking: Abstracts are classified for evidence of pathogen-disease relationships using LLMs.
Models Used: GPT-3.5, GPT-4, Llama-2-70b, Mixtral-8x7B, and others as defined in workflow/rules/relation_extraction.smk.

4. Postprocessing

Full Article Mining: For selected abstracts, full articles are retrieved and mined for additional evidence.
Scoring: Articles are scored based on the strength of evidence.

5. Output

Database: The final database is available for download and exploration at PAISDB Webserver.

Customization

Adding New Data Sources: Update the relevant scripts and configuration in config/config.yml.
Changing LLMs or Prompts: Modify the scripts in workflow/scripts/relation_extraction/ and update Snakemake rules as needed.

References

For more details on the methodology and data sources, see the README.md.
For technical details on each rule, see the corresponding .smk files in workflow/rules/.

Contact

For questions or contributions, please contact the maintainers listed in the repository.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
workflow		workflow
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PAISDB Pipeline Documentation

Overview

Pipeline Structure

Directory Layout

Running the Pipeline

Key Pipeline Steps

1. Data Collection

2. Data Integration

3. Relation Extraction with LLMs

4. Postprocessing

5. Output

Customization

References

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PAISDB Pipeline Documentation

Overview

Pipeline Structure

Directory Layout

Running the Pipeline

Key Pipeline Steps

1. Data Collection

2. Data Integration

3. Relation Extraction with LLMs

4. Postprocessing

5. Output

Customization

References

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages