Skip to content

CCB-SB/paisdb

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 

Repository files navigation

PAISDB Pipeline Documentation

Overview

The PAISDB pipeline is designed to construct a comprehensive database of Post Acute Infection Syndromes (PAIS) by systematically collecting, processing, and analyzing biomedical literature. The pipeline integrates data from multiple sources, extracts pathogen-disease associations, and leverages large language models (LLMs) for relationship validation.


Pipeline Structure

  • Data Collection: Automated retrieval of abstracts from PubMed using disease-pathogen queries.
  • Data Integration: Merging and deduplication of pathogen and disease lists from various databases (e.g., Disbiome, PathoPhenoDB, Wikipedia, DO, GCP).
  • Relation Extraction: Use of LLMs (e.g., GPT-4, Llama-2, Mixtral) to classify the strength of evidence for pathogen-disease relationships in abstracts.
  • Postprocessing: Full-text mining and scoring of articles for evidence strength.
  • Knowledge Display: Database and webserver for querying and visualization.

Directory Layout

  • workflow/
    • rules/: Snakemake rules for each pipeline stage.
    • scripts/: Python scripts for data collection, processing, and analysis.
    • envs/: Conda environment files for reproducibility.
    • notebooks/: Jupyter notebooks for exploratory analysis.
    • tests/: Test scripts for pipeline validation.
  • src/: Source data files (disease lists, pathogen lists, etc.).
  • config/: Configuration files (e.g., config.yml).
  • results/: Output data and intermediate results.

Running the Pipeline

  1. Configure the Pipeline

  2. Set Up Environments

    • Create conda environments as specified in workflow/envs/.
  3. Run with Snakemake

    • From the workflow/ directory, execute:
      snakemake --cores <N>
    • This will execute all rules defined in the Snakefile and included rule files.

Key Pipeline Steps

1. Data Collection

  • Abstract Retrieval: Queries are generated by combining disease and pathogen terms (including synonyms) and submitted to PubMed.
  • Sources: Disease and pathogen lists are compiled from Disbiome, PathoPhenoDB, Wikipedia, DO, GCP, and in-house lists.

2. Data Integration

  • Deduplication: Disease and pathogen terms are unified and deduplicated using a priority order.
  • Relation Extraction: Known causative relationships are excluded from novel association mining.

3. Relation Extraction with LLMs

  • Benchmarking: Abstracts are classified for evidence of pathogen-disease relationships using LLMs.
  • Models Used: GPT-3.5, GPT-4, Llama-2-70b, Mixtral-8x7B, and others as defined in workflow/rules/relation_extraction.smk.

4. Postprocessing

  • Full Article Mining: For selected abstracts, full articles are retrieved and mined for additional evidence.
  • Scoring: Articles are scored based on the strength of evidence.

5. Output

  • Database: The final database is available for download and exploration at PAISDB Webserver.

Customization

  • Adding New Data Sources: Update the relevant scripts and configuration in config/config.yml.
  • Changing LLMs or Prompts: Modify the scripts in workflow/scripts/relation_extraction/ and update Snakemake rules as needed.

References

  • For more details on the methodology and data sources, see the README.md.
  • For technical details on each rule, see the corresponding .smk files in workflow/rules/.

Contact

For questions or contributions, please contact the maintainers listed in the repository.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors