Skip to content

Commit 7f8397e

Browse files
authored
Merge pull request #49 from ncbo/feat/run_umls_pipeline
Automate the full umls2rdf pipeline
2 parents daf298f + 210691e commit 7f8397e

8 files changed

Lines changed: 576 additions & 18 deletions

File tree

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
conf.py
22
*.pyc
33
output/*
4+
data/*
45

56
*.swp
67
sync*

README.md

Lines changed: 65 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,18 @@
11
This project takes a MySQL Unified Medical Language System (UMLS) database and converts the ontologies to RDF using OWL and SKOS as the main schemas.
22

3-
Virtual Appliance users can review the [documentation in the OntoPortal Administration Guide}(https://ontoportal.github.io/documentation/administration/ontologies/handling_umls).
3+
Virtual Appliance users can review the [documentation in the OntoPortal Administration Guide](https://ontoportal.github.io/documentation/administration/ontologies/handling_umls).
44

5-
To use it:
5+
Recommended workflow:
66

7-
* Specify your database connection conf.py
8-
* Specify the SAB ontologies to export in umls.conf
7+
* Install Python dependencies with <code>pip install -r requirements.txt</code>
8+
* Configure <code>conf.py</code>
9+
* Specify the SAB ontologies to export in <code>umls.conf</code>
10+
* Run the full resumable import/export pipeline with <code>python run_umls_pipeline.py</code>
11+
12+
Generated TTL files are written under a versioned output directory based on
13+
<code>OUTPUT_FOLDER</code> from <code>conf.py</code>. A common pattern is
14+
<code>OUTPUT_FOLDER = "output/%s" % UMLS_VERSION.upper()</code>, which writes to
15+
<code>output/2025AB</code>.
916

1017
The umls.conf configuration file must contain one ontology per line. The lines are comma separated tuples where the elements are:
1118

@@ -23,11 +30,59 @@ umls2rdf.py is designed to be an offline, run-once process.
2330
It's memory intensive and exports all of the default ontologies in umls.conf in 3h 30min.
2431
The ontologies listed in umls.conf are the UMLS ontologies accessible in [BioPortal](https://bioportal.bioontology.org/).
2532

26-
If you get an error when installing the MySQL-python python library, https://stackoverflow.com/questions/12218229/my-config-h-file-not-found-when-intall-mysql-python-on-osx-10-8 may be of help.
33+
To download the full UMLS release archive outside the full pipeline, run:
34+
35+
<pre>
36+
python download_umls.py
37+
</pre>
38+
39+
The downloader returns the local path to the downloaded archive. This step only
40+
fetches and extracts the pre-built UMLS release; you still need to load the
41+
UMLS tables into MySQL before running <code>umls2rdf.py</code>. The script uses
42+
<code>UMLS_VERSION</code> and <code>UMLS_API_KEY</code> from <code>conf.py</code>.
43+
If <code>UMLS_DOWNLOAD_DIR</code> is set, the zip archive is stored under that
44+
directory. If it is not set, the library default <code>~/.data/bio/umls</code>
45+
is used. By default, the archive is extracted into an
46+
<code>extracted</code> subdirectory next to the downloaded zip. You can override
47+
that location with <code>UMLS_EXTRACT_DIR</code>.
48+
49+
To create the target MySQL database with explicit UTF-8 settings outside the
50+
full pipeline, run:
51+
52+
<pre>
53+
python create_mysql_db.py
54+
</pre>
55+
56+
The script creates or updates <code>DB_NAME</code> from <code>conf.py</code>
57+
with <code>utf8mb4</code> character set and
58+
<code>utf8mb4_unicode_ci</code> collation.
59+
60+
To run the full UMLS pipeline end-to-end, use:
61+
62+
<pre>
63+
python run_umls_pipeline.py
64+
</pre>
65+
66+
The pipeline performs these stages:
67+
68+
* Download the configured UMLS full release archive
69+
* Extract the release only when the extracted <code>META</code> directory is not already present
70+
* Recreate the configured <code>DB_NAME</code> and load it with the extracted <code>META/populate_mysql_db.sh</code> script
71+
* Run <code>umls2rdf.py</code>
2772

28-
If running a Windows 10 OS with MySQL, the following tips may be of help.
73+
The pipeline patches loader settings from <code>conf.py</code> into a generated
74+
copy of <code>populate_mysql_db.sh</code>, and it patches
75+
<code>META/mysql_tables.sql</code> in place to replace
76+
<code>@LINE_TERMINATION@</code>. Pipeline state is stored under
77+
<code>PIPELINE_WORK_DIR</code> (default:
78+
<code>data/pipeline/&lt;UMLS_VERSION&gt;</code>) and reruns skip completed steps
79+
after validating the extracted files, MySQL tables, and RDF output. Add
80+
<code>MYSQL_HOME</code> to <code>conf.py</code>; if your MySQL client is at
81+
<code>/usr/bin/mysql</code>, set <code>MYSQL_HOME = "/usr"</code>. Pipeline
82+
stdout and stderr are appended to <code>PIPELINE_LOG_FILE</code> when set, or
83+
to <code>data/pipeline/&lt;UMLS_VERSION&gt;/pipeline.log</code> by default.
2984

30-
- Install [MySQL 5.5](https://dev.mysql.com/downloads/mysql/5.5.html#downloads) to avoid the InnoDB space [disclaimer](https://www.nlm.nih.gov/research/umls/implementation_resources/scripts/README_RRF_MySQL_Output_Stream.html) by NLM.
31-
- [Python 2.7.x](https://www.python.org/downloads/) should be used to avoid syntax errors on 'raise Attribute'
32-
- For installtion of the MySQLdb module <pre>python -m pip install MySQLdb</pre> is error prone. Install with executable [MySQL-python-1.2.3.win-amd64-py2.7](http://www.codegood.com/archives/129) (last known location).
33-
- Create your RRF subset(s) using mmsys with the MySQL load option, load your database, edit conf.py and umls.py to specifications, run umsl2rdf.py
85+
If <code>PROCESS_ONLY_CURRENT_UMLS_VERSION</code> is set to <code>True</code>,
86+
the exporter only processes ontologies whose <code>MRSAB.IMETA</code> exactly
87+
matches <code>UMLS_VERSION</code>. Ontologies with a different value are skipped
88+
and logged.

conf_sample.py

Lines changed: 23 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,35 @@
1-
#Folder to dump the RDF files.
2-
OUTPUT_FOLDER = "output"
3-
41
UMLS_VERSION = "2025AB"
52

6-
#DB Config
3+
# Folder to dump the RDF files.
4+
OUTPUT_FOLDER = "output/%s" % UMLS_VERSION.upper()
5+
6+
# DB Config
77
DB_HOST = "localhost"
88
DB_NAME = "umls%s" % UMLS_VERSION.lower()
99
DB_USER = "umls"
1010
DB_PASS = "umls"
11-
12-
# Define the base URI used to generate the concepts URI
13-
UMLS_BASE_URI = "http://purl.bioontology.org/ontology/"
11+
MYSQL_HOME = "/usr"
1412

1513
# Include the semantic type concepts for each Ontology file generated
1614
INCLUDE_SEMANTIC_TYPES = True
1715

16+
# Define the base URI used to generate the concepts URI
17+
UMLS_BASE_URI = "http://purl.bioontology.org/ontology/"
18+
1819
# Only process ontologies updated in this UMLS release (MRSAB.IMETA == UMLS_VERSION)
1920
PROCESS_ONLY_CURRENT_UMLS_VERSION = False
21+
22+
# Pipeline config
23+
UMLS_API_KEY = "your umls api key"
24+
25+
# Optional: final umls download directory, e.g. data/umls
26+
# UMLS_DOWNLOAD_DIR = "data/umls"
27+
28+
# Optional: directory for extracted full UMLS contents
29+
# UMLS_EXTRACT_DIR = "data/umls-extracted"
30+
31+
# Optional: working directory for pipeline state and patched loader script
32+
# PIPELINE_WORK_DIR = "data/pipeline"
33+
34+
# Optional: pipeline log file path
35+
# PIPELINE_LOG_FILE = "data/pipeline/pipeline.log"

create_mysql_db.py

Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
#!/usr/bin/env python3
2+
3+
import argparse
4+
5+
import pymysql
6+
7+
import conf
8+
9+
10+
def connect_server():
11+
return pymysql.connect(
12+
host=conf.DB_HOST,
13+
user=conf.DB_USER,
14+
passwd=conf.DB_PASS,
15+
charset="utf8mb4",
16+
autocommit=True,
17+
)
18+
19+
20+
def database_exists(connection, db_name):
21+
with connection.cursor() as cursor:
22+
cursor.execute("SHOW DATABASES LIKE %s", (db_name,))
23+
return cursor.fetchone() is not None
24+
25+
26+
def create_database(connection, db_name):
27+
with connection.cursor() as cursor:
28+
cursor.execute(
29+
"""
30+
CREATE DATABASE IF NOT EXISTS `{db_name}`
31+
CHARACTER SET utf8mb4
32+
COLLATE utf8mb4_unicode_ci
33+
""".format(db_name=db_name)
34+
)
35+
cursor.execute(
36+
"ALTER DATABASE `{db_name}` CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci".format(
37+
db_name=db_name
38+
)
39+
)
40+
41+
42+
def drop_database(connection, db_name):
43+
with connection.cursor() as cursor:
44+
cursor.execute("DROP DATABASE IF EXISTS `{db_name}`".format(db_name=db_name))
45+
46+
47+
def ensure_database(db_name, recreate=False):
48+
connection = connect_server()
49+
try:
50+
if recreate:
51+
drop_database(connection, db_name)
52+
create_database(connection, db_name)
53+
finally:
54+
connection.close()
55+
56+
57+
def parse_args():
58+
parser = argparse.ArgumentParser()
59+
parser.add_argument(
60+
"--recreate",
61+
action="store_true",
62+
help="Drop the configured database before creating it.",
63+
)
64+
return parser.parse_args()
65+
66+
67+
def main():
68+
args = parse_args()
69+
ensure_database(conf.DB_NAME, recreate=args.recreate)
70+
print(conf.DB_NAME)
71+
72+
73+
if __name__ == "__main__":
74+
main()

download_umls.py

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
#!/usr/bin/env python3
2+
3+
import os
4+
from pathlib import Path
5+
import zipfile
6+
7+
import conf
8+
9+
10+
def get_extract_dir(zip_path):
11+
extract_dir = getattr(conf, "UMLS_EXTRACT_DIR", None)
12+
if extract_dir:
13+
return Path(extract_dir).expanduser().resolve()
14+
return zip_path.parent / "extracted"
15+
16+
17+
def main():
18+
from umls_downloader import download_umls_full
19+
20+
download_dir = getattr(conf, "UMLS_DOWNLOAD_DIR", None)
21+
if download_dir:
22+
download_dir = Path(download_dir).expanduser().resolve()
23+
os.environ["BIO_HOME"] = download_dir.parent.as_posix()
24+
25+
path = download_umls_full(
26+
version=conf.UMLS_VERSION.upper(),
27+
api_key=conf.UMLS_API_KEY,
28+
)
29+
extract_dir = get_extract_dir(path)
30+
extract_dir.mkdir(parents=True, exist_ok=True)
31+
32+
with zipfile.ZipFile(path) as zip_file:
33+
zip_file.extractall(extract_dir)
34+
35+
print(extract_dir)
36+
37+
38+
if __name__ == "__main__":
39+
main()

requirements.txt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
pymysql
2+
umls_downloader

0 commit comments

Comments
 (0)