Skip to content

Add mash 2.3 container with refreshed RefSeq prokaryotic sketch (v235)#1717

Merged
erinyoung merged 3 commits into
StaPH-B:masterfrom
arodzh-sudo:mash-2.3-RefSeqProkv235
Jun 26, 2026
Merged

Add mash 2.3 container with refreshed RefSeq prokaryotic sketch (v235)#1717
erinyoung merged 3 commits into
StaPH-B:masterfrom
arodzh-sudo:mash-2.3-RefSeqProkv235

Conversation

@arodzh-sudo

Copy link
Copy Markdown
Contributor

Description

Adds a new Mash container variant, staphb/mash:2.3-RefSeqProkv235, that embeds a refreshed RefSeq prokaryotic reference sketch (v235) in place of the 2019 RefSeqSketchesDefaults.msh included in staphb/mash:2.3.

Why: the 2019 sketch under-represents recent type strains and contributes to Enterobacter cloacae complex (ECC) mis-calls. The v235 sketch is regenerated from RefSeq representative bacterial genomes (Mash 2.3, k=21, s=1000) and refreshes the candidate pool.

The sketch is downloaded from Zenodo and MD5-verified at build time, then stored at /db/RefSeqProkSketchesv235.msh:

  • v235, published 2026-05-19 — Zenodo record 20293962 — DOI 10.5281/zenodo.20293962
  • RefSeqSketches_235.msh.gz, MD5 2885e98f2d985b2b4f8f8ce9978040d2

The reference sketch is generated and published by Erin Young's update_mash_dist pipeline.

Testing (via the "Manual test" Action, build to the test target):

  • MD5 verification of the downloaded sketch passes
  • mash info confirms k=21 and 22,274 sketches
  • mash screen of E. hormaechei FDAARGOS_1433 returns it at identity 1.0, with the remaining ECC species ranked directly below

Pull Request (PR) checklist:

  • Include a description of what is in this pull request in this message.
  • The dockerfile successfully builds to a test target for the user creating the PR. (docker build --tag mash:2.3-RefSeqProkv235test --target test build-files/mash/2.3-RefSeqProkv235) — verified via the "Manual test" GitHub Action.
  • Directory structure as name of the tool in lower case with special characters removed with a subdirectory of the version number (build-files/mash/2.3-RefSeqProkv235/Dockerfile).
    • (optional) Test files: functional tests are inline in the Dockerfile test stage (no separate test.sh).
  • Create a simple container-specific README.md in the same directory as the Dockerfile.
    • The README is ~37 lines; the additional lines document the sketch provenance (Zenodo DOI, version, MD5) for traceability.
  • Dockerfile includes the recommended LABELS.
  • Main README.md has been updated to include the tool and version of the dockerfile in this PR.
  • Program_Licenses.md contains the tool(s) used in this PR — Mash already listed; no change needed.

Add build-files/mash/2.3-RefSeqProkv235 — Mash 2.3 bundled with the v235
RefSeq prokaryotic reference sketch (Zenodo 10.5281/zenodo.20293962), replacing
the 2019 RefSeqSketchesDefaults.msh.

The reference sketch is generated and published by Erin Young's update_mash_dist
pipeline (https://github.com/erinyoung/update_mash_dist), which regenerates the
RefSeq sketch after each RefSeq release.

Includes app/test stages; updates the README image table.
@Kincekara

Copy link
Copy Markdown
Collaborator

@arodzh-sudo Thank you for your PR.
Could you change the base image to ubuntu:noble?

@arodzh-sudo

Copy link
Copy Markdown
Contributor Author

Thanks @Kincekara. Switched the base image to ubuntu:noble in a39dfe9

@erinyoung

Copy link
Copy Markdown
Contributor

@staphb-dockerbuilds-diff mash 2.3 2.3-RefSeqProkv235

@github-actions

Copy link
Copy Markdown

Dockerfile Diff: mash

Comparing: 2.3 -> 2.3-RefSeqProkv235

--- build-files/mash/2.3/Dockerfile	2026-06-26 18:43:07.707090277 +0000
+++ build-files/mash/2.3-RefSeqProkv235/Dockerfile	2026-06-26 18:43:07.706975884 +0000
@@ -1,38 +1,55 @@
-FROM ubuntu:xenial
+FROM ubuntu:noble AS app
 
-# for easy upgrade later. ARG variables only persist during image build time
-ARG MASH_VER="2.3"
+ARG MASH_VER="v2.3"
+ARG SKETCH_VER="235"
 
-LABEL base.image="ubuntu:xenial"
+LABEL base.image="ubuntu:noble"
 LABEL dockerfile.version="1"
 LABEL software="Mash"
 LABEL software.version="2.3"
-LABEL description="Fast genome and metagenome distance estimation using MinHash"
-LABEL website="https://mash.readthedocs.io/en/latest/index.html"
+LABEL description="Fast genome/metagenome distance estimation (MinHash), bundled with a refreshed RefSeq prokaryotic reference sketch (v235, Zenodo 10.5281/zenodo.20293962)"
+LABEL website="https://github.com/marbl/Mash"
 LABEL license="https://github.com/marbl/Mash/blob/master/LICENSE.txt"
-LABEL maintainer="Curtis Kapsak"
-LABEL maintainer.email="pjx8@cdc.gov"
+LABEL maintainer="Arnold Rodriguez"
+LABEL maintainer.email="arnold.rodriguezhilario@flhealth.gov"
 
-# install dependencies
-RUN apt-get update && apt-get -y install \ 
- wget && \
- apt-get autoclean && \
- rm -rf /var/lib/apt/lists/*
-
-# download mash binary; make /data
-RUN wget https://github.com/marbl/Mash/releases/download/v${MASH_VER}/mash-Linux64-v${MASH_VER}.tar && \
- tar -xvf mash-Linux64-v${MASH_VER}.tar && \
- rm -rf mash-Linux64-v${MASH_VER}.tar && \
- mkdir /data
-
-# add mash to path, and set perl locale settings
-ENV PATH="${PATH}:/mash-Linux64-v${MASH_VER}" \
- LC_ALL=C
+RUN apt-get update && apt-get install --no-install-recommends -y \
+    wget \
+    ca-certificates \
+    procps && \
+    apt-get autoclean && rm -rf /var/lib/apt/lists/*
+
+# install mash 2.3 precompiled linux64 binary
+RUN wget https://github.com/marbl/Mash/releases/download/${MASH_VER}/mash-Linux64-${MASH_VER}.tar && \
+    tar -xvf mash-Linux64-${MASH_VER}.tar --no-same-owner && \
+    rm mash-Linux64-${MASH_VER}.tar && \
+    mv /mash-Linux64-${MASH_VER}/mash /usr/local/bin/ && \
+    mkdir /data
+
+# refreshed RefSeq prokaryotic sketch, embedded at build time (verify MD5 before gunzip)
+RUN mkdir /db && cd /db && \
+    wget -O RefSeqProkSketchesv${SKETCH_VER}.msh.gz \
+      "https://zenodo.org/records/20293962/files/RefSeqSketches_${SKETCH_VER}.msh.gz?download=1" && \
+    echo "2885e98f2d985b2b4f8f8ce9978040d2  RefSeqProkSketchesv${SKETCH_VER}.msh.gz" | md5sum -c - && \
+    gunzip RefSeqProkSketchesv${SKETCH_VER}.msh.gz
+
+ENV LC_ALL=C
+
+CMD ["/bin/bash", "-c", "mash -h && echo '** RefSeq prokaryotic reference sketch (v235) is located at /db/RefSeqProkSketchesv235.msh'"]
 
 WORKDIR /data
 
-# make db dir. Store db there. Better to have db's added in the last layers
-RUN mkdir /db && \
- cd /db && \
- wget https://gembox.cbcb.umd.edu/mash/RefSeqSketchesDefaults.msh.gz && \
- gunzip RefSeqSketchesDefaults.msh.gz
+FROM app AS test
+
+ARG SKETCH_VER="235"
+
+# verify tool and reference sketch integrity
+WORKDIR /db
+RUN mash --version && mash -h && \
+    mash info RefSeqProkSketchesv${SKETCH_VER}.msh | head -n 10
+
+# functional test: Enterobacter hormaechei FDAARGOS_1433 (ECC) screened against the refreshed sketch
+WORKDIR /test
+RUN wget -q https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/019/048/245/GCF_019048245.1_ASM1904824v1/GCF_019048245.1_ASM1904824v1_genomic.fna.gz && \
+    gunzip GCF_019048245.1_ASM1904824v1_genomic.fna.gz && \
+    mash screen -p 4 /db/RefSeqProkSketchesv${SKETCH_VER}.msh GCF_019048245.1_ASM1904824v1_genomic.fna | sort -gr | head

@erinyoung

Copy link
Copy Markdown
Contributor

Test worked

#14 [test 4/4] RUN wget -q https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/019/048/245/GCF_019048245.1_ASM1904824v1/GCF_019048245.1_ASM1904824v1_genomic.fna.gz &&     gunzip GCF_019048245.1_ASM1904824v1_genomic.fna.gz &&     mash screen -p 4 /db/RefSeqProkSketchesv235.msh GCF_019048245.1_ASM1904824v1_genomic.fna | sort -gr | head
#14 1.447 Loading /db/RefSeqProkSketchesv235.msh...
#14 13.37    15767473 distinct hashes.
#14 13.37 Streaming from GCF_019048245.1_ASM1904824v1_genomic.fna...
#14 13.91    Estimated distinct k-mers in mixture: 4900608
#14 13.91 Summing shared...
#14 14.13 Computing coverage medians...
#14 14.13 Writing output...
#14 21.25 1	1000/1000	1	0	Enterobacter_hormaechei_GCF_019048245.1	[3 seqs] NZ_CP077308.1 Enterobacter hormaechei strain FDAARGOS 1433 chromosome, complete genome [...]
#14 21.25 0.94428	300/1000	1	0	Enterobacter_intestinihominis_GCF_048568405.1	[3 seqs] NZ_CP183772.1 Enterobacter intestinihominis strain JNQH618 chromosome, complete genome [...]
#14 21.25 0.939555	270/1000	1	0	Enterobacter_quasihormaechei_GCF_004331385.1	[51 seqs] NZ_SJON01000001.1 Enterobacter quasihormaechei strain WCHEQ120003 1, whole genome shotgun sequence [...]
#14 21.25 0.920851	177/1000	1	0	Enterobacter_pasteurii_GCF_014930725.1	[48 seqs] NZ_JADBRO010000001.1 Enterobacter pasteurii strain P40RS contig_0001, whole genome shotgun sequence [...]
#14 21.25 0.908727	134/1000	1	0	Enterobacter_chuandaensis_GCF_039718975.1	[2 seqs] NZ_CP097173.1 Enterobacter chuandaensis strain E20191216 chromosome, complete genome [...]
#14 21.25 0.906747	128/1000	1	0	Enterobacter_quasimori_GCF_018597345.1	[30 seqs] NZ_JAHEVU010000001.1 Enterobacter quasimori strain 120130 contig00001, whole genome shotgun sequence [...]
#14 21.25 0.906409	127/1000	1	0	Enterobacter_nematophilus_GCF_026344075.1	[19 seqs] NZ_JAPKNE010000001.1 Enterobacter nematophilus strain E-TC7 Strain6_1a_1, whole genome shotgun sequence [...]
#14 21.25 0.906067	126/1000	1	0	Enterobacter_bugandensis_GCF_900324475.1	NZ_LT992502.1 Enterobacter bugandensis isolate EB-247 chromosome I
#14 21.25 0.903965	120/1000	1	0	Enterobacter_oligotrophicus_GCF_009176645.1	NZ_AP019007.1 Enterobacter oligotrophicus strain CCA6 chromosome, complete genome
#14 21.25 0.901759	114/1000	1	0	Enterobacter_vonholyi_GCF_040834095.1	[2 seqs] NZ_CP162152.1 Enterobacter vonholyi strain STK80-C chromosome, complete genome [...]
#14 DONE 21.3s

Comment on lines +20 to +24
The sketch replaces the 2019 `RefSeqSketchesDefaults.msh` distributed with `staphb/mash:2.3`. It is regenerated from RefSeq representative bacterial genomes by Erin Young's [`update_mash_dist`](https://github.com/erinyoung/update_mash_dist) pipeline (Mash 2.3, default k=21 / s=1000) and published to Zenodo:
- Version: v235 (published 2026-05-19)
- Zenodo record: 20293962
- DOI: [10.5281/zenodo.20293962](https://doi.org/10.5281/zenodo.20293962)
- Source file: `RefSeqSketches_235.msh.gz` (MD5 `2885e98f2d985b2b4f8f8ce9978040d2`)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for including this!

@erinyoung erinyoung left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no changes to recommend.

I am going to approve this PR and get this deployed to dockerhub and quay using the tag '2.3-RefSeqProkv235'. I will not be overwriting the 'latest' tag for staphb/mash

@erinyoung erinyoung merged commit 762fcaf into StaPH-B:master Jun 26, 2026
2 checks passed
@erinyoung

Copy link
Copy Markdown
Contributor

Thank you for putting this together!

You can check the status of the deploy here : https://github.com/StaPH-B/docker-builds/actions/runs/28258766636

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants