Add mash 2.3 container with refreshed RefSeq prokaryotic sketch (v235)#1717
Merged
Conversation
Add build-files/mash/2.3-RefSeqProkv235 — Mash 2.3 bundled with the v235 RefSeq prokaryotic reference sketch (Zenodo 10.5281/zenodo.20293962), replacing the 2019 RefSeqSketchesDefaults.msh. The reference sketch is generated and published by Erin Young's update_mash_dist pipeline (https://github.com/erinyoung/update_mash_dist), which regenerates the RefSeq sketch after each RefSeq release. Includes app/test stages; updates the README image table.
Collaborator
|
@arodzh-sudo Thank you for your PR. |
Contributor
Author
|
Thanks @Kincekara. Switched the base image to ubuntu:noble in a39dfe9 |
Contributor
|
@staphb-dockerbuilds-diff mash 2.3 2.3-RefSeqProkv235 |
Dockerfile Diff: mashComparing: 2.3 -> 2.3-RefSeqProkv235 --- build-files/mash/2.3/Dockerfile 2026-06-26 18:43:07.707090277 +0000
+++ build-files/mash/2.3-RefSeqProkv235/Dockerfile 2026-06-26 18:43:07.706975884 +0000
@@ -1,38 +1,55 @@
-FROM ubuntu:xenial
+FROM ubuntu:noble AS app
-# for easy upgrade later. ARG variables only persist during image build time
-ARG MASH_VER="2.3"
+ARG MASH_VER="v2.3"
+ARG SKETCH_VER="235"
-LABEL base.image="ubuntu:xenial"
+LABEL base.image="ubuntu:noble"
LABEL dockerfile.version="1"
LABEL software="Mash"
LABEL software.version="2.3"
-LABEL description="Fast genome and metagenome distance estimation using MinHash"
-LABEL website="https://mash.readthedocs.io/en/latest/index.html"
+LABEL description="Fast genome/metagenome distance estimation (MinHash), bundled with a refreshed RefSeq prokaryotic reference sketch (v235, Zenodo 10.5281/zenodo.20293962)"
+LABEL website="https://github.com/marbl/Mash"
LABEL license="https://github.com/marbl/Mash/blob/master/LICENSE.txt"
-LABEL maintainer="Curtis Kapsak"
-LABEL maintainer.email="pjx8@cdc.gov"
+LABEL maintainer="Arnold Rodriguez"
+LABEL maintainer.email="arnold.rodriguezhilario@flhealth.gov"
-# install dependencies
-RUN apt-get update && apt-get -y install \
- wget && \
- apt-get autoclean && \
- rm -rf /var/lib/apt/lists/*
-
-# download mash binary; make /data
-RUN wget https://github.com/marbl/Mash/releases/download/v${MASH_VER}/mash-Linux64-v${MASH_VER}.tar && \
- tar -xvf mash-Linux64-v${MASH_VER}.tar && \
- rm -rf mash-Linux64-v${MASH_VER}.tar && \
- mkdir /data
-
-# add mash to path, and set perl locale settings
-ENV PATH="${PATH}:/mash-Linux64-v${MASH_VER}" \
- LC_ALL=C
+RUN apt-get update && apt-get install --no-install-recommends -y \
+ wget \
+ ca-certificates \
+ procps && \
+ apt-get autoclean && rm -rf /var/lib/apt/lists/*
+
+# install mash 2.3 precompiled linux64 binary
+RUN wget https://github.com/marbl/Mash/releases/download/${MASH_VER}/mash-Linux64-${MASH_VER}.tar && \
+ tar -xvf mash-Linux64-${MASH_VER}.tar --no-same-owner && \
+ rm mash-Linux64-${MASH_VER}.tar && \
+ mv /mash-Linux64-${MASH_VER}/mash /usr/local/bin/ && \
+ mkdir /data
+
+# refreshed RefSeq prokaryotic sketch, embedded at build time (verify MD5 before gunzip)
+RUN mkdir /db && cd /db && \
+ wget -O RefSeqProkSketchesv${SKETCH_VER}.msh.gz \
+ "https://zenodo.org/records/20293962/files/RefSeqSketches_${SKETCH_VER}.msh.gz?download=1" && \
+ echo "2885e98f2d985b2b4f8f8ce9978040d2 RefSeqProkSketchesv${SKETCH_VER}.msh.gz" | md5sum -c - && \
+ gunzip RefSeqProkSketchesv${SKETCH_VER}.msh.gz
+
+ENV LC_ALL=C
+
+CMD ["/bin/bash", "-c", "mash -h && echo '** RefSeq prokaryotic reference sketch (v235) is located at /db/RefSeqProkSketchesv235.msh'"]
WORKDIR /data
-# make db dir. Store db there. Better to have db's added in the last layers
-RUN mkdir /db && \
- cd /db && \
- wget https://gembox.cbcb.umd.edu/mash/RefSeqSketchesDefaults.msh.gz && \
- gunzip RefSeqSketchesDefaults.msh.gz
+FROM app AS test
+
+ARG SKETCH_VER="235"
+
+# verify tool and reference sketch integrity
+WORKDIR /db
+RUN mash --version && mash -h && \
+ mash info RefSeqProkSketchesv${SKETCH_VER}.msh | head -n 10
+
+# functional test: Enterobacter hormaechei FDAARGOS_1433 (ECC) screened against the refreshed sketch
+WORKDIR /test
+RUN wget -q https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/019/048/245/GCF_019048245.1_ASM1904824v1/GCF_019048245.1_ASM1904824v1_genomic.fna.gz && \
+ gunzip GCF_019048245.1_ASM1904824v1_genomic.fna.gz && \
+ mash screen -p 4 /db/RefSeqProkSketchesv${SKETCH_VER}.msh GCF_019048245.1_ASM1904824v1_genomic.fna | sort -gr | head |
Contributor
|
Test worked |
erinyoung
reviewed
Jun 26, 2026
Comment on lines
+20
to
+24
| The sketch replaces the 2019 `RefSeqSketchesDefaults.msh` distributed with `staphb/mash:2.3`. It is regenerated from RefSeq representative bacterial genomes by Erin Young's [`update_mash_dist`](https://github.com/erinyoung/update_mash_dist) pipeline (Mash 2.3, default k=21 / s=1000) and published to Zenodo: | ||
| - Version: v235 (published 2026-05-19) | ||
| - Zenodo record: 20293962 | ||
| - DOI: [10.5281/zenodo.20293962](https://doi.org/10.5281/zenodo.20293962) | ||
| - Source file: `RefSeqSketches_235.msh.gz` (MD5 `2885e98f2d985b2b4f8f8ce9978040d2`) |
Contributor
There was a problem hiding this comment.
Thank you for including this!
erinyoung
approved these changes
Jun 26, 2026
erinyoung
left a comment
Contributor
There was a problem hiding this comment.
I have no changes to recommend.
I am going to approve this PR and get this deployed to dockerhub and quay using the tag '2.3-RefSeqProkv235'. I will not be overwriting the 'latest' tag for staphb/mash
Contributor
|
Thank you for putting this together! You can check the status of the deploy here : https://github.com/StaPH-B/docker-builds/actions/runs/28258766636 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Adds a new Mash container variant,
staphb/mash:2.3-RefSeqProkv235, that embeds a refreshed RefSeq prokaryotic reference sketch (v235) in place of the 2019RefSeqSketchesDefaults.mshincluded instaphb/mash:2.3.Why: the 2019 sketch under-represents recent type strains and contributes to Enterobacter cloacae complex (ECC) mis-calls. The v235 sketch is regenerated from RefSeq representative bacterial genomes (Mash 2.3, k=21, s=1000) and refreshes the candidate pool.
The sketch is downloaded from Zenodo and MD5-verified at build time, then stored at
/db/RefSeqProkSketchesv235.msh:RefSeqSketches_235.msh.gz, MD52885e98f2d985b2b4f8f8ce9978040d2The reference sketch is generated and published by Erin Young's
update_mash_distpipeline.Testing (via the "Manual test" Action, build to the
testtarget):mash infoconfirms k=21 and 22,274 sketchesmash screenof E. hormaechei FDAARGOS_1433 returns it at identity 1.0, with the remaining ECC species ranked directly belowPull Request (PR) checklist:
docker build --tag mash:2.3-RefSeqProkv235test --target test build-files/mash/2.3-RefSeqProkv235) — verified via the "Manual test" GitHub Action.build-files/mash/2.3-RefSeqProkv235/Dockerfile).teststage (no separatetest.sh).