Skip to content

qntx/h2m

H2M

Crates.io Docs.rs CI License Rust

Fast, extensible HTML-to-Markdown converter with optional web search — CommonMark + GFM, plugin architecture.

H2M converts HTML into clean Markdown with full CommonMark compliance and GitHub Flavored Markdown extensions. It uses a plugin-based rule system, supports reference-style links, relative URL resolution, and ships with an async CLI that can also search the web and pipe results through the same conversion pipeline. Search works zero-config out of the box via DuckDuckGo and Wikipedia, and also integrates with SearXNG, Brave Search, and Tavily.

H2M CLI Demo

Quick Start

Install the CLI

Shell (macOS / Linux):

curl -fsSL https://sh.qntx.fun/h2m | sh

PowerShell (Windows):

irm https://sh.qntx.fun/h2m/ps | iex

Or via Cargo:

cargo install h2m-cli

CLI Structure

H2M uses a subcommand tree:

h2m <COMMAND> [OPTIONS] ...

Commands:
  convert  Convert HTML to Markdown (URL, file, stdin)
  search   Search the web and optionally scrape each hit to Markdown

convert — HTML → Markdown

h2m convert https://example.com
h2m convert page.html
curl -s https://example.com | h2m convert
echo '<h1>Hi</h1>' | h2m convert

Content extraction:

h2m convert -r https://blog.example.com/post          # smart readable
h2m convert -s article https://blog.example.com/post  # CSS selector
h2m convert -s '#content' https://example.com         # by ID

JSON output (agents / programmatic use):

h2m convert --json https://example.com                # pretty JSON
h2m convert --json --extract-links https://example.com
h2m convert --json url1 url2 url3                     # NDJSON streaming
h2m convert --json --urls urls.txt -j 8 --delay 100

Formatting:

h2m convert --gfm https://example.com                 # tables, strikethrough, task lists
h2m convert --link-style referenced page.html         # reference-style links
h2m convert --heading-style setext page.html          # === / --- underlines
h2m convert --user-agent "MyBot/1.0" https://example.com
h2m convert -o output.md https://example.com

search — Web search

H2M ships with five search providers. The default is duckduckgo, which requires no API key, no registration, and no environment variables:

Provider Requires Free tier Notes
DuckDuckGo - (zero-config) unlimited* Default. HTML scraping + lite fallback
Wikipedia - (zero-config) unlimited Official MediaWiki API, 300+ languages
SearXNG H2M_SEARXNG_URL yes (self-host) Open-source meta-search
Brave BRAVE_API_KEY $5/month credit Independent index, transparent pagination
Tavily TAVILY_API_KEY 1000 req/month AI-tuned snippets + LLM answers

* DuckDuckGo uses unauthenticated HTML endpoints. Aggressive or datacenter traffic may trigger anti-bot challenges; the provider auto-falls back to lite.duckduckgo.com and emits a structured "kind":"captchaDetected" error so you can automate provider switching. Wikipedia is the recommended fallback.

Zero-config usage (nothing to configure, runs immediately):

h2m search "rust async trait"                       # DuckDuckGo (default)
h2m search "Turing machine" --provider wikipedia    # official MediaWiki API
h2m search "图灵机" --provider wikipedia --wikipedia-lang zh

All the usual flags work uniformly across providers:

h2m search "rust" --limit 5 --time-range week       # last-week results
h2m search "rust" --sources web,news --country us
h2m search "rust" --language en --safesearch strict
h2m search "rust" --json                             # NDJSON (one hit per line)

Provider-specific keys (opt-in, via env vars or flags):

export BRAVE_API_KEY=...    ; h2m search "rust" --provider brave
export TAVILY_API_KEY=...   ; h2m search "rust" --provider tavily --include-answer
export H2M_SEARXNG_URL=...  ; h2m search "rust" --provider searxng

Tips:

  • CAPTCHA handling — when DuckDuckGo returns "kind":"captchaDetected" or "authFailed", switch to --provider wikipedia or a keyed provider. The error JSON always carries kind / provider / status fields.
  • Windows + system proxy — if your system proxy intercepts localhost requests (Clash/V2Ray/etc), set NO_PROXY=127.0.0.1,localhost before pointing h2m at a self-hosted SearXNG instance.
  • Brave pagination--limit up to 200 is supported (Brave caps at 20 per page; h2m paginates transparently via offset).

Search + scrape (runs every hit through the full convert pipeline, streams NDJSON ScrapeResults):

h2m search "rust async" --scrape                 # raw markdown per hit
h2m search "rust async" --scrape --gfm --readable
h2m search "rust async" --scrape --selector article
h2m search "rust" --scrape -j 8 --timeout 20     # parallel scrape

A ready-made end-to-end smoke test lives at scripts/live_search_e2e.ps1 (Windows PowerShell) — it exercises DuckDuckGo and Wikipedia across English / Chinese / Japanese and prints a classified summary table.

JSON Output

convert single URL (pretty JSON):

{
  "markdown": "# Example Domain\n\n...",
  "metadata": {
    "title": "Example Domain",
    "description": "This domain is for use in illustrative examples.",
    "language": "en",
    "ogImage": "https://example.com/og.png",
    "sourceUrl": "https://example.com",
    "url": "https://example.com/",
    "statusCode": 200,
    "contentType": "text/html; charset=UTF-8",
    "elapsedMs": 234
  },
  "links": ["https://example.com/about"]
}

search response:

{
  "query": "rust async",
  "provider": "tavily",
  "answer": "Rust's async trait support stabilized in 1.75 ...",
  "web": [
    {
      "title": "Rust", "url": "https://rust-lang.org",
      "description": "...", "engine": "duckduckgo", "score": 0.92
    }
  ],
  "news": [],
  "images": [],
  "elapsedMs": 312
}
  • answer — LLM-generated summary (Tavily --include-answer flag, opt-in).
  • score — relevance in [0, 1] (Tavily only; other providers omit it).
  • engine — upstream backend name (SearXNG only; aggregators omit it).

Fields marked Option are dropped from the JSON when absent, keeping output lean.

Multiple inputs (convert batch, or search --scrape) stream NDJSON — one JSON object per line.

Library Usage

// One-liner with CommonMark defaults
let md = h2m::convert("<h1>Hello</h1><p>World</p>");
assert_eq!(md, "# Hello\n\nWorld");
// Full control with the builder
use h2m::{Converter, Options};
use h2m::plugins::Gfm;
use h2m::rules::CommonMark;

let converter = Converter::builder()
    .options(Options::default())
    .use_plugin(&CommonMark)
    .use_plugin(&Gfm)
    .domain("example.com")
    .build();

let md = converter.convert(r#"<a href="/about">About</a>"#);
assert_eq!(md, "[About](https://example.com/about)");

Async Scraping

Enable the scrape feature for async HTTP scraping with built-in concurrency control, rate limiting, and streaming output:

use h2m::scrape::Scraper;

let scraper = Scraper::builder()
    .concurrency(8)
    .gfm(true)
    .extract_links(true)
    .build()?;

let result = scraper.scrape("https://example.com").await?;
println!("{}", result.markdown);

let urls = vec!["https://a.com".into(), "https://b.com".into()];
scraper.scrape_many_streaming(&urls, |result| {
    match result {
        Ok(r) => println!("{}", r.markdown),
        Err(e) => eprintln!("error: {e}"),
    }
}).await;

Web Search

The h2m-search crate exposes the same provider abstraction the CLI uses. The zero-config default is DuckDuckGo; no builder configuration required:

use h2m_search::{SearchClient, SearchQuery};

// Zero-config: uses DuckDuckGo (no API key, no env vars).
let client = SearchClient::builder().build()?;

let response = client
    .search(&SearchQuery::new("rust async").with_limit(5))
    .await?;

for hit in &response.web {
    println!("{} — {}", hit.title, hit.url);
}
# Ok::<_, Box<dyn std::error::Error>>(())

Design

  • CommonMark + GFM — full spec compliance with tables, strikethrough, task lists, reference-style links
  • Plugin architecture — extend with custom rules via the Rule trait
  • Async batch pipelinetokio + reqwest, semaphore concurrency, streaming NDJSON (scrape feature)
  • Multi-provider searchSearchClient enum with static dispatch, one Cargo feature per provider
  • Search + scrape compositionsearch --scrape funnels hits through the same Scraper pipeline, reusing all formatting / extraction flags
  • JSON output — nested camelCase metadata aligned with Firecrawl conventions
  • Smart readable extraction — two-phase content detection: semantic selectors → noise stripping
  • Zero-copy fast pathsCow<str> escaping, zero unsafe, Send + Sync

Supported HTML Elements

CommonMark (built-in)

Element Markdown Output
<h1>-<h6> # Heading (ATX) or underline (Setext)
<p>, <div>, <section>, <article> Block paragraph
<strong>, <b> **bold**
<em>, <i> *italic*
<code>, <kbd>, <samp>, <tt> `inline code`
<pre><code> Fenced code block with language detection
<a href="..."> [text](url) or reference-style
<img src="..." alt="..."> ![alt](src "title")
<ul>, <ol>, <li> Bullet/numbered lists with nesting
<blockquote> > quoted text
<hr> ---
<br> Hard line break
<iframe> [iframe](url)

GFM Extensions (with --gfm)

Element Markdown Output
<table> GFM pipe table with alignment
<del>, <s>, <strike> ~~strikethrough~~
<input type="checkbox"> [x] or [ ] (task list)

Auto-removed

Element Behavior
<script> Removed (content stripped)
<style> Removed (content stripped)
<noscript> Removed (content stripped)

Custom Rules

Extend the converter with your own rules by implementing the Rule trait:

use h2m::{Converter, Rule, Action, Context};
use h2m::rules::CommonMark;
use scraper::ElementRef;

#[derive(Debug)]
struct HighlightRule;
impl Rule for HighlightRule {
    fn tags(&self) -> &'static [&'static str] { &["mark"] }

    fn apply(&self, content: &str, _el: &ElementRef<'_>, _ctx: &mut Context<'_>) -> Action {
        Action::Replace(format!("=={content}=="))
    }
}

let mut builder = Converter::builder()
    .use_plugin(CommonMark);
builder.add_rule(HighlightRule);
let converter = builder.build();

let md = converter.convert("<p>This is <mark>important</mark></p>");
assert!(md.contains("==important=="));

License

Licensed under either of:

at your option.

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in this project shall be dual-licensed as above, without any additional terms or conditions.


A QNTX open-source project.

QNTX

Code is law. We write both.

About

A blazing-fast HTML to Markdown converter, written in Rust.

Topics

Resources

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Contributors

Languages