H2M

Fast, extensible HTML-to-Markdown converter with optional web search — CommonMark + GFM, plugin architecture.

H2M converts HTML into clean Markdown with full CommonMark compliance and GitHub Flavored Markdown extensions. It uses a plugin-based rule system, supports reference-style links, relative URL resolution, and ships with an async CLI that can also search the web and pipe results through the same conversion pipeline. Search works zero-config out of the box via DuckDuckGo and Wikipedia, and also integrates with SearXNG, Brave Search, and Tavily.

Quick Start

Install the CLI

Shell (macOS / Linux):

curl -fsSL https://sh.qntx.fun/h2m | sh

PowerShell (Windows):

irm https://sh.qntx.fun/h2m/ps | iex

Or via Cargo:

cargo install h2m-cli

CLI Structure

H2M uses a subcommand tree:

h2m <COMMAND> [OPTIONS] ...

Commands:
  convert  Convert HTML to Markdown (URL, file, stdin)
  search   Search the web and optionally scrape each hit to Markdown

`convert` — HTML → Markdown

h2m convert https://example.com
h2m convert page.html
curl -s https://example.com | h2m convert
echo '<h1>Hi</h1>' | h2m convert

Content extraction:

h2m convert -r https://blog.example.com/post          # smart readable
h2m convert -s article https://blog.example.com/post  # CSS selector
h2m convert -s '#content' https://example.com         # by ID

JSON output (agents / programmatic use):

h2m convert --json https://example.com                # pretty JSON
h2m convert --json --extract-links https://example.com
h2m convert --json url1 url2 url3                     # NDJSON streaming
h2m convert --json --urls urls.txt -j 8 --delay 100

Formatting:

h2m convert --gfm https://example.com                 # tables, strikethrough, task lists
h2m convert --link-style referenced page.html         # reference-style links
h2m convert --heading-style setext page.html          # === / --- underlines
h2m convert --user-agent "MyBot/1.0" https://example.com
h2m convert -o output.md https://example.com

`search` — Web search

H2M ships with five search providers. The default is duckduckgo, which requires no API key, no registration, and no environment variables:

Provider	Requires	Free tier	Notes
DuckDuckGo	- (zero-config)	unlimited*	Default. HTML scraping + lite fallback
Wikipedia	- (zero-config)	unlimited	Official MediaWiki API, 300+ languages
SearXNG	`H2M_SEARXNG_URL`	yes (self-host)	Open-source meta-search
Brave	`BRAVE_API_KEY`	$5/month credit	Independent index, transparent pagination
Tavily	`TAVILY_API_KEY`	1000 req/month	AI-tuned snippets + LLM answers

* DuckDuckGo uses unauthenticated HTML endpoints. Aggressive or datacenter traffic may trigger anti-bot challenges; the provider auto-falls back to lite.duckduckgo.com and emits a structured "kind":"captchaDetected" error so you can automate provider switching. Wikipedia is the recommended fallback.

Zero-config usage (nothing to configure, runs immediately):

h2m search "rust async trait"                       # DuckDuckGo (default)
h2m search "Turing machine" --provider wikipedia    # official MediaWiki API
h2m search "图灵机" --provider wikipedia --wikipedia-lang zh

All the usual flags work uniformly across providers:

h2m search "rust" --limit 5 --time-range week       # last-week results
h2m search "rust" --sources web,news --country us
h2m search "rust" --language en --safesearch strict
h2m search "rust" --json                             # NDJSON (one hit per line)

Provider-specific keys (opt-in, via env vars or flags):

export BRAVE_API_KEY=...    ; h2m search "rust" --provider brave
export TAVILY_API_KEY=...   ; h2m search "rust" --provider tavily --include-answer
export H2M_SEARXNG_URL=...  ; h2m search "rust" --provider searxng

Tips:

CAPTCHA handling — when DuckDuckGo returns "kind":"captchaDetected" or "authFailed", switch to --provider wikipedia or a keyed provider. The error JSON always carries kind / provider / status fields.
Windows + system proxy — if your system proxy intercepts localhost requests (Clash/V2Ray/etc), set NO_PROXY=127.0.0.1,localhost before pointing h2m at a self-hosted SearXNG instance.
Brave pagination — --limit up to 200 is supported (Brave caps at 20 per page; h2m paginates transparently via offset).

Search + scrape (runs every hit through the full convert pipeline, streams NDJSON ScrapeResults):

h2m search "rust async" --scrape                 # raw markdown per hit
h2m search "rust async" --scrape --gfm --readable
h2m search "rust async" --scrape --selector article
h2m search "rust" --scrape -j 8 --timeout 20     # parallel scrape

A ready-made end-to-end smoke test lives at scripts/live_search_e2e.ps1 (Windows PowerShell) — it exercises DuckDuckGo and Wikipedia across English / Chinese / Japanese and prints a classified summary table.

JSON Output

convert single URL (pretty JSON):

{
  "markdown": "# Example Domain\n\n...",
  "metadata": {
    "title": "Example Domain",
    "description": "This domain is for use in illustrative examples.",
    "language": "en",
    "ogImage": "https://example.com/og.png",
    "sourceUrl": "https://example.com",
    "url": "https://example.com/",
    "statusCode": 200,
    "contentType": "text/html; charset=UTF-8",
    "elapsedMs": 234
  },
  "links": ["https://example.com/about"]
}

search response:

{
  "query": "rust async",
  "provider": "tavily",
  "answer": "Rust's async trait support stabilized in 1.75 ...",
  "web": [
    {
      "title": "Rust", "url": "https://rust-lang.org",
      "description": "...", "engine": "duckduckgo", "score": 0.92
    }
  ],
  "news": [],
  "images": [],
  "elapsedMs": 312
}

answer — LLM-generated summary (Tavily --include-answer flag, opt-in).
score — relevance in [0, 1] (Tavily only; other providers omit it).
engine — upstream backend name (SearXNG only; aggregators omit it).

Fields marked Option are dropped from the JSON when absent, keeping output lean.

Multiple inputs (convert batch, or search --scrape) stream NDJSON — one JSON object per line.

Library Usage

// One-liner with CommonMark defaults
let md = h2m::convert("<h1>Hello</h1><p>World</p>");
assert_eq!(md, "# Hello\n\nWorld");

// Full control with the builder
use h2m::{Converter, Options};
use h2m::plugins::Gfm;
use h2m::rules::CommonMark;

let converter = Converter::builder()
    .options(Options::default())
    .use_plugin(&CommonMark)
    .use_plugin(&Gfm)
    .domain("example.com")
    .build();

let md = converter.convert(r#"<a href="/about">About</a>"#);
assert_eq!(md, "[About](https://example.com/about)");

Async Scraping

Enable the scrape feature for async HTTP scraping with built-in concurrency control, rate limiting, and streaming output:

use h2m::scrape::Scraper;

let scraper = Scraper::builder()
    .concurrency(8)
    .gfm(true)
    .extract_links(true)
    .build()?;

let result = scraper.scrape("https://example.com").await?;
println!("{}", result.markdown);

let urls = vec!["https://a.com".into(), "https://b.com".into()];
scraper.scrape_many_streaming(&urls, |result| {
    match result {
        Ok(r) => println!("{}", r.markdown),
        Err(e) => eprintln!("error: {e}"),
    }
}).await;

Web Search

The h2m-search crate exposes the same provider abstraction the CLI uses. The zero-config default is DuckDuckGo; no builder configuration required:

use h2m_search::{SearchClient, SearchQuery};

// Zero-config: uses DuckDuckGo (no API key, no env vars).
let client = SearchClient::builder().build()?;

let response = client
    .search(&SearchQuery::new("rust async").with_limit(5))
    .await?;

for hit in &response.web {
    println!("{} — {}", hit.title, hit.url);
}
# Ok::<_, Box<dyn std::error::Error>>(())

Design

CommonMark + GFM — full spec compliance with tables, strikethrough, task lists, reference-style links
Plugin architecture — extend with custom rules via the Rule trait
Async batch pipeline — tokio + reqwest, semaphore concurrency, streaming NDJSON (scrape feature)
Multi-provider search — SearchClient enum with static dispatch, one Cargo feature per provider
Search + scrape composition — search --scrape funnels hits through the same Scraper pipeline, reusing all formatting / extraction flags
JSON output — nested camelCase metadata aligned with Firecrawl conventions
Smart readable extraction — two-phase content detection: semantic selectors → noise stripping
Zero-copy fast paths — Cow<str> escaping, zero unsafe, Send + Sync

Supported HTML Elements

CommonMark (built-in)

Element	Markdown Output
`<h1>`-`<h6>`	`# Heading` (ATX) or underline (Setext)
`<p>`, `<div>`, `<section>`, `<article>`	Block paragraph
`<strong>`, `<b>`	`bold`
`<em>`, `<i>`	`italic`
`<code>`, `<kbd>`, `<samp>`, `<tt>`	`inline code`
`<pre><code>`	Fenced code block with language detection
`<a href="...">`	`[text](url)` or reference-style
`<img src="..." alt="...">`	`![alt](src "title")`
`<ul>`, `<ol>`, `<li>`	Bullet/numbered lists with nesting
`<blockquote>`	`> quoted text`
`<hr>`	`---`
`<br>`	Hard line break
`<iframe>`	`[iframe](url)`

GFM Extensions (with `--gfm`)

Element	Markdown Output
`<table>`	GFM pipe table with alignment
`<del>`, `<s>`, `<strike>`	`~~strikethrough~~`
`<input type="checkbox">`	`[x]` or `[ ]` (task list)

Auto-removed

Element	Behavior
`<script>`	Removed (content stripped)
`<style>`	Removed (content stripped)
`<noscript>`	Removed (content stripped)

Custom Rules

Extend the converter with your own rules by implementing the Rule trait:

use h2m::{Converter, Rule, Action, Context};
use h2m::rules::CommonMark;
use scraper::ElementRef;

#[derive(Debug)]
struct HighlightRule;
impl Rule for HighlightRule {
    fn tags(&self) -> &'static [&'static str] { &["mark"] }

    fn apply(&self, content: &str, _el: &ElementRef<'_>, _ctx: &mut Context<'_>) -> Action {
        Action::Replace(format!("=={content}=="))
    }
}

let mut builder = Converter::builder()
    .use_plugin(CommonMark);
builder.add_rule(HighlightRule);
let converter = builder.build();

let md = converter.convert("<p>This is <mark>important</mark></p>");
assert!(md.contains("==important=="));

License

Licensed under either of:

Apache License, Version 2.0 (LICENSE-APACHE or https://www.apache.org/licenses/LICENSE-2.0)
MIT License (LICENSE-MIT or https://opensource.org/licenses/MIT)

at your option.

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in this project shall be dual-licensed as above, without any additional terms or conditions.

A QNTX open-source project.

Code is law. We write both.

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
.github		.github
.vscode		.vscode
h2m-cli		h2m-cli
h2m-search		h2m-search
h2m		h2m
scripts		scripts
skills/h2m		skills/h2m
.gitattributes		.gitattributes
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE-APACHE		LICENSE-APACHE
LICENSE-MIT		LICENSE-MIT
Makefile		Makefile
README.md		README.md
clippy.toml		clippy.toml
demo.gif		demo.gif
demo.tape		demo.tape
rust-toolchain.toml		rust-toolchain.toml
rustfmt.toml		rustfmt.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

H2M

Quick Start

Install the CLI

CLI Structure

`convert` — HTML → Markdown

`search` — Web search

JSON Output

Library Usage

Async Scraping

Web Search

Design

Supported HTML Elements

CommonMark (built-in)

GFM Extensions (with `--gfm`)

Auto-removed

Custom Rules

License

About

Licenses found

Uh oh!

Releases 10

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

H2M

Quick Start

Install the CLI

CLI Structure

convert — HTML → Markdown

search — Web search

JSON Output

Library Usage

Async Scraping

Web Search

Design

Supported HTML Elements

CommonMark (built-in)

GFM Extensions (with --gfm)

Auto-removed

Custom Rules

License

About

Topics

Resources

License

Licenses found

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 10

Contributors

Uh oh!

Languages

`convert` — HTML → Markdown

`search` — Web search

GFM Extensions (with `--gfm`)