Fast, extensible HTML-to-Markdown converter with optional web search — CommonMark + GFM, plugin architecture.
H2M converts HTML into clean Markdown with full CommonMark compliance and GitHub Flavored Markdown extensions. It uses a plugin-based rule system, supports reference-style links, relative URL resolution, and ships with an async CLI that can also search the web and pipe results through the same conversion pipeline. Search works zero-config out of the box via DuckDuckGo and Wikipedia, and also integrates with SearXNG, Brave Search, and Tavily.
Shell (macOS / Linux):
curl -fsSL https://sh.qntx.fun/h2m | shPowerShell (Windows):
irm https://sh.qntx.fun/h2m/ps | iexOr via Cargo:
cargo install h2m-cliH2M uses a subcommand tree:
h2m <COMMAND> [OPTIONS] ...
Commands:
convert Convert HTML to Markdown (URL, file, stdin)
search Search the web and optionally scrape each hit to Markdown
h2m convert https://example.com
h2m convert page.html
curl -s https://example.com | h2m convert
echo '<h1>Hi</h1>' | h2m convertContent extraction:
h2m convert -r https://blog.example.com/post # smart readable
h2m convert -s article https://blog.example.com/post # CSS selector
h2m convert -s '#content' https://example.com # by IDJSON output (agents / programmatic use):
h2m convert --json https://example.com # pretty JSON
h2m convert --json --extract-links https://example.com
h2m convert --json url1 url2 url3 # NDJSON streaming
h2m convert --json --urls urls.txt -j 8 --delay 100Formatting:
h2m convert --gfm https://example.com # tables, strikethrough, task lists
h2m convert --link-style referenced page.html # reference-style links
h2m convert --heading-style setext page.html # === / --- underlines
h2m convert --user-agent "MyBot/1.0" https://example.com
h2m convert -o output.md https://example.comH2M ships with five search providers. The default is duckduckgo, which
requires no API key, no registration, and no environment variables:
| Provider | Requires | Free tier | Notes |
|---|---|---|---|
| DuckDuckGo | - (zero-config) | unlimited* | Default. HTML scraping + lite fallback |
| Wikipedia | - (zero-config) | unlimited | Official MediaWiki API, 300+ languages |
| SearXNG | H2M_SEARXNG_URL |
yes (self-host) | Open-source meta-search |
| Brave | BRAVE_API_KEY |
$5/month credit | Independent index, transparent pagination |
| Tavily | TAVILY_API_KEY |
1000 req/month | AI-tuned snippets + LLM answers |
* DuckDuckGo uses unauthenticated HTML endpoints. Aggressive or datacenter
traffic may trigger anti-bot challenges; the provider auto-falls back to
lite.duckduckgo.com and emits a structured "kind":"captchaDetected" error
so you can automate provider switching. Wikipedia is the recommended fallback.
Zero-config usage (nothing to configure, runs immediately):
h2m search "rust async trait" # DuckDuckGo (default)
h2m search "Turing machine" --provider wikipedia # official MediaWiki API
h2m search "图灵机" --provider wikipedia --wikipedia-lang zhAll the usual flags work uniformly across providers:
h2m search "rust" --limit 5 --time-range week # last-week results
h2m search "rust" --sources web,news --country us
h2m search "rust" --language en --safesearch strict
h2m search "rust" --json # NDJSON (one hit per line)Provider-specific keys (opt-in, via env vars or flags):
export BRAVE_API_KEY=... ; h2m search "rust" --provider brave
export TAVILY_API_KEY=... ; h2m search "rust" --provider tavily --include-answer
export H2M_SEARXNG_URL=... ; h2m search "rust" --provider searxngTips:
- CAPTCHA handling — when DuckDuckGo returns
"kind":"captchaDetected"or"authFailed", switch to--provider wikipediaor a keyed provider. The error JSON always carrieskind/provider/statusfields. - Windows + system proxy — if your system proxy intercepts
localhostrequests (Clash/V2Ray/etc), setNO_PROXY=127.0.0.1,localhostbefore pointingh2mat a self-hosted SearXNG instance. - Brave pagination —
--limitup to 200 is supported (Brave caps at 20 per page;h2mpaginates transparently viaoffset).
Search + scrape (runs every hit through the full convert pipeline,
streams NDJSON ScrapeResults):
h2m search "rust async" --scrape # raw markdown per hit
h2m search "rust async" --scrape --gfm --readable
h2m search "rust async" --scrape --selector article
h2m search "rust" --scrape -j 8 --timeout 20 # parallel scrapeA ready-made end-to-end smoke test lives at scripts/live_search_e2e.ps1
(Windows PowerShell) — it exercises DuckDuckGo and Wikipedia across English /
Chinese / Japanese and prints a classified summary table.
convert single URL (pretty JSON):
{
"markdown": "# Example Domain\n\n...",
"metadata": {
"title": "Example Domain",
"description": "This domain is for use in illustrative examples.",
"language": "en",
"ogImage": "https://example.com/og.png",
"sourceUrl": "https://example.com",
"url": "https://example.com/",
"statusCode": 200,
"contentType": "text/html; charset=UTF-8",
"elapsedMs": 234
},
"links": ["https://example.com/about"]
}search response:
{
"query": "rust async",
"provider": "tavily",
"answer": "Rust's async trait support stabilized in 1.75 ...",
"web": [
{
"title": "Rust", "url": "https://rust-lang.org",
"description": "...", "engine": "duckduckgo", "score": 0.92
}
],
"news": [],
"images": [],
"elapsedMs": 312
}answer— LLM-generated summary (Tavily--include-answerflag, opt-in).score— relevance in[0, 1](Tavily only; other providers omit it).engine— upstream backend name (SearXNG only; aggregators omit it).
Fields marked Option are dropped from the JSON when absent, keeping output lean.
Multiple inputs (convert batch, or search --scrape) stream NDJSON — one JSON object per line.
// One-liner with CommonMark defaults
let md = h2m::convert("<h1>Hello</h1><p>World</p>");
assert_eq!(md, "# Hello\n\nWorld");// Full control with the builder
use h2m::{Converter, Options};
use h2m::plugins::Gfm;
use h2m::rules::CommonMark;
let converter = Converter::builder()
.options(Options::default())
.use_plugin(&CommonMark)
.use_plugin(&Gfm)
.domain("example.com")
.build();
let md = converter.convert(r#"<a href="/about">About</a>"#);
assert_eq!(md, "[About](https://example.com/about)");Enable the scrape feature for async HTTP scraping with built-in concurrency control, rate limiting, and streaming output:
use h2m::scrape::Scraper;
let scraper = Scraper::builder()
.concurrency(8)
.gfm(true)
.extract_links(true)
.build()?;
let result = scraper.scrape("https://example.com").await?;
println!("{}", result.markdown);
let urls = vec!["https://a.com".into(), "https://b.com".into()];
scraper.scrape_many_streaming(&urls, |result| {
match result {
Ok(r) => println!("{}", r.markdown),
Err(e) => eprintln!("error: {e}"),
}
}).await;The h2m-search crate exposes the same provider abstraction the CLI uses.
The zero-config default is DuckDuckGo; no builder configuration required:
use h2m_search::{SearchClient, SearchQuery};
// Zero-config: uses DuckDuckGo (no API key, no env vars).
let client = SearchClient::builder().build()?;
let response = client
.search(&SearchQuery::new("rust async").with_limit(5))
.await?;
for hit in &response.web {
println!("{} — {}", hit.title, hit.url);
}
# Ok::<_, Box<dyn std::error::Error>>(())- CommonMark + GFM — full spec compliance with tables, strikethrough, task lists, reference-style links
- Plugin architecture — extend with custom rules via the
Ruletrait - Async batch pipeline —
tokio+reqwest, semaphore concurrency, streaming NDJSON (scrapefeature) - Multi-provider search —
SearchClientenum with static dispatch, one Cargo feature per provider - Search + scrape composition —
search --scrapefunnels hits through the sameScraperpipeline, reusing all formatting / extraction flags - JSON output — nested camelCase metadata aligned with Firecrawl conventions
- Smart readable extraction — two-phase content detection: semantic selectors → noise stripping
- Zero-copy fast paths —
Cow<str>escaping, zerounsafe,Send + Sync
| Element | Markdown Output |
|---|---|
<h1>-<h6> |
# Heading (ATX) or underline (Setext) |
<p>, <div>, <section>, <article> |
Block paragraph |
<strong>, <b> |
**bold** |
<em>, <i> |
*italic* |
<code>, <kbd>, <samp>, <tt> |
`inline code` |
<pre><code> |
Fenced code block with language detection |
<a href="..."> |
[text](url) or reference-style |
<img src="..." alt="..."> |
 |
<ul>, <ol>, <li> |
Bullet/numbered lists with nesting |
<blockquote> |
> quoted text |
<hr> |
--- |
<br> |
Hard line break |
<iframe> |
[iframe](url) |
| Element | Markdown Output |
|---|---|
<table> |
GFM pipe table with alignment |
<del>, <s>, <strike> |
~~strikethrough~~ |
<input type="checkbox"> |
[x] or [ ] (task list) |
| Element | Behavior |
|---|---|
<script> |
Removed (content stripped) |
<style> |
Removed (content stripped) |
<noscript> |
Removed (content stripped) |
Extend the converter with your own rules by implementing the Rule trait:
use h2m::{Converter, Rule, Action, Context};
use h2m::rules::CommonMark;
use scraper::ElementRef;
#[derive(Debug)]
struct HighlightRule;
impl Rule for HighlightRule {
fn tags(&self) -> &'static [&'static str] { &["mark"] }
fn apply(&self, content: &str, _el: &ElementRef<'_>, _ctx: &mut Context<'_>) -> Action {
Action::Replace(format!("=={content}=="))
}
}
let mut builder = Converter::builder()
.use_plugin(CommonMark);
builder.add_rule(HighlightRule);
let converter = builder.build();
let md = converter.convert("<p>This is <mark>important</mark></p>");
assert!(md.contains("==important=="));Licensed under either of:
- Apache License, Version 2.0 (LICENSE-APACHE or https://www.apache.org/licenses/LICENSE-2.0)
- MIT License (LICENSE-MIT or https://opensource.org/licenses/MIT)
at your option.
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in this project shall be dual-licensed as above, without any additional terms or conditions.
