fix: markdown conversion preprocessing by RohitR311 · Pull Request #1045 · getmaxun/maxun

RohitR311 · 2026-04-23T08:35:26Z

What this PR does?

Improves preprocessing of raw data for markdown conversion.

Summary by CodeRabbit

Bug Fixes
- Improved superscript conversion in markdown; empty results are omitted.
- Tighter link handling: avoids unsafe links, omits default/empty link labels, preserves heading-styled links and discards empty headings.
- Preserves fenced/tilde code blocks during post-processing to prevent unwanted escaping.
- More targeted HTML cleanup and UI-artifact pruning scoped to the main content for cleaner extraction.

coderabbitai · 2026-04-23T08:35:38Z

Walkthrough

Refines markdown conversion: adds <sup> → ^...^ handling, tightens inline link extraction (skip javascript:, return empty when no text, convert heading-like link text), scopes HTML noise removal to selected content, and preserves fenced code blocks during link-fixing. (50 words)

Changes

Cohort / File(s)	Summary
Turndown, Link & Heading Rules `server/src/markdownify/markdown.ts`	Adds `<sup>` → `^...^` rule; `inlineLink` no longer emits a default label, returns `""` when text can't be derived, normalizes/blocks `javascript:` hrefs, removes same-domain suppression, converts heading-pattern link text into heading-styled links, drops `cleanUrl` usage.
HTML Tidying & Noise Removal `server/src/markdownify/markdown.ts`	Replaces broad DOM pruning heuristics with targeted top-level/ARIA removals and an `innerNoiseSelectors` list applied only inside the chosen `$content`; UI artifact (leaf-node) pruning retained but scoped to `$content`.
Markdown Post-processing / Link Fixes `server/src/markdownify/markdown.ts`	`fixBrokenLinks` extracts fenced/tilde code blocks first to avoid escaping inside them; adjusts newline/bracket escaping logic and removes the whitespace-before-newline normalization pass.

Sequence Diagram(s)

(omitted)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

fix: improve markdown conversion #1037 — Modifies the same server/src/markdownify/markdown.ts areas (turndown link/heading handling and HTML tidying), overlapping code-level changes.

Suggested reviewers

amhsirak

Poem

🐰 I nibble tags and hop on carets bright,
Superscripts spring into the light.
Links that whisper "javascript" I skip,
Headings and code stay safe in their crypt.
Carrots crunch — neat Markdown takes flight!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix: markdown conversion preprocessing' accurately reflects the main changes, which involve preprocessing improvements for markdown conversion including Turndown parsing updates, HTML tidying refinements, and markdown post-processing fixes.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch md-fix

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@server/src/markdownify/markdown.ts`:
- Around line 71-72: Normalize the href before the javascript check: after
retrieving href via node.getAttribute("href") and trimming, strip
control/whitespace characters and convert to lowercase, then test
startsWith("javascript:"); update the href handling (the variable named href) so
mixed-case or control-character-padded values like "JavaScript:" or
"java\nscript:" are normalized and blocked.
- Around line 221-236: The fixBrokenLinks function fails to recognize GFM fenced
code blocks with 3+ backticks or tilde fences; replace the simple
/(```[\s\S]*?```)/g split with a more robust fencedCodeBlock regex that matches
(?:^|\n)[ \t]{0,3}(`{3,}|~{3,})...closing backreference so it captures 3+
backticks or tildes and surrounding indentation, then split md with that
fencedCodeBlock variable and adjust the parity check from i % 2 === 1 to i % 3
=== 1 to account for the extra capture groups returned by split; keep the rest
of fixBrokenLinks logic unchanged (depth tracking and newline escaping inside
bracketed text).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: bb5e94d0-ed94-4204-84fc-d529db1dfedd

📥 Commits

Reviewing files that changed from the base of the PR and between 1f0da54 and eb4377d.

📒 Files selected for processing (1)

server/src/markdownify/markdown.ts

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

server/src/markdownify/markdown.ts (1)

222-238: ⚠️ Potential issue | 🟠 Major

Handle indented GFM fences before mutating link text.

The new splitter still misses valid fenced blocks indented by up to three spaces, so fixBrokenLinks can continue escaping newlines inside code. This is the same fenced-code preservation gap previously raised, just partially addressed.

Proposed fix

 function fixBrokenLinks(md: string): string {
-  const parts = md.split(/((?:^|\n)(`{3,}|~{3,})[\s\S]*?\n\2(?:\n|$))/g);
+  const fencedCodeBlock = /((?:^|\n)[ \t]{0,3}(`{3,}|~{3,})[^\n]*\n[\s\S]*?\n[ \t]{0,3}\2[ \t]*(?=\n|$))/g;
+  const parts = md.split(fencedCodeBlock);
   return parts.map((part, i) => {
     if (i % 3 === 1) return part;
     if (i % 3 === 2) return "";

Verification script:

#!/bin/bash
# Description: Demonstrate that the current splitter misses valid indented fenced code blocks.

node - <<'NODE'
const current = /((?:^|\n)(`{3,}|~{3,})[\s\S]*?\n\2(?:\n|$))/g;
const proposed = /((?:^|\n)[ \t]{0,3}(`{3,}|~{3,})[^\n]*\n[\s\S]*?\n[ \t]{0,3}\2[ \t]*(?=\n|$))/g;

const md = 'Before\n\n   ```js\nconst value = "[broken\\nlink";\n   ```\n\nAfter';

console.log({
  currentMatchesIndentedFence: current.test(md),
  proposedMatchesIndentedFence: proposed.test(md),
});
NODE

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@server/src/markdownify/markdown.ts` around lines 222 - 238, fixBrokenLinks
currently uses a splitter regex that fails to match fenced code blocks indented
up to three spaces, so newlines inside those indented fences are still being
mutated; update the splitter to a GFM-compatible regex that allows up to three
leading spaces before opening and closing fences and accepts an optional
info-string (for example replace the existing
/((?:^|\n)(`{3,}|~{3,})[\s\S]*?\n\2(?:\n|$))/g with a pattern that matches
(?:^|\n)[ \t]{0,3} followed by the opening fence and optional info, then the
fence contents, and a closing fence optionally preceded by up to three spaces
and followed by newline or end-of-file), so that the part-splitting in
fixBrokenLinks correctly preserves indented fenced code blocks before escaping
link text.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@server/src/markdownify/markdown.ts`:
- Around line 212-216: The code inserts raw title text into HTML (variable title
-> resultHtml) without escaping, so markup in titles becomes active; fix by
HTML-escaping the title before interpolation (or create the H1 via DOM/text APIs
instead of string interpolation) when building resultHtml in the markdown
conversion code that uses title, resultHtml and $content (e.g., replace
`<h1>${title}</h1>\n${resultHtml}` with an escaped-title variant or construct a
h1 element and use its safe .text() output).

---

Duplicate comments:
In `@server/src/markdownify/markdown.ts`:
- Around line 222-238: fixBrokenLinks currently uses a splitter regex that fails
to match fenced code blocks indented up to three spaces, so newlines inside
those indented fences are still being mutated; update the splitter to a
GFM-compatible regex that allows up to three leading spaces before opening and
closing fences and accepts an optional info-string (for example replace the
existing /((?:^|\n)(`{3,}|~{3,})[\s\S]*?\n\2(?:\n|$))/g with a pattern that
matches (?:^|\n)[ \t]{0,3} followed by the opening fence and optional info, then
the fence contents, and a closing fence optionally preceded by up to three
spaces and followed by newline or end-of-file), so that the part-splitting in
fixBrokenLinks correctly preserves indented fenced code blocks before escaping
link text.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 55b4b8b5-01ff-47dc-8b23-a83b01636dd7

📥 Commits

Reviewing files that changed from the base of the PR and between 399aac5 and 53207b6.

📒 Files selected for processing (1)

server/src/markdownify/markdown.ts

coderabbitai · 2026-04-23T09:16:20Z

  const title = $("title").text().trim() || $("h1").first().text().trim();
-  let resultHtml = contentToProcess.html() || "";
+  let resultHtml = $content.html() || "";

  if (title && !resultHtml.includes(title)) {
    resultHtml = `<h1>${title}</h1>\n${resultHtml}`;


⚠️ Potential issue | 🟡 Minor

Escape the title before injecting it as HTML.

title is text, but Line 216 interpolates it as markup. A page title like <b>Sale</b> will be converted as an actual <b> element instead of literal title text.

Proposed fix

const title = $("title").text().trim() || $("h1").first().text().trim(); let resultHtml = $content.html() || ""; if (title && !resultHtml.includes(title)) { - resultHtml = `<h1>${title}</h1>\n${resultHtml}`; + const escapedTitle = $("<div>").text(title).html() || ""; + resultHtml = `<h1>${escapedTitle}</h1>\n${resultHtml}`; }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@server/src/markdownify/markdown.ts` around lines 212 - 216, The code inserts raw title text into HTML (variable title -> resultHtml) without escaping, so markup in titles becomes active; fix by HTML-escaping the title before interpolation (or create the H1 via DOM/text APIs instead of string interpolation) when building resultHtml in the markdown conversion code that uses title, resultHtml and $content (e.g., replace `<h1>${title}</h1>\n${resultHtml}` with an escaped-title variant or construct a h1 element and use its safe .text() output).

fix: link structure and whitespace cleanup

eb4377d

coderabbitai Bot reviewed Apr 23, 2026

View reviewed changes

Comment thread server/src/markdownify/markdown.ts Outdated

Comment thread server/src/markdownify/markdown.ts

RohitR311 added 2 commits April 23, 2026 14:32

fix: normalize href cleanup

399aac5

fix: fence detection code blocks

53207b6

coderabbitai Bot reviewed Apr 23, 2026

View reviewed changes

RohitR311 added Type: Enhancement Improvements to existing features Scope: Ext Issues/PRs related to core extraction labels Apr 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: markdown conversion preprocessing#1045

fix: markdown conversion preprocessing#1045
RohitR311 wants to merge 3 commits intodevelopfrom
md-fix

RohitR311 commented Apr 23, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 23, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

RohitR311 commented Apr 23, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

RohitR311 commented Apr 23, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 23, 2026 •

edited

Loading