Skip to content

fix: markdown conversion preprocessing#1045

Open
RohitR311 wants to merge 3 commits intodevelopfrom
md-fix
Open

fix: markdown conversion preprocessing#1045
RohitR311 wants to merge 3 commits intodevelopfrom
md-fix

Conversation

@RohitR311
Copy link
Copy Markdown
Collaborator

@RohitR311 RohitR311 commented Apr 23, 2026

What this PR does?

Improves preprocessing of raw data for markdown conversion.

Summary by CodeRabbit

  • Bug Fixes
    • Improved superscript conversion in markdown; empty results are omitted.
    • Tighter link handling: avoids unsafe links, omits default/empty link labels, preserves heading-styled links and discards empty headings.
    • Preserves fenced/tilde code blocks during post-processing to prevent unwanted escaping.
    • More targeted HTML cleanup and UI-artifact pruning scoped to the main content for cleaner extraction.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 23, 2026

Walkthrough

Refines markdown conversion: adds <sup>^...^ handling, tightens inline link extraction (skip javascript:, return empty when no text, convert heading-like link text), scopes HTML noise removal to selected content, and preserves fenced code blocks during link-fixing. (50 words)

Changes

Cohort / File(s) Summary
Turndown, Link & Heading Rules
server/src/markdownify/markdown.ts
Adds <sup>^...^ rule; inlineLink no longer emits a default label, returns "" when text can't be derived, normalizes/blocks javascript: hrefs, removes same-domain suppression, converts heading-pattern link text into heading-styled links, drops cleanUrl usage.
HTML Tidying & Noise Removal
server/src/markdownify/markdown.ts
Replaces broad DOM pruning heuristics with targeted top-level/ARIA removals and an innerNoiseSelectors list applied only inside the chosen $content; UI artifact (leaf-node) pruning retained but scoped to $content.
Markdown Post-processing / Link Fixes
server/src/markdownify/markdown.ts
fixBrokenLinks extracts fenced/tilde code blocks first to avoid escaping inside them; adjusts newline/bracket escaping logic and removes the whitespace-before-newline normalization pass.

Sequence Diagram(s)

(omitted)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • amhsirak

Poem

🐰 I nibble tags and hop on carets bright,
Superscripts spring into the light.
Links that whisper "javascript" I skip,
Headings and code stay safe in their crypt.
Carrots crunch — neat Markdown takes flight!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix: markdown conversion preprocessing' accurately reflects the main changes, which involve preprocessing improvements for markdown conversion including Turndown parsing updates, HTML tidying refinements, and markdown post-processing fixes.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch md-fix

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@server/src/markdownify/markdown.ts`:
- Around line 71-72: Normalize the href before the javascript check: after
retrieving href via node.getAttribute("href") and trimming, strip
control/whitespace characters and convert to lowercase, then test
startsWith("javascript:"); update the href handling (the variable named href) so
mixed-case or control-character-padded values like "JavaScript:" or
"java\nscript:" are normalized and blocked.
- Around line 221-236: The fixBrokenLinks function fails to recognize GFM fenced
code blocks with 3+ backticks or tilde fences; replace the simple
/(```[\s\S]*?```)/g split with a more robust fencedCodeBlock regex that matches
(?:^|\n)[ \t]{0,3}(`{3,}|~{3,})...closing backreference so it captures 3+
backticks or tildes and surrounding indentation, then split md with that
fencedCodeBlock variable and adjust the parity check from i % 2 === 1 to i % 3
=== 1 to account for the extra capture groups returned by split; keep the rest
of fixBrokenLinks logic unchanged (depth tracking and newline escaping inside
bracketed text).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: bb5e94d0-ed94-4204-84fc-d529db1dfedd

📥 Commits

Reviewing files that changed from the base of the PR and between 1f0da54 and eb4377d.

📒 Files selected for processing (1)
  • server/src/markdownify/markdown.ts

Comment thread server/src/markdownify/markdown.ts Outdated
Comment thread server/src/markdownify/markdown.ts
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
server/src/markdownify/markdown.ts (1)

222-238: ⚠️ Potential issue | 🟠 Major

Handle indented GFM fences before mutating link text.

The new splitter still misses valid fenced blocks indented by up to three spaces, so fixBrokenLinks can continue escaping newlines inside code. This is the same fenced-code preservation gap previously raised, just partially addressed.

Proposed fix
 function fixBrokenLinks(md: string): string {
-  const parts = md.split(/((?:^|\n)(`{3,}|~{3,})[\s\S]*?\n\2(?:\n|$))/g);
+  const fencedCodeBlock = /((?:^|\n)[ \t]{0,3}(`{3,}|~{3,})[^\n]*\n[\s\S]*?\n[ \t]{0,3}\2[ \t]*(?=\n|$))/g;
+  const parts = md.split(fencedCodeBlock);
   return parts.map((part, i) => {
     if (i % 3 === 1) return part;
     if (i % 3 === 2) return "";

Verification script:

#!/bin/bash
# Description: Demonstrate that the current splitter misses valid indented fenced code blocks.

node - <<'NODE'
const current = /((?:^|\n)(`{3,}|~{3,})[\s\S]*?\n\2(?:\n|$))/g;
const proposed = /((?:^|\n)[ \t]{0,3}(`{3,}|~{3,})[^\n]*\n[\s\S]*?\n[ \t]{0,3}\2[ \t]*(?=\n|$))/g;

const md = 'Before\n\n   ```js\nconst value = "[broken\\nlink";\n   ```\n\nAfter';

console.log({
  currentMatchesIndentedFence: current.test(md),
  proposedMatchesIndentedFence: proposed.test(md),
});
NODE
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/src/markdownify/markdown.ts` around lines 222 - 238, fixBrokenLinks
currently uses a splitter regex that fails to match fenced code blocks indented
up to three spaces, so newlines inside those indented fences are still being
mutated; update the splitter to a GFM-compatible regex that allows up to three
leading spaces before opening and closing fences and accepts an optional
info-string (for example replace the existing
/((?:^|\n)(`{3,}|~{3,})[\s\S]*?\n\2(?:\n|$))/g with a pattern that matches
(?:^|\n)[ \t]{0,3} followed by the opening fence and optional info, then the
fence contents, and a closing fence optionally preceded by up to three spaces
and followed by newline or end-of-file), so that the part-splitting in
fixBrokenLinks correctly preserves indented fenced code blocks before escaping
link text.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@server/src/markdownify/markdown.ts`:
- Around line 212-216: The code inserts raw title text into HTML (variable title
-> resultHtml) without escaping, so markup in titles becomes active; fix by
HTML-escaping the title before interpolation (or create the H1 via DOM/text APIs
instead of string interpolation) when building resultHtml in the markdown
conversion code that uses title, resultHtml and $content (e.g., replace
`<h1>${title}</h1>\n${resultHtml}` with an escaped-title variant or construct a
h1 element and use its safe .text() output).

---

Duplicate comments:
In `@server/src/markdownify/markdown.ts`:
- Around line 222-238: fixBrokenLinks currently uses a splitter regex that fails
to match fenced code blocks indented up to three spaces, so newlines inside
those indented fences are still being mutated; update the splitter to a
GFM-compatible regex that allows up to three leading spaces before opening and
closing fences and accepts an optional info-string (for example replace the
existing /((?:^|\n)(`{3,}|~{3,})[\s\S]*?\n\2(?:\n|$))/g with a pattern that
matches (?:^|\n)[ \t]{0,3} followed by the opening fence and optional info, then
the fence contents, and a closing fence optionally preceded by up to three
spaces and followed by newline or end-of-file), so that the part-splitting in
fixBrokenLinks correctly preserves indented fenced code blocks before escaping
link text.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 55b4b8b5-01ff-47dc-8b23-a83b01636dd7

📥 Commits

Reviewing files that changed from the base of the PR and between 399aac5 and 53207b6.

📒 Files selected for processing (1)
  • server/src/markdownify/markdown.ts

Comment on lines 212 to 216
const title = $("title").text().trim() || $("h1").first().text().trim();
let resultHtml = contentToProcess.html() || "";
let resultHtml = $content.html() || "";

if (title && !resultHtml.includes(title)) {
resultHtml = `<h1>${title}</h1>\n${resultHtml}`;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Escape the title before injecting it as HTML.

title is text, but Line 216 interpolates it as markup. A page title like &lt;b&gt;Sale&lt;/b&gt; will be converted as an actual <b> element instead of literal title text.

Proposed fix
   const title = $("title").text().trim() || $("h1").first().text().trim();
   let resultHtml = $content.html() || "";

   if (title && !resultHtml.includes(title)) {
-    resultHtml = `<h1>${title}</h1>\n${resultHtml}`;
+    const escapedTitle = $("<div>").text(title).html() || "";
+    resultHtml = `<h1>${escapedTitle}</h1>\n${resultHtml}`;
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/src/markdownify/markdown.ts` around lines 212 - 216, The code inserts
raw title text into HTML (variable title -> resultHtml) without escaping, so
markup in titles becomes active; fix by HTML-escaping the title before
interpolation (or create the H1 via DOM/text APIs instead of string
interpolation) when building resultHtml in the markdown conversion code that
uses title, resultHtml and $content (e.g., replace
`<h1>${title}</h1>\n${resultHtml}` with an escaped-title variant or construct a
h1 element and use its safe .text() output).

@RohitR311 RohitR311 added Type: Enhancement Improvements to existing features Scope: Ext Issues/PRs related to core extraction labels Apr 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Scope: Ext Issues/PRs related to core extraction Type: Enhancement Improvements to existing features

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant