perf: move html entity decoding to server side#1561
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
2 Skipped Deployments
|
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 📝 WalkthroughWalkthroughHTML entity decoding was moved out of Possibly related PRs
🚥 Pre-merge checks | ✅ 1✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
server/utils/readme.ts (1)
376-383:⚠️ Potential issue | 🟠 MajorDecoded
plainTextinserted into heading HTML template causes content loss for headings containing<or>.The order of operations at line 377 is:
- Strip HTML tags:
text.replace(/<[^>]*>/g, '')- Decode entities:
decodeHtmlEntities(...)— turns<→<,&→&The decoded
plainText(which may now contain raw<and&) is then string-interpolated directly into the heading HTML at line 383:return `<h${semanticLevel} ...><a href="#${uniqueSlug}">${plainText}</a></h${semanticLevel}>\n`
sanitize-html's underlyinghtmlparser2is fairly tolerant, but when a bare<appears in text content it begins tag parsing — confirmed reports show the string is truncated to the text before the<symbol, so the remainder of the text is discarded. For example, a heading## Types <T>would loseT>in the final rendered README HTML.Before this PR the entity-encoded
<was inserted into the template (valid HTML); the client decoded it for display. ThedecodeHtmlEntitiescall should only apply to the TOC entry, not to the value embedded in the HTML template.The CodeQL "Incomplete multi-character sanitization" warning at line 377 tracks exactly this dataflow.
🐛 Proposed fix — separate decoded text (TOC) from entity-safe text (HTML template)
- // Collect TOC item with plain text (HTML stripped, entities decoded) - const plainText = decodeHtmlEntities(text.replace(/<[^>]*>/g, '').trim()) + // Strip HTML tags for heading template (keep entities so the HTML stays valid) + const strippedText = text.replace(/<[^>]*>/g, '').trim() + // Decode entities only for the plain-text TOC entry + const plainText = decodeHtmlEntities(strippedText) if (plainText) { toc.push({ text: plainText, id, depth }) } /** The link href uses the unique slug WITHOUT the 'user-content-' prefix, because that will later be added for all links. */ - return `<h${semanticLevel} id="${id}" data-level="${depth}"><a href="#${uniqueSlug}">${plainText}</a></h${semanticLevel}>\n` + return `<h${semanticLevel} id="${id}" data-level="${depth}"><a href="#${uniqueSlug}">${strippedText}</a></h${semanticLevel}>\n`
🧹 Nitpick comments (1)
shared/utils/html.ts (1)
1-13: Minimal but fit-for-purpose utility.The regex alternatives and the
htmlEntitiesmap are in perfect correspondence, so the|| matchfallback is technically unreachable at runtime (every match the regex captures has a corresponding map entry). It serves as a compile-time hint whennoUncheckedIndexedAccessis enabled, which is fine; just worth noting it cannot fire under normal operation.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
server/utils/readme.ts (1)
388-395:⚠️ Potential issue | 🟠 MajorDecoded
plainTextused in the HTML template causes content loss viasanitizeHtml.After decoding,
plainTextmay contain raw<and>characters. When these are interpolated into the heading HTML template at line 395 and then processed bysanitizeHtml, the resulting markup is treated as unknown HTML tags and stripped — not preserved as text.Concrete regression trace for a heading such as
## `Promise<T>`:
text(frommarked) ="<code>Promise<T></code>"stripHtmlTags(text)→"Promise<T>"decodeHtmlEntities(...)→"Promise<T>"—plainTextnow contains raw angle brackets- Line 395 template:
<a href="#...">Promise<T></a>sanitizeHtmlsees<T>as an unknown tag (not inALLOWED_TAGS), strips it- Rendered anchor text:
"Promise"—<T>is silently lostBefore this PR,
plainTextwas not decoded, soPromise<T>was embedded as HTML entities, whichsanitizeHtmlpreserves and the browser renders correctly asPromise<T>.The fix is to separate the stripped-but-not-decoded text (safe for HTML templates) from the fully decoded text (correct for TOC display, where Vue renders it as plain text via interpolation):
🐛 Proposed fix
- // Collect TOC item with plain text (HTML stripped, entities decoded) - const plainText = decodeHtmlEntities(stripHtmlTags(text).trim()) + // Collect TOC item with plain text (HTML stripped, entities decoded) + const strippedText = stripHtmlTags(text).trim() + const plainText = decodeHtmlEntities(strippedText) if (plainText) { toc.push({ text: plainText, id, depth }) } /** The link href uses the unique slug WITHOUT the 'user-content-' prefix, because that will later be added for all links. */ - return `<h${semanticLevel} id="${id}" data-level="${depth}"><a href="#${uniqueSlug}">${plainText}</a></h${semanticLevel}>\n` + return `<h${semanticLevel} id="${id}" data-level="${depth}"><a href="#${uniqueSlug}">${strippedText}</a></h${semanticLevel}>\n`
| * Strip all HTML tags from a string, looping until stable to prevent | ||
| * incomplete sanitization from nested/interleaved tags | ||
| * (e.g. `<scr<script>ipt>` → `<script>` after one pass). | ||
| */ |
There was a problem hiding this comment.
Comment example does not match the actual regex behaviour.
The doc comment illustrates <scr<script>ipt> → <script> after one pass, which is the failure mode of the narrower regex /<[^<>]*>/g (which excludes both < and >). The actual regex used here is /<[^>]*>/g, whose greedy [^>]* also matches <, so the first pass strips <scr<script> (from the opening < to the first >), leaving ipt> — not <script>.
📝 Proposed fix for the comment
/**
* Strip all HTML tags from a string, looping until stable to prevent
* incomplete sanitization from nested/interleaved tags
- * (e.g. `<scr<script>ipt>` → `<script>` after one pass).
+ * (e.g. `<<script>>` → `<script>` after one pass, then `''` after the second).
*/There was a problem hiding this comment.
🧹 Nitpick comments (2)
shared/utils/html.ts (2)
1-13: Derive the regex fromhtmlEntitieskeys to eliminate desync risk.The regex alternation
/&(?:amp|lt|gt|quot|apos|nbsp|#39);/gis a manual duplicate of the map's keys. Adding a new entity to the map without updating the regex will silently no-op.♻️ Proposed refactor — derive regex from map keys
const htmlEntities: Record<string, string> = { '&': '&', '<': '<', '>': '>', '"': '"', ''': "'", ''': "'", ' ': '\u00A0', } +const htmlEntityRegex = new RegExp( + Object.keys(htmlEntities) + .map(k => k.replace(/[.*+?^${}()|[\]\\]/g, '\\$&')) + .join('|'), + 'g', +) + export function decodeHtmlEntities(text: string): string { - return text.replace(/&(?:amp|lt|gt|quot|apos|nbsp|#39);/g, match => htmlEntities[match] || match) + return text.replace(htmlEntityRegex, match => htmlEntities[match] || match) }
1-9: Consider extending entity coverage for common typographic characters.README headings and package labels frequently contain entities outside this set —
—(—),–(–),…(…),©(©),®(®) — that will now appear as literal entity strings in the server-rendered output. Worth auditing whether the TOC/playground label paths encounter these in practice.
follow on from #1550 and #591 - we should be doing this on server-side