FIX: server-side HtmlToMarkdown improvements (#9586)

TLDR; this commit vastly improves how whitespaces are handled when converting from HTML to Markdown. It also adds support for converting HTML <tables> to markdown tables. The previous 'remove_whitespaces!' method was traversing the whole HTML tree and used a heuristic to remove leading and trailing whitespaces whenever it was appropriate (ie. mostly before and after HTML block elements) It was a good idea, but it was very limited and leaded to bad conversion when the html had leading whitespaces on several lines for example. One such example can be found [here](https://meta.discourse.org/t/86782). For various reasons, most of the whitespaces in a HTML file is ignored when the page is being displayed in a browser. The rules that the browsers follow are the [CSS' White Space Processing Rules](https://www.w3.org/TR/css-text-3/#white-space-rules). They can be quite complicated when you take into account RTL languages and other various tidbits but they boils down to the following: - Collapse whitespaces down to one space (0x20) inside an inline context (ie. nodes/tags that are being displaying on the same line) - Remove any leading/trailing whitespaces inside an inline context One quick & dirty way of getting this 90% solved would be to do 'HTML.gsub!(/[[:space:]]+/, " ")'. We would also need to hoist <pre> elements in order to not mess with their whitespaces. Unfortunately, this solution let some whitespaces creep around HTML tags which leads to more '.strip!' calls than I can bear. I decided to "emulate" the browser's handling of whitespaces and came up with a solution in 4 parts 1. remove_not_allowed! The HtmlToMarkdown library is recursively "visiting" all the nodes in the HTML in order to convert them to Markdown. All the nodes that aren't handled by the library (eg. <script>, <style> or any non-textual HTML tags) are "swallowed". In order to reduce the number of nodes visited, the method 'remove_not_allowed!' will automatically delete all the nodes that have no "visitor" (eg. a 'visit_<tag>' method) defined. 2. remove_hidden! Similar purpose as the previous method (eg. reducing number of nodes visited), there's no point trying to convert something that is hidden. The 'remove_hidden!' method removes any nodes that was hidden using the "hidden" HTML attribute, some CSS or with a width or height equal to 0. 3. hoist_line_breaks! The 'hoist_line_breaks!' method is there to handle <br> tags. I know those tiny <br> don't do much but they can be quite annoying. The <br> tags are inline elements but they visually work like a block element (ie. they create a new line). If you have the following HTML "<i>Foo<br>Bar</i>", it ends up visually similar to "<i>Foo</i><br><i>Bar</i>". The latter being much more easy to process than the former, so that's what this method is doing. The "hoist_line_breaks" will hoist <br> tags out of inline tags until their parent is a block element. 4. remove_whitespaces! The "remove_whitespaces!" is where all the whitespace removal is happening. It's broken down into 4 methods as well - remove_whitespaces! - is_inline? - collapse_spaces! - remove_trailing_space! The 'remove_whitespace!' method is recursively walking the HTML tree (skipping <pre> tags). If a node has any children, they will be chunked into groups of inline elements vs block elements. For each chunks of inline elements, it will call the "collapse_space!" and "remove_trailing_space!" methods. For each chunks of block elements, it will call "remote_whitespace!" to keep walking the HTML tree recursively. The "is_inline?" method determines whether a node is part of a inline context. A node is inline iif it's a text node or it's an inline tag, but not <br>, and all its children are also inline. The "collapse_spaces!" method will collapse any kind of (white) space into a single space (" ") character, even accros tags. For example, if we have " Foo \n<i> Bar </i>\t42", it will return "Foo <i>Bar </i>42". Finally, the "remove_trailing_space!" method is there to remove any trailing space that might creep in at the end of the inline chunk. This solution is not 100% bullet-proof. It does not support RTL languages at all and has some caveats that I felt were not worth the work to get properly fixed. FIX: better detection of hidden elements when converting HTML to Markdown FIX: take into account the 'allowed_href_schemes' site setting when converting HTML <a> to Markdown FIX: added support for 'mailto:' scheme when converting <a> from HTML to Markdown FIX: added support for <img> dimensions when converting from HTML to Markdown FIX: added support for <dl>, <dd> and <dt> when converting from HTML to Markdown FIX: added support for multilines emphases, strongs and strikes when converting from HTML to Markdown FIX: added support for <acronym> when converting from HTML to Markdown DEV: remove unused 'sanitize' gem Wow, did you just read all that?! Congratz, here's a cookie: 🍪.
2025-07-02 09:44:30 +08:00 · 2020-04-30 12:21:25 +02:00
parent fe51f7a863
commit 501b19b6e0
4 changed files with 450 additions and 235 deletions
--- a/spec/components/html_to_markdown_spec.rb
+++ b/spec/components/html_to_markdown_spec.rb
@ -10,7 +10,7 @@ describe HtmlToMarkdown do
  end

  it "remove whitespaces" do
-    expect(html_to_markdown(<<-HTML
+    html = <<-HTML
      <div dir="auto">Hello,
        <div dir="auto"><br></div>
        <div dir="auto">&nbsp; &nbsp; This is the 1st paragraph.&nbsp; &nbsp; </div>
@ -20,11 +20,54 @@ describe HtmlToMarkdown do
        </div>
      </div>
    HTML
-    )).to eq("Hello,\n\nThis is the 1st paragraph.\n\nThis is another paragraph")
+
+    expect(html_to_markdown(html)).to eq("Hello,\n\nThis is the 1st paragraph.\n\nThis is another paragraph")
+
+    html = <<~HTML
+      <body text="#000000" bgcolor="#FFFFFF">
+          <p>Let me see if it happens by answering your message through
+            Thunderbird.</p>
+          <p>Long sentence 1 Long sentence 1 Long sentence 1 Long sentence 1
+            Long sentence 1 Long sentence 1 Long sentence 1 Long sentence 1
+            Long sentence 1 Long sentence 1 Long sentence 1 Long sentence 1
+            Long sentence 1 Long sentence 1 Long sentence 1 Long sentence 1
+            Long sentence 1 Long sentence 1 Long sentence 1 Long sentence 1
+            Long sentence 1
+          </p>
+      </body>
+    HTML
+
+    markdown = <<~MD
+       Let me see if it happens by answering your message through Thunderbird.
+
+       Long sentence 1 Long sentence 1 Long sentence 1 Long sentence 1 Long sentence 1 Long sentence 1 Long sentence 1 Long sentence 1 Long sentence 1 Long sentence 1 Long sentence 1 Long sentence 1 Long sentence 1 Long sentence 1 Long sentence 1 Long sentence 1 Long sentence 1 Long sentence 1 Long sentence 1 Long sentence 1 Long sentence 1
+    MD
+
+    expect(html_to_markdown(html)).to eq(markdown.strip)
+
+    html = <<~HTML
+      <p>    This     post
+            has             lots<br>           of
+                  space
+      </p>
+      <pre>    This     space    was   left untouched     !</pre>
+    HTML
+
+    markdown = <<~MD
+      This post has lots
+      of space
+
+      ```
+          This     space    was   left untouched     !
+      ```
+    MD
+
+    expect(html_to_markdown(html)).to eq(markdown.strip)
  end

  it "skips hidden tags" do
    expect(html_to_markdown(%Q{<p>Hello <span style="display: none">cruel </span>World!</p>})).to eq("Hello World!")
+    expect(html_to_markdown(%Q{<p>Hello <span hidden>cruel </span>World!</p>})).to eq("Hello World!")
  end

  it "converts <strong>" do
@ -37,13 +80,15 @@ describe HtmlToMarkdown do
    expect(html_to_markdown("<b>B*ld</b>")).to eq("__B*ld__")

    html = <<~HTML
+      Before
      <p><b>Bold
      <br>
      <br>
      </b>
      </p>
+      After
    HTML
-    expect(html_to_markdown(html)).to eq("**Bold**")
+    expect(html_to_markdown(html)).to eq("Before\n\n**Bold**\n\nAfter")
  end

  it "converts <em>" do
@ -60,6 +105,11 @@ describe HtmlToMarkdown do
    expect(html_to_markdown(%Q{<a href="https://www.discourse.org">Discourse</a>})).to eq("[Discourse](https://www.discourse.org)")
  end

+  it "supports SiteSetting.allowed_href_schemes" do
+    SiteSetting.allowed_href_schemes = "tel|steam"
+    expect(html_to_markdown(%Q{<a href="steam://store/48000">LIMBO</a>})).to eq("[LIMBO](steam://store/48000)")
+  end
+
  it "removes empty & invalid <a>" do
    expect(html_to_markdown(%Q{<a>Discourse</a>})).to eq("Discourse")
    expect(html_to_markdown(%Q{<a href="">Discourse</a>})).to eq("Discourse")
@ -67,7 +117,7 @@ describe HtmlToMarkdown do
  end

  HTML_WITH_IMG     ||= %Q{<img src="https://www.discourse.org/logo.svg" alt="Discourse Logo">}
-  HTML_WITH_CID_IMG ||= %Q{<img src="cid:ii_1525434659ddb4cb" alt="Discourse Logo">}
+  HTML_WITH_CID_IMG ||= %Q{<img src="cid:ii_1525434659ddb4cb" title="Discourse Logo">}

  it "converts <img>" do
    expect(html_to_markdown(HTML_WITH_IMG)).to eq("![Discourse Logo](https://www.discourse.org/logo.svg)")
@ -84,8 +134,7 @@ describe HtmlToMarkdown do
  end

  it "keeps <img> with src='cid:' whith 'keep_cid_imgs'" do
-    expect(html_to_markdown(HTML_WITH_CID_IMG, keep_cid_imgs: true)).to eq("![Discourse Logo](cid:ii_1525434659ddb4cb)")
-    expect(html_to_markdown(HTML_WITH_CID_IMG, keep_img_tags: true, keep_cid_imgs: true)).to eq("<img src=\"cid:ii_1525434659ddb4cb\" alt=\"Discourse Logo\">")
+    expect(html_to_markdown(HTML_WITH_CID_IMG, keep_cid_imgs: true)).to eq(HTML_WITH_CID_IMG)
  end

  it "skips hidden <img>" do
@ -95,6 +144,12 @@ describe HtmlToMarkdown do
    expect(html_to_markdown(%Q{<img src="https://www.discourse.org/logo.svg" style="height:0px">})).to eq("")
  end

+  it "supports width/height on <img>" do
+    expect(html_to_markdown(%Q{<img src="https://www.discourse.org/logo.svg" height=100>})).to eq("![](https://www.discourse.org/logo.svg)")
+    expect(html_to_markdown(%Q{<img src="https://www.discourse.org/logo.svg" width=200>})).to eq("![](https://www.discourse.org/logo.svg)")
+    expect(html_to_markdown(%Q{<img src="https://www.discourse.org/logo.svg" height=100 width=200>})).to eq("![|200x100](https://www.discourse.org/logo.svg)")
+  end
+
  (1..6).each do |n|
    it "converts <h#{n}>" do
      expect(html_to_markdown("<h#{n}>Header #{n}</h#{n}>")).to eq("#" * n + " Header #{n}")
@ -150,11 +205,11 @@ describe HtmlToMarkdown do
  end

  it "supports <s>" do
-    expect(html_to_markdown("<s>Strike Through</s>")).to eq("<s>Strike Through</s>")
+    expect(html_to_markdown("<s>Strike Through</s>")).to eq("~~Strike Through~~")
  end

  it "supports <strike>" do
-    expect(html_to_markdown("<strike>Strike Through</strike>")).to eq("<strike>Strike Through</strike>")
+    expect(html_to_markdown("<strike>Strike Through</strike>")).to eq("~~Strike Through~~")
  end

  it "supports <blockquote>" do
@ -221,11 +276,11 @@ describe HtmlToMarkdown do

  it "handles <p>" do
    expect(html_to_markdown("<p>1st paragraph</p><p>2nd paragraph</p>")).to eq("1st paragraph\n\n2nd paragraph")
-    expect(html_to_markdown("<body><p>1st paragraph</p>\n    <p>    2nd paragraph\n    2nd paragraph</p>\n<p>3rd paragraph</p></body>")).to eq("1st paragraph\n\n2nd paragraph\n2nd paragraph\n\n3rd paragraph")
+    expect(html_to_markdown("<body><p>1st paragraph</p>\n    <p>    2nd paragraph\n    2nd paragraph</p>\n<p>3rd paragraph</p></body>")).to eq("1st paragraph\n\n2nd paragraph 2nd paragraph\n\n3rd paragraph")
  end

  it "handles <div>" do
-    expect(html_to_markdown("<div>1st div</div><div>2nd div</div>")).to eq("1st div\n\n2nd div")
+    expect(html_to_markdown("<div>1st div</div><div>2nd div</div>")).to eq("1st div\n2nd div")
  end

  it "swallows <span>" do
@ -257,15 +312,19 @@ describe HtmlToMarkdown do
  context "with an oddly placed <br>" do

    it "handles <strong>" do
-      expect(html_to_markdown("<strong><br>Bold</strong>")).to eq("**Bold**")
-      expect(html_to_markdown("<strong>Bold<br></strong>")).to eq("**Bold**")
-      expect(html_to_markdown("<strong>Bold<br>text</strong>")).to eq("**Bold\ntext**")
+      expect(html_to_markdown("Hello <strong><br>Bold</strong> World")).to eq("Hello\n**Bold** World")
+      expect(html_to_markdown("Hello <strong>Bold<br></strong> World")).to eq("Hello **Bold**\nWorld")
+      expect(html_to_markdown("Hello <strong>Bold<br>text</strong> World")).to eq("Hello **Bold**\n**text** World")
    end

    it "handles <em>" do
-      expect(html_to_markdown("<em><br>Italic</em>")).to eq("*Italic*")
-      expect(html_to_markdown("<em>Italic<br></em>")).to eq("*Italic*")
-      expect(html_to_markdown("<em>Italic<br>text</em>")).to eq("*Italic\ntext*")
+      expect(html_to_markdown("Hello <em><br>Italic</em> World")).to eq("Hello\n*Italic* World")
+      expect(html_to_markdown("Hello <em>Italic<br></em> World")).to eq("Hello *Italic*\nWorld")
+      expect(html_to_markdown("Hello <em>Italic<br>text</em> World")).to eq("Hello *Italic*\n*text* World")
+    end
+
+    it "works" do
+      expect(html_to_markdown("<div>A <b> B <i> C <br> D </i> E <br> F </b> G</div>")).to eq("A __B *C*__\n__*D* E__\n**F** G")
    end

  end
@ -314,4 +373,64 @@ describe HtmlToMarkdown do

  end

+  it "supoorts <table>" do
+    html = <<~HTML
+      <table>
+        <thead>
+          <tr>
+            <th>This</th>
+            <th>is</th>
+            <th>the</th>
+            <th><i>headers</i></th>
+          </tr>
+        </thead>
+        <tbody>
+          <tr>
+            <td>I am</td>
+            <td>the</td>
+            <td><b>first</b></td>
+            <td>row</td>
+          </tr>
+          <tr>
+            <td>And this</td>
+            <td>is the</td>
+            <td>2<sup>nd</sup></td>
+            <td>line</td>
+          </tr>
+        </tbody>
+      </table>
+    HTML
+
+    markdown = <<~MD
+      | This | is | the | *headers* |
+      | - | - | - | - |
+      | I am | the | **first** | row |
+      | And this | is the | 2<sup>nd</sup> | line |
+    MD
+
+    expect(html_to_markdown(html)).to eq(markdown.strip)
+
+    expect(html_to_markdown("<table><tr><td>Hello</td><td>World</td></tr></table>")).to eq("| Hello | World |\n| - | - |")
+  end
+
+  it "doesn't swallow badly formatted <table>" do
+    html = <<~HTML
+      <table>
+        <tr>
+          <th>1</th>
+          <th>2</th>
+          <th>3</th>
+          <th>4</th>
+        </tr>
+        <tr>
+          <td>One</td>
+          <td>Two</td>
+          <td>Three</td>
+        </tr>
+      </table>
+    HTML
+
+    expect(html_to_markdown(html)).to eq("1 2 3 4 \nOne Two Three")
+  end
+
 end