From 501b19b6e005f00ffbaa126965b80bad3eaca47b Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?R=C3=A9gis=20Hanol?= Date: Thu, 30 Apr 2020 12:21:25 +0200 Subject: [PATCH] FIX: server-side HtmlToMarkdown improvements (#9586) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit TLDR; this commit vastly improves how whitespaces are handled when converting from HTML to Markdown. It also adds support for converting HTML to markdown tables. The previous 'remove_whitespaces!' method was traversing the whole HTML tree and used a heuristic to remove leading and trailing whitespaces whenever it was appropriate (ie. mostly before and after HTML block elements) It was a good idea, but it was very limited and leaded to bad conversion when the html had leading whitespaces on several lines for example. One such example can be found [here](https://meta.discourse.org/t/86782). For various reasons, most of the whitespaces in a HTML file is ignored when the page is being displayed in a browser. The rules that the browsers follow are the [CSS' White Space Processing Rules](https://www.w3.org/TR/css-text-3/#white-space-rules). They can be quite complicated when you take into account RTL languages and other various tidbits but they boils down to the following: - Collapse whitespaces down to one space (0x20) inside an inline context (ie. nodes/tags that are being displaying on the same line) - Remove any leading/trailing whitespaces inside an inline context One quick & dirty way of getting this 90% solved would be to do 'HTML.gsub!(/[[:space:]]+/, " ")'. We would also need to hoist
 elements in order to not mess with their whitespaces.
Unfortunately, this solution let some whitespaces creep around HTML tags which leads to more '.strip!' calls than I can bear.

I decided to "emulate" the browser's handling of whitespaces and came up with a solution in 4 parts

1. remove_not_allowed!

The HtmlToMarkdown library is recursively "visiting" all the nodes in the HTML in order to convert them to Markdown.
All the nodes that aren't handled by the library (eg.