So I've just spent a couple of hours tracking down why this happens (google's not much help because it just interprets as a carriage return... smart google). I've found what causes it, but not why it happens. In short, when you save any content that is of type HTMLText, it gets examined for link integrity. To help with the messy XML work needed, SilverStripe passes the content into a DOMDocument, does the DOM manipulations needed, then gets the content back out. What's happening though is that carriage return characters (\r) are being converted into their unicode equivalents at some point of the process, coming back as the dreaded
So why does this happen? As you might know, *nix systems represent their newlines as "\n", whereas windows systems use "\r\n". So DOMDocument processing on your Linux server expects newlines to be "\n" delimited, and converts the trailing "\r" to its entity equivalent. Why it happens on some content areas and not others though is a complete mystery to me at the moment. Content fields in the backend work without a problem, because for whatever reason they return "\n" newline characters, but switching to frontend content fields (eg in the blog module) start sending back "\r\n" characters when posting/editing via the frontend.
Even stranger, I'm actually on a linux system, so in theory I should be always sending "\n" as my newlines; however, in both Firefox and Chrome, it's sending "\r\n". I'm stumped at the moment as to what's introducing the "\r" in there.
Anyway - for the quick fix I've put in place as I continue trying to track down the real problem, you can try adding something like
In sapphire/integration/HTMLValue.php at ~line 48 (the setContent method)
public function setContent($content) {
if (PHP_EOL == "\n") {
$content = str_replace("\r\n", PHP_EOL, $content);
}
return @$this->getDocument()->loadHTML(
'<html><head><meta http-equiv="content-type" content="text/html; charset=utf-8"></head>' .
"<body>$content</body></html>"
);
}
This basically checks to see whether we're running on a system expecting "\n" as the line feed, and forcibly replaces any "\r\n" characters.
Hope that helps a bit - it'd be great if someone knew WHY "\r\n" was forcing its way in though.