Skip to main content

This site requires you to update your browser. Your browsing experience maybe affected by not having the most up to date version.

We've moved the forum!

Please use forum.silverstripe.org for any new questions (announcement).
The forum archive will stick around, but will be read only.

You can also use our Slack channel or StackOverflow to ask for help.
Check out our community overview for more options to contribute.

Data Model Questions /

Moderators: martimiz, Sean, Ed, biapar, Willr, Ingo, swaiba

is it possible to make Diff::compareHTML() to ignore HTML attributes?


Go to End


5 Posts   1014 Views

Avatar
Jare

Community Member, 39 Posts

20 January 2017 at 8:22am

First of all, I'm not sure if this is the correct forum section for this question or not, or if I should actually start a discussion about this in GitHub.

We have a website that uses the SilverStripe's Diff class to show changes between content versions in the frontend for users who need to see the history of the content provided by the website. However, sometimes there are meaningless changes between the content versions.

For example, a paragraph might have a class called "left" in the old version, but in the new version the class name could have been replaced with "center". The content of the paragraph stays the same. So, <p class="left">Hello World!</p> becomes <p class="center">Hello World!</p>. Literally speaking, the paragraph _has_ changed, and the Diff class notices the change, but on the other hand there is no real changes in the _content_. Meta data changes are not important for the users of this website.

In fact, we have already modified the class a little bit to change a few things, for example to ignore certain tags that are only used for styling (i.e. <strong> and <em>) so adding/removing those tags in the content does not trigger a detectable change in the content. We have also made it to ignore whitespace changes.

And we have also tried to make it to ignore attributes, but now I see that it doesn't always succeed on that. I'm not even 100% sure if our other modifications work every time, but at least they usually do. Our customised class might be a little bit hard to read as it's not the cleanest code, so I would first like to ask you some general advices about how would you do it. I don't think that I need precise step by step instructions. I don't know how the diff algorithm work deep inside.

Of course, I can show the customised code if needed. And if it's useful (or becomes useful after more development), perhaps it could be merged to the core framework or could be converted to a module. I understand that the current Diff works now as it should work and if I ever suggest changes to that, those shouldn't affect the default behaviour, but rather be something that developers could toggle on when they use the Diff class for their own purposes.

Thank you for your support!

P.S. Oh, and please fix the forum session timeout issue. It's still logging me out when writing a long post like this. One solution would be to raise a JavaScript popup which would ask if the user still wants to keep the session alive after a certain amount of time. If the user hits yes, JavaScript would send an AJAX ping request that would just update the session's timestamp.

Avatar
Jare

Community Member, 39 Posts

15 June 2017 at 12:00am

Bump, anyone? :) And the answer doesn't have to be for the current Diff class of SilverStripe. If you happen to have good experiences of some other Diff library, please share your thoughts and I will try it out. Thanks!

Avatar
martimiz

Forum Moderator, 1391 Posts

22 June 2017 at 3:06am

Hi, I'm sorry - I have no direct answer. In fact I didn't even know that class exists :( I'm kind of intrigued and would like to help but at this point I can't even fathom how this class would be used :(

Anyway, I'm afraid the forum might not be the best place for this question, as it's not overly active and this is a very specific question. The folks over at Stackoverflow are more active - then again I'm not quite sure if this is a stackoverflow kind of question...

Anyway, just wanted to let you know that someone did hear you :)

Avatar
Jare

Community Member, 39 Posts

23 June 2017 at 2:18am

Thank you martimiz for your reply and for a tip about StackOverflow. You might be right about that this question perhaps doesn't suit for the style of StackOverflow, but it's good to know that StackOverflow generally has more active people answering complicated questions though :).

SilverStripe uses the Diff class in the "compare" view between two versions of a single Page. (This can be found in the History tab in the CMS). I guess it's not the most used feature in the backend, so perhaps that's why the class was not developed to generate the most polished output.

Avatar
martimiz

Forum Moderator, 1391 Posts

23 June 2017 at 8:52am

Edited: 23/06/2017 8:54am

OK, maybe this can be of help to you:

I made some changes to Diff::compareHTML(). Normally it will create two arrays $from and $to, and feed them to a new Diff(). But for this purpose the MappedDiff class may be the better choice: it takes 4 arrays: the original $from and $to, and a second set of arrays that contain the values you actually want to compare. (Note this is SilverStripe 3.6)

So I created the second set by looping through the original arrays and removing all attributes using a regexp I found here: https://stackoverflow.com/questions/3026096/remove-all-attributes-from-an-html-tag (there may/will be better ways, but hey :) )

This is the code I altered (Diff.php #700):

	public static function compareHTML($from, $to, $escape = false) {
		// First split up the content into words and tags
		// Martimiz: renamed the variables to make things more readable
		$arrFrom = self::getHTMLChunks($from);
		$arrTo   = self::getHTMLChunks($to);

		// Martimiz: next create the mapped versions, with all attributes removed
		$mappedFrom = array();
		foreach ($arrFrom as $item) {

			if (isset($item[0]) && $item[0] == '<') {
				$mappedItem = preg_replace("/<([a-z][a-z0-9]*)[^>]*?(\/?)>/i",'<$1$2>', $item);
			} else {
				$mappedItem = $item;
			}
			$mappedFrom[] = $mappedItem;
		}

		$mappedTo = array();
		foreach ($arrTo as $item) {

			if (isset($item[0]) && $item[0] == '<') {
				$mappedItem = preg_replace("/<([a-z][a-z0-9]*)[^>]*?(\/?)>/i",'<$1$2>', $item);
			} else {
				$mappedItem = $item;
			}
			$mappedTo[] = $mappedItem;
        	}

		// Diff that
		// Martimiz:: use renamed variables
		//$diff = new Diff($arrFrom, $arrTo);

		// Martimiz: use MappedDiff
		$diff = new MappedDiff($arrFrom, $arrTo, $mappedFrom, $mappedTo);

		...

Tested it with:

        $from = '<p class="left">test</p><p>old</p>';
        $to   = '<p class="right">test</p><p>new</p>';

        $diff = Diff::compareHTML($from, $to);

        var_export($diff);

The original function gave me this:

<ins><p class="right"> test </p></ins>  <del><p class="left"> test </p></del>  <p>  <ins>new</ins>  <del>old</del>  </p>

The new version gave me this:

<p class="left"> test </p> <p>  <ins>new</ins>  <del>old</del>  </p>

note that the <p class="left"> is now considered not changed, so it still displays the old version. But that would be logical... Nice puzzle :)