<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

Improved treatment of the visual document structure (Submitted by Carsten Pfeiffer) about daisydiff HOT 19 CLOSED

roycezhc commented on June 8, 2024

Improved treatment of the visual document structure (Submitted by Carsten Pfeiffer)

from daisydiff.

Comments (19)

GoogleCodeExporter commented on June 8, 2024

Hi Carsten,

I've applied the patch but cannot see any improvements over the pre-patching 
output. Perhaps I had the wrong expectations about the outcome can you please 
advice? I've attached the two sources i'm using for comparison and the 
resulting output.

Original comment by [email protected] on 19 Nov 2010 at 10:02

Attachments:

from daisydiff.

GoogleCodeExporter commented on June 8, 2024

Just to clarify the problem with the output on the previous comment. The diff 
for the Manufacturer and Description sections are scrambled together.

Original comment by [email protected] on 19 Nov 2010 at 10:14

from daisydiff.

GoogleCodeExporter commented on June 8, 2024

Hi Michael,

sorry for coming back to you so late. I just checked your example and you are 
right, this scenario is not covered by my patch. Daisydiff doesn't know or care 
about sections or chapters like e.g. "Manufacturers" from your snippets. 
Daisydiff only sees two long list of text elements and tries to find and report 
as few differences in them as possible.

Daisydiff would need to learn about those chapters (h1..h6 for a start) and 
perform the differencing separately for the text nodes of those chapters.

I.e. DomTreeBuilder should not return a flat list of elements but some kind of 
tree of chapters, each with their own list of text nodes.

Ideally you would then create one TextNodeComparator per chapter and perform 
the diffing chapter-wise. But this needs some work, because this would lead to 
multiple outputs (you want to have just one single result of course).

Cheers,
Carsten

Original comment by [email protected] on 14 Mar 2011 at 4:14

from daisydiff.

GoogleCodeExporter commented on June 8, 2024

[deleted comment]

from daisydiff.

GoogleCodeExporter commented on June 8, 2024

Carsten,

Can you post an example or two, where one can see clearly what has changed 
before and after applying this patch? This should help people understand the 
changes and decided if the patch should be accepted or not.

Original comment by [email protected] on 23 May 2011 at 2:40

from daisydiff.

GoogleCodeExporter commented on June 8, 2024

I applied the patch (which was tricky because it clashed with the performance 
fixes in 1.2) and I note that it fixes 
http://code.google.com/p/daisydiff/issues/detail?id=26 for me, without breaking 
anything else I could think of.  Awesome work Carsten.

Original comment by [email protected] on 15 Jun 2011 at 5:29

from daisydiff.

GoogleCodeExporter commented on June 8, 2024

Can you create then a new patch (that is based on 1.2) and attach it here?

Then I will apply this to trunk if you think that it should be part of 
DaisyDiff.

Original comment by [email protected] on 15 Jun 2011 at 6:41

from daisydiff.

GoogleCodeExporter commented on June 8, 2024

Yes.  Will do.  I have a little more testing to do first.

Original comment by [email protected] on 15 Jun 2011 at 6:54

from daisydiff.

GoogleCodeExporter commented on June 8, 2024

Here's a simple testcase that my patch fixes. Note that the second table cell 
in the old.html file ("1XXX") must start with the same character with which the 
previous (first) cell ends ("1"). 

In two-way diff mode, daisydiff creates a new bogus column in the table.

In three-way diff mode it even puts the red removal marker into the first 
column, AFAIR.

Original comment by [email protected] on 16 Jun 2011 at 10:49

Attachments:

from daisydiff.

GoogleCodeExporter commented on June 8, 2024

I will have no time for this in the coming weeks, so I took the time today, 
rebased the patch to trunk and committed it. I'm using it internally for almost 
a year and had no problems with it.

(There are more problems left, but unrelated to this issue).

Original comment by [email protected] on 17 Jun 2011 at 3:57

from daisydiff.

GoogleCodeExporter commented on June 8, 2024

The SeparatingNode is missing a hashCode method.  Patch provided.

Original comment by [email protected] on 24 Jun 2011 at 7:22

Attachments:

hashCode_for_SeparatingNode.patch

from daisydiff.

GoogleCodeExporter commented on June 8, 2024

While the SeparatingNode seems to really improve the handling of tables, I'm 
not convinced that the check for paretn position in TagNode.isEquals is a good 
change.  It leads to a lot of cells claiming to be moved when really they've 
just had a row inserted before them.

Eg I find that removing that section improves things.  I'm attaching a patch 
that comments out the part I'm talking about so that people can discuss whether 
they like it or not.

Original comment by [email protected] on 24 Jun 2011 at 7:34

Attachments:

comment_out_position_check.patch

from daisydiff.

GoogleCodeExporter commented on June 8, 2024

Thanks Don. I committed your patch until I find a better solution. The problem 
(i.e. the reason for the position-check) manifests itself mostly in three-way 
diff, as I found out.

Original comment by [email protected] on 4 Jul 2011 at 12:52

from daisydiff.

GoogleCodeExporter commented on June 8, 2024

Don,

after commenting out the "parent index" thing in TagNode, the testcases 
"broke", as expected. When I had a look at fixing them, it became obvious that 
it is currently too time consuming to validate them.

So here's my proposal to fix that:
- rename the file expected.xml to expected.html
- make them html markup instead of xml
- add an html header including a reference to the diff.css

That way, one can easily see the expected output (and also validate if the 
expected output make sense at all).

I will then regenerate all expected.html files with the current version. If 
you're fine with that, I will commit that tomorrow.

Original comment by [email protected] on 4 Jul 2011 at 4:13

from daisydiff.

GoogleCodeExporter commented on June 8, 2024

[deleted comment]

from daisydiff.

GoogleCodeExporter commented on June 8, 2024

Hi Carsten,

switching to html files sounds great to me.

Original comment by [email protected] on 5 Jul 2011 at 9:35

from daisydiff.

GoogleCodeExporter commented on June 8, 2024

Committed in r155. Please re-open if you encounter any issues.

Original comment by [email protected] on 13 Sep 2011 at 9:05

Changed state: Fixed

from daisydiff.

GoogleCodeExporter commented on June 8, 2024

Hi,

I've may have found a fix for Carsten's position checking (which brings too 
many drawbacks with it). I simply added more separators by commenting out a 
couple of lines of code - a simple patch and I'm attaching it to this issue. I 
ran the new code against the file-based unit tests and while the code does 
perform worse in a few cases it is a huge improvement overall. I believe it 
also solves issue 44.

I went over all unit tests which have 'failed' after adding the new code and my 
tally goes like this:

Big improvements in Lists 22, 23, 35; General 5, 6, 35
Small improvements in General 9

Big regression in General 38
Slight regressions in Lists 6, 19 (bad HTML!), 20 (bad HTML!), 31; Paragraphs 
43; General 39 (though Lists 19 and 20 contain bad HTML, UL elements cannot be 
nested in other UL elements unless they are wrapped in a LI first - if I fix 
this the result with new code isn't bad)

Couldn't decide: Lists 4, 5, 17, 18; General 42

I would appreciate if somebody could take the time to review these tests. To do 
this, apply the patch, run 'ant test' and go to the 'testdata' directory to 
review the test results: failed tests will have the '_actual_result.html' file 
present in their directory (not that failure here simply means a different 
output, not a failure per se).

Original comment by [email protected] on 3 Sep 2012 at 8:33

Attachments:

more-separators.diff

from daisydiff.

GoogleCodeExporter commented on June 8, 2024

Hi Gregor,

thanks a bunch for your efforts. The improvements of your patch do indeed look 
good to me. The regressions in lists however are a bad. It would be awesome if 
we could find a way to get the improvements without the regressions :-)

Can you tell me how you fixed the badly nested ULs so that the results are OK? 
Adding a <li> in front of the nested <ul> changes the behavior (you get another 
bullet point). Even though it's not correct html to leave out the <li>, we have 
to support that, because it is quite common. For example the Mozilla HTML 
Commposer (the rich text editor in Thunderbird for example) creates such HTML.

Regarding issue 44 -- I couldn't reproduce that problem at all (i.e. plain 
1.2). The result was the same (perfect indentation) with and without your patch.

Thanks,
Carsten

Original comment by [email protected] on 19 Sep 2012 at 8:58

from daisydiff.

Improved treatment of the visual document structure (Submitted by Carsten Pfeiffer) about daisydiff HOT 19 CLOSED

Comments (19)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent