Love this package; it's a super neat and clean interface for extracting text. <p d

Nice! <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

WordExtractor get_body() doesn't appear to retrieve all text content from .doc file about node-word-extractor HOT 8 CLOSED

morungos commented on May 30, 2024

WordExtractor get_body() doesn't appear to retrieve all text content from .doc file

from node-word-extractor.

Comments (8)

cakemountain commented on May 30, 2024 1

Nice! @morungos I'm testing this now and I'll post back here with results. Thanks again for your quick response and work here.

from node-word-extractor.

morungos commented on May 30, 2024

Thank you so much for this. I am surprised how much these files are missing. I'll certainly dig into this.

from node-word-extractor.

morungos commented on May 30, 2024

There's definitely an issue here, and differently for .docx and for .doc files, which means there are at least two bugs here, so I'll open a second issue on .docx when we've done some more investigation. On .doc, the piece (there's only one) does seem to contain all the correct text, but somehow, we're not getting the final content assembled correctly. There are hyperlinks and friends in there, but at first glance they shouldn't be responsible. However, it does seem to be the cleaning process that is responsible.

Most of the issue is that we are incorrectly assessing a huge amount of the text as deleted when it isn't. It's the sprmCFRMarkDel test that is the problem. Note that this is a toggle operand, not a clear value, and our test does not seem to respect that, so we might need to actually determine the current style to know whether or not text is deleted. Since the code doesn't yet poke in the style tables, that is a likely origin. In other words, the style might be specifying that by default text is deleted, and the actual text says it isn't, so we have everything inverted. This is the only time we have seen this in the wild, which is why it's now an apparent problem.

Whatever is happening in .docx does not appear to involve the same logic at all, so let's make a second issue for that.

from node-word-extractor.

morungos commented on May 30, 2024

The main issue here was pretty simple: we weren't looking at the argument to the sprm at all, so we assumed that the mere existence of the sprm was sufficient to classify a run as deleted. This meant that we deleted virtually all the document. I have no idea how these got there, but my suspicion would be: these text runs were deleted, and then they were rejected, so rather than removing the sprms, the value was toggled.

I've continued to ignore the style info, because having a base style with "deleted" as a default value makes no sense whatsoever, and we won't see it in a real world file. So they say.

That handles the situation for the OLE case.

from node-word-extractor.

cakemountain commented on May 30, 2024

@morungos thanks for the speedy response here! Everything you said makes sense. Let me know if I can be of additional assistance. I have yet to find other .doc files that this happens on but I can keep you updated if and when I do find more examples.

from node-word-extractor.

morungos commented on May 30, 2024

@cakemountain I've not yet pushed a new release. The .docx issue is slightly more complex, as it is down in the order of entries in the zip file layer. I'm working on it now, so should have something pushed later today or tomorrow. But yes... any more files that don't work as expected, send them my way 😀

from node-word-extractor.

morungos commented on May 30, 2024

@cakemountain OK, I have just now pushed version 1.0.4 to github and published to npmjs. So hopefully this is a help to you.

On digging, I am guessing these were LibreOffice files. I haven't tested anywhere near as much against LibreOffice as I have against Word itself, and both these small differences fit with that origin. Any files that come from other applications (I'm using a Mac, so I should definitely try with some Pages files, for example) are worth cross-checking to make sure they are consistent.

from node-word-extractor.

cakemountain commented on May 30, 2024

💥 much better.

Results after the update on the same doc:

Lorem ipsum 

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc ac faucibus odio. 

Vestibulum neque massa, scelerisque sit amet ligula eu, congue molestie mi. Praesent ut varius sem. Nullam at porttitor arcu, nec lacinia nisi. Ut ac dolor vitae odio interdum condimentum. Vivamus dapibus sodales ex, vitae malesuada ipsum cursus convallis. Maecenas sed egestas nulla, ac condimentum orci. Mauris diam felis, vulputate ac suscipit et, iaculis non est. Curabitur semper arcu ac ligula semper, nec luctus nisl blandit. Integer lacinia ante ac libero lobortis imperdiet. Nullam mollis convallis ipsum, ac accumsan nunc vehicula vitae. Nulla eget justo in felis tristique fringilla. Morbi sit amet tortor quis risus auctor condimentum. Morbi in ullamcorper elit. Nulla iaculis tellus sit amet mauris tempus fringilla.
Maecenas mauris lectus, lobortis et purus mattis, blandit dictum tellus.
Maecenas non lorem quis tellus placerat varius. 
Nulla facilisi. 
Aenean congue fringilla justo ut aliquam. 
Mauris id ex erat. Nunc vulputate neque vitae justo facilisis, non condimentum ante sagittis. 
Morbi viverra semper lorem nec molestie. 
Maecenas tincidunt est efficitur ligula euismod, sit amet ornare est vulputate.









In non mauris justo. Duis vehicula mi vel mi pretium, a viverra erat efficitur. Cras aliquam est ac eros varius, id iaculis dui auctor. Duis pretium neque ligula, et pulvinar mi placerat et. Nulla nec nunc sit amet nunc posuere vestibulum. Ut id neque eget tortor mattis tristique. Donec ante est, blandit sit amet tristique vel, lacinia pulvinar arcu. Pellentesque scelerisque fermentum erat, id posuere justo pulvinar ut. Cras id eros sed enim aliquam lobortis. Sed lobortis nisl ut eros efficitur tincidunt. Cras justo mi, porttitor quis mattis vel, ultricies ut purus. Ut facilisis et lacus eu cursus.
In eleifend velit vitae libero sollicitudin euismod. Fusce vitae vestibulum velit. Pellentesque vulputate lectus quis pellentesque commodo. Aliquam erat volutpat. Vestibulum in egestas velit. Pellentesque fermentum nisl vitae fringilla venenatis. Etiam id mauris vitae orci maximus ultricies. 

Cras fringilla ipsum magna, in fringilla dui commodo a.

        Lorem ipsum     Lorem ipsum     Lorem ipsum
1       In eleifend velit vitae libero sollicitudin euismod.    Lorem
2       Cras fringilla ipsum magna, in fringilla dui commodo a. Ipsum
3       Aliquam erat volutpat.  Lorem
4       Fusce vitae vestibulum velit.   Lorem
5       Etiam vehicula luctus fermentum.        Ipsum

Etiam vehicula luctus fermentum. In vel metus congue, pulvinar lectus vel, fermentum dui. Maecenas ante orci, egestas ut aliquet sit amet, sagittis a magna. Aliquam ante quam, pellentesque ut dignissim quis, laoreet eget est. Aliquam erat volutpat. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Ut ullamcorper justo sapien, in cursus libero viverra eget. Vivamus auctor imperdiet urna, at pulvinar leo posuere laoreet. Suspendisse neque nisl, fringilla at iaculis scelerisque, ornare vel dolor. Ut et pulvinar nunc. Pellentesque fringilla mollis efficitur. Nullam venenatis commodo imperdiet. Morbi velit neque, semper quis lorem quis, efficitur dignissim ipsum. Ut ac lorem sed turpis imperdiet eleifend sit amet id sapien.


Lorem ipsum dolor sit amet, consectetur adipiscing elit. 

Nunc ac faucibus odio. Vestibulum neque massa, scelerisque sit amet ligula eu, congue molestie mi. Praesent ut varius sem. Nullam at porttitor arcu, nec lacinia nisi. Ut ac dolor vitae odio interdum condimentum. Vivamus dapibus sodales ex, vitae malesuada ipsum cursus convallis. Maecenas sed egestas nulla, ac condimentum orci. Mauris diam felis, vulputate ac suscipit et, iaculis non est. Curabitur semper arcu ac ligula semper, nec luctus nisl blandit. Integer lacinia ante ac libero lobortis imperdiet. Nullam mollis convallis ipsum, ac accumsan nunc vehicula vitae. Nulla eget justo in felis tristique fringilla. Morbi sit amet tortor quis risus auctor condimentum. Morbi in ullamcorper elit. Nulla iaculis tellus sit amet mauris tempus fringilla.
Maecenas mauris lectus, lobortis et purus mattis, blandit dictum tellus. 
Maecenas non lorem quis tellus placerat varius. Nulla facilisi. Aenean congue fringilla justo ut aliquam. Mauris id ex erat. Nunc vulputate neque vitae justo facilisis, non condimentum ante sagittis. Morbi viverra semper lorem nec molestie. Maecenas tincidunt est efficitur ligula euismod, sit amet ornare est vulputate.
In non mauris justo. Duis vehicula mi vel mi pretium, a viverra erat efficitur. Cras aliquam est ac eros varius, id iaculis dui auctor. Duis pretium neque ligula, et pulvinar mi placerat et. Nulla nec nunc sit amet nunc posuere vestibulum. Ut id neque eget tortor mattis tristique. Donec ante est, blandit sit amet tristique vel, lacinia pulvinar arcu. Pellentesque scelerisque fermentum erat, id posuere justo pulvinar ut. Cras id eros sed enim aliquam lobortis. Sed lobortis nisl ut eros efficitur tincidunt. Cras justo mi, porttitor quis mattis vel, ultricies ut purus. Ut facilisis et lacus eu cursus.
In eleifend velit vitae libero sollicitudin euismod. 
Fusce vitae vestibulum velit. Pellentesque vulputate lectus quis pellentesque commodo. Aliquam erat volutpat. Vestibulum in egestas velit. Pellentesque fermentum nisl vitae fringilla venenatis. Etiam id mauris vitae orci maximus ultricies. Cras fringilla ipsum magna, in fringilla dui commodo a.
Etiam vehicula luctus fermentum. In vel metus congue, pulvinar lectus vel, fermentum dui. Maecenas ante orci, egestas ut aliquet sit amet, sagittis a magna. Aliquam ante quam, pellentesque ut dignissim quis, laoreet eget est. Aliquam erat volutpat. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Ut ullamcorper justo sapien, in cursus libero viverra eget. Vivamus auctor imperdiet urna, at pulvinar leo posuere laoreet. Suspendisse neque nisl, fringilla at iaculis scelerisque, ornare vel dolor. Ut et pulvinar nunc. Pellentesque fringilla mollis efficitur. Nullam venenatis commodo imperdiet. Morbi velit neque, semper quis lorem quis, efficitur dignissim ipsum. Ut ac lorem sed turpis imperdiet eleifend sit amet id sapien.

from node-word-extractor.

WordExtractor get_body() doesn't appear to retrieve all text content from .doc file about node-word-extractor HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent