Comments (8)
Nice! @morungos I'm testing this now and I'll post back here with results. Thanks again for your quick response and work here.
from node-word-extractor.
Thank you so much for this. I am surprised how much these files are missing. I'll certainly dig into this.
from node-word-extractor.
There's definitely an issue here, and differently for .docx and for .doc files, which means there are at least two bugs here, so I'll open a second issue on .docx when we've done some more investigation. On .doc, the piece (there's only one) does seem to contain all the correct text, but somehow, we're not getting the final content assembled correctly. There are hyperlinks and friends in there, but at first glance they shouldn't be responsible. However, it does seem to be the cleaning process that is responsible.
Most of the issue is that we are incorrectly assessing a huge amount of the text as deleted when it isn't. It's the sprmCFRMarkDel
test that is the problem. Note that this is a toggle operand, not a clear value, and our test does not seem to respect that, so we might need to actually determine the current style to know whether or not text is deleted. Since the code doesn't yet poke in the style tables, that is a likely origin. In other words, the style might be specifying that by default text is deleted, and the actual text says it isn't, so we have everything inverted. This is the only time we have seen this in the wild, which is why it's now an apparent problem.
Whatever is happening in .docx does not appear to involve the same logic at all, so let's make a second issue for that.
from node-word-extractor.
The main issue here was pretty simple: we weren't looking at the argument to the sprm at all, so we assumed that the mere existence of the sprm was sufficient to classify a run as deleted. This meant that we deleted virtually all the document. I have no idea how these got there, but my suspicion would be: these text runs were deleted, and then they were rejected, so rather than removing the sprms, the value was toggled.
I've continued to ignore the style info, because having a base style with "deleted" as a default value makes no sense whatsoever, and we won't see it in a real world file. So they say.
That handles the situation for the OLE case.
from node-word-extractor.
@morungos thanks for the speedy response here! Everything you said makes sense. Let me know if I can be of additional assistance. I have yet to find other .doc files that this happens on but I can keep you updated if and when I do find more examples.
from node-word-extractor.
@cakemountain I've not yet pushed a new release. The .docx issue is slightly more complex, as it is down in the order of entries in the zip file layer. I'm working on it now, so should have something pushed later today or tomorrow. But yes... any more files that don't work as expected, send them my way 😀
from node-word-extractor.
@cakemountain OK, I have just now pushed version 1.0.4 to github and published to npmjs. So hopefully this is a help to you.
On digging, I am guessing these were LibreOffice files. I haven't tested anywhere near as much against LibreOffice as I have against Word itself, and both these small differences fit with that origin. Any files that come from other applications (I'm using a Mac, so I should definitely try with some Pages files, for example) are worth cross-checking to make sure they are consistent.
from node-word-extractor.
💥 much better.
Results after the update on the same doc:
Lorem ipsum
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc ac faucibus odio.
Vestibulum neque massa, scelerisque sit amet ligula eu, congue molestie mi. Praesent ut varius sem. Nullam at porttitor arcu, nec lacinia nisi. Ut ac dolor vitae odio interdum condimentum. Vivamus dapibus sodales ex, vitae malesuada ipsum cursus convallis. Maecenas sed egestas nulla, ac condimentum orci. Mauris diam felis, vulputate ac suscipit et, iaculis non est. Curabitur semper arcu ac ligula semper, nec luctus nisl blandit. Integer lacinia ante ac libero lobortis imperdiet. Nullam mollis convallis ipsum, ac accumsan nunc vehicula vitae. Nulla eget justo in felis tristique fringilla. Morbi sit amet tortor quis risus auctor condimentum. Morbi in ullamcorper elit. Nulla iaculis tellus sit amet mauris tempus fringilla.
Maecenas mauris lectus, lobortis et purus mattis, blandit dictum tellus.
Maecenas non lorem quis tellus placerat varius.
Nulla facilisi.
Aenean congue fringilla justo ut aliquam.
Mauris id ex erat. Nunc vulputate neque vitae justo facilisis, non condimentum ante sagittis.
Morbi viverra semper lorem nec molestie.
Maecenas tincidunt est efficitur ligula euismod, sit amet ornare est vulputate.
In non mauris justo. Duis vehicula mi vel mi pretium, a viverra erat efficitur. Cras aliquam est ac eros varius, id iaculis dui auctor. Duis pretium neque ligula, et pulvinar mi placerat et. Nulla nec nunc sit amet nunc posuere vestibulum. Ut id neque eget tortor mattis tristique. Donec ante est, blandit sit amet tristique vel, lacinia pulvinar arcu. Pellentesque scelerisque fermentum erat, id posuere justo pulvinar ut. Cras id eros sed enim aliquam lobortis. Sed lobortis nisl ut eros efficitur tincidunt. Cras justo mi, porttitor quis mattis vel, ultricies ut purus. Ut facilisis et lacus eu cursus.
In eleifend velit vitae libero sollicitudin euismod. Fusce vitae vestibulum velit. Pellentesque vulputate lectus quis pellentesque commodo. Aliquam erat volutpat. Vestibulum in egestas velit. Pellentesque fermentum nisl vitae fringilla venenatis. Etiam id mauris vitae orci maximus ultricies.
Cras fringilla ipsum magna, in fringilla dui commodo a.
Lorem ipsum Lorem ipsum Lorem ipsum
1 In eleifend velit vitae libero sollicitudin euismod. Lorem
2 Cras fringilla ipsum magna, in fringilla dui commodo a. Ipsum
3 Aliquam erat volutpat. Lorem
4 Fusce vitae vestibulum velit. Lorem
5 Etiam vehicula luctus fermentum. Ipsum
Etiam vehicula luctus fermentum. In vel metus congue, pulvinar lectus vel, fermentum dui. Maecenas ante orci, egestas ut aliquet sit amet, sagittis a magna. Aliquam ante quam, pellentesque ut dignissim quis, laoreet eget est. Aliquam erat volutpat. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Ut ullamcorper justo sapien, in cursus libero viverra eget. Vivamus auctor imperdiet urna, at pulvinar leo posuere laoreet. Suspendisse neque nisl, fringilla at iaculis scelerisque, ornare vel dolor. Ut et pulvinar nunc. Pellentesque fringilla mollis efficitur. Nullam venenatis commodo imperdiet. Morbi velit neque, semper quis lorem quis, efficitur dignissim ipsum. Ut ac lorem sed turpis imperdiet eleifend sit amet id sapien.
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Nunc ac faucibus odio. Vestibulum neque massa, scelerisque sit amet ligula eu, congue molestie mi. Praesent ut varius sem. Nullam at porttitor arcu, nec lacinia nisi. Ut ac dolor vitae odio interdum condimentum. Vivamus dapibus sodales ex, vitae malesuada ipsum cursus convallis. Maecenas sed egestas nulla, ac condimentum orci. Mauris diam felis, vulputate ac suscipit et, iaculis non est. Curabitur semper arcu ac ligula semper, nec luctus nisl blandit. Integer lacinia ante ac libero lobortis imperdiet. Nullam mollis convallis ipsum, ac accumsan nunc vehicula vitae. Nulla eget justo in felis tristique fringilla. Morbi sit amet tortor quis risus auctor condimentum. Morbi in ullamcorper elit. Nulla iaculis tellus sit amet mauris tempus fringilla.
Maecenas mauris lectus, lobortis et purus mattis, blandit dictum tellus.
Maecenas non lorem quis tellus placerat varius. Nulla facilisi. Aenean congue fringilla justo ut aliquam. Mauris id ex erat. Nunc vulputate neque vitae justo facilisis, non condimentum ante sagittis. Morbi viverra semper lorem nec molestie. Maecenas tincidunt est efficitur ligula euismod, sit amet ornare est vulputate.
In non mauris justo. Duis vehicula mi vel mi pretium, a viverra erat efficitur. Cras aliquam est ac eros varius, id iaculis dui auctor. Duis pretium neque ligula, et pulvinar mi placerat et. Nulla nec nunc sit amet nunc posuere vestibulum. Ut id neque eget tortor mattis tristique. Donec ante est, blandit sit amet tristique vel, lacinia pulvinar arcu. Pellentesque scelerisque fermentum erat, id posuere justo pulvinar ut. Cras id eros sed enim aliquam lobortis. Sed lobortis nisl ut eros efficitur tincidunt. Cras justo mi, porttitor quis mattis vel, ultricies ut purus. Ut facilisis et lacus eu cursus.
In eleifend velit vitae libero sollicitudin euismod.
Fusce vitae vestibulum velit. Pellentesque vulputate lectus quis pellentesque commodo. Aliquam erat volutpat. Vestibulum in egestas velit. Pellentesque fermentum nisl vitae fringilla venenatis. Etiam id mauris vitae orci maximus ultricies. Cras fringilla ipsum magna, in fringilla dui commodo a.
Etiam vehicula luctus fermentum. In vel metus congue, pulvinar lectus vel, fermentum dui. Maecenas ante orci, egestas ut aliquet sit amet, sagittis a magna. Aliquam ante quam, pellentesque ut dignissim quis, laoreet eget est. Aliquam erat volutpat. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Ut ullamcorper justo sapien, in cursus libero viverra eget. Vivamus auctor imperdiet urna, at pulvinar leo posuere laoreet. Suspendisse neque nisl, fringilla at iaculis scelerisque, ornare vel dolor. Ut et pulvinar nunc. Pellentesque fringilla mollis efficitur. Nullam venenatis commodo imperdiet. Morbi velit neque, semper quis lorem quis, efficitur dignissim ipsum. Ut ac lorem sed turpis imperdiet eleifend sit amet id sapien.
from node-word-extractor.
Related Issues (20)
- Is this cross-platform ? HOT 1
- Use word-extractor in typescript HOT 1
- Get body function getting logged HOT 1
- Incorrect text when extracting fields
- Incorrect character filtering for Word
- Handle field displayed text right in OLE Word
- Correctly remove deleted text HOT 1
- Table rows are not terminated correctly in OLE files HOT 1
- Separate header and footer HOT 8
- add method to read text boxes HOT 7
- Error: Max buffer length exceeded: attribValue HOT 10
- Errors thrown by the XML parser cannot be caught HOT 1
- Switch the SAX layer to saxes HOT 1
- WordExtractor get_body() doesn't appear to retrieve all text content from .docx file HOT 1
- error
- Add a way to detect Numbering indicator/ Bullet point HOT 12
- Add a way to read macros HOT 1
- Any way to read data from url
- Add way to iterate, fetch, and count pages HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from node-word-extractor.