Comments (9)
With looking at: https://www.npmjs.com/package/is-docx as a way to sniff for these if they're misnamed in the extension.
Other than that, you'll need to cut the text out of the XML in a traditional manner. I'm gonna close this, as it's semi a duplicate for #1, although it's probably worth adding a special case to detect the zip header and generate a better alert error.
from node-word-extractor.
I can certainly replicate this issue. It seems like it is coming from inside our lifted OleCompoundDoc
code, probably in the magic number used to identify the document type.
from node-word-extractor.
from node-word-extractor.
Yes, it's easy -- its actually a .docx file, not a .doc file. The giveaway is that it uses a zip header, the first bytes are 50 4B 03 04
. I don't handle that with this module yet (I could, but I'd need someone to cover the costs), but it's technically not hard. .docx files can be unzipped, and you'll find all the content inside in XML content, especially in word/document.xml
inside the zip archive.
from node-word-extractor.
Oh my bad , All along i thought it was a .doc file , this was embarassing sorry for the trouble.
from node-word-extractor.
from node-word-extractor.
You're very welcome, and I'll make the error messages clearer for these files, as it's a common case. Or, I might just implement it, and finally resolve #1
from node-word-extractor.
Hi @morungos I tried with .doc file, but i got the same problem too: "Not a valid compound document" I already check my .doc file too on https://www.npmjs.com/package/is-docx and the result is false
from node-word-extractor.
from node-word-extractor.
Related Issues (20)
- Use word-extractor in typescript HOT 1
- Get body function getting logged HOT 1
- Incorrect text when extracting fields
- Incorrect character filtering for Word
- Handle field displayed text right in OLE Word
- Correctly remove deleted text HOT 1
- Table rows are not terminated correctly in OLE files HOT 1
- Separate header and footer HOT 8
- add method to read text boxes HOT 7
- Error: Max buffer length exceeded: attribValue HOT 10
- Errors thrown by the XML parser cannot be caught HOT 1
- Switch the SAX layer to saxes HOT 1
- WordExtractor get_body() doesn't appear to retrieve all text content from .doc file HOT 8
- WordExtractor get_body() doesn't appear to retrieve all text content from .docx file HOT 1
- error
- Add a way to detect Numbering indicator/ Bullet point HOT 12
- Add a way to read macros HOT 1
- Any way to read data from url
- Add way to iterate, fetch, and count pages HOT 1
- Broken multi-byte letters at the borders of 4096-byte chunks HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from node-word-extractor.