Code Monkey home page Code Monkey logo

node-word-extractor's Introduction

word-extractor

npm version test workflow

Read data from a Word document (.doc or .docx) using Node.js

Why use this module?

There are a fair number of npm components which can extract text from Word .doc files, but they often appear to require some external helper program, and involve either spawning a process or communicating with a persistent one. That raises the installation and deployment burden as well as the runtime one.

This module is intended to provide a much faster way of reading the text from a Word file, without leaving the Node.js environment.

This means you do not need to install Word, Office, or anything else, and the module will work on all platforms, without any native binary code requirements.

As of version 1.0, this module supports both traditional, OLE-based, Word files (usually .doc), and modern, Open Office-style, ECMA-376 Word files (usually .docx). It can be used both with files and with file contents in a Node.js Buffer.

How do I install this module?

yarn add word-extractor

# Or using npm... 
npm install word-extractor

How do I use this module?

const WordExtractor = require("word-extractor"); 
const extractor = new WordExtractor();
const extracted = extractor.extract("file.doc");

extracted.then(function(doc) { console.log(doc.getBody()); });

The object returned from the extract() method is a promise that resolves to a document object, which then provides several views onto different parts of the document contents.

Methods

WordExtractor#extract(<filename> | <Buffer>)

Main method to open a Word file and retrieve the data. Returns a promise which resolves to a Document. If a Buffer is passed instead of a filename, then the buffer is used directly, instad of reading a disk from the file system.

Document#getBody()

Retrieves the content text from a Word document. This will handle UNICODE characters correctly, so if there are accented or non-Latin-1 characters present in the document, they'll show as is in the returned string.

Document#getFootnotes()

Retrieves the footnote text from a Word document. This will handle UNICODE characters correctly, so if there are accented or non-Latin-1 characters present in the document, they'll show as is in the returned string.

Document#getEndnotes()

Retrieves the endnote text from a Word document. This will handle UNICODE characters correctly, so if there are accented or non-Latin-1 characters present in the document, they'll show as is in the returned string.

Document#getHeaders(options?)

Retrieves the header and footer text from a Word document. This will handle UNICODE characters correctly, so if there are accented or non-Latin-1 characters present in the document, they'll show as is in the returned string.

Note that by default, getHeaders() returns one string, containing all headers and footers. This is compatible with previous versions. If you want to separate headers and footers, use getHeaders({includeFooters: false}), to return only the headers, and the new method getFooters() (from version 1.0.1) to return the footers separately.

Document#getFooters()

From version 1.0.1. Retrieves the footer text from a Word document. This will handle UNICODE characters correctly, so if there are accented or non-Latin-1 characters present in the document, they'll show as is in the returned string.

Document#getAnnotations()

Retrieves the comment bubble text from a Word document. This will handle UNICODE characters correctly, so if there are accented or non-Latin-1 characters present in the document, they'll show as is in the returned string.

Document#getTextboxes(options?)

Retrieves the textbox contenttext from a Word document. This will handle UNICODE characters correctly, so if there are accented or non-Latin-1 characters present in the document, they'll show as is in the returned string.

Note that by default, getTextboxes() returns one string, containing all textbox content from both main document and the headers and footers. You can control what gets included by using the options includeHeadersAndFooters (which defaults to true) and includeBody (also defaults to true). So, as an example, if you only want the body text box content, use: doc.getTextboxes({includeHeadersAndFooters: false}).

License

Copyright (c) 2016-2021. Stuart Watt.

Licensed under the MIT License.

node-word-extractor's People

Contributors

dependabot[bot] avatar dmitri-gb avatar ksfreitas avatar morungos avatar njlr avatar nmn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

node-word-extractor's Issues

Error: Max buffer length exceeded: attribValue

I ran into the error below while trying to extract text in v1.0.2 from a DOCX that is 396 KB (405,504 bytes)

I can't seem to get the whole call stack to print out for this, but here is what it gave me:

Exception has occurred: Error: Max buffer length exceeded: attribValue
Line: 1
Column: 97393
Char: 
  at error (/harvester/node_modules/sax/lib/sax.js:651:10)
    at checkBufferLength (/harvester/node_modules/sax/lib/sax.js:125:13)
    at SAXParser.write (/harvester/node_modules/sax/lib/sax.js:1505:7)
    at SAXStream.write (/harvester/node_modules/sax/lib/sax.js:239:18)
    at AssertByteCountStream.ondata (internal/streams/readable.js:745:22)
    at AssertByteCountStream.emit (events.js:376:20)
    at addChunk (internal/streams/readable.js:309:12)
    at readableAddChunk (internal/streams/readable.js:284:9)
    at AssertByteCountStream.Readable.push (internal/streams/readable.js:223:10)
    at AssertByteCountStream.Transform.push (internal/streams/transform.js:166:32)

Further up the call stack inspector in vscode, it seems to originate from ./lib/word.js:48
So, I think the error is occurring due to line in ./lib/word.js:45:

const buffer = Buffer.alloc(512);

Potential fixes:

  • Increase the default buffer size
  • Allow the buffer size to be configurable.

I'm afraid I cannot share the document in question in a public forum like this, but if you'd like to connect I can prepare an anonymised version and show it you on screen share. I can tell you that it has an image embedded in it that appears to be high res.

Add method to read annotation comments

A big part of the original Text::Extract::Word was the support for pulling out the annotation balloon comments, which we needed for Open Mentor among other things. This is still in the Perl code, but not yet ported to this component.

Fix the docs so that returned doc is actually a promise

hi,im using this package for a test,and i got this problem

`F:\parseWord\app.js:4
var body = doc.getBody();
^

TypeError: doc.getBody is not a function
at Object. (F:\parseWord\app.js:4:16)
at Module._compile (module.js:409:26)
at Object.Module._extensions..js (module.js:416:10)
at Module.load (module.js:343:32)
at Function.Module._load (module.js:300:12)
at Function.Module.runMain (module.js:441:10)
at startup (node.js:139:18)
at node.js:968:3`

code:

`var WordExtractor = require("word-extractor");
var extractor = new WordExtractor();
var doc = extractor.extract("./testDocccccc.doc")
var body = doc.getBody();

console.log(body)`

Switch the SAX layer to saxes

For OpenOffice formats and XML, let's switch to saxes.

In fact, the exact issue in #37 is discussed in the FAQ for why saxes was written in the first place.

Not a valid compound Document

Hey,
This module works great but I'm having trouble converting this file that i have attached.
this is the error that I'm getting.
"Unhandled rejection Error: Not a valid compound document.
at Object.ensureErrorObject (C:\Users\Suchith\Desktop\node.js\inter backup\internsh\node_modules\bluebird\js\main\util.js:261:
20)
at Promise._rejectCallback (C:\Users\Suchith\Desktop\node.js\inter backup\internsh\node_modules\bluebird\js\main\promise.js:46
9:22)
at C:\Users\Suchith\Desktop\node.js\inter backup\internsh\node_modules\bluebird\js\main\promise.js:486:17
at OleCompoundDoc. (C:\Users\Suchith\Desktop\node.js\inter backup\internsh\node_modules\word-extractor\lib\word.js:
42:18)
at emitOne (events.js:96:13)
at OleCompoundDoc.emit (events.js:188:7)
at C:\Users\Suchith\Desktop\node.js\inter backup\internsh\node_modules\word-extractor\lib\ole-doc.js:344:15
at FSReqWrap.wrapper [as oncomplete] (fs.js:603:17)".

Help me in extracting text out of this .doc file.

file.zip

Move the codebase to ES6

Most of the benefits of CoffeeScript are past their use by date now, and we can get away with ES6 just fine now with its promises and better iterations.

This task is to decaffeinate the code.

error

const word = require("word-extractor");
const docx = new word();
docx.extract("${__dirname}/"+filename).then(function(doc) { console.log(doc.getBody()); });

Add a way to detect Numbering indicator/ Bullet point

Thanks for making such a great lib, I just wonder is there a way we can know that some text is prefix with Numbering indicator/ Bullet point?
For example: I have a piece of word file like this:
image

The text extracted:
Câu 20: Để phát hiện một người có nhiễm HIV hay không người ta làm gì?
Xét nghiệm máu
Xét nghiệm đường hô hấp
Xét nghiệm đường tiêu hoá
Xét nghiệm da

As you can see, the A,B,C,D indicator prefixed on the last 4 lines is missing

JS heap errors observed in real Word files

We found this with a few files in our big collection. Word opens them okay, but it seems that a bad sector identifier chain can trip up JS fatally. The issue is caused by free sectors (id = -1) breaking the load.

The issue is in the loop:

   while ( secId > AllocationTable.SecIdFree ) {
      secIds.push( secId );
      secId = this._table[secId];
   }

When secId is AllocationTable.SecIdFree , we end up pulling undefined values out and then indexing the table by undefined, which kills Node very fatally. At the least, we should not do this.

I can't easily create a test, because the only files I have contain PII, and if I attempt to redact them, Word fixes the bad sector chain. However, it is real and we might be able to test with a unit test.

Incorrect character filtering for Word

When extracting data from a Word file, we don't always handle the character encoding right. For example:

- In the latest version, there are a few issues with BlueDragon. First – Lyla CAPTCHA is not supported. You must disable CAPTHA support or the blog will not load in BlueDragon. (A future release will automatically disable it.) I already mentioned that print support is disabled, but this is done automatically by the blog engine. 
+ In the latest version, there are a few issues with BlueDragon. First � Lyla CAPTCHA is not supported. You must disable CAPTHA support or the blog will not load in BlueDragon. (A future release will automatically disable it.) I already mentioned that print support is disabled, but this is done automatically by the blog engine. 

These are where we need the translations, probably.

Extract from a buffer

I am using Node.js and downloading .doc files using superagent. This gives me a buffer object that I would like to parse and extract text from. However, word-extractor only seems to support files.

How do I extract the text from a .doc in memory, not in a file?

Table rows are not terminated correctly in OLE files

Deep down in the binary format for OLE-based Word files, is a small gem. The cell delimiter and the row delimiter are both ASCII 7. Turns out, you need to look in the paragraph tables to find out which is which. So, we need to add some code to read and map out the paragraph tables so we can make the last entry in a row properly end with \n.

Move to use Jest for testing

Essentially, it's much nicer than mocha, and there are a ton of vulnerabilities in mocha as well. This will also mean we can drop the extra verbose text display during the testing process.

WordExtractor get_body() doesn't appear to retrieve all text content from .doc file

Love this package; it's a super neat and clean interface for extracting text.

However, in testing I found some cases where not all text was extracted. You can find the examples here.

The sample 100kB DOC file, for instance, has lots of text in it. Here's the TypeScript code I executed:

import request from 'request';
import WordExtractor from 'word-extractor';

const fileUrl = 'https://file-examples-com.github.io/uploads/2017/02/file-sample_100kB.doc';

request.get({ url: fileUrl, encoding: null }, (err, res, body) => {
  const extractor = new WordExtractor();
  const extracted = extractor.extract(body);
  extracted.then(function (doc) {
    console.log(doc.getBody());
  });
});

This seems to only extract the following, even though there's much, much more content in that DOC file.

Lorem ipsum Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc ac faucibus odio. 

Cras fringilla ipsum magna, in fringilla dui commodo a. Lorem ipsum     Lorem ipsum
1       Lorem
2       Ipsum
3       Lorem
4       Lorem
5       Ipsum
Lorem ipsum dolor sit amet, consectetur adipiscing elit. 
Maecenas mauris lectus, lobortis et purus mattis, blandit dictum tellus. 
In eleifend velit vitae libero sollicitudin euismod. 

Similarly, I've tried getFootnotes, getEndnotes, getHeaders, getFooters, and getAnnotations. All of those return empty content for the document I linked above.

getTextboxes returns the same content as getBody.

Is this useful information/is there something I might be doing wrong with configuration options? It seems like this is a bug but I don't have much experience with extracting text from .doc files.

Add method to read form data

A second common use case, added by request in Text::Extract::Word, was to read form data from protected Word files. Again, the code for this still exists in the Perl component, but we should port it here.

Extract multiple footnotes

Hi there,

Great library - I've been looking for something similar for quite a while!

I tried to extract text from a file with multiple footnotes. Unfortunately, the getFootnotes method only returns the first one :(

Any ideas how to work around that?

Get body function getting logged

var WordExtractor = require("word-extractor");
var extractor = new WordExtractor();
var extracted = extractor.extract("worldd.doc");
extracted.then(function (doc) {
console.log(doc.getBody);
});

Code produces some getBody(shouldFilter) {} as output, instead of the file body.

ƒ getBody(shouldFilter) {
if (shouldFilter == null) {
shouldFilter = true;
}
const start = 0;
const string = this.getTextRange(start, start + this.boundaries.ccpText);
return filter(string, shouldFilter);
}

Incorrect text when extracting fields

When extracting the body of a Word file, fields are not being handled properly. For example, you might get text as follows from test01.doc:

If you find any bugs, or have any suggestions, please email me at [email protected]. You can also go to the BlogCFC Forums at HYPERLINK "http://ray.camdenfamilywww.coldfusionjedi.com/forums/forums.cfm?conferenceid=CBD210FD-AB88-8875-EBDE545BF7B67269" http://ray.camdenfamilywww.coldfusionjedi.com/forums/forums.cfm?conferenceid=CBD210FD-AB88-8875-EBDE545BF7B67269. You may also go to the BlogCFC Project page at HYPERLINK "http://ray.camdenfamily.com/projects/blogcfc" http://ray.camdenfamily.com/projects/blogcfc.riaforge.org. Lastly – you can read news about BlogCFC at http://www.blogcfc.com.

In practice, the original text is as follows:

image

The differences are in fields, mainly.

Error: Cannot find module 'word-extractor'

Hi,

Tried running the example given and it throws an error

Error: Cannot find module 'word-extractor'

var WordExtractor = require("word-extractor"); var extractor = new WordExtractor(); var doc = extractor.extract("file.doc") var body = doc.getBody();

Error: Cannot find module 'word-extractor'
at Function.Module._resolveFilename (module.js:440:15)
at Function.Module._load (module.js:388:25)
at Module.require (module.js:468:17)
at require (internal/module.js:20:19)
at Object. (C:\Users\test\Desktop\text\3.js:1:83)
at Module._compile (module.js:541:32)
at Object.Module._extensions..js (module.js:550:10)
at Module.load (module.js:458:32)
at tryModuleLoad (module.js:417:12)
at Function.Module._load (module.js:409:3)

Any help please?

add method to read text boxes

Feature request:

Document#getTextBoxes()

it must:

  • Get all text from text boxes in the document

it should:

  • Return the text boxes in the same order as they appear in the document

it could:

  • Somehow try to reproduce the document text order combining the content from getBody() with the text box content. (maybe as a separate Document#getBody({includeTextBoxes: true}). I imagine this is insanely difficult, but just putting it out there :-)

Handle field displayed text right in OLE Word

There's some nasty issues involved in fields in OLE Word. test01.doc is a good example, there are nested HYPERLINK fields, but try as I might, I can't see any way for the fields to be handled by character markers in a way that is consistent with what we see in Word. We might need to use some of the additional Word table data to get offsets right.

There are several points here that don't work, but most of the non-nested fields look correct to me. It is mainly this nested one. It may be that it is sufficiently garbage that we could map it to an empty string, but even that does not really work, as it'd leave a space either side, and yet only one space is visible in Word itself.

Correctly remove deleted text

In the current implementation, we don't handle character properties. This is why we are including deleted text in the final output. This is, itself, why we are struggling with the nastiness of test01.doc, because a bunch of the fields in there are flagged as deleted, and therefore throw off the positioning and results.

The only resolution is that we need to process character properties properly.

Is this cross-platform ?

I see in this ticket #11 a mention of the OLE implementation, is this package actually using Word itself to parse the documents ?

Basically, my question is, is this package really pure-JS no-binary-dependency and cross platform ?

Errors thrown by the XML parser cannot be caught

See #37 for more information, and an account of why this matters.

To test, we should construct some mechanism to allow us to throw an error from one of the sax-us handling routines, and establish that we can handle the extraction process in a normal catchable way.

Separate header and footer

I've been using your library today. It seems to be the light at the end of the tunnel for some problems we have been having with docx extraction when docs contain headers, footers, endnotes and footnotes. Almost everything else we've tried has been incapable of parsing these properly when they co-exist in a document. And no system dependencies... I salute you!

One issue I have found is that both the header and footer content gets bundled into data._headers. Is there much of a change to make to separate the footer into a new data._footers property?

image

Add support for .docx files

.docx files are actually very much easier than .doc files, but we might get either of them thrown at us. So we should provide a transparent interface as far as we can.

Error: EMFILE: too many open files

Node 9 - 12
Windows 10

Thanks for the wonderful library :)

File descriptors do not close...
I even downloaded ProcessExplorer to make sure of this :(
It turns out to open somewhere 8000 - 8200 files before the fall -> Error: EMFILE: too many open files

const fs = require('fs')
const path = require('path')
const WordExtractor = require('word-extractor')

let extractor = new WordExtractor()

const foo = async (pathDir) => {
  let arrFileName = fs.readdirSync(pathDir)
   let count = 0

  for (let fileName of arrFileName) {
    let pathFile = path.join(pathDir, fileName)
    let res = await extractor.extract(pathFile)
    let doc = res.getBody()
    count = count + 1
    console.log(count)
  }
}
foo('d:/Dir')

Help me, pleath

Add way to iterate, fetch, and count pages

Hello! I was wondering if it would be possible to add some paging functionality. This issue could serve as three related requests:

  1. A way to iterate through pages
  2. A method to get the text for a specific page
  3. A method to get the total page count for the document

I know the library currently replaces page breaks with new lines, but it would be great to be able to break them up.

Use word-extractor in typescript

Good afternoon,
I'd like extract text from buffer of doc and docx file.

But I don't know as import the library into my ts file.

I used
import * as WORDEXTRACTOR from 'word-extractor';

and implemented this code:

WORDEXTRACTOR.fromBuffer(originalFile).then((doc: any) => {
// eslint-disable-next-line no-console
console.log(doc.getBody());
});

But in compile time I have an error :

Error ole....

Can You help me, please ?

I hope that this library is ok for my goal.

Thanks, everybody.

Broken multi-byte letters at the borders of 4096-byte chunks

The handleEntry() function in the open-office-extractor.js file has an instruction to read streams by 4096-byte chunks:
const chunk = readStream.read(0x1000);

Given the text in the *.docx file is not in some Latin alphabet, most characters in the strings inside <w:t> tags would be multi-byte. It's only a matter of time for such a chunk to break some characters in two, which will result in 2 Unicode "Replacement Characters" (U+FFFD, which looks like EF BF BD bytes) in the reconstructed text content.

Perhaps it would be reasonable to get the encoding from the first tag in the *.xml files (document.xml and alike) and then treat the file as a text of such encoding, not just a stream of bytes. Read in chunks of characters, not chunks of bytes.

For example, this Lorem Ipsum text uses the Cyrillic letters, and the second appearance of the words сед ут амет риденс номинави gets turned to сед ут амет ри��енс номинави:
Lorem_Ipsum__CYR.docx

This behaviour would pose a problem for the documents using Greek, Cyrillic, Arabic, Hebrew, Hindu, Chinese, Korean, Japanese (and a few more) alphabets.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.