Convert tables inside PDFs to CSV. Node wrapper for
tabular-java
.
This is a maintained fork of the tabula-js
package.
- Java Development Kit (JDK) with
java
available on command-line - Node.js/npm
To install as a dependency via npm
:
$ npm install --save fresh-tabula-js
Simply import the module:
const tabula = require('fresh-tabula-js');
const table = tabula('data/foobar.pdf');
table.extractCsv((err, data) => {
if (err) {
console.error(err);
return;
}
console.log(data);
});
Not all tabula-java options are exposed. Particularly wirting to file but any extracted data is available through a callback or a stream.
Here are the options (for options with no value, pass true
as the value):
Options:
area <AREA> Portion of the page to analyze (top,left,bottom,right).
Example: "269.875,12.75,790.5,561". Default is entire page.
If there are multiple areas to analyze:
Example: ["269.875,12.75,790.5,561", "132.45,23.2,256.3,534"]
columns <COLUMNS> X coordinates of column boundaries. Example
"10.1,20.2,30.3"
debug Print detected table areas instead ofprocessing.
guess Guess the portion of the page to analyze per page.
silent Suppress all stderr output.
noSpreadsheet Force PDF not to be extracted using spreadsheet-style
extraction
(if there are ruling lines separating each cell, as in a PDF of an Excel spreadsheet)
pages <PAGES> Comma separated list of ranges, or all.
Examples: pages: "1-3,5-7", pages: "3" or pages: "all". Default is pages: "1"
spreadsheet Force PDF to be extracted using spreadsheet-style
extraction
(if there are ruling lines separating each cell, as in a PDF of an Excel spreadsheet)
password <PASSWORD> Password to decrypt document. Default is empty
useLineReturns Use embedded line returns in cells. (Only in spreadsheet
mode.)
This is the simplest use case. It's uses a classic node style callback (err, data)
. The extracted CSV is an array of all rows found in the data table including any headers.
const tabula = require('tabula-js');
const t = tabula(source.pdf);
t.extractCsv((err, data) => console.log(data));
Here we use the area
option to zero in on the data.
const tabula = require('tabula-js');
const t = tabula(source.pdf, {area: "269.875,150,690,545"});
t.extractCsv((err, data) => console.log(data));
Is similar to the callback version but with data extracted as a stream.
const tabula = require('tabula-js');
const stream = tabula(source.pdf).streamCsv();
stream.pipe(process.stdout);
In reality the library is built on the notion of streams all the way down. Highland.js is used to make this a breeze.
This also means the returned stream can readily perform highland.js style transformations and operations.
const tabula = require('tabula-js');
const stream = tabula(source.pdf).streamCsv();
stream
.split()
.doto(console.log)
.done(() => console.log('ALL DONE!'));