documentcloud / docsplit Goto Github PK

Break Apart Documents into Images, Text, Pages and PDFs

Home Page: http://documentcloud.github.com/docsplit/

License: Other

Ruby 71.12% HTML 28.88%

docsplit's Introduction

==
         __                      ___ __ 
    ____/ /___  ______________  / (_) /_
   / __  / __ \/ ___/ ___/ __ \/ / / __/
  / /_/ / /_/ / /__(__  ) /_/ / / / /_  
  \____/\____/\___/____/ .___/_/_/\__/  
                      /_/
                      
  Docsplit is a command-line utility and Ruby library for splitting apart
  documents into their component parts: searchable UTF-8 plain text, page 
  images or thumbnails in any format, PDFs, single pages, and document 
  metadata (title, author, number of pages...)
  
  Installation:
  gem install docsplit
  
  For documentation, usage, and examples, see:
  https://documentcloud.github.io/docsplit/
  
  To suggest a feature or report a bug: 
  http://github.com/documentcloud/docsplit/issues/

docsplit's People

Contributors

Stargazers

Watchers

Forkers

anderser thejefflarson 1serg satish completelynovel matko21 bmo gijs hendriklouw neildecapia onehub chrisvanhill talentbox tzuryby vrybas simeonwillbanks sequoiar peopledoc marvstazar hopsor minio-sk datadesk netconstructor sepeth efroese sirvine rickychilcott ineiti neostoic jamster mhayes vangheem doubleotoo ministrycentered ahazelwood alindeman scottweisman rajington jeremybmerrill dentarg graphicly younited trevorturk pseudonumos uxscripts docuprep jamesalmond hderms myanmarlinks skopp jy4618272 amalagaura raskhadafi aleksandrov1988 burlap aponsin sumkincpp va7map victorcreed pdarbo leknarf lonjoy vanderhoorn mrifat cadwallion tim-vandecasteele crowdcompass chintanparikh elia eastxing dmayer malkassem dannguyen legalsifter web5design bridgway gollapudi jstin baojie kaybus rscarvalho richleenyc theredcoder dldinternet cwalston nathanstitt tmaier serene stepthom shantanusingh merinoowe arjunpola jarocho3 seodom enricodellamonica nvdnkpr nugen89 jonoterc burisu netfluence

docsplit's Issues

Need extract_html

I need some API to convert from PDF to HTML, i.e extract_html method...

OWNER PASSWORD REQUIRED ERROR

Anything I can do to blow through this?

$ docsplit pages fileName.pdf
Error: Failed to open PDF file: 
   fileName.pdf
   OWNER PASSWORD REQUIRED, but not given (or incorrect)
Errors encountered.  No output created.
Done.  Input errors, so no output created.

shell scaping needed e.g. for filenames

Example:

Docsplit.extract_text('/tmp/file with spaces (and probably other chars that need to be escaped).pdf', :ocr => false, :output => dir)

=> fails

Problem with OpenOffice, JOD Converter, Docsplit on CentOS

I've followed the directions to install Docsplit from here (accounting for differences for CentOS of course): http://documentcloud.github.com/docsplit

I think I've got all necessary packages installed, but when I run "docsplit pdf <some.doc.file>" to convert a Word Doc to PDF, I get this error:

Exception in thread "main" org.artofsolving.jodconverter.office.OfficeException: failed to start and connect
 at org.artofsolving.jodconverter.office.ManagedOfficeProcess.startAndWait(ManagedOfficeProcess.java:61)
 at org.artofsolving.jodconverter.office.PooledOfficeManager.start(PooledOfficeManager.java:102)
 at org.artofsolving.jodconverter.office.ProcessPoolOfficeManager.start(ProcessPoolOfficeManager.java:59)
 at org.artofsolving.jodconverter.cli.Convert.main(Convert.java:98)
Caused by: java.util.concurrent.ExecutionException: org.artofsolving.jodconverter.office.OfficeException: could not establish connection
 at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:252)
 at java.util.concurrent.FutureTask.get(FutureTask.java:111)
 at org.artofsolving.jodconverter.office.ManagedOfficeProcess.startAndWait(ManagedOfficeProcess.java:59)
 ... 3 more
Caused by: org.artofsolving.jodconverter.office.OfficeException: could not establish connection
 at org.artofsolving.jodconverter.office.ManagedOfficeProcess.doStartProcessAndConnect(ManagedOfficeProcess.java:123)
 at org.artofsolving.jodconverter.office.ManagedOfficeProcess.access$000(ManagedOfficeProcess.java:31)
 at org.artofsolving.jodconverter.office.ManagedOfficeProcess$1.run(ManagedOfficeProcess.java:55)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
 at java.util.concurrent.FutureTask.run(FutureTask.java:166)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
 at java.lang.Thread.run(Thread.java:636)
Caused by: org.artofsolving.jodconverter.office.RetryTimeoutException: java.net.ConnectException: connection failed: 'socket,host=127.0.0.1,port=2002,tcpNoDelay=1'; java.net.ConnectException: Connection refused
 at org.artofsolving.jodconverter.office.Retryable.execute(Retryable.java:48)
 at org.artofsolving.jodconverter.office.Retryable.execute(Retryable.java:31)
 at org.artofsolving.jodconverter.office.ManagedOfficeProcess.doStartProcessAndConnect(ManagedOfficeProcess.java:113)
 ... 8 more
Caused by: java.net.ConnectException: connection failed: 'socket,host=127.0.0.1,port=2002,tcpNoDelay=1'; java.net.ConnectException: Connection refused
 at org.artofsolving.jodconverter.office.OfficeConnection.connect(OfficeConnection.java:101)
 at org.artofsolving.jodconverter.office.ManagedOfficeProcess$6.attempt(ManagedOfficeProcess.java:116)
 at org.artofsolving.jodconverter.office.Retryable.execute(Retryable.java:41)
 ... 10 more

For what it's worth, I had all of this working perfectly both in OSX and in an Ubuntu VM I set up. But CentOS is giving me a lot of grief with this stuff.

Any ideas?

Command line docsplit displays pdftotext usage when inputing a PDF filename that has spaces

So this is a pretty minor issue, but when I run something like this on the command line:

docsplit text ZUJI\ Hong\ Kong\:\ Your\ Online\ Travel\ Guru.pdf

It's going to spit out:

pdftotext version 0.16.7
Copyright 2005-2011 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2004 Glyph & Cog, LLC
Usage: pdftotext [options] <PDF-file> [<text-file>]
  -f <int>          : first page to convert
  -l <int>          : last page to convert
  -r <fp>           : resolution, in DPI (default is 72)
  -x <int>          : x-coordinate of the crop area top left corner
  -y <int>          : y-coordinate of the crop area top left corner
  -W <int>          : width of crop area in pixels (default is 0)
  -H <int>          : height of crop area in pixels (default is 0)
  -layout           : maintain original physical layout
  -raw              : keep strings in content stream order
  -htmlmeta         : generate a simple HTML file, including the meta information
  -enc <string>     : output text encoding name
  -listenc          : list available encodings
  -eol <string>     : output end-of-line convention (unix, dos, or mac)
  -nopgbrk          : don't insert page breaks between pages
  -bbox             : output bounding box for each word and page size to html.  Sets -htmlmeta
  -opw <string>     : owner password (for encrypted files)
  -upw <string>     : user password (for encrypted files)
  -q                : don't print any messages or errors
  -v                : print copyright and version info
  -h                : print usage information
  -help             : print usage information
  --help            : print usage information
  -?                : print usage information

As I said a minor issue, but it appears that any PDF filenames that have spaces (ebooks) will need to be changed before running this command in the terminal.

language parameter is invalid

I can't use language option. I get this message :

invalid option: --language

White border in output when the input are images

When i submit images for convert to another images with the width of 800px, they output get some white borders...

Docsplit.extract_images(doc_path, :size => '800x', :format => :png, :output => pages_path)

Install docs incorrect for Tesseract on Debian/Ubuntu

The "Installation & Dependencies" section of the docs show the package name to be installed for Tesseract as "tesseract". This is incorrect for Debian/Ubuntu -- in those distros the correct package name is "tesseract-ocr" (Debian package, Ubuntu package).

Not a huge deal for anyone who knows how to do an aptitude search, but might confuse some newbies.

Image -> OCR -> PDF

Hi,

I'm wondering if there is an example where you convert a scanned document (tif) to a pdf while running OCR so that the PDF becomes searchable. Some of the documents formatting needs to be kept. Is this doable using docsplit? docsplit text removes all formatting from the original document. Any hints on how to achieve this is appreciated.

Couldn't open file '/tmp/docsplit/filename.pdf': No such file or directory.

I updated to new docsplit (0.7.2) and I now got this error of no such file or directory.

Inside the /tmp folder, I checked for filename.pdf and it is there.

Someone can help me what is going on?

When I'm running the docsplit on a EC2 instance it writes on the root "/tmp" folder and for some reason it does not have permission to read(?) I guess, because in localhost it works.

Thanks!

how can i do multiple pdf extraction processes concurrently?

I'd like to be able to extract pdf concurently, but it is not possible with docsplit gem
I tried to extract 2 ppt files to pdf, the gem fails to process.
The code is as below, please replace path_to_docsplit.rb, path_to_test_file1.ppt, path_to_test_file2.ppt

Im looking forward to your answer.
Thank you,
Quyen

!/usr/bin/ruby

require 'path_to_docsplit.rb'

def extraction(path_to_file)
Docsplit.extract_pdf(path_to_file)
end

puts('start extraction')
t1=Thread.new{extraction('path_to_test_file1.ppt')}
t2=Thread.new{extraction('path_to_test_file2.ppt')}
t1.join
t2.join
puts('end extraction')

Distorted PNG image when using docsplit images

I am having trouble rendering large pdf's into png's. The png output is all distorted and glitched out.

If I run:

$ docsplit images large_pdf_test.pdf -d 300 -f png

I get this image:

Any help would be appreciated!

I have a feeling it might be running out of memory. Since it works on smaller pdf's that are being produced the same way.

thanks,
Nick

clean_ocr method removes accents

clean_ocr method replaces accents by ?
in french it's an issue
if i use --no-clean parameter, accents are kept

Option for binary blobs/file handles?

Hello everyone, I was wondering whether anyone is thinking about adding support for passing a File object (or maybe a relative thereof) to the docsplit ruby API instead of just pathnames?

The particular use case I have in mind is extracting text directly from a pdf in Mongodb's GridFS (yields a file-like IO object), but I think it should apply to anyone wanting to read a pdf from a binary stream. Writing the stream contents into a temp file in a real FS so a pathname can be supplied to docsplit in a runtime context feels like an artificial step, and on Heroku it becomes a bit of a headache :)

I don't know whether this is obviously impractical due to the underlying library APIs, but I thought it was worth the suggestion. Thanks for a great library!

confusing file location when extract_images from pdf

Hi! ,
im using Docsplit to extract images from pdf,
my problem is when i do this

Docsplit.extract_images("./public/" + doc, :size => '100x', :format => [:png], :pages=>1)
=> ["./public//uploads/test/document/document/4fb4530d1a88789d180c6eaa/doc.pdf"]

the file saves ok, but in the root of my project, not in ./public//uploads/test/document/docu ... doc.pdf

this is a normal behavior? , can i set the output path somewhere ?

btw im using JRUBY 1.6.7 --1.9

thanks for this proyect

HTML/CSS to PDF

Can the library convert a full fledged html/css webpage to pdf document ??

Would be nice to specify a sub-directory for the temporary files

There can be issues on some systems with the permissions of /tmp at file cleanup time -- it would be nice to have an option to create tmp files in a sub-directory on /tmp.

Docsplit on fedora x86_64

I installed docsplit on fedora with the instructions in the documentation page http://documentcloud.github.com/docsplit .
I created a symbolic link:

ln -s /usr/lib64/openoffice.org3 /usr/lib/openoffice

but after that, docsplit works well except when i want to use the funtionalities of openoffice: for example docsplit text my_file.odt . Here is the error message:

Exception in thread "main" org.artofsolving.jodconverter.office.OfficeException: failed to start and connect
at org.artofsolving.jodconverter.office.ManagedOfficeProcess.startAndWait(ManagedOfficeProcess.java:61)
at org.artofsolving.jodconverter.office.PooledOfficeManager.start(PooledOfficeManager.java:102)
at org.artofsolving.jodconverter.office.ProcessPoolOfficeManager.start(ProcessPoolOfficeManager.java:59)
at org.artofsolving.jodconverter.cli.Convert.main(Convert.java:98)
Caused by: java.util.concurrent.ExecutionException: org.artofsolving.jodconverter.office.OfficeException: could not establish connection
at java.util.concurrent.FutureTask$Sync.innerGet(Unknown Source)
at java.util.concurrent.FutureTask.get(Unknown Source)
at org.artofsolving.jodconverter.office.ManagedOfficeProcess.startAndWait(ManagedOfficeProcess.java:59)
... 3 more
Caused by: org.artofsolving.jodconverter.office.OfficeException: could not establish connection
at org.artofsolving.jodconverter.office.ManagedOfficeProcess.doStartProcessAndConnect(ManagedOfficeProcess.java:123)
at org.artofsolving.jodconverter.office.ManagedOfficeProcess.access$000(ManagedOfficeProcess.java:31)
at org.artofsolving.jodconverter.office.ManagedOfficeProcess$1.run(ManagedOfficeProcess.java:55)
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: org.artofsolving.jodconverter.office.RetryTimeoutException: java.net.ConnectException: connection failed: 'socket,host=127.0.0.1,port=2002,tcpNoDelay=1'; java.net.ConnectException: Connection refused
at org.artofsolving.jodconverter.office.Retryable.execute(Retryable.java:48)
at org.artofsolving.jodconverter.office.Retryable.execute(Retryable.java:31)
at org.artofsolving.jodconverter.office.ManagedOfficeProcess.doStartProcessAndConnect(ManagedOfficeProcess.java:113)
... 8 more
Caused by: java.net.ConnectException: connection failed: 'socket,host=127.0.0.1,port=2002,tcpNoDelay=1';  java.net.ConnectException: Connection refused
at org.artofsolving.jodconverter.office.OfficeConnection.connect(OfficeConnection.java:101)
at org.artofsolving.jodconverter.office.ManagedOfficeProcess$6.attempt(ManagedOfficeProcess.java:116)
at org.artofsolving.jodconverter.office.Retryable.execute(Retryable.java:41)
... 10 more

I have tested the installation on ubuntu 10.10 and everything works.
On Fedora there is no package/rpm for openoffice.org-java-common so i have unpacked a openoffice.org-java-comon.deb and added the missing files on the fedora system but i still have the same issue.
Furthermore the version of Open Office is 3.2 on Ubuntu and 3.3 on Fedora. Is it related with the openoffice version ?

Please don't tell me to only use ubuntu.

Thanks (Sorry for my english)

docsplit doesn't work with LibreOffice

I tried to run docsplit on a workstation running Ubuntu 10.10 (Maverick Meerkat), and when I invoked it via the command-line interface it threw the following error:

Exception in thread "main" java.lang.IllegalStateException: invalid officeHome: it doesn't contain soffice.bin: /usr/lib/openoffice 
at org.artofsolving.jodconverter.office.DefaultOfficeManagerConfiguration.buildOfficeManager(DefaultOfficeManagerConfiguration.java:119)
at org.artofsolving.jodconverter.cli.Convert.main(Convert.java:97)

I eventually tracked the issue down to the fact that this particular workstation had replaced Ubutu's standard OpenOffice.org with the new LibreOffice fork from the Document Foundation, as distributed in this PPA. Removing LibreOffice and re-installing the stock OpenOffice.org from the Ubuntu repositories fixed the problem.

How to use with paperclip?

Hi,
First, Docsplit is awesome. Thank you for sharing that.
I would like that my users will upload a file (pdf) and that they will have it available as images.
I am trying to combine paperclip with docsplit for that with no real success (Rails newbie).

Any tips / directions on how can I accomplish that? How did you do it on documentcloud?

Thank you so much!
S.

Expose density arg in ImageExtractor

I'd like to be able to set a higher density when extracting images from a PDF.

Cheers,
Zach

docsplit has trouble with uppercase file extensions

[mike 1:12:20 ~/docsplit-test]% docsplit text HB00300S.PDF 
Exception in thread "main" java.lang.IllegalArgumentException: unsupported input format: Portable Document Format
    at com.artofsolving.jodconverter.openoffice.converter.AbstractOpenOfficeDocumentConverter.convert(AbstractOpenOfficeDocumentConverter.java:99)
    at com.artofsolving.jodconverter.openoffice.converter.AbstractOpenOfficeDocumentConverter.convert(AbstractOpenOfficeDocumentConverter.java:74)
    at com.artofsolving.jodconverter.openoffice.converter.AbstractOpenOfficeDocumentConverter.convert(AbstractOpenOfficeDocumentConverter.java:70)
    at com.artofsolving.jodconverter.cli.ConvertDocument.convertOne(ConvertDocument.java:154)
    at com.artofsolving.jodconverter.cli.ConvertDocument.main(ConvertDocument.java:133)
[mike 1:12:30 ~/docsplit-test]% mv HB00300S.PDF HB00300S.pdf 
[mike 1:12:34 ~/docsplit-test]% docsplit text HB00300S.pdf 
[mike 1:12:43 ~/docsplit-test]%

jodconverter --external

When running two commands "docsplit images foo.doc" at the same time one will crash when trying to connect to office.

Docsplit java calls look like:

java -Djava.awt.headless=true -Djava.util.logging.config.file=/home/[..bla...]/docsplit-0.5.2/vendor/logging.properties -Doffice.home=/usr/lib/openoffice -cp /home/..bla...]/docsplit-0.5.2/vendor/'*' -jar /home/..bla...]/jodconverter-core-3.0.jar -r /home/[..bla...]/vendor/conf/document-formats.js "/home/greg/Downloads/foobar.docx" "/tmp/docsplit/foobar.pdf"

Exception in thread "main" org.artofsolving.jodconverter.office.OfficeException: failed to start and connect
...
Caused by: java.lang.IllegalStateException: a process with acceptString 'socket,host=127.0.0.1,port=2002' is already running; pid 19490

If we try to launch soffice in headless mode like stated in the docs:
soffice -headless -accept="socket,host=127.0.0.1,port=8100;urp;" -nofirststartwizard

how can we then tell docsplit to tell jodconverter to use port 8100 instead of 2002 like he is actually doing ?

Having one headless soffice daemon running will probably solve our OfficeException when trying to run concurrent doc split workers.

tesseract 3.0

Hello,
I have a few questions about the project Docsplit. We create a product
that uses doscplit. In this product we need support for Slovak within
OCR. As you may know came new tesseract that makes this possible. I
want to ask when you implement in your project docplit support for
tesseract 3.0. Have a nice day. Matej Marus

Unable to use LibreOffice on version 0.7.2 -> Could not find or load main class .usr.lib.libreoffice

Development has been done on Mac, and upgrading to docsplit 0.7.2 and installing libreoffice has been a breeze and everything worked perfectly the first time around.

However moving it to our staging environment (AWS EC2 running Ubuntu) has proven to be a challenge.

The gemfile was updated, and libre office was install through (sudo apt-get install libreoffice), but it seems that docsplit is having issue running libreoffice:

$ bundle exec docsplit pdf test.txt
Error: Could not find or load main class .usr.lib.libreoffice

However libreoffice seemed to be installed properly anyway:

$ soffice --headless --convert-to txt:text test_fake.doc
convert /home/app/helpers-staging/releases/20130328152857/test.doc -> /home/app/helpers-staging/releases/20130328152857/test.txt using text

Use the extract_text data in Ruby rather than a file

Can I just say first.. Docsplit is amazing!

I have this in my code...
Docsplit.extract_text(source_path, :output => destination_path)

Is there a way, however, to "get" the text in Ruby directly?
With the above, I end up with a lovely file that contains text, but it I am to use it, I need to reopen it to get its contents.

As far as I know, setting:

something = Docsplit.extract_text(...)

..would just give me the source filename in "something".

Docsplit and CarrierWave

Hey,
I have open this issue: carrierwaveuploader/carrierwave#502
on CarrierWave - but maybe you could help me ?

Thanks

extract_text ignores new lines

I'm having an issue using extract_text on a .docx or .pdf file, It looks like when reading in the document the parser is removing the new lines. Is there any setting to ensure these are put into the new txt file? I've tried :clean => false with no luck.

Example:

ESTRAGON:
(giving up again). Nothing to be done.
VLADIMIR:
(advancing with short, stiff strides, legs wide apart).

Converts to:
ESTRAGON: (giving up again). Nothing to be done. VLADIMIR: (advancing with short, stiff strides, legs wide apart).

Expected Result:
There should be \n where the line breaks are.

Notifications on error

Is there any way to track errors while converting file

TextCleaner garbels german umlauts in recognized text

If I run tesseract over a scanned PDF, the text is correctly recognized with all its german umlauts and special chars. When enabling text cleaning after recognition, german umlauts are getting garbled.

Gemüse => Gem"use

This is due to the use of Iconv.iconv('ascii//translit//ignore', 'utf-8', text).first. As Iconv is also deprecated, it would make sense to remove the Iconv part. The output, producted by TextCleaner (with disabled Iconv), is valid UTF-8 and my umlauts are preserved.

I suggest removing these two lines:

require 'iconv' unless defined?(Iconv)
text = Iconv.iconv('ascii//translit//ignore', 'utf-8', text).first

Thanks and cheers,
Marc

How cant I convert input image to another image without converts it to pdf before?

Is there a way in doscplit to do that or I have to attach this funcionality in my code?

No error output

I can't convert anything to pdf. I tried html, doc and docx

Example: docsplit pdf index.html

It doesn't give me any error or output

Docsplit version 0.7.2
LibreOffice 3.4 340m1(Build:502)

Passing options to GraphicsMagick

When converting PDFs to images with docsplit I found it added a lot of whitespace to some pages. Cutting a long story short, it's fixed by adding '-define pdf:use-cropbox=true' to the graphicsmagick call. I've forked docsplit to patch this in but would like to see it as an option I could pass in to the standard gem. I'm happy to do the work to implement any custom option passing, but before I started I wanted to know:

a) is this something you'd be willing to integrate?
b) if so, how would you want the options passing?

I see it either being a specific param to trigger the pdf cropping behaviour or the ability to pass in an arbitrary string to interpolate into the executed command.

Can't run the tests

$ ruby -v
ruby 1.9.3p125 (2012-02-16 revision 34643) [x86_64-darwin11.2.0]
$ git clone [email protected]:documentcloud/docsplit.git
Cloning into 'docsplit'...
remote: Counting objects: 667, done.
remote: Compressing objects: 100% (314/314), done.
remote: Total 667 (delta 384), reused 617 (delta 341)
Receiving objects: 100% (667/667), 8.67 MiB | 535 KiB/s, done.
Resolving deltas: 100% (384/384), done.
$ cd docsplit
$ rake -T
rake gem:install    # Build and install the docsplit gem
rake gem:uninstall  # Uninstall the docsplit gem
rake test           # Run all tests
$ rake test
NOTE: Gem.available? is deprecated, use Specification::find_by_name. It will be removed on or after 2011-11-01.
Gem.available? called from /Users/dentarg/src/docsplit/Rakefile:7.
rake aborted!
cannot load such file -- test/unit/test_convert_to_pdf.rb

Tasks: TOP => test
(See full trace by running task with --trace)

Bug where strange text is being overlaid to extracted image (pptx to png)

I am having an issue with docsplit adding staring text when converting some PPTX files to PNGs.

It appears that some text is being superimposed on top of some images inside the slides (always at the top left corner).

Using LibreOffice 3.5.

Really appreciate any help on this!

Is there a way to get rid of the dependency warnings?

I'm using DocSplit in an app that doesn't require the dependencies - so the dependency warnings are always showing up in my dev environment, cluttering up my console (crying wolf, if you will). It would be nice if there were some config setting to ignore the dependencies. Is there any way to do that currently?

Not saving Unicode (UTF8) characters (accents in other languages)

Bassically I am trying to recognize the text of the attached image. When I use tesseract directly on the image then it works:

tesseract p1.jpg p1 -l spa

For example, "Pero acá tenés una y está en tus manos."

However when I try to use docsplit directly, the accents are not saved correctly.

docsplit text p1.pdf --pages all -l spa

And the same line becomes, "Pero ac? tenes una est? en tus manos."

PDF to SVG

Is there a way to convert PDFs and Office documents into SVG?

PDFs with rotated pages are clipped

When image splitting a PDF with one or several rotated pages, the results are cropped and bottom-aligned to fit in the standard portrait orientation.

The command:

docsplit images example.pdf

using this pdf produces

Error: /undefined in x when trying to convert PDF to images

When I try to convert certain PDF's to images with the Docsplit.extract_images method I get the following error:

Error: /undefined in x
Operand stack:

Execution stack:
%interp_exit .runexec2 --nostringval-- --nostringval-- --nostringval-- 2 %stopped_push --nostringval-- --nostringval-- --nostringval-- false 1 %stopped_push 1862 1 3 %oparray_pop 1861 1 3 %oparray_pop 1845 1 3 %oparray_pop 1739 1 3 %oparray_pop --nostringval-- %errorexec_pop .runexec2 --nostringval-- --nostringval-- --nostringval-- 2 %stopped_push --nostringval--
Dictionary stack:
--dict:1158/1684(ro)(G)-- --dict:0/20(G)-- --dict:71/200(L)--
Current allocation mode is local
Current file position is 1
GPL Ghostscript 8.70: Unrecoverable error, exit code 1

DocSplit fails to extract text on Windows when filenames have spaces

Docsplit invokes pdftotext to extract text, escaping spaces in the filename with \ to construct a command line. On Windows, \ does not escape spaces.

In my instrumented test, docsplit attempts to execute the following command:

pdftotext -enc UTF-8 test-docs/Ideology\ and\ Climate\ Change.pdf extracted-text/Ideology\ and\ Climate\ Change.txt 2>&1

This fails with the error message below. The following command works:

pdftotext -enc UTF-8 "test-docs/Ideology and Climate Change.pdf" "extracted-text/Ideology and Climate Change.txt" 2>&1

You will need poppler on Windows to reproduce, which is available here: http://www.compgeom.com/~piyush/scripts/scripts.html

Full error message follows:

C:/Ruby193/lib/ruby/gems/1.9.1/gems/docsplit-0.6.3/lib/docsplit/text_extractor.r
b:99:in run': pdftotext version 0.16.6 (Docsplit::ExtractionFailed) Copyright 2005-2011 The Poppler Developers - http://poppler.freedesktop.org Copyright 1996-2004 Glyph & Cog, LLC Usage: pdftotext [options] <PDF-file> [<text-file>] -f <int> : first page to convert -l <int> : last page to convert -r <fp> : resolution, in DPI (default is 72) -x <int> : x-coordinate of the crop area top left corner -y <int> : y-coordinate of the crop area top left corner -W <int> : width of crop area in pixels (default is 0) -H <int> : height of crop area in pixels (default is 0) -layout : maintain original physical layout -raw : keep strings in content stream order -htmlmeta : generate a simple HTML file, including the meta informatio n -enc <string> : output text encoding name -listenc : list available encodings -eol <string> : output end-of-line convention (unix, dos, or mac) -nopgbrk : don't insert page breaks between pages -bbox : output bounding box for each word and page size to html. Sets -htmlmeta -opw <string> : owner password (for encrypted files) -upw <string> : user password (for encrypted files) -q : don't print any messages or errors -v : print copyright and version info -h : print usage information -help : print usage information --help : print usage information -? : print usage information from C:/Ruby193/lib/ruby/gems/1.9.1/gems/docsplit-0.6.3/lib/docsplit/tex t_extractor.rb:106:inextract_full'
from C:/Ruby193/lib/ruby/gems/1.9.1/gems/docsplit-0.6.3/lib/docsplit/tex
t_extractor.rb:54:in extract_from_pdf' from C:/Ruby193/lib/ruby/gems/1.9.1/gems/docsplit-0.6.3/lib/docsplit/tex t_extractor.rb:38:inblock in extract'
from C:/Ruby193/lib/ruby/gems/1.9.1/gems/docsplit-0.6.3/lib/docsplit/tex
t_extractor.rb:32:in each' from C:/Ruby193/lib/ruby/gems/1.9.1/gems/docsplit-0.6.3/lib/docsplit/tex t_extractor.rb:32:inextract'
from C:/Ruby193/lib/ruby/gems/1.9.1/gems/docsplit-0.6.3/lib/docsplit.rb:
51:in extract_text' from overview-prototype/docloader/docloader.rb:75:inprocessFile'
from overview-prototype/docloader/docloader.rb:150:in block in <main>' from overview-prototype/docloader/docloader.rb:50:incall'
from overview-prototype/docloader/docloader.rb:50:in block in scanDir' from overview-prototype/docloader/docloader.rb:42:inforeach'
from overview-prototype/docloader/docloader.rb:42:in scanDir' from overview-prototype/docloader/docloader.rb:150:in

[docsplit image] Horizontal image get thumbnailed on a A4 page at the bottom.

Here is an example:

Input:

Thumbnail:

Java chokes on paths with spaces

When docsplit is installed in a directory which contains spaces, java jokes and dies.

undefined method normalize_range for Docsplit:Module

When specifying :pages => Range.new(1,5) I get a undefined method normalize_range for Docsplit:Module error. Looking at the source the method don't seem to be defined.

https://github.com/documentcloud/docsplit/blob/master/lib/docsplit.rb#L112-113

options = { :format => :jpg, :output => path_to_output_directory, :pages => Range.new(1, 5) }
Docsplit.extract_images(path_to_file, options)

Notifications on error

Is there any way to track errors as notifications in extract_pdf

Extracting text from PDFs

When trying to use docsplit to extract text from some PDFs I found out that some text is mixed; I understand that docsplit is a thin layer over other tools (in fact, pdftotext is who to blame for mixing the text); but I was wondering if you had some examples of how to use docsplit to minimize this effects (maybe using OCR instead of pdftotext?)

Also, I couldn't find if you had any suggestions to strip headers and page numbers in the output text; I wrote some code, but I guess you had the same problem and maybe came up with something better? :)

Thanks!

basic usage of docsplit

Hi I installed the Docsplit gem yesterday (together with all the dependencies) and I wanted to test this quickly so I tried one of the examples of your documentation (in commandline)

docsplit images example.pdf

and this was the outputted error:

execvp failed, errno = 2 (No such file
or directory) gm convert: "gs" "-q"
"-dBATCH" "-dMaxBitmap=50000000"
"-dNOPAUSE" "-sDEVICE=ppmraw"
"-dTextAlphaBits=4"
"-dGraphicsAlphaBits=4" "-r150x150"
"-dFirstPage=1" "-dLastPage=1"
"-sOutputFile=/var/folders/um/umOJP4yeEoG4UihNlcD7ME+++TM/-Tmp-/d20110325-6084-j35i1w/gmrpht13"
"--"
"/var/folders/um/umOJP4yeEoG4UihNlcD7ME+++TM/-Tmp-/d20110325-6084-j35i1w/gm04N0rO"
"-c" "quit". gm convert: Postscript
delegate failed (example.pdf).

I'm not sure why it says No such file or directory because I'm absolutely sure the file exists.

Also I'm trying out the method in a ruby script (usually I only use gems in a Ruby on Rails project, so this might be a stupid error)

require 'rubygems'
require 'docsplit'

CUR_DIR = Dir.getwd
DOCS_DIR = "#{CUR_DIR}/docs"
THUMB_DIR = "#{CUR_DIR}/thumbnails"

Dir.mkdir DOCS_DIR unless File.directory? DOCS_DIR
Dir.mkdir THUMB_DIR unless File.directory? THUMB_DIR

Dir.chdir(DOCS_DIR)
Dir["*"].each do |filename|
  # skip directories
  next if File.directory? filename

  puts "processing #{filename}"  
  Docsplit.extract_images(filename, :size => '920x', :format => [:png, :jpg])
end

NameError: uninitialized constant Docsplit

Note I'm using docsplit (0.5.0) and ruby 1.8.7 (2011-02-18 patchlevel 334) [i686-darwin10]

Would you happend to know what's causing this problem and what would possibly fix this issue?

Won't work if docsplit is used by multiple unix users

Hello,

On this line, the directory name should be computed based on Time.now or something, because hardcoded as is, it will produce write permissions error if the gem is used by more than one unix user (the directory will belong to the first one using docsplit, all others will have write errors).

extract_pages does not use page range "pages" parameter

extract_options doesn't extract the parameter, the script never uses it, and in fact pdftk, the underlying utility for page extraction does not support partial page ranges

Determining the paths of images created as a result of Docsplit.extract_images

It would be nice if there was an option one could pass which would determine whether the images created were returned, or the path to the PDF that was created as an intermediary. I'm thinking something like :and_return => :images

or :and_return => :intermediate

With the default case (:and_return == nil) being the current behavior. I believe this wouldn't break backwards compatibility, and would increase the utility of this application from within Ruby.