Code Monkey home page Code Monkey logo

manuscript-dl's Introduction

manuscript-dl

Collection of scripts to download digitized manuscripts from different online libraries.

Some online libraries provide convenient way to download complete manuscript as a PDF file. Some don't. Mad scripting skills to the resque!

Supported libraries

To download a book:

  1. Find out its ID, e.g.: https://www.nb.no/items/URN:NBN:no-nb_digibok_2008091504048?page=1
  2. (optional) Copy curl command from the browser, so that you preserve cookies, and adjust it.
  3. Run:
$ python ./nb.no.py -H 'cookie: something' URN:NBN:no-nb_digibok_2008091504048

To download a book:

  1. Go to book description page, e.g.: http://www.e-codices.unifr.ch/en/list/one/csg/0369
  2. Right click on the link "IIIF Manifest URL" and save it to file, e.g. manifest.json
  3. Run
$ e-codices.sh manifest.json [size]

size is an optional argument. Original size of manuscripts on e-codices is usually way too big and needs to be reduced.

This downloader uses montage (imagemagick suite) program to convert images to PDFs and pdftk to concatenate PDFs together. You need to have pdftk and montage installed in your system.

Ubuntu:

sudo apt-get install pdftk imagemagick

To download a book you need to find out its short name:

  1. Open manuscript description, e.g.: http://www.bl.uk/manuscripts/FullDisplay.aspx?ref=Add_MS_24686
  2. In this case the name is "add_ms_24686" (notice lower case). But you can double check if you click any of the pictures below and open a new page: http://www.bl.uk/manuscripts/Viewer.aspx?ref=add_ms_24686_f002r
  3. Here, add_ms_24686_f002r is a manuscript name + page name. You only need manuscript name.
  4. Run the bl.uk.py with manuscript name:
$ python3 bl.uk.py add_ms_24686 --resolution 12

This will grab all available pages with resolution 12. If you want specific pages, you can set page range using --pages A:B argument.

At some point the Library started replying with HTTP 429 (Too Many Requests). Faking user agent helped. If default user agent is not working for you, you can replace it using --user-agent option like this:

python3 bl.uk.py add_ms_24686 --user-agent 'Mozilla/5.0 (X11; OpenBSD i386) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36'

Author

(c) 2015-2018 Yuri Bochkarev

manuscript-dl's People

Contributors

balta2ar avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

manuscript-dl's Issues

Some clarifications on the bl.uk.py script

I found some issues that made me stumped for a bit, but eventually figured out. Figured I'd share them here in case someone else runs into them or I run into them in the future after I forget.

  1. If a manuscript does not exist in a certain resolution (e.g. no resolution 14), the output error would be something similar to the below:
python3 bl.uk.py or_12988 --resolution 14 --user-agent 'Mozilla/5.0 (X11; OpenBSD i386) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36' --pages 206:206
Downloading manuscript or_12988 resolution 14
335 pages found
1 pages downloading (range 206:206)
Downloading page or_12988_f103r (1/1)

End of the page
Page or_12988_f103r has size row x column = 0 x 0
Concatenating page or_12988_f103r (1/1)
montage-im6.q16: missing an image filename `pics/14/or_12988/or_12988_f103r/row_0.jpg' @ error/montage.c/MontageImageCommand/1804.
.montage-im6.q16: missing an image filename `pics/14/or_12988/or_12988_f103r.jpg' @ error/montage.c/MontageImageCommand/1804.

Converting manuscript or_12988 into PDF
Converting page or_12988_f103r (1/1)
convert-im6.q16: unable to open image `pics/14/or_12988/or_12988_f103r.jpg': No such file or directory @ error/blob.c/OpenBlob/2924.
convert-im6.q16: no images defined `pics/14/or_12988/or_12988_f103r.pdf' @ error/convert.c/ConvertImageCommand/3229.
Folding page or_12988_f103r.pdf (1/1)
Traceback (most recent call last):
  File "/home/mo/manuscript-dl/bl.uk.py", line 423, in <module>
    sys.exit(main(args))
  File "/home/mo/manuscript-dl/bl.uk.py", line 401, in main
    download_manuscript(args.pages,
  File "/home/mo/manuscript-dl/bl.uk.py", line 396, in download_manuscript
    convert_manuscript(resolution, base_dir, manuscript, pages)
  File "/home/mo/manuscript-dl/bl.uk.py", line 368, in convert_manuscript
    fold_pages(base_dir, manuscript, pages, output_name)
  File "/home/mo/manuscript-dl/bl.uk.py", line 345, in fold_pages
    shutil.copy2(pdf_name, output_name)
  File "/usr/lib/python3.10/shutil.py", line 434, in copy2
    copyfile(src, dst, follow_symlinks=follow_symlinks)
  File "/usr/lib/python3.10/shutil.py", line 254, in copyfile
    with open(src, 'rb') as fsrc:
FileNotFoundError: [Errno 2] No such file or directory: 'pics/14/or_12988/or_12988_f103r.pdf'
  1. The --pages arg requires integers. The page numbers on the manuscript page are sub-labeled with r and v. For example, f103r and f103v. If you only want to download page f103r it would be 206 (103*2). For f103v it is 207 (103*2 + 1)

The BL downloader gets stuck in a redirect loop

I believe the loop is caused by the lack of a particular session header. I didn't investigate in detail, but it's repeatedly redirecting to http://www.bl.uk/manuscripts/SetupViewerHandler.ashx?[ms identifier]. This page sets up a cookie for ASP.NET_SessionId. I suspect that is needed, but I didn't investigate further.

I was lazy and was able to bypass this by copying the headers from a session with the Chrome developer tools and adding them to the request, after which it worked fine.

Nice script. I had been pulling them via shell script, and using a pnmtile to put them back together.

Error when downloading images from the British Library

Traceback (most recent call last):
File "C:\Users\Menna\Downloads\bl.uk.py", line 423, in
sys.exit(main(args))
File "C:\Users\Menna\Downloads\bl.uk.py", line 401, in main
download_manuscript(args.pages,
File "C:\Users\Menna\Downloads\bl.uk.py", line 392, in download_manuscript
download_pages(resolution, base_dir, manuscript, pages)
File "C:\Users\Menna\Downloads\bl.uk.py", line 358, in download_pages
concatenate_page(base_dir, manuscript, page, columns, rows)
File "C:\Users\Menna\Downloads\bl.uk.py", line 299, in concatenate_page
call(cmd)
File "C:\Users\Menna\AppData\Local\Programs\Python\Python39\lib\subprocess.py", line 349, in call
with Popen(*popenargs, **kwargs) as p:
File "C:\Users\Menna\AppData\Local\Programs\Python\Python39\lib\subprocess.py", line 951, in init
self._execute_child(args, executable, preexec_fn, close_fds,
File "C:\Users\Menna\AppData\Local\Programs\Python\Python39\lib\subprocess.py", line 1420, in _execute_child
hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
FileNotFoundError: [WinError 2] The system cannot find the file specified

Download too slow.

I need at least 2 days to download a manuscript (400 pages) with the resolution of 13.

Any idea how to improve this code?
My internet speed is 100Mbp/s.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.