balta2ar / manuscript-dl Goto Github PK

22.0 5.0 4.0 24 KB

Collection of scripts to download digitized manuscripts from various online libraries

Shell 7.91% Python 92.09%

download downloader manuscript-dl library download-digitized-manuscripts manuscript pdf python elibrary calligraphy

manuscript-dl's Introduction

manuscript-dl

Collection of scripts to download digitized manuscripts from different online libraries.

Some online libraries provide convenient way to download complete manuscript as a PDF file. Some don't. Mad scripting skills to the resque!

Supported libraries

Nasjonalbiblioteket

To download a book:

Find out its ID, e.g.: https://www.nb.no/items/URN:NBN:no-nb_digibok_2008091504048?page=1
(optional) Copy curl command from the browser, so that you preserve cookies, and adjust it.
Run:

$ python ./nb.no.py -H 'cookie: something' URN:NBN:no-nb_digibok_2008091504048

e-codices - Virtual Manuscript Library of Switzerland

To download a book:

Go to book description page, e.g.: http://www.e-codices.unifr.ch/en/list/one/csg/0369
Right click on the link "IIIF Manifest URL" and save it to file, e.g. manifest.json
Run

$ e-codices.sh manifest.json [size]

size is an optional argument. Original size of manuscripts on e-codices is usually way too big and needs to be reduced.

British Library Digitised Manuscripts

This downloader uses montage (imagemagick suite) program to convert images to PDFs and pdftk to concatenate PDFs together. You need to have pdftk and montage installed in your system.

Ubuntu:

sudo apt-get install pdftk imagemagick

To download a book you need to find out its short name:

Open manuscript description, e.g.: http://www.bl.uk/manuscripts/FullDisplay.aspx?ref=Add_MS_24686
In this case the name is "add_ms_24686" (notice lower case). But you can double check if you click any of the pictures below and open a new page: http://www.bl.uk/manuscripts/Viewer.aspx?ref=add_ms_24686_f002r
Here, add_ms_24686_f002r is a manuscript name + page name. You only need manuscript name.
Run the bl.uk.py with manuscript name:

$ python3 bl.uk.py add_ms_24686 --resolution 12

This will grab all available pages with resolution 12. If you want specific pages, you can set page range using --pages A:B argument.

At some point the Library started replying with HTTP 429 (Too Many Requests). Faking user agent helped. If default user agent is not working for you, you can replace it using --user-agent option like this:

python3 bl.uk.py add_ms_24686 --user-agent 'Mozilla/5.0 (X11; OpenBSD i386) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36'

Author

manuscript-dl's People

Contributors

Stargazers

Watchers

Forkers

ybbr333 abbasabbassi mscuthbert conradolandia

manuscript-dl's Issues

Some clarifications on the bl.uk.py script

I found some issues that made me stumped for a bit, but eventually figured out. Figured I'd share them here in case someone else runs into them or I run into them in the future after I forget.

If a manuscript does not exist in a certain resolution (e.g. no resolution 14), the output error would be something similar to the below:

python3 bl.uk.py or_12988 --resolution 14 --user-agent 'Mozilla/5.0 (X11; OpenBSD i386) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36' --pages 206:206
Downloading manuscript or_12988 resolution 14
335 pages found
1 pages downloading (range 206:206)
Downloading page or_12988_f103r (1/1)

End of the page
Page or_12988_f103r has size row x column = 0 x 0
Concatenating page or_12988_f103r (1/1)
montage-im6.q16: missing an image filename `pics/14/or_12988/or_12988_f103r/row_0.jpg' @ error/montage.c/MontageImageCommand/1804.
.montage-im6.q16: missing an image filename `pics/14/or_12988/or_12988_f103r.jpg' @ error/montage.c/MontageImageCommand/1804.

Converting manuscript or_12988 into PDF
Converting page or_12988_f103r (1/1)
convert-im6.q16: unable to open image `pics/14/or_12988/or_12988_f103r.jpg': No such file or directory @ error/blob.c/OpenBlob/2924.
convert-im6.q16: no images defined `pics/14/or_12988/or_12988_f103r.pdf' @ error/convert.c/ConvertImageCommand/3229.
Folding page or_12988_f103r.pdf (1/1)
Traceback (most recent call last):
  File "/home/mo/manuscript-dl/bl.uk.py", line 423, in <module>
    sys.exit(main(args))
  File "/home/mo/manuscript-dl/bl.uk.py", line 401, in main
    download_manuscript(args.pages,
  File "/home/mo/manuscript-dl/bl.uk.py", line 396, in download_manuscript
    convert_manuscript(resolution, base_dir, manuscript, pages)
  File "/home/mo/manuscript-dl/bl.uk.py", line 368, in convert_manuscript
    fold_pages(base_dir, manuscript, pages, output_name)
  File "/home/mo/manuscript-dl/bl.uk.py", line 345, in fold_pages
    shutil.copy2(pdf_name, output_name)
  File "/usr/lib/python3.10/shutil.py", line 434, in copy2
    copyfile(src, dst, follow_symlinks=follow_symlinks)
  File "/usr/lib/python3.10/shutil.py", line 254, in copyfile
    with open(src, 'rb') as fsrc:
FileNotFoundError: [Errno 2] No such file or directory: 'pics/14/or_12988/or_12988_f103r.pdf'

The --pages arg requires integers. The page numbers on the manuscript page are sub-labeled with r and v. For example, f103r and f103v. If you only want to download page f103r it would be 206 (103*2). For f103v it is 207 (103*2 + 1)

The BL downloader gets stuck in a redirect loop

I believe the loop is caused by the lack of a particular session header. I didn't investigate in detail, but it's repeatedly redirecting to http://www.bl.uk/manuscripts/SetupViewerHandler.ashx?[ms identifier]. This page sets up a cookie for ASP.NET_SessionId. I suspect that is needed, but I didn't investigate further.

I was lazy and was able to bypass this by copying the headers from a session with the Chrome developer tools and adding them to the request, after which it worked fine.

Nice script. I had been pulling them via shell script, and using a pnmtile to put them back together.

Error when downloading images from the British Library

Traceback (most recent call last):
File "C:\Users\Menna\Downloads\bl.uk.py", line 423, in
sys.exit(main(args))
File "C:\Users\Menna\Downloads\bl.uk.py", line 401, in main
download_manuscript(args.pages,
File "C:\Users\Menna\Downloads\bl.uk.py", line 392, in download_manuscript
download_pages(resolution, base_dir, manuscript, pages)
File "C:\Users\Menna\Downloads\bl.uk.py", line 358, in download_pages
concatenate_page(base_dir, manuscript, page, columns, rows)
File "C:\Users\Menna\Downloads\bl.uk.py", line 299, in concatenate_page
call(cmd)
File "C:\Users\Menna\AppData\Local\Programs\Python\Python39\lib\subprocess.py", line 349, in call
with Popen(*popenargs, **kwargs) as p:
File "C:\Users\Menna\AppData\Local\Programs\Python\Python39\lib\subprocess.py", line 951, in init
self._execute_child(args, executable, preexec_fn, close_fds,
File "C:\Users\Menna\AppData\Local\Programs\Python\Python39\lib\subprocess.py", line 1420, in _execute_child
hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
FileNotFoundError: [WinError 2] The system cannot find the file specified