The whiiif from mbennett-uoe

Add more logging

Search requests
Solr responses
Errors accessing resource documents

Add scaling to Collection search

I have a suspicion that it doesn't work properly, because we never had a test case that needed it

Integration tests / Deployment tests

We should also have a bunch of tests for use when a Solr instance is running and can be used for testing the full workflow (Process XML -> Ingest -> Query).

These can be used both for integration testing new versions of the Solr plugin to make sure they don't break stuff and also for pre-production "does your deployment work?" testing for end-users

Make Collection endpoint more flexible

At the minute, there are a bunch of assumptions about the manifest:

Canvases are called either page_<x> or just <x> (where <x> is the index)
Pages in the ALTO file are called page_<x> (tesseract default - maybe other software too)
Therefore we can link ALTO page to the Canvas by reading the manifest, making a list of Canvases and using index referencing
Canvas URIs have a static relationship to the Manifest URL - see #16

If any of these are not true, Collection search won't work. Whatever solution to #16 we use, it should make this issue relatively easy to solve. Probably just need to change the SOLR response->Canvas lookup to be dict based instead of list based.

Work with canvases not at <manifest_url>/canvas/<canvas_id>

Currently, we generate the URLs for the IIIF Search API on property by assuming that manifests use a uniform way to reference canvases inside them:

whiiif/whiiif/views.py

Line 122 in a6e3bd8

    
           "on": "{}/canvas/{}#xywh={}".format(result_part["manifest_url"], result_part["canvas_id"],

However, this format is not required by the IIIF Presentation spec, so we should try and support other formats.

Update code to use v0.3 of the Solr plugin

Investigate breaking changes
Refactor code
Update solrconfig dir

Add paging to all search types

Currently we just return up to a limit for each search type, but it would be better to implement paging so all rows could be returned.

IIIF Search API:

Add config settings
Add paging to solr queries using highlights - Investigate if we can page these or if we have to return all of them and do the paging ourselves
Add logic to calculate page counts
Add first and last page info to the within block in the response
Add next & prev to the main response block

Collection Search:

Add config settings
Add paging to solr queries using rows
Add logic to calculate page counts
Refactor JSON response to include count and page info BREAKING CHANGE

Snippet search:

Add config settings
Add paging to solr queries using highlights - Investigate if we can page these or if we have to return all of them and do the paging ourselves
Add logic to calculate page counts
Add count, next & prev to the main response block

Fix error handling

Don't fall over if q is missing from the querystring
Fail gracefully if Solr doesn't respond
Fail gracefully if a resource document can't be found

Add support for non-integer co-ordinates in ALTO

Though the IIIF spec (and thus Whiiif's outputs) requires integer co-ordinates, ALTO can support more accurate, float-based co-ordinates (most likely when non-pixel measurements are used (see #13)) and we could maybe use these to make more accurate calculations.

Investigate non-pixel measurements in ALTO

Everything is currently built with the assumption that the co-ordinates in ALTO files are always in pixels, but the standard allows for others.

Get rid of hardcoded URLS and make the Flask `url_for` call work properly from behind a proxy

Recheck IIIF Search compliance

Find a nicer solution to the Scaling problem

Currently, one scale value is stored per manifest and used to transform all co-ordinate values.

However, if different canvases have different scale ratios between fullsize and the size indexed in the ALTO doc, these will be wrong.

For the majority of cases, there will probably be very little variance between scaling values for different canvases in the same manifest, and given the large size of most IIIF canvases, a difference of ~<10px may not even really be noticeable to an end user.

Possible solutions:

Calculate and store scale for each canvas - see #16 point 2
Calculate scale for each canvas and modify ALTO file with scaled values

Potential tripup: Assuming that the Image resource on the Canvas shows the full size. Need to weigh up cost/benefit of scraping and parsing info.json for every image

Documentation!

Tidy the Solr stuff

Split into "here's what you need to add" and "here's a full config you can drag/drop"
Test how using a seperate endpoint works and consider implementing

Allow as much as possible to be modified from config files

Search parameters
Field names
~~Solr endpoint~~ **Moved to be part of #10 **

Unit tests

java NoSuchMethodError

Hello,

First of all, thank you so much for sharing Whiiif.

I'm running into some trouble getting the software up and running- hopefully I am just missing something simple. I'm using Debian 10.4 in a single docker container for testing, Solr 8.6.3, and I'm running Whiiif in development mode.

I created a Solr core called "whiiif" and I was able to successfully add a document to Solr using the index_with_plugin.py script in utils. However, when I try to run a simple search (e.g. http://localhost:5000/search/b29b1tv7xn7g?q=Chicago) Whiiif doesn't return any hits, although that string does occur in the OCR.

Flask outputs the following errors when I try that search:

[2020-10-15 20:29:38,250] INFO in views: Processing IIIF Search request for document b29b1tv7xn7g
[2020-10-15 20:29:38,251] DEBUG in views: Request original q: Chicago
[2020-10-15 20:29:38,251] DEBUG in views: Request bleached q: Chicago
[2020-10-15 20:29:38,251] DEBUG in views: Regexed manifest ID: b29b1tv7xn7g
[2020-10-15 20:29:38,251] INFO in views: Built query: http://localhost:8983/solr/whiiif/select?hl=on&hl.ocr.absoluteHighlights=true&hl.weightMatches=true&hl.ocr.limitBlock=page&hl.ocr.contextSize=1&hl.ocr.contextBlock=word&df=ocr_text&hl.ocr.fl=ocr_text&hl.snippets=4096&fq=id:b29b1tv7xn7g&q=Chicago
[2020-10-15 20:29:38,257] DEBUG in views: Solr request response code: 500
[2020-10-15 20:29:38,257] ERROR in views: Error occurred with SOLR query: <class 'json.decoder.JSONDecodeError'>
[2020-10-15 20:29:38,258] ERROR in views: Error message: Expecting value: line 1 column 1 (char 0)
172.17.0.1 - - [15/Oct/2020 20:29:38] "GET /search/b29b1tv7xn7g?q=Chicago HTTP/1.1" 200 -

Then, if I try to run the built query on it's own, I get the following error from Solr, followed by a stack trace:

HTTP ERROR 500 java.lang.NoSuchMethodError: 'org.apache.lucene.util.automaton.CharacterRunAutomaton[] de.digitalcollections.solrocr.lucene.OcrHighlighter.getAutomata(java.lang.String, org.apache.lucene.search.Query, java.util.Set)'

I dropped the solr ocr highlighter .jar file into a "server/solr/whiiif/lib" directory- but to me the error above seems to say that Solr isn't able to find the OCR highlighter at runtime. Can you tell what's going on?

Please let me know if I can provide any additional information- and thank you so much for any help or tips you can provide!

mbennett-uoe / whiiif Goto Github PK

whiiif's People

Contributors

Stargazers

Watchers

Forkers

whiiif's Issues

Recommend Projects

Recommend Topics

Recommend Org