mbennett-uoe / whiiif Goto Github PK
View Code? Open in Web Editor NEWSimple IIIF Search service for OCRed texts
License: Other
Simple IIIF Search service for OCRed texts
License: Other
I have a suspicion that it doesn't work properly, because we never had a test case that needed it
We should also have a bunch of tests for use when a Solr instance is running and can be used for testing the full workflow (Process XML -> Ingest -> Query).
These can be used both for integration testing new versions of the Solr plugin to make sure they don't break stuff and also for pre-production "does your deployment work?" testing for end-users
At the minute, there are a bunch of assumptions about the manifest:
page_<x>
or just <x>
(where <x>
is the index)page_<x>
(tesseract default - maybe other software too)If any of these are not true, Collection search won't work. Whatever solution to #16 we use, it should make this issue relatively easy to solve. Probably just need to change the SOLR response->Canvas lookup to be dict based instead of list based.
Currently, we generate the URLs for the IIIF Search API on
property by assuming that manifests use a uniform way to reference canvases inside them:
Line 122 in a6e3bd8
However, this format is not required by the IIIF Presentation spec, so we should try and support other formats.
Currently we just return up to a limit for each search type, but it would be better to implement paging so all rows could be returned.
IIIF Search API:
within
block in the responseCollection Search:
rows
Snippet search:
q
is missing from the querystringThough the IIIF spec (and thus Whiiif's outputs) requires integer co-ordinates, ALTO can support more accurate, float-based co-ordinates (most likely when non-pixel measurements are used (see #13)) and we could maybe use these to make more accurate calculations.
Everything is currently built with the assumption that the co-ordinates in ALTO files are always in pixels, but the standard allows for others.
Currently, one scale
value is stored per manifest and used to transform all co-ordinate values.
However, if different canvases have different scale ratios between fullsize and the size indexed in the ALTO doc, these will be wrong.
For the majority of cases, there will probably be very little variance between scaling values for different canvases in the same manifest, and given the large size of most IIIF canvases, a difference of ~<10px may not even really be noticeable to an end user.
Possible solutions:
Potential tripup: Assuming that the Image resource on the Canvas shows the full size. Need to weigh up cost/benefit of scraping and parsing info.json for every image
@id
generatedtotal
value and also list lengthsresources
block returnedhits
block returnedresources
block returnedhits
block returnedscale
in Solr responseignored
parameter populated correctlymanifest_url
correcttotal_results
correctcanvas_id
returnedregion
returnedcoords
correctcoords
blocks correctscale
in Solr responsemanifest_url
correcttotal_results
correctcanvas_id
returnedregion
returnedcoords
correctcoords
blocks correctscale
in Solr responsesnips
parameter working correctlyHello,
First of all, thank you so much for sharing Whiiif.
I'm running into some trouble getting the software up and running- hopefully I am just missing something simple. I'm using Debian 10.4 in a single docker container for testing, Solr 8.6.3, and I'm running Whiiif in development mode.
I created a Solr core called "whiiif" and I was able to successfully add a document to Solr using the index_with_plugin.py script in utils. However, when I try to run a simple search (e.g. http://localhost:5000/search/b29b1tv7xn7g?q=Chicago) Whiiif doesn't return any hits, although that string does occur in the OCR.
Flask outputs the following errors when I try that search:
[2020-10-15 20:29:38,250] INFO in views: Processing IIIF Search request for document b29b1tv7xn7g
[2020-10-15 20:29:38,251] DEBUG in views: Request original q: Chicago
[2020-10-15 20:29:38,251] DEBUG in views: Request bleached q: Chicago
[2020-10-15 20:29:38,251] DEBUG in views: Regexed manifest ID: b29b1tv7xn7g
[2020-10-15 20:29:38,251] INFO in views: Built query: http://localhost:8983/solr/whiiif/select?hl=on&hl.ocr.absoluteHighlights=true&hl.weightMatches=true&hl.ocr.limitBlock=page&hl.ocr.contextSize=1&hl.ocr.contextBlock=word&df=ocr_text&hl.ocr.fl=ocr_text&hl.snippets=4096&fq=id:b29b1tv7xn7g&q=Chicago
[2020-10-15 20:29:38,257] DEBUG in views: Solr request response code: 500
[2020-10-15 20:29:38,257] ERROR in views: Error occurred with SOLR query: <class 'json.decoder.JSONDecodeError'>
[2020-10-15 20:29:38,258] ERROR in views: Error message: Expecting value: line 1 column 1 (char 0)
172.17.0.1 - - [15/Oct/2020 20:29:38] "GET /search/b29b1tv7xn7g?q=Chicago HTTP/1.1" 200 -
Then, if I try to run the built query on it's own, I get the following error from Solr, followed by a stack trace:
HTTP ERROR 500 java.lang.NoSuchMethodError: 'org.apache.lucene.util.automaton.CharacterRunAutomaton[] de.digitalcollections.solrocr.lucene.OcrHighlighter.getAutomata(java.lang.String, org.apache.lucene.search.Query, java.util.Set)'
I dropped the solr ocr highlighter .jar file into a "server/solr/whiiif/lib" directory- but to me the error above seems to say that Solr isn't able to find the OCR highlighter at runtime. Can you tell what's going on?
Please let me know if I can provide any additional information- and thank you so much for any help or tips you can provide!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.