Code Monkey home page Code Monkey logo

whiiif's People

Contributors

mbennett-uoe avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

Forkers

ttwz-dz

whiiif's Issues

Add more logging

  • Search requests
  • Solr responses
  • Errors accessing resource documents

Integration tests / Deployment tests

We should also have a bunch of tests for use when a Solr instance is running and can be used for testing the full workflow (Process XML -> Ingest -> Query).

These can be used both for integration testing new versions of the Solr plugin to make sure they don't break stuff and also for pre-production "does your deployment work?" testing for end-users

Make Collection endpoint more flexible

At the minute, there are a bunch of assumptions about the manifest:

  • Canvases are called either page_<x> or just <x> (where <x> is the index)
  • Pages in the ALTO file are called page_<x> (tesseract default - maybe other software too)
  • Therefore we can link ALTO page to the Canvas by reading the manifest, making a list of Canvases and using index referencing
  • Canvas URIs have a static relationship to the Manifest URL - see #16

If any of these are not true, Collection search won't work. Whatever solution to #16 we use, it should make this issue relatively easy to solve. Probably just need to change the SOLR response->Canvas lookup to be dict based instead of list based.

Work with canvases not at <manifest_url>/canvas/<canvas_id>

Currently, we generate the URLs for the IIIF Search API on property by assuming that manifests use a uniform way to reference canvases inside them:

"on": "{}/canvas/{}#xywh={}".format(result_part["manifest_url"], result_part["canvas_id"],

However, this format is not required by the IIIF Presentation spec, so we should try and support other formats.

Add paging to all search types

Currently we just return up to a limit for each search type, but it would be better to implement paging so all rows could be returned.

IIIF Search API:

  • Add config settings
  • Add paging to solr queries using highlights - Investigate if we can page these or if we have to return all of them and do the paging ourselves
  • Add logic to calculate page counts
  • Add first and last page info to the within block in the response
  • Add next & prev to the main response block

Collection Search:

  • Add config settings
  • Add paging to solr queries using rows
  • Add logic to calculate page counts
  • Refactor JSON response to include count and page info BREAKING CHANGE

Snippet search:

  • Add config settings
  • Add paging to solr queries using highlights - Investigate if we can page these or if we have to return all of them and do the paging ourselves
  • Add logic to calculate page counts
  • Add count, next & prev to the main response block

Fix error handling

  • Don't fall over if q is missing from the querystring
  • Fail gracefully if Solr doesn't respond
  • Fail gracefully if a resource document can't be found

Add support for non-integer co-ordinates in ALTO

Though the IIIF spec (and thus Whiiif's outputs) requires integer co-ordinates, ALTO can support more accurate, float-based co-ordinates (most likely when non-pixel measurements are used (see #13)) and we could maybe use these to make more accurate calculations.

Find a nicer solution to the Scaling problem

Currently, one scale value is stored per manifest and used to transform all co-ordinate values.

However, if different canvases have different scale ratios between fullsize and the size indexed in the ALTO doc, these will be wrong.

For the majority of cases, there will probably be very little variance between scaling values for different canvases in the same manifest, and given the large size of most IIIF canvases, a difference of ~<10px may not even really be noticeable to an end user.

Possible solutions:

  • Calculate and store scale for each canvas - see #16 point 2
  • Calculate scale for each canvas and modify ALTO file with scaled values

Potential tripup: Assuming that the Image resource on the Canvas shows the full size. Need to weigh up cost/benefit of scraping and parsing info.json for every image

Tidy the Solr stuff

  • Split into "here's what you need to add" and "here's a full config you can drag/drop"
  • Test how using a seperate endpoint works and consider implementing

Unit tests

  • Minimal set without Solr
    • Does Flask start
    • Do dirs referenced in config exist - This is really an integration test (see #18)
    • Can test files be read / parsed - Now part of first SOLR set test
  • Set with Solr
    • Index a document - This is really an integration test (see #18)
    • IIIF search endpoint
      • Correct Solr query generated
      • Response well formatted (context block)
      • Correct @id generated
      • Correct number of results returned - in the total value and also list lengths
      • Single part correct resources block returned
      • Single part correct hits block returned
      • Multi-part response correct resources block returned
      • Multi-part response correct hits block returned
      • Scaling if scale in Solr response
      • ignored parameter populated correctly
      • "Graceful" failure if anything goes wrong with Solr query
    • Collection search endpoint
      • Correct Solr query generated
      • Correct number of results returned
      • Result manifest_url correct
      • Result total_results correct
      • Correct canvas_id returned
      • Correct region returned
      • Single highlight coords correct
      • Multi-highlight coords blocks correct
      • Snippet scaling correct (downsizing all results by fixed ratio)
      • Scaling if scale in Solr response - Need to fix #12 first!!
      • "Graceful" failure if anything goes wrong with Solr query
    • Snippet search endpoint
      • Correct Solr query generated
      • Correct number of results returned
      • Result manifest_url correct
      • Result total_results correct
      • Correct canvas_id returned
      • Correct region returned
      • Single highlight coords correct
      • Multi-highlight coords blocks correct
      • Scaling if scale in Solr response
      • "Graceful" failure if anything goes wrong with Solr query
      • snips parameter working correctly

java NoSuchMethodError

Hello,

First of all, thank you so much for sharing Whiiif.

I'm running into some trouble getting the software up and running- hopefully I am just missing something simple. I'm using Debian 10.4 in a single docker container for testing, Solr 8.6.3, and I'm running Whiiif in development mode.

I created a Solr core called "whiiif" and I was able to successfully add a document to Solr using the index_with_plugin.py script in utils. However, when I try to run a simple search (e.g. http://localhost:5000/search/b29b1tv7xn7g?q=Chicago) Whiiif doesn't return any hits, although that string does occur in the OCR.

Flask outputs the following errors when I try that search:

[2020-10-15 20:29:38,250] INFO in views: Processing IIIF Search request for document b29b1tv7xn7g
[2020-10-15 20:29:38,251] DEBUG in views: Request original q: Chicago
[2020-10-15 20:29:38,251] DEBUG in views: Request bleached q: Chicago
[2020-10-15 20:29:38,251] DEBUG in views: Regexed manifest ID: b29b1tv7xn7g
[2020-10-15 20:29:38,251] INFO in views: Built query: http://localhost:8983/solr/whiiif/select?hl=on&hl.ocr.absoluteHighlights=true&hl.weightMatches=true&hl.ocr.limitBlock=page&hl.ocr.contextSize=1&hl.ocr.contextBlock=word&df=ocr_text&hl.ocr.fl=ocr_text&hl.snippets=4096&fq=id:b29b1tv7xn7g&q=Chicago
[2020-10-15 20:29:38,257] DEBUG in views: Solr request response code: 500
[2020-10-15 20:29:38,257] ERROR in views: Error occurred with SOLR query: <class 'json.decoder.JSONDecodeError'>
[2020-10-15 20:29:38,258] ERROR in views: Error message: Expecting value: line 1 column 1 (char 0)
172.17.0.1 - - [15/Oct/2020 20:29:38] "GET /search/b29b1tv7xn7g?q=Chicago HTTP/1.1" 200 -

Then, if I try to run the built query on it's own, I get the following error from Solr, followed by a stack trace:

HTTP ERROR 500 java.lang.NoSuchMethodError: 'org.apache.lucene.util.automaton.CharacterRunAutomaton[] de.digitalcollections.solrocr.lucene.OcrHighlighter.getAutomata(java.lang.String, org.apache.lucene.search.Query, java.util.Set)'

I dropped the solr ocr highlighter .jar file into a "server/solr/whiiif/lib" directory- but to me the error above seems to say that Solr isn't able to find the OCR highlighter at runtime. Can you tell what's going on?

Please let me know if I can provide any additional information- and thank you so much for any help or tips you can provide!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.