Code Monkey home page Code Monkey logo

go-toolkit's People

Contributors

banux avatar chocolatkey avatar dependabot[bot] avatar jccr avatar mickael-menu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

go-toolkit's Issues

LCP JSON mapping incomplete

https://github.com/readium/r2-streamer-go/blob/master/parser/epub/lcp.go#L33

	Rights struct {
		Print int        `json:"print"`
		Copy  int        `json:"copy"`
		Start *time.Time `json:"start"`
		End   *time.Time `json:"end"`
	}
	User struct {
		ID        string   `json:"id"`
		Email     string   `json:"email"`
		Name      string   `json:"name"`
		Encrypted []string `json:"encrypted"`
	}
	Signature struct {
		Algorithm   string `json:"algorithm"`
		Certificate string `json:"certificate"`
		Value       string `json:"value"`
	}

Missing: 'json:"rights"' 'json:"user"' and 'json:"signature"'

(pub.Publication).Get when href contains anchors

Calling Get when a manifest.Link Href contains anchors returns resource: error 404: file does not exist

an example Manifest.TableOfContents

     manifest.Link{
        Href:       "/OEBPS/Text/appendice1.xhtml",
        Type:       "",
        Templated:  false,
        Title:      "APPENDICE A ANNALI DEI RE E DEI GOVERNATORI",
        Rels:       manifest.Strings{},
        Properties: manifest.Properties{},
        Height:     0x0,
        Width:      0x0,
        Bitrate:    0.000000,
        Duration:   0.000000,
        Languages:  manifest.Strings{},
        Alternates: manifest.LinkList{},
        Children:   manifest.LinkList{
          manifest.Link{
            Href:       "/OEBPS/Text/appendice1.xhtml#sec1",
            Type:       "",
            Templated:  false,
            Title:      "I. I re Númenóreani",
            Rels:       manifest.Strings{},
            Properties: manifest.Properties{},
            Height:     0x0,
            Width:      0x0,
            Bitrate:    0.000000,
            Duration:   0.000000,
            Languages:  manifest.Strings{},
            Alternates: manifest.LinkList{},
            Children:   manifest.LinkList{},
          },
          manifest.Link{
            Href:       "/OEBPS/Text/appendice1.xhtml#sec2",
            Type:       "",
            Templated:  false,
            Title:      "II. La casa di Eorl",
            Rels:       manifest.Strings{},
            Properties: manifest.Properties{},
            Height:     0x0,
            Width:      0x0,
            Bitrate:    0.000000,
            Duration:   0.000000,
            Languages:  manifest.Strings{},
            Alternates: manifest.LinkList{},
            Children:   manifest.LinkList{},
          },

A check on anchors could be added to Get?

Incorrect TOC parsing, ignores empty hrefs with title, whitespaces in string titles

"Children's Literature"

Go streamer:
https://readium2.feedbooks.net/Ym9va3MvY2hpbGRyZW5zLWxpdGVyYXR1cmUuZXB1Yg==/manifest.json

NodeJS streamer:
https://readium2.herokuapp.com/pub/L2FwcC9taXNjL2VwdWJzL2NoaWxkcmVucy1saXRlcmF0dXJlLmVwdWI%3D/manifest.json

Using JSON comparison tools (links below) to help pin-point discrepancies in the "webpub manifest" streamer output, I found the following errors (ignoring issues related to "base URL" used to explicitly resolve absolute links, and to the lack of consistent normalization for union-type values such as string vs. array-of-strings such as @context and rel):

  • The Go Navigation Document parser discards tree nodes with empty href even when there is a valid title, resulting in missing data in the generated toc JSON: "Abram S. Isaacs", "Samuel Taylor Coleridge", "Hans Christian Andersen", "Frances Browne", "Oscar Wilde", "Raymond MacDonald Alden", "Jean Ingelow", "Frank R. Stockton", "John Ruskin".
  • Some TOC titles contain whitespace characters, for example: \n\t\t\t\t\t\t\t\t\t\t\t\tI. The Rabbi and the Diadem\n\t\t\t\t\t\t\t\t\t\t\t vs. I. The Rabbi and the Diadem.

JSON comparison tools:

Reference non-linear resources outside of the spine

In EPUB, various spine items can be declared as non-linear.

As a concept, linearity is very vague and can be handled in a number of ways by reading systems, but most of them simply skip non-linear items.

To avoid this pitfall, the Web Publication Manifest and the in-memory model will not consider non-linear resources to be part of the spine.

This means that the parser should verify each <itemref> in the spine:
<itemref idref="c1-answerkey" linear="no"/>

If an <itemref> includes a linear attribute set to "no", the resource should be added to resources. Otherwise, the resource should be added to spine.

cc @danielweck

Proposal to Add -v or --version Parameter to CLI Interface

I would like to propose the addition of a new parameter, either -v or --version, to the command-line interface (CLI) of our tool. This parameter would allow users to quickly retrieve information about the version of the tool they are using.

Currently, when working with the CLI, it can be cumbersome to find the version of the tool.

webpub manifest JSON `spine` and `resource` lists of links should be mutually-exclusive / non-overlaping?

https://readium2.feedbooks.net/Ym9va3MvY2hpbGRyZW5zLWxpdGVyYXR1cmUuZXB1Yg==/manifest.json

If I am not mistaken, cover.xhtml, nav.xhtml and s04.xhtml should only appear in the spine collection, not resources:

...
, 
"spine": [
  {
   "href": "https://aldiko-stats.feedbooks.net/Ym9va3MvY2hpbGRyZW5zLWxpdGVyYXR1cmUuZXB1Yg==/EPUB/cover.xhtml",
   "type": "application/xhtml+xml"
  },
  {
   "href": "https://aldiko-stats.feedbooks.net/Ym9va3MvY2hpbGRyZW5zLWxpdGVyYXR1cmUuZXB1Yg==/EPUB/nav.xhtml",
   "type": "application/xhtml+xml",
   "rel": [
    "contents"
   ],
   "properties": {
    "contains": [
     "js"
    ]
   }
  },
  {
   "href": "https://aldiko-stats.feedbooks.net/Ym9va3MvY2hpbGRyZW5zLWxpdGVyYXR1cmUuZXB1Yg==/EPUB/s04.xhtml",
   "type": "application/xhtml+xml"
  }
 ],

 "resources": [
...
{
   "href": "https://aldiko-stats.feedbooks.net/Ym9va3MvY2hpbGRyZW5zLWxpdGVyYXR1cmUuZXB1Yg==/EPUB/cover.xhtml",
   "type": "application/xhtml+xml"
  },
  {
   "href": "https://aldiko-stats.feedbooks.net/Ym9va3MvY2hpbGRyZW5zLWxpdGVyYXR1cmUuZXB1Yg==/EPUB/s04.xhtml",
   "type": "application/xhtml+xml"
  },
  {
   "href": "https://aldiko-stats.feedbooks.net/Ym9va3MvY2hpbGRyZW5zLWxpdGVyYXR1cmUuZXB1Yg==/EPUB/nav.xhtml",
   "type": "application/xhtml+xml",
   "rel": [
    "contents"
   ],
   "properties": {
    "contains": [
     "js"
    ]
   }
  }
...
 ],
...

Experiment with support for Shared JS/Readium-1

The streamer could also output a manifest in a format compatible with Readium-1, which means that Shared JS could be re-used without any modification as a navigator.

We're a little fuzzy on the details, so far we have:

Is there anything else that we need to know? Are there well-known URIs that we need to know about? Specific expectations regarding content or HTTP requests?

Could you provide us with these details @danielweck?

No support for alt-rep/alt-script

The current model does not support alt-rep/alt-script to provide a representation of a title or a contributor in another language/script.

This is mostly tied to the fact that we're directly de-serializing the structure to JSON:
https://github.com/Feedbooks/webpub-streamer/blob/master/models/metadata.go#L6
https://github.com/Feedbooks/webpub-streamer/blob/master/models/metadata.go#L44

The Web Publication Manifest has extensive support for alternative representations of a string, more so than EPUB 3.1:

"title": {
  "fr": "Vingt mille lieues sous les mers",
  "en": "Twenty Thousand Leagues Under the Sea",
  "ja": "海底二万里"
}
"author": {
  "name": {
    "ru": "Михаил Афанасьевич Булгаков",
    "en": "Mikhail Bulgakov",
    "fr": "Mikhaïl Boulgakov"
  }
}

This is one area where streamers in other languages SHOULD NOT copy the current Go project and make sure that their model support the full extent of what the Web Publication Manifest can do.

Is there an easy way to deal with the serialization issue in Go without making everything else far more complex? cc @jpbougie @banux

Add support for subjects

Subjects prior to EPUB 3.1 are basically a list of tags, that may or may not be concatenated in a single field.

The parser should not attempt to separate concatenated subjects.

In EPUB 3.1, subjects behave much more like Web Publications and the format should be very close to what we have in memory.

Add support for properties of a Link Object

Currently, the properties of either a spine or a manifest item are ignored during the parsing.

Instead of this behavior we'd like to:

  • turn a number of these properties into rel values, for instance for the cover or the ToC
  • express the rest of them in a more consistent way, for example properties such as mathml or svg could instead be expressed as "contains": ["mathml", "svg"]

Additional work on the Web Publication Manifest will have to be done in parallel in order to improve how properties are expressed in our publication model.

Media Overlays parsing: incorrect handling of body/seq epub:textref (IMO)

In fillMediaOverlay ( https://github.com/readium/r2-streamer-go/blob/master/parser/epub.go#L695 ), mo.Text = smil.Body.TextRef can actually be undefined (such as Moby Dick's first two MO chapters https://github.com/IDPF/epub3-samples/blob/master/30/moby-dick-mo/OPS/chapter_001_overlay.smil#L2 ). The parsing algorithm in the recursive function addSeqToMediaOverlay ( https://github.com/readium/r2-streamer-go/blob/master/parser/epub.go#L737 ) attempts to "split" the Media Overlays based on the nested seqs' epub:textref values (see baseHref == baseHrefParent in the linked Go code), but this may actually results in skipped SMIL fragments and inconsistent JSON output.

I appreciate the optimisation effort (as we discussed during the Readium-2 conference calls, i.e. the edge case when single SMIL files reference multiple HTML documents), but in the NodeJS implementation I personally decided to replicate the Readium-1 parsing behaviour, which consists in parsing each individual SMIL file in full, and then attaching each resulting root Media Overlay to every HTML spine item that references said SMIL (i.e. OPF manifest items' media-overlay IDREF). In essence, I removed the special treatment of mo.Text (which is duplicated in both fillMediaOverlay and addSeqToMediaOverlay), and I added a single preliminary parsing step to construct the mapping: https://github.com/edrlab/r2-streamer-js/blob/develop/src/parser/epub.ts#L378-L421
This way, at worst there will be redundant Media Overlays timing data for a given HTML document (in the edge case where the SMIL file references multiple spine items), but in most cases the attached MO will contain exactly what an HTML document needs.

Add support for publication date

In EPUB 2.0 this is handled by dc:date as referenced in http://www.idpf.org/epub/20/spec/OPF_2.0.1_draft.htm#Section2.2.7

Will have to be careful in 2.x about event since this attribute is largely undefined.

In 3.x this is more clearly established (http://www.idpf.org/epub/301/spec/epub-publications.html#sec-opf-dcmes-optional):

  • only one dc:date element
  • this element contains the publication date, other events have to use dcterms:date

Main issue is that ISO8601 is only a recommandation, not a requirement.

Publications cannot be Unmarshalled

The following code fails on the final line:

epubData, err := parser.Parse("book.epub")
Expect(err).To(BeNil())

jsonBytes, err := json.Marshal(epubData) // Marshal works fine.
Expect(err).To(BeNil())

foo := &models.Publication{}
err = json.Unmarshal(jsonBytes, foo)
Expect(err).To(BeNil()) // json: cannot unmarshal string into Go struct field Contributor.metadata.author.name of type models.MultiLanguage

I think this happens because of how models.MultiLanguage overrides json.Marshal - it turns MultiLanguages into either a string or an object, but a models.Metadata doesn't expect either - it expects a MultiLanguage.

I can't understand how this is supposed to work. I'm trying to get the metadata for an epub, store it, and get it again later, but when I get it again later I don't just want to pass it around as bytes, I want to interact with the metadata, so I need to Unmarshal it. It seems unlikely that this wouldn't be supported, but I couldn't find another way after reading the docs and looking through the code. Sorry in advance if I missed something.

Return pageNavigation instead of pageBreakNumbers when inferring metadata

printPageNumbers has been gradually deprecated in favor of two different values:

  • pageBreakMarkers which is meant to indicate that the text of the publication contains page break markers using ARIA
  • and pageNavigation which is meant to indicate that the publication contains a list of pages, usually based on HTML IDs

The current inference technique is based on the presence of pageList in the RWPM output which can either come from:

  • a Navigation Document in EPUB 3.x
  • or an NCX in EPUB 2.x

Given the nature of what is inferred here, the toolkit should return pageNavigation instead of printPageNumbers.

Improve support for EPUB 2.x and 3.x metadata

The current prototype has limited support for metadata:

  • support for contributors is limited to author/contributor and not the other roles
  • the first title is used
  • the first identifier, and not the one identified as the main one, is used

This should be extended to support all the metadata currently available in models, both in EPUB 2.x and EPUB 3.x.

Use an array for roles in contributor

As it's been pointed out in the EPUB 3 maintenance group, EPUB 3.1 doesn't allow content producers to indicate more than a single role for a contributor.

In EPUB 3.0.x it was possible to indicate as many roles as you wanted:

<dc:contributor id="Olaf">Dr. Olaf Hoffmann</dc:contributor>
<meta refines="#Olaf" property="file-as">Dr. Hoffmann, Olaf</meta>
<meta refines="#Olaf" property="role" scheme="marc:relators">mrk</meta>
<meta refines="#Olaf" property="role" scheme="marc:relators">art</meta>
<meta refines="#Olaf" property="role" scheme="marc:relators">ill</meta>
<meta refines="#Olaf" property="role" scheme="marc:relators">aui</meta>
<meta refines="#Olaf" property="role" scheme="marc:relators">pfr</meta>

The Readium Web Publication Manifest is a direct descendent of the BFF project and was designed with 3.1 round-trippability in mind.

But in the context of Readium-2, we want to maximize compatibility with EPUB 2.0.1 and any 3.x revision, which means that instead of having a single string allowed for the role in a contributor element, we need to move to an array of string.

Output an OPDS 2.0 feed listing publications

While the current version of the Go streamer is focused on parsing and serving a single publication, the use case for both the Go and node.js/Typescript versions of the streamer might be primarily on the server side.

To better adapt to such use cases, we should do the following:

  • by default, parse and keep in memory all publications stored in publications/
  • provide an OPDS 2.0 feed for these publications at /publications.json

OPDS 2.0 is not truly a thing yet, aside from a few experiments on Gist.

But to reach a point where we're comfortable writing a specification for OPDS 2.0, we need to experiment and this is the perfect opportunity to do it.

Here are a few ground rules:

  • OPDS 2.0 will be based on the same abstract model as the Readium Web Publication Manifest, which means collections, Link Object, links and metadata
  • the media type for OPDS 2.0 is application/opds+json
  • there won't be a difference between acquisition and navigation feeds in OPDS 2.0, all feeds contain a number of collections
  • there will be three core collection roles: publications (equivalent of an acquisition feed), navigation and groups (to replace rel="collection" and aggregate publications together in a single feed)
  • the output that the streamer will provide is a very basic OPDS 2.0 feed where all publications are listed in a publications collection
  • this collection won't contain the full content of a Readium Web Publication (we'll only include metadata, links and a new images collection that contains one or more different covers)

Here's a very basic example of what the output will look like:

{
  "@context": "http://opds-spec.org/opds.jsonld",

  "metadata": {
    "@type": "http://schema.org/DataFeed",
    "title": "All Publications",
    "numberOfItems": 1
  },

  "links": [
    {"rel": "self", "href": "http://example.org/publications.json", "type": "application/opds+json"}
  ],

  "publications": [
    {
      "metadata": {
        "@type": "http://schema.org/Book",
        "title": "Moby-Dick",
        "author": "Herman Melville",
        "identifier": "urn:isbn:978031600000X",
        "language": "en",
        "modified": "2015-09-29T17:00:00Z"
      },
      "links": [
        {"rel": "self", "href": "http://example.org/manifest.json", "type": "application/webpub+json"},
      ],     
      "images": [
        {"href": "http://example.org/cover.jpg", "type": "image/jpeg", "height": 600, "width": 400},
      ]
    }
  ]
}

Add support for subtitles

In EPUB 3.0.x it's possible to indicate that a title is a subtitle:

<dc:title id="title_2">All About EPUB 3.1</dc:title>
<meta refines="#title_2" property="title-type">subtitle</meta>
<meta refines="#title_2" property="display-seq">2</meta>

A recent revision to the Readium Web Publication Manifest also added a subtitle element to play the same role.

We need to add support for subtitles in the streamer, using the same multi-lingual model as the title element.

Add support for NCX and Navigation Document parsing

Both the NCX and the Navigation Document are currently ignored. This should be modified to extract:

  • toc and/or page-list for the NCX
  • toc, page-list, landmarks, loi, loa, lov and lot for the Navigation Document

In addition to these two documents, we might also treat the EPUB 2.x guide element as the equivalent of landmarks.

Navigation Document should always takes precedence over the NCX when both of them are available.

The Navigation Document itself should also be marked as such in the spine/resources collection of a publication (in our in-memory model), using rel="contents".

HTTP CORS

Just a heads-up: although your test server does not emit HTTP CORS headers (which would make sense, as a reading system app would most likely be hosted on a different domain, distinct from the content server's origin), you can use a proxy such as https://crossorigin.me , for example:

https://proto.myopds.com/manifest/mobydick.epub/manifest.json

content-type →text/plain; charset=utf-8
date →Wed, 12 Oct 2016 14:37:50 GMT
server →Caddy
status →200
vary →Origin

vs.

https://crossorigin.me/https://proto.myopds.com/manifest/mobydick.epub/manifest.json

access-control-allow-credentials →false
access-control-allow-headers →Content-Type, X-Requested-With
access-control-allow-origin →*
cf-ray →2f0b508a2d2d360e-LHR
content-encoding →gzip
content-type →text/plain; charset=utf-8
date →Wed, 12 Oct 2016 14:41:47 GMT
expires →Thu, 13 Oct 2016 14:41:46 GMT
server →cloudflare-nginx
status →200

Multilingual string (metadata), lower case

https://github.com/readium/r2-streamer-go/blob/master/parser/epub.go#L214

			contributor.Name.MultiString = make(map[string]string)
			contributor.Name.MultiString[publication.Metadata.Language[0]] = cont.Data

			for _, m := range metaAlt {
				contributor.Name.MultiString[m.Lang] = m.Data
			}

...should lower-case the language code, to be consistent with:

https://github.com/readium/r2-streamer-go/blob/master/parser/epub.go#L280

			publication.Metadata.Title.MultiString = make(map[string]string)
			publication.Metadata.Title.MultiString[strings.ToLower(mainTitle.Lang)] = mainTitle.Data

			for _, m := range metaAlt {
				publication.Metadata.Title.MultiString[strings.ToLower(m.Lang)] = m.Data
			}

HTTP Optimizations

In order to improve performance, a number of optimizations can also be handled at a HTTP level:

  • manifest served with an ETag
  • all resources from the publication served with Cache-Control header and a long expiration date

Add support for dc:type

In EPUB 3.0.1, dc:type is tied to an EPUB controlled vocabulary that we should attempt to store in our in memory model and in the Web Publication Manifest.

Since prior to EPUB 3.0.1 other values could be used, it's probably best to filter this and only support controlled vocabularies.

Support encrypted Media Overlay documents

@HadrienGardeur commented on Mon Mar 13 2017

With the current code, media overlays are not parsed when they're encrypted.

We need to add support for this feature by handling the following behavior:

  • once the proper keys are provided for a DRM, trigger the goroutine that parses SMIL files
  • make sure that the links for media overlay (in links or properties) are present, even if we can't decrypt the SMIL files yet
  • return an HTTP error for the media overlay service when SMIL files haven't been decrypted yet

Add ability to read raw compressed archive entries

When streaming an EPUB's content to web browsers using the go-toolkit, it is possible for the end user's device to be the only one that must perform decompression of the file's contents. This would result in a further decrease in CPU usage of any software using the go-toolkit, as decompression would not have to occur within the code, and additional reductions in resource usage may be avoided due to no longer needing to recompress the resource further up the chain, in e.g. nginx, apache etc.
How does this work in practice? Here's an example of the accept-encoding header of a modern browser (Chrome): Accept-Encoding: gzip, deflate, br, zstd. The deflate encoding is the same compression scheme used inside of ZIP files. This means that if you were to provide a browser that supports this encoding the contents of a compressed zip.File using the OpenRaw() method, and return content-encoding: deflate, the browser would be able to decode it directly.

Implement APIs for LCP support

In order to support LCP, in addition to handling decryption, the streamer also needs to add two new interactions:

  • a link to the LCP license
  • an LCP license handler, to communicate User Key/Passphrase to the streamer

For the LCP license, the fetcher will serve directly META-INF/license.lcpl through a link at the publication level:

{
  "href": "license.lcpl",
  "rel": "license",
  "type": "application/vnd.readium.lcp.license-1.0+json"
}

The LCP license handler will have the following interactions possible:

  • Getting the handler document (GET)
  • Sending the User Key/Passphrase (POST)

The following link is added to the publication's manifest:

{
  "href": "license-handler.json",
  "rel": "http://readium.org/lcp/handler",
  "type": "application/json
}

The two interactions possible respond with the following documents:

GET license-handler.json
{
  "identifier": "62b2dfcb-48f0-4e1b-b2b0-8e3444960f13",
  "profile": "http://readium.org/lcp/basic-profile",
  "key": {
    "ready": false, 
    "check": "jJEjUDipHK3OjGt6kFq7dcOLZuicQFUYwQ+TYkAIWKm6Xv6kpHFhF7LOkUK/Owww"
  },
  "hint": {
    "text": "Enter your library card PIN",
    "url": "http://www.example.com/passphraseHint?user_id=1234"
  },
  "support": {
    "mail": "[email protected]",
    "url": "http://www.example.com/support",
    "tel": "1800836482"
  }
}
POST license-handler.json
{
  "key": {
    "hash": "9728be1c6737759dcba331ebe78276d8c83999b02d410aa2662c763915229a79"
  }
}

The POST request returns the full handler document, with the following HTTP status codes:

  • 200 OK, when the key/passphrase is valid
  • 401 Unauthorized, when the key/passphrase is invalid

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.