Code Monkey home page Code Monkey logo

pwpub's Introduction

W3C Logo

Packaged Web Publications

This is the repository of the W3C’s specification on Packaged Web Publications, developed by the Publishing Working Group.

This work has been superceded by the Lighweight Packaging Format (LPF), released as a W3C Working Group Note.

The original editors’ draft of the Packaged Web Publications specification is still available for the record.

Contributing to the Repository

Use the standard fork, branch, and pull request workflow to propose changes to the specification. Please make branch names informative—by including the issue or bug number for example.

Editorial changes that improve the readability of the spec or correct spelling or grammatical mistakes are welcome.

Please read CONTRIBUTING.md, about licensing contributions.

pwpub's People

Contributors

iherman avatar llemeurfr avatar mattgarrish avatar plehegar avatar prototypo avatar tzviyasiegman avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pwpub's Issues

Should the mimetype file be kept in the package?

The current proposal (PR #30) removes the mimetype file (a.k.a signature file) from the package.
In #30 (comment) it is suggested to keep it.

As a reminder, this file comes from the ODF specification (OCF is heavily inspired by ODF) and is used for providing so called magic numbers to operating systems (as an alternative to file extensions).
A web search didn't bring mention of a use of the OCF magic number by any OS. On the other side, many current EPUB reading systems stop reading an EPUB file if the mimetype file is absent in the package.

Let's discuss the pros of having this mimetype file in the new package for package processors (B2B processing nodes and reading systems).

What is the `origin` of a packaged publication?

A precise answer to this question should (probably) be included in the document. This origin affects the way relative URI-s in the manifest are turned into absolute ones, it affects behaviors of scripts, etc.

Do we need a dedicated media type and file extension for PWP?

Based on recent discussions with @iherman and @lrosenthol it seems likely that PWP will include a default option for packaging.

In order to identify such packages on the Web and on a file system, we'll need to roll out our own:

  • media type: application/pwpub
  • file extension: .pwpub

At this point this is strictly an early proposal for these values.

Based on what we adopt as our default packaging format, we can easily customize the media type a little more (for example application/pwpub+zip or application/pwpub+cbor).

Should the manifest be optional in the package?

In the current proposal #30, the manifest file is required.

It has been argued in #30 (comment) that "This appears to disallow embedded manifests, and thus restricts what types of web publications can be packaged."

in #30 (comment) I put forward that "We may consider that the container contains a canonical representation of the WP, where the manifest is external..."

Hoping we can solve this issue here.

Better define LPF's UA conformance

(This is a follow-up issue of the discussion we had at the Boston f2f. /cc @llemeurfr).

The UA Conformance requirements is currently underspecified, for instance:

It is able to import the Package and fulfill the same requirements as a user agent processing the equivalent Web Publication, as defined in [wpub-ucr] ;

Beyond the reference to a Note (which I guess is OK, LPF being a Note itself!), I believe we should define more precisely which requirements are meant here, and how a UA is expected to process the LPF to meet them.

It is able to expose the Package as a Web Publication, as defined in [wpub];

We don't define "expose something as a Web Publication" in the Web Publication spec, so we need to disambiguate what is meant here.

It is able to convert a Package to an alternative format suitable for electronic distribution.

Needs to be disambiguated as well. What does "suitable for electronic distribution" mean?

Is "signing as an origin" sufficient?

From #23:

o  Signing as an origin: So that readers can be sure their copy is
   authentic and so that copying the package preserves the URLs of
   the content inside it.

This enables cross-origin/domain distribution (which is what AMP wants this for) -- i.e. Chrome (in the future) will display the URL from the package, not the CDN (aka Google's AMP cache).

Signatures are based upon origin certification. This means that if your origin (i.e. your domain) expires and/or your domain certificate is expired or otherwise no-longer valid there may be implications to the future use of the contents (essentially a bundle of "old" HTTP requests). This concern is being explored in the context of archiving a WebPackage.

Choose a media-type and file extension for Web Publications packaged in "OCF lite"

The use of a simplified version of OCF is discussed by the Audio TF, as a way to be able to package audio based Web Publications without waiting for future W3C Web Packaging technologies. The upcoming BD-Comics-Manga CG will certainly also want to package contents using the same technique. This makes this simplified version of OCF more than a pure "OCF for Audio" format; rather a generic packaging mechanism for Web Publications, with know constraints (discussed elsewhere).

Such a Web Publication container will essentially be used:

  • to exchange in-progress packaged Web Publications between different individuals and/or different organizations;
  • to provide packaged Web Publications from a publisher or conversion house to the distribution or sales channel; and
  • to deliver packaged Web Publications to Reading Systems or users.

What should be the mime-type for this file format?
What should be the preferred file extension for this file format?

Note: the first draft boldly uses the following values:

  • mime-type: application/wpub+zip (= a zipped web publication)
  • file extension: pwp (a packaged web publication)

Packages vs canonical id

ref https://www.w3.org/TR/wpub/#canonical-identifier

By definition, "A Web Publication's canonical identifier is a unique identifier that resolves to the preferred version of the Web Publication. It is expressed using the id property." A Package usually contains content that has been composed "out of the web". There is therefore no "preferred version of the Web Publication" involved.

Creators of Packages have some solutions:

1/ No canonical id.
This is ok from the WP spec has indicated in the sentence "If a URL is not provided in the manifest, or the value is an invalid URL, the Web Publication does not have a canonical identifier.' A canonical id may be added to the manifest by a processor which makes a Web Publication from the Package.

2/ A DOI as canonical id.
In such a case, the DOI will redirect to a Web Publication, when it is put online. But in this case, what if the Package becomes SEVERAL Web Publications at different URLs?

3/ a URN as canonical id.
URN (Uniform Resource Name) is used here in the sense of a Name but not Locator. The advantage is that there is no attempt to use it as a Locator. But URNs are not liked much these days in Web circles.

Any other solutions? which one should be recommended (or required)?

Allow signing the components of a package

The new packaging format should not mix xml with json. Therefore the reuse of the signature.xml file is not the way to go and we have to find a "json" way to handle the signature of WP resources.

(Admin) Reorg repository

The current repo still has the ocf-lite document as a 'side' document, as if it was a secondary note. We should reorganize to reflect the reality. This may mean:

  1. We set up a separate repository and we move this draft there. The current repository should be marked (eg, in the README) as, say, 'postponed', or something similar
  2. We keep the current repository, in which case:
    • the current index.html file should be moved into some other subdirectory ('attic'), duly marked as postponed
    1. the current spec/ocf-lite.html should become the index.html file, ie, the main entry to the repository
    2. the README file should be updated
    3. because this document would not replace the current pwp document, it would have to have its own short name, should go through the FPWD publishing phase, which should also mean some admin to do v.a.v. ECHIDNA

These steps also depend on the outcome of issue #39

Cc: @mattgarrish @wareid @TzviyaSiegman @GarthConboy @llemeurfr

entry.html

Why are we using entry.html as the well-known location instead of the much more common index.html?

Should the entry page be optional in the package?

The current consensus is that a WP MUST have a Primary Entry Page, i.e. an HTML page.

I has been argued by the Audio TF that audio books don't have HTML pages and adding a dummy page to a package would be a burden. Note that in this situation (no HTML), a JSON ToC should exist in the manifest or the ToC must be inferred from the track listing (i.e. the reading order)... this is discussed in w3c/wpub#369.

On the other side, I was made aware of a particular situation (Audiolib in France) where a book had a beautiful graphical ToC and the audio publisher integrated an image of this illustration in the audiobook as supplementary content and it would have been great to be able to use it as a real ToC. In such a situation, having an HTML entry page makes great sense.

When a package is exposed as a Web Publication, it is easy for a processor to create on the fly an entry page if there is none in the package. This HTML page will have a link to the manifest (or will embed the manifest) and its content will be created from the metadata found in the manifest. This is a tiny development.

We could therefore choose to have the HTML entry page optional in the package. It would offer simplicity for basic use cases and guarantee great results for advanced use cases.

Note: The other solution is to conclude that a JSON ToC is not an option and that an HTML Toc is imposed to audiobook publishers.

Renamed the required `manifest.jsonld` to `publication.json`

Since this spec will force the naming of files, we should consider changing manifest.jsonld to publication.jsonld to avoid confusing with the frequently used manifest.json by Web Application Manifest users.

Potentially, we may also want to consider using .json vs. .jsonld as development tools are more likely to "naturally" open .json files and may not have .jsonld file extensions mapped (yet).

So, the suggestion is to rename manifest.jsonld to publication.json.

Is "Indexed by URL" sufficient?

In #23 one of the "associated requirements" listed within the PWP section of the Web Packaging spec is...

Indexed by URL: Resources on the web are addressed by URL.

Practically, this means that:

  1. everything in the package is also independently retrievable via its URL
  2. there is no additional mechanism for pointing at things within the package (i.e. no fragment identifier)
  3. there cannot be (consequently) any "package-only" or "interior" resources

There may be other consequences (positive and negative) of this design which we should consider.

First attempt at listing requirements

A Packaged Web Publication:

  • MUST contain a WP manifest at a well-known location (e.g. manifest.json)
  • MUST contain all resources that are part of the publication (reading order + secondary resources)
  • MAY contain additional resources that are referenced by the publication (for example a metadata record in a different format)
  • SHOULD contain the request & response HTTP payloads for each resource
  • MAY contain a signature for the whole package or individual resource
  • MUST NOT use the same media type and file extension if any resource contained in the package is protected by a DRM

How to Define Conformance with the PWP Specification?

What will constitute conformance with the PWP specification? May an implementation conform partially to the specification for some allowed values of "partially"?

For example, might an implementation complying with the Web Publication specification and also complying with the "descriptive properties" section of the PWP specification, but not the packaging section, be considered a legitimate, if partial, profile of PWP?

Relative IRI-s section in the LPF document

I wonder whether the references and section should not rely on the WhatWG spec rather than the RFC-s. That is what most of the W3C specs do these days. Maybe more importantly, this is what is done in the WPUB §3.1.7; we should be consistent within a family of specifications.

I actually wonder whether we should not mostly refer to §3.1.7 above. We would then have one place in our specs where this could be amended if needed. The LPF spec would then slightly extend that section saying that the rules described there are for the relative URL-s for URL-s appearing in the manifest, but that same rules should apply to references among constituent files.

Note that the section refers to the base URI of the manifest; I think this is unnecessary. The fact that the manifest.jsonld file, if present, must appear in the Root Directory (which is already the case), plus the rules defined in the WPUB document covers that case. On the other hand, if there is an index.html file the author may decide to place the manifest file somewhere else (or embed it), in which case this paragraph is not relevant.

Ensure text equivalent for accessibility

One of the main principles of PWG charter is:
"Recommendation-track deliverables will contain mechanisms to make Web Publications accessible to a broad range of readers with different needs and capabilities. This includes general Web Content Accessibility Guidelines (WCAG) and Web Accessibility Initiative (WAI) requirements of the W3C as well as requirements for international readers using different scripts and document formats.Profiles of Web Publications may be defined with more stringent accessibility requirements."

Further to it, the accessibility guidelines defined by WCAG 2.0/2.1 mandates accessibility for broader range of disabilities.
http://www.w3.org/WAI/standards-guidelines/wcag/glance/

It is obvious that audio-only publications are accessible to low vision and visually impaired. But we also need to ensure that it is possible to produce audio publications which are accessible to people with other kind of disabilities for example people with hearing loss, slow learners etc.
Therefore we do need to ensure that the specifications support text equivalent.

Some thoughts are:

  • Can we leverage existing web specifications to add transcriptions and captions to audio track in the audio publications?
  • Is the audio profile designed in a way to integrate well with text of the publication so as to provide audio sync with text?

These are basic design questions that we need to keep in mind as the audio profile is developed.

Should the packaging spec explicitly disallow file name characters?

The OCF specification lists a series of characters which cannot be used in file paths and file names.

The ISO 21320 specification, on which the WP packaging format will be based, does not formally prohibits any character for use in file paths / names. Instead, in an informative annex, it expresses that [The ZIP] "Appnote specifies few restrictions for filenames in the archive. For compatibility, this part of ISO/IEC 21320 does not require additional restrictions on filenames which are valid according to Appnote."
Then it shows as "knows restrictions" in a comparative table what JAR, Widget Zip, OOXML, OCF and Adobe UCF do prohibit.

The ZIP Appnote has a field reserved for "file name" (4.4.17). It states that "The path stored MUST NOT contain a drive or device letter, or a leading slash. All slashes MUST be forward slashes '/' as opposed to backwards slashes '' for compatibility with Amiga and UNIX file systems etc. ".
I also found Appendix D.2, which states that "the ZIP format has historically supported only the original IBM PC character encoding set, commonly referred to as IBM Code Page 437". But there is also a way to storie a Unicode Path in UTF-8 in either an "extra field" (4.6.9) or the original field.

Conclusion: in this Appnote there is no mention of disallowed characters in file paths / names.

Therefore, we can't rely on ISO 21320 to limit the characters allowed in file paths / names. Either we keep (extend?) the OCF constraints, or we rely on authors (and operating systems) to avoid file path / names which would break interoperability between systems. After all this is what we do when choosing a file name; I can name a file [{"'!§$€%.x on MacOS: would I try to move it to a Windows or Linux machine?

Proposed changes to 2.2

I try to re-formulate section 2.2 to make it the conformance part more explicit (for my taste). Also, I believe the current formulation is a bit too restrictive: if the manifest file is obtained from the PEP but is stored as a separate file, is there any reason to require a fixed name and position in the file system?

Here is what I would propose as a (slight) reformulation. (Slight, because the main idea remains the same.)

(Note that, in this text, I also propose to keep to index.html as the predefined name. index.html is the natural setup on Web Servers for the entry page in a directory, and probably most of the users/developers would instinctively keep to that pattern. As an LPF might result in a simple zipping of a directory in a file system, it would be more natural to keep to this pattern.)

The Package MUST include at least one of the following files in its Root Directory:

The contents of both files are specified in [[wpub]].

The User Agent MUST obtain the Web Publication Manifest for the publication included in the Package through the following steps:

  1. If the Package contains an index.html file, the Web Publication Manifest is obtained following the rules described in the relevant section of the Web Publication specification [[wpub]].
  2. Otherwise, the manifest.jsonld file is used as the Web Publication Manifest.

If both index.html and manifest.jsonld are present in the Package, then the former SHOULD contain a reference to the latter, following the rules described by the definition of the PEP.

All other files within the Package MAY be in any location descendant from the Root Directory.

Note that the index.html page may contain an embedded manifest, i.e., the Web Publication Manifest may not be explicitly present in the Package.

Font obfuscation in PWP

Is there still a requirement to obfuscate embedded fonts in a PWP container?
If yes, should it be treated at the container level (e.g. OCF-lite) or at the WP level?

References:

Remarks:

  • There is no such thing as obfuscated fonts on the Web.
  • Many people mix font obfuscation with DRM and DRM is evil at W3C.
  • Let's remember that for specific use cases not treated by PWP, EPUB 3 will be there ... forever.
  • There a two methods for obfuscating fonts in EPUB, from Adobe and IDPF (and treating them is quite a pain for developers of packaging tools and Reading Systems).
  • The IDPF method relies on a unique identifier for the publication. The unique_identifier in the EPUB .opf must be replaced by the identifier found in the WP manifest, therefore there is a close relationship between WP parsing and font obfuscation.

In the first draft of "OCF-lite" the section about font obfuscation is removed. This threat should help decide what is to be done on this front.

Packages vs WP Address

WP address are defined in https://w3c.github.io/wpub/#address.
Most packages created by publishers will have no address until they are "exposed" on the web.

The difficulty is that the WP address is required in a WP Manifest, the WP address references the PEP, but a, index.html (i.e. PEP file) is not mandatory in a Package.

A pragmatic solution is to require that the address value MUST be "index.html" inside the manifest even if the corresponding file is not present in the Package. The URL will be updated to the final WP address by a processor which transforms the Package to a Web Publication.

First-class citizenship of resources in PWPs

For the symmetry of PWPs and WPs, all resources in PWPs should be first-class citizens of the web. Thus, any such resource-in-PWP should be addressable by URIs.

Moreover, fragment identifiers should not be used. Use of fragment identifiers implies second-class citizens (in the terminology of RFC 3986, "secondary resource").

Update authority

If you send me a .epub file (or any other downloadable file), I have it. You can't update it without sending me another .epub--which I can choose to replace the old one, or I can use as a separate one, or I can ignore entirely.

This (somewhat) relates to this quote from #23:

o  Downgrade prevention: An early version of a publication might
   contain incorrect content, and a publisher should be able to
   update that without worrying that an attacker can still show the
   old content to users.

An attacker, in this scenario, is considered someone besides the publisher, but in the eyes of the reader (who has potentially paid for a publication) the publisher and the "attacker" may be the same--i.e. Amazon removing copies of 1984 (etc).

Given that a single publication is currently identified by it's publication "address" (a URL) and (if we use WebPackage) will be signed by a single origin's certificate (i.e. rented authority mapped into that URL), what other facilities must we provide (on behalf of the reader) to prevent "overwriting" by either an attacker or even a publisher (however well intentioned).

How do we enable the reader to keep a publication--defined as part of the Web--if/when the underlying technology (domain, URL, certificate, etc) change under their feat?

See also #25.

Review WebPackage PWP use cases

Copied from https://tools.ietf.org/html/draft-yasskin-webpackage-use-cases-00#section-2.2.1

2.2.  Nice-to-have

2.2.1.  Packaged Web Publications

   The W3C's Publishing Working Group [7], merged from the International
   Digital Publishing Forum (IDPF) and in charge of EPUB maintenance,
   wants to be able to create publications on the web and then let them
   be copied to different servers or to other users via arbitrary
   protocols.  See their Packaged Web Publications use cases [8] for
   more details.

   Associated requirements:

   o  Indexed by URL: Resources on the web are addressed by URL.

   o  Signing as an origin: So that readers can be sure their copy is
      authentic and so that copying the package preserves the URLs of
      the content inside it.

   o  Downgrade prevention: An early version of a publication might
      contain incorrect content, and a publisher should be able to
      update that without worrying that an attacker can still show the
      old content to users.

   o  Metadata: A publication can have copyright and licensing concerns;
      a title, author, and cover image; an ISBN or DOI name; etc.; which
      should be included when that publication is packaged.

   Other requirements are similar to those from Offline installation:

   o  Random access: To avoid needing a long linear scan before using
      the content.

   o  Compress stored packages: So that more content can fit on the same
      storage device.

   o  Request headers: If different users' browsers have different
      capabilities or preferences, the "accept*" headers are important
      for selecting which resource to use at each URL.

   o  Response headers: The meaning of a resource is heavily influenced
      by its HTTP response headers.

   o  Signing uses existing TLS certificates: So a publisher doesn't
      have to spend lots of money buying a specialized certificate.

   o  Cryptographic agility: Today's algorithms will eventually be
      obsolete and will need to be replaced.

Yasskin                   Expires March 3, 2018                 [Page 6]

Internet-Draft Use Cases and Requirements for Web Packages   August 2017

   o  Certificate revocation: The publisher's certificate might be
      compromised or mis-issued, and an attacker shouldn't then get an
      infinite ability to mint packages.

If you feel one of these warrants it's own (possibly lengthy) discussion, please create a new issue for it.

Additionally, there may be other use cases in this document that may related to PWP. Please surface those as/if/when you find them.

Thanks!
🎩

What Packaging Format/Style Should a PWP Use?

PWP will require the selection of some sort of packaging format in order to be a Packaged Web Publication.

Some options currently under consideration include, but are not limited to:

All of these have pros and cons. For example, Web Packaging is not finalised, the CBOR specification precludes inclusion of a general compression scheme (although one could add one on top of CBOR), and SQLite is not a standard of a recognised body.

ESCAPE Workshop

The aim of the ESCAPE workshop is to "deep dive on the ramifications of Web Packaging for the web". We are asked to express our "group's perspective and needs in case they can inform the debate".
We can also submit a position paper.

Ace cannot unzip some EPUB

Ace crashes when processing some EPUBs with following messages:

HL018125:Accessibilité laudrain$ ./ace-check.sh NouvellesOrdinaires3décembre1678
verbose: ACE cwd=/Users/laudrain/Documents/Fichiers/eBook/Accessibilité, outdir=NouvellesOrdinaires3décembre1678-ace, tmpdir=NouvellesOrdinaires3décembre1678-ace, verbose=true, silent=false, jobId=
info: Processing NouvellesOrdinaires3décembre1678.epub
verbose: Extracting EPUB
error: Failed to unzip EPUB
error: Unexpected error: invalid comment length. expected: 7. found: 0
debug: Error: invalid comment length. expected: 7. found: 0
at /usr/local/lib/node_modules/ace-core/node_modules/yauzl/index.js:116:25
at /usr/local/lib/node_modules/ace-core/node_modules/yauzl/index.js:473:5
at /usr/local/lib/node_modules/ace-core/node_modules/fd-slicer/index.js:32:7
at FSReqWrap.wrapper [as oncomplete] (fs.js:665:17)
info: Closing logs.

This epub is available here:
http://www.lectura.plus/Block/download/?id=10659

It one among several produced by a French association on old press issues:
http://www.lectura.plus/848-presse-ancienne-accessible.html

These EPUB can be unzipped and also read in Readium...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.