w3c / wpub Goto Github PK

View Code? Open in Web Editor NEW

78.0 59.0 19.0 4.23 MB

W3C Web Publications

Home Page: https://w3c.github.io/wpub/

License: Other

HTML 86.85% JavaScript 12.61% CSS 0.49% Makefile 0.06%

digital-publishing publishing publ-wg

wpub's People

Contributors

Stargazers

Watchers

Forkers

rdeltour llemeurfr marcoscaceres baldurbjarnason frivoal prototypo rachelcomerford takeratta laudrain wareid deborahgu dauwhe bhanditz dbaron skyrookieyu isabella232 kkpan11 talat09

wpub's Issues

Minimum Viable Manifest

Ignoring issues such as location, serialization, etc. What is the minimum viable manifest?

I have extracted requirements from #6

identifier (tbd what id is) (required)
Identification as WP (required)
list of resources/at least one resource (required)
required order (required)
metadata (there are some discussions about what should be included/required in metadata, but that is a separate issue)
navigation/toc (optional? required?)
language (is this metadata?)
title (is this metadata?)

For more detail see
#6 (comment)
#6 (comment)
#6 (comment)
#6 (comment)

better explain lack of exhaustive resource list

Another side issue from #55 is to better explain why the resource list does not have to be exhaustive.

This used to make more sense when a secondary resource was defined as a subresource of a primary one, but as those definitions and terms are no longer in use what remains is a bit ambiguous.

I'm wondering if we can use a term like HTML5's critical subresources to help explain this, as well as call out the problems with scripting.

One issue that has stuck with me since the NYC meeting and that I was discussing with @dauwhe is whether json is really a vast improvement over xml for the average book/publication developer. If we consider the complexity of the epub package document a failure point, should we consider the merit of allowing a simpler alternative representation so that we don't switch from complaints about the idiosyncrasies of xml to those of json? (e.g., quoting, escaping, objects, arrays)

In particular, I'm wondering if some kind of markdown-like syntax wouldn't greatly improve life for the average publication author. The user agent could translate up to json.

The one concern is that such a representation may not be sufficiently robust enough to express everything, but perhaps that's the dividing line between using a simplified syntax or the formal one.

Just food for thought, as this could also be done independently of the spec to simplify authoring.

Navigating a web publication

From @dauwhe on June 27, 2017 18:0

Readers need to easily navigate through web publications. Such navigation could be complicated by the reality that many web publications consist of many resources, likely in a defined order. EPUB reading systems typically provide UI to access a (required) table of contents from anywhere in the publication, and further provide simple UI gestures to move from document to document.

Should web publications require a comprehensive navigation document? How can we make it easy for users to get from document to document?

Copied from original issue: w3c/publ-wg#14

manifest: title

on 2017-08-07 call, there was discussion about minimum manifest, based on #15. Is title required for the minimum manifest?

Root of locators for identifying/retrieving content within a Web Publication

The DPub IG identified use cases for being able to on-the-fly mint a locator that points to arbitrary segments of content (smaller granularity than a Primary / Secondary Resource) within a Web Publication, e.g., [1] [2] [3]. There is competing prior art.

First question (this issue): Should we take the EPUB CFI approach and mint locators that require parsing the Web Publication Manifest in order to resolve? Or should we take the Web Annotation approach of minting locators based on the URLs of the Primary / Secondary Resource containing the target? Or is there another approach we should consider?

CFI Approach: EPUB 3.1 uses Canonical Fragment Identifiers (CFI) as a way to help support use cases requiring a way to address arbitrary content within a publication. The EPUB Canonical Fragment Identifiers 1.1 [4] specifies that, "The process of resolving an EPUB CFI to a location within an EPUB Publication begins with the root package element of the Package Document." When used as a fragment identifier, CFIs are appended to the EPUB URL (e.g., book.epub#epubcfi(/6/4[chap01ref]!/4[body01]/10[para05]/3:10)), which is not necessarily the URL of the containing Primary or Secondary Resource. There are trade-offs to tying all such locators to the publication (rather than constituent resources) and relying on having to parse the manifest in order to resolve the CFI locator.

Web Anno Approach: An alternative approach [5] [6] is to model the location of arbitrary content within a Web Publication by relying on the URL of the Primary or Secondary Resource containing the content. This of course assumes that every Primary / Secondary Resource listed in a Web Publication manifest has a URL (a requirement that is not yet expressed in most recent draft of our spec, but seems to be the direction we're heading - e.g., see issue #5 from last month). This has the immediate advantage (?) that the locator can be used independent of and potentially persist beyond the life of the Web Publication, assuming of course the Primary/Secondary Resource involves persists. Note, the Web Annotation data model provides an optional way (scope attribute) that we might be able to use for including as well the URL of the Web Publication involved, in cases where this might be useful.

Neither the CFI approach nor the Web Annotation approach alone covers all use cases, but both models are extensible and there is actually significant overlap and some similarities between the two approaches. We probably could extend either approach to meet our needs. I'm working on a table to compare the approaches by generic use case supported.

So, in terms of minting locators for content within Web Publications, which approach should we use?

Is there already consensus in the Working Group that resolving locators for Web Publication content based on parsing the Manifest is the wrong strategy for Web Publications? (Full disclosure, I'm partial to the Web Anno approach, in part because I suspect it requires less work to adapt for our use.)

[1] - https://www.w3.org/TR/pwp-ucr/#identify_const_resources
[2] - https://www.w3.org/TR/pwp-ucr/#random-access
[3] - https://www.w3.org/TR/dpub-annotation-uc/
[4] - http://www.idpf.org/epub/linking/cfi/epub-cfi.html
[5] - https://www.w3.org/TR/selectors-states/
[6] - https://www.w3.org/TR/annotation-model/

avoiding resource declaration duplication

An unresolved issue from #55 that we'll need to return to in the future is resource declaration. We have a resource list and a default reading order, where the latter is a subset of the former.

Some choices include:

having two separate lists with no duplication, as user agents will be aware those in the default reading order are necessary
requiring the user agent add the default reading order resources to the infoset list of resources
requiring the publisher duplicate the resources across lists in the manifest

is a canonical identifier necessary

This issue was raised in #56 by @rdeltour at #56 (comment). Splitting out here for future resolution.

I'm still failing to see why we need a "canonical identifier" in the first place, even a loosely-defined one as proposed by @iherman and prosed by @mattgarrish.

It seems to me that at the minimum we will have:

the publication’s URL (however that will be defined)

ways to express "identifier" properties (e.g. a dc:identifier property in metadata, wherever that will be allowed).

ways to declare links with specific semantics (link@rel), to express for instance a "canonical link" as defined in RFC6596.

That some systems use this or that as a "canonical" identifier, or require some uniqueness in some specific context, is totally up to implementers. In other words, I don't think the spec even needs to include the terms "canonical identifier". At all.

Web Publication affordances examination

So far we've mostly been focused on what we'll be encoding into a Web Publication.

The Web Publications Use Cases and Requirements document (i.e. "where this all started") mentions several Web Publication affordances.

If you're new to the word "affordances" (as apparently Firefox's spell check is...was! 😉) here's a quick definition from https://en.wikipedia.org/wiki/Affordance

An affordance is the possibility of an action on an object or environment.

Here's a list of the affordances that I gleaned from the UCR document above:

WP affords non-textual experiences

2.1.3 Alternative modalities
- The notion of a Web Publication should enable specific publications like audio books, graphics books, and mixed media.
2.1.4 Time-based Media and Text
- A Web Publication needs to support both time-based media and text.
2.1.5 Inclusion of Data
- Web Publications should be able to include data as resources, just as it does with text, images, etc.

WP affords offline access/readability

2.1.6 Going Offline
- A Web Publication should also be available offline.

WP affords containment (?)

2.1.7 Single Unit
- User agents must treat a Web Publication as a single logical resource with its own URL, beyond the references to individual, constituent resources.

WP affords "paging" through a publication

2.1.10 Pagination
- It should be possible to see the Web Publication in a “paginated” view.

WP affords personalization of experience

2.1.11 Personalization
- The user must have the possibility to personalize his or her reading experience.

WP affords guided navigation

2.2.1 Default Reading Order
- There should be a means to indicate the author’s preferred navigation structure among the resources of a Web Publication.

WP affords filtered navigation/reading experience

2.2.2 Random Access to Content
- Authors of a Web Publication should be able to provide the user agent with information to access random parts of the publication.
- ("random" here has nothing to do with randomness...just author-stated sub-set of content)

WP affords restricted access

2.2.4 Access Control and Write Protections
- A Web Publication should be able to express the access control and write protections of the publication.

WP affords informed action

2.2.5 Metadata and Resources
- Web Publications should include technical and descriptive metadata as well as any additional characteristics of the constituent resources.

There are likely more, or perhaps I've overstated the "action-able-ness" of some of these existing use cases--and while still things a WP must needs do, they may not result in an action being taken.

Thinking about affordances may help us as we consider our relationship to user agents, UX, and reading experiences.

Cheers!
🎩

Do we need definitions for identifier and locator?

Based on Monday's call and several issue threads (notably /wpub #11 and /publ-wg 21) we seem to have less than consensus about the difference between an identifier and a locator, and if there is a distinction, about whether the nature of this difference matters or is useful to highlight in the context of Web Publications (WP). Personally I think we need working definitions of identifier and of locator in WP terminology section. Below are my draft working definitions. But since it's unclear if the WG feels it necessary to distinguish between locator and identifier, before moving forward with a Pull Request, I would appreciate feedback about whether we need these definitions (as well as on the proposed wording). As a way to elaborate on the practical meaning of these definitions and to highlight the distinctions, I include comments below the proposed definitions. These comments are not intended for the terminology section of the spec (and might still need discussion), but may be helpful in drafting other sections.

Identifier
An identifier persistently identifies a WP or WP resource. A WP or WP resource MAY be identified by more than one identifier, but an identifier MUST NOT identify more than one WP or WP resource. Different versions or editions of a WP { SHOULD? MAY? } have distinct identifiers. Some identifiers MAY also serve as locators. URIs [[rfc 3986]], IRIs[[rfc 3987]], URNs [[rfc 8141]], DOIs, ISBNs, PURLs are all examples of identifiers frequently used in publishing.

Locator
A locator is a URL that can be used to locate and retrieve a WP or WP resource, subject to authentication / authorization and similar access limits mediated by HTTP or other protocols. Locators MAY rely for retrieval functionality on HTTP redirection and/or intermediate pages, e.g., a "landing page." Unlike identifiers, the locators associated with a WP or WP resource MAY change. Locators that persistently and unambiguously identify a single WP or WP resource are also identifiers.

Comments about identifiers: Identifiers are not required, but generally { SHOULD? MAY} be assigned to a WP. An identifier once assigned may not be reused in the future to identify a different WP. A single identifier cannot identify both a part of a WP and the WP as a whole; it must identify only one or the other. Some WPs and WP resources MAY be available in multiple representations (e.g., multiple serializations); an identifier identifies independent of representation. Some identifiers are also locators, but not all. Though URIs and URLs now share a common syntax, a distinction still made in this context (and in some other contexts) is that URIs identify while URLs locate. Other kinds of identifiers (e.g., DOIs) can be mapped (usually through services) to locators. Other kinds of identifiers (e.g., an ISBN) identify only, i.e., are not immediately useful for locating an online representation of a WP or WP component.

Comments about locators: Locators are required. The serialization, formatting and even byte-level content of a representation retrieved when dereferencing a URL may be fixed or may depend on content negotiation, the client device used to resolve the URL, the client IP address and/or other HTTP headers accompanying the Request, e.g., an editor using a browser or other tool that provides a HTTP Authorization header MAY see a different view than an anonymous reader. In some serializations, the resource (e.g., manifest) requested using a locator MAY be embedded within and conflated with another resource, e.g., a manifest conflated within the HTML representation of a WP's primary resource.

Machine-Readable navigation

@HadrienGardeur wrote in #6

Right, but I think a lot of the arguments in favour of including all machine-readable navigation in HTML are misguided:

I agree with you - though for different reasons. I believe (and correct me if I am wrong) that you actually want machine readable navigation - something akin to the NavDoc in EPUB. I, on the other hand, don't want it at all. If an author wishes to provide navigation in their document - they can build it using the same tools they build all other content. The UA doesn't need to know anything about it - except perhaps where it lives (done via something like dpub-aria TOC role)

Default natural language

At present, the default is English in the WP draft. But it should be unknown, as in HTML5.

manifest: requirements for offline

from #15 (comment) by @bduga

There is no way to crawl a script and find all the resources it might use or cause to be used. If all secondary resources are not listed, it is not possible to cache a WP offline. I think we either list all resources, or abandon deterministic offline caching.

To be discussed: which resources should be listed in manifest

Should some additional metadata items move to core infoset requirements

Section 3.10 of the current draft lists a number of additional metadata items as "should"-s. These may have to be considered as core and listed among the core infoset requirements.

Multiple-origin web publications

Issue #5 included lots of discussion of whether multiple-origin web publications should be allowed, and what some of the problems might be. Let's have that particular discussion here.

I believe there is wide agreement that we shouldn't restrict things like web fonts or scripts to be same-orgin as the WP, given such things are so common on the web. So the issue is more about what old epub folk might call "spine items".

What makes primary resources "primary"

Given the new definitions for primary and secondary resources, what gives primary resources any primacy?

Being in the default reading order doesn't make them materially any different from top-level content that the user can reach in other ways.

Something like EPUB's distinction of "spine resources" might be less confusing for times when we need to speak specifically about resources in the default reading order.

separate implicit information from failure handling

I find the current approach of making publications valid by having user agents fill in holes worrisome. It will make validation for authors that much harder, because there will be no indication that information is missing, or clarity of what it is.

Can we start being more specific about what information an author can intentionally omit, in other words?

Reliance on implicit computation will never be perfect, but as an example here's how the title section could read:

The Web Publication's infoset requires a title.

An author MAY omit the title from the manifest only when the first resource in the default reading order contains a non-empty title element (e.g., an HTML or SVG document). In this case, the user agent MUST use this title as the title of the Web Publication.

In the case of an invalid Web Publication that does not contain a non-empty title, the user agent MUST provide one. This specification does not mandate how the title is computed. The user agent might:

use the non-empty title of a subsequent primary resource in the default reading order;

provide a language-specific placeholder title (e.g., 'Untitled Publication');

use the URL of the manifest;

calculate a title using its own algorithms.

(Personally, I'd rather a title only be implied if there is exactly one resource in the default reading order, but can live with taking the title from the first resource.)

Picking a language

The current fallback algorithm states that if the language cannot be determined, set it to English.

I don't like this approach, as it's a random choice. As stated in BCP 47, if there has to be a language and one cannot be determined, use "und" (undetermined).

If we're picking up language codes from WP resources, it would also be good to be clear that a conforming language code is expected, not just a language designation like "English".

Language of web publication v. language of manifest/resources

According to the infoset, a language should be specified, but it's not entirely clear where the distinction is between the language of the manifest and the language of the publication.

If they are the same thing, it's not possible to express multilingual publications.

If they are not, how is each defined?

For context, see the discussion in PR #51 starting here.

For manifest in FPWD: Should Natural Language be Required per WCAG 2

https://rawgit.com/w3c/wpub/manifest-consensus-proposal/index-manifest-proposal.html#abstract-versus-concrete-manifest

What does "Reading Order" mean in the context of a Web Publication?

Is it merely the action of some supposed "next" and "previous" interface elements?

Is it required to be publication-wide? Or can it be contextual (choose-your-own-adventure)?

If it's contextual, must there only be one "next"?

This relates to musings on #36, but I think has value all on its own.

I'm also very keen to get input from a wide(r) range of people including those who use Assitive Technology (AT) when reading/listening.

manifest embedded, linked, both?

Should the manifest be in an external file, embedded in a specified manner, or should either option be allowed?

First-class citizenship of resources in PWPs

For the symmetricy of PWPs and WPs, all resources in PWPs should be first-class citizens of the web. Thus, any such resource-in-PWP should be addressable by URIs.

Moreover, fragment identifiers should not be used. Use of fragment identifiers implies second-class citizens (in the terminology of RFC 3986, "secondary resource").

Who is responsible for a publication's user interface?

From @dauwhe on June 12, 2017 16:36

The web is a free-for-all, with design and interaction nearly entirely controlled by the site author.

But ebook reading systems typically control how the user interacts with a publication, and sometimes even change the appearance of publication content.

Will web publications need to provide their own user interface? Will they need to be designed to support both paradigms? What might this mean?

Copied from original issue: w3c/publ-wg#2

Do all documents in the reading order have to be reachable from the ToC

edited to better describe the scope

There is a consensus that a Web publication must have a reading order (a list of primary resources) and must/should have a table of contents (ToC) (the main navigation entry point).

Scope

The goal of this issue is to debate whether accessibility best practices require that all the primary resources should be reachable from the ToC, or not. This is about conceptual relationship between the ToC and reading order.

In other words, the objective is to provide answers to the following questions (the list can be edited as the discussion goes on):

are there legitimate use cases where a primary resource would not be listed in the ToC
a primary resource is not listed in the ToC, does it violate WCAG SC 2.4.5
if not, and if the ToC is incomplete, is the list of primary resource equivalent of a "site map" (i.e. does it satisfy WCAG technique G63?
ultimately, how does that translate into the spec (must, should, note, a11y guidelines, etc.)

Out of scope

This issue is not about discussing:

the format details
how to infer one structure from another (as in issue #36)

As @baldurbjarnason pointed out:

we need to start off by specifying a format that supports these as independent structures before we specify how you'd use one type of structure to generate another. And we should be explicit in the specification that these conversions can fail for a variety of reasons.

References

Issue #38 "What does “Reading Order” mean"
Issue #36 "Is the ToC sufficient to provide reading order?"
Issue #26 "Should the manifest be an implicit ToC?"
WCAG SC 2.4.5 "Multiple ways"
Understanding WCAG SC 2.4.5

For manifest in FPWD: Should manifest TITLE be Required per WCAG 2?

https://rawgit.com/w3c/wpub/manifest-consensus-proposal/index-manifest-proposal.html#abstract-versus-concrete-manifest

Naming the media overlay work

The WG discussed, and plans to discuss further, work on media overlays. On the WG call of 2017.09.18 the term SMIL was used to denote the work. We may want to choose another term...

manifest: metadata

Per issue #15, there is discussion about whether the minimum manifest must include any metadata.

I am of the opinion that a small subset of metadata is either a MUST or can be implicit. See also #15 (comment)

Should linking from manifests support URI templates?

URI templates rfc650 may be useful features to make external references easier to maintain.

Should the manifest be an implicit TOC?

Should the TOC be a separate HTML file or is the listing of primary resources in the manifest an implicit TOC?
See #2

Replace the "infoset" term?

The term can be confusing: it can't be looked up in the dictionary, and can be confused with the XML Infoset.

Can we use something else?

MUST the manifest include information about secondary resources or not?

This has been discussed in different other issues, and it is probably better to separate this as an explicit question to be followed.

Remixing Content

@HadrienGardeur wrote on #6

Also, I'd like to have the ability to remix content on the Web. If I have zero write-access to the content that I'd like to remix within a Web Publication, there's no way I'll be able to include such a link in HTML or HTTP.

Remixing isn't one of the original use cases (though collation/combining was). But assuming we do want to address it, it's going to be a huge set of hurdles to cross with respect to security and other considerations. And that's just in the WP world - PWP just ups the ante...

non-linear resources - primary, secondary or something else?

An issue I'm sure no one is looking forward to, but where do "non-linear" resources fall into the publication hierarchy?

For reference, non-linear is what epub calls documents that contain supplementary information not part of the primary narrative. In EPUB, they were marked as non-linear and required to be in the spine, but whether they were suppressed when reading the book or not was reading system dependent.

Is WP going to define something similar, or can the reading order be defined without including supplementary documents in the reading order?

If they can be excluded, these will be neither primary nor secondary resources, but something else.

If they can be excluded, does the author have to provide previous/next page navigation.

Document Collection Interface

From @dauwhe on June 27, 2017 17:12

Our White Paper talks about the need for an interface and API for a collection of web resources. So really all we have to do is figure this out:

interface Collection: Node {
[SameObject] readonly attribute DOMImplementation implementation;
  readonly attribute USVString URL;
  readonly attribute USVString collectionURI;
  readonly attribute USVString origin;
  readonly attribute DOMString compatMode;
  readonly attribute DOMString characterSet;
  …a few missing details :)
};

Perhaps something like this, more than anything else we are discussing, is what it means for publications to be a first-class citizen of the web.

Copied from original issue: w3c/publ-wg#13

Relationships to the Web App Manifest specification.

In case the (concrete) manifest is expressed in JSON (see issue #7), should it be defined “on top” (i.e., as some form of an extension) of the Web Application Manifest specification, or should it be a fully separate specification?

previous and next via link relations

As we talk about the reading order of a web publication, we should at least acknowledge that HTML already has link relationships to describe the previous and next documents in a sequence.

A reader could even navigate to the next document by pressing the space bar in pre-Blink Opera. I believe Firefox had UI as well, but it was removed. Various browser extensions still provide this functionality.

Define relationship with WCAG

Several issues refer to WCAG requirements. What is the relationship of Web Publications to WCAG? In particular, is a Web Publication something more than a "set of web pages"?

See:

#39 (comment)
#39 (comment)
next/previous links (technique G125)
a table of contents (technique G64)
an index of all the pages ("site map") (technique G63)
a search function (technique G161)

(thanks to @rdeltour for creating the list for me) cc/@Ryladog

i18n of metadata values, in case JSON is chosen as a serialization format

A question was raised during the first F2F meeting in NYC, about the proper internationalization of UTF-8 metadata values (eg. the book title).

I'll quote Ivan, from the minutes: "On the i18n side, we will need to be careful about ids, uris, iris, etc. w/respect to i18n char-sets. Another area we need to be careful about is metadata, which also have issues with the char-sets for the actual text content. One example is mixing bidi text in the metadata content."

Depending the serialization format used for expressing publishing metadata, the issue we face may be different. But in case a JSON (JSON-LD?) format is chosen, which are the issues we may face and the i18n solutions recommended by other W3C WGs?

Is WP enhanced web content or a special implementation of the web?

This question keeps bothering me as I try to understand whether WP is plain old web content that some user agents can enhance, or a special kind of web content that needs a certain class of user agent to consume.

Unless there is an expectation that all user agents (including browsers) will support the WP enhancements how do we deal with usability/accessibility? Does every publication have to be designed like a regular web site if the author wants it to be usable/accessible to anyone accessing it? Will it have links to the previous and next page if it consists of more than one document, links to the table of contents on every page, etc. in order for users to be able to read it?

If we do expect reasonably universal support, then what are the additions we need to avoid a web interface? (automatic next/previous page navigation, access to a table of contents)

It's hard to know what usability/accessibility look like at the WP level without having a clear idea which scenario is expected.

Obtaining language from http headers

A side-issue raised in #42 is whether http headers can be a fallback for obtaining the language of the publication.

Leaving for future discussion.

Information content of the abstract manifest

From @dauwhe on June 27, 2017 14:33

What information is required for an abstract manifest? [edited to add items from comments]

An identifier for the web publication, which should be a URL
Some way of saying that this URL represents a web publication.
Some way of identifying the constituent resources of the web publication.
Some way of providing a preferred order of (some of) the constituent resources in case there is more than one
Some way of being able to add more complex metadata to a publication. (Not clear to my mind whether we would define a minimally required set of metadata, but the slot should be there.)
Locating table of contents or other navigation structure

What else? I think we should distinguish required information from "nice to have" information.

Copied from original issue: w3c/publ-wg#12

Associating a manifest with publication resources

This has come up in several different threads, so it seem worthwhile to give it its own issue.

If we have a collection of information about a web publication as a whole ("manifest") that exists separately from most of the publication's resources, we need to find a way to associate the manifest with the other publicatin resources. This is necessary because we envision the manifest affecting the rendering and behavior of said web resources.

As usual, the web has already faced this problem. For example, we often use external stylesheets to control the rendering of content documents, and it's quite convenient to not have to duplicate large quantities of CSS in each HTML resource. So we use link rel="stylesheet" to associate a CSS file with an HTML file.

Web App Manifests uses exactly this mechanism to associate a manifest with a resource. But Web App Manifests also has the idea of applying a manifest to a web app:

A manifest is applied to a top-level browsing context, meaning that the members of the manifest are affecting the presentation or behavior of a browsing context.

What's interesting is that, once the manifest is associated with a particular content document, and applied to a top-level browsing context, that manifest can then affect other content documents within the scope of the manifest.

I think we have some tough choices here. Is this seemingly reasonable desideratum even possible?

If I navigate to a component of a WP, I should be able to discover that it's a WP and find the manifest.

Is the ToC sufficient to provide reading order?

Relates to both #26 "Should the manifest be an implicit TOC?" and the more recent #35 "Proposal: an HTML-first Table of Contents approach to Web Publication."

Proposal: an HTML-first Table of Contents approach to Web Publication

@dauwhe and I have been working on a proposal to use HTML's <nav> element as a web publication manifest: https://github.com/dauwhe/html-first

TL:DR define the primary resources of a WP to be the files referenced in the first

element of an "index" file. This file would also host WP metadata.

We feel this approach has many benefits:

Human-focused. User agents need a list of primary resources and their default ordering, but so do actual users. Most web publications would benefit from a human-readable table of contents. TOCs are crucial for accessibility.
Simplicity. Given the broad need for a TOC, using that as manifest is a straightforward way to avoid duplication (as in EPUB's nav/manifest/spine/ncx). And we've discovered a huge benefit, as we don't need a list of secondary resources to facilitate offline caching via service workers (see the demo books)!
Ubiquity. Everyone in the web space is already familiar with HTML, and there is a large and mature ecosystem around authoring, rendering, and validating HTML.
Expressiveness. HTML's language and styling support allows for a richer experience for humans.
Progressive enhancement. Existing web user agents know what to do with HTML.
A Path to the future. Every EPUB3 has a nav document. Many "web books" already use such a design pattern.

Note we've created a couple of demo books that work offline, based on the HTML manifest.

Thanks,
🎩 and 🐳

WP-dependent URIs of resources

We might want to give different URIs for a resource in the context of one WP and the same resource in
the context of a different WP. Such URIs can allow different "next primary resources" depending on context WPs.

Requirements for a WP title

Issue #20 addresses the question of whether a minimal viable manifest requires a title. This issue is to separate that out from questions of a manifest. What are the minimum title requirements for a title in a WP? Regardless of whether that title is encoded in a manifest or not, what are a WP's title requirements?

This is not necessarily a blocker for first public working draft.

Proposal

A WP requires a title.
Sufficient: a title is defined somewhere in the WP's metadata.
Fallback: a WP contains a single primary resource with a title element appropriate for its type (eg. SVG, HTML titles), and that becomes the WP's title.
Fallback: a WP contains multiple primary resources with a title element appropriate for its type (eg. SVG, HTML titles). The first one becomes the WP's title
- This one is imperfect from a usability and accessibility standpoint, but is an adequate fallback.
A WP contains no title attribute and no primary resources with a title element appropriate for its type (eg. SVG, HTML titles). This is non-conformant.
- A URL is not a fallback title
- A filename is not a fallback title

Rationale

Discovery / Organization
Accessibility: EPUB Accessibility uses title in its example of how accessibility needs to apply to a publication as a whole, not merely the component documents:

Consequently, when evaluating the accessibility of an EPUB Publication, individual pages — or Content Documents, as they are known in EPUB nomenclature — cannot be reviewed in isolation. Rather, their overall accessibility as parts of a larger work also has to be evaluated.

For example, it is not sufficient for individual Content Documents to have a logical reading order if the publication presents them in the wrong order. Likewise, including a title for every Content Document is complementary to providing a title for the publication: the overall accessibility is affected if either is missing.

Authoring

An authoring tool can create a title, when one isn't provided, however it chooses. Just as Word used to make the first string of a text document into the doc title, and some blogging platforms used to make the filename into the alt text, an authoring tool can choose to enforce useful titles or put in less useful fallbacks.

Therefore what @dauwhe called 'documents' made in WP tools, and what @lrosenthol called 'ad hoc publications' have no onerous work for the user in an authoring tool which makes it non-onerous.

The canonical-ness of identification needs clarification

The current spec text states:

If assigned, this canonical identifier MUST be unique to the Web Publication.

Given that definition, I asked in this comment about what the "canonical identifier" would be for https://www.w3.org/TR/html/

Here's the list of identifiers in the the currently published TR for HTML:

https://www.w3.org/TR/2016/REC-html51-20161101/ 
https://www.w3.org/TR/html51/ 
https://www.w3.org/TR/html/ 
https://w3c.github.io/html/

To my original comment, @iherman responded that:

The W3C considers https://www.w3.org/TR/html/ as THE identifier for the HTML standard

Which is correct if "Web Publication" in our current definition above refers to the the conceptual thing called "the HTML standard."

However, if "Web Publication" refers to the thing that is currently published (i.e. the publication of "the HTML standard" on the web at the time of my retrieving it), then the canonical identifier would be https://www.w3.org/TR/2016/REC-html51-20161101/ which uniquely references the Web Publication I have in my browser right now.

The note about rel="canonical" doesn't really clear things up (sadly). RFC 6596 defines that link relationship as...

Designat[ing] the preferred version of a resource (the IRI and its contents).

Given that this is an author/publisher defined relationship, the "preferred version" is likely the latest one. But if that's the case, then what is the unique, authoritative identifier for the resource that I just got in my browser? Do we have a name for that yet (assuming that's not "canonical" in the current parlance)?

How do we identify a web publication and its components?

From @dauwhe on June 26, 2017 22:17

A Web Publication (WP) is a collection of one or more constituent resources, organized together in a uniquely identifiable grouping that may be presented using standard Open Web Platform technologies. A Web Publication is not just a collection of links— the act of publishing involves obtaining resources and organizing them into a publication, which must be “manifested” (in the FRBR sense) by having the resources available on a Web server. Thus the publisher provides an origin for the WP, and a URL that can uniquely identify that manifestation.

Perhaps the simplest possible answer to these questions is just a URL: https://www.example.com/MobyDick/ would both identify the publication and mean that everything whose URL starts with this is part of the publication.

So I guess that I’m looking for reasons to make this more complicated :)

Copied from original issue: w3c/publ-wg#10

The URI, URL, and URN of a web publication

This is another issue being extracted from #5, with the hope that we will focus on the matter at hand.

How do we identify and locate a web publication? We do not seem to have an agreement yet on whether a WP should be located by the URL of the manifest, or the URL of the first content document. I have a strong preference for the latter.