wicg / content-index Goto Github PK

View Code? Open in Web Editor NEW

29.0 29.0 12.0 176 KB

Explainer and spec for the Content Indexing proposal

Home Page: https://wicg.github.io/content-index/spec/

License: Other

content-index's People

Contributors

Stargazers

Watchers

Forkers

jeffposnick beaufortfrancois qls0ulp tomayac foolip global-localhost global19 global19-atlassian-net autokagami nena2030 seanpm2001

content-index's Issues

Content Indexing & Media Feeds

There's some overlap between the Media Feeds API and Content Indexing.

The APIs seem to be solving different problems though. For one thing Content Indexing is more geared towards offline pages where the type can be hinted to the browsers, whereas Media Feeds is targeting video media with a more rigid type breakdown that varies from type-to-type.

I looked a bit into potentially merging the APIs, but there doesn't seem to be a nice way of doing that so that both the APIs' goals can be achieved. It also seems to me that this is a bit like CacheStorage and IndexedDb, which are somewhat similar, but both can simultaneously co-exist as they are different tools for different nails.

I'd like to get some further thoughts from people involved, since this is something that will likely come up in the standardization track.

@beccahughes - who's working on the Media Feeds API
@jeffposnick - who's aware about both APIs

Why have both id and launchUrl?

What's the point of the id field? Why isn't (the absolutised value of) launchUrl a primary key? If it's not and I'm misunderstanding something, maybe an example should be shown where there are non-unique launchUrl values.

Overlap with Web Manifests

What's the relation between this spec and Web Manifests?

There seems to be an awful lot of overlap, with the major differences being:

Web Manifests collect many resources together (e.g., an entire newspaper website), while Content Index applies to individual resources (e.g., single newspaper articles).
Web Manifests are discovered declaratively, while Content Index is used via scripting. However, both still need scripting to actually install the service worker for offline use. This allows Content Index to be kept in sync with the state of the offline cache, so that only items that are actually available (and useful) offline are shown to the user.
Web Manifests don't have a way to explicitly indicate that they're useful offline; while many will be, there will inevitably be some that strictly need the network (e.g., real-time multiplayer games).

Some sort of compare-and-contrast probably belongs in the spec, as well as notes on why extending Web Manifests (e.g., with an offline boolean field) wouldn't be as good.

I also think that it could lift some design directly from the Web Manifest spec, which has basically the right approach to responsive icons (#2) and, less perfectly but still better, to categories (#7) and i18n (#6).

Should access to the index be restricted to top-level contexts?

In other words, should an iframe have the ability to register content?

Verify that content is really available offline

What about sites that register things that are not actually available offline?

Why only one icon type/size? (look at web app manifest)

Potential as a spam vector

This is a potential spam vector. Browsers should be prepared to deal with malicious websites registering a whole pile of bogus resources to fill up the offline content discovery interface. This isn't a problem with the spec per se since this is in the realm of implementation decisions.

I think something like being able to manually hide all registrations from a particular registrable domain will be needed, as well as whatever automatic anti-spam mechanisms get put in place - probably central registrable domain nolists.

I think this will warrant a note in the security section, if only so that it's on implementers' radars.

Websites can give usefulness hints

The website may have a very good idea of what resources the user will want to access - e.g., the next chapter of the book they're reading, rather than some randomly selected one according to general-purpose browser heuristics.

Giving the website a way to indicate likely usefulness for registered and suggesting (but of course not requiring) the browser uses that as an input would likely be a good idea. Note that this should only affect the ordering of resources shown within a single registrable domain.

Concretely, this might look like a real-valued weight field being added.

This is also distinctly optional and not needed for a MVP, but I do think it could be quite useful.

Privacy implications

It bears explicitly noting that offline content registrations are like history in how they make revealing browsing history easy, rather than like cookies/etc which require work to leak browsing history. These should be cleared whenever history is cleared, not just when a total site data purge happens.

While this isn't exactly new with this spec - websites can track your browsing history and re-display it - it does make this sort of leakage a lot more likely, and in particular it's likely to be shown by the browser, rather than the website itself.

Review feedback

They would have to know which websites work offline or install a PWA to be able to browse through content while offline.

It isn't clear to me where the "or" operates here. Is it: They would have to know which websites

work offline
install a PWA to be able to browse through content while offline

If so, the second point doesn't seem right. Websites can't install themselves onto the homescreen.

there no entry points

Isn't the homescreen icon the entry point?

deleteArticleResources in example 1 references payload.id, but it should just be id.

Could just be my preference, but I find async functions easier to read than promise chains. Eg, for the push listener:

self.addEventListener('push', event => {
  const payload = event.data.json();

  // Fetch & store the article, then register it.
  event.waitUntil(async function() {
    const articlesCache = await caches.open('offline-articles');
    await articlesCache.add(`/article/${payload.id}`);
    await self.registration.index.add({
      id: payload.id,
      title: payload.title,
      description: payload.description,
      category: 'article',
      icons: payload.icons,
      launchUrl: `/article/${payload.id}`,
    });
    // Show a notification if urgent.
  }());
});

I'm finding "display" a little confusing. Since it requires an environment, it feels like this UI can't be shown any time other than during the initial add() call.

If something is displayed for each item added, will this lead to the user being bombed with UI? Eg, if I download "today's news stories", which compromises of 20 stories, will the user suddenly get 20 notification-like things?

It’s RECOMMENDED that the user agent fetch the icons when the content is being registered, and stored to be accessed when needed.

I would make this specific. Spec when the icon should be downloaded, and what should happen if that download fails.

The whole icon-downloading bit could be behind a 'may', but it should be clear how the UA should behave if it chooses to get the icon.

The UI SHOULD provide a way for the user to delete the underlying content exposed by the UI

This kinda sounds like we can guarantee that the underlying content will be deleted. From a UI point of view, a button may be provided which, when activated, runs delete a content index entry for entry.

If either of description’s id, title, description, or launchUrl

Nit: 'any' rather than 'either'.

The content categories seem a bit limiting, especially as they're required. Eg, a book isn't really a "homepage" or an "article". Is a photo gallery an "article"? What about a daily crossword? Etc etc.

What's the reasoning behind requiring a category?

add and getDescriptions don't seem to mirror each other in terms of naming. I'd expect addDescription and getDescriptions, or add and getAll.

Let launchUrl be the result of parsing description’s launchUrl

"parsing" is linking to the wrong thing.

As you mentioned, a new registration may be introduced with a narrower scope, that would receive navigations for content items 'owned' by another registration.

Calling add twice with a ContentDescription with the same ID is racy. You could solve this with some sort of queue, eg https://html.spec.whatwg.org/multipage/infrastructure.html#parallel-queue. Dunno if it matters.

Right now, the promise returned by add is delayed by calling "display entry", which includes fetching icons. Is that deliberate?

For activating the content index entry, look at https://w3c.github.io/ServiceWorker/#clients-openwindow - it shows how to create a top level browsing context and navigate it.

The delete event provides the ID of the resource, but should it provide the whole content description?

Due to race conditions, it's possible for:

User clicks delete on entry with ID 'foo'.
contentdelete event for 'foo' queued.
New item with ID 'foo' added.
contentdelete event for 'foo' fires.

Activating a Content Index entry

The spec draft implies that a new tab should be opened when the UI is activated. Is that the right thing to do? Should this be more flexible to allow for browser-specific implementations?

launchUrl is incorrectly capitalized

It should be launchURL, per https://w3ctag.github.io/design-principles/#casing-rules and https://url.spec.whatwg.org/#url-apis-elsewhere

Need more content metadata to able sort or filter it from the list

Talking about content, it will be great to have more content metadata to able show it properly in the list. For example created time and updated time will be useful to sort the content after we pull it from the list.

WordPress Post object will be a good reference for available content metadata.

Renaming the spec

Some people have raised concerns that the name of the API is confusing. Is there a better name for this?

Offline Metadata has been thrown around.

Error handling for delete()

Currently there is no feedback if the id passed into the delete method doesn't exist in the Content Index.

I'm unsure of what type of exception this should throw, but doing something like:

try {
	registration.index.delete("something");
} catch (e) {
	console.log('Failed to remove content', e.message);
}

would be good programming™️

How to index a Web Packaging package?

One of the items mentioned in the WICG proposal post is the interaction with Web Packaging. I'm very interested to know more about what this interaction would look like.

Let's start with this scenario:

I am a comic book site comic.yoyodyne.example. I have a PWA main site and distribute Web Package bundles with free comics.
My customers Jane and Karen are offline on a subway ride. Both have the PWA.
Karen sends Jane the latest Fantastic Overthrust Oscillators (FOO) bundle
How does Jane's PWA find out about the bundle & it's content?

At the moment, I have only one path I can think of for this to work. My thought is that the FOO Bundle could have it's own index.html that has embedded within it an explicit, hardcoded list of all the other files also in the Web Bundle. FOO Bundle would step through this hardcoded list of content, & index.add() each item. Once done, it could redirect to the main PWA url.

The constraint is that currently Karen or Jane would have to know about the Bundle's unique index.html file, & know how to navigate there to initialize this problem. It's also a kind of gross solution any how, because the index.html file has to have some JS with the hardcoded list of content that's in the bundle.

What I would love to see would be a way for content to more easily declare itself as indexed. As a secondary objective, HTTP Push has almost the same problem, where the page/sw have no way to know about PUSHed content. There, a similar approach is also hacked together: use SSE or use WebSockets or some such to tell the browser about the content you have just PUSHed to it, so you can fetch() then cache that content. That issue is whatwg/fetch#65.

It would be really lovely to have a way to get content into the content index effectively. I would love for my Comic web app to be able to find out about the Comics it is being sent. Content-Index seems like it could be a breakthrough in enabling that, but there's still an outstanding question to me of how to get Web Package bundles into the content-index.

Review feedback

Thanks for publishing this! Some suggestions.

Intro

"This is not a great user experience." Why not? (Undiscoverability.) What's the impact to the developer? It'd be great to highlight why this is interesting for developers.
"...developers can expose fresh content..." - what does this mean?
"This allows the user to browser..." - no, it doesn't. It allows the browser to provide UI that then allows the user to browse.
"...and potentially search on-device for a specific article." - This sounds too hypothetical. I'd focus on mentioning some examples of where this data could go and what it enables: help users discover available content whilst offline, add extra details to browsing history, participate in on-device search, etc..

Why

This very much focuses on the offline case. I would also detail other discoverability reasons: particularly rich highlights might be worth mentioning.

Combined with other APIs

While the example with the Periodic Sync API is apt, by using it you're proposing something on top of a proposal, which is perceived as a risk. I would suggest using the Push API instead w/ pre-caching content on a breaking news notification.
Rather than using await in a loop, which makes for a sequential operation, store the promises in an array and then await for Promise.all(), making it a parallel operation.

There also are a few other things I think would be good to mention:

Privacy and security. While obvious, still good practice to state it.
Quality enforcement. Since this isn't providing a storage layer, how does it solve the problem of making content available online? (This proposal addresses the developer incentive.) What if a developer puts a million items in their cache?
Related: why does this interface live off the ServiceWorkerRegistration?
Reasoning as to why there are different categories of data.
Alternatives. Many browsers already suggest content to users, why is that not sufficient? Where do other hot projects like Web Packaging come in?

I'm sure you've seen the following document, but just in case:
https://github.com/w3ctag/w3ctag.github.io/blob/master/explainers.md

Fire a contentdeleted SW event

If a user decides to delete the content, the browser should fire a contentdeleted event with the ID so developers can clean up the underlying content.

Some thought should be put into preventing malicious websites from re-adding the same content with a new ID within that event.

"app" content category

Would it make sense to have an "app" content category to indicate an offline-enabled web app?

Internationalisation of text fields

The text fields (title and description) are monolingual and in an unspecified language. Even if multiple languages + browser selection of which one to display would be overkill, the language should be indicable to allow proper display (as matters for, e.g., CJK).

Use the newly defined ImageResource

The ImageResource definition moved to its own spec. This spec should point to that definition rather than the one in the Manifest spec.

Categories, meaning & extensibility

I'm a bit concerned around the extensibility situation for categories.

First, in the spec itself, a comprehensive listing of categories that are guaranteed to be understood as a baseline would be a good idea.

The list that is there (in the IDL) is undescribed and opaque; what's the meaning of the different values, and when should they be used? E.G., I see audio and article are separate categories; which one does a spoken rendition of a magazine article fall into? That also seems quite different to a piece of music, but both could fall under audio. Maybe this should be using something like schema.org's ontology, taking a combination of all the (understood) categories.

Categories are inextensible without some way to indicate less-preferred-but-more-understood fallbacks. ARIA and some CSS properties have a first-understood-value-wins rule; alternatively, a combination of all the applicable understood values could be used, like RDFa and microformats.

(This point really is a quibble.) Using un-namespaced tokens for category means that extension is risky, as someone else might be using the values you add. If the values were URLs, anyone could mint new values without risking collisions.

Available from main thread without SW registeration

Why not directly available from the main thread without having a SW registered (like the Cache API)?

oncontentadded event

Hello. At the moment, if I have a complex service worker that wants to take notice of content being added, I can either

do that additional work everytime I call index.add()
use setInterval() or some such & call index.getAll() & diff, checking for new content

Since we have this index, & this index already can tell us when content goes away, it would also be nice to know when content is added. This would be a more normalized path than (1) or (2).

Should developers provide a size for their registered content?

It would give users a better idea of how much space they would get back upon deleting an entry, and makes for better UI.

Broken references in Content Index

While crawling Content Index, the following links to other specifications were detected as pointing to non-existing anchors, which should be fixed:

https://dom.spec.whatwg.org/#context-object

_{This issue was detected and reported semi-automatically by Strudy based on data collected in webref.}

Does icon validity check trigger a SW's fetch handler?

This might be less of a bug against the spec and more of an implementation detail related to Chrome 80's current behavior.

registration.index.add() currently rejects if you pass in an icons[].src value that isn't a valid URL. I'm curious about how this determination is made, as it's making it difficult for me to accomplish something while trying out the Content Indexing API. (Sample code.)

I've got a PWA that handles incoming media sharing requests using the Web Share Target API on Android.

If a user shares an image to my PWA, that image gets saved locally using the Cache Storage API with a cache key URL that doesn't exist on a remote server—requests for that URL will only succeed if intercepted by my service worker's fetch event handler, which will bypass the network and return the cached media resource.

I am calling registration.index.add() inside of the same fetch handler that is responsible for handling the incoming POST request from the Web Share Target API. If I pass in an icons[].src value corresponding to a generic icon URL that exists on the remote server, everything works as expected. However, if I pass in a icons[].src value that refers to the newly-cached image (which, again, is only valid when intercepted by the service worker, and doesn't exist on the remote server), the add() call rejects due to an invalid icon.

I can probably refactor things so that the call to registration.index.add() happens outside of a fetch handler, if that's what's causing the failure. But my bigger question is whether the validity checks for icons is supposed to trigger a service worker's fetch handler at all—because if it doesn't, I've got a bigger issue to solve.