wicg / compression-dictionary-transport Goto Github PK
View Code? Open in Web Editor NEWLicense: Other
License: Other
As currently spec'd, the sec-bikeshed-available-dictionary:
request header is a structured field dictionary that includes the hash type and base-64 encoded hash of the dictionary file.
i.e. sec-bikeshed-available-dictionary: sha-256=:d435Qo+nKZ+gLcUHn7GQtQ72hiBVAgqoLsZnZPiTGPk=:
On the server side, it would be extremely easy to check for and serve delta-encoded resources if the hash was part of the file name. i.e. /app/main.js.sbr.<hash>
.
Extracting the hash from the SF value and mapping it to a hex string or other path-safe string can be done but is maybe a bit more complicated than it needs to be.
Since the length of the hash string changes by the hash type we can send the hash without having to send the algorithm (just need to make sure all supported algorithms generate different hash lengths). Additionally, Base64 includes /
as one of the characters to use when encoding so it may be cleaner to just use hex encoding. Other higher-but-safe bases could be selected as well but may complicate tooling.
If we change it to use the base-16 encoded hash and send the raw hash as the value then the server or middle boxes can construct the file name directly by appending the header value to the end of the file path (though some care should be taken to make sure it isn't abused for a path attack and that the value appended only contains valid characters).
Allow for matching against fetch destination (Sec-Fetch-Dest) in addition to the URL pattern.
Maybe an optional match-dest
dictionary entry on the Use-As-Dictionary
response and require it be a full match against the specified fetch destinations.
It might well make sense that this is layered on top of the HTTP cache as @pmeenan suggests, but not all implementations have a tripled-keyed HTTP cache at this point. Is that a pre-requisite for this feature?
See whatwg/fetch#1035 for some background and further pointers on triple-keyed HTTP cache.
Hey!
Imagine an Edge based deployment of compression dictionaries, where the resources themselves are in a cloud-based storage.
Every time the CI runs, it adds a new resource to the pile, and calculates the diffs between it and N previous versions of that same resource. All of these diffs are stored in the same bucket in the cloud.
Now, whenever a resource is served, it uses a use-as-dictionary
value that matches the various resource versions.
What happens when that same resource gets reloaded?
Its matches
value definitely matches itself, so it's getting a request with a SHA-256 signature in its sec-available-dictionary
with its own signature. That kind of 0 sized diff does not exist in the cloud storage, because the CI didn't create diffs from the resource to itself. That means the request either fails, or is retried without the dictionary. (adding delay)
What's the right way to tackle such a scenario?
I'd love thoughts on the right thing here for the protocol (and developer advice that will be derived from it).
Why is it p
and not path
? Same for the other fields.
cc @mnot
Zstandard can use both structured and raw content dictionaries (RFC8878 sec. 5). When a buffer is presented to Zstandard to be used as a dictionary it must be instructed how to interpret it. (If a properly formatted dictionary is used as a structured dictionary by the compressor and as a raw content dictionary by the decompressor, or vice-a-versa, the reconstructed output will likely differ from the original content.)
The three options provided by zstd are:
One option is to use the MIME type of the resource being used as a dictionary (as discussed in #44) to signal how it should be interpreted. But simpler might just be to use the auto-interpretation mechanism.
Whatever we choose, the description of the zstd-d
content-encoding should be updated to be explicit about this.
For the case of dynamic HTML resources, I can see sites with low number of returning visitors, where it can be beneficial to e.g. reuse the HTML delivered as part of the current page for future versions of the same page or for similar pages (e.g. reuse the HTML from one product page to another).
But very often, such HTML pages (especially with publishers and e-commerce) are served with very low caching freshness lifetime (if any), to ensure that typos or page errors won't live on in the browser's cache.
At the same it'd be great to be able to use these pages as a dictionary for a long while.
So it'd be great to be able to define both Cache-Control max-age and a dictionary TTL, have the browser cache keep the resource around for the duration of the longest amongst the two, but only use it for the case for which it is still fresh.
The explainer describes that the client/and server generate SHA-256 hashes and then use those to coordinate. Is there a specific reason why algorithm agility is not built in to the protocol? In simple terms, the ability to migrate to other algorithms as the security environment evolves.
The more I look at this aspect, the more it gets me thinking about whether the design has some overlap with the HTTP digests specification https://httpwg.org/http-extensions/draft-ietf-httpbis-digest-headers.html
The explainer hints at wanting to constrain the size of the sec-bikeshed-available-dictionary
field value via
SHA-256 hashes are long. Their hex representation would be 64 bytes, and we can base64 them to be ~42 (I think). We can't afford to send many hashes for both performance and privacy reasons.
but I wonder how much this really matters in practice.
If we adopted a similar approach that digests use, you could make sec-bikeshed-available-dictionary
be a Structured Fields dictionary that can convey 1 or more hash values alongside their indicated algorithm e.g.
sec-bikeshed-available-dictionary:
sha-256=:d435Qo+nKZ+gLcUHn7GQtQ72hiBVAgqoLsZnZPiTGPk=:,
sha-512=:YMAam51Jz/jOATT6/zvHrLVgOYTGFy1d6GJiOHTohq4yP+pgk4vf2aCs
yRZOtw8MjkM7iw7yZ/WkppmM44T3qg==:
Even if you restrict to only adding one hash, you can still benefit from agility via sending the algorithm
This somewhat falls under "open question 2" in the explainer, but I thought its worth opening an issue to discuss this specific aspect.
The problem
It is very common for asset paths to include a hash of the asset's content.
cdn.mysite.com/assets/myscript_HASH.js
cdn.mysite.com/assets/HASH/myscript.js
.The benefit: assets with no changes between version n
to n+1
keep the same hash, and load from the HTTP cache.
The proposed path/scoping rules mentioned in the explainer do not support type of versioning.
I think this is bad:
myscript.js/{version}
-esque scoping rules are supported, then there's a mutually exclusive choice between cache-friendly hash-based versioning, and support for delta dictionaries.Of course, there's is no clear cut "one is better than the other".
Factors such as code-splitting granularity, deployment cadence, and user demographic/perf distribution come to top of mind.
I think Chromium might not be able to accurately measure the net impact even with an open A/B test origin trial due to selection bias of those who opt-in for the trial.
For a non-trial, there are many other factors that could affect performance over time. Looking at improved CWV for a short window of "before/after" might tell a lie in the long run and we'd never know.
Solution thoughts
I'm only here to complain..
Adding a wildcard anywhere in the path is a problem for the proposed scoping/pathing rules.
Is it ~better if it's only allowed for the slug (last segment)?
Maybe the slug can be a prefix by-definition? i.e. /myscript.js
implicitly matches both /myscript.js.hash1
, /myscript.js.hash2
.
[1] - https://simonhearne.com/2020/network-faster-than-cache/
I'm wondering whether we should expose the storage usage for the dictionaries.
Currently Storage API is providing a way to get the storage usage.
For example in Chromium,
JSON.stringify(await navigator.storage.estimate(), null, 2);
returns
{
"quota": 296630877388,
"usage": 75823910,
"usageDetails": {
"caches": 72813056,
"indexedDB": 2877379,
"serviceWorkerRegistrations": 133475
}
}
Note: usageDetails was launched in Chromium. But it is still under spec discussion.
I have two questions:
usage
for dictionaries?dictionaries
in usageDetails
?All dictionary resources should be readable from the page, so I don't think there is any risk of exposing them. But I'd love to hear other options.
To add another layer of defense against cross-origin timing attacks, we should add language along the lines of:
When the server receives a sec-bikeshed-dictionary-available: sha256=:<hash>:
request that includes an authority
or origin
as well as a referer
request headers and where the referer
is cross-origin, the dictionary may only be used for compression if the response headers includes an Access-Control-Allow-Origin:
that includes the origin from the referer
header.
It could be tweaked to use different sec-*
headers to detect the cross-origin nature of the request but the requirement is to prevent servers from even sending responses using dictionary compression that should be opaque (and opening up the possibility of a timing attack).
Can we make the browser automatically retry the request without the sec-bikeshed-available-dictionary:
header when it failed to read the cached dictionary?
The current explainer says:
In case the browser advertized a dictionary but then fails to successfuly fetch it from its cache and the dictionary was used by the server, the resource request should be terminated
So the browser must check the existence of the cached dictionary on the disk before sending the request to reduce the risk of such failure.
If the automatic retry is allowed, the browser can speculatively send the request with sec-bikeshed-available-dictionary:
header without checking the cached dictionary. I think this is very important for performance.
There is no way to delete registered dictionaries.
I think we should support it using Clear Site Data. The Clear Site Data spec defines following types.
"cache"
"cookies"
"storage"
"executionContexts"
"*"
I think Web developers will want to delete dictionaries without deleting other types ("cache"
, "cookies"
, "storage"
). So we should introduce a new type "dictionaries"
.
Clear-Site-Data: "dictionaries"
For dictionaries loaded from a Link:
header, it could be useful for the request that triggers the dictionary fetch to either specify the scope of the dictionary or for the allowable path for the dictionary to include the path from the original request and from a document <link>
tag to also provide other path options.
The path restrictions for dictionary use as they are currently written are for providing some level of ownership proof when setting the scope. The request that triggers the dictionary fetch and the document itself are also proof points and could allow for serving the dictionary from a different directory than the resources it is intended to be used with (still needs to be same-origin as the resources).
Brotli and Zstandard both support raw byte streams as well as "optimized" dictionaries. Most of the work to this point has assumed raw byte streams but it would be beneficial to spec what the negotiation for a custom dictionary payload would look like so that backward-compatibility doesn't become a problem.
i.e. If a browser ships without support for extended brotli dictionaries or index-based Zstandard dictionaries and support for both is added at a later time, we need to make sure that older clients will not break by trying to use the new dictionary as a raw byte stream.
This could be done with different content-encodings for the different types of dictionaries but it would be better to not explode the set of encodings if it isn't necessary.
One possibility that comes to mind:
dictionary/raw
, dictionary/brotli
, etc.link rel=dictionary
mechanism, Advertise the supported dictionary times in the Accept:
header.use-as-dictionary
response header, add an optional field for type=
for the type of dictionary that defaults to type=raw
.content-type
header.type
specified in the use-as-dictionary
response header then it should not store the dictionary (independent of how it was fetched).content-type
response header is not a recognized dictionary type then it should not be stored as a dictionary.Since custom dictionaries will only ever make sense to be fetched as stand-alone dictionaries, this should allow for backward-compatibility as new dictionary formats are created.
It seems that the proposal allows any subresource to essentially claim dictionary authority for /
.
The README.md says:
On a future visit to the site after the application code has changed:
<script src="//static.example.com/app/main.js/125">
./app/main.js/125
request with the /app/main.js
path of the previous response that is in cache and requests https://static.example.com/app/main.js/123 with Accept-Encoding: br, gzip, sbr
, sec-fetch-mode: cors
and sec-bikeshed-available-dictionary: <SHA-256 HASH>
.Content-Encoding: sbr
, Access-Control-Allow-Origin: https://www.example.com
, Vary: Accept-Encoding,sec-bikeshed-available-dictionary
.I believe it should say:
The browser matches the /app/main.js/125
request with the /app/main.js
path of the previous response that is in cache and requests https://static.example.com/app/main.js/125 with Accept-Encoding: br, gzip, sbr
, sec-fetch-mode: cors
and sec-bikeshed-available-dictionary: <SHA-256 HASH>
.
If dictionaries end up scoped to a path and use some form of precedence, what are the mechanics for expiring a dictionary with more specificity for a less-specific one?
i.e., assuming dictionaries that cover 2 paths:
A - http://example.com/web/products/
B - http://example.com/
If a client has both dictionaries but a site decides to unify on a single global dictionary (B), how is dictionary A replaced? Some possibilities come to mind:
Content-encoding is the most natural fit for the actual compression but it is likely to also cause adoption problems, at least in the short term.
It's not unusual for the serving path to consider content-encoding to be per-hop instead of end-to-end from the browser to the origin and unless the delta-encoding is being done by the leaf serving node, the sbr
encoding is likely to be stripped out.
sequenceDiagram
Browser->>CDN: Accept-encoding: sbr, br, gzip
CDN->>Origin: Accept-encoding: gzip
Origin->>CDN: Content-encoding: gzip
CDN->>Browser: Content-encoding: br
If the actual encoding is done using other headers for negotiation but the content-type remains the same, then the compressed resources will be binary data and may cause other issues for middleboxes (i.e. with something like edge workers, they will be expecting to be processing text HTML, CSS or Javascript payloads). That could be workable for a given origin as long as they control the processing along their serving path.
One deployment model where it could work, but requires explicit support from both origins and CDNs is:
bikeshed-use-as-dictionary: <path>
response headersec-bikeshed-available-dictionary:
and Accept-Encoding: sbr, br, gzip
Content-Encoding: sbr
When a browser fetches a cross-origin script (eg: <script src='https://static.example.com/script.js'>
in https://www.example.com/index.html) , it sends a request with the mode set to no-cors
and the credentials set to include
.
The current explainer allows this type of request for both registering as a dictionary and using a registered dictionary for its decompression, as long as the response header contains a valid Access-Control-Allow-Origin
header (*
or https://www.example.com
).
However, if we follow the CORS check step in the Fetch spec, the response header must also contain the Access-Control-Allow-Credentials: true
header, and Access-Control-Allow-Origin
header must be https://example.com
. This means that the server must know the origin of the request, even though the request does not include an Origin
header. (It may include a Referer
header. But the Origin
header and Referer
header are conceptually different.)
For this reason, now I think supporting no-cors
mode requests is problematic.
Maybe we should support only navigate
, same-origin
, and cors
mode requests?
@pmeenan @yoavweiss
Do you have any thoughts?
Hey folks, tracking this proposal as it seems like a huge way to cut down on our CDN traffic and get people loading updated JS bundles faster.
From an implementation perspective, is there a recommended / expected location for the implementation in standard flows? I can see two places to do this:
There are pros and cons to both approaches but wondering if from a spec perspective there is an "ideal" approach here or specifically what was envisaged while writing the spec.
For full context we ship like ~15-20 builds a day which means that solving the "how do we generate these files" is not a trivial problem to solve in either case (dynamic vs build time). But looking to go down the path most trodden.
A lot of build systems produce static resources that are prefixed by a build number and that doesn't work well with a prefix-only match. i.e. /app/123/main.js
We could allow for more flexible path matching with some form of wildcard support but that will complicate the "most-specific" matching logic and the ownership protections.
Using a #
for a wildcard (since it is already reserved as a client-side separator) we could allow for exact matching by default, prefix matching with a #
at the end or wildcard matching.
Some open questions:
Colleagues and I were curious if Chromium had any plans it could share around whatwg/urlpattern#191. As running JS regular expressions in networking doesn't seem like it will fly.
Is the idea to have some kind of safe subset?
Hi folks working on this! If you haven't already, could you please file standards positions for WebKit and Mozilla? It would be great to get a few more eyes on this from potential implementers:
Thanks in advance!!!
One of the things that came up during Chrome's origin trial is that A/B testing the effectiveness of compression dictionaries is difficult (and will become more difficult when it is no longer an origin trial).
There are 2 points in the serving flow where dictionary decisions need to be made:
use-as-dictionary
response header is sent to mark a response as an available dictionary.available-dictionary
and the server decides if it is going to serve a dictionary-compressed response.In the case of the origin trial, there is a third gate which is the setting of the origin trial token which enables the feature (without which the use-as-dictionary
response header will be ignored. Outside of the origin trial there is no page-level gate for enabling it and in both cases, once enabled, there is no way to turn it off for individual users.
For the dynamic use case where the server is running application logic anyway and the response is not coming from a cache, it is possible to use a cookie or some other mechanism to decide if dictionaries should be used, both on the initial request and subsequent requests where the available-dictionary
request can just be ignored.
In the static file use case where resources are served from an edge cache and the cache keys the resources by URL, accept-encoding and available-dictionary, there is no granular way to control user populations. All clients for a resource will get the use-as-dictionary
response header and all clients that advertise a given dictionary would get the dictionary-compressed response. The page does have SOME level of control but it would require using different URLs for the resources for the different populations.
While it would be useful for sites to be able to have granular control over the feature for measuring the effectiveness during roll-out, that level of control is not usually exposed for transport-level features.
available-dictionary
request headers).In Chromium, we are using the MatchPattern() method to process the URL-matching.
The MatchPattern() method supports both ?
and *
. (?
matches 0 or 1 character. And *
matches 0 or more characters.) Also the backslash character (\
) can be used as an escape character for *
and ?
.
The current proposal's Dictionary URL matching doesn't support \
. Also it doesn't support ?
.
I think ?
is useful. But ?
is used in URLs before URL-query string. So I think we should support both ?
and \
.
Websockets themselves would fail a same-origin check for a dictionary delivered over HTTPS.
Would it be valuable (and safe) to allow for the path matching URL in the dictionary response to specify a wss:// scheme along with a match path (and explicitly restrict dictionaries to https, not just same-origin)? Then the dictionary-setting part of the spec could require that the match path be same-origin (and https) or the equivalent origin if wss was used as a scheme in the match path.
Something like:
use-as-dictionary:
response headers for requests with a https
scheme.path
(or match
if we change it) param as a URL.
wss
(and doesn't hurt otherwise, allowing regular URL parsing and classes to be used).https
and wss
schemes if the URL is fully-qualified.wss
scheme, replace it with https
when doing the origin comparison.AFAIK, the actual compression should work fine for data delivered over a websocket as long as the encoding supports streamed compression (which is usually a requirement before adopting a new compression algorithm anyway).
Apologies if there are some specifics that I missed around this but I'm curious how service workers will interact with this solution. It's clearly at a lower layer and no API for SW but is it expected that, when using SW to make fetch requests, this process still happens or should it be skipped? Atm, typical browser caching layers are skipped with SW networking - for example, responses sending a etag header will not go through the process for which the request will have a if-none-match header, the SW needs to incorporate that.
Is this defined somewhere?
This is the new foundation we're using for URL matching across the web platform. https://github.com/WICG/urlpattern
Introducing a new type of pattern is counterproductive to our efforts. (I can't find the details from the explainer, but it says "This is parsed as a URL that relative or absolute URLs as well as * wildcard expansion.", and then #42 is open also I guess.)
It's probably worth calling out that all caches in the serving path will need to support Vary: sec-bikeshed-available-dictionary
so that the cache for a given URL doesn't get polluted with delta-compressed artifacts using different dictionaries.
Full Vary support for arbitrary headers isn't necessarily needed but it will be required for whatever the dictionary request header ends up being.
Not sure if it needs specific mentioning, but this is for the CDNs, Load balancers and web servers at a minimum, depending on what caches are in the path for a given origin.
I'm assuming it also needs to be limited to HTTPS (and maybe only HTTP/2 and 3) to reduce the risk of forward proxies or intercepting man-in-the-middle proxies from causing cache issues.
Hello. Could/should this specification be generalized to support it's application to other compression schemes? Currently the README seems exclusively focused on Brotli, but having wider support could help other standards.
For example there is the zstd compression scheme, here:
https://docs.google.com/document/d/1aDyUw4mAzRdLyZyXpVgWvO-eLpc4ERz7I_7VDIPo9Hc/edit . This too could use Dictionary support.
Concern over handling (or lack thereof) of dictionaries was one of the primary concerns cited in mozilla/standards-positions#105 for the defer
status against the zstd compression scheme proposal. If this PR could be generalized a bit, there's a possibility of zstd & potentially other compression schemes to have a better chance, to making it forward & helping users save cpu & bandwidth on the web.
(i.e. as main.js..sbr)
Did you mean to use e.g. here?
I found that there are no definition of MIME type for dictionary itself in demo.
application/compression-dictionary
maybe ?
https://github.com/WICG/compression-dictionary-transport/blob/main/README.md?plain=1#L137
expires - Expiration time in seconds for the dictionary.
in Cookie, Cache-Control etc, expire
is date format and max-age
is time in seconds.
so it seems bit strange for me to have time in seconds for expires
.
how about make it max-age
?
In the current explainer, when the browser detects a link element <link rel=bikeshed-dictionary as=document href="/product/dictionary_v1.dat">
, it fetches the dictionary with sec-fetch-dest: document
header.
However, when the server receives the request, it may be confused whether this is an actual document request for navigation or a dictionary fetch request.
Therefore, I want to recommend to introduce an appropriate sec-fetch-dest
value to indicate that the request is for a dictionary fetch.
Two possible ideas are:
sec-fetch-dest: dictionary
and sec-fetch-dict-dest: document
sec-fetch-dest: dictionary-for-document
In Chromium implementation, the "document" destination type is used to detect the main resource request. Therefore, introducing a new destination type is also convenient for Chromium developers.
We should make sure the correct thing is done here, to avoid confused deputy attacks.
(This came up during TPAC 2023 and nobody present was immediately clear on whether this was handled correctly.)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.