wicg / sanitizer-api Goto Github PK

View Code? Open in Web Editor NEW

227.0 227.0 31.0 720 KB

Home Page: https://wicg.github.io/sanitizer-api/

License: Other

Bikeshed 100.00%

sanitizer-api's People

Contributors

Stargazers

Watchers

sanitizer-api's Issues

Semantics of allow vs block vs defaults

@iVanlIsh has thankfully been looking at implementing this, and we noticed that we've never bothered to define (or agree on) how allow-lists & block-lists & the default config would play together. My current idea would be something like so:

For each element in the source fragment, we determine an action keep | block | drop.
if keep, we keep the element.
if block, we remove the element and replace it with its children, which we'll process in turn
if drop, we remove the element and all its children from the tree.

Decision making: If the current element is:

in the built-in default drop list: drop.
in the built-in default block list: block.
in the config's drop list: drop.
in the config's block list: block.
not in the config's allow list and the allow list is non-empty: block
otherwise: keep

This is a bit different (& more explicit) than the current proposals:

Instead of a duality allow/block, this now has three behaviours: allow/block/drop.
In this definition, drop overrides block, and block overrides allow.
In this definition, the built-in lists override the config. There is no way to override the built-ins.

I think this definition could be analogously extended for atributes, or namespaced elements.

There are several related issues (#29 #17 #18 #26), but none seem to quite answer this question.

Generally, I don't feel super strongly. This is more a request to settle on some specific behaviour than a proposal for a specific one. I'd be particularly interested in feedback from practicioners on whether this matches their needs. @mozfreddyb @shhnjk @koto.

Naming of the innerHTML setter (e.g., `setInnerHTML`.)

I don't like the naming of setInnerHTML. In the security industry, we said so much about avoid using innerHTML, that library like React use the name called dangerouslySetInnerHTML. It's super hard to tell people now that "Hey! setInnerHTML is safe!". At the very least, it's confusing for developers. setHTML (or setSanitizedHTML) sounds better.

Originally posted by @shhnjk in #42 (comment)

Still active?

HI @mozfreddyb, just checking if this is still something you want to keep pursuing? It hasn't moved for about a year so wondering if we should archive it?

Consider splitting the API into lower-level primitives

Existing client-side HTML sanitizers (I'll focus on DOMPurify and Closure HTML sanitizer as I know them the most) deal with a couple of tasks to prevent XSS:

Creating an inert HTML document to put the dirty HTML into.
Traversing the DOM safely (using a couple of DOM APIs with some hardening against DOM clobbering).
Applying a "baseline" whitelist of elements / attributes to remove JS vectors (roughly - making sure that the sanitized document behaves like <iframe sandbox> without allow-scripts).
Sanitizing the values of the attributes / text content - e.g. URIs, id/name attributes for DOM clobbering in output, limited script gadget vectors sanitization (see e.g. MUSTACHE_EXPR,ERB_EXPR).
Special processing for non-HTML e.g. SVG.
Tree conversion and canonicalization with special additional rules to prevent mXSS.

Most of the codebase of the sanitizer deals with the above, with special processing to account for browser bugs, or browser API incompatibilities.

Some of the tasks above require userland configuration or hooking for the sanitizer to be useful for its callers. For example:

In-place sanitization allows the caller of the sanitizer to skip the inert DOM creation for performance.
Whitelists can be modified by callers to allow application-specific policies (e.g. "we don't want SVGs nor MathML", "we want some custom tags / attributes allowed", "only text-formatting tags", "some scripts are ok").
Attribute / content sanitization is usually configurable, either declaratively, or in the form of hooks - this is to allow for custom URL policies ("all network requests should be proxied", "apply a hostname whitelist", "apply a scheme whitelist"), ID/name policies ("we don't care about DOM clobbering", "add a prefix to all IDs", etc.). It gets even more hairy with the gadgets ("remove AngularJS expressions, but we use a different expression separator").
Output is sometimes needed as a DOMFragment, sometimes as a string etc. This is controlled by the caller.

The users want a rich customization - that suggests it might be difficult to agree on a "one true HTML sanitizer". Even if we focus on XSS prevention, there are differences between the sanitizers as to whether gadget triggers are "in-scope" (and to what extent), are data- attributes OK, which hooks wires to expose, to what extent should a library be concerned with browser bugs. These concerns are usually easier to satisfy in userland code, than natively.

I think a good starting point would be to expose lower level primitives. I can identify the following:

a proper inert DOM creation API
- it should not trigger network request or JS execution (now it happens that there are embarrassing browser bugs)
- it should create the same DOM (now there are some slight differences across browsers)
- it should solve the Web Components issue
non-clobberable DOM traversing API
authoritative, browser-mandated elem + attr list
- at minimum, specifying which element + attribute can result in a network request or a JS execution
- perhaps exposing it as a map of elem + attribute => Trusted Type? Types already carry that information (TrustedScript is a literal JS code, TrustedScriptURL will execute JS when loaded, TrustedURL may result in a network request etc.)
- something, something to annotate an attribute can trigger DOM clobbering
- even if there are browser differences between the lists, it shouldn't matter, as each browser is only required to maintain the list to match its behavior
document serialization that's not prone to mXSS, because, seriously, we should not have mXSS anymore.

An XSS-focused HTML sanitizer then becomes a lean composition of the above + code to apply the caller-specified customization. That wrapping, customizable sanitizer could be implemented either in userland (where I believe it would be simply easier to maintain), or natively.

Exposing such primitives has additional advantages:

they are easier to polyfill (in fact, that's what the existing sanitizers are doing already)
they can be used outside of the HTML sanitization realm. In fact the elem + attr list alone would be very useful for Trusted Type policies, templating systems, or WYSIWYG editors.

ci is broken

e.g., https://travis-ci.org/github/WICG/sanitizer-api/builds/771537899

I will find out!

Bypass using noscript and a closing noscript tag in attributes

Hiya

Cool api :) I tested the Chrome canary version and found you could inject a noscript tag and a closing noscript tag inside an attribute. Although it doesn't execute in the demo (probably because innerHTML isn't used) the sanitised output is vulnerable if used elsewhere.

<noscript><img title="</noscript><iframe onload=alert(1)>"></noscript>

Demo

Should allow_attributes support element wildcard?

This came up during a discussion of the Sanitizer, and I'm not quite sure what to make of it.

Currently, in both attribute allow and block lists, you can specify attribute/element combos, or a wildcard for elements. E.g.:

allow_attributes: {
  "src": ["img"], /* src= attribute, but only on <img> */,
  "onclick": ["*"],  /* onclick= attribute, blocked everywhere. Is actually covered by baseline; it's just an example. */
},

Should we also allow

allow_attributes: {
  "*": ["div", "span"] /* allow any attributes - unless blocked by baseline - on <div> and <span> */
},

or:

block_attributes: { "*", {"b", "em"]}, /* block all attribute on formatting elements <b> or <em>. */

What should the output for creationOptions getter be?

Break down to following questions:

If the user inputs contains invalid configs, should our output remains them? ( Personally consider a NO)
Should we format the cases for node/elements name string or not and if so, to upper or lower case? (Personally prefer a Yes but not sure to which case)

WDYT, @otherdaniel @mozfreddyb @shhnjk @koto.

proposed api: innerHTML setter that takes template strings

(...)

given that we cannot distinguish that type from a string it's still unclear to me what that means. And even once we can distinguish it, what it would mean then.

To give more high-level description, Web Platform currently doesn't have a way to say developers are assigning static string or dynamic string.
That is,

elem.innerHTML = "<b>hello</b>";

AND

elem.innerHTML = "<b>" + location.hash.slice(1) + "</b>";

Are both string assignment (even though one is static and one is dynamic). What @koto suggested is to give a primitive to developers, where the platform can understand the notion of static string assignment. This primitive is important in the platform, because if we know something is static, it is safe to append without sanitization.

BTW, elem.setHTML.withLiterals`this_is_not_injected_for_sure` sounds better (remember, this is the thread to decide namings 😋)

Originally posted by @shhnjk in #100 (comment)

travis ci / deploy scripts need updating

It's all python 2.x :|

Limitations in the protection against arbitrary code execution

I've invented and implemented the first version of the sanitizer in Mozilla, in 2002.

I had created the sanitizer as an additional (!) protection on top of <browser type="content">, to isolate malicious HTML from privileged JS with system principal. It is explicitly not capable to be the only protection from malicious HTML in privileged JS contexts that can run arbitrary code on the local machine, for example in Electron. The attack surface of HTML is just too large to be 100% sure to cover everything.

Rather, it was intended as additional protection for cases where the browser has a security hole. Browsers are not secure enough, given that they have several security holes every single week of the year. In some contexts, this is not acceptable. That's what the sanitizer was for, to given protection where the browser alone is too insecure.

It's one thing to use a sanitizer as XSS protection. Worst case, your account on that site gets hacked. It's a completely different thing as protection against arbitrary code execution. Worst case, all your accounts get hacked, all passwords leak, all your documents leak, and your computer is used to attack the rest of the local network.

It seems you are aware that the protection is not complete. The document has limitations like "browser has a fairly good idea", "the library... will be updated ... as bugs or new attack vectors are found", which is fair. The entire problem is that the browser has too many security bugs, and the sanitizer was intended as additional protection, exactly to stop them and put an end to the endless browser bugs.

And it did. The sanitizer together with the iframe protection only ever had 2-3 security bugs that I know about, within ~10 years. This is in contrast with the browser, which has 2-3 security bugs per week. This is unacceptable in many contexts.

My advise: The sanitizer is great for low security contexts, against low risk XSS, e.g. for user comments on a blog. For higher security contexts, e.g. apps with system rights with arbitrary code execution rights on the local machine, I advise to explicitly state that this sanitizer is a good complement, but alone is not sufficient. Additional protections are needed.

I don't want the sanitizer to be the cause for the next broken pipeline, poisioned tap water, or shut down hospital.

Consider respecting baseline settings for all sanitizers

Problem: Some applications are more sensitive w.r.t to what elements/attributes should the sanitizer output.

For example, some applications are built on frameworks that recognize custom XSS sinks (i.e. that have script gadgets), or e.g. wish that all images were stripped, or wish that iframes are not present, or only some tags related to text formatting are supported.

The sanitizer API solves that problem via sanitizer config, that may be arbitrarily strict, but it doesn't solve the following: How to guarantee that all sanitizers adhere to those rules? E.g. How to prevent arbitrary 3rd party dependency, that uses Sanitizer, from inserting DOM nodes violating my custom restrictions?

There's a couple of ways how this problem could be addressed. For example, some integration with Trusted Types which already deals with guarding certain sinks could be possible (#55 explores that). But there are also some ways this could be addressed without TT:

A. Event that can modify / reject config at creation time

Similar to w3c/trusted-types#270, there could be an event dispatched that allows for inspecting the configuration of a sanitizer when it's created.

Sanitizer.addEventListener("beforecreate", (event) => {
  try {
    event.config = massage(event.config);
  } catch (e) {
    console.error("config so bad I can't recover");
    e.preventDefault(); // Sanitizer creation throws.
  }
});

B. Baseline configuration

Sanitizer.setBaseline(baselineCfg);

This would assert that restictions in the baseline is respected for every Sanitizer construction afterwards.
The effective "merging" of configurations TBD, but it might be some form of: effectiveCfg = Object.assign(givenCfg, baselineCfg) (it gets a bit more complex with allowlists / blocklists).

There are also ways to implement that in userland code:

C. Create a Proxy early (userland)

Sanitizer = new Proxy(Sanitizer, {
  construct(target, args) {
    console.assert(respectMyRules(args), "invalid sanitizer config");
    // or change the rules to respect the application-specific baseline.
    return new target(...args);
  }
});

const s = new Sanitizer(configRespectingTheRules); // yay!

D. Remove the Sanitizer (userland)

// Run me early
export const myOneTrueSanitizer = new Sanitizer(customXssCfg);
delete Sanitizer; // 😱

// cc @mikewest @otherdaniel @shhnjk @mozfreddyb

Invalid Web IDL: Sanitizer.config attribute type cannot be dictionary

A recent commit switched config to be an attribute getter. However, that is forbidden in Web IDL: "the type of the attribute [...] must not be a nullable or non-nullable version of [...] a dictionary type".

(I suspect one reason for the rule is that devs expect attributes to return the same object, as in obj.config === obj.config, and that cannot easily be enforced when a readonly attribute returns a dictionary)

Revisit config and defaultConfig as getters.

.config() and .defaultConfig() are currently methods, why they should be properties, or at least, use property syntax. (.config). A getter would be a way to accomplish this.

Per https://github.com/WICG/sanitizer-api/pull/80/files#r616408446
(via whatwg/webidl#981)
(via #77)
(via w3ctag/design-reviews#619)

How to identify namespaces / SVG + MathML

The current spec uses a tag's local name, meaning it cannot distinguish between SVG, MathML, and HTML element names. Clearly, that won't do long-term.

traverse and sanitize within template element?

Imagine a fragment like

<template><img src="x" onerror="alert(1)"></template>

Do we expect the sanitizer to special-case the <template> element and traverse into it's content attribute?
Firefox currently doesn't support the element, but I'd like to support it and traverse.
From looking at this testcase DOMPurify does so too.

From a spec perspective, I think we could check if the element has a content attribute of type DocumentFragment (or such), which might also handle Declarative Shadow DOM.

Example usage doesn't make sense

The "example usage" in the explainer does not actually affect the security properties of the code in question.

Your example actually encourages random censorship of HTML strings for no reason instead of rendering the HTML as an inert string, which is what the textContent setter would have done.

Was innerHTML meant to be used instead in this example?

document.getElementById("...").textContent =
  sanitizers.html.toString(user_supplied_value);

Allow/block list for attributes global, or per element?

It seems most sanitizers specify attributes to be alloed/blocked globally (e.g., always remove onclick=...). Others have suggested that this is ineffective (or leads to overblocking), and that attributes need to be specifies relative to the elements that hold them.

Cross-cutting concern: Ensure forward/backward compatibilty.

I suspect there will be a compability risk (across versions; not browsers), particularly for the pre-defined filter lists. When a browser adds e.g. a new event type & associated event handler, then the filter lists need to be updated as well. This might have surprising consequences for app developers.

Provide presets for both "script-free" and "fetch free"

Currently the design of the sanitizer API seems to only be concerned with XSS, this means blocking all forms of scripting (e.g. onevent, <script>, etc).

However a weaker class of attacks still exist that may be used to exfiltrate data about users, and that is the ability to trigger any form of request to a third party (or sometimes even first party) resource.

For example, this image tag could be used to exfiltrate how many people view a particular piece of user generated content:

<img src="https://evil.bad/honeypot.png">

Many other examples exist however, such as <a ping="...">, <link rel="stylesheet">, etc etc

As such I'd like to suggest having a second preset that will strip all elements/attributes that could trigger a fetch.

toString is reserved identifier.

toString is a reserved identifier in WebIDL. (See sentence starting with "The identifier of any".) We need to pick a different name.

Change default branch name to "main"

I plan to change the default branch name to main, my understanding is that GitHub's support should cause little interruptions for PRs that are already inflight, but will require local branches that are checked out on master to switch.
I must admit that my knowledge of unintended side-effects are a bit limited, so don't take my word for it ;-)

I don't have a specific date in mind, but noticed that official-W3C repos have already switched.

@otherdaniel, @iVanlIsh Please let me know if there's anything in progress which would be interrupted by this, otherwise I will make the change at the end of this week.

Consider handling of javascript: urls in navigation

Consider how to handle javascript: urls, e.g. in <a href="...">. Likely, these should be dropped unconditionally.

Error handling for string functions when parsing context element would be blocked by config.

I'm presently mucking around with an implementation of the API along the lines of #42 (comment) , to see whether it "feels" right. It mostly does, except for error handling: Since we now have an explicit context element... should the config apply to the context or not?

The issue comes up since several fairly simple test cases now break: There's a new test cases along the lines of:

  new Sanitizer({allowElements: ["p"]}).sanitize("<p>bla<em>bla</p>")  // "<p>blabla</p>

For re-writing the tests, I now need to pick a context (regardless of whether directly in the API, or indirectly by parsing separately). Most test don't use anything fancy, so I thought I'd just pick a <div> or <template>. In my current implementation, I'm also checking the context against the configuration (built-in & supplied), which means that neither context would work, since I'm only allowing <p>. That semms logical, but... fairly annoying. In particular, it means that you e.g. cannot sanitize into a <template> element without also allowing <template> in the tree.

On the other hand, if we would not check the context element against the configuration, then you could do something like .sanitizeFor("script", "alert(1)"), which seems like even worse.

  // Should this be: <template><p>blabla</p></template>, or null?
  new Sanitizer({allowElements: ["p"]}).sanitizeFor("template", "<p>bla<em>bla</p>")

  // Is this different from:
  ... .sanitizeFor("script", ...)
  some_script_element.SetInnerHTML("alert(1)")

What I have presently implemented:

The context element gets checked against the configuration. (And thus also against the baseline & defaults).
.sanitizeFor returns null if the context would have been removed.
.SetSanitizedHTML / SetHTML throws if the context element would have been removed (since it can't fullfil the contract.)

Solutions I'm seeing:

Check the context, as described above.
Never check the context.
Check the context, but with a more restricted check. (E.g. only against the baseline.)
???

How to build derived configurations?

In the current drafts, there is no way to build a configuration based on another one. In order to allow e.g. a single additional custom element on top of the defaults, the developer would need to copy the default config from the spec, add the one lement, and paste that into their app. That seems clumsy.

(And also goes a little against the "ever-green" aspect, in that this might cause outdated defaults to be persisted.)

It might be better to have a way a convenient way to build one config on top of another; in particular on top of the defaults.

Declarative Shadow DOM should be explicitly considered

The declarative Shadow DOM proposal adds a declarative way to add Shadow DOM to HTML. This opens a potential sanitizer bypass, through a closed shadow root. Given that this proposed sanitizer API lives in the browser, and can therefore "see" even closed shadow roots, it can still provide good filtering. But the current spec algorithm does not explicitly traverse into shadow roots. It should likely contain a new step #7, something like:

If node is a shadow host, run the sanitize a document fragment algorithm on node's shadow root DocumentFragment.

Or something similar.

Sanitizer vs Trusted Types

The resulting APIs should be integratable with Trusted Types.

prototype API

For the sake of prototyping, I would like to suggest an API that looks like this (but obviously subject to change):

You construct a Sanitizer object by supplying options (or accepting the default): new Sanitizer(options)
The sanitizer object has various functions depending on the return type required:
** sanitize() returning a DocumentFragment
** sanitizeToString() returning a DOMString
All of those functions should accept documentfragments, documents, strings

WebIDL, scrambled together and likely not entirely correct:

typedef (DOMString or DocumentFragment or Document) SanitizerInput;

[Exposed=Window]
interface Sanitizer {
  constructor(optional SanitizerOptions options); // optionality still discussed below
  DocumentFragment sanitize(optional SanitizerInput input);
  DOMString sanitizeToString(optional SanitizerInput input);
};

// unimplemented during prototyping
interface SanitizerOptions {}

validation of per-element / per-attribute contents

@craigfrancis wrote in #18.

I'd prefer the per-element custom config, rather than separate lists for tags and attributes.

But could it go a little further, so the attributes can specify the allowed content?

Short example:
{
  'div': {
      'class': 'css-classes'
    },
  'a': {
      'href': ['url', 'mailto'] // Not inline javascript
    },
  '*': {
      'title': 'string'
    }
}
Why add this extra complexity?

Some attributes can take safe and unsafe values, e.g. the a[href] accepting "javascript:"

You could give hashes for acceptable inline JS/CSS - personally I wouldn't allow 'unsafe-inline' code, but I have to accept that many websites do.

If unsafe attributes get added to the spec in the future, it would be nice for the developer to have clearly stated if they intended it to be a custom attribute (not realising that data-* attributes exist), or if they intended to use the newly introduced attribute. For example, they might have <input ontap="myclick" /> for their JS to read and work with, and a theoretical future browser could see that as some JS to run.

This wouldn't be used by most systems, but for a theoretical complicated example:
{
  'a': {
      'href': [
          'url',
          'mailto',
          'sha256-SOMTvqVfOViiCfDUw29p76/OVddUs0V7HyE0bATK3K8=' // For: javascript:alert()
        ],
      'onclick': [
          'sha256-ZlNiuEKoYsR3vT5/phF5QQzmxOHlto0Qb7NgKx0WwV8=' // For: window.open(this.href); return false;
        ],
    },
  'input': {
      'ontap': 'string', // Arbitrary string, if this becomes an unsafe attribute, the value is dropped.
      'onfocus': 'unsafe-inline', // Arbitrary JS, danger, danger, avoid.
      'onclick': [
          'sha256-biFQTroSCI3Z5BmsMGyEE2jFZdwjjG1Oe7JLytgH6jM=', // For: this.select()
          'sha256-hphOYdb9WX9dW4pYcQdXa8E450mGtzl7k4kSIg1GOIo='  // For: this.value=''
        ],
      'style': 'unsafe-inline', // Might need some more thoughts on this one.
      'size': 'number'
  }
}
Where 'string' and 'unsafe-js' are effectively the same, but 'unsafe-js' will always be let though (the developer has explicitly stated they are happy with this risk, and it's easy for auditors to find); whereas 'string', will only be let though if the browser knows the attribute can accept this as a safe value (on the basis that the attribute may change in the future, or be added to the spec).

What about DOM clobbering?

Do we want to include heuristics as to whether disallow e.g., id, name attributes or disallow generally?

rename to HTMLSanitizer

The sanitizer-api in this proposal only targets HTML-syntax.

Maybe sometime we want to habe a e.g. CSSSanitizer:

s = new CSSSanitizer({
    blockUrls: 'extern',
    allowVars: false,
})

Or a UnicodeSanitizer:

s = new UnicodeSanitizer({
    blockInvisibleChars: true,
    replaceDiacritics: true,
})

Add redirect for old URL

Consider adding the purification URL to https://github.com/WICG/wicg.github.io so it remains working.

collect and link to existing algorithms in specs that we integrate with

e.g., https://w3c.github.io/DOM-Parsing/#dfn-fragment-parsing-algorithm

Sanitizer API creating mock context-element can cause XSS when used in different context

The fragment parsing algorithm takes two arguments, a context element and a string. This API only takes a string. I guess that means it makes up its own context element, but this might not always be correct and I have a slight worry that the difference could be exploited at worst and at best lead to confusion.

(Related to #37.)

`javascript:` bypass via `<svg>` and `use`.

A friendly, clever person submitted the following as a bypass of the javascript: URL handling in https://wicg.github.io/sanitizer-api/#handle-funky-elements:

<div id=div></div>
<script>
div.replaceChildren(
new Sanitizer({
    "allowElements":["svg","use"],
    "allowAttributes":{"xlink:href":["use"]}}
).sanitize(`<svg>
    <use xlink:href='data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" id="x" viewBox="0 0 100 50" width="100%" height="100%"><a href="javascript:alert(1)"><circle r="100" /></a></svg>#x'/>
</svg>`));
</script>

bikeshed error: [Constructor] extended attribute is deprecated

FATAL ERROR: The [Constructor] extended attribute (on Sanitizer) is deprecated, please switch to a constructor() method.

Consider referencing element classes in the HTML specification

As part of the spec, large numbers of elements are referenced in the specification that are generated with a script.

I might make sense even in a non normative sense to reference which elements / attributes are excluded / included from the default / baseline.

If this can be done at least partially by referencing element kinds and deprecated features in the HTML specification this would make reading the specification intent easier.

can we rename this to sanitizer?

Preset for navigation-less

Similar to #57, provide a preset for "navigation-less" HTML. (I.e., block <a> & all of its friends).

How to identify elements to allow/block?

The current draft identifies elements and attributes by name + namespace. But Custom Elements allow to customize built-in elements with a different name.

(In Trusted Types, we decided to go by the constructor name of the base element.)

Bypass via `SVGAnimationElement.onend`.

A friendly, clever person notes that the SVGAnimationElement.onend attribute can be allowed via allowElements/allowAttributes, for example:

<div id=div></div>
<script>
div.replaceChildren(
new Sanitizer({
    "allowElements":["svg","set"],
    "allowAttributes":{
        "onend": ["set"],
        "dur":["set"]
    }
}).sanitize(`<svg><set onend="alert(1)" dur="1"/></svg>`));
</script>

It's not clear to me whether this is supposed to be allowed per spec (perhaps related to #72?), so this might just be a Chromium issue.

Proposed API: MathML, SVG support, plus localname case-handling.

Seperate issue, because this tries to pull together several existing issues in one.

SVG + MathML + namespaces. (#72 Also #84 + #85).
lowercasing (PR #97, and specifially Andrew's comments there)
normalization in config() getters (can't find a reference; I think this was discussed mainly in person.)

The current problems are:

SVG + MathML are unspecified. That's a bit of an embarrasment.
Some spec items (like lower-casing) interact with that in odd ways.
The config getters were originally specified as returning a copy of the config as it was originally passed in, while feedback suggested we should return a normalized copy that mirrors what exactly the API will actually handle.

This mainly picks up SecurityMB's comment in #72, and tries to extend it to cover anything.

In all places where we accept element names -- allow lists or block lists, or elements in attribute allow/block lists -- we will accept pseudo-namespaced element names "html:name", "svg:name", "math:name", or "*:name". The latter can also be written as "name". That is, any namespace is the default.
The prefixes are fixed strings, and will not attempt to process XML Namespaces-style namespace declarations, or anything similar.
The *: variant matches any element with the given localName, the html:, svg:, math: variants match against the localname, iff the parser has placed the elements into the corresponding namespace. (Since now all methods that (implicitly) parse have an explicit context, this ought to work with a rather low surprise factor.)
The html: and *: variants match case in-sensitively. The config returns them lower-cased. The other two match case-sensitively and the config getter returns their names as-is.
The config getters will return all *: names without the prefix. (I.e., "*.script" becomes "script".)
The config getters will drop malformed strings. (Like: "wondernamespace:script".)

WDYT?

I find the case-handling based on the namespace to be slightly surprising, but this matches what HTML parser / DOM does. If we embrace that, I think we the rest of the API is fairly straight-forward.

What should be in the default list?

Here are the parts I thought of:
default dropElements: "SCRIPT", "ANNOTATION-XML", "AUDIO", "COLGROUP", "DESC", "FOREIGNOBJECT", "HEAD", "IFRAME", "MATH", "MI", "MN", "MO", "MS", "MTEXT", "NOEMBED", "NOFRAMES", "PLAINTEXT", "STYLE", "SVG", "TEMPLATE", "THEAD", "TITLE", "VIDEO", "XMP" (copied from FORBID_CONTENTS)
default blockElements:
default dropAttributes: "on*", "src": ["EMBED"], "codeBase": ["OBJECT"], "data": ["OBJECT"] (partially from Trusted Types list)

Any opinions on this? @otherdaniel @koto @shhnjk @mozfreddyb

Context and meaning of "sanitized"

Can we consider more concretely what the context and meaning of "sanitized" looks like in this and referenced efforts.

For sanity to hold, branding something "safe" should do so in ways that align with the semantics of the branded entities — ie makeSafe('<link href=x/>') === '<link href=x/>' then it is only natural to expect isSafe(makeSafe('<link href=x/>')) === isSafe('<link href=x/>').
For safety to hold, branding something "safe" does not mean that it cannot be unsafe in the wrong capacity — ie injected in the wrong place, or at the very least being exposed to a masquerading API/sink.

Those are some aspects I know I am hoping to preserve when branding something safe.

I am sure a lot of consideration has gone into this already, and maybe I missed some aspects in the current draft, so is there a good way to address this dimension more visibly?

What about data URLs?

DOMPurify allows for audio,video,img,source,image,track only. Browsers are blocking top-level navigations by now..Is that enough?

Rethink how we make sanitizeToString an mXSS-safe method?

Interesting blog post came out today about recent DOMPurify bypass:
https://research.securitum.com/mutation-xss-via-mathml-mutation-dompurify-2-0-17-bypass/

The blog post ends with interesting conclusion:

The bypass also made me realize that the pattern of
div.innerHTML = DOMPurify.sanitize(html)
is prone to mutation XSS-es by design and it’s just a matter of time to find another instances. I strongly suggest that you pass RETURN_DOM or RETURN_DOM_FRAGMENT options to DOMPurify, so that the serialize-parse roundtrip is not executed.

This is equivalent to our sanitizeToString. Maybe we should drop the support of sanitizeToString all together?
I think sanitize should work for most of cases, and developers can create sanitizeToString behavior by themselves if they really need to, but we can choose to not recommend this for mXSS purpose.

WDYT?

Consider whether DOM serialization is sufficient

This is related to, and maybe duplicate of, #37 and whatwg/html#6235.

In the (very nice) example given in whatwg/html#6235, the key seems to be that parsing will treat the <p> element differently, but serialization will do the standard in-order serialization of the tree, resulting in a serialization that is arguably broken. For html, that is unfortunate, but I think for the Sanitizer API this is deeply broken.

Since the parsing rules are known, we should consider to either adjust the serialization to match, or to alternatively fail serialization is certain conditinos are met. (Such as, a <p> inside an element that the HTML parser will refuse to put there.)

Sanitizer config introspection.

This came up during TAG review (w3ctag/design-reviews#619): "Exposing the config on Sanitizer instances (possibly read-only) could provide an easy solution for this use case. Exposing the default config as a static property on Sanitizer might be helpful too."

A similar thought came up in #64.
A similar mechanism existed in very early versions of our (Chrome's) implementation (.createOptions property)
TAG considers whether this should be elevated to a design principle: w3ctag/design-principles#300

I think the specific question raised by TAG (how to block a specific element) can be resolved elsehow (namely by blockElements:), but I think the general question remains: Should we support query-ing any sanitizer instance's config, and/or the default config, with the primary use case being building a derived configuration.

Handling of Defaults

How should defaults be handled?

Had a brief discussion with @iVanlIsh about this, and there seem to be two trains of thought:

Just have a single, well-chosen default, so that developers can simply call sanitize(...) and get good results.
There isn't really a single, obvious default, but rather a handful of base use-cases. E.g., plain text-only; simple formatting; anything-goes-except-JS. There should be default-configs for each of those, instead of a single global default.

I think the main consideration here is developer ergonomics, on both sides: The first is really, really simple. The second takes the point of view that this is arguably too simple, and that set-of-defaults is better suited,

The current version of the explainer reflects no. 2 above, which happens to be my opinion. However, we don't have consensus yet.

Do we want to keep the content of known-OK tags?

DOMPurify has a concept of KEEP_CONTENT where it does not remove the element and all its child nodes, but copies the content out of the node (via something like node.insertAdjacentHTML('afterend', node.innerHTML)), removes the element and continues iterating over the tree. This only works for elements that aren't in an explicit FORBID list