I'm working on Teepee https://github.com/t

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

requirements for parsing HTTP headers containing URLs about rust-url HOT 29 CLOSED

servo commented on July 24, 2024

requirements for parsing HTTP headers containing URLs

from rust-url.

Comments (29)

scott-ainsworth commented on July 24, 2024 1

I think there is some misunderstanding about "absolute" and "relative" with respect to URLs. These two terms do not refer to the URL path, rather to the existence (or non existence) of a scheme (e.g., "http"). Although parsing the URL forms mentioned above is possible without a base URL, reference resolution (RFC 3986 sec 5) requires a base URL if the path itself is not absolute. For most headers, the base is the URL used to retrieve the headers (although the Link header allows the base to be overridden using "anchor"--see RFC 5988). This library currently combines parsing and resolution. Perhaps separation of parsing and reference resolution should be considered.

from rust-url.

SimonSapin commented on July 24, 2024

[...] we would prefer not to have to specify a base URL. The algorithm specified by http://url.spec.whatwg.org/ seems to work this way.

I’m not sure what you mean by this. The spec algorithm specifically aborts on relative URLs without a base:

http://url.spec.whatwg.org/#no-scheme-state

no scheme state
If base is null, or base's scheme is not a relative scheme, parse error, return failure.

But maybe there are still use cases for this that are not considered in the spec. I think the first thing to do is determine the data structures involved.

How do you handle Request-URI? For example, I don’t think GET ../foo HTTP/1.1 is valid.

from rust-url.

SimonSapin commented on July 24, 2024

My point with Request-URI is that there are different kinds of relative URLs. The spec calls them scheme-relative, absolute-path-relative, and path-relative.

from rust-url.

db48x commented on July 24, 2024

Ah, I misread that part. That is unfortunate. Did I say already that I don't like that spec?

At present Teepee doesn't handle the RequestURI at all; it's pretty early yet. We would of course want to be able to parse exactly the pieces mentioned in the spec, namely absolute URLs, absolute paths, and authorities, plus whatever HTTP 0.9, HTTP 1.0 and HTTP 2.0 say :).

I like the BNF-style specifications over the more modern WhatWG spec because I can make a function that parses each named piece (even if I'm not using parser combinators). Perhaps two rust-url needs two interfaces, one that exposes the WhatWG parser (or one with the same effect) and another which exposes more of the individual pieces. The former can call the latter, of course. Or perhaps it should just ignore that aspect of the WhatWG spec and just produce a Uri instance with pieces missing.

from rust-url.

SimonSapin commented on July 24, 2024

Yeah, I agree that the current state-machine-style parser in the spec is terrible. I’ve convinced @annevk that a recursive-decentish style (like CSS Syntax) would be better, but he’s waiting on me to finish up rust-url to do that.

rust-url is already written as a bunch of functions that call each other, and I think several of them will end up in the public API.

The spec was also (I believe) written with browsers in mind, and may not consider some things that the server-side needs. Again, once we determine what exactly is needed and in what use cases, I think we can convince @annevk to add stuff to the spec.

As to BNF grammars, the problem is that they often do not define error handling. If you just reject anything that doesn’t match the grammar, that’s fine. But it’s not good enough if you want to do any more subtle error recovery.

from rust-url.

db48x commented on July 24, 2024

The CSS Syntax spec is indeed better (and I like state machines), but still not really to my liking. I feel like they should just be written in a programming language if they want that degree of control over the implementations.

As to BNF grammars, the problem is that they often do not define error handling. If you just reject
anything that doesn’t match the grammar, that’s fine. But it’s not good enough if you want to do any
more subtle error recovery.

Fair enough.

from rust-url.

SimonSapin commented on July 24, 2024

I’m happy to debate the merits of different spec writing styles, but this issue is probably not the space to do it :) Feel free to email me or ping me in some other forum.

from rust-url.

db48x commented on July 24, 2024

Yes, it's certainly off-topic.

I think that about covers it though; we need to be able to specific syntax constructs, preferably with different types for each or a wrapper enum distinguishing them. A function to call for each syntax construct that fails if it finds others is fine (ie a function that parses absolute urls that fails when given a relative url, provided there's another function that parses relative urls and fails when given an absolute url), or a function that parses anything it's given but distinguishes them on return is also fine.

from rust-url.

annevk commented on July 24, 2024

@db48x you need to read HTTP errata. Location needs to be parsed as a relative URL. See also http://fetch.spec.whatwg.org/ for how to deal with multiple Location headers and such (HTTP does not cover that).

For Host there is actually a distinct host parser defined by the specification.

from rust-url.

annevk commented on July 24, 2024

(Note that with service workers browsers effectively have a proxy server.)

from rust-url.

db48x commented on July 24, 2024

Fair enough. Doesn't change the requirements though.

from rust-url.

SimonSapin commented on July 24, 2024

Something else to determine: are these headers in a specific encoding so we can decode to Unicode before parsing, or should we add support for parsing URLs from bytes?

from rust-url.

db48x commented on July 24, 2024

URLs are ASCII, and Teepee is using Vec<u8>.

from rust-url.

SimonSapin commented on July 24, 2024

Vec<u8>? What happens if I send a request with non-ASCII bytes?

from rust-url.

db48x commented on July 24, 2024

HTTP headers are all ASCII. non-ASCII bytes are either in the body, or in the url. If they're in the url then they've been percent-encoded or run through IDNA.

from rust-url.

SimonSapin commented on July 24, 2024

HTTP headers are all ASCII.

That sounds surprising to me, but even supposing that all headers are indeed supposed to be ASCII, nothing prevents me from sending an invalid request with non-ASCII bytes. And I’m sure that some broken client somewhere does so. What should the server do in this case?

from rust-url.

db48x commented on July 24, 2024

For the moment I'm only concerned with doing what the spec says. Most of the headers don't have any useful way to put in arbitrary text, so it won't be any different than any other type of malformed header that you can simply ignore. Because we have a byte vector we can always turn individual values into strings later if it turns out to be useful.

from rust-url.

annevk commented on July 24, 2024

Header values are not ASCII. Far from it. The ones taking URLs should probably be ISO-8859-1 decoded (the original, not windows-1252).

from rust-url.

annevk commented on July 24, 2024

@db48x I don't see how it does not change the requirements. Also note that if you want to parse an absolute URL, you can simply not pass in a base URL. We do that for ws/wss URLs in HTML.

from rust-url.

db48x commented on July 24, 2024

@db48x I don't see how it does not change the requirements. Also note that if you want to parse an
absolute URL, you can simply not pass in a base URL. We do that for ws/wss URLs in HTML.

It doesn't change the requirements because we still need to be able to parse absolute/relative/abs-path+query/host/etc individually, or to have a single parse function which parses any of those and tells you which it found, both without having to supply any additional information such as base urls.

rust-url already provides a host parser, and Teepee already uses it. Likewise, when it wants to parse an absolute url and nothing else, it passes in None for the base url. That still leaves the other bits though.

Header values are not ASCII. Far from it. The ones taking URLs should probably be ISO-8859-1
decoded (the original, not windows-1252).

Sure, I'm not suggesting we error out if it's not strictly 7-bit ASCII. All of the important characters in an HTTP header or URL are all taken from ASCII though. Ideally we would never scan over the url to convert it to UTF-8 and then scan it again for parsing; we can just parse it while it's still a byte vector. A UTF-8 continuation character is always going to have a high bit set, and thus will never match one of the characters we care about so we'll never accidentally break up a multibyte UTF-8 character.

from rust-url.

chris-morgan commented on July 24, 2024

As far as HTTP header encoding is concerned, the story is significantly more complex than ISO-8859-1 or Windows-1252; in many places arbitrary encodings are permitted, encoded according to the rules in RFC 2047. The specs are not clear about where these sorts of things can happen, unfortunately…

from rust-url.

db48x commented on July 24, 2024

I didn't know that one applied to HTTP…

from rust-url.

chris-morgan commented on July 24, 2024

Yeah, mostly for *TEXT, so it theoretically shouldn’t be very widespread. The spec is very poor about specifying where it’s really allowed, so it’ll be one of these matters of discovering what everyone else does and more or less ignoring the specs.

But my recollection for URLs is that they should all be in canonical form which is a strict subset of ASCII. Anyway, Teepee’s raw headers are defined in terms of the raw bytes.

from rust-url.

annevk commented on July 24, 2024

Why do you need that? You haven't justified your requirements.

As for URLS in HTTP. URLs should be in ASCII over HTTP, but they are not always. And if you want interoperability, you need original ISO-8859-1 handling. Just test in browsers.

from rust-url.

db48x commented on July 24, 2024

Why do you need that? You haven't justified your requirements.

Why do we need what?

from rust-url.

annevk commented on July 24, 2024

we still need to be able to parse absolute/relative/abs-path+query/host/etc individually

from rust-url.

db48x commented on July 24, 2024

Ah; I thought my initial comment made that fairly clear.

For example, the HTTP Errata http://skrb.org/ietf/http_errata.html says this:

The definition of Request-URI should be:
    Request-URI   = "*" | absoluteURI | abs_path [ "?" query ] | authority

So when I write a parser for this header I need to call a function in rust-url that either parses an absoluteURI and fails on a relativeURI, or parses either one but identifies which was parsed, so that I can reject relative URIs and go on to the other alternates. Naturally, a function which parses any of these and tells me which it was would also work. I'll leave that up to Simon.

(Obviously Teepee would handle the asterisk in any case; that's specific to HTTP rather than part of a URI.)

from rust-url.

SimonSapin commented on July 24, 2024

The new RFCs might be relevant. (And obviously for the rest of Teepee too.)

from rust-url.

SimonSapin commented on July 24, 2024

Closing in favor of https://docs.rs/http/0.1.13/http/uri/struct.Uri.html, which is designed specifically for representing the request-target on an HTTP request.

from rust-url.

requirements for parsing HTTP headers containing URLs about rust-url HOT 29 CLOSED

Comments (29)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent