Comments (29)
I think there is some misunderstanding about "absolute" and "relative" with respect to URLs. These two terms do not refer to the URL path, rather to the existence (or non existence) of a scheme (e.g., "http"). Although parsing the URL forms mentioned above is possible without a base URL, reference resolution (RFC 3986 sec 5) requires a base URL if the path itself is not absolute. For most headers, the base is the URL used to retrieve the headers (although the Link header allows the base to be overridden using "anchor"--see RFC 5988). This library currently combines parsing and resolution. Perhaps separation of parsing and reference resolution should be considered.
from rust-url.
[...] we would prefer not to have to specify a base URL. The algorithm specified by http://url.spec.whatwg.org/ seems to work this way.
I’m not sure what you mean by this. The spec algorithm specifically aborts on relative URLs without a base:
http://url.spec.whatwg.org/#no-scheme-state
no scheme state
If base is null, or base's scheme is not a relative scheme, parse error, return failure.
But maybe there are still use cases for this that are not considered in the spec. I think the first thing to do is determine the data structures involved.
How do you handle Request-URI? For example, I don’t think GET ../foo HTTP/1.1
is valid.
from rust-url.
My point with Request-URI
is that there are different kinds of relative URLs. The spec calls them scheme-relative, absolute-path-relative, and path-relative.
from rust-url.
Ah, I misread that part. That is unfortunate. Did I say already that I don't like that spec?
At present Teepee doesn't handle the RequestURI at all; it's pretty early yet. We would of course want to be able to parse exactly the pieces mentioned in the spec, namely absolute URLs, absolute paths, and authorities, plus whatever HTTP 0.9, HTTP 1.0 and HTTP 2.0 say :).
I like the BNF-style specifications over the more modern WhatWG spec because I can make a function that parses each named piece (even if I'm not using parser combinators). Perhaps two rust-url needs two interfaces, one that exposes the WhatWG parser (or one with the same effect) and another which exposes more of the individual pieces. The former can call the latter, of course. Or perhaps it should just ignore that aspect of the WhatWG spec and just produce a Uri instance with pieces missing.
from rust-url.
Yeah, I agree that the current state-machine-style parser in the spec is terrible. I’ve convinced @annevk that a recursive-decentish style (like CSS Syntax) would be better, but he’s waiting on me to finish up rust-url to do that.
rust-url is already written as a bunch of functions that call each other, and I think several of them will end up in the public API.
The spec was also (I believe) written with browsers in mind, and may not consider some things that the server-side needs. Again, once we determine what exactly is needed and in what use cases, I think we can convince @annevk to add stuff to the spec.
As to BNF grammars, the problem is that they often do not define error handling. If you just reject anything that doesn’t match the grammar, that’s fine. But it’s not good enough if you want to do any more subtle error recovery.
from rust-url.
The CSS Syntax spec is indeed better (and I like state machines), but still not really to my liking. I feel like they should just be written in a programming language if they want that degree of control over the implementations.
As to BNF grammars, the problem is that they often do not define error handling. If you just reject
anything that doesn’t match the grammar, that’s fine. But it’s not good enough if you want to do any
more subtle error recovery.
Fair enough.
from rust-url.
I’m happy to debate the merits of different spec writing styles, but this issue is probably not the space to do it :) Feel free to email me or ping me in some other forum.
from rust-url.
Yes, it's certainly off-topic.
I think that about covers it though; we need to be able to specific syntax constructs, preferably with different types for each or a wrapper enum distinguishing them. A function to call for each syntax construct that fails if it finds others is fine (ie a function that parses absolute urls that fails when given a relative url, provided there's another function that parses relative urls and fails when given an absolute url), or a function that parses anything it's given but distinguishes them on return is also fine.
from rust-url.
@db48x you need to read HTTP errata. Location
needs to be parsed as a relative URL. See also http://fetch.spec.whatwg.org/ for how to deal with multiple Location
headers and such (HTTP does not cover that).
For Host
there is actually a distinct host parser defined by the specification.
from rust-url.
(Note that with service workers browsers effectively have a proxy server.)
from rust-url.
Fair enough. Doesn't change the requirements though.
from rust-url.
Something else to determine: are these headers in a specific encoding so we can decode to Unicode before parsing, or should we add support for parsing URLs from bytes?
from rust-url.
URLs are ASCII, and Teepee is using Vec<u8>.
from rust-url.
Vec<u8>
? What happens if I send a request with non-ASCII bytes?
from rust-url.
HTTP headers are all ASCII. non-ASCII bytes are either in the body, or in the url. If they're in the url then they've been percent-encoded or run through IDNA.
from rust-url.
HTTP headers are all ASCII.
That sounds surprising to me, but even supposing that all headers are indeed supposed to be ASCII, nothing prevents me from sending an invalid request with non-ASCII bytes. And I’m sure that some broken client somewhere does so. What should the server do in this case?
from rust-url.
For the moment I'm only concerned with doing what the spec says. Most of the headers don't have any useful way to put in arbitrary text, so it won't be any different than any other type of malformed header that you can simply ignore. Because we have a byte vector we can always turn individual values into strings later if it turns out to be useful.
from rust-url.
Header values are not ASCII. Far from it. The ones taking URLs should probably be ISO-8859-1 decoded (the original, not windows-1252).
from rust-url.
@db48x I don't see how it does not change the requirements. Also note that if you want to parse an absolute URL, you can simply not pass in a base URL. We do that for ws/wss URLs in HTML.
from rust-url.
@db48x I don't see how it does not change the requirements. Also note that if you want to parse an
absolute URL, you can simply not pass in a base URL. We do that for ws/wss URLs in HTML.
It doesn't change the requirements because we still need to be able to parse absolute/relative/abs-path+query/host/etc individually, or to have a single parse function which parses any of those and tells you which it found, both without having to supply any additional information such as base urls.
rust-url already provides a host parser, and Teepee already uses it. Likewise, when it wants to parse an absolute url and nothing else, it passes in None for the base url. That still leaves the other bits though.
Header values are not ASCII. Far from it. The ones taking URLs should probably be ISO-8859-1
decoded (the original, not windows-1252).
Sure, I'm not suggesting we error out if it's not strictly 7-bit ASCII. All of the important characters in an HTTP header or URL are all taken from ASCII though. Ideally we would never scan over the url to convert it to UTF-8 and then scan it again for parsing; we can just parse it while it's still a byte vector. A UTF-8 continuation character is always going to have a high bit set, and thus will never match one of the characters we care about so we'll never accidentally break up a multibyte UTF-8 character.
from rust-url.
As far as HTTP header encoding is concerned, the story is significantly more complex than ISO-8859-1 or Windows-1252; in many places arbitrary encodings are permitted, encoded according to the rules in RFC 2047. The specs are not clear about where these sorts of things can happen, unfortunately…
from rust-url.
I didn't know that one applied to HTTP…
from rust-url.
Yeah, mostly for *TEXT
, so it theoretically shouldn’t be very widespread. The spec is very poor about specifying where it’s really allowed, so it’ll be one of these matters of discovering what everyone else does and more or less ignoring the specs.
But my recollection for URLs is that they should all be in canonical form which is a strict subset of ASCII. Anyway, Teepee’s raw headers are defined in terms of the raw bytes.
from rust-url.
Why do you need that? You haven't justified your requirements.
As for URLS in HTTP. URLs should be in ASCII over HTTP, but they are not always. And if you want interoperability, you need original ISO-8859-1 handling. Just test in browsers.
from rust-url.
Why do you need that? You haven't justified your requirements.
Why do we need what?
from rust-url.
we still need to be able to parse absolute/relative/abs-path+query/host/etc individually
from rust-url.
Ah; I thought my initial comment made that fairly clear.
For example, the HTTP Errata http://skrb.org/ietf/http_errata.html says this:
The definition of Request-URI should be:
Request-URI = "*" | absoluteURI | abs_path [ "?" query ] | authority
So when I write a parser for this header I need to call a function in rust-url that either parses an absoluteURI and fails on a relativeURI, or parses either one but identifies which was parsed, so that I can reject relative URIs and go on to the other alternates. Naturally, a function which parses any of these and tells me which it was would also work. I'll leave that up to Simon.
(Obviously Teepee would handle the asterisk in any case; that's specific to HTTP rather than part of a URI.)
from rust-url.
The new RFCs might be relevant. (And obviously for the rest of Teepee too.)
from rust-url.
Closing in favor of https://docs.rs/http/0.1.13/http/uri/struct.Uri.html, which is designed specifically for representing the request-target
on an HTTP request.
from rust-url.
Related Issues (20)
- Disabling "remove dot segments" behavior HOT 2
- Mark Origin::unicode_serialization as deprecated?
- Is it possible to force the %-encoding of `+`? HOT 1
- Release new version of `data-url` HOT 1
- idna::punycode::encode_str() wrong conversion ? HOT 2
- Add parameter spaceAsPlus to ByteSerialize
- Incorrect parsing of Windows drive letter quirk
- Invalid IPv4 but parsing success HOT 1
- URL http:www.google.com is passed as validated HOT 1
- Host should implement deserialize to parse strings
- JOIN functionality not working HOT 4
- URL validity change between 2.2 and 2.3. HOT 2
- Documentation for IDNA configuration options should explain use cases
- Feature request: add parser boolean option to leave relative paths in the URL.
- Neither punycode::encode_str nor Config::...::to_ascii return expected results for single Unicode char and "EXAMPLE" HOT 3
- `Url::from_file_path()` incorrect handling of backslash on linux
- `=` is not being escaped as query value HOT 2
- [DataUrl] Unable to parse application/json;utf8 containing # HOT 1
- Feature request: provide separate struct for URL which is can-be-base
- Error: 🚫 Building project failed: error[E0583]: file not found for module `origin`serde, interproc... HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from rust-url.