Code Monkey home page Code Monkey logo

uri's Introduction

uri

Lint CI Coverage Status Vulnerability Check Go Report Card

GitHub tag (latest by date) Go Reference license

Package uri is meant to be an RFC 3986 compliant URI builder, parser and validator for golang.

It supports strict RFC validation for URIs and URI relative references.

This allows for stricter conformance than the net/url package in the go standard libary, which provides a workable but loose implementation of the RFC for URLs.

Requires go1.19.

What's new?

V1.2 announcement

To do before I cut a v1.2.0:

  • [] handle empty fragment, empty query. Ex: https://host? is not equivalent to http://host. Similarly https://host# is not equivalent to http://host.
  • [] IRI UCS charset compliance
  • [] URI normalization (like PuerkitoBio/purell)
  • [] more explicit errors, with context

See also TODOs.

V2 announcement

V2 is getting closer to completion. It comes with:

  • very significant performance improvement (x 1.5). Eventually uri gets significantly faster than net/url (-50% ns/op)
  • a simplified API: no interface, no Validate(), no Builder()
  • options for tuning validation strictness
  • exported package level variables disappear

Current master (unreleased)

Fixes

  • stricter scheme validation (no longer support non-ASCII letters). Ex: Smørrebrød:// is not a valid scheme.
  • stricter IP validation (do not support escaping in addresses, excepted for IPv6 zones)
  • stricter percent-escape validation: an escaped character MUST decode to a valid UTF8 endpoint (1). Ex: %C3 is an incomplete escaped UTF8 sequence. Should be %C3%B8 to escape the full UTF8 rune.
  • stricter port validation. A port is an integer less than or equal to 65535.

(1) go natively manipulates UTF8 strings only. Even though the standards are not strict about the character encoding of escaped sequences, it seems natural to prevent invalid UTF8 to propagate via percent escaping. Notice that this approach is not the one followed by net/url.PathUnescape(), which leaves invalid runes.

Features

  • feat: added IsIP() bool and IPAddr() netip.Addr methods

Performances

  • perf: slight improvement. Now only 8-25% slower than net/url.Parse, depending on the workload

Older releases

Usage

Parsing

	u, err := Parse("https://example.com:8080/path")
	if err != nil {
		fmt.Printf("Invalid URI")
	} else {
		fmt.Printf("%s", u.Scheme())
	}
	// Output: https
	u, err := ParseReference("//example.com/path")
	if err != nil {
		fmt.Printf("Invalid URI reference")
	} else {
		fmt.Printf("%s", u.Authority().Path())
	}
	// Output: /path

Validating

    isValid := IsURI("urn://example.com?query=x#fragment/path") // true

    isValid= IsURI("//example.com?query=x#fragment/path") // false

    isValid= IsURIReference("//example.com?query=x#fragment/path") // true

Caveats

  • Registered name vs DNS name: RFC3986 defines a super-permissive "registered name" for hosts, for URIs not specifically related to an Internet name. Our validation performs a stricter host validation according to DNS rules whenever the scheme is a well-known IANA-registered scheme (the function UsesDNSHostValidation(string) bool is customizable).

Examples: ftp://host, http://host default to validating a proper DNS hostname.

  • IPv6 validation relies on IP parsing from the standard library. It is not super strict regarding the full-fledged IPv6 specification, but abides by the URI RFC's.

  • URI vs URL: every URL should be a URI, but the converse does not always hold. This module intends to perform stricter validation than the pragmatic standard library net/url, which currently remains about 30% faster.

  • URI vs IRI: at this moment, this module checks for URI, while supporting unicode letters as ALPHA tokens. This is not strictly compliant with the IRI specification (see known issues).

Building

The exposed type URI can be transformed into a fluent Builder to set the parts of an URI.

	aURI, _ := Parse("mailto://[email protected]")
	newURI := auri.Builder().SetUserInfo(test.name).SetHost("newdomain.com").SetScheme("http").SetPort("443")

Canonicalization

Not supported for now (contemplated as a topic for V2).

For URL normalization, see PuerkitoBio/purell.

Reference specifications

The librarian's corner (still WIP).

Title Reference Notes
Uniform Resource Identifier (URI) RFC3986 (1)(2)
Uniform Resource Locator (URL) RFC1738
Relative URL RFC1808
Internationalized Resource Identifier (IRI) RFC3987 (1)
Practical standardization guidelines URL WhatWG Living Standard (4)
Domain names implementation RFC1035
IPv6
Representing IPv6 Zone Identifiers RFC6874
IPv6 Addressing architecture RFC3513
URI Schemes
IANA registered URI schemes IANA (5)
Port numbers
IANA port assignments by service IANA
Well-known TCP and UDP port numbers [Wikipedia)(https://en.wikipedia.org/wiki/List_of_TCP_and_UDP_port_numbers)

|

Notes

(1) Deviations from the RFC:

  • Tokens: ALPHAs are tolerated to be Unicode Letter codepoints. Schemes remain constrained to ASCII letters ([a-z]|[A-Z])
  • DIGITs are ASCII digits as required by the RFC. Unicode Digit codepoints are rejected (ex: ६ (6), ① , 六 (6), Ⅶ (7) are not considered legit URI DIGITS).

Some improvements are still needed to abide more strictly to IRI's provisions for internationalization. Working on it...

(2) Percent-escape:

  • Escape sequences, e.g. %hh must decode to valid UTF8 runes (standard says should).

(2) IP addresses:

  • As per RFC3986, [hh::...] literals must be IPv6 and ddd.ddd.ddd.ddd litterals must be IPv4.
  • As per RFC3986, notice that [] is illegal, although the golang IP parser translates this to [::] (zero value IP). In go, the zero value for netip.Addr is invalid just a well.
  • IPv6 zones are supported, with the '%' escaped as '%25' to denote an IPv6 zoneID (RFC6974)
  • IPvFuture addresses are supported, with escape sequences (which are not part of RFC3986, but natural since IPv6 do support escaping)

(4) Deviations from the WhatWG recommendation

  • [] IPv6 address is invalid
  • invalid percent-encoded characters trigger an error rather than being ignored

(5) Most permanently registered schemes have been accounted for when checking whether Domain Names apply for hosts rather than the "registered name" from RFC3986. Quite a few commonly used found, either unregistered or with a provisional status have been added as well. Feel free to create an issue or contribute a change to enrich this list of well-known URI schemes.

FAQ

Benchmarks

Credits

  • Tests have been aggregated from the test suites of URI validators from other languages: Perl, Python, Scala, .Net. and the Go url standard library.

  • This package was initially based on the work from ttacon/uri (credits: Trey Tacon).

Extra features like MySQL URIs present in the original repo have been removed.

  • A lot of improvements and suggestions have been brought by the incredible guys at fyne-io. Thanks all.

Release notes

v1.1.0

Build

  • requires go1.19

Features

  • Typed errors: parsing and validation now returns errors of type uri.Error, with a more accurate pinpointing of the error provided by the value. Errors support the go1.20 addition to standard errors with Join() and Cause(). For go1.19, backward compatibility is ensured (errors.Join() is emulated).
  • DNS schemes can be overridden at the package level

Performances

  • Significantly improved parsing speed by dropping usage of regular expressions and reducing allocations (~ x20 faster).

Fixes

  • stricter compliance regarding paths beginning with a double '/'
  • stricter compliance regarding the length of DNS names and their segments
  • stricter compliance regarding IPv6 addresses with an empty zone
  • stricter compliance regarding IPv6 vs IPv4 litterals
  • an empty IPv6 litteral [] is invalid

Known open issues

  • IRI validation lacks strictness
  • IPv6 validation relies on the standard library and lacks strictness

Other

Major refactoring to enhance code readability, esp. for testing code.

  • Refactored validations
  • Refactored test suite
  • Added support for fuzzing, dependabots & codeQL scans

uri's People

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

andydotxyz

uri's Issues

Make test suite readable

After a couple of years now, maintaining the test suite appears daunting...

Time to refactor tests and organize them along the lines of what is being tested rather than the original compilation of many test points gathered from what others did in other languages.

Make the URI type a concrete type, remove interface

Prepare a v2 with a breaking change:

  • URI is no longer an interface but a concrete type

And yes, unfortunately, abstraction comes at a cost in go... Let's follow the way url.URL works, and let's remove this pointless interface. That was something I had kept from the original forked repo, but I no longer see the purpose in exposing an interface.

@Jacalz I would assume that this would break your use case, right?

`file:///...` URIs place the path in the host field

Minimal test case to reproduce:

package main

import "fmt"
import "github.com/fredbi/uri"

func main() {
        // example taken from ETF RFC3986, §1.1 pp. 5
        u, err := uri.Parse("file:///etc/hosts")
        fmt.Printf("err=%v\n", err)
        fmt.Printf("u=%v\n", u)
        fmt.Printf("u.Authority().Path()=%s\n", u.Authority().Path())
        fmt.Printf("u.Authority().Host()=%s\n", u.Authority().Host())
}

Output:

err=invalid host in URI
u=file:///etc/hosts
u.Authority().Path()=
u.Authority().Host()=/etc/hosts

System information:

$ go version
go version go1.15.6 linux/amd64

It would seem that in cases such as file:///some/path, that the path is referencing a location on the local system. In this case, the path string should be retrieved by .Path(), not by .Host(). This is supported by Section 1.1 of RFC3986:

URIs have a global scope and are interpreted consistently regardless
of context, though the result of that interpretation may be in
relation to the end-user's context. For example, "http://localhost/"
has the same interpretation for every user of that reference, even
though the network interface corresponding to "localhost" may be
different for each end-user: interpretation is independent of access.
However, an action made on the basis of that reference will take
place in relation to the end-user's context, which implies that an
action intended to refer to a globally unique thing must use a URI
that distinguishes that resource from all other things. URIs that
identify in relation to the end-user's local context should only be
used when the context itself is a defining aspect of the resource,
such as when an on-line help manual refers to a file on the end-
user's file system (e.g., "file:///etc/hosts").

Try to rework or remove creation of regular expressions

There are a lot of very complicated regular expressions that not only are very hard to understand. Compiling all of these rules is slow and allocates quite a lot of memory. None of these are explicitly tested; just the functions making use of the expressions. Can we get rid of any of these? Can we simplify them? Can we lazy-load some of them?

var (
	rexScheme   = regexp.MustCompile(`^[\p{L}][\p{L}\d\+-\.]+$`)
	rexFragment = regexp.MustCompile(`^([\p{L}\d\-\._~\:@!\$\&'\(\)\*\+,;=\?/]|(%[[:xdigit:]]{2})+)+$`)
	rexQuery    = rexFragment
	rexSegment  = regexp.MustCompile(`^([\p{L}\d\-\._~\:@!\$\&'\(\)\*\+,;=]|(%[[:xdigit:]]{2})+)+$`)
	rexHostname = regexp.MustCompile(`^[a-zA-Z0-9\p{L}]((-?[a-zA-Z0-9\p{L}]+)?|(([a-zA-Z0-9-\p{L}]{0,63})(\.)){1,6}([a-zA-Z\p{L}]){2,})$`)

	// unreserved | pct-encoded | sub-delims.
	rexRegname = regexp.MustCompile(`^([\p{L}\d\-\._~!\$\&'\(\)\*\+,;=]|(%[[:xdigit:]]{2})+)+$`)
	// unreserved | pct-encoded | sub-delims | ":".
	rexUserInfo = regexp.MustCompile(`^([\p{L}\d\-\._~\:!\$\&'\(\)\*\+,;=\?/]|(%[[:xdigit:]]{2})+)+$`)

	rexIPv6Zone = regexp.MustCompile(`:[^%:]+%25(([\p{L}\d\-\._~\:@!\$\&'\(\)\*\+,;=]|(%[[:xdigit:]]{2}))+)?$`)
)

Maybe the new [net/netip] package can be used to validate IPv6 instead of the regex?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.