untitaker / hyperlink Goto Github PK

View Code? Open in Web Editor NEW

160.0 2.0 9.0 297 KB

Very fast link checker for CI.

License: MIT License

Rust 94.67% Shell 3.65% Dockerfile 1.68%

link-checker 404 fast ci broken-anchors validators link-checking linter linters broken-link-finder

hyperlink's Introduction

hyperlink

A command-line tool to find broken links in your static site.

Fast. docs.sentry.io produces 1.1 GB of HTML files. hyperlink handles this amount of data in 4 seconds on a MacBook Pro 2018. See Alternatives for a performance comparison.
Pay for what you need. By default, hyperlink checks for hard 404s in internal links only. Anything beyond that is opt-in. See Options for a list of features to enable.
Maps back errors to source files. If your static site was created from Markdown files, hyperlink can try to find the original broken link by fuzzy-matching the content around it. See the --sources option.
Supports traversing file-system paths only, no arbitrary URLs.
- No support for the <base> tag.
- No support for external links. It does not know how to speak HTTP.
- Even if you don't have a static site, you can put hyperlink to work by first downloading the entire website using e.g. suckit. In certain cases this is faster than other tools too.
Does not honor robots.txt. A broken link is still broken for users even if not indexed by Google.
Does not parse CSS files, as broken links in CSS have not been a practical concern for us. We are concerned about broken link in the page content, not the chrome around it.
Only supports UTF-8 encoded HTML files.

Installation and Usage

Download the latest binary and:

# Check a folder of HTML
./hyperlink public/

# Also validate anchors
./hyperlink public/ --check-anchors

# src/ is a folder of Markdown. Show original Markdown file paths in errors
./hyperlink public/ --sources src/

GitHub action

- uses: untitaker/[email protected]
  with:
    args: public/ --sources src/

NPM

npm install -g @untitaker/hyperlink
hyperlink public/ --sources src/

Docker

docker run -v $PWD:/check ghcr.io/untitaker/hyperlink:0.1.32 /check/public/ --sources /check/src/

# specific commit
docker run -v $PWD:/check ghcr.io/untitaker/hyperlink:sha-82ca78c /check/public/ --sources /check/src

See all available tags

From source

cargo install --locked hyperlink  # latest stable release
cargo install --locked --git https://github.com/untitaker/hyperlink  # latest git SHA

Options

When invoked without options, hyperlink only checks for 404s of internal links. However, it can do more.

-j/--jobs: How many threads to spawn for parsing HTML. By default hyperlink will attempt to saturate your CPU.
--check-anchors: Opt-in, check for validity of anchors on pages. Broken anchors are considered warnings, meaning that hyperlink will exit 2 if there are only broken anchors but no hard 404s.
--sources: A folder of markdown files that were the input for the HTML hyperlink has to check. This is used to provide better error messages that point at the actual file to edit. hyperlink does very simple content-based matching to figure out which markdown files may have been involved in the creation of a HTML file.

Why not just crawl and validate links in Markdown at this point? Answer:
- There are countless of proprietary extensions to markdown out there for creating intra-page links that are generally not supported by link checking tools.
- The structure of your markdown content does not necessarily match the structure of your HTML (i.e. what the user actually sees). With this setup, hyperlink does not have to assume anything about your build pipeline.
--github-actions: Emit GitHub actions errors, i.e. add error messages in-line to PR diffs. This is only useful with --sources set.

If you are using hyperlink through the GitHub action this option is already set. It is only useful if you are downloading/building and running hyperlink yourself in CI.

Exit codes

exit 1: There have been errors (hard 404s)
exit 2: There have been only warnings (broken anchors)

Alternatives

(roughly ranked by performance, determined by some unserious benchmark. this section contains partially dated measurements and is not continuously updated with regards to either performance or featureset)

None of the listed alternatives have an equivalent to hyperlink's --sources and --github-actions feature.

lychee, like hyperlink, is a great choice for obscenely large static sites. Additionally it can check external/outbound links. An invocation of lychee --offline public/ is more or less equivalent to hyperlink public/.
liche seems to be fairly fast, but is unmaintained.
htmltest seems to be fairly fast as well, and is more of a general-purpose HTML linting tool.
muffet seems to have similar performance as htmltest. We tested muffet with http-server and webfsd without noticing a change in timings.
linkcheck is faster than linkchecker but still quite slow on large sites.

We tried linkcheck together with http-server on localhost, although that does not seem to be the bottleneck at all.
wummel/linkchecker seems to be the fairly feature-rich, but was a non-starter due to performance. This applies to other countless link checkers we tried that are not mentioned here.

Testimonials

We use Hyperlink to check for dead links on Graphviz's static-site user documentation, because:

Hyperlink is blazingly fast, checking 700 HTML pages in 220ms (default) and 850ms (with --check-anchors).

Hyperlink's single-binary release, with no library dependencies, was trivial to integrate into our continuous integration tests.

High coverage: Hyperlink immediately spotted over a thousand broken page links within both <a> tags and HTML redirects, and a further 62 broken anchor-links with --check-anchors.

Hyperlink's design decision to crawl only static files (avoiding HTTP), avoids test flakiness from network requests, allowing me to confidently block merging if Hyperlink reports an error.

In conclusion, Hyperlink fills the "static site continuous testing" niche really nicely.

-- Mark Hansen, Graphviz documentation maintainer

License

Licensed under the MIT, see ./LICENSE.

hyperlink's People

Contributors

Stargazers

Watchers

Forkers

isgasho icodein mhansen olemartinorg mwcz funcsec anilabhadatta deining aljazerzen

hyperlink's Issues

Fails with urlencoded anchors

Our documentation contains some pages in English, and some in Norwegian. We use hugo to generate static pages. Norwegian is mostly ascii, but we have some special characters (æ, ø and å, along with their capitalized variants Æ, Ø and Å). When we create a link pointing to an anchor to a heading with one of these special characters, it seems hugo uses the special character in the heading id property (as is supported in HTML5), but links pointing to it will be urlencoded.

As an example, this page currently includes a link to the documentation page for how to deploy in the test environment (expand 'Deploy application' and click the first link under 'Useful documentation'). The link for this comes out to:

<li><a href="/app/testing/deploy/#deploy-av-app-til-testmilj%C3%B8">Deploy app to test environment</a></li>

%C3%B8 is utf8 for 'ø'

In the target file, the anchor is defined as:

<h2 id="deploy-av-app-til-testmiljø">Deploy av app til testmiljø</h2>

hyperlink thinks the link is broken, as it's not equal to the value in the header, but the link works fine in a browser.

Could this possibly be solved simply by url-decoding in try_normalize_href_value?

Unescape HTML properly when matching source files

For backreferencing markdown files we attempt to unescape the HTML content. This sometimes fails because quick-xml does not support  , for example (0b58b23)

Also we possibly need to run the same unescape logic over markdown if pulldown-cmark does not do that already.

Publish on crates.io

To allow installation via cargo install hyperlink.

hyperlink is slower than liche on folders with many small files

Hyperlink may be faster on docs.sentry.io, but is actually slower on the following synthetic test:

seq 1000000 | xargs -P4 -n1 -I{} sh -c 'echo "<a href=$(({} - 1)).html>Hey</a>" > {}.html'

Not overly concerned about this one in particular since it is not close to the real world, but we should figure out how to do parallel directory walking.

Consider building the github action on top of the docker image

Now that we have the docker image as of #113, it should be possible to build the github action around it

Pending question is whether we get any perf degradation because of that. Does a docker action build from the existing Dockerfile, or can it use a built image? Is there significant overhead by docker on windows and macos runners?

Url escapes are not normalized for local paths

When a path has URL-unsafe characters, the URL to it gets escaped in the document but during link checking hyperlink does not unescape them to match the actual path, leading to errors like this:

Error: bad links:
  _next/static/chunks/pages/article/%5Bslug%5D-f92160effe6eedb195dc.js

The file _next/static/chunks/pages/article/[slug]-f92160effe6eedb195dc.js indeed exists on the disk.

Hyperlink hangs (and does not saturate CPU) with specific synthetic test

mkdir test
cd test
html-bench --file-count 10000 --max-folder-size 1000000 --link-density 10000
hyperlink .

html-bench: ca85658

hyperlink will initially max out CPU, then hyperlink just hangs there slowly trying to deallocate 100+ GB of virtual memory. Overall timing is good but could be so much better.

Publish mac-arm binary in release?

Hi there, just a little feature request, would love for this to run natively on M1 Mac.

On https://github.com/untitaker/hyperlink/releases/tag/0.1.20, I only see x64_64 though:
hyperlink-linux-x86_646.38 MB
hyperlink-mac-x86_642.78 MB
hyperlink-windows-x86_64.exe2.04 MB

Thanks!

Publish Docker Image

It will be great to have a docker image that one can run locally or in CI instead of needing to install/download a binary.

Dependabot can't resolve your Rust dependency files

Dependabot can't resolve your Rust dependency files.

As a result, Dependabot couldn't update your dependencies.

The error Dependabot encountered was:

Updating git repository `https://github.com/tafia/quick-xml`
error: failed to select a version for the requirement `quick-xml = "^0.20.0"`
candidate versions found which didn't match: 0.22.0
location searched: Git repository https://github.com/tafia/quick-xml
required by package `hyperlink v0.1.15 (/home/dependabot/dependabot-updater/dependabot_tmp_dir)`

If you think the above is an error on Dependabot's side please don't hesitate to get in touch - we'll do whatever we can to fix it.

View the update logs.

Ignore links inside `<script>` tags

The script in here should be ignored; it's not actual HTML.

<script id="algolia__template" type="text/template">
{% raw %}
  <div class="algolia__result">
    <a class="algolia__result-link" href="{{ full_url }}#algolia:{{ css_selector }}">{{{ _highlightResult.title.value }}}</a>
    {{#posted_at}}
    <div class="algolia__result-date">{{ posted_at_readable }}</div>
    {{/posted_at}}
    <div class="algolia__result-text">{{{ _highlightResult.text.value }}}</div>
  </div>
{% endraw %}
</script>

Make a GitHub action

GitHub action that downloads hyperlink, builds it and runs it with certain args.

Original intent was to just install rust, compile the stuff and put it into GHA cache. Unfortunately you can't use actions inside actions: actions/runner#646, and docker builds are also not cached (see #6)

Perhaps we need proper releases after all.

Hyperlink uses a lot of memory

Hyperlink does its best to use CPU efficiently at the expense of memory usage. Some of this is intentional:

we use arenas where we can mostly allocate but not deallocate
we don't share datastructures across threads, but rather have duplicate data in memory that is later merged into one result. just to avoid some locks

Then again some of the measured memory usage is shocking even then.

We could expose some knobs that would help with memory usage at the expense of longer running times. -j1 is already one such knob, but there could be many more.

Dependabot can't resolve your Rust dependency files

Dependabot can't resolve your Rust dependency files.

As a result, Dependabot couldn't update your dependencies.

The error Dependabot encountered was:

Updating git repository `https://github.com/tafia/quick-xml`
error: failed to select a version for the requirement `quick-xml = "^0.20.0"`
candidate versions found which didn't match: 0.22.0
location searched: Git repository https://github.com/tafia/quick-xml
required by package `hyperlink v0.1.15 (/home/dependabot/dependabot-updater/dependabot_tmp_dir)`

If you think the above is an error on Dependabot's side please don't hesitate to get in touch - we'll do whatever we can to fix it.

View the update logs.

Add option "case insensitive"

Is it possible to check anchors case insensitive or to add such an option?
E.g. when checking the html below, I get the error error: bad link test.html#Some-Section
However both links "#some-section" and "#Some-Section" should be valid in my context.
Thank you.

<h1>Test</h1>

<a href="#some-section">link</a>
<a href="#Some-Section">link</a>

<h1 id="some-section">Some section</h1>

Bench against lychee

Do basic benchmark against https://github.com/hello-rust/lychee. Need to disable a lot of features in lychee to get comparable featureset, as usual.

Blocker: lycheeverse/lychee#21
So practically lychee can currently only validate external links until that issue is resolved, while hyperlink can only validate relative links

Feature wish: exclude path(s) from link check

hyperlink is a great , thanks for providing this useful tool!
Is there any way to exclude certain path from link check? I envision something like a IgnoreDirs setting where you can specify an array of regexs of directories to ignore when scanning for HTML files.

`mailto`, `tel`, and data URIs considered broken

It seems that hrefs beginning with mailto:, tel:, and data: are flagged as broken, but only in documents from subdirectories.

./index.html

<a href="mailto:[email protected]">Email me</a>
<a href="tel:+1-1234-1234">telephone</a>
<a href="data:text/plain;base64,SGVsbG8sIFdvcmxkIQ==">hello world</a>
<a href="/badlink">regular bad link</a>

./subdir/index.html

same content as ./index.html

Hyperlink reports /badlink broken in both files, as expected. But the subdir/index.html also reports the mailto:, tel:, and data: links as broken.

$ hyperlink .                                                          
Reading files                                                          
Checking 5 links from 2 files (2 documents)                            
./index.html                                                           
  error: bad link /badlink                                             
                                                                       
./subdir/index.html                                                    
  error: bad link /badlink                                             
  error: bad link /subdir/data:text/plain;base64,SGVsbG8sIFdvcmxkIQ==  
  error: bad link /subdir/mailto:[email protected]                         
  error: bad link /subdir/tel:+1-1234-1234                             
                                                                       
Found 5 bad links

Edit: also fax: and modem: from https://www.ietf.org/rfc/rfc2806

rust v1.73.0 error on install

error[E0186]: method should_emit_errors has a &mut self declaration in the trait, but not in the impl
--> C:\Users\jax.cargo\registry\src\index.crates.io-6f17d22bba15001f\hyperlink-0.1.31\src\html\parser.rs:264:5
|
264 | fn should_emit_errors() -> bool {
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ expected &mut self in impl
|
= note: should_emit_errors from trait: fn(&mut Self) -> bool

For more information about this error, try rustc --explain E0186.
error: could not compile hyperlink (bin "hyperlink") due to previous error

External links support

External link support was not built because fetching remote content is slow and flaky. Ideas:

Support sitemap.xml only or at least attempt to use it as fastpath.
Maybe make user cache/store sitemaps for all external domains so flakiness can be kept in check
Add subcommand to generate sitemap.xml for own static site

Why do it this way? Because our actual usecase is only for checking links from docs.sentry.io to sentry.io. Both are static sites we control, so we could make sure everything has sitemaps and still get away with very fast builds. sentry.io already has a sitemap

However, for a general-purpose external links checker we probably really need to support real HTTP + build a local cache file, maybe. Also for anchor-checking sitemap.xml doesn't work.

Valid protocol-relative URLs show as errors / bad links

URLs without the protocol don't seem to work. They show as errors / bad links.

For example: //example.com

https://en.wikipedia.org/wiki/URL#prurl

Lint against CSP violations

See lycheeverse/lychee#388

Could be implemented without network requests, but a serious challenge to implement efficiently.

Dependabot can't resolve your Rust dependency files

Dependabot can't resolve your Rust dependency files.

As a result, Dependabot couldn't update your dependencies.

The error Dependabot encountered was:

Updating git repository `https://github.com/tafia/quick-xml`
error: failed to select a version for the requirement `quick-xml = "^0.20.0"`
candidate versions found which didn't match: 0.22.0
location searched: Git repository https://github.com/tafia/quick-xml
required by package `hyperlink v0.1.15 (/home/dependabot/dependabot-updater/dependabot_tmp_dir)`

If you think the above is an error on Dependabot's side please don't hesitate to get in touch - we'll do whatever we can to fix it.

View the update logs.

Swap out html parser

quick-xml is.. quick, but unsustainable as it both does more than we need and at the same time chokes on things we don't care for:

<script>
...
 * @typedef {{
 *     name: string,
 *     id: string,
 *     score: number,
 *     description: string,
 *     audits: !Array<!ReportRenderer.AuditJSON>
 * }}
...
</script>

investigate alternatives.

I have tried:

razrfalcon/xmlparser: as fast as quick-xml but much stricter with no way to hack around things. probably very clean and a good place to start forking, if we had to
https://github.com/cloudflare/lol-html: no idea about performance but claims to have it, probably battle-tested. does more than we need.
html5ever: awfully slow, functionally correct

Ideas to try out:

Write regex or custom parser. For link extraction this could be easy enough, but for paragraph hashing probably not.

Link fix/redirect suggestions

Hyperlink has internal tables to figure out which markdown files contribute to which html files. If we store that extra state between builds, hyperlink could see how content moved from URL A to URL B.

This can then be used to suggest added redirect rules (in a file like nginx.conf) and maybe fixing the original markdown link.

Dependabot can't resolve your Rust dependency files

Dependabot can't resolve your Rust dependency files.

As a result, Dependabot couldn't update your dependencies.

The error Dependabot encountered was:

Updating git repository `https://github.com/tafia/quick-xml`
error: failed to select a version for the requirement `quick-xml = "^0.20.0"`
candidate versions found which didn't match: 0.22.0
location searched: Git repository https://github.com/tafia/quick-xml
required by package `hyperlink v0.1.15 (/home/dependabot/dependabot-updater/dependabot_tmp_dir)`

If you think the above is an error on Dependabot's side please don't hesitate to get in touch - we'll do whatever we can to fix it.

View the update logs.

v0.1.23 broke github action support

Thanks for a fantastic tool! We immediately found 96 bad links and 61 bad anchors. Really looking forward to getting this integrated as a github action.

I followed the examples and added 0.1.23 to a workflow in Altinn/altinn-studio-docs/pull/609, but it sadly failed:

Run untitaker/[email protected]
Run set -x
cd /home/runner/work/_actions/untitaker/hyperlink/0.1.23
sh scripts/install.sh
downloading hyperlink 0.1.23 for Linux
% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
Dload  Upload   Total   Spent    Left  Speed
0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 6593k  100 6593k    0     0  13.3M      0 --:--:-- --:--:-- --:--:-- 13.3M
Run /home/runner/work/_actions/untitaker/hyperlink/0.1.23/hyperlink public/ --check-anchors --sources content/ --github-actions
/home/runner/work/_temp/a5658146-4326-4fa6-8886-b6970313b1d9.sh: line 1: /home/runner/work/_actions/untitaker/hyperlink/0.1.23/hyperlink: No such file or directory
Error: Process completed with exit code 127.

Seems to me like this broke when the install script got extracted, which now downloads the binary as hyperlink-bin, although the action still tries to run hyperlink.

prefix configuration

When deploying to some hosts, for example github pages, the site contents are required to live under a path prefix. This means that something like <link href="/project-name/style.css" /> would fail to check unless an artificial directory is added before checking. Would you consider adding this as a configuration option?

Rename?

I found this program name impossible to google. I wonder if there's a better name

Support for links without .html extension

👋 Hyperlink is really an awesome tool, so first off, thanks for putting this out there!

I wondered if it would be possible to support an opt out of the strictness regarding .html extensions on links - these are essentially false positives in my case

Example: /about is an error but /about.html does exist (checking against Next.js build dir)

I'm not sure how common clean internal URL links are among static site generators but my guess is it might be a useful option

Optionally check `file://` links

I have a repository containing a crate and an mdbook. The doc comments in the crate link pages from the mdbook. I have a simple declarative macro that lets me define these links like:

#[doc = refBook!("bar", "/foo/bar.html")]
Bar,

The macro enables these links to either point to the .html files on the local file system (via file:/// and env!("CARGO_MANIFEST_DIR")) or to the live website.

I'd now like to use hyperlink in the CI to check that these links aren't broken but hyperlink doesn't appear to check (or support to check) file:// links.

Bad link to `/` has confusing error message

hyperlink apparently strips leading slashes which however results in a confusing error message when an index.html file in the base-path is missing:

% mkdir /tmp/test
% echo '<a href=/>test</a>' > /tmp/test/foo.html                   
% hyperlink /tmp/test                           
Reading files
Checking 1 links from 1 files (1 documents)
/tmp/test/foo.html
  error: bad link 

Found 1 bad links

Notice that it doesn't say what the bad link is.

untitaker / hyperlink Goto Github PK

hyperlink's Introduction

hyperlink

Installation and Usage

GitHub action

NPM

Docker

From source

Options

Exit codes

Alternatives

Testimonials

License

hyperlink's People

Contributors

Stargazers

Watchers

Forkers

hyperlink's Issues

Recommend Projects

Recommend Topics

Recommend Org