gjtorikian / html-proofer Goto Github PK
View Code? Open in Web Editor NEWTest your rendered HTML files to make sure they're accurate.
License: MIT License
Test your rendered HTML files to make sure they're accurate.
License: MIT License
The following HTML produces validation errors:
<a href="git://github.com/mono/mono">Git</a>
<a href="ftp://ftp.example.com">FTP</a>
<a href="irc://irc.gimp.org/mono">IRC</a>
<a href="svn://svn.example.com">SVN</a>
../test.html: internally linking to git://github.com/mono/mono, which does not exist
../test.html: internally linking to ftp://ftp.example.com, which does not exist
../test.html: internally linking to irc://irc.gimp.org/mono, which does not exist
../test.html: internally linking to svn://svn.example.com, which does not exist
I tried passing in --href_ignore git
, that changed nothing. I know that the proofer can't really validate the links, but shouldn't it just ignore them then, like it does with mailto:
?
I can install html-proofer
.
$ gem install html-proofer
Successfully installed html-proofer-0.6.0
How can I use it from command line? html-proofer
or htmlproof
don’t work here.
Sooooo I have a thought. What if we used Process.fork
or Thread.new
to allow for concurrent link proofs? Which is a better approach? Does Typhoeus do this already and I just don't know it?
I appears that at least with 0.27.0 internal links that are references to an element id are not found eg.
the two skiplinks in example below give an error
...internally linking to #mainMenu, which does not exist
...internally linking to #mainContent, which does not exist
while both obviously do exist, are valid hash-name references and are focuable elements, the former being the first anchor in the sitenav <nav>
element; the latter being the <main>
element
<!DOCTYPE html>
<html lang="nl-NL" class="no-js" id="document">
<head prefix="og: http://ogp.me/ns# article: http://ogp.me/ns/article#">
<meta charset="utf-8">
<title>Homepage</title>
</head>
<body>
<div id="skipLinks" class="skiplinks">
<a href="#mainMenu" class="skiplink">Naar hoofdnavigatiemenu</a>
<a href="#mainContent" class="skiplink">Naar hoofdinhoud</a>
</div>
<div class="site">
<header id="siteHeader" class="siteHeader" role="banner">
<h1 class="title">GeoDienstenCentrum</h1>
<p class="subtitle">Toegankelijke ruimtelijke informatievoorziening</p>
<nav id="sitenav" class="site-nav" role="navigation">
<ul>
<li>
<a href="/" id="mainMenu">
<span aria-hidden="true" data-icon=""></span>
<span>home</span>
</a>
</li>
<li>
<a href="/over.html">
<span aria-hidden="true" data-icon=""></span>
<span>over</span>
</a>
</li>
</ul>
</nav>
</header>
<main id="mainContent" tabindex="-1" role="main">
<p>Voor advies over en implementatie van toegankelijke ruimtelijke informatie met een
"privacy first" insteek, bij voorkeur op basis van open standaarden, open source software
en open data.<p>
</main>
</div>
<div class="site-footer">
<span class="rss">
<a href="/atom.xml" class="">
<span aria-hidden="true" data-icon=""></span>
<span class="visually-hidden">Atom feed voor deze site</span>
</a>
</span>
</div>
<script src="/js/script.js" charset="utf-8"></script>
</body>
</html>
Full pages and traces are on Travis-ci: https://travis-ci.org/GeoDienstenCentrum/geodienstencentrum.github.io/builds/33530564
Follow-up to #71 I propose to make the output more readable by using indentation (inspired by NPM). Note the issue count at the end of some lines if the issue appears more than once. Examples:
./_site/blog/a-whisper/index.html (4)
├── image /assets/body/black_down_arrow-c5df63bbaa0639b1295aa92bf32fe9ff.png does not have an alt attribute
├── image /assets/body/rss-c859bf63379b25bc6e44eba6f7a8b5ed.png does not have an alt attribute (2)
└── image /blog/images/waterfall.jpg does not have an alt attribute
./_site/blog/advanced-ratcheting/index.html (3)
├── image /assets/body/black_down_arrow-c5df63bbaa0639b1295aa92bf32fe9ff.png does not have an alt attribute
└── image /assets/body/rss-c859bf63379b25bc6e44eba6f7a8b5ed.png does not have an alt attribute (2)
image /assets/body/black_down_arrow-c5df63bbaa0639b1295aa92bf32fe9ff.png does not have an alt attribute (2)
├── ./_site/blog/a-whisper/index.html
└── ./_site/blog/advanced-ratcheting/index.html
image /assets/body/rss-c859bf63379b25bc6e44eba6f7a8b5ed.png does not have an alt attribute (4)
├── ./_site/blog/a-whisper/index.html (2)
└── ./_site/blog/advanced-ratcheting/index.html (2)
image /blog/images/waterfall.jpg does not have an alt attribute
└── ./_site/blog/a-whisper/index.html
git clone [email protected]:hafniatimes/hafniatimes.github.io.git reproduction
cd reproduction
bundle exec jekyll build
gem install html-proofer
html-proof ./_site
Returns
$ htmlproof ./_site
Running [Links, Images] checks on ./_site on *.html...
Checking 8 external links...
Ran on 6 files!
./_site/404/index.html: internally linking to //twitter.com/hafniatimes, which does not exist
./_site/articles/2014/06/06/intro.html: internally linking to //twitter.com/hafniatimes, which does not exist
./_site/contact/index.html: internally linking to //twitter.com/hafniatimes, which does not exist
./_site/contact/index.html: mailto: is an invalid URL
./_site/contact/index.html: tel: is an invalid URL
./_site/da/articles/2014/06/06/intro.html: internally linking to //twitter.com/hafniatimes, which does not exist
./_site/da/index.html: internally linking to //twitter.com/hafniatimes, which does not exist
./_site/index.html: internally linking to //twitter.com/hafniatimes, which does not exist
htmlproof 0.7.1 | Error: HTML-Proofer found 8 failures!
Are these href types supposed to fail?
href="//..."
href="mailto:..."
href="tel:..."
Just wondering. :)
Would be awesome if you could add a ci-ignore
class or something similar to a link for it to be ignored by HTML Proofer.
The biggest use case would be hashes that are handled by Javascript (e.g., backbone fragments), but also for URLs generated dynamically that wouldn't be practical to add to href_ignore
.
I'd imagine it'd be something like:
<a href="#print" class="ci-ignore">Print</a>
Glad to take a pass at it, if there's interest.
Let’s list 3-4 real life usage examples in the readme. I propose to mention only business-backed cases. My favorites:
I'm no pro at Ruby development but I think this would be really useful as a command-line executable tool.
Proofer should emulate content negotiation for html files. We could try to do it like Apache’s MultiViews
:
The effect of
MultiViews
is as follows: if the server receives a request for/some/dir/foo
, if/some/dir
has MultiViews enabled, and/some/dir/foo
does not exist, then the server reads the directory looking for files namedfoo.*
, and effectively fakes up a type map which names all those files, assigning them the same media types and content-encodings it would have if the client had asked for one of them by name. It then chooses the best match to the client's requirements.
Saying this I think it’s time for an own Internal
class that fakes all the server things we support: DirectoryIndex
, MultiViews
, followlocation
, hashes, etc.
uri = Proofer::Internal.new("path/to/internal/ressource", options = {})
if uri.invalid? do
if uri.hash? do
issues << "Hash not found"
elsif uri.empty?
issues << "Uri empty"
elsif uri.ugly?
issues << "Uri ugly"
end
end
Hi, I was attempting to follow the doc here (http://jekyllrb.com/docs/continuous-integration/) and found an interesting error when running bundle exec htmlproof ./_site
.
/Users/me/.rbenv/versions/1.9.3-p547/lib/ruby/gems/1.9.1/gems/html-proofer-1.1.5/bin/htmlproof:11:in `block in <top (required)>': undefined method `version' for nil:NilClass (NoMethodError)
from /Users/me/.rbenv/versions/1.9.3-p547/lib/ruby/gems/1.9.1/gems/mercenary-0.3.4/lib/mercenary.rb:21:in `program'
from /Users/me/.rbenv/versions/1.9.3-p547/lib/ruby/gems/1.9.1/gems/html-proofer-1.1.5/bin/htmlproof:10:in `<top (required)>'
from /Users/me/.rbenv/versions/1.9.3-p547/bin/htmlproof:23:in `load'
from /Users/me/.rbenv/versions/1.9.3-p547/bin/htmlproof:23:in `<main>'
This happens to me on both CI (Shippable, Ubuntu 12.04) as well as a local environment (OSX 10.9.2, ruby 1.9.3p547). I installed the gem through bundle install
.
A quick fix is commenting out line 11 in bin/htmlproof:
p.version Gem::Specification::load(File.join(File.dirname(__FILE__), "..", "html-proofer.gemspec")).version
This suggests that the "html-proofer.gemspec" file is missing, and indeed it is:
$ cd /Users/me/.rbenv/versions/1.9.3-p547/lib/ruby/gems/1.9.1/gems/html-proofer-1.1.5/
$ ls -l
drwxr-xr-x 3 me staff 102 Aug 7 22:38 bin
drwxr-xr-x 3 me staff 102 Aug 7 22:23 lib
$ curl -O https://raw.githubusercontent.com/gjtorikian/html-proofer/master/html-proofer.gemspec
I checked and the issue is not there in version 1.1.4, which seems to download all the files in the repository, not just the bin
and lib
directories. As a result, the issue may stem from 88f6572.
Thank you for writing the gem - it has helped me find many interesting link errors.
I would like to check the content of elements except code
and pre
against an arbitrary array of strings. Two use cases come in my mind.
Typography. You will never see characters like "
, --
, !!
in a book written in a European language. People who care about could proof their texts:
HTML::Proofer.new("./_book", {:prose => ["\"", "--", "!!"]}).run
Censorship. Some words can not be published:
HTML::Proofer.new("./_vegan", {:prose => ["meat", "fish", "egg"]}).run
The typography use case is more important, because if you use pre-processors like Markdown, you write --
down and the renderer converts it to n-dash –
. Proofer could watch out whether it renders as expected.
What do you think about it?
Using such anchors is quite a common practice (e.g. by Bootstrap Dropdowns) and generates the following error:
index.html: linking to internal hash # that does not exist
I think they shouldn't trigger an error as they are just used as a placeholder and not for linking to a specific part of the document.
I'm seeing some errors appear for links that don't actually exist in the HTML files specified. This is due to a link in the HTML redirecting to another URL and that end URL being reported in the log rather than the original URL linked. Made it a bit tough to find the broken link in a page with a pile of links.
Travis has a blog. In https://github.com/travis-ci/blog-travis-ci-com/pull/21 somebody finds a broken link and tries to fix it. A Travis guy answers:
Here's a general question: Is the blog meant to be a document to reflect how things are now, or a historical document that announces what was new then?
It makes sense to make corrections to errors for a short while after the article's publication, but at some point we should stop updating them.
The Travis guy do not merge the fixed link and leaves the issue open. It is open now for two months.
It would be helpful to provide motivation in such situations. I think on images like this:
Doesn't look like this is a feature yet, but it would be very nice to have.
Just wanted to say thank you for this great tool! 😍
./out/ssl-configuration.html[0m: External link https://www.openssl.org/ failed: 0 Peer certificate cannot be authenticated with given CA certificates
�[34m./out/what-is-my-disk-quota.html�[0m: External link https://www.npmjs.org/ failed: 0 Peer certificate cannot be authenticated with given CA certificates
�[34m./out/what-are-other-good-resources-for-learning-git-and-github.html�[0m: External link https://www.codeschool.com/courses/git-real failed: 0 Peer certificate cannot be authenticated with given CA certificates
�[34m./out/what-are-other-good-resources-for-learning-git-and-github.html�[0m: External link https://www.codeschool.com/ failed: 0 Peer certificate cannot be authenticated with given CA certificates
_site/index.html: javascript:if(typeof WZXYxe58==typeof alert)WZXYxe58();(function(){var s=document.createElement('link');s.setAttribute('href','/static/css/dyslexia.css');s.setAttribute('rel','stylesheet');s.setAttribute('type','text/css');document.getElementsByTagName('head')[0].appendChild(s);})(); is an invalid URL
— https://travis-ci.org/hafniatimes/hafniatimes.github.io/builds/31079849#L319
Not sure whether there’s a specific part that’s failing, or if the script just doesn’t fancy in-href JS, so I defer to you in this matter. 'href','/static/css/dyslexia.css'
works just fine, if you click Dyslexia
here. :)
It’s probably not the place for html-proofer to inspect a JavaScript href
, but it currently seems to be broken either way.
I just discovered that if I create a folder that ends in .html, like Test.html, the proofer crashes:
htmlproof 1.3.1 | Error: Is a directory @ io_fread - _site/Test.html
I have the following snippet in one of my html files:
<link rel="prefetch" href="data/c%23.csv">
On disk, the linked file is named c#.csv so I am url encoding the number sign character in html.
html-proofer reports the following error when encountering this:
index.html: internally linking to data/c%23.csv, which does not exist
This can't be right. I'm referencing a few more files from the same path and this is the only one that produces an error, so the issue is definitely related to the encoded character.
Maybe it’s just me, but \n\n
is overkill in the log:
I think it’s acceptable that there’s one newline, in the cases where the line is longer than the terminal width, though:
But the current set-up makes it really hard to read the log in one window.
Have a page with the URL http://dotgov-browser.herokuapp.com/domains?cms=drupal
, which HTML proofer complains: ./_site/2014/07/07/analysis-of-federal-executive-domains-part-deux/index.html: (http://dotgov-browser.herokuapp.com/domains?cms=drupal) is an invalid URL
.
The URL returns a 50x (my fault), but should still be seen as a valid URL.
@lurch reported the issue in raspberrypi/documentation#104 (comment). Looks similar to #102.
I can reproduce the issue on my clone https://github.com/penibelst/documentation/commit/6787d65531891c4c231502c6d0482eab1134acaf
Just heard about hwacha, which could improve times due to its ability to run checks in parallel. Thoughts on using it?
Ran into a very curious issue just now, running the tests locally:
~/code/stuff$ be rake cibuild
bundle exec jekyll build --destination _site
Configuration file: /Users/parkermoore/code/stuff/_config.yml
Source: /Users/parkermoore/code/stuff
Destination: _site
Generating... done.
Running [Links, Images] checks on /Users/parkermoore/code/stuff/_site...
rake aborted!
Too many open files - /Users/parkermoore/code/stuff/_site/mirrors/world.html
_tasks/cibuild.rake:5:in `block in <top (required)>'
Tasks: TOP => cibuild
(See full trace by running task with --trace)
The error appears to be Ruby reaching its file descriptor limit. Any way I can limit html-proofer
to a certain number of files at a time?
Scenario:
some-hash
html-proofer fails, though. It does not follow the redirect in step two; instead, it tries to look for the hash on Page B and complains.
I'm trying to skip the alt tag check. When I use alt_ignore: [/.*/]
in the options, all links are ignored rather than just ignoring the alt tag check.
Running html-proofer on my personal site, which isn't huge, can take ~10 minutes. I wonder if, for example, every time I link to /
or /about
in the header, if it's making an HTTP call for each page. If within one run we've checked a URL and got a 200 status code, cache it so that we don't keep rechecking the same URLs and can complete tests in a reasonable time.
E.g., try linking to http://mediadecoder.blogs.nytimes.com/2010/11/29/netflix-partner-says-comcast-toll-threatens-online-video-delivery/, which due to the 303 (!) paywall redirect, fails via Proofer.
Occasionally, at GitHub, we'll link to sites within github.com that are behind auth. For example: [check out this discussion](https://www.github.com/github/secret-internal-repo/issues/23)
.
We've had to exclude these links by writing them out as HTML and adding data-proofer-ignore
. Blah.
I think instead there should be a new config option hash that takes a domain as a key, and an OAuth token as the value, so that these sorts of links can be checked. For example, you'd pass in :domain_auth => { "github.com" => ENV['MACHINE_USER_TOKEN'] }
. When HTML::Proofer hits a 404, it'd look the domain up, and try to use the provided token to recheck the link.
/cc @penibelst @parkr Y'all think this makes sense?
Images can have a srcset
attribute:
When authors adapt their sites for high-resolution displays, they often need to be able to use different assets representing the same image. We address this need for adaptive, bitmapped content images by adding a
srcset
attribute to theimg
element.
Just testing out using html-proofer
for a random site of "stuff" I have built and the Nokogiri dependency slows everything down a lot, as installation is incredibly slow.
What is the "best practice" around this? Build locally or store vendored versions of gems?
As noted in #113 (comment), Typhoues is real picky about what it take in. I'll need to make a breaking release in order to namespace Typhoeus (and other libs!) options. So rather than
HTML::Proofer.new("out/", {:ext => ".htm", :verbose => true, :ssl_verifyhost => 2 })
It should be
HTML::Proofer.new("out/", {:ext => ".htm", :verbose => true, :typhoeus => {:ssl_verifyhost => 2 }})
If you have <link>
field in an RSS feed, with the tag body being the link, HTML proofer raises an anchor has no href attribute
error.
e.g. <link>http://ben.balter.com/2014/10/08/why-government-contractors-should-%3C3-open-source/</link>
Not sure how to do it (maybe count \n
's?), but would be extremely helpful to know the line number of errors when they're outputted to the console, e.g.:
_site/foo.html: internally linking to _site/bar.html on line 7 which doesn't exist
The Open Graph protocol requires two properties for every page we could check:
og:image
- An image URL which should represent your object within the graph.og:url
- The canonical URL of your object that will be used as its permanent ID in the graph, e.g., "http://www.imdb.com/title/tt0117500/".Example:
<meta property="og:url" content="http://www.example.com/" />
<meta property="og:image" content="http://www.example.com/image.png" />
Over the time external links get moved permanently, because nice guys don’t break the web. In most cases that means “the old url is deprecated, use the new url”. Can we have an option to output a warning on those links?
There is another case: automatic server-side redirection, when people forget to add a trailing slash in their internal links. An example is Bootstrap’s main menu: they just list the lazy /components
instead of the right /components/
. This causes a performance waste.
All HTTP links return failed: 301 SSL connect error
Example: https://travis-ci.org/benbalter/benbalter.github.com/builds/16423365
Proofer crashes on a Travis test with:
htmlproof 0.6.7 | Error: No such file or directory @ rb_sysopen - <file path>
The PR is IIIF/api#105. The traced Travis build is https://travis-ci.org/IIIF/iiif.io/builds/25216529
I can reproduce the issue on my local machine with Ruby 1.9.3.
Looks like something changed on the referenced URL, so the test fails now:
Failures:
1) Links test fails on redirects if not following
Failure/Error: output.should match /External link https:\/\/help.github.com\/changing-author-info\/ failed: 301 No error/
expected "spec/html/proofer/fixtures/links/linkWithRedirect.html: External link http://timclem.wordpress.com/2012/03/01/mind-the-end-of-your-line/ failed: 301 No error\n" to match /External link https:\/\/help.github.com\/changing-author-info\/ failed: 301 No error/
# ./spec/html/proofer/links_spec.rb:45:in `block (2 levels) in <top (required)>'
Inspired by the discussion with @parkr about privacy I want to propose an optional checking for external embedded resources: images, styles, scripts, … External embeds lack many things.
High quality websites only serve from own hosts. If I migrate an old website, first thing I do is to collect all embedded resources. Then I can really control what happens on my website.
The option also would help big teams with many authors to take care, because lazy authors sometimes embed images from Tumblr instead of uploading to own server.
Scenario:
www.example.com
assets.example.com
The option must check all external resources, (e. g. http://code.jquery.com/jquery-latest.min.js
) except your own external server.
How would you design such an option?
Upgraded from 1.1.3 to 1.3.0, and it seems the ssl_verifypeer option is no longer supported.
require 'html/proofer'
task :test do
HTML::Proofer.new("./_site", href_ignore: ['#'], ssl_verifypeer: false).run
end
After upgrading I have a few new failures of the form.
External link https://blog.patternsinthevoid.net/ failed: 0 Peer certificate cannot be authenticated with given CA certificates
Alt tags may be left empty for decorative images http://dev.w3.org/html5/alt-techniques/#secm3
It would be good if to have a similar whitelist as the href check to exclude known valid images like "logo.png" where the content is purely decorative. In the decorative case, an empty alt tag prevents screen readers from reading the file name as a fallback.
Hey,
The most recent version 1.3.3 is hanging. It says it's running, but I've waited a few minutes and nothing happens.
> htmlproof _site
> Running ["Images", "Links", "Scripts"] checks on _site on *.html...
Thanks for any help.
It could just be me, but I've noticed that HTML::Proofer will (usually?) treat redirects as failures. Is this behavior intentional and, if not, would it be reasonably easy to fix?
I just started testing one of my sites
and it turns out there's an issue with a link
having no href
attribute. Or does
that mean it's an a
tag that has no href
attribute?
It's a confusing error – could it be elaborated upon?
When using a data uri in an img tag
<img src="data:image/png;base64, blah">
I get
bad URI(is not URI?): data:image/png;base64, blah
This should pass validation
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.