gjtorikian / html-proofer Goto Github PK

View Code? Open in Web Editor NEW

1.6K 32.0 190.0 8.54 MB

Test your rendered HTML files to make sure they're accurate.

License: MIT License

Ruby 64.44% HTML 35.47% PHP 0.01% JavaScript 0.01% Shell 0.09%

ruby link-checker html

html-proofer's People

Stargazers

Watchers

Forkers

mbijon kansaichris nschonni afeld aroben akoeplinger hktan1 jimmydjwu83 arjunmenon qinghezi183 myokoym ashengwang mrthomas108 bertrand-caron johnelse stbenjam kyleboyle poindexterc eksperimental-forks evilcooldud martinruden wolfgang42 tsing lauweijie codetechlg jonbartlett doismellburning reetendra19 modulexcite mirceapreotu black80887 tiger66639 kidk naveen-iiits johnzeringue bkeepers stephenitis plaindocs jmieleiii mlinksva tuckie arvindsv cloudxtreme parastoo-62 pranavgoel25 rabbotio mdgunn izzyrut fulldecent tweag lee-dohm merafour andyfry01 riton nidayecc peternewman jeremy patmart uamakernyc jeznag cpu nulltask mattlk13 ilyalyo tisba mattclegg henri-tremblay pauldambra alexxnica kryndex gemfarmer shma 18f langphil stephengroat tomdee jebcat1982 homemaker1963 2pees jnasoy timrogers graeme-a-stewart kinzhao skully78 olleolleolle jmack2 ldemailly hackerface theneva him2him2 muhammadyana nicolasleger sylturner fullstackenviormentss adamdecaf dkuspawono luizhassuncao ibobik naylin15 seankilleen

html-proofer's Issues

Error when linking to git://, ftp:// etc. URIs

The following HTML produces validation errors:

<a href="git://github.com/mono/mono">Git</a>
<a href="ftp://ftp.example.com">FTP</a>
<a href="irc://irc.gimp.org/mono">IRC</a>
<a href="svn://svn.example.com">SVN</a>

../test.html: internally linking to git://github.com/mono/mono, which does not exist
../test.html: internally linking to ftp://ftp.example.com, which does not exist
../test.html: internally linking to irc://irc.gimp.org/mono, which does not exist
../test.html: internally linking to svn://svn.example.com, which does not exist

I tried passing in --href_ignore git, that changed nothing. I know that the proofer can't really validate the links, but shouldn't it just ignore them then, like it does with mailto: ?

How to use from command line?

I can install html-proofer.

$ gem install html-proofer
Successfully installed html-proofer-0.6.0

How can I use it from command line? html-proofer or htmlproof don’t work here.

Threads/Processes?

Sooooo I have a thought. What if we used Process.fork or Thread.new to allow for concurrent link proofs? Which is a better approach? Does Typhoeus do this already and I just don't know it?

internal links to element id's are not found

I appears that at least with 0.27.0 internal links that are references to an element id are not found eg.
the two skiplinks in example below give an error

...internally linking to #mainMenu, which does not exist
...internally linking to #mainContent, which does not exist

while both obviously do exist, are valid hash-name references and are focuable elements, the former being the first anchor in the sitenav <nav> element; the latter being the <main> element

<!DOCTYPE html>
<html lang="nl-NL" class="no-js" id="document">
    <head prefix="og: http://ogp.me/ns# article: http://ogp.me/ns/article#">
        <meta charset="utf-8">
        <title>Homepage</title>
    </head>
    <body>

        <div id="skipLinks" class="skiplinks">
            <a href="#mainMenu" class="skiplink">Naar hoofdnavigatiemenu</a>
            <a href="#mainContent" class="skiplink">Naar hoofdinhoud</a>
        </div>

        <div class="site">
          <header id="siteHeader" class="siteHeader" role="banner">
            <h1 class="title">GeoDienstenCentrum</h1>
            <p class="subtitle">Toegankelijke ruimtelijke informatievoorziening</p>

          <nav id="sitenav" class="site-nav" role="navigation">
              <ul>
                  <li>
                      <a href="/" id="mainMenu">
                          <span aria-hidden="true" data-icon="&#xe60d;"></span>
                          <span>home</span>
                      </a>
                  </li>
                  <li>
                      <a href="/over.html">
                          <span aria-hidden="true" data-icon="&#xe60e;"></span>
                          <span>over</span>
                      </a>
                  </li>
              </ul>
          </nav>
          </header>

          <main id="mainContent" tabindex="-1" role="main">

          <p>Voor advies over en implementatie van toegankelijke ruimtelijke informatie met een 
"privacy first" insteek, bij voorkeur op basis van open standaarden, open source software 
en open data.<p>

          </main>

        </div>

        <div class="site-footer">
              <span class="rss">
                  <a href="/atom.xml" class="">
                    <span aria-hidden="true" data-icon="&#xe608;"></span>
                    <span class="visually-hidden">Atom feed voor deze site</span>
                  </a>
              </span>
        </div>
        <script src="/js/script.js" charset="utf-8"></script>
    </body>
</html>

Full pages and traces are on Travis-ci: https://travis-ci.org/GeoDienstenCentrum/geodienstencentrum.github.io/builds/33530564

Make the output more readable

Follow-up to #71 I propose to make the output more readable by using indentation (inspired by NPM). Note the issue count at the end of some lines if the issue appears more than once. Examples:

Sorted by path

./_site/blog/a-whisper/index.html (4)
├── image /assets/body/black_down_arrow-c5df63bbaa0639b1295aa92bf32fe9ff.png does not have an alt attribute
├── image /assets/body/rss-c859bf63379b25bc6e44eba6f7a8b5ed.png does not have an alt attribute (2)
└── image /blog/images/waterfall.jpg does not have an alt attribute
./_site/blog/advanced-ratcheting/index.html (3)
├── image /assets/body/black_down_arrow-c5df63bbaa0639b1295aa92bf32fe9ff.png does not have an alt attribute
└── image /assets/body/rss-c859bf63379b25bc6e44eba6f7a8b5ed.png does not have an alt attribute (2)

Sorted by issue

image /assets/body/black_down_arrow-c5df63bbaa0639b1295aa92bf32fe9ff.png does not have an alt attribute (2)
├── ./_site/blog/a-whisper/index.html
└── ./_site/blog/advanced-ratcheting/index.html
image /assets/body/rss-c859bf63379b25bc6e44eba6f7a8b5ed.png does not have an alt attribute (4)
├── ./_site/blog/a-whisper/index.html (2)
└── ./_site/blog/advanced-ratcheting/index.html (2)
image /blog/images/waterfall.jpg does not have an alt attribute
└── ./_site/blog/a-whisper/index.html

Getting errors for "//", "mailto:" and "tel:" URLs

git clone [email protected]:hafniatimes/hafniatimes.github.io.git reproduction
cd reproduction
bundle exec jekyll build
gem install html-proofer
html-proof ./_site

Returns

$ htmlproof ./_site
Running [Links, Images] checks on ./_site on *.html... 

Checking 8 external links...
Ran on 6 files!

./_site/404/index.html: internally linking to //twitter.com/hafniatimes, which does not exist

./_site/articles/2014/06/06/intro.html: internally linking to //twitter.com/hafniatimes, which does not exist

./_site/contact/index.html: internally linking to //twitter.com/hafniatimes, which does not exist

./_site/contact/index.html: mailto: is an invalid URL

./_site/contact/index.html: tel: is an invalid URL

./_site/da/articles/2014/06/06/intro.html: internally linking to //twitter.com/hafniatimes, which does not exist

./_site/da/index.html: internally linking to //twitter.com/hafniatimes, which does not exist

./_site/index.html: internally linking to //twitter.com/hafniatimes, which does not exist

htmlproof 0.7.1 | Error:  HTML-Proofer found 8 failures!

Are these href types supposed to fail?

href="//..."
href="mailto:..."
href="tel:..."

Just wondering. :)

Allow URLs to be ignored by attribute

Would be awesome if you could add a ci-ignore class or something similar to a link for it to be ignored by HTML Proofer.

The biggest use case would be hashes that are handled by Javascript (e.g., backbone fragments), but also for URLs generated dynamically that wouldn't be practical to add to href_ignore.

I'd imagine it'd be something like:

<a href="#print" class="ci-ignore">Print</a>

Glad to take a pass at it, if there's interest.

Real life usage examples

Let’s list 3-4 real life usage examples in the readme. I propose to mention only business-backed cases. My favorites:

Project	Repository
Raspberry Pi documentation	https://github.com/raspberrypi/documentation
Open Whisper Systems website	https://github.com/WhisperSystems/whispersystems.org
Jekyll website	https://github.com/jekyll/jekyll

Use Commander?

I'm no pro at Ruby development but I think this would be really useful as a command-line executable tool.

Content negotiation

Proofer should emulate content negotiation for html files. We could try to do it like Apache’s MultiViews:

The effect of MultiViews is as follows: if the server receives a request for /some/dir/foo, if /some/dir has MultiViews enabled, and /some/dir/foo does not exist, then the server reads the directory looking for files named foo.*, and effectively fakes up a type map which names all those files, assigning them the same media types and content-encodings it would have if the client had asked for one of them by name. It then chooses the best match to the client's requirements.

Saying this I think it’s time for an own Internal class that fakes all the server things we support: DirectoryIndex, MultiViews, followlocation, hashes, etc.

uri = Proofer::Internal.new("path/to/internal/ressource", options = {})

if uri.invalid? do
  if uri.hash? do
    issues << "Hash not found"
  elsif uri.empty?
    issues << "Uri empty"
  elsif uri.ugly?
    issues << "Uri ugly"
  end
end

undefined method `version` error

Hi, I was attempting to follow the doc here (http://jekyllrb.com/docs/continuous-integration/) and found an interesting error when running bundle exec htmlproof ./_site.

/Users/me/.rbenv/versions/1.9.3-p547/lib/ruby/gems/1.9.1/gems/html-proofer-1.1.5/bin/htmlproof:11:in `block in <top (required)>': undefined method `version' for nil:NilClass (NoMethodError)
    from /Users/me/.rbenv/versions/1.9.3-p547/lib/ruby/gems/1.9.1/gems/mercenary-0.3.4/lib/mercenary.rb:21:in `program'
    from /Users/me/.rbenv/versions/1.9.3-p547/lib/ruby/gems/1.9.1/gems/html-proofer-1.1.5/bin/htmlproof:10:in `<top (required)>'
    from /Users/me/.rbenv/versions/1.9.3-p547/bin/htmlproof:23:in `load'
    from /Users/me/.rbenv/versions/1.9.3-p547/bin/htmlproof:23:in `<main>'

This happens to me on both CI (Shippable, Ubuntu 12.04) as well as a local environment (OSX 10.9.2, ruby 1.9.3p547). I installed the gem through bundle install.

A quick fix is commenting out line 11 in bin/htmlproof:

p.version Gem::Specification::load(File.join(File.dirname(__FILE__), "..", "html-proofer.gemspec")).version

This suggests that the "html-proofer.gemspec" file is missing, and indeed it is:

$ cd /Users/me/.rbenv/versions/1.9.3-p547/lib/ruby/gems/1.9.1/gems/html-proofer-1.1.5/
$ ls -l
drwxr-xr-x  3 me  staff   102 Aug  7 22:38 bin
drwxr-xr-x  3 me  staff   102 Aug  7 22:23 lib
$ curl -O https://raw.githubusercontent.com/gjtorikian/html-proofer/master/html-proofer.gemspec

I checked and the issue is not there in version 1.1.4, which seems to download all the files in the repository, not just the bin and lib directories. As a result, the issue may stem from 88f6572.

Thank you for writing the gem - it has helped me find many interesting link errors.

Incorrect fail on 200

https://a248.e.akamai.net/camo.github.com/4c724400e0e4144f44f3830ce8e82f8dd948b3f7/687474703a2f2f6769746875622e73332e616d617a6f6e6177732e636f6d2f626c6f672f77617463682d737461722e706e67 does not exist HTTP request failed: 200

WTF.

Prose checking

I would like to check the content of elements except code and pre against an arbitrary array of strings. Two use cases come in my mind.

Typography. You will never see characters like ", --, !! in a book written in a European language. People who care about could proof their texts:
```
HTML::Proofer.new("./_book", {:prose => ["\"", "--", "!!"]}).run
```

Censorship. Some words can not be published:

HTML::Proofer.new("./_vegan", {:prose => ["meat", "fish", "egg"]}).run

The typography use case is more important, because if you use pre-processors like Markdown, you write -- down and the renderer converts it to n-dash –. Proofer could watch out whether it renders as expected.

What do you think about it?

Ignore <a href="#"> when checking internal links

Using such anchors is quite a common practice (e.g. by Bootstrap Dropdowns) and generates the following error:

index.html: linking to internal hash # that does not exist

I think they shouldn't trigger an error as they are just used as a placeholder and not for linking to a specific part of the document.

Redirected links don't report original href in log

I'm seeing some errors appear for links that don't actually exist in the HTML files specified. This is due to a link in the HTML redirecting to another URL and that end URL being reported in the log rather than the original URL linked. Made it a bit tough to find the broken link in a page with a pile of links.

Why stop updating broken links?

Travis has a blog. In https://github.com/travis-ci/blog-travis-ci-com/pull/21 somebody finds a broken link and tries to fix it. A Travis guy answers:

Here's a general question: Is the blog meant to be a document to reflect how things are now, or a historical document that announces what was new then?

It makes sense to make corrections to errors for a short while after the article's publication, but at some point we should stop updating them.

The Travis guy do not merge the fixed link and leaves the issue open. It is open now for two months.

It would be helpful to provide motivation in such situations. I think on images like this:

Image source

Test for valid HTML

Doesn't look like this is a feature yet, but it would be very nice to have.

Awesome

Just wanted to say thank you for this great tool! 😍

Issues with SSL checks

./out/ssl-configuration.html[0m: External link https://www.openssl.org/ failed: 0 Peer certificate cannot be authenticated with given CA certificates

�[34m./out/what-is-my-disk-quota.html�[0m: External link https://www.npmjs.org/ failed: 0 Peer certificate cannot be authenticated with given CA certificates

�[34m./out/what-are-other-good-resources-for-learning-git-and-github.html�[0m: External link https://www.codeschool.com/courses/git-real failed: 0 Peer certificate cannot be authenticated with given CA certificates

�[34m./out/what-are-other-good-resources-for-learning-git-and-github.html�[0m: External link https://www.codeschool.com/ failed: 0 Peer certificate cannot be authenticated with given CA certificates

in-href JS returns error

_site/index.html: javascript:if(typeof WZXYxe58==typeof alert)WZXYxe58();(function(){var s=document.createElement('link');s.setAttribute('href','/static/css/dyslexia.css');s.setAttribute('rel','stylesheet');s.setAttribute('type','text/css');document.getElementsByTagName('head')[0].appendChild(s);})(); is an invalid URL

— https://travis-ci.org/hafniatimes/hafniatimes.github.io/builds/31079849#L319

Not sure whether there’s a specific part that’s failing, or if the script just doesn’t fancy in-href JS, so I defer to you in this matter. 'href','/static/css/dyslexia.css' works just fine, if you click Dyslexia here. :)

It’s probably not the place for html-proofer to inspect a JavaScript href, but it currently seems to be broken either way.

Crash when folder named *.html

I just discovered that if I create a folder that ends in .html, like Test.html, the proofer crashes:

htmlproof 1.3.1 | Error: Is a directory @ io_fread - _site/Test.html

False positive on link with url encoded character

I have the following snippet in one of my html files:

<link rel="prefetch" href="data/c%23.csv">

On disk, the linked file is named c#.csv so I am url encoding the number sign character in html.

html-proofer reports the following error when encountering this:

index.html: internally linking to data/c%23.csv, which does not exist

This can't be right. I'm referencing a few more files from the same path and this is the only one that produces an error, so the issue is definitely related to the encoded character.

Remove double \n in output

Maybe it’s just me, but \n\n is overkill in the log:

I think it’s acceptable that there’s one newline, in the cases where the line is longer than the terminal width, though:

But the current set-up makes it really hard to read the log in one window.

URLs with parameters deemed invalid

Have a page with the URL http://dotgov-browser.herokuapp.com/domains?cms=drupal, which HTML proofer complains: ./_site/2014/07/07/analysis-of-federal-executive-domains-part-deux/index.html: (http://dotgov-browser.herokuapp.com/domains?cms=drupal) is an invalid URL.

The URL returns a 50x (my fault), but should still be seen as a valid URL.

Error “internally linking to , which does not exist”

@lurch reported the issue in raspberrypi/documentation#104 (comment). Looks similar to #102.

I can reproduce the issue on my clone https://github.com/penibelst/documentation/commit/6787d65531891c4c231502c6d0482eab1134acaf

Integrate Hwacha for Parallelized Checks?

Just heard about hwacha, which could improve times due to its ability to run checks in parallel. Thoughts on using it?

"Too many open files" error

Ran into a very curious issue just now, running the tests locally:

~/code/stuff$ be rake cibuild
bundle exec jekyll build --destination _site
Configuration file: /Users/parkermoore/code/stuff/_config.yml
            Source: /Users/parkermoore/code/stuff
       Destination: _site
      Generating... done.
Running [Links, Images] checks on /Users/parkermoore/code/stuff/_site...

rake aborted!
Too many open files - /Users/parkermoore/code/stuff/_site/mirrors/world.html
_tasks/cibuild.rake:5:in `block in <top (required)>'
Tasks: TOP => cibuild
(See full trace by running task with --trace)

The error appears to be Ruby reaching its file descriptor limit. Any way I can limit html-proofer to a certain number of files at a time?

Problems with 301s and hash tag refs

Scenario:

Page A links to Page B#some-hash
Page B 301s to another page, Page C
Page C does actually actually some-hash

html-proofer fails, though. It does not follow the redirect in step two; instead, it tries to look for the hash on Page B and complains.

Trying to ignore all alt tags ignores all links

I'm trying to skip the alt tag check. When I use alt_ignore: [/.*/] in the options, all links are ignored rather than just ignoring the alt tag check.

Internally cache status of known URLs

Running html-proofer on my personal site, which isn't huge, can take ~10 minutes. I wonder if, for example, every time I link to / or /about in the header, if it's making an HTTP call for each page. If within one run we've checked a URL and got a 200 status code, cache it so that we don't keep rechecking the same URLs and can complete tests in a reasonable time.

30x response codes should be followed and verified, not blindly fail

E.g., try linking to http://mediadecoder.blogs.nytimes.com/2010/11/29/netflix-partner-says-comcast-toll-threatens-online-video-delivery/, which due to the 303 (!) paywall redirect, fails via Proofer.

Support links behind auth

Occasionally, at GitHub, we'll link to sites within github.com that are behind auth. For example: [check out this discussion](https://www.github.com/github/secret-internal-repo/issues/23).

We've had to exclude these links by writing them out as HTML and adding data-proofer-ignore. Blah.

I think instead there should be a new config option hash that takes a domain as a key, and an OAuth token as the value, so that these sorts of links can be checked. For example, you'd pass in :domain_auth => { "github.com" => ENV['MACHINE_USER_TOKEN'] }. When HTML::Proofer hits a 404, it'd look the domain up, and try to use the provided token to recheck the link.

/cc @penibelst @parkr Y'all think this makes sense?

Checking the srcset attribute

Images can have a srcset attribute:

When authors adapt their sites for high-resolution displays, they often need to be able to use different assets representing the same image. We address this need for adaptive, bitmapped content images by adding a srcset attribute to the img element.

Nokogiri dependency brings CI builds to a crawl

Just testing out using html-proofer for a random site of "stuff" I have built and the Nokogiri dependency slows everything down a lot, as installation is incredibly slow.

What is the "best practice" around this? Build locally or store vendored versions of gems?

Namespace Typhoeus options

As noted in #113 (comment), Typhoues is real picky about what it take in. I'll need to make a breaking release in order to namespace Typhoeus (and other libs!) options. So rather than

HTML::Proofer.new("out/", {:ext => ".htm", :verbose => true, :ssl_verifyhost => 2 })

It should be

HTML::Proofer.new("out/", {:ext => ".htm", :verbose => true, :typhoeus => {:ssl_verifyhost => 2 }})

Proofer raises false positive on RSS feeds

If you have <link> field in an RSS feed, with the tag body being the link, HTML proofer raises an anchor has no href attribute error.

e.g. <link>http://ben.balter.com/2014/10/08/why-government-contractors-should-%3C3-open-source/</link>

Expose line number in errors

Not sure how to do it (maybe count \n's?), but would be extremely helpful to know the line number of errors when they're outputted to the console, e.g.:

_site/foo.html: internally linking to _site/bar.html on line 7 which doesn't exist

Support Open Graph

The Open Graph protocol requires two properties for every page we could check:

og:image - An image URL which should represent your object within the graph.
og:url - The canonical URL of your object that will be used as its permanent ID in the graph, e.g., "http://www.imdb.com/title/tt0117500/".

Example:

<meta property="og:url" content="http://www.example.com/" />
<meta property="og:image" content="http://www.example.com/image.png" />

Warn on permanent redirects (301)

Over the time external links get moved permanently, because nice guys don’t break the web. In most cases that means “the old url is deprecated, use the new url”. Can we have an option to output a warning on those links?

There is another case: automatic server-side redirection, when people forget to add a trailing slash in their internal links. An example is Bootstrap’s main menu: they just list the lazy /components instead of the right /components/. This causes a performance waste.

Cannot verify external SSL links after speedup

All HTTP links return failed: 301 SSL connect error

Example: https://travis-ci.org/benbalter/benbalter.github.com/builds/16423365

No such file or directory @ rb_sysopen

Proofer crashes on a Travis test with:

htmlproof 0.6.7 | Error:  No such file or directory @ rb_sysopen - <file path>

The PR is IIIF/api#105. The traced Travis build is https://travis-ci.org/IIIF/iiif.io/builds/25216529

I can reproduce the issue on my local machine with Ruby 1.9.3.

Handle 503s better

Failing test: "Links test: fails on redirects if not following"

Looks like something changed on the referenced URL, so the test fails now:

Failures:

  1) Links test fails on redirects if not following
     Failure/Error: output.should match /External link https:\/\/help.github.com\/changing-author-info\/ failed: 301 No error/
       expected "spec/html/proofer/fixtures/links/linkWithRedirect.html: External link http://timclem.wordpress.com/2012/03/01/mind-the-end-of-your-line/ failed: 301 No error\n" to match /External link https:\/\/help.github.com\/changing-author-info\/ failed: 301 No error/
     # ./spec/html/proofer/links_spec.rb:45:in `block (2 levels) in <top (required)>'

External embeds

Inspired by the discussion with @parkr about privacy I want to propose an optional checking for external embedded resources: images, styles, scripts, … External embeds lack many things.

Reliability — External servers come, go, and stay. The best current example was published lately: Don’t Use jquery-latest.js.
Speed — Every external connection means a new time consuming connection opening.
Privacy — if you respect your visitors, you don’t let them be tracked.

High quality websites only serve from own hosts. If I migrate an old website, first thing I do is to collect all embedded resources. Then I can really control what happens on my website.

The option also would help big teams with many authors to take care, because lazy authors sometimes embed images from Tumblr instead of uploading to own server.

Scenario:

We serve from www.example.com
Our assets are assets.example.com

The option must check all external resources, (e. g. http://code.jquery.com/jquery-latest.min.js) except your own external server.

How would you design such an option?

allow links to sites with self-signed certs?

Upgraded from 1.1.3 to 1.3.0, and it seems the ssl_verifypeer option is no longer supported.

require 'html/proofer'                                                       

task :test do                                                                
  HTML::Proofer.new("./_site", href_ignore: ['#'], ssl_verifypeer: false).run
end

After upgrading I have a few new failures of the form.

External link https://blog.patternsinthevoid.net/ failed: 0 Peer certificate cannot be authenticated with given CA certificates

Need relaxing option for empty alt check

Alt tags may be left empty for decorative images http://dev.w3.org/html5/alt-techniques/#secm3
It would be good if to have a similar whitelist as the href check to exclude known valid images like "logo.png" where the content is purely decorative. In the decorative case, an empty alt tag prevents screen readers from reading the file name as a fallback.

Hanging on run

Hey,

The most recent version 1.3.3 is hanging. It says it's running, but I've waited a few minutes and nothing happens.

> htmlproof _site
> Running ["Images", "Links", "Scripts"] checks on _site on *.html...

Thanks for any help.