Code Monkey home page Code Monkey logo

html-proofer's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

html-proofer's Issues

Error when linking to git://, ftp:// etc. URIs

The following HTML produces validation errors:

<a href="git://github.com/mono/mono">Git</a>
<a href="ftp://ftp.example.com">FTP</a>
<a href="irc://irc.gimp.org/mono">IRC</a>
<a href="svn://svn.example.com">SVN</a>

../test.html: internally linking to git://github.com/mono/mono, which does not exist
../test.html: internally linking to ftp://ftp.example.com, which does not exist
../test.html: internally linking to irc://irc.gimp.org/mono, which does not exist
../test.html: internally linking to svn://svn.example.com, which does not exist

I tried passing in --href_ignore git, that changed nothing. I know that the proofer can't really validate the links, but shouldn't it just ignore them then, like it does with mailto: ?

How to use from command line?

I can install html-proofer.

$ gem install html-proofer
Successfully installed html-proofer-0.6.0

How can I use it from command line? html-proofer or htmlproof don’t work here.

Threads/Processes?

Sooooo I have a thought. What if we used Process.fork or Thread.new to allow for concurrent link proofs? Which is a better approach? Does Typhoeus do this already and I just don't know it?

internal links to element id's are not found

I appears that at least with 0.27.0 internal links that are references to an element id are not found eg.
the two skiplinks in example below give an error

...internally linking to #mainMenu, which does not exist
...internally linking to #mainContent, which does not exist

while both obviously do exist, are valid hash-name references and are focuable elements, the former being the first anchor in the sitenav <nav> element; the latter being the <main> element

<!DOCTYPE html>
<html lang="nl-NL" class="no-js" id="document">
    <head prefix="og: http://ogp.me/ns# article: http://ogp.me/ns/article#">
        <meta charset="utf-8">
        <title>Homepage</title>
    </head>
    <body>

        <div id="skipLinks" class="skiplinks">
            <a href="#mainMenu" class="skiplink">Naar hoofdnavigatiemenu</a>
            <a href="#mainContent" class="skiplink">Naar hoofdinhoud</a>
        </div>

        <div class="site">
          <header id="siteHeader" class="siteHeader" role="banner">
            <h1 class="title">GeoDienstenCentrum</h1>
            <p class="subtitle">Toegankelijke ruimtelijke informatievoorziening</p>

          <nav id="sitenav" class="site-nav" role="navigation">
              <ul>
                  <li>
                      <a href="/" id="mainMenu">
                          <span aria-hidden="true" data-icon="&#xe60d;"></span>
                          <span>home</span>
                      </a>
                  </li>
                  <li>
                      <a href="/over.html">
                          <span aria-hidden="true" data-icon="&#xe60e;"></span>
                          <span>over</span>
                      </a>
                  </li>
              </ul>
          </nav>
          </header>

          <main id="mainContent" tabindex="-1" role="main">

          <p>Voor advies over en implementatie van toegankelijke ruimtelijke informatie met een 
"privacy first" insteek, bij voorkeur op basis van open standaarden, open source software 
en open data.<p>

          </main>

        </div>

        <div class="site-footer">
              <span class="rss">
                  <a href="/atom.xml" class="">
                    <span aria-hidden="true" data-icon="&#xe608;"></span>
                    <span class="visually-hidden">Atom feed voor deze site</span>
                  </a>
              </span>
        </div>
        <script src="/js/script.js" charset="utf-8"></script>
    </body>
</html>

Full pages and traces are on Travis-ci: https://travis-ci.org/GeoDienstenCentrum/geodienstencentrum.github.io/builds/33530564

Make the output more readable

Follow-up to #71 I propose to make the output more readable by using indentation (inspired by NPM). Note the issue count at the end of some lines if the issue appears more than once. Examples:

Sorted by path

./_site/blog/a-whisper/index.html (4)
├── image /assets/body/black_down_arrow-c5df63bbaa0639b1295aa92bf32fe9ff.png does not have an alt attribute
├── image /assets/body/rss-c859bf63379b25bc6e44eba6f7a8b5ed.png does not have an alt attribute (2)
└── image /blog/images/waterfall.jpg does not have an alt attribute
./_site/blog/advanced-ratcheting/index.html (3)
├── image /assets/body/black_down_arrow-c5df63bbaa0639b1295aa92bf32fe9ff.png does not have an alt attribute
└── image /assets/body/rss-c859bf63379b25bc6e44eba6f7a8b5ed.png does not have an alt attribute (2)

Sorted by issue

image /assets/body/black_down_arrow-c5df63bbaa0639b1295aa92bf32fe9ff.png does not have an alt attribute (2)
├── ./_site/blog/a-whisper/index.html
└── ./_site/blog/advanced-ratcheting/index.html
image /assets/body/rss-c859bf63379b25bc6e44eba6f7a8b5ed.png does not have an alt attribute (4)
├── ./_site/blog/a-whisper/index.html (2)
└── ./_site/blog/advanced-ratcheting/index.html (2)
image /blog/images/waterfall.jpg does not have an alt attribute
└── ./_site/blog/a-whisper/index.html

Getting errors for "//", "mailto:" and "tel:" URLs

git clone [email protected]:hafniatimes/hafniatimes.github.io.git reproduction
cd reproduction
bundle exec jekyll build
gem install html-proofer
html-proof ./_site

Returns

$ htmlproof ./_site
Running [Links, Images] checks on ./_site on *.html... 

Checking 8 external links...
Ran on 6 files!

./_site/404/index.html: internally linking to //twitter.com/hafniatimes, which does not exist

./_site/articles/2014/06/06/intro.html: internally linking to //twitter.com/hafniatimes, which does not exist

./_site/contact/index.html: internally linking to //twitter.com/hafniatimes, which does not exist

./_site/contact/index.html: mailto: is an invalid URL

./_site/contact/index.html: tel: is an invalid URL

./_site/da/articles/2014/06/06/intro.html: internally linking to //twitter.com/hafniatimes, which does not exist

./_site/da/index.html: internally linking to //twitter.com/hafniatimes, which does not exist

./_site/index.html: internally linking to //twitter.com/hafniatimes, which does not exist

htmlproof 0.7.1 | Error:  HTML-Proofer found 8 failures!

Are these href types supposed to fail?

  • href="//..."
  • href="mailto:..."
  • href="tel:..."

Just wondering. :)

Allow URLs to be ignored by attribute

Would be awesome if you could add a ci-ignore class or something similar to a link for it to be ignored by HTML Proofer.

The biggest use case would be hashes that are handled by Javascript (e.g., backbone fragments), but also for URLs generated dynamically that wouldn't be practical to add to href_ignore.

I'd imagine it'd be something like:

<a href="#print" class="ci-ignore">Print</a>

Glad to take a pass at it, if there's interest.

Use Commander?

I'm no pro at Ruby development but I think this would be really useful as a command-line executable tool.

Content negotiation

Proofer should emulate content negotiation for html files. We could try to do it like Apache’s MultiViews:

The effect of MultiViews is as follows: if the server receives a request for /some/dir/foo, if /some/dir has MultiViews enabled, and /some/dir/foo does not exist, then the server reads the directory looking for files named foo.*, and effectively fakes up a type map which names all those files, assigning them the same media types and content-encodings it would have if the client had asked for one of them by name. It then chooses the best match to the client's requirements.

Saying this I think it’s time for an own Internal class that fakes all the server things we support: DirectoryIndex, MultiViews, followlocation, hashes, etc.

uri = Proofer::Internal.new("path/to/internal/ressource", options = {})

if uri.invalid? do
  if uri.hash? do
    issues << "Hash not found"
  elsif uri.empty?
    issues << "Uri empty"
  elsif uri.ugly?
    issues << "Uri ugly"
  end
end

undefined method `version` error

Hi, I was attempting to follow the doc here (http://jekyllrb.com/docs/continuous-integration/) and found an interesting error when running bundle exec htmlproof ./_site.

/Users/me/.rbenv/versions/1.9.3-p547/lib/ruby/gems/1.9.1/gems/html-proofer-1.1.5/bin/htmlproof:11:in `block in <top (required)>': undefined method `version' for nil:NilClass (NoMethodError)
    from /Users/me/.rbenv/versions/1.9.3-p547/lib/ruby/gems/1.9.1/gems/mercenary-0.3.4/lib/mercenary.rb:21:in `program'
    from /Users/me/.rbenv/versions/1.9.3-p547/lib/ruby/gems/1.9.1/gems/html-proofer-1.1.5/bin/htmlproof:10:in `<top (required)>'
    from /Users/me/.rbenv/versions/1.9.3-p547/bin/htmlproof:23:in `load'
    from /Users/me/.rbenv/versions/1.9.3-p547/bin/htmlproof:23:in `<main>'

This happens to me on both CI (Shippable, Ubuntu 12.04) as well as a local environment (OSX 10.9.2, ruby 1.9.3p547). I installed the gem through bundle install.

A quick fix is commenting out line 11 in bin/htmlproof:

p.version Gem::Specification::load(File.join(File.dirname(__FILE__), "..", "html-proofer.gemspec")).version

This suggests that the "html-proofer.gemspec" file is missing, and indeed it is:

$ cd /Users/me/.rbenv/versions/1.9.3-p547/lib/ruby/gems/1.9.1/gems/html-proofer-1.1.5/
$ ls -l
drwxr-xr-x  3 me  staff   102 Aug  7 22:38 bin
drwxr-xr-x  3 me  staff   102 Aug  7 22:23 lib
$ curl -O https://raw.githubusercontent.com/gjtorikian/html-proofer/master/html-proofer.gemspec

I checked and the issue is not there in version 1.1.4, which seems to download all the files in the repository, not just the bin and lib directories. As a result, the issue may stem from 88f6572.

Thank you for writing the gem - it has helped me find many interesting link errors.

Prose checking

I would like to check the content of elements except code and pre against an arbitrary array of strings. Two use cases come in my mind.

  1. Typography. You will never see characters like ", --, !! in a book written in a European language. People who care about could proof their texts:

    HTML::Proofer.new("./_book", {:prose => ["\"", "--", "!!"]}).run
  2. Censorship. Some words can not be published:

    HTML::Proofer.new("./_vegan", {:prose => ["meat", "fish", "egg"]}).run

The typography use case is more important, because if you use pre-processors like Markdown, you write -- down and the renderer converts it to n-dash . Proofer could watch out whether it renders as expected.

What do you think about it?

Ignore <a href="#"> when checking internal links

Using such anchors is quite a common practice (e.g. by Bootstrap Dropdowns) and generates the following error:

index.html: linking to internal hash # that does not exist

I think they shouldn't trigger an error as they are just used as a placeholder and not for linking to a specific part of the document.

Redirected links don't report original href in log

I'm seeing some errors appear for links that don't actually exist in the HTML files specified. This is due to a link in the HTML redirecting to another URL and that end URL being reported in the log rather than the original URL linked. Made it a bit tough to find the broken link in a page with a pile of links.

Why stop updating broken links?

Travis has a blog. In https://github.com/travis-ci/blog-travis-ci-com/pull/21 somebody finds a broken link and tries to fix it. A Travis guy answers:

Here's a general question: Is the blog meant to be a document to reflect how things are now, or a historical document that announces what was new then?

It makes sense to make corrections to errors for a short while after the article's publication, but at some point we should stop updating them.

The Travis guy do not merge the fixed link and leaves the issue open. It is open now for two months.

It would be helpful to provide motivation in such situations. I think on images like this:

broken-link-seal

Image source

Test for valid HTML

Doesn't look like this is a feature yet, but it would be very nice to have.

Awesome

Just wanted to say thank you for this great tool! 😍

Issues with SSL checks

./out/ssl-configuration.html[0m: External link https://www.openssl.org/ failed: 0 Peer certificate cannot be authenticated with given CA certificates

�[34m./out/what-is-my-disk-quota.html�[0m: External link https://www.npmjs.org/ failed: 0 Peer certificate cannot be authenticated with given CA certificates

�[34m./out/what-are-other-good-resources-for-learning-git-and-github.html�[0m: External link https://www.codeschool.com/courses/git-real failed: 0 Peer certificate cannot be authenticated with given CA certificates

�[34m./out/what-are-other-good-resources-for-learning-git-and-github.html�[0m: External link https://www.codeschool.com/ failed: 0 Peer certificate cannot be authenticated with given CA certificates

in-href JS returns error

_site/index.html: javascript:if(typeof WZXYxe58==typeof alert)WZXYxe58();(function(){var s=document.createElement('link');s.setAttribute('href','/static/css/dyslexia.css');s.setAttribute('rel','stylesheet');s.setAttribute('type','text/css');document.getElementsByTagName('head')[0].appendChild(s);})(); is an invalid URL

https://travis-ci.org/hafniatimes/hafniatimes.github.io/builds/31079849#L319

Not sure whether there’s a specific part that’s failing, or if the script just doesn’t fancy in-href JS, so I defer to you in this matter. 'href','/static/css/dyslexia.css' works just fine, if you click Dyslexia here. :)

It’s probably not the place for html-proofer to inspect a JavaScript href, but it currently seems to be broken either way.

Crash when folder named *.html

I just discovered that if I create a folder that ends in .html, like Test.html, the proofer crashes:

htmlproof 1.3.1 | Error: Is a directory @ io_fread - _site/Test.html

False positive on link with url encoded character

I have the following snippet in one of my html files:

<link rel="prefetch" href="data/c%23.csv">

On disk, the linked file is named c#.csv so I am url encoding the number sign character in html.

html-proofer reports the following error when encountering this:

index.html: internally linking to data/c%23.csv, which does not exist

This can't be right. I'm referencing a few more files from the same path and this is the only one that produces an error, so the issue is definitely related to the encoded character.

Remove double \n in output

Maybe it’s just me, but \n\n is overkill in the log:

screen shot 2014-07-15 at 10 51 09

screen shot 2014-07-15 at 10 54 47

I think it’s acceptable that there’s one newline, in the cases where the line is longer than the terminal width, though:

screen shot 2014-07-15 at 10 50 49

But the current set-up makes it really hard to read the log in one window.

URLs with parameters deemed invalid

Have a page with the URL http://dotgov-browser.herokuapp.com/domains?cms=drupal, which HTML proofer complains: ./_site/2014/07/07/analysis-of-federal-executive-domains-part-deux/index.html: (http://dotgov-browser.herokuapp.com/domains?cms=drupal) is an invalid URL.

The URL returns a 50x (my fault), but should still be seen as a valid URL.

"Too many open files" error

Ran into a very curious issue just now, running the tests locally:

~/code/stuff$ be rake cibuild
bundle exec jekyll build --destination _site
Configuration file: /Users/parkermoore/code/stuff/_config.yml
            Source: /Users/parkermoore/code/stuff
       Destination: _site
      Generating... done.
Running [Links, Images] checks on /Users/parkermoore/code/stuff/_site...

rake aborted!
Too many open files - /Users/parkermoore/code/stuff/_site/mirrors/world.html
_tasks/cibuild.rake:5:in `block in <top (required)>'
Tasks: TOP => cibuild
(See full trace by running task with --trace)

The error appears to be Ruby reaching its file descriptor limit. Any way I can limit html-proofer to a certain number of files at a time?

Problems with 301s and hash tag refs

Scenario:

  1. Page A links to Page B#some-hash
  2. Page B 301s to another page, Page C
  3. Page C does actually actually some-hash

html-proofer fails, though. It does not follow the redirect in step two; instead, it tries to look for the hash on Page B and complains.

Internally cache status of known URLs

Running html-proofer on my personal site, which isn't huge, can take ~10 minutes. I wonder if, for example, every time I link to / or /about in the header, if it's making an HTTP call for each page. If within one run we've checked a URL and got a 200 status code, cache it so that we don't keep rechecking the same URLs and can complete tests in a reasonable time.

Support links behind auth

Occasionally, at GitHub, we'll link to sites within github.com that are behind auth. For example: [check out this discussion](https://www.github.com/github/secret-internal-repo/issues/23).

We've had to exclude these links by writing them out as HTML and adding data-proofer-ignore. Blah.

I think instead there should be a new config option hash that takes a domain as a key, and an OAuth token as the value, so that these sorts of links can be checked. For example, you'd pass in :domain_auth => { "github.com" => ENV['MACHINE_USER_TOKEN'] }. When HTML::Proofer hits a 404, it'd look the domain up, and try to use the provided token to recheck the link.

/cc @penibelst @parkr Y'all think this makes sense?

Checking the srcset attribute

Images can have a srcset attribute:

When authors adapt their sites for high-resolution displays, they often need to be able to use different assets representing the same image. We address this need for adaptive, bitmapped content images by adding a srcset attribute to the img element.

Nokogiri dependency brings CI builds to a crawl

Just testing out using html-proofer for a random site of "stuff" I have built and the Nokogiri dependency slows everything down a lot, as installation is incredibly slow.

What is the "best practice" around this? Build locally or store vendored versions of gems?

Namespace Typhoeus options

As noted in #113 (comment), Typhoues is real picky about what it take in. I'll need to make a breaking release in order to namespace Typhoeus (and other libs!) options. So rather than

HTML::Proofer.new("out/", {:ext => ".htm", :verbose => true, :ssl_verifyhost => 2 })

It should be

HTML::Proofer.new("out/", {:ext => ".htm", :verbose => true, :typhoeus => {:ssl_verifyhost => 2 }})

Proofer raises false positive on RSS feeds

If you have <link> field in an RSS feed, with the tag body being the link, HTML proofer raises an anchor has no href attribute error.

e.g. <link>http://ben.balter.com/2014/10/08/why-government-contractors-should-%3C3-open-source/</link>

Expose line number in errors

Not sure how to do it (maybe count \n's?), but would be extremely helpful to know the line number of errors when they're outputted to the console, e.g.:

_site/foo.html: internally linking to _site/bar.html on line 7 which doesn't exist

Support Open Graph

The Open Graph protocol requires two properties for every page we could check:

  • og:image - An image URL which should represent your object within the graph.
  • og:url - The canonical URL of your object that will be used as its permanent ID in the graph, e.g., "http://www.imdb.com/title/tt0117500/".

Example:

<meta property="og:url" content="http://www.example.com/" />
<meta property="og:image" content="http://www.example.com/image.png" />

Warn on permanent redirects (301)

Over the time external links get moved permanently, because nice guys don’t break the web. In most cases that means “the old url is deprecated, use the new url”. Can we have an option to output a warning on those links?

There is another case: automatic server-side redirection, when people forget to add a trailing slash in their internal links. An example is Bootstrap’s main menu: they just list the lazy /components instead of the right /components/. This causes a performance waste.

Failing test: "Links test: fails on redirects if not following"

Looks like something changed on the referenced URL, so the test fails now:

Failures:

  1) Links test fails on redirects if not following
     Failure/Error: output.should match /External link https:\/\/help.github.com\/changing-author-info\/ failed: 301 No error/
       expected "spec/html/proofer/fixtures/links/linkWithRedirect.html: External link http://timclem.wordpress.com/2012/03/01/mind-the-end-of-your-line/ failed: 301 No error\n" to match /External link https:\/\/help.github.com\/changing-author-info\/ failed: 301 No error/
     # ./spec/html/proofer/links_spec.rb:45:in `block (2 levels) in <top (required)>'

External embeds

Inspired by the discussion with @parkr about privacy I want to propose an optional checking for external embedded resources: images, styles, scripts, … External embeds lack many things.

  1. Reliability — External servers come, go, and stay. The best current example was published lately: Don’t Use jquery-latest.js.
  2. Speed — Every external connection means a new time consuming connection opening.
  3. Privacy — if you respect your visitors, you don’t let them be tracked.

High quality websites only serve from own hosts. If I migrate an old website, first thing I do is to collect all embedded resources. Then I can really control what happens on my website.

The option also would help big teams with many authors to take care, because lazy authors sometimes embed images from Tumblr instead of uploading to own server.

Scenario:

  • We serve from www.example.com
  • Our assets are assets.example.com

The option must check all external resources, (e. g. http://code.jquery.com/jquery-latest.min.js) except your own external server.

How would you design such an option?

allow links to sites with self-signed certs?

Upgraded from 1.1.3 to 1.3.0, and it seems the ssl_verifypeer option is no longer supported.

require 'html/proofer'                                                       

task :test do                                                                
  HTML::Proofer.new("./_site", href_ignore: ['#'], ssl_verifypeer: false).run
end                                                                          

After upgrading I have a few new failures of the form.

External link https://blog.patternsinthevoid.net/ failed: 0 Peer certificate cannot be authenticated with given CA certificates

Hanging on run

Hey,

The most recent version 1.3.3 is hanging. It says it's running, but I've waited a few minutes and nothing happens.

> htmlproof _site
> Running ["Images", "Links", "Scripts"] checks on _site on *.html... 


Thanks for any help.

Redirects don't appear to be handled properly

It could just be me, but I've noticed that HTML::Proofer will (usually?) treat redirects as failures. Is this behavior intentional and, if not, would it be reasonably easy to fix?

data uris in img tags fails to validate

When using a data uri in an img tag
<img src="data:image/png;base64, blah">
I get
bad URI(is not URI?): data:image/png;base64, blah

This should pass validation

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.