Code Monkey home page Code Monkey logo

htmlentities's Introduction

HTMLEntities

The canonical source for this project can be found at GitHub: threedaymonk/htmlentities.

HTML entity encoding and decoding for Ruby

HTMLEntities is a simple library to facilitate encoding and decoding of named (ý and so on) or numerical ({ or Ī) entities in HTML and XHTML documents.

Usage

HTMLEntities works with UTF-8 (or ASCII) strings only.

Please ensure that your system is set to display UTF-8 before running these examples. In Ruby 1.8, you'll need to set $KCODE = "u".

Decoding

require 'htmlentities'
coder = HTMLEntities.new
string = "élan"
coder.decode(string) # => "élan"

Encoding

This is slightly more complicated, due to the various options. The encode method takes a variable number of parameters, which tell it which instructions to carry out.

require 'htmlentities'
coder = HTMLEntities.new
string = "<élan>"

Escape unsafe codepoints only:

coder.encode(string) # => "&lt;élan&gt;"

Or:

coder.encode(string, :basic) # => "&lt;élan&gt;"

Escape all entities that have names:

coder.encode(string, :named) # => "&lt;&eacute;lan&gt;"

Escape all non-ASCII/non-safe codepoints using decimal entities:

coder.encode(string, :decimal) # => "&#60;&#233;lan&#62;"

As above, using hexadecimal entities:

coder.encode(string, :hexadecimal) # => "&#x3c;&#xe9;lan&#x3e;"

You can also use several options, e.g. use named entities for unsafe codepoints, then decimal for all other non-ASCII:

coder.encode(string, :basic, :decimal) # => "&lt;&#233;lan&gt;"

Flavours

HTMLEntities knows about three different sets of entities:

  • :xhtml1 – Entities from the XHTML1 doctype
  • :html4 – Entities from the HTML4 doctype. Differs from +xhtml1+ only by the absence of +&apos+
  • :expanded – Entities from a variety of SGML sets

The default is :xhtml, but you can override this:

coder = HTMLEntities.new(:expanded)

Licence

This code is free to use under the terms of the MIT licence. See the file COPYING.txt for more details.

Contact

Send email to [email protected].

htmlentities's People

Contributors

champierre avatar janne avatar merrells avatar threedaymonk avatar tricknotes avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

htmlentities's Issues

Add License information to gemfile

This will make it show up on rubygems.org. I'm doing due diligence on our gems and need to find out the licenses for all the gems. Having it show up on rubygems.org cuts out the step of having to go to the github repo.

htmlentities gem appearing as corrupt to Bundler

I'm attempting to install the htmlentities gem onto a new system via Bundler.

When running bundle install I keep hitting this error:

bundle install --deployment --local
/opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/site_ruby/1.8/rubygems/package/tar_input.rb:111:in initialize': No metadata found! (Gem::Package::FormatError) from /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/site_ruby/1.8/rubygems/package/tar_input.rb:17:innew'
from /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/site_ruby/1.8/rubygems/package/tar_input.rb:17:in open' from /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/site_ruby/1.8/rubygems/package.rb:58:inopen'
from /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/site_ruby/1.8/rubygems/format.rb:63:in from_io' from /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/site_ruby/1.8/rubygems/format.rb:51:infrom_file_by_path'
from /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/1.8/open-uri.rb:32:in open_uri_original_open' from /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/1.8/open-uri.rb:32:inopen'
from /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/site_ruby/1.8/rubygems/format.rb:50:in from_file_by_path' from /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/gems/1.8/gems/bundler-1.0.2/lib/bundler/source.rb:197:incached_specs'
from /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/gems/1.8/gems/bundler-1.0.2/lib/bundler/source.rb:195:in each' from /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/gems/1.8/gems/bundler-1.0.2/lib/bundler/source.rb:195:incached_specs'
from /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/gems/1.8/gems/bundler-1.0.2/lib/bundler/source.rb:194:in each' from /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/gems/1.8/gems/bundler-1.0.2/lib/bundler/source.rb:194:incached_specs'
from /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/gems/1.8/gems/bundler-1.0.2/lib/bundler/source.rb:157:in fetch_specs' from /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/gems/1.8/gems/bundler-1.0.2/lib/bundler/index.rb:7:inbuild'
from /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/gems/1.8/gems/bundler-1.0.2/lib/bundler/source.rb:155:in fetch_specs' from /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/gems/1.8/gems/bundler-1.0.2/lib/bundler/source.rb:70:inspecs'
from /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/gems/1.8/gems/bundler-1.0.2/lib/bundler/lazy_specification.rb:48:in __materialize__' from /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/gems/1.8/gems/bundler-1.0.2/lib/bundler/spec_set.rb:83:inmaterialize'
from /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/gems/1.8/gems/bundler-1.0.2/lib/bundler/spec_set.rb:81:in map!' from /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/gems/1.8/gems/bundler-1.0.2/lib/bundler/spec_set.rb:81:inmaterialize'
from /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/gems/1.8/gems/bundler-1.0.2/lib/bundler/definition.rb:93:in specs' from /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/gems/1.8/gems/bundler-1.0.2/lib/bundler/definition.rb:81:inresolve_with_cache!'
from /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/gems/1.8/gems/bundler-1.0.2/lib/bundler/installer.rb:34:in run' from /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/gems/1.8/gems/bundler-1.0.2/lib/bundler/installer.rb:8:ininstall'
from /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/gems/1.8/gems/bundler-1.0.2/lib/bundler/cli.rb:217:in install' from /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/gems/1.8/gems/bundler-1.0.2/lib/bundler/vendor/thor/task.rb:22:insend'
from /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/gems/1.8/gems/bundler-1.0.2/lib/bundler/vendor/thor/task.rb:22:in run' from /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/gems/1.8/gems/bundler-1.0.2/lib/bundler/vendor/thor/invocation.rb:118:ininvoke_task'
from /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/gems/1.8/gems/bundler-1.0.2/lib/bundler/vendor/thor.rb:246:in dispatch' from /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/gems/1.8/gems/bundler-1.0.2/lib/bundler/vendor/thor/base.rb:389:instart'
from /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/gems/1.8/gems/bundler-1.0.2/bin/bundle:13
from /opt/REE/bin/bundle:19:in `load'
from /opt/REE/bin/bundle:19

Based on the suggestion at this page - http://codebeef.com/bundler-no-metadata-found-problem - we have patched tar_input.rb to provide further details.

This yields the following:

/opt/www/appname/vendor/cache/htmlentities-4.2.1.gem may be corrupt! Delete it and retry the operation

Based on this error message, I had Bundler recache the gem, but hit the same issue again. I don't believe htmlentities to actually be corrupt, especially as I've run into this issue with a fresh cache of the gem file. Why would it appear as corrupt to Bundler?

Add support for incorrect numerical entity format

Add an option that allows users to decode (invalid) HTML entities that forget the # sign, such as &1234; instead of &#1234.

(I'll open a PR for this soon, I just need the link to this issue for now).

Add support for case-insentitive decoding

Add an option that allows users to decode (invalid) HTML entities with incorrect casing, such as &Amp; instead of &amp;.

(I'll open a PR for this soon, I just need the link to this issue for now).

Does not decode, when using regex.

I thought in the third block of code the entites would get decoded. Am I missing something?

>> require 'htmlentities'
>> @coder = HTMLEntities.new

>> poi_csv = ["space&nbsp;", "dots&hellip;", "arrow&raquo;"]

>> poi_csv.collect { |column| column.gsub( /(&.*?;)/, @coder.decode('\1')) }
=> ["space&nbsp;", "dots&hellip;", "arrow&raquo;"]

>> @coder.decode("&hellip;")
=> "…"

I am using ruby 1.8.7 and rails 2.3.17

Verify HTML entity names

While looking at that duplicate inodot key that was fixed in #17, I noticed the capital letter below is called Iodot. That seemed inconsistent with other capital letters, so I looked it up.

What I found was &Idot; without the o. That's a bit strange too.

I'm curious where this list of entities came from and if there is any way to verify that they are all correct?

Missing helper file helpers/htmlentities.rb

**Using rails 3.2.11

Gem install successfully, can't use it. I'm not using the functionality in either of the following two files either which confuses me.

Trace:

app/controllers/application_controller.rb:1:in `<top (required)>'
app/controllers/users_controller.rb:1:in `<top (required)>'

This error occurred while loading the following files:
   htmlentities

Why do ldquo and rdquo appear differently?

Can anyone explain why the left quote displays differently than the right quote? Is this normal?

coder = HTMLEntities.new
string = "&ldquo;These pretzels are making me thirsty&hellip;&rdquo;"
coder.decode(string) => "“These pretzels are making me thirsty…\342\200\235"

I'm using ruby enterprise 1.8.7.

Encoding failure on well formed utf-8 (mdash)

Hello,

I'm using htmlentities in one of my projects. Recently users have been complaining about a failure in the text encoding process. I finally got a minimal code that is exposing the issue (using ruby 1.8.7):

require "htmlentities" 
dec=HTMLEntities.new;
dec.encode(dec.decode("&mdash;"),:decimal)

throws :

/htmlentities-4.2.0/lib/htmlentities/encoder.rb:85:in `unpack': malformed UTF-8 character (expected 3 bytes, given 1 bytes) (ArgumentError)

Playing with git bisect seems to say that commit : e7f336b introduces the bug. After looking at the regexp I would tend to think that you miss a + sign after the new pattern.

PS : Version 4.0.0 is working just fine !

Using this with Controller

I am new to Ruby. I tried implementing the gem in my controller but it is not working.

class linksController < ApplicationController
  require 'erb'
  include ERB::Util
  require 'open-uri'
  require 'htmlentities'

  def index
    coder = HTMLEntities.new
    @test = coder.encode('hdhdhd-shgssg- shsah', :basic, :decimal)
  end
end

When I look at the source code, it is not being converted.

Expanded encoder doesn't encode colon character

Not sure if this is by design, but the expanded encoder (which includes a mapping for the colon character) doesn't convert colons to their HTML entity form. The use case is encoding title text for use in a YAML front matter block (and subsequently embedding in an HTML page).

My code:

require 'htmlentities'
title = "Foo: Bar"
coder = HTMLEntities.new(:expanded)
coder.encode(title, :hexadecimal)

Expected: "Foo: Bar"
Got: "Foo: Bar"

Also tried coder.encode(title, :decimal), coder.encode(title, :named, :decimal), and coder.encode(title). Tested with IRB to make sure the problem isn't coming from somewhere else.

NameError: uninitialized constant HTMLEntities::Encoder::Encoding

Hi,
i'm getting this error:

>> require 'htmlentities'
=> []
>> coder = HTMLEntities.new
=> #<HTMLEntities:0x10902c6f0 @flavor="xhtml1">
>> string = "<élan>"
=> "<élan>"
>> coder.encode(string) # => "&lt;élan&gt;"
NameError: uninitialized constant HTMLEntities::Encoder::Encoding
    from /Users/xx/.rbenv/versions/1.8.7-p371/lib/ruby/gems/1.8/gems/activesupport-2.3.17/lib/active_support/dependencies.rb:131:in `const_missing'
    from /Users/xx/.rbenv/versions/1.8.7-p371/lib/ruby/gems/1.8/gems/htmlentities-4.3.2/lib/htmlentities/encoder.rb:25:in `prepare'
    from /Users/xx/.rbenv/versions/1.8.7-p371/lib/ruby/gems/1.8/gems/htmlentities-4.3.2/lib/htmlentities/encoder.rb:19:in `encode'
    from /Users/xx/.rbenv/versions/1.8.7-p371/lib/ruby/gems/1.8/gems/htmlentities-4.3.2/lib/htmlentities.rb:73:in `encode'
    from (irb):4

Using Ruby 1.8.7 + Rails 2.3.17
Any help please?

Serious decoding bug with masked entities

Hi there,

consider the following input string:

&amp;#3346;

When calling decode() on this string, it will get decoded to the unicode character referenced by #3346. I think this happens because you first decode the &amp; and then decode the generated &#3346; in the string. A valid solution would be to decode &amp; last.

Greetings,
CK

encode(string, :named) is re-encoding valid entries and breaking the HTML

Hi,
First of all I would like to thank you for this awesome gem. But I found a bug while trying to sanitize a string that has both valid and invalid chars. below i explain this problem better:

coder = HTMLEntities.new
string = "> Car &amp; Bike <"
new = coder.encode(string)  # BUG =>  "&gt; Car &amp;amp; Bike &lt;" 
worst_then_new = coder.encode(new) # BUG => "&amp;gt; Car &amp;amp;amp; Bike &amp;lt;" 

A workaround this problem would be to "decode" before "encode" but this hack is to slow...

Feature request: only encode HTML special characters

I'm not sure how to do this with your library, but I would like the ability to only encode special characters. For example, I have a block of HTML which has UTF-8 characters, but I don't what to encode the HTML tags. Is there a way that we can pass a option, or configure the encoder to skip the < and > characters which make up an html tag?

Decoding removes &pound; entity when given a SafeBuffer

Use Case

The Rails number_to_currency helper returns currency symbols as HTML entities. When exporting an HTML report to CSV, we would like to use the UTF-8 symbol instead.

Versions

  • ruby 2.0.0p247 (2013-06-27 revision 41674) [x86_64-darwin12.5.0]
  • Rails 4.0.0 (same issue with 3.x)
  • htmlentities (4.3.1)

Steps to Reproduce

New rails app:

rails new testapp
cd testapp
echo "gem 'htmlentities'" >> Gemfile
bundle
rails c

In console:

HTMLEntities.new.decode("&pound;12.34") # => "£12.34"
ActiveSupport::SafeBuffer.new("&pound;12.34") # => "&pound;12.34"
HTMLEntities.new.decode(ActiveSupport::SafeBuffer.new("&pound;12.34")) # => "12.34"

It decodes the way I would want when not given a SafeBuffer, but removes the entity entirely when given a SafeBuffer.

Work around

Instead of using a nice generic solution like HTMLEntities, I end up using a few gsubs to get the job done:

str.to_s.gsub("&pound;", "\u00a3").gsub("&#8364;", "\u20ac").gsub("&#269;", "\u010d")

(which works, but then this list needs to be maintained).

Improperly decoding apostrophe

When using an apostrophe encoded to &#39; from Rails, this is being improperly decoded to an empty string.

To reproduce this problem:
HTMLEntities gem version: 4.3.4
Ruby 2.2.2
Rails 4.2.7

$> HTMLEntities.new.decode(HTMLEntities.new.encode("'", :decimal))
 => "'" 
$> HTMLEntities.new.encode("'", :decimal)
 => "&#39;" 
$> ERB::Util.h("'")
 => "&#39;" 
$> ERB::Util.h("'") == HTMLEntities.new.encode("'", :decimal)
 => true 
$> ERB::Util.h("'") === HTMLEntities.new.encode("'", :decimal)
 => true 
$> HTMLEntities.new.decode(ERB::Util.h("'"))
 => "" 

I was able to get around this behavior by monkey-patching prepare

class HTMLEntities
  class Decoder
    private

    def prepare(string)
-      string.to_s.encode(Encoding::UTF_8)
+      string.to_s.encode(Encoding::UTF_8).unicode_normalize
    end
  end
end

Encode Registered Trademark (®)

See the following in my Rails console:

[11] pry(#<DesignsController>)> HTMLEntities.new.encode("®")
=> "®"
[12] pry(#<DesignsController>)> HTMLEntities.new.encode("&")
=> "&amp;"

How can I encode the Registered Trademark symbol?

Cannot Decode &#44; HTML to Comma

htmlentities seems to work great for everything except &#44; which should decode to a comma

Example Code - Long_Description contains text with entities such as &#44; which should decode to a comma but does not.

require 'htmlentities'
coder = HTMLEntities.new
self.Long_Description = coder.decode(self.Long_Description)

Any ideas?

Side Note: Even here on Github if you don't enclose &#44; into code tags it decodes it to the comma character ( , )

doesn't decode &Amp; - purposeful?

I'm scraping so I can't really control the HTML entity itself. I don't know whether &Amp; is a valid html entity (as opposed to &amp;), tbh I don't really care, I just need to decode it to &.

The regex to match an entity is case insensitive, but the map (even the expanded flavor) doesn't include the capitalized version. I figure this may have done on purpose to match html specs.

I'm happy to submit a PR to handle this if anyone is interested, but short of that I'm curious what the advised path is.

My first thought (as seen in other issues) would be to define a custom mapping that includes this value. Being wary of what other entities might be out there that are mis-capitalized, another thought is to downcase the match before checking the map.

decoding failure for &Ccedil;

I seem to be having a problem with Ç. For example: 'FRANÇOIS' is being decoded as 'FRANÇOIS'. However, 'François' is correctly handled as 'François'. I thought that it might be a case of the input string being latin1, but I'm pretty sure that's not the case and your documentation seems to imply that it won't decode things that it doesn't understand.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.