Code Monkey home page Code Monkey logo

html2text's Introduction

Go Reference test coverage Report Card

html2text

A simple Golang package to convert HTML to plain text (without non-standard dependencies).

It converts HTML tags to text and also parses HTML entities into characters they represent. A <head> section of the HTML document, as well as most other tags are stripped out but links are properly converted into their href attribute.

It can be used for converting HTML emails into text.

Some tests are installed as well. Uses semantic versioning and no breaking changes are planned.

Fell free to publish a pull request if you have suggestions for improvement but please note that the library can now be considered feature-complete and API stable. If you need more than this basic conversion, please use an alternative mentioned at the bottom.

Install

go get github.com/k3a/html2text

Usage

package main

import (
	"fmt"
	"github.com/k3a/html2text"
)

func main() {
	html := `<html><head><title>Good</title></head><body><strong>clean</strong> text</body>`
	
	plain := html2text.HTML2Text(html)
			  
	fmt.Println(plain)
}

/*	Outputs:

	clean text
*/

To see all features, please look info html2text_test.go.

Alternatives

License

MIT

html2text's People

Contributors

devstein avatar k3a avatar kkeybbs avatar merlincox avatar st3fan avatar xstrom avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

html2text's Issues

Link element returns empty string

I'm trying to parse HTML from Wiktionary. Words appearing in <a> tags are not added to the final result. For example:

HTML:

<span class="Latn" lang="en" about="#mwt127" typeof="mw:Transclusion"><a rel="mw:WikiLink" href="/wiki/yet#English" title="yet">yet</a>, <a rel="mw:WikiLink" href="/wiki/not_yet#English" title="not yet">not yet</a></span>

Expected output: yet, not yet
Actual output: ,

Advice on Modifying html2text Behavior to Preserve Link Text

Hi ๐Ÿ‘‹

Thanks for making and maintaining this package! It's almost exactly what I was looking for. The only different behavior I am looking for is to preserve the link text and append the link as a suffix to the text in the conversion.

For example, the behavior I want is

So(HTML2Text(`click <a href="javascript:void(0)">here</a>`), ShouldEqual, "click here")
So(HTML2Text(`click <a href="test"><span>here</span> or here</a>`), ShouldEqual, "click test <test>")
So(HTML2Text(`click <a href="http://bit.ly/2n4wXRs">news</a>`), ShouldEqual, "click news <http://bit.ly/2n4wXRs>")

I understand a change like would be breaking for users and change the interface, so I am planning on forking the repo to make this change.

Do you have advice on how to make this change? It's clear that changes would need to be made here. Any advice is appreciated!

Converts URLs to lower-case, which is not equivalent

Consider this piece of code:

package main

import (
	"fmt"

	"github.com/k3a/html2text"
)

func main() {
	result := html2text.HTML2Text(`Something <a href="http://bit.ly/2n4wXRs">http://bit.ly/2n4wXRs</a>`)
	fmt.Println(result)
	fmt.Printf("Correct? %v", result == `Something http://bit.ly/2n4wXRs`)
}

Which you can run here: https://play.golang.org/p/KpgQrOSkoyZ

This outputs the following:

Something http://bit.ly/2n4wxrs
Correct? false

For some reason the URL gets lower-cased, which is definitely not equivalent: the path part is case-sensitive.

Return output for empty html

jaytaylor/html2text return empty string for the following html. I think empty string in the right answer in this string.
This library return

<!doctype html><title></title><script src="https://www.google.com/adsense/domains/caf.js" type="text/javascript"></script>

For full functionality of this site it is necessary to enable JavaScript. Here are the instructions how to enable JavaScript in your web browser.
<script type="application/javascript">window.LANDER_SYSTEM="PW"</script>
<script>!function(e){function r(r){for(var n,l,a=r[0],i=r[1],p=r[2],c=0,s=[];c<a.length;c++)l=a[c],Object.prototype.hasOwnProperty.call(o,l)&&o[l]&&s.push(o[l][0]),o[l]=0;for(n in i)Object.prototype.hasOwnProperty.call(i,n)&&(e[n]=i[n]);for(f&&f(r);s.length;)s.shift()();return u.push.apply(u,p||[]),t()}function t(){for(var e,r=0;r<u.length;r++){for(var t=u[r],n=!0,a=1;a<t.length;a++){var i=t[a];0!==o[i]&&(n=!1)}n&&(u.splice(r--,1),e=l(l.s=t[0]))}return e}var n={},o={1:0},u=[];function l(r){if(n[r])return n[r].exports;var t=n[r]={i:r,l:!1,exports:{}};return e[r].call(t.exports,t,t.exports,l),t.l=!0,t.exports}l.m=e,l.c=n,l.d=function(e,r,t){l.o(e,r)||Object.defineProperty(e,r,{enumerable:!0,get:t})},l.r=function(e){"undefined"!=typeof Symbol&&Symbol.toStringTag&&Object.defineProperty(e,Symbol.toStringTag,{value:"Module"}),Object.defineProperty(e,"__esModule",{value:!0})},l.t=function(e,r){if(1&r&&(e=l(e)),8&r)return e;if(4&r&&"object"==typeof e&&e&&e.__esModule)return e;var t=Object.create(null);if(l.r(t),Object.defineProperty(t,"default",{enumerable:!0,value:e}),2&r&&"string"!=typeof e)for(var n in e)l.d(t,n,function(r){return e[r]}.bind(null,n));return t},l.n=function(e){var r=e&&e.__esModule?function(){return e.default}:function(){return e};return l.d(r,"a",r),r},l.o=function(e,r){return Object.prototype.hasOwnProperty.call(e,r)},l.p="https://d1hi41nc56pmug.cloudfront.net/";var a=this["webpackJsonpparking-lander"]=this["webpackJsonpparking-lander"]||[],i=a.push.bind(a);a.push=r,a=a.slice();for(var p=0;p<a.length;p++)r(a[p]);var f=i;t()}([])</script><script src="https://d1hi41nc56pmug.cloudfront.net/static/js/2.9c709ad7.chunk.js"></script><script src="https://d1hi41nc56pmug.cloudfront.net/static/js/main.f7615e4a.chunk.js"></script></html

Replace new line in tag text with space in result

Currently if the tag contains text with new lines, those new lines are removed and the lines are concatenated. But as for me it's more clear to put a whitespace between those part of text. Or if it's possible, I would suggest leaving a result text with newlines.

Example

Input:

<body>
a \n b \n c \n
<body>

Output:

a b c

Current output

abc

In what this package differs from jaytaylor/html2text?

jaytaylor/html2text package is mentioned in the README as an alternative to this package.

jaytaylor/html2text looks more famous than this one, but I guess that there were good reasons why you created this package and not used jaytaylor/html2text.

Could you tell me and ultimately add those advantages and tradeoffs of using this package rather than jaytaylor/html2text?

Many thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.