mtibben / html2text Goto Github PK

PHP library to convert HTML to formatted plain text

PHP 100.00%

html2text's Introduction

Html2Text

A PHP library for converting HTML to formatted plain text.

Installing

composer require html2text/html2text

Basic Usage

$html = new \Html2Text\Html2Text('Hello, &quot;<b>world</b>&quot;');

echo $html->getText();  // Hello, "WORLD"

History

This library started life on the blog of Jon Abernathy http://www.chuggnutt.com/html2text

A number of projects picked up the library and started using it - among those was RoundCube mail. They made a number of updates to it over time to suit their webmail client.

Now it has been extracted as a standalone library. Hopefully it can be of use to others.

html2text's People

Contributors

Stargazers

Watchers

Forkers

pre-schoollearningalliance wolfwolker mercoline deries ianhk pankaj-garg koconder ali1k bapfnet digbot voku cmworld synchro raysilent picardie-nature splitice kkrauspe nyholm piermonta marlon-be ravi-sharma zelazowy acorncom 2naive dsas shahzab onnovos securecloud-biz anthonymorel mvachette mario-kinesissurvey-com slackero savioret majregor ashbeats michabbb-backup oakraw fxhover pstast dharma017 mcboof stalinko ghsolutions ditso romislovs andrewnicols sgrodzicki darkdoom omgan jimjag cjq romeoonisim pinforet lililiang nurielmeni antonve ofertix jcubic juzim aallanrd ajouve dedurus dfsmirnov iveskim evrpress mauromorello lidiavieiras lunixyacht akaservus 57u c00 nabikaz wagnerpinheiro hadryan ethenoscar2011 honsberg totallyben nicoschoenmaker activecodingart fredfilo technoboggle akumbhani66 suadhuskic tahtalaksana spinnerlabs jinchunguang braveyjs jeff-goldstein sdkiller alidurna2011 fakhri21 solutionzhere maratth swap17 kasperg bryanaamot nobyea andreacardelli locvfx b0006

html2text's Issues

'allowed_tags' property

Hi,

I am coming from a very old version of this class, which had the ability to set a allowed_tags property, via a set_allowed_tags public method. Is there any equivalent in the current version?

Just to let you know I'm using your class...

The original html2text package your class was based on by Chuggnutt has been bundled with PHPMailer for years, and I contributed lots of fixes for it a long time ago, however, I wasn't actually using it in PHPMailer. When I did just recently, I found that the old version broke my build in PHP 5.5 because it uses the deprecated /e modifier - then I found your fork and used it, and now it's passing again, so thanks!

There was one small issue I fixed: $start and $taglen are not defined before they are used in _convert_blockquotes(). In my copy I also removed the namespace declaration since it needs to work in PHP < 5.3 and lower-cased the name to retain backward compatibility.

Full plain text

Hi, this class is great, thanks. Now it replace text like this:

Hi,

This is a text.

Cheers!

But maybe it would be interesting if it can do this:

Hi, This is a text. Cheers!

Regards.

How do I output completely plain text?

From what I understand this library may convert several elements to a markdown-style syntax.
I would like to only output the textual content of the HTML string.
Any pointers? I saw there are 'options' for the class constructor but i found no documentation about that.

Allow to choose input string encoding

Encoding is forced to UTF-8, but il would be useful to be able to convert html string with other encoding

I will submit a pull -request for this issue

Support multi-level <ul> lists

It appears if you do this:

<ul>
  <li>Coffee</li>
  <li>Tea
    <ul>
      <li>Black tea</li>
      <li>Green tea</li>
    </ul>
  </li>
  <li>Milk</li>
</ul>

Html2Text will flatten the list.
Just a request to support multiple level lists by adding additional tabs in front of each item per level.

This vs https://github.com/soundasleep/html2text

What are the differences between this repo and this one https://github.com/soundasleep/html2text

tag regex not restrictive enough

See #58

Allowed tags

How can I allow certain tags? I want to show the IMG tag in the filtered content.

Option convert clean text

need option for clean convert

without modification like <hr /> = --------------------- but it actualy <hr /> = \n
and others like  not wrapper with _ : _b_

thanks

Use in codeigniter ?

Hi
this it can use on codeigniter or not ?
if can, what can installed ?

Doctrine Annotation Exception with "@type" on using html2text class

Hi,

I use htm2text in a TYPO3 9.5 instance and PHP 7.2.21. On using html2Text class I have this exception in frontend:

(1/1) Doctrine\Common\Annotations\AnnotationException
[Semantical Error] The annotation "@type" in property Html2Text\Html2Text::$html was never imported. Did you maybe forget to add a "use" statement for this annotation?

TYPO3 is installed with composer. I'm using the "doctrine/annotations" v1.8.0 package.

Changing all "@type" to "@var" in "html2text/html2text/src/Html2Text.php" annotations solves the problem and the frontend shows me the plain text result.

It would be fine, if that will be fixed.

links don't work without quotes

Having Example just results in the text Example.

Putting Example produces both the text and the very important link.

Multibyte strings do not play nicely with blockquotes

A string such as:

“Hello”

<blockquote>goodbye</blockquote>

Currently gets converted to:

“Hello”

This is because mb_substr in convertBlockquotes is truncating in the wrong place, which results in potentially incomplete blockquote tags, which strip_tags will remove.

I've created pull request #56 to address this.

Replacement for newlines and tabs

Hi,

ist the " " as replacement correct? Newlines and tabs don´t do much in html.

The html I have:

 <p>
    This is some text<br/>
    with a break in the middle
 </p>

Results to:

This is some text
 with a break in the middle

The " " before "with" is not correct there, but I can´t decide if there aren´t any other side effects?

[PHP7] preg_replace is not works

When I changed everything to PHP7, I have fixed many things I had in my scripts (where I used old mysql and changed into mysqli), but this converter is totally broken because of preg_replace function (I using this to prepare news post from my official site for posting into the chat room in readable form. Because of this, I receiving the empty string). How about upgrading to use preg_replace_callback?

base URLs in version 2.0 are ignored

Base URLs should be honored if set in the HTML code.

Single quotes are removed

The single quote ' is removed but, imho, should not

I'll provide a PR to fix it, with a test

URLs are uppercased inside of b tags.

URLs get uppercased when they are encapsulated by a bold tag:

<b><a href="https://www.tave.com/Test/LowerAndUpperCase?Signature=518d4bf6872e0deaf0eeb23b19c724514ec7c69a">Test</a></b>

becomes:

TEST
[HTTPS://WWW.TAVE.COM/TEST/LOWERANDUPPERCASE?SIGNATURE=518D4BF6872E0DEAF0EEB23B19C724514EC7C69A]

In the call to _preg_callback, for the b and strong tags, we call _toupper of $matches[3], but it doesn't look like _toupper handles when the link has already been parsed. See test case below:

$contents = '

<b><a href="https://www.tave.com/Test/LowerAndUpperCase?Signature=518d4bf6872e0deaf0eeb23b19c724514ec7c69a">Test</a></b>

<a href="https://www.tave.com/Test/LowerAndUpperCase?Signature=518d4bf6872e0deaf0eeb23b19c724514ec7c69a"><b>Test</b></a>

';

try {
  $textContents = new \Html2Text\Html2Text($contents);
  $textContents = $textContents->get_text();
}
catch (Exception $e) {
    return $e;
}

echo $textContents;

/* Result */
/*

TEST
[HTTPS://WWW.TAVE.COM/TEST/LOWERANDUPPERCASE?SIGNATURE=518D4BF6872E0DEAF0EEB23B19C724514EC7C69A]
TEST
[https://www.tave.com/Test/LowerAndUpperCase?Signature=518d4bf6872e0deaf0eeb23b19c724514ec7c69a]';

*/

$matches[3] will contain:

Test [https://www.tave.com/Test/LowerAndUpperCase?Signature=518d4bf6872e0deaf0eeb23b19c724514ec7c69a]

so _toupper doesn't have any tags to split on and blindly uppercases the whole string.

I'll look into doing a pull request for this, but I am in a bit of a time crunch right now, so I figured I would report it and see whomever got to it first.

Cheers

New release

Hi,

Any chance of a new release for the library? The last release was 3.0.0 in October, and since then there have been a range of bug fixes and new features, including:
#44 PHP7 support in unit tests
#37/#45 blockquote parsing fix
#7 Treat all paragraph content equally
#47 bbcode support

We (moodle/moodle) can pull from master, but generally we prefer to pick known releases.

Thanks in advance,

Andrew

Use associative array instead of double-arrays

Two arrays that must maintain an explicit 1:1 matching of elements is hard to maintain and visualize. This section: https://github.com/mtibben/html2text/blob/master/lib/Html2Text/Html2Text.php#L61-L115
could be much better represented as an associative key:value array.

Use numbers instead of bullets for ordered lists

Currently ordered lists are converted into unordered lists in text; using an asterisk instead of numbers. Would be great to use numbers for these cases.

Example

(new \Html2Text\Html2Text('<ol><li>Item 1</li><li>Item 2</li></ol>'))->getText()

Result

* Item 1
* Item 2

Expected

1. Item
2. Item

Bug?

Hi!

It seems I have found a bug, I'm not sure about the reason, I tried replacing <blockquote> and </blockquote> (using, for example, <div> and </div>) and that fixed this particular case but after doing that I realized it's about the string length, I think the problem could be a "wrong" str_pos to replace something (???). I don't really know... Thoughts?

TEST CASE (I'm working with Tumblr API, this is just some text from a random post):

$test_str = "Highlights from today’s Newlyhired Game:<blockquote>Sean: What came first, Blake’s first Chief Architect position or Blake’s first girlfriend? </blockquote> <blockquote> Sean: Devin, Bryan spent almost five years of his life slaving away for this vampire squid wrapped around the face of humanity… Devin: Goldman Sachs? Sean: Correct! </blockquote> <blockquote> Sean: What was the name of the girl Zhu took to prom three months ago? John: What? Derek (from the audience): Destiny! Zhu: Her name is Jolene. She’s nice. I like her.</blockquote>I think the audience is winning.  - Derek";

$html2text = new \Html2Text\Html2Text($test_str);
echo $html2text->getText();

Thanks for you help!

Cheers,
Sebatian C.

&laquo and &raquo

These items are not changed into text (« and ») whilst they should be. These are often used in french as quotation marks.

HeaderTest.php does not comply with psr-4 autoloading standard

Due to different ClassName and FileName we see following warning while doing composer dump-autoload

vagrant@homestead:~/code/mrm$ composer dump
Generating optimized autoload files
Deprecation Notice: Class Html2Text\StrToUpperTest located in ./vendor/html2text/html2text/test/HeaderTest.php does not comply with psr-4 autoloading standard. It will not autoload anymore in Composer v2.0. in phar:///usr/local/bin/composer/src/Composer/Autoload/ClassMapGenerator.php:201

Add License

You note in your README that a number of projects have found this useful and state "Now it has been extracted as a standalone library. Hopefully it can be of use to others.".

We're currently testing this out in a small, commercial software product we develop as well, and it's always nice to be sure that the libraries we are using support the right type of license so we don't cross any lines we shouldn't be crossing.

Would you be comfortable adding an open source license, like the MIT license or something (https://choosealicense.com), so it's clear how you allow others to use the codebase?

Links format 'inline' and 'nextline'

If you, among other things, has such links in HTML
< a href="http://example.com/en/content/5/Some-Site.html" >http://example.com/en/content/5/Some-Site.html< /a >
these links are output as
http://example.com/en/content/5/Some-Site.html [http://example.com/en/content/5/Some-Site.html]
what is complet superfluous.

So it was with Jon Abernathy, it is also here.

Of course Flag 'none', but in this flag all links as $display (buildlinkList() line about 398) are output.

I fixed it with ever
return $display. "\ n ['. $ url. ']'; -> return $ display == $url ? $display : $display. "\n ['. $url. ']';
and
return $display. '['. $url. ']'; -> return $display == $url ? $display : $display. '['. $url. ']';
but I have to rub entire buildlinkList() in child class, which is not optimal course.

Can you perhaps add flags like 'inline_auto' and 'nextline_auto' that take into account such situations. Or in any other way can handle these situations.

The text layout changes

Hi.

I ran the following simple HTML buffer through the library:

<div dir="ltr">I received an e-mail from one of your colleagues a short while back regarding an invoice i received</div>

and to my surprise, I got back:

I received an e-mail from one of your colleagues a short while back
regarding an invoice i received

Why are you adding the extra newline between "back" and "regarding"??

Inconsistent output with tag within <pre> tag

Consider this HTML:

<pre>
    <span>
void FillMeUp(char* in_string) {<br />  int i = 0;<br />  while (in_string[i] != \'\0\') {<br />    in_string[i] = \'X\';<br />    i++;<br />  }<br />}
    </span>
</pre>

In version 3 it was rendered like this:

void FillMeUp(char* in_string) {
  int i = 0;
  while (in_string[i] != \'\') {
    in_string[i] = \'X\';
    i++;
  }
}

But now in version 4 we get this:

void FillMeUp(char* in_string) {
int i = 0;
while (in_string[i] != \'\') {
in_string[i] = \'X\';
i++;
}
}

As best I can tell this has something to do with the changes to the callbackSearch array - specifically the
addition.

Job › Overview spews notices for day list.

Wrong repo!

bad use for me

is not a good work, i use strip_tags() function instead.

Why not display the mailto links?

Why mailto links are not displayed?

getText stripps non html usage of gt/lt

Is there a reason that non html uses of less than / greater than get stripped?

        $text = 'over 95% and very few financial penalties (<2%) lorem ipsum (KPI > 95%), major changes applied';
        $helper = new Html2Text($text);
        $this->assertContains('lorem ipsum', $helper->getText());

' ' convert

$htmlToText = new \Html2Text\Html2Text('&nbsp;');
var_dump(trim($htmlToText->getText())); //string(2) " "

as I understand it should be string(0) ""

best regards, Eduard

Use in Drupal?

The swiftmailer contrib module has been using this for a while now, and I opened an issue to remove our own custom html 2 text conversion in favor of this: https://www.drupal.org/node/2830384

Could be a bit tricky as our implementation apparently has different opinions on everything.

Even if we end up not doing that, I imagine that the issue might be useful for you to follow. We do have a decent amount of tests, quite possible that we not only see a difference of opinion in the failing tests but actual bugs in this library?

This vs soundasleep/html2text?

https://github.com/soundasleep/html2text

Cannot find chomprensive list of construction options

I cannot find a chomprensive list of construction options.
Is there any place where I can look?

Many thanks

MIT or LGPL license

Any changes to relicensing to MIT or LGPL I have existing project licensed with MIT and I would like to use the library without requiring to use GPL for my project.

and are treated equally

@voku commented in the code change from #7 that there should be a difference between how  and   are displayed.

At the moment, the following text will be rendered:

<p>Some content</p><p>Here<br>And there</p>

As:


Some content

Here
And there

\nSome content\n\nHere\nAnd there\n

@voku is suggesting change  tags to render as "\n\n" . $content . "\n\n"
The above example then becomes:



Some content



Here
And there

\n\nSome content\n\n\n\nHere\nAnd there\n\n

Which, after normalisation of the newlines becomes:



Some content

Here
And there

\n\nSome content\n\nHere\nAnd there\n\n

The net result is the same in many situations, but will be different where the net element is not a paragraph (e.g. an H3, or a table).

Alt for img is added to link

I have html like this:

<header>
    <img src="/img/background.jpg" alt="Computer Keyboard - Głównie JavaScript"/>
    <h1><a href="/">Głównie JavaScript</a></h1>

which is converted to:

[http://jcubic.plComputer Keyboard - Głównie JavaScript]

PHP8.2 trim(): Passing null to parameter #1 ($string) of type string is deprecated at line 354 in /var/www/default/Private/Vendor/html2text/html2text/src/Html2Text.php

Fatal error with PHP 8.2

trim(): Passing null to parameter #1 ($string) of type string is deprecated at line 354 in /var/www/default/Private/Vendor/html2text/html2text/src/Html2Text.php

Mailto encoded with html entities get slash at the begining

On github profile there is urls like this:

&#109;&#97;&#105;&#108;&#116;&#111;&#58;%6a%63%75%62%69%63@%6a%63%75%62%69%63.%70%6c

which is mailto:email and html2text display that url as (if decoded):

[/mailto:%6a%63%75%62%69%63@%6a%63%75%62%69%63.%70%6c]

it add superfluous slash at the begining.

License change

Please consider changing GPL licence to LGPL (or some other - http://opensource.org/licenses/category), which is more suitable for libraries. GPL prevents using the code in non-GPL projects (and html2text claims "Hopefully it can be of use to others.").

Autoloader pollution when optimizing

The composer file is configured to always autoload all tests, even when installed as a dependency of projects (where these tests aren't used).

This results in the following classmap when optimizing the autloader:

return array(
    'Html2Text\\BasicTest' => $baseDir . '/test/BasicTest.php',
    'Html2Text\\BlockquoteTest' => $baseDir . '/test/BlockquoteTest.php',
    'Html2Text\\ConstructorTest' => $baseDir . '/test/ConstructorTest.php',
    'Html2Text\\DefinitionListTest' => $baseDir . '/test/DefinitionListTest.php',
    'Html2Text\\DelTest' => $baseDir . '/test/DelTest.php',
    'Html2Text\\Html2Text' => $baseDir . '/src/Html2Text.php',
    'Html2Text\\HtmlCharsTest' => $baseDir . '/test/HtmlCharsTest.php',
    'Html2Text\\ImageTest' => $baseDir . '/test/ImageTest.php',
    'Html2Text\\InsTest' => $baseDir . '/test/InsTest.php',
    'Html2Text\\LinkTest' => $baseDir . '/test/LinkTest.php',
    'Html2Text\\ListTest' => $baseDir . '/test/ListTest.php',
    'Html2Text\\PreTest' => $baseDir . '/test/PreTest.php',
    'Html2Text\\PrintTest' => $baseDir . '/test/PrintTest.php',
    'Html2Text\\SearchReplaceTest' => $baseDir . '/test/SearchReplaceTest.php',
    'Html2Text\\SpanTest' => $baseDir . '/test/SpanTest.php',
    'Html2Text\\StrToUpperTest' => $baseDir . '/test/StrToUpperTest.php',
    'Html2Text\\TableTest' => $baseDir . '/test/TableTest.php',
);

When we split the tests into autoload-dev and the actual code file in autoload, we cut out 16 of the 17 classes:

return array(
    'Html2Text\\Html2Text' => $baseDir . '/src/Html2Text.php',
);

The difference in code changes is small:

     "autoload": {
-        "psr-4": { "Html2Text\\": ["src/", "test/"] }
+        "psr-4": { "Html2Text\\": "src/" }
+    },
+    "autoload-dev": {
+        "psr-4": { "Html2Text\\": "test/" }
     },

func_get_args(), no longer report the original value as passed to a parameter

html2text/html2text/src/Html2Text.php

FOUND 0 ERRORS AND 1 WARNING AFFECTING 1 LINE

241 | WARNING | Since PHP 7.0, functions inspecting arguments, like func_get_args(), no longer
| | report the original value as passed to a parameter, but will instead provide the
| | current value. The parameter "$options" was used, and possibly changed (by
| | reference), on line 240.
| | (PHPCompatibility.FunctionUse.ArgumentFunctionsReportCurrentValue.NeedsInspection)

within prevents from being converted

We have HTML (created by WYSIWYG editors) that contain  tags with   tags inside. Because the   tags are converted to new lines before  is being converted to uppercase, and the regexp doesn't match new lines, it prevents the  from being converted to uppercase.

Example:
This would not be converted.But this would, though

Because not all  tags have a   it's kind of confusing for our users. This could also be the case for more tags ofcourse (, <a>, ...)

There could be 3 solutions:

Adding the s pattern modifier to the regexp (http://php.net/manual/en/reference.pcre.pattern.modifiers.php). This would cause all dots to match line breaks.

'/<(strong)( [^>]*)?>(.*?)<\/strong>/si',                 // <strong>

Using character classes. Use [\s\S] instead of . (dot). This would match all characters that are spaces and all characters that are not spaces. In other words, any character, including line breaks. This still gives you the ability to use the . (dot) for it's actual purpose. Not necessary in this case, but giving the option anyways :)

'/<(strong)( [^>]*)?>([\s\S]*?)<\/strong>/i',                 // <strong>

Moving the regexp for   to the end of the array, so   would be converted last.