Code Monkey home page Code Monkey logo

html2text's Introduction

Html2Text

A PHP library for converting HTML to formatted plain text.

Build Status

Installing

composer require html2text/html2text

Basic Usage

$html = new \Html2Text\Html2Text('Hello, &quot;<b>world</b>&quot;');

echo $html->getText();  // Hello, "WORLD"

History

This library started life on the blog of Jon Abernathy http://www.chuggnutt.com/html2text

A number of projects picked up the library and started using it - among those was RoundCube mail. They made a number of updates to it over time to suit their webmail client.

Now it has been extracted as a standalone library. Hopefully it can be of use to others.

html2text's People

Contributors

ajouve avatar andrewnicols avatar askumbhani66 avatar cameorn1730 avatar dsas avatar dvdoug avatar ianhk avatar jcubic avatar jimjag avatar kasperg avatar maratth avatar mario-kinesissurvey-com avatar mtibben avatar nyholm avatar on2 avatar orlitzky avatar pascalbaljet avatar pstast avatar samuelfa avatar scaytrase avatar sgrodzicki avatar sylry avatar synchro avatar vargaandrea avatar wolfwolker avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

html2text's Issues

'allowed_tags' property

Hi,

I am coming from a very old version of this class, which had the ability to set a allowed_tags property, via a set_allowed_tags public method. Is there any equivalent in the current version?

Just to let you know I'm using your class...

The original html2text package your class was based on by Chuggnutt has been bundled with PHPMailer for years, and I contributed lots of fixes for it a long time ago, however, I wasn't actually using it in PHPMailer. When I did just recently, I found that the old version broke my build in PHP 5.5 because it uses the deprecated /e modifier - then I found your fork and used it, and now it's passing again, so thanks!

There was one small issue I fixed: $start and $taglen are not defined before they are used in _convert_blockquotes(). In my copy I also removed the namespace declaration since it needs to work in PHP < 5.3 and lower-cased the name to retain backward compatibility.

Full plain text

Hi, this class is great, thanks. Now it replace text like this:

Hi,

This is a text.

Cheers!

But maybe it would be interesting if it can do this:

Hi, This is a text. Cheers!

Regards.

How do I output completely plain text?

From what I understand this library may convert several elements to a markdown-style syntax.
I would like to only output the textual content of the HTML string.
Any pointers? I saw there are 'options' for the class constructor but i found no documentation about that.

Allow to choose input string encoding

Encoding is forced to UTF-8, but il would be useful to be able to convert html string with other encoding

I will submit a pull -request for this issue

Support multi-level <ul> lists

It appears if you do this:

<ul>
  <li>Coffee</li>
  <li>Tea
    <ul>
      <li>Black tea</li>
      <li>Green tea</li>
    </ul>
  </li>
  <li>Milk</li>
</ul>

Html2Text will flatten the list.
Just a request to support multiple level lists by adding additional tabs in front of each item per level.

Allowed tags

How can I allow certain tags? I want to show the IMG tag in the filtered content.

Option convert clean text

need option for clean convert

without modification like <hr /> = --------------------- but it actualy <hr /> = \n
and others like <b></b> not wrapper with _ : _b_

thanks

Doctrine Annotation Exception with "@type" on using html2text class

Hi,

I use htm2text in a TYPO3 9.5 instance and PHP 7.2.21. On using html2Text class I have this exception in frontend:

(1/1) Doctrine\Common\Annotations\AnnotationException
[Semantical Error] The annotation "@type" in property Html2Text\Html2Text::$html was never imported. Did you maybe forget to add a "use" statement for this annotation?

TYPO3 is installed with composer. I'm using the "doctrine/annotations" v1.8.0 package.

Changing all "@type" to "@var" in "html2text/html2text/src/Html2Text.php" annotations solves the problem and the frontend shows me the plain text result.

It would be fine, if that will be fixed.

Multibyte strings do not play nicely with blockquotes

A string such as:

“Hello”

<blockquote>goodbye</blockquote>

Currently gets converted to:

“Hello”

This is because mb_substr in convertBlockquotes is truncating in the wrong place, which results in potentially incomplete blockquote tags, which strip_tags will remove.

I've created pull request #56 to address this.

Replacement for newlines and tabs

Hi,

ist the " " as replacement correct? Newlines and tabs don´t do much in html.

The html I have:

 <p>
    This is some text<br/>
    with a break in the middle
 </p>

Results to:

This is some text
 with a break in the middle

The " " before "with" is not correct there, but I can´t decide if there aren´t any other side effects?

[PHP7] preg_replace is not works

When I changed everything to PHP7, I have fixed many things I had in my scripts (where I used old mysql and changed into mysqli), but this converter is totally broken because of preg_replace function (I using this to prepare news post from my official site for posting into the chat room in readable form. Because of this, I receiving the empty string). How about upgrading to use preg_replace_callback?

URLs are uppercased inside of b tags.

URLs get uppercased when they are encapsulated by a bold tag:

<b><a href="https://www.tave.com/Test/LowerAndUpperCase?Signature=518d4bf6872e0deaf0eeb23b19c724514ec7c69a">Test</a></b>

becomes:

TEST
[HTTPS://WWW.TAVE.COM/TEST/LOWERANDUPPERCASE?SIGNATURE=518D4BF6872E0DEAF0EEB23B19C724514EC7C69A]

In the call to _preg_callback, for the b and strong tags, we call _toupper of $matches[3], but it doesn't look like _toupper handles when the link has already been parsed. See test case below:

$contents = '

<b><a href="https://www.tave.com/Test/LowerAndUpperCase?Signature=518d4bf6872e0deaf0eeb23b19c724514ec7c69a">Test</a></b>

<a href="https://www.tave.com/Test/LowerAndUpperCase?Signature=518d4bf6872e0deaf0eeb23b19c724514ec7c69a"><b>Test</b></a>

';

try {
  $textContents = new \Html2Text\Html2Text($contents);
  $textContents = $textContents->get_text();
}
catch (Exception $e) {
    return $e;
}

echo $textContents;

/* Result */
/*

TEST
[HTTPS://WWW.TAVE.COM/TEST/LOWERANDUPPERCASE?SIGNATURE=518D4BF6872E0DEAF0EEB23B19C724514EC7C69A]
TEST
[https://www.tave.com/Test/LowerAndUpperCase?Signature=518d4bf6872e0deaf0eeb23b19c724514ec7c69a]';

*/

$matches[3] will contain:

Test [https://www.tave.com/Test/LowerAndUpperCase?Signature=518d4bf6872e0deaf0eeb23b19c724514ec7c69a]

so _toupper doesn't have any tags to split on and blindly uppercases the whole string.

I'll look into doing a pull request for this, but I am in a bit of a time crunch right now, so I figured I would report it and see whomever got to it first.

Cheers

New release

Hi,

Any chance of a new release for the library? The last release was 3.0.0 in October, and since then there have been a range of bug fixes and new features, including:
#44 PHP7 support in unit tests
#37/#45 blockquote parsing fix
#7 Treat all paragraph content equally
#47 bbcode support

We (moodle/moodle) can pull from master, but generally we prefer to pick known releases.

Thanks in advance,

Andrew

Use numbers instead of bullets for ordered lists

Currently ordered lists are converted into unordered lists in text; using an asterisk instead of numbers. Would be great to use numbers for these cases.

Example

(new \Html2Text\Html2Text('<ol><li>Item 1</li><li>Item 2</li></ol>'))->getText()

Result

* Item 1
* Item 2

Expected

1. Item
2. Item

Bug?

Hi!

It seems I have found a bug, I'm not sure about the reason, I tried replacing <blockquote> and </blockquote> (using, for example, <div> and </div>) and that fixed this particular case but after doing that I realized it's about the string length, I think the problem could be a "wrong" str_pos to replace something (???). I don't really know... Thoughts?

TEST CASE (I'm working with Tumblr API, this is just some text from a random post):

$test_str = "<p>Highlights from today&rsquo;s <strong>Newlyhired Game</strong>:</p><blockquote><p><strong>Sean:</strong> What came first, Blake&rsquo;s first <em>Chief Architect position</em> or Blake&rsquo;s first <em>girlfriend</em>?</p> </blockquote> <blockquote> <p><strong>Sean:</strong> Devin, Bryan spent almost five years of his life slaving away for this vampire squid wrapped around the face of humanity&hellip;<br/><strong>Devin:</strong> Goldman Sachs?<br/><strong>Sean:</strong> Correct!</p> </blockquote> <blockquote> <p><strong>Sean:</strong> What was the name of the girl Zhu took to prom three months ago?<br/><strong>John:</strong> What?<br/><strong>Derek (from the audience):</strong> Destiny!<br/><strong>Zhu:</strong> Her name is Jolene. She&rsquo;s nice. I like her.</p></blockquote><p>I think the audience is winning.&nbsp; - Derek</p>";

$html2text = new \Html2Text\Html2Text($test_str);
echo $html2text->getText();

Thanks for you help!

Cheers,
Sebatian C.

&laquo and &raquo

These items are not changed into text (« and ») whilst they should be. These are often used in french as quotation marks.

HeaderTest.php does not comply with psr-4 autoloading standard

Due to different ClassName and FileName we see following warning while doing composer dump-autoload

vagrant@homestead:~/code/mrm$ composer dump
Generating optimized autoload files
Deprecation Notice: Class Html2Text\StrToUpperTest located in ./vendor/html2text/html2text/test/HeaderTest.php does not comply with psr-4 autoloading standard. It will not autoload anymore in Composer v2.0. in phar:///usr/local/bin/composer/src/Composer/Autoload/ClassMapGenerator.php:201

Add License

You note in your README that a number of projects have found this useful and state "Now it has been extracted as a standalone library. Hopefully it can be of use to others.".

We're currently testing this out in a small, commercial software product we develop as well, and it's always nice to be sure that the libraries we are using support the right type of license so we don't cross any lines we shouldn't be crossing.

Would you be comfortable adding an open source license, like the MIT license or something (https://choosealicense.com), so it's clear how you allow others to use the codebase?

Links format 'inline' and 'nextline'

If you, among other things, has such links in HTML
< a href="http://example.com/en/content/5/Some-Site.html" >http://example.com/en/content/5/Some-Site.html< /a >
these links are output as
http://example.com/en/content/5/Some-Site.html [http://example.com/en/content/5/Some-Site.html]
what is complet superfluous.

So it was with Jon Abernathy, it is also here.

Of course Flag 'none', but in this flag all links as $display (buildlinkList() line about 398) are output.

I fixed it with ever
return $display. "\ n ['. $ url. ']'; -> return $ display == $url ? $display : $display. "\n ['. $url. ']';
and
return $display. '['. $url. ']'; -> return $display == $url ? $display : $display. '['. $url. ']';
but I have to rub entire buildlinkList() in child class, which is not optimal course.

Can you perhaps add flags like 'inline_auto' and 'nextline_auto' that take into account such situations. Or in any other way can handle these situations.

The text layout changes

Hi.

I ran the following simple HTML buffer through the library:

<div dir="ltr">I received an e-mail from one of your colleagues a short while back regarding an invoice i received</div>

and to my surprise, I got back:

I received an e-mail from one of your colleagues a short while back
regarding an invoice i received

Why are you adding the extra newline between "back" and "regarding"??

Inconsistent output with <br> tag within <pre> tag

Consider this HTML:

<pre>
    <span>
void FillMeUp(char* in_string) {<br />  int i = 0;<br />  while (in_string[i] != \'\0\') {<br />    in_string[i] = \'X\';<br />    i++;<br />  }<br />}
    </span>
</pre>

In version 3 it was rendered like this:

void FillMeUp(char* in_string) {
  int i = 0;
  while (in_string[i] != \'\') {
    in_string[i] = \'X\';
    i++;
  }
}

But now in version 4 we get this:

void FillMeUp(char* in_string) {
int i = 0;
while (in_string[i] != \'\') {
in_string[i] = \'X\';
i++;
}
}

As best I can tell this has something to do with the changes to the callbackSearch array - specifically the
addition.

bad use for me

is not a good work, i use strip_tags() function instead.

getText stripps non html usage of gt/lt

Is there a reason that non html uses of less than / greater than get stripped?

        $text = 'over 95% and very few financial penalties (<2%) lorem ipsum (KPI > 95%), major changes applied';
        $helper = new Html2Text($text);
        $this->assertContains('lorem ipsum', $helper->getText());

'&nbsp;' convert

$htmlToText = new \Html2Text\Html2Text('&nbsp;');
var_dump(trim($htmlToText->getText())); //string(2) " "

as I understand it should be string(0) ""

best regards, Eduard

Use in Drupal?

Hi

The swiftmailer contrib module has been using this for a while now, and I opened an issue to remove our own custom html 2 text conversion in favor of this: https://www.drupal.org/node/2830384

Could be a bit tricky as our implementation apparently has different opinions on everything.

Even if we end up not doing that, I imagine that the issue might be useful for you to follow. We do have a decent amount of tests, quite possible that we not only see a difference of opinion in the failing tests but actual bugs in this library?

MIT or LGPL license

Any changes to relicensing to MIT or LGPL I have existing project licensed with MIT and I would like to use the library without requiring to use GPL for my project.

<p> and <br> are treated equally

@voku commented in the code change from #7 that there should be a difference between how <p> and <br> are displayed.

At the moment, the following text will be rendered:

<p>Some content</p><p>Here<br>And there</p>

As:


Some content

Here
And there

\nSome content\n\nHere\nAnd there\n

@voku is suggesting change <p> tags to render as "\n\n" . $content . "\n\n"
The above example then becomes:



Some content



Here
And there

\n\nSome content\n\n\n\nHere\nAnd there\n\n

Which, after normalisation of the newlines becomes:



Some content

Here
And there

\n\nSome content\n\nHere\nAnd there\n\n

The net result is the same in many situations, but will be different where the net element is not a paragraph (e.g. an H3, or a table).

Alt for img is added to link

I have html like this:

<header>
    <img src="/img/background.jpg" alt="Computer Keyboard - Głównie JavaScript"/>
    <h1><a href="/">Głównie JavaScript</a></h1>

which is converted to:

[http://jcubic.plComputer Keyboard - Głównie JavaScript] 

Mailto encoded with html entities get slash at the begining

On github profile there is urls like this:

&#109;&#97;&#105;&#108;&#116;&#111;&#58;%6a%63%75%62%69%63@%6a%63%75%62%69%63.%70%6c

which is mailto:email and html2text display that url as (if decoded):

[/mailto:%6a%63%75%62%69%63@%6a%63%75%62%69%63.%70%6c]

it add superfluous slash at the begining.

License change

Please consider changing GPL licence to LGPL (or some other - http://opensource.org/licenses/category), which is more suitable for libraries. GPL prevents using the code in non-GPL projects (and html2text claims "Hopefully it can be of use to others.").

Autoloader pollution when optimizing

The composer file is configured to always autoload all tests, even when installed as a dependency of projects (where these tests aren't used).

This results in the following classmap when optimizing the autloader:

return array(
    'Html2Text\\BasicTest' => $baseDir . '/test/BasicTest.php',
    'Html2Text\\BlockquoteTest' => $baseDir . '/test/BlockquoteTest.php',
    'Html2Text\\ConstructorTest' => $baseDir . '/test/ConstructorTest.php',
    'Html2Text\\DefinitionListTest' => $baseDir . '/test/DefinitionListTest.php',
    'Html2Text\\DelTest' => $baseDir . '/test/DelTest.php',
    'Html2Text\\Html2Text' => $baseDir . '/src/Html2Text.php',
    'Html2Text\\HtmlCharsTest' => $baseDir . '/test/HtmlCharsTest.php',
    'Html2Text\\ImageTest' => $baseDir . '/test/ImageTest.php',
    'Html2Text\\InsTest' => $baseDir . '/test/InsTest.php',
    'Html2Text\\LinkTest' => $baseDir . '/test/LinkTest.php',
    'Html2Text\\ListTest' => $baseDir . '/test/ListTest.php',
    'Html2Text\\PreTest' => $baseDir . '/test/PreTest.php',
    'Html2Text\\PrintTest' => $baseDir . '/test/PrintTest.php',
    'Html2Text\\SearchReplaceTest' => $baseDir . '/test/SearchReplaceTest.php',
    'Html2Text\\SpanTest' => $baseDir . '/test/SpanTest.php',
    'Html2Text\\StrToUpperTest' => $baseDir . '/test/StrToUpperTest.php',
    'Html2Text\\TableTest' => $baseDir . '/test/TableTest.php',
);

When we split the tests into autoload-dev and the actual code file in autoload, we cut out 16 of the 17 classes:

return array(
    'Html2Text\\Html2Text' => $baseDir . '/src/Html2Text.php',
);

The difference in code changes is small:

     "autoload": {
-        "psr-4": { "Html2Text\\": ["src/", "test/"] }
+        "psr-4": { "Html2Text\\": "src/" }
+    },
+    "autoload-dev": {
+        "psr-4": { "Html2Text\\": "test/" }
     },

func_get_args(), no longer report the original value as passed to a parameter

html2text/html2text/src/Html2Text.php

FOUND 0 ERRORS AND 1 WARNING AFFECTING 1 LINE

241 | WARNING | Since PHP 7.0, functions inspecting arguments, like func_get_args(), no longer
| | report the original value as passed to a parameter, but will instead provide the
| | current value. The parameter "$options" was used, and possibly changed (by
| | reference), on line 240.
| | (PHPCompatibility.FunctionUse.ArgumentFunctionsReportCurrentValue.NeedsInspection)

<br /> within <strong> prevents <strong> from being converted

We have HTML (created by WYSIWYG editors) that contain <strong> tags with <br /> tags inside. Because the <br /> tags are converted to new lines before <strong> is being converted to uppercase, and the regexp doesn't match new lines, it prevents the <strong> from being converted to uppercase.

Example:
<strong>This would<br />not be converted.</strong><strong>But this would, though</strong>

Because not all <strong> tags have a <br /> it's kind of confusing for our users. This could also be the case for more tags ofcourse (<b>, <a>, ...)

There could be 3 solutions:

'/<(strong)( [^>]*)?>(.*?)<\/strong>/si',                 // <strong>
  • Using character classes. Use [\s\S] instead of . (dot). This would match all characters that are spaces and all characters that are not spaces. In other words, any character, including line breaks. This still gives you the ability to use the . (dot) for it's actual purpose. Not necessary in this case, but giving the option anyways :)
'/<(strong)( [^>]*)?>([\s\S]*?)<\/strong>/i',                 // <strong>
  • Moving the regexp for <br /> to the end of the array, so <br /> would be converted last.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.