Code Monkey home page Code Monkey logo

reversemarkdown-net's Introduction

Meet ReverseMarkdown

Build status NuGet Version

ReverseMarkdown is a Html to Markdown converter library in C#. Conversion is very reliable since HtmlAgilityPack (HAP) library is used for traversing the Html DOM.

If you have used and benefitted from this library. Please feel free to buy me a coffee!
GitHub Sponsor

Usage

Install the package from NuGet using Install-Package ReverseMarkdown or clone the repository and built it yourself.

var converter = new ReverseMarkdown.Converter();

string html = "This a sample <strong>paragraph</strong> from <a href=\"http://test.com\">my site</a>";

string result = converter.Convert(html);

snippet source | anchor

Will result in:

This a sample **paragraph** from [my site](http://test.com)

snippet source | anchor

The conversion can be customized:

var config = new ReverseMarkdown.Config
{
    // Include the unknown tag completely in the result (default as well)
    UnknownTags = Config.UnknownTagsOption.PassThrough,
    // generate GitHub flavoured markdown, supported for BR, PRE and table tags
    GithubFlavored = true,
    // will ignore all comments
    RemoveComments = true,
    // remove markdown output for links where appropriate
    SmartHrefHandling = true
};

var converter = new ReverseMarkdown.Converter(config);

snippet source | anchor

Configuration options

  • DefaultCodeBlockLanguage - Option to set the default code block language for Github style markdown if class based language markers are not available

  • GithubFlavored - Github style markdown for br, pre and table. Default is false

  • SuppressDivNewlines - Removes prefixed newlines from div tags. Default is false

  • ListBulletChar - Allows to change the bullet character. Default value is -. Some systems expect the bullet character to be * rather than -, this config allows to change it.

  • RemoveComments - Remove comment tags with text. Default is false

  • SmartHrefHandling - how to handle <a> tag href attribute

    • false - Outputs [{name}]({href}{title}) even if name and href is identical. This is the default option.

    • true - If name and href equals, outputs just the name. Note that if Uri is not well formed as per Uri.IsWellFormedUriString (i.e string is not correctly escaped like http://example.com/path/file name.docx) then markdown syntax will be used anyway.

      If href contains http/https protocol, and name doesn't but otherwise are the same, output href only

      If tel: or mailto: scheme, but afterwards identical with name, output name only.

  • UnknownTags - handle unknown tags.

    • UnknownTagsOption.PassThrough - Include the unknown tag completely into the result. That is, the tag along with the text will be left in output. This is the default
    • UnknownTagsOption.Drop - Drop the unknown tag and its content
    • UnknownTagsOption.Bypass - Ignore the unknown tag but try to convert its content
    • UnknownTagsOption.Raise - Raise an error to let you know
  • PassThroughTags - Pass a list of tags to pass through as-is without any processing.

  • WhitelistUriSchemes - Specify which schemes (without trailing colon) are to be allowed for <a> and <img> tags. Others will be bypassed (output text or nothing). By default allows everything.

    If string.Empty provided and when href or src schema couldn't be determined - whitelists

    Schema is determined by Uri class, with exception when url begins with / (file schema) and // (http schema)

  • TableWithoutHeaderRowHandling - handle table without header rows

    • TableWithoutHeaderRowHandlingOption.Default - First row will be used as header row (default)
    • TableWithoutHeaderRowHandlingOption.EmptyRow - An empty row will be added as the header row

Note that UnknownTags config has been changed to an enumeration in v2.0.0 (breaking change)

Features

  • Supports all the established html tags like h1, h2, h3, h4, h5, h6, p, em, strong, i, b, blockquote, code, img, a, hr, li, ol, ul, table, tr, th, td, br
  • Can deal with nested lists
  • Github Flavoured Markdown conversion supported for br, pre and table. Use var config = new ReverseMarkdown.Config(githubFlavoured:true);. By default table will always be converted to Github flavored markdown immaterial of this flag.

Acknowledgements

This library's initial implementation ideas were from the Ruby based Html to Markdown converter xijo/reverse_markdown.

Copyright

Copyright © Babu Annamalai

License

ReverseMarkdown is licensed under MIT. Refer to License file for more information.

reversemarkdown-net's People

Contributors

actions-user avatar dependabot[bot] avatar doggy8088 avatar dstj avatar francescolf avatar ian-craig avatar janis-veinbergs avatar jeremy-jameson avatar laim avatar mysticmind avatar natelowry avatar promofaux avatar rickstrahl avatar rosskyl avatar simoncropp avatar soyvolon avatar stah avatar thepbjainatmicrosoft avatar wghilliard avatar zelloxy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

reversemarkdown-net's Issues

Convert inline style "font-weight: bold" to **bold**

I'd like to request a new feature to support converting inline font-weight: bold style to markdown **bold**.

        [TestMethod]
        public void ConvertInlineStyleFontWeightBold()
        {
            string html = @"
            <html>
                <body>
                    <p style=""font-weight: bold;"">Hello World</p>
                </body>
            </html>";

            var config = new Config();
            Converter converter = new Converter(config);
            string markdown = converter.Convert(html).Trim();

            Assert.AreEqual("**Hello World**", markdown);
        }

Expected

I expected <p style="font-weight: bold;">Hello World</p> to be converted to **Hello World**.

Actual

<p style="font-weight: bold;">Hello World</p> is converted to Hello World.

Thank you.

Code updates and migration to .NET Standard

I've cleaned up the code and refactored it slightly to improve performance. I would like to upload my changes from my local branch.

I also am in the process to migrating this code so it will work with .NET Standard. I wondered if you might be interested in having me upload those changes as well.

Smart handing of href tags and whitelisted schemes for a and img tags.

@mysticmind are you interested in having pull request with commit
janis-veinbergs/reversemarkdown-net@7368073 ?

  • HrefHandling - how to handle <a> tag href attribute

    • None - Outputs [{name}]({href}{title}) even if name and href is identical. This is the default option.

    • Smart - If name and href equals, outputs just the name instead of [{name}]({href}{title}). Note that if Uri is not well formed (string is not correctly escaped like http://example.com/path/file name.docx) then markdown syntax will be used anyway.

      If href contains http/https protocol, and name doesn't but otherwise are the same, output href only

      If tel: or mailto: scheme, but afterwards identical with name, output name only.

  • WhitelistUriSchemes - Specify which schemes (without trailing colon) are to be allowed for <a> and <img> tags. Others will be bypassed (output text or nothing). By default allows everything.

    If string.Empty provided and when href schema coudn't be determined - whitelists

The defaults for these options are so that it doesn't break your existing conversions.

<br> in <table> breaks formatting

Example HTML:

<table><tr><th>col1</th><th>col2</th></tr><tr><td>line 1<br>line 2</td><td>c2</td></tr></table>
col1col2
line 1
line 2
c2

Current Markdown:

| col1 | col2 |
| --- | --- |
| line1
line2 | c2 |
col1 col2
line1
line2 c2

Expected Markdown:

| col1 | col2 |
| --- | --- |
| line1<br>line2 | c2 |
col1 col2
line1
line2
c2

Github seems to do a decent job at guessing how to render this. Other renderers just break.

Extra Line in Fenced Code Blocks (Github Flavored Markdown)

There's an extra line when importing code blocks using the common <pre><code> format (ie. Github flavored fenced code output).

Here's a test that demonstrates:

[Fact]
public void When_FencedCodeBlocks_Shouldnt_Have_Trailing_Line()
{

    var html = @"<pre><code class=""language-xml hljs""><span class=""hljs-tag"">&lt;<span class=""hljs-name"">AspNetCoreHostingModel</span>&gt;</span>InProcess<span class=""hljs-tag"">&lt;/<span class=""hljs-name"">AspNetCoreHostingModel</span>&gt;</span>
</code></pre>";
    var expected = $@"{Environment.NewLine}```
<AspNetCoreHostingModel>InProcess</AspNetCoreHostingModel>
```{Environment.NewLine}";

    var config = new ReverseMarkdown.Config
    {
        GithubFlavored = true,
    };
    var converter = new Converter(config);
    var result = converter.Convert(html);

    Assert.Equal(expected, result, StringComparer.OrdinalIgnoreCase);
}

Result:

Expected: 
```
<AspNetCoreHostingModel>InProcess</AspNetCoreHostingModel>
```

Actual:   
```
<AspNetCoreHostingModel>InProcess</AspNetCoreHostingModel>

```

Note the extra line below.

Also looks like although the code is looking for language it's not picking it up off the <code> element but the <pre>. The code above uses commonly used HighlightJs syntax (ie. language-name for the language Id).

Blank lines are inserted in nested lists

HTML

<ul>
    <li>OuterItem1
        <ol>
            <li>InnerItem1</li>
        </ol>
    </li>
    <li>Item2</li>
    <ol>
        <li>InnerItem2</li>
    </ol>
    <li>Item3</li>
</ul>
  • OuterItem1
    1. InnerItem1
  • Item2
    1. InnerItem2
  • Item3

Current Markdown:

- OuterItem1

    1. InnerItem1
- Item2

    1. InnerItem2

- Item3
  • OuterItem1

    1. InnerItem1
  • Item2

    1. InnerItem2
  • Item3

Expected Markdown:

- OuterItem1
    1. InnerItem1
- Item2
    1. InnerItem2
- Item3
  • OuterItem1
    1. InnerItem1
  • Item2
    1. InnerItem2
  • Item3

While this renders mostly correctly as-is, the blank lins can cause problems with with some renders. Inner lists can be rendered as indented code/quote blocks, and numbering is restarted.

Table separator between header and body has too many columns

When converting a table, the separator between header and body (| --- | ...) now seems to have too many columns.

From the tests I've done, for n columns of content there are 2*n + 1 separator columns

Expected (and actual in v3.0.0)

| Fruit  | Quantity |
| --- | --- |
| Apples | 100 |

Actual in v3.4.0

| Fruit  | Quantity |
| --- | --- | --- | --- | --- |
| Apples | 100 |

Nested lists do not indent correctly

Example can be found in existing test WhenThereIsOrderedListWithNestedUnorderedList_ThenConvertToMarkdownListWithNestedList

Current Markdown (2 space indent)

This text has ordered list.
1. OuterItem1
  - InnerItem1
  - InnerItem2
2. Item2

This text has ordered list.

  1. OuterItem1
  • InnerItem1
  • InnerItem2
  1. Item2

Expected Markdown (4 space indent)

This text has ordered list.
1. OuterItem1
    - InnerItem1
    - InnerItem2
2. Item2

This text has ordered list.

  1. OuterItem1
    • InnerItem1
    • InnerItem2
  2. Item2

Improvement request: ability to change the unordered list bullet character to use

Hi,

Just found out about this Converter and I'm quite pleased so far!

The system to which I'm pushing the converted Markdown only seems to accepts * as unordered list (<ol><li>) bullet lead in character, not the - used.

So, I'd like to suggest an improvement to add a setting to pick the bullet character to use. It could be - by default, but others are equally valid it appears. The markdown guide says:

Unordered Lists
To create an unordered list, add dashes (-), asterisks (*), or plus signs (+) in front of line items. Indent one or more items to create a nested list.

That would resolve the issue with my "non-fully-markdown-compliant" other system...

I'm guessing in would only require modifying this line of code right here:

        private string PrefixFor(HtmlNode node)
        {
            if (node.ParentNode != null && node.ParentNode.Name == "ol")
            {
                // index are zero based hence add one
                var index = node.ParentNode.SelectNodes("./li").IndexOf(node) + 1;
                return $"{index}. ";
            }
            else
            {
                return "- ";
            }
        }

Table cells with newlines are not correctly converted

I had a couple of tables that looked like this:

<table>
<tr>
<td>
<ul>
<li>bla bla
<li>bla bla
</ul>
</td>
</tr>
</table>

They are converted as:

| * bla bla
* bla bla |

And that doesn't render as a table. To fix it I had to change it to:

| * bla bla </br>* bla bla |

Handling of newlines when table elements contain elements like: p and div

My goal is to have more human readable tables. Is this actually the goal of reversemarkdown library? Am I using it wrong?

Consider two input HTML that produce different outputs, but should produce identical:

<html><body><table><tbody><tr><td><p>col1</p></td><td><p>col2</p></td></tr><tr><td><p>data1</p></td><td><p>data2</p></td></tr></tbody></table></body></html>

Out:

| col1<br> | col2<br> |
| --- | --- |
| data1<br> | data2<br> |
<html><body><table><tbody><tr><td><p>
col1</p></td><td><p>col2</p></td></tr><tr><td>
<p>
data1
</p>
</td>
<td><p>data2</p></td></tr></tbody></table></body></html>

Out:

| col1<br> | col2<br> |
| --- | --- |
| <br>data1<br> | data2<br> |

I would expect output in both cases to be this:

| col1 | col2 |
| --- | --- |
| data1 | data2 |

There are 2 issues

p converter no matter what appends newline to end:

return $"{indentation}{TreatChildren(node).Trim()}{Environment.NewLine}";

Browser doesn't. And Outlook generates tables just like that - td > p > #text

The incomplete fix would be checking if p (perhaps any flow content) is last element within td and not add trailing newline.

So considering some scenarios:

  1. <td><p>data1</p></td>: browser renders no newlines. Reversemarkdown excess ending br: | data1<br> |
  2. <td>data1<p>p</p></td>. Browser renders newline before p. Reversemarkdown excess ending
    : | data1<br>p<br> |
  3. <td><div><p>data1</p></div></td>, browser renders no newlines. Reversemarkdown excess starting and ending br: | <br>data1<br> |

I don't know what would be the best way to handle these newlines. Because when I convert real-life html that comes from outlook, it is just overwhelmed with newlines.

Cases are many. Fixing those I brought up is possible. I should probably do it?

line break after starting tag and before ending tag should be ignored

This is some non-standard document, but currently browser behaves like that:

A line break occurring immediately following a start tag must be ignored, as must a line break occurring immediately before an end tag. This applies to all HTML elements without exceptions. In addition, for all elements except PRE, leading white space characters, such as spaces, horizontal tabs, form feeds and line breaks, following the start tag must be ignored, and any subsequent sequence of contiguous white space characters must be replaced by a single word space.
The following three examples must be rendered identically:

<P>Thomas is watching TV.</P>
<P>
Thomas is watching TV.
</P>
<P>
   Thomas is watching TV.
</P>

w3.org HTML Text element

Nested list render issue

I tried the following html

<ul>
    <li>Apple</li>
    <li>Microsoft</li>
    <ul>
        <li>IBM</li>
        <li>Cisco</li>
    </ul>
</ul>

Expected behavior

  • Apple
  • Microsoft
    • IBM
    • Cisco

Actual Output

the converted output is in the link below and it has 2 issues

https://jbt.github.io/markdown-editor/#49JVcCwoyEnl0lXwzUwuyi/OTyvh4lLQVfB08gVRzpnFyflcXFwA

  1. the nested list is not indented correctly. It is actually missing 1 space to make it correct
  2. notice the empty lines before and after the converted markdown result, those should be trimmed

Steps to reproduce

I used the latest nuget package, and the code to reproduce

        var html = @"
    <ul>
        <li>Apple</li>
        <li>Microsoft</li>
        <ul>
            <li>IBM</li>
            <li>Cisco</li>
        </ul>
    </ul>
    ";

        var converterReverseMarkdown = new ReverseMarkdown.Converter();
        var mdReverseMarkdown = converterReverseMarkdown.Convert(html);

New Release soon?

Do you know when this merge will be added to the main nuget package? I would love for the latest commits to be released.

Imported Markdown Spacing Issues

Nice work on the update for ReverseMarkdown! This looks very nice and much more usable than the previous version. Congratulations... I'm checking out reverse Markdown and it does an excellent job overall with capturing the most common things and dealing with unhandled HTML well which is awesome.

However I am running into some issues with White Space generation in the generated Markdown output.

Inlines strip Whitespace and Run into previous text

I notice that inlines that are imported are not spacing out correctly:

image

Any blocks imported are imported with 3 empty lines instead of 1

Here's what imports from a Github Readme look like:

image

I suspect this has to do with the spacing inside of the block tags that is preserved, but probably should be trimmed and then separated with a single empty line.

I realize this latter issue is legal and of course renders fine, but it wastes a bunch of space and likely requires manual fixup of the Markdown text when editing later.

Enhance Markdown style

After conversion from HTML to Markdown a lot of markdown rules are not respected.

image

In VSCode when I open the result the makrdownlint extension marks a lot of MD files with warnings.

image

While it's not really a problem, at some point you could maybe look at "optimizing" returned content to be compliant with MD RFC.

Of course not sure it all makes sense as some people may not want that - just a thought.

When setting config unknownTagsConverter to "bypass" an exception is thrown when converting

Exception Message: An item with the same key has already been added.
Stacktrace:

   at System.ThrowHelper.ThrowArgumentException(ExceptionResource resource)
   at System.Collections.Generic.Dictionary`2.Insert(TKey key, TValue value, Boolean add)
   at System.Collections.Generic.Dictionary`2.Add(TKey key, TValue value)
   at ReverseMarkdown.Converters.ByPass..ctor(Converter converter)
   at ReverseMarkdown.Converter.GetDefaultConverter(String tagName)
   at ReverseMarkdown.Converters.ConverterBase.TreatChildren(HtmlNode node)
   at ReverseMarkdown.Converters.Div.Convert(HtmlNode node)
   at ReverseMarkdown.Converters.ConverterBase.TreatChildren(HtmlNode node)
   at ReverseMarkdown.Converters.Div.Convert(HtmlNode node)
   at ReverseMarkdown.Converters.ConverterBase.TreatChildren(HtmlNode node)
   at ReverseMarkdown.Converters.Div.Convert(HtmlNode node)
   at ReverseMarkdown.Converters.ConverterBase.TreatChildren(HtmlNode node)
   at ReverseMarkdown.Converters.Div.Convert(HtmlNode node)
   at ReverseMarkdown.Converters.ConverterBase.TreatChildren(HtmlNode node)
   at ReverseMarkdown.Converters.Div.Convert(HtmlNode node)
   at ReverseMarkdown.Converters.ConverterBase.TreatChildren(HtmlNode node)
   at ReverseMarkdown.Converters.ByPass.Convert(HtmlNode node)
   at ReverseMarkdown.Converter.Convert(String html)

Table thead with td instead of th

Happens when table gets copied out of MS Word & converted to HTML.
It interprets first row within thead as columns - good. But also interprets tbody first row as columns.

Given html:

<table><thead><tr><td>col1</td><td>col2</td></tr></thead><tbody><tr><td>data1</td><td>data2</td></tr><tbody></table>

Expected:

| col1 | col2 |
| --- | --- |
| data1 | data2 |

Actual:

| col1 | col2 |
| --- | --- |
| data1 | data2 |
| --- | --- |

Link and image text and href are not escaped correctly

In a Markdown link [text](href) or imagee ![text](href) the text should not contain [ or ] and the href should not contain ), ( or (space) because these would terminate the section.

Text is also allowed to have newlines, but not multiple adjacent newlines.

All of the above are allowed in HTML, so we need to escape them when converting to Markdown.

HTML:

<img alt="foo ]

bar" src="https://avatars3.githubusercontent.com/u/2031632?s=88&v=4&desc=Foo Bar&code=asd)asd" />

Current Markdown:

![foo ]

bar](https://avatars3.githubusercontent.com/u/2031632?s=88&v=4&desc=Foo Bar&code=asd)asd)

![foo ]

bar](https://avatars3.githubusercontent.com/u/2031632?s=88&v=4&desc=Foo Bar&code=asd)asd)

Expected Markdown:

![foo \]
bar](https://avatars3.githubusercontent.com/u/2031632?s=88&v=4&title=Foo%20Bar&code=asd%29asd)

foo ] bar

New bug for 3.0.0

I tested the following html. The result is not good.

<div>
  <table>
    <tbody>
      <tr>
        <td>aaa</td>
        <td>2</td>
        <td>3</td>
        <td>4</td>
        <td>5</td>
      </tr>
      <tr>
        <td>6</td>
        <td>7</td>
        <td>8</td>
        <td>9</td>
        <td>9</td>
      </tr>
      <tr>
        <td>1</td>
        <td>2</td>
        <td>3</td>
        <td>4</td>
        <td>5</td>
      </tr>
      <tr>
        <td>7</td>
        <td>7</td>
        <td>8</td>
        <td>9</td>
        <td>0</td>
      </tr>
    </tbody>
  </table>
</div>

Actual result:

|  |  |  |  |  |
| --- | --- | --- | --- | --- |
| aaa | 2 | 3 | 4 | 5 |
| 6 | 7 | 8 | 9 | 9 |
| 1 | 2 | 3 | 4 | 5 |
| 7 | 7 | 8 | 9 | 0 |
aaa 2 3 4 5
6 7 8 9 9
1 2 3 4 5
7 7 8 9 0

Expected result:

| aaa | 2 | 3 | 4 | 5 |
| --- | --- | --- | --- | --- |
| 6 | 7 | 8 | 9 | 9 |
| 1 | 2 | 3 | 4 | 5 |
| 7 | 7 | 8 | 9 | 0 |
aaa 2 3 4 5
6 7 8 9 9
1 2 3 4 5
7 7 8 9 0

Incorrect paragraph spacing in ordered lists

It looks like there is logic to deal with paragraph indenting within lists, but this logic at one point checks for "li".

Example HTML:

<ol>
  <li>Item1</li>
  <p>Item 1 additional info</p>
  <li>Item2</li>
</ol>
  1. Item1
  2. Item 1 additional info

  3. Item2

Current Markdown:

1. Item1

Item 1 additional info
2. Item2
  1. Item1

Item 1 additional info
2. Item2

I should note also that this spacing breaks the number continuity in some renderers.

Expected Markdown:

1. Item1
  Item 1 additional info
2. Item2
  1. Item1
    Item 1 additional info
  2. Item2

Html from Outlook 2016 not being converted to Markdown

So I'v got HTML created by Outlook 2016 with body text: Test content and I'v got some wrong outputs from ReverseMarkdown. I'v managed to strip down html to some minimal repro:

<html>
<head>
<style><!-- some comment --></style><!-- another comment -->
</head>
<body>
<div>
<p>Test content</p>
</div>
</body>
</html>
  • If config RemoveComments = true and UnknownTags = Config.UnknownTagsOption.Drop, then output is empty string, but I would expect Test content
  • If config RemoveComments = true and UnknownTags = Config.UnknownTagsOption.Bypass, then output is empty string, but I would expect Test content
  • If config RemoveComments = false and UnknownTags = Config.UnknownTagsOption.Bypass, then output is missing another comment:
<!-- some comment -->
Test content

If I remove <!-- another comment -->, then my tests pass.

The HtmlAgilityPack DOM seems OK to me: DOM Head node and DOM P Node

I'm currently investigating the issue deeper.

Raw html from Outlook:

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
	{font-family:"Cambria Math";
	panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
	{font-family:Calibri;
	panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
	{margin:0cm;
	margin-bottom:.0001pt;
	font-size:11.0pt;
	font-family:"Calibri",sans-serif;
	mso-fareast-language:EN-US;}
a:link, span.MsoHyperlink
	{mso-style-priority:99;
	color:#0563C1;
	text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
	{mso-style-priority:99;
	color:#954F72;
	text-decoration:underline;}
span.EmailStyle17
	{mso-style-type:personal-compose;
	font-family:"Calibri",sans-serif;
	color:windowtext;}
.MsoChpDefault
	{mso-style-type:export-only;
	font-family:"Calibri",sans-serif;
	mso-fareast-language:EN-US;}
@page WordSection1
	{size:612.0pt 792.0pt;
	margin:72.0pt 90.0pt 72.0pt 90.0pt;}
div.WordSection1
	{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="LV" link="#0563C1" vlink="#954F72">
<div class="WordSection1">
<p class="MsoNormal">Test content<o:p></o:p></p>
</div>
</body>
</html>

Test cases used for debugging

        [Fact]
        public void WhenOutlook2016Html_WithRemoveCommentsAndDropUnknownTags_ThenConvertToMarkdown()
        {
            // note that the string also has a tab space
            string html = @"<html>
<head>
<style><!-- some comment --></style><!-- another comment -->
</head>
<body>
<div>
<p>Test content</p>
</div>
</body>
</html>";
            string expected = $"{Environment.NewLine}Test content{Environment.NewLine}";

            CheckConversion(html, expected, new Config() {
                RemoveComments = true,
                UnknownTags = Config.UnknownTagsOption.Drop,
            });
        }


        [Fact]
        public void WhenOutlook2016Html_WithRemoveCommentsAndBypassUnknownTags_ThenConvertToMarkdown() {
            // note that the string also has a tab space
            string html = @"<html>
<head>
<style><!-- some comment --></style><!-- another comment -->
</head>
<body>
<div>
<p>Test content</p>
</div>
</body>
</html>";
            string expected = $"{Environment.NewLine}Test content{Environment.NewLine}";

            CheckConversion(html, expected, new Config() {
                RemoveComments = true,
                UnknownTags = Config.UnknownTagsOption.Bypass,
            });
        }

        [Fact]
        public void WhenOutlook2016Html_WithDoNotRemoveCommentsAndDropUnknownTags_ThenConvertToMarkdown() {
            // note that the string also has a tab space
            string html = @"<html>
<head>
<style><!-- some comment --></style><!-- another comment -->
</head>
<body>
<div>
<p>Test content</p>
</div>
</body>
</html>";
            string expected = $"<!-- some comment -->{Environment.NewLine}Test content{Environment.NewLine}";

            CheckConversion(html, expected, new Config() {
                RemoveComments = false,
                UnknownTags = Config.UnknownTagsOption.Bypass,
            });
        }

Images within links have escape characters added

Firstly, thank you for you work on this project. I've just looked at integrating with something I'm working on and noticed an issue.

If an img tag is contained within an a tag, the resulting Markdown has escape characters added.

Code to reproduce:

var converter = new ReverseMarkdown.Converter();
var result = converter.Convert("<a href=\"https://www.example.com\"><img src=\"https://example.com/image.jpg\"/></a>");

Expected Result:

[![](https://example.com/image.jpg)](https://www.example.com)

Actual Result:

[!\[\](https://example.com/image.jpg)](https://www.example.com)

I'm not familiar with the code, but from a quick look it's possibly being escaped in the EscapeLinkText method here:

return useHrefWithHttpWhenNameHasNoScheme ? href : $"[{StringUtils.EscapeLinkText(name)}]({href}{title})";

wrong syntax when converting HTML tables

I try to convert the following HTML into Markdown.

<div>
  <table>
    <tr>
      <td>aaa</td>
      <td>2</td>
      <td>3</td>
      <td>4</td>
      <td>5</td>
    </tr>
    <tr>
      <td>6</td>
      <td>7</td>
      <td>8</td>
      <td>9</td>
      <td>9</td>
    </tr>
    <tr>
      <td>1</td>
      <td>2</td>
      <td>3</td>
      <td>4</td>
      <td>5</td>
    </tr>
    <tr>
      <td>7</td>
      <td>7</td>
      <td>8</td>
      <td>9</td>
      <td>0</td>
    </tr>
  </table>
</div>

Use the following code snippet:

bool githubFlavored = true;
bool removeComments = true;
var config = new ReverseMarkdown.Config(ReverseMarkdown.Config.UnknownTagsOption.PassThrough, githubFlavored: githubFlavored, removeComments : removeComments);
var converter = new ReverseMarkdown.Converter(config);
var markdown = converter.Convert(html);

The result is:

| aaa | 2 | 3 | 4 | 5 |
| 6 | 7 | 8 | 9 | 9 |
| 1 | 2 | 3 | 4 | 5 |
| 7 | 7 | 8 | 9 | 0 |

The expected result should be:

| aaa | 2 | 3 | 4 | 5 |
|-----|---|---|---|---|
| 6 | 7 | 8 | 9 | 9 |
| 1 | 2 | 3 | 4 | 5 |
| 7 | 7 | 8 | 9 | 0 |

Use Nuget MysticMind.HtmlAgilityPack 1.4.9.4 for release version

The original HtmlAglityPack project in codeplex did not have a .Net Standard version and I created a forked version with support for .Net Standard which has been maintained for some time now.

The original HtmlAglityPack project in codeplex has a new maintainer, repo has been moved from Codeplex to Github https://github.com/zzzprojects/html-agility-pack. The new maintainer contacted me and used my forked code changes in the original project to release a beta version(pre-release) targeting .Net Standard version https://github.com/zzzprojects/html-agility-pack/releases/tag/v1.5.0-beta2.

I will use my forked version for this project to publish a release version until the original project published a release version in Nuget.

Preserving whitespace (newlines) between PRE tags.

I'm trying to parse HTML and convert to markdwon a simple page. Is there a way to preserve whitespace within pre tags? At the moment pre tags are converted to one big line which is not at all ideal.

I tried to modify your code a bit by adding:

            var doc = new HtmlDocument();
            doc.OptionWriteEmptyNodes = true;
            doc.LoadHtml(html);

but it didn't help. Here's a snippet:

<p>Hi there</p>
<p>Sometimes we got simply calls to list the following data from ADDS - Active Directory Domain Users:</p>
<p>Total Users in ADDS forest / domain</p>
<p>Total Disabled Users in&nbsp;ADDS forest / domain</p>
<p>Total Enabled Users in&nbsp;ADDS forest / domain</p>
<p>In general you can use PowerShell remote to do so or you can connect direct using RDP (remote desktop services) into a specific DC (Domain Controller) to pull out this data request.</p>
<p>If your environment has <a href="https://docs.microsoft.com/en-us/powershell/scripting/learn/remoting/jea/overview?view=powershell-7">
JEA - Just Enough Administration </a>be aware of limitations on powershell commands against specified Servers' roles and features.</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<pre class="hidden">###############################################################################################################################################################################################################################################
# Author Thiago Beier [email protected]   
# Version: 1.0 - 2020-03-09  
# List sum of total users, total enabled accounts, total disabled accounts in ADDS (you can restrict the search to a specific OU - line 12 and 13
# Toronto, CANADA   
# Email: [email protected] 
# https://www.linkedin.com/in/tbeier/ 
# https://twitter.com/thiagobeier
# thiagobeier.wordpress.com
###############################################################################################################################################################################################################################################  

#Get-ADUser -Filter * -SearchBase &quot;OU=Field,OU=Users,OU=Toronto,DC=canada,DC=local&quot; |fl name | measure
$userstotal = Get-ADUser -Filter * -SearchBase &quot;DC=canada,DC=local&quot; |fl name | measure
$usersenabled = Get-ADUser -Filter {Enabled -eq $true} | fl name | measure
$usersdisabled = Get-ADUser -Filter {Enabled -eq $false} | fl name | measure
write-host -ForegroundColor Magenta &quot;########################## ADDS - TGAM - Users Report ###########################&quot;
Write-Host -ForegroundColor Yellow &quot;Total Users&quot; $userstotal.Count
Write-Host -ForegroundColor Green &quot;Enabled Users&quot; $usersenabled.Count
Write-Host -ForegroundColor Cyan &quot;Disabled Users&quot; $usersdisabled.Count
write-host -ForegroundColor Magenta &quot;#################################################################################&quot;</pre>

Here's full page:
PageProjectDescription.zip

Any way to get this working?

Unexpected markdown when converting HTML with <br/> inside <b>

I receive unexpected markdown when converting HTML with <br/> inside <b>.

Steps to reproduce

        [TestMethod]
        public void test()
        {
            string html = "test<b><br/>test</b>";

            Converter converter = new Converter();

            string markdown = converter.Convert(html);

            Assert.AreEqual("test**  \r\ntest**", markdown);
        }

Expected

I expected test<b><br/>test</b> to be converted to the following markdown:

test**  
test**

Actual

test<b><br/>test</b> is converted to test**test**.

I'm using v3.11.0.

Thank you.

Headings inside tables break Markdown tables

Example HTML:

<table>
    <tr><th><h2>Heading</h2></th><tr>
    <tr><td>Content</td><tr>
</table>

Heading

Content

Current Markdown:

| 
## Heading

|
| --- |
| Content |

|

Heading

|
| --- |
| Content |

One possible expected Markdown:

| **Heading** |
| --- |
| Content |
Heading
Content

@mysticmind This formatting bug is a bit more of a design decision, so I don't want to rush to sending a PR for a solution.

Converting to bold would be an easy fix, but obviously the resulting text isn't quite as emphasised as a heading.

Alternatively, sometimes headings have inline style: <h2 style="font-size: .... ;"> ... which we could convert to <span style="font-size: .... ;"> ... to preserve more of the original formatting. If this formatting doesn't exist we could make one up, but this is more of a special case and seems more risky.

I'm happy to send a PR for the bold fix if this is a direction you're happy with.

Regression in HTML decoding

v3.7.0 has a regression where HTML escaped characters aren't decoded.

Caused by a bug in my commit cab9985

html

<p>cat&#39;s</p>

Expected markdown

cat's

Current markdown

cat&#39;s

Although it still renders correctly, it's less readable in the Markdown

cat's

table cells and pre not correct convertet.

Hi,
First: nice work !
I think that this bug is relatet to carriage returns. which seems to be added or are inside the code.
So it may be a idea to ensure which kind of "cr/lf" you has inside your html
a soloution might just to delete all crs first or have a switch to decide which keind of cr you want. Unix or Windows ones. I would like also to see switches to disable the conversation of tags.
a nice place to test your resulting code might be http://demo.showdownjs.com/
I currently shuffle Autocad Help files to your program . (After i shovel it trough htmltidy) - so the html was clean and valid. If you want to have a nice real world testcase download a autocad offline help file. there are nearly 10000 html pages you might test your code with. Testing on selv programmed testcases is not all the time a good idea. You might just have overseen what user can do with it ;)
To be able to convert dl tags would be also nice. i just temporary replace them with table tags at the moment..

Best regards and have a nice day :)

Thomas

Allow to choose attributes for parsing PRE tag

Here's my use case:

<pre class="vb">Const NO_VALUE = Empty

Set WshShell = WScript.CreateObject(&quot;WScript.Shell&quot;)
WshShell.RegWrite _
    &quot;HKLM\System\CurrentControlSet\Services\EventLog\Scripts\&quot;, NO_VALUE
</pre>

As you can see class is used for syntax highlighting. Could you add support for it from Class or ability to configure attribute if it not possible automatically?

Some information on possible options: https://stackoverflow.com/questions/5134242/semantics-standards-and-using-the-lang-attribute-for-source-code-in-markup

Inline Code should not be encoded

If inline code blocks in single ticks contain special HTML characters the inline code is not HTML decoded properly.

[Fact]
public void When_InlineCode_Shouldnt_Contain_Encoded_Chars()
{
    var html = @"This is inline code: <code>&lt;AspNetCoreHostingModel&gt;</code>.";
    var expected = @"This is inline code: `<AspNetCoreHostingModel>`.";

    var converter = new Converter();
    var result = converter.Convert(html);
    Assert.Equal(expected, result, StringComparer.OrdinalIgnoreCase);
}
This is inline code: <code>&lt;AspNetCoreHostingModel&gt;</code>.

renders as:

This is inline code: `&lt;AspNetCoreHostingModel&gt;`.";

but should render as:

This is inline code: `<AspNetCoreHostingModel>`.";

<p> inside <table> breaks table formatting

Example HTML:

<table><tr><th>col1</th></tr><tr><td><p>line1</p><p>line2</p></td></tr></table>
col1

line1

line2

Current Markdown:

| col1 |
| --- |
| 
line1

line2
 |
col1

|
line1

line2
|

Expected Markdown:

| col1 |
| --- |
| line1<br><br>line2<br> |
col1
line1

line2

Edit: Prefer double newline between what were <p> tags. This renders a bit more like the HTML, and also allows reusing the existing paragraph conversion logic.

Nested Lists and CommonMark

I've used this wonderful converter to migrate my blog from Blogger to Ghost and I ran into a few small issues:

  1. nested lists don't work after migration. Ghost uses CommonMark which requires deeper indentation. I think this would be easy to fix, especially if the indentation could be made configurable right here: https://github.com/mysticmind/reversemarkdown-net/blob/master/src/ReverseMarkdown/Converters/Li.cs Or at least by making the method that gets the indentation protected so I can easily change the behaviour.
  2. code blocks. CommonMark uses the GitHub style code blocks, but the standard <pre> converter didn't do the trick for me, it stripped all the required newlines. I ended up writing a very simple custom converter that seems to work better: https://github.com/jessehouwing/Blogger2Ghost/blob/master/Blogger2Ghost/Commands/ConvertCommand.cs#L418-L430

Empty table rows cause conversion to throw

The following HTML causes Convert to throw.

<table><tr></tr></table>

While this HTML isn't particularly useful, we still should degrade gracefully and drop the table row or render an empty row.

UnknownTagsOption issue

There appears to be an issue with handling unknown tags. No matter which option is chosen, the uknown tag seems to be processed as Passthrough.

Example:
This text has a<sup>"sup"</sup> tag.

When it's processed, the tag remains regardless of whether I pass Drop or Bypass. Exceptions do not get thrown when Raise is chosen.

Am I missing something?

Ability to return an object rather than Markdown directly

I have this scenario where I know very little HTML. I can read it, but I can't really get an HTMLAgility pack to "prepare" document for preprocessing with reversemarkdown.

I was thinking about two things:

  • ability to provide MAP what I need from HTML I imagine there would be different cases here
    • choose a tag
    • choose a tag but only if it has a class of X
    • ...
  • ability to get an output as array of objects - this would allow me to "pick" objects I want and don't want.

Both scenarios would be useful for me. For example, I know the HTML has two pre tags, 7 p tags, some duplicated content with different classes, and so on. This would allow me to filter out some stuff before it reaches to markdown.

<br> tags inserted into markdown output where line breaks should be being removed

The following two lines in Converters\Text.cs are inserting
tags into the markdown output where there are new-lines in a contiguous block of inner text.

content = content.Replace("\r\n", "<br>");
content = content.Replace("\n", "<br>");

Example, the following HTML snippet:

<P>This service will be
temporarily unavailable due to planned maintenance
from 02:00-04:00 on 30/01/2020</P>

Will be converted to

This service will be<br>temporarily unavailable due to planned maintenance<br>from 02:00-04:00 on 30/01/2020

The correct output would be:

This service will be temporarily unavailable due to planned maintenance from 02:00-04:00 on 30/01/2015

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.