onizet / html2openxml Goto Github PK

Html2OpenXml is a small .Net library that convert simple or advanced HTML to plain OpenXml components. This program has started in 2009, initially to convert user's comments from SharePoint to Word.

License: MIT License

C# 100.00%

docx openxml openxml-sdk dotnet-core

html2openxml's Introduction

What is Html2OpenXml?

Html2OpenXml is a small .Net library that convert simple or advanced HTML to plain OpenXml components. This program has started in 2009, initially to convert user's comments from SharePoint to Word.

This library supports both .Net Framework 4.6.2, .NET Standard 2.0 and .NET 8 which are all LTS.

Depends on DocumentFormat.OpenXml.

Supported Html tags

Refer to w3schools’ tag list to see their meaning

<a>
<h1-h6>
<abbr> and <acronym>
<b>, <i>, <u>, <s>, <del>, <ins>, <em>, <strike>, <strong>
<br> and <hr>
<img>, <figcaption>
<table>, <td>, <tr>, <th>, <tbody>, <thead>, <tfoot> and <caption>
<cite>
<div>, <span>, <font> and <p>
<pre>
<sub> and <sup>
<ul>, <ol> and <li>
<dd> and <dt>
<q> and <blockquote> (since 1.5)
<article>, <aside>, <section> are considered like <div>

Javascript (<script>), CSS <style>, <meta> and other not supported tags does not generate an error but are ignored.

Tolerance for bad formed HTML

The parsing of the Html is done using a custom Regex-based enumerator. These are supported:

	samples
Ignore case	<span>Some text<SPAN>
Missing closing tag or invalid tag position	<i>Here<b> is </i> some</b> bad formed html.
no need to be XHTML compliant	Both <br> and <br/> are valid
Color	red, #ff0000, #f00, rgb(255,0,0,.5), hsl(0, 100%, 50%) are all the red color
Attributes	<table id=table1> or <table id="table1">

Acknowledgements

Thank you to all contributors that share their bug fixes: scwebgroup, ddforge, daviderapicavoli, worstenbrood, jodybullen, BenBurns, OleK, scarhand, imagremlin, antgraf, mdeclercq, pauldbentley, xjpmauricio, jairoXXX, giorand, bostjanKlemenc, AaronLS, taishmanov. And thanks to David Podhola for the Nuget package.

Logo provided with the permission of Enhanced Labs Design Studio.

Support

This project is open source and I do my best to support it in my spare time. I'm always happy to receive Pull Request and grateful for the time you have taken If you have questions, don't hesitate to get in touch with me!

html2openxml's People

Contributors

Stargazers

Watchers

Forkers

liran-dobrish wzhuang1223 coyot iwhp andreasnilssonatskanska solutions-pour-sharepoint jherink hobosoft k59319 peteamundson advapiit jjhester taishmanov sjefvanleeuwen richie86 fowzaj gvhung kirans4976 megafetis imarti masums bengraf mathewsun docsprodev llaurentiu atmike jfaquinojr kay11091 fahadanwarhussain xin-lai jangocheng nfgallimore wxd56 sebez giulianopiz dwainbrowne plantanapp tawani qingxi stevendalong huangxi011 twsouthwick moyanming mhomol jayantjha dynaspan mattiamerzi raihansazal liftarn daviddabo microting dhavalgajera alegarro calvinalvin ytsteven ilovego-debug albertoantunes dittoxu bubdm saxo26 sbowler worthingtonjg maffin-sa thesimpleassociate dohoangtung fire-oak skacel-jan przemyslawklys waywedo ra2003 alllucky1996 michaelstgt meiotoha onurkanbakirci tjsas1 gilbertogwa martic ricardodemauro nitinhanda gerhobbelt fmmesen danielnutu alankakishore vstof yangzhinong adambarath stantoxt foxhirellc vad873 archie-miller mldzs tghamm rgarita 929496959 buddylancer pallabgupta jkdeadwolf

html2openxml's Issues

Import html images into a docx created from template not working

Hi,
We recently updated from 2.0.1 to 2.0.2 and after the update the import of the images from html file to docx is not longer working when the document is created from a template. The document is created from a blank template.

        var templatePath = "Empty.dotx";
        var html = File.ReadAllText(path + "\\ImagePage.html");
        var doc = WordprocessingDocument.CreateFromTemplate(templatePath);
        var mainPart = doc.MainDocumentPart;
        HtmlConverter converter = new HtmlConverter(mainPart)
        {
            ImageProcessing = ImageProcessing.ManualProvisioning,
            BaseImageUrl = new Uri(path)
        };
        converter.ProvisionImage += OnProvisionImage;
        converter.ParseHtml(html);
        doc.SaveAs(@"D:\temp\doc.docx");
        doc.Close();

Thanks

Insert html to specific content control / document child / OpenXmlElement

hello, I wanted to know if it was possible to use this library to insert html in a specific content controller

bordercolor attribute not being parsed

Within a table tag, the bordercolor attribute is not parsed:

Inline css does not work either : (border: 1px solid black)

Will there be any support for this?

Ca't open generated dcument with image

good day
i have some trouble and need assist
i have test html markup:

<h2>Sample</h2>
<p><span class=""text-big"" style=""font-family:'Courier New', Courier, monospace; "">row </span><span class=""text-tiny"">with </span><span class=""text-big"" style=""color: hsl(60, 75 %, 60 %); "">different </span><span class=""text-big"">text </span><span class=""text-big"" style=""background-color:hsl(120, 75 %, 60 %); "">formats</span></p>
<p>This is an instance of the <a href=""https://ckeditor.com/docs/ckeditor5/latest/builds/guides/overview.html#classic-editor"">classic editor build</a>.</p>
<figure class=""image""><img src=""https://image.shutterstock.com/image-photo/beautiful-water-drop-on-dandelion-260nw-789676552.jpg"" alt=""Autumn fields""></figure>
<p>You can use this sample to validate whether your <a href=""https://ckeditor.com/docs/ckeditor5/latest/builds/guides/development/custom-builds.html"">custom build</a> works fine.</p>

i try to insert converted paragrahs to my document:

var converter = new HtmlConverter(document.MainDocumentPart);
converter.ImageProcessing = ImageProcessing.AutomaticDownload;
var parsedData = converter.Parse(content);
foreach (var itm in parsedData)
	position.AppendChild(itm);

but then i try to open document with wrd (2016) i receive error:unspecified error: part: /word/document.xml, Line: 0, Column: 0

i try to skip paragraph with image and document open correctly. i also look at document.xml inside docx and where is no other difference for document with image and without it...

<p> 
	<r>
		<drawing>
			<inline distT="0" distB="0" distL="0" distR="0">
				<extent cx="3962399" cy="2667000" />
				<effectExtent l="19050" t="0" r="0" b="0" />
				<docPr id="5" name="https://image.shutterstock.com/image-photo/beautiful-water-drop-on-dandelion-260nw-789676552.jpg" descr="" />
				<cNvGraphicFramePr>
					<graphicFrameLocks noChangeAspect="1" />
				</cNvGraphicFramePr>
				<graphic>
					<graphicData uri="http://schemas.openxmlformats.org/drawingml/2006/picture">
						<pic>
			        			<nvPicPr>
								<cNvPr id="2" name="https://image.shutterstock.com/image-photo/beautiful-water-drop-on-dandelion-260nw-789676552.jpg" descr="Autumn fields" />
								<cNvPicPr>
									<picLocks noChangeAspect="1" noChangeArrowheads="1" />
								</cNvPicPr>
							</nvPicPr>
				        		<blipFill>
								<blip r:embed="R44d8b6de1eae414a" />
								<srcRect />
								<stretch>
									<fillRect />
								</stretch>
							</blipFill>
							<spPr bwMode="auto">
								<xfrm>
									<off x="0" y="0" />
									<ext cx="3962399" cy="2667000" />
								</xfrm>
								<prstGeom prst="rect">
									<avLst />
								</prstGeom>
							</spPr>
						</pic>
					</graphicData>
				</graphic>
			</inline>
		</drawing>
	</r>
</p>

html2openxml version: 2.0.3
platform: .net core 2.1

save open xml tags in database

It is possible to obtain an xml similar to this:
<w:p> <w:r> <w:t>My paragraph</w:t> </w:r> </w:p>

I need to save that in the database

Wrong GoogleDisk table preview

Wrong GoogleDisk table preview.
Print Screen

NET Core 2.0

Do you plan to port this package into the new framework ?
Thanks

Table Cell Shading Applies to Subsequent Cells

When HtmlConverter.ProcessClosingTableColumn calls TableStyleCollection.ApplyTags, it copies TableCellProperties from preceding cells into the one currently being processed. For the table below, this logic results in every cell having a grey (d5d5d5) background

<table style="width: 100%;">
<tbody>
<tr>
<th style="background-color: #d5d5d5;">Component</th>
<th style="background-color: #d5d5d5;">Area m²</th>
<th style="background-color: #d5d5d5; text-align: right;">%</th>
</tr>
<tr>
<td style="width: 40%; text-align: left;"> </td>
<td style="width: 30%; text-align: right;"> </td>
<td style="width: 30%; text-align: right;"> </td>
</tr>
<tr>
<td style="width: 40%; text-align: left;">Total</td>
<td style="width: 30%; text-align: right;"> </td>
<td style="width: 30%; text-align: right;">100.0</td>
</tr>
</tbody>
</table>

Donation?

How can I give you a small donation, put a link in readme? Really helpful library that saved me LOTS of time. Maybe use https://www.buymeacoffee.com/

How to insert image

Hello

I have some tables in an HTML page and want some images "Anchored" to special cells, overlay on page (not inside of cell and not changing cell size). Image data is in format of base64 included in image tag.
How can I develop HTML for that purpose.

Thanks

Table element not detected within a html string

If a table element is stripped of all its attributes and passed through the ParseHtml method, should the HtmlToOpenXml.TableContext property "tables" be populated with the table element and it's contents or is this property used for something else besides detecting a table in a html string? When I pass a table element through the parser, the table element is not recognized as a table; instead as a paragraph within the html string.

Nested List Numbering Not Working

[copied from codeplex]

I am currently trying to convert an HTML with ordered lists that is like this:

<ol>
    <li>&nbsp;Prior to Visit:
        <ul>
            <li>Site Personal Protective Equipment Requirements</li>
            <li>Review Network Diagram</li>
            <li>Review Asset Inventory</li>
        </ul>
    </li>
    <li>&nbsp;On-Site
        <ul>
            <li>Physically Inspect Computer Systems
                <ol>
                    <li>Understand Network Connection</li>
                    <li>Removable Media Controls</li>
                </ol>
            </li>
            <li>Request the following information, per device
                <ol>
                    <li>Operating System or Firmware Version
                        <ul style="list-style-type: square;">
                            <li>Patching Cadence</li>
                        </ul>
                    </li>
                    <li>Primary ICS Applications &amp; Versions</li>
                    <li>Host Based Protection &amp; Version
                        <ul style="list-style-type: square;">
                            <li>Last Signature Update Date</li>
                        </ul>
                    </li>
                    <li>Services &amp; Open Ports</li>
                    <li>User Accounts
                        <ul style="list-style-type: square;">
                            <li>Last Log-on</li>
                            <li>Last Password Change</li>
                        </ul>
                    </li>
                    <li>Firewall Configuration</li>
                    <li>Log Management Configuration</li>
                </ol>
            </li>
            <li>Passive Wireless Scanning Site Perimeter</li>
            <li>Understand Any Additional Controls or Technology</li>
        </ul>
    </li>
</ol>

When in the browser it numbers correctly for order list under "Request the following information, per device". However when converted, it puts all child items in the order list as just a bullet. I have removed the sub child items that had the unique styling, and this did not have any effect. Has anyone else seen this? Is there a solution as to how to get order lists to use numbering when converted in a nested list?

Thanks
James

Number List continues to increment when parsing multiple chunks of separate HTML.

Hi,
I have a web application that uses TinyMCE Editor where users can add Rich Text content which is saved as HTML. The web form has multiple Rich Text fields throughout the form. I'm generating a Word.docx from all content entered by the users from the web form. The document is layed out with different sections where the field content in injected. I'm using the Html2OpenXml library to inject the HTML into parts of the Word.docx file. The issue I'm running into is the Number List continues to increment (1, 2, 3 4, 5, 6 7, 8, 9) even when I have created new sections in the document which contain a new Number List HTML chunk. I'm expecting output like (1, 2, 3 1, 2, 3 1, 2, 3) in the HTML.

My output in Word document comes out like this. I'm expecting the numbers to start over in each HTML chunk. Any help would be greatly appreciated.

HTML Chunk 1
1.) Number Parent Item 1
2.) Number Parent Item 2

HTML Chunk 2
3.) Number Parent Item 1
4.) Number Parent Item 2

Here is the HTML sample that I'm pasting in each TinyMCE Rich Text editor.

<div>
    <ol>
        <li>Number Parent Item 1</li>
        <li>Number Parent Item 2</li>
    </ol>
</div>

Here is the code I'm using to create the HTML parts. This method gets called for each section that I'm rendering in the Word document.

		private List<Paragraph> ConvertHtmlToOpenXML(string htmlText)
		{
			// Must return at least 1 paragraph
			if (string.IsNullOrEmpty(htmlText))
			{
				List<Paragraph> paragraphs = new List<Paragraph>();
				paragraphs.Add(new Paragraph());
				return paragraphs;
			}

			// Temporarily create new document for HTML conversion and then retrieve the generated paragraphs and 
			// append to original document.
			using (var tmpGeneratedDocument = new System.IO.MemoryStream())
			{
				var tmpPackage = WordprocessingDocument.Create(tmpGeneratedDocument, WordprocessingDocumentType.Document);

				var tmpMainDocumentPart1 = tmpPackage.MainDocumentPart;
				if (tmpMainDocumentPart1 == null)
				{
					tmpMainDocumentPart1 = tmpPackage.AddMainDocumentPart();
					new Document(new Body()).Save(tmpMainDocumentPart1);
				}

				var htmlConverter = new HtmlConverter(tmpMainDocumentPart1);

				// ParseHtml will automatically append to temp document
				htmlConverter.ParseHtml(htmlText);
				tmpMainDocumentPart1.Document.Save();

				tmpPackage.Close();
				tmpGeneratedDocument.Close();

				// Return parsed HTML paragraphs
				return tmpMainDocumentPart1.Document.Body.Descendants<Paragraph>().ToList();
			}
		}

Table column widths as a percentage

Given <td style="width: 100px;"> the OpenXml generated is <w:tcW w:w="1500" w:type="dxa"/> which works perfectly.

However, given <td style="width: 25%;"> the OpenXml generated <w:tcW w:w="0" w:type="pct"/>

Doubule character "t" in Html2OpenXml of the repository's introuduction

Anyway to avoid conversion errors?

[copied from codeplex]

I'm trying to convert slabs of HTML some of which have originated in word and were copied into a large CMS.
With really simple clean html html2openxml works fine, but as soon as I try some of these more complex examples I get exceptions thrown.
At first it didn't like a style tag with "margin: 0 0 0 .00001pt" and failed with int32 conversion failure. I've fixed that but now it's falling over on a styletag containing " border: currentColor;" saying it's not a valid integer value. I'm not fussed if I loose some of the formatting for these tags (because they shouldn't be there anyway), but currently the parser just crashes out - correcting all the HTML isn't feasible either..

Is there anyway, I can tell html2openxml to ignore conversion exceptions of style tags?

Table Background Color missing

[copied from codeplex]

I am using the SDK 2.5/.NET 4.0 set. Consider the following HTML table:

<tr class="TableHeaderCells">
     <th style="background-color:#FFFF99;" colspan="4"></th>
     <th style="background-color:#FFFF99;" colspan="7"></th>
     <th style="background-color:#FFFF99;" colspan="5">[Header Text]</th>
</tr>
<tr class="TableHeaderCells">
     <th style="background-color:#FFFF99;" colspan="4"></th>
     <th style="background-color:#FFFF99;" colspan="3">[More Header Text]</th>
     <th style="background-color:#FFFF99;" colspan="4">[Even More Header Text]</th>
     <th style="background-color:#FFFF99;" colspan="3">[Still Even More Header Text]</th>
     <th style="background-color:#FFFF99;" colspan="2">[So Much Header Text]</th>
</tr>

When I use HTMLtoOpenXML to convert this table, I lose the "background-color" style and I haven't been able to pinpoint why yet. I think it might be here in

TableStyleCollection.ProcessCommonAttributes:
var colorValue = en.StyleAttributes.GetAsColor("background-color");

// "background-color" is also handled by RunStyleCollection which duplicate this attribute (bug #13212). Let's ignore it
if (!colorValue.IsEmpty && en.CurrentTag.Equals("<td>", StringComparison.InvariantCultureIgnoreCase)) colorValue = System.Drawing.Color.Empty;

Has anybody else seen this happen to them? If so, how did you address the issue?

Outputs to a content control, instead of the whole document

Hello,

In order to create templated word documents, it would be nice if the library allows generating content in a named content control, instead of the full document.

I don't think it is yet possible, isn't it?

If so, this would be a great enhancement.

thanks,
steve

when converting and embedded image, if the src attribute has newline character, image is ignored

If an image that is embedded in HTML using has \n in the src, the image is not converted. I solved it in my application removing \r\n and \n from the src tag.
It would be nice if it was built-in.
example.txt

Wrong table creation

Hello,
I've got some issue with HTML Table and Docx creation... in particular I've attached you a demo project that shows the behavior (I've been unable to upload on git since I got error from this machine)

https://drive.google.com/open?id=1dTsMoPqBHzgrbbZbvLRp_KMnojAHNeik

What happens is that the content I've inside the content.html is ok (in real case this data is taken from DB), btw the HTML is correct and it shows fine.

After I add it to the document I got this result

instead of

What am I doing wrong?
Thanks

P.S. I'm using your latest version

Symbols in HTML cannot convert correctly

My original HTML contains symbols such as ◎▕ . These symbols were going to become blank box in docx after converted.

Is there any ways to process these symbols in the code or library?

Some formats are missing

Hello

Please find link of an Html file and result of its conversion to docx. As you can see, most of formattings like text formattings and borders are missing. What can I do to have these formattings effective?

Thanks

Relicense/ Dual-License

Thanks for writing this!
Would you consider relicensing/dual licensing to a MIT/BSD/Apache license? We would like to use it for a project at work (and I'd be glad to contribute any fixes back), but there was some concern about the MS-PL license.

Thanks again!

support multiple page orientation

support multiple section with multiple page orientation

<div style="page-orientation: landscape">
page 1 </div>

<div style="page-orientation: portrait">
page 2 </div>

Can't convert rgba color

Hi, I have exception when I call ParseHtml method.
The ConvertToForeColor method can't convert rgba color, because fourth digit in the rgba is a alpha with a value of 0-1 (double), and this digit can't be converted to Int32.

Could you make a change to solve my problem. Thank you

Is it possible to use this code to generate PowerPoint slides?

I need to process some html and generate a powerpoint slide. I'm fairly new to OpenXML. It seems like this code is focused on Documents, but is it possible to use it for Presentations?

<blockquote> prevents header styles to be applied

After a blockquote, subsequent headers do not get styled

<blockquote>
  "The blockquote"
</blockquote>
<h1>This doesn't get the Heading style applied</h1>

The XML generated has the "IntenseQuote" style property applied on the heading paragraphs.

<w:body>
  <w:p>
    <w:pPr>
      <w:ind w:left="708" />
    </w:pPr>
    <w:r>
      <w:t xml:space="preserve">"The blockquote"</w:t>
    </w:r>
  </w:p>
  <w:p>
    <w:pPr>
      <w:pStyle w:val="IntenseQuote" />
    </w:pPr>
    <w:r>
      <w:t xml:space="preserve">This doesn't get the Heading style applied</w:t>
    </w:r>
  </w:p>
</w:body>

Unordered list is converted in ordered list

All is in the title

When my html contains an unordered list, this list is converted to ordered list.

`
using (WordprocessingDocument package = WordprocessingDocument.Create(generatedDocument, WordprocessingDocumentType.Document))
{
MainDocumentPart mainPart = package.MainDocumentPart;
if (mainPart == null)
{
mainPart = package.AddMainDocumentPart();
new Document(new Body()).Save(mainPart);
}

                HtmlConverter converter = new HtmlConverter(mainPart);
                converter.ParseHtml(html);

                mainPart.Document.Save();
                docMainPart = mainPart;
            }

Ordered list with paragraphs not converted correctly

Hi there,

I'm trying to convert an ordered list. Some items have paragraphs inside:

<h2>Test</h2>
<ol>
    <li>Number 1</li>
    <li>Number 2</li>
    <li>
        <p>Number 3</p>
    </li>
    <li>Number 4</li>
</ol>

For these items, no valid list item is generated and following items are put in a new list:

Best regards

"Strong name signature could not be verified" for version 2.0.2

I'm getting the following error with version 2.0.2:

[FileLoadException: Could not load file or assembly 'HtmlToOpenXml, Version=2.0.0.0, Culture=neutral, PublicKeyToken=6ad79d83e2b60e63' or one of its dependencies. Strong name signature could not be verified. The assembly may have been tampered with, or it was delay signed but not fully signed with the correct private key. (Exception from HRESULT: 0x80131045)]

I used NuGet to get the package and the build target is .Net Framework 4.7. Version 2.0.1 works without issue.

Manual image provisioning stopped working

Hi, just thought I'd report my finding...
I've just tried v2.0 coming from v1.5. Everything works the same for my purposes except for manual image provisioning. In v1.5 I can set the image data in my ProvisionImage event handler as follows:
e.Data = File.ReadAllBytes(filepath);
In v2.0 the syntax has changed, so it now looks like this:
e.Provision(File.ReadAllBytes(filepath));
The old version works but the new version does not with the same HTML source and image URL and data. If I disable the manual provisioning then the automatic HTTP retrieval works fine in either version.
This isn't a big problem for me - I could just use the automatic HTTP download or stick with v1.5.

Double quotes converts to single quote

outdated nuget

Hi Olivier,

Do you plan to release an updated nuget package? in the latest package (from mid 2016) there is a reference to documentformat.openxml 2.5 while in my project I need to use 2.7.2. I see that you have updated this reference in the source but haven't released it as a package. And, what would be even better, perhaps you can remove the dependency on specific version of documentformat.openxml?

Thanks for the great stuff,

Regards,
Dmitri

Correct spacing between block tags and flow tags

[Copied from codeplex]

In the constructor of class HtmlEnumerator this line:

html = Regex.Replace(html, @"(\s*)(</?(p |div|br|body)[^>]*/?>)(\s*)", "$2", RegexOptions.Multiline| RegexOptions.IgnoreCase);```

must be modified in
```c#
html = Regex.Replace(html, @"(\s*)(</?(\bp\b|\bdiv\b|\bbr\b|\bbody\b)[^>]*/?>)(\s*)", "$2", RegexOptions.Multiline | RegexOptions.IgnoreCase);

In this way the regex analize the words and not the array characters
(example: the word 'p' and not the word 'pre')

Another problem: the spaces after a flow tag (example: <b> or <i>) are deleted.
To retain this spaces, you can modify this line of code in MoveUntilMatch function HtmlEnumerator class:

while ((success = en.MoveNext()) && (current = en.Current.Trim('\n', '\r')).Length == 0) ;

modified in:

while ((success = en.MoveNext()) && (current = en.Current.Trim('\r')).Length == 0) ;

This is an HTML example to parse:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
  <meta content="text/html; charset=ISO-8859-1"
 http-equiv="content-type" />
  <title></title>
</head>
<body>
<p>
Hello <b>beautiful</b>
world!!!</p>
<p>
Hello <b>beautiful</b>
<i>world!!!</i></p>
  <p>Lorem Ipsum</p>
 <pre> 
Hello
  world!!!  </pre>
</body>
</html>

This is the behaviour now:

This is the new behaviour with the modified code:

onizet wrote Dec 17, 2015 at 2:47 PM

I effectively tested your first fix but I don't have much time to perform many testing.
I'm glad you come back with this troubleshooting, coz I found the same bug but I didn't make the link with the regex changes.
So thanks, you make my day :-)

onizet wrote Dec 17, 2015 at 3:50 PM

If I'm not mistaken, I can only stick with \bp\b because the other tags are very different from the others Html tags.
So I can keep only:
html = Regex.Replace(html, @"(\s*)(</?(\bp\b|div|br|body)[^>]*/?>)(\s*)", "$2", RegexOptions.Multiline| RegexOptions.IgnoreCase);

giorand wrote Dec 17, 2015 at 4:32 PM

You're right.
It could be only for a correct logic maintain the other \b

onizet wrote Dec 17, 2015 at 5:29 PM

about your statement:
Another problem: the spaces after a flow tag (example: <b> or <i>) are deleted
If you paste your HTML in a browser, you will see they will be deleted.

Associated with changeset 90889: This is a major commit about RowSpan bug (#13058, #12781, #13689). Also, include the fix from giorand about spaces.

giorand wrote Dec 18, 2015 at 7:46 AM

You're right, in browser there is a space between the words 'beautiful' and 'world'.
But if you parsing with actual dll, the result in Word 2013 is 'beautifulworld' without space (as you can see in the first image)

onizet wrote Jan 12, 2016 at 9:25 PM

just to notified you that I'm still working on this issue, which I consider major.

Table with set width and rowspan not respecting width

Hi!

I just ran across a issue where if you have something like this:

<table border='1' style='width:100%;line-height: 100%'>
    <thead>
        <tr>
            <th colspan='2' style='background-color: #0071ce;text-align: left;height:15px;'>
                <font style='color: white;font-weight: bold;' face='Arial'>
                    LANGUAGES SKILLS
                </font>
                <br>
                <font style='color: white;font-weight: bold;' face='Arial'>
                    Alot of text Alot of text Alot of text Alot of text Alot of text Alot of text Alot of text Alot of
                    text Alot of text Alot of text
                    Alot of text Alot of text Alot of text Alot of text Alot of text Alot of text Alot of text Alot of
                    text Alot of text Alot of text Alot of text Alot of text Alot of text
                </font>
            </th>
        </tr>
    </thead>
    <tbody>
        <tr rowspan='2' style='background-color: white;height:15px;'>
            <td style='width:150px;'>
                <font style='font-weight: bold;' face='Calibri'>Text</font>
            </td>
            <td>
                <font face='Calibri'>
                    Text
                </font>
            </td>
        </tr>
    </tbody>
</table>

The cells with rowspan don't apply the width:150px, the text of the header pushes the cell's length

I was able to replicate this is word using a generated document, BUT if you create a table in Word manually with the same structure this doesn't happen

Any insights on this?

Thank you
Luís Almeida

<strong>, <i>, <u> with custom style

<strong>,<i>, <u> tags can also have 'style' attribute with custom colors and other formatting. At this moment your library ignores custom styles in 'strong' and other tags.
Example html:

  <p>
    <span style="color:rgb(255, 0, 0)">test span</span>
    <strong style="color:rgb(255, 0, 0)">test strong</strong>
  </p>

After converting to Wml the text inside 'span' tag will have red color as specified in html but text inside 'strong' tag won't. I took the example html from real world. It was user input from the SharePoint rich text field.

Table with multiple colspans and rowspans

Given the following HTML table:

<table border="1">
  <thead>
    <tr>
      <th colspan="2" rowspan="2">Header 1</th>
      <th colspan="2">Header 2</th>
      <th colspan="2" rowspan="2">Header 3</th>
    </tr>
    <tr>
      <th>Sub-header 2.1</th>
      <th>Sub-header 2.2</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td rowspan="2">Data 1.1</td>
      <td rowspan="2">Data 1.2</td>
      <td rowspan="2">Data 2.1</td>
      <td rowspan="2">Data 2.2</td>
      <td>Data 3.1.1</td>
      <td>Data 3.1.2</td>
    </tr>
    <tr>
      <td>Data 3.2.1</td>
      <td>Data 3.2.2</td>
    </tr>
  </tbody>
</table>

The HTML output renders as:

However the Word output is:

I have highlighted in red an empty cell which is added in the wrong location.

Number List with nested Bullet List rending nested elements with Numbers instead of Bullets

Hi,
I have a web application that uses TinyMCE Editor where users can add Rich Text content which is saved as HTML. I'm generating a Word.docx from all content entered by the users from the web form. Only some of the content is HTML/Rich Text. I'm using the Html2OpenXml library to inject the HTML into parts of the Word.docx file. The issue I'm running into is that a basic ordered Number List with nested Bulleted List which only outputs with incremented numbers, not with nested bullets. I've tested with this basic HTML.

The output comes out like this. FYI, the sample I'm showing comes out nested with correct with spacing, just doesn't show bullets.

1.) Number Parent Item 1
1.) Bullet Sub Item 1-1
2.) Bullet Sub Item 1-2
2.) Number Parent Item 2
1.) Bullet Sub Item 2-1
2.) Bullet Sub Item 2-2

<div>
    <ol>
        <li>Number Parent Item 1
            <ul>
                <li>Bullet Sub Item 1-1</li>
                <li>Bullet Sub Item 2-1</li>
            </ul>
        </li>
        <li>Number Parent Item 2
            <ul>
                <li>Bullet Sub Item 2-1</li>
                <li>Bullet Sub Item 2-2</li>
            </ul>
        </li>
    </ol>
</div>

Here is the code I'm using.

                private List<Paragraph> ConvertHtmlToOpenXML(string htmlText)
		{
			// Must return at least 1 paragraph
			if (string.IsNullOrEmpty(htmlText))
			{
				List<Paragraph> paragraphs = new List<Paragraph>();
				paragraphs.Add(new Paragraph());
				return paragraphs;
			}

			// Temporarily create new document for HTML conversion and then retrieve the generated paragraphs and 
			// append to original document.
			using (var tmpGeneratedDocument = new System.IO.MemoryStream())
			{
				var tmpPackage = WordprocessingDocument.Create(tmpGeneratedDocument, WordprocessingDocumentType.Document);

				var tmpMainDocumentPart1 = tmpPackage.MainDocumentPart;
				if (tmpMainDocumentPart1 == null)
				{
					tmpMainDocumentPart1 = tmpPackage.AddMainDocumentPart();
					new Document(new Body()).Save(tmpMainDocumentPart1);
				}

				var htmlConverter = new HtmlConverter(tmpMainDocumentPart1);

				// ParseHtml will automatically append to temp document
				htmlConverter.ParseHtml(htmlText);
				tmpMainDocumentPart1.Document.Save();

				tmpPackage.Close();
				tmpGeneratedDocument.Close();

				// Return parsed HTML paragraphs
				return tmpMainDocumentPart1.Document.Body.Descendants<Paragraph>().ToList();
			}
		}

Emf image not showing

Other types of image are converted as expected.

Auto Spacing Before Paragraphs

Looks like the html parser fails to indentify the property:

margin-top:auto for a

for example, other values are being parsed correctly but the auto is resultin in a 0 in the docx document..

Creating too many NumberingInstances for unordered lists

Im composing a large OpenXML document from many HTML fragments, that contain lots of unorderd lists.
It seems html2openxml appends a new NumberingInstance for each opening tag (<ul>), which might be correct, but more than really necessary.
I'm facing an issue with Microsoft-Word, when trying to open the document, because it cannot cope with the thousands of NumberingInstances (showing an "out of memory" error.)
I already tried to tweak HtmlToOpenXml.NumberingListStyleCollection.BeginList() to create only one NumberingInstance per nesting level, but ended in messing up the numbered lists.

Nofixfor_NumberingListStyleCollectionl.txt

<blockquote> Style lookup not working for "IntenseQuote"

In the default template for the latest version of Word (O365) it seems that the name for the intense quote is now "Intense Quote" with a space. The current functionality of ProcessBlockQuote should be updated to check for both "IntenseQuote" and "Intense Quote" and then apply when one is found. Otherwise, could this be something configurable where the block quote style is optionally passed into the conversion process?

Wrong table cell width creation

My test html sample with tables coverts wrong.

<html><head><meta charset="UTF-8"/></head><body><p style="margin-top:20px;font-family:'Times New Roman', serif;font-size:14px;line-height:150%;margin-bottom:0"></p><table style="cell-spacing:0;cell-padding:0;border-collapse:collapse;width:100%;mso-table-layout-alt:fixed"><tbody columnSizes="310,310"><tr><td style="width:310px;height:40px;vertical-align:top;border-width:1px;border-style:solid;border-color:black"><p style="margin-top:10px;font-family:'Times New Roman', serif;font-size:14px;line-height:150%;margin-bottom:0"></p></td><td style="width:310px;height:40px;vertical-align:top;border-width:1px;border-style:solid;border-color:black"><p style="margin-top:10px;font-family:'Times New Roman', serif;font-size:14px;line-height:150%;margin-bottom:0"></p></td></tr><tr><td style="width:310px;height:40px;vertical-align:top;border-width:1px;border-style:solid;border-color:black"><p style="margin-top:10px;font-family:'Times New Roman', serif;font-size:14px;line-height:150%;margin-bottom:0"></p></td><td style="width:310px;height:40px;vertical-align:top;border-width:1px;border-style:solid;border-color:black"><p style="margin-top:10px;font-family:'Times New Roman', serif;font-size:14px;line-height:150%;margin-bottom:0"></p></td></tr></tbody></table><p style="margin-top:20px;font-family:'Times New Roman', serif;font-size:14px;line-height:150%;margin-bottom:0"></p><p style="margin-top:20px;font-family:'Times New Roman', serif;font-size:14px;line-height:150%;margin-bottom:0"></p><table style="cell-spacing:0;cell-padding:0;border-collapse:collapse;width:100%;mso-table-layout-alt:fixed"><tbody columnSizes="310,310"><tr><td style="width:310px;height:40px;vertical-align:top;border-width:1px;border-style:solid;border-color:black"><p style="margin-top:10px;font-family:'Times New Roman', serif;font-size:14px;line-height:150%;margin-bottom:0"></p></td><td style="width:310px;height:40px;vertical-align:top;border-width:1px;border-style:solid;border-color:black"><p style="margin-top:10px;font-family:'Times New Roman', serif;font-size:14px;line-height:150%;margin-bottom:0"></p></td></tr><tr><td style="width:310px;height:40px;vertical-align:top;border-width:1px;border-style:solid;border-color:black"><p style="margin-top:10px;font-family:'Times New Roman', serif;font-size:14px;line-height:150%;margin-bottom:0">иыавпывап</p></td><td style="width:310px;height:40px;vertical-align:top;border-width:1px;border-style:solid;border-color:black"><p style="margin-top:10px;font-family:'Times New Roman', serif;font-size:14px;line-height:150%;margin-bottom:0">ывапывапывап</p></td></tr></tbody></table><p style="margin-top:20px;font-family:'Times New Roman', serif;font-size:14px;line-height:150%;margin-bottom:0"></p><p style="margin-top:20px;font-family:'Times New Roman', serif;font-size:14px;line-height:150%;margin-bottom:0">апапвапр</p><table style="cell-spacing:0;cell-padding:0;border-collapse:collapse;width:100%;mso-table-layout-alt:fixed"><tbody columnSizes="310,310"><tr><td style="width:310px;height:40px;vertical-align:top;border-width:1px;border-style:solid;border-color:black"><p style="margin-top:10px;font-family:'Times New Roman', serif;font-size:14px;line-height:150%;margin-bottom:0">ввапрварп</p></td><td style="width:310px;height:40px;vertical-align:top;border-width:1px;border-style:solid;border-color:black"><p style="margin-top:10px;font-family:'Times New Roman', serif;font-size:14px;line-height:150%;margin-bottom:0"><span xml:space="preserve"> </span></p></td></tr><tr><td style="width:310px;height:40px;vertical-align:top;border-width:1px;border-style:solid;border-color:black"><p style="margin-top:10px;font-family:'Times New Roman', serif;font-size:14px;line-height:150%;margin-bottom:0">вапвапрвапукецукееееее цукецукецукецуке</p></td><td style="width:310px;height:40px;vertical-align:top;border-width:1px;border-style:solid;border-color:black"><p style="margin-top:10px;font-family:'Times New Roman', serif;font-size:14px;line-height:150%;margin-bottom:0"><span xml:space="preserve"> </span></p></td></tr></tbody></table><p style="margin-top:20px;font-family:'Times New Roman', serif;font-size:14px;line-height:150%;margin-bottom:0"></p><p style="margin-top:20px;font-family:'Times New Roman', serif;font-size:14px;line-height:150%;margin-bottom:0"></p><table style="cell-spacing:0;cell-padding:0;border-collapse:collapse;width:100%;mso-table-layout-alt:fixed"><tbody columnSizes="310,310"><tr><td style="width:310px;height:40px;vertical-align:top;border-width:1px;border-style:solid;border-color:black"><p style="margin-top:10px;font-family:'Times New Roman', serif;font-size:14px;line-height:150%;margin-bottom:0"></p></td><td style="width:310px;height:40px;vertical-align:top;border-width:1px;border-style:solid;border-color:black"><p style="margin-top:10px;font-family:'Times New Roman', serif;font-size:14px;line-height:150%;margin-bottom:0"></p></td></tr><tr><td style="width:310px;height:40px;vertical-align:top;border-width:1px;border-style:solid;border-color:black"><p style="margin-top:10px;font-family:'Times New Roman', serif;font-size:14px;line-height:150%;margin-bottom:0"></p></td><td style="width:310px;height:40px;vertical-align:top;border-width:1px;border-style:solid;border-color:black"><p style="margin-top:10px;font-family:'Times New Roman', serif;font-size:14px;line-height:150%;margin-bottom:0"></p></td></tr></tbody></table><p style="margin-top:20px;font-family:'Times New Roman', serif;font-size:14px;line-height:150%;margin-bottom:0"></p></body></html>

Error with class attribute handling

I'm trying to convert HTML with some formatting (applied through class attributes) to word content.

The html content is:

<div>
    <strong>Bold<span class="red-characters">red bold</span></strong>
    <span class="red-characters">red</span><span>No formatting</span>
</div>

The html result is:

However, after conversion, the result is:

In order to have the red formatting available in Word, I created a new character style named red-characters, and changed its color.

A full reproductible sample is added as attachments: TestHtmlToWordMl.zip

How to properly handle this scenario ?

Thanks

Trim trailing <br />

Could there be an option in the converter where you could trim trailing
s inside a paragraph? To skip unnecessary breaklines.

property line-height

hi everyone.
i've tried to use line-height property, but doesn't work.
I wonder if property can be used?

Thanks for all!!

UPDATE: I did this for solve my problem.

I added this in Converter.cs

public static SpacingBetweenLines ToSpacingBetweenLines(string html)
{
if (html == null) return null;
SpacingBetweenLines spacingBetweenLines = new SpacingBetweenLines();
spacingBetweenLines.LineRule = LineSpacingRuleValues.Exact;
Unit unit = Unit.Parse(html);
if (unit.IsValid || unit.Value > 0)
spacingBetweenLines.Line = (unit.Value * 20.0).ToString();

        //spacingBetweenLines.Before ="0";
        //spacingBetweenLines.After = "0";

        return spacingBetweenLines;

    }

Edit 2 ParagraphStyleCollection.cs
attrValue = en.StyleAttributes["line-height"];
if (attrValue != null && en.CurrentTag != "")
{
var spacing = Converter.ToSpacingBetweenLines(attrValue);
if (spacing != null)
{
containerStyleAttributes.Add(spacing);
}
}

Style table using HtmlConverter in c#

A predefined string with the following style <table border="1", has been defined within the database. Can I use the HtmlConverter class to modify the style within the table tag?

Base64 Image Processing Problem

Doesn't seems to work for me. Was thinking to troubleshooting with manual provision from sample code below found at https://github.com/onizet/html2openxml/wiki/ImageProcessing
private void converter_ProvisionImage(object sender, ProvisionImageEventArgs e) { // Read the image from the file system: e.Data = File.ReadAllBytes(@"c:\inetpub\wwwroot\mysite\images\" + e.ImageUrl); }

but e.Data is undefined (did something changed? I can't see any properties with Data in ProvisionImageEventArgs class)

Table with ColSpan, empty cell inserted after wrong cell

My table:

smrw			zz1	zz4
			zz2
			zz3
sm1	sm2	sm3	true
sm4				true

After parsing table, empty cell is in 4th column and 2nd row.
I found in method 'ProcessClosingTableRow', maybe the commented code in line 1368 caused this problem. Should I call 'NextSibling'?
Related issue: #25

word experienced an error trying to open the file Error

I have tried to convert HTML to OpenXML. When I want to open the generated file I get this error:

This is my code:

                string background = ((dynamic)obj).Entity.Background as string;                    
                HtmlConverter converter = new HtmlConverter(mainDocumentPart);
                IList<OpenXmlCompositeElement> convertedToHTML = converter.Parse(background);
                List<OpenXmlElement> openXmlElementList = new List<OpenXmlElement>();

                convertedToHTML.ToList().ForEach(q =>
                {
                    var convertedOpenXmlElement = q as OpenXmlElement;
                    openXmlElementList.Add(convertedOpenXmlElement);
                });

                addAction(new Run(convertedToHTML.ToList()));

onizet / html2openxml Goto Github PK

html2openxml's Introduction

What is Html2OpenXml?

See Also

Supported Html tags

Tolerance for bad formed HTML

Acknowledgements

Support

html2openxml's People

Contributors

Stargazers

Watchers

Forkers

html2openxml's Issues

Recommend Projects

Recommend Topics

Recommend Org