microsoft / openscraping-lib-csharp Goto Github PK

Turn unstructured HTML pages into structured data. The OpenScraping library can extract information from HTML pages using a JSON config file with xPath rules. It can scrape even multi-level complex objects such as tables and forum posts. This is the C# version.

License: Other

C# 100.00%

openscraping-lib-csharp's Introduction

OpenScraping HTML Structured Data Extraction
C# Library

This library is used in production to scrape thousands of pages.

The latest NuGet package is .NET Standard 2.0, which means it can be used both in .NET Core 2.0+ and .NET Framework 4.6.1+ projects.

Self-contained example

Create a new console C# project, then install the OpenScraping NuGet package by using the GUI or by using this command in the Package Manager Console:

Install-Package OpenScraping

Then paste and run the following code:

namespace OpenScrapingTest
{
    using System;
    using Newtonsoft.Json;
    using OpenScraping;
    using OpenScraping.Config;

    class Program
    {
        static void Main(string[] args)
        {
            var configJson = @"
            {
                'title': '//h1',
                'body': '//div[contains(@class, \'article\')]'
            }
            ";

            var config = StructuredDataConfig.ParseJsonString(configJson);

            var html = "<html><body><h1>Article title</h1><div class='article'>Article contents</div></body></html>";

            var openScraping = new StructuredDataExtractor(config);
            var scrapingResults = openScraping.Extract(html);

            Console.WriteLine(scrapingResults["title"]);
            Console.WriteLine("----------------------------");
            Console.WriteLine(JsonConvert.SerializeObject(scrapingResults, Formatting.Indented));
            Console.ReadKey();
        }
    }
}

The output looks like this:

Article title
----------------------------
{
  "title": "Article title",
  "body": "Article contents"
}

Example: Extracting an article from bbc.com

Below is a simple configuration file that extracts an article from a www.bbc.com page.

{
  "title": "//div[contains(@class, 'story-body')]//h1",
  "dateTime": "//div[contains(@class, 'story-body')]//div[contains(@class, 'date')]",
  "body": "//div[@property='articleBody']"
}

Here is how to call the library:

// www.bbc.com.json contains the JSON configuration file pasted above
var jsonConfig = File.ReadAllText(@"www.bbc.com.json");
var config = StructuredDataConfig.ParseJsonString(jsonConfig);

var html = File.ReadAllText(@"www.bbc.com.html", Encoding.UTF8);

var openScraping = new StructuredDataExtractor(config);
var scrapingResults = openScraping.Extract(html);

Console.WriteLine(JsonConvert.SerializeObject(scrapingResults, Formatting.Indented));

And here is the result for a bbc news article:

{
  title: 'Robert Downey Jr pardoned for 20-year-old drug conviction',
  dateTime: '24 December 2015',
  body: 'Body of the article is shown here'
}

Here is how the www.bbc.com page looked like on the day we saved the HTML for this sample:

Example: Extracting a list of products from Ikea

The sample configuration below is more complex as it demonstrates support for extracting multiple items at the same time, and running transformations on them. For this example we are using a products page from ikea.com.

{
  "products": 
  {
    "_xpath": "//div[@id='productLists']//div[starts-with(@id, 'item_')]",
    "title": ".//div[contains(@class, 'productTitle')]",
    "description": ".//div[contains(@class, 'productDesp')]",
    "price": 
    {
      "_xpath": ".//div[contains(@class, 'price')]/text()[1]",
      "_transformations": [
        "TrimTransformation"
      ]
    }
  }
}

Here is a snippet of the result:

{
  products: [{
    title: 'HEMNES',
    description: 'coffee table',
    price: '$139.00'
  },
...
  {
    title: 'NORDEN',
    description: 'sideboard',
    price: '$149.00'
  },
  {
    title: 'SANDHAUG',
    description: 'tray table',
    price: '$79.99'
  }]
}

Here is how the www.ikea.com page looked like on the day we saved the HTML for this sample:

You can find more complex examples in the unit tests.

Transformations

In the Ikea example above we used a transformation called TrimTransformation. Transformation modify the raw extracted HTML nodes in some ways. For instance, TrimTransformation just runs String.Trim() on the extracted text before it gets written to the JSON output.

Below are a few of the built-in transformations, with links to example rules. To see how the example rules are tested, please check the code of the test class, as well as the HTML files in the TestData folder.

Name	Purpose	Example
AbbreviatedIntegerTranformation	Converts strings like "9k" to the integer 9,000, or "2m" to 2,000,000, or "5B" to 5,000,000,000.	Here
CastToIntegerTransformation	Converts strings to the corresponding integer. For example, converts "12450" to the integer 12,450.	Here
ExtractIntegerTransformation	Tries to find an integer in the middle of a string. For instance, for the string "Popularity: 1000 views" it extracts the integer 1,000. Note that if the string would have a comma after the 1, it would just extract 1 as the integer.	Here
ListTitleTransformation	Tries to find the "title" of the current unordered or ordered HTML list by looking for some text just above the list.	Here
RemoveExtraWhitespaceTransformation	Replaces consecutive spaces with a single space. For the string "hello world" it would return "hello world".	Here
SplitTransformation	Splits the string into an array based on a separator.	Here
TotalTextLengthAboveListTransformation	Tries to determine the length of the text which is above an unordered or ordered HTML list.	Here
TrimTransformation	Runs String.Trim() on the extracted text before it gets written to the JSON output.	Here
RegexTransformation	Matches text with regular expressions	Here
ParseDateTransformation	Converts text to date	Here
HtmlDecode	Decodes HTML with WebUtility.HtmlDecode	Here
HtmlEncode	Encodes HTML with WebUtility.HtmlEncode	Here
UrlDecode	Decodes text with WebUtility.UrlDecode	Here
UrlEncode	Encodes text with WebUtility.UrlEncode	Here
ExtractTextTransformation	A better way to extract text if you want to preserve white space between adjacent text nodes. For example, if a text node immediately follows a link, then this transformation outputs the extracted text with a space between the anchor text and the adjacent text. Useful when extracting large text articles.	Here

Writing custom transformations

You can implement custom transformations in your own code. The library will pick them up through reflection. There are two types of transformations, ones that act on incoming HTML (first transformation in the chain), and ones that act on the output of previous transformations. The first kind implement ITransformationFromHtml and the second one implement ITransformationFromObject. You can actually have one transformation implement both interfaces, such as the ParseDateTransformation.

Remove unwanted HTML tags and XPath nodes before extracting content

Let's say you want to extract a news article but before the actual extraction you would like to remove some HTML nodes. You can do that in two ways. The first (deprecated) way is to use the the _removeTags setting, where you can list names of HTML tags that need to be removed before we start processing the xPath rules. The second (better) way is setting the _removeXPaths setting, which allows listing XPath rules to find nodes that we want to remove BEFORE we process the normal _xpath extraction rules.

Example HTML:

<!DOCTYPE html>

<html lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
    <meta charset="utf-8" />
    <title>Page title</title>
</head>
<body>
    <h1>  Article  title    </h1>
    <div id="primary">
        <script>alert('test');</script>

        <p>Para1     content</p>
        <p>Para2   content</p>

        <div id="comments">
            <p>Comment1</p>
            <p>Comment2</p>
        </div>
    </div>
</body>
</html>

JSON config:

{
  "_removeTags": [
    "script"
  ],
  "title": {
    "_xpath": "//h1",
    "_transformations": [
      "TrimTransformation"
    ]
  },
  "body": {
    "_removeXPaths": [
      "./div[@id='comments']"
    ],
    "_xpath": "//div[@id='primary']",
    "_transformations": [
      "RemoveExtraWhitespaceTransformation"
    ]
  }
}

Result:

{
  "title": "Article  title",
  "body": "Para1 content Para2 content"
}

MultiExtractor: Load multiple xPath rule config files and match on URL

You can use MultiExtractor to load multiple xPath rule JSON config files for different websites, then allow the code to pick the correct rule depending on the URL you provide. This is useful, for example, if you are parsing a network of websites with similar HTML design but with different URLs. For example, check these two JSON config files: stackexchange.com.json and answers.microsoft.com.json. The first config file defines multiple potential URL patterns that can match that config:

{
    "_urlPatterns": [
        "^https?:\/\/.+\\.stackexchange\\.com\/questions\/[0-9]+\/.+$",
        "^https?:\/\/stackoverflow\\.com\/questions\/[0-9]+\/.+$",
        "^https?:\/\/serverfault\\.com\/questions\/[0-9]+\/.+$",
        "^https?:\/\/superuser\\.com\/questions\/[0-9]+\/.+$",
        "^https?:\/\/askubuntu\\.com\/questions\/[0-9]+\/.+$"
    ],
...
}

We can load multiple of these config files into a MultiExtractor, then we can pass in an HTML file and its corresponding URL. MultiExtractor will then goes over the _urlPatterns in each config file, it picks the config file which matches the URL, then applies the corresponding rules.

var multiExtractor = new MultiExtractor(configRootFolder: "TestData", configFilesPattern: "*.json");
var json = multiExtractor.ParsePage(
	url: "http://answers.microsoft.com/en-us/windows/forum/windows_10-win_upgrade/i-want-to-reserve-my-free-copy-of-windows-10-but-i/9c3f7f56-3da8-4b40-a30f-e33772439ee1", 
	html: File.ReadAllText(Path.Combine("TestData", "answers.microsoft.com.html")));

To see a full example search for the MultiWebsiteExtractionTest() function in the test class.

openscraping-lib-csharp's People

Stargazers

Watchers

Forkers

uikit0 jango2015 calebjenkins votrongdao modulexcite rodrigoratan mannu598 uav cloud9-xx protonpl solertis jonasehn sk8tz zeroinfinite agabuza alexxnica kryndex zhimingz spencerx chkob andregleichner abebaehaki shawnshaddock uzbekdev1 willadsen bdeb212 lisansojib shayan-et-al skightteam imagineagents mcicekci asianpotato dlandi jake-bladt polakosz anders-bo fatkulamri kengranderson kishan574 bhaskers-blu-org2 taffywrinkle claudiusgonzo yuris-fx jgador resaberon hxjj youthis321 ehtick test-mass-forker-org-1 francescogt ramons03 linuxsuren abolfathi jmstrs marmikreal jamesdsouza02 mahaisong

openscraping-lib-csharp's Issues

Probably not an issue with OpenScraping, but for your awareness...

I'm trying to use OpenScraping in a CSharp Script .csx using dotnet-script (https://github.com/filipw/dotnet-script). I posted an issue over there, but wanted to also submit here on the off chance that you'd be interested and/or possibly have some advice.

dotnet-script/dotnet-script#114

Can I extract attribute values using OpenScraping?

<a href="test">dfsdfsdf</a>

I have tried systax like this //a/@href
It just returns the contents of the anchor tag, but I'm looking for "test" in the href attribute.

Is this possible?

Incompatibility with .Net Standard 2.0

When trying to add this package to a project targeting .NET standard 2.0, an error is thrown because OpenScraping v1.0.1 only supports netcoreapp2.0.

Here is the output when running dotnet add OpenScraping:

info :   GET https://api.nuget.org/v3-flatcontainer/openscraping/index.json
info :   OK https://api.nuget.org/v3-flatcontainer/openscraping/index.json 380ms
error: Package OpenScraping 1.0.1 is not compatible with netstandard2.0 (.NETStandard,Version=v2.0). Package OpenScraping 1.0.1 supports: netcoreapp2.0 (.NETCoreApp,Version=v2.0)
error: Package 'OpenScraping' is incompatible with 'all' frameworks in project '<REDACTED>/Project.csproj'.

Regexp Transformation

While using openscraping faced quite common scenario when a specific subset(word) has to be selected from element with plain text.

Sample: <div class="info">Contact information. Phone: 111-111-111, Address: str.Street 1/1, City. 2017</div>

Would be really useful to have built-in RegexpTransformation which can take custom regexp expressions as an input param '_regexp'. Something like '_separator' in SplitTransformation.

Missing extra spaces in RemoveExtraWhitespaceTransformation example

In the documentation for Transformations, the RemoveExtraWhitespaceTransformation example doesn't display the extra spaces in the first occurrence of "hello world". I think you'll need to use non-breaking spaces there, or if that doesn't work something else that has the same effect as non-breaking spaces.

It looks like this:

Replaces consecutive spaces with a single space. For the string "hello world" it would return "hello world".

But it should look like this:

Replaces consecutive spaces with a single space. For the string "hello world" it would return "hello world".

Modify MultiExtractor and document it

Broken links to "Transformations" examples

On this page:
https://github.com/Microsoft/openscraping-lib-csharp
in the README.md Transformations Example column, all of the "Here" links return a 404 error.

Get meta data?

I have a need to get the content of meta data - how is this possible?
<meta content="Tim Fischer" name="author">

Thanks

This repo is missing important files

There are important files that Microsoft projects should all have that are not present in this repository. A pull request has been opened to add the missing file(s). When the pr is merged this issue will be closed automatically.

Microsoft teams can learn more about this effort and share feedback within the open source guidance available internally.

Merge this pull request

How to get the value from an attribute instead of the node text?

Regex Transformation not returning first match

Hello, I am using this config

{
  "id": {
    "_xpath": "//link[@rel='canonical']/@href",
    "_transformations": [
      {
        "_type": "RegexTransformation",
        "_regex": "[0-9]{7}$"
      }
    ]
  }
}

to extract from this document

<!DOCTYPE html>

<html lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
   <meta charset="utf-8" />
   <link rel="canonical" href="http://example.com/test-xyz-1234567" />
    <title>Test</title>
</head>
<body>
    Test
</body>
</html>

My goal is to extract "1234567" from the link-tag.

The resulting output is empty.
Reason: In RegexTransformation.cs the first match is ignored (line 77).

I am not sure why the first match is ignored and how to work around this in my example.
Could someone please provide some advice? Thank you very much.

How to return the href of a hyperlink?

I have tried "link":"//a[selector]/@href".
The parsed result is the text of the link instead of the href url.

What is the correct way to retrieve the link?

O Gosh, it is a duplicate issue with the previous attribute selection problem.

Enable CastToIntegerTransformation to transform from container too

My use case:

Given this URL: https://dev.test/index.php?PHPSESSID=a&action=profile;u=99 i wanted to extract the 99 user ID from the end of the string. My solution was to use a simple Regex and convert it to integer:

"_transformations": [
{
    "_type": "RegexTransformation",
    "_regex": "u=(\\d+)",
},
"CastToIntegerTransformation",
],

But after i got

Transformation chain broken at transformation type CastToIntegerTransformation

started to debug the library and recognized that the CastToIntegerTransformation not inherits from ITransformationFromObject so i cannot use at the end of the parsing pipeline.

Yes, this problem can easily fixed with inheritance but i thought mention here.

Click to view my extended CastToIntegerTransformation class implementation

/// <summary>
/// Class to cast selected XPath value to <see cref="int"/>.
/// </summary>
public class CastToIntegerTransformation : ITransformationFromHtml, ITransformationFromObject
{
    public object Transform(Dictionary<string, object> settings, HtmlNodeNavigator nodeNavigator, List<HtmlAgilityPack.HtmlNode> logicalParents)
    {
        var text = nodeNavigator?.Value ?? nodeNavigator?.CurrentNode?.InnerText;

        if (text != null)
        {
            int intVal;

            if (int.TryParse(text, out intVal))
            {
                return intVal;
            }
        }

        return null;
    }

    /// <summary>
    /// Transforms the input to a valid <see cref="int"/>.
    /// </summary>
    /// <param name="settings"><seealso cref="Config.TransformationConfig.ConfigAttributes"/>.</param>
    /// <param name="input">Parsed XPath value.</param>
    /// <returns><see cref="int"/>.</returns>
    /// <exception cref="FormatException">Occurs when the <paramref name="input" /> parameter
    /// is not a valid integer.</exception>
    public object Transform(Dictionary<string, object> settings, object input)
    {
        if (int.TryParse(input.ToString(), out int number))
        {
            return number;
        }

        throw new FormatException($"Input parameter {input} is not a valid integer!");
    }
}

Thank You for this great library!

Using "_xpath" on table with one row does not create an JSON array

The _xpath feature seems to behave differently if there is only one result from the _xpath. It will generate:

"rows": { "col1": "val1", "col2": "val2" }

Instead of:

"rows": [{ "col1": "val1", "col2": "val2" }]

This makes it difficult to iterate over the array of rows, as you end up iterating over the columns. Thoughts? Is there a way to force it to be an array regardless of there being a single row? Thank you.

Document how to write custom transformations

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.