Code Monkey home page Code Monkey logo

scrape.net's Introduction

Scrape.NET

A flexible .NET scraping library, with a powerful exceptions system.

Quick start

This library uses extensions method extensively, do not forget to using Scrape.NET;

using System.Net.Http;
using AngleSharp.Html.Dom;
using Scrape.NET; // <- make sure to use the library

HttpClient httpClient = new HttpClient();

// send a [GET] request and retrive an IHtmlDocument
IHtmlDocument html = await httpClient.GetAsync<IHtmlDocument>("https://...");

// css selector, or fail with NodeNotFoundException
html.CssOrFail("title");

More flexible example

using System.Net.Http;
using AngleSharp.Html.Dom;
using Scrape.NET; // <- make sure to use the library

// create an http client
HttpClient httpClient = new HttpClient();

// create a request
HttpRequestMessage requestMessage = new HttpRequestMessage(HttpMethod.Get, "https://...");

// send the request but do not actually consume the content, read only the headers
HttpResponseMessage responseMessage = await httpClient.SendAsync(requestMessage, HttpCompletionOption.ResponseHeadersRead, cancellationToken);

// ensure the request is any 2xx status
// without ResponseHeadersRead, we read more that what whe actually need
responseMessage.EnsureSuccessStatusCode();

// we can now consume the rest of the response
IHtmlDocument html = await responseMessage.ReadAsHtmlAsync(cancellationToken);

// as an alternative of the line above
{
    // create the AngleSharp IResponse
    IResponse documentResponse = await DocumentFactory.CreateResponse(responseMessage, cancellationToken);

    // use AngleSharp to create the actual document
    IDocument document = await BrowsingContext.New().OpenAsync(documentResponse, cancellationToken);

    IHtmlDocument html = (IHtmlDocument)document;
}

// css selector, or fail with NodeNotFoundException
html.CssOrFail("title");

Table of Contents

Query Selector Utilities

All the available extensions methods source code: NodeExtensions.cs

All the utilities automatically decoded, unescape and trim the result, a &lt; hey is trimmed and converted to < hey.

Css and XPath selector

// return the first .container, or null, similar to document.querySelector()
INode node = html.Css(".container");

// throw it a .container does not exists
INode node = html.CssOrFail(".container");

// select all the .container
INodeList nodes = html.CssAll(".container");

// same as before
html.XPath("/container/a[2]");
html.XPathOrFail("/container/a[2]");
html.XPathAll("/container/a");

// or something more complex
html.XPath("/html/body/div[7]/div[1]/div[1]/h2[8]");

Sub-select

You can select from any node, therefore:

html.Css("body").CssAll(".container").CssAll("a");

// or
html.Css("body").Css(".container").Css("a");

You can also use a css selector:

html.Css("body .container > a");

Select an attribute

When selecting an attribute, text content, or html content, the result string will be decoded and unescaped automatically, therefore &lt; will become <.

// <div class=".container" name="The container name">...

html.CssOrFail(".container").Attr("name"); // -> The container name

html.CssOrFail(".container").Attr("id"); // -> null

// with a default value
html.CssOrFail(".container").Attr("style", "Invalid Style"); // -> Invalid Style

If you want to use use AngleSharp, you need to cast the result manually:

 ((IHtmlElement)html.CssOrFail(".container")).GetAttribute("name");

Get the normalized href attribute

string href = html.CssOrFail("a").Href();
Uri href = html.CssOrFail("a").Href<Uri>();

// with a default value
string href = html.CssOrFail("a").Href("https://...");
Uri href = html.CssOrFail("a").Href<Uri>("https://...");

FileStreamContent

Sometimes you may need to download a file while still conserving a reference to that operation in the Http library.

// send a request
HttpResponseMessage responseMessage = await httpClient.SendAsync(..., HttpCompletionOption.ResponseHeadersRead);

// make sure we have a 2xx status code
responseMessage.EnsureSuccessStatusCode();

// create a temp file
string tempFile = Path.GetTempFileName();

using (HttpContent content = responseMessage.Content)
{
    // download the file
    using (FileStream fs = File.OpenWrite(tempFile))
    {
        await responseMessage.Content.CopyToAsync(fs);
    }

    // replace the consumed content with a new HttpContent that points to the file on the disk
    responseMessage.Content = new FileStreamContent(tempFile, deleteOnDispose: false);

    // copy the headers to the new content also
    foreach (KeyValuePair<string, IEnumerable<string>> header in content.Headers)
    {
        responseMessage.Content.Headers.Add(header.Key, header.Value);
    }
}

// we can still use responseMessage.Content for other reasons, for example, down-scaling an image
responseMessage.Content;

// deleteOnDispose is false in this example, so calling dispose will not delete the file
responseMessage.Content.Dispose();

File.Exists(tempFile); // is true

HttpResponseMessage Serialization to JSON

The response content stream will not be serialized to JSON!!!

You can serialize an instance of HttpResponseMessage in JSON with:

HttpResponseMessage responseMessage = await httpClient.GetAsync("...");

string json = JsonSerializer.Serialize(responseMessage, new JsonSerializerOptions
{
    WriteIndented = true,
    Converters =
    {
        new HttpResponseMessageJsonConverter(),
    }
});

// HttpResponseMessageJsonConverter works on Deserialize also!

HttpResponseMessageJsonConverter serialize and deserialize the HttpResponseMessage.

The result JSON string:

{
  "Version": "1.1",
  "StatusCode": 200,
  "ReasonPhrase": "OK",
  "ContentHeaders": [
    {
      "Key": "Content-Type",
      "Value": [
        "text/html; charset=utf-8"
      ]
    },
    {
      "Key": "Expires",
      "Value": [
        "Mon, 01 Jan 1990 00:00:00 GMT"
      ]
    }
  ],
  "Headers": [
    {
      "Key": "Cache-Control",
      "Value": [
        "no-store, must-revalidate, no-cache, max-age=0"
      ]
    },
  ],
  "TrailingHeaders": [],
  "RequestMessage": {
    "Method": "GET",
    "RequestUri": "https://...",
    "Version": "1.1",
    "VersionPolicy": 0,
    "ContentHeaders": null,
    "Headers": []
  }
}

You can also serialize/deserialize a request with HttpRequestMessageJsonConverter.

The library do not hide the implementation details from you:

You can manually serialize a request with:

var serializable = new SerializableHttpRequestMessage(requestMessage);

string json = JsonSerializer.Serialize(serializable);

// get an instance of HttpRequestMessage back
HttpRequestMessage requestMessage = serializable.ToHttpRequestMessage();

It is very useful when you wish to store an HttpRequestMessage in a format that is not JSON, as it basically clone the request into a class that can be easily serialized.

Uber-fast concurrent Console.Write over the current line

Sometimes you may wish to rewrite very fast over the current line, in a multithreaded highly concurrent operation, in order to do so, you can use:

ConsoleEx.ReWrite("Hello World!");

ConsoleEx.ReWrite normalize the string, prevents it from overflowing the console width, clean the old line properly, and its extraordinarily fast!

It uses Interlocked.Exchange behind the scenes so it drops overlapping calls, make sure at the end of your program to use Console.WriteLine() to create a clean new line in which you can write.

Writing to the console can be very slow, ConsoleEx.ReWrite helps tremendously but is still limited, if you need to rely on the console output for communicating with other programs, its best if you stick with the traditional console, or if you use a logging library.

Prevent the blinking of the console

ConsoleEx.ReWrite does prevent the console from flashing when used exclusively, this is a way you can test your specific system:

Stopwatch stopwatch = Stopwatch.StartNew();
for (int i = 0; i < 100_000; i++)
{
    ConsoleEx.ReWrite($"{i}");
}
stopwatch.Stop();

Console.WriteLine(); // start from a new line
Console.WriteLine("Time elapsed: {0}", stopwatch.Elapsed);

99999 Time elapsed: 00:00:12.6130886

The speed of the operation depends mostly on your hardware.

Hide the cursor

It can be useful to hide the cursor:

using (ConsoleEx.HideCursor())
{
    ExecuteSomeCode();
}

Turn off Quick Edit on Windows

Quick Edit is a feature of the windows console to suspend the program execution when the user click on the terminal.

// disable Quick Edit
ConsoleEx.EnableQuickEdit(false);

The code does nothing on a non Windows platform, however it still return false. To check if the current system is Windows, use:

bool isWindows = RuntimeInformation.IsOSPlatform(OSPlatform.Windows);

Example

For now, the only available example is Scrape.NET.Example

Caveats

If you plan to use the default System.Net.Http.HttpClient, be aware that the HttpRequestMessage.RequestUri will change when the http client is configured to follow redirects.

Make sure you keep a copy of the original HttpRequestMessage.RequestUri if you need it later, as an example, for a caching system.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.