ruippeixotog / scala-scraper Goto Github PK

View Code? Open in Web Editor NEW

712.0 27.0 106.0 778 KB

A Scala library for scraping content from HTML pages

License: MIT License

Scala 93.05% HTML 6.95%

scala scraper dsl html-parsing hacktoberfest

scala-scraper's Introduction

Scala Scraper

A library providing a DSL for loading and extracting content from HTML pages.

Take a look at Examples.scala and at the unit specs for usage examples or keep reading for more thorough documentation. Feel free to use GitHub Issues for submitting any bug or feature request and Gitter to ask questions.

This README contains the following sections:

Quick Start
Core Model
Browsers
Content Extraction
Content Validation
Other DSL Features
Using Browser-Specific Features
Working Behind an HTTP/HTTPS Proxy
Integration with Typesafe Config
New Features and Migration Guide
Copyright

Quick Start

To use Scala Scraper in an existing SBT project with Scala 2.11 or newer, add the following dependency to your build.sbt:

libraryDependencies += "net.ruippeixotog" %% "scala-scraper" % "3.1.1"

If you are using an older version of this library, see this document for the version you're using: 1.x, 0.1.2, 0.1.1, 0.1.

An implementation of the Browser trait, such as JsoupBrowser, can be used to fetch HTML from the web or to parse a local HTML file or string:

import net.ruippeixotog.scalascraper.browser.JsoupBrowser

val browser = JsoupBrowser()
val doc = browser.parseFile("core/src/test/resources/example.html")
val doc2 = browser.get("http://example.com")

The returned object is a Document, which already provides several methods for manipulating and querying HTML elements. For simple use cases, it can be enough. For others, this library improves the content extracting process by providing a powerful DSL.

You can open the example.html file loaded above to follow the examples throughout the README.

First of all, the DSL methods and conversions must be imported:

import net.ruippeixotog.scalascraper.dsl.DSL._
import net.ruippeixotog.scalascraper.dsl.DSL.Extract._
import net.ruippeixotog.scalascraper.dsl.DSL.Parse._

Content can then be extracted using the >> extraction operator and CSS queries:

import net.ruippeixotog.scalascraper.model._

// Extract the text inside the element with id "header"
doc >> text("#header")
// res0: String = "Test page h1"

// Extract the <span> elements inside #menu
val items = doc >> elementList("#menu span")
// items: List[Element] = List(
//   JsoupElement(<span><a href="#home">Home</a></span>),
//   JsoupElement(<span><a href="#section1">Section 1</a></span>),
//   JsoupElement(<span class="active">Section 2</span>),
//   JsoupElement(<span><a href="#section3">Section 3</a></span>)
// )

// From each item, extract all the text inside their <a> elements
items.map(_ >> allText("a"))
// res1: List[String] = List("Home", "Section 1", "", "Section 3")

// From the meta element with "viewport" as its attribute name, extract the
// text in the content attribute
doc >> attr("content")("meta[name=viewport]")
// res2: String = "width=device-width, initial-scale=1"

If the element may or may not be in the page, the >?> tries to extract the content and returns it wrapped in an Option:

// Extract the element with id "footer" if it exists, return `None` if it
// doesn't:
doc >?> element("#footer")
// res3: Option[Element] = Some(
//   JsoupElement(
//     <div id="footer">
//  <span>No copyright 2014</span>
// </div>
//   )
// )

With only these two operators, some useful things can already be achieved:

// Go to a news website and extract the hyperlink inside the h1 element if it
// exists. Follow that link and print both the article title and its short
// description (inside ".lead")
for {
  headline <- browser.get("http://observador.pt") >?> element("h1 a")
  headlineDesc = browser.get(headline.attr("href")) >> text(".lead")
} println("== " + headline.text + " ==\n" + headlineDesc)

In the next two sections the core classes used by this library are presented. They are followed by a description of the full capabilities of the DSL, including the ability to parse content after extracting, validating the contents of a page and defining custom extractors or validators.

Core Model

The library represents HTML documents and their elements by Document and Element objects, simple interfaces containing methods for retrieving information and navigating through the DOM.

Browser implementations are the entrypoints for obtaining Document instances. Most notably, they implement get, post, parseFile and parseString methods for retrieving documents from different sources. Depending on the browser used, Document and Element instances may have different semantics, mainly on their immutability guarantees.

Browsers

The library currently provides two built-in implementations of Browser:

JsoupBrowser is backed by jsoup, a Java HTML parser library. JsoupBrowser provides powerful and efficient document querying, but it doesn't run JavaScript in the pages. As such, it is limited to working strictly with the HTML sent in the page source;
HtmlUnitBrowser is based on HtmlUnit, a GUI-less browser for Java programs. HtmlUnitBrowser simulates thoroughly a web browser, executing JavaScript code in the pages in addition to parsing HTML. It supports several compatibility modes, allowing it to emulate browsers such as Internet Explorer.

Due to its speed and maturity, JsoupBrowser is the recommended browser to use when JavaScript execution is not needed. More information about each browser and its semantics can be obtained in the Scaladoc of each implementation.

Content Extraction

The >> and >?> operators shown above accept an HtmlExtractor as their right argument, a trait with a very simple interface:

trait HtmlExtractor[-E <: Element, +A] {
  def extract(doc: ElementQuery[E]): A
}

One can always create a custom extractor by implementing HtmlExtractor. However, the DSL provides several ways to create HtmlExtractor instances, which should be enough in most situations. In general, you can use the extractor factory method:

doc >> extractor(<cssQuery>, <contentExtractor>, <contentParser>)

Where the arguments are:

cssQuery: the CSS query used to select the elements to be processed;
contentExtractor: the content to be extracted from the selected elements, e.g. the element objects themselves, their text, a specific attribute, form data;
contentParser: an optional parser for the data extracted in the step above, such as parsing numbers and dates or using regexes.

The DSL provides several contentExtractor and contentParser instances, which were imported before with DSL.Extract._ and DSL.Parse._. The full list can be seen in ContentExtractors.scala and ContentParsers.scala.

Some usage examples:

// Extract the date from the "#date" element
doc >> extractor("#date", text, asLocalDate("yyyy-MM-dd"))
// res5: org.joda.time.LocalDate = 2014-10-26

// Extract the text of all "#mytable td" elements and parse each of them as a number
doc >> extractor("#mytable td", texts, seq(asDouble))
// res6: TraversableOnce[Double] = non-empty iterator

// Extract an element "h1" and do no parsing (the default parsing behavior)
doc >> extractor("h1", element, asIs[Element])
// res7: Element = JsoupElement(<h1>Test page h1</h1>)

With the help of the implicit conversions provided by the DSL, we can write more succinctly the most common extraction cases:

<cssQuery> is taken as extractor(<cssQuery>, elements, asIs) (by an implicit conversion);
<contentExtractor> is taken as extractor(":root", <contentExtractor>, asIs) (content extractors are also HtmlExtractor instances by themselves);
<contentExtractor>(<cssQuery>) is taken as extractor(<cssQuery>, <contentExtractor>, asIs) (by an implicit conversion).

Because of that, one can write the expressions in the Quick Start section, as well as:

// Extract all the "h3" elements (as a lazy iterable)
doc >> "h3"
// res8: ElementQuery[Element] = LazyElementQuery(
//   JsoupElement(<h3>Section 1 h3</h3>),
//   JsoupElement(<h3>Section 2 h3</h3>),
//   JsoupElement(<h3>Section 3 h3</h3>)
// )

// Extract all text inside this document
doc >> allText
// res9: String = "Test page Test page h1 Home Section 1 Section 2 Section 3 Test page h2 2014-10-26 2014-10-26T12:30:05Z 4.5 2 Section 1 h3 Some text for testing More text for testing Section 2 h3 My Form Add field Section 3 h3 3 15 15 1 No copyright 2014"

// Extract the elements with class ".active"
doc >> elementList(".active")
// res10: List[Element] = List(
//   JsoupElement(<span class="active">Section 2</span>)
// )

// Extract the text inside each "p" element
doc >> texts("p")
// res11: Iterable[String] = List(
//   "Some text for testing",
//   "More text for testing"
// )

Content Validation

While scraping web pages, it is a common use case to validate if a page effectively has the expected structure. This library provides special support for creating and applying validations.

A HtmlValidator has the following signature:

trait HtmlValidator[-E <: Element, +R] {
  def matches(doc: ElementQuery[E]): Boolean
  def result: Option[R]
}

As with extractors, the DSL provides the validator constructor and the >/~ operator for applying a validation to a document:

doc >/~ validator(<extractor>)(<matcher>)

Where the arguments are:

extractor: an extractor as defined in the previous section;
matcher: a function mapping the extracted content to a boolean indicating if the document is valid.

The result of a validation is an Either[R, A] instance, where A is the type of the document and R is the result type of the validation (which will be explained later).

Some validation examples:

// Check if the title of the page is "Test page"
doc >/~ validator(text("title"))(_ == "Test page")
// res12: Either[Unit, browser.DocumentType] = Right(
//   JsoupDocument(
//     <!doctype html>
// <html lang="en">
//  <head>
//   <meta charset="utf-8">
//   <meta name="viewport" content="width=device-width, initial-scale=1">
//   <title>Test page</title>
//  </head>
//  <body>
//   <div id="wrapper">
//    <div id="header">
//     <h1>Test page h1</h1>
//    </div>
//    <div id="menu">
//     <span><a href="#home">Home</a></span> <span><a href="#section1">Section 1</a></span> <span class="active">Section 2</span> <span><a href="#section3">Section 3</a></span>
//    </div>
//    <div id="content">
//     <h2>Test page h2</h2><span id="date">2014-10-26</span> <span id="datefull">2014-10-26T12:30:05Z</span> <span id="rating">4.5</span> <span id="pages">2</span>
//     <section>
//      <h3>Section 1 h3</h3>
//      <p>Some text for testing</p>
//      <p>More text for testing</p>
//     </section>
//     <section>
//      <h3>Section 2 h3</h3><span>My Form</span>
//      <form id="myform" action="submit.html">
//       <input type="text" name="name" value="John"> <input type="text" name="address"> <input type="submit" value="Submit"> <span><a href="#">Add field</a></span>
//      </form>
//     </section>
//     <section>
//      <h3>Section 3 h3</h3>
//      <table id="mytable">
//       <tbody>
//        <tr>
//         <td>3</td>
//         <td>15</td>
//         <td>15</td>
//         <td>1</td>
//        </tr>
//       </tbody>
//      </table>
//     </section>
// ...

// Check if there are at least 3 ".active" elements
doc >/~ validator(".active")(_.size >= 3)
// res13: Either[Unit, browser.DocumentType] = Left(())

// Check if the text in ".desc" contains the word "blue"
doc >/~ validator(allText("#mytable"))(_.contains("blue"))
// res14: Either[Unit, browser.DocumentType] = Left(())

When a document fails a validation, it may be useful to identify the problem by pattern-matching it against common scraping pitfalls, such as a login page that appears unexpectedly because of an expired cookie, dynamic content that disappeared or server-side errors. If we define validators for both the success case and error cases:

val succ = validator(text("title"))(_ == "My Page")

val errors = Seq(
  validator(allText(".msg"), "Not logged in")(_.contains("sign in")),
  validator(".item", "Too few items")(_.size < 3),
  validator(text("h1"), "Internal Server Error")(_.contains("500")))

They can be used in combination to create more informative validations:

doc >/~ (succ, errors)
// res15: Either[String, browser.DocumentType] = Left("Too few items")

Validators matching errors were constructed above using an additional result parameter after the extractor. That value is returned wrapped in a Left if that particular error occurs during a validation.

Other DSL Features

As shown before in the Quick Start section, one can try if an extractor works in a page and obtain the extracted content wrapped in an Option:

// Try to extract an element with id "optional", return `None` if none exist
doc >?> element("#optional")
// res16: Option[Element] = None

Note that when using >?> with content extractors that return sequences, such as texts and elements, None will never be returned (Some(Seq()) will be returned instead).

If you want to use multiple extractors in a single document or element, you can pass tuples or triples to >>:

// Extract the text of the title element and all inputs of #myform
doc >> (text("title"), elementList("#myform input"))
// res17: (String, List[Element]) = (
//   "Test page",
//   List(
//     JsoupElement(<input type="text" name="name" value="John">),
//     JsoupElement(<input type="text" name="address">),
//     JsoupElement(<input type="submit" value="Submit">)
//   )
// )

The extraction operators work on List, Option, Either and other instances for which a Scalaz Functor instance exists. The extraction occurs by mapping over the functors:

// Extract the titles of all documents in the list
List(doc, doc) >> text("title")
// res18: List[String] = List("Test page", "Test page")

// Extract the title if the document is a `Some`
Option(doc) >> text("title")
// res19: Option[String] = Some("Test page")

You can apply other extractors and validators to the result of an extraction, which is particularly powerful combined with the feature shown above:

// From the "#menu" element, extract the text in the ".active" element inside
doc >> element("#menu") >> text(".active")
// res20: String = "Section 2"

// Same as above, but in a scenario where "#menu" can be absent
doc >?> element("#menu") >> text(".active")
// res21: Option[String] = Some("Section 2")

// Same as above, but check if the "#menu" has any "span" element before
// extracting the text
doc >?> element("#menu") >/~ validator("span")(_.nonEmpty) >> text(".active")
// res22: Option[Either[Unit, String]] = Some(Right("Section 2"))

// Extract the links inside all the "#menu > span" elements
doc >> elementList("#menu > span") >?> attr("href")("a")
// res23: List[Option[String]] = List(
//   Some("#home"),
//   Some("#section1"),
//   None,
//   Some("#section3")
// )

This library also provides a Functor for HtmlExtractor, making it possible to map over extractors and create chained extractors that can be passed around and stored like objects. For example, new extractors can be defined like this:

import net.ruippeixotog.scalascraper.scraper.HtmlExtractor

// An extractor for a list with the first link found in each "span" element
val spanLinks: HtmlExtractor[Element, List[Option[String]]] =
  elementList("span") >?> attr("href")("a")

// An extractor for the number of "span" elements that actually have links
val spanLinksCount: HtmlExtractor[Element, Int] =
  spanLinks.map(_.flatten.length)

You can also "prepend" a query to any existing extractor by using its mapQuery method:

// An extractor for `spanLinks` that are inside "#menu"
val menuLinks: HtmlExtractor[Element, List[Option[String]]] =
  spanLinks.mapQuery("#menu")

And they can be used just as extractors created using other means provided by the DSL:

doc >> spanLinks
// res24: List[Option[String]] = List(
//   Some("#home"),
//   Some("#section1"),
//   None,
//   Some("#section3"),
//   None,
//   None,
//   None,
//   None,
//   None,
//   Some("#"),
//   None
// )

doc >> spanLinksCount
// res25: Int = 4

doc >> menuLinks
// res26: List[Option[String]] = List(
//   Some("#home"),
//   Some("#section1"),
//   None,
//   Some("#section3")
// )

Just remember that you can only apply extraction operators >> and >?> to documents, elements or functors "containing" them, which means that the following is a compile-time error:

// The `texts` extractor extracts a list of strings and extractors cannot be
// applied to strings
doc >> texts("#menu > span") >> "a"
// error: value >> is not a member of Iterable[String]
// doc >> texts("#menu > span") >> "a"
// ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Finally, if you prefer not using operators for the sake of code legibility, you can use alternative methods:

// `extract` is the same as `>>`
doc extract text("title")
// res28: String = "Test page"

// `tryExtract` is the same as `>?>`
doc tryExtract element("#optional")
// res29: Option[Element] = None

// `validateWith` is the same as `>/~`
doc validateWith (succ, errors)
// res30: Either[String, browser.DocumentType] = Left("Too few items")

Using Browser-Specific Features

NOTE: this feature is in a beta stage. Please expect API changes in future releases.

At this moment, Scala Scraper is focused on providing a DSL for querying documents efficiently and elegantly. Therefore, it doesn't support directly modifying the DOM or executing actions such as clicking an element. However, since version 2.0.0 a new typed element API allows users to interact directly with the data structures of the underlying Browser implementation.

First of all, make sure your Browser instance has a concrete type, like HtmlUnitBrowser:

import net.ruippeixotog.scalascraper.browser.HtmlUnitBrowser
import net.ruippeixotog.scalascraper.browser.HtmlUnitBrowser._

// the `typed` method on the companion object of a `Browser` returns instances
// with their concrete type
val typedBrowser: HtmlUnitBrowser = HtmlUnitBrowser.typed()

val typedDoc: HtmlUnitDocument = typedBrowser.parseFile("core/src/test/resources/example.html")

Note that the val declarations are explicitly typed for explanation purposes only; the methods work just as well when types are inferred.

The content extractors pElement, pElements and pElementList are special types of extractors - they are polymorphic extractors. They work just like their non-polymorphic element, elements and elementList extractors, but they propagate the concrete types of the elements if the document or element being extracted also has a concrete type. For example:

// extract the "a" inside the second child of "#menu"
val aElem = typedDoc >> pElement("#menu span:nth-child(2) a")
// aElem: HtmlUnitElement = HtmlUnitElement(HtmlAnchor[<a href="#section1_2">])

Note that extracting using CSS queries also keeps the concrete types of the elements:

// same thing as above
typedDoc >> "#menu" >> "span:nth-child(2)" >> "a" >> pElement
// res31: pElement.Out[HtmlUnitElement] = HtmlUnitElement(
//   HtmlAnchor[<a href="#section1_2">]
// )

Concrete element types, like HtmlUnitElement, expose a public underlying field with the underlying element object used by the browser backend. In the case of HtmlUnit, that would be a DomElement, which exposes a whole new range of operations:

// extract the current "href" this "a" element points to
aElem >> attr("href")
// res32: String = "#section1"

// use `underlying` to update the "href" attribute
aElem.underlying.setAttribute("href", "#section1_2")

// verify that "href" was updated
aElem >> attr("href")
// res34: String = "#section1_2"

// get the location of the document (without the host and the full path parts)
typedDoc.location.split("/").last
// res35: String = "example.html"

def click(elem: HtmlUnitElement): Unit = {
  // the type param may be needed, as the original API uses Java wildcards
  aElem.underlying.click[org.htmlunit.Page]()
}

// simulate a click on our recently modified element
click(aElem)

// check the new location
typedDoc.location.split("/").last
// res37: String = "example.html#section1_2"

Using the typed element API provides much more flexibility when more than querying elements is required. However, one should avoid using it unless strictly necessary, as:

It binds code to specific Browser implementations, making it more difficult to change implementations later;
The code becomes subject to changes in the API of the underlying library;
It's heavier on the Scala type system and it is not as mature, leading to possible unexpected compilation errors. If that happens, please file an issue!

Working Behind an HTTP/HTTPS Proxy

If you are behind an HTTP or SOCKS proxy, you can configure Browser implementations to make connections through it by either using the browser's appropriate constructor (implementation-dependent) or by calling withProxy on any browser instance:

import net.ruippeixotog.scalascraper.browser.Proxy

val browser2 = JsoupBrowser().withProxy(Proxy("example.com", 7000, Proxy.SOCKS))

Integration with Typesafe Config

The Scala Scraper Config module can be used to load extractors and validators from config files.

New Features and Migration Guide

The CHANGELOG is kept updated with the bug fixes and new features of each version. When there are breaking changes, they are listed there together with suggestions for migrating old code.

Copyright

scala-scraper's People

Contributors

Stargazers

Watchers

Forkers

wiredin-io denveloper marinatedpork tpolecat jenga hendrasaputra alisheikh gvolpe anand-singh rklick-solutions akamaus sofyan-ahmad drstevens pzang lpaydat aryalrabin firearasi andrelfpinto marm67 rockie-yang tspannhw emanresusername rowhit raboof wulx gantu sshobotov hendisantika n4to4 temon gautampunhani spark-clustering-notebook maltiyadav lfthwjx megaspoon bluepine thduc soerjadi jchernan ysusuk vmorarian bhattacharyyasom fayimora kiaragrouwstra polymorpher guotechfin hmishinev flecttakumikadowaki nomadblacky polomarcus digi-lab kai2002 unkarjedy hypnotranz romartini33 alex-at-home nagabhushanreddy mkuzmentsov niccorder fullstackenviormentss ztouzi xuxue1 metaotao pandinosaurus arbaba ivanmunozt xizil zhaihao friendbear fermiyon danoliv math85360 emarx kevinwright rossoha dmx2010 silvinobarreiros alespuh pesto93 xsistens goggoneko vectos scala-steward paulpdaniels vonwenm hhhhhroot tiagoooliveira xuzhou911 phdoerfler corporateadedayo hubio-inc sharkback dev590t arcilli dominiqueduplan deabreu capesepias mattkohl-flex doytsujin drdub

scala-scraper's Issues

Problem with parsing html snippets.

I'm continuing my VK parsing journey and this time I stumbled upon the following problem:

scala> import net.ruippeixotog.scalascraper.browser.Browser
import net.ruippeixotog.scalascraper.browser.Browser

scala> val browser = new Browser("Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:27.0) Gecko/20100101 Firefox/27.0")
browser: net.ruippeixotog.scalascraper.browser.Browser = net.ruippeixotog.scalascraper.browser.Browser@49f6eca4

scala> val doc = browser.post("http://vk.com/show_links", Map("al"->"1", "oid" -> "-73830198"))
doc: org.jsoup.nodes.Document =
<!--18641<!>groups.css,page.css,groups.js,page.js<!>0<!>6675<!>0<!>Открытая группа<!><div id="reg_bar" class="top_info_wrap fixed">
  <div class="scroll_fix">
    <div id="reg_bar_content">
      Присоединяйтесь, чтобы всегда оставаться в контакте с друзьями и близкими
      <div class="button_blue" id="reg_bar_button"><a class="button_link" href="/join" onclick="return !showBox('join.php', {act: 'box', from: nav.strLoc}, {}, event)"><button id="reg_bar_btn"><span id="reg_bar_with_arr">Зарегистрироваться</span></button></a></div>
    </div>
  </div>
</div>
<div><div class="scroll_fix">
  <div id="page_layout" style="width: 791px;">
    <div id="page_header" class="p_head p_head_l0">
      <div class="back"></div>
      <div class="left"></div>
      <div ...
scala> doc.select("div")
res2: org.jsoup.select.Elements =

scala> import net.ruippeixotog.scalascraper.dsl.DSL._
import net.ruippeixotog.scalascraper.dsl.DSL._

scala> import net.ruippeixotog.scalascraper.dsl.DSL.Extract._
import net.ruippeixotog.scalascraper.dsl.DSL.Extract._

scala> import net.ruippeixotog.scalascraper.dsl.DSL.Parse._
import net.ruippeixotog.scalascraper.dsl.DSL.Parse._

scala> doc >> element("div")
java.util.NoSuchElementException
  at java.util.ArrayList$Itr.next(ArrayList.java:834)
  at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
  at scala.collection.IterableLike$class.head(IterableLike.scala:107)
  at scala.collection.AbstractIterable.head(Iterable.scala:54)
  at net.ruippeixotog.scalascraper.scraper.ContentExtractors$$anonfun$1.apply(HtmlExtractor.scala:87)
  at net.ruippeixotog.scalascraper.scraper.ContentExtractors$$anonfun$1.apply(HtmlExtractor.scala:87)
  at net.ruippeixotog.scalascraper.scraper.SimpleExtractor.extract(HtmlExtractor.scala:65)
  at net.ruippeixotog.scalascraper.dsl.ScrapingOps$ElementsScrapingOps$$anonfun$extract$1.apply(ScrapingOps.scala:16)
  at scalaz.Monad$$anonfun$map$1$$anonfun$apply$2.apply(Monad.scala:14)
  at scalaz.IdInstances$$anon$1.point(Id.scala:18)
  at scalaz.Monad$$anonfun$map$1.apply(Monad.scala:14)
  at scalaz.IdInstances$$anon$1.bind(Id.scala:20)
  at scalaz.Monad$class.map(Monad.scala:14)
  at scalaz.IdInstances$$anon$1.map(Id.scala:17)
  at scalaz.syntax.FunctorOps.map(FunctorSyntax.scala:9)
  at net.ruippeixotog.scalascraper.dsl.ScrapingOps$ElementsScrapingOps.extract(ScrapingOps.scala:16)
  at net.ruippeixotog.scalascraper.dsl.ScrapingOps$ElementsScrapingOps.$greater$greater(ScrapingOps.scala:20)
  ... 43 elided

It's a weird pieces of html torn out of context, but it's the the server who returns it. I guess they inject it deep inside the main document. Can we parse it somehow?

Problems with injecting HtmlUnitBrowser via Guice

Long time user of your lib here. I'm using Play Framework 2.5 and trying to follow the DI patterns they use to avoid global state.

I'm passing HtmlUnitBrowser like this

class YelpService @Inject() (actorSystem: ActorSystem)(browser: HtmlUnitBrowser)

and getting

ProvisionException: Unable to provision, see the following errors:

Could not find a suitable constructor in net.ruippeixotog.scalascraper.browser.HtmlUnitBrowser. Classes must have either one (and only one) constructor annotated with @Inject or a zero-argument constructor that is not private.
at net.ruippeixotog.scalascraper.browser.HtmlUnitBrowser.class(HtmlUnitBrowser.scala:33)

There doesn't appear to be a zero-argument constructor afaik but it DOES have a default value. Perhaps I should be asking the Guice guys about this, but do you see a way of fixing this?

Per connection proxy?

I noticed the README and comments to existing issues suggest setting up a global JVM proxy for the use case of scrapping behind a proxy. However, in many distributed use cases, it is necessary to switch proxy servers at high frequency.

As I was searching for answers, I noticed Jsoup supports per-connection proxy since 1.9 (https://stackoverflow.com/questions/13288471/jsoup-over-vpn-proxy). Similar support also exists in HtmlUnit: https://stackoverflow.com/questions/36398670/using-htmlunit-behind-proxy .

Based on these, it would seem straightforward to add per connection proxy functionality to the scala-scrapper. I want to confirm this is a missing feature and not part of the next release, before I work on it and start a pull request.

Extracting all Hn tag values in order of appearance

Is there a way to extract all the Hn tag value, in order? For example, given:

<h1>Head 1</h1>
<p>blah</p>
<h2>Head 2a</h2>
<p>asdf</p>
<h3>Head 3</h3>
<h2>Head 2b</h2>

The desired response is:

Head 1
Head 2a
Head 3
Head 2b

Here is how I would do it with Jsoup:

val html = scala.io.Source.fromFile("/path/to/index.html").mkString
val doc = org.jsoup.Jsoup.parse(html)
doc.select("h1, h2, h3, h4, h5, h6, h7").eachText()

Missing resources in project

I get the following when compiling the project:

Error:(14, 22) No resource found that matches the given name (at 'ettIcon' with value '@drawable/ic_edit_grey_900_24dp').
Error:(15, 32) No resource found that matches the given name (at 'ettIconInEditMode' with value '@drawable/ic_done_grey_900_24dp').
Error:(79, 26) No resource found that matches the given name (at 'src' with value '@drawable/ic_play_arrow_black_36dp').
Error:(88, 26) No resource found that matches the given name (at 'src' with value '@drawable/ic_pause_black_36dp').
Error:(98, 26) No resource found that matches the given name (at 'src' with value '@drawable/ic_favorite_black_36dp').
Error:(107, 26) No resource found that matches the given name (at 'src' with value '@drawable/ic_favorite_outline_black_36dp').
Error:(117, 26) No resource found that matches the given name (at 'src' with value '@drawable/ic_share_grey_900_36dp').
Error:(126, 26) No resource found that matches the given name (at 'src' with value '@drawable/ic_delete_black_36dp').

Inconsistent behaviour of text("#xxx")

I'm trying to scrape some Amazon page, e.g. https://www.amazon.com/dp/B0756FN69M but I can't extract the price, even though it's in the source code:

 <span id="priceblock_ourprice" class="a-size-medium a-color-price">$15.99</span>

Here's the code:

import net.ruippeixotog.scalascraper.browser.JsoupBrowser
import net.ruippeixotog.scalascraper.dsl.DSL._
import net.ruippeixotog.scalascraper.scraper.ContentExtractors.text

val browser = new JsoupBrowser
val url = "https://www.amazon.com/dp/B0756FN69M"
val doc = browser get url

println(doc >?> text("#productTitle"))
println(doc >?> text("#priceblock_ourprice"))

Output:

Some(TaoTronics Bluetooth Receiver, Portable Wireless Car Aux Adapter 3.5mm Stereo Car Kits, Bluetooth 4.0 Hands-free Bluetooth Audio Adapter for Home /Car Stereo Music Streaming Sound System)  
None

The "title" is extracted just fine.

Add support for custom locales in date parsers

When parsing pages in a foreign language - a common use case for this library - it is sometimes needed to parse dates formatted in another locale (e.g. different month and week day names). We should create a built-in parser and scala-scraper-config support for those cases.

Duplicate text from sibling <td></td> elements is lost

Unless I'm misunderstanding something, there appears to be a bug when selecting text from columns in a table. When two sibling columns have duplicate text, one is not returned.

I have added a test which demonstrates the issue I have encountered via d98f01a

Make query results Iterable

I've given a try to scala-scraper and I noticed that in case a query matches more than one element in the document I get a String that represents the concatenation of all matching elements.

Is there a way to get an iterable object out of the query results?

There's more related to my point: there are some extraction I cannot put in place using "JQuery like" and "CSS selectors" support offered by jsoup.

For example getting only one occurrence of all matches in jsoup is possible only accessing a specific element in the collection returned via the Java API ( eg. results.first() ) whereas XPath gives the chance to do it in the query definition ( //span[18] -> takes the 18th match only)

This feature will be very nice in case it's required to define queries in a configuration file.

In case you are not by chance planning to add support to XPath selectors, an alternative can be enhancing the extractors definition in order to access a specific match in a result list.

Unresolved Dependency on Import in Build.SBT (Scala/Play 2.7)

Added "net.ruippeixotog" %% "scala-scraper" % "2.1.0", in my libraryDependencies and I get:

sbt.librarymanagement.ResolveException: unresolved dependency: net.ruippeixotog#scala-scraper_2.13;2.1.0: not found

So I guess this doesn't work/isn't maintained any more?

Introduce ignoreContentType for JsoupBrowser

Are there any plans to allow setting JsoupBrowser::ignoreContentType(true)?

Currently it fails to load e.g. Mimetype=application/pdf with org.jsoup.UnsupportedMimeTypeException and I do not see how could I obtain the content of the link.

Any plan for WebDriver?

As you probably know, Chrome now supports headless (https://developers.google.com/web/updates/2017/04/headless-chrome), and one way to call it is through WebDriver. Any plan for scala-scraper to support headless Chrome?

Exception while parsing Vkontakte page

Hello!
I'm trying to use the library to parse Vkontakte (Russian social network) community page. At first grabber was redirected to mobile version, but after I implemented the trick described in ticket #5 I got the following behavior:

[info] Running grabber.Runner 
[error] (run-main-0) java.util.NoSuchElementException: head of empty list
java.util.NoSuchElementException: head of empty list
    at scala.collection.immutable.Nil$.head(List.scala:420)
    at scala.collection.immutable.Nil$.head(List.scala:417)
    at net.ruippeixotog.scalascraper.browser.Browser$$anonfun$4.apply(Browser.scala:50)
    at net.ruippeixotog.scalascraper.browser.Browser$$anonfun$4.apply(Browser.scala:49)
    at scala.Option.map(Option.scala:146)
    at net.ruippeixotog.scalascraper.browser.Browser.net$ruippeixotog$scalascraper$browser$Browser$$process(Browser.scala:49)
    at net.ruippeixotog.scalascraper.browser.Browser$$anonfun$3.apply(Browser.scala:39)
    at net.ruippeixotog.scalascraper.browser.Browser$$anonfun$3.apply(Browser.scala:39)
    at scala.Function1$$anonfun$andThen$1.apply(Function1.scala:52)
    at net.ruippeixotog.scalascraper.browser.Browser.get(Browser.scala:17)
    at grabber.Runner$.delayedEndpoint$grabber$Runner$1(grabber.scala:11)
    at grabber.Runner$delayedInit$body.apply(grabber.scala:6)
    at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
    at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
    at scala.App$$anonfun$main$1.apply(App.scala:76)
    at scala.App$$anonfun$main$1.apply(App.scala:76)
    at scala.collection.immutable.List.foreach(List.scala:381)
    at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
    at scala.App$class.main(App.scala:76)
    at grabber.Runner$.main(grabber.scala:6)
    at grabber.Runner.main(grabber.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)

Code snippet I used:

package grabber

import net.ruippeixotog.scalascraper.browser.Browser
import org.jsoup.{ Connection, Jsoup }

object Runner extends App {
  val browser = new Browser {
    override def requestSettings(conn: Connection) = conn.userAgent("Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:27.0) Gecko/20100101 Firefox/27.0")
  }

  val doc = browser.get("http://vk.com/scala_lang")
  print(doc)
}

build.sbt:

name := "vkgazer"
version := "0.1"
scalaVersion := "2.11.7"
libraryDependencies += "net.ruippeixotog" %% "scala-scraper" % "0.1.1"
libraryDependencies += "org.scalatest" %% "scalatest" % "2.2.4" % "test"

Potential Consulting Engagement

Rui, I contacted you early last week to inquire if you were interested in an remote engagement with my company (Solebrity). I sent 2 emails (1 from gmail & another from my company). If interested please respond. John

build for scala 2.12

Hi! Do you consider supporting scala 2.12 any time in the future?

See https://github.com/scala/make-release-notes/blob/2.12.x/projects-2.12.md

Table data extractor

Hi,
Do you have code for extracting data from tables like you have for extracting data from forms? Because of the use of rowspan and colspan attributes, it gets difficult to parse a table from the raw html. Is there an easy way to do this from the in-memory rendering of the browsers?
Regards.

.sibling{Element,Node}s method for Elements

this came up just now, not hard to work around, but getting siblings would be nice

<div>
<a href="/page-1">1</a>
<a href="/second-page">2</a>
<a href="/3rd-pg">3</a>
<a id=nextButton href="/3rd-pg">Next</a>
</div>

(doc >> element("#nextButton")).siblings.foreach(sibling => {
// #nextButton wouldn't show up as a sibling here like it would in `.parent.get.children`
val href = sibling.attr("href")
// grab stuff from each page
})

>> and >?> operators produce empty string in case of a missing attribute

Hi, thanks for your work on this project.
The title says it all. I'm scraping web pages to get informations like :
doc >?> attr("content")("meta[property=og:description]")
If the meta[property=og:description] is missing I get Some("") instead of None, or simply "" with the >> operator.
I didn't really dig deep into the code as a simple wrapper around >> was enough to get the expected result. But the current behavior is a bit deceptive.

A build for Scala 2.10 in maven central

Is it possible to upload a build for Scala 2.10 in maven central repository?

XPath Support

is xpath support planned / already usable? couldn't find it in the docs

best regards

Handling of missing tags

How can I properly handle missing tags? I am running in quite some problems outlined http://stackoverflow.com/questions/43187383/spark-dataframe-handle-option-some-none and with a minimal sample of https://gist.github.com/geoHeil/bfb01427b88cf58ea755f912ce539712 sometimes I receive characters sometimes Strings.

ignore some js when load a page with Htmlunit

I have to load a page that has a lot of javascript file that I should ignore it and just render several of their js files. I can not find any function to doing that.

Also when I update the version to 1.1.0 at SBT build script, I search the close all function does not exist at Htmlunit class.

scala-scraper's implementation of HtmlUnit doesn't have .close()

I have an akka Actor that starts up every 3 hours to scrape some content using HtmlUnitBrowser (have to use this because of JS execution). Everything works fine except memory usage jumps every time it starts and stays constant at that new level. So eventually I run out of memory. I'm not 100% sure HtmlUnit is the issue but they do have a FAQ question about it specifically:

http://htmlunit.sourceforge.net/faq.html#MemoryLeak

As such, I'd like to test closing the browser after its used in the akka actor every time. However, I don't see a .close() method on HtmlUnitBrowser.

Please advise.

Any plan for selenium wrap the headless chrome

just like https://github.com/SeleniumHQ/selenium/wiki/ChromeDriver, we can use a real browser.

Too many redirects occurred trying to load URL

I'm trying to parse url https://shop.rivegauche.ru/newstore/ru/%D0%91%D1%80%D0%B5%D0%BD%D0%B4%D1%8B/%D0%92%D1%81%D0%B5-%D0%91%D1%80%D0%B5%D0%BD%D0%B4%D1%8B/GUERLAIN/Guerlain-Meteorites-Gold-Light-The-Christmas-Collection/p/873725
But got java.io.IOException: Too many redirects occurred trying to load URL. Also I tried to parse same url using jsoup (1.11.2) lib for java and it was successful.
I guess the problem in method org.jsoup.helper.HttpConnection.Response#createConnection, because it have some differences in comparison with java method

Use Either instead of VSuccess/VFailure

Would it be possible to use Either? As far as I know it would cover all the functionality of VSuccess and VFailure while reusing existing knowledge. It becomes especially interesting when using version 2.12, as Either becomes right biased. This allows you to use it in for comprehensions and handle errors more smoothly.

Why my ouput wrong encoding rendering

import net.ruippeixotog.scalascraper.browser.JsoupBrowser
import net.ruippeixotog.scalascraper.dsl.DSL._
import net.ruippeixotog.scalascraper.dsl.DSL.Extract._
import net.ruippeixotog.scalascraper.dsl.DSL.Parse._

object Scraper {
  val browser = JsoupBrowser()

  val doc = browser.get("http://camhr.com")

   def main(args: Array[String]): Unit = {
     // Extract the <span> elements inside #menu
     val items = doc >?> element("#footer")
    print(items)

   }

}

What I see in website is in English, but when I run this code I get in Chinese.

The version of typesafe config used requires jdk 8+

Scala targets jdk 6+, so relying on jdk8 makes it tough for us to deploy this library (we use 7). Would downgrading the config library be out of the question?

Wrong types in >?>

You have an error here https://github.com/ruippeixotog/scala-scraper/blob/master/src/main/scala/net/ruippeixotog/scalascraper/dsl/ScrapingOps.scala#L38 - [B, C, D] in signature vs [B,C,C] in params.

set browser cookies

I'm fetching from a site that requires users to login in. This works fine using this lib.
However, if I restart the program, it will have lost the cookies, and needs to redo the login procedure. I'd like to be able to persist the cookies to resolve this.
Currently cookieMap is private however, preventing me from being able to load in persisted cookies. It would be nice if there were a public setter for it.

sbt dependency

The sbt dependency doesn't work when copy-pasted. The following fixes it:
libraryDependencies += "net.ruippeixotog" % "scala-scraper_2.11" % "0.1.2"

Chaining

I want to get the src of a <img class="large"> nested deep inside a div that does not have class ignored. This div is inside <div id="foo">.

In jQuery, this is how I do it:

$('#foo > div').not('.ignored').find('img.large').attr('src')

How do I do it with scala-scraper?

Proxy support

Does this support Proxy?

ContentExtractors.table throw StackOverflowError.

I try to parse big table element with ContentExtractors.table.
but, buildRow and buildTable method is not tail recursion.

Thereby ContentExtractors.table function throwed StackOverflowError.

that failed to parse URL: http://www.tipness.co.jp/schedule/SHP063/month

Build for scala 2.13.x

2.13.0 has been released. Lot's of dependencies are not ready for a build however.

Cool library. But how do I scrape from sites that load with AJAX?

Extracting text from a series of sibling elements

Hi,

This is not an issue report per se, but more like a support request. I am wondering if/how I can scrape the text/html data in the following case.

Let's assume that the HTML looks like this:

<div id="mw-content-text" class="mw-content-ltr" lang="en" dir="ltr">
    <table class="infobox vevent" style="width:22em;font-size:90%;">...</table>
    <p>...</p>
    <p>...</p>
    <p>...</p>
    <h2>
        <span id="Plot" class="mw-headline">Plot</span>
        <span class="mw-editsection">
    </h2>
    <p>...</p>
    <p>...</p>
    <h2>
        <span id="Cast" class="mw-headline">Cast</span>
        <span class="mw-editsection">
    </h2>
    ...
</div>

If you are wondering, this is how the movie pages in Wikipedia look like. I am trying to extract the plots of a bunch of movies. So essentially, what I have to do is, get the element corresponding to the "Plot" header (h2), and keep grabbing the texts from the [p] tags following that element, until I hit the next header (h2) tag.

Now, one way I can think of is to get all the elements in the top-level div, and then do this. It would work, but I am wondering if there is a smarter way to extract out these stuff? I can easily get to the "Plot" header (h2) element, but I couldn't figure out how I can get to the next few sibling elements from it till the next header (h2) tag.

Thanks.

HTML returned for mobile version

Hi,
just wanna start by saying this is a great tool :)

So, I was trying to scrape the following URL: http://guia.uol.com.br/sao-paulo/bares/detalhes.htm?ponto=academia-da-gula_3415299801

Had some weird behaviour, so I dumped the Element object:

val browser = new Browser
val teste = browser.get("http://guia.uol.com.br/sao-paulo/bares/detalhes.htm?ponto=academia-da-gula_3415299801")
println(teste)

This was the result:

<!DOCTYPE html>
<html lang="en">
 <head> 
  <meta charset="UTF-8"> 
  <title>Guia UOL</title> 
  <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=no"> 
  <link rel="stylesheet" href="http://jsuol.com.br/c/_template/v1/_geral/css/estrutura/webfonts-v2.css"> 
  <style>body{font-family:'UOLText';background-color:#ff9833;text-align:center;margin:0}img{width:100%;max-width:319px}#topo{width:100%;text-align:center;color:#fff;font-weight:normal;margin-top:30px;margin-bottom:14px}#topo>*{line-height:55px;height:55px;vertical-align:middle}#logo{background-image:url('http://h.imguol.com/2013/sprites/selos.v4.png');-webkit-background-size:340px 204px;background-size:340px 204px;width:55px;height:55px;display:inline-block}#instalar{background-color:#fff;color:#000;text-decoration:none;font-family:arial;padding:10px;display:block;font-weight:bold;margin:23px 25px 50px 25px;font-size:14px;line-height:20px;max-width:300px}@media screen and (min-width:350px){#instalar{margin:23px auto 50px auto}}#uol{font-size:86px;vertical-align:center;color:#fff;text-decoration:none}h2{color:#fff;font-size:19px;margin-top:4px;margin-bottom:5px}.font-logo{font-family:'UOLLogo'}#info{margin-top:5px;font-size:14px;padding:0 15px;line-height:19px}footer *{margin:0;padding:5px;font-size:13px;color:#fff;text-decoration:none;background-color:#1c1c1c}#copyright{padding:0}#copyright>p{background-color:#262626}</style> 
 </head> 
 <body> 
  <h1 id="topo"> <a href="http://www.uol.com.br" id="logo"></a> <span class="">Guia</span> <a href="http://www.uol.com.br" id="uol" class="font-logo ">a</a> </h1> 
  <img src="http://imguol.com/c/guia/mapa-porteira-guia-uol-baixe-o-app.jpg" alt="mapa"> 
  <h2 id="title">Aplicativo do Guia UOL</h2> 
  <p id="info"> Parar acessar o conteúdo, baixe o Guia UOL.<br> São mais de 5.000 lugares para você passear e se divertir. </p> 
  <a href="#" id="instalar">INSTALAR AGORA</a> 
  <footer> 
   <div class="site-versao">
     Ver Guia em: 
    <a id="versao-web" href="#versao-web">Web</a> 
   </div> 
   <div id="copyright" class="bgcolor5 bg-color3"> 
    <p class="centraliza">© UOL 1996-2015</p> 
   </div> 
  </footer> 
  <script type="text/javascript">var GuiaStore=(function(){var urls={android:"https://play.google.com/store/apps/details?id=br.org.eldorado.guiauol&feature=search_result#?t=W251bGwsMSwyLDEsImJyLm9yZy5lbGRvcmFkby5ndWlhdW9sIl0",windows:false,ios:{phone:"https://itunes.apple.com/br/app/guia-uol-para-iphone/id523324609?mt=8&ls=1",tablet:"https://itunes.apple.com/br/app/guia-uol/id525174384?mt=8&uo=4"}};var platform=false;var iDevice=false;var fromUrl=decodeURIComponent(location.search.substr(1).split("=")[1]);var detectDevice=function(){var UA=navigator.userAgent;if(UA.match(/iPhone|iPod|iPad/i)!=null&&UA.match(/Safari/i)!=null){iDevice=UA.match(/iPad/)?"tablet":"phone";platform="ios";}else{if(UA.match(/Android/i)!=null){platform="android";}else{if(UA.match(/Windows NT 6.2/i)!=null&&UA.match(/Touch/i)!==null){platform="windows";}}}};var setStoreUrl=function(){if(platform==false){showUnavailableContent();}else{var url=(platform=="ios")?urls["ios"][iDevice]:urls[platform];if(url==false){showUnavailableContent();}else{changeLinkButton(url);}}};var showUnavailableContent=function(){var title=document.getElementById("title");var info=document.getElementById("info");var btn=document.getElementById("instalar");document.body.removeChild(title);info.innerHTML="O conteúdo não tem versão mobile";btn.innerHTML="VER GUIA EM VERSÃO WEB";btn.setAttribute("href",fromUrl);btn.addEventListener("click",setCookieVersion);};var changeLinkButton=function(url){var btn=document.getElementById("instalar");btn.setAttribute("href",url);};var setCookieVersion=function(){setCookie("x-user-agent-class","WEB",3000,".uol.com.br");};var setCookie=function(nome,valor,duracao,domain){var de=new Date();if(duracao){de.setDate(de.getDate()+duracao);}document.cookie=nome+"="+escape(valor)+(duracao?"; expires="+de.toUTCString():"")+"; path=/;"+((domain)?"domain="+domain:"");};var bindWebVersion=function(){var webBtn=document.getElementById("versao-web");webBtn.setAttribute("href",fromUrl);webBtn.addEventListener("click",setCookieVersion);};return{init:function(){detectDevice();setStoreUrl();bindWebVersion();}};})();GuiaStore.init();</script> 
  <script type="text/javascript">window.Config={"estacaoId":"guia","estacao":"guia","canal":"","Conteudo":{"tipo":"home-estacao","titulo":"Guia UOL"},"Metricas":{"estacao":"guia","subcanal":"","tipocanal":"","tagpage":""}};</script> 
  <script type="text/javascript" charset="iso-8859-1" src="http://simg.uol.com/nocache/omtr/guiauolmobile.js"></script> 
  <script type="text/javascript">var s_code=uol_sc.t();if(s_code){document.write(s_code);}</script>  
 </body>
</html>

Which is pretty different from the actual HTML contents. Used curl to acquire content, headers and return data seems normal:

airton@Airtons-MacBook-Pro ~/dev curl -i -XGET http://guia.uol.com.br/sao-paulo/bares/detalhes.htm\?ponto\=academia-da-gula_3415299801
HTTP/1.1 200 OK
Date: Fri, 03 Jul 2015 16:37:50 GMT
Server: marrakesh 1.9.8
Last-Modified: Fri, 03 Jul 2015 16:30:20 GMT
Content-Type: text/html;charset=UTF-8
Cache-Control: max-age=60
ETag: f7b039de3c3cf411ff1654f12f706b01
Expires: Fri, 03 Jul 2015 16:38:50 GMT
Vary: Accept-Encoding,User-Agent
Content-Length: 79585
Cache-Control: private, proxy-revalidate
Connection: close

Any thoughts??

browser.get() not working for some URLs

Hi,

I'm trying to scrape some public linkedin urls (ill use mine as an example) using:

val browser = new Browser
browser.get("https://www.linkedin.com/in/piercelamb").

This code is returning:

"org.jsoup.HttpStatusException: HTTP error fetching URL"

And in some cases, returning a "status=999" which doesn't make sense. I can parse that link using pure Jsoup code, i.e.

Jsoup.connect("https://www.linkedin.com/in/piercelamb").get() just fine.

I tried updating from the 0.1.1 version of your artifact to 0.1.2. Also dug into the code somewhat and saw that you seemed to separate out a trait for Browser and 2 classes, though this doesn't appear to be reflected in the artifact. The only difference from base Jsoup code appears to be your executePipeline(...) function which im trying to understand since it is likely causing the problem. Perhaps its the default cookies you pass in defaultRequestSettings?

How to change connection timeout?

Is there a way, how to change request timeout of JsoupBrowser?

IntelliJ

Hi, thanks for the library,

I don't know why IntelliJ is not happy with >> elemnts("something") syntax and it makes is red, it compiles it though.

How to keep http session?

First of all, thank you for this amazing lib, it really makes thing easier to do some web-scraping in Scala. Really, thank you.

I'm working on a side project that requires to keep a HTTP session alive, so I could log-in and navigate using that session. Is there an easy way of doing it? Right now, I see no way of doing that apart from making some refactoring in Browser and all its implementations.

Thank you

No remove method?

Hello
I am trying to remove some elements
How can I achive that like JSoup remove elements

Asynchronous browser#get?

It would be nice if the browsers has an asynchronous version of get---this way you can just do several page loads at once. As a work around, can I use a custom loader? what format does Document need to be to work with scala-scraper?

scalaz 7.0.6

Play 2.3 only supports scalaz 7.0.6. Can you please release a version that uses only Scalaz 7.0.6?

I got error with this string

val org_description: Option[String] = doc >?> text(".description-ellipsis.normal p")

java.lang.NoClassDefFoundError: scalaz/std/MapSubMap

I found out this MapSubMap only exists at 7.1.1 version.

get empty data return

import net.ruippeixotog.scalascraper.browser.JsoupBrowser
import net.ruippeixotog.scalascraper.dsl.DSL._
import net.ruippeixotog.scalascraper.dsl.DSL.Extract._
import net.ruippeixotog.scalascraper.dsl.DSL.Parse._
import net.ruippeixotog.scalascraper.browser.JsoupBrowser
import org.jsoup.select.Elements

object Main {
  def main(args: Array[String]) {
    println("We're running scala..")

    val browser = JsoupBrowser()
    val doc = browser.get("https://www.bongthom.com/job_category.html")
     val elements =doc >?> element("#footer")
    println(elements)
  }
}

But I get this only , why not see the elements:

We're running scala..
Some(JsoupElement(<footer id="footer">
 &nbsp;
</footer>))

How to check for status code?

Hi,

Is there any way to check for status code when making a HTTP request? For instance, when executing

browser.post(loginURL, Map(
        "email" -> email,
        "password" -> password,
        "op" -> "Login",
        "form_build_id" -> getLoginFormID,
        "form_id" -> "packt_user_login_form"
    ))

how can one check whether status code is 200 (successful logging) or not? One way would be to write a custom HtmlValidator to validate returned Document, but can I do it explicitly by checking status code?

support socks proxy in proxyutils

pretty please

Examples in the readme

Are you sure your examples in the readme accord to your latest version? I have problems with them. Like this one:

val items: Seq[Element] = doc >> elements(".item")

Browser.post() doesn't accept form params with the same name

post() accepts Map[String, String]

A form data set is a sequence of control-name/current-value pairs constructed from successful controls
(https://www.w3.org/TR/html401/interact/forms.html#h-17.13.3.2)

Some web services use this feature to accept array values.

For example, I've encountered something like this:

<form action="gestion.php" method=post>
<input type="text" size="20" maxlength="20" name="pNomPartie" value="">
<input type="text" size="10" maxlength="20" name="pInvite[]" value="">
<input type="text" size="10" maxlength="20" name="pInvite[]" value="">
</form>

With current implementation sending the form is impossible.