ruippeixotog / scala-scraper Goto Github PK

A Scala library for scraping content from HTML pages

License: MIT License

Scala 93.05% HTML 6.95%

scala scraper dsl html-parsing hacktoberfest

scala-scraper's Issues

Use Either instead of VSuccess/VFailure

Would it be possible to use Either? As far as I know it would cover all the functionality of VSuccess and VFailure while reusing existing knowledge. It becomes especially interesting when using version 2.12, as Either becomes right biased. This allows you to use it in for comprehensions and handle errors more smoothly.

Why my ouput wrong encoding rendering

import net.ruippeixotog.scalascraper.browser.JsoupBrowser
import net.ruippeixotog.scalascraper.dsl.DSL._
import net.ruippeixotog.scalascraper.dsl.DSL.Extract._
import net.ruippeixotog.scalascraper.dsl.DSL.Parse._

object Scraper {
  val browser = JsoupBrowser()

  val doc = browser.get("http://camhr.com")

   def main(args: Array[String]): Unit = {
     // Extract the <span> elements inside #menu
     val items = doc >?> element("#footer")
    print(items)

   }

}

What I see in website is in English, but when I run this code I get in Chinese.

Any plan for selenium wrap the headless chrome

just like https://github.com/SeleniumHQ/selenium/wiki/ChromeDriver, we can use a real browser.

IntelliJ

Hi, thanks for the library,

I don't know why IntelliJ is not happy with >> elemnts("something") syntax and it makes is red, it compiles it though.

Proxy support

Does this support Proxy?

>> and >?> operators produce empty string in case of a missing attribute

Hi, thanks for your work on this project.
The title says it all. I'm scraping web pages to get informations like :
doc >?> attr("content")("meta[property=og:description]")
If the meta[property=og:description] is missing I get Some("") instead of None, or simply "" with the >> operator.
I didn't really dig deep into the code as a simple wrapper around >> was enough to get the expected result. But the current behavior is a bit deceptive.

Inconsistent behaviour of text("#xxx")

I'm trying to scrape some Amazon page, e.g. https://www.amazon.com/dp/B0756FN69M but I can't extract the price, even though it's in the source code:

 <span id="priceblock_ourprice" class="a-size-medium a-color-price">$15.99</span>

Here's the code:

import net.ruippeixotog.scalascraper.browser.JsoupBrowser
import net.ruippeixotog.scalascraper.dsl.DSL._
import net.ruippeixotog.scalascraper.scraper.ContentExtractors.text

val browser = new JsoupBrowser
val url = "https://www.amazon.com/dp/B0756FN69M"
val doc = browser get url

println(doc >?> text("#productTitle"))
println(doc >?> text("#priceblock_ourprice"))

Output:

Some(TaoTronics Bluetooth Receiver, Portable Wireless Car Aux Adapter 3.5mm Stereo Car Kits, Bluetooth 4.0 Hands-free Bluetooth Audio Adapter for Home /Car Stereo Music Streaming Sound System)  
None

The "title" is extracted just fine.

Introduce ignoreContentType for JsoupBrowser

Are there any plans to allow setting JsoupBrowser::ignoreContentType(true)?

Currently it fails to load e.g. Mimetype=application/pdf with org.jsoup.UnsupportedMimeTypeException and I do not see how could I obtain the content of the link.

Wrong types in >?>

You have an error here https://github.com/ruippeixotog/scala-scraper/blob/master/src/main/scala/net/ruippeixotog/scalascraper/dsl/ScrapingOps.scala#L38 - [B, C, D] in signature vs [B,C,C] in params.

How to change connection timeout?

Is there a way, how to change request timeout of JsoupBrowser?

sbt dependency

The sbt dependency doesn't work when copy-pasted. The following fixes it:
libraryDependencies += "net.ruippeixotog" % "scala-scraper_2.11" % "0.1.2"

Add support for custom locales in date parsers

When parsing pages in a foreign language - a common use case for this library - it is sometimes needed to parse dates formatted in another locale (e.g. different month and week day names). We should create a built-in parser and scala-scraper-config support for those cases.

Examples in the readme

Are you sure your examples in the readme accord to your latest version? I have problems with them. Like this one:

val items: Seq[Element] = doc >> elements(".item")

Chaining

I want to get the src of a <img class="large"> nested deep inside a div that does not have class ignored. This div is inside <div id="foo">.

In jQuery, this is how I do it:

$('#foo > div').not('.ignored').find('img.large').attr('src')

How do I do it with scala-scraper?

Asynchronous browser#get?

It would be nice if the browsers has an asynchronous version of get---this way you can just do several page loads at once. As a work around, can I use a custom loader? what format does Document need to be to work with scala-scraper?

Build for scala 2.13.x

2.13.0 has been released. Lot's of dependencies are not ready for a build however.

Any plan for WebDriver?

As you probably know, Chrome now supports headless (https://developers.google.com/web/updates/2017/04/headless-chrome), and one way to call it is through WebDriver. Any plan for scala-scraper to support headless Chrome?

Extracting text from a series of sibling elements

Hi,

This is not an issue report per se, but more like a support request. I am wondering if/how I can scrape the text/html data in the following case.

Let's assume that the HTML looks like this:

<div id="mw-content-text" class="mw-content-ltr" lang="en" dir="ltr">
    <table class="infobox vevent" style="width:22em;font-size:90%;">...</table>
    <p>...</p>
    <p>...</p>
    <p>...</p>
    <h2>
        <span id="Plot" class="mw-headline">Plot</span>
        <span class="mw-editsection">
    </h2>
    <p>...</p>
    <p>...</p>
    <h2>
        <span id="Cast" class="mw-headline">Cast</span>
        <span class="mw-editsection">
    </h2>
    ...
</div>

If you are wondering, this is how the movie pages in Wikipedia look like. I am trying to extract the plots of a bunch of movies. So essentially, what I have to do is, get the element corresponding to the "Plot" header (h2), and keep grabbing the texts from the [p] tags following that element, until I hit the next header (h2) tag.

Now, one way I can think of is to get all the elements in the top-level div, and then do this. It would work, but I am wondering if there is a smarter way to extract out these stuff? I can easily get to the "Plot" header (h2) element, but I couldn't figure out how I can get to the next few sibling elements from it till the next header (h2) tag.

Thanks.

Table data extractor

Hi,
Do you have code for extracting data from tables like you have for extracting data from forms? Because of the use of rowspan and colspan attributes, it gets difficult to parse a table from the raw html. Is there an easy way to do this from the in-memory rendering of the browsers?
Regards.

No remove method?

Hello
I am trying to remove some elements
How can I achive that like JSoup remove elements

Problems with injecting HtmlUnitBrowser via Guice

Long time user of your lib here. I'm using Play Framework 2.5 and trying to follow the DI patterns they use to avoid global state.

I'm passing HtmlUnitBrowser like this

class YelpService @Inject() (actorSystem: ActorSystem)(browser: HtmlUnitBrowser)

and getting

ProvisionException: Unable to provision, see the following errors:

Could not find a suitable constructor in net.ruippeixotog.scalascraper.browser.HtmlUnitBrowser. Classes must have either one (and only one) constructor annotated with @Inject or a zero-argument constructor that is not private.
at net.ruippeixotog.scalascraper.browser.HtmlUnitBrowser.class(HtmlUnitBrowser.scala:33)

There doesn't appear to be a zero-argument constructor afaik but it DOES have a default value. Perhaps I should be asking the Guice guys about this, but do you see a way of fixing this?

Too many redirects occurred trying to load URL

I'm trying to parse url https://shop.rivegauche.ru/newstore/ru/%D0%91%D1%80%D0%B5%D0%BD%D0%B4%D1%8B/%D0%92%D1%81%D0%B5-%D0%91%D1%80%D0%B5%D0%BD%D0%B4%D1%8B/GUERLAIN/Guerlain-Meteorites-Gold-Light-The-Christmas-Collection/p/873725
But got java.io.IOException: Too many redirects occurred trying to load URL. Also I tried to parse same url using jsoup (1.11.2) lib for java and it was successful.
I guess the problem in method org.jsoup.helper.HttpConnection.Response#createConnection, because it have some differences in comparison with java method

set browser cookies

I'm fetching from a site that requires users to login in. This works fine using this lib.
However, if I restart the program, it will have lost the cookies, and needs to redo the login procedure. I'd like to be able to persist the cookies to resolve this.
Currently cookieMap is private however, preventing me from being able to load in persisted cookies. It would be nice if there were a public setter for it.

Unresolved Dependency on Import in Build.SBT (Scala/Play 2.7)

Added "net.ruippeixotog" %% "scala-scraper" % "2.1.0", in my libraryDependencies and I get:

sbt.librarymanagement.ResolveException: unresolved dependency: net.ruippeixotog#scala-scraper_2.13;2.1.0: not found

So I guess this doesn't work/isn't maintained any more?

get empty data return

import net.ruippeixotog.scalascraper.browser.JsoupBrowser
import net.ruippeixotog.scalascraper.dsl.DSL._
import net.ruippeixotog.scalascraper.dsl.DSL.Extract._
import net.ruippeixotog.scalascraper.dsl.DSL.Parse._
import net.ruippeixotog.scalascraper.browser.JsoupBrowser
import org.jsoup.select.Elements

object Main {
  def main(args: Array[String]) {
    println("We're running scala..")

    val browser = JsoupBrowser()
    val doc = browser.get("https://www.bongthom.com/job_category.html")
     val elements =doc >?> element("#footer")
    println(elements)
  }
}

But I get this only , why not see the elements:

We're running scala..
Some(JsoupElement(<footer id="footer">
 &nbsp;
</footer>))

Exception while parsing Vkontakte page

Hello!
I'm trying to use the library to parse Vkontakte (Russian social network) community page. At first grabber was redirected to mobile version, but after I implemented the trick described in ticket #5 I got the following behavior:

[info] Running grabber.Runner 
[error] (run-main-0) java.util.NoSuchElementException: head of empty list
java.util.NoSuchElementException: head of empty list
    at scala.collection.immutable.Nil$.head(List.scala:420)
    at scala.collection.immutable.Nil$.head(List.scala:417)
    at net.ruippeixotog.scalascraper.browser.Browser$$anonfun$4.apply(Browser.scala:50)
    at net.ruippeixotog.scalascraper.browser.Browser$$anonfun$4.apply(Browser.scala:49)
    at scala.Option.map(Option.scala:146)
    at net.ruippeixotog.scalascraper.browser.Browser.net$ruippeixotog$scalascraper$browser$Browser$$process(Browser.scala:49)
    at net.ruippeixotog.scalascraper.browser.Browser$$anonfun$3.apply(Browser.scala:39)
    at net.ruippeixotog.scalascraper.browser.Browser$$anonfun$3.apply(Browser.scala:39)
    at scala.Function1$$anonfun$andThen$1.apply(Function1.scala:52)
    at net.ruippeixotog.scalascraper.browser.Browser.get(Browser.scala:17)
    at grabber.Runner$.delayedEndpoint$grabber$Runner$1(grabber.scala:11)
    at grabber.Runner$delayedInit$body.apply(grabber.scala:6)
    at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
    at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
    at scala.App$$anonfun$main$1.apply(App.scala:76)
    at scala.App$$anonfun$main$1.apply(App.scala:76)
    at scala.collection.immutable.List.foreach(List.scala:381)
    at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
    at scala.App$class.main(App.scala:76)
    at grabber.Runner$.main(grabber.scala:6)
    at grabber.Runner.main(grabber.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)

Code snippet I used:

package grabber

import net.ruippeixotog.scalascraper.browser.Browser
import org.jsoup.{ Connection, Jsoup }

object Runner extends App {
  val browser = new Browser {
    override def requestSettings(conn: Connection) = conn.userAgent("Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:27.0) Gecko/20100101 Firefox/27.0")
  }

  val doc = browser.get("http://vk.com/scala_lang")
  print(doc)
}

build.sbt:

name := "vkgazer"
version := "0.1"
scalaVersion := "2.11.7"
libraryDependencies += "net.ruippeixotog" %% "scala-scraper" % "0.1.1"
libraryDependencies += "org.scalatest" %% "scalatest" % "2.2.4" % "test"

Problem with parsing html snippets.

I'm continuing my VK parsing journey and this time I stumbled upon the following problem:

scala> import net.ruippeixotog.scalascraper.browser.Browser
import net.ruippeixotog.scalascraper.browser.Browser

scala> val browser = new Browser("Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:27.0) Gecko/20100101 Firefox/27.0")
browser: net.ruippeixotog.scalascraper.browser.Browser = net.ruippeixotog.scalascraper.browser.Browser@49f6eca4

scala> val doc = browser.post("http://vk.com/show_links", Map("al"->"1", "oid" -> "-73830198"))
doc: org.jsoup.nodes.Document =
<!--18641<!>groups.css,page.css,groups.js,page.js<!>0<!>6675<!>0<!>Открытая группа<!><div id="reg_bar" class="top_info_wrap fixed">
  <div class="scroll_fix">
    <div id="reg_bar_content">
      Присоединяйтесь, чтобы всегда оставаться в контакте с друзьями и близкими
      <div class="button_blue" id="reg_bar_button"><a class="button_link" href="/join" onclick="return !showBox('join.php', {act: 'box', from: nav.strLoc}, {}, event)"><button id="reg_bar_btn"><span id="reg_bar_with_arr">Зарегистрироваться</span></button></a></div>
    </div>
  </div>
</div>
<div><div class="scroll_fix">
  <div id="page_layout" style="width: 791px;">
    <div id="page_header" class="p_head p_head_l0">
      <div class="back"></div>
      <div class="left"></div>
      <div ...
scala> doc.select("div")
res2: org.jsoup.select.Elements =

scala> import net.ruippeixotog.scalascraper.dsl.DSL._
import net.ruippeixotog.scalascraper.dsl.DSL._

scala> import net.ruippeixotog.scalascraper.dsl.DSL.Extract._
import net.ruippeixotog.scalascraper.dsl.DSL.Extract._

scala> import net.ruippeixotog.scalascraper.dsl.DSL.Parse._
import net.ruippeixotog.scalascraper.dsl.DSL.Parse._

scala> doc >> element("div")
java.util.NoSuchElementException
  at java.util.ArrayList$Itr.next(ArrayList.java:834)
  at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
  at scala.collection.IterableLike$class.head(IterableLike.scala:107)
  at scala.collection.AbstractIterable.head(Iterable.scala:54)
  at net.ruippeixotog.scalascraper.scraper.ContentExtractors$$anonfun$1.apply(HtmlExtractor.scala:87)
  at net.ruippeixotog.scalascraper.scraper.ContentExtractors$$anonfun$1.apply(HtmlExtractor.scala:87)
  at net.ruippeixotog.scalascraper.scraper.SimpleExtractor.extract(HtmlExtractor.scala:65)
  at net.ruippeixotog.scalascraper.dsl.ScrapingOps$ElementsScrapingOps$$anonfun$extract$1.apply(ScrapingOps.scala:16)
  at scalaz.Monad$$anonfun$map$1$$anonfun$apply$2.apply(Monad.scala:14)
  at scalaz.IdInstances$$anon$1.point(Id.scala:18)
  at scalaz.Monad$$anonfun$map$1.apply(Monad.scala:14)
  at scalaz.IdInstances$$anon$1.bind(Id.scala:20)
  at scalaz.Monad$class.map(Monad.scala:14)
  at scalaz.IdInstances$$anon$1.map(Id.scala:17)
  at scalaz.syntax.FunctorOps.map(FunctorSyntax.scala:9)
  at net.ruippeixotog.scalascraper.dsl.ScrapingOps$ElementsScrapingOps.extract(ScrapingOps.scala:16)
  at net.ruippeixotog.scalascraper.dsl.ScrapingOps$ElementsScrapingOps.$greater$greater(ScrapingOps.scala:20)
  ... 43 elided

It's a weird pieces of html torn out of context, but it's the the server who returns it. I guess they inject it deep inside the main document. Can we parse it somehow?

Per connection proxy?

I noticed the README and comments to existing issues suggest setting up a global JVM proxy for the use case of scrapping behind a proxy. However, in many distributed use cases, it is necessary to switch proxy servers at high frequency.

As I was searching for answers, I noticed Jsoup supports per-connection proxy since 1.9 (https://stackoverflow.com/questions/13288471/jsoup-over-vpn-proxy). Similar support also exists in HtmlUnit: https://stackoverflow.com/questions/36398670/using-htmlunit-behind-proxy .

Based on these, it would seem straightforward to add per connection proxy functionality to the scala-scrapper. I want to confirm this is a missing feature and not part of the next release, before I work on it and start a pull request.

scalaz 7.0.6

Play 2.3 only supports scalaz 7.0.6. Can you please release a version that uses only Scalaz 7.0.6?

I got error with this string

val org_description: Option[String] = doc >?> text(".description-ellipsis.normal p")

java.lang.NoClassDefFoundError: scalaz/std/MapSubMap

I found out this MapSubMap only exists at 7.1.1 version.

scala-scraper's implementation of HtmlUnit doesn't have .close()

I have an akka Actor that starts up every 3 hours to scrape some content using HtmlUnitBrowser (have to use this because of JS execution). Everything works fine except memory usage jumps every time it starts and stays constant at that new level. So eventually I run out of memory. I'm not 100% sure HtmlUnit is the issue but they do have a FAQ question about it specifically:

http://htmlunit.sourceforge.net/faq.html#MemoryLeak

As such, I'd like to test closing the browser after its used in the akka actor every time. However, I don't see a .close() method on HtmlUnitBrowser.

Please advise.

Extracting all Hn tag values in order of appearance

Is there a way to extract all the Hn tag value, in order? For example, given:

<h1>Head 1</h1>
<p>blah</p>
<h2>Head 2a</h2>
<p>asdf</p>
<h3>Head 3</h3>
<h2>Head 2b</h2>

The desired response is:

Head 1
Head 2a
Head 3
Head 2b

Here is how I would do it with Jsoup:

val html = scala.io.Source.fromFile("/path/to/index.html").mkString
val doc = org.jsoup.Jsoup.parse(html)
doc.select("h1, h2, h3, h4, h5, h6, h7").eachText()

XPath Support

is xpath support planned / already usable? couldn't find it in the docs

best regards

Browser.post() doesn't accept form params with the same name

post() accepts Map[String, String]

A form data set is a sequence of control-name/current-value pairs constructed from successful controls
(https://www.w3.org/TR/html401/interact/forms.html#h-17.13.3.2)

Some web services use this feature to accept array values.

For example, I've encountered something like this:

<form action="gestion.php" method=post>
<input type="text" size="20" maxlength="20" name="pNomPartie" value="">
<input type="text" size="10" maxlength="20" name="pInvite[]" value="">
<input type="text" size="10" maxlength="20" name="pInvite[]" value="">
</form>

With current implementation sending the form is impossible.

How to keep http session?

First of all, thank you for this amazing lib, it really makes thing easier to do some web-scraping in Scala. Really, thank you.

I'm working on a side project that requires to keep a HTTP session alive, so I could log-in and navigate using that session. Is there an easy way of doing it? Right now, I see no way of doing that apart from making some refactoring in Browser and all its implementations.

Thank you

build for scala 2.12

Hi! Do you consider supporting scala 2.12 any time in the future?

See https://github.com/scala/make-release-notes/blob/2.12.x/projects-2.12.md

Potential Consulting Engagement

Rui, I contacted you early last week to inquire if you were interested in an remote engagement with my company (Solebrity). I sent 2 emails (1 from gmail & another from my company). If interested please respond. John

Cool library. But how do I scrape from sites that load with AJAX?

HTML returned for mobile version

Hi,
just wanna start by saying this is a great tool :)

So, I was trying to scrape the following URL: http://guia.uol.com.br/sao-paulo/bares/detalhes.htm?ponto=academia-da-gula_3415299801

Had some weird behaviour, so I dumped the Element object:

val browser = new Browser
val teste = browser.get("http://guia.uol.com.br/sao-paulo/bares/detalhes.htm?ponto=academia-da-gula_3415299801")
println(teste)

This was the result:

<!DOCTYPE html>
<html lang="en">
 <head> 
  <meta charset="UTF-8"> 
  <title>Guia UOL</title> 
  <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=no"> 
  <link rel="stylesheet" href="http://jsuol.com.br/c/_template/v1/_geral/css/estrutura/webfonts-v2.css"> 
  <style>body{font-family:'UOLText';background-color:#ff9833;text-align:center;margin:0}img{width:100%;max-width:319px}#topo{width:100%;text-align:center;color:#fff;font-weight:normal;margin-top:30px;margin-bottom:14px}#topo>*{line-height:55px;height:55px;vertical-align:middle}#logo{background-image:url('http://h.imguol.com/2013/sprites/selos.v4.png');-webkit-background-size:340px 204px;background-size:340px 204px;width:55px;height:55px;display:inline-block}#instalar{background-color:#fff;color:#000;text-decoration:none;font-family:arial;padding:10px;display:block;font-weight:bold;margin:23px 25px 50px 25px;font-size:14px;line-height:20px;max-width:300px}@media screen and (min-width:350px){#instalar{margin:23px auto 50px auto}}#uol{font-size:86px;vertical-align:center;color:#fff;text-decoration:none}h2{color:#fff;font-size:19px;margin-top:4px;margin-bottom:5px}.font-logo{font-family:'UOLLogo'}#info{margin-top:5px;font-size:14px;padding:0 15px;line-height:19px}footer *{margin:0;padding:5px;font-size:13px;color:#fff;text-decoration:none;background-color:#1c1c1c}#copyright{padding:0}#copyright>p{background-color:#262626}</style> 
 </head> 
 <body> 
  <h1 id="topo"> <a href="http://www.uol.com.br" id="logo"></a> <span class="">Guia</span> <a href="http://www.uol.com.br" id="uol" class="font-logo ">a</a> </h1> 
  <img src="http://imguol.com/c/guia/mapa-porteira-guia-uol-baixe-o-app.jpg" alt="mapa"> 
  <h2 id="title">Aplicativo do Guia UOL</h2> 
  <p id="info"> Parar acessar o conteúdo, baixe o Guia UOL.<br> São mais de 5.000 lugares para você passear e se divertir. </p> 
  <a href="#" id="instalar">INSTALAR AGORA</a> 
  <footer> 
   <div class="site-versao">
     Ver Guia em: 
    <a id="versao-web" href="#versao-web">Web</a> 
   </div> 
   <div id="copyright" class="bgcolor5 bg-color3"> 
    <p class="centraliza">© UOL 1996-2015</p> 
   </div> 
  </footer> 
  <script type="text/javascript">var GuiaStore=(function(){var urls={android:"https://play.google.com/store/apps/details?id=br.org.eldorado.guiauol&feature=search_result#?t=W251bGwsMSwyLDEsImJyLm9yZy5lbGRvcmFkby5ndWlhdW9sIl0",windows:false,ios:{phone:"https://itunes.apple.com/br/app/guia-uol-para-iphone/id523324609?mt=8&ls=1",tablet:"https://itunes.apple.com/br/app/guia-uol/id525174384?mt=8&uo=4"}};var platform=false;var iDevice=false;var fromUrl=decodeURIComponent(location.search.substr(1).split("=")[1]);var detectDevice=function(){var UA=navigator.userAgent;if(UA.match(/iPhone|iPod|iPad/i)!=null&&UA.match(/Safari/i)!=null){iDevice=UA.match(/iPad/)?"tablet":"phone";platform="ios";}else{if(UA.match(/Android/i)!=null){platform="android";}else{if(UA.match(/Windows NT 6.2/i)!=null&&UA.match(/Touch/i)!==null){platform="windows";}}}};var setStoreUrl=function(){if(platform==false){showUnavailableContent();}else{var url=(platform=="ios")?urls["ios"][iDevice]:urls[platform];if(url==false){showUnavailableContent();}else{changeLinkButton(url);}}};var showUnavailableContent=function(){var title=document.getElementById("title");var info=document.getElementById("info");var btn=document.getElementById("instalar");document.body.removeChild(title);info.innerHTML="O conteúdo não tem versão mobile";btn.innerHTML="VER GUIA EM VERSÃO WEB";btn.setAttribute("href",fromUrl);btn.addEventListener("click",setCookieVersion);};var changeLinkButton=function(url){var btn=document.getElementById("instalar");btn.setAttribute("href",url);};var setCookieVersion=function(){setCookie("x-user-agent-class","WEB",3000,".uol.com.br");};var setCookie=function(nome,valor,duracao,domain){var de=new Date();if(duracao){de.setDate(de.getDate()+duracao);}document.cookie=nome+"="+escape(valor)+(duracao?"; expires="+de.toUTCString():"")+"; path=/;"+((domain)?"domain="+domain:"");};var bindWebVersion=function(){var webBtn=document.getElementById("versao-web");webBtn.setAttribute("href",fromUrl);webBtn.addEventListener("click",setCookieVersion);};return{init:function(){detectDevice();setStoreUrl();bindWebVersion();}};})();GuiaStore.init();</script> 
  <script type="text/javascript">window.Config={"estacaoId":"guia","estacao":"guia","canal":"","Conteudo":{"tipo":"home-estacao","titulo":"Guia UOL"},"Metricas":{"estacao":"guia","subcanal":"","tipocanal":"","tagpage":""}};</script> 
  <script type="text/javascript" charset="iso-8859-1" src="http://simg.uol.com/nocache/omtr/guiauolmobile.js"></script> 
  <script type="text/javascript">var s_code=uol_sc.t();if(s_code){document.write(s_code);}</script>  
 </body>
</html>

Which is pretty different from the actual HTML contents. Used curl to acquire content, headers and return data seems normal:

airton@Airtons-MacBook-Pro ~/dev curl -i -XGET http://guia.uol.com.br/sao-paulo/bares/detalhes.htm\?ponto\=academia-da-gula_3415299801
HTTP/1.1 200 OK
Date: Fri, 03 Jul 2015 16:37:50 GMT
Server: marrakesh 1.9.8
Last-Modified: Fri, 03 Jul 2015 16:30:20 GMT
Content-Type: text/html;charset=UTF-8
Cache-Control: max-age=60
ETag: f7b039de3c3cf411ff1654f12f706b01
Expires: Fri, 03 Jul 2015 16:38:50 GMT
Vary: Accept-Encoding,User-Agent
Content-Length: 79585
Cache-Control: private, proxy-revalidate
Connection: close

Any thoughts??

support socks proxy in proxyutils

pretty please

ignore some js when load a page with Htmlunit

I have to load a page that has a lot of javascript file that I should ignore it and just render several of their js files. I can not find any function to doing that.

Also when I update the version to 1.1.0 at SBT build script, I search the close all function does not exist at Htmlunit class.

Handling of missing tags

How can I properly handle missing tags? I am running in quite some problems outlined http://stackoverflow.com/questions/43187383/spark-dataframe-handle-option-some-none and with a minimal sample of https://gist.github.com/geoHeil/bfb01427b88cf58ea755f912ce539712 sometimes I receive characters sometimes Strings.

Make query results Iterable

I've given a try to scala-scraper and I noticed that in case a query matches more than one element in the document I get a String that represents the concatenation of all matching elements.

Is there a way to get an iterable object out of the query results?

There's more related to my point: there are some extraction I cannot put in place using "JQuery like" and "CSS selectors" support offered by jsoup.

For example getting only one occurrence of all matches in jsoup is possible only accessing a specific element in the collection returned via the Java API ( eg. results.first() ) whereas XPath gives the chance to do it in the query definition ( //span[18] -> takes the 18th match only)

This feature will be very nice in case it's required to define queries in a configuration file.

In case you are not by chance planning to add support to XPath selectors, an alternative can be enhancing the extractors definition in order to access a specific match in a result list.

ContentExtractors.table throw StackOverflowError.

I try to parse big table element with ContentExtractors.table.
but, buildRow and buildTable method is not tail recursion.

Thereby ContentExtractors.table function throwed StackOverflowError.

that failed to parse URL: http://www.tipness.co.jp/schedule/SHP063/month

browser.get() not working for some URLs

Hi,

I'm trying to scrape some public linkedin urls (ill use mine as an example) using:

val browser = new Browser
browser.get("https://www.linkedin.com/in/piercelamb").

This code is returning:

"org.jsoup.HttpStatusException: HTTP error fetching URL"

And in some cases, returning a "status=999" which doesn't make sense. I can parse that link using pure Jsoup code, i.e.

Jsoup.connect("https://www.linkedin.com/in/piercelamb").get() just fine.

I tried updating from the 0.1.1 version of your artifact to 0.1.2. Also dug into the code somewhat and saw that you seemed to separate out a trait for Browser and 2 classes, though this doesn't appear to be reflected in the artifact. The only difference from base Jsoup code appears to be your executePipeline(...) function which im trying to understand since it is likely causing the problem. Perhaps its the default cookies you pass in defaultRequestSettings?

.sibling{Element,Node}s method for Elements

this came up just now, not hard to work around, but getting siblings would be nice

<div>
<a href="/page-1">1</a>
<a href="/second-page">2</a>
<a href="/3rd-pg">3</a>
<a id=nextButton href="/3rd-pg">Next</a>
</div>

(doc >> element("#nextButton")).siblings.foreach(sibling => {
// #nextButton wouldn't show up as a sibling here like it would in `.parent.get.children`
val href = sibling.attr("href")
// grab stuff from each page
})

Missing resources in project

I get the following when compiling the project:

Error:(14, 22) No resource found that matches the given name (at 'ettIcon' with value '@drawable/ic_edit_grey_900_24dp').
Error:(15, 32) No resource found that matches the given name (at 'ettIconInEditMode' with value '@drawable/ic_done_grey_900_24dp').
Error:(79, 26) No resource found that matches the given name (at 'src' with value '@drawable/ic_play_arrow_black_36dp').
Error:(88, 26) No resource found that matches the given name (at 'src' with value '@drawable/ic_pause_black_36dp').
Error:(98, 26) No resource found that matches the given name (at 'src' with value '@drawable/ic_favorite_black_36dp').
Error:(107, 26) No resource found that matches the given name (at 'src' with value '@drawable/ic_favorite_outline_black_36dp').
Error:(117, 26) No resource found that matches the given name (at 'src' with value '@drawable/ic_share_grey_900_36dp').
Error:(126, 26) No resource found that matches the given name (at 'src' with value '@drawable/ic_delete_black_36dp').

How to check for status code?

Hi,

Is there any way to check for status code when making a HTTP request? For instance, when executing

browser.post(loginURL, Map(
        "email" -> email,
        "password" -> password,
        "op" -> "Login",
        "form_build_id" -> getLoginFormID,
        "form_id" -> "packt_user_login_form"
    ))

how can one check whether status code is 200 (successful logging) or not? One way would be to write a custom HtmlValidator to validate returned Document, but can I do it explicitly by checking status code?

ruippeixotog / scala-scraper Goto Github PK

scala-scraper's Issues

Recommend Projects

Recommend Topics

Recommend Org