Code Monkey home page Code Monkey logo

Comments (5)

ruippeixotog avatar ruippeixotog commented on May 23, 2024

Unfortunately, I can't seem to produce a document that has an HTML element with ID #priceblock_ourprice. On my computer, in the Amazon page downloaded by JsoupBrowser, the price is in this element:

scala> println(doc >?> text("#color_name_1_price"))
Some($15.99)

This may be a difference in localization or user-agent. It may help to change the user agent used by JsoupBrowser (by passing it in its constructor) to make it match your browser.

If you are sure that an HTML document has a valid element which scala-scraper can't find by its id, please provide a static HTML page with which I can reproduce the problem (Document#toHtml can be used for that in scala-scraper).

from scala-scraper.

wiradikusuma avatar wiradikusuma commented on May 23, 2024

I checked from both Chrome and Jsoup, both contain #priceblock_ourprice. Please see attachment.

attachment.zip

Could it be A/B testing from Amazon? (same URL, but we receive different content)

from scala-scraper.

ruippeixotog avatar ruippeixotog commented on May 23, 2024

Oh, I see; it happens because that item does not ship to Portugal, and so I am not shown the bigger priceblock_ourprice price label.

However, I'm able to successfully extract the price both with from_chrome.html and with from_jsoup.html:

scala> val browser = new JsoupBrowser
browser: net.ruippeixotog.scalascraper.browser.JsoupBrowser = net.ruippeixotog.scalascraper.browser.JsoupBrowser@44de88e4

scala> val doc = browser parseFile "from_chrome.html"
doc: browser.DocumentType = (...)

scala> println(doc >?> text("#priceblock_ourprice"))
Some($15.99)

scala> val doc = browser parseFile "from_jsoup.html"
doc: browser.DocumentType = (...)

scala> println(doc >?> text("#priceblock_ourprice"))
Some($15.99)

Does this also happen to you when you load the HTML files from disk?

from scala-scraper.

wiradikusuma avatar wiradikusuma commented on May 23, 2024

I've found the culprit: encoding. I need to explicitly tell Jsoup to use UTF-8. This works:

val doc = browser.parseInputStream(new URL(url).openStream, "UTF-8")

The reason why reading from my HTML files work is because I saved them as UTF-8.

Thanks for taking the time to investigate my issue. If you know a better way, feel free to add, otherwise just close this. Thanks!

from scala-scraper.

ruippeixotog avatar ruippeixotog commented on May 23, 2024

I'm glad that you found out the problem :) It's a strange issue nonetheless, since JsoupBrowser explicitly requests UTF-8 in HTTP requests. This may be a problem with how jsoup handles content encodings or with some missing headers in the request or response; it's hard to tell.

I'll close this for now, but please update this if you find out anything else about it.

from scala-scraper.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.