Comments (5)
Unfortunately, I can't seem to produce a document that has an HTML element with ID #priceblock_ourprice
. On my computer, in the Amazon page downloaded by JsoupBrowser
, the price is in this element:
scala> println(doc >?> text("#color_name_1_price"))
Some($15.99)
This may be a difference in localization or user-agent. It may help to change the user agent used by JsoupBrowser
(by passing it in its constructor) to make it match your browser.
If you are sure that an HTML document has a valid element which scala-scraper can't find by its id, please provide a static HTML page with which I can reproduce the problem (Document#toHtml
can be used for that in scala-scraper).
from scala-scraper.
I checked from both Chrome and Jsoup, both contain #priceblock_ourprice
. Please see attachment.
Could it be A/B testing from Amazon? (same URL, but we receive different content)
from scala-scraper.
Oh, I see; it happens because that item does not ship to Portugal, and so I am not shown the bigger priceblock_ourprice
price label.
However, I'm able to successfully extract the price both with from_chrome.html
and with from_jsoup.html
:
scala> val browser = new JsoupBrowser
browser: net.ruippeixotog.scalascraper.browser.JsoupBrowser = net.ruippeixotog.scalascraper.browser.JsoupBrowser@44de88e4
scala> val doc = browser parseFile "from_chrome.html"
doc: browser.DocumentType = (...)
scala> println(doc >?> text("#priceblock_ourprice"))
Some($15.99)
scala> val doc = browser parseFile "from_jsoup.html"
doc: browser.DocumentType = (...)
scala> println(doc >?> text("#priceblock_ourprice"))
Some($15.99)
Does this also happen to you when you load the HTML files from disk?
from scala-scraper.
I've found the culprit: encoding. I need to explicitly tell Jsoup to use UTF-8. This works:
val doc = browser.parseInputStream(new URL(url).openStream, "UTF-8")
The reason why reading from my HTML files work is because I saved them as UTF-8.
Thanks for taking the time to investigate my issue. If you know a better way, feel free to add, otherwise just close this. Thanks!
from scala-scraper.
I'm glad that you found out the problem :) It's a strange issue nonetheless, since JsoupBrowser
explicitly requests UTF-8 in HTTP requests. This may be a problem with how jsoup handles content encodings or with some missing headers in the request or response; it's hard to tell.
I'll close this for now, but please update this if you find out anything else about it.
from scala-scraper.
Related Issues (20)
- Heroku Error
- Can't indicate the encoding for HtmlUnitBrowser HOT 2
- Scala 3 release HOT 6
- Scalaz upgrade HOT 1
- Waiting for real final rendering HOT 2
- replace HtmlUnit by a wrapper around Cypress?
- Xalan removal HOT 1
- Implementation of Jsoup ownText
- [Security] Versions of the package net.sourceforge.htmlunit:htmlunit from 0 and before 3.0.0 are vulnerable to Remote Code Execution (RCE) via XSTL, when browsing the attacker’s webpage. HOT 1
- get empty data return HOT 1
- Too many redirects occurred trying to load URL HOT 3
- How to check for status code? HOT 3
- How to change connection timeout? HOT 1
- How to keep http session? HOT 1
- Introduce ignoreContentType for JsoupBrowser HOT 4
- Extracting all Hn tag values in order of appearance HOT 1
- Add support for custom locales in date parsers
- Build for scala 2.13.x HOT 2
- ContentExtractors.table throw StackOverflowError.
- Unresolved Dependency on Import in Build.SBT (Scala/Play 2.7) HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from scala-scraper.