Comments (12)
It would be awesome if you shared your approach here, even if there was no interest now (which it seems there is), it would be a good resource for anyone dealing with this issue :) If the algorithm and data structure you chose to parse the table is general enough, I can surely help you implementing it in scala-scraper.
from scala-scraper.
Hi! Currently there is no specialized extractor for HTML tables. It would be a nice addition to have one, but does come with its challenges. For example, what data structure did you have in mind? As you mentioned in your example, the ability of cells to cover an arbitrary number of rows and columns can make the organization rather messy...
from scala-scraper.
I have something in mind - let me see if I can open a pull request in a few weeks. If I do end up coding something, where do you think it would show up in the codebase? In scalascraper/scraper/HtmlExtractor.scala?
from scala-scraper.
Yeah, I would expect it to be a new extractor in ContentExtractors
. Looking forward to your pull request then!
from scala-scraper.
hi, any update with this ?
from scala-scraper.
from scala-scraper.
Hello samikrc ,
Thanks a lot for quick response and yes, I would be happy to check it, please do publish.
from scala-scraper.
Guys,
Sorry for the delay. Attaching two files, one containing the source code and the other containing some test code. Note that the test code is not automated - just some prints for manually checking if things look OK.
@ruippeixotog Saw your other email about the exciting features in the next version, including the "Content Extractors". This is probably too late to get included in that, but that is probably where this stuff can be integrated.
Ready to answer questions :-)
Thanks.
-Samik
TableExtractor.scala.txt
TableExtractorTester.scala.txt
from scala-scraper.
Also note that some of the methods are just stubs, but are easy to implement. Important methods are already implemented.
from scala-scraper.
Hi, any update on this? Did the code get used somewhere?
from scala-scraper.
Hi @samikrc, I ended up not using it anywhere for now - mostly due to my lack of time lately. I took a look at your code before and it seemed like a good implementation, it just needs to be converted to a more idiomatic extractor, like the regex extractors. I'll try to work on it in the next two weeks :)
from scala-scraper.
I have just added a new table
content extractor to scala-scraper (e7d3fe6). I ended up writing the extractor from scratch, as it seemed easier for me to integrate it with the style of the other extractors this way.
Closing this now. If you find any bug with the implementation feel free to open another issue!
from scala-scraper.
Related Issues (20)
- Heroku Error
- Can't indicate the encoding for HtmlUnitBrowser HOT 2
- Scala 3 release HOT 6
- Scalaz upgrade HOT 1
- Waiting for real final rendering HOT 2
- replace HtmlUnit by a wrapper around Cypress?
- Xalan removal HOT 1
- Implementation of Jsoup ownText
- [Security] Versions of the package net.sourceforge.htmlunit:htmlunit from 0 and before 3.0.0 are vulnerable to Remote Code Execution (RCE) via XSTL, when browsing the attacker’s webpage. HOT 1
- get empty data return HOT 1
- Too many redirects occurred trying to load URL HOT 3
- How to check for status code? HOT 3
- How to change connection timeout? HOT 1
- How to keep http session? HOT 1
- Introduce ignoreContentType for JsoupBrowser HOT 4
- Extracting all Hn tag values in order of appearance HOT 1
- Add support for custom locales in date parsers
- Build for scala 2.13.x HOT 2
- ContentExtractors.table throw StackOverflowError.
- Unresolved Dependency on Import in Build.SBT (Scala/Play 2.7) HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from scala-scraper.