Code Monkey home page Code Monkey logo

Comments (12)

ruippeixotog avatar ruippeixotog commented on May 23, 2024 1

It would be awesome if you shared your approach here, even if there was no interest now (which it seems there is), it would be a good resource for anyone dealing with this issue :) If the algorithm and data structure you chose to parse the table is general enough, I can surely help you implementing it in scala-scraper.

from scala-scraper.

ruippeixotog avatar ruippeixotog commented on May 23, 2024

Hi! Currently there is no specialized extractor for HTML tables. It would be a nice addition to have one, but does come with its challenges. For example, what data structure did you have in mind? As you mentioned in your example, the ability of cells to cover an arbitrary number of rows and columns can make the organization rather messy...

from scala-scraper.

samikrc avatar samikrc commented on May 23, 2024

I have something in mind - let me see if I can open a pull request in a few weeks. If I do end up coding something, where do you think it would show up in the codebase? In scalascraper/scraper/HtmlExtractor.scala?

from scala-scraper.

ruippeixotog avatar ruippeixotog commented on May 23, 2024

Yeah, I would expect it to be a new extractor in ContentExtractors. Looking forward to your pull request then!

from scala-scraper.

kumarivin avatar kumarivin commented on May 23, 2024

hi, any update with this ?

from scala-scraper.

samikrc avatar samikrc commented on May 23, 2024

from scala-scraper.

kumarivin avatar kumarivin commented on May 23, 2024

Hello samikrc ,
Thanks a lot for quick response and yes, I would be happy to check it, please do publish.

from scala-scraper.

samikrc avatar samikrc commented on May 23, 2024

Guys,

Sorry for the delay. Attaching two files, one containing the source code and the other containing some test code. Note that the test code is not automated - just some prints for manually checking if things look OK.

@ruippeixotog Saw your other email about the exciting features in the next version, including the "Content Extractors". This is probably too late to get included in that, but that is probably where this stuff can be integrated.

Ready to answer questions :-)

Thanks.
-Samik

TableExtractor.scala.txt
TableExtractorTester.scala.txt

from scala-scraper.

samikrc avatar samikrc commented on May 23, 2024

Also note that some of the methods are just stubs, but are easy to implement. Important methods are already implemented.

from scala-scraper.

samikrc avatar samikrc commented on May 23, 2024

Hi, any update on this? Did the code get used somewhere?

from scala-scraper.

ruippeixotog avatar ruippeixotog commented on May 23, 2024

Hi @samikrc, I ended up not using it anywhere for now - mostly due to my lack of time lately. I took a look at your code before and it seemed like a good implementation, it just needs to be converted to a more idiomatic extractor, like the regex extractors. I'll try to work on it in the next two weeks :)

from scala-scraper.

ruippeixotog avatar ruippeixotog commented on May 23, 2024

I have just added a new table content extractor to scala-scraper (e7d3fe6). I ended up writing the extractor from scratch, as it seemed easier for me to integrate it with the style of the other extractors this way.

Closing this now. If you find any bug with the implementation feel free to open another issue!

from scala-scraper.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.