Code Monkey home page Code Monkey logo

Comments (12)

morungos avatar morungos commented on May 29, 2024 2

I don't have this immediately on my current plan, sorry. The way way these files store the text is, there is the text, and then there is a bunch of pointers into complex structures where the style is held. So working out the styling is not something that happens along the way to getting the text out. (That's for .doc, with .docx, something like this is likely to be easier.)

I won't close the issue, so at least it remains open for now, because it would be nice to have this.

from node-word-extractor.

thegoatherder avatar thegoatherder commented on May 29, 2024

@morungos we have a use case for list numbering and bullet point extraction too - mostly in docx... even if it stubbed in an asterix that would be really helpful. Do you have any appetite to look into this?

We are seeing some mixed results in tests - numbered lists and bullet points seem to parse correctly in the majority of .doc files. In .docx files they don't seem to parse. Also, if we take a docx file and save it as .doc, the numbers don't get parsed out. We haven't been able to spot a pattern in the .doc files which indicates whether the list is likely to parse properly or not. We would be happy to share some test documents with you if it helps to understand the problem.

from node-word-extractor.

morungos avatar morungos commented on May 29, 2024

Please send through your test files, I'd be happy to take a look. It doesn't sound like it's too hard an issue if I can replicate it easily. (.doc files are much much worse, and I'd be crying if they were the ones you needed).

from node-word-extractor.

yoy0lol avatar yoy0lol commented on May 29, 2024

Would love to have an update on this if possible.

from node-word-extractor.

thegoatherder avatar thegoatherder commented on May 29, 2024

@yoy0lol this is my fault as I never sent across the sample files. I’ll try to sort something this week. Although any docx with bullets and numbering within it would probably make a basic test case. I imagine things could get more difficult with nested levels…

from node-word-extractor.

morungos avatar morungos commented on May 29, 2024

Files would be great! I've been distracted by other things, but I will likely have time in the next couple of weeks. So if you can drop me some test files by early next week, that would be great!!

from node-word-extractor.

Fdawgs avatar Fdawgs commented on May 29, 2024

@morungos Were you ever provided these documents? Happy to send over a few examples if not!

from node-word-extractor.

morungos avatar morungos commented on May 29, 2024

@Fdawgs Pleae do. I have some available time at the moment.

from node-word-extractor.

Fdawgs avatar Fdawgs commented on May 29, 2024

@Fdawgs Pleae do. I have some available time at the moment.

Brilliant, what's the best way to get them to you?

from node-word-extractor.

morungos avatar morungos commented on May 29, 2024

Ideally, just drop them into this issue. Ideally they'd be small and public, so I can make them part of the test suite during development.

from node-word-extractor.

Fdawgs avatar Fdawgs commented on May 29, 2024

Ah, didn't realise you could drop files into issue comments now!

Find below a handful:
test_file_1.docx
test_file_2.docx
test_file_3.docx

from node-word-extractor.

Fdawgs avatar Fdawgs commented on May 29, 2024

Did you manage to find some time to look at this @morungos?

from node-word-extractor.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.