Comments (12)
I don't have this immediately on my current plan, sorry. The way way these files store the text is, there is the text, and then there is a bunch of pointers into complex structures where the style is held. So working out the styling is not something that happens along the way to getting the text out. (That's for .doc, with .docx, something like this is likely to be easier.)
I won't close the issue, so at least it remains open for now, because it would be nice to have this.
from node-word-extractor.
@morungos we have a use case for list numbering and bullet point extraction too - mostly in docx... even if it stubbed in an asterix that would be really helpful. Do you have any appetite to look into this?
We are seeing some mixed results in tests - numbered lists and bullet points seem to parse correctly in the majority of .doc files. In .docx files they don't seem to parse. Also, if we take a docx file and save it as .doc, the numbers don't get parsed out. We haven't been able to spot a pattern in the .doc files which indicates whether the list is likely to parse properly or not. We would be happy to share some test documents with you if it helps to understand the problem.
from node-word-extractor.
Please send through your test files, I'd be happy to take a look. It doesn't sound like it's too hard an issue if I can replicate it easily. (.doc files are much much worse, and I'd be crying if they were the ones you needed).
from node-word-extractor.
Would love to have an update on this if possible.
from node-word-extractor.
@yoy0lol this is my fault as I never sent across the sample files. I’ll try to sort something this week. Although any docx with bullets and numbering within it would probably make a basic test case. I imagine things could get more difficult with nested levels…
from node-word-extractor.
Files would be great! I've been distracted by other things, but I will likely have time in the next couple of weeks. So if you can drop me some test files by early next week, that would be great!!
from node-word-extractor.
@morungos Were you ever provided these documents? Happy to send over a few examples if not!
from node-word-extractor.
@Fdawgs Pleae do. I have some available time at the moment.
from node-word-extractor.
@Fdawgs Pleae do. I have some available time at the moment.
Brilliant, what's the best way to get them to you?
from node-word-extractor.
Ideally, just drop them into this issue. Ideally they'd be small and public, so I can make them part of the test suite during development.
from node-word-extractor.
Ah, didn't realise you could drop files into issue comments now!
Find below a handful:
test_file_1.docx
test_file_2.docx
test_file_3.docx
from node-word-extractor.
Did you manage to find some time to look at this @morungos?
from node-word-extractor.
Related Issues (20)
- Use word-extractor in typescript HOT 1
- Get body function getting logged HOT 1
- Incorrect text when extracting fields
- Incorrect character filtering for Word
- Handle field displayed text right in OLE Word
- Correctly remove deleted text HOT 1
- Table rows are not terminated correctly in OLE files HOT 1
- Separate header and footer HOT 8
- add method to read text boxes HOT 7
- Error: Max buffer length exceeded: attribValue HOT 10
- Errors thrown by the XML parser cannot be caught HOT 1
- Switch the SAX layer to saxes HOT 1
- WordExtractor get_body() doesn't appear to retrieve all text content from .doc file HOT 8
- WordExtractor get_body() doesn't appear to retrieve all text content from .docx file HOT 1
- error
- Add a way to read macros HOT 1
- Any way to read data from url
- Add way to iterate, fetch, and count pages HOT 1
- Broken multi-byte letters at the borders of 4096-byte chunks HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from node-word-extractor.