I've been using your library today. It seems to be the light at the end of the tunnel

Quick question <a class="user-mention notranslate" data-hovercard-type="user" data-hov

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Separate header and footer about node-word-extractor HOT 8 CLOSED

morungos commented on May 30, 2024

Separate header and footer

from node-word-extractor.

Comments (8)

thegoatherder commented on May 30, 2024 1

I’m looking for all visible text... but I don’t think it’s a huge deal if we do get some hidden text included in the extracts. I’d say go the easy route!

On Mon 24 May 2021 at 01:21, Stuart Watt ***@***.***> wrote: Quick question @thegoatherder <https://github.com/thegoatherder> -- I assume you don't worry about whether a header/footer is shown? One of the sneaky things I found is that, e.g., someone can have a "first page only" header that is switched off, so it's never seen. Can I assume you don't care whether the header/footer is actually rendered? It's very hard to tell whether any particular headers/footers are actually shown, because that can depend on everything down to the pagination, and therefore font size, etc. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#34 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAC7EJHCEPUZBGT25KFFEY3TPGL2DANCNFSM45JT452A> .

…

----------------------------------------------------------- "Are you sleep walking through your waking life ...or wake walking through your dreams?"

from node-word-extractor.

morungos commented on May 30, 2024

Thank you :-) -- it was a good feeling to get this finally out there.

To answer, no, the basic distinction isn't too hard. I've tried to keep the API backwards compatible, but when I dug into the headers/footers and annotations, in particular, it'd be good to expose this extra data. Separating headers and footers isn't too bad, it gets hard if you need to know which of the twelve (!) different headers or footers is which.

The question is: how best to expose this without changing the API. I used to have a hidden parameter which did some filtering, but I never documented so it's silently faded away. So we could either (a) pass some options to getHeaders() to allow it to return more rich data on demand, or (b) add more methods. Do you have any preferences or opinions?

In the meantime, I'll start on the underlying splitting logic. Should be able to do an update for this in the next few days.

from node-word-extractor.

thegoatherder commented on May 30, 2024

Hi Stuart - thanks for the thoughtful reply. As a new user, I don’t have any hard thoughts on the API semantics and breaking changes... but I understand where you’re coming from. If it was me, I would probably have a `getHeaders(includeFooters=true)` new defaulted parameter. And add a getFooters() for good measure. And I’d plan to move to the default to `includeFooters=false` for the next major release. I guess things get complicated when we start to look at global headers vs. section headers! For my use case specifically, headers are rare and when they do occur they tend to be a single header for the entire document - around 99.4%+ of the time. I am outputting text from your API with a call like: ```js return ''.concat( data.getHeaders(), data.getBody(), data.getFootnotes(), data.getEndnotes(); ``` so in my case, if I could break out a `data.getFooters()` on the bottom of that then the job is done. If you'd like some help with any of this let me know and I can try to contribute and do a PR. Cheers!

…

On Fri 21 May 2021 at 19:12, Stuart Watt ***@***.***> wrote: Thank you :-) -- it was a good feeling to get this finally out there. To answer, no, the basic distinction isn't too hard. I've tried to keep the API backwards compatible, but when I dug into the headers/footers and annotations, in particular, it'd be good to expose this extra data. Separating headers and footers isn't too bad, it gets hard if you need to know which of the twelve (!) different headers or footers is which. The question is: how best to expose this without changing the API. I used to have a hidden parameter which did some filtering, but I never documented so it's silently faded away. So we could either (a) pass some options to getHeaders() to allow it to return more rich data on demand, or (b) add more methods. Do you have any preferences or opinions? In the meantime, I'll start on the underlying splitting logic. Should be able to do an update for this in the next few days. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#34 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAC7EJFKG7XRMHLJT34CYCDTO2O7PANCNFSM45JT452A> .

from node-word-extractor.

morungos commented on May 30, 2024

Notes to self, mainly.

Headers and footers are pretty easy to separate, especially in .docx. To be honest, if I'd thought of it, I'd have liked to provide two methods for separate headers and footers, but in OLE versions of Word, distinguishing them is pretty awful. My annoyance is that I should probably have called it getHeadersAndFootersAsString() from the start. Anyway, I truly want to keep the old API viable, so I propose the following.

In .doc, it's a bit more complicated. The good news is that the headers and footers do show up; the bad news is we have limited ability to tell which is which, because the documentation is somewhat incomplete. The plcfhdd table associates headers and footers with text ranges, but the standard suggests there are only twelve fields, where in fact, each section can introduce up to three headers and footers each. It might well be that there's data about document sections that we need to cross-check here. That would be in grpfihdt, introduced by section SPRMs -- except that the documentation suggests this is no longer used (although we do know that something is). There is another SPRM for a section's title page (sprmSFTitlePage)

But, then, the documentation at: https://docs.microsoft.com/en-us/openspecs/office_file_formats/ms-doc/8465bee7-6c79-45a9-812e-58b0c5fd6cdc is much more consistent. There, it suggests the first six stories are for footnote and endnote separators and continuations, and everything after that is a section header/footer run. After a bit of testing, it seems like there is always six stories per section, and it's always following this pattern:

Even page header
Odd page header
Even page footer
Odd page footer
First page header
First page footer

from node-word-extractor.

morungos commented on May 30, 2024

Quick question @thegoatherder -- I assume you don't worry about whether a header/footer is shown? One of the sneaky things I found is that, e.g., someone can have a "first page only" header that is switched off, so it's never seen. Can I assume you don't care whether the header/footer is actually rendered?

It's very hard to tell whether any particular headers/footers are actually shown, because that can depend on everything down to the pagination, and therefore font size, etc.

from node-word-extractor.

morungos commented on May 30, 2024

Okay, just published version 1.0.1 to npm. It should contain the header/footer separation you need, and I tested it with both OLE and .docx file formats, so we're all compatible. There's a new option to getHeaders() that omits the default inclusion of footer text, and a new method getFooters() that selects footers, which means old code won't break but new code should be able to use this functionality. Hope that works for you :-)

from node-word-extractor.

thegoatherder commented on May 30, 2024

Thank you very much indeed! I will check it out tomorrow and let you know how it goes with our test data set.

On Mon 24 May 2021 at 14:02, Stuart Watt ***@***.***> wrote: Closed #34 <#34>. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#34 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAC7EJFCD7ZV5IZW7H2HYXTTPJE7LANCNFSM45JT452A> .

…

----------------------------------------------------------- "Are you sleep walking through your waking life ...or wake walking through your dreams?"

from node-word-extractor.

thegoatherder commented on May 30, 2024

@morungos thank you very much for your time on this, it's been a massive help. I can confirm getFooters() is working as expected on our test data.

from node-word-extractor.

Separate header and footer about node-word-extractor HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent