Code Monkey home page Code Monkey logo

Comments (8)

thegoatherder avatar thegoatherder commented on May 30, 2024 1

from node-word-extractor.

morungos avatar morungos commented on May 30, 2024

Thank you :-) -- it was a good feeling to get this finally out there.

To answer, no, the basic distinction isn't too hard. I've tried to keep the API backwards compatible, but when I dug into the headers/footers and annotations, in particular, it'd be good to expose this extra data. Separating headers and footers isn't too bad, it gets hard if you need to know which of the twelve (!) different headers or footers is which.

The question is: how best to expose this without changing the API. I used to have a hidden parameter which did some filtering, but I never documented so it's silently faded away. So we could either (a) pass some options to getHeaders() to allow it to return more rich data on demand, or (b) add more methods. Do you have any preferences or opinions?

In the meantime, I'll start on the underlying splitting logic. Should be able to do an update for this in the next few days.

from node-word-extractor.

thegoatherder avatar thegoatherder commented on May 30, 2024

from node-word-extractor.

morungos avatar morungos commented on May 30, 2024

Notes to self, mainly.

Headers and footers are pretty easy to separate, especially in .docx. To be honest, if I'd thought of it, I'd have liked to provide two methods for separate headers and footers, but in OLE versions of Word, distinguishing them is pretty awful. My annoyance is that I should probably have called it getHeadersAndFootersAsString() from the start. Anyway, I truly want to keep the old API viable, so I propose the following.

In .doc, it's a bit more complicated. The good news is that the headers and footers do show up; the bad news is we have limited ability to tell which is which, because the documentation is somewhat incomplete. The plcfhdd table associates headers and footers with text ranges, but the standard suggests there are only twelve fields, where in fact, each section can introduce up to three headers and footers each. It might well be that there's data about document sections that we need to cross-check here. That would be in grpfihdt, introduced by section SPRMs -- except that the documentation suggests this is no longer used (although we do know that something is). There is another SPRM for a section's title page (sprmSFTitlePage)

But, then, the documentation at: https://docs.microsoft.com/en-us/openspecs/office_file_formats/ms-doc/8465bee7-6c79-45a9-812e-58b0c5fd6cdc is much more consistent. There, it suggests the first six stories are for footnote and endnote separators and continuations, and everything after that is a section header/footer run. After a bit of testing, it seems like there is always six stories per section, and it's always following this pattern:

  • Even page header
  • Odd page header
  • Even page footer
  • Odd page footer
  • First page header
  • First page footer

from node-word-extractor.

morungos avatar morungos commented on May 30, 2024

Quick question @thegoatherder -- I assume you don't worry about whether a header/footer is shown? One of the sneaky things I found is that, e.g., someone can have a "first page only" header that is switched off, so it's never seen. Can I assume you don't care whether the header/footer is actually rendered?

It's very hard to tell whether any particular headers/footers are actually shown, because that can depend on everything down to the pagination, and therefore font size, etc.

from node-word-extractor.

morungos avatar morungos commented on May 30, 2024

Okay, just published version 1.0.1 to npm. It should contain the header/footer separation you need, and I tested it with both OLE and .docx file formats, so we're all compatible. There's a new option to getHeaders() that omits the default inclusion of footer text, and a new method getFooters() that selects footers, which means old code won't break but new code should be able to use this functionality. Hope that works for you :-)

from node-word-extractor.

thegoatherder avatar thegoatherder commented on May 30, 2024

from node-word-extractor.

thegoatherder avatar thegoatherder commented on May 30, 2024

@morungos thank you very much for your time on this, it's been a massive help. I can confirm getFooters() is working as expected on our test data.

from node-word-extractor.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.