Code Monkey home page Code Monkey logo

Comments (3)

l0o0 avatar l0o0 commented on May 18, 2024

可以的,目前我还不知道如何正确的处理这么多规则。你也可以截图一些你目前的命名规则,让我以后参考一下

from jasminum.

crliu95 avatar crliu95 commented on May 18, 2024

感谢 @l0o0 的及时反馈!

下图中是我的合作者在手动维护中文文献pdf时常采取的一个方式,供你参考:
PDF命名示例

我非常认同您提到的“自动识别多种规则可能是比较麻烦且不经济的”观点,所以一个初步的期待就是在用户给定命名模板的情况下,能够允许Jasminum对PDF文件名进行更灵活的解析,尤其是命名模板中已经有清楚的分割符的情况下。

我能想到的一个实现策略是:(1)按照用户提供的模板将PDF中的分割符(也即非{%X}格式、大括号外面的部分)统一替换为一个默认分割符(比如下划线)。(2)按照默认分割符进一步split,得到关于字段的list。(3)对模板进行解析,得到关于字段代称(即{%X})的list。(4)在常见的命名情况下,两个list长度应该是完全一致的,按照顺序一一对应即可。(5)从中选取对应的信息(标题和作者姓名)到知网引擎中进行检索。(6)一个兜底的解决方案是,将文件名split后的字段中最长的一个作为标题,这可以帮助解决识别过程出错的情况。(7)进一步改进的空间可能是:更有机地利用用户模板中的分割符差异,比如我举的例子中用了&作为连接多个作者的符号;考虑在预处理PDF文件名时去掉“等”、“et al.”等缀词,避免污染关键信息。

再次感谢作者!

from jasminum.

l0o0 avatar l0o0 commented on May 18, 2024

非常感谢你的建议,我后面会接着优化一下

from jasminum.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.