Hi all, I am trying to use BlingFire for sentence splitting in Greek

Hi all, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-

How to add sentence splitting rules? about blingfire HOT 3 OPEN

microsoft commented on August 27, 2024 2

How to add sentence splitting rules?

from blingfire.

Comments (3)

SergeiAlonichau commented on August 27, 2024

Hi,

The source code for sentence breaking is here: BlingFire/ldbsrc/sbd . The specific file is: wbd.lex.utf8 .

The documentation file describing the syntax of the rules is here: BlingFire/doc/lex.htm .

In order to have it recompiled please follow these steps: https://github.com/microsoft/BlingFire/wiki/How-to-change-linguistic-resources .

from blingfire.

lfoppiano commented on August 27, 2024

Hi all, @SergeiAlonichau

I have some questions. I've followed the instruction from @SergeiAlonichau, modified the file sbd/wbd.lex.utf8 and recompiled using the command make -f Makefile.gnu lang=sbd all.
Then, from dist-pypi I rebuilt the python package, using python setup.py sdist but I have the feeling my changes were not taken in consideration.

The file ldbsrc/ldb/sbd.bin which is the output of the make command, doesn't seems to be included in the setup.py.
What did I miss? I would like to have a new version of blingfilre packaged as a python package with the new segmentation rules.

Thank you in advance

I deleted and re-wrote this comment as I got more information

from blingfire.

SergeiAlonichau commented on August 27, 2024

Hi,

Thank you for your effort and sorry for the late reply!

The text_to_sentences uses built in model. The generated bin is converted to CPP array (see this https://raw.githubusercontent.com/microsoft/BlingFire/master/ldbsrc/sbd/BlingFireTokLibSbdData.cxx it has a command line in the comments) and that array is used instead of a model loaded from file.

However you can always load your own model if you use text_to_sentences_with_model. So that should be working. In fact I suggest to run two models side by side and compare differences this way.

If you wish you can also ship it this way, call it sbd2 or anything else you like.

If you want to replace the default model, we need to make sure it is better than the default model so some good amount of testing is required.

Once you are convinced it is better, just generate a CXX file from your binary and replace the existing *cxx file with yours. Then text_to_sentences will use your model.

I am ruling out that you did not install your local package correctly of course but maybe worth double checking.

from blingfire.

Recommend Projects

How to add sentence splitting rules? about blingfire HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent