Code Monkey home page Code Monkey logo

Comments (3)

SergeiAlonichau avatar SergeiAlonichau commented on August 27, 2024

Hi,

The source code for sentence breaking is here: BlingFire/ldbsrc/sbd . The specific file is: wbd.lex.utf8 .

The documentation file describing the syntax of the rules is here: BlingFire/doc/lex.htm .

In order to have it recompiled please follow these steps: https://github.com/microsoft/BlingFire/wiki/How-to-change-linguistic-resources .

from blingfire.

lfoppiano avatar lfoppiano commented on August 27, 2024

Hi all, @SergeiAlonichau

I have some questions. I've followed the instruction from @SergeiAlonichau, modified the file sbd/wbd.lex.utf8 and recompiled using the command make -f Makefile.gnu lang=sbd all.
Then, from dist-pypi I rebuilt the python package, using python setup.py sdist but I have the feeling my changes were not taken in consideration.

The file ldbsrc/ldb/sbd.bin which is the output of the make command, doesn't seems to be included in the setup.py.
What did I miss? I would like to have a new version of blingfilre packaged as a python package with the new segmentation rules.

Thank you in advance

I deleted and re-wrote this comment as I got more information

from blingfire.

SergeiAlonichau avatar SergeiAlonichau commented on August 27, 2024

Hi,

Thank you for your effort and sorry for the late reply!

The text_to_sentences uses built in model. The generated bin is converted to CPP array (see this https://raw.githubusercontent.com/microsoft/BlingFire/master/ldbsrc/sbd/BlingFireTokLibSbdData.cxx it has a command line in the comments) and that array is used instead of a model loaded from file.

However you can always load your own model if you use text_to_sentences_with_model. So that should be working. In fact I suggest to run two models side by side and compare differences this way.

If you wish you can also ship it this way, call it sbd2 or anything else you like.

If you want to replace the default model, we need to make sure it is better than the default model so some good amount of testing is required.

Once you are convinced it is better, just generate a CXX file from your binary and replace the existing *cxx file with yours. Then text_to_sentences will use your model.

I am ruling out that you did not install your local package correctly of course but maybe worth double checking.

from blingfire.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.