Code Monkey home page Code Monkey logo

niutrans / niutrans.smt Goto Github PK

View Code? Open in Web Editor NEW
144.0 20.0 40.0 96.26 MB

NiuTrans.SMT is an open-source statistical machine translation system developed by a joint team from NLP Lab. at Northeastern University and the NiuTrans Team. The NiuTrans system is fully developed in C++ language. So it runs fast and uses less memory. Currently it supports phrase-based, hierarchical phrase-based and syntax-based (string-to-tree, tree-to-string and tree-to-tree) models for research-oriented studies.

License: GNU General Public License v2.0

Perl 1.94% Makefile 0.01% C++ 97.86% C 0.15% Shell 0.01% Python 0.01% Prolog 0.02% Batchfile 0.01% Raku 0.01%
machine-translation statistical-machine-translation decoder phrase-based-translation parsing

niutrans.smt's Introduction

NiuTrans.SMT: A Statistical Machine Translation System

  • NiuTrans.SMT is an open-source statistical machine translation system developed by the joint team from the Natural Language Processing Lab. at Northeastern University and the YaTrans Co.,Ltd. The NiuTrans.SMT system is fully developed in C++ language. So it runs fast and uses less memory. Currently it supports phrase-based, hierarchical phrase-based and syntax-based (string-to-tree, tree-to-string and tree-to-tree) models for research-oriented studies.

Features

  1. Written in C++. So it runs fast.
  2. Multi-thread supported
  3. Easy-to-use APIs for feature engineering
  4. Competitive performance for translation tasks
  5. A compact but efficient n-gram language model is embedded. It does not need external support from other softwares (such as SRILM)
  6. Supports multiple SMT models
    • Phrase-based model
    • Hierarchical phrase-based model
    • Syntax-based (string-to-tree, tree-to-string and tree-to-tree) models

Requirements

  • For Windows users, Visual Studio 2008, Cygwin, and perl (version 5.10.0 or higher) are required. It is suggested to install cygwin under path "C:" by default.

  • For Linux users, gcc (version 4.1.2 or higher), g++ (version 4.1.2 or higher), GNU Make (version 3.81 or higher) and perl (version 5.8.8 or higher) are required.

NOTE: 2GB memory and 10GB disc space is a minimal requirement for running the system. Of course, more memory and disc space is helpful if the system is trained using large-scale corpus. To support large data/model (such as n-gram LM), 64bit OS is recommended.

Installation

For Windows users

- open "NiuTrans.sln" in "NiuTrans\src\"
- set configuration mode to "Release"
- set platform mode to "Win32" (for 32bit OS) or "x64" (for 64bit OS)
- build the whole solution
 You will then find that all binaries are generated in "NiuTrans\bin\".

For Linux users

- cd NiuTrans/src/
- chmod a+x install.sh 
- ./install.sh -m32 (for 32bit OS) or ./install.sh (for 64bit OS)
- source ~/.bashrc
 You will then find that all binaries are generated in "NiuTrans/bin/".

Manual

The package also offers a manual to describe more details about the system, as well as various tricks to build better MT engines using NiuTrans. Click here to download the manual in pdf.

NiuTrans Team

  • Jingbo Zhu(Co-PI)
  • Tong Xiao(Co-PI)
  • Yinqiao Li
  • Quan Du
  • Qiang Wang
  • Yufan Jiang
  • Ye Lin
  • Yuhao Zhang

Acknowledgements: In the process of the implementation of this project, we get the support of previous graduates, they are Qiang Li (phrase extraction and many scripts), Hao Zhang (decoder, ME-reordering model), Rushan Chen (language model), Shujie Yao (data selection and data preprocessing), Ji Ma (language model and CWMT2013 baseline systems), Kunjie Sun (CWMT2013 Chinese-English baseline system) and Zhuo Liu (CWMT2013 English-Chinese baseline system).

How To Cite NiuTrans

If you use NiuTrans.SMT in your research and would like to acknowledge this project, please cite the following paper

Tong Xiao, Jingbo Zhu, Hao Zhang and Qiang Li. 2012. NiuTrans: An Open Source Toolkit for Phrase-based and Syntax-based Machine Translation. In Proc. of ACL, demonstration session.

Get Support

For any questions about NiuTrans, please e-mail to us ([email protected]) directly.

History

NiuTrans version 1.4.1 Beta - June 1, 2023 (bug fixes)
NiuTrans version 1.4.0 Beta - May 12, 2018 (bug fixes)
NiuTrans version 1.3.1 Beta - December 1, 2014 (bug fixes for the t2s/t2t decoder and syntactic rule extraction module)
NiuTrans version 1.3.0 Beta - July 17, 2013 (bug fixes, decoder updates, data preprocessing system updates, new scripts for CWMT2013)
NiuTrans version 1.2.0 Beta - January 31, 2013 (bug fixes, decoder updates, add preprocessing system, word-alignment tool and recasing module)
NiuTrans version 1.1.0 Beta - August 1, 2012 (bug fixes)
NiuTrans version 1.0.0 Beta - July 7, 2012 (three syntax-based models are supported)
NiuTrans version 0.3.0 - April 27, 2012 (hierarchical phrase-based model is supported)
NiuTrans version 0.2.0 - October 29, 2011 (bug-fixing, 32bit OS supported) NiuTrans version 0.1.0 - July 5, 2011 (first version)

Acknowledgements

This project is supported in part by the National Science Foundation of China, Specialized Research Fund for the Doctoral Program of Higher Education, and the Fundamental Research Funds for the Central Universities.

niutrans.smt's People

Contributors

liyinqiao2012 avatar xiaotong avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

niutrans.smt's Issues

请问机器翻译中应该如何处理关于表情与特殊符号的问题

在神经机器翻译中,我已经收集到不少数据,但是出现的问题是,中文到英文准确度可以保证,当设置平行语句时总会出问题,并且在训练过程中,日语和韩语这两种语言与中文的转换并不准确,韩语与日语中有很多语法与中文语法不同,所以想请教一下大佬们,有没有好的建议,还有就是在训练中,如果一句话中加入表情,那么识别语种会有问题,并且表情符号也会被吞掉,以上这几个问题请问有没有好的解决办法呢?

Multiple compile issues on Linux

My C++ is very weak but doing a fresh pull and make has multiple errors when attempting to build from master on Ubuntu 18.04. It appears that there are many issues, the first of which is

OurTree.cpp: In member function ‘bool smt::Tree::CreateForest(const char*)’:                                                                                  
OurTree.cpp:377:23: error: ISO C++ forbids comparison between pointer and integer [-fpermissive]                                                 
         while(ibeg != '\0'){                                                                                                                                 
                       ^~~~                                                                                                                                   Makefile:13: recipe for target 'OurTree.o' failed                                                                                                             
make[1]: *** [OurTree.o] Error 1                                                                                                                              
make[1]: Leaving directory '/home/a.melser/dev/NiuTrans.SMT/src/NiuTrans.Decoder'                                                                             
Makefile:12: recipe for target 'all' failed                                                                                                                   
make: *** [all] Error 2  

But there seem to be many others, like missing variables (src/NiuTrans.PhraseExtractor/dispatcher.cpp, options.sort_phrase_table), missing methods:

ruletable_scorer.cpp: In member function ‘bool ruletable_scorer::PhraseTable::generatePhraseTable(ruletable_scorer::PhraseAlignment&, bool&, std::ofstream&, b
ool&, ruletable_scorer::OptionsOfScore&, ruletable_scorer::ScoreClassifyNum&)’:                                                                  
ruletable_scorer.cpp:280:80: error: no matching function for call to ‘ruletable_scorer::PhraseTable::output(std::ofstream&, bool&, ruletable_scorer::OptionsOf
Score&, ruletable_scorer::ScoreClassifyNum&, double&)’                                                                                                        
         output( outfile, inverseFlag, options, scoreClassifyNum ,totalFrequency); 

And maybe more. Is there something I am missing or has this version not been tested on Linux? If you have a version that has definitely been compiled on Linux I can compare with then I can help get this working!

FYI, none of the links to download packages on http://www.nlplab.com/NiuPlan/NiuTrans.html or http://www.niutrans.com/niutrans/NiuTrans.html are still working.

Segmentation fault when using NiuTrans.Decoder

Hi, I am using the latest version of NiuTrans.SMT, and while following the Quick Walkthrough of the user manual, I encountered the following issue:
image
I am running this on Centos Stream 8. Please do advise. Thank you!

Error about "NiuTrans-running-segmenter"

some error occured when i run this script:
perl NiuTrans-running-segmenter.pl \ # 中文预处理 -lang ch \ -input ../work/preprocessing/chinese.clean.txt \ -output ../work/preprocessing/chinese.clean.txt.prepro \ -method 01

and some error info is as follows:

`########### SCRIPT ########### SCRIPT ############ SCRIPT ##########

NiuTrans Running NiuSeg (version 1.2.0 Beta) --www.nlplab.com

########### SCRIPT ########### SCRIPT ############ SCRIPT ##########
Running: ../bin/NiuSegmenter_CN_x64 ../config/NiuTrans.NiuSeg.ch.config tmp.config.chi 11101
--- Initialize Chinese program ...
--- Chinese_Wrapper : Load configure file.
--- Chinese_Wrapper : Configure file load finished.
--- Chinese_Wrapper : Initialize segmentation ...
Reading keys from ../resource/Dict0920/len2.lex...
Sorting keys...
Analyzing ...
keys wcstok failed

Error ##### chi_LM-Based_word_breaker reports lex:../resource/Dict0920/len2.lex||||||loc:../resource/Dict0920/len2.loc||||||org:../resource/bi.org.dict||||||psn:../resource/Dict0920/len2.psn not found or can't open!

--- Chinese_Wrapper : Segmentations initialize finished.
--- Chinese_Wrapper : Initialize preprocessor ...
--- all_PreProcessing_FullToHalf stand ready.
--- Chinese_Wrapper : PreProcessors initialize finished.
--- Chinese_Wrapper : Initialize prev-recognizers ...
--- all_PrevRecognition_RegexRecognizer stand ready.
--- Chinese_Wrapper : Prev-recognizers initialize finished.
--- Chinese_Wrapper : Initialize post-recognizers ...
--- chi_All_Post_Details stand ready.
--- all_PostRecognition_MergeAtomToCompose stand ready.
--- Chinese_Wrapper : Post-recognizers initialize finished.
--- Chinese_Wrapper : Initialize translators ...
--- chi_Translation_ChinumToArabicnum stand ready.
--- chi_Translation_ArabicNumToEngTranslate stand ready.
--- chi_Translation_BilingualDictionary stand ready.
--- chi_Translation_NumberTranslator stand ready.
Error: Execution of: ../bin/NiuSegmenter_CN_x64 ../config/NiuTrans.NiuSeg.ch.config tmp.config.chi 11101
die with signal 11, with coredump
`
Environment
Linux version 4.4.0-62-generic (buildd@lcy01-30) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4) )

特定领域的翻译问题, 使用统计翻译模型大概需要多少数据量才能得到合理的翻译结果

首先感谢该项目, 我在完全不了解perl的情况下, 成功在自己的语料下完成了, 整个过程. (只遇到了一个 因 "#"字符导致的错误)

我当前的数据量只有几千条, 在未经任何数据处理下, 我的实验结果是训练集 bleu是0.76, 测试集是0.26.
使用的模型是基于层次的短语模型.

除了标题的问题, 我还想知道切换到其他开源翻译模型, 是否对翻译效果, 有帮助

perl NiuTrans-running-segmenter.pl -lang ch -input ../sample-data/sample-submission-version/Test-set/Niu.test.txt -output ./sample-data/sample-submission-version/Test-set/pred -method 11

########### SCRIPT ########### SCRIPT ############ SCRIPT ##########

NiuTrans Running NiuSeg (version 1.2.0 Beta) --www.nlplab.com

########### SCRIPT ########### SCRIPT ############ SCRIPT ##########
Running: ../bin/NiuSegmenter_CN_x64 ../config/NiuTrans.NiuSeg.ch.config tmp.config.chi 11111
--- Initialize Chinese program ...
--- Chinese_Wrapper : Load configure file.
--- Chinese_Wrapper : Configure file load finished.
--- Chinese_Wrapper : Initialize segmentation ...
Reading keys from ../resource/Dict0920/len2.lex...
Sorting keys...
Analyzing ...
keys wcstok failed

Error ##### chi_LM-Based_word_breaker reports lex:../resource/Dict0920/len2.lex||||||loc:../resource/Dict0920/len2.loc||||||org:../resource/bi.org.dict||||||psn:../resource/Dict0920/len2.psn; not found or can't open!

--- Chinese_Wrapper : Segmentations initialize finished.
--- Chinese_Wrapper : Initialize preprocessor ...
--- all_PreProcessing_FullToHalf stand ready.
--- Chinese_Wrapper : PreProcessors initialize finished.
--- Chinese_Wrapper : Initialize prev-recognizers ...
--- all_PrevRecognition_RegexRecognizer stand ready.
--- Chinese_Wrapper : Prev-recognizers initialize finished.
--- Chinese_Wrapper : Initialize post-recognizers ...
--- chi_All_Post_Details stand ready.
--- all_PostRecognition_MergeAtomToCompose stand ready.
--- Chinese_Wrapper : Post-recognizers initialize finished.
--- Chinese_Wrapper : Initialize translators ...
--- chi_Translation_ChinumToArabicnum stand ready.
--- chi_Translation_ArabicNumToEngTranslate stand ready.
--- chi_Translation_BilingualDictionary stand ready.
--- chi_Translation_NumberTranslator stand ready.
Error: Execution of: ../bin/NiuSegmenter_CN_x64 ../config/NiuTrans.NiuSeg.ch.config tmp.config.chi 11111
die with signal 11, with coredump
zyyt@ubuntu:~/liuqingmin/enkk_wmt/tools/NiuTrans.SMT/scripts$ vi ../sample-data/sample-submission-version/Test-set/Niu.test.txt

语料对齐问题

image
示例里面用的是中翻英系统,src填中文语料路径,tgt填英文语料路径。
如果我想训练英翻中系统,src也是填中文,tgt也是填英文吗?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.