Code Monkey home page Code Monkey logo

ltp's Introduction

CODE SIZE CONTRIBUTORS LAST COMMIT

Language version
Python LTP LTP-Core LTP-Extension
Rust LTP

LTP 4

LTP(Language Technology Platform) 提供了一系列中文自然语言处理工具,用户可以使用这些工具对于中文文本进行分词、词性标注、句法分析等等工作。

引用

如果您在工作中使用了 LTP,您可以引用这篇论文

@inproceedings{che-etal-2021-n,
    title = "N-{LTP}: An Open-source Neural Language Technology Platform for {C}hinese",
    author = "Che, Wanxiang  and
      Feng, Yunlong  and
      Qin, Libo  and
      Liu, Ting",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
    month = nov,
    year = "2021",
    address = "Online and Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.emnlp-demo.6",
    doi = "10.18653/v1/2021.emnlp-demo.6",
    pages = "42--49",
    abstract = "We introduce N-LTP, an open-source neural language technology platform supporting six fundamental Chinese NLP tasks: lexical analysis (Chinese word segmentation, part-of-speech tagging, and named entity recognition), syntactic parsing (dependency parsing), and semantic parsing (semantic dependency parsing and semantic role labeling). Unlike the existing state-of-the-art toolkits, such as Stanza, that adopt an independent model for each task, N-LTP adopts the multi-task framework by using a shared pre-trained model, which has the advantage of capturing the shared knowledge across relevant Chinese tasks. In addition, a knowledge distillation method (Clark et al., 2019) where the single-task model teaches the multi-task model is further introduced to encourage the multi-task model to surpass its single-task teacher. Finally, we provide a collection of easy-to-use APIs and a visualization tool to make users to use and view the processing results more easily and directly. To the best of our knowledge, this is the first toolkit to support six Chinese NLP fundamental tasks. Source code, documentation, and pre-trained models are available at https://github.com/HIT-SCIR/ltp.",
}

参考书: 由哈工大社会计算与信息检索研究中心(HIT-SCIR)的多位学者共同编著的《自然语言处理:基于预训练模型的方法 》(作者:车万翔、郭江、崔一鸣;主审:刘挺)一书现已正式出版,该书重点介绍了新的基于预训练模型的自然语言处理技术,包括基础知识、预训练词向量和预训练模型三大部分,可供广大 LTP 用户学习参考。

更新说明

  • 4.2.0
    • [结构性变化] 将 LTP 拆分成 2 个部分,维护和训练更方便,结构更清晰
      • [Legacy 模型] 针对广大用户对于推理速度的需求,使用 Rust 重写了基于感知机的算法,准确率与 LTP3 版本相当,速度则是 LTP v3 的 3.55 倍,开启多线程更可获得 17.17 倍的速度提升,但目前仅支持分词、词性、命名实体三大任务
      • [深度学习模型] 即基于 PyTorch 实现的深度学习模型,支持全部的 6 大任务(分词/词性/命名实体/语义角色/依存句法/语义依存)
    • [其他改进] 改进了模型训练方法
      • [共同] 提供了训练脚本和训练样例,使得用户能够更方便地使用私有的数据,自行训练个性化的模型
      • [深度学习模型] 采用 hydra 对训练过程进行配置,方便广大用户修改模型训练参数以及对 LTP 进行扩展(比如使用其他包中的 Module)
    • [其他变化] 分词、依存句法分析 (Eisner) 和 语义依存分析 (Eisner) 任务的解码算法使用 Rust 实现,速度更快
    • [新特性] 模型上传至 Huggingface Hub,支持自动下载,下载速度更快,并且支持用户自行上传自己训练的模型供 LTP 进行推理使用
    • [破坏性变更] 改用 Pipeline API 进行推理,方便后续进行更深入的性能优化(如 SDP 和 SDPG 很大一部分是重叠的,重用可以加快推理速度),使用说明参见Github 快速使用部分
  • 4.1.0
    • 提供了自定义分词等功能
    • 修复了一些 bug
  • 4.0.0
    • 基于 Pytorch 开发,原生 Python 接口
    • 可根据需要自由选择不同速度和指标的模型
    • 分词、词性、命名实体、依存句法、语义角色、语义依存 6 大任务

快速使用

# 方法 1: 使用清华源安装 LTP
# 1. 安装 PyTorch 和 Transformers 依赖
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple torch transformers
# 2. 安装 LTP
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple ltp ltp-core ltp-extension

# 方法 2: 先全局换源,再安装 LTP
# 1. 全局换 TUNA 源
pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
# 2. 安装 PyTorch 和 Transformers 依赖
pip install torch transformers
# 3. 安装 LTP
pip install ltp ltp-core ltp-extension

注: 如果遇到任何错误,请尝试使用上述命令重新安装 ltp,如果依然报错,请在 Github issues 中反馈。

import torch
from ltp import LTP

# 默认 huggingface 下载,可能需要代理

ltp = LTP("LTP/small")  # 默认加载 Small 模型
                        # 也可以传入模型的路径,ltp = LTP("/path/to/your/model")
                        # /path/to/your/model 应当存在 config.json 和其他模型文件

# 将模型移动到 GPU 上
if torch.cuda.is_available():
    # ltp.cuda()
    ltp.to("cuda")

# 自定义词表
ltp.add_word("汤姆去", freq=2)
ltp.add_words(["外套", "外衣"], freq=2)

#  分词 cws、词性 pos、命名实体标注 ner、语义角色标注 srl、依存句法分析 dep、语义依存分析树 sdp、语义依存分析图 sdpg
output = ltp.pipeline(["他叫汤姆去拿外衣。"], tasks=["cws", "pos", "ner", "srl", "dep", "sdp", "sdpg"])
# 使用字典格式作为返回结果
print(output.cws)  # print(output[0]) / print(output['cws']) # 也可以使用下标访问
print(output.pos)
print(output.sdp)

# 使用感知机算法实现的分词、词性和命名实体识别,速度比较快,但是精度略低
ltp = LTP("LTP/legacy")
# cws, pos, ner = ltp.pipeline(["他叫汤姆去拿外衣。"], tasks=["cws", "ner"]).to_tuple() # error: NER 需要 词性标注任务的结果
cws, pos, ner = ltp.pipeline(["他叫汤姆去拿外衣。"], tasks=["cws", "pos", "ner"]).to_tuple()  # to tuple 可以自动转换为元组格式
# 使用元组格式作为返回结果
print(cws, pos, ner)

详细说明

use std::fs::File;
use itertools::multizip;
use ltp::{CWSModel, POSModel, NERModel, ModelSerde, Format, Codec};

fn main() -> Result<(), Box<dyn std::error::Error>> {
  let file = File::open("data/legacy-models/cws_model.bin")?;
  let cws: CWSModel = ModelSerde::load(file, Format::AVRO(Codec::Deflate))?;
  let file = File::open("data/legacy-models/pos_model.bin")?;
  let pos: POSModel = ModelSerde::load(file, Format::AVRO(Codec::Deflate))?;
  let file = File::open("data/legacy-models/ner_model.bin")?;
  let ner: NERModel = ModelSerde::load(file, Format::AVRO(Codec::Deflate))?;

  let words = cws.predict("他叫汤姆去拿外衣。")?;
  let pos = pos.predict(&words)?;
  let ner = ner.predict((&words, &pos))?;

  for (w, p, n) in multizip((words, pos, ner)) {
    println!("{}/{}/{}", w, p, n);
  }

  Ok(())
}

模型性能以及下载地址

深度学习模型(🤗HF/🗜 压缩包) 分词 词性 命名实体 语义角色 依存句法 语义依存 速度(句/S)
🤗Base 🗜Base 98.7 98.5 95.4 80.6 89.5 75.2 39.12
🤗Base1 🗜Base1 99.22 98.73 96.39 79.28 89.57 76.57 --.--
🤗Base2 🗜Base2 99.18 98.69 95.97 79.49 90.19 76.62 --.--
🤗Small 🗜Small 98.4 98.2 94.3 78.4 88.3 74.7 43.13
🤗Tiny 🗜Tiny 96.8 97.1 91.6 70.9 83.8 70.1 53.22
感知机算法模型(🤗HF/🗜 压缩包) 分词 词性 命名实体 速度(句/s) 备注
🤗Legacy 🗜Legacy 97.93 98.41 94.28 21581.48 性能详情

注:感知机算法速度为开启 16 线程速度

如何下载对应的模型

# 使用 HTTP 链接下载
# 确保已安装 git-lfs (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/LTP/base

# 使用 ssh 下载
# 确保已安装 git-lfs (https://git-lfs.com)
git lfs install
git clone [email protected]:LTP/base

# 下载压缩包
wget http://39.96.43.154/ltp/v4/base.tgz
tar -zxvf base.tgz -C base

如何使用下载的模型

from ltp import LTP

# 在路径中给出模型下载或解压后的路径
# 例如:base 模型的文件夹路径为 "path/to/base"
#      "path/to/base" 下应当存在 "config.json"
ltp = LTP("path/to/base")

构建 Wheel 包

make bdist

其他语言绑定

感知机算法

深度学习算法

作者信息

开源协议

  1. 语言技术平台面向国内外大学、中科院各研究所以及个人研究者免费开放源代码,但如上述机构和个人将该平台用于商业目的(如企业合作项目等)则需要付费。
  2. 除上述机构以外的企事业单位,如申请使用该平台,需付费。
  3. 凡涉及付费问题,请发邮件到 [email protected] 洽商。
  4. 如果您在 LTP 基础上发表论文或取得科研成果,请您在发表论文和申报成果时声明“使用了哈工大社会计算与信息检索研究中心研制的语言技术平台(LTP)”. 同时,发信给[email protected],说明发表论文或申报成果的题目、出处等。

ltp's People

Contributors

alongwy avatar dependabot[bot] avatar weidongkl avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ltp's Issues

百度网盘的ltp-data-v3.1.0.zip模型是有问题的吧?

ltp-3.1.0和github上最新的co试过了都有这个问题

$ bin/ltp_test ltp_data/ltp.cnf dp 1.txt 
[TRACE] 2014/04/03 12:02:18 Loading segmentor model from "ltp_data/cws.model" ...
[TRACE] 2014/04/03 12:02:18 segmentor model is loaded.
[WARNING] 2014/04/03 12:02:18 No "postagger-model" config is found
[TRACE] 2014/04/03 12:02:18 Loading parser resource from "ltp_data/parser.model"
1792
-1
[ERROR] 2014/04/03 12:02:18 /home/feng/fun/ltp/src/__ltp_dll/LTPResource.cpp: line 183: LoadParserResource(): Failed to create parser
[ERROR] 2014/04/03 12:02:18 /home/feng/fun/ltp/src/__ltp_dll/Ltp.cpp: line 128: ReadConfFile(): in LTP::parser, failed to load parser resource
Failed to load LTP
[TRACE] 2014/04/03 12:02:18 segmentor model is released.

步进调试发现是feat_opt.use_sibling被置位了,但是parser.model里面却只有一个collections

$ xxd parser.model |grep collections
0000700: 636f 6c6c 6563 7469 6f6e 7300 0000 0000  collections.....

看看这个什么情况呢 Symbol not found: __Z26postagger_create_postaggerPKc

我是用cpp写的nodejs扩展,然后我这个编译透过了。在调用的时候提示这样。。
帮我分析下是什么原因呢。。
THX

node t.js
[TRACE] 2013/09/24 17:35:03 Loading segmentor model from "ltp_data/cws.model" ...
[TRACE] 2013/09/24 17:35:03 segmentor model is loaded.
[TRACE] 2013/09/24 17:35:03 Loading postagger model from "ltp_data/pos.model" ...
dyld: lazy symbol binding failed: Symbol not found: __Z26postagger_create_postaggerPKc
Referenced from: /Users/iceet/Mine/ltp/src/node/build/Release/hello.node
Expected in: dynamic lookup

dyld: Symbol not found: __Z26postagger_create_postaggerPKc
Referenced from: /Users/iceet/Mine/ltp/src/node/build/Release/hello.node
Expected in: dynamic lookup

分词的c++代码(3.0.0alpha)

@Oneplus 帮忙看看
我复制的你们的代码,在linux下命名为seg.cc

#include <iostream>
#include <string>
#include "segment_dll.h"

int main(int argc, char * argv[]) {
    if (argc < 2) {
        std::cerr << "cws [model path]" << std::endl;
        return 1;
    }

    void * engine = segmentor_create_segmentor(argv[1]);
    if (!engine) {
        return -1;
    }
    std::vector<std::string> words;
    int len = segmentor_segment(engine, 
            "爱上一匹野马,可我的家里没有草原。", words);
    for (int i = 0; i < len; ++ i) {
        std::cout << words[i] << "|";
    }
    std::cout << std::endl;
    segmentor_release_segmentor(engine);
    return 0;
}

然后执行
g++ seg.cc segmentor.a -o seg
好像编译通不过(ps:我对linux下c++编程不熟悉)

segmentor.cpp:(.text._ZN5boost9re_detail12perl_matcherIN9__gnu_cxx17__normal_iteratorIPKcSsEESaINS_9sub_matchIS6_EEENS_12regex_traitsIcNS_16cpp_regex_traitsIcEEEEE8find_impEv[boost::re_detail::perl_matcher<__gnu_cxx::__normal_iterator<char const*, std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<boost::sub_match<__gnu_cxx::__normal_iterator<char const*, std::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, boost::regex_traits<char, boost::cpp_regex_traits<char> > >::find_imp()]+0x18f): undefined reference to `boost::re_detail::put_mem_block(void*)'
segmentor.cpp:(.text._ZN5boost9re_detail12perl_matcherIN9__gnu_cxx17__normal_iteratorIPKcSsEESaINS_9sub_matchIS6_EEENS_12regex_traitsIcNS_16cpp_regex_traitsIcEEEEE8find_impEv[boost::re_detail::perl_matcher<__gnu_cxx::__normal_iterator<char const*, std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<boost::sub_match<__gnu_cxx::__normal_iterator<char const*, std::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, boost::regex_traits<char, boost::cpp_regex_traits<char> > >::find_imp()]+0x361): undefined reference to `boost::re_detail::put_mem_block(void*)'
collect2: ld returned 1 exit status

Compiling Errors on Mac OS X after deleting Logger.h and Logger.cpp from CMakeLists.txt

Scanning dependencies of target postagger
[ 40%] Building CXX object src/_svmtagger/CMakeFiles/postagger.dir/dict.cpp.o
In file included from /usr/include/sys/signal.h:148,
from /usr/include/sys/wait.h:116,
from /usr/include/stdlib.h:65,
from /usr/include/c++/4.2.1/cstdlib:72,
from /usr/include/c++/4.2.1/bits/stl_algobase.h:68,
from /usr/include/c++/4.2.1/bits/char_traits.h:46,
from /usr/include/c++/4.2.1/string:47,
from /usr/local/include/boost/regex/v4/cregex.hpp:207,
from /usr/local/include/boost/cregex.hpp:27,
from /Users/wanxiang/Documents/workspace/ltp/src/_svmtagger/er.h:24,
from /Users/wanxiang/Documents/workspace/ltp/src/_svmtagger/dict.cpp:27:
/usr/include/sys/_structs.h:218: error: conflicting declaration ‘typedef struct __darwin_sigaltstack stack_t’
/Users/wanxiang/Documents/workspace/ltp/src/svmtagger/swindow.h:44: error: ‘struct stack_t’ has a previous declaration as ‘struct stack_t’
/Users/wanxiang/Documents/workspace/ltp/src/svmtagger/dict.cpp: In member function ‘void dictionary::dictWrite(char)’:
/Users/wanxiang/Documents/workspace/ltp/src/svmtagger/dict.cpp:136: warning: deprecated conversion from string constant to ‘char
/Users/wanxiang/Documents/workspace/ltp/src/svmtagger/dict.cpp: In member function ‘void dictionary::dictCreate(FILE, int, int)’:
/Users/wanxiang/Documents/workspace/ltp/src/svmtagger/dict.cpp:250: warning: deprecated conversion from string constant to ‘char
/Users/wanxiang/Documents/workspace/ltp/src/svmtagger/dict.cpp: In member function ‘void dictionary::dictRepairFromFile(char)’:
/Users/wanxiang/Documents/workspace/ltp/src/svmtagger/dict.cpp:325: warning: deprecated conversion from string constant to ‘char
/Users/wanxiang/Documents/workspace/ltp/src/svmtagger/dict.cpp: In member function ‘void dictionary::dictAddBackup(char)’:
/Users/wanxiang/Documents/workspace/ltp/src/svmtagger/dict.cpp:462: warning: deprecated conversion from string constant to ‘char
/Users/wanxiang/Documents/workspace/ltp/src/svmtagger/dict.cpp: In constructor ‘dictionary::dictionary(char, char
)’:
/Users/wanxiang/Documents/workspace/ltp/src/svmtagger/dict.cpp:582: warning: deprecated conversion from string constant to ‘char
/Users/wanxiang/Documents/workspace/ltp/src/svmtagger/dict.cpp:586: warning: deprecated conversion from string constant to ‘char
/Users/wanxiang/Documents/workspace/ltp/src/svmtagger/dict.cpp: In constructor ‘dictionary::dictionary(char)’:
/Users/wanxiang/Documents/workspace/ltp/src/svmtagger/dict.cpp:593: warning: deprecated conversion from string constant to ‘char
/Users/wanxiang/Documents/workspace/ltp/src/svmtagger/dict.cpp: In constructor ‘dictionary::dictionary(char, int, int)’:
/Users/wanxiang/Documents/workspace/ltp/src/svmtagger/dict.cpp:601: warning: deprecated conversion from string constant to ‘char
make[3]: *** [src/_svmtagger/CMakeFiles/postagger.dir/dict.cpp.o] Error 1
make[2]: *** [src/_svmtagger/CMakeFiles/postagger.dir/all] Error 2
make[1]: *** [all] Error 2
make: *** [all] Error 2

mingw compiling error

As I know, main reason for compiling error is the wrongly defined macro

ifdef __WIN32__

these kind of macro is designed for MSVC, however mingw under windows also triggered this macro.

another issue is the usage of hashmap under mingw is still a mystery to me.

can't find LTP model

您好,文档中的的模型没有下载链接,能给个链接吗?
谢谢!

template.hpp实现效率

ltp::utility::Template是整个ltp里面最常用的基础数据结构,提供从特征模板中实例化特征字符串的功能。例如,特征模板T=3={w0}-{p0}。在w0=amp0=v的情况下被实例化为3=am-v

特征模板的实现方法是将特征模板拆解成若干tokens(3=, w0, -, p0),将这些tokens存储在Template_Token_Cache的单件里。然后将Template转换为token对应的index。在这个例子里,模板T被表示为(0,1,2,3)的index列表。

每次实例化前需要定义一个Template::Data的类型,用以存储每个实例化的token。Template::Data对于Token_Cache建立一个副本,然后在调用set方法时将相应key实例化为value。render模板的过程变成将模板里的index列表对应的token拼接的过程。

旧版的实现在设置key的时候需要一步查找。由于认为token的个数比较少,这一过程直接使用线性查找。最近的unittest中发现这一过程可以采用hashmap替换从而获得更快的速度。

这一修改将会直接影响各模块的速度(训练,解码)。从初步实验来看,这一修改在二阶sibling模型上能够获得40%的速度提升,使得分析速度从7.35句/s提高到12.7句/s(因测试机不同而有差异)。还需进一步进行测试。

optimize server code

current performance of ltp-server is list below.

  • average response time : 1.25 (secs)
  • throughput : 7.806 (req/sec)

here goes some problems:

  • cpu usage is low during the test (about 60% single thread). maybe too much time is consumed by web-io?
  • why not implement a multithreaded server to improve performance.

代码 parser_dll.cpp 的 62 行是否有bug?

我尝试用如下代码解密例句“我们都是**人。”

    //依存句法分析接口
    //parse
    void * engine3 = parser_create_parser("/Users/iceet/Mine/Bill/tools/xls/ltp_data/parser.model");

    vector<int>            heads;
    vector<std::string>    deprels;
    vector< pair<int, string> > parse;

    parser_parse(engine3, words, postags, heads, deprels);
    for (int i = 0; i < heads.size(); ++ i) {
        //std::cout << words[i] << "\t" << tags[i] << "\t" 
        //    << heads[i] << "\t" << deprels[i] << std::endl;
        //parser[i].first = heads[i];
       // heads[i] = heads[i];
        cout << heads[i] << deprels[i] << endl;
//         int parentIdx = atoi( heads[i].c_str() );

        //parser[i].second = deprels[i];
        parse.push_back(make_pair(heads[i],deprels[i]));
    }

期望得到结果A:

我们/r    2 SBV
都/d   2 ADV
是/v   -1 HED
**/ns   4 ATT
人/n   2 VOB
。/wp  2 WP

实际上得到的结果B:

我们/r    3 SBV
都/d   3 ADV
是/v   0 HED
**/ns   5 ATT
人/n   3 VOB
。/wp  3 WP

发现在 parser_dll.cpp 的 62 行

 int len = inst->size();
        heads.resize(len - 1);
        deprels.resize(len - 1);
        for (int i = 1; i < len; ++ i) {
            heads[i - 1] = inst->predicted_heads[i];//这里是否应该减去1?
            deprels[i - 1] = ltp::parser::Parser::model->deprels.at(
                    inst->predicted_deprelsidx[i]);
        }

SRL and maxent updated

Hi all,

The srl module has been updated, as well as a new maxent package. Some details are shown below:

  • A predicate recognition (PRG) module is added, displacing the previous POS-based recognizer.
  • A new maxent pack is added, placed at the thirdparty/ folder. Three solvers: L1-OWLQN, L1-SGD and L2-LBFGS are provided.
  • The latest SRL module by default uses the L1-OWLQN solver, in order to obtain a sparse and thus much smaller model.
  • An integral training and testing suite for PRG/SRL is provided (see src/_srl/lgsrl.cpp), displacing the previous mass of EXEs. Besides, several configuration files for srl training are updated (see tools/train/assets/)
  • I did not reorganize the source architecture, such as renaming _srl to srl. Yijia please help do it.

static boost::regex造成多线程析构时冲突

正常运行多线程ltp_server不会有问题,因为理论上讲server不应该停机。但是在multi_cws_cmdline上会出现写在segmentor::rulebase里面的两个正则表达式析构时出core。

编译成功后的ltp_test怎么用啊

在mac和linux上编译出了ltp_test2,和ltp_test2,还有ltp_test_xml,这3个可执行程序怎么用,另外linux上有python的接口吗?还是必须自己写程序调用链接库的函数?

字母和数字混合分词错误

字母和数字混合时,错误的将字母和数字分隔开,如“惠普d2015打印机”中,“的”和“2015”分成了两个词。

无用代码清理

以下列表中的文件LTP已经不再使用

__util/conversion_utf.h
__util/decode_gbk.h
__util/EncodeUtil.cpp
__util/EncodeUtil.h
__util/gbk_u16.h
__util/IniReader.cpp
__util/IniReader.h
__util/Logger.cpp
__util/Logger.h
__util/md5.cpp
__util/md5.h
__util/SBC2DBC.cpp
__util/SBC2DBC.h
__util/TextProcess.cpp
__util/TextProcess.h
__util/Timer.h

Compiling Error on Mac OS X 10.7.5

Wanxiangs-MacBook-Pro:ltp wanxiang$ make
[ 10%] Built target crfpp
[ 17%] Built target maxent
[ 21%] Built target tinyxml
[ 22%] Building CXX object src/__util/CMakeFiles/util.dir/Logger.cpp.o
In file included from /Users/wanxiang/Documents/workspace/ltp/src/__util/Logger.cpp:10:
/Users/wanxiang/Documents/workspace/ltp/src/__util/Logger.h:71: error: ‘MAX_PATH’ was not declared in this scope
/Users/wanxiang/Documents/workspace/ltp/src/__util/Logger.h:76: error: ‘semaphore’ does not name a type
/Users/wanxiang/Documents/workspace/ltp/src/__util/Logger.h:77: error: ‘semaphore’ has not been declared
/Users/wanxiang/Documents/workspace/ltp/src/__util/Logger.h:78: error: ‘semaphore’ has not been declared
/Users/wanxiang/Documents/workspace/ltp/src/__util/Logger.h:79: error: ‘semaphore’ has not been declared
/Users/wanxiang/Documents/workspace/ltp/src/_util/Logger.cpp: In constructor ‘CLogger::CLogger(const char, int)’:
/Users/wanxiang/Documents/workspace/ltp/src/__util/Logger.cpp:16: error: ‘m_csLogger’ was not declared in this scope
/Users/wanxiang/Documents/workspace/ltp/src/__util/Logger.cpp: At global scope:
/Users/wanxiang/Documents/workspace/ltp/src/__util/Logger.cpp:32: error: variable or field ‘InitializeCriticalSection’ declared void
/Users/wanxiang/Documents/workspace/ltp/src/__util/Logger.cpp:32: error: ‘semaphore’ was not declared in this scope
/Users/wanxiang/Documents/workspace/ltp/src/__util/Logger.cpp:32: error: ‘s’ was not declared in this scope
/Users/wanxiang/Documents/workspace/ltp/src/__util/Logger.cpp:42: error: variable or field ‘EnterCriticalSection’ declared void
/Users/wanxiang/Documents/workspace/ltp/src/__util/Logger.cpp:42: error: ‘semaphore’ was not declared in this scope
/Users/wanxiang/Documents/workspace/ltp/src/__util/Logger.cpp:42: error: ‘s’ was not declared in this scope
/Users/wanxiang/Documents/workspace/ltp/src/__util/Logger.cpp:66: error: variable or field ‘LeaveCriticalSection’ declared void
/Users/wanxiang/Documents/workspace/ltp/src/__util/Logger.cpp:66: error: ‘semaphore’ was not declared in this scope
/Users/wanxiang/Documents/workspace/ltp/src/__util/Logger.cpp:66: error: ‘s’ was not declared in this scope
/Users/wanxiang/Documents/workspace/ltp/src/_util/Logger.cpp: In member function ‘void CLogger::Log(int, const char, _va_list_tag)’:
/Users/wanxiang/Documents/workspace/ltp/src/__util/Logger.cpp:99: error: ‘m_csLogger’ was not declared in this scope
/Users/wanxiang/Documents/workspace/ltp/src/__util/Logger.cpp:108: error: ‘m_OutputBuf’ was not declared in this scope
/Users/wanxiang/Documents/workspace/ltp/src/__util/Logger.cpp:141: error: ‘m_OutputBuf’ was not declared in this scope
/Users/wanxiang/Documents/workspace/ltp/src/__util/Logger.cpp:158: error: ‘m_OutputBuf’ was not declared in this scope
/Users/wanxiang/Documents/workspace/ltp/src/__util/Logger.cpp:176: error: ‘m_OutputBuf’ was not declared in this scope
/Users/wanxiang/Documents/workspace/ltp/src/__util/Logger.cpp:190: error: ‘m_OutputBuf’ was not declared in this scope
/Users/wanxiang/Documents/workspace/ltp/src/__util/Logger.h: At global scope:
/Users/wanxiang/Documents/workspace/ltp/src/_util/Logger.h:77: warning: inline function ‘static void CLogger::InitializeCriticalSection(int)’ used but never defined
/Users/wanxiang/Documents/workspace/ltp/src/_util/Logger.h:78: warning: inline function ‘static void CLogger::EnterCriticalSection(int)’ used but never defined
/Users/wanxiang/Documents/workspace/ltp/src/_util/Logger.h:79: warning: inline function ‘static void CLogger::LeaveCriticalSection(int)’ used but never defined
make[3]: *** [src/__util/CMakeFiles/util.dir/Logger.cpp.o] Error 1
make[2]: *** [src/__util/CMakeFiles/util.dir/all] Error 2
make[1]: *** [all] Error 2
make: *** [all] Error 2

3.0.0 alpha

请问与3.0.0 alpha 相匹配的最新模型文件,哪里可以下载到??

win64编译

在根目录下的CMakeLists.txt会在win32环境下排斥编译unittest和ltp_server,但是在win64下仍旧会编译这两个模块。

有这个srl调用的例子么,我自己写了一个简单的调用例子无法放回结果

我的代码如下:
作者能提供一个srl调用例子么?

/*
* NodeJS LTP 扩展
* @author 蜗眼 
* 第一次写C/C++ 见笑了。。。
*/
#include < v8.h > #include "Xml4nlp.h"#include "Ltp.h"#include "segment_dll.h"#include "postag_dll.h"#include "ner_dll.h"#include "parser_dll.h"#include "SRL_DLL.h"#include < iostream > #include < string > #include < node.h > using namespace node; 
using namespace v8; 
using namespace std; 
//using namespace ltp::strutils::codecs;
#include "DepSRL.h"static DepSRL g_depSRL; 
Handle < Value > Method(const Arguments & args) {
   HandleScope scope; 
   const char * path = "/Users/iceet/Mine/Bill/tools/xls/ltp_data/cws.model"; 
   const char * selfs = "/Users/iceet/Mine/Bill/src/BillNLP/self.dic"; 
   void * engine = segmentor_create_segmentor(path, 
   selfs); 
   vector < string > words; 
   //分词接口
   int len = segmentor_segment(engine, 
   "我们都是**人。", words); 
   for (int i = 0; i < len; ++ i) {
      std :: cout << words[i] << "|"; 
      }
   std :: cout << std :: endl; 
   segmentor_release_segmentor(engine); 
   //词性标注
   void * engine1 = postagger_create_postagger("/Users/iceet/Mine/Bill/tools/xls/ltp_data/pos.model"); 
   std :: vector < std :: string > tags; 
   postagger_postag(engine1, words, tags); 
   for (int i = 0; i < tags.size(); ++ i) {
      std :: cout << words[i] << "/" << tags[i]; 
      if (i == tags.size() - 1) std :: cout << std :: endl; 
      else std :: cout << " "; 
      }
   postagger_release_postagger(engine1); 
   //命名实体识别接口
   void * engin2 = ner_create_recognizer("/Users/iceet/Mine/Bill/tools/xls/ltp_data/ner.model"); 
   int ret; 
   std :: vector < string > vec; 
   ret = ner_recognize(engin2, words, tags, vec); 
   //std::cont << vec.size() <<std::endl;
   for (int i = 0; i < vec.size(); ++ i) {
      std :: cout << vec[i] << "<<" << i; 
      if (i == vec.size() - 1) std :: cout << std :: endl; 
      else std :: cout << " "; 
      }
   //
   //依存句法分析接口
   //parse
   void * engine3 = parser_create_parser("/Users/iceet/Mine/Bill/tools/xls/ltp_data/parser.model"); 
   vector < int > heads; 
   vector < std :: string > deprels; 
   vector < pair < int, string > > parser; 
   parser_parse(engine3, words, tags, heads, deprels); 
   for (int i = 0; i < heads.size(); ++ i) {
      std :: cout << words[i] << "\t" << tags[i] << "\t" << heads[i] << "\t" << deprels[i] << std :: endl; 
      //parser[i].first = heads[i];
      cout << heads[i] << endl; 
      // int parentIdx = atoi( heads[i].c_str() );
      //parser[i].second = deprels[i];
      parser.push_back(make_pair(static_cast < int > (heads[i]), 
      deprels[i])); 
      }
   SRL_LoadResource("/Users/iceet/Mine/Bill/tools/xls/ltp_data/srl/"); 
   vector < pair < int, vector < pair < const char * , pair < int, int > > > > > vecSRLResult; 
   SRL(words, tags, vec, parser, vecSRLResult); 
    //srl 结果,这里为什么为0呢
   cout << vecSRLResult.size() << endl; 
   int j = 0; 
   for (; j < vecSRLResult.size(); ++j) {
      vector < string > vecType; 
      vector < pair < int, 
      int > > vecBegEnd; 
      int k = 0; 
      for (; k < vecSRLResult[j].second.size(); ++k) {
         // vecType.push_back(vecSRLResult[j].second[k].first);
         //vecBegEnd.push_back(vecSRLResult[j].second[k].second);
         // std::cout << vecSRLResult[j].second[k].first[0] << "\t" <<endl;
         std :: cout << k << endl; 
         }
      cout << "--" << endl; 
      }
   return scope.Close(Undefined()); 
   }
void init(Handle < Object > exports) {
   exports -> Set(String :: NewSymbol("analyze"), 
   FunctionTemplate :: New(Method) -> GetFunction()); 
   }
NODE_MODULE(BillNLP, init)

提高PoSTagging性能

【视频日媒称越南因南海争端停播**央视节目越南停播央视日媒新浪视频】

最后一个】会标错

Ⅰ、大豆及其制品。

Ⅰ会标错

应该加入一个特殊字符识别的特征。

gcc 的版本是多少比较好

编译成功后。执行测试:
./bin/ltp_test "ws" "test_data/test_gb.txt"

有错误:
terminate called after throwing an instance of 'std::string'

ltp_server 启动之后调用报错了 什么情况呢这个

系统是ubuntu12.04

Input sentence is: **你好?
-->>debug make xml
[XML4NLP ERROR REPORT]
description : Error document empty.
location :
row : 0
col : 0

===================

[ERROR] 2013/09/02 08:51:15 /home/parallels/ltp/src/__ltp_dll/Ltp.cpp: line 147: splitSentence_dummy(): in LTP::splitsent, There is no paragraph in doc,
[ERROR] 2013/09/02 08:51:15 /home/parallels/ltp/src/__ltp_dll/Ltp.cpp: line 148: splitSentence_dummy(): you may have loaded a blank file or have not loaded a file yet.
[ERROR] 2013/09/02 08:51:15 /home/parallels/ltp/src/__ltp_dll/Ltp.cpp: line 182: wordseg(): in LTP::wordseg, failed to perform split sentence preprocess.
[ERROR] 2013/09/02 08:51:15 /home/parallels/ltp/src/__ltp_dll/Ltp.cpp: line 233: postag(): in LTP::postag, failed to perform word segment preprocess
[ERROR] 2013/09/02 08:51:15 /home/parallels/ltp/src/__ltp_dll/Ltp.cpp: line 284: ner(): in LTP::ner, failed to perform postag preprocess
[ERROR] 2013/09/02 08:51:15 /home/parallels/ltp/src/__ltp_dll/Ltp.cpp: line 397: srl(): in LTP::srl, failed to perform ner preprocess
Result is: HTTP/1.1 200 OK

windows 7下cmake报错 getopt.c文件找不到

系统 windows 7,
visual studio 2008完全安装,
cmake 2.8.11.2,
报错如下:

CMake Error at thirdparty/maxent/CMakeLists.txt:80 (add_executable):
  Cannot find source file:

    getopt.c

  Tried extensions .c .C .c++ .cc .cpp .cxx .m .M .mm .h .hh .h++ .hm .hpp
  .hxx .in .txx

词性"z"

在PoSTagging的结果中含有“z”词性(依照北大标注规范),但是在Parser的训练数据中没有"z"词性。

一个解决方法是用自动词性+god dep-relation训练一个parser model。

另外,是否需要z词性还是需要再讨论。

重制文档

doc(二进制)格式文档无法追踪更改 不适合在github和开源项目中使用 建议使用TeX或Markdown等重新制作

ltp_test奇怪的模型加载方式

现在实现的逻辑是同时将所有模块的模型load进来。

这个设计并不是很美观,考虑到有些用户只想做分词,但是却连parser的模型也加载进来。

在配置文件中指定加载的模型以及要做的任务。

src/srl/SRLBaselineExt.cpp 报 segmentation fault 错误

在这个行数里面的: SRLBaselineExt::ExtractPrgFeatures(vector< vector >& vecPrgFeatures) 的83行。。

我的调用如下:

SRL_LoadResource("/Users/iceet/Mine/Bill/tools/xls/ltp_data/srl/");

    vector< pair< int, vector< pair<const char *, pair< int, int > > > > > vecSRLResult;

    std::cout << "==ing===" << endl;

    SRL(words, tags, vec, parser, vecSRLResult);//这里执行的时候报错了

    std::cout << "=======e" << std::endl;

最终定位到代码 SRLBaselineExt.cpp

    for (size_t row = 1; row <= row_count; ++row)
    {
        vector<string> instance;
        for (size_t i = 0; i < m_prgFeatureNumbers.size(); ++i)
        {
            string feature = m_prgFeaturePrefixes[i] + "@"
                + vec_feature_values[i][row]; //这里有问题
            instance.push_back(feature);
        }
        vecPrgFeatures.push_back(instance);
    }

新版本的代码MAC 下编译错误,'tr1/unordered_map' file not found

我下载了新版本的源码然后编译:

[ 34%] Building CXX object src/segmentor/CMakeFiles/otcws.dir/otcws.cpp.o
In file included from 。。。/LTP/src/segmentor/otcws.cpp:2:
。。。/LTP/src/utils/cfgparser.hpp:11:10: fatal error:
'tr1/unordered_map' file not found

include <tr1/unordered_map>

     ^

1 error generated.
make[3]: *** [src/segmentor/CMakeFiles/otcws.dir/otcws.cpp.o] Error 1
make[2]: *** [src/segmentor/CMakeFiles/otcws.dir/all] Error 2
make[1]: *** [all] Error 2
make: *** [all] Error 2

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.