Code Monkey home page Code Monkey logo

sbert-chineseexample's Introduction


Sbert-ChineseExample

Sentence-Transformers 中文信息检索例子


In English

内容提要

关于这个工程

About The Project

Sentence Transformers是一个多语言、多模态句子向量生成框架,可以根据Huggingface Transformers框架简单地生成句子及文本段落的分布式向量表征。

这个工程的目的是通过训练bi_encoder和cross_encoder实现类似于ms_macro任务的中文数据集信息检索,并搭配定制化的pandas形式的elasticsearch接口使得结果产出(文本、向量)可以方便地序列化。

构建信息

Built With

开始

Getting Started

安装

Installation

  • pip
pip install -r requirements.txt
  • 安装Elasticsearch并启动服务
  • install Elasticsearch and start service

使用

Usage

1. 从 google drive 下载数据集

2. bi_encoder 数据准备

3. 训练 bi_encoder

4. cross_encoder 训练数据准备

5. cross_encoder 检测数据准备

6. 训练 cross_encoder

7. 展示 bi_encoder cross_encoder 的推断过程

引导

Roadmap


* 1 这个工程使用自定义的 es-pandas 的重载接口 (支持向量存储) 来使用pandas对于elasticsearch实现简单的操作。
* 2 try_sbert_neg_sampler.py 抽取困难样本(模型识别困难的样本)的功能来自于 https://guzpenha.github.io/transformer_rankers/, 也可以使用 elasticsearch 生成困难样本, 相应的功能在 valid_cross_encoder_on_bi_encoder.py 中定义。
* 3 上面在 cross_encoder 上训练的功能, 需要预先在不同的句子间检查语义区别程度, 组合相似语义的样本对于模型训练是有帮助的。
* 4 增加了一些对Sentence-Transformers多类别结果比较的工具。

贡献

Contributing

License

Distributed under the MIT License. See LICENSE for more information.

Contact

svjack - [email protected] [email protected]

Project Link: https://github.com/svjack/Sbert-ChineseExample

Acknowledgements

sbert-chineseexample's People

Contributors

svjack avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.