Code Monkey home page Code Monkey logo

scel2txt's Introduction

scel2txt

搜狗细胞词库转鼠须管(Rime)词库,提供 Python3 和 Golang 实现的版本

使用

将从搜狗官方词库网站下载的 *.scel 文件放入 scel 文件夹,然后运行

Python

python3 scel2txt.py

或者下载编译好的命令 scel2txt-darwin-amd64-0.0.1.gz

gunzip scel2txt-darwin-amd64-0.0.1.gz
chmod +x scel2txt-darwin-amd64-0.0.1
./scel2txt-darwin-amd64-0.0.1

生成的文件

  • 后缀为 .txt 的同名词库文件
  • 自动合并所有 *.txt 文件到 luna_pinyin.sogou.dict.yaml

搜狗细胞词库(scel格式文件) 格式说明

按照一定格式保存的 Unicode 编码文件,其中每两个字节表示一个字符(中文汉字或者英文字母)。

主要包括两部分:

  1. 全局拼音表,在文件中的偏移值是 0x1540+4, 格式为 (py_idx, py_len, py_str)

    • py_idx: 两个字节的整数,代表这个拼音的索引
    • py_len: 两个字节的整数,拼音的字节长度
    • py_str: 当前的拼音,每个字符两个字节,总长 py_len
  2. 汉语词组表,在文件中的偏移值是 0x2628 或 0x26c4, 格式为 (word_count, py_idx_count, py_idx_data, (word_len, word_str, ext_len, ext){word_count}),其中 (word_len, word, ext_len, ext){word_count} 一共重复 word_count 次, 表示拼音的相同的词一共有 word_count 个

    • word_count: 两个字节的整数,同音词数量
    • py_idx_count: 两个字节的整数,拼音的索引个数
    • py_idx_data: 两个字节表示一个整数,每个整数代表一个拼音的索引,拼音索引数
    • word_len:两个字节的整数,代表中文词组字节数长度
    • word_str: 汉语词组,每个中文汉字两个字节,总长度 word_len
    • ext_len: 两个字节的整数,可能代表扩展信息的长度,好像都是 10
    • ext: 扩展信息,一共 10 个字节,前两个字节是一个整数(不知道是不是词频),后八个字节全是 0,ext_len 和 ext 一共 12 个字节

目前已测试的词库

参考资料

  1. scel2mmseg
  2. scel-to-txt

scel2txt's People

Contributors

asc8384 avatar lewangdev avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.