Code Monkey home page Code Monkey logo

zinc-analysis-gse's Introduction

zinc-analysis-gse

It's already build in Zinc:

https://github.com/prabhatsharma/zinc/tree/main/pkg/bluge/analysis/lang/chs

it's a plugin of zinc to support Chinese analyzer.

Analyzer: gse_standard , gse_search

Tokenizer: gse_standard , gse_search

TokenFilter: gse_stop

build has embed dictionary of zh/s_1.txt, zh/stop_tokens.txt.

you can find it: https://github.com/go-ego/gse/tree/master/data/dict

also you can custom dictionary follow custom user dictionary

after custom, you need restart zinc.

gse

https://github.com/go-ego/gse

Go efficient multilingual NLP and text segmentation; support english, chinese, japanese and other.

Environment

you need pass environment to enable gse support:

ZINC_PLUGIN_GSE_ENABLE true of false, default is false

ZINC_PLUGIN_GSE_DICT_EMBED small or big, default is small, which size dictionary will load when gse enabled.

ZINC_PLUGIN_GSE_DICT_PATH custom dictionary path, default is ./plugins/gse/dict

API example

POST http://localhost:4080/es/_analyze

{
  "analyzer": "gse_standard",
  "text": "《复仇者联盟3:无限战争》是全片使用IMAX摄影机拍摄制作的的科幻片."
}

POST http://localhost:4080/es/_analyze

{
  "analyzer": "gse_search",
  "text": "《复仇者联盟3:无限战争》是全片使用IMAX摄影机拍摄制作的的科幻片."
}

PUT http://localhost:4080/api/index

{
	"name": "my-index-chs",
		"mappings": {
			"properties": {
				"title": {
					"type": "text",
					"index": true,
					"highlightable": true,
					"analyzer": "gse_search",
					"search_analyzer": "gse_standard"
				},
				"author": {
					"type": "keyword",
					"index": true,
					"store": false
				},
				"create_time": {
					"type":"time"
				}
			}
		}
}

POST http://localhost:4080/api/my-index-chs/document

{
	"title": "《复仇者联盟3:无限战争》是全片使用IMAX摄影机拍摄制作的科幻片",
	"author": "灭霸",
	"create_time": "2022-03-05T18:18:18+08:00"
}

POST http://localhost:4080/es/my-index-chs/_search

{
	"query": {
		"match": {
			"title": "复仇者联盟"
		}
	}
}

custom user dictionary

add your words append to the file ${ZINC_PLUGIN_GSE_DICT_PATH}/user.txt

format:

分词文本  频率        词性
word    frequency   property

like:

复仇者联盟 100 n

custom stop tokens

add your words append to the file ${ZINC_PLUGIN_GSE_DICT_PATH}/stop.txt

format:

停止词
word

like:

哈哈

Credit

zinc-analysis-gse's People

Contributors

hengfeiyang avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.