juntaosun / langsegment Goto Github PK
View Code? Open in Web Editor NEWIt is a multi-lingual (97 languages) text content automatic recognition and segmentation tool. 强大的TTS多语言(97种语言)混合文本内容自动分词工具。
It is a multi-lingual (97 languages) text content automatic recognition and segmentation tool. 强大的TTS多语言(97种语言)混合文本内容自动分词工具。
File "c:\Users\14404\Project\GPT-SoVITS-beta0306fix2\runtime\lib\site-packages\LangSegment\LangSegment.py", line 676, in getTexts
return LangSegment.getTexts(text)
File "c:\Users\14404\Project\GPT-SoVITS-beta0306fix2\runtime\lib\site-packages\LangSegment\LangSegment.py", line 572, in getTexts
text = LangSegment._parse_symbols(text)
File "c:\Users\14404\Project\GPT-SoVITS-beta0306fix2\runtime\lib\site-packages\LangSegment\LangSegment.py", line 508, in _parse_symbols
cur_word = LangSegment._process_tags([] , text , True)
File "c:\Users\14404\Project\GPT-SoVITS-beta0306fix2\runtime\lib\site-packages\LangSegment\LangSegment.py", line 461, in _process_tags
LangSegment._parse_language(words , text)
File "c:\Users\14404\Project\GPT-SoVITS-beta0306fix2\runtime\lib\site-packages\LangSegment\LangSegment.py", line 330, in _parse_language
LangSegment._addwords(words,language,text,score)
File "c:\Users\14404\Project\GPT-SoVITS-beta0306fix2\runtime\lib\site-packages\LangSegment\LangSegment.py", line 250, in _addwords
else:LangSegment._saveData(words,language,text,score)
File "c:\Users\14404\Project\GPT-SoVITS-beta0306fix2\runtime\lib\site-packages\LangSegment\LangSegment.py", line 228, in _saveData
LangSegment._statistics(data["lang"],data["text"])
File "c:\Users\14404\Project\GPT-SoVITS-beta0306fix2\runtime\lib\site-packages\LangSegment\LangSegment.py", line 191, in _statistics
if not "|" in language:lang_count[language] += int(len(text)*2) if language == "zh" else len(text)
TypeError: list indices must be integers or slices, not str
Traceback (most recent call last):
File "e:\AItools\GPT-SoVITS-Inference\runtime\lib\site-packages\werkzeug\serving.py", line 362, in run_wsgi
execute(self.server.app)
File "e:\AItools\GPT-SoVITS-Inference\runtime\lib\site-packages\werkzeug\serving.py", line 325, in execute
for data in application_iter:
File "e:\AItools\GPT-SoVITS-Inference\runtime\lib\site-packages\werkzeug\wsgi.py", line 256, in __next__
return self._next()
File "e:\AItools\GPT-SoVITS-Inference\runtime\lib\site-packages\werkzeug\wrappers\response.py", line 32, in _iter_encoded
for item in iterable:
File "e:\AItools\GPT-SoVITS-Inference\runtime\lib\site-packages\flask\helpers.py", line 113, in generator
yield from gen
File "E:\AItools\GPT-SoVITS-Inference\Inference\src\inference_core.py", line 140, in get_streaming_tts_wav
for sr, chunk in chunks:
File "E:\AItools\GPT-SoVITS-Inference\Inference\src\inference_core.py", line 112, in inference
yield next(tts_pipline.run(inputs))
File "E:\AItools\GPT-SoVITS-Inference\GPT_SoVITS\TTS_infer_pack\TTS.py", line 531, in run
data = self.text_preprocessor.preprocess(text, text_lang, text_split_method)
File "E:\AItools\GPT-SoVITS-Inference\GPT_SoVITS\TTS_infer_pack\TextPreprocessor.py", line 53, in preprocess
phones, bert_features, norm_text = self.segment_and_extract_feature_for_text(text, lang)
File "E:\AItools\GPT-SoVITS-Inference\GPT_SoVITS\TTS_infer_pack\TextPreprocessor.py", line 87, in segment_and_extract_feature_for_text
textlist, langlist = self.seg_text(texts, language)
File "E:\AItools\GPT-SoVITS-Inference\GPT_SoVITS\TTS_infer_pack\TextPreprocessor.py", line 99, in seg_text
for tmp in LangSegment.getTexts(text):
File "e:\AItools\GPT-SoVITS-Inference\runtime\lib\site-packages\LangSegment\LangSegment.py", line 676, in getTexts
return LangSegment.getTexts(text)
File "e:\AItools\GPT-SoVITS-Inference\runtime\lib\site-packages\LangSegment\LangSegment.py", line 572, in getTexts
text = LangSegment._parse_symbols(text)
File "e:\AItools\GPT-SoVITS-Inference\runtime\lib\site-packages\LangSegment\LangSegment.py", line 508, in _parse_symbols
cur_word = LangSegment._process_tags([] , text , True)
File "e:\AItools\GPT-SoVITS-Inference\runtime\lib\site-packages\LangSegment\LangSegment.py", line 461, in _process_tags
LangSegment._parse_language(words , text)
File "e:\AItools\GPT-SoVITS-Inference\runtime\lib\site-packages\LangSegment\LangSegment.py", line 330, in _parse_language
LangSegment._addwords(words,language,text,score)
File "e:\AItools\GPT-SoVITS-Inference\runtime\lib\site-packages\LangSegment\LangSegment.py", line 250, in _addwords
else:LangSegment._saveData(words,language,text,score)
File "e:\AItools\GPT-SoVITS-Inference\runtime\lib\site-packages\LangSegment\LangSegment.py", line 212, in _saveData
LangSegment._statistics(preData["lang"],text)
File "e:\AItools\GPT-SoVITS-Inference\runtime\lib\site-packages\LangSegment\LangSegment.py", line 191, in _statistics
if not "|" in language:lang_count[language] += int(len(text)*2) if language == "zh" else len(text)
TypeError: list indices must be integers or slices, not str
LangSegment.py 第186行
有的时候会报错如上
增加了一些鲁棒性检测后不会报错了:
修正后代码
@staticmethod
def _statistics(language, text):
if LangSegment._lang_count is None or not isinstance(LangSegment._lang_count, defaultdict):
LangSegment._lang_count = defaultdict(int)
lang_count = LangSegment._lang_count
if not "|" in language:
lang_count[language] += int(len(text)*2) if language == "zh" else len(text)
LangSegment._lang_count = lang_count
textlist = ["【齐鲁艺票通】恭喜您购票成功!","订单编号:","669775550013131,","取票码:","93342253;","请凭取票码在演出开始前30分钟到指定地点换取纸质票。"]
for text in textlist:
print(LangSegment.getTexts(text))
# [{'lang': 'zh', 'text': '【齐鲁艺票通】恭喜您购票成功!'}]
# [{'lang': 'zh', 'text': '订单编号:'}]
# [{'lang': 'en', 'text': '669775550013131, '}]
# [{'lang': 'zh', 'text': '取票码:'}]
# [{'lang': 'en', 'text': '93342253; '}]
# [{'lang': 'zh', 'text': '请凭取票码在演出开始前30分钟到指定地点换取纸质票。'}]
大佬。整句输入判定正常,当用户选择按标点符号切割时,纯数字会被识别为英文,能否改成参照上文的逻辑处理?
熘南贝炒南贝
被识别成了日文
使用完整的ALL模型去做識別
日本 chatgpt 萬 chatgpt
{'lang': 'ja', 'text': '日本', 'score': 0.72695434}
{'lang': 'en', 'text': 'chatgpt ', 'score': 0.75385803}
{'lang': 'ja', 'text': '萬 ', 'score': 0.84992695}
{'lang': 'en', 'text': 'chatgpt ', 'score': 0.75385803}
日本、日本語、數字(一~十、百、千、萬等等)
日本、日本語 ->單獨打會出錯
單獨一個數字 ->前後包夾一個英文字會出錯(若換成使用中英日模型怎不會出錯)
text = "English中文"
LangSegment.setLangfilters(["en"])
print(LangSegment.getTexts(text))
LangSegment.setLangfilters(["zh"])
print(LangSegment.getTexts(text))
运行结果
[{'lang': 'en', 'text': 'English '}]
[{'lang': 'en', 'text': 'English '}]
import LangSegment
LangSegment.getTexts("《冰雪女王5:融冰之战》,选择挺多的。")
你好 juntaosun,我是一个产品经理,我的newsletter在这里:https://produck.zhubai.love/
希望与你取得联系讨论一个音视频项目的实现,望不吝赐教。
我的联系方式:de_base64("d2VjaGF0OiBtYWRsaWZlcjEzMzcgLyBtYWlsOiBtYWRsaWZlckBsaXZlLmNvbQ==”)
输入:
那我们拆分来看一下:Part A(传统阅读理解)
Part A被识别成了日文
langid.py会把hello china识别成it,试用LangSegment是能正确归类成en的,这个你是怎么解决的啊?
输入:
每题0.5分,共10分,第二部分
输出:
每题0.5第二部分
丢失了"分共10分"这几个字。我尝试了把分改为别的字,比如“元”,不会发生丢字
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.