Comments (4)
Here is my code:
def underthesea_tokenize(txt):
return word_tokenize(txt, format="text")
...
// Call underthesea_tokenize func for each line of text read from file
line = underthesea_tokenize(line)
...
Do you have any solution to catch this issue?
from underthesea.
This error caused by incorrect url pattern in regex_tokenize
module. It's already fixed.
You can retest by update latest version of underthesea
$ pip uninstall underthesea
$ pip install underthesea==1.1.8a0
$ python
>>> from underthesea import word_tokenize
>>> text = "https://www.facebook.com/photo.php?fbid=1627680357512432&set=a.1406713109609159.1073741826.100008114498358&type=1 mình muốn chia sẻ bài viết của một bác nói về thực trạng của bộ giáo dục bây giờ! mọi người vào đọc và chia sẻ để Phạm Vũ Luận BIẾT!"
>>> word_tokenize(text)
['https://www.facebook.com/photo.php?fbid=1627680357512432&set=a.1406713109609159.1073741826.100008114498358&type=1', 'mình', 'muốn', 'chia sẻ', 'bài', 'viết', 'của', 'một', 'bác', 'nói', 'về', 'thực trạng', 'của', 'bộ', 'giáo dục', 'bây giờ', '!', 'mọi', 'người', 'vào', 'đọc', 'và', 'chia sẻ', 'để', 'Phạm', 'Vũ', 'Luận', 'BIẾT', '!']
Close this issue here! Feel free to open it if you find this error is not fixed.
from underthesea.
@Hieunv1996 what is your code?
I think this error caused by some special characters in the text
from underthesea.
@Hieunv1996
Okay. I can reproduce this error with the following code
from underthesea import word_tokenize
text = "https://www.facebook.com/photo.php?fbid=1627680357512432&set=a.1406713109609159.1073741826.100008114498358&type=1 mình muốn chia sẻ bài viết của một bác nói về thực trạng của bộ giáo dục bây giờ! mọi người vào đọc và chia sẻ để Phạm Vũ Luận BIẾT!"
word_tokenize(text)
There are dot characters in url, which is so weird. I will debug and fix it soon.
Thanks for your reporting.
from underthesea.
Related Issues (20)
- Zero shot Named Entity Recognition model for Vietnamese using OpenAI API
- Permission error when trying to remove the downloaded model zip file HOT 2
- How to install underthesea on Alpine Docker image HOT 3
- Support PyTorch v2 for dependency parsing HOT 2
- Agents with LLMs
- KeyError: '__getitems__' HOT 1
- Create text to speech with custom voice HOT 1
- VLC Corpus 2023
- underthesea for another languages
- Vietnamese Fiction Dataset
- Vietnamese Abstract Meaning Represeantion
- Optimization of Underthesea Codebase Size
- Lỗi gặp phải khi chuyển văn bản thành giọng nói (TTS) HOT 1
- không tải được, nó hiện lỗi ModuleNotFoundError: No module named 'maturin HOT 5
- Support python 3.12 (2024Q1)
- Bug detecting names with hyphens.
- Incompatibility with sklearn >= 1.5
- field list be able returned from classify function?
- 🌊 Underthesea v7
- Support underthesea_core with python 3.12
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from underthesea.