Code Monkey home page Code Monkey logo

Comments (4)

YontiLevin avatar YontiLevin commented on June 1, 2024

please give me extra info as i cant reproduce on python 3.8.12

from hebrew-tokenizer.

rubmz avatar rubmz commented on June 1, 2024

Seems it's not a hebrew_tokenizer bug after all! but an odd, and not a very stable implementation of beautifulsoup4 find_all() which emits a (not very stable pointer?) string which even changes value between debug runs (!)
So, no bug here, unless you feel the urge to support other peoples non stable code :-)

from hebrew-tokenizer.

rubmz avatar rubmz commented on June 1, 2024

How to recreate this issue - just so it's clear why was this bug opened in the first place:

from bs4 import BeautifulSoup
from hebrew_tokenizer import tokenize

sent = 'לצדה פועלות חברות כמו  <a href="%D0%9C%D0%A2%D0%A1">МТС</a> <a href="Beeline">Beeline</a>'

# GOOD - this goes pretty well:
one_word = list(tokenize('לצדה פועלות חברות כמו  '))
print(one_word)
one_word = list(tokenize(sent))
print(one_word)
one_word = list(tokenize('MTC'))
print(one_word)

# BAD - this does NOT go well at all:
soup = BeautifulSoup(sent, features="html.parser")
tags = soup.find_all('a')
for t in tags:
    al_arr = list(tokenize(t.text))
    print(one_word)

from hebrew-tokenizer.

rubmz avatar rubmz commented on June 1, 2024

I am opening a new issue instead of that one...

After opening a bug in BeautifulSoup4 repository, the good people maintaining it have explained to me what went wrong. Seems that the 'MTC' letters above AREN'T ENGLISH. It's Cryllic letters which make Hebrew-tokenizer return weird results (internally the REGEX simply does not know of any Cryllic).
I will put my suggested resolution for that in the new bug description.

from hebrew-tokenizer.

Related Issues (9)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.