Code Monkey home page Code Monkey logo

homoglyphs's Introduction

Homoglyphs

Homoglyphs lives! This Python library is an important and widely used library for handling Homoglyphs in Python. This is a fork of the original orsinium maintained project.

Homoglyphs logo Build Status PyPI version Status Code size License

Homoglyphs -- python library for getting homoglyphs and converting to ASCII.

Features

It's smarter version of confusable_homoglyphs:

  • Autodect or manual choosing category (aliases from ISO 15924).
  • Auto or manual load only needed alphabets in memory.
  • Converting to ASCII.
  • More configurable.
  • More stable.

Installation

sudo pip install homoglyphs_fork

Usage

Best way to explain something is show how it works. So, let's have a look on the real usage.

Importing:

import homoglyphs_fork as hg

Languages

#detect
hg.Languages.detect('w')
# {'pl', 'da', 'nl', 'fi', 'cz', 'sr', 'pt', 'it', 'en', 'es', 'sk', 'de', 'fr', 'ro'}
hg.Languages.detect('т')
# {'mk', 'ru', 'be', 'bg', 'sr'}
hg.Languages.detect('.')
# set()

# get alphabet for languages
hg.Languages.get_alphabet(['ru'])
# {'в', 'Ё', 'К', 'Т', ..., 'Р', 'З', 'Э'}

# get all languages
hg.Languages.get_all()
# {'nl', 'lt', ..., 'de', 'mk'}

Categories

Categories -- (aliases from ISO 15924).

#detect
hg.Categories.detect('w')
# 'LATIN'
hg.Categories.detect('т')
# 'CYRILLIC'
hg.Categories.detect('.')
# 'COMMON'

# get alphabet for categories
hg.Categories.get_alphabet(['CYRILLIC'])
# {'ӗ', 'Ԍ', 'Ґ', 'Я', ..., 'Э', 'ԕ', 'ӻ'}

# get all categories
hg.Categories.get_all()
# {'RUNIC', 'DESERET', ..., 'SOGDIAN', 'TAI_LE'}

Homoglyphs

Get homoglyphs:

# get homoglyphs (latin alphabet initialized by default)
hg.Homoglyphs().get_combinations('q')
# ['q', '𝐪', '𝑞', '𝒒', '𝓆', '𝓺', '𝔮', '𝕢', '𝖖', '𝗊', '𝗾', '𝘲', '𝙦', '𝚚']

Alphabet loading:

# load alphabet on init by categories
homoglyphs = hg.Homoglyphs(categories=('LATIN', 'COMMON', 'CYRILLIC'))  # alphabet loaded here
homoglyphs.get_combinations('гы')
# ['rы', 'гы', 'ꭇы', 'ꭈы', '𝐫ы', '𝑟ы', '𝒓ы', '𝓇ы', '𝓻ы', '𝔯ы', '𝕣ы', '𝖗ы', '𝗋ы', '𝗿ы', '𝘳ы', '𝙧ы', '𝚛ы']

# load alphabet on init by languages
homoglyphs = hg.Homoglyphs(languages={'ru', 'en'})  # alphabet will be loaded here
homoglyphs.get_combinations('гы')
# ['rы', 'гы']

# manual set alphabet on init      # eng rus
homoglyphs = hg.Homoglyphs(alphabet='abc абс')
homoglyphs.get_combinations('с')
# ['c', 'с']

# load alphabet on demand
homoglyphs = hg.Homoglyphs(languages={'en'}, strategy=hg.STRATEGY_LOAD)
# ^ alphabet will be loaded here for "en" language
homoglyphs.get_combinations('гы')
# ^ alphabet will be loaded here for "ru" language
# ['rы', 'гы']

You can combine categories, languages, alphabet and any strategies as you want. The strategies specify how to handle any characters not already loaded:

  • STRATEGY_LOAD: load category for this character
  • STRATEGY_IGNORE: add character to result
  • STRATEGY_REMOVE: remove character from result

Converting glyphs to ASCII chars

homoglyphs = hg.Homoglyphs(languages={'en'}, strategy=hg.STRATEGY_LOAD)

# convert
homoglyphs.to_ascii('ТЕСТ')
# ['TECT']
homoglyphs.to_ascii('ХР123.')  # this is cyrillic "х" and "р"
# ['XP123.', 'XPI23.', 'XPl23.']

# string with chars which can't be converted by default will be ignored
homoglyphs.to_ascii('лол')
# []

# you can set strategy for removing not converted non-ASCII chars from result
homoglyphs = hg.Homoglyphs(
    languages={'en'},
    strategy=hg.STRATEGY_LOAD,
    ascii_strategy=hg.STRATEGY_REMOVE,
)
homoglyphs.to_ascii('лол')
# ['o']

# also you can set up range of allowed char codes for ascii (0-128 by default):
homoglyphs = hg.Homoglyphs(
    languages={'en'},
    strategy=hg.STRATEGY_LOAD,
    ascii_strategy=hg.STRATEGY_REMOVE,
    ascii_range=range(ord('a'), ord('z')),
)
homoglyphs.to_ascii('ХР123.')
# ['l']
homoglyphs.to_ascii('хр123.')
# ['xpl']

The Fork

To help with the transition I have:

  • Moved the main branch
  • Enabled Issues

I am looking to:

  • Switch to using GitHub Actions
  • Add this fork to PyPI
  • Update orsinium's page to say it's maintained

homoglyphs's People

Contributors

ariutta avatar inokenty90 avatar jordiae avatar orsinium avatar porfanid avatar tapplencourt avatar typerslow avatar vadimych avatar yamatt avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

homoglyphs's Issues

Some Latin characters cause to_ascii to return an empty result.

It's my understanding that STRATEGY_IGNORE should "add characters to result", which to me sounds like it should retain the character in the output if it isn't matched.

However, I cannot seem to retain my complete original input

import homoglyphs_fork as hgf
hg = hgf.Homoglyphs(strategy=hgf.STRATEGY_IGNORE)

'ß' in hgf.Categories.get_alphabet(['LATIN'])
>>> True

hg.to_ascii('ß')
>>> []

This is an issue because there are characters that, while not true homoglyphs, can still be used as them. Consider the German eszett, ß, which is a common stand-in for 'B' online.

Because of this, I'm unable to properly detect (as an example) the string 'Сaptchaß𝗈t' -- Cyrillic ES (homoglyph of latin C), German Eszett (leet-speak for latin B), and Mathematical o (normalized to latin o). The best I've been able to achieve is Captchaot with strategy LOAD and ascii_strategy REMOVE.

Is there a way to have homoglyphs simply pass-through any character that isn't matched?

Character \x00 in to_ascii() raises an exception

import homoglyphs as hg
hg.Homoglyphs(languages={'en'}, strategy=hg.STRATEGY_LOAD).to_ascii('\x00')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "homoglyphs/core.py", line 240, in to_ascii
    return self.uniq_and_sort(self._to_ascii(text))
  File "homoglyphs/core.py", line 169, in uniq_and_sort
    result = list(set(data))
  File "homoglyphs/core.py", line 235, in _to_ascii
    for variant in self._get_combinations(text, ascii=True):
  File "homoglyphs/core.py", line 218, in _get_combinations
    alt_chars = self._get_char_variants(char)
  File "homoglyphs/core.py", line 195, in _get_char_variants
    if not self._update_alphabet(char):
  File "homoglyphs/core.py", line 182, in _update_alphabet
    category = Categories.detect(char)
  File "homoglyphs/core.py", line 66, in detect
    category = unicodedata.name(char).split()[0]
ValueError: no such name

I guess it should rather return [].

(BTW, is this fork still maintained?)

PyPI page still shows readme of old version

Hi @yamatt ,

Happy new year!
Thanks for maintaining this fork.

However it seems the PyPI page on https://pypi.org/project/homoglyphs_fork/ still shows the readme of the old original version as the project description, not loading your README.md contents specified by the pyproject.toml

Not sure quite what's causing that, but maybe the setup.py needs to be updated ?
I'm not familiar with pyproject.toml specification yet - it seems something is causing PyPI to load the description from the readme of the original version, even though the other pieces are correct.

p.s. do you think it would make sense to make this project an org, move it to homoglyphs/homoglyphs ?
That way there could be several maintainers added to the project.

There is also a forker at AnatolyTimakov@a70aa9f that added some content.

Thanks,

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.