Code Monkey home page Code Monkey logo

Comments (8)

hholzgra avatar hholzgra commented on August 13, 2024

I can make this work for strings that start with a number by extending the regular expressions that check for digits to accept multiple unicode digits:

diff --git a/natsort/utils.py b/natsort/utils.py
index b6484b0..9055da8 100644
--- a/natsort/utils.py
+++ b/natsort/utils.py
@@ -96,9 +96,9 @@ _float_nosign_noexp_re = _float_nosign_noexp_re.format(_num, numeric)
 _float_nosign_noexp_re = re.compile(_float_nosign_noexp_re, flags=re.U)
 
 # Integer regexes - include Unicode digits.
-_int_nosign_re = r'([0-9]+|[{0}])'.format(digits)
+_int_nosign_re = r'([0-9]+|[{0}]+)'.format(digits)
 _int_nosign_re = re.compile(_int_nosign_re, flags=re.U)
-_int_sign_re = r'([-+]?[0-9]+|[{0}])'.format(digits)
+_int_sign_re = r'([-+]?[0-9]+|[{0}]+)'.format(digits)
 _int_sign_re = re.compile(_int_sign_re, flags=re.U)
 
 # This dict will help select the correct regex and number conversion function.

With numbers at the end of the stings things fail though when there's more than one unicode digit, like in "street ۱۲". This causes the same "TypeError: unorderable types: float() < str()" as in issue #7 again.

Turns out that the fake_fastnumbers implementation only identifies single digit unicode numbers as "int", too, otherwise returning "string". When fixing the fastnumbers version check (issue #51) so that the real fastnumbers functions are used things work fine as expected after all.

from natsort.

hholzgra avatar hholzgra commented on August 13, 2024

My regex change makes one test fail when testing against Python 3.5. With Python 2.7 all tests pass:


=================================================================================== FAILURES ====================================================================================
_________________________________________________________ test_parse_string_factory_only_parses_digits_with_nosign_int __________________________________________________________

    @given(lists(elements=floats() | text().filter(whitespace_check) | integers(), min_size=1, max_size=10))
>   @example([10000000000000000000000000000000000000000000000000000000000000000000000000,
              100000000000000000000000000000000000000000000000000000000000000000000000000,
              100000000000000000000000000000000000000000000000000000000000000000000000000])
    def test_parse_string_factory_only_parses_digits_with_nosign_int(x):

test_natsort/test_parse_string_function.py:82: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
.tox/py35/lib/python3.5/site-packages/hypothesis/core.py:581: in execute
    result = self.test_runner(data, run)
.tox/py35/lib/python3.5/site-packages/hypothesis/executors.py:58: in default_new_style_executor
    return function(data)
.tox/py35/lib/python3.5/site-packages/hypothesis/core.py:573: in run
    return test(*args, **kwargs)
test_natsort/test_parse_string_function.py:82: in test_parse_string_factory_only_parses_digits_with_nosign_int
    @example([10000000000000000000000000000000000000000000000000000000000000000000000000,
.tox/py35/lib/python3.5/site-packages/hypothesis/core.py:520: in test
    result = self.test(*args, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

x = ['᧐᧐']

    @given(lists(elements=floats() | text().filter(whitespace_check) | integers(), min_size=1, max_size=10))
    @example([10000000000000000000000000000000000000000000000000000000000000000000000000,
              100000000000000000000000000000000000000000000000000000000000000000000000000,
              100000000000000000000000000000000000000000000000000000000000000000000000000])
    def test_parse_string_factory_only_parses_digits_with_nosign_int(x):
        s = ''.join(repr(y) if type(y) in (float, long, int) else y for y in x)
>       assert _parse_string_factory(0, '', _int_nosign_re.split, no_op, fast_int, tuple2)(s) == int_splitter(s, False, '')
E       AssertionError: assert ('᧐᧐',) == ('', 0, '', 0)
E         At index 0 diff: '᧐᧐' != ''
E         Right contains more items, first extra item: 0
E         Use -v to get the full diff

test_natsort/test_parse_string_function.py:87: AssertionError
---------------------------------------------------------------------------------- Hypothesis -----------------------------------------------------------------------------------
Falsifying example: test_parse_string_factory_only_parses_digits_with_nosign_int(x=['᧐᧐'])


from natsort.

SethMMorton avatar SethMMorton commented on August 13, 2024

The reason I chose to not attempt to combine non-ASCII digits into numbers as is done for ASCII numbers is that it was not clear to me if they could be treated the same way 100% of the time. For example, I know that ⅐ would need special care, but this is classified as a number in unicode and not a digit so perhaps I should not worry.

Should it be safe to treat any unicode digit as it's ASCII equivalent when converting to integers? For example, I am assuming that ۱۲ should be converted to 12? What about if there are decimal points - would ۱۲.۱۲ be the float 12.12? Would this be true for other character sets?

from natsort.

hholzgra avatar hholzgra commented on August 13, 2024

I only care about integers at this point, not floats, the background being that I was looking for a more natural sorting of street names in the street indexes generated by MapOSMatic.

This looks good now for my two original test cases, New York City:

https://maposmatic.osm-baustelle.de/maps/16968

and a town in Iran:

https://maposmatic.osm-baustelle.de/maps/16961

For example, I know that ⅐ would need special care, but this is classified as a number in unicode and not a digit so perhaps I should not worry.

I now checked for cities using ½ in their street numbering scheme in the OpenStreetMap database and found

https://maposmatic.osm-baustelle.de/maps/1699

where I can see things like:

 79th Street
 80th Avenue West
 80th Street
 8½th Avenue
 8½th Street
 8½th Street Court
 8½th Street West
 81st Avenue West
 82nd Avenue West

Seeing 8½th between 80th and 81st is a bit odd, but as we only have some ~1300 roads worldwide that use ½ (and none using the other fractionals like 1/3 or 1/4) I can live with that for now.

Should it be safe to treat any unicode digit as it's ASCII equivalent when converting to integers? For example, I am assuming that ۱۲ should be converted to 12?

That should be safe. I could only check with native speakers from Iran (for Persian/Farsi) and Syria (for Arabic) so far, but I also have contacts from Malaysia, Israel, South Korea, and maybe Japan and China that I can ask to verify that street lists come out correctly (if there are examples of numbered streets in those countries at all).

What about if there are decimal points - would ۱۲.۱۲ be the float 12.12? Would this be true for other character sets?

That seems to be way more tricky, at least from looking at https://en.wikipedia.org/wiki/Decimal_separator ...

from natsort.

SethMMorton avatar SethMMorton commented on August 13, 2024

It sounds like the validation you are using to assess if changing the natsort heuristic for identifying numbers is only within the context of street names. My concern is that natsort is intended to be general for all domains, so I want to make sure that any changes made will not break some other assumption in another domain.

Having said that, I do agree some update should be done, especially considering as you point out that fastnumbers handles this properly. I am actually surprised by that, because I did not code it to handle numbers like ۱۲, so I'll need to look into that.

I think that perhaps just updating the regex and also fake_fastnumbers should be sufficient, but I am a bit over-careful about things like this so I will do more investigation.

from natsort.

SethMMorton avatar SethMMorton commented on August 13, 2024

I had forgotten that I added code in fastnumbers that translates any unicode decimal to the ASCII equivalent. Under the hood this is what python does for int or float.

Thanks for reporting this.

from natsort.

SethMMorton avatar SethMMorton commented on August 13, 2024

@hholzgra Can you review the changes I made in PR #54 and see if those will support your use case?

from natsort.

SethMMorton avatar SethMMorton commented on August 13, 2024

@hholzgra I think I am going to go forward with this release in the next few days.

from natsort.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.