Code Monkey home page Code Monkey logo

Comments (27)

SethMMorton avatar SethMMorton commented on August 13, 2024 3

OK - I have finally decided how I want to implement this.

  1. I am going to introduce the functions winsorted and winsort_key into the API
  2. These functions will use the Windows API directly in order to perform their actions, similar to this answer
  3. Because the Windows API is used, this will only be available on Windows.
  4. Because the Windows API handles everything in a black-box fashion, there will be no alg option for customization of the results (though I will still provide key)

from natsort.

SethMMorton avatar SethMMorton commented on August 13, 2024 2

It does seem that Microsoft has a custom sorting order for characters (at least for Excel) as can be seen in the table given here. In this table the _ character appears before numbers, which is not how it appears in the ASCII table. This is why Python's sorted (and thus natsorted) places the _ character after and not before number characters.

Having said this, it's not clear to me what the request is here. As filed, the issue simply states that natsort is "not well done in windows" and points out the behavior of natsorted compared to Windows Explorer. Is this

  • a question on how to make natsort return the same results as Windows Explorer?
  • an enhancement request to allow the user to modify the sorting table to match Windows Explorer?

from natsort.

SethMMorton avatar SethMMorton commented on August 13, 2024 1

Well, my original plan was to not even export winsorted on a non-windows machine. But, after I posted my plan I realized a better solution was to instead name the function os_sorted (or something similar) and have it behave similar to your thought, where it sorts according to how the OS's file manager would sort.

from natsort.

SethMMorton avatar SethMMorton commented on August 13, 2024 1

@ganego It's out - natsort 7.1.0.

from natsort.

SethMMorton avatar SethMMorton commented on August 13, 2024

Please give specific examples of the input you are passing, the output you are getting, and the desired output.

from natsort.

SethMMorton avatar SethMMorton commented on August 13, 2024

Just a guess, but have you tried using alg=ns.PATH as suggested in the natsort examples section? Here is an explanation of why this might be needed.

from natsort.

earonesty avatar earonesty commented on August 13, 2024

Allowing the user to modify the sorting table to match NTFS is a good optional feature. Note: on windows an alphabetical sorting order is baked into NTFS (see: https://docs.microsoft.com/en-us/windows/desktop/api/fileapi/nf-fileapi-findfirstfileexa).

The API also reports underscores after numbers when using an NTFS file systems.

Output from "dir" request from cmd.exe:

01/21/2019  08:08 AM                 0 foo0
01/21/2019  08:08 AM                 0 foo_0

Output from bash "ls" request on the same filesystem:

-rw-r--r-- 1 erik 197609           0 Jan 21 08:08  foo_0
-rw-r--r-- 1 erik 197609           0 Jan 21 08:08  foo0

Conceptually, if the user requests alg.PATH semantics, and is on a Windows system, then Windows PATH should be chosen. When developing UI stuff, you typically want native OS semantics. If someone wants an OS insensitive sort order, the build-in sorted() command is sufficient. I would consider this option in effect only when a combo of os.PATH | os.LOCALE is chosen... simply because LOCALE is intended to modify lexical order and PATH is intended to concern itself only with slashes.

from natsort.

SethMMorton avatar SethMMorton commented on August 13, 2024

Do you have a suggestion for how this might be implemented? At the end of the day, the sorting order of non-alphanumeric characters is arbitrary, so unlike a modification like being case insensitive (which just users x.lower()) changing the sorting order to match a different table is can't be achieved with a single, simple rule.

from natsort.

SethMMorton avatar SethMMorton commented on August 13, 2024

To follow up with my above comment - I would welcome any PR that attempts to solve this issue.

from natsort.

SethMMorton avatar SethMMorton commented on August 13, 2024

@earonesty Do you envision the user being able to customize the translation table, or would there be pre-defined translation tables?

I think this could work by using str.translate() to pre-process the text before running it through the locale filter (or in place of the locale filter).

from natsort.

SethMMorton avatar SethMMorton commented on August 13, 2024

@earonesty I went to implement this today, and I realized that the table in the link I gave above is incomplete, so I was unable to implement this solution. Do you know where a full table of ASCII to NTFS equivalence exists?

Alternatively, is there an existing module or library that already exists that provides a collation function that makes strings sort like they are on NTFS?


TO ANYONE INTERESTED IN HELPING.

I would like to implement this with a new enum (ns.WINDOWS/ns.W/ns.NTFS)? Whatever business logic gets implemented, the entry point would be part of string_component_transform_factory hook (in utils.py). If the logic is added here (manipulators on the string values after numbers are extracted), it should "just work".

from natsort.

ganego avatar ganego commented on August 13, 2024

I got the same problem when sorting folder names, that now appear in a different order when used on Windows.
The problem I had, was folder names with prepending 'special' characters like ! _ + - and so on, which I expect to be sorted before the 'normal' characters like Windows Explorer does. So for example the order should be +1, 1 instead of 1, +1 (which is what is given as a sort result right now).

The library fails when sorting numbers with prepending 'special' characters:

from natsort import natsorted, ns

a = ["1", "+1", "a", "+a"]
print( natsorted(a, alg=ns.IGNORECASE) )

Results in: 1, +1, +a, a
Windows Explorer: +1, 1, +a, a


Regarding your question about a complete ASCII table, I could not find any.
Some characters are forbidden, so they can be ignored, see: https://docs.microsoft.com/en-us/windows/win32/fileio/naming-a-file#naming-conventions

I put together a list of the characters found on a 'normal' western keyboard, so basically all the ASCII chars (and some from extended ASCII) in the correct order: https://i.postimg.cc/cL5hNSnd/image.png

That is: ' - ! # $ % & ( ) , ; @ [ ] ^ _ ` { } ~ ´ € + = § ° µ 11111 aaaaa foo_0 foo0
This is what Windows Explorer shows, which is identical to dir /on (dir, sort by names) on german W7.

Windows Explorer sorting seems to be different from NTFS sorting: https://devblogs.microsoft.com/oldnewthing/20050617-10/?p=35293.
If dir gives NTFS order (I'm not sure), then it's different:
That is: ! # $ % & ' ( ) + , - 11111 ; = @ aaaaa foo0 foo_0 [ ] ^ _ ` { } ~ § ° ´ µ €

I think Explorer sorting should be used as this is what people usually see.

Hope that might be helpful.
Thank you.

from natsort.

earonesty avatar earonesty commented on August 13, 2024

I would imagine alg=ns.NTFS or alg=ns.WINEXPLORE, etc. that use optional, predefined tables is the correct choice. Arbitrary sort tables require users to develop and maintain a lot of code.... better for them to live in a public repo with an enum driving use. With a plugin model, contributed sorting tables will improve over time.

from natsort.

SethMMorton avatar SethMMorton commented on August 13, 2024

@ganego One of the articles you linked indicated that Windows Explorer uses the locale to sort... can you see if using natsort with the locale setting will replicate the results of your experiment?

from natsort.

ganego avatar ganego commented on August 13, 2024

I think the locale for Windows Explorer sorting only plays a role when going to extended-ASCII with all the special letters like äöüØ... See: https://www.ascii-code.com/ The basic-ASCII should be locale-independent.
Sorting strings with natsorted like they would show up on Windows Explorer becomes an issue even with basic-ASCII signs (+ - ! ...). What I mean is: On a first issue, we could ignore locale and just set the sorting for the basic-ASCII manually if that is possible(?).

Results (W7, German):

dir /on (dir, sorted by name):

' - ! # $ % & ( ) , ; @ [ ] ^ _ ` { } ~ ´ € + +11111 +aaaaa = § ° µ 11111 aaaaa foo_0 foo0

dir (NTFS sorting?):
! # $ % & ' ( ) + +11111 +aaaaa , - 11111 ; = @ aaaaa foo0 foo_0 [ ] ^ _ ` { } ~ § ° ´ µ €


natsorted with Locale to German:
German_Germany.1252
11111 ' - ! # $ % & ( ) , ; @ [ ] ^ _ ` { } ~ ´ € + +11111 +aaaaa = § ° aaaaa foo0 foo_0 µ

Without Locale:
11111 ! # $ % & ' ( ) + +11111 +aaaaa , - ; = @ [ ] ^ _ ` aaaaa foo0 foo_0 { } ~ § ° ´ µ €

Code used:

import locale
from natsort import natsorted, ns

a = [f.name for f in os.scandir(dir) if f.is_dir()]
print(locale.setlocale(locale.LC_ALL, locale="German"))
print( natsorted(a, alg=ns.IGNORECASE | ns.LOCALE) )

Beside the obvious differences, the one thing why I sumbled upon this issue was the difference in handling of <specialChar&Letter> vs <specialChar&Number>, see the ['1', '+1', '+a', 'a'] example, where +1 should actually be sorted in front of 1 (which would be the case for +a vs a) - no difference with locale in this example.

EDIT: Here is a string that can serve as Python list for testing with all the folder names from above: ['11111', '!', '#', '$', '%', '&', "'", '(', ')', '+', '+11111', '+aaaaa', ',', '-', ';', '=', '@', '[', ']', '^', '_', '`', 'aaaaa', 'foo0', 'foo_0', '{', '}', '~', '§', '°', '´', 'µ', '€']

from natsort.

SethMMorton avatar SethMMorton commented on August 13, 2024

I am all-in on adding this type of functionality. Having said that, I have reservations on how successful it can be:

  • How could this be tested, rigorously and exhaustively?
    • I can imagine during testing writing a C extension to access StrCmpLogicalW as validation, but of course this could only work on Windows.
  • Is it expected that this would work on Windows only, or cross platform?
  • How would this be implemented? Remember that just taking care of a subset of characters as in the suggested list above is not enough - it should really cover the whole unicode range (assuming that that is what Windows does as well)

I do not use Windows so I do not think I am in a position to implement this. But I would gladly accept a PR from the brave soul who wants to take a stab at this.

from natsort.

ganego avatar ganego commented on August 13, 2024

Well, best case: We ask MS to open-source StrCmpLogicalW...

I found this: https://gist.github.com/mcmarcu/7899295 which either uses the correct dll on Windows or on Linux some other function. Then of course you will only get "proper" Windows sorting on Windows - the question would be, do Linux users even need this? Maybe not. (But to be honest, I don't understand why a Linux user would be ok with their system sorting 1 before !1, as that makes absolutely no sense when !a is sorted before a, but ok - EDIT: This is what Thunar does, ls | sort (even without -n properly sorts it), so there is some code somewhere that does it on Linux..).

I then found this: https://psycodedeveloper.wordpress.com/2013/04/12/c-numeric-sorting-revisited/
It's C# and I cannot compile it, but it claims to do it like the real Windows function. I wrote a comment asking about the cases above several days ago, but it seems the blog is dead - no answer.
In case anyone could compile it and test it, that would be nice.

Sorting right now mostly works, except for one big issue with <specialChar&Letter> vs <specialChar&Number>. If that somehow could be fixed for the numbers that are currently sorted incorrectly.

from natsort.

SethMMorton avatar SethMMorton commented on August 13, 2024

Sorting right now mostly works, except for one big issue with <specialChar&Letter> vs <specialChar&Number>.

Have you tried sorting with ns.SIGNED? Does that get closer to the correct results?


With respect to the Wordpress article: it looks like they are using a comparitor called StringComparison.InvariantCultureIgnoreCase which I'm pretty sure ends up using the same code internally as StrCmpLogicalW (https://docs.microsoft.com/en-us/dotnet/standard/globalization-localization/performing-culture-insensitive-string-comparisons). So I don't think it is actually cross-platform.

from natsort.

ganego avatar ganego commented on August 13, 2024

Have you tried sorting with ns.SIGNED? Does that get closer to the correct results?

That won't help.

In the meantime, while looking to solve some other unrelated problem I found this:
https://docs.python.org/3/library/pathlib.html#general-properties

Paths of a same flavour are comparable and orderable

Amazing, Python comes with this built-in, and it works:

from pathlib import *
print( PureWindowsPath('C:\\!1') < PureWindowsPath('C:\\1') )

True

But wait! it also bugs:

from pathlib import *
print( PureWindowsPath('C:\\15') < PureWindowsPath('C:\\100') )

False

So while it actually perfectly sorts stuff like the problem mentioned before, it now fails at natural sorting :(

I will open a bugreport, since this function clearly does not produce the correct order for Windows despite the name.

from natsort.

SethMMorton avatar SethMMorton commented on August 13, 2024

I will open a bugreport, since this function clearly does not produce the correct order for Windows despite the name.

It's your decision to file a bug report on this if you want (it's unrelated to natsort), but I'm not so sure it's a bug. The documentation never states that when sorted it will match Windows Explorer behavior. If I were the Python devs and had this labeled as a "bug" and not as a "feature request" I would be a bit annoyed since it is never advertised to work that way.

from natsort.

ganego avatar ganego commented on August 13, 2024

Good to hear.
How will you make the error handling?
Introduce a new error that returns when wrong OS, or simply just catch the error internally in winsorted and use the "old" way on Linux. I'd think the second option is a better way, since it will sort on Linux then just fine and won't crash code that someone wrote for Windows but ran on Linux. And the people using the library would not have to implement their own error exception code all the time or make switches depending on the OS themselves.

from natsort.

ganego avatar ganego commented on August 13, 2024

That's an even more elegant solution. I like it.

from natsort.

earonesty avatar earonesty commented on August 13, 2024

seems like one lambda is all you need: os_sorted ... None == don't use it. but yeah, a key/callback type thing makes sense

from natsort.

SethMMorton avatar SethMMorton commented on August 13, 2024

@ganego @earonesty Check out #123, which is the PR for this feature. I plan to merge sometime tomorrow. I would be open to some feedback, especially the documentation or the tests.

from natsort.

ganego avatar ganego commented on August 13, 2024

Just looked over the code but did not run any tests. Seems you used my list for the test, so if it gives the correct results I guess it works. Doc also looks ok. When will you publish a new version?

from natsort.

SethMMorton avatar SethMMorton commented on August 13, 2024

I am trying to see if I can solve #122. If I cannot by tonight, I think I will release tonight or tomorrow.

from natsort.

nelsonjchen avatar nelsonjchen commented on August 13, 2024

StrCmpLogicalW

I wonder if WINE's version is accurate.

https://github.com/wine-mirror/wine/blob/e909986e6ea5ecd49b2b847f321ad89b2ae4f6f1/dlls/kernelbase/string.c#L1302

from natsort.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.