Comments (27)
OK - I have finally decided how I want to implement this.
- I am going to introduce the functions
winsorted
andwinsort_key
into the API - These functions will use the Windows API directly in order to perform their actions, similar to this answer
- Because the Windows API is used, this will only be available on Windows.
- Because the Windows API handles everything in a black-box fashion, there will be no
alg
option for customization of the results (though I will still providekey
)
from natsort.
It does seem that Microsoft has a custom sorting order for characters (at least for Excel) as can be seen in the table given here. In this table the _
character appears before numbers, which is not how it appears in the ASCII table. This is why Python's sorted
(and thus natsorted
) places the _
character after and not before number characters.
Having said this, it's not clear to me what the request is here. As filed, the issue simply states that natsort
is "not well done in windows" and points out the behavior of natsorted
compared to Windows Explorer. Is this
- a question on how to make
natsort
return the same results as Windows Explorer? - an enhancement request to allow the user to modify the sorting table to match Windows Explorer?
from natsort.
Well, my original plan was to not even export winsorted
on a non-windows machine. But, after I posted my plan I realized a better solution was to instead name the function os_sorted
(or something similar) and have it behave similar to your thought, where it sorts according to how the OS's file manager would sort.
from natsort.
@ganego It's out - natsort
7.1.0.
from natsort.
Please give specific examples of the input you are passing, the output you are getting, and the desired output.
from natsort.
Just a guess, but have you tried using alg=ns.PATH
as suggested in the natsort
examples section? Here is an explanation of why this might be needed.
from natsort.
Allowing the user to modify the sorting table to match NTFS is a good optional feature. Note: on windows an alphabetical sorting order is baked into NTFS (see: https://docs.microsoft.com/en-us/windows/desktop/api/fileapi/nf-fileapi-findfirstfileexa).
The API also reports underscores after numbers when using an NTFS file systems.
Output from "dir" request from cmd.exe:
01/21/2019 08:08 AM 0 foo0
01/21/2019 08:08 AM 0 foo_0
Output from bash "ls" request on the same filesystem:
-rw-r--r-- 1 erik 197609 0 Jan 21 08:08 foo_0
-rw-r--r-- 1 erik 197609 0 Jan 21 08:08 foo0
Conceptually, if the user requests alg.PATH semantics, and is on a Windows system, then Windows PATH should be chosen. When developing UI stuff, you typically want native OS semantics. If someone wants an OS insensitive sort order, the build-in sorted()
command is sufficient. I would consider this option in effect only when a combo of os.PATH | os.LOCALE
is chosen... simply because LOCALE is intended to modify lexical order and PATH is intended to concern itself only with slashes.
from natsort.
Do you have a suggestion for how this might be implemented? At the end of the day, the sorting order of non-alphanumeric characters is arbitrary, so unlike a modification like being case insensitive (which just users x.lower()
) changing the sorting order to match a different table is can't be achieved with a single, simple rule.
from natsort.
To follow up with my above comment - I would welcome any PR that attempts to solve this issue.
from natsort.
@earonesty Do you envision the user being able to customize the translation table, or would there be pre-defined translation tables?
I think this could work by using str.translate()
to pre-process the text before running it through the locale filter (or in place of the locale filter).
from natsort.
@earonesty I went to implement this today, and I realized that the table in the link I gave above is incomplete, so I was unable to implement this solution. Do you know where a full table of ASCII to NTFS equivalence exists?
Alternatively, is there an existing module or library that already exists that provides a collation function that makes strings sort like they are on NTFS?
TO ANYONE INTERESTED IN HELPING.
I would like to implement this with a new enum (ns.WINDOWS
/ns.W
/ns.NTFS
)? Whatever business logic gets implemented, the entry point would be part of string_component_transform_factory
hook (in utils.py
). If the logic is added here (manipulators on the string values after numbers are extracted), it should "just work".
from natsort.
I got the same problem when sorting folder names, that now appear in a different order when used on Windows.
The problem I had, was folder names with prepending 'special' characters like ! _ + -
and so on, which I expect to be sorted before the 'normal' characters like Windows Explorer does. So for example the order should be +1, 1
instead of 1, +1
(which is what is given as a sort result right now).
The library fails when sorting numbers with prepending 'special' characters:
from natsort import natsorted, ns
a = ["1", "+1", "a", "+a"]
print( natsorted(a, alg=ns.IGNORECASE) )
Results in: 1, +1, +a, a
Windows Explorer: +1, 1, +a, a
Regarding your question about a complete ASCII table, I could not find any.
Some characters are forbidden, so they can be ignored, see: https://docs.microsoft.com/en-us/windows/win32/fileio/naming-a-file#naming-conventions
I put together a list of the characters found on a 'normal' western keyboard, so basically all the ASCII chars (and some from extended ASCII) in the correct order: https://i.postimg.cc/cL5hNSnd/image.png
That is: ' - ! # $ % & ( ) , ; @ [ ] ^ _ ` { } ~ ´ € + = § ° µ 11111 aaaaa foo_0 foo0
This is what Windows Explorer shows, which is identical to dir /on
(dir, sort by names) on german W7.
Windows Explorer sorting seems to be different from NTFS sorting: https://devblogs.microsoft.com/oldnewthing/20050617-10/?p=35293.
If dir
gives NTFS order (I'm not sure), then it's different:
That is: ! # $ % & ' ( ) + , - 11111 ; = @ aaaaa foo0 foo_0 [ ] ^ _ ` { } ~ § ° ´ µ €
I think Explorer sorting should be used as this is what people usually see.
Hope that might be helpful.
Thank you.
from natsort.
I would imagine alg=ns.NTFS
or alg=ns.WINEXPLORE
, etc. that use optional, predefined tables is the correct choice. Arbitrary sort tables require users to develop and maintain a lot of code.... better for them to live in a public repo with an enum driving use. With a plugin model, contributed sorting tables will improve over time.
from natsort.
@ganego One of the articles you linked indicated that Windows Explorer uses the locale to sort... can you see if using natsort
with the locale setting will replicate the results of your experiment?
from natsort.
I think the locale for Windows Explorer sorting only plays a role when going to extended-ASCII with all the special letters like äöüØ... See: https://www.ascii-code.com/ The basic-ASCII should be locale-independent.
Sorting strings with natsorted
like they would show up on Windows Explorer becomes an issue even with basic-ASCII signs (+ - ! ...). What I mean is: On a first issue, we could ignore locale and just set the sorting for the basic-ASCII manually if that is possible(?).
Results (W7, German):
dir /on
(dir, sorted by name):
' - ! # $ % & ( ) , ; @ [ ] ^ _ ` { } ~ ´ € + +11111 +aaaaa = § ° µ 11111 aaaaa foo_0 foo0
dir
(NTFS sorting?):
! # $ % & ' ( ) + +11111 +aaaaa , - 11111 ; = @ aaaaa foo0 foo_0 [ ] ^ _ ` { } ~ § ° ´ µ €
natsorted
with Locale to German:
German_Germany.1252
11111 ' - ! # $ % & ( ) , ; @ [ ] ^ _ ` { } ~ ´ € + +11111 +aaaaa = § ° aaaaa foo0 foo_0 µ
Without Locale:
11111 ! # $ % & ' ( ) + +11111 +aaaaa , - ; = @ [ ] ^ _ ` aaaaa foo0 foo_0 { } ~ § ° ´ µ €
Code used:
import locale
from natsort import natsorted, ns
a = [f.name for f in os.scandir(dir) if f.is_dir()]
print(locale.setlocale(locale.LC_ALL, locale="German"))
print( natsorted(a, alg=ns.IGNORECASE | ns.LOCALE) )
Beside the obvious differences, the one thing why I sumbled upon this issue was the difference in handling of <specialChar&Letter> vs <specialChar&Number>, see the ['1', '+1', '+a', 'a']
example, where +1
should actually be sorted in front of 1
(which would be the case for +a
vs a
) - no difference with locale in this example.
EDIT: Here is a string that can serve as Python list for testing with all the folder names from above: ['11111', '!', '#', '$', '%', '&', "'", '(', ')', '+', '+11111', '+aaaaa', ',', '-', ';', '=', '@', '[', ']', '^', '_', '`', 'aaaaa', 'foo0', 'foo_0', '{', '}', '~', '§', '°', '´', 'µ', '€']
from natsort.
I am all-in on adding this type of functionality. Having said that, I have reservations on how successful it can be:
- How could this be tested, rigorously and exhaustively?
- I can imagine during testing writing a C extension to access StrCmpLogicalW as validation, but of course this could only work on Windows.
- Is it expected that this would work on Windows only, or cross platform?
- How would this be implemented? Remember that just taking care of a subset of characters as in the suggested list above is not enough - it should really cover the whole unicode range (assuming that that is what Windows does as well)
I do not use Windows so I do not think I am in a position to implement this. But I would gladly accept a PR from the brave soul who wants to take a stab at this.
from natsort.
Well, best case: We ask MS to open-source StrCmpLogicalW...
I found this: https://gist.github.com/mcmarcu/7899295 which either uses the correct dll on Windows or on Linux some other function. Then of course you will only get "proper" Windows sorting on Windows - the question would be, do Linux users even need this? Maybe not. (But to be honest, I don't understand why a Linux user would be ok with their system sorting 1
before !1
, as that makes absolutely no sense when !a
is sorted before a
, but ok - EDIT: This is what Thunar does, ls | sort
(even without -n
properly sorts it), so there is some code somewhere that does it on Linux..).
I then found this: https://psycodedeveloper.wordpress.com/2013/04/12/c-numeric-sorting-revisited/
It's C# and I cannot compile it, but it claims to do it like the real Windows function. I wrote a comment asking about the cases above several days ago, but it seems the blog is dead - no answer.
In case anyone could compile it and test it, that would be nice.
Sorting right now mostly works, except for one big issue with <specialChar&Letter> vs <specialChar&Number>. If that somehow could be fixed for the numbers that are currently sorted incorrectly.
from natsort.
Sorting right now mostly works, except for one big issue with <specialChar&Letter> vs <specialChar&Number>.
Have you tried sorting with ns.SIGNED
? Does that get closer to the correct results?
With respect to the Wordpress article: it looks like they are using a comparitor called StringComparison.InvariantCultureIgnoreCase
which I'm pretty sure ends up using the same code internally as StrCmpLogicalW
(https://docs.microsoft.com/en-us/dotnet/standard/globalization-localization/performing-culture-insensitive-string-comparisons). So I don't think it is actually cross-platform.
from natsort.
Have you tried sorting with ns.SIGNED? Does that get closer to the correct results?
That won't help.
In the meantime, while looking to solve some other unrelated problem I found this:
https://docs.python.org/3/library/pathlib.html#general-properties
Paths of a same flavour are comparable and orderable
Amazing, Python comes with this built-in, and it works:
from pathlib import *
print( PureWindowsPath('C:\\!1') < PureWindowsPath('C:\\1') )
True
But wait! it also bugs:
from pathlib import *
print( PureWindowsPath('C:\\15') < PureWindowsPath('C:\\100') )
False
So while it actually perfectly sorts stuff like the problem mentioned before, it now fails at natural sorting :(
I will open a bugreport, since this function clearly does not produce the correct order for Windows despite the name.
from natsort.
I will open a bugreport, since this function clearly does not produce the correct order for Windows despite the name.
It's your decision to file a bug report on this if you want (it's unrelated to natsort
), but I'm not so sure it's a bug. The documentation never states that when sorted it will match Windows Explorer behavior. If I were the Python devs and had this labeled as a "bug" and not as a "feature request" I would be a bit annoyed since it is never advertised to work that way.
from natsort.
Good to hear.
How will you make the error handling?
Introduce a new error that returns when wrong OS, or simply just catch the error internally in winsorted
and use the "old" way on Linux. I'd think the second option is a better way, since it will sort on Linux then just fine and won't crash code that someone wrote for Windows but ran on Linux. And the people using the library would not have to implement their own error exception code all the time or make switches depending on the OS themselves.
from natsort.
That's an even more elegant solution. I like it.
from natsort.
seems like one lambda is all you need: os_sorted
... None == don't use it. but yeah, a key/callback type thing makes sense
from natsort.
@ganego @earonesty Check out #123, which is the PR for this feature. I plan to merge sometime tomorrow. I would be open to some feedback, especially the documentation or the tests.
from natsort.
Just looked over the code but did not run any tests. Seems you used my list for the test, so if it gives the correct results I guess it works. Doc also looks ok. When will you publish a new version?
from natsort.
I am trying to see if I can solve #122. If I cannot by tonight, I think I will release tonight or tomorrow.
from natsort.
StrCmpLogicalW
I wonder if WINE's version is accurate.
from natsort.
Related Issues (20)
- Can't use natsort_keygen() as key for sorting DataFrame with MultiIndex in pandas HOT 3
- Some values don't sort in a consistent order HOT 3
- Set which OS to sort by in `os_sorted` HOT 8
- Paths should be sorted like strings HOT 6
- Loosen types and type checking
- Sorting a list of dictionaries when the sort field might or might not have a number HOT 2
- Improve os_sorted performance by avoiding `Path` roundtrips HOT 1
- add a mode for hexadecimal numbers HOT 7
- 1 test fails HOT 5
- RFE: drop use `m2r2` module HOT 8
- Sorting income category with both string and num HOT 1
- compatibility with GNU coreutils sort -n (numeric sort) HOT 14
- 'os_sorted' sorts files with spaces in names not as in Windows Explorer HOT 2
- not consistent with behavior of windows chinese edition HOT 1
- `cmp_to_key` gives an error and I don't know if this is a bug or just not supported HOT 3
- wiki pages so huge that they are not loaded properly HOT 2
- Error while sorting dates if `NaT` HOT 3
- Unexpected natural sort when sorting multi-dimensional arrays or `pandas.DataFrame` HOT 14
- Support Python 3.12 HOT 2
- Character based sorting HOT 14
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from natsort.