Re.: commit <a class="commit-link" data-hovercard-type="commit" data-hovercard-url="ht

Can not reproduce? <div class="snippet-clipboard-content notranslate position-rela

Perhaps rline should not call setlocale()</code

I have made some progress towards implementing a proper in-house <code class="notransl

Float parsing uses `strtod`, but rline calls `setlocale`,about kuroko-lang/kuroko

Comments (26)

klange commented on May 25, 2024 1

Can not reproduce?

>>> math.floor(0.1) == math.ceil(0.1)
 => False
>>> math.floor(0.9) == math.ceil(0.9)
 => False
>>> math.floor(0.1)
 => 0
>>> math.ceil(0.1)
 => 1
>>> math.ceil(0.9)
 => 1
>>> math.floor(0.9)
 => 0

These functions are simple wrappers around libm math functions and appear to be working as intended on the platforms I have tested on.

from kuroko.

klange commented on May 25, 2024 1

Using Kuroko with locales other than C is currently unsupported - strtod is used to parse floats.

Please confirm that you get the misparse of 0.5 as 0.0 in the repl.

from kuroko.

klange commented on May 25, 2024 1

Perhaps rline should not call setlocale() when it is not supported?

rline needs to run setlocale or wcwidth will return nonsensical values for the widths of characters, and it becomes difficult to type, eg., Japanese. I may modify rline to go through the steps of saving the current locale before it calls setlocale, and restoring it afterwards, which may resolve this.

There's something I don't understand: When rline.c explicitly sets $LC_ALL to C:

rline only does that on Windows. On other platforms, it calls setlocale with "", as specified in the manpage:

If locale is an empty string, "", each part of the locale that should be modified is set according to the environment variables.

rline, of course, does not care about things like numeric formatting, and was originally written for a platform that had no locale support...

from kuroko.

klange commented on May 25, 2024 1

Changing the title of this issue to reflect the underlying problem, so it can be more directly addressed.

Given that I want Kuroko to be usable in an embedded environment where someone else may have already been using setlocale, fixing rline to deal with locales in a re-entrant manner is only half of a fix... ideally, the float parsing needs to be in-housed. There's also the issue of printf formatting. In both cases, there is a brute-force option available: determine the radix symbol and replace as needed. This still leaves many edge cases in both directions, though.

I have a locale-oblivious strtod here: https://github.com/klange/toaruos/blob/master/libc/stdlib/strtod.c - unfortunately, this implementation is inaccurate, and crafting a good one is tricky due to floating point precision loss.

I don't have a complete in-house float printer - the one in ToaruOS is very incomplete.

from kuroko.

klange commented on May 25, 2024 1

Possibly.

Checking the radix symbol in a portable manner is slightly tricky, but a quick hack might be to do if (strtod("0.5",NULL) != 0.5) setlocale(LC_ALL, "C.UTF-8");, just in rline, and see where that gets us.

from kuroko.

klange commented on May 25, 2024 1

Not going to print a warning for now, especially since rline does its locale setting every time a new prompt is shown, but see if this helps for you: 0c8cb94

from kuroko.

klange commented on May 25, 2024 1

This misunderstands the relationship between rline and Kuroko. rline is not part of Kuroko, and it must independently ensure that it is operating in a viable locale to perform its function as an interactive line editor. What rline should do is restore the original locale from before it was called when resuming to the caller.

Meanwhile, as long as Kuroko relies on the functionality of strtod and other locale-dependent standard library functions, it should enforce the C locale. Ideally, Kuroko should not rely on these things and use its own implementations. However, as the main roadblock here is strtod and that is not a simple thing to implement well, this is not something that can be done in the short term.

from kuroko.

klange commented on May 25, 2024 1

My current thought for an improved kludge is to have rline only set the LC_CTYPE locale setting, and only if it determines the current locale does not think (wchar_t)0x3024 (あ) is 2 cells wide. Kuroko does not make use of functions that rely on the LC_CTYPE locale, only LC_NUMERIC.

Once Kuroko is free of locale-dependent functions on its own, the C locale functionality can be exposed which will allow the locale to be set when calling, eg., time.strftime (which currently suffers from a similar issue where rline's use of setlocale(LC_ALL, ...) means the repl operates differently).

Looking through the codebase, functions which may be locale-dependent include:

strtod, used in the compiler and string.__float__, as discussed here.
Calls to printf-family functions with %.16g to format floats (generally only in float.__repr__, though also in the debug output printing to "safely" print float objects, but that does not matter).
Possibly calls to atoi (used in the interactive debugger, and once to parse a command-line argument) and strtoul (used once in the compiler to parse hex escapes in strings, and once in str.format to parse argument position numbers in format strings), which may erroneously accept digit separators in some locales, though I don't think any of these cases is problematic and all locales are required to handle ASCII western Arabic numerals in the expected manner.

from kuroko.

klange commented on May 25, 2024 1

This is a complicated question. In order for any program to correctly handle this, its own wcwidth results must match the terminal's wcwidth results (or whatever the terminal uses to determine the widths of displayed text - which might not even be wcwidth!). If the terminal and the program disagree on locales, these results might differ. Worse yet, if the terminal and the program are running on completely different systems (eg., the program is running over ssh), then they may have completely different implementations of wcwidth even with the same locale, using different databases. Emoji are a common stumbling point here: earlier databases will claim their width is 1, while newer ones may say 2, and if the two sides disagree then things can get weird.

from kuroko.

klange commented on May 25, 2024 1

I have made some progress towards implementing a proper in-house strtod by making use of Kuroko's existing bigints: taking the decimal representation of the significant digits and dividing them by the appropriate power of 10, extracting enough bits to fill in a double. This work will likely make its way into the implementation of long.__truediv__, which currently cheats and may produce incorrect results as it converts both operands to doubles before dividing. It will take some more work to turn this into a strtod implementation suitable for Kuroko's use cases, but the end result should be a robust and accurate replacement that does not depend on locale settings.

from kuroko.

klange commented on May 25, 2024 1

The new long.__truediv__ implementation has been pushed, so at least now you can accurately write floats as the division of two large integers and that should always work regardless of any locale interventions. I'm not 100% certain the rounding behavior is correct, but it worked for some edge case examples, and supports subnormals. With this, a pure-Kuroko implementation of float parsing should be possible as well (though you'd need to pull in math to return appropriate nan and inf values), and the final implementations of the compiler's parser and str.__float__ will be in C anyway).

I also deployed the described change to rline, so the time module should no longer behave differently in the repl and will always produce C/English output (until we have a locale module to manipulate the locale settings).

from kuroko.

klange commented on May 25, 2024 1

I am pleased to note that in the in-progress in-house-float-conversions branch (https://github.com/kuroko-lang/kuroko/tree/in-house-float-conversions), we now have:

Accurately rounded long.__truediv__, which means double-precision floats can be accurately represented as fractions or other integer-based expressions.
A new float printer, with support for various formatting options, with no locale dependencies.
A new float parser, which makes use of the long.__truediv__ support, and has no locale dependencies.
Support for e exponents in float literals in the compiler.
Much faster implementations of long.__rshift__ and long.__lshift__.

There is still some work to do to clean up and verify all of this before merging it.

from kuroko.

klange commented on May 25, 2024 1

That branch has been merged, and Kuroko no longer uses strtod to parse floats, or printf to convert them to strings.

from kuroko.

kseistrup commented on May 25, 2024

I believe I have found the general area where the culprit is lurking, although I don't know what is going on, or how to fix it.

I have focused solely on ceil(0.5), which should always return 1.

If I launch kuroko with the environment variable $LC_ALL set to one of C, POSIX, en_US.UTF-8, en_GB.UTF-8, or en_CA.UTF-8 (i.e., env LC_ALL=x_Y.UTF-8 kuroko), ceil(0.5) will give the right result: 1.

If I launch kuroko with any of en_DK.UTF-8 (the default here), da_DK.UTF-8, nb_NO.UTF-8, nn_NO.UTF-8, or sv_SE.UTF-8, ceil(0.5) will give the wrong result: 0.

(These were the available locales on my boxen.)

My Python interpreter, as well as the S-Lang shell slsh, both of which are linked to libm, give the right result under all locales.

I'm gonna go out on a limb here, although it could be totally wrong:

At least one of the characteristics of the locales that give the right result is that they use . (FULL STOP) as decimal separator, whereas all the locales that give with the wrong result are using , (COMMA) as decimal separator.

Now, irregardless of the locale, kuroko does not accept ceil(0,5) as valid input (ArgumentError: ceil() expects one argument), but could some other part of kuroko become confused if the decimal separator is something else than a FULL STOP?

from kuroko.

klange commented on May 25, 2024

This should not normally be an issue when running scripts, but rline calls setlocale: https://github.com/kuroko-lang/kuroko/blob/master/src/vendor/rline.c#L2433

If you wish to use the repl with a locale other than C, please disable rline with the the -r option.

from kuroko.

kseistrup commented on May 25, 2024

Using Kuroko with locales other than C is currently unsupported

Well, that was an unfortunate turn of events…

If you wish to use the repl with a locale other than C, please disable rline with the the -r option.

Perhaps rline should not call setlocale() when it is not supported?

There's something I don't understand: When rline.c explicitly sets $LC_ALL to C:

    setlocale(LC_ALL, "C.UTF-8");

why is the locale of the environment a problem at all?

from kuroko.

kseistrup commented on May 25, 2024

PS: Oh, and yes: the REPL does interpret 0.5 as 0.0.

from kuroko.

kseistrup commented on May 25, 2024

rline only does that on Windows

Ah yes, I misread the #ifndef as #ifdef. My fault.

I'm not sure I'm prepared to give up rline, so I have mitigated the problem by aliasing kuroko to env LC_ALL=C.utf8 /usr/bin/kuroko $argv in fish (my interactive shell).

from kuroko.

kseistrup commented on May 25, 2024

I wonder if the issue could be mitigated for “everyone” in this way [until a more solid fix is in place]:

Do the Windows/non-Windows setlocale()-thing as it is now, then (in some kind of pseudo-C):

if (get_decimal_separator() != '.') {
  setlocale(LC_ALL, "C.UTF-8");
  warn_user_that_locale_has_been_changed();
}

from kuroko.

kseistrup commented on May 25, 2024

Thanks, that quick hack works great for me. 👍

Is it really necessary to set the locale for each and every time rline() has been called? Could “we” just set it upon first invocation and flag that the locale has been set? That way it's just a matter of checking a boolean variable.

Checking the radix symbol in a portable manner is slightly tricky

Could something like this be called portable (disclaimer, I don't speak C):

#include <locale.h>
#include <stdio.h>

int main() {
  (void) setlocale(LC_ALL, "");
  struct lconv *lc = localeconv();
  (void) printf("Decimal separator: %s\n", lc->decimal_point);
  return 0;
}

📎 radix.c

from kuroko.

kseistrup commented on May 25, 2024

E.g.:

#ifndef _WIN32
   static int locale_has_been_initialized = 0;
   if (locale_has_been_initialized == 0) {
      (void) setlocale(LC_ALL, "");
      struct lconv *lc = localeconv();
      if (strcmp(lc->decimal_point, ".") != 0) {
         (void) setlocale(LC_ALL, "C.UTF-8");
      }
      locale_has_been_initialized = 1;
   }
#else

from kuroko.

kseistrup commented on May 25, 2024

Come to think of it: kuroko should not behave differently in a script and in the REPL. The path of least surprise is to setlocale() at the top the main() function, before arguments are parsed and other routines are called.

Let me see if I can provide a sensible pull-request…

from kuroko.

kseistrup commented on May 25, 2024

Thanks, I get your point. Makes sense.

Meanwhile you could consider if the call to localeconv() wouldn't be a more portable, and certainly less kludgy, way of make sure that the decimal point is a full stop.

from kuroko.

kseistrup commented on May 25, 2024

You know your code infinitely better than I do, so let me ask you this naive question:

Would such a locale exist that considers HIRAGANA LETTER A (あ) “2 cells wide”, that would miscalculate e.g. LOTUS (🪷)?

I recently had a problem with the editor JED not being able to correctly edit a line containing the LOTUS character:

jedsoft/jed#9

Perhaps HIRAGANA LETTER A will be enough to asses a locale, but I thought I'd mention it anyway.

from kuroko.

kseistrup commented on May 25, 2024

Thank you for your patience, it's highly appreciated. 🙏

from kuroko.

kseistrup commented on May 25, 2024

Great news, thanks! 🙏

from kuroko.

Float parsing uses `strtod`, but rline calls `setlocale` about kuroko HOT 26 CLOSED

Comments (26)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent