Code Monkey home page Code Monkey logo

Comments (9)

McDutchie avatar McDutchie commented on August 18, 2024 1

I've localised the problem to this part of sh_fmtq():

for(;c;c= mbchar(cp))
{
#if SHOPT_MULTIBYTE
if(c=='\'' || c>=128 || c<0 || !iswprint(c))
#else
if(c=='\'' || !isprint(c))
#endif /* SHOPT_MULTIBYTE */
state = 2;
else if(c==']' || c=='=' || (c!=':' && c<=0x7f && (c=sh_lexstates[ST_NORM][c]) && c!=S_EPAT))
state |=1;
}
if(state<2)
{
if(state==1)
stakputc('\'');
if(c = --cp - string)
stakwrite(string,c);
if(state==1)
stakputc('\'');
}
else
{

(note: in the following else clause it does $'...' quoting which does not have a bug).

If we insert the following debug command on line 373 and recompile:

errormsg(SH_DICT,ERROR_warn(0),"[DEBUG] length == %d", cp - string - 1);

then the output of the test script above becomes:

test.ksh[2]: printf: warning: [DEBUG] length == 25
'Is shell-quoting garbled?'
test.ksh[4]: printf: warning: [DEBUG] length == 26
'Is shell-quoting garbled?

Note that the garbled string has its length counted wrong: one character more, not less. This means that stakwrite(string,c); on line 376 writes one byte too many, which is the C string's terminating zero. This explains why the final quote doesn't appear; it is dutifully added, but after the erroneous zero.

(later edit: if a longer UTF-8 character is used, e.g. the four-byte '🇪🇺', then the wrong length in the second debug output correspondingly increases from 26 to 28 -- three bytes too many.)

So there is some problem with multibyte character processing. My working hypothesis is that the read -n1 g command in the test script, which triggers the bug and which reads one byte (not one character), leaves some kind of inconsistent multibyte character processing state, as reading one byte is reading half of that multibyte character.

First suspect is the mbchar(cp) call on line 360. Evidently this returns the current character and moves the character pointer (cp) to the next character in a multibyte-compatible way.

mbchar() looks like a function, but is a macro. This macro and its friends are defined here:

#define mbmax() (ast.mb_cur_max)
#define mberr() (ast.tmp_int<0)
#define mbcoll() (ast.mb_xfrm!=0)
#define mbwide() (mbmax()>1)
#define mb2wc(w,p,n) (*ast.mb_towc)(&w,(char*)p,n)
#define mbchar(p) (mbwide()?((ast.tmp_int=(*ast.mb_towc)(&ast.tmp_wchar,(char*)(p),mbmax()))>0?((p+=ast.tmp_int),ast.tmp_wchar):(p+=ast.mb_sync+1,ast.tmp_int)):(*(unsigned char*)(p++)))
#define mbnchar(p,n) (mbwide()?((ast.tmp_int=(*ast.mb_towc)(&ast.tmp_wchar,(char*)(p),n))>0?((p+=ast.tmp_int),ast.tmp_wchar):(p+=ast.mb_sync+1,ast.tmp_int)):(*(unsigned char*)(p++)))
#define mbinit() (mbwide()?(*ast.mb_towc)((wchar_t*)0,(char*)0,mbmax()):0)
#define mbsize(p) (mbwide()?(*ast.mb_len)((char*)(p),mbmax()):((p),1))
#define mbnsize(p,n) (mbwide()?(*ast.mb_len)((char*)(p),n):((p),1))
#define mbconv(s,w) (ast.mb_conv?(*ast.mb_conv)(s,w):((*(s)=(w)),1))
#define mbwidth(w) (ast.mb_width?(*ast.mb_width)(w):1)
#define mbxfrm(t,f,n) (mbcoll()?(*ast.mb_xfrm)((char*)(t),(char*)(f),n):0)
#define mbalpha(w) (ast.mb_alpha?(*ast.mb_alpha)(w):isalpha((w)&0xff))

It looks like the behaviour of mbchar() and others depends on the result of mbwide(), another macro which is defined as (mbmax()>1), and mbmax() is another macro, defined as (ast.mb_cur_max). Aha! That's a global state variable. mb_cur_max, could that mean "multibyte current maximum"? Could this be what keeps track of the length of the current multibyte character and is left nonzero by read -n1?

To put this to the test, insert ast.mb_cur_max = 0 on line 337 in string.c - right at the beginning of the sh_fmtq() function. Sure enough, after doing that and recompiling, the bug vanishes.

This is a hack, though, not a proper fix. The real bug is in the multibyte macros, not in the shellquoting algorithm. More digging to follow…

from ksh.

McDutchie avatar McDutchie commented on August 18, 2024 1

One of those ast.h macros that is catching my attention is the one on line 190:

#define mbinit()	(mbwide()?(*ast.mb_towc)((wchar_t*)0,(char*)0,mbmax()):0) 

This macro's name suggests that is is meant to initialise the state of multibyte character processing. It is called in various places where multibyte characters are processed, but not in sh_fmtq().

However, unfortunately, putting a mbinit() instead of ast.mb_sync = 0 on line 337 in sh/string.c makes the bug reappear. But what good is that whole mbinit() thing then?

If we're currently in a multibyte locale (i.e.: mbwide() is true a.k.a. ast.mb_cur_max > 1), then mbinit() calls the function pointed to by ast.mb_towc with the first two parameters being null pointers and the third the current value of ast.mb_cur_max. But what function is it calling?

The answer is in the set_ctype discipline function quoted above, on line 2244 to be precise: in a UTF-8 locale, ast.mb_towc = utf8_mbtowc. (Not sure what wc is an acronym of... wordcode? wide characters?) Anyway, here is that utf8_mbtowc() function:

static int
utf8_mbtowc(wchar_t* wp, const char* str, size_t n)
{
register unsigned char* sp = (unsigned char*)str;
register int m;
register int i;
register int c;
register wchar_t w = 0;
if (!sp || !n)
return 0;
if ((m = utf8tab[*sp]) > 0)
{
if (m > n)
return -1;
if (wp)
{
if (m == 1)
{
*wp = *sp;
return 1;
}
w = *sp & ((1<<(8-m))-1);
for (i = m - 1; i > 0; i--)
{
c = *++sp;
if ((c&0xc0) != 0x80)
goto invalid;
w = (w<<6) | (c&0x3f);
}
if (!(utf8mask[m] & w) || w >= 0xd800 && (w <= 0xdfff || w >= 0xfffe && w <= 0xffff))
goto invalid;
*wp = w;
}
return m;
}
if (!*sp)
return 0;
invalid:
#ifdef EILSEQ
errno = EILSEQ;
#endif
ast.mb_sync = (const char*)sp - str;
return -1;
}

So what does this function actually do if it's called by mbinit(), with the first two pointer parameters being null? The answer is in lines 597, 603 and 604: if sp, being equal to str, being the second parameter, is null, then it simply does nothing at all and returns zero.

Which means that, in UTF-8 locales, invoking mbinit() does nothing. It's a no-op.

Maybe this is the actual bug. This init routine should be resetting ast.mb_sync to zero, as set_ctype() does. If ast.mb_sync is erroneously non-zero, then you can sort of see in the mbchar() macro definition (which I still find very hard to understand) that too much gets added to the length. This is starting to make some kind of sense now.

from ksh.

hyenias avatar hyenias commented on August 18, 2024 1

Understood. Thank you for all the hard work on ksh.

from ksh.

McDutchie avatar McDutchie commented on August 18, 2024

My line 337 ast.mb_cur_max = 0 hack in the previous message is wrong. It introduces another bug:

$ cat > test2.ksh
LANG=C.UTF-8
var=múltîbytè
printf 'length of var: %d\n' ${#var}
printf '%q\n' 'Is shell-quoting garbled?'
printf 'length of var: %d\n' ${#var}
print -nr $'\303\274' | read -n1 g
printf '%q\n' 'Is shell-quoting garbled?'
$ arch/*/bin/ksh test2.ksh
length of var: 9
'Is shell-quoting garbled?'
length of var: 12
'Is shell-quoting garbled?'

Of course the length of the variable should be 9 in both cases. This means that ast.mb_cur_max does not store any kind of temporary state and that setting it to zero actually disables multibyte processing altogether, which of course masks the shellquoting bug.

Grepping the code to find where ast.mb_cur_max is set points us to this set_ctype() function:

/*
* called when LC_CTYPE initialized or changes
*/
static int
set_ctype(Lc_category_t* cp)
{
ast.mb_sync = 0;
ast.mb_alpha = (Isw_f)iswalpha;
#if AHA
if ((ast.locale.set & (AST_LC_debug|AST_LC_setlocale)) && !(ast.locale.set & AST_LC_internal))
sfprintf(sfstderr, "locale setf %17s %16s\n", cp->name, locales[cp->internal]->name);
#endif
if (locales[cp->internal]->flags & LC_debug)
{
ast.mb_cur_max = DEBUG_MB_CUR_MAX;
ast.mb_len = debug_mblen;
ast.mb_towc = debug_mbtowc;
ast.mb_width = debug_wcwidth;
ast.mb_conv = debug_wctomb;
ast.mb_alpha = debug_alpha;
}
else if ((locales[cp->internal]->flags & LC_utf8) && !(ast.locale.set & AST_LC_test))
{
ast.mb_cur_max = 6;
ast.mb_len = utf8_mblen;
ast.mb_towc = utf8_mbtowc;
if ((locales[cp->internal]->flags & LC_local) || !(ast.mb_width = wcwidth))
ast.mb_width = utf8_wcwidth;
ast.mb_conv = utf8_wctomb;
ast.mb_alpha = utf8_alpha;
}
else if ((locales[cp->internal]->flags & LC_default) || (ast.mb_cur_max = MB_CUR_MAX) <= 1 || !(ast.mb_len = mblen) || !(ast.mb_towc = mbtowc))
{
ast.mb_cur_max = 1;
ast.mb_len = 0;
ast.mb_towc = 0;
ast.mb_width = default_wcwidth;
ast.mb_conv = 0;
}
else
{
if (!(ast.mb_width = wcwidth))
ast.mb_width = default_wcwidth;
ast.mb_conv = wctomb;
#ifdef mb_state
{
/*
* check for sjis that translates unshifted 7 bit ascii!
*/
char* s;
char buf[2];
mbinit();
buf[1] = 0;
*(s = buf) = '\\';
if (mbchar(s) != buf[0])
{
memcpy(mb_state, mb_state_zero, sizeof(mbstate_t));
ast.mb_towc = sjis_mbtowc;
}
}
#endif
}
return 0;
}

Interesting: this is the discipline function for the LC_CTYPE variable. So this initialises the character processing state depending on the locale that is being set. The first thing it does is set ast.mb_sync to 0. Could that be what we need? — Sure enough, setting ast.mb_sync = 0 instead on line 337 in sh/string.c appears to fix the shellquoting bug without disabling multibyte character processing:

$ arch/*/bin/ksh test2.ksh
length of var: 9
'Is shell-quoting garbled?'
length of var: 9
'Is shell-quoting garbled?'

This also makes some sense of how those mb* macros are defined, if you read them carefully (good luck).

So this may be a better hack, but it's still a hack. Stay tuned.

from ksh.

McDutchie avatar McDutchie commented on August 18, 2024

So, it looks like the following patch makes the bug reliably go away without causing regressions when running bin/shtests --utf8:

diff --git a/src/lib/libast/comp/setlocale.c b/src/lib/libast/comp/setlocale.c
index ba31fe9..65553f8 100644
--- a/src/lib/libast/comp/setlocale.c
+++ b/src/lib/libast/comp/setlocale.c
@@ -600,6 +600,8 @@ utf8_mbtowc(wchar_t* wp, const char* str, size_t n)
        register int            c;
        register wchar_t        w = 0;

+       if (!wp && !sp)
+               ast.mb_sync = 0;  /* assume call from mbinit() macro: reset global multibyte sync state */
        if (!sp || !n)
                return 0;
        if ((m = utf8tab[*sp]) > 0)

from ksh.

McDutchie avatar McDutchie commented on August 18, 2024

With the patch in the previous comment, it looks like sh_fmtq() doesn't even need a new mbinit() call/expansion for the bug to stay gone. So, apparently, this patch fixes one or more problems elsewhere in the code that left the global multibyte sync state inconsistent due to no-op mbinit() calls, which is a good sign.

I think I'll add that mbinit() call to sh_fmtq() anwyay, though: it will probably make it more bug-proof even if I can't trigger any bugs now. (Compare sh_substitute() in string.c lines 190-192, which does the same.)

from ksh.

hyenias avatar hyenias commented on August 18, 2024

FYI: I do not experience the missing closing single quote using ksh93v-.

$ echo $KSH_VERSION
Version ABIJM 93v- 2014-12-24
$ LANG=C.UTF-8
$ printf '%q\n' 'Is shell-quoting garbled?'
'Is shell-quoting garbled?'
$ print -nr $'\303\274' | read -n1 g
$ printf '%q\n' 'Is shell-quoting garbled?'
'Is shell-quoting garbled?'
$

from ksh.

McDutchie avatar McDutchie commented on August 18, 2024

Thanks @hyenias. The 93v- beta completely overhauled the whole multibyte processing system in a new version of the libast API, so it wasn't possible to backport a fix, nor was it feasible to backport that entire overhauled system as it's tied in with lots of other changes in the code.

from ksh.

McDutchie avatar McDutchie commented on August 18, 2024

Just to add to the record, I'd forgotten I'd found another trigger of this bug back in 2017: att#13 (comment)

$ ksh -c 'i=$IFS; IFS=é; set : :; echo "$*"; IFS=$i; trap "echo end" EXIT; trap'
:?:
trap -- 'echo end EXIT
end

Note the missing quote after trap -- 'echo end. You get the same with the output of export -p, alias, etc.

I can happily confirm that this was fixed by 300cd19 as well (but of course IFS still doesn't actually handle multibyte characters).

from ksh.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.