<div class="highlight highlight-source-shell notranslate position-relative overflow-auto" dir="auto"

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

Shell-quoting garbled after parsing UTF-8 characters,about ksh93/ksh

McDutchie commented on August 18, 2024 1

I've localised the problem to this part of sh_fmtq():

Lines 360 to 381 in 1ca9286

    
           	for(;c;c= mbchar(cp)) 
        
           	{ 
        
           #if SHOPT_MULTIBYTE 
        
           		if(c=='\'' || c>=128 || c<0 || !iswprint(c))  
        
           #else 
        
           		if(c=='\'' || !isprint(c)) 
        
           #endif /* SHOPT_MULTIBYTE */ 
        
           			state = 2; 
        
           		else if(c==']' || c=='=' || (c!=':' && c<=0x7f && (c=sh_lexstates[ST_NORM][c]) && c!=S_EPAT)) 
        
           			state |=1; 
        
           	} 
        
           	if(state<2) 
        
           	{ 
        
           		if(state==1) 
        
           			stakputc('\''); 
        
           		if(c = --cp - string) 
        
           			stakwrite(string,c); 
        
           		if(state==1) 
        
           			stakputc('\''); 
        
           	} 
        
           	else 
        
           	{

(note: in the following else clause it does $'...' quoting which does not have a bug).

If we insert the following debug command on line 373 and recompile:

errormsg(SH_DICT,ERROR_warn(0),"[DEBUG] length == %d", cp - string - 1);

then the output of the test script above becomes:

test.ksh[2]: printf: warning: [DEBUG] length == 25
'Is shell-quoting garbled?'
test.ksh[4]: printf: warning: [DEBUG] length == 26
'Is shell-quoting garbled?

Note that the garbled string has its length counted wrong: one character more, not less. This means that stakwrite(string,c); on line 376 writes one byte too many, which is the C string's terminating zero. This explains why the final quote doesn't appear; it is dutifully added, but after the erroneous zero.

(later edit: if a longer UTF-8 character is used, e.g. the four-byte '🇪🇺', then the wrong length in the second debug output correspondingly increases from 26 to 28 -- three bytes too many.)

So there is some problem with multibyte character processing. My working hypothesis is that the read -n1 g command in the test script, which triggers the bug and which reads one byte (not one character), leaves some kind of inconsistent multibyte character processing state, as reading one byte is reading half of that multibyte character.

First suspect is the mbchar(cp) call on line 360. Evidently this returns the current character and moves the character pointer (cp) to the next character in a multibyte-compatible way.

mbchar() looks like a function, but is a macro. This macro and its friends are defined here:

ksh/src/lib/libast/include/ast.h

Lines 181 to 196 in 1ca9286

    
           #define mbmax()		(ast.mb_cur_max) 
        
           #define mberr()		(ast.tmp_int<0) 
        
           #define mbcoll()	(ast.mb_xfrm!=0) 
        
           #define mbwide()	(mbmax()>1) 
        
           #define mb2wc(w,p,n)	(*ast.mb_towc)(&w,(char*)p,n) 
        
           #define mbchar(p)	(mbwide()?((ast.tmp_int=(*ast.mb_towc)(&ast.tmp_wchar,(char*)(p),mbmax()))>0?((p+=ast.tmp_int),ast.tmp_wchar):(p+=ast.mb_sync+1,ast.tmp_int)):(*(unsigned char*)(p++))) 
        
           #define mbnchar(p,n)	(mbwide()?((ast.tmp_int=(*ast.mb_towc)(&ast.tmp_wchar,(char*)(p),n))>0?((p+=ast.tmp_int),ast.tmp_wchar):(p+=ast.mb_sync+1,ast.tmp_int)):(*(unsigned char*)(p++))) 
        
           #define mbinit()	(mbwide()?(*ast.mb_towc)((wchar_t*)0,(char*)0,mbmax()):0) 
        
           #define mbsize(p)	(mbwide()?(*ast.mb_len)((char*)(p),mbmax()):((p),1)) 
        
           #define mbnsize(p,n)	(mbwide()?(*ast.mb_len)((char*)(p),n):((p),1)) 
        
           #define mbconv(s,w)	(ast.mb_conv?(*ast.mb_conv)(s,w):((*(s)=(w)),1)) 
        
           #define mbwidth(w)	(ast.mb_width?(*ast.mb_width)(w):1) 
        
           #define mbxfrm(t,f,n)	(mbcoll()?(*ast.mb_xfrm)((char*)(t),(char*)(f),n):0) 
        
           #define mbalpha(w)	(ast.mb_alpha?(*ast.mb_alpha)(w):isalpha((w)&0xff))

It looks like the behaviour of mbchar() and others depends on the result of mbwide(), another macro which is defined as (mbmax()>1), and mbmax() is another macro, defined as (ast.mb_cur_max). Aha! That's a global state variable. mb_cur_max, could that mean "multibyte current maximum"? Could this be what keeps track of the length of the current multibyte character and is left nonzero by read -n1?

To put this to the test, insert ast.mb_cur_max = 0 on line 337 in string.c - right at the beginning of the sh_fmtq() function. Sure enough, after doing that and recompiling, the bug vanishes.

This is a hack, though, not a proper fix. The real bug is in the multibyte macros, not in the shellquoting algorithm. More digging to follow…

from ksh.

McDutchie commented on August 18, 2024 1

One of those ast.h macros that is catching my attention is the one on line 190:

#define mbinit()	(mbwide()?(*ast.mb_towc)((wchar_t*)0,(char*)0,mbmax()):0)

This macro's name suggests that is is meant to initialise the state of multibyte character processing. It is called in various places where multibyte characters are processed, but not in sh_fmtq().

However, unfortunately, putting a mbinit() instead of ast.mb_sync = 0 on line 337 in sh/string.c makes the bug reappear. But what good is that whole mbinit() thing then?

If we're currently in a multibyte locale (i.e.: mbwide() is true a.k.a. ast.mb_cur_max > 1), then mbinit() calls the function pointed to by ast.mb_towc with the first two parameters being null pointers and the third the current value of ast.mb_cur_max. But what function is it calling?

The answer is in the set_ctype discipline function quoted above, on line 2244 to be precise: in a UTF-8 locale, ast.mb_towc = utf8_mbtowc. (Not sure what wc is an acronym of... wordcode? wide characters?) Anyway, here is that utf8_mbtowc() function:

ksh/src/lib/libast/comp/setlocale.c

Lines 594 to 638 in 2624b29

    
           static int 
        
           utf8_mbtowc(wchar_t* wp, const char* str, size_t n) 
        
           { 
        
           	register unsigned char*	sp = (unsigned char*)str; 
        
           	register int		m; 
        
           	register int		i; 
        
           	register int		c; 
        
           	register wchar_t	w = 0; 
        
           	if (!sp || !n) 
        
           		return 0; 
        
           	if ((m = utf8tab[*sp]) > 0) 
        
           	{ 
        
           		if (m > n) 
        
           			return -1; 
        
           		if (wp) 
        
           		{ 
        
           			if (m == 1) 
        
           			{ 
        
           				*wp = *sp; 
        
           				return 1; 
        
           			} 
        
           			w = *sp & ((1<<(8-m))-1); 
        
           			for (i = m - 1; i > 0; i--) 
        
           			{ 
        
           				c = *++sp; 
        
           				if ((c&0xc0) != 0x80) 
        
           					goto invalid; 
        
           				w = (w<<6) | (c&0x3f); 
        
           			} 
        
           			if (!(utf8mask[m] & w) || w >= 0xd800 && (w <= 0xdfff || w >= 0xfffe && w <= 0xffff)) 
        
           				goto invalid; 
        
           			*wp = w; 
        
           		} 
        
           		return m; 
        
           	} 
        
           	if (!*sp) 
        
           		return 0; 
        
            invalid: 
        
           #ifdef EILSEQ 
        
           	errno = EILSEQ; 
        
           #endif 
        
           	ast.mb_sync = (const char*)sp - str; 
        
           	return -1; 
        
           }

So what does this function actually do if it's called by mbinit(), with the first two pointer parameters being null? The answer is in lines 597, 603 and 604: if sp, being equal to str, being the second parameter, is null, then it simply does nothing at all and returns zero.

Which means that, in UTF-8 locales, invoking mbinit() does nothing. It's a no-op.

Maybe this is the actual bug. This init routine should be resetting ast.mb_sync to zero, as set_ctype() does. If ast.mb_sync is erroneously non-zero, then you can sort of see in the mbchar() macro definition (which I still find very hard to understand) that too much gets added to the length. This is starting to make some kind of sense now.

from ksh.

hyenias commented on August 18, 2024 1

Understood. Thank you for all the hard work on ksh.

from ksh.

McDutchie commented on August 18, 2024

My line 337 ast.mb_cur_max = 0 hack in the previous message is wrong. It introduces another bug:

$ cat > test2.ksh
LANG=C.UTF-8
var=múltîbytè
printf 'length of var: %d\n' ${#var}
printf '%q\n' 'Is shell-quoting garbled?'
printf 'length of var: %d\n' ${#var}
print -nr $'\303\274' | read -n1 g
printf '%q\n' 'Is shell-quoting garbled?'
$ arch/*/bin/ksh test2.ksh
length of var: 9
'Is shell-quoting garbled?'
length of var: 12
'Is shell-quoting garbled?'

Of course the length of the variable should be 9 in both cases. This means that ast.mb_cur_max does not store any kind of temporary state and that setting it to zero actually disables multibyte processing altogether, which of course masks the shellquoting bug.

Grepping the code to find where ast.mb_cur_max is set points us to this set_ctype() function:

ksh/src/lib/libast/comp/setlocale.c

Lines 2218 to 2284 in 1ca9286

    
           /* 
        
            * called when LC_CTYPE initialized or changes 
        
            */ 
        
           static int 
        
           set_ctype(Lc_category_t* cp) 
        
           { 
        
           	ast.mb_sync = 0; 
        
           	ast.mb_alpha = (Isw_f)iswalpha; 
        
           #if AHA 
        
           	if ((ast.locale.set & (AST_LC_debug|AST_LC_setlocale)) && !(ast.locale.set & AST_LC_internal)) 
        
           		sfprintf(sfstderr, "locale setf %17s %16s\n", cp->name, locales[cp->internal]->name); 
        
           #endif 
        
           	if (locales[cp->internal]->flags & LC_debug) 
        
           	{ 
        
           		ast.mb_cur_max = DEBUG_MB_CUR_MAX; 
        
           		ast.mb_len = debug_mblen; 
        
           		ast.mb_towc = debug_mbtowc; 
        
           		ast.mb_width = debug_wcwidth; 
        
           		ast.mb_conv = debug_wctomb; 
        
           		ast.mb_alpha = debug_alpha; 
        
           	} 
        
           	else if ((locales[cp->internal]->flags & LC_utf8) && !(ast.locale.set & AST_LC_test)) 
        
           	{ 
        
           		ast.mb_cur_max = 6; 
        
           		ast.mb_len = utf8_mblen; 
        
           		ast.mb_towc = utf8_mbtowc; 
        
           		if ((locales[cp->internal]->flags & LC_local) || !(ast.mb_width = wcwidth)) 
        
           			ast.mb_width = utf8_wcwidth; 
        
           		ast.mb_conv = utf8_wctomb; 
        
           		ast.mb_alpha = utf8_alpha; 
        
           	} 
        
           	else if ((locales[cp->internal]->flags & LC_default) || (ast.mb_cur_max = MB_CUR_MAX) <= 1 || !(ast.mb_len = mblen) || !(ast.mb_towc = mbtowc)) 
        
           	{ 
        
           		ast.mb_cur_max = 1; 
        
           		ast.mb_len = 0; 
        
           		ast.mb_towc = 0; 
        
           		ast.mb_width = default_wcwidth; 
        
           		ast.mb_conv = 0; 
        
           	} 
        
           	else 
        
           	{ 
        
           		if (!(ast.mb_width = wcwidth)) 
        
           			ast.mb_width = default_wcwidth; 
        
           		ast.mb_conv = wctomb; 
        
           #ifdef mb_state 
        
           		{ 
        
           			/* 
        
           			 * check for sjis that translates unshifted 7 bit ascii! 
        
           			 */ 
        
           			char*	s; 
        
           			char	buf[2]; 
        
           			mbinit(); 
        
           			buf[1] = 0; 
        
           			*(s = buf) = '\\'; 
        
           			if (mbchar(s) != buf[0]) 
        
           			{ 
        
           				memcpy(mb_state, mb_state_zero, sizeof(mbstate_t)); 
        
           				ast.mb_towc = sjis_mbtowc; 
        
           			} 
        
           		} 
        
           #endif 
        
           	} 
        
           	return 0; 
        
           }

Interesting: this is the discipline function for the LC_CTYPE variable. So this initialises the character processing state depending on the locale that is being set. The first thing it does is set ast.mb_sync to 0. Could that be what we need? — Sure enough, setting ast.mb_sync = 0 instead on line 337 in sh/string.c appears to fix the shellquoting bug without disabling multibyte character processing:

$ arch/*/bin/ksh test2.ksh
length of var: 9
'Is shell-quoting garbled?'
length of var: 9
'Is shell-quoting garbled?'

This also makes some sense of how those mb* macros are defined, if you read them carefully (good luck).

So this may be a better hack, but it's still a hack. Stay tuned.

from ksh.

McDutchie commented on August 18, 2024

So, it looks like the following patch makes the bug reliably go away without causing regressions when running bin/shtests --utf8:

diff --git a/src/lib/libast/comp/setlocale.c b/src/lib/libast/comp/setlocale.c
index ba31fe9..65553f8 100644
--- a/src/lib/libast/comp/setlocale.c
+++ b/src/lib/libast/comp/setlocale.c
@@ -600,6 +600,8 @@ utf8_mbtowc(wchar_t* wp, const char* str, size_t n)
        register int            c;
        register wchar_t        w = 0;

+       if (!wp && !sp)
+               ast.mb_sync = 0;  /* assume call from mbinit() macro: reset global multibyte sync state */
        if (!sp || !n)
                return 0;
        if ((m = utf8tab[*sp]) > 0)

from ksh.

McDutchie commented on August 18, 2024

With the patch in the previous comment, it looks like sh_fmtq() doesn't even need a new mbinit() call/expansion for the bug to stay gone. So, apparently, this patch fixes one or more problems elsewhere in the code that left the global multibyte sync state inconsistent due to no-op mbinit() calls, which is a good sign.

I think I'll add that mbinit() call to sh_fmtq() anwyay, though: it will probably make it more bug-proof even if I can't trigger any bugs now. (Compare sh_substitute() in string.c lines 190-192, which does the same.)

from ksh.

hyenias commented on August 18, 2024

FYI: I do not experience the missing closing single quote using ksh93v-.

$ echo $KSH_VERSION
Version ABIJM 93v- 2014-12-24
$ LANG=C.UTF-8
$ printf '%q\n' 'Is shell-quoting garbled?'
'Is shell-quoting garbled?'
$ print -nr $'\303\274' | read -n1 g
$ printf '%q\n' 'Is shell-quoting garbled?'
'Is shell-quoting garbled?'
$

from ksh.

McDutchie commented on August 18, 2024

Thanks @hyenias. The 93v- beta completely overhauled the whole multibyte processing system in a new version of the libast API, so it wasn't possible to backport a fix, nor was it feasible to backport that entire overhauled system as it's tied in with lots of other changes in the code.

from ksh.

McDutchie commented on August 18, 2024

Just to add to the record, I'd forgotten I'd found another trigger of this bug back in 2017: att#13 (comment)

$ ksh -c 'i=$IFS; IFS=é; set : :; echo "$*"; IFS=$i; trap "echo end" EXIT; trap'
:?:
trap -- 'echo end EXIT
end
Note the missing quote after trap -- 'echo end. You get the same with the output of export -p, alias, etc.

I can happily confirm that this was fixed by 300cd19 as well (but of course IFS still doesn't actually handle multibyte characters).

from ksh.

Shell-quoting garbled after parsing UTF-8 characters about ksh HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	for(;c;c= mbchar(cp))
	{
	#if SHOPT_MULTIBYTE
	if(c=='\'' \|\| c>=128 \|\| c<0 \|\| !iswprint(c))
	#else
	if(c=='\'' \|\| !isprint(c))
	#endif /* SHOPT_MULTIBYTE */
	state = 2;
	else if(c==']' \|\| c=='=' \|\| (c!=':' && c<=0x7f && (c=sh_lexstates[ST_NORM][c]) && c!=S_EPAT))
	state \|=1;
	}
	if(state<2)
	{
	if(state==1)
	stakputc('\'');
	if(c = --cp - string)
	stakwrite(string,c);
	if(state==1)
	stakputc('\'');
	}
	else
	{

	#define mbmax() (ast.mb_cur_max)
	#define mberr() (ast.tmp_int<0)

	#define mbcoll() (ast.mb_xfrm!=0)
	#define mbwide() (mbmax()>1)

	#define mb2wc(w,p,n) (ast.mb_towc)(&w,(char)p,n)
	#define mbchar(p) (mbwide()?((ast.tmp_int=(ast.mb_towc)(&ast.tmp_wchar,(char)(p),mbmax()))>0?((p+=ast.tmp_int),ast.tmp_wchar):(p+=ast.mb_sync+1,ast.tmp_int)):((unsigned char)(p++)))
	#define mbnchar(p,n) (mbwide()?((ast.tmp_int=(ast.mb_towc)(&ast.tmp_wchar,(char)(p),n))>0?((p+=ast.tmp_int),ast.tmp_wchar):(p+=ast.mb_sync+1,ast.tmp_int)):((unsigned char)(p++)))
	#define mbinit() (mbwide()?(ast.mb_towc)((wchar_t)0,(char*)0,mbmax()):0)
	#define mbsize(p) (mbwide()?(ast.mb_len)((char)(p),mbmax()):((p),1))
	#define mbnsize(p,n) (mbwide()?(ast.mb_len)((char)(p),n):((p),1))
	#define mbconv(s,w) (ast.mb_conv?(ast.mb_conv)(s,w):(((s)=(w)),1))
	#define mbwidth(w) (ast.mb_width?(*ast.mb_width)(w):1)
	#define mbxfrm(t,f,n) (mbcoll()?(ast.mb_xfrm)((char)(t),(char*)(f),n):0)
	#define mbalpha(w) (ast.mb_alpha?(*ast.mb_alpha)(w):isalpha((w)&0xff))

	static int
	utf8_mbtowc(wchar_t* wp, const char* str, size_t n)
	{
	register unsigned char* sp = (unsigned char*)str;
	register int m;
	register int i;
	register int c;
	register wchar_t w = 0;

	if (!sp \|\| !n)
	return 0;
	if ((m = utf8tab[*sp]) > 0)
	{
	if (m > n)
	return -1;
	if (wp)
	{
	if (m == 1)
	{
	wp = sp;
	return 1;
	}
	w = *sp & ((1<<(8-m))-1);
	for (i = m - 1; i > 0; i--)
	{
	c = *++sp;
	if ((c&0xc0) != 0x80)
	goto invalid;
	w = (w<<6) \| (c&0x3f);
	}
	if (!(utf8mask[m] & w) \|\| w >= 0xd800 && (w <= 0xdfff \|\| w >= 0xfffe && w <= 0xffff))
	goto invalid;
	*wp = w;
	}
	return m;
	}
	if (!*sp)
	return 0;
	invalid:
	#ifdef EILSEQ
	errno = EILSEQ;
	#endif
	ast.mb_sync = (const char*)sp - str;
	return -1;
	}

	/*
	* called when LC_CTYPE initialized or changes
	*/

	static int
	set_ctype(Lc_category_t* cp)
	{
	ast.mb_sync = 0;
	ast.mb_alpha = (Isw_f)iswalpha;
	#if AHA
	if ((ast.locale.set & (AST_LC_debug\|AST_LC_setlocale)) && !(ast.locale.set & AST_LC_internal))
	sfprintf(sfstderr, "locale setf %17s %16s\n", cp->name, locales[cp->internal]->name);
	#endif
	if (locales[cp->internal]->flags & LC_debug)
	{
	ast.mb_cur_max = DEBUG_MB_CUR_MAX;
	ast.mb_len = debug_mblen;
	ast.mb_towc = debug_mbtowc;
	ast.mb_width = debug_wcwidth;
	ast.mb_conv = debug_wctomb;
	ast.mb_alpha = debug_alpha;
	}
	else if ((locales[cp->internal]->flags & LC_utf8) && !(ast.locale.set & AST_LC_test))
	{
	ast.mb_cur_max = 6;
	ast.mb_len = utf8_mblen;
	ast.mb_towc = utf8_mbtowc;
	if ((locales[cp->internal]->flags & LC_local) \|\| !(ast.mb_width = wcwidth))
	ast.mb_width = utf8_wcwidth;
	ast.mb_conv = utf8_wctomb;
	ast.mb_alpha = utf8_alpha;
	}
	else if ((locales[cp->internal]->flags & LC_default) \|\| (ast.mb_cur_max = MB_CUR_MAX) <= 1 \|\| !(ast.mb_len = mblen) \|\| !(ast.mb_towc = mbtowc))
	{
	ast.mb_cur_max = 1;
	ast.mb_len = 0;
	ast.mb_towc = 0;
	ast.mb_width = default_wcwidth;
	ast.mb_conv = 0;
	}
	else
	{
	if (!(ast.mb_width = wcwidth))
	ast.mb_width = default_wcwidth;
	ast.mb_conv = wctomb;
	#ifdef mb_state
	{
	/*
	* check for sjis that translates unshifted 7 bit ascii!
	*/

	char* s;
	char buf[2];

	mbinit();
	buf[1] = 0;
	*(s = buf) = '\\';
	if (mbchar(s) != buf[0])
	{
	memcpy(mb_state, mb_state_zero, sizeof(mbstate_t));
	ast.mb_towc = sjis_mbtowc;
	}
	}
	#endif
	}
	return 0;
	}