There is a bug in the StandardAnalyzer, with a 50/50 chance of causing either an excep

StandardAnalyzer exception/segfault about luceneplusplus HOT 1 CLOSED

luceneplusplus commented on June 26, 2024

StandardAnalyzer exception/segfault

from luceneplusplus.

Comments (1)

vslavik commented on June 26, 2024

Just got hit by this bug too, when storing & analyzing field with value “As multiple of 𝜋” — that last character is U+1D70B MATHEMATICAL ITALIC SMALL PI. The noteworthy thing about that last character — which is the cause of the crash — is that it cannot be represented in UCS-2, it is outside of BMP and can only be represented with surrogate pairs in UTF-16.

The place the code crashes on is on this line in StandardTokenizerImpl::getNextToken():

int32_t zzNext = zzTransL[zzRowMapL[zzState] + zzCMapL[zzInput]];

The value of zzInput at the time of crash is 120587 == 0x1D70B and it crashes, predictably enough, because the ZZ_CMAP() array is 65536 entries long.

I think it’s clear enough how this bug happened: the code is converted from Java original. Java uses UTF-16 for its strings and this situation can never happen in it. Likewise, it cannot happen on Windows, where sizeof(wchar_t) == 2. On Unix and OS X, though, wchar_t is 32bit and wstring is UTF-32/UCS-4 and what would only be representable with surrogate pairs in Java is a single wchar_t value and crashing ensues.

As to how to fix this, I unfortunately don’t have a clue. The safest would be to simply use basic_string<char16_t> all over the place (breaking compatibility with existing code), or to use it in just this method — that would make Lucene++ closer to the Java original (I’m not familiar with Lucene++ enough to understand what else may break by using non-UTF-16 inputs).

Probably better would be to just patch up this method and perhaps use a fixed “character class" value for anything over 0xFFFF (the analysis won’t be particularly good then — but hey, so what). The trouble is that all this is completely undocumented as far as I can tell and I don’t know what the values in _ZZ_CMAP even mean.

Playing with it in a debugger, it seems that the value of zzCMapL[some_english_character] is 10. Perhaps it would be acceptable, at least as an interim solution, to just use that for all non-BMP characters? After all, Lucene 3.0’s StandardTokenizer is explicitly for European languages… (So I guess porting 3.1’s Unicode tokenizer would be another solution.)

I’m more than willing to work on fixing it (for Poedit, handling Unicode data is kind of important), but I could really use some pointers in the right direction(s)...

from luceneplusplus.

Recommend Projects

StandardAnalyzer exception/segfault about luceneplusplus HOT 1 CLOSED

Comments (1)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent