Code Monkey home page Code Monkey logo

Comments (16)

rjshae avatar rjshae commented on June 27, 2024

Would iconv be useful for this? At least for building a conversion dictionary.

from xoreos-tools.

DrMcCoy avatar DrMcCoy commented on June 27, 2024

We're already using iconv in https://github.com/xoreos/xoreos/blob/master/src/common/encoding.h / https://github.com/xoreos/xoreos/blob/master/src/common/encoding.cpp .

Converting isn't the problem, the issue is identification, which is not really 100% possible. Then there's the double-UTF-8. And if there's strings with multiple encodings, that's even more trouble.

Also, how should we handle it in the xoreos-tools? Silently convert everything to UTF-8? Will that work for the original game?

from xoreos-tools.

rjshae avatar rjshae commented on June 27, 2024

Yes, it doesn't sound like a completely automatable solution is available. iconv will at least output a 0xFF character if it doesn't match the encoding set, so that will catch some. Maybe the rest can be eye-balled and put in a conversion array/file, one per language, at least until a better solution is found? shrug

from xoreos-tools.

rjshae avatar rjshae commented on June 27, 2024

There's a perl module that guesses at the encoding of a text string:
https://metacpan.org/source/DANKOGAI/Encode-2.98/lib/Encode/Guess.pm
https://perldoc.perl.org/Encode/Guess.html
Perhaps that could be useful for deriving some coding logic?

from xoreos-tools.

rjshae avatar rjshae commented on June 27, 2024

This looks interesting:
https://github.com/neitanod/forceutf8/blob/master/src/ForceUTF8/Encoding.php

Perhaps it can be used as an identification tool?

from xoreos-tools.

rjshae avatar rjshae commented on June 27, 2024

A thought occurred: it may be possible that neither NWN nor NWN2 is using those 3K fields. I searched the 2da files in NWN2 for a sample but found no matches. Ideally one could extract all the StrRef entries in both games and do a compare.

from xoreos-tools.

DrMcCoy avatar DrMcCoy commented on June 27, 2024

Yes, I'm pretty sure the game is not using some of the really broken ones.

This does help us in xoreos proper, but how are we going to handle it in our tlk2xml tool?

from xoreos-tools.

rjshae avatar rjshae commented on June 27, 2024

"Converting isn't the problem, the issue is identification, which is not really 100% possible. Then there's the double-UTF-8. And if there's strings with multiple encodings, that's even more trouble."

If you believe that there are other instances of encoding issues, then perhaps a probability-based approach will work? Write a utility that can build a frequency count table of byte pairs. Encoding issues will presumably be outliers, so pass an argument specifying a cut-off count, with table entries at or below this argument being output as a conversion file. There's a lot of sample data to work with, so most of the remaining bad encoding patterns that haven't already been caught should be the (relatively) rare exceptions.

This tool makes the first pass through through the file building a count array of all double-byte patterns, ignoring white spaces and punctuation. The second pass through can then build a draft conversion file, listing a hex data array of the 2+ byte combinations followed by the containing word as a comment. Thus:

0xC3, 0xA9, // fiancé (1 instance)

If the file isn't too noisy with false negatives, we can manually peruse the exception list and throw out the ones that look okay. Hopefully the hand-massaged data file can now be used as the front end of a conversion table.

So... is this something like you have in mind? Or am I just plain misunderstanding?

"Also, how should we handle it in the xoreos-tools? Silently convert everything to UTF-8? Will that work for the original game?"

Are you contemplating building a modified TLK of all UTF-8 characters that can be used in the original game? We could try the output in the original game and see if the non-verbal text is intelligible. It should show up in the character build stage, such as when you read the class descriptions.

from xoreos-tools.

DrMcCoy avatar DrMcCoy commented on June 27, 2024

Are you contemplating building a modified TLK of all UTF-8 characters that can be used in the original game?

No.

What I'm saying is this: right now, we have two tools: tlk2xml, which converts a whole TLK into a user-editable XML, and xml2tlk, which takes such a XML and converts it back into a TLK. Only that the first tool has, of course, no knowledge which strings are used and aren't used by the game. It simply converts all of the strings. And it breaks for these broken strings.

How are we going to handle the NWN2 TLK, with those broken strings, in these tools, so that a modder can use them to take the NWN2 TLK, modify some strings, and recreate a working TLK out of it again.

Because right now, that use-case is broken. tlk2xml will take the NWN2 TLK, and produce a XML containing garbage in some strings. After going through xml2tlk, you'll have a TLK file containing broken strings, and that's bad and dangerous.

from xoreos-tools.

rjshae avatar rjshae commented on June 27, 2024

Okay. Well I find it really cool all the file conversion and manipulation content you've developed here, but unless some character combinations are going to break the engine I'm not sure it's worth breaking a sweat over this little detail. All the gamer is going to (potentially) see are some bad strings in the game. If the mod builder cares, they'll edit their TLK. Otherwise the gamer will just keep on playing. :-)

You've listed a finite set of character issues to address. Why not just deal with those, both for the conversion and the reverse, then worry about future exceptions down the road? That'll limit the scope and make it do-able in the near term.

If you are worried about game crashes from text string combinations, then it seems like some testing is needed. shrug

from xoreos-tools.

rjshae avatar rjshae commented on June 27, 2024

During testing of a new Journal class, I ran into what appears to be an error with strref #180942. This value is retrieved for the "construct" quest in the OC module.JRL file. The Journal routine used the getString call from the GFF3Struct class to get the "Text" field for the first entry for the quest, which is the above strref.

On running the game it generates the error: "WARNING: iconv() failed: Illegal byte sequence!" and returns the string "[!?!]". I checked the dialog.TLK row and it looked like ordinary text, at least in TLK EDIT.

Not much I can do about it at the moment.

Ed.: Now I see you have it listed above.

from xoreos-tools.

Mingun avatar Mingun commented on June 27, 2024

How are we going to handle the NWN2 TLK, with those broken strings, in these tools, so that a modder can use them to take the NWN2 TLK, modify some strings, and recreate a working TLK out of it again.

tlk2xml can output invalid strings to XML in hex/base64 and mark it with some attribute, for example broken="true". xml2tlk must just interpret such strings as raw byte arrays (in hex/base64 due to XML limitations) and write it as is.

from xoreos-tools.

DrMcCoy avatar DrMcCoy commented on June 27, 2024

tlk2xml can output invalid strings to XML in hex/base64 and mark it with some attribute, for example broken="true".

Yeah, that seems to be solution that least breaks things.

For Phaethon, if that ever gets a TLK editor, we can probably add a drop-down box and let the user override a misidentification.

During testing of a new Journal class, I ran into what appears to be an error with strref #180942. This value is retrieved for the "construct" quest in the OC module.JRL file. The Journal routine used the getString call from the GFF3Struct class to get the "Text" field for the first entry for the quest, which is the above strref.

Hmm, so those strings are actually used in the game. I had hoped they weren't. :P

How are we going to handle that in xoreos, then, though?

How does the original game handle that in the first place? The string itself is probably just read and treated a a raw byte array. And only when the text is displayed, it selects the correct character. But how does it know that the 0xE2 here is the start of an UTF-8 ellipsis and not a â?

from xoreos-tools.

rjshae avatar rjshae commented on June 27, 2024

Can we catch such iconv errors in the GFF3Struct::getString call so we can at least get a report on the ResRef value where it failed?

from xoreos-tools.

DrMcCoy avatar DrMcCoy commented on June 27, 2024

...There was a reason I made a failed encoding conversion not throw...but I can't remember anymore what that reason was :/

from xoreos-tools.

rjshae avatar rjshae commented on June 27, 2024

A kludgy work-around then is to check the converted string within the getString call and print an informative warning message if it matches the error pattern?

from xoreos-tools.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.