Code Monkey home page Code Monkey logo

Comments (21)

Jehan avatar Jehan commented on July 21, 2024 1

Oh there is actually another solution which I am planning to work on at some point: language hints.
I want to provide ways to hint a detector towards a list of language, either a hard hint (the file owner says "that's definitely Italian", which will basically really make encoding detection much easier, even for short texts), or soft hints (for instance a software could keep a list of languages commonly read by the user and gives higher weight for these languages; this won't prevent detection for other language and encoding yet gives better confidence on the user preferred languages which are statistically more likely to appear again).
But that's all future wishes. I don't know when I'll be able to make the time for language hinting.

from uchardet.

Jehan avatar Jehan commented on July 21, 2024

Hi @cxjwin ! Thanks for the report.

As explained on the README, this repository is not the upstream anymore (only kept for historical reasons, I'd say). We are now hosted at Freedesktop. Could you post any further bug report at: https://bugs.freedesktop.org/enter_bug.cgi?product=uchardet ?

This being said, I'm not really sure what I am looking at. All the language models have been moved to src/LangModels/. What do these red files mean?
Or is something related to the MacOS build (I assume that's what is under build-mac/) and an output of whatever development GUI you use on this platform? If so, I know nothing about the platform, don't have a MacOS machine or tools. But I will gladly accept any patch fixing whatever needs to be fixed there. :-)

If you have any patch to provide, please do so on the new bug tracker for the project at Freedesktop's.
Thanks!

from uchardet.

marinofaggiana avatar marinofaggiana commented on July 21, 2024

Hi @Jehan I have a problem with iOS for compile uchardet, can you help me ?

from uchardet.

Jehan avatar Jehan commented on July 21, 2024

Hello @marinofaggiana,

Maybe I can but I have no iOS machine so any help on my side can only be generic. I see a bunch of files under build-mac/ in our repository, but have no idea what they are and how it works (which is why I never touched these, therefore I am not surprised if something is broken).
So if you explain me with details the problem, error messages and if you have hints, maybe we can fix this together.
Ideally if you are able to fix and provide a patch, it is even better. ;-)

Finally, as explained in my previous comment, this is not the upstream repository anymore. It means that this is not later code of uchardet, and also that is not the place to deal with bugs. I only answer exceptionally but I won't do it every time.
Uchardet is now a Freedesktop project.

from uchardet.

marinofaggiana avatar marinofaggiana commented on July 21, 2024

Thanks @Jehan, my issue is not a build (I think) ... no errors but the build-mac/ is (old) now the LangModel are in new dir (LangModel) ... the issue is for detect, for example I have this fine txt with a italian words :

utf8.txt

I have install on my Mac OS X the ucharsed with brew and test the file :

`
MacBook-Pro:000 marinofaggiana$ uchardet -v

uchardet Command Line Tool
Version 0.0.6

Authors: BYVoid, Jehan
Bug Report: https://bugs.freedesktop.org/enter_bug.cgi?product=uchardet

MacBook-Pro:000 marinofaggiana$
MacBook-Pro:000 marinofaggiana$
MacBook-Pro:000 marinofaggiana$
MacBook-Pro:000 marinofaggiana$ uchardet utf8.txt
UTF-8
MacBook-Pro:000 marinofaggiana$
`
Response : UTF-8, ok Correct

The issue is with iOS the detect return :
encodingName __NSCFString * @"ISO-8859-1" 0x00006080006209e0

... this is the issue ...

from uchardet.

Jehan avatar Jehan commented on July 21, 2024

The issue is with iOS the detect return :
encodingName __NSCFString * @"ISO-8859-1" 0x00006080006209e0

I don't understand. Where does this come from? Are you saying that comes from uchardet too? Is "detect" a command of iOS maybe?
If this is the former, you'll have to tell me more (what is the difference between the 2 calls?). If this is the later, then… well that's why uchardet exists (because most other tools make a lot of detection errors).

You'll have to give me a bunch more details for me to understand. :-)

from uchardet.

marinofaggiana avatar marinofaggiana commented on July 21, 2024

ok, no I have used a wrapper on Object-C for library .a :

@interface NCUchardet ()
{
   uchardet_t _detector;
}
@end

@implementation NCUchardet

+ (NCUchardet *)sharedNUCharDet {
    static NCUchardet *nuCharDet;
    @synchronized(self) {
        if (!nuCharDet) {
            nuCharDet = [NCUchardet new];
        }
        return nuCharDet;
    }
}

- (id)init
{
    self = [super init];
    
    if (self) {
        _detector = uchardet_new();
    }
    
    return self;
}

- (void)dealloc
{
    uchardet_delete(_detector);
}

- (NSString *)encodingStringDetectWithData:(NSData *)data
{
    uchardet_handle_data(_detector, [data bytes], [data length]);
    uchardet_data_end(_detector);
    
    const char *charset = uchardet_get_charset(_detector);
    NSString *encoding = [NSString stringWithCString:charset encoding:NSASCIIStringEncoding];
    
    uchardet_reset(_detector);
    
    return encoding;
}

@end

from uchardet.

Jehan avatar Jehan commented on July 21, 2024

I am not a Object-C expert (to say the least) but from what I read, it looks like it should work. I can think of 2 things: are you sure Object-C does not reencode the data before it reaches uchardet by any chance? I would try and dump the data and make sure it is byte for byte the same as it is in the file.

Second thing is that I see you reuse the same detector by keeping it around and running uchardet_reset(). Most use cases I saw is to create a new detector every time, so who knows, maybe the barely used uchardet_reset() is broken. If that is the case and you detected encoding of various files before, it may have interfered. Could you try to delete and recreate a new detector after every detection and see if it helps?

from uchardet.

marinofaggiana avatar marinofaggiana commented on July 21, 2024

I have removed the singleton library, but this is not the issue :

First dumb :

schermata 2017-08-18 alle 12 11 12

from uchardet.

Jehan avatar Jehan commented on July 21, 2024

Could we see the data in bytes mode to make sure that's UTF-8? :-)

from uchardet.

marinofaggiana avatar marinofaggiana commented on July 21, 2024

schermata 2017-08-18 alle 12 33 34

from uchardet.

marinofaggiana avatar marinofaggiana commented on July 21, 2024

schermata 2017-08-18 alle 12 38 56

from uchardet.

Jehan avatar Jehan commented on July 21, 2024

Ok looks fine UTF-8. But I just understood the problem. It's not your program.

Actually I realize that development code of uchardet returns this data as ISO-8859-1, which is wrong. Are you using last git code for your development while you are using stable 0.0.6 for the uchardet tool by any chance? :-)

from uchardet.

marinofaggiana avatar marinofaggiana commented on July 21, 2024

Good question ... I have used :

https://cgit.freedesktop.org/uchardet/uchardet/

for a copy ... is the 0.0.6 ?

from uchardet.

Jehan avatar Jehan commented on July 21, 2024

Well by default master is the development code. Keep the same code, but checkout the commit for v0.0.6 release:

git checkout v0.0.6

Then you'll have the code used for 0.0.6. Alternatively use the snapshot in a tarball: https://www.freedesktop.org/software/uchardet/releases/ (but that should be the same code if you checkout the right tag).

from uchardet.

marinofaggiana avatar marinofaggiana commented on July 21, 2024

ok @Jehan with https://www.freedesktop.org/software/uchardet/releases/ the detect it's ok UTF-8, thanks, for the future if you want a test for iOS we are here with our project :

https://github.com/nextcloud/ios

from uchardet.

Jehan avatar Jehan commented on July 21, 2024

Nice to know that you use uchardet in Nextcloud (only the iOS app?). I have a lot of stuff I'd like to discuss for Nextcloud (not related to character detection). Probably some time later. :-)

I opened a bug report related to the file you gave which is now detected as ISO-8859-1 because of a new language support. Though I'm not sure I have much of a solution for now. There are 2 problems here:

1/ With very short texts (like here, just 2 words), a system based on language statistics will be a lot less efficient. For longer texts (even just a few more words with a complete sentence), the encoding detection will become a lot more accurate (and in particular any slight confidence which makes the system believe it may be another language currently would likely disappear with more words).

2/ UTF-8 detection is not language aware currently. If it were and knew of Italian letter-usage statistics, this should definitely raise the confidence for UTF-8.

This second point is something I plan to work on someday. The first point is inherent to uchardet algorithm (the smaller the input data, the harder it is to map results to generic language statistics).

Bug report: https://bugs.freedesktop.org/show_bug.cgi?id=102292

P.S.: the text was Italian, right?

from uchardet.

marinofaggiana avatar marinofaggiana commented on July 21, 2024

Nice to know that you use uchardet in Nextcloud (only the iOS app?). I have a lot of stuff I'd like to discuss for Nextcloud (not related to character detection). Probably some time later. :-)

Android too, for a discuss when do you want :-)

1/ With very short texts (like here, just 2 words) ....

yes, of course !

P.S.: the text was Italian, right?

Yes, Italian

from uchardet.

marinofaggiana avatar marinofaggiana commented on July 21, 2024

A question @Jehan, can the

const char *charset = uchardet_get_charset(_detector);
return NULL or "" or NIL ?

from uchardet.

Jehan avatar Jehan commented on July 21, 2024

It can return "" when no charset was found with high confidence enough. Otherwise a charset name. It won't ever return NULL.

from uchardet.

marinofaggiana avatar marinofaggiana commented on July 21, 2024

Very well, thanks for your help !

from uchardet.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.