Code Monkey home page Code Monkey logo

juniversalchardet's People

Watchers

 avatar

juniversalchardet's Issues

build.xml javac source and target attrs unset

Distributed jar is built with Java 1.6 (I think).  Would be convenient for
wider range of users (e.g. 1.5 and 1.6 users) if the build.xml javac target
had source="1.5" and target="1.5" which would make this jar useful to more
people "out-of-the-box".

This causes compile failure on JDK 1.5.0_12 

> javac -cp .:juniversalchardet-1.0.2.jar TestDetector.java
TestDetector.java:1: cannot access
org.mozilla.universalchardet.UniversalDetector
bad class file:
juniversalchardet-1.0.2.jar(org/mozilla/universalchardet/UniversalDetector.class
)
class file has wrong version 50.0, should be 49.0
Please remove or make sure it appears in the correct subdirectory of the
classpath.
import org.mozilla.universalchardet.UniversalDetector;
                                    ^
1 error

Patch attached.

Original issue reported on code.google.com by [email protected] on 4 Feb 2008 at 11:55

Attachments:

Having this functionality in a stream could be useful

We store some metadata for the stream contents (like hashes), and we wanted to 
determine the encoding with it as well. I have therefore wrapped the 
UniversalDetector inside a stream to be able to do several actions in one step 
using nested streams.

Maybe it is useful to others:

public class EncodingDetectorInputStream extends BufferedInputStream {

    private final UniversalDetector detector = new UniversalDetector(null);

    public EncodingDetectorInputStream(InputStream in) {
        super(in);
    }

    public String getDetectedCharset() {
        return detector.getDetectedCharset();
    }

    @Override
    public synchronized int read(byte[] b, int off, int len) throws IOException {
        final int nrOfBytesRead = super.read(b, off, len);
        if (!detector.isDone() && nrOfBytesRead > 0) {
            detector.handleData(b, 0, nrOfBytesRead);
        }
        if (nrOfBytesRead == -1) {
            detector.dataEnd();
        }
        return nrOfBytesRead;
    }

}

Original issue reported on code.google.com by [email protected] on 23 Jul 2013 at 3:32

Need to know the size of BOM

When a file starts with a Byte Order Mark, there needs to be a way to discard 
those bytes. The detected charset is not enough information, because the file 
may include a BOM or not.

The easy way would be a method indicating the number of bytes to skip.

What steps will reproduce the problem?
1. Run the universal detector on a file with a BOM, such as UTF-16LE
2. Open a reader using the detected charset
3. Observe the spurious first character

Original issue reported on code.google.com by marcus.downing on 29 Apr 2011 at 12:08

would you like to use Git?

In case you'd like to switch the repo to Git, I made a git-svn clone available 
at:

https://github.com/thkoch2001/juniversalchardet

You can just clone it and for example upload it to google code.

Regards, Thomas Koch

Original issue reported on code.google.com by [email protected] on 26 Aug 2012 at 6:11

GB18030 and BIG5 Prober have bug

GB2312 GB18030 and Big5 Charset Detect error

So I look up the mozilla universalchardet code

I found the  BIG5Prober.handleData has an error

please modify "this.distributionAnalyzer.handleChar(buf, i - 1, charLen);" 
to "this.distributionAnalyzer.handleOneChar(buf, i - 1, charLen);"

and GB18030Prober.handleData has the same bug.Do the same thing can fix 
the same bug. 

Original issue reported on code.google.com by [email protected] on 9 Jul 2008 at 3:24

juniversalchardet-1.0.3_binary-dist-2008-06-23 always returns 'null' charset on some undetermined systems/jvm combinations, but works after rebuild

What steps will reproduce the problem?
    1. Try charset detection of a file, using the sample code from the homepage.

What is the expected output? What do you see instead?
    Expected: the detected charset (WINDOWS-1252, UTF-8)
    Instead: null

What version of the product are you using? On what operating system?
    Using the Jul 23, 2008 binary of juniversalchardet 1.0.3, SHA1=591d72211acc0b909b79c840e0b3ed9a0982d807
    Problem appeared on:
        a. A x64 Windows Server 2008 R2 server with Java 1.6.0_43
        b. A x64 Windows 7 workstation with Java 1.6.0_43
    Problem did not appear (detection worked flawlessly) on:
        c. Another x64 Windows 7 workstation with Java 1.6.0_43

Please provide any additional information below.
    In order to understand the issue, I ended up re-building the .jar with the debug=true <javac> option. Which of course did let me properly debug like expected, but also solved my problem: now detection worked on machines a and b! That seemed strange, so I rolled back my changes to build.xml, re-launched the compile & dist Ant tasks, and ta-da, it works.

--> One some systems/jvm combinations, it seems the binary build on Jul 23, 
2008 doesn't work and always returns null.

--> Being just a user of the library who barely understands the flow of the 
detection, I failed to understand what went wrong and where and cannot be more 
precise. Feel free to ask for trace information.

--> Maybe publishing a re-compiled version on the website would be a good idea? 
Mine is attached, compiled with Java 1.6.0_43 and Ant 1.9.0 on my machine 'a' 
(x64 Windows Server 2008 R2 server with Java 1.6.0_43).

Original issue reported on code.google.com by [email protected] on 15 Mar 2013 at 6:28

Attachments:

Fails to Detect UTF-8 without BOM

What steps will reproduce the problem?
1. Save a file in UTF-8 without BOM
2. Try to detect Character Encoding.

What is the expected output? What do you see instead?
I expect to see UTF-8 from the #getDetectedCharset() method. Instead I get null.

What version of the product are you using? On what operating system?
I am using juniversalchardet-1.0.3.jar on a Windows 7 System.


Please provide any additional information below.
When I use UTF-8 with BOM I can detect the file just fine but Java does not 
support BOM so I get characters at the beginning of the file which I do not 
want. Therefore I have been using UTF-8 without BOM.

Perhaps I am not feeding the detector enough data with the file I am reading 
in? Although I don't think that is the case because I have extended the amount 
of data inside of the file up to 171390 characters with no difference.

Original issue reported on code.google.com by [email protected] on 30 Sep 2011 at 10:20

Attachments:

universal detector fail to detect some charsets

What steps will reproduce the problem?
1. for each case, read first line
2. if it's encoded word, decoded using either base64 or quoted printable
3. convert it to UTF-8
4. compare with second line of each case, which is expected result.

What is the expected output? What do you see instead?

  please see second line of attached file.
What version of the product are you using? On what operating system?
 redhat enterprise 4

Please provide any additional information below.


Original issue reported on code.google.com by [email protected] on 18 Jun 2010 at 6:14

Attachments:

getConfidence()

in c
Source CharDistribution.cpp
Method float CharDistributionAnalysis::GetConfidence()
float r = mFreqChars / ((mTotalChars - mFreqChars) * mTypicalDistributionRatio);

in java
Source CharDistributionAnalysis.java
Method public float getConfidence()
float r = this.freqChars / (this.totalChars - this.freqChars) * 
this.typicalDistributionRatio;

Parenthesis is less. May be porting miss.

Original issue reported on code.google.com by [email protected] on 12 Sep 2012 at 10:21

IBM866 detected instead of utf-8

What steps will reproduce the problem?
1. Get Bangla characters encoded in utf-8
2. Try to detect the encoding charset
3. IBM866 is detected 

What is the expected output? 
Bangla text

What do you see instead?
Kind of garbaged russian

What version of the product are you using? On what operating system?
juniversalchardet-1.0.3.jar on redhat

Please provide any additional information below.
i sent an email (composed with the text below) by omitting the charset

Ovi মেইল ওয়েব 
ভাষা সমর�থন 
বিষয়ে 
বিজ�ঞপ�তি

প�রিয় Bretajohnson,

আগামী কয়েক 
সপ�তাহের 
ভিতরে Nokia তাদের Ovi 
মেইল 
ওয়েবসাইটের 
নত�ন দর�শন ও 
অভিজ�ঞতা 
সংযোজন করতে 
চলেছেন, কারণ 
সেটি Yahoo! পরিসেবা 
দ�বারা প�ষ�ট Ovi 
মেইল-� 
পরিবর�তিত হতে 
চলেছে।  �ই 
পরিবর�তনের 
ফলে সংয�ক�ত 
তাত�ক�ষণিক 
বার�তা 
(ইন�সট�যান�ট 
মেসেজিং, IM) সহ 
স�ট�রিমলাইন 
করা ওয়েব 
অভিজ�ঞতা ও 
অতিরিক�ত 
বৈশিষ�ট�য 
আপনি আপনার Ovi 
মেইল 
অ�যাকাউন�টে 
পেয়ে যাবেন।

আমরা যখন Ovi মেইল 
পরিসেবার 
সর�বমোট 
উন�নতিসাধন 
ঘটাব, আমরা তখন 
সেই নত�ন ওয়েব 
অভিজ�ঞতা শ�র� 
করার সময়ে Bengali 
ভাষায় ওয়েব 
সমর�থন করতে 
পারব না। �ই 
কার�য 
চলাকালীন 
আপনারা Ovi মেইল 
ওয়েবে ইংরেজী 
ভাষায় 
অ�যাক�সেস 
করতে পারবেন, 
যাতে আপনারা Ovi 
মেইল পরিসেবার 
নত�ন ক�ষমতা 
সমূহ ব�যবহার 
করতে পারেন। 
আমরা �ই 
পরিস�থিতির 
জন�য 
আন�তরিকভাবে 
দ�ঃখিত, আর আমরা 
আগামী কয়েক 
মাসে Bengali ভাষার 
সমর�থন 
প�নর�চাল� 
করার জনà§�à
 ¦¯ নিরনà§�তর কাজ করে চলেছি।

অন�গ�রহ করে 
মনে রাখবেন যে 
�ই ভাষা 
সমর�থনে 
পরিবর�তন 
কেবলমাত�র mail.ovi.com-� 
ওয়েবকেই 
প�রভাবিত করবে, 
আপনি যদি Nokia ফোন 
থেকে Ovi মেইল 
অ�যাক�সেস 
করেন তবে সেই 
প�রভাব �খানে 
কার�যকারী হবে 
না।

Ovi মেইল ব�যবহার 
করার জন�য আমরা 
আপনাদের 
আন�তরিক 
ধন�যবাদ জানাই!

ধন�যবাদান�তে,
Ovi by Nokia

আপনি আপনার 
ইনবক�সে 
আমাদের ই-মেইল 
গ�রহণ করে 
যাওয়া নিশ�চিত 
করতে (জাঙ�ক বা 
বাল�ক 
ফোল�ডারে নয়) 
আপনার 
যোগাযোগের 
তালিকা বা 
নিরাপদ 
প�রাপকের 
তালিকায় 
অন�গ�রহ করে  
[email protected]  যà§�কà§�ত 
কর�ন।

কপিরাইট 2011 Nokia. সব 
স�বত�ব 
সংরক�ষিত।  Nokia Inc, 102 
Corporate Park Drive, White Plains, NY 10604 Ovi 
http://ct.nokia.com/?593603168&FGOI0 | 
ব�যবহারের 
শর�তাবলি 
http://ct.nokia.com/?593603168&FGOI6 | 
গোপনীয়তার 
নীতি http://ct.nokia.com/?593603168&FGOI3


Original issue reported on code.google.com by [email protected] on 3 Aug 2011 at 4:28

Can't copy+paste and compile the example TestDetector

Can't copy+paste and compile the example TestDetector class on the project
home page because:
1. uses single quotes instead of double quotes in Strings
2. main() not declared to throw java.io.IOException (or other apropos
try/catches)
Might also be nice if it took the file to test with as a command line arg.
Also would be nice to have a link to download the source file, all just for
convenience.
Modified source attached.

Original issue reported on code.google.com by [email protected] on 5 Feb 2008 at 12:02

Attachments:

Detect from FileItem not working

What steps will reproduce the problem?
1. if I use a fileinputstream it detects fine, if i use FileItem always detect 
maccrylic
2. atach you can see the example file
3.
 here is the peace of code:
BufferedWriter clsWriter = new BufferedWriter ( new OutputStreamWriter ( 
clsFile.getOutputStream () ) );

        clsWriter
                .write ( "ÄÜÖßäöü,Name1ÄÜÖßäöü,Name2ÄÜÖßäöü,Name3ÄÜÖßäöü,StreetÄÜÖßäöü,MÄÜÖßäöü,DE,80080,München,ContactÄÜÖßäöü,+49(0)ÄÜÖßäöü,ÄÜÖßäöü@gls-itservices.com,CommentÄÜÖßäöü,+49,(0)98,765,432,BlÄÜÖßäöü" );

        clsWriter.close ();

        InputStream clsInput = clsFile.getInputStream ();
        byte[] buffer = new byte[ 1024 ];

        while ( true )
        {
            int n = clsInput.read ( buffer );

            if ( n <= 0 )
            {
                break;
            }

            detector.handleData ( buffer, 0, n );

        }

        detector.dataEnd ();

        clsInput.close ();

        String strEncoding = detector.getDetectedCharset ();

        System.out.println ( "encoding: " + strEncoding );


What is the expected output? What do you see instead?
I expect latin-1

What version of the product are you using? On what operating system?
juniversalchardet-1.0.3.jar windowsxp

Please provide any additional information below.


Original issue reported on code.google.com by [email protected] on 23 Jul 2014 at 3:31

Attachments:

GB18030 false positive with WINDOWS-1252 data set

What steps will reproduce the problem?
1. Pass UniversalDetector a byte buffer for WINDOWS-1252 containing a series of 
degree symbols and character / numbers
 e.g. {91, -80, 52, -80, 48, -80, 84, -80, 67, -80, 67, -80, 48, -80, 67, -80, 84}
2. Call UniversalDetector#getDetectedCharset(), it should be WINDOWS-1252, but 
instead returns GB18030.

See attached unit test for minimal reproduction test case.

What is the expected output? What do you see instead?
Expected output from UniversalDetector#getDetectedCharset() is "WINDOWS-1252," 
but instead is "GB18030."

What version of the product are you using? On what operating system?
 I'm using version 1.0.3 on 64-bit Ubuntu 11.4 (Natty) with default kernel 2.6.38-10-generic.  The JDK I'm currently running is 1.6.0_23-x64.

Original issue reported on code.google.com by [email protected] on 13 Jul 2011 at 4:34

Request for enhancement: IBM850 detection

IBM850/437 codepages are not detected for German (and possibly other
Western European languages). Greek - IBM737/851/869 and Arabic 864 would be
nice too. 
Reason: no code. 


Original issue reported on code.google.com by [email protected] on 14 Sep 2007 at 11:09

Enhancement request: Allow for usage under the terms of the EPL

Your licensing terms read "The library is subject to the Mozilla Public License 
Version 1.1. Alternatively, the library may be used under the terms of either 
the GNU General Public License Version 2 or later, or the GNU Lesser General 
Public License 2.1 or later."

Please add the possible use of the library under the terms of the EPL 
(http://www.eclipse.org/legal/epl-v10.html, for use of juniversalchardet in 
Eclipse-based applications (RCP)), or state in some way that it is okay to 
distribute juniversalchardet under the LGPL together with components under the 
EPL, e.g., as a separate Plug-in.

Thanks a lot in advance!


Original issue reported on code.google.com by [email protected] on 27 Feb 2013 at 4:30

Incorrect encoding when the line contains two £ symbols followed by numbers

What steps will reproduce the problem?
1.Create a file with following line
Wykamol,£588.95,0.18,0.12,testingSpecialised Products for DIY and 
Professionals£12
(Any text containing two  pound signs followed by numbers like
Wykamol,£588.95£12)
2. Save the file as Ansi
3.

What is the expected output? What do you see instead?
Western European(windows) or something.. but it is GB18030

What version of the product are you using? On what operating system?

1.0.3
Please provide any additional information below.
Not sure how the API is supposed to be used. I tried a simple file with few 
ansi characters like "Find Encoding".. API return encoding as null..




Original issue reported on code.google.com by [email protected] on 12 Apr 2011 at 11:26

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.