Code Monkey home page Code Monkey logo

inquisitor's Introduction

Inquisitor

Quicklisp Build Status Coverage Status

Encoding/end-of-line detection and external-format abstraction for Common Lisp.

The Library is a sphere whose exact center is any one of its hexagons and whose circumference is inaccessible. -- "The Library of Babel" by Jorge Luis Borges

Features

  • encoding/end-of-line name abstraction
  • encoding/end-of-line detection
  • external-format abstraction
    • make external-format for each implementations
    • make external-format from byte-array, stream and pathname (with auto-detection)
    • [TODO] abstract external-format of babel and flexi-stream
  • supports many implementations
    • GNU CLISP
    • Embeddable Common Lisp
    • Steel Bank Common Lisp
    • Clozure Common Lisp
    • Armed Bear Common Lisp
    • Allegro Common Lisp
    • LispWorks

Installation

Get and install via quicklisp:

CL-USER> (ql:quickload :inquisitor)

Usage

Encoding detection

To detect encoding from stream, use (inq:detect-encoding stream scheme). This returns implementation independent encoding name. About scheme, see Encoding scheme.

for example:

CL-USER> (with-open-file (in #P"t/data/unicode/utf-8.txt"
                          :direction :input
                          :element-type '(unsigned-byte 8))
           (inq:detect-encoding in :jp))
:UTF-8

You can see the list of available encodings:

CL-USER> inq:+available-encodings+
(:UTF-8 :UCS-2LE :UCS-2BE :UTF-16 :ISO-2022-JP :EUC-JP :CP932 :BIG5 :ISO-2022-TW
 :GB2312 :GB18030 :ISO-2022-CN :EUC-KR :JOHAB :ISO-2022-KR :ISO-8859-6 :CP1256
 :ISO-8859-7 :CP1253 :ISO-8859-8 :CP1255 :ISO-8859-9 :CP1254 :ISO-8859-5
 :KOI8-R :KOI8-U :CP866 :CP1251 :ISO-8859-2 :CP1250 :ISO-8859-13 :CP1257)

Encoding scheme

Encoding scheme is a hint to detect encoding.

It's mostly impossible to detect encoding universally, because there are two encoding such that use same byte sequences to represent other characters. So, limitting target encodings has benefit to encoding detection.

Here, in inquisitor, languages are used to limit the encodings. Where languages are, roughly speaking, writing systems used in anywhere arround the world. Fixing language is equivalent to fixing possible characters. Becaus of which, encoding detection be slightly eazy.

Supported scheme (languages) is as follows:

  • jp: japanese
  • tw: taiwanese
  • cn: chinese
  • kr: korean
  • ru: russian (latin-5)
  • ar: arabic (latin-6)
  • tr: turkish (latin-9)
  • gr: greek (latin-7)
  • hw: hebrew (latin-8)
  • pl: polish (latin-2)
  • bl: baltic (latin-7)

End-of-line type detection

If you want to know end-of-line (line break) type, use (inq:detect-end-of-line stream). This returns implementation independent end-of-line name.

CL-USER> (with-open-file (in "t/data/ascii/ascii-crlf.txt"
                             :direction :input
                             :element-type '(unsigned-byte 8))
           (inquisitor:detect-end-of-line in))

:CRLF

Implementation dependent/independent names

If you want to know implementation dependent name of encodings or eol type, use (inq:independent-name dependent-name). Returned value can be used as external-format, or its part.

CL-USER> (inq:independent-name :cp932)
#<ENCODING "CP932" :UNIX>  ; on CLISP
:WINDOWS-CP932  ; on ECL
:SHIFT_JIS  ; on SBCL
:WINDOWS-31J  ; on CCL
:|X-MS932_0213|  ; on ABCL

If you want to know implementation independent name of encodings or eol type, use (inq:dependent-name independent-name).

Eol

If you want to know eol is available on your implementation, use (inq:eol-available-p).

CL-USER> (inq:eol-available-p)
NIL  ; on SBCL

Make external-format

To make external-format from impl independent names, use (inq:make-external-format enc eol).

In SBCL and CCL, same code returns different value.

On SBCL:

CL-USER> (let* ((file #P"t/data/ja/sjis.txt")
                (enc (inq:detect-encoding file :jp))
                (eol (inq:detect-end-of-line file)))
           (inq:make-external-format enc eol))
:SHIFT_JIS

On CCL:

CL-USER> (let* ((file #P"t/data/ja/sjis.txt")
                (enc (inq:detect-encoding file :jp))
                (eol (inq:detect-end-of-line file)))
           (inq:make-external-format enc eol))
#<EXTERNAL-FORMAT :WINDOWS-31J/:UNIX #x302001C574CD>

External-format detection

Inquisitor provides external-format detection method. It detects encoding and eol style, then make external-format from these. It can use with vector, byte stream and pathname.

Let's see examples with CCL.

From vector
CL-USER> (inq:detect-external-format
          (encode-string-to-octets "公的な捜索係、調査官がいる。
わたしは彼らが任務を遂行しているところを見た。")
          :jp)
#<EXTERNAL-FORMAT :UTF-8/:UNIX #x30200046719D>
From stream
CL-USER> (with-open-file (in "t/data/unicode/utf-8.txt"
                             :direction :input
                             :element-type '(unsigned-byte 8))
           (inq:detect-external-format in :jp))
#<EXTERNAL-FORMAT :UTF-8/:UNIX #x30200046719D>
From pathname
CL-USER> (inq:detect-external-format #P"t/data/unicode/utf-8.txt" :jp)
#<EXTERNAL-FORMAT :UTF-8/:UNIX #x30200046719D>

Author

Copyright

Copyright (c) 2000-2007 Shiro Kawai ([email protected])
Copyright (c) 2007 Masayuki Onjo ([email protected])
Copyright (c) 2011 zqwell ([email protected])
Copyright (c) 2015 Shinichi Tanaka ([email protected])

License

Licensed under the MIT License.

inquisitor's People

Contributors

cxxxr avatar defunkydrummer avatar snmsts avatar t-sin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

inquisitor's Issues

Licensing

Now inquisitor is licensed under the MIT license, but is it right? For it I should do two things: survey and decision.

TODO

Broken on CCL

Error occurs when quickloading inquisitor.

CL-USER> (ql:quickload :inquisitor)
To load "inquisitor":
  Load 1 ASDF system:
    inquisitor
; Loading "inquisitor"
[package inquisitor.names]

Reader error on #<BASIC-FILE-CHARACTER-INPUT-STREAM ("/path/to/inquisitor/src/names.lisp"/8 UTF-8) #x3020013DCBED>, near position 7051, within "works) :cr)
    ((:c":
Dot context error.
   [Condition of type CCL::SIMPLE-READER-ERROR]

UTF-8 name

UTF-8 named :utf8 in inquisitor on master at now but it seems that UTF-8 should be named like :utf-8.

Because of:

  • each implementations have UTF-8 as :utf-8
  • other character encodings of Unicode named like <name>-<surfix>

Impl dependent name => impl independent name

It is a problem that inquisitor (or others) has no way to know implementation independent name (encoding name, eol style name. e.g. UTF-8 (enc), LF (eol)) from implementation dependent name.

For instance, when I want to know the encoding name of *standard-input* and want to write file with same encoding as stdin's one.

guess-jp causes error

This code causes an error at inquisitor.encoding.guess::guess-jp:

(defparameter x
 #(140 246 147 73 130 200 145 123 141 245 140 87 129 65 146 178 141 184 138 175
    130 170 130 162 130 233 129 66 10 130 237 130 189 130 181 130 205 148 222 130
   231 130 170 148 67 150 177 130 240 144 139 141 115 130 181 130 196 130 162
   130 233 130 198 130 177 130 235 130 240 140 169 130 189 129 66))

(inquisitor:detect-external-format x :jp)

Shorter nicknames

Now, I think inquisitor's package names are too long. Additionally function/method names are also long, codes using inquisitor is wrapped. (e.g. inquisitor:detect-external-format)

Providing shoter nicknames makes inquisitor more useful.

  • inquisitor => inq
  • inquisitor.eol => inq.eol
    • or public APIs are exported only in inq.

Encoding spesifications

Specification documents are:

External-format abstraction

I think inquisitor must abstract external-format in several libraries; e.g. flexi-streams and babel.

libraries

  • babel
  • flexi-streams

改行コード判定

  • 改行コード利用可能かを返す述語
  • 各改行コードのキーワードを返す関数(guessのように)
  • 判定処理

Providing roswell script

For more convenience, provide some roswell script.

  • inq.ros
    • inq.ros [scheme] [path] --- detect encoding from path
    • inq.ros [scheme] --- detect encoding from STDIO

or

  • inq-[scheme].ros
    • command line params
      • inq-[scheme].ros [path] --- detect encoding from path
      • inq-[scheme].ros --- detect encoding from STDIO
    • example: inq-jp.ros /path/to/file.txt => utf-8
  • inq-eol.ros [path] --- detect eol-style

Is detecting from STDIO needed?

Providing encoding/eol name?

Should inquisitor provide implementation independent encoding/end-of-line name?
When just to know encoding/eol name, returning impl's name is inconvenient.

Tests for names

On refine-name-mapping branch, I changed name mapping structure. Oh, large!

So it shoud be tested. Tests about:

  • +available-encodings+, +available-eols+
  • independent-name per implementation
  • dependent-name per implementation
  • unicode-p

Docstring

Docstrings bring some convenience. For example, API reference included documentation on quickdocs!

I think the timing of adding docstrings is after rearranging API(#22).

detect-end-of-lineが常にnilを返します

Common Lispのloopマクロ

(loop for i from 1 to 3
         with j = i
         collect (list i j))

with j = iはループ毎に更新されるのではなく最初に一回だけなので
結果は((1 1) (2 2) (3 3))ではなく((1 1) (2 1) (3 1))です

detect-end-of-lineのloopを見てみると

(loop for n = (read-sequence vec stream) ; ここでバッファ変数vecを更新して
         with eol = (eol-guess-from-vector vec) ; ここで前の行のvecではなく空のバッファのvecを使ってeolには常にnilが入る
         until (or (zerop n)
                      (not (null eol))) ; eolは常にnilなので(not (null eol))は常にnilでloopを抜ける
         finally (return eol))

この結果detect-end-of-lineは常にnilを返します

このloopのwithをforに変えて動かすと正しく動くはずなんですがテストが通りません
テストの失敗した部分を見てみると
× :LF is expected to be NIL
となっています
しかしこれは:LFが正しいはずなんです

他の場所でも
:UCS-2LE is expected to be :UTF-16BE や
:BIG5 is expected to be :ISO-2022-TW がありますが
これもdetect-end-of-lineの結果を引きずってるみたいです

常にnilを返すバグを無視してテストのほうを変えて動くようにしてるように見えます
この機能は都合が悪くてわざと一時的に動かないようにしてるんでしょうか

気が向いたら直しておいてほしいです

改行コードがcrlfだとsbclでread-lineの末尾にcrが紛れる問題

ライブラリを使う側でsbclとその他に分けて
sbclならread-lineで得た文字列の末尾を除去しないといけません
これはinquisitor側でした方がいいと思います

ひとつめはwith-open-fileをinquisitor側でサポートしてgray-streamsを使う方法を考えました
このやり方だと万事うまくいきそうですがsbclのcrlfのためだけにこれをするのは冗長で実装が面倒です

ふたつめはwith-open-fileと一緒にdo-read-lineもinquisitor側でサポートします

(with-input-file ((in external-format) path ...)
  (do-read-line ((string eof-p) in external-format)
    ...))

こうすると末尾のcrの除去を裏に隠せます

変数external-formatを明示的に書いた理由は
ライブラリを使う側でファイルの保存時に文字コードを保つために必要だからです
sbclなら改行コードも考慮する必要があるのでふたつ目の方法だとdo-print-line
ひとつ目の方法だと定義したstreamクラスのスロットににencodingとend-of-lineを入れておく必要があります

ライセンス選択

ライセンスを決める。Gaucheの元のコードのライセンスに反しないものを選ぶこと。

Rearrangement API

  • remove detecting-buffer-size (#19)
  • let detect-encoding and detect-end-of-line be method
    • those can treat vector, stream and pathname same as detect-external-format
  • can supply buffer size as &optional to detect-encoding and detect-end-of-line
    • do to detect- things
    • if you wanna check whole file, supply :full

ABCLのexternal-format

ドキュメントがまったく充実していない。

system:available-encodingsで可能なエンコーディングのリストを得られる。
また、ソースでexternal-formatの定義を見つけた(定義)。

CLISPでコンパイルエラー

charset:iso8859-5だけがないと言われてる。他はOKということか? マニュアルを調査。

[package inquisitor.encoding.keyword]
*** - READ from
#<INPUT BUFFERED FILE-STREAM CHARACTER
#P"/home/travis/build/t-sin/inquisitor/src/encoding/keyword.lisp"
@307>
: # has no external symbol with name "ISO8859-5"

[https://travis-ci.org/t-sin/inquisitor/jobs/76869696]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.