juliastrings / stringencodings.jl Goto Github PK
View Code? Open in Web Editor NEWString encoding conversion in Julia using iconv
License: Other
String encoding conversion in Julia using iconv
License: Other
I'm having trouble getting the package to work on the latest master branch. I had WinRPM troubles, those just got resolved, so WinRPM should be working correctly.
Pkg.build("StringEncodings")
# a million deprecation warnings...
# ...
# ...
INFO: Packages to install: win_iconv-dll
WARNING: Base.Void is deprecated, use Nothing instead.
likely near C:\Users\tbeason\.julia\v0.7\StringEncodings\deps\build.jl:976
WARNING: Base.Void is deprecated, use Nothing instead.
likely near C:\Users\tbeason\.julia\v0.7\StringEncodings\deps\build.jl:976
WARNING: Base.Void is deprecated, use Nothing instead.
likely near C:\Users\tbeason\.julia\v0.7\StringEncodings\deps\build.jl:976
WARNING: Base.Void is deprecated, use Nothing instead.
likely near C:\Users\tbeason\.julia\v0.7\StringEncodings\deps\build.jl:976
┌ Warning: `info()` is deprecated, use `@info` instead.
│ caller = do_install(::WinRPM.Package) at WinRPM.jl:454
└ @ WinRPM WinRPM.jl:454
INFO: Downloading: win_iconv-dll
┌ Warning: `info()` is deprecated, use `@info` instead.
│ caller = do_install(::WinRPM.Package) at WinRPM.jl:465
└ @ WinRPM WinRPM.jl:465
INFO: Extracting: win_iconv-dll
┌ Warning: `open(cmd)` now returns only a Process<:IO object.
│ caller = next at deprecated.jl:200 [inlined]
└ @ Core deprecated.jl:200
ERROR: The system cannot find the file specified.
C:\Users\tbeason\.julia\v0.7\WinRPM\cache\2\noarch%2Fmingw64-win_iconv-dll-0.0.8-3.15.noarch.cpio
System ERROR:
The system cannot find the file specified.
7-Zip [64] 16.04 : Copyright (c) 1999-2016 Igor Pavlov : 2016-10-04
Scanning the drive for archives:
┌ Error: ------------------------------------------------------------
│ # Build failed for StringEncodings
│ exception =
│ LoadError: failed process: Process(`'C:\Users\tbeason\AppData\Local\Julia-0.7.0-DEV\bin\7z.exe' x -y 'C:\Users\tbeason\.julia\v0.7\WinRPM\cache\2\noarch%2Fmingw64-win_iconv-dll-0.0.8-3.15.noarch.cpio' '-oC:\Users\tbeason\.julia\v0.7\WinRPM\deps'`, ProcessExited(2)) [2]
│ Stacktrace:
│ [1] error(::String, ::Base.Process, ::String, ::Int64, ::String) at .\error.jl:42
│ [2] pipeline_error(::Base.Process) at .\process.jl:698
│ [3] do_install(::WinRPM.Package) at C:\Users\tbeason\.julia\v0.7\WinRPM\src\WinRPM.jl:483
│ [4] do_install at C:\Users\tbeason\.julia\v0.7\WinRPM\src\WinRPM.jl:445 [inlined]
│ [5] #install#21(::Bool, ::Function, ::WinRPM.Package) at C:\Users\tbeason\.julia\v0.7\WinRPM\src\WinRPM.jl:392
│ [6] #install at .\<missing>:0 [inlined]
│ [7] #install#19 at C:\Users\tbeason\.julia\v0.7\WinRPM\src\WinRPM.jl:361 [inlined]
│ [8] #install at .\<missing>:0 [inlined] (repeats 2 times)
│ [9] (::getfield(WinRPM, Symbol("##36#37")){WinRPM.RPM})() at C:\Users\tbeason\.julia\v0.7\WinRPM\src\winrpm_bindeps.jl:42
│ [10] run(::getfield(WinRPM, Symbol("##36#37")){WinRPM.RPM}) at C:\Users\tbeason\.julia\v0.7\BinDeps\src\BinDeps.jl:484
│ [11] run(::BinDeps.SynchronousStepCollection) at C:\Users\tbeason\.julia\v0.7\BinDeps\src\BinDeps.jl:527
│ [12] satisfy!(::BinDeps.LibraryDependency, ::Array{DataType,1}) at C:\Users\tbeason\.julia\v0.7\BinDeps\src\dependencies.jl:943
│ [13] satisfy!(::BinDeps.LibraryDependency) at C:\Users\tbeason\.julia\v0.7\BinDeps\src\dependencies.jl:921
│ [14] top-level scope at C:\Users\tbeason\.julia\v0.7\BinDeps\src\dependencies.jl:976
│ [15] include(::Module, ::String) at .\boot.jl:292
│ [16] include_relative(::Module, ::String) at .\loading.jl:1012
│ [17] include at .\sysimg.jl:26 [inlined]
│ [18] include(::String) at .\loading.jl:1046
│ [19] top-level scope
│ [20] eval at .\boot.jl:295 [inlined]
│ [21] eval at .\sysimg.jl:71 [inlined]
│ [22] evalfile(::String, ::Array{String,1}) at .\loading.jl:1041 (repeats 2 times)
│ [23] #2 at .\none:15 [inlined]
│ [24] cd(::getfield(, Symbol("##2#5")){String}, ::String) at .\file.jl:59
│ [25] (::getfield(, Symbol("##1#3")))(::IOStream) at .\none:14
│ [26] #open#318(::Base.Iterators.IndexValue{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::getfield(, Symbol("##1#3")), ::String, ::Vararg{String,N} where N) at .\iostream.jl:369
│ [27] open(::Function, ::String, ::String) at .\iostream.jl:367
│ [28] top-level scope
│ [29] eval at .\boot.jl:295 [inlined]
│ [30] eval(::Module, ::Expr) at .\sysimg.jl:71
│ [31] exec_options(::Base.JLOptions) at .\client.jl:309
│ [32] _start() at .\client.jl:447
│ in expression starting at C:\Users\tbeason\.julia\v0.7\StringEncodings\deps\build.jl:976
└ @ Main none:18
┌ Warning: ------------------------------------------------------------
│ # Build error summary
│
│ StringEncodings had build errors.
│
│ - packages with build errors remain installed in C:\Users\tbeason\.julia\v0.7
│ - build the package(s) and all dependencies with `Pkg.build("StringEncodings")`
│ - build a single package by running its `deps/build.jl` script
└ @ Pkg.Entry entry.jl:651
Looking at the directory it is searching:
λ ls -1 C:\Users\tbeason\.julia\v0.7\WinRPM\cache\2
mingw64-win_iconv-dll-0.0.8-3.15.noarch.cpio
noarch%2Fmingw64-win_iconv-dll-0.0.8-3.15.noarch.rpm
repodata%2F58a3da7b6a7a7cf1c71a355252e0d9db1aab60162e9017b2235f5fa7a118660f-primary.xml
repodata%2Frepomd.xml
The comment says that flush
returns the number of bytes written to output buffer, but it returns the encoder. Is the comment outdated?
StringEncodings.jl/src/StringEncodings.jl
Lines 219 to 234 in 5ad92b7
Hi,
we are working a lot with @view/SubArray on top of a larger buffer. Currently the decode() function requires a Vector{UInt8} argument. For us it means we need to copy the data in an inner, performance optimized, loop. In case there is no specific reason, would it possible to change the interface to support AbstractVector{UInt8}?
thanks a lot
Warning in Julia v0.6.1:
julia> Pkg.build("StringEncodings")
INFO: Building StringEncodings
WARNING: BinDeps.shlib_ext is deprecated.
likely near /home/ec2-user/.julia/v0.6/StringEncodings/deps/build.jl:47
julia> versioninfo()
Julia Version 0.6.1
Commit 0d7248e2ff (2017-10-24 22:15 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Xeon(R) CPU E5-2676 v3 @ 2.40GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.9.1 (ORCJIT, haswell)
If you plan on registering this. Otherwise will sort at the very end. Only other non capitalized package name is kNN, and that is deprecated and doesn't have any tags.
At #40 a crash wasn't caught by tests, probably because none of them covers strings longer than the buffer. It's essential to also test that. It's possible that BUFSIZE was increased after the test were written.
julia> using StringEncodings
[ Info: Precompiling StringEncodings [69024149-9ee7-55f6-a4c4-859efe599b68]
ERROR: LoadError: iconv not installed properly, run Pkg.build("StringEncodings"), restart Julia and try again
Stacktrace:
[1] error(::String) at .\error.jl:33
[2] top-level scope at C:\Users\JohnDoe\.julia\packages\StringEncodings\B9gIH\src\StringEncodings.jl:10
[3] include(::Function, ::Module, ::String) at .\Base.jl:380
[4] include(::Module, ::String) at .\Base.jl:368
[5] top-level scope at none:2
[6] eval at .\boot.jl:331 [inlined]
[7] eval(::Expr) at .\client.jl:467
[8] top-level scope at .\none:3
in expression starting at C:\Users\JohnDoe\.julia\packages\StringEncodings\B9gIH\src\StringEncodings.jl:9
ERROR: Failed to precompile StringEncodings [69024149-9ee7-55f6-a4c4-859efe599b68] to C:\Users\JohnDoe\.julia\compiled\v1.5\StringEncodings\ACjY3_cqGsg.ji.
Stacktrace:
[1] error(::String) at .\error.jl:33
[2] compilecache(::Base.PkgId, ::String) at .\loading.jl:1290
[3] _require(::Base.PkgId) at .\loading.jl:1030
[4] require(::Base.PkgId) at .\loading.jl:928
[5] require(::Module, ::Symbol) at .\loading.jl:923
Pkg.build("StringEncodings")
didn't help. Julia versions prior to Julia 1.5.0 on this machine did successfully install and use StringEncodings. Removing and reinstalling StringEncodings didn't work either.
julia> versioninfo()
Julia Version 1.5.0
Commit 96786e22cc (2020-08-01 23:44 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-9.0.1 (ORCJIT, skylake)
Environment:
JULIA_NUM_THREADS = 7
The URL of this package does not match that stored in METADATA.jl.
cc: @nalimilan
Is posible write to file data (UTF8) like ANSI ? (iso_8859_2")
i need to write array as txt file coding iso_8859_2?
in Julia
my_array=["żyć" "błąd";1 5;3 6]
how to save my_array as file.txt coding by iso_8859_2?
Paul
I have to convert text files from an UTF-encoding to a ISO-8859-1-encoding. In this case not all characters can be converted to the target encoding. So, whenever encode
encounters a non-convertible character within a string, it raises an exception.
Having an exception raised in this situation is in my use case not really a good solution.
Is there a way to sort of "test" a string in advance, if it contains non-convertible characters and also identify which of the characters within the string are non-convertible?
The latest version of Libiconv_jll
is not built for macOS aarch64. But, a previous version was. So, I wonder whether this should be fixed.
I've written some simple functions that create tables using iconv.jl, and then do the conversions in pure Julia code instead of calling iconv, as well as comparing the performance of
I've made a Gist with benchmark results (using https://github.com/johnmyleswhite/Benchmarks.jl)
along with the code and benchmarking code, at:
https://gist.github.com/ScottPJones/fcd12f675edb3d79b5ce.
The tables created are also very small, at most couple hundred bytes (or less) per character set
(maximum, if the character set is ASCII compatible, is 256 bytes, if it an ANSI character set, max is 192 bytes, and only 64 bytes for CP1252 - which woud probably be the most used conversion).
Should we move towards using this approach at least for the 8-bit character set conversions?
It would also make it easy to add all of the options that Python 3 has, for handling invalid characters
(error, remove, replace with fixed replacement character (default 0xfffd) or string, insert quoted XML escape sequence, insert quoted as \uxxxx
or \u{xxxx}
.
I'm trying to read a file that if the encoding selected is enc"UTF-8" it starts with "\ufeff". In Python the solution was to use "utf-8-sig" .
#38 implements an optimized readbytes!
method. The same approach could be used to make read!
and write
more efficient with arrays, probably by overloading Base.unsafe_read
and Base.unsafe_write
.
Hello,
I want to search in some text for a keyword and extract the Text beginning with the keyword to the end.
I have the following simplified example:
(path, io) = mktemp();
write(path, "Hello World!") # dummy content
readuntil(io, "World") # serach for keyword
skip(io, -5) # go back to include keyword in result
DesiredOutput = readuntil(io, "\n") # get rest of the file
However, this is not possible when working with an encoding:
io2 = open(path, enc"MS-ANSI") # open with encoding
readuntil(io2 , "World") # serach for keyword
skip(io2 , -5) # go back to include keyword in result <-- errors
DesiredOutput = readuntil(io2 , "\n") # get rest of the file
Should skip be supproted or is there a better solution for this?
After JuliaLang/julia#19449, soon to be merged for Julia 0.6, you will no longer be able to access the raw bytes of a string via string.data
; instead, do Vector{UInt8}(string)
. For example, this affects:
julia> readuntil(IOBuffer("noël"), enc"UTF-8", "ë")
"no"
julia> readuntil(IOBuffer("noël"), enc"UTF-8", 'ë')
ERROR: MethodError: no method matching position(::StringDecoder{Encoding{Symbol("UTF-8")},Encoding{Symbol("UTF-8")},Base.GenericIOBuffer{Array{UInt8,1}}})
Closest candidates are:
position(::Base.Filesystem.File) at filesystem.jl:225
position(::Base.Libc.FILE) at libc.jl:92
position(::IOStream) at iostream.jl:188
...
Stacktrace:
[1] mark(::StringDecoder{Encoding{Symbol("UTF-8")},Encoding{Symbol("UTF-8")},Base.GenericIOBuffer{Array{UInt8,1}}}) at ./io.jl:915
[2] peek at ./iostream.jl:525 [inlined]
[3] read(::StringDecoder{Encoding{Symbol("UTF-8")},Encoding{Symbol("UTF-8")},Base.GenericIOBuffer{Array{UInt8,1}}}, ::Type{Char}) at ./io.jl:625
[4] #readuntil#283(::Bool, ::Function, ::StringDecoder{Encoding{Symbol("UTF-8")},Encoding{Symbol("UTF-8")},Base.GenericIOBuffer{Array{UInt8,1}}}, ::Char) at ./io.jl:646
[5] #readuntil#11 at ./none:0 [inlined]
[6] readuntil(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Encoding{Symbol("UTF-8")}, ::Char) at /home/milan/.julia/StringEncodings/src/StringEncodings.jl:443
[7] top-level scope at none:0
Since libiconv is already available in BB, it should be straightforward to use it.
julia> encode("ク", "shift_jisx0213")
UInt8[]
This should have provided the same result as "sjis" encoding.
julia> encode("ク", "sjis")
2-element Vector{UInt8}:
0x83
0x4e
In 0.3.2 , unsafe_wrap used on julia version 1.4.1
@static if VERSION >= v"1.6.0-DEV.438"
inbuf_view = view(s.inbuf, Int(s.inbytesleft[]+1):BUFSIZE)
else
inbuf_view = unsafe_wrap(Array, pointer(s.inbuf, s.inbytesleft[]+1), BUFSIZE)
end
however, Bounderror or dead kernel occurred while decoding files , but encoding/decoding string is fine
I wonder "unsafe_wrap" maybe the cause
~/.julia/v0.6/StringEncodings/deps/src/libiconv-1.14$ uname -a Linux 4.10.0-30-generic #34-Ubuntu SMP Mon Jul 31 19:38:17 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Independently, building libiconv-1.14 also reports the exact same error.
make[3]: Leaving directory '/home/sambit/.julia/v0.6/StringEncodings/deps/src/libiconv-1.14' gcc -DHAVE_CONFIG_H -DEXEEXT=\"\" -I. -I.. -I../lib -I../intl -DDEPENDS_ON_LIBICONV=1 -DDEPENDS_ON_LIBINTL=1 -g -O2 -c progname.c In file included from progname.c:26:0: ./stdio.h:1010:1: error: ‘gets’ undeclared here (not in a function) _GL_WARN_ON_USE (gets, "gets is a security hole - use fgets instead"); ^ Makefile:914: recipe for target 'progname.o' failed make[2]: *** [progname.o] Error 1
libiconv-1.15 does not have these errors and builds properly.
This issue is used to trigger TagBot; feel free to unsubscribe.
If you haven't already, you should update your TagBot.yml
to include issue comment triggers.
Please see this post on Discourse for instructions and more details.
If you'd like for me to do this for you, comment TagBot fix
on this issue.
I'll open a PR within a few hours, please be patient!
The 0.2.2 is downloading libiconv-1.15. However, the fix in the build.jl needs to hit the metadata so that the iconv is picked up from the libc.
GNU libc has iconv in-built. If available in the platform that should be used instead of installing a local libiconv. Some of the discussion is available in the issue #16.
On installation it gives the following warning. Not harmful but I just want to report the issue anyways.
┌ Warning: Could not extract the platform key of https://github.com/JuliaStrings/IConvBuilder/releases/download/v1.15+build.3/IConv.x86_64-linux-gnu.tar.gz; continuing...
└ @ BinaryProvider ~/.julia/packages/BinaryProvider/UTYxu/src/Prefix.jl:224
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.