Code Monkey home page Code Monkey logo

libunicode's Introduction

C++20 CI Build

Modern C++20 Unicode Library

The goal of this library is to bring painless unicode support to C++ with simple and easy to understand APIs.

The API naming conventions are chosen to look familiar to those using the C++ standard libary.

Feature Overview

  • API for accessing UCD properties
  • UTF8 <-> UTF32 conversion
  • wcwidth equivalent (int unicode::width(char32_t))
  • grapheme segmentation (UTS algorithm)
  • symbol/emoji segmentation (UTS algorithm)
  • script segmentation UTS 24
  • unit tests for most parts (wcwidth / segmentation)
  • generic text run segmentation (top level segmentation API suitable for text shaping implementations)
  • word segmentation (UTS algorithm)
  • CLI tool: uc-inspect for inspecting input files by code point properties, grapheme cluster, word, script, ...

Unicode Technical Specifications

  • UTS 11 - character width
  • UTS 24 - script property
  • UTS 29 - text segmentation (grapheme cluster, word boundary)
  • UTS 51 - Emoji

Integrate with your CMake project

git submodule add --name libunicode https://github.com/contour-terminal/libunicode 3rdparty/libunicode
add_subdirectory(3rdparty/libunicode)

add_executable(your_tool your_tool.cpp)
target_link_libraries(your_tool PRIVATE unicode::unicode)

Contributing

Users of this library

Disclaimer

This library is -in terms of features- by no means competive to the ICU library, but it attempts to provide a clean and intuitive modern C++ API for those that do not want to fight legacy-style C APIs.

I hope that over time we can add more and more features to this library to conform to the Unicode specification eventually at some point and I welcome everyone to contribute to it by forking the library, creating pull requests, or even just constructive feedback.

License

libunicode - a modern C++20 unicode library
-------------------------------------------

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

libunicode's People

Contributors

christianparpart avatar data-man avatar topazus avatar yaraslaut avatar yrashk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

libunicode's Issues

add a Unicode query CLI tool

Given a codepoint as command line parameter,
it should print out the following information (at least)

  • assigned name
  • the Unicode version that introduced this codepiont?
  • block
  • plane
  • script
  • category
  • escaped string in UTF-8 and UTF-32 (for UTF-16 I don't really see the reason)
  • the codepoint in the heading as rendered form (e.g. actually displayed)

Additionally I think that's useful too

  • emoji default presentation
  • east asian width (and terminal display width, which isn't always the same)

Also

I think it might make sense to level the idea up and provide CLI access to (grapheme cluster) segmentation rules in a way the user can easily grasp the information but also the output should be trivially consumable by other scripts.

Most of the info can be retrieved already. Add the missing ones

How to deal with UCD file when packaging libunicode

The unicode-ucd package in Fedora provides the same thing with the ucd download from https://www.unicode.org/Public/15.0.0/ucd/UCD.zip. The reviewer asked me to give some explanations about the UCD in the review request of libunicode on Fedora bugzilla.

[ruby@fedora ~]$ dnf se unicode-ucd
Last metadata expiration check: 6 days, 0:55:44 ago on Sat 18 Feb 2023 08:05:11 AM EST.
=========================================== Name Exactly Matched: unicode-ucd ===========================================
unicode-ucd.noarch : Unicode Character Database
=============================================== Name Matched: unicode-ucd ===============================================
perl-Unicode-UCD.noarch : Unicode character database
unicode-ucd-unihan.noarch : Unicode Han Database
[ruby@fedora ~]$ ls /usr/share/unicode/ucd/
ArabicShaping.txt          DerivedNormalizationProps.txt   NameAliases.txt               ScriptExtensions.txt
auxiliary                  EastAsianWidth.txt              NamedSequencesProv.txt        Scripts.txt
BidiBrackets.txt           emoji                           NamedSequences.txt            SpecialCasing.txt
BidiCharacterTest.txt      EmojiSources.txt                NamesList.html                StandardizedVariants.txt
BidiMirroring.txt          EquivalentUnifiedIdeograph.txt  NamesList.txt                 TangutSources.txt
BidiTest.txt               extracted                       NormalizationCorrections.txt  UnicodeData.txt
Blocks.txt                 HangulSyllableType.txt          NormalizationTest.txt         USourceData.txt
CaseFolding.txt            Index.txt                       NushuSources.txt              USourceGlyphs.pdf
CJKRadicals.txt            IndicPositionalCategory.txt     PropertyAliases.txt           USourceRSChart.pdf
CompositionExclusions.txt  IndicSyllabicCategory.txt       PropertyValueAliases.txt      VerticalOrientation.txt
DerivedAge.txt             Jamo.txt                        PropList.txt
DerivedCoreProperties.txt  LineBreak.txt                   ReadMe.txt

set(LIBUNICODE_UCD_DIR "${LIBUNICODE_UCD_BASE_DIR}/ucd-${LIBUNICODE_UCD_VERSION}")

I do not find a good method to deal with this. changing ${LIBUNICODE_UCD_BASE_DIR}/ucd-${LIBUNICODE_UCD_VERSION} to /usr/share/unicode/ucd where the UCD files provided by unicode-ucd package can work. Any better suggestions? @christianparpart

related issue: #56

create release (0.1.0)

it's used by contour which is used by quite some people already and stress-tested via notcurses demo and other tests. It may be time for a release now.

Checklist

  • installed package must expose version number. apps using this library must have a way to require that specific version number (or greater).
  • create Changelog.md for future releases (containing state at first release)
  • maybe reuse release CI script from contour to autogen releases and release pages?
  • create small blog post about on my tiny dev.to :)

CI build test for Arch/Linux

as it just broke (when compiling in C++20) on Arch, it makes sense to test it there, too.

I think we could probably copy'n'paste most out of the contour's CI for that.

Refactor implementation to load UCD at runtime

Goals

  • reduce binary size
  • maintenance overhead for already installed systems

Drawbacks

This of course does not work for every UCD contents. All the tables can be easily loaded at runtime but if something's added that is currently translated into an enum class, that's still compile-time and cannot be runtime loaded, obviously.

Checklist

  • write C++ UCD codepoint_properties loader to populate UCD tables at runtime
  • install UCD data to /usr/share/libunicode/ucd (or similar)
  • change implementation to make use of the new tables. Also, ucd.h should then most likely remain static and version controlled. (Can it still be auto-generated or should it be hand-maintained?)

Implementation

I think the best would be to go with a double-layer principle, i.e. we still make use of mktables.py to create some .cpp & .h files, but have some of their table names flagged such that it's not populating them but use another API to access the runtime loaded ones instead.

encode Unicode version into namespace

And have the latest Unicode version be imported into the main.

  • unicode::v13

Or have a way to change the unicode version at runtime (via parameter), for functions?

optimize binary search via L1 dcache prefetching using `__builtin_prefetch`

auto a = size_t{0};
auto b = static_cast<size_t>(_ranges.size()) - 1;
while (a < b)
{
auto const i = ((b + a) / 2);
auto const& I = _ranges[i];
if (I.to < _codepoint)
a = i + 1;
else if (I.from > _codepoint)
{
if (i == 0)
return false;
b = i - 1;
}
else
return true;
}
return a == b && _ranges[a].from <= _codepoint && _codepoint <= _ranges[a].to;

See: https://stackoverflow.com/a/31688096/386670

Optimize performance for grapheme cluster break lookup (and other tables)

Checklist

  • implement table lookup based on https://www.strchr.com/multi-stage_tables for all tables
  • Evaluate the possibility to join commonly looked up attributes into a single table (grapheme break, script, width, emoji default presentation, ...)

Future invesgitation

There's a very good research done by utf8proc team: https://halt.software/optimizing-unicodes-grapheme-cluster-break-algorithm/

We could see if we can implement it like that, too, document it, and reference their great work.

  • implement break algorithm based on the above web link
  • document the idea behind that algorithm such that one can understand it without looking at further (possible future-deleted) web articles
  • perf-test against naive implementation (probably simply by doing it as part of contour-terminal/contour#692 - which desperately needs an improved performance for the break algorithm.

Build System Downloads Files at the Configuration Stage

Unfortunately when I was updating my FreeBSD Ports overlay for contour and pushed it through the automatic package builder poudriere I got build failures due to CMake downloading files (the UCD zip) at configuration time where network access is completely disabled.

I believe I pointed this out before in tickets in contour.

My suggestion would be to either:

  • Vendor the UCD files or ship them in the distribution tarball or
  • Let the build system check for the files' existence and error out if they are missing. Additionally provide a script to download them for the user. This way package/port maintainers can add them to the distfile list.

Related to #42

Incompatible with Android Build Chain

Currently because of the Catch2 and fmt dependencies it is incompatible with android - these two packages are not being correctly fetched and built when using the android grande build system.

Adding libunicode to an AndroidStudio CMakeLists file does not currently work correctly.

Error:
Could not find a package configuration file provided by "fmt" with any of
the following names:
fmtConfig.cmake
fmt-config.cmake

Could not find a package configuration file provided by "Catch2" with any
of the following names:
Catch2Config.cmake
catch2-config.cmake

Refactor grapheme cluster segmentation to properly act on clusters with more than 2 codepoints

https://unicode.org/reports/tr29/#Grapheme_Cluster_Boundary_Rules

Specifically I am interested in correctly segmenting a consecutive list of country flags (RI regional indicators).

Also, to make the future implementation (but also the current one) very fast, we
should add the grapheme tokens (CR, LF, L, V, LV, LVT, Extend, ZWJ, Control, SpacingMark, Prepend, Extended_Pictographic, RI) as a field to the new codepoint_properties table to ensure grapheme segmentation is as efficient as possible.

Building stops on mktables.py

Trying to build the library as a part of contour, and the build stops with the error message:

FileNotFoundError: [Errno 2] No such file or directory: 'C:\\contour\\3rdparty\\libunicode\\src\\unicode/../../docs/ucd/PropertyValueAliases.txt'

The whole docs folder doesn't seem to be a part of the repo, I guess it was not added?

Optimize table generation for names

this process is currently very slow, because every non-empty string is most definitely a MISS when searching for it. Try finding a way to avoid that and thus, have name table generation be fast.

`script_extensions` should return an `optional<span<Script>>`

script_extensions should return an optional<span<Script>>

this requires C++20 though. maybe auto-detect if that is the case, and if not, provide a custom span<T> type for that.

also make use of that in contour when implemented.

So the signature for the additional function would be:

std::optional<std::span<Script>> script_extensions(char32_t _codepoint) noexcept;

unicode/utf8.h conflicts with ICU

Trying to install libunicode as a system library in /usr results in utf8.h living in /usr/include/unicode/utf8.h -- but that file is also owned by the system library installation of ICU.
This makes it impossible to install libunicode and ICU in the same location.

codepoint mapping to their names

Quite a few codepoints have names that describe them.

It would be nice to have a getter from char32_t to string_view.

As well as
maybe a sub namespace that contains all of these names nicely capitalized as constexpr inline char32_t That_Name_Here = 0x1234;

namespace maybe: unicode::names

questions about packaging libunicode for Fedora

I did a trial to build libunicode from latest commit tag on Fedora, and install the built libunicode rpm package and the file conflicted with libicu-devel. Maybe install header files to another directory will solve it?

install(TARGETS ${INSTALL_TARGETS}
EXPORT ${TARGETS_EXPORT_NAME}
LIBRARY DESTINATION ${CMAKE_INSTALL_LIBDIR}
ARCHIVE DESTINATION ${CMAKE_INSTALL_LIBDIR}
PUBLIC_HEADER DESTINATION "${CMAKE_INSTALL_PREFIX}/include/unicode"
FRAMEWORK DESTINATION "."
RUNTIME DESTINATION ${CMAKE_INSTALL_BINDIR})
install(
FILES
ucd.h
ucd_enums.h
ucd_fmt.h
ucd_ostream.h
DESTINATION "${CMAKE_INSTALL_INCLUDEDIR}/unicode"
)

[ruby@fedora x86_64]$ rpm -qlp ./libunicode-static-20230219.5987666-1.fc38.x86_64.rpm 
/usr/include/unicode/capi.h
/usr/include/unicode/codepoint_properties.h
/usr/include/unicode/convert.h
/usr/include/unicode/emoji_segmenter.h
/usr/include/unicode/grapheme_segmenter.h
/usr/include/unicode/intrinsics.h
/usr/include/unicode/run_segmenter.h
/usr/include/unicode/scan.h
/usr/include/unicode/script_segmenter.h
/usr/include/unicode/ucd.h
/usr/include/unicode/ucd_enums.h
/usr/include/unicode/ucd_fmt.h
/usr/include/unicode/ucd_ostream.h
/usr/include/unicode/utf8.h
/usr/include/unicode/utf8_grapheme_segmenter.h
/usr/include/unicode/width.h
/usr/include/unicode/word_segmenter.h
/usr/lib64/cmake/libunicode/libunicode-config-version.cmake
/usr/lib64/cmake/libunicode/libunicode-config.cmake
/usr/lib64/cmake/libunicode/unicode-targets-noconfig.cmake
/usr/lib64/cmake/libunicode/unicode-targets.cmake
/usr/lib64/libunicode.a
/usr/lib64/libunicode_loader.a
/usr/lib64/libunicode_ucd.a
[ruby@fedora x86_64]$ sudo dnf in ./libunicode-static-20230219.5987666-1.fc38.x86_64.rpm ./libunicode-20230219.5987666-1.fc38.x86_64.rpm 
Last metadata expiration check: 3:10:44 ago on Wed 22 Feb 2023 07:57:54 AM CST.
Dependencies resolved.
==============================================================================================================
 Package                     Architecture     Version                            Repository              Size
==============================================================================================================
Installing:
 libunicode                  x86_64           20230219.5987666-1.fc38            @commandline           1.1 M
 libunicode-static           x86_64           20230219.5987666-1.fc38            @commandline           506 k

Transaction Summary
==============================================================================================================
Install  2 Packages

Total size: 1.6 M
Installed size: 8.5 M
Is this ok [y/N]: y
Downloading Packages:
Running transaction check
Transaction check succeeded.
Running transaction test
Error: Transaction test error:
  file /usr/include/unicode/utf8.h from install of libunicode-static-20230219.5987666-1.fc38.x86_64 conflicts with file from package libicu-devel-72.1-2.fc38.x86_64

Add UTF-16 conversions

It shouldn't just be possible to convert between UTF-8 <-> UTF-32 but also

  • UTF-16 <-> UTF-32

Maybe also generically between two different UTF 8/16/32 versions?

Hangul Jamo vowels and trailing consonants should probably be 0 width

U+1160..U+11FF and U+D7B0..U+D7FF should have 0 width.

Korean Hangul is a writing system which uses syllable blocks consisting of alphabetic components. A syllable consists of one or more Leading Consonants, one or more Vowels, and zero or more trailing consonants.

Unicode has precomposed syllable blocks at U+AC00..U+D7A3 (11172).

There are also component Jamos:

  • Hangul Jamo (U+1100..U+11FF).
    • U+1100..U+115F Choseong (initial, Leading Consonants) have East_Asian_Width=Wide and Hangul_Syllable_Type=Leading_Jamo
    • U+1160..U+11A7 Jungseong (medial, Vowels) have East_Asian_Width=Neutral and Hangul_Syllable_Type=Vowel_Jamo
    • U+11A8..U+11FF Jongseong (final, Trailing consonants) have East_Asian_Width=Neutral and Hangul_Syllable_Type=Trailing_Jamo
  • U+A960..U+A97F Hangul Jamo Extended-A (choseong) have East_Asian_Width=Wide
  • U+D7B0..U+D7FF Hangul Jamo Extended-B (jungseong and jongseong) have East_Asian_Width=Neutral
  • U+3130..U+318F Hangul Compatibility Jamo have no conjoining behavior
  • U+FFA0..U+FFDF half-width forms have no conjoining behavior.

U+1100..U+11FF, U+A960..U+A97F, U+D7B0..U+D7FF have conjoining behavior, a sequence of L+V+T* gets rendered as a syllable block. wcwidth() implementations tend to give U+1100..U+115F width 2, and U+1160..U+11FF width 0, so the resulting syllable block has the correct total width.

U+D7B0..U+D7FF, should also have width 0.

glibc gave width 0 to conjoining jungseong and jongseong at:

 commit 7a79e321c6f85b204036c33d85f6b2aa794e7c76
Author: Thorsten Glaser <[email protected]>
Date:   Fri Jul 14 14:02:50 2017 +0200

    Refresh generated charmap data and ChangeLog

            [BZ #21750]
            * charmaps/UTF-8: Refresh.

diff --git a/localedata/ChangeLog b/localedata/ChangeLog
index 04ef5ad071..9e05b4a652 100644
--- a/localedata/ChangeLog
+++ b/localedata/ChangeLog
@@ -1,3 +1,17 @@
+2017-07-14  Thorsten Glaser  <[email protected]>
+
+       [BZ #21750]
+       * charmaps/UTF-8: Refresh.
+       * unicode-gen/utf8_gen.py (U+00AD): Set width to 1.
+       * unicode-gen/utf8_gen.py (U+1160..U+11FF): Set width to 0.
+       * unicode-gen/utf8_gen.py (U+3248..U+324F): Set width to 2.
+       * unicode-gen/utf8_gen.py (U+4DC0..U+4DFF): Likewise.
+       * unicode-gen/utf8_gen.py: Treat category Me and Mn as combining.
+       [BZ #19852]
+       * unicode-gen/utf8_gen.py: Process EastAsianWidth lines before
+       UnicodeData lines so the latter have precedence; remove hack
+       to group output by EastAsianWidth ranges.
+

[ ... snip ...]

commit 6e540caa21616d5ec5511fafb22819204525138e
Author: Mike FABIAN <[email protected]>
Date:   Tue Jun 16 08:29:40 2020 +0200

    Set width of JUNGSEONG/JONGSEONG characters from UD7B0 to UD7FB to 0 [BZ #26120]
Reviewed-by: default avatarCarlos O'Donell <[email protected]>

diff --git a/localedata/charmaps/UTF-8 b/localedata/charmaps/UTF-8
index 14c5d4fa33..8cce47cd97 100644
--- a/localedata/charmaps/UTF-8
+++ b/localedata/charmaps/UTF-8
@@ -48920,6 +48920,8 @@ WIDTH
 <UABE8>        0
 <UABED>        0
 <UAC00>...<UD7A3>      2
+<UD7B0>...<UD7C6>      0
+<UD7CB>...<UD7FB>      0
 <UF900>...<UFA6D>      2
 <UFA70>...<UFAD9>      2
 <UFB1E>        0

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.