contour-terminal / libunicode Goto Github PK

View Code? Open in Web Editor NEW

93.0 7.0 11.0 998 KB

Modern C++20 Unicode library

License: Apache License 2.0

CMake 6.62% Python 14.26% C++ 70.79% C 4.32% Ragel 1.02% Shell 2.24% PowerShell 0.76%

unicode emoji grapheme library unicode-emoji unicode-symbols grapheme-cluster uts-algorithm wcwidth hacktoberfest cpp

libunicode's Introduction

Modern C++20 Unicode Library

The goal of this library is to bring painless unicode support to C++ with simple and easy to understand APIs.

The API naming conventions are chosen to look familiar to those using the C++ standard libary.

Feature Overview

Unicode Technical Specifications

UTS 11 - character width
UTS 24 - script property
UTS 29 - text segmentation (grapheme cluster, word boundary)
UTS 51 - Emoji

Integrate with your CMake project

git submodule add --name libunicode https://github.com/contour-terminal/libunicode 3rdparty/libunicode

add_subdirectory(3rdparty/libunicode)

add_executable(your_tool your_tool.cpp)
target_link_libraries(your_tool PRIVATE unicode::unicode)

Contributing

for filing issues please visit: https://github.com/contour-terminal/libunicode/issues
fork and create pull requests: https://github.com/contour-terminal/libunicode/pulls
I am also happy to just receive code reviews
you can help with documentation, or
general feedback is also very welcome

Users of this library

Contour Terminal Emulator

Disclaimer

This library is -in terms of features- by no means competive to the ICU library, but it attempts to provide a clean and intuitive modern C++ API for those that do not want to fight legacy-style C APIs.

I hope that over time we can add more and more features to this library to conform to the Unicode specification eventually at some point and I welcome everyone to contribute to it by forking the library, creating pull requests, or even just constructive feedback.

License

libunicode - a modern C++20 unicode library
-------------------------------------------

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

libunicode's People

Contributors

Stargazers

Watchers

Forkers

3n16m4 data-man externalrepositories danoli3 topazus khangthk ezhangle yrashk

libunicode's Issues

add a Unicode query CLI tool

Given a codepoint as command line parameter,
it should print out the following information (at least)

assigned name
the Unicode version that introduced this codepiont?
block
plane
script
category
escaped string in UTF-8 and UTF-32 (for UTF-16 I don't really see the reason)
the codepoint in the heading as rendered form (e.g. actually displayed)

Additionally I think that's useful too

emoji default presentation
east asian width (and terminal display width, which isn't always the same)

Also

I think it might make sense to level the idea up and provide CLI access to (grapheme cluster) segmentation rules in a way the user can easily grasp the information but also the output should be trivially consumable by other scripts.

Most of the info can be retrieved already. Add the missing ones

Upgrade Catch2 to v3

The documentation about Catch2 migration from v2 to v3: https://github.com/catchorg/Catch2/blob/devel/docs/migrate-v2-to-v3.md

How to deal with UCD file when packaging libunicode

The unicode-ucd package in Fedora provides the same thing with the ucd download from https://www.unicode.org/Public/15.0.0/ucd/UCD.zip. The reviewer asked me to give some explanations about the UCD in the review request of libunicode on Fedora bugzilla.

[ruby@fedora ~]$ dnf se unicode-ucd
Last metadata expiration check: 6 days, 0:55:44 ago on Sat 18 Feb 2023 08:05:11 AM EST.
=========================================== Name Exactly Matched: unicode-ucd ===========================================
unicode-ucd.noarch : Unicode Character Database
=============================================== Name Matched: unicode-ucd ===============================================
perl-Unicode-UCD.noarch : Unicode character database
unicode-ucd-unihan.noarch : Unicode Han Database
[ruby@fedora ~]$ ls /usr/share/unicode/ucd/
ArabicShaping.txt          DerivedNormalizationProps.txt   NameAliases.txt               ScriptExtensions.txt
auxiliary                  EastAsianWidth.txt              NamedSequencesProv.txt        Scripts.txt
BidiBrackets.txt           emoji                           NamedSequences.txt            SpecialCasing.txt
BidiCharacterTest.txt      EmojiSources.txt                NamesList.html                StandardizedVariants.txt
BidiMirroring.txt          EquivalentUnifiedIdeograph.txt  NamesList.txt                 TangutSources.txt
BidiTest.txt               extracted                       NormalizationCorrections.txt  UnicodeData.txt
Blocks.txt                 HangulSyllableType.txt          NormalizationTest.txt         USourceData.txt
CaseFolding.txt            Index.txt                       NushuSources.txt              USourceGlyphs.pdf
CJKRadicals.txt            IndicPositionalCategory.txt     PropertyAliases.txt           USourceRSChart.pdf
CompositionExclusions.txt  IndicSyllabicCategory.txt       PropertyValueAliases.txt      VerticalOrientation.txt
DerivedAge.txt             Jamo.txt                        PropList.txt
DerivedCoreProperties.txt  LineBreak.txt                   ReadMe.txt

libunicode/CMakeLists.txt

Line 58 in f921d1b

    
           set(LIBUNICODE_UCD_DIR "${LIBUNICODE_UCD_BASE_DIR}/ucd-${LIBUNICODE_UCD_VERSION}")

I do not find a good method to deal with this. changing ${LIBUNICODE_UCD_BASE_DIR}/ucd-${LIBUNICODE_UCD_VERSION} to /usr/share/unicode/ucd where the UCD files provided by unicode-ucd package can work. Any better suggestions? @christianparpart

related issue: #56

create release (0.1.0)

it's used by contour which is used by quite some people already and stress-tested via notcurses demo and other tests. It may be time for a release now.

Checklist

installed package must expose version number. apps using this library must have a way to require that specific version number (or greater).
create Changelog.md for future releases (containing state at first release)
maybe reuse release CI script from contour to autogen releases and release pages?
create small blog post about on my tiny dev.to :)

CI build test for Arch/Linux

as it just broke (when compiling in C++20) on Arch, it makes sense to test it there, too.

I think we could probably copy'n'paste most out of the contour's CI for that.

Refactor implementation to load UCD at runtime

Goals

reduce binary size
maintenance overhead for already installed systems

Drawbacks

This of course does not work for every UCD contents. All the tables can be easily loaded at runtime but if something's added that is currently translated into an enum class, that's still compile-time and cannot be runtime loaded, obviously.

Checklist

write C++ UCD codepoint_properties loader to populate UCD tables at runtime
install UCD data to /usr/share/libunicode/ucd (or similar)
change implementation to make use of the new tables. Also, ucd.h should then most likely remain static and version controlled. (Can it still be auto-generated or should it be hand-maintained?)

Implementation

I think the best would be to go with a double-layer principle, i.e. we still make use of mktables.py to create some .cpp & .h files, but have some of their table names flagged such that it's not populating them but use another API to access the runtime loaded ones instead.

encode Unicode version into namespace

And have the latest Unicode version be imported into the main.

unicode::v13

Or have a way to change the unicode version at runtime (via parameter), for functions?

optimize binary search via L1 dcache prefetching using `__builtin_prefetch`

libunicode/src/unicode/ucd_private.h

Lines 30 to 47 in a0f7291

    
           auto a = size_t{0}; 
        
           auto b = static_cast<size_t>(_ranges.size()) - 1; 
        
           while (a < b) 
        
           { 
        
           	auto const i = ((b + a) / 2); 
        
           	auto const& I = _ranges[i]; 
        
           	if (I.to < _codepoint) 
        
           		a = i + 1; 
        
           	else if (I.from > _codepoint) 
        
                  { 
        
                      if (i == 0) 
        
                          return false; 
        
           		b = i - 1; 
        
                  } 
        
           	else 
        
           		return true; 
        
           } 
        
           return a == b && _ranges[a].from <= _codepoint && _codepoint <= _ranges[a].to;

See: https://stackoverflow.com/a/31688096/386670

Optimize performance for grapheme cluster break lookup (and other tables)

Checklist

implement table lookup based on https://www.strchr.com/multi-stage_tables for all tables
Evaluate the possibility to join commonly looked up attributes into a single table (grapheme break, script, width, emoji default presentation, ...)

Future invesgitation

There's a very good research done by utf8proc team: https://halt.software/optimizing-unicodes-grapheme-cluster-break-algorithm/

We could see if we can implement it like that, too, document it, and reference their great work.

implement break algorithm based on the above web link
document the idea behind that algorithm such that one can understand it without looking at further (possible future-deleted) web articles
perf-test against naive implementation (probably simply by doing it as part of contour-terminal/contour#692 - which desperately needs an improved performance for the break algorithm.

Build System Downloads Files at the Configuration Stage

Unfortunately when I was updating my FreeBSD Ports overlay for contour and pushed it through the automatic package builder poudriere I got build failures due to CMake downloading files (the UCD zip) at configuration time where network access is completely disabled.

I believe I pointed this out before in tickets in contour.

My suggestion would be to either:

Vendor the UCD files or ship them in the distribution tarball or
Let the build system check for the files' existence and error out if they are missing. Additionally provide a script to download them for the user. This way package/port maintainers can add them to the distfile list.

Related to #42

Incompatible with Android Build Chain

Currently because of the Catch2 and fmt dependencies it is incompatible with android - these two packages are not being correctly fetched and built when using the android grande build system.

Adding libunicode to an AndroidStudio CMakeLists file does not currently work correctly.

Error:
Could not find a package configuration file provided by "fmt" with any of
the following names:
fmtConfig.cmake
fmt-config.cmake

Could not find a package configuration file provided by "Catch2" with any
of the following names:
Catch2Config.cmake
catch2-config.cmake

normalization algorithm

Should also be suitable for streaming input.

Unicode 14.0 Beta

The great library, thank you!

As I assume, first of all, you developed it for Contour.
According to https://github.com/ThomasDickey/uniset-snapshots xterm already uses Unicode 14.0.
In my UCD_14 branch I tried to do this.
What do you think?

NB: all tests updated to Catch v3.

Refactor grapheme cluster segmentation to properly act on clusters with more than 2 codepoints

https://unicode.org/reports/tr29/#Grapheme_Cluster_Boundary_Rules

Specifically I am interested in correctly segmenting a consecutive list of country flags (RI regional indicators).

Also, to make the future implementation (but also the current one) very fast, we
should add the grapheme tokens (CR, LF, L, V, LV, LVT, Extend, ZWJ, Control, SpacingMark, Prepend, Extended_Pictographic, RI) as a field to the new codepoint_properties table to ensure grapheme segmentation is as efficient as possible.

github release page should include a static build of `unicode-query`

github release page should include a static build of unicode-query such that users could easily and quickly download a prebuilt version of it.

probably include windows / osx builds of this tool, too?

Building stops on mktables.py

Trying to build the library as a part of contour, and the build stops with the error message:

FileNotFoundError: [Errno 2] No such file or directory: 'C:\\contour\\3rdparty\\libunicode\\src\\unicode/../../docs/ucd/PropertyValueAliases.txt'

The whole docs folder doesn't seem to be a part of the repo, I guess it was not added?

Optimize table generation for names

this process is currently very slow, because every non-empty string is most definitely a MISS when searching for it. Try finding a way to avoid that and thus, have name table generation be fast.

UTF-16 <-> UTF-32 conversion not working correctly

As noted in PR contour-terminal/contour#360

that UTF-16 <-> UTF-32 conversion seems broken. That should be fixed and unit tests added / extended.

Remove mktables.py and join its functionality into tablegen C++ binary

This is to primarily avoid double-loading and parsing the UCD tables. This is useless. But since it's bad to move the resource heavy load from C++ to Python, we should move the Python bits over to C++.

This also helps us having only ONE script to maintain.

`script_extensions` should return an `optional<span<Script>>`

script_extensions should return an optional<span<Script>>

this requires C++20 though. maybe auto-detect if that is the case, and if not, provide a custom span<T> type for that.

also make use of that in contour when implemented.

So the signature for the additional function would be:

std::optional<std::span<Script>> script_extensions(char32_t _codepoint) noexcept;

unicode/utf8.h conflicts with ICU

Trying to install libunicode as a system library in /usr results in utf8.h living in /usr/include/unicode/utf8.h -- but that file is also owned by the system library installation of ICU.
This makes it impossible to install libunicode and ICU in the same location.

UTS word segmentation API

Implementing the UTS algorithm.

codepoint mapping to their names

Quite a few codepoints have names that describe them.

It would be nice to have a getter from char32_t to string_view.

As well as
maybe a sub namespace that contains all of these names nicely capitalized as constexpr inline char32_t That_Name_Here = 0x1234;

namespace maybe: unicode::names

Ubuntu 20.04 support

Support for file(ARCHIVE_EXTRACT ... was introduced in 3.18, while ubuntu 20.04 support only 3.16 :)
https://packages.ubuntu.com/focal/cmake
https://cmake.org/cmake/help/latest/command/file.html#archive-extract

questions about packaging libunicode for Fedora

I did a trial to build libunicode from latest commit tag on Fedora, and install the built libunicode rpm package and the file conflicted with libicu-devel. Maybe install header files to another directory will solve it?

libunicode/src/unicode/CMakeLists.txt

Lines 119 to 134 in 5987666

    
           install(TARGETS ${INSTALL_TARGETS} 
        
                   EXPORT ${TARGETS_EXPORT_NAME} 
        
                   LIBRARY DESTINATION ${CMAKE_INSTALL_LIBDIR} 
        
                   ARCHIVE DESTINATION ${CMAKE_INSTALL_LIBDIR} 
        
                   PUBLIC_HEADER DESTINATION "${CMAKE_INSTALL_PREFIX}/include/unicode" 
        
                   FRAMEWORK DESTINATION "." 
        
                   RUNTIME DESTINATION ${CMAKE_INSTALL_BINDIR}) 
        
           install( 
        
               FILES 
        
                   ucd.h 
        
                   ucd_enums.h 
        
                   ucd_fmt.h 
        
                   ucd_ostream.h 
        
               DESTINATION "${CMAKE_INSTALL_INCLUDEDIR}/unicode" 
        
           )

[ruby@fedora x86_64]$ rpm -qlp ./libunicode-static-20230219.5987666-1.fc38.x86_64.rpm 
/usr/include/unicode/capi.h
/usr/include/unicode/codepoint_properties.h
/usr/include/unicode/convert.h
/usr/include/unicode/emoji_segmenter.h
/usr/include/unicode/grapheme_segmenter.h
/usr/include/unicode/intrinsics.h
/usr/include/unicode/run_segmenter.h
/usr/include/unicode/scan.h
/usr/include/unicode/script_segmenter.h
/usr/include/unicode/ucd.h
/usr/include/unicode/ucd_enums.h
/usr/include/unicode/ucd_fmt.h
/usr/include/unicode/ucd_ostream.h
/usr/include/unicode/utf8.h
/usr/include/unicode/utf8_grapheme_segmenter.h
/usr/include/unicode/width.h
/usr/include/unicode/word_segmenter.h
/usr/lib64/cmake/libunicode/libunicode-config-version.cmake
/usr/lib64/cmake/libunicode/libunicode-config.cmake
/usr/lib64/cmake/libunicode/unicode-targets-noconfig.cmake
/usr/lib64/cmake/libunicode/unicode-targets.cmake
/usr/lib64/libunicode.a
/usr/lib64/libunicode_loader.a
/usr/lib64/libunicode_ucd.a
[ruby@fedora x86_64]$ sudo dnf in ./libunicode-static-20230219.5987666-1.fc38.x86_64.rpm ./libunicode-20230219.5987666-1.fc38.x86_64.rpm 
Last metadata expiration check: 3:10:44 ago on Wed 22 Feb 2023 07:57:54 AM CST.
Dependencies resolved.
==============================================================================================================
 Package                     Architecture     Version                            Repository              Size
==============================================================================================================
Installing:
 libunicode                  x86_64           20230219.5987666-1.fc38            @commandline           1.1 M
 libunicode-static           x86_64           20230219.5987666-1.fc38            @commandline           506 k

Transaction Summary
==============================================================================================================
Install  2 Packages

Total size: 1.6 M
Installed size: 8.5 M
Is this ok [y/N]: y
Downloading Packages:
Running transaction check
Transaction check succeeded.
Running transaction test
Error: Transaction test error:
  file /usr/include/unicode/utf8.h from install of libunicode-static-20230219.5987666-1.fc38.x86_64 conflicts with file from package libicu-devel-72.1-2.fc38.x86_64

Update to Unicode 15.1.0 and adapt Grapheme Break Rule changes

See reference and argumentation in comment: JuliaStrings/utf8proc#252 (comment)

It seems a little bit more involved to upgrade from Unicode 15.0.0 to 15.1.0, as there was one grapheme break rule added.

Add UTF-16 conversions

It shouldn't just be possible to convert between UTF-8 <-> UTF-32 but also

UTF-16 <-> UTF-32

Maybe also generically between two different UTF 8/16/32 versions?

Use SIMD for UTF-8 to UTF-32 conversion

And maybe also u8-to-u16.

Get inspired by https://github.com/BobSteagall/utf_utils/blob/master/src/utf_utils.cpp#L1126 to see how he has been doing that.

Non-supported platforms should automatically fall back to the classical algorithm.

Then test if (and if: then how much) the performance gain will be on contour throughout benchmarks.

Hangul Jamo vowels and trailing consonants should probably be 0 width

U+1160..U+11FF and U+D7B0..U+D7FF should have 0 width.

Korean Hangul is a writing system which uses syllable blocks consisting of alphabetic components. A syllable consists of one or more Leading Consonants, one or more Vowels, and zero or more trailing consonants.

Unicode has precomposed syllable blocks at U+AC00..U+D7A3 (11172).

There are also component Jamos:

Hangul Jamo (U+1100..U+11FF).
- U+1100..U+115F Choseong (initial, Leading Consonants) have East_Asian_Width=Wide and Hangul_Syllable_Type=Leading_Jamo
- U+1160..U+11A7 Jungseong (medial, Vowels) have East_Asian_Width=Neutral and Hangul_Syllable_Type=Vowel_Jamo
- U+11A8..U+11FF Jongseong (final, Trailing consonants) have East_Asian_Width=Neutral and Hangul_Syllable_Type=Trailing_Jamo
U+A960..U+A97F Hangul Jamo Extended-A (choseong) have East_Asian_Width=Wide
U+D7B0..U+D7FF Hangul Jamo Extended-B (jungseong and jongseong) have East_Asian_Width=Neutral
U+3130..U+318F Hangul Compatibility Jamo have no conjoining behavior
U+FFA0..U+FFDF half-width forms have no conjoining behavior.

U+1100..U+11FF, U+A960..U+A97F, U+D7B0..U+D7FF have conjoining behavior, a sequence of L⁺V⁺T^* gets rendered as a syllable block. wcwidth() implementations tend to give U+1100..U+115F width 2, and U+1160..U+11FF width 0, so the resulting syllable block has the correct total width.

U+D7B0..U+D7FF, should also have width 0.

glibc gave width 0 to conjoining jungseong and jongseong at:

 commit 7a79e321c6f85b204036c33d85f6b2aa794e7c76
Author: Thorsten Glaser <[email protected]>
Date:   Fri Jul 14 14:02:50 2017 +0200

    Refresh generated charmap data and ChangeLog

            [BZ #21750]
            * charmaps/UTF-8: Refresh.

diff --git a/localedata/ChangeLog b/localedata/ChangeLog
index 04ef5ad071..9e05b4a652 100644
--- a/localedata/ChangeLog
+++ b/localedata/ChangeLog
@@ -1,3 +1,17 @@
+2017-07-14  Thorsten Glaser  <[email protected]>
+
+       [BZ #21750]
+       * charmaps/UTF-8: Refresh.
+       * unicode-gen/utf8_gen.py (U+00AD): Set width to 1.
+       * unicode-gen/utf8_gen.py (U+1160..U+11FF): Set width to 0.
+       * unicode-gen/utf8_gen.py (U+3248..U+324F): Set width to 2.
+       * unicode-gen/utf8_gen.py (U+4DC0..U+4DFF): Likewise.
+       * unicode-gen/utf8_gen.py: Treat category Me and Mn as combining.
+       [BZ #19852]
+       * unicode-gen/utf8_gen.py: Process EastAsianWidth lines before
+       UnicodeData lines so the latter have precedence; remove hack
+       to group output by EastAsianWidth ranges.
+

[ ... snip ...]

commit 6e540caa21616d5ec5511fafb22819204525138e
Author: Mike FABIAN <[email protected]>
Date:   Tue Jun 16 08:29:40 2020 +0200

    Set width of JUNGSEONG/JONGSEONG characters from UD7B0 to UD7FB to 0 [BZ #26120]
Reviewed-by: default avatarCarlos O'Donell <[email protected]>

diff --git a/localedata/charmaps/UTF-8 b/localedata/charmaps/UTF-8
index 14c5d4fa33..8cce47cd97 100644
--- a/localedata/charmaps/UTF-8
+++ b/localedata/charmaps/UTF-8
@@ -48920,6 +48920,8 @@ WIDTH
 <UABE8>        0
 <UABED>        0
 <UAC00>...<UD7A3>      2
+<UD7B0>...<UD7C6>      0
+<UD7CB>...<UD7FB>      0
 <UF900>...<UFA6D>      2
 <UFA70>...<UFAD9>      2
 <UFB1E>        0

	auto a = size_t{0};
	auto b = static_cast<size_t>(_ranges.size()) - 1;
	while (a < b)
	{
	auto const i = ((b + a) / 2);
	auto const& I = _ranges[i];
	if (I.to < _codepoint)
	a = i + 1;
	else if (I.from > _codepoint)
	{
	if (i == 0)
	return false;
	b = i - 1;
	}
	else
	return true;
	}
	return a == b && _ranges[a].from <= _codepoint && _codepoint <= _ranges[a].to;

	install(TARGETS ${INSTALL_TARGETS}
	EXPORT ${TARGETS_EXPORT_NAME}
	LIBRARY DESTINATION ${CMAKE_INSTALL_LIBDIR}
	ARCHIVE DESTINATION ${CMAKE_INSTALL_LIBDIR}
	PUBLIC_HEADER DESTINATION "${CMAKE_INSTALL_PREFIX}/include/unicode"
	FRAMEWORK DESTINATION "."
	RUNTIME DESTINATION ${CMAKE_INSTALL_BINDIR})

	install(
	FILES
	ucd.h
	ucd_enums.h
	ucd_fmt.h
	ucd_ostream.h
	DESTINATION "${CMAKE_INSTALL_INCLUDEDIR}/unicode"
	)

contour-terminal / libunicode Goto Github PK

libunicode's Introduction

Modern C++20 Unicode Library

Feature Overview

Unicode Technical Specifications

Integrate with your CMake project

Contributing

Users of this library

Disclaimer

License

libunicode's People

Contributors

Stargazers

Watchers

Forkers

libunicode's Issues

Additionally I think that's useful too

Also

Checklist

Goals

Drawbacks

Checklist

Implementation

Checklist

Future invesgitation

Recommend Projects

Recommend Topics

Recommend Org