Code Monkey home page Code Monkey logo

utf8_string's Introduction

UTF-8 string

Build Status pipeline status

This is a simple implementation of UTF-8 strings in C++.

Implementation

UTF8string is based on std::string provided by the standard C++ library but has been implemented to support UTF-8 encoded strings.

Some functions have been adapted for utf8 strings :

  • utf8_length : get number of characters in a string (number of codepoints).
  • utf8_size : get get the memory size of the string (in byte).
  • utf8_find : find a utf8 substring in the current string.
  • utf8_substr : get a utf8 substring of the current string.
  • utf8_at : get the codepoint at a specified position.
  • utf8_pop : remove the last codepoint of the string.

Usage

You just need to include all of the .hpp and .cpp files from src/ in your project. For each file that uses UTF8string, include this piece of code :

#include "utf8_string.hpp"

Code example

UTF8string u8("がんばつて Gumichan");
UTF8string sub = u8.utf8_substr(0,5);
size_t pos = u8.utf8_find(UTF8string("chan"));
size_t sz  = u8.utf8_size();
size_t l   = u8.utf8_length();

std::cout << "u8 string: " << u8 << "\n";
std::cout << "utf8 substring from 0 to 5: " << sub << "\n";
std::cout << "utf8 codepoint at 2: " << u8.utf8_at(2) << "\n";
std::cout << "utf8 string \"chan\" at " << pos << "\n";
std::cout << "u8 string - memory size: " << sz << "; length: " << l << "\n\n";

for (auto s: sub)    // or for (const std::string& s: u8)
{
    std::cout << "-> " << s << "\n";
}

Output :

utf8 string: がんばつて Gumichan
utf8 substring from 0 to 5: がんばつて
utf8 codepoint at 2: ば
utf8 string "chan" at 10
u8 string - memory size: 24; length: 14

-> が
-> ん
-> ば
-> つ
-> て

Project that uses UTF8string

License

This library is under the MIT License.

utf8_string's People

Contributors

gumichan01 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

utf8_string's Issues

new function: utf8_assign

Specification

Replace the contents of the string.

Signature

UTF8string& utf8_assign( const char * str )
UTF8string& utf8_assign( const u8string& str )
UTF8string& utf8_assign( const u8string& str, size_t pos, size_t count = npos )
UTF8string& utf8_assign( UTF8string&& u8str )

Internal structure of UTF8string

The problem

UTF8string internally uses std::string. However, std::string is an alias of std::basic_string<char>.

This is a problem because each byte of UTF-8 characters are encoded as an "unsigned char" between 0 and 255 (0xFF). So, when I check if a string is utf8-valid, I check it as an std::string (sequence of char) instead of as a sequence of unsigned char ("real bytes"). Consequently, I have some troubles with byte comparison (unwanted implicit conversions (>.<)).

Possible solution

I think it should be very helpful to use an std::basic_string<unsigned char> instead of std::string in order to handle the internal string properly. Fortunately, the interface does not need to be changed.

4-bytes utf8 char validation

The way you validate 4-bytes utf8 character is questionable.
The first of 4-bytes char must be between 0xf0 and 0xf4, which you forget to test.

            // If the first byte of the sequence is 0xF0
            // then the first continuation byte must be between 0x90 and 0xBF
            // otherwise, if the byte is 0xF4
            // then the first continuation byte must be between 0x80 and 0x8F
            if(*it == '\xF0')
            {
                if(*(it + 1) < '\x90' || *(it + 1) > '\xBF')
                    return false;
            }
            else if(*it == '\xF4')
            {
                if(*(it + 1) > '\x8F')
                    return false;
            }

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.