Code Monkey home page Code Monkey logo

is_utf8's Introduction

is_utf8

Most strings online are in unicode using the UTF-8 encoding. Validating strings quickly before accepting them is important.

How to use is_utf8

This is a simple one-source file library to validate UTF-8 strings at high speeds using SIMD instructions. It works on all platforms (ARM, x64).

Build and link is_utf8.cpp with your project. Code usage:

  #include "is_utf8.h"

  char * mystring = ...
  bool is_it_valid = is_utf8(mystring, thestringlength);

It should be able to validate strings using less than 1 cycle per input byte.

Requirements

  • C++11 compatible compiler. We support LLVM clang, GCC, Visual Studio. (Our optional benchmark tool requires C++17.)
  • For high speed, you should have a recent 64-bit system (e.g., ARM or x64).
  • If you rely on CMake, you should use a recent CMake (at least 3.15).
  • AVX-512 support require a processor with AVX512-VBMI2 (Ice Lake or better) and a recent compiler (GCC 8 or better, Visual Studio 2019 or better, LLVM clang 6 or better). You need a correspondingly recent assembler such as gas (2.30+) or nasm (2.14+): recent compilers usually come with recent assemblers. If you mix a recent compiler with an incompatible/old assembler (e.g., when using a recent compiler with an old Linux distribution), you may get errors at build time because the compiler produces instructions that the assembler does not recognize: you should update your assembler to match your compiler (e.g., upgrade binutils to version 2.30 or better under Linux) or use an older compiler matching the capabilities of your assembler.

Build with CMake

cmake -B build
cmake --build build
cd build
ctest .

Visual Studio users must specify whether they want to build the Release or Debug version.

To run benchmarks, build and execute the bench command.

cmake -B build
cmake --build build
./build/benchmarks/bench

Instructions are similar for Visual Studio users.

Real-word usage

This C++ library is part of the JavaScript package utf-8-validate. The utf-8-validate package is routinely downloaded more than a million times per week.

If you are using Node JS (19.4.0 or better), you already have access to this function as buffer.isUtf8(input).

Reference

Want more?

If you want a wide range of fast Unicode function for production use, you can rely on the simdutf library. It is as simple as the following:

#include "simdutf.cpp"
#include "simdutf.h"

int main(int argc, char *argv[]) {
  const char *source = "1234";
  // 4 == strlen(source)
  bool validutf8 = simdutf::validate_utf8(source, 4);
  if (validutf8) {
    std::cout << "valid UTF-8" << std::endl;
  } else {
    std::cerr << "invalid UTF-8" << std::endl;
    return EXIT_FAILURE;
  }
}

See https://github.com/simdutf/

License

This library is distributed under the terms of any of the following licenses, at your option:

is_utf8's People

Contributors

lemire avatar lpinca avatar striezel avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.