Code Monkey home page Code Monkey logo

rime's Introduction

Test by GCC latest Test by MSVC

rime

rime is an extension for <regex>.

Overview

  • Compile time regex pattern check
  • Some utilities for regex
  • Header Only
  • Requires C++20 or later
    • GCC 11.1 or later
    • MSVC 2019 Preview latest or later
      • However, the current MSVC has a problem with consteval constructor, so some tests are disabled

Facility

Compile time regex pattern check

Checks the validity of a regular expression string and raises a Compile Error if it is wrong.

If there is no problem, the compiler will not say anything.

※ Currently, it only supports the ECMAScript format.

UDL

Pass a string to std::regex using User Defined Literal (""_re).

#include <iostream>
#include "rime.hpp"

using namespace rime::literals;

int main() {
  std::regex re{R"_(\d{1,)_"_re};
}
In file included from prog.cc:6:
rime.hpp: In function 'int main()':
prog.cc:11:17:   in 'constexpr' expansion of 'rime::literals::operator""_re(((const char*)"\\d{1,"), 5)'
rime.hpp:724:32:   in 'constexpr' expansion of 'rime::patern_check<char>::start(std::basic_string_view<char>(str, len))'
rime.hpp:135:18:   in 'constexpr' expansion of 'rime::patern_check<char>::disjunction(it, ((const char*)fin))'
rime.hpp:144:18:   in 'constexpr' expansion of 'rime::patern_check<char>::alternative((* & it), ((rime::patern_check<char>::S)fin))'
rime.hpp:164:13:   in 'constexpr' expansion of 'rime::patern_check<char>::term((* & it), ((rime::patern_check<char>::S)fin))'
rime.hpp:175:21:   in 'constexpr' expansion of 'rime::patern_check<char>::quantifier((* & it), ((rime::patern_check<char>::S)fin))'
rime.hpp:210:24:   in 'constexpr' expansion of 'rime::patern_check<char>::quantifier_prefix((* & it), ((rime::patern_check<char>::S)fin))'
rime.hpp:280:33: error: call to non-'constexpr' function 'void rime::REGEX_PATERN_ERROR(const char*)'
  280 |               REGEX_PATERN_ERROR(R"_(Quantifiers braces are not closed. [Example: `\d{0, 10` ] )_");
      |               ~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This can be used for wchar_t as well.

std::regex factory function

Use factory function rime::regex() to create a std::regex from a string.

rime::regex(regex_str) returns std::regex.

#include <iostream>
#include "rime.hpp"

using namespace rime::literals;

int main() {
  auto re = rime::regex(R"_(\d{1,)_");
}
In file included from prog.cc:3:
rime.hpp: In function 'int main()':
prog.cc:8:24:   in 'constexpr' expansion of 'rime::detail::regex_patern_str<char>("\\d{1,")'
rime.hpp:745:35:   in 'constexpr' expansion of 'rime::patern_check<char>::start(((rime::detail::regex_patern_str<char>*)this)->rime::detail::regex_patern_str<char>::str)'
rime.hpp:135:18:   in 'constexpr' expansion of 'rime::patern_check<char>::disjunction(it, ((const char*)fin))'
rime.hpp:144:18:   in 'constexpr' expansion of 'rime::patern_check<char>::alternative((* & it), ((rime::patern_check<char>::S)fin))'
rime.hpp:164:13:   in 'constexpr' expansion of 'rime::patern_check<char>::term((* & it), ((rime::patern_check<char>::S)fin))'
rime.hpp:175:21:   in 'constexpr' expansion of 'rime::patern_check<char>::quantifier((* & it), ((rime::patern_check<char>::S)fin))'
rime.hpp:210:24:   in 'constexpr' expansion of 'rime::patern_check<char>::quantifier_prefix((* & it), ((rime::patern_check<char>::S)fin))'
rime.hpp:280:33: error: call to non-'constexpr' function 'void rime::REGEX_PATERN_ERROR(const char*)'
  280 |               REGEX_PATERN_ERROR(R"_(Quantifiers braces are not closed. [Example: `\d{0, 10` ] )_");
      |               ~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This can be used for wchar_t as well.

rime::regex_searches()

rime::regex_searches() will search the input string for all substrings that match the regular expression pattern.

The return value of rime::regex_searches() is a range object that represents the entire search result.

If the return type of rime::regex_searches(str, regex) is R, then R is forward_range and viewable_range.

#include <iostream>
#include "rime.hpp"

using namespace rime::literals;

int main() {
  const auto regex = rime::regex(R"(\d+)");

  for (const auto &m : rime::regex_searches("1421, 34353, 7685, 12765, 976754", regex)) {
    std::cout << m.str() << ' ';
  }
}

// Output
// 1421 34353 7685 12765 976754

This is a wrapper for std::regex_iterator, which does std::regex_search in succession.

Appendix : ECMAScript RegExp Patterns

Pattern ::
    Disjunction

Disjunction ::
    Alternative
    Alternative | Disjunction

Alternative ::
    [empty]
    Alternative Term

Term ::
    Assertion
    Atom
    Atom Quantifier

Assertion :: 
    ^
    $
    \b
    \B

Quantifier ::
    QuantifierPrefix
    QuantifierPrefix ?

QuantifierPrefix ::
    *
    +
    ?
    { DecimalDigits }
    { DecimalDigits , }
    { DecimalDigits , DecimalDigits }

Atom ::
    PatternCharacter
    .
    \ AtomEscape
    CharacterClass
    ( Disjunction )
    ( ? : Disjunction )
    ( ? = Disjunction )
    ( ? ! Disjunction )

PatternCharacter :: SourceCharacter but not any of: 
    ^ $ \ . * + ? ( ) [ ] { } |

AtomEscape ::
    DecimalEscape
    CharacterEscape
    CharacterClassEscape

CharacterEscape ::
    ControlEscape
    c ControlLetter
    HexEscapeSequence
    UnicodeEscapeSequence
    IdentityEscape

ControlEscape :: one of
    f n r t v

ControlLetter :: one of
    a b c d e f g h i j k l m n o p q r s t u v w x y z
    A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

IdentityEscape ::
    SourceCharacter but not IdentifierPart

DecimalEscape ::
    DecimalIntegerLiteral [lookahead ∉ DecimalDigit]

CharacterClassEscape :: one of
    d D s S w W

CharacterClass ::
    [ [lookahead ∉ {^}] ClassRanges ]
    [ ^ ClassRanges ]

ClassRanges ::
    [empty]
    NonemptyClassRanges

NonemptyClassRanges ::
    ClassAtom
    ClassAtom NonemptyClassRangesNoDash
    ClassAtom - ClassAtom ClassRanges

NonemptyClassRangesNoDash ::
    ClassAtom
    ClassAtomNoDash NonemptyClassRangesNoDash
    ClassAtomNoDash - ClassAtom ClassRanges

ClassAtom ::
    -
    ClassAtomNoDash

ClassAtomNoDash ::
    SourceCharacter but not one of \ ] -
    \ ClassEscape

ClassEscape ::
    DecimalEscape
    b
    CharacterEscape
    CharacterClassEscape

HexEscapeSequence ::
    x HexDigit HexDigit

UnicodeEscapeSequence ::
    u HexDigit HexDigit HexDigit HexDigit

HexDigit :: one of
    0 1 2 3 4 5 6 7 8 9 a b c d e f A B C D E F

IdentifierStart ::
    UnicodeLetter
    $
    _
    \ UnicodeEscapeSequence

IdentifierPart ::
    IdentifierStart
    UnicodeCombiningMark
    UnicodeDigit
    UnicodeConnectorPunctuation
    \ UnicodeEscapeSequence

UnicodeLetter
    any character in the Unicode categories “Uppercase letter (Lu)”, “Lowercase letter (Ll)”, “Titlecase letter (Lt)”, “Modifier letter (Lm)”, “Other letter (Lo)”, or “Letter number (Nl)”.

UnicodeCombiningMark
    any character in the Unicode categories “Non-spacing mark (Mn)” or “Combining spacing mark (Mc)”

UnicodeDigit
    any character in the Unicode category “Decimal number (Nd)”

UnicodeConnectorPunctuation
    any character in the Unicode category “Connector punctuation (Pc)”

C++ Modified ECMAScript regular expression grammar

ClassAtom ::
    -
    ClassAtomNoDash
    ClassAtomExClass
    ClassAtomCollatingElement
    ClassAtomEquivalence

IdentityEscape ::
    SourceCharacter but not c

ClassAtomExClass ::
    [: ClassName :]

ClassAtomCollatingElement ::
    [. ClassName .]

ClassAtomEquivalence ::
    [= ClassName =]

ClassName ::
    ClassNameCharacter
    ClassNameCharacter ClassName

ClassNameCharacter ::
    SourceCharacter but not one of . or = or :

More patches with this library

In rime, the grammar are modified according to the implementation (GCC/clang).

ClassAtom ::
    -
    ClassAtomNoDash

ClassAtomNoDash ::
    SourceCharacter but not one of \ ] -
    \ ClassEscape
    ClassAtomExClass
    ClassAtomCollatingElement
    ClassAtomEquivalence

This will allow patterns like [abcd[:digit:]efgh] to be allowed.
(Under the previous definition, [abcd[:digit:]] and [[:digit:]abcd] were valid, but [abcd[:digit:]efgh] was not allowed.)

rime's People

Contributors

kariya-mitsuru avatar onihusube avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

kariya-mitsuru

rime's Issues

IdentityEscape のパースの実装

よくわからないので未実装

IdentityEscape ::
    SourceCharacter but not IdentifierPart

IdentifierStart :: 
    UnicodeLetter
    $
    _
    \ UnicodeEscapeSequence

IdentifierPart :: 
    IdentifierStart
    UnicodeCombiningMark
    UnicodeDigit
    UnicodeConnectorPunctuation
    \ UnicodeEscapeSequence

UnicodeLetter
    any character in the Unicode categories “Uppercase letter (Lu)”, “Lowercase letter (Ll)”, “Titlecase letter (Lt)”, “Modifier letter (Lm)”, “Other letter (Lo)”, or “Letter number (Nl)”.

UnicodeCombiningMark
    any character in the Unicode categories “Non-spacing mark (Mn)” or “Combining spacing mark (Mc)”

UnicodeDigit
    any character in the Unicode category “Decimal number (Nd)”

UnicodeConnectorPunctuation
    any character in the Unicode category “Connector punctuation (Pc)”

ClassAtomExClass 等の開始文字と終了文字が合わないケースがエラーにならない

ClassAtomExClass 等で、例えば [[:upper.]] とした場合、どうも GCC や Clang ではエラーになるようですが、rime ではエラーとならずに見逃されるため、実行時エラーが発生してしまいます。

余談ですが、個人的には、[:upper. の時点で ClassAtomExClassClassAtomCollatingElementClassAtomEquivalence のいずれの生成規則にもマッチしないので、全ての文字がそれぞれ独立した ClassAtomNoDash として解釈されるべきだと思うのですが、世の中的にはどうもそのような解釈にはなっていないようです。

3 桁以上の DecimalEscape がエラーになる

DecimalEscape は 3 桁以上でも使用できて、続く数字全部を飲み込みます。
例)https://wandbox.org/permlink/D4GXKWjaDVhvpsSG

例外は 0 始まりの時だけで、その場合だけ後ろに数字が続いてはいけません。
(まぁキャプチャグループ 100 個以上って普通は無いと思いますが…)

ちなみに、DecimalEscape が表す数値は、それが出現するより前(文字列の左側)のキャプチャグループ数以下じゃないといけないので(実行時例外が出る)、それもチェックできるとより良いですね。

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.