Code Monkey home page Code Monkey logo

rdf4cpp's Introduction

⚠️ This repo is work-in-progress! Before v0.1.0 all APIs are considered unstable and might be subject to change. ⚠️

⚠️ Conan 2 only works when consuming the rdf4cpp conan package. Other usecases (e.g. development) still require Conan 1. ⚠️

rdf4cpp

rdf4cpp aims to be a stable, modern RDF library for C++.

Current documentation: https://rdf4cpp.readthedocs.io/en/latest/

Usage

check out the examples directory.

As Conan Package

Until its first stable release, rdf4cpp will not be available via Conan Center. Instead, it is available via the artifactory of the DICE Research Group.

You need the package manager Conan installed and set up. You can add the DICE artifactory with:

conan remote add dice-group https://conan.dice-research.org/artifactory/api/conan/tentris

To use rdf4cpp, add it to your conanfile.txt:

[requires]
rdf4cpp/0.0.30

Note:

If you want to include rdf4cpp without using conan, make sure you also include its dependencies exposed via the rdf4cpp API.

With FetchContent

Use

include(FetchContent)
FetchContent_Declare(
        rdf4cpp
        GIT_REPOSITORY "${CMAKE_CURRENT_SOURCE_DIR}/../"
        GIT_TAG v0.0.30
        GIT_SHALLOW TRUE
)
FetchContent_MakeAvailable(rdf4cpp)

to make the library target rdf4cpp::rdf4cpp available.

Beware: Conan will not be used for dependency retrieval if you include rdf4cpp via FetchContent. It is your responsibility that all dependencies are available before.

Build

Requirements

Currently, rdf4cpp builds only on linux with a C++20 compatible compiler. CI builds and tests rdf4cpp with gcc-{13}, clang-{15,16} with libstdc++-13 on ubuntu 22.04.

Dependencies

It is recommended to include build dependencies via conan version 1. Set up Conan as follows on Ubuntu 22.04+:

sudo apt install python3-pip
pip3 install --user "conan<2"
conan user
conan profile new --detect default
conan profile update settings.compiler.libcxx=libstdc++13 default
conan remote add dice-group https://conan.dice-research.org/artifactory/api/conan/tentris

Compile

rdf4cpp uses CMake. To build it, run:

cmake -B build_dir # configure and generate
cmake --build build_dir # compile

To install it to your system, run afterward:

sudo make install

Additional CMake config options:

-DBUILD_EXAMPLES=ON/OFF [default: OFF]: Build the examples.

-DBUILD_TESTING=ON/OFF [default: OFF]: Build the tests.

-DBUILD_SHARED_LIBS=ON/OFF [default: OFF]: Build a shared library instead of a static one.

-DUSE_CONAN=ON/OFF [default: ON]: If available, use Conan to retrieve dependencies.

rdf4cpp's People

Contributors

bigerl avatar clueliss avatar kaimal11 avatar konradhoeffner avatar lukaskerk avatar mcb5637 avatar nkaralis avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

rdf4cpp's Issues

C-style casts should be replaced

C-style casts have a great potential for bugs and undefined behaviour, because
they can perform any kind of c++-style cast to make the conversion work (including reinterpret_cast and const_cast).
In almost all cases you don't actually want to reinterpret memory or strip away const, but a C-style cast will happily do
both and give you no warning it did so.

In my opinion they don't belong in a modern C++ codebase and should all be replaced by the appropriate c++-style casts.
We should also enable -Wold-style-cast and -Wcast-qual.

Just look at this:

#include <iostream>
#include <string>

struct S {
    void *ignore; // the allocator occupies this space
    size_t len;

    void some_func() {
        std::cout << "len is: " << len << '\n';
    }
};

int main() {
    std::string const s = "123";
    S &b = (S &)s; // no longer const; and the conversion just doesn't make any sense

    b.some_func();
}

Yes, this works gives no warning by default and prints "len is: 3", see: https://godbolt.org/z/oe1nx93aj

Future improvement: rdf:langString (lang tag) inlining

It could be possible to inline the language tag for some rdf:langStrings by sacrificing some bits in
the LiteralID to encode a fixed set of lang tags and using only the remaining bits to refer to the backend.
This only works for sufficiently small backend ids.

Feature: Cast function for Literal

It would be handy to be able to cast Literals to different types e.g. cast xsd:int to an xsd:long.
Implementation can use existing type hierarchy.

Single Iterable for Parsing RDF file

with #76 a iterable parser for RDF is introduced that is very efficient and flexible but needs several lines in the code to be used. Additionally, it would be nice to have an iterable that can be used in the default case where one wants to extract triples or quads from a file:

for (auto &entry : RDFFileParser{"myrdffile.ttl"})
    if (entry.has_value())
        std::cout << entry.value() << '\n';
    else
        std::cerr << entry.error() << '\n';

An (unfinished) draft could look like this:

class RDFFileParser {
    std::string file_path_;

public:
    explicit RDFFileParser(std::string file_path, ParsingFlags flags = ParsingFlags::none(), storage::node::NodeStorage node_storage = storage::node::NodeStorage::default_instance())
        : file_path_(std::move(file_path)) {}

    class iterator {
        std::unique_ptr<std::ifstream> ifstream_;
        std::unique_ptr<IStreamQuadIterator> iter_;

    public:
        using value_type = IStreamQuadIterator::value_type;
        using reference = IStreamQuadIterator::reference;
        using pointer = IStreamQuadIterator::pointer;
        using difference_type = IStreamQuadIterator::difference_type;
        using iterator_category = IStreamQuadIterator::iterator_category;
        using istream_type = IStreamQuadIterator::istream_type;

        iterator() = default;

        explicit iterator(std::string const &file_path)
            : ifstream_([&file_path]() -> std::unique_ptr<std::ifstream> {
                  std::ifstream ifs{file_path};
                  if (ifs.is_open())
                      return std::make_unique<std::ifstream>(std::move(ifs));
                  else
                      return {};
              }()),
              iter_([this]() -> std::unique_ptr<IStreamQuadIterator> {
                  if (this->iter_)
                      return std::make_unique<IStreamQuadIterator>(*this->ifstream_);
                  else
                      return {};
              }()) {}

    protected:
        [[nodiscard]] bool is_at_end() const noexcept {
            return bool(ifstream_) or iter_->is_at_end();
        }

    public:
        reference operator*() const noexcept { return (*iter_).operator*(); }
        pointer operator->() const noexcept { return (*iter_).operator->(); }
        iterator &operator++() {
            ++(*iter_);
            return *this;
        }

        bool operator==(bool other) const noexcept {
            return not is_at_end() == other;
        }
        bool operator==(iterator const &other) const noexcept { return (*iter_) == (*other.iter_); }
        bool operator!=(iterator const &other) const noexcept { return (*iter_) != (*other.iter_); }
    };

    [[nodiscard]] iterator begin() const {
        return iterator{file_path_};
    }

    [[nodiscard]] bool begin() const {
        return false;
    }
};

fix return types of Literal logical operators

Currently operator &&, || and ! of Literal return another Literal containing the result of applying
the respective operator to the effective boolean value of the arguments.
This is inefficient as it requires calls to the node storage and registry to actually do anything useful with the returned
values. Operator chaining is especially expensive as each intermediate value requires a call to the node storage and registry.

Proposed fix:

  1. implement #57
class Literal {
    ...
    - Literal logical_and(Literal const &other, NodeStorage &node_storage = NodeStorage::default_instance()) const;
    - Literal operator&&(Literal const &other) const;

    - Literal logical_or(Literal const &other, NodeStorage &node_storage = NodeStorage::default_instance()) const;
    - Literal operator||(Literal const &other) const;

    - Literal logical_not(NodeStorage &node_storage = NodeStorage::default_instance()) const;
    - Literal operator!() const;
    
    + TriBool logical_and(Literal const &other) const;
    + TriBool operator&&(Literal const &other) const;

    + TriBool logical_or(Literal const &other) const;
    + TriBool operator||(Literal const &other) const;

    + TriBool logical_not() const; 
    + TriBool operator!() const;
   ...
};

Note: The proposed solution does not break operator chaining, see: https://godbolt.org/z/GY3x7s51K

add operator operator TriBool() for Literal

Literal would be more convenient to use if it had an operator TriStateBool() that returns the ebv.

Proposed change:

- Literal Literal::effective_boolean_value() const;
+ TriBool Literal::effective_boolean_value() const;
+ Literal::operator TriBool() const;
+ Literal Literal::into_effective_boolean_value() const;

Fixed IDs for LiteralDatatypes

Literal datatypes should have fixed IDs. This enables faster processing of type hierarchies. datatype <-> ID mapping should must be available at compile time. It would be beneficial if it would neither depend on the DatatypeRegistry nor on the NodeStorage.

  • Every NodeStorage must use these fixed IDs for the IRIs of Datatypes.
  • These IDs should be used in Literal datatype handling instead if the datatype IRI.

QuadPattern string not correct

Describe the bug
The QuadPattern string representation is well-formed correct, e.g.: ?g ?s ?p?o .

Expected behavior
?g ?s ?p?o .

Add rdf:langString datatype to make "graphs_and_datasets" example work

Currently rdf:langString does not exist as a LiteralDatatype, but it is used in the mentioned example as though it is one.
This provokes a bug kind of related to #55 when #56 is merged as during comparisons the type is required to exist in the registry (which it does not) to fetch the corresponding function to perform the actual comparison.

To Fix:

  • "Just" define it as a LiteralDatatype
  • Extend #56 to fallback to lexical form comparisons in the extended (order by) comparison operators (which are used for Node comparisons)

Binary Types

  • base64Binary
  • hexBinary

both should be represented with a non-owning view on an unmutable std::byte sequence. access via operator[n] retrieves the n-th std::byte.
An additional function hextet(n) retrieves a copy of the n-th hextet. Type could also be std::byte.

Fix decimal overflow behaviour

Decimal currently does not behave as per spec during over and underflow
Also: comment in the tests in tests_NumOpsResults

Make rdf4cpp portable

Currently there are a few places in rdf4cpp that are not portable and only compile on gcc and clang. This issue will collect these defects.

  • packed attributes in node storage
  • SIMD intrinsics

Validation in Literal constructors

Currently, literals are validated only in the from_string() functions of each datatype. The validation should also take place in the literal constructors to ensure that the constructors and make() functions work in the same manner.

Improvement: native-type backend Storage

Currently, all datatype values are either stored encoded as canonical strings in the backend or inlined into the NodeBackendHandle.
While working with inlined values is almost for free, working with values in the backend is expensive. They need to be parsed from string to their native type every time they are used.

The proposed solution is to customize the backend storage and it's access in such a way that it stores the values instead of the string representation for certain types in the backend. This could be a feature of a type definition, e.g. like inlineable.

Non-Canonical Literals

Issue:

If ?x = "2.0"^^xsd:double and ?y="2.00"^^xsd:double the following filter should remove the binding:

filter(str(?x) = str(?y))

This will not happen when the bindings for ?x and ?y are canonized to "2.0"^^xsd:double.

Open questions:

  • When is it allowed in triple stores to drop non-canonical literals in favor of the canonical form?

Comparison of literals with language tags

Currently, the expression "DEF"@en-US == "DEF"@en-us (consider them being literals) evaluates to false.
RFC 4646 states in section 2.1 (page 5):

The tags and their subtags, including private use and extensions, are to be treated as case insensitive: there exist conventions for the capitalization of some of the subtags, but these MUST NOT be taken to carry meaning.

I believe that the above expression should evaluate to true.
This is also the case in following test case:

Note that the query uses "en-US", whereas in the results "en-us" is used.

Figure out a way to mangle serd symbols

Currently rdf4cpp includes a custom, patched version of serd which defines the same symbol names as serd itself. This causes linking to fail (due to ODR violations) when linking against rdf4cpp and serd.

dynamic Literal constructors behave badly

Many of the Literal constructors behave badly as they allow the user to create a literal without the type being registered
which leads to triggered asserts in comparisons, ebv and numeric ops.

How to reproduce (used branch feature/literal-ops-comparisons for this example, but can be triggered in develop with num ops):

  1. never mention rdf4cpp::rdf::datatypes::xsd::String anywhere in your code (because this would trigger the template instantiation and register the type, thus preventing this bug).

Literal lit1 = Literal{"test1"};
Literal lit2 = Literal{"test2"};
bool res = lit1 < lit2; // assertion triggered here

Constructors that have this behaviour:

  • Literal::Literal(std::string_view lexical_form, Node::NodeStorage &node_storage)
  • Literal::Literal(std::string_view lexical_form, const IRI &datatype, Node::NodeStorage &node_storage)
  • Literal::Literal(std::string_view lexical_form, std::string_view lang, Node::NodeStorage &node_storage)
  • Literal Literal::make(std::string_view lexical_form, const IRI &datatype, Node::NodeStorage &node_storage)

Review Ownership Model of NodeStorage and Backends

To make rdf4cpp completely thread safe node storage handles are necessary

Reminder, something like this:

struct NodeStorageHandle {
     NodeStorageId index;
     uint32_t generation;
};

See also PR Review of #77 for details. To implement those however, a review and overhaul of the ownership model are needed.

Feature: Blank Node Management

Blank nodes are no stable references in RDF, see https://www.w3.org/TR/rdf11-concepts/#section-blank-nodes

rdf4cpp is not trying to prevent blank node ID collisions. Scopes allow the the user to correctly use blank nodes within files and graph stores/datasets. Comparison between blank nodes from different scopes is technically possible within rdf4cpp but not allowed by RDF.

  • By default, we should not keep the identifier of a blank node loaded from file but assign new ones
  • There should be a generator that you can ask for a fresh, so far unused identifier
    • replacable
    • Variants:
      • Random IDs (could be UUIDs)
      • increasing IDs
        • should be as short as allowed in turtle
        • optional user-defined pre- or suffix
  • There should be a BlankNodeMapper that keeps track of what blank node identifiers from a file are represented by what blank node identifiers in a NodeStorage. (Scopes)
  • There should be subscopes. Subscopes know all blank nodes from their parent scope.
    • Subscopes are initialized with all blank nodes from their parent scope. If a blank node is added to a parent scope, it is forwarded to the subscope.
  • A scope is not bound to a NodeStorage. Each method has a node_storage argument with a default value
  • Add support for Skolem IRIs

Note: SPARQL variables are not handled as blank nodes but as anonymous variables. Thus, name clashes between variables and blank nodes are no concern.

Future improvement: xsd:decimal inlining

Inlining xsd:decimal is currently not inlineable because of difficulty truncating the used type to 42bits without loss of precision.

Explored solutions:

  • converting boost decimal float -> string -> native float => slow and does not work for many values
  • gcc's std::decimal => incomplete (ex. no string conversion support)
  • https://github.com/GaryHughes/stddecimal => promising but either:
    1. would need to find a fast way to convert to/from boost decimal float
    2. would need to alter inlining api to be able to perform operations on a seperate "inlined-datatype" eg. decimal32 for xsd:decimal if the precision allows it

Another possibility: explore boost decimal float allocator support => if it properly supports fancy pointers it would be possible to store the value directly in the backend (not the string repr) (#110)

Exploration: boost::multiprecision type size control

Explore if it is possible to control the size of boost::multiprecision types seperately from their supported range.
(smaller internal buffer, move to allocator earlier)

or change the backend implementation to something else

Undefined behaviour in various from_string functions

The from_string implementations of

  • Decimal
  • Integer
  • String
  • Float

rely on std::string_view being null-terminated by using the .data() member in combination with functions expecting null-terminated strings. std::string_view does not guarantee being null-terminated; this results in a buffer-overread if it is not.

broken code example: https://godbolt.org/z/qsjKehjh7

To fix:

  • use std::from_chars instead of std::strtod, std::stof, std::strtol as it accepts and end pointer
  • use std::string_view constructor of std::string instead of the char const * constructor
  • for std::regex_match there is no easy fix except converting to std::string

IRI Parsing

The code for PlainIRI.cpp specifies future validation of IRI strings.

The following pseudo-code provides for IRI validation based on the IRI specification. The numerous string constructions model the specification to produce the final string, strRE_IRI_COMPLETE, and regular expression, reIRI_COMPLETE_i. A few other helpers are also shown.

u8string strRE_IPRIVATE = u8"\\u{E000}-\\u{F8FF}\\u{F0000}-\\u{FFFFD}\\u{100000}-\\u{10FFFD}";

u8string strRE_UCSCHAR =
	u8"\\u{000A0}-\\u{0D7FF}\\u{0F900}-\\u{0FDCF}\\u{0FDF0}-\\u{0FFEF}" +
	u8"\\u{10000}-\\u{1FFFD}\\u{20000}-\\u{2FFFD}\\u{30000}-\\u{3FFFD}" +
	u8"\\u{40000}-\\u{4FFFD}\\u{50000}-\\u{5FFFD}\\u{60000}-\\u{6FFFD}" +
	u8"\\u{70000}-\\u{7FFFD}\\u{80000}-\\u{8FFFD}\\u{90000}-\\u{9FFFD}" +
	u8"\\u{A0000}-\\u{AFFFD}\\u{B0000}-\\u{BFFFD}\\u{C0000}-\\u{CFFFD}" +
	u8"\\u{D0000}-\\u{DFFFD}\\u{E1000}-\\u{EFFFD}";

u8string strRE_SUB_DELIMS = u8"!\\$&'\\(\\)\\*\\+,;=";
u8string strRE_SUB_DELIMS_GRP = u8"[" + strRE_SUB_DELIMS + u8"]";

u8string strRE_GEN_DELIMS = u8":\\/\\?\\#\\[\\]@";

u8string strRE_RESERVED = strRE_SUB_DELIMS + strRE_GEN_DELIMS;

u8string strRE_ALPHA = u8"a-z";
u8string strRE_ALPHA_GRP = u8"[" + strRE_ALPHA + u8"]";
u8string strRE_DIGIT = u8"\\d";
u8string strRE_ALPHA_DIGIT = strRE_ALPHA + strRE_DIGIT;

u8string strRE_HEX = u8"[\\da-f]";

u8string strRE_PCT_ENCODED = u8"%" + strRE_HEX + u8"{2}";
u8string strRE_PCT_ENCODED_GRP = u8"(?:" + strRE_PCT_ENCODED + u8")";

u8string strRE_UNRESERVED = u8"-" + strRE_ALPHA_DIGIT + u8"\\._~";
u8string strRE_UNRESERVED_GRP = u8"(?:[" + strRE_UNRESERVED + u8"])";

u8string strRE_IUNRESERVED = strRE_UNRESERVED + strRE_UCSCHAR;
u8string strRE_IUNRESERVED_GRP = u8"(?:[" + strRE_IUNRESERVED + u8"])";

u8string strRE_SCHEME =
	strRE_ALPHA_GRP +
	u8"(?:[-" + strRE_ALPHA_DIGIT + u8"\\+\\.])*";

u8string strRE_IUNSUBS = strRE_IUNRESERVED + strRE_SUB_DELIMS;

u8string strRE_IREG_NAME_SUBGRP =
	u8"(?:[" +
		strRE_IUNSUBS +
	u8"])|" + strRE_PCT_ENCODED_GRP;
u8string strRE_IREG_NAME = u8"(?:" + strRE_IREG_NAME_SUBGRP + u8")*";

u8string strRE_ISEGMENT_NC_BASE = strRE_IUNSUBS + u8"@";
u8string strRE_ISEGMENT_NC =
	u8"(?:[" +
		strRE_ISEGMENT_NC_BASE +
	u8"])|" + strRE_PCT_ENCODED_GRP;

u8string strRE_IPCHAR_BASE = strRE_ISEGMENT_NC_BASE + u8":";
u8string strRE_IPCHAR =
	u8"(?:[" +
		strRE_IPCHAR_BASE +
	u8"])|" + strRE_PCT_ENCODED_GRP;

u8string strRE_ISEGMENT_BASE = u8"(?:" + strRE_IPCHAR + u8")";
u8string strRE_ISEGMENT = strRE_ISEGMENT_BASE + u8"*";
u8string strRE_ISEGMENT_NZ = strRE_ISEGMENT_BASE + u8"+";
u8string strRE_ISEGMENT_NZ_NC = u8"(?:" + strRE_ISEGMENT_NC + u8")+";

u8string strRE_IPATH_ABEMPTY = u8"(?:\\/" + strRE_ISEGMENT + u8")*";
u8string strRE_IPATH_ROOTLESS =
	strRE_ISEGMENT_NZ +
	strRE_IPATH_ABEMPTY;
u8string strRE_IPATH_ABSOLUTE = u8"\\/" + u8"(?:" + strRE_IPATH_ROOTLESS + u8")?";
u8string strRE_IPATH_EMPTY = u8"";

u8string strRE_DEC_OCTET = u8"(?:0{0,2}\\d|0{0,1}[1-9]\\d|1\\d\\d|2[0-4]\\d|25[0-5])";
u8string strRE_IPV4 = strRE_DEC_OCTET + u8"(?:\\." + strRE_DEC_OCTET + u8"){3}";

u8string strRE_H16 = strRE_HEX + u8"{1,4}";
u8string strRE_LS32 =
	u8"(?:" +
		strRE_H16 + u8":" + strRE_H16 + u8"|" +
		strRE_IPV4 +
	u8")";
u8string strRE_IPV6 =
			  	  "(?:" + strRE_H16 + u8":){6}" +	strRE_LS32 + u8"|" +
				u8"::(?:" + strRE_H16 + u8":){5}" +	strRE_LS32 + u8"|" +
	u8"(?:" +														strRE_H16 + u8")?" +
				u8"::(?:" + strRE_H16 + u8":){4}" +	strRE_LS32 + u8"|" +
	u8"(?:" + u8"(?:" + strRE_H16 + u8":){0,1}" +	strRE_H16 + u8")?" +
				u8"::(?:" + strRE_H16 + u8":){3}" +	strRE_LS32 + u8"|" +
	u8"(?:" + u8"(?:" + strRE_H16 + u8":){0,2}" +	strRE_H16 + u8")?" +
				u8"::(?:" + strRE_H16 + u8":){2}" +	strRE_LS32 + u8"|" +
	u8"(?:" + u8"(?:" + strRE_H16 + u8":){0,3}" +	strRE_H16 + u8")?" +
				u8"::"	+ strRE_H16 + u8":" +		strRE_LS32 + u8"|" +
	u8"(?:" + u8"(?:" + strRE_H16 + u8":){0,4}" +	strRE_H16 + u8")?" +
				u8"::" +												strRE_LS32 + u8"|" +
	u8"(?:" + u8"(?:" + strRE_H16 + u8":){0,5}" +	strRE_H16 + u8")?" +
				u8"::" +												strRE_H16 + u8"|" +
	u8"(?:" + u8"(?:" + strRE_H16 + u8":){0,6}" +	strRE_H16 + u8")?" +
				u8"::";

u8string strRE_IPVFUTURE =
	u8"v" + strRE_HEX + u8"+\\." +
	u8"(?:[" +
		strRE_UNRESERVED +
		strRE_SUB_DELIMS +
		u8":" +
	u8"])+";

u8string strRE_IP_LITERAL =
	u8"\\[" +
		u8"(?:" + strRE_IPV6 + u8")" + u8"|" +
		u8"(?:" + strRE_IPVFUTURE + u8")" +
	u8"\\]";

u8string strRE_IUSERINFO = u8"(?:" + strRE_IREG_NAME_SUBGRP + u8"|" + u8":" + u8")*";
u8string strRE_IUSERINFO_OPT = u8"(?:" + strRE_IUSERINFO + u8"@)?";

u8string strRE_IHOST =
	u8"(?:" +
		strRE_IP_LITERAL + u8"|" +
		strRE_IPV4 + u8"|" +
		strRE_IREG_NAME +
	u8")";

u8string strRE_PORT = u8"\\d*";
u8string strRE_PORT_OPT = u8"(?:" + strRE_PORT + u8")?";

u8string strRE_IAUTHORITY =
	strRE_IUSERINFO_OPT +
	strRE_IHOST +
	strRE_PORT_OPT;

u8string strRE_IHIER_PART =
	u8"(?:" +
		u8"(?:\\/\\/" +
			strRE_IAUTHORITY +
			strRE_IPATH_ABEMPTY +
		u8")" +
		u8"|" +
		strRE_IPATH_ABSOLUTE + u8"|" +
		strRE_IPATH_ROOTLESS + u8"|" +
		//RDFTransformCommon.strRE_IPATH_EMPTY +
	u8")+"; // ... +, since IPATH_EMPTY == u8"", therefore {0,1}

u8string strRE_IQ_IF_BASE_CHARS = strRE_IPCHAR_BASE + u8"\\/\\?";

u8string strRE_IQUERY_CHARS =
	u8"(?:[" +
		strRE_IQ_IF_BASE_CHARS +
		strRE_IPRIVATE +
	u8"])|" + strRE_PCT_ENCODED_GRP;
u8string strRE_IQUERY = u8"(?:" + strRE_IQUERY_CHARS + u8")*";
u8string strRE_IQUERY_OPT = u8"(?:\\?" + strRE_IQUERY + u8")?";

u8string strRE_IFRAGMENT_CHARS =
	u8"(?:[" +
		strRE_IQ_IF_BASE_CHARS +
	u8"])|" + strRE_PCT_ENCODED_GRP;
u8string strRE_IFRAGMENT = u8"(?:" + strRE_IFRAGMENT_CHARS + u8")*";
u8string strRE_IFRAGMENT_OPT = u8"(?:#" + strRE_IFRAGMENT + u8")?";

u8string strRE_IRI =
	strRE_SCHEME + u8":" +
	strRE_IHIER_PART +
	strRE_IQUERY_OPT +
	strRE_IFRAGMENT_OPT;

u8string strRE_IRI_EACH = u8"(?:" + strRE_IRI + u8")";
u8string strRE_IRI_COMPLETE = u8"^" + strRE_IRI_EACH + u8"$";

// IRI RegExp match entire string...
u8regex reIRI_COMPLETE_i = new u8regex(strRE_IRI_COMPLETE, regex::icase);

// IRI RegExp match on each occurrence in string...
u8regex reIRI_EACH_i = new u8regex(strRE_IRI_EACH, regex::icase);

u8string strRE_LINE_TERMINALS = u8"\\r?\\n|\\r|\\p{Zl}|\\p{Zp}";
// Line Terminals RegExp match on each occurrence in multiline string...
u8regex reLINE_TERMINALS_m = new u8regex(strRE_LINE_TERMINALS, regex::multiline);

/*
 * Method validateIRI(strText)
 *
 *	Test that ALL of the given u8string is a single IRI
 */
bool validateIRI(u8string strIRI) {
	return regex_search(strIRI, reIRI_COMPLETE_iu);
}

Since the regex standard in c++20 is completely out-of-date and does not properly support UTF-8, something like SRELL is suggested for implementation. Other researched methods are unreliable or lacking in their implementation details.

See the IRI specification rfc3987 to compare the provided code.

Numeric Type Implementation

Implemented Types:

  • xsd:decimal -> 128 bit decimal double
    • xsd:integer -> int128_t
      • xsd:long -> int64_t
        • xsd:int -> int32_t
          • xsd:short -> int16_t
            • xsd:byte -> int8_t
      • xsd:nonNegativeInteger -> uint128_t
        • xsd:positiveInteger -> -> uint128_t (or wrapper)
        • xsd:unisngedLong -> uint64_t
          • xsd:unsigendInt -> uint32_t
            • xsd:unsignedShort -> uint16_t
              • xsd:unsignedByte -> uint8_t
      • xsd:nonPostiveInteger -> int128_t (or wrapper)
        • xsd:negativeInteger -> int128_t (or wrapper)
  • xsd:float -> float
  • xsd:double -> double
  • owl:rational -> ?
  • owl:real -> ?

Ideal would be a library with:

  • runtime arbitrary precision
  • support for basic operations +-*/...
  • support for Integers and Decimals
  • support for nonNegative = unsigned integer
  • explicit support for positiveInteger, nonPostiveInteger, negativeInteger is not required (use Integer or write wrapper)
  • support for fancy pointers and custom allocators

Options for arbitrary precision types:

Options for arbitrary precision floating decimals:

Starting Points for own implementations could be:

signature could be something like:

template<class Allocator = std::allocator<uint32_t>>
class Integer {
    using ptr_type = typename std::allocator_traits<Allocator>::pointer;
    std::size_t count_;
    union {
        size_t small;
        ptr_type big;
    };
   Allocator allocator_;

public:
    consteval Integer() : count_(0), small(0) {}

    Integer(std::string_view str_repr, Allocator const &alloc = Allocator()), allocator_(alloc) {
        std::vector<uint32_t> x{alloc};
        uint32_t buffer[32] = {};
        count_ = parse(str_repr, buffer);
        switch (count_) {
            case 0UL:
                small = {};
                break;
            case 1UL:
                small = buffer[0];
                break;
            case 2UL:
                small = buffer[0] | (uint64_t(buffer[1]) << 32);
                break;
            default: {
                big = std::allocator_traits<Allocator>::allocate(allocator_, count_);
                std::memcpy(std::to_address(big), buffer, sizeof(uint32_t) * count_);
            }
        }
    }
}

This needs more research

Casting operations

There are some issues with casting operations

Add error support to numeric literal ops

The (sparql) standard requires to raise "dynamic errors" under certain conditions in numeric operations which is currently not possible.
The problem is that the functions are marked noexcept and the return type does not allow for error value return.

There are two solutions:

  1. remove the noexcept specification and just throw exceptions in the appropriate cases
    • trivial to implement
    • possibly bad performance
  2. change the return type to be something like std::expected<op_result_type, op_error>
    • requires either: a. finding a library implementing something like it; or b. implementing our own until c++23 arrives

A decision needs to be made before merging some parts of the code for #45 (mainly related to one-side-bounded ap integer types, as there is no clear alternative to erroring on bound violation).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.