perl-apollo / oshun Goto Github PK

View Code? Open in Web Editor NEW

27.0 13.0 5.0 315 KB

Declarative data validation for variables and subroutines

Home Page: https://www.youtube.com/watch?v=uT0S-jfO1mE

License: Other

Perl 100.00%

oshun perl programming syntax

oshun's Introduction

Not So Quick Start

Here's the full spec. It's very long and much of the current discussion is in Discussions.

Primary change from spec might be to use established my TYPE VARLIST syntax from perldoc -f my instead of attributes.

This is Oshun

Oshun is a Nigerian Yoruba river deity. She is a protector, a savior.

She will protect your data.

In software terms:

sub fibonacci :returns(UINT) ($nth :of(PositiveInt)) {
    ...
}

Note: do not worry about that syntax. It's not real. It's just a placeholder for whatever will be agreed upon. We'll get to that later.

Oshun is not a module to be installed (though there is code and almost 200K tests). Instead, it's intended to be a specification like Corinna, with the goal of seeing if we can get it into the Perl core.

History

In December of 2022, I again wrote about a type system for Perl. I've done this before and the discussion is usually positive, though given that we're a community, there are those who disagree with the need to have them.

Shortly thereafter, Damian Conway and I started talking and he shared a private gist with me. It was an incredibly detailed plan for runtime data checks. (The term "type" is avoided because of the baggage it carries). A few others were quietly invited to the conversation.

We spent a few months discussing this and he wrote a prototype, which is the code in this repository. The protoype is ALPHA code and absolutely should not be used in production. Instead, it's a proof of concept to explore the problem space. It's also a way to get feedback from the community, which is why this repository is here.

After a few months of discussion, Damian rewrote that gist. It covers the full spec, but it's long and daunting. I'll just touch on key points here.

Note: Damian regrets that, for personal reasons, he is not able to continue working on Oshun at this time. He might answer questions, but he does not have much free time available right now.

Why "data checks"?

We're not using the word "type" because:

Computer scientists have reasonable differences about what they want from a type system
Computer programmers have screaming matches

We'd like to avoid screaming matches.

What I want out a "type system" is probably not feasible in Perl and certainly won't match everyone's expectations. So we've taken a look at what Perl developers currently do. Type::Tiny and Moose (and Moo) are heavy inspirations for this work. We also looked at Dios, Zydeco, Raku, and other languages for inspiration, but mostly this matches what Perl is doing today, keeping in mind that popular systems are working within the limitation of Perl. Just as Corinna is better because Sawyer X told me to design something great and not worry about Perl's current limitations, so are data checks designed to give us what we want without worrying about Perl's limitations.

What we need to design

There are two aspects of data checks: syntax and semantics. Obviously these are tightly coupled, but we can discuss them separately. If we can get basic agreement on the syntax and core semantics (there will always be edge cases), then we can move forward on writing up the full specification.

Syntax

The syntax is probably the hard part. The initial design was made on the possibly unfounded assumption that P5P would reject any syntax which might impact existing code. This is why we have the :returns and :of keywords. perldoc -f my has the following:

    my VARLIST
    my TYPE VARLIST
    my VARLIST : ATTRS
    my TYPE VARLIST : ATTRS

We very much want that TYPE syntax. Fortunately, because data checks are lexically scoped and not global, it turns out that we probably can have the type TYPE syntax, but we need to be careful.

So we have the following syntax:

sub fibonacci :returns(UINT) ($nth :of(PositiveInt)) {
    ...
}

But let's dig in. We'll consider naming and declaration separately.

Naming

We have two kinds of data check declarations: built-in and user-defined. Built-in checks were defined as all uppercase: INT, ARRAY, HASH, etc. The reason is this problem:

sub f_to_c :returns(NUM) ($f :of(NUM)) {...}

f_to_c(32) returns 0. f_to_c(-1000) returns -573.333333333333.

However, that doesn't really make sense, since that's below absolute zero. So we have user-defined checks:

check Celsius    :isa(NUM[-273.15..inf]);
check Fahrenheit :isa(NUM[−459.67..inf]);

Which gives us a much safer, and self-documenting signature:

sub f_to_c :returns(Celsius) ($f :of(Fahrenheit) {...}

Now, if you discover that your user-defined check is wrong, you can fix it and it will be globally applied. This is a huge win (er, except when your code doesn't really match the check).

UPPER-CASE CHECKS were designed to be a subtle disaffordance to encourage people to write custom checks that more accurately reflect their intent.

They also clearly distinguished between built-in and user-defined checks.

However, the SHOUTY checks were a touch controversial in some earlier private discussions. You don't always need a user-defined check. Sometimes it's just burdensome and if you find a built-in check is wrong, you can later upgrade that to a user-defined check.

Another benefit of data checks is that tests are easier. I used to program in Java and I wrote tests using JUnit. You know what I didn't test? I didn't have to test what would happen if the type I passed in was not an expected type. I didn't have to write tests to verify that the structure I got back was correct. The compiler caught that for me. I could focus on testing the actual functional bits of my code, not the infrastructure. In a sense, Java tests could be more compact and reliable than Perl tests.

But getting back to the shouty checks ...

Or we could look at how other languages deal with this. For Java, primitive types are lower-case and include things like int, char, double, and so on. These map directly to what the underlying hardware supports. These correspond to the "built-in" checks for Oshun, with the caveat that we focus on types that map naturally to what perl supports, not what the underlying hardware expects. For example, we have a GLOB type:

my    $my_scalar    :of(GLOB) = *STDOUT;
our   $our_scalar   :of(GLOB) = *STDERR;
state $state_scalar :of(GLOB) = *STDIN;

Java also has non-primitive types, which are defined by the programmer. These correspond to our user-defined checks. In Java, these are defined using class names. In Oshun, we use check and the name of a check is an unqualified Perl identifier, which must contain at least one upper-case character and at least one lower-case character.

# Newly assigned values must never decrease...
check Monotonic :isa(NUM) ($value, %value) { $value >= $value{old} }

However, since SHOUTY checks are controversial, we could use all lower-case for the built-in checks and require user-defined checks to start with an upper-case letter.

# @data must be an array of hashes, where the hash keys must be integers
# and values must be arrayrefs of Account objects.
my @data :of(hash[int => array[obj[Account]]]);

Declaration

That brings us to the next contentious issue: how do we declare data checks?

We used attributes because they were not likely to conflict with existing code. Further, they correspond to the KIM syntax which Corinna now uses:

# KEYWORD IDENTIFIER MODIFIERS                   SETUP
  sub     f_to_c     :returns(NUM) ($f :of(NUM)) {...}

# KEYWORD IDENTIFIER   MODIFIERS SETUP
  my      $fahrenheit :of(NUM)   = 32;

Personally, I would like most new Perl features to use KIM. It's very consistent and avoids the issue of adding a ton of new keywords to the language. More consistency to Perl is a good thing, but many prefer the quirky nature of our beloved langauge.

Responses to the proposed syntax were mixed. Many people prefer a syntax like this:

my hash[int => array[obj[Account]]] @data;
my uint $count = 1;

Or this:

my @hash hash[int => array[obj[Account]]];
my $count uint = 1;

Still others were happy with the KIM syntax, but wanted to use anything other than :of. :is, :check, :contract were all suggested. I won't take a position here, other than to say that whatever syntax we should choose should only be cumbersome for things we think we should actively discourage.

Return Values From Subroutines

If we don't use the :returns(...) syntax for specifying the checks on values subs/methods return, what then? Looking at how Raku handles this:

sub foo(--> Int)      {}; say &foo.returns; # OUTPUT: «(Int)␤»
sub foo() returns Int {}; say &foo.returns; # OUTPUT: «(Int)␤»
sub foo() of Int      {}; say &foo.returns; # OUTPUT: «(Int)␤»
my Int sub foo()      {}; say &foo.returns; # OUTPUT: «(Int)␤»

I don't know the design discussions which led Raku to that place, but I don't think it's controversial to suggest that Perl is not Raku and we probably don't want that many different ways of declaring the return check. But if we don't use :returns(...), what then?

sub int num (str $name) {...}

What's the name of that subroutine? I think it's num, but it might look like int to someone else. Who knows? We could do this:

int sub num (str $name) {...}

That's much clearer, but if &int is a function in this namespace, I imagine that's going to create all sorts of parsing problems (not to mention that this will likely confict with existing code). We could do this:

sub num (str $name) returns int {...}

I think that's the clearest, but it's also the most verbose. I have no strong preference here, so long as whatever we do it doesn't conflict with existing code and is easy to use.

Semantics

For a full discussion of the semantics, check Damian's gist. Here's the short version, including the rather controversial final point.

Checks are on the variable, not the data

my $foo :of(INT) = 4;
$foo = 'hello'; # fatal

However:

my $foo :of(INT) = 4;
my $bar = $foo;
$bar = 'hello'; # legal

This is because we don't want checks to have "infectious" side effects that might surprise you. The developer should have full control over the data checks.

No type inference

No surprises. The developer should have full control over the data checks.

I can no longer find the article, but I read a long post from a company explaining why they had abandoned their use of type inference.

The absolutely loved it, but they spent so much time trying to patch third-party modules that they gave up. They were spending as much time trying to fix other's code than writing their own.

This is a danger of retrofitting a system like "data checks" onto an existing language. Thus, we're being extremely conservative.

Signature checks

We need to work out the syntax, but the current plan is something like this:

sub count_valid :returns(UINT) (@customers :of(OBJ[Customer])) {
	...
}

The @customers variable should maintain the check in the body of the sub, but the return check is applied once and only once on the data returned at the time that it's returned.

Scalars require valid assignments

my $total :of(NUM); # fatal, because undef fails the check

This is per previous discussions. Many languages allow this:

int foo;

But as soon as you assign something to foo, it's fatal if it's not an integer. For Perl, that's a bit tricky as there's no difference between uninitialized and undefined. While using that variable prior to assignment is fatal in many languages, that would be more difficult in Perl. Thus, we require a valid assignment.

As a workaround, this is bad, but valid:

my $total :of(INT|UNDEF);

This restriction doesn't apply to arrays or hashes because being empty trivially passes the check.

Fatal

By default, a failed check is fatal. We have provisions to downgrade them to warnings or disable them completely.

Internal representation

my $foo :of(INT) = "0";
Dump($foo);

0 naturally coerces to an integer, so that's allowed. However, we don't plan (for the MVP) to guarantee that Dump shows an IV instead of a PV. We're hoping that can be addressed post-MVP.

User-defined checks

Users should be able to define their own checks:

check LongStr :params($N :of(PosInt)) :isa(STR) ($n) { length $n >= $N }

The above would allow this:

my $name :of(LongStr[10]) = get_name(); # must be at least 10 characters

The body of a check definition should return a true or false value, or die/croak with a more useful message (exceptions would be strongly preferred), but that's a battle for another day. A Loki project, perhaps?)

A user-defined check is not allowed to change the value of the variable passed in. Otherwise, we could not safely disable checks on demand (coercions are not planned for the MVP, but we have them specced and they use a separate syntax).

User-defined checks could be post-MVP, but it's unclear to me how useful checks would be without them.

Checks are on assignment to the variable

This is probably the most problematic bit.

A check applied to a variable is not an invariant on that variable. It's a prerequisite for assignment to that variable.

An invariant on the variable would guarantee that the contents of the variable must always meet a given constraint; a "prerequisite for assignment" only guarantees that each element must be assigned values that meet the constraint at the moment they are assigned.

So an array such as my @data :of(HASH[INT]) only requires that each element of @data must be assigned a hashref whose values are integers. If you were to subsequently modify an element like so (with the caveat that the two lines aren't exactly equivalent):

$data[$idx]       = { $key => 'not an integer' }; # fatal
$data[$idx]{$key} = 'not an integer";             # not fatal !

The second assignment is not modifying @data directly, only retrieving a value from it and modifying the contents of an entirely different variable through the retrieved reference value.

We could specify that checks are invariants, instead of prerequisites, but that would require that any reference value stored within a checked arrayref or hashref would have to have checks automatically and recursively applied to them as well, which would greatly increase the cost of checking, and might also lead to unexpected action-at-a-distance, when the now-checked references are modified through some other access mechanism.

Moreover, we would have to ensure that such auto-subchecked references were appropriately “de-checked” if they are ever removed from the checked container. And how would we manage any conflict if the nested referents happened to have their own (possibly inconsistent) checks?

So the checks are simply assertions on direct assignments, rather than invariants over a variable’s entire nested data structure.

This is unsatisfying, but we're playing with the matches we have, not the flamethrower we want.

Coercions

Many people want coercions. We have a plan for them, but they're not part of the MVP. However, we're trying to make sure that they can be added later if necessary. Currently, there are some significant limitations to them. First, if we downgrade checks to warnings or disable them, we can't do that with coercions because the code expects the coerced value. Any user-defined check which is built from a coercion would automatically be upgraded to a coercion.

Second, coercions are action at a distance. Thus, if you're trying to debug why a method failed, you might not realize that the method was passed a UUID instead of a Customer object.

We're not ruling out coercions, but they introduce new problems we'd rather not have in the MVP

Compile-time checks

We're not planning on compile-time checks. We're not ruling them out, but they're not for the MVP. However, we can envision a future where we have this being a compile-time failure:

my $foo :of(INT) = "bar"

It would be nice to see this a a compile-time failure:

sub find_customer :returns(OBJ[Customer]) ($self, $id :of(UUID)) {
    ...
}

# in other code:
my $customer :of(HASH) = $object->find_customer($UUID);

However, due to the extreme late binding in Perl, that's like to be impossible, so we're simply not worrying about it.

It's only mentioned now because people have asked about this a few times.

About the `Data::Checks` module

The Data::Checks module is a proof-of-concept implementation of the above. However, due to current limitations of Perl, it's a scary combination of PPR, Filter::Simple, Variable::Magic, and tied variables. It's not pretty, but it works. However, Damian's very clear that this is an unholy abomination (my words, not his, but he was very clear that his code is a proof-of-concept and not something he'd ever release).

Amongst other issues that had to be dealt with:

Variable::Magic has significant limitations with array and hashrefs
Attributes are not allowed inside subroutine signatures
Fast parsing and rewriting of Perl documents is hard

After he turned it over to me, I fixed a bug and rewrote part of it to make some expectations clearer. In particular, use Data::Checks; is equivalent to:

use strict;
use warnings;
use v5.22;
use experimental 'signatures';
use Data::Checks::Parser;

The Data::Checks::Parser module is the core of the module and was originally named Data::Checks (to be fair Data::Checks::Parser is a terrible name because it's rewriting your code, not just parsing it).

oshun's People

Contributors

Stargazers

Watchers

Forkers

jkeenan zmughal esabol pders01 wchristian

oshun's Issues

Subroutine data check not persistent

If I run this:

my $size :of(UINT) = 3;
$size = -2.43;

That fails with Can't assign -2.43 to $size: failed UINT check at ...

However, if I run this:

sub foo ( $max_size :of(UINT) ) {
    say "Max size is $max_size";
    $max_size = -2.43;
    say "Max size is $max_size";
}

foo(3);
foo(-42);

That prints:

Max size is 3
Max size is -2.43
Can't pass -42 to parameter $max_size in call to foo() at ...

So the check works when you make the sub call, but no longer applies within the body of the subroutine.

This is for version 0.00001.

Data::Checks `import` and `unimport` should leave strict and warnings the hell alone

https://github.com/Perl-Oshun/oshun/blob/66eb75e20079881d9fb96ef774ef2e69858c911b/lib/Data/Checks.pm#L30

Consider:

use strict;
use warnings;
use Data::Checks;

{
   no Data::Checks; # Turn off Data::Checks in this scope
   ...;             # but strict and warnings have also been disabled!
}

Contemplating on String

Motivation / Goal

Nowadays common usage of Perl programs is some kind of backend. There it usually needs 3 types of checks (better word here will be contract)

description of API I/O (mostly checks corresponding with JSON Schema (openapi) or XML Schema (XML over HTTP, XML/RPC, SOAP)
description of internal representation
description of storage representation (usually SQL)

It will be nice to have Perl checks / contracts specified that way so it will be possible to generate external descriptions directly from Perl definition.

Example: (syntax symbolic)

# declare Bar => String [ min_length => 3, max_length => 16 ];
sub operation_handler :returns (Bar) { ... }

say Bar->to_openapi;
# - <...>
#   - type: string
#   - max-length: 16
#   - min-length: 3

say Bar->to_xsd;
# <xs:simpleType>
#   <xs:restriction base="xs:string">
#    <xs:maxLength value="16"/>
#    <xs:minLength value="3"/>
#   </xs:restriction>
# </xs:simpleType>

`String` variants

restrictions

Typical String restrictions (I like more XML schema's word facet) are

min-length
max-length
pattern

There restrictions are supported by both JSON and XML schema as well (though they don't support perl regex).

It will be nice to support named restrictions, eg:

Str [ min_length => 10 ];
Str [ min_length (10) ];
Str :min_length (10);

binary vs text

It will be nice to be able to declare whether value is generic binary string or text string, eg:

Str - string treated as utf-8
Binary - generic binary string

XML schema

supported by dedicated type base64Binary

JSON schema

supported by string type property contentEncoding: base64
supports also content-type

It will be nice to be able to specify context encoding and related implicit coercions to/from internal encoding:

Binary :encoding (base64);
Binary :encoding (uuencode);
Binary :encoding (deflate);
Str :encoding (Latin-2);

documentation

It will be nice to be able to specify some description of check, eg:

Str :abstract (This is abstract);
Str :abstract_uri (https://...)

common derived checks (subtypes)

URI

XML schema

built-in type anyURI

JSON schema

string with format, one of
- uri
- uri-reference
- iri
- iri-reference

Although it is easy to write subcheck using pattern restriction, it will be IMHO handy to provide built-in checks:

URI
- URL
- URN

Date / time

XML schema

date
dateTime
duration
gDay
gMonth
gMonthDay
gYear
gYearMonth
time

JSON schema

date-time
date
time
duration

It will be nice to provide also date/time related checks with possible encodings

strict ISO 8601
relaxed variant allowing space as date-time separator (default?)
misc national format

Value represented by these checks may be dual valued, once there will good enough implementation of datetime object.

other useful checks

UUID (JSON schema: uuid)
Identifier (XML schema: token / ID / Name)

Would be a mistake to allow an INT to be both "0" and 0?

my $age :of(INT) = 42; # ok
my $age :of(INT) = "42"; # not cool

I haven't poked inside the compiler, but I was wondering if Perl could get optimized in the future if a SV was indeed an IV or PV at compile time, and not changeable.

I also think it would also be nice if a function that convert from Perl to JSON knew for sure what something was, since it was defined by the programmer, and not by accident.

BOOL should accept undef

Unlike other primitive BOOL should accept undef as well.

There are tons of existing code returning undef as false.

IMHO it's not good idea forcing users to change their existing codebase in order to use newer syntax to add some features.

Why using attributes?

Why my $dog :isa(Dog) and sub foo($dog :of(Dog)) { ... } and not my Dog $dog =... and my sub foo (Dog $dog)?

Perl has allowed my Dog $dog = ... forever (although it does nothing) and neither approach is supported in signatures right now so whatever we do there we have to do from scratch anyway.

I'm just wondering the thinking process that got us to the current spec.

A thought - my $var TYPE instead of my TYPE $var

https://developers.slashdot.org/story/23/08/07/0136228/should-a-variables-type-come-after-its-name

how to treat array access to not-permitted elements?

Example:

my @list : (Array[ max_length (7), Int ];

if ($list[8]) { # should this fail?
}

Tooling?

I love the idea of doing something like this but I wonder if we've fully identified the value, or agreed on WHY we are doing this. I do a lot of programming bouncing between Perl and Golang which is strongly typed and the thing I miss the most about Go when I'm back in Perl is how useful the type information is for tooling and debugging. Like I can hover over a method and get details about its required signature and its return value, which in Perl often I end up having to add a lot of Data::Dumper statements to figure out what something is doing. This is more valuable with complex, large codebases. But it seems like we are focusing on runtime checking, is that correct? What is the barrier to having type information introspect able via the compiler so that we can support in in a Language::Server?

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

perl-apollo / oshun Goto Github PK

oshun's Introduction

Not So Quick Start

This is Oshun

History

Why "data checks"?

What we need to design

Syntax

Naming

Declaration

Return Values From Subroutines

Semantics

Checks are on the variable, not the data

No type inference

Signature checks

Scalars require valid assignments

Fatal

Internal representation

User-defined checks

Checks are on assignment to the variable

Coercions

Compile-time checks

About the Data::Checks module

oshun's People

Contributors

Stargazers

Watchers

Forkers

oshun's Issues

Motivation / Goal

String variants

restrictions

binary vs text

documentation

common derived checks (subtypes)

URI

Date / time

other useful checks

Recommend Projects

Recommend Topics

Recommend Org

About the `Data::Checks` module

`String` variants