Let's start with the problem that I'm having (and attempting to solve). While trying to figure out some memory related details in our runtime, I realized that some COBOL features are impossible to implement in standard conforming C without invoking undefined behavior. More specifically, according to the rules in the C standard, it's not possible to implement COBOL's REDEFINES
(unions) in C without invoking undefined behavior, and by extension most kinds of type punning is also prohibited.
I'll clarify what I mean by this, as this might also come as a surprise to other C developers. Both the C99 and C11 standards contain a footnote saying the following:
If the member used to access the contents of a union object is not the same as the member last used to store a value in the object, the appropriate part of the object representation of the value is reinterpreted as an object representation in the new type as described in 6.2.6 (a process sometimes called "type punning"). This might be a trap representation.
Which at first makes it seem like C allows the kind of type punning we need to make COBOL's REDEFINES
work. However, footnotes are specified in the foreword as non-normative:
In accordance with Part 3 of the ISO/IEC Directives, this foreword, the introduction, notes, footnotes, and examples are also for information only.
Meaning that footnotes can't define normative behavior and should only clarify the existing normative text with additional information. No normative text exists in the C standard that specifies the kind of type punning described in the footnote. In fact, we have sections in the standard that contradict what is said in the footnote:
The value of at most one of the members can be stored in a union object at any time. . . .
If the value of only one member can be stored in a union, then the value of the other members is non-existent (and reading from them would be UB). Nothing stops Clang and GCC from optimizing (currently or in the future) based on the assumption that union members, other than the very first written to, are in fact non-existent, so we risk undefined behavior here.
The other issue is, that even if we consider the footnote as normative (and resolve the conflicts), C's strict aliasing rules (in 6.5) would still prohibit us from accessing the same memory location as objects of different types:
An object shall have its stored value accessed only by an lvalue expression that has one of the following types:
— a type compatible with the effective type of the object,
— a qualified version of a type compatible with the effective type of the object,
— a type that is the signed or unsigned type corresponding to the effective type of the object,
— a type that is the signed or unsigned type corresponding to a qualified version of the effective type of the object,
— an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union), or
— a character type.
So we have to assume that accessing (reading or writing) to the same memory location using different types is at best implementation-defined (-fno-strict-aliasing
), and at worst undefined behavior. I'm not the first one having issues with how the standard defined these rules, as it turns out the Linux kernel is compiled with -fno-strict-aliasing
for this same reason (and generates a ton of type punning warnings when the flag is turned off).
This strict aliasing issue is, of course, not only quite awful for any low-level systems programming, but also really bad for us because Standard COBOL does not have the same strict aliasing restrictions, and in fact, the ability to access the same memory location as different types is required in order to make REDEFINES
(unions) work as specified by the standard.
As described in the section 13.18.44 of the COBOL23 standard:
The REDEFINES clause allows the same computer storage area to be described by different data description entries.
And again a little further down, in the general rules for the clause:
When the same storage area is defined by more than one data description entry, the data-name associated with any of those data description entries may be used to reference that storage area.
Being able to access the same memory location as different types is a requirement for Standard COBOL, and is at the same time a violation of Standard C's strict aliasing rules. So we can conclude that C as it currently is, cannot be used to implement COBOL's REDEFINES
without invoking undefined or implementation-defined behavior (and by extension, C# unions have the same issue).
Some people have suggested using memcpy as a workaround to make type punning work in C, but this has several issues, one being that in order for this to work we need the compiler to be able to recognize the use of memcpy for type punning and optimize it out (not always possible, and risks more UB). If for some reason the call to memcpy is not optimized away and the call is actually executed then we'd be calling memcpy on overlapping memory, which is undefined behavior.
Also, using a memcpy-like function for type punning can cause undefined behavior in COBOL. As described in section 14.6.10:
When the data items referenced by a sending and a receiving operand in any statement are identified as sharing either a part of or all of their storage areas, and the rules for the statement do not provide for a specific result in the following circumstances, then:
- When the data items are not described by the same data description entry, the result of the statement is undefined.
I'd rather not have to deal with conflicting UB on both sides and strict aliasing weirdness, so I'm proposing a subset of Standard COBOL, that I'll be calling "Embedded COBOL", which consists only of language features that map as close to one-to-one as possible with freestanding standard C.
Please note that this won't be another dialect that Otterkit will directly support in addition to Standard COBOL, but rather a subset of Standard COBOL features that we'll be prioritizing (implementing first) in order to replace our existing C code as soon as possible.
I'll update this issue soon to start adding the C to COBOL feature mappings, let me know if anyone has any feature suggestions from C (that map to COBOL) that I should add to the list.
Tagging both @gabrielesilinic, and @TriAttack238. I need your feedback on this.
Also tagging @GitMensch. I don't think COBOL has any restrictions that would make this impossible, let me know if I'm wrong on this. Also, feel free to add any C features I should include that map directly to COBOL.