Should we rethink sumtypes to allow for user-defined types? For inst

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

check for void calls: <a class="user-mention notranslate" data-hovercard-

ir: rethink sumtypes to allow for user-definable types,about llir/llvm

Comments (19)

pwaller commented on September 15, 2024 1

Well the immediate thing that strikes me is that wherever we do a type switch maybe that's a hint that you could perhaps make it an interface instead (or type switching over interfaces), perhaps, for example on types that have an "Elem()" method.

I think in general if the user wants to implement their own objects they have to accept that things might break. At least this kind of breakage will tend to be fairly obvious!

from llvm.

dannypsnl commented on September 15, 2024 1

I start to analyze this issue, in my view, there have two problems have to solve:

which structures should be replaced with interfaces

I think this is quite hard to answer. In my opinion, every instruction should be able to be customized, even those primitive(int, float), for consistently sum-type.
For every instruction would have Is<InstructionName>, e.g. Int has IsInt(), Struct has IsStruct.

where is typecasting

To answer the second, run ag 'case \*' --ignore "*_test.go" under directory llvm/ir can see the result as following shows(ignore *_test.go).

constant/expr_memory.go:149:	case *Int:
constant/expr_memory.go:152:	case *ZeroInitializer:
constant/expr_memory.go:154:	case *Vector:
constant/expr_memory.go:170:			case *Int:
constant/expr_memory.go:193:	case *Undef:
constant/expr_conversion.go:148:	case *Int:
constant/expr_other.go:47:		case *types.IntType, *types.PointerType:
constant/expr_other.go:49:		case *types.VectorType:
constant/expr_other.go:107:		case *types.FloatType:
constant/expr_other.go:109:		case *types.VectorType:
constant/expr_aggregate.go:143:	case *types.ArrayType:
constant/expr_aggregate.go:145:	case *types.StructType:
inst_memory.go:551:	case *constant.Int:
inst_memory.go:554:	case *constant.ZeroInitializer:
inst_memory.go:556:	case *constant.Vector:
inst_memory.go:572:			case *constant.Int:
inst_memory.go:595:	case *constant.Undef:
func.go:312:	case *InstCall, *TermInvoke, *TermCallBr:
inst_other.go:54:		case *types.IntType, *types.PointerType:
inst_other.go:56:		case *types.VectorType:
inst_other.go:119:		case *types.FloatType:
inst_other.go:121:		case *types.VectorType:
inst_aggregate.go:145:	case *types.ArrayType:
inst_aggregate.go:147:	case *types.StructType:
inst_aggregate.go:149:	case *types.PointerType:

Once we know the decision of the first question we know how to update these codes.

p.s. Does sumgen can handle the multiple levels of algebra data type? For example Instruction = Struct | Int, Struct = Struct | StructWithComment.

from llvm.

mewmew commented on September 15, 2024 1

Hi @dannypsnl,

Thanks for starting to take a look at this! I agree, we have to figure out what kind of interfaces we need to define.

I think the list of usages you produced is very interesting, as from there we should be able to infer common behaviour that needs an interface definition. In the case of ir/types I think @pwaller pointed out that we need something like:

type Elemer interface {
   Elem() types.Type
}

This would handle *types.VectorType, *types.PointerType and *types.ArrayType.

Note that most of the code is used to handle getelementptr.

Used by getelementptr instructions and expressions (to compute type).

inst_memory.go:551:	case *constant.Int:
inst_memory.go:554:	case *constant.ZeroInitializer:
inst_memory.go:556:	case *constant.Vector:
inst_memory.go:572:			case *constant.Int:
inst_memory.go:595:	case *constant.Undef:

constant/expr_memory.go:149:	case *Int:
constant/expr_memory.go:152:	case *ZeroInitializer:
constant/expr_memory.go:154:	case *Vector:
constant/expr_memory.go:170:			case *Int:
constant/expr_memory.go:193:	case *Undef:

Used by extractvalue and insertvalue instructions and expressions (to compute type):

inst_aggregate.go:145:	case *types.ArrayType:
inst_aggregate.go:147:	case *types.StructType:
inst_aggregate.go:149:	case *types.PointerType:

constant/expr_aggregate.go:143:	case *types.ArrayType:
constant/expr_aggregate.go:145:	case *types.StructType:

Used by icmp and fcmp instructions and expressions (to compute type).

inst_other.go:54:		case *types.IntType, *types.PointerType:
inst_other.go:56:		case *types.VectorType:
inst_other.go:119:		case *types.FloatType:
inst_other.go:121:		case *types.VectorType:

constant/expr_other.go:47:		case *types.IntType, *types.PointerType:
constant/expr_other.go:49:		case *types.VectorType:
constant/expr_other.go:107:		case *types.FloatType:
constant/expr_other.go:109:		case *types.VectorType:

Used when simplifying constant expressions:

constant/expr_conversion.go:148:	case *Int:

Used by Func.AssignIDs to check for void calls.

func.go:312:	case *InstCall, *TermInvoke, *TermCallBr:

from llvm.

mewmew commented on September 15, 2024 1

check for void calls: @mewmew Why this case didn't just use return n.Type().Equal(types.Void)? Could there any instruction with type void an not a call, invoke or terminator?

Fair point, this would work.

I think the reason is historical, as before the type of a call instruction could be either the return type of the callee or the function type of the callee. This was needed to handle variadic functions, but as of rev c56773e this logic was simplified.

As such, we can definitely use your suggested approach instead; that is to just use n.Type().Equal(types.Void) directly.

from llvm.

mewmew commented on September 15, 2024 1

I understand your approach and do think it solves a lot of problems, but we still have some remaining issues unless we further refine how we handle sumtypes internally within the llir/llvm/ir code base.

For instance, lets consider the user-defined EStruct type. It's prefect use case of how to use struct embedding to extend the *types.StructType to also support field names.

The problem is that it will not work for some parts of our type checking code.

Take for instance *types.StructType.Equal, which is implemented as follows:

func (t *StructType) Equal(u Type) bool {
	if u, ok := u.(*StructType); ok {
		if len(t.TypeName) > 0 || len(u.TypeName) > 0 {
			// Identified struct types are uniqued by type names, not by structural
			// identity.
			//
			// t or u is an identified struct type.
			return t.TypeName == u.TypeName
		}
		// Literal struct types are uniqued by structural identity.
		if t.Packed != u.Packed {
			return false
		}
		if len(t.Fields) != len(u.Fields) {
			return false
		}
		for i := range t.Fields {
			if !t.Fields[i].Equal(u.Fields[i]) {
				return false
			}
		}
		return true
	}
	return false
}

Notice on line 788 the runtime type assertion:

if u, ok := u.(*StructType); ok {

If we pass a *EStructType to *ir.StructType.Equal, then the check would always return false. e.g.:

structType := types.NewStruct(types.I32, types.Float)
structTypeWithNames := &EStruct{
   StructType: structType,
   FieldNames: []string{"x", "y"},
}
structTypeWithNames.Equal(structType) // return true.

structType.Equal(structTypeWithNames) // return false. <<< NOTE: unexpected result.

structType.Equal(structTypeWithNames.StructType) // return true

So, for these reasons, we still have to think about relaxing the internal code base of llir/llvm/ir to handle interfaces rather than using type assertions, to make it easier to create user-defined types (and instructions), such as the extended types you provided.

Cheers,
Robin

from llvm.

mewmew commented on September 15, 2024

While here, we should also add remaining test cases to ensure that sumtypes are implemented and enforced.

e.g.

// Assert that each constant implements the constant.Constant interface.
var _ constant.Constant = (*Global)(nil)

from llvm.

pwaller commented on September 15, 2024

I don't have a strong opinion on this. I can see the benefit of "locking down an API to only hold objects you exactly know what they are". But the alternative view in my mind is that you don't care what the object is, but only the interface it satisfies. If someone is willing to pass an object that satisfies the interface, I don't immediately see a strong reason to disallow it.

Someone also recently pointed out that you can actually defeat this by embedding anyway!

https://news.ycombinator.com/item?id=19347281

The context they brought it up was that this technique is actually used in go SSA.

from llvm.

mewmew commented on September 15, 2024

I can see the benefit of "locking down an API to only hold objects you exactly know what they are".

The benefit is unquestionably stability. However, with stability comes limitation, just like in life :p

If someone is willing to pass an object that satisfies the interface, I don't immediately see a strong reason to disallow it.

The main reason I can think of is that there are definitely parts of the llir/llvm/ir code base that would start to break down, as they perform type switches on known concrete types.

For instance, let say someone defines their own llir/llvm/ir/types.Type. This would break when used with the getelementptr instruction, as we use the concrete types.FooType types to determine the element type of the gep instruction.

llvm/ir/inst_memory.go

Line 560 in fe7863f

case *types.StructType:

However, we also want to allow such use cases as they are very powerful. For instance, it is easy to envision a user-defined LLVM IR type for structures with named fields.

package foolang

type StructType struct {
   FieldNames []string
   *types.StructType
}

I really wish we can find a way to fix such fragile code, so that we may allow users to define their own LLVM IR types, values, etc. I think it is good to enable using the IR package in ways not envisioned by the authors of the IR package. Essentially following the principles of Unix.

Someone also recently pointed out that you can actually defeat this by embedding anyway!

That's a fair point.

from llvm.

mewmew commented on September 15, 2024

I stumbled upon sumgen, and it may be worth investigating whether we should use such a tool to generate the dummy method definitions for sumtypes. Perhaps it's worth it, perhaps not. Regardless, it would be interesting to explore.

from llvm.

mewmew commented on September 15, 2024

Having worked a bit on getting Geode up to speed with using v0.3.0 (in geode-lang/geode#27), I'm quite convinced that we should work on getting sumtypes right for the v0.4.0 release, to allow for user-defined instructions (e.g. line comments), types (e.g. structs with field names), constants, etc.

Just putting this here, so we can prioritize working on this issue early on as it could have quite invasive API changes.

from llvm.

dannypsnl commented on September 15, 2024

When I implementing getelementptr I found using Elemer would be a huge change. I haven't dig into how gep work in codebase but can see there would be a big impact. Currently it actually wants gep.Index rather than get element directly. So I create a Indexer as following code changes. I don't feel index at here is a good abstraction, but if we keep this I think I would like to use the same implementation rather than current way, that means have to use llir/llvm ast rather than the llir/ll ast. But first focus on how to implement gep use Elemer.

// NOTE: keep getIndex in sync with getIndex in:
//
//    * ast/inst_memory.go
//    * ir/inst_memory.go
//    * ir/constant/expr_memory.go
//
// The reference point and source of truth is in ir/constant/expr_memory.go.

code changes

diff --git a/ir/constant/const_indexer.go b/ir/constant/const_indexer.go
new file mode 100644
index 0000000..0f54aee
--- /dev/null
+++ b/ir/constant/const_indexer.go
@@ -0,0 +1,64 @@
+package constant
+
+import (
+	"fmt"
+
+	"github.com/llir/llvm/internal/gep"
+)
+
+type Indexer interface {
+	GetIndex() gep.Index
+}
+
+func (i *Int) GetIndex() gep.Index {
+	return gep.NewIndex(i.X.Int64())
+}
+
+func (z *ZeroInitializer) GetIndex() gep.Index {
+	return gep.NewIndex(0)
+}
+
+func (u *Undef) GetIndex() gep.Index {
+	return gep.Index{HasVal: false}
+}
+
+func (index *Vector) GetIndex() gep.Index {
+	// ref: https://llvm.org/docs/LangRef.html#getelementptr-instruction
+	//
+	// > The getelementptr returns a vector of pointers, instead of a single
+	// > address, when one or more of its arguments is a vector. In such
+	// > cases, all vector arguments should have the same number of elements,
+	// > and every scalar argument will be effectively broadcast into a vector
+	// > during address calculation.
+	if len(index.Elems) == 0 {
+		return gep.Index{HasVal: false}
+	}
+	// Sanity check. All vector elements must be integers, and must have the
+	// same value.
+	var val int64
+	for i, elem := range index.Elems {
+		switch elem := elem.(type) {
+		case *Int:
+			x := elem.X.Int64()
+			if i == 0 {
+				val = x
+			} else if x != val {
+				// since all elements were not identical, we must conclude that
+				// the index vector does not have a concrete value.
+				return gep.Index{
+					HasVal:    false,
+					VectorLen: uint64(len(index.Elems)),
+				}
+			}
+		default:
+			// TODO: remove debug output.
+			panic(fmt.Errorf("support for gep index vector element type %T not yet implemented", elem))
+			//return gep.Index{HasVal: false}
+		}
+	}
+	return gep.Index{
+		HasVal:    true,
+		Val:       val,
+		VectorLen: uint64(len(index.Elems)),
+	}
+}
diff --git a/ir/constant/expr_memory.go b/ir/constant/expr_memory.go
index 392c5f8..be2b73d 100644
--- a/ir/constant/expr_memory.go
+++ b/ir/constant/expr_memory.go
@@ -116,7 +116,7 @@ func (index *Index) String() string {
 func gepExprType(elemType, src types.Type, indices []Constant) types.Type {
 	var idxs []gep.Index
 	for _, index := range indices {
-		idx := getIndex(index)
+		idx := GetIndex(index)
 		// Check if index is of vector type.
 		if indexType, ok := index.Type().(*types.VectorType); ok {
 			idx.VectorLen = indexType.Len
@@ -126,7 +126,7 @@ func gepExprType(elemType, src types.Type, indices []Constant) types.Type {
 	return gep.ResultType(elemType, src, idxs)
 }
 
-// NOTE: keep getIndex in sync with getIndex in:
+// NOTE: keep GetIndex in sync with GetIndex in:
 //
 //    * ast/inst_memory.go
 //    * ir/inst_memory.go
@@ -134,8 +134,8 @@ func gepExprType(elemType, src types.Type, indices []Constant) types.Type {
 //
 // The reference point and source of truth is in ir/constant/expr_memory.go.
 
-// getIndex returns the gep index corresponding to the given constant index.
-func getIndex(index Constant) gep.Index {
+// GetIndex returns the gep index corresponding to the given constant index.
+func GetIndex(index Constant) gep.Index {
 	// unpack inrange indices.
 	if idx, ok := index.(*Index); ok {
 		index = idx.Constant
@@ -146,52 +146,8 @@ func getIndex(index Constant) gep.Index {
 		index = idx.Simplify()
 	}
 	switch index := index.(type) {
-	case *Int:
-		val := index.X.Int64()
-		return gep.NewIndex(val)
-	case *ZeroInitializer:
-		return gep.NewIndex(0)
-	case *Vector:
-		// ref: https://llvm.org/docs/LangRef.html#getelementptr-instruction
-		//
-		// > The getelementptr returns a vector of pointers, instead of a single
-		// > address, when one or more of its arguments is a vector. In such
-		// > cases, all vector arguments should have the same number of elements,
-		// > and every scalar argument will be effectively broadcast into a vector
-		// > during address calculation.
-		if len(index.Elems) == 0 {
-			return gep.Index{HasVal: false}
-		}
-		// Sanity check. All vector elements must be integers, and must have the
-		// same value.
-		var val int64
-		for i, elem := range index.Elems {
-			switch elem := elem.(type) {
-			case *Int:
-				x := elem.X.Int64()
-				if i == 0 {
-					val = x
-				} else if x != val {
-					// since all elements were not identical, we must conclude that
-					// the index vector does not have a concrete value.
-					return gep.Index{
-						HasVal:    false,
-						VectorLen: uint64(len(index.Elems)),
-					}
-				}
-			default:
-				// TODO: remove debug output.
-				panic(fmt.Errorf("support for gep index vector element type %T not yet implemented", elem))
-				//return gep.Index{HasVal: false}
-			}
-		}
-		return gep.Index{
-			HasVal:    true,
-			Val:       val,
-			VectorLen: uint64(len(index.Elems)),
-		}
-	case *Undef:
-		return gep.Index{HasVal: false}
+	case Indexer:
+		return index.GetIndex()
 	case Expression:
 		// should already have been simplified to a form we can handle.
 		return gep.Index{HasVal: false}
diff --git a/ir/inst_memory.go b/ir/inst_memory.go
index f0aef35..0a1e90a 100644
--- a/ir/inst_memory.go
+++ b/ir/inst_memory.go
@@ -538,69 +538,5 @@ func gepInstType(elemType, src types.Type, indices []value.Value) types.Type {
 
 // getIndex returns the gep index corresponding to the given constant index.
 func getIndex(index constant.Constant) gep.Index {
-	// unpack inrange indices.
-	if idx, ok := index.(*constant.Index); ok {
-		index = idx.Constant
-	}
-	// Use index.Simplify() to simplify the constant expression to a concrete
-	// integer constant or vector of integers constant.
-	if idx, ok := index.(constant.Expression); ok {
-		index = idx.Simplify()
-	}
-	switch index := index.(type) {
-	case *constant.Int:
-		val := index.X.Int64()
-		return gep.NewIndex(val)
-	case *constant.ZeroInitializer:
-		return gep.NewIndex(0)
-	case *constant.Vector:
-		// ref: https://llvm.org/docs/LangRef.html#getelementptr-instruction
-		//
-		// > The getelementptr returns a vector of pointers, instead of a single
-		// > address, when one or more of its arguments is a vector. In such
-		// > cases, all vector arguments should have the same number of elements,
-		// > and every scalar argument will be effectively broadcast into a vector
-		// > during address calculation.
-		if len(index.Elems) == 0 {
-			return gep.Index{HasVal: false}
-		}
-		// Sanity check. All vector elements must be integers, and must have the
-		// same value.
-		var val int64
-		for i, elem := range index.Elems {
-			switch elem := elem.(type) {
-			case *constant.Int:
-				x := elem.X.Int64()
-				if i == 0 {
-					val = x
-				} else if x != val {
-					// since all elements were not identical, we must conclude that
-					// the index vector does not have a concrete value.
-					return gep.Index{
-						HasVal:    false,
-						VectorLen: uint64(len(index.Elems)),
-					}
-				}
-			default:
-				// TODO: remove debug output.
-				panic(fmt.Errorf("support for gep index vector element type %T not yet implemented", elem))
-				//return gep.Index{HasVal: false}
-			}
-		}
-		return gep.Index{
-			HasVal:    true,
-			Val:       val,
-			VectorLen: uint64(len(index.Elems)),
-		}
-	case *constant.Undef:
-		return gep.Index{HasVal: false}
-	case constant.Expression:
-		// should already have been simplified to a form we can handle.
-		return gep.Index{HasVal: false}
-	default:
-		// TODO: add support for more constant expressions.
-		// TODO: remove debug output.
-		panic(fmt.Errorf("support for gep index type %T not yet implemented", index))
-		//return gep.Index{HasVal: false}
-	}
+	return constant.GetIndex(index)
 }

from llvm.

dannypsnl commented on September 15, 2024

extractvalue and insertvalue : aggregateElemType in ir and ir/constant has different behavior. One allows PointerType and one not, if this is not expected then we could use one sumtype else has to create two.
icmp and fcmp: They actually are type, I think they are not the target of this issue. But we could have another issue for user-extended type, it could be useful, e.g. simulate class from a high perspective.
simplifying: Only *Int for now, I'm not sure would users need to extend Int?
~~4. check for void calls: @mewmew Why this case didn't just use return n.Type().Equal(types.Void)? Could there any instruction with type void an not a call, invoke or terminator?~~

BTW, in document mentions that pointer could be an integer in some cases. This probably a good reason to unify them by an interface.

If the operands are pointer typed, the pointer values are compared as if they were integers.

from llvm.

dannypsnl commented on September 15, 2024

I have another idea: What if we make every IR structure became an interface. Therefore Function for example, would be:

type Function interface {
    // ...
}
var _ Function = &function{}
type function struct {
    // ...
}

Anyone want to extend it would be able to:

type Class struct {
    ir.Function
    ir.Struct
}

func NewClass() *Class {
    return &Class {
        Function: ir.NewFunction(...),
        Struct: ir.NewStruct(...),
    }
}

The benefit I can see at here, is users don't need to learn advanced concept like Elemer. People can understand Elemer, but which can be an Elemer is not clean as extend ir.Vec, ir.Array and ir.Pointer.

BTW, this kind of implementation also using in Kubernetes.

from llvm.

mewmew commented on September 15, 2024

I have another idea: What if we make every IR structure became an interface. Therefore Function for example, would be:

I like the idea! We should definitely create an experimental branch of llir/llvm/ir where we try this out!

The main downside I see is that the fields would no longer be exported, so we would have to create a lot of getters and setters, and this would make the API quite bloated.

Besides that, the extensibility of the sumtypes would be hugely improved. So there are definitely benefits and drawbacks. So, to get a better understanding of how to balance this, we should implement it in an experimental branch and get more experience.

Cheers,
Robin

from llvm.

dannypsnl commented on September 15, 2024

@mewmew You can get the experimental implementation at this commit, it only changes ir.Func BTW.

from llvm.

mewmew commented on September 15, 2024

@mewmew You can get the experimental implementation at this commit, it only changes ir.Func BTW.

Thanks for starting to experiment with this!

As you already pointed out, there are benefits and drawbacks with the interface-everything apprach. The benefits of such extensive use of interfaces is that we would enable users to extend the IR in almost every way. Whether we need this full power of extensibility is yet to be determined, thus great to have more concrete experiments to base such decisions on.

For now, lets consider some of the drawbacks with the current experimental implementation. One of the drawbacks to the API seems to be the handling of slices (as you already pointed out in https://github.com/dannypsnl/llvm/commit/ac2baf5820d9c13291f0be568546aba18fceac9e#r37512892). E.e.

f.SetFuncAttrs(append(f.GetFuncAttrs(), funcAttr))

The other issue is that type assertions on interfaces may not be enough to fully and uniquely discriminate them. The basic issue is that an interface with just getters/setters for its fields may not be enough to fully determine a unique concrete underlying type, e.g. add and sub instructions have the same fields (see https://github.com/dannypsnl/llvm/commit/ac2baf5820d9c13291f0be568546aba18fceac9e#r37514418). For what it's worth, the fix should be quite simple, we just add a IsAddInst method or something along those lines to differentiate e.g. add instructions from sub instructions.

Cheers,
Robin

Edit: one possible outcome that we see from this experiment is that some IR concepts may be turned into interfaces to allow user-defined types, but others may be kept as core IR concepts with concrete struct types. Rather than going for a purist approach of having everything be an interface, it may turn out more pragmatic to have the concepts of the IR which are most likely for a user to modify and extend be turned into interfaces. For this, instructions is the obvious contender. And also the LLVM IR defined types. One such example I've wanted personally is an LLVM IR struct type which also tracks the names of its fields. When it comes to instructions, the pseudo-instruction used for comments is also a common example.

from llvm.

dannypsnl commented on September 15, 2024

I suddenly realize we can embed ir.Module as a solution for use-cases mentioned here, should I close the issue?

from llvm.

mewmew commented on September 15, 2024

I suddenly realize we can embed ir.Module as a solution for use-cases mentioned here, should I close the issue?

Can you provide a quick example code which uses embedding of ir.Module?

from llvm.

dannypsnl commented on September 15, 2024

For example, we want to generate a class handle, in the meanwhile affect our declaration~~~

type EModule struct {
    *ir.Module
}

func (m *EModule) NewClass() *Class {
    return &Class {
        Function: m.NewFunction(...),
        Struct: m.NewStruct(...),
    }
}

type EModule struct {
    *ir.Module
}

type EStruct struct {
   FieldNames []string
   *types.StructType
}

func (m *EModule) NewStruct(fieldNames []string, fieldTypes []types.Type) *EStruct {
    s := &EStruct {
        StructType: m.NewStruct(...),
    }
    s.FieldNames = fieldNames
    return s
}

from llvm.

ir: rethink sumtypes to allow for user-definable types about llvm HOT 19 OPEN

Comments (19)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent