Code Monkey home page Code Monkey logo

www's Introduction

CMSC 430: Design and Implementation of Programming Languages

This repository contains the course material for CMSC 430: Design and Implementation of Programming Languages (aka Compilers), taught at the University of Maryland, College Park.

The current instance of this course is:

Copyright © David Van Horn and José Manuel Calderón Trilla and Leonidas Lampropoulos

Licensed under the Academic Free License version 3.0

www's People

Contributors

abrassel avatar amozolin avatar arjunnaha avatar astronautlevel2 avatar bquiring avatar danielhuang avatar dmanikt avatar dvanhorn avatar ivan98q avatar jmct avatar kihtrakraknas avatar lemonidas avatar pdarragh avatar tasnimkabir avatar temurson avatar vyasgupta avatar wchung43 avatar williamchung211 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

www's Issues

`read_byte` should wrap an unsigned char

Currently read_byte returns val_wrap_int(c) where c is the char that was read, but this will encode the byte as a signed integer, whereas a byte should be in (0,256]. The fix is to add a new function to the run-time:

val_t val_wrap_byte(unsigned char b)

and have read_byte return val_wrap_byte(c). The definition is the same as val_wrap_int.

Spelling Error

In www/notes/a86.scrbl on line 623 and 644 in the comment. It should say calling instead of "callling".

Embrace `#lang`

A few perennial issues in the class:

  • people don't know what the behavior of a given program should be. We often say, "what does Racket say?" and this works 99% of the time, but we're really talking about a subset of Racket and there are subtle issues here.
  • we assume expressions are "well-formed" but never fully spell out what this means.
  • students don't seem to internalize they are building a PL in which they can write and run programs.

To improve this, I propose, we ship #lang language for each of the languages plus all of the assignment languages. The implementation will be by expansion to Racket, so you'll get usual IDE tools, plus static checking for conformance to the language. This makes it definitive to say "run Check Syntax" and if it's OK, your compiler better be able to compile it. It also gives students a reference implementation for the interpreter without leaking any details since the implementation via expansion doesn't shed light on how to write the interpreter/compiler. Being able to run programs in DrRacket may help reinforce the ideas that you can write programs in these languages.

Check `system-type` in `asm-interp`

Similar to how we check for nasm and provide a helpful error message, we should check that (system-type 'arch) gives you 'x86_64 and if not say that you've installed the wrong version of Racket and need to install the Intel 64-bit version.

Possible issue in check:target implementation

The a86 ast.rkt defines a struct guard used by Call and all the jump varieties. On the third line of this function, the comment diverges significantly from the code:

(define check:target
  (λ (a x n)
    (unless (or (symbol? x) (offset? x)); either register or label
      (error n "expects symbol; given ~v" x))
    (values a x)))

From reading on the call and jump instructions in x86, I think check:target could hypothetically allow labels, registers, and offsets, but the architecture also supports some kind of relative addressing (with which I am not familiar) that uses some alternate syntax we don't support in the pretty-printer as yet. We could update this guard to allow registers directly as well as the other two forms.

That said, I think the only way we use these instructions in 430 is with labels. Would it maybe be better to just limit them to labels only?

Only one global label

Currently printer.rkt only makes the first label in a program global, so you can't call any other racket-compiled functions from c.

notes 22 source archive 404 & `racketblock` failed to render

link for reference.

i couldn’t find the source for this page on the repository so if this is still in-progress please close this issue, sorry for the bother.

other unsolicited feedback: this looks like a really nice course, thanks for putting so much work into making the content easily digestible. just reading the notes tonight has been very pleasant.

Rethink lambdas, datums, etc

Starting with Loot the compiler should either return multiple values or a more structured result than just an Asm sequence so that compilation can be a single pass. The code would be more efficient and clearer and eliminate the functions that scan over a program and produce certain subexpressions (e.g. lambdas, datums).

Symbols with bad chars (for nasm)

Symbols with characters that nasm can't handle result in nasm errors.

(asm-interp
   (prog (Label 'foo-x)
         (Mov rax 42)
         (Ret)))

Results in:

assembly error: make sure to use `prog` to construct an assembly program
if you did and still get this error; please share with course staff.

macOS PATH instructions have outdated version number

Course suggests Racket 8.7 and it's what you'll get if you go to the Racket website, but macOS instructions for adding to PATH still say 8.6. See:

@verbatim|{ PATH=$PATH:"/Applications/Racket v8.6/bin"}|

It's likely not a huge deal to update this when necessary, but maybe there's a decent permanent solution?

  • Maybe Scribble has support for variables, and there could be a version variable used when discussing the version used by the class.
  • Perhaps a path of /Applications/Racket\ v*.*/bin
    • At first glance it feels wrong to me to have this path end up in ~/.zprofile, so maybe something like this would work?
      • echo "PATH=\$PATH:'$(echo /Applications/Racket\ v*.*/bin)'" >> ~/.zprofile && source ~/.zprofile
        • Note: The output of echo "PATH=\$PATH:'$(echo /Applications/Racket\ v*.*/bin)'" is PATH=$PATH:'/Applications/Racket v8.7/bin'
      • This assumes you want the instructions to have a command that automatically modifies ~/.zprofile. This also assumes that ~/.zprofile is the correct file as it is mentioned in the documentation; it doesn't work on my Linux system, but maybe it does on macOS.

Do we need a `#lang`?

Here are some loose thoughts on a potential redesign of the 430 materials to deal with some long standing issues I see.

The bigger issues:

  • Racket is a big language and folks can go down a rabbit hole learning parts of Racket that aren't really relevant for the class, e.g. for, quasiquote, fancy match patterns, stuff in libraries, etc. Not necessarily bad, but I'd prefer students worked within the small subset that's needed and try to really master that subset. This isn't a class on Racket.
  • The code people write is atrocious. This is largely a product of UMD's autograder über alles. Perfect solution would be human code review and feedback. But also some language support could help. Maybe we require adherence to a style guide programmatically: require purpose statements, type signatures, consistent formatting, etc.
  • The lack of type enforcement leads to students spending way too much time debugging programs that have type errors, or worse, coding around those errors.

Some smaller issues:

  • Since we're committed to programming in Racket, we get the warts too. Transparent structs aren't the right default for us, so we have to explain #:prefab etc.
  • We also get tied to the Racket semantics, e.g. truish, etc. Would be nice if we could make our own design choices.

There are other issues (the source language and meta language being the same causes confusion, etc.), but a way of dealing with the above issues is creating our own #lang language. That gives us control over much of this stuff.

Loot compiler doesn't handle returning procedures

When looking over the Loot compiler, I noticed that the runtime environment and compile.rkt all look like they can and should handle returning a procedure, e.g. (lambda (x) x) -> "#", but the unload-bits-asm.rkt does not like this -- I am getting "; match: no matching clause for ...". I feel like it is as simple as matching (? proc-bits? i) but I thought this was worth mentioning.

long label names bug

According to the NASM docs: Maximum length of an identifier is 4095 characters.

What they mean is "only the first 4095 characters are used in an identifier," so you don't get an assembly-time error if you have a label name that exceeds this bound, it just truncates the label to the first 4095 characters.

This is problematic for our compiler starting with Iniquity since labels are derived from user-provided function names. If you define two very long function names that are only distinguished from each other after the character limit, then you'll get a "inconsistently redefined" error from NASM.

This may seem unlikely for function names and could potentially be considered a static error or something, but it's more problematic for string literals, which also derive label names.

The solution seems to be to change symbol->label and symbol->data-label to truncate the string before appending the unique eq hash code on the end.

Evildoer Parser Typo

I believe the case that is "[(list 'parse) (Prim0 'parse)]" should be "[(list 'void) (Prim0 'void)]".

Recursive data structures are broken

Related to #152.

The ziggy bits->value functions recurse without pause. Consider the case for vectors in Hoax:

www/ziggy/src/types.rkt

Lines 44 to 50 in d6fbb58

{:> H1}
[(vect-bits? b)
(if (zero? (untag b))
(vector)
(build-vector (heap-ref b)
(lambda (j)
(bits->value (heap-ref (+ b (* 8 (add1 j))))))))]

I think bits->value could be reworked to include a map from addresses to reconstituted values so we can reconstruct cyclical heap-allocated data structures.

Depending on how complex the final implementation is, it might give us an opportunity to talk about cyclical data structures explicitly in the course.

Array data structures should keep their lengths encoded

Our current layout for strings and vectors is to store their unencoded length in the first word, but there's really no good reason doing it this way it and it means you have to re-encode it when you access the length. I think this was originally done to make bounds checking simpler to understand, but it's basically the same encoded or not. Changing this would simplify the code a bit.

Arguably the same should be done for the character data in strings. We have the bits available, so using codepoints doesn't buy you anything except the work of de/re-coding.

It's technically possible to pack 3 codepoints per words (but not chars), but we don't do that.

`prog` should check that entry label is `Global`

asm-interp calls the first label in the given instructions, but if that label is not declared as Global you get a bad error message from the ffi:

#lang racket
(require a86)
(asm-interp
 (prog
  (Mov 'rax 3)
  (Call 'done)
  (Ret)
  (Label 'done)
  (Mov 'rax 0)
  (Ret)))
ffi-obj: could not find export from foreign library
name: done
library: /var/tmp/nasm16878285301687828530715.so
system error: /var/tmp/nasm16878285301687828530715.so: undefined symbol: done 

This is usually the result of forgetting to have an entry point.

The prog function should check that the first label is declared as Global and provide some guidance if not.

a86: Check immediate ranges

Instructions that take immediate values should check that values are in range. Some instructions, e.g. Cmp can only take 32-bit immediate, and will silently do something unexpected if you use a 64-bit immediate.

Bit-widths in (some?) operations must be compared

A student attempted to compile this code:

;; 2^32 - 16
(seq
  (Mov 'rdx 4294967280)
  (And 'eax 'rdx))

and they got this error:

assembly error: make sure to use `prog` to construct an assembly program
if you did and still get this error; please share with course staff.

/var/tmp/nasm17077944821707794482656.s:8: error: invalid combination of opcode and operands

Apparently, the specification of and cares about the order of the operands: if the left operand is only 32 bits wide (e.g., eax), then the second operand can only be 32 bits wide. However, if the first operand is 64 bits wide (e.g., rax), then the second operand can be of any lesser width (e.g., eax) and that second operand will be sign-extended.

So we'll want to update the checks in the a86 instructions to look for this, and then update the course notes accordingly. I assume and is not the only instruction featuring this behavior, so we'll have to look through the others to see what else ought to be changed.

Rework literals

Should adopt the Mountebank AST approach early on and have (Quote <datum>) where <datum> is just integer to begin, then add boolean, char, etc. Mountebank would then just add pairs, vectors, etc.

Rework pattern matching to eliminate a bunch of pattern forms and pun expression forms, eg. (PVar x) is just (Var x) and (PLit d) is just (Quote d). Have to revisit after Mountebank.

Redex font issue

Somewhere in the move to building using GH actions instead of Travis CI or the move from Racket 7.4 to 7.9, the rendering of the fancier unicode characters in the Redex semantics has broken.

So instead of:
image

You get:
image

Con is a reserved word in Windows

Per MSDN (emphasis mine, extra text omitted):

Do not use the following reserved names for the name of a file:

CON, [...]. Also avoid these names followed immediately by an extension[...].

I suggest renaming the con language to something like connive. A bit longer, but still alphabetically appropriate and will not cause issues for students trying to interact with the languages on Windows.

Prevent arbitrary x86 insertion

The implementation of comment rendering in the printer (below) allows the insertion of arbitrary x86 instructions, which we don't want. I think this can be fixed by replacing the existing implementation with one that replaces newlines \n with commented newlines \n;; (or something prettier).

www/langs/a86/printer.rkt

Lines 241 to 245 in 82864f8

(define (comment->string c)
(match c
[(% s) (string-append (make-string 32 #\space) "; " s)]
[(%% s) (string-append tab ";; " s)]
[(%%% s) (string-append ";;; " s)]))

Improve converted names for arithmetic operators

If symbol->label is used on names like +, you end up with label___12345, which is very unhelpful. Maybe we could incorporate special-case names for certain common symbols, depending on what the nasm specification allows.

www/langs/a86/ast.rkt

Lines 400 to 421 in 7f55e4f

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; Symbol to Label
;; Symbol -> Label
;; Produce a symbol that is a valid Nasm label
;; Guarantees that (eq? s1 s2) <=> (eq? (symbol->label s1) (symbol->label s1))
(provide symbol->label)
(define (symbol->label s)
(string->symbol
(string-append
"label_"
(list->string
(map (λ (c)
(if (or (char<=? #\a c #\z)
(char<=? #\A c #\Z)
(char<=? #\0 c #\9)
(memq c '(#\_ #\$ #\# #\@ #\~ #\. #\?)))
c
#\_))
(string->list (symbol->string s))))
"_"
(number->string (eq-hash-code s) 16))))

Incorrect semantics regarding `lookup`

The lookup function introduced in Fraud is given semantics that suggest a failed lookup should result in a dynamic error, but we treat this case as a static error. The semantics should be updated to remove a definition for a failed lookup call, resulting in undefined behavior.

Simplify unload-bits-asm and asm-interp when there's a heap

Currently asm-interp changes it's return type if the runtime uses a heap and produces a pair of the result and a pointer to the heap. The latter is only used to free the memory after the value has been reconstructed. You don't need the heap pointer to reconstruct the value.

It seems to make more sense to just not free the memory and return the bits in rax, from which the value can be reconstructed. This keeps the concept of asm-interp simple and consistent from Abscond through to the end.

There's some worry about leaking memory, but one solution is to have asm-interp re-use the heap memory on each call (it may already do this, I'm not sure). This means you have to be done using the memory before another call to asm-interp happens, but that seems reasonable.

Detect bad label names in `prog`

Label names that happen to coincide with x86 instruction names that are not supported in a86 cause issues, but these errors are very opaque to students — especially when only certain versions of nasm report them. Add tests to prog (or label?) to check for these sorts of issues.

Example program:

        default rel
        section .text
        global entry
entry:
        mov rax, 0
        mov rcx, rax
        mov rax, 0
        cmp rcx, 0
        je end
        jne rep
rep:
        pop r8
        add rax, r8
        sub rcx, 1
        cmp rcx, 0
        jne rep
end:
        ret

Error:

~ $ nasm test.s
test.s:10: error: expression syntax error
test.s:11: error: parser: instruction expected
test.s:16: error: expression syntax error

Duplicate bindings not discovered at parse-time

In Racket, duplicate bindings in a let are a syntax error:

> (let ([x 1] [x 2]) x)
string:1:13: let: duplicate identifier
  at: x
  in: (let ((x 1) (x 2)) x)
 [,bt for context]

We don't perform these checks during parsing, though, and instead wait until interpretation or compilation to discover the issue. Historically this has been fine because we simply don't test syntactically invalid inputs.

However, as we continue to implement competitive elimination assignments with reference machines, it becomes important to improve the underlying implementations. I think it would be best if we incorporate checks for this in the parser going forward.

Typo in notes

www/notes/con.scrbl line 68/69

"is the meaning of e1 if the meaning of e1 if the meaning of e0 is 0"

Sentence Error

In www/notes/abscond.scrbl on lines 276-277 the sentence does not make sense, it says:
@item{it is a low-level language, so compiling to a high-level
language to x86-64 will require building everything from scratch,}

It should say:
@item{it is a low-level language, so compiling from a high-level
language to x86-64 will require building everything from scratch,}

Recursive printing is broken

When we have self-referential heap structures, the output loops infinitely and must be interrupted:

> (let ([v (make-vector 1 #t)]) (begin (vector-set! v 0 v) v))
#0='#(#0#)
> (run (compile (parse '(let ([v (make-vector 1 #t)]) (begin (vector-set! v 0 v) v)))))
^Cuser break [,bt for context]

It'd be nice to implement a (simple) check for these cases and indicate the recursion somehow.

Better error messages when linking fails in asm-interp

There should be better error messages when asm-interp fails due to linking. A very common case is that the user forgot to set current-objs to include the run-time system for the language they're working in. So a message could say something like: "a linking error occurred, this is probably because your code depends on object files that are not included in current-objs; try (current-objs '("runtime.o")) for example, to link the language run-time to asm-interp" and then dump the link error which will include useful info such as the undefined symbol or inability to reposition code.

The notes could be better about this by explicitly setting current-objs. I believe setting this parameter may be hidden currently.

Finally, it might be good to have a language-specific asm-interp function that is set-up appropriately for each language; it would only be a couple lines for each language.

a86: unsized moves slip past error checking

This program results in a nasm error; the Mov constructor needs to check that you're not moving a literal into a memory offset (error message should recommend going through a register).

(Alternatively we could size all the moves so this is allowed...)

#lang racket
(require a86)
(asm-interp
 (prog (Label 'f)
       (Mov (Offset 'rax 0) 0)
       (Ret)))

Results in:

assembly error: make sure to use `prog` to construct an assembly program
if you did and still get this error; please share with course staff.

/var/folders/ml/jqmf1yqj5n9f721tqmkz8tm40000gn/T/nasm16130216891613021689814.s:5: error: operation size not specified

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.