Interpreter for a Simple Pascal-like Programming Language

Executive Summary

This project is an interpreter for an interpreted Pascal-like programming language. Interpreted languages, such as the language in this project and more well known ones like Python and R, are executed line-by-line, as opposed to compiled languages where programs are translated into machine code all at once before execution. The interpreter for this language is made up of two main copmonents, a tokenizer/lexical analyzer, and a recursive-descent parser/interpreter. To help show how this programming language is supposed to look and function, there are numerous example programs provided in the tests folder. If you are interested, I would greatly encourage you to read the formal write-up below, download the code, and write some programs for yourself in this programming langauge.

In short, with this project we can take a program like this, written in our own custom programming language:

program circle;
var
	{Clean program testing nested if statement}
	r, a, p, b : real; 
	flag : boolean := true;
	i, j : integer := 0;
	str : string := 'End of Program';	
begin
	r := 6;
	p := 0;
	a := 5;
	
	if (a > 0 and a < 100) then
	begin
		b := 25;
		if (i = 0) then j := r + 1;
	    p := 2 * 3.14 * r 
	end
	else
	begin
	    p := -1;
	    flag := false
	end;
	{Display the results}
	writeln ( 'The result of a = ' , a, ', ', b);
	writeln('The result of p = ' , p);
	writeln(str)
end.

And run it, giving us an output that looks like this:

The result of a = 5.00, 25.00
The result of p = 37.68

Note

There is adequate documentation within the source code itself, but for a full theoretical understanding of this project, I strongly recommend reading the full write-up that follows this executive summary.

Extended Backus-Naur Form(EBNF) Meta-syntax

All programming languages must have a grammar ruleset that defines what statements are valid for the lanuage. These grammer rulesets are called context-free grammars, becuase they generate valid sentences without knowing their true purpose. EBNF is a notation/ruleset for formally describing context-free grammars. Context-free grammars are the theoretical foundation for all programming languages. The formal language generated by a context-free grammar is the set of terminal strings that can be produced by the grammar rules.

Mathematical Foundations

Mathematically, any context-free grammar G is defined as the 4-tuple $G = (V, \Sigma, R, S)$, such that:

$V$ is the finite set of all non-terminals $v \in V$ where each element v is a non-terminal character. Every non-terminal character is a character that cannot be changed the rules of the language.
$\Sigma$ is the finite set of all terminals such that $\Sigma \cap V = \varnothing$. The set of terminals are defined by the language.
$R$ is a finite subset of the binary relation defined as $V \times (V \cup \Sigma)^*$ where $\beta \in (V \cup \Sigma)^*$ is a string of terminals and non-terminals.
$S$ is the start/entry symbol for the program. It is necessary that $S \in V$.

The mathematical rule for the grammar above allows for the creation of all valid "sentences", or in our case, programmatic commands, for our language.

EBNF Notation Rules

We have now explored the mathematical foundation for context-free grammars. As stated above, EBNF meta-syntax is used to describe context-free grammars, which are the foundation for all programming languages, including the langauge inmplemented in this project. The table below is a handy "cheat sheet" that contains everything that you will need to know to understand the EBNF rules for this language.

Important

Listed below is the notation that will be used in the formal EBNF definition of this programming language in the next section. It is important that this notation is understood before reading the rules for this language.

Usage	Notation	Description
Language Terminals	Any UPPERCASE text	Terminals are the lexical foundation of the language. They are symbols that cannot be changed using the rules of the grammar. The final output after applying all grammar rules will be a string of terminals. The terminals that are in this language are further defined below
Language Nonterminals	Any bolded text	Nonterminals, or syntactic variables, are symbols that can be replaced.
Optional Repetition	{. . . . .}	Indicates that the sequence of terminals and nonterminals inside may be repeated 0 or many times
Optional Inclusion	[. . . . .]	Indicates that the sequence of terminals and nonterminals inside may be used not at all or 1 time
Grouping	(. . . . .)	Indicates that the sequence of terminals and nonterminals inside are grouped together. Usually seen used with the \| operator
Concatenation	,	Inidicates concatenation of items
Alternation	\|	Indicates that items on the left or right are used "either-or"
Definition	::=	The non-terminal on the left of the definition operator is defined as the sequence of terminals and nonterminals on the right of the operator

Definining the Programming Langauge in This Project

With this brief introduction in mind, let's take a look at the EBNF ruleset for the language implemented in this project.

EBNF Description of Grammar Rules

Prog ::= PROGRAM IDENT ; DeclPart CompoundStmt
DeclPart ::= VAR DeclStmt { ; DeclStmt }
DeclStmt ::= IDENT {, IDENT } : Type [:= Expr]
Type ::= INTEGER | REAL | BOOLEAN | STRING
Stmt ::= SimpleStmt | StructuredStmt
SimpleStmt ::= AssignStmt | WriteLnStmt | WriteStmt
StructuredStmt ::= IfStmt | CompoundStmt
CompoundStmt ::= BEGIN Stmt {; Stmt } END
WriteLnStmt ::= WRITELN (ExprList)
WriteStmt ::= WRITE (ExprList)
IfStmt ::= IF Expr THEN Stmt [ ELSE Stmt ]
AssignStmt ::= Var := Expr
Var ::= IDENT
ExprList ::= Expr { , Expr }
Expr ::= LogOrExpr ::= LogAndExpr { OR LogAndExpr }
LogAndExpr ::= RelExpr {AND RelExpr }
RelExpr ::= SimpleExpr [ ( = | < | > ) SimpleExpr ]
SimpleExpr :: Term { ( + | - ) Term }
Term ::= SFactor { ( * | / | DIV | MOD ) SFactor }
SFactor ::= [( - | + | NOT )] Factor
Factor ::= IDENT | ICONST | RCONST | SCONST | BCONST | (Expr)

With this description of our langauge in mind, let's explore the project's structure and function.

Project Structure/Design Philosophy

This project is made up of 2 main components that work in tandem to achieve the goal of running a file that the user gives to us in our coding language. Additionally, there are several, less important components. Those will be covered towards throughout the explaination, but have not been given their own individual sections

The 2 main components are: the tokenizer/lexical analyzer, in lex.cpp and the Parser/Interpreter, in parserInterp.cpp. Let's explore each program individually.

Tokenizing/Lexical Analysis in lex.cpp

Tokenizing/Lexical analysis refers to effectively translating a program file that a user has written into a a stream of tokens that we can use to run the program. lex.cpp is responsible for just this task.

This is accomplished through the main function of lex.cpp:

Lexitem getNextToken(istream& in, int& linenum)

This function takes a reference to an istream object and a reference to an integer as the line number, and acts as a state machine to go through the characters that its currently at. For example, if the program observes the next character to be a letter of some kind, it will automatically enter the INID(inside of identifier) state, and process the following characters as part of an identifier. It does this for integers, real number, strings, booleans and constants. If at any point the lexical analyzer runs into a lexeme(word) that is not a recognized part of the language, it will return the ERR token. It is important to note that this program only tokenizes and analyzes the lexemes of the language, it does not check for syntax, or logical correctness. That is all handled by the other program.

There is a special function that is used when dealing with keywords/reserved words in our language. This function is:

Lexitem id_or_kw(String& lexeme, int linenum)

This function is called by getNextToken because there is a need to differentiate between identifiers(variable names) and reserved words. For example, if the program observes the lexeme "writeln", it has to have a way of determining if writeln is a valid variable name, or if it is a reserved word in our language. It is for this reason that, when in the state INID, the getNextToken function calls id_or_kw(), which then compares the lexeme with a map of all of our reserved words, to determine whether it is an identifier or keyword.

Tokens for our programming language

Once a token is returned by lex.cpp, that token is processed by parserInterp.cpp. We will discuss parserInterp.cpp soon, but before that, provided below is a comprehensive list of all of the tokens that can be generated by the tokenizer.

Identifiers and constants

Identifiers and constants follow certain rules, and will only be tokenized if those rules are obeyed. These rules and some examples are shown in the table below.

Token	Description	Regular Expression Definition	Valid Examples	Invalid Examples
IDENT	An identifier is a user-defined program variable	IDENT := Letter {( Letter \| Digit \| _ \| $ )} Letter := [ a-z A-Z ] Digit := [0-9]	hello$, myVar, first_name	$hello, 1st_name, _name
SCONST	A string constant is defined as a sequence of characters enclosed in single quotes	SCONST := 'Any Character string'	'Hello $5 #., s9 my name is', 'hello'	"This is an invalid string due to the double quotes"
ICONST	An integer constant is any valid integer	ICONST := [0-9]+	2, 200, 3444	-68, 9.56
RCONST	A real constant is any valid real number	RCONST := ([0-9]+).([0-9]*)	9.01, 0.2, 2.	.2, 2.3.4
BCONST	A boolean constant is either true or false	BCONST := (true \| false)	true, false	tRuE, FALSE

Reserved Words

Furthermore, our language has certain reserved words. A reserved word is a word that may not be used as a variable name by the programmer, as the word has some syntactic function in the language. For the sake of brevity, the tokens that these reserved words are tokenized into are also included here.

Reserved Word	Token
and	AND
begin	BEGIN
boolean	BOOLEAN
div	DIV
end	END
else	ELSE
false	FALSE
if	IF
integer	INTEGER
mod	MOD
not	NOT
or	OR
program	PROGRAM
real	REAL
string	STRING
write	WRITE
writeln	WRITELN
var	VAR
true	TRUE
false	FALSE

Operator Symbols/Keywords

Our language also has various arithmetic/logical operators that can be used. The table below shows the operands and their functions. Just like with the reserved words, the tokens for each operator/keyword are included in this table.

Operator Symbol/Keyword	Token	Description
+	PLUS	Arithmetic addition or concatenation
-	MINUS	Arithmetic subtraction
*	MULT	Multiplication
/	DIV	Division
:=	ASSOP	Assignment operator
=	EQ	Equality
<	LTHAN	Less than operator
>	GTHAN	Greater than operator
and	AND	Logical conjunction
or	OR	Logical disjunction
not	NOT	Logical complement
div	IDIV	Integer division
mod	MOD	Modulos operator

Delimiters

Additionally, there are several delimiters that the language makes syntactic use of. These are shown in the table below.

Delimiter	Token
,	COMMA
;	SEMICOL
:	COLON
.	DOT
(	LPAREN
)	RPAREN

Miscellaneous Tokens

Finally, there are a few miscellaneous tokens/rules that you should be aware of

Anything enclosed in {} will be ignored by the tokenizer as a comment
Any word that is not a lexeme in the language will be returned with an ERR token
The DONE token is returned when the tokenizer reaches the end of file(eof) character

With that, we have covered everything in lex.cpp. As stated before, once a token is generated by the lexical analyzer, it is passed on to the parser/interpreter in parserIntep.cpp, which is what we will explore next.

Syntax Analysis/Interpretation in parserInterp.cpp

parserInterp.cpp is responsible for performing syntactic analysis and execution for our program. Syntax analysis refers to the validity of a statement in the frame of our grammar rules, as well as certain other cases where invalid syntax/statements may be found. To do this, we should first see a full description of the language.

Syntactic Description of the Language

There are four basic types: INTEGER, REAL, STRING and BOOLEAN
There are operator precedence and associativty rules(see table below)
An If-statement evaluates a logical expression and executes if it is true, and does not if it isn't. An Else-clause is optional for an If-Statement, and is only evaluate if the logical expression is false
It is an error to use a variable before it has been defined
WRITELN and WRITE statements print out their comma-separated insides from left-to-right
The ASSOP( := ) operator in the AssignStmt assigns a value to a variable. It evaluates the Expr on the right-hand side and saves its value in a memory location associated with the left-hand side variable (Var). A left-hand side variable of a Numeric type must be assigned a numeric value. Type conversion must be automatically applied if the right-hand side numeric value of the evaluated expression does not match the numeric type of the left-hand side variable. While a left-hand side variable of string or Boolean type must be assigned value of the same type as of the left-hand side variable.
Binary operations for numeric operators may only be performed on numeric operands
Binary operations for boolean operators may only be performed on boolean operands
Relational operators(=, <, >) operate only on two compatible types
Unary sign operators(+/-) operate only on numeric types, whereas the unary NOT operator operates only on booleans

Operator Precedence and Associativity

Note

Items with a lower precedence level get executed first

Precedence Level	Operator	Description	Associativity
1	Unary +, - , not	Operates on a single numeric or boolean operand	Right-to-left
2	*, / , div, mod	Operate on two numeric operands	Left-to-right
3	+, -	Arithmetic addition and subctraction	Left-to-right
4	<, >, =	Relational comparison	No cascading
5	and	Logical conjunction	Left-to-right
6	or	Logical disjunction	Left-to-right

parserInterp.cpp is also designed to check for valid syntax accordig to these rules, and to obey the operator precedence. It achieces the latter through its design as a recursive-descent parse tree.

Design/Functionality of parserInterp.cpp

The program parserInterp.cpp is designed as an implicit recursive-descent parse tree. That is, direct recursion is not a part of its overall design. It achieves recursion through its design, and how every function calls the function below it on the parse-tree. There is no explicit parse-tree either, but instead an implicit one that is created based on how methods call eachother.

Recall our EBNF Ruleset from above:

Prog ::= PROGRAM IDENT ; DeclPart CompoundStmt
DeclPart ::= VAR DeclStmt { ; DeclStmt }
DeclStmt ::= IDENT {, IDENT } : Type [:= Expr]
Type ::= INTEGER | REAL | BOOLEAN | STRING
Stmt ::= SimpleStmt | StructuredStmt
SimpleStmt ::= AssignStmt | WriteLnStmt | WriteStmt
StructuredStmt ::= IfStmt | CompoundStmt
CompoundStmt ::= BEGIN Stmt {; Stmt } END
WriteLnStmt ::= WRITELN (ExprList)
WriteStmt ::= WRITE (ExprList)
IfStmt ::= IF Expr THEN Stmt [ ELSE Stmt ]
AssignStmt ::= Var := Expr
Var ::= IDENT
ExprList ::= Expr { , Expr }
Expr ::= LogOrExpr ::= LogAndExpr { OR LogAndExpr }
LogAndExpr ::= RelExpr {AND RelExpr }
RelExpr ::= SimpleExpr [ ( = | < | > ) SimpleExpr ]
SimpleExpr :: Term { ( + | - ) Term }
Term ::= SFactor { ( * | / | DIV | MOD ) SFactor }
SFactor ::= [( - | + | NOT )] Factor
Factor ::= IDENT | ICONST | RCONST | SCONST | BCONST | (Expr)

Every bolded word has its own method defined in parserInterp.cpp, as shown in this function signatures from the header file parserInterp.h

extern bool Prog(istream& in, int& line);
extern bool DeclPart(istream& in, int& line);
extern bool DeclStmt(istream& in, int& line);
extern bool Stmt(istream& in, int& line);
extern bool StructuredStmt(istream& in, int& line);
extern bool CompoundStmt(istream& in, int& line);
extern bool SimpleStmt(istream& in, int& line);
extern bool WriteLnStmt(istream& in, int& line);
extern bool WriteStmt(istream& in, int& line);
extern bool IfStmt(istream& in, int& line);
extern bool AssignStmt(istream& in, int& line);
extern bool Var(istream& in, int& line, LexItem & idtok);
extern bool ExprList(istream& in, int& line);
extern bool Expr(istream& in, int& line, Value & retVal);
extern bool LogANDExpr(istream& in, int& line, Value & retVal);
extern bool RelExpr(istream& in, int& line, Value & retVal);
extern bool SimpleExpr(istream& in, int& line, Value & retVal);
extern bool Term(istream& in, int& line, Value & retVal);
extern bool SFactor(istream& in, int& line, Value & retVal);
extern bool Factor(istream& in, int& line, Value & retVal, int sign);

Through this, we achieve a recursive-descent parse tree.

Here is an example to make things clear: Stmt calls StructuredStmt which then calls CompoundStmt which then calls Stmt

As you can see here, statement does not directly call itself, but it triggers a series of other function calls that lead to it being called again. Through indirect left recursion, we generate a parse tree that analyzes the syntax of each respective component of the language, and evaluates it accordingly.

Speaking of evaluation and values, there is a third program that we have not yet discussed, being val.cpp. val.cpp and the header file val.h contain the definition of the Value class, which is the class that is used to store all of the returned values in our program.

The definition of the Value class is as follows

enum ValType { VINT, VREAL, VSTRING, VBOOL, VERR };

class Value {
    ValType	T;
    bool    Btemp;
    int     Itemp;
    double  Rtemp;
    string  Stemp;

Additionally, val.cpp contains overloaded operators so that we can do operations between two objects of the Value class. Their signatures are as follows:

    // numeric overloaded add this to op
    Value operator+(const Value& op) const;
    
    // numeric overloaded subtract op from this
    Value operator-(const Value& op) const;
    
    // numeric overloaded multiply this by op
    Value operator*(const Value& op) const;
    
    // numeric overloaded divide this by oper
    Value operator/(const Value& op) const;
    
    // numeric overloaded modulus of this by oper
    Value operator%(const Value& oper) const;
    
    //numeric integer division this by oper
    Value div(const Value& oper) const;
    
    
    //overloaded equality operator of this with op
    Value operator==(const Value& op) const;
    //overloaded greater than operator of this with op
    Value operator>(const Value& op) const;
    //overloaded less than operator of this with op
    Value operator<(const Value& op) const;

    //integer divide operator of this by op
    Value idiv(const Value& op) const;
	
    //Logic operations  
    Value operator&&(const Value& oper) const;//Overloaded Anding operator
    Value operator||(const Value& oper) const;//Overloaded Oring operator
    Value operator!() const;//Overloaded Not/complement operator

Every time we execute an expression, and need to save the result, we wrap the result in an instance of the Value class, so that we are able to store the type of the expression's result and the actual value all in one convenient class. This also helps for type checking.

On the topic of storage, there are several map containers that are important for parserInterp.cpp's function:

// defVar keeps track of all variables that have been defined in the program thus far
map<string, bool> defVar;
// SymTable keeps track of the type for all of our variables
map<string, Token> SymTable;
//Container of temporary locations of Value objects for results of expressions, variables values and constants 
//Key is a variable name, value is its Value. Holds all variables defined by assign statements
map<string, Value> TempsResults;
//declare a pointer variable to a queue of Value objects
queue <Value> * ValQue;

The comments are pretty detailed, but one thing worth mentioning is that ValQue is only used for write and writeln methods, as it is a queue of all the values to be printed to the console.

All of this comes together for the full functionality of our interpeter. The entry point to the program and the top of our parse tree is the prog method. The driver program, prog3.cpp, only needs to call the prog method and pass in a reference to an istream object to run the entire program contained in the file pointed to by the in pointer. Once prog is called, it begins the execution of the parse tree. Every method in the parse tree calls the GetNextToken method in lex.cpp, checking for syntactic and lexical errors along the way, until the entire program file has been read through and executed, or until a syntactic or lexical error is reached, in which case the program exits.

Final Summary

Now that you have a theoretical understanding of how these programs work, I would greatly encourage anyone interested to download the source code and try writing/running your own program in this given programming language. There are various examples of programs given in the tests folder here on Github. Happy coding!

jackr276 / simple-pascal-like-language-interpreter Goto Github PK

simple-pascal-like-language-interpreter's Introduction