Code Monkey home page Code Monkey logo

recursive_descent_parser's Introduction

Recursive Descent Parsing

Purpose

  The goal of this work was to learn more about how compilers work and parse input from a programming language to an internal representation and machine code. Within this repository are two implementations for a recursive descent parser: one written in Python and one written in C. Both define a simple programming language based on arithmetic expressions with a scanner separate from the parser.

Background

  Language implementation systems must analyze source code, regardless of the specific implementation approach. Nearly all syntax analysis is based on a formal description of the syntax of the source language (Backus-Naur form, or BNF). The syntax analysis portion of a language processor nearly always consists of two parts:

  • A low-level part called a lexical analyzer (mathematically, a finite automaton based on a regular grammar)
  • A high-level part called a syntax analyzer, or parser (mathematically, a push-down automaton based on a context-free grammar, or BNF)

  Lexical and syntax analysis are typically separated because of simplicity (less complex approaches can be used for lexical analysis; separating them simplifies the parser), efficiency (separation allows optimization of the lexical analyzer) and portability (parts of the lexical analyzer may not be portable, but the parser is always portable).

  A lexixal analyzer is a pattern matcher for character strings (i.e. a front-end for the parser). It identifies substrings of the source program that belong together - lexemes. Lexemes match a character pattern, which is associated with a lexical category called a token. The lexical analyzer is usually a function that is called by the parser when it needs the next token.

  The goals of the parser, given an input program, are to find all syntax errors (for each, produce an appropriate diagnostic message and recover quicky) and produce the parse tree, or at least a trace of the parse tree, for the program. There are two categories of parsers:

  1. Top-down: produces the parse tree beginning at the root
  • Order is that of a leftmost derivation
  • Traces or builds the parse tree in preorder
  1. Bottom-up: produces the parse tree beginning at the leaves
  • Order is that of the reverse of a rightmost derivation
  • Useful parsers look only one token ahead in the input

The work in this repository focuses on top-down parsers, where given a sentential form, xAα, the parser must choose the correct A-rule to get the next sentential form in the leftmost derivation, using only the first token produced by A. The most common top-down parsing algorithms are recursive-descent (a coded implementation, used in this repo) and LL parsers (table-driven implementations).

  For recursive-descent parsers, there is a subprogram for each nonterminal in the grammar which can parse sentences that can be generated by that nonterminal. Extended Backus-Naur form (EBNF) is ideally suited for being the basis of a recursive-descent parser, because EBNF minimizes the number of nonterminals. The following shows an example of a grammar for simple expressions:

<expr> → <term> {(+ | -) <term>}
<term> → <factor> {(* | /) <factor>}
<factor> → id | int_constant | ( <expr> )

  When there is only one right-hand side (RHS):

  • For each terminal symbol in the RHS, compare it with the next input token; if they match, continue, else there is an error
  • For each nonterminal symbol in the RHS, call its associated parsing subprogram

  A nonterminal that has more than one RHS requires an initial process to determine which RHS it is to parse:

  • The correct RHS is chosen on the basis of the next token of input (the lookahead)
  • The next token is compared with the first token that can be generated by each RHS until a match is found
  • If no match is found, it is a syntax error

  The Left Recursion Problem: If a grammar has left recursion, either direct or indirect, it cannot be the basis for a top-down parser. A grammar can be modified to remove direct left recursion as follows:
  For each non-terminal, A,

  1. Group the A-rules as
A → Aα1 | … | Aαm | β1 | β2 | … | βn

  where none of the β‘s begins with A 2. Replace the original A-rules with

A → β1A’ | β2A’ | … | βnA’ 
A’ → α1A’ | α2A’ | … | αmA’ | ε

  The other characteristic of grammars that disallows top-down parsing is the lack of pairwise disjointness, the inability to determine the correct RHS on the basis of one token of lookahead.

Def: FIRST(α) = {a | α =>* aβ }
(If α =>* ε, ε is in FIRST(α)

Pairwise Disjointness Test:
  For each nonterminal, A, in the grammar that has more than one RHS, for each pair of rules, A → αi and A → αj, it must be true that FIRST(αi) ⋂ FIRST(αj) = Φ. For example:
A → a | bB | cAb
A → a | aB
  Left factoring can resolve pairwise disjointness! Replace
<variable> → identifier | identifier [<expression>]
with

<variable> → identifier <new>
<new> → ε | [<expression>]

or
<variable> → identifier [[<expression>]]
(the outer brackets are metasymbols of EBNF)

Author

Michael Balas

License

GNU General Public License

recursive_descent_parser's People

Contributors

michaelbalas avatar

Stargazers

 avatar

Watchers

 avatar

Forkers

jahgath buke2016

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.