Code Monkey home page Code Monkey logo

scanlex's Introduction

scanlex - a simple lexical scanner.

The Problem of Input

It is easier to write things out than to read them in, since more things can go wrong. The read may fail, the text may not be valid UTF-8, the number may be malformed or simply out of range.

Lexical Scanners

Lexical scanners split a stream of characters into tokens. Tokens are returned by repeatedly calling the get method of Scanner, (which will return Token::End if no tokens are left) or by iterating over the scanner. They represent numbers, characters, identifiers, or single/double quoted strings. There is also Token::Error to indicate a badly formed token.

This lexical scanner makes some assumptions, such as a number may not be directly followed by a letter, etc. No attempt is made in this version to decode C-style escape codes in strings. All whitespace is ignored. It's intended for processing generic structured data, rather than code.

For example, the string "hello 'dolly' * 42" will be broken into four tokens:

  • an identifier 'hello'
  • a quoted string 'dolly'
  • a character '*'
  • and a number 42
extern crate scanlex;
use scanlex::{Scanner,Token};

let mut scan = Scanner::new("hello 'dolly' * 42");
assert_eq!(scan.get(),Token::Iden("hello".into()));
assert_eq!(scan.get(),Token::Str("dolly".into()));
assert_eq!(scan.get(),Token::Char('*'));
assert_eq!(scan.get(),Token::Int(10));
assert_eq!(scan.get(),Token::End);

To extract the values, use code like this:

let greeting = scan.get_iden()?;
let person = scan.get_string()?;
let op = scan.get_char()?;
let answer = scan.get_integer(); // i64

Scanner implements Iterator. If you just wanted to extract the words from a string, then filtering with as_iden will do the trick, since it returns Option<String>.

let s = Scanner::new("bonzo 42 dog (cat)");
let v: Vec<_> = s.filter_map(|t| t.as_iden()).collect();
assert_eq!(v,&["bonzo","dog","cat"]);

Using as_number instead you can use this strategy to extract all the numbers out of a document, ignoring all other structure. The scan.rs example shows you the tokens that would be generated by parsing the given string on the commmand-line.

This iterator only stops at Token::End - you can handle Token::Error yourself.

Usually it's important not to ignore structure. Say we have input strings that look like this "(WORD) = NUMBER":

	scan.skip_chars("(")?;
	let word = scan.get_iden()?;
	scan.skip_chars(")=")?;
	let num = scan.get_number()?;

Any of these calls may fail!

It is a common pattern to create a scanner for each line of text read from a readable source. The scanline.rs example shows how to use ScanLines to accomplish this.

    let f = File::open("scanline.rs").expect("cannot open scanline.rs");
    let mut iter = ScanLines::new(&f);
    while let Some(s) = iter.next() {
        let mut s = s.expect("cannot read line");
        // show the first token of each line
        println!("{:?}",s.get());
    }

A more serious example (taken from the tests) is parsing JSON:

type JsonArray = Vec<Box<Value>>;
type JsonObject = HashMap<String,Box<Value>>;

#[derive(Debug, Clone, PartialEq)]
pub enum Value {
   Str(String),
   Num(f64),
   Bool(bool),
   Arr(JsonArray),
   Obj(JsonObject),
   Null
}

fn scan_json(scan: &mut Scanner) -> Result<Value,ScanError> {
    use Value::*;
    match scan.get() {
        Token::Str(s) => Ok(Str(s)),
        Token::Num(x) => Ok(Num(x)),
        Token::Int(n) => Ok(Num(n as f64)),
        Token::End => Err(scan.scan_error("unexpected end of input",None)),
        Token::Error(e) => Err(e),
        Token::Iden(s) =>
            if s == "null"    {Ok(Null)}
            else if s == "true" {Ok(Bool(true))}
            else if s == "false" {Ok(Bool(false))}
            else {Err(scan.scan_error(&format!("unknown identifier '{}'",s),None))},
        Token::Char(c) =>
            if c == '[' {
                let mut ja = Vec::new();
                let mut ch = c;
                while ch != ']' {
                    let o = scan_json(scan)?;
                    ch = scan.get_ch_matching(&[',',']'])?;
                    ja.push(Box::new(o));
                }
                Ok(Arr(ja))
            } else
            if c == '{' {
                let mut jo = HashMap::new();
                let mut ch = c;
                while ch != '}' {
                    let key = scan.get_string()?;
                    scan.get_ch_matching(&[':'])?;
                    let o = scan_json(scan)?;
                    ch = scan.get_ch_matching(&[',','}'])?;
                    jo.insert(key,Box::new(o));
                }
                Ok(Obj(jo))
            } else {
                Err(scan.scan_error(&format!("bad char '{}'",c),None))
            }
    }
}

(This is of course an Illustrative Example. JSON is a solved problem.)

Options

With no_float you get a barebones parser that does not recognize floats, just integers, strings, chars and identifiers. This is useful if the existing rules are too strict - e.g "2d" is fine in no_float mode, but an error in the default mode. chrono-english uses this mode to parse date expressions.

With line_comment you provide a character; after this character, the rest of the current line will be ignored.

scanlex's People

Contributors

stevedonovan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

cpardotortosa

scanlex's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.