Code Monkey home page Code Monkey logo

8086-emulator's Introduction

8086 Emulator

Execution GIF

This is an Intel 8086 emulator / vm. It can run most of 8086 instruction set and provides an interactive interpreter to run the program line by line. This repository contains the core library which contains the preprocessor, data parser and interpreter ; as well as a command line driver which provides command line interface for running the program. For syntax check syntax.md.

This also has be compiled to WASM and available in Web version : https://github.com/YJDoc2/8086-emulator-web

Note

This is a Intel 8086 Emulator, providing a way to run programs written for 8086 assembly instruction set. This internally stores data in the emulated "memory" of 1 MB size, but the code is not compiled to binary or stored in memory. Assembly statements are executed using an interpreter, which operates on the memory and architecture (registers, flags etc.) to emulate execution of the program.

As this does not have a 'True' memory, this does not allow jumps to memory positions, and Does not support ISRs, as ISR requires the code to be stored in memory as well.

This also does not emulate external devices like storage, or co-processors, but allows almost all instructions that 8086 support.

Most of the assembly syntax is same as Intel assembly syntax, with few minor changes, which are documented under respective instructions in the syntax.md.

Installation

As this is not on crates.io, installing is done using this repository itself.

cargo install --git https://github.com/YJDoc2/8086-Emulator.git

This should install the binary and put the program with name 'emulator_8086' in a folder in PATH, so you should be able to directly run.

To Use As Dependency

As this is not on crates.io, dependency is specified using this repository itself.

[dependencies]
emulator_8086 = { git = "https://github.com/YJDoc2/8086-Emulator" }

This will allow to import and use the core library of this in your project :

use emulator_8086_lib;

Or if you don't want to use the long name, you can either rename the import :

use emulator_8086_lib as lib;

or use it under different name :

...
[dependencies]
a_different_name = { git = "https://github.com/YJDoc2/8086-Emulator", package="emulator_8086"}
...

Commandline usage

USAGE:
emulator_8086 [FLAGS] [file_path]

FLAGS:
-h, --help Prints help information
-i, --interpreted To run in interpreted mode
-V, --version Prints version information

ARGS:
<file_path> Input assembly file path
  • file_path is required argument for the assembly file which is to be run.
  • The interpreted flag (-i) will enable a user prompt before execution of every instruction , which currently allows to print flags, registers and memory, and can be used to debug.
  • Note : if you don't want to check after every command, but just before/after a particular command, use int 3 instead, syntax explained in syntax.md.

The user prompt support following commands :

  • n/next : this will continue the execution of instructions.
  • q/quit : this will exit the program
  • print statements : these allow to print flags, registers, and memory locations, the syntax is same as the assembly file print, explained in syntax.md.

Another way to get user prompt is to set trap flag, in which case, the prompt will be displayed before execution of each instruction as long as the trap flag is set.

Core Library

File structure

The complete project has following file structure :

.
├── examples              ->  examples of 8086 assembly programs
├── src                   ->  the code
    ├── driver            ->  code for the commandline driver
    ├── lib               ->  code of core library
    |   ├── data_parser   ->  code and test for the data parser, which
    |                         interprets data commands and fill data in memory
    |   ├── instructions  -> contains functions which are used to run some opcode instructions,
    |                         which are not directly coded in the interpreter
    |   ├── interpreter   -> code and tests for the interpreter
    |   ├── preprocessor  -> code and test for the preprocessor
    |   ├── util          -> utility and helper functions / structures
    |   ├── arch.rs       -> definition of 8086 architecture struct
    |   ├── vm.rs         -> definition of vm struct
    |   └── lib.rs        -> lib file which re-exports various structs and function
    └── bin.rs            -> main file for the binary
├── build.rs              -> the build code required to generate parsers from lalrpop files
├── Cargo.toml            -> Cargo TOML file
├── README.md             -> This file
├── LICENSE-APACHE        -> Licence file
├── LICENCE-MIT           -> Licence file
├── syntax.md             -> file containing syntax for the assembler
├── flowcharts.md         -> Markdown file containing flowcharts for various parts of emulator
└── .gitignore            -> gitignore file for the repository

Driver

Driver is the program which uses the core library to run the vm and interpret the code, hence the name 'driver'. AS the core library contains functionality related to preprocessing/syntax checking, data feed into vm, and running the code via interpreter, it is kept purposefully as much platform independent as it can be. The driver takes care of taking the input (via a file or so), removing comments from the file, feeding that to the prerpocessor (which converts everything to uniform small case, and does syntac checking etc.), take its output, then run data parser to store the data of DB/DW commands into the vm memory, and then run the interpreter, as well as provide the user prompt. The driver is kept this way, so that the core library can be also used without any changes in the web version as well.
The print parser in the driver interprets the print commands and displays the output. This can be used to interactively check state of vm, as well as for debugging purposes.

Interpreting Flow

The flow of the complete process is :

  • Drivers takes the input of program, removes comments from the program, and feeds this to preproecssor

  • Preprocessor checks the program for correct syntax, separates the code and data instructions, maps labels and functions to appropriate output instructions, and returns :

    • A list containing code instructions, and another list containing data instructions, also makes it uniform byt converting to small case, and converts from supported number formats like hexadecimal and binary to decimal
    • Maps of output code list elements to input code positions, code labels to the output code elements, stores the offset of data elements
  • The driver checks for errors, start labels, and if all labels that are used before declaration (in jump commands) are declared or not. Then it gives data list of preprocessor output to data parser, which interprets data commands and stores the data into vm's memory

  • The driver checks for errors and runs an unconditional loop, in which it gives the code instructions to interpreter, checks its return state, and accordingly gives next input. If it is run in interpreted mode / encounters int 3 or trap flag is set, it also provides user prompt and interprets and runs print commands. Driver also has responsibility to handle DOS and BIOS interrupts.

Flowcharts for various parts of emulator are in the flowcharts.md file.


License

Licensed under either of

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

8086-emulator's People

Contributors

ceyhunsen avatar estebanborai avatar francoiscapon avatar tomial avatar yatharthvyas avatar yjdoc2 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

8086-emulator's Issues

Overhaul the lexer parse and interpreter architecture to a better scheme

Currently the overall architecture of this is acceptable, but pretty hideous. The choices were made for various different (not necessarily correct) reasons at the time when this was written, but needs a fixing to some better choices now.

  • There is one lalrpop file which generates the "initial" parser, whose job is to take raw text input, make sure the syntax is correct, and generate list of label maps, data and code instructions ( as strings).
  • Next is interpreter which , again, takes the instructions as string, parses them and runs them.
  • Finally there is a print parser, dedicated to parse and run print instructions.

This is horrible for many reasons :

  • three parsers means it takes a lot of time to generate each from lalrpop file, pretty irritating for dev
  • print parser does not need to exist, its syntax is simple enough to manually lex and parse it
  • Because initial parser uses default lexer from lalrpop, which
    • does not report \n, so we have to go through input before everything for making newline mapping
    • because the way we define tokens using regexp , there are conflicts when we define two token which are overlapping, eg see issue #5 (comment) , here the issue is the regexp for db text gobbles up everything upto the last quote (the last quote it can find, so it includes the next 'db' and the text as string). If we try to fix that regexp, it collides with string regexp. We cannot stop the db string at EOL, as lalrpop does not give access to \n
  • We really don't need to parse the instructions again from text for the interpreting. We can just use an enum to indicate instructions and store related params in it, and match on it , which will be a much better scheme overall.

Currently the two strategies are :

  • make a custom lexer which will be used with lalrpop as parse to do the initial parsing. The custom lexer will take care of newline mapping, as well as considering capital/small letters.
  • Make a custom lexer + (recursive decent?) parser, and remove the lalrpop dependency completely

Even though second option is desirable, it is equally tricky, so first shifting to custom lexer, than a custom parser separately might be a better way.

Either way, we should make the initial parser generate enum instead of text again and remove "interpreter parser", and remove print parser as well.

Tracking:

  • Remove Print parser with custom lexer+parser
  • Add lexer for "normal" asm , i.e. the main lexer (possibly integrate with print parser somehow?)
  • Integrate this lexer's token into lalrpop with custom token support which lalrpop provides, so at least the issue mentioned above can be mitigated in short term
  • Define enum for asm opcodes, so the original / lalrpop parser can (eventually) emit this instead of text
  • Port "initial" parser from emitting text instruction to the enum defined above, simultaneously port the "interpreter" from lalrpop to a giant match stmt on this enum values
  • Add a custom (recursive decent) parser for the "initial" parser, so that lalrpop dependency will completely removed. This is still up for discussion , need to see if that will actually provide any benefit , otherwise with custom tokens, the lalrpop parser file will be much simpler anyways.

Just noticed that the 8086 manual also include hex codes for instructions, if we can use them, we can actually store instructions in the memory and remove that barrier.

JL, JLE, JG, JGE, ... dont work properly

`start:
;JG, JGE Bug:
MOV AX, 20
MOV BX, -10
CMP AX, BX
JG AX_GREATER_THAN_BX ;JG, JGE is for signed numbers
MOV AX, 0 ;But no jump made
AX_GREATER_THAN_BX: ;Must jump here
MOV AX, 0

;JL, JLE Bug
MOV AX, 20
MOV BX, -10
CMP AX, BX
JL AX_LESSER_THAN_BX ;JL, JLE is for signed numbers
MOV AX, 0 ;Must no jump
AX_LESSER_THAN_BX: ;Jumps here
MOV AX, 0`

IMUL does not sign extend correctly

As from a mail received :

start:
MOV AL, 0xF9 (-7)
MOV BL, 0x02 (2)
IMUL BL

It gives the result of 01F2. This answer is the result of unsigned number multiplication (MUL).
It is expected to give the answer of FFF2 (-14).

This is because of a bug in imul implementation which does the sign extension + expansion incorrectly.

next doesn't print some instructions correctly

Normally next will print the instruction about to be executed, but for some instructions (maybe the "singleton data" family?) it prints the following one instead. (It does seem to execute the correct one.)

Given the following input file:

start:  pushf
        add bx, 1
        pushf
        add bx, 2
        pushf
        add bx, 3
        popf
        add bx, 4
        lahf
        add bx, 5
        sahf
        add bx, 6
        xlat
        add bx, 7

Running with -i and entering n every time (e.g. by piping in yes n), the output is:

About to execute line 2 :         add bx, 1
>>> About to execute line 2 :         add bx, 1
>>> About to execute line 4 :         add bx, 2
>>> About to execute line 4 :         add bx, 2
>>> About to execute line 6 :         add bx, 3
>>> About to execute line 6 :         add bx, 3
>>> About to execute line 8 :         add bx, 4
>>> About to execute line 8 :         add bx, 4
>>> About to execute line 10 :         add bx, 5
>>> About to execute line 10 :         add bx, 5
>>> About to execute line 12 :         add bx, 6
>>> About to execute line 12 :         add bx, 6
>>> About to execute line 14 :         add bx, 7
>>> About to execute line 14 :         add bx, 7
>>> 

LEA instruction not working

I have this string declared like the helloworld example:
hello: DB "Hello World" ; store string

Now I have just one instruction:
LEA DX, hello

it should load the address of hello string inside DX but it doesn't compile.

Errors:
Syntax Error at line 5:16 : LEA DX, hello : Unexpected Token : hello

Thank you!

PS:
Please can you implement service 0x09 (MOV AH, 0x09) with INT 0x21 ??? This way I could easly print strings like with EMU8086.

CALL / RET ... is that implemented ?

Hi,

I try to compile perfectly valid code with a procedure call, but I get this weird message:

Syntax Error at 27:0 : call f1 : 'call' can be only used with procedures, f1 is not a procedure

I wonder how does the emulator determines that 'f1' is not a procedure, when it is a valid label in my code:

f1: 
push bp
mov  bp, sp
mov ax, ax
pop bp
ret

Moreover, I note that in the web app, bp is not syntax-highlighted, which is strange and makes me think of some implementation lacking.

Any clarifications are highly appreciated.

blind student

Hi, can you help to make to software web interface for my BLIND student to program on 8086 ?

Miguel

DIV/IDIV should trap on all overflow, not just divide by zero

If IDIV would produce a quotient of 0x8000 (word) or 0x80 (byte), the real 8086 traps (INT 0, divide overflow). 8086-Emulator yields 0x8000 and continues. (Note this behavior changed between 8086 and 80386, not sure what the 80186/80286 did.)

Memory offset wraparound is not correct

With ds = 0, mov word [0xffff], 0xbeef should write bytes 0x0ffff and 0x00000: the offset should wrap around but not propagate its carry. Instead it writes 0x0ffff and 0x10000.

To fix this, I think one needs to change inc_addr to take a segment and offset, instead of just a linear address.

INT 0x21, AH = 0x0a counts 1 character less, than it should.

`inputBufferSize: DB 64
inputBufferAddr: DW 10

start:
;Input string
MOV DI, word inputBufferAddr ; DI ← 10
MOV BL, byte inputBufferSize
MOV byte [DI], BL;offset inputBufferSize ;[10] ← 64

MOV DX, DI

MOV AH, 0x0a
INT 0x21

;Output string
MOV DI, word inputBufferAddr
INC DI
INC byte [DI] ; Bug. Need this correction to workaround (counter + 1)

mov CH, 0
mov CL, byte [DI]
INC DI
MOV BP, DI

MOV AH, 0x13
MOV AL, 0x01
MOV BX, 0x00
MOV DL, 0
INT 0x10

AAA Does not clear upper nibble

The algorithm performed by the ASCII Adjust After Addition instruction is documented as always clearing the upper nibble of AL. However, the current implementation does not do so.

From https://www.felixcloutier.com/x86/aaa:

IF 64-Bit Mode
    THEN
        #UD;
    ELSE
        IF ((AL AND 0FH) > 9) or (AF = 1)
            THEN
                AX := AX + 106H;
                AF := 1;
                CF := 1;
            ELSE
                AF := 0;
                CF := 0;
        FI;
        AL := AL AND 0FH;
FI;

This can be fixed by adding the bitwise and in the aaa implementation.

PUSH/POP shift ss too much

In the push/pop instructions, the value of ss is multiplied by 0x10 before passing to Address::calculate_from_offset:

let ss = vm.arch.ss as usize*0x10;
let sp = vm.arch.sp as usize;
let base = Address::calculate_from_offset(ss,sp);

But the latter multiplies by 0x10 again:

make_valid_address(base.into() * 0x10 + offset.into())

So ss effectively gets shifted 8 bits instead of 4.

You can reproduce with

mov ax, 0x1000
mov ss, ax
mov sp, 0x20
mov bx, 0xdead
push bx

and note that address 0x0001e gets written instead of 0x1001e as it should.

MUL/IMUL flag setting is incorrect

mul bl currently sets the carry and overflow flags if ah was nonzero on input. It should set them if ah is nonzero on output. I think somebody misread the manual.

The same goes for the other multiply instructions. Additionally, the flag set logic in IMUL is incorrect. It clears/sets them according to whether AH is equal to 0xff. Instead they should be cleared if AH equals the sign bit of AL (0xff or 0x00 as appropriate), otherwise set.

XLAT should support segment override

The 8086 XLAT instruction can take a segment override prefix. Currently the emulator has no way to express this.

One option that would be reasonably consistent with 8086 assembly would be xlat byte es [bx]. Of course no other addressing mode besides [bx] would be valid here. (Actually, I believe the original assembler would let you write some other addressing mode but just ignore it.)

To avoid the parser awkwardness, they used xlat for this "one operand form" and xlatb for the zero-operand form where DS is assumed.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.