Code Monkey home page Code Monkey logo

dessertlab / shellcode_ia32 Goto Github PK

View Code? Open in Web Editor NEW
29.0 3.0 2.0 387 KB

Shellcode_IA32 is a dataset consisting of challenging but common assembly instructions, collected from real shellcodes, with their natural language descriptions. The dataset can be used for neural machine translation tasks to automatically generate software exploits from natural language.

License: GNU General Public License v3.0

shellcode ia32 nmt shellcode-development ia32-assembly exploit-development dataset linux nasm nasm-assembly

shellcode_ia32's Introduction

Shellcode_IA32

Shellcode_IA32 is a dataset containing more than 20 years of shellcodes from a variety of sources and is the largest collection of shellcodes in assembly available to date. We are currently extending the dataset. Up to now, we released three versions of the dataset.

Version 1: Shellcode_IA32

Shellcode_IA32 was presented for the first time in the paper Shellcode_IA32: A Dataset for Automatic Shellcode Generation, accepted to the 1st Workshop on Natural Language Processing for Programming (NLP4Prog 2021). This version consists of 3,200 examples of instructions in assembly language for IA-32 (the 32-bit version of the x86 Intel Architecture) from publicly available security exploits. We collected assembly programs used to generate shellcode from exploit-db and from shell-storm. We enriched the dataset by adding examples of assembly programs for the IA-32 architecture from popular tutorials and books. This allowed us to understand how different authors and assembly experts comment and, thus, how to deal with the ambiguity of natural language in this specific context. Our dataset consists of 10% of instructions collected from books and guidelines, and the rest from real shellcodes.

Our focus is on Linux, the most common OS for security-critical network services. Accordingly, we added assembly instructions written with Netwide Assembler (NASM) for Linux.

Each line of Shellcode_IA32 dataset represents a snippet - intent pair. The snippet is a line or a combination of multiple lines of assembly code, built by following the NASM syntax. The intent is a comment in the English language.

We conducted an extensive experimental evaluation using the Shellcode_IA32 dataset in the journal paper Can we generate shellcodes via natural language? An empirical study, published in the Automated Software Engineering (AUSE) journal. The paper also contains further statistics on the dataset.

Version 2: Decoder Dataset

Shellcode_IA32 dataset has been extended to build the Decoder Dataset (an assembly dataset used for decoding the encoded shellcodes) presented in the paper EVIL: Exploiting Software via Natural Language. The extended dataset and the code to reproduce the experiments of the paper can be found on this GitHub repository. This second version of the dataset includes 3,715 assembly code snippets with their description in the English language.

Version 3: Extended_Shellcode_IA32

We further enriched the dataset (Extended_Shellcode_IA32) with additional samples of shellcodes collected from publicly available security exploits, reaching 5,900 unique pairs of assembly code snippets/English intents.

Our dataset also includes 1,374 intents (~23% of the dataset) that generate multiple lines of assembly code, separated by the newline character \n. These multi-line snippets contain many different assembly instructions (e.g., whole functions).

shellcode_ia32's People

Contributors

piliguori avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

shellcode_ia32's Issues

Could you share how you split dataset?

hello~, In [Can we generate shellcodes via natural language? An empirical study], the test data is clear, but how to split train&valid is not clear. could you show me?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.