Code Monkey home page Code Monkey logo

os_fingerprinting's Introduction

Machine Learning Model for Passive OS Fingerprinting

OS fingerprinting is the process of detecting a remote server's OS (and version) by communicating with it and analyzing its response. This process is important for security experts (and attackers), since knowing a server's OS reveals the server's security vulnerabilities.

The most common tools for fingerprinting (Nmap, NetworkMiner, Satori, p0f) rely on a database of "network signatures" (a signature can be thought of as the 'accent' or 'body language' of an OS). The database is maintained manually by security experts, and has not been updated in a long time (most tools rely on the database of p0f).

This project is an attempt to create an ML model for OS fingerprinting.

Background on OS Fingerprinting

There are 2 types of fingerprinting:

  • Active fingerprinting takes advantage of known security flaws: if there was a vulnerability in version X of the linux kernel, and it was fixed in version Y, then attempting to use the exploit will help us determine the server's kernel version ("exploit completed successfully" --> "server has version X"). Nmap is a common tool for active fingerprinting.

  • Passive fingerprinting only analyzes packets of 'typical/legitimate' communication (mainly the TCP/IP headers). p0f is a common tool for passive fingerprinting.

The trade-off between the two methods: the active method has better accuracy, but its 'aggressive' nature makes it much easier to detect by firewalls.

In this project my models perform the passive version. To be precise, they only look at the server's TCP SYN-ACK message, which makes the process extremely stealthy and fast.

Related Work: I found a paper written by IEEE researchers about a similar project:
      A Machine Learning-based Tool for Passive OS Fingerprinting with TCP Flavor as a Novel Feature

Data Generation

I collected data on ~1,000,000 servers (chosen from a list of popular websites).

Establishing Ground Truth

Since I don't have a datacenter's-worth of my own servers, finding labeled servers felt like a 'chicken and egg' problem. I decided to use Nmap's analysis as my ground truth: it may not be 100% accurate, but it does harness the percision of active fingerprinting, and it's an industry standard.

Nmap's output usually claims to be of 85%-90% certainty. It returns a list of guesses in descending order of certainty. For this reason I aimed for 85%-90% accuracy with my models, and decided that the most relevant accuracy metric will be top-2 accuracy.

Feature Selection

I chose the features by reading p0f's documentation, the paper mentioned before and the RFC on TCP/IP headers.
Some of the most helpful fields are IP's "Dont Fragment" flag, IP's TTL value, TCP's MSS value, and TCP's options.

Data Collection

The process of retrieving labels and the process of retrieving features were run separately using different tools.

Label retrieval: Python has a wrapper for Nmap, so automating the scan was relatively trivial. Another advantage of Nmap is a built-in ability to concurrently scan multiple hosts.

Feature retrieval: to analyze a server's SYN-ACK message, I sent an HTTP request while sniffing the communication with Scapy (a sniffer & packet manipulation tool). I used multithreading to probe multiple hosts simultaneously.
(Initially I only sent a TCP SYN message, as it's simpler & faster than sending a full HTTP request. I noticed there was almost no variety in the response's TCP options, and suspected it may be due to the 'synthetic' nature of the probe. Switching to a full HTTP request resulted in the variety I was hoping for.)

My scan found the following operating systems:

OS       # Samples       OS       # Samples      
Linux 5.X       12392       OpenBSD 4.X       7041      
Linux 4.X       110824       FreeBSD 6.X       72072      
Linux 3.X       88485       embedded       76809      
Linux 2.6.X       50978       Windows 2016       6224      
Linux (Other)       5634       Windows 2012       9014      


Model Comparison

The Models:

  • SVM: in some of the features, different operating systems result in different value ranges (for example, Windows systems tend to have initial TTL of 128, while Linux systems tend to have initial TTL of 64). I believed this property might call for a linear classifier.
  • Gradient Boosting: this is simply a typical choice for tabular data.

  • Neural Network: adding this model was mostly for my own curiosity. The network has 4 fully-connected layers.

The Metric:
    As I wrote under Establishing Ground Truth, the metric that fit my data is top-2 accuracy.
    Note that it does not hinder user experience too much: receiving 2 guesses isn't so bad when looking for exploits.

The Results:
    All 3 models reached a top-2 accuracy of around 85%. Graphs are available in the Model Training Notebook.

os_fingerprinting's People

Contributors

oopir avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.