Code Monkey home page Code Monkey logo

mondeo's Introduction

Mondeo - Multistage Botnet Detection

Table of Contents

Introduction

The number of mobile devices is increasing exponentially, allowing users to access services through dedicated applications. This mass adoption also has associated security threats, that can impact overall systems in the internet, as devices can act in a coordinated fashion, as part of distributed denial of service attacks, through botnets. FluBot is one of the most recent threats in terms of malware that uses DNS protocol to interact with Command and Control (C2) serversm having an impact on the end device as well as on operator networks. In this paper, we propose MONDEO, a multistage mechanism with a flexible pipeline for the detection of FluBot and other C2-based malware that rely on DNS. MONDEO has the advantage of not requiring any deployment of software, agents or configuration in mobile devices, allowing network operators to collect and process data efficiently whilst integrating with core Internet service platforms like DNS. MONDEO contains diverse pipeline phases, tailored for detecting specific botnet patterns such as DGA, query ratio or even suspicious domain names. It is also able to process the stream of packets with high efficiency, providing a classification as the final result(infected or not). The evaluation results demonstrates the suitability of MONDEO to be deployed in network infrastructures of operators with minimal overhead and accurate precision levels

Disclaimer

This application present a proof of concept for detecting botnet activity in networks. It should NOT be used in a real environment, from both a security and scalability point of view. Furthermore it is presented as is, under the context of the GPL2.0 license (check the license file).

Installation

MONDEO runs in a Docker container in order to establish both ease of install and ease of use. First you must install docker in your system. Please refer to the docker installation page. After installing docker, the installation follows the a regular docker container install.

docker-compose -f docker-compose.yml up --build

Dependencies

Python version : 3.9

Flask==2.2.2
Jinja2==3.1.2
tld==0.12.6
tensorflow
tensorflow-io
matplotlib==3.5.2
numpy==1.21.5
pandas==1.4.2
dgaintel==2.3
scikit-learn==1.1.3

Documentation

At core of this application lay two main functions:

  • Analysing DNS Traffic;
  • Analysing HTTP Traffic;

DNS Traffic Analysis

The analysis of this traffic enables the detection of infected devices. For this to work, the packets in consideration should be respect the parameters:

  • Be DNS;
  • Be a query packet;

From this packet we extract several features:

  • Source Host;
  • Destination;
  • Frame length;
  • DNS Flags;
  • DNS count queries;
  • DNS Query Type;
  • DNS Query Name;
  • Timestamp;

They will be later used in different stages of the pipeline

Pipeline-Analysis

Note: This is an abridged version of explaining how the pipeline works, for a comphreensive description please consult the paper (Check References).

The pipeline contains 4 core steps, aimed to remove packets from the pipeline as early as possible, this ensures efficiency and further scalability of the pipeline. They are listed below:

  • Blacklist/Whitelist analysis;
  • Query rate analysis;
  • DGA probability analysis;
  • ML Evaluation;

Mondeo Note: The steps were decided after an extensive analysis of the behaviour of the malware.

Blacklist/Whitelist Analysis;

As simple as the name implies. The idea here is to allow for platform adopter to curate their own lists, tailored for their needs. A whitelist will automatically allow packets into the network, while a blacklist will automatically disallow packets from the network.

Feedback system

Using this mechanism it is possible to feed both lists using evaluation from other steps, though it is not recommended to automatically feed a whitelist. As if there is a problem with the blacklist, a simple customer call will suffice. On the other hand, if malware manages to get whitelisted, it will be forever under the radar.

Query Rate Analysis

A common behaviour that this malware seems to use and abuse, is the amount of DNS queries made. This is mainly due to the properties of DGA algorithms (explained below). It is not relevant for this step which domains are being queried, but the amount of times they are. A regular "human" user will no generate dozens of queries per second, amount to thousands in a few minutes. This indicates the behaviour of a bot, which we aim to catch. To prevent to many false positives, 2 metric are configurable:

  • The number of packets before a warning is fired out;
  • The smallest $ time \Delta$ before a packet is considered to be out of the norm.

DGA probability Analysis

Follow this behaviour pattern of high query rate, we reacht he conclusion that the malware is using a DGA algorithm. This algorithm generates, random but deterministic domains, which, since they are deterministic, can be registered or deleted fast, to prevent the botnet to go down. To detected if a domain is DGA-generated an python package was used dga intel, whch produces a floating point evaluation (0..1) corresponding to the certainty of the evaluation.

ML Evaluation

The final steps aims to evaluate the hardest of packets. According to our evalution less than 1% of the packets should reach this stage. Here a machine learning algorithm uses the properties of the packet to evaluate if it think wheter is it infected or not.

HTTP Traffic Analysis

The second core function available is the http analysis. This analysis, paired with the DNS analysis, enables the discovery of the botnet comand and control-server.

At its core, this analysis is much simpler than the DNS analysis. Here we focus on packets that respect the following criteria

  • Be HTTP;
  • Target port 80;
  • Request Methods to be POST.

From this packet we extract several features:

  • Source Host;
  • Destination;
  • Domain;
  • Timestamp;

Pipeline Analysis

There are 3 core assertions that must be made before to confirm whether a IP is a possible C2 server:

  • The source(device), must be in the infected list (which he enter by having any DNS query previously flagged);
  • The query must be within a $time \Delta$ (defined in config);
  • The query URI must be considered dga generated (dga evaluation must surpass the default config value).

If all conditions are met the IP is considered to the infected. Otherwise, the response is negative.

WebUI

Main Menu

Main page for the application, functions available:

  • Access the Download Page - Access the download Page
  • Check Current Stats - Loads the statistics page
  • Reset - resets the current statistics
  • Save - saves the current statistics
  • Upload a File - Loads the upload file page Main Menu

filelist

Lists all files available for download (from the webUI) Download Page

check_stats

Presents an abridged version of the available statistics Stats

Upload

Uploads a file into the system, this file will be used to reset and load stats. File must be compliant with the system (for example, downloaded from Download Page). Upload Page

Endpoints

save_stats

Method : GET
Description

Saves the stats into a file

Response
Success
{
 "code": "200",
 "success": True, 
 "filename": filename
 }
Failure
{
 "code": "500",
 "success": False, 
 "filename": None
 }

toggle_retroactive

Method : GET
Description

Change the retroactive property in the system configs

Response
Success
{
 "code": "200",
 "current_retroactive_value": Boolean, 
 }

stats_time

Method : GET

Retrieves the time stats in json format

Return
{
    "time_flagged_by_blacklist": {
        "average": NaN,
        "std": NaN,
        "total": 0
    },
    "time_flagged_by_dga_prob": {
        "average": NaN,
        "std": NaN,
        "total": 0
    },
    "time_flagged_by_eval_http": {
        "average": NaN,
        "std": NaN,
        "total": 0
    },
    "time_flagged_by_ml": {
        "average": NaN,
        "std": NaN,
        "total": 0
    },
    "time_flagged_by_query_rate": {
        "average": NaN,
        "std": NaN,
        "total": 0
    },
    "time_of_flagged_packets_dns": 0.0,
    "time_of_flagged_packets_http": 0.0,
    "time_of_passed_packets_dns": 0.0,
    "time_of_passed_packets_http": 0.0,
    "time_passed_by_dga_prob": {
        "average": NaN,
        "std": NaN,
        "total": 0
    },
    "time_passed_by_eval_http": {
        "average": NaN,
        "std": NaN,
        "total": 0
    },
    "time_passed_by_ml": {
        "average": NaN,
        "std": NaN,
        "total": 0
    },
    "time_passed_by_whitelist": {
        "average": NaN,
        "std": NaN,
        "total": 0
    },
    "total_time_packets": 0.0,
    "total_time_packets_dns": 0.0,
    "total_time_packets_http": 0.0
}

stats_eval

Method : GET

Retrieves the evaluation stats in json format

Return
{
    "flagged_by_blacklist": 0,
    "flagged_by_dga_calc": 0,
    "flagged_by_http_eval": 0,
    "flagged_by_ml": 0,
    "flagged_by_query_rate": 0,
    "flagged_packets_dns": 0,
    "flagged_packets_http": 0,
    "passed_by_dga_calc": 0,
    "passed_by_http_eval": 0,
    "passed_by_ml": 0,
    "passed_by_whitelist": 0,
    "passed_packets_dns": 0,
    "passed_packets_http": 0,
    "total_packets": 0,
    "total_packets_dns": 0,
    "total_packets_http": 0
}

stats_domain

Method : GET

Retrieves the domain stats in json format

Return
{
    "ai_flagged_domains": [],
    "ai_passed_domains": [],
    "blacklist_domains": [],
    "dga_flagged_domains": [],
    "dga_passed_domains": [],
    "http_flagged_domains": [],
    "http_passed_domains": [],
    "query_rate_domains": [],
    "whitelist_domains": []
}

all_stats

Method : GET

Retrieves all stats (eval + domain + time) in json format. WARNING : May result in a large response

Return
{
    "ai_flagged_domains": [],
    "ai_passed_domains": [],
    "blacklist_domains": [],
    "dga_flagged_domains": [],
    "dga_passed_domains": [],
    "flagged_by_blacklist": 0,
    "flagged_by_dga_calc": 0,
    "flagged_by_http_eval": 0,
    "flagged_by_ml": 0,
    "flagged_by_query_rate": 0,
    "flagged_packets_dns": 0,
    "flagged_packets_http": 0,
    "http_flagged_domains": [],
    "http_passed_domains": [],
    "passed_by_dga_calc": 0,
    "passed_by_http_eval": 0,
    "passed_by_ml": 0,
    "passed_by_whitelist": 0,
    "passed_packets_dns": 0,
    "passed_packets_http": 0,
    "query_rate_domains": [],
    "time_flagged_by_blacklist": {
        "average": NaN,
        "std": NaN,
        "total": 0
    },
    "time_flagged_by_dga_prob": {
        "average": NaN,
        "std": NaN,
        "total": 0
    },
    "time_flagged_by_eval_http": {
        "average": NaN,
        "std": NaN,
        "total": 0
    },
    "time_flagged_by_ml": {
        "average": NaN,
        "std": NaN,
        "total": 0
    },
    "time_flagged_by_query_rate": {
        "average": NaN,
        "std": NaN,
        "total": 0
    },
    "time_of_flagged_packets_dns": 0.0,
    "time_of_flagged_packets_http": 0.0,
    "time_of_passed_packets_dns": 0.0,
    "time_of_passed_packets_http": 0.0,
    "time_passed_by_dga_prob": {
        "average": NaN,
        "std": NaN,
        "total": 0
    },
    "time_passed_by_eval_http": {
        "average": NaN,
        "std": NaN,
        "total": 0
    },
    "time_passed_by_ml": {
        "average": NaN,
        "std": NaN,
        "total": 0
    },
    "time_passed_by_whitelist": {
        "average": NaN,
        "std": NaN,
        "total": 0
    },
    "total_packets": 0,
    "total_packets_dns": 0,
    "total_packets_http": 0,
    "total_time_packets": 0.0,
    "total_time_packets_dns": 0.0,
    "total_time_packets_http": 0.0,
    "whitelist_domains": []
}

analyze_dns

Method : POST
Input (JSON format)
  • source - int (ipv4 to int conversion)
  • destination - int (ipv4 to int conversion)
  • length - int
  • nr_of_requests - int
  • question_type - int
  • queries_null - int
  • timestamp - int
  • domain - str
Example
  {
    "source": 39028245,
    "destination": 3247033209,
    "length": 78,
    "dns_flag": 0,
    "nr_of_requests": 1,
    "question_type": 28,
    "queries_null": 0,
    "timestamp": 1657704769,
    "domain": "octocloud.net"
}
Response
Success
  • code - int
  • prediction - str
  • domain - str
  • source - str
Example
{
    "code": 200,
    "domain": "octocloud.net",
    "prediction": "0.85",
    "source": "39028245"
}
Failure
  • code - int
  • error_msg - str
Example
{
    "code": 400,
    "message": "Bad format for http request (check Documentation)"
}

analyze_http

Methods : POST
Input (JSON format)
  • source - int (ipv4 to int conversion)
  • destination - int (ipv4 to int conversion)
  • timestamp - int
  • domain - str
Example
{
    "source": 39028245,
    "destination": 3247033209,
    "timestamp": 1657704769,
    "domain": "octocloud.net" 
}
Response
Success
  • code - int
  • prediction - str
  • domain - str
  • source - str
Example
{
    "code": 200,
    "domain": "octocloud.net",
    "prediction": "0",
    "source": "3247033209"
}
Failure
  • code - int
  • error_msg - str
Example
{
    "code": 400,
    "message": "Bad format for http request (check Documentation)"
}

Papers (Will be updated accordingly)

  • Paper 1 - To be published
  • Paper 2 - To be published

mondeo's People

Contributors

tldart avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

Forkers

bmsousa

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.