BalCCon2k25

Beyond Signatures: Training ML Models to Hunt Ransomware
2025-09-21 , Tesla

This talk is a result of a free-time project to try to truly understand how malware works and the presentation will show a personal journey into malware analysis, demonstrating how we can move beyond traditional signatures to try to stop modern threats.

During the talk we will give a try to practical methodology for detecting ransomware by applying machine learning and going beyond traditional signature-based detection. Also I'll show how I build a classifier that can distinguish ransomware from other malware families. Idea behind this talk is to give you insights how to perform data-driven feature engineering, extracting critical static indicators from PE headers and dynamic behavioral clues—like MITRE ATT&CK TTPs and registry modifications from analysis logs created in Cuckoo3 sandbox environment. I will walk through the process of training and evaluating couple of powerful models for classifications like Random Forest, XGBoost, CatBoost and LightGBM.

Technical Level: Beginner/Intermediate.

Prerequisites: Familiarity with terms like "sandbox" and "static/dynamic analysis" will be helpful to question my work. No specific preparation is required!


Part 1: The Problem (5-10mins)

1.1. Introduction to the Modern Malware Landscape:

  • The operational bottleneck: Manual analysis is slow, expensive, and doesn't scale and can't follow rapid growth of malware types.
  • Core idea is to move from a reactive, signature-based posture to a proactive, behavior-based detection and classification model to shorten the response gap.

1.2. The Goal: Automated Triage for Rapid Response:

Define the objective: To build an automated system that can accurately classify a given malware sample, with a specific focus on identifying high-impact threats like ransomware.

Briefly outline the talk's structure: Data Collection -> Feature Engineering -> Model Training & Evaluation -> Conclusion.

Part 2: Data Collection and Feature Engineering (20-25 mins)

2.1. Acquiring the Raw Materials (Data Acquisition):

  • Sourcing malware samples from Malware Bazaar.
  • Demonstration of a programmatic approach using the VirusTotal API to check malware classification.
  • Walkthrough of the vt_check.py script to fetch analysis data for a list of malware hashes. This illustrates a repeatable process for fetching file reports from VirusTotal utilizing their API.
  • Quick overview on how I established Cuckoo3 sandbox and list some of their issues on the current public GitHub repo.
  • Go through python scripts which helped me to triage malweres and automatically submit it to the Cuckoo sandbox for analysis.

2.2. Feature Engineering:

This is the core of the methodology. I will show how to transform raw sandbox and analysis data into meaningful features for ML models.
This process is built upon established malware analysis principles. I acknowledge the foundational work in malware feature extraction from sandbox reports, citing general approach from academic research used in systems like Cuckoo Sandbox.
- A deep dive into the feature extraction script.
- Static Features: Extracted without running the code. This includes
- PE Header data
- Imported Functions
- File metadata
- Dynamic & Behavioral Features: Extracted from observing the malware in a sandbox.
- File System Operations
- Process Activity
- Registry Modifications
- Mapping to MITRE ATT&CK like: T1486,T1490,T1070.004

Part 3: Training and Evaluating the Models (15-20 mins)

3.1. Model Selection:

  • Brief introduction to the selected ensemble models, chosen for their high performance and interpretability:
    • Random Forest
    • XG Boost
    • LightGBM (LGBM)
    • CatBoost

3.2. Model Performance:

  • The Test I have trained these models to perform a binary classification Ransomware vs. Other Malware.
  • Visualizing Success with Confusion Matrices I will present and interpret the confusion matrices for each model.
  • Conclusion from Matrices This validates the effectiveness of our feature set.

3.3. Understanding the Feature Importance:

I will show the "Top 20 Features" graphs for each mode to build trust in the model and I will share my thoughts on the results.
For example Random Forest Feature Importance highlights TTPs like T1135 (Network Share Discovery) and T1053_005 (Scheduled Task), alongside dynamic features like dynamic_reg_any_written_count.
Key Insight: By comparing these graphs, it is possible to identify a consistent set of highly predictive features demonstrating that the model's decisions are based on tangible malware characteristics.

Part 4: Conclusion (5 mins)

4.2. Summary and Future Work:

Recap: I’ll demonstrate a complete, effective pipeline for ML-based malware classification.
Limitations: The model is only as good as its training data. Continuous retraining is needed to combat concept drift. Also with malware evolution it is hard to keep proper approach for malware analysis. This is a rabbit hole.
Future Directions: Expanding the classification to a multi-class problem (e.g., identifying spyware, trojans, worms), and exploring more advanced deep learning models.

4.3. Q&A and Resource Sharing:
- Open the floor for questions.
- Provide a link to a public repository containing the presentation slides and the referenced Python scripts for attendees to explore further.

I'm Dusan, and I'm excited to be presenting at a conference for the first time! My professional journey led me to the role of a system engineer and people manager in the gaming industry. My passion for cybersecurity fuels my insatiable curiosity and lifelong commitment to learning.