BalCCon2k24

The Future of Threat Mitigation: AI in the Battle Against Security Vulnerabilities
2024-09-20, 18:45–19:30 (Europe/Belgrade), Tesla

Every day hundreds of new data sources on security vulnerabilities (CVEs) appear on the web. These are articles, vulnerability databases, code repositories, forums, chats, and they contain a handful of details each. Security operators have to invest a lot of effort to find out:
- Is the published information new or already known?
- What is the applicability? Does the attack target a specific consumer device (e.g. printer), is it about a specific OS (e.g. Windows), is it a local or remote attack?
- What details are provided in the description? Is it a “news-type” article providing essentially a headline, or a “blog-type” article providing technical details that can be used to reconstruct the attack and protect against it?

After sorting out the above questions, provided that we have a new and informative description of the vulnerability, the security operator can finally work on protection measures. In the context of Web Application Firewall (WAF) this means crafting a special rule that will detect and potentially block the malicious traffic without affecting the benign one.

In this talk we present a machine learning pipeline that uses state-of-the-art Large Language Models (LLMs) to automate above tasks. This enables to:
- Reduce time-to-mitigation
- Reduce human costs by saving time required from highly skilled individuals

Our pipeline consists of several building blocks:
- Text extraction (including image-to-text and video-to-text capabilities)
- Classification tasks:
-- Is the article informative?
-- Does the article describe a web attack?
- Generation tasks: given a detailed description of an attack, transform it into a WAF rule that pertains to a given syntax

In this talk we describe the challenges of this exciting problem and show a stack of solutions that can be applied to a wide range of products on the market.


Here is the outline of the presentation:
- What is the problem? Central tasks are:
-- Main challenge: Creating WAF rule given relevant information
-- “Not-enough-information” classification
-- “Out-of-scope” classification
- How is this problem solved currently?
-- Some components are automated (like scraping)
-- Main parts are still manual, with the help of security experts
- Preliminary stages for the solution: building datasets
-- Scoping the problem and dividing it into tasks
-- Choosing clean data
-- Creating dataset per task
- Using AI to solve the problem automatically, task by task
-- Scraping the web for new information
-- Extracting text from webpages, images and videos
-- Creating WAF rule using LLMs
-- Automatic rule verification
-- Automatic rule deployment in WAF security engines
- Wrapping the solution in automatic service
-- Matching the human UX: making AI another “team member”

Data Science is about magic. If only data scientists believed it before the Generative AI revolution began with the explosion of ChatGPT, now everybody thinks so. I was lucky to be in the industry in the past 20+ years, first as a developer and researcher, then manager and more recently as a data scientist.
As a data scientist in Imperva, I apply magic to cybersecurity challenges. We use Large Language Models, logistic regression, clustering, and whatever it takes to protect the good guys from the bad ones. The former happen to be our customers.
I hold an M.Sc. in Computer Science from the Weizmann Institute in Israel.