Building a Machine Learning Threat Detection System

Overview

In this project, I built a threat detection system using machine learning and log data.

1. Data Preprocessing

Raw log example:

Dec 15 00:12:10 server sshd[12345]: Failed password for invalid user admin from 192.168.1.2 port 22

Python preprocessing:

import pandas as pd

logs = pd.read_csv("auth.log")

logs["IP"] = logs["raw"].str.extract(r'from (\d+\.\d+\.\d+\.\d+)')
logs["Status"] = logs["raw"].str.contains("Failed").replace({True: "Failed", False: "Success"})

print(logs.head())

2. Isolation Forest Model

from sklearn.ensemble import IsolationForest

features = logs[["IP", "Timestamp"]]
features["IP"] = features["IP"].astype("category").cat.codes

model = IsolationForest(contamination=0.01)
logs["Anomaly"] = model.fit_predict(features)

print(logs[logs["Anomaly"] == -1])

Results

Timestamp	IP	Status	Anomaly
1702591930	192.168.1.2	Failed	-1

GitHub Repository

github.com/FrancescoCitti/ml_threat_detection