πŸš€ Overview

In this project, I built a threat detection system using machine learning and log data.


1. Data Preprocessing

Here’s an example of raw logs and how I processed them:

Raw Logs:

Dec 15 00:12:10 server sshd[12345]: Failed password for invalid user admin from 192.168.1.2 port 22

Python Code for Preprocessing:

import pandas as pd

# Load log file
logs = pd.read_csv("auth.log")

# Extract IP and Status
logs["IP"] = logs["raw"].str.extract(r'from (\d+\.\d+\.\d+\.\d+)')
logs["Status"] = logs["raw"].str.contains("Failed").replace({True: "Failed", False: "Success"})

print(logs.head())

2. Building the Isolation Forest Model

from sklearn.ensemble import IsolationForest

# Feature engineering
features = logs[["IP", "Timestamp"]]
features["IP"] = features["IP"].astype("category").cat.codes

# Train Isolation Forest
model = IsolationForest(contamination=0.01)
logs["Anomaly"] = model.fit_predict(features)

# View anomalies
print(logs[logs["Anomaly"] == -1])

Results

TimestampIPStatusAnomaly
1702591930192.168.1.2Failed-1

πŸ”— GitHub Repository

Check the full code here.