A PySpark-based big data pipeline that processes large HDFS logs, detects anomalous behavior, and generates operational risk insights with visual outputs.
Distributed systems generate huge volumes of logs that are difficult to inspect manually. This project automates:
- log parsing and cleaning,
- time-window feature engineering,
- anomaly detection with Isolation Forest,
- risk analysis for components and time periods.
The current dataset in this repo is large-scale for local analysis:
- ~1.58 GB raw log file
- ~11.17 million log lines
Raw logs -> Parse -> Clean -> Aggregate features -> Train model -> Predict anomalies -> Risk analysis -> Plots
- Top components by log volume
- Failure risk by hour (LOW/MEDIUM/HIGH)
- Component risk classification
- Anomaly predictions and anomaly scores
- Saved visualizations in
outputs/
.
├── main.py
├── src/
│ ├── components/
│ │ ├── data_ingestion.py
│ │ ├── data_preprocessing.py
│ │ ├── data_transformation.py
│ │ ├── feature_builder.py
│ │ ├── model_training.py
│ │ ├── analysis.py
│ │ └── pipeline.py
│ └── utils/
│ ├── logger.py
│ └── exception.py
├── data/
│ └── raw/HDFS.log
├── outputs/
├── logs/
├── requirements.txt
└── project_overview.md
- Starts Spark session
- Reads raw text logs using
spark.read.text
- Regex-based parsing of date/time/process/log level/component/message
- Data cleaning, timestamp creation, deduplication
- 1-minute window feature generation (
event_count,warn_ratio,event_delta, etc.)
- Time-based train/test split
- Feature scaling with
StandardScaler - Unsupervised anomaly detection using
IsolationForest - Rule-assisted anomaly flagging (
warn_ratiothreshold)
- Aggregated risk tables
- Visual charts saved to disk
- Console summaries for quick inspection
- Pipeline stage logs in
logs/ - Centralized custom exception wrapper
- Python 3.10+
- Java (required by Spark)
- Enough memory for large log processing
pip install -r requirements.txtpython main.pyAfter execution, the project writes:
- Visuals in
outputs/:anomaly_event_count.pnganomaly_score_trend.pngfeature_contribution.pngtop_components.pnghourly_distribution.pngwarning_components.pngfailure_trend.png
- Logs in
logs/
From the current implementation and dataset:
- Feature windows: 2322
- Train/Test split: 1625 / 697
- Detected anomalies: 70 (10.04% of test windows)
- Full local runtime: ~18 minutes (environment-dependent)
- PySpark over pandas-only for scalable processing.
- Batch pipeline over streaming for simplicity and reproducibility.
- Isolation Forest because labeled anomaly data is not required.
- 1-minute windows for finer anomaly granularity.
- Model + rule hybrid for practical anomaly coverage.
- Batch-only (no real-time streaming).
- No scheduler/orchestrator like Airflow.
- Thresholds are static and may need tuning by environment.
- Spark performance can degrade at higher scale without tuning.
- Security hardening (encryption/IAM/auditing) is not production-complete in this repo.
- Add streaming ingestion (Kafka + Structured Streaming).
- Add orchestration and retries.
- Add model drift and adaptive thresholding.
- Add API/dashboard serving layer.
- Improve cost and performance tuning for larger workloads.
- Deep technical walkthrough:
project_overview.md
If you want to productionize this pipeline, start with: scheduler integration, Spark resource tuning, and security controls.