📄 Resume NER Data Pipeline (End-to-End)

An automated, production-style data engineering pipeline that extracts structured information from resumes and makes it analytics-ready.

📌 Overview

This project is an end-to-end resume processing pipeline that automatically extracts key information from resumes and makes it available for analytics and reporting.

It covers the complete data journey:

Resume → NER → MySQL (workbench) → CSV → S3 → Snowflake → Power BI

The system supports real-world resumes (PDF, DOCX, TXT), handles noisy formats, and follows industry-grade data engineering practices.

⚙️ What This Project Does

User uploads a resume via a Streamlit UI
Text is extracted from PDF / DOCX / TXT files
NER (Named Entity Recognition) extracts:
- Name
- Email
- Mobile Number
- Date of Birth
- Gender
Extracted data is inserted into MySQL
The inserted record is converted into a CSV file
CSV is uploaded to Amazon S3
Snowpipe automatically ingests data into Snowflake
Power BI connects to Snowflake for dashboards

All steps are fully automated after resume upload.

🧰 Tech Stack Used

Frontend

Streamlit

Backend / Processing

Python
spaCy (NER)
Regex
pdfplumber
python-docx

Databases

MySQL (workbench)
Snowflake

Cloud & Data Engineering

Amazon S3
Snowpipe (Auto ingestion)
AWS SQS (Event notifications)

Analytics

Power BI

📁 Project Structure

resume_pipeline/
│__ screenshots
├── streamlit_app.py           # UI for resume upload
├── resume_ner_to_mysql.py     # Core pipeline logic
├── .env                       # Environment variables
├── .env.example               # Sample env file
├── requirements.txt           # Python dependencies
├── README.md                  # Project documentation
│
└── venv/                      # Virtual environment

🔍 Extracted Fields

Name (handles ALL CAPS, initials, headers)
Email
Mobile Number
Date of Birth
Supports formats like: 19/11/2004, 19 Nov 2004, Date of Birth: 01 Jan 2006
Gender

🗄️ MySQL Table Structure

CREATE TABLE resume_entities (
    id INT AUTO_INCREMENT PRIMARY KEY,
    name VARCHAR(255),
    email VARCHAR(255) UNIQUE,
    mobile VARCHAR(50),
    dob VARCHAR(50),
    gender VARCHAR(20)
);

❄️ Snowflake Table Structure

CREATE TABLE RESUME_ENTITIES (
    ID INTEGER AUTOINCREMENT,
    NAME STRING,
    EMAIL STRING,
    MOBILE STRING,
    DOB STRING,
    GENDER STRING
);

🚀 Snowflake Ingestion (Snowpipe)

CSV files uploaded to:

s3://resume-input-pdfs/processed/

Snowpipe listens for .csv files
S3 event notifications trigger ingestion automatically
No manual COPY commands required

🔐 Environment Variables (.env)

# MySQL
MYSQL_HOST=your-rds-endpoint
MYSQL_USER=your-user
MYSQL_PASSWORD=your-password
MYSQL_DB=resume_db

# AWS
AWS_ACCESS_KEY_ID=xxxx
AWS_SECRET_ACCESS_KEY=xxxx
AWS_DEFAULT_REGION=ap-south-1
S3_BUCKET_NAME=resume-input-pdfs

▶️ How to Run the Project

1️⃣ Create Virtual Environment

python -m venv venv
source venv/bin/activate
# Windows: venv\Scripts\activate

2️⃣ Install Dependencies

pip install -r requirements.txt
python -m spacy download en_core_web_sm

3️⃣ Run Streamlit App

streamlit run streamlit_app.py

4️⃣ Upload Resume

Upload PDF / DOCX / TXT
Pipeline runs automatically
Data appears in:
- MySQL
- Snowflake (via S3 + Snowpipe)
- Power BI dashboard (after refresh)

🛡️ Error Handling & Edge Cases

Scanned PDFs detected and rejected gracefully
Duplicate resumes (same email) update existing records
Safe MySQL cursor handling
Snowpipe ignores old files automatically
Clean separation of UI and backend logic

💡 Why This Project Is Valuable

Demonstrates real-world data engineering workflows
Covers ingestion, transformation, storage, and analytics
Implements event-driven architecture
Scalable and production-ready
Reflects real HR / ATS resume processing systems

🧭 End-to-End Workflow (With Screenshots)

This section visually explains how a resume moves through the system — from upload to analytics — using real screenshots, AWS configuration, and Snowflake SQL.

0️⃣ Resume Upload & NER Extraction

The user uploads a resume using the Streamlit UI. The backend extracts text and applies spaCy + Regex to identify structured entities.

Streamlit UI – Resume Upload & NER Output

1️⃣ CSV Upload to Amazon S3

After successful extraction and insertion into MySQL, the record is converted into a CSV file and uploaded to the S3 bucket:

s3://resume-input-pdfs/processed/

S3 Bucket – Processed CSV Files

import boto3

s3 = boto3.client("s3")

s3.upload_file(
    Filename="resume_data.csv",
    Bucket="resume-input-pdfs",
    Key="processed/resume_data.csv"
)

2️⃣ Snowflake Pipe & Notification Channel

Snowpipe is configured to automatically ingest CSV files from S3. The pipe is linked to an Amazon SQS notification channel.

Snowflake Notification Channel

CREATE OR REPLACE NOTIFICATION INTEGRATION resume_s3_notification
TYPE = QUEUE
NOTIFICATION_PROVIDER = AWS_SQS
ENABLED = TRUE
AWS_SQS_ARN = 'arn:aws:sqs:ap-south-1:xxxx:resume-sqs-queue';

DESC PIPE resume_pipe;

3️⃣ Configure S3 Event Notifications

The S3 bucket is configured to notify Snowflake via SQS whenever a new CSV file is uploaded to the processed/ folder.

S3 Event Notification Configuration

4️⃣ Attach SQS Queue to S3 Bucket

The S3 bucket sends event notifications to the SQS queue used by Snowflake Snowpipe.

S3 → SQS Integration

Event type      : PUT
Prefix filter   : processed/
Destination     : SQS Queue

5️⃣ Snowpipe Auto-Ingest into Snowflake

Once the CSV arrives in S3:

S3 sends event to SQS
Snowpipe detects the event
CSV is automatically loaded into Snowflake

No manual COPY commands are required.

CREATE OR REPLACE PIPE resume_pipe
AUTO_INGEST = TRUE
INTEGRATION = resume_s3_notification
AS
COPY INTO RESUME_ENTITIES
FROM @resume_stage
FILE_FORMAT = (TYPE = 'CSV' FIELD_OPTIONALLY_ENCLOSED_BY='"')
ON_ERROR = 'CONTINUE';

6️⃣ Power BI Analytics

Power BI connects directly to Snowflake and visualizes the ingested resume data for reporting and insights.

This completes the fully automated, event-driven data pipeline.

Streamlit UI
   ↓
NER Extraction
   ↓
MySQL
   ↓
CSV
   ↓
S3 (processed/)
   ↓
SQS
   ↓
Snowpipe
   ↓
Snowflake
   ↓
Power BI

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
screenshots		screenshots
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
RESUME AK.docx		RESUME AK.docx
requirements.txt		requirements.txt
resume_ner_to_mysql.py		resume_ner_to_mysql.py
streamlit_app.py		streamlit_app.py

Folders and files

Latest commit

History

Repository files navigation

📄 Resume NER Data Pipeline (End-to-End)

📌 Overview

⚙️ What This Project Does

🧰 Tech Stack Used

Frontend

Backend / Processing

Databases

Cloud & Data Engineering

Analytics

📁 Project Structure

🔍 Extracted Fields

🗄️ MySQL Table Structure

❄️ Snowflake Table Structure

🚀 Snowflake Ingestion (Snowpipe)

🔐 Environment Variables (.env)

▶️ How to Run the Project

1️⃣ Create Virtual Environment

2️⃣ Install Dependencies

3️⃣ Run Streamlit App

4️⃣ Upload Resume

🛡️ Error Handling & Edge Cases

💡 Why This Project Is Valuable

🧭 End-to-End Workflow (With Screenshots)

0️⃣ Resume Upload & NER Extraction

1️⃣ CSV Upload to Amazon S3

2️⃣ Snowflake Pipe & Notification Channel

3️⃣ Configure S3 Event Notifications

4️⃣ Attach SQS Queue to S3 Bucket

5️⃣ Snowpipe Auto-Ingest into Snowflake

6️⃣ Power BI Analytics

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages