An automated, production-style data engineering pipeline that extracts structured information from resumes and makes it analytics-ready.
This project is an end-to-end resume processing pipeline that automatically extracts key information from resumes and makes it available for analytics and reporting.
It covers the complete data journey:
Resume β NER β MySQL (workbench) β CSV β S3 β Snowflake β Power BI
The system supports real-world resumes (PDF, DOCX, TXT), handles noisy formats, and follows industry-grade data engineering practices.
- User uploads a resume via a Streamlit UI
- Text is extracted from PDF / DOCX / TXT files
- NER (Named Entity Recognition) extracts:
- Name
- Mobile Number
- Date of Birth
- Gender
- Extracted data is inserted into MySQL
- The inserted record is converted into a CSV file
- CSV is uploaded to Amazon S3
- Snowpipe automatically ingests data into Snowflake
- Power BI connects to Snowflake for dashboards
All steps are fully automated after resume upload.
- Streamlit
- Python
- spaCy (NER)
- Regex
- pdfplumber
- python-docx
- MySQL (workbench)
- Snowflake
- Amazon S3
- Snowpipe (Auto ingestion)
- AWS SQS (Event notifications)
- Power BI
resume_pipeline/ β__ screenshots βββ streamlit_app.py # UI for resume upload βββ resume_ner_to_mysql.py # Core pipeline logic βββ .env # Environment variables βββ .env.example # Sample env file βββ requirements.txt # Python dependencies βββ README.md # Project documentation β βββ venv/ # Virtual environment
- Name (handles ALL CAPS, initials, headers)
- Mobile Number
-
Date of Birth
Supports formats like: 19/11/2004, 19 Nov 2004, Date of Birth: 01 Jan 2006 - Gender
CREATE TABLE resume_entities (
id INT AUTO_INCREMENT PRIMARY KEY,
name VARCHAR(255),
email VARCHAR(255) UNIQUE,
mobile VARCHAR(50),
dob VARCHAR(50),
gender VARCHAR(20)
);
CREATE TABLE RESUME_ENTITIES (
ID INTEGER AUTOINCREMENT,
NAME STRING,
EMAIL STRING,
MOBILE STRING,
DOB STRING,
GENDER STRING
);
- CSV files uploaded to:
s3://resume-input-pdfs/processed/
- Snowpipe listens for
.csvfiles - S3 event notifications trigger ingestion automatically
- No manual
COPYcommands required
# MySQL MYSQL_HOST=your-rds-endpoint MYSQL_USER=your-user MYSQL_PASSWORD=your-password MYSQL_DB=resume_db # AWS AWS_ACCESS_KEY_ID=xxxx AWS_SECRET_ACCESS_KEY=xxxx AWS_DEFAULT_REGION=ap-south-1 S3_BUCKET_NAME=resume-input-pdfs
python -m venv venv source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txt python -m spacy download en_core_web_sm
streamlit run streamlit_app.py
- Upload PDF / DOCX / TXT
- Pipeline runs automatically
- Data appears in:
- MySQL
- Snowflake (via S3 + Snowpipe)
- Power BI dashboard (after refresh)
- Scanned PDFs detected and rejected gracefully
- Duplicate resumes (same email) update existing records
- Safe MySQL cursor handling
- Snowpipe ignores old files automatically
- Clean separation of UI and backend logic
- Demonstrates real-world data engineering workflows
- Covers ingestion, transformation, storage, and analytics
- Implements event-driven architecture
- Scalable and production-ready
- Reflects real HR / ATS resume processing systems
This section visually explains how a resume moves through the system β from upload to analytics β using real screenshots, AWS configuration, and Snowflake SQL.
The user uploads a resume using the Streamlit UI. The backend extracts text and applies spaCy + Regex to identify structured entities.
Streamlit UI β Resume Upload & NER Output
After successful extraction and insertion into MySQL, the record is converted into a CSV file and uploaded to the S3 bucket:
s3://resume-input-pdfs/processed/
S3 Bucket β Processed CSV Files
import boto3
s3 = boto3.client("s3")
s3.upload_file(
Filename="resume_data.csv",
Bucket="resume-input-pdfs",
Key="processed/resume_data.csv"
)
Snowpipe is configured to automatically ingest CSV files from S3. The pipe is linked to an Amazon SQS notification channel.
Snowflake Notification Channel
CREATE OR REPLACE NOTIFICATION INTEGRATION resume_s3_notification TYPE = QUEUE NOTIFICATION_PROVIDER = AWS_SQS ENABLED = TRUE AWS_SQS_ARN = 'arn:aws:sqs:ap-south-1:xxxx:resume-sqs-queue';
DESC PIPE resume_pipe;
The S3 bucket is configured to notify Snowflake via SQS whenever
a new CSV file is uploaded to the processed/ folder.
S3 Event Notification Configuration
The S3 bucket sends event notifications to the SQS queue used by Snowflake Snowpipe.
Event type : PUT Prefix filter : processed/ Destination : SQS Queue
Once the CSV arrives in S3:
- S3 sends event to SQS
- Snowpipe detects the event
- CSV is automatically loaded into Snowflake
No manual COPY commands are required.
CREATE OR REPLACE PIPE resume_pipe AUTO_INGEST = TRUE INTEGRATION = resume_s3_notification AS COPY INTO RESUME_ENTITIES FROM @resume_stage FILE_FORMAT = (TYPE = 'CSV' FIELD_OPTIONALLY_ENCLOSED_BY='"') ON_ERROR = 'CONTINUE';
Power BI connects directly to Snowflake and visualizes the ingested resume data for reporting and insights.
This completes the fully automated, event-driven data pipeline.
Streamlit UI β NER Extraction β MySQL β CSV β S3 (processed/) β SQS β Snowpipe β Snowflake β Power BI
