Skip to content

aksingh4545/streamlit_s3_pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ“„ Resume NER Data Pipeline (End-to-End)

An automated, production-style data engineering pipeline that extracts structured information from resumes and makes it analytics-ready.


πŸ“Œ Overview

This project is an end-to-end resume processing pipeline that automatically extracts key information from resumes and makes it available for analytics and reporting.

Resume NER Pipeline UI

It covers the complete data journey:

Resume β†’ NER β†’ MySQL (workbench) β†’ CSV β†’ S3 β†’ Snowflake β†’ Power BI

The system supports real-world resumes (PDF, DOCX, TXT), handles noisy formats, and follows industry-grade data engineering practices.


βš™οΈ What This Project Does

  • User uploads a resume via a Streamlit UI
  • Text is extracted from PDF / DOCX / TXT files
  • NER (Named Entity Recognition) extracts:
    • Name
    • Email
    • Mobile Number
    • Date of Birth
    • Gender
  • Extracted data is inserted into MySQL
  • The inserted record is converted into a CSV file
  • CSV is uploaded to Amazon S3
  • Snowpipe automatically ingests data into Snowflake
  • Power BI connects to Snowflake for dashboards

All steps are fully automated after resume upload.



🧰 Tech Stack Used

Frontend

  • Streamlit

Backend / Processing

  • Python
  • spaCy (NER)
  • Regex
  • pdfplumber
  • python-docx

Databases

  • MySQL (workbench)
  • Snowflake

Cloud & Data Engineering

  • Amazon S3
  • Snowpipe (Auto ingestion)
  • AWS SQS (Event notifications)

Analytics

  • Power BI

πŸ“ Project Structure

resume_pipeline/
β”‚__ screenshots
β”œβ”€β”€ streamlit_app.py           # UI for resume upload
β”œβ”€β”€ resume_ner_to_mysql.py     # Core pipeline logic
β”œβ”€β”€ .env                       # Environment variables
β”œβ”€β”€ .env.example               # Sample env file
β”œβ”€β”€ requirements.txt           # Python dependencies
β”œβ”€β”€ README.md                  # Project documentation
β”‚
└── venv/                      # Virtual environment

πŸ” Extracted Fields

  • Name (handles ALL CAPS, initials, headers)
  • Email
  • Mobile Number
  • Date of Birth
    Supports formats like: 19/11/2004, 19 Nov 2004, Date of Birth: 01 Jan 2006
  • Gender

πŸ—„οΈ MySQL Table Structure

CREATE TABLE resume_entities (
    id INT AUTO_INCREMENT PRIMARY KEY,
    name VARCHAR(255),
    email VARCHAR(255) UNIQUE,
    mobile VARCHAR(50),
    dob VARCHAR(50),
    gender VARCHAR(20)
);

❄️ Snowflake Table Structure

CREATE TABLE RESUME_ENTITIES (
    ID INTEGER AUTOINCREMENT,
    NAME STRING,
    EMAIL STRING,
    MOBILE STRING,
    DOB STRING,
    GENDER STRING
);

πŸš€ Snowflake Ingestion (Snowpipe)

  • CSV files uploaded to:
s3://resume-input-pdfs/processed/
  • Snowpipe listens for .csv files
  • S3 event notifications trigger ingestion automatically
  • No manual COPY commands required

πŸ” Environment Variables (.env)

# MySQL
MYSQL_HOST=your-rds-endpoint
MYSQL_USER=your-user
MYSQL_PASSWORD=your-password
MYSQL_DB=resume_db

# AWS
AWS_ACCESS_KEY_ID=xxxx
AWS_SECRET_ACCESS_KEY=xxxx
AWS_DEFAULT_REGION=ap-south-1
S3_BUCKET_NAME=resume-input-pdfs

▢️ How to Run the Project

1️⃣ Create Virtual Environment

python -m venv venv
source venv/bin/activate
# Windows: venv\Scripts\activate

2️⃣ Install Dependencies

pip install -r requirements.txt
python -m spacy download en_core_web_sm

3️⃣ Run Streamlit App

streamlit run streamlit_app.py

4️⃣ Upload Resume

  • Upload PDF / DOCX / TXT
  • Pipeline runs automatically
  • Data appears in:
    • MySQL
    • Snowflake (via S3 + Snowpipe)
    • Power BI dashboard (after refresh)

πŸ›‘οΈ Error Handling & Edge Cases

  • Scanned PDFs detected and rejected gracefully
  • Duplicate resumes (same email) update existing records
  • Safe MySQL cursor handling
  • Snowpipe ignores old files automatically
  • Clean separation of UI and backend logic

πŸ’‘ Why This Project Is Valuable

  • Demonstrates real-world data engineering workflows
  • Covers ingestion, transformation, storage, and analytics
  • Implements event-driven architecture
  • Scalable and production-ready
  • Reflects real HR / ATS resume processing systems

🧭 End-to-End Workflow (With Screenshots)

This section visually explains how a resume moves through the system β€” from upload to analytics β€” using real screenshots, AWS configuration, and Snowflake SQL.

0️⃣ Resume Upload & NER Extraction

The user uploads a resume using the Streamlit UI. The backend extracts text and applies spaCy + Regex to identify structured entities.


Streamlit UI – Resume Upload & NER Output

1️⃣ CSV Upload to Amazon S3

After successful extraction and insertion into MySQL, the record is converted into a CSV file and uploaded to the S3 bucket:

s3://resume-input-pdfs/processed/


S3 Bucket – Processed CSV Files

import boto3

s3 = boto3.client("s3")

s3.upload_file(
    Filename="resume_data.csv",
    Bucket="resume-input-pdfs",
    Key="processed/resume_data.csv"
)

2️⃣ Snowflake Pipe & Notification Channel

Snowpipe is configured to automatically ingest CSV files from S3. The pipe is linked to an Amazon SQS notification channel.


Snowflake Notification Channel

CREATE OR REPLACE NOTIFICATION INTEGRATION resume_s3_notification
TYPE = QUEUE
NOTIFICATION_PROVIDER = AWS_SQS
ENABLED = TRUE
AWS_SQS_ARN = 'arn:aws:sqs:ap-south-1:xxxx:resume-sqs-queue';
DESC PIPE resume_pipe;

3️⃣ Configure S3 Event Notifications

The S3 bucket is configured to notify Snowflake via SQS whenever a new CSV file is uploaded to the processed/ folder.


S3 Event Notification Configuration

4️⃣ Attach SQS Queue to S3 Bucket

The S3 bucket sends event notifications to the SQS queue used by Snowflake Snowpipe.


S3 β†’ SQS Integration

Event type      : PUT
Prefix filter   : processed/
Destination     : SQS Queue

5️⃣ Snowpipe Auto-Ingest into Snowflake

Once the CSV arrives in S3:

  • S3 sends event to SQS
  • Snowpipe detects the event
  • CSV is automatically loaded into Snowflake

No manual COPY commands are required.

CREATE OR REPLACE PIPE resume_pipe
AUTO_INGEST = TRUE
INTEGRATION = resume_s3_notification
AS
COPY INTO RESUME_ENTITIES
FROM @resume_stage
FILE_FORMAT = (TYPE = 'CSV' FIELD_OPTIONALLY_ENCLOSED_BY='"')
ON_ERROR = 'CONTINUE';

6️⃣ Power BI Analytics

Power BI connects directly to Snowflake and visualizes the ingested resume data for reporting and insights.

This completes the fully automated, event-driven data pipeline.

Streamlit UI
   ↓
NER Extraction
   ↓
MySQL
   ↓
CSV
   ↓
S3 (processed/)
   ↓
SQS
   ↓
Snowpipe
   ↓
Snowflake
   ↓
Power BI

About

The system supports real-world resumes (PDF, DOCX, TXT), handles noisy formats, and follows industry-grade data engineering practices.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages