dataset-iq

Overview

Dataset-IQ is a structured system for organizing machine learning datasets with automated statistics generation, validation, and standardized metadata.

It transforms raw datasets into self-contained units with computed analysis, enabling consistency, reproducibility, and easier dataset comparison.

Demo

demo.mp4

Core Features

Automatic dataset statistics generation via core/stats.py
Standardized metadata schema per dataset
Machine-readable dataset descriptions
Support for CSV and Excel datasets
Flask-based web UI with dataset browsing and detail views
Data quality scoring with issue detection (missing values, duplicates, high correlation)
GitHub Actions workflow for auto-generating stats on push

Structure

Datasets are stored flat inside data/ml/:

data/ml/
├── <dataset_name>.csv / .xlsx
└── <dataset_name>.stats.json

Each .stats.json file is auto-generated and contains:

{
  "summary": { "rows": ..., "columns": ..., "data_quality_score": ..., "problem_type": ..., "target": ... },
  "issues": { "missing_values_total": ..., "duplicate_rows": ..., "highly_correlated_columns": [...], "columns_with_high_missing": [...] },
  "schema": [ { "name": ..., "type": ..., "missing_pct": ..., "unique_values": ..., "stats": { "min": ..., "max": ..., "mean": ... } } ]
}

Repo Structure

dataset-iq/
├── .github/
│   └── workflows/
│       └── stats.yml             # GitHub Actions — auto-runs stats on push
├── core/
│   └── stats.py                  # Stats generation logic for any dataset file
├── data/
│    └── ml/
│       ├── <dataset_name>.csv    # Raw dataset file (CSV or Excel)
│       └── <dataset_name>.stats.json  # Auto-generated stats for that dataset
├── static/
│   └── style.css                 # Global stylesheet for all pages
├── templates/
│   ├── index.html                # Homepage — dataset grid with filters
│   ├── dataset.html              # Detail page — schema, issues, data preview
│   └── 404.html                  # 404 error page
├── .gitignore                    # .gitignore file
├── app.py                        # Flask app — routes and API endpoints
├── demo.mp4                      # Demonstration web app video
├── LICENSE                       # License file
├── README.md                     # Main markdown file
└── requirements.txt              # Python dependencies

Statistics Generated

Each dataset includes:

Number of rows and columns
Data quality score (0–100)
Problem type detection (classification / regression)
Target column identification
Missing values count and percentage per column
Duplicate records count
Highly correlated column pairs (threshold > 0.90)
Columns exceeding 30% missing values

Usage

Run the web app locally:

pip install -r requirements.txt
python app.py

Then open http://localhost:5000.

Generate stats manually:

python -c "from stats import run_all; run_all()"

Stats are also auto-generated via GitHub Actions on every push that modifies files under data/ml/.

Goal

To create a unified, reproducible, and machine-readable dataset registry for machine learning workflows.

Contributing

Fork the repository, add your dataset (CSV or Excel) into data/ml/, then open a Pull Request.

Once merged, stats are generated automatically.

This keeps datasets versioned, traceable, and safe to integrate while allowing contributors without write access to still submit work.

Every dataset added improves the registry and makes it easier to reuse structured ML data instead of rebuilding it from scratch.

License

MIT - See License

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dataset-iq

Overview

Demo

Core Features

Structure

Repo Structure

Statistics Generated

Usage

Goal

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
.github/workflows		.github/workflows
core		core
data/ml		data/ml
static		static
templates		templates
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
demo.mp4		demo.mp4
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

dataset-iq

Overview

Demo

Core Features

Structure

Repo Structure

Statistics Generated

Usage

Goal

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages