Skip to content

Mordekai66/dataset-iq

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

dataset-iq

License Python Status Last Commit Repo Size Visitors


Overview

Dataset-IQ is a structured system for organizing machine learning datasets with automated statistics generation, validation, and standardized metadata.

It transforms raw datasets into self-contained units with computed analysis, enabling consistency, reproducibility, and easier dataset comparison.


Demo

demo.mp4

Core Features

  • Automatic dataset statistics generation via core/stats.py
  • Standardized metadata schema per dataset
  • Machine-readable dataset descriptions
  • Support for CSV and Excel datasets
  • Flask-based web UI with dataset browsing and detail views
  • Data quality scoring with issue detection (missing values, duplicates, high correlation)
  • GitHub Actions workflow for auto-generating stats on push

Structure

Datasets are stored flat inside data/ml/:

data/ml/
├── <dataset_name>.csv / .xlsx
└── <dataset_name>.stats.json

Each .stats.json file is auto-generated and contains:

{
  "summary": { "rows": ..., "columns": ..., "data_quality_score": ..., "problem_type": ..., "target": ... },
  "issues": { "missing_values_total": ..., "duplicate_rows": ..., "highly_correlated_columns": [...], "columns_with_high_missing": [...] },
  "schema": [ { "name": ..., "type": ..., "missing_pct": ..., "unique_values": ..., "stats": { "min": ..., "max": ..., "mean": ... } } ]
}

Repo Structure

dataset-iq/
├── .github/
│   └── workflows/
│       └── stats.yml             # GitHub Actions — auto-runs stats on push
├── core/
│   └── stats.py                  # Stats generation logic for any dataset file
├── data/
│    └── ml/
│       ├── <dataset_name>.csv    # Raw dataset file (CSV or Excel)
│       └── <dataset_name>.stats.json  # Auto-generated stats for that dataset
├── static/
│   └── style.css                 # Global stylesheet for all pages
├── templates/
│   ├── index.html                # Homepage — dataset grid with filters
│   ├── dataset.html              # Detail page — schema, issues, data preview
│   └── 404.html                  # 404 error page
├── .gitignore                    # .gitignore file
├── app.py                        # Flask app — routes and API endpoints
├── demo.mp4                      # Demonstration web app video
├── LICENSE                       # License file
├── README.md                     # Main markdown file
└── requirements.txt              # Python dependencies


Statistics Generated

Each dataset includes:

  • Number of rows and columns
  • Data quality score (0–100)
  • Problem type detection (classification / regression)
  • Target column identification
  • Missing values count and percentage per column
  • Duplicate records count
  • Highly correlated column pairs (threshold > 0.90)
  • Columns exceeding 30% missing values

Usage

Run the web app locally:

pip install -r requirements.txt
python app.py

Then open http://localhost:5000.

Generate stats manually:

python -c "from stats import run_all; run_all()"

Stats are also auto-generated via GitHub Actions on every push that modifies files under data/ml/.


Goal

To create a unified, reproducible, and machine-readable dataset registry for machine learning workflows.


Contributing

Fork the repository, add your dataset (CSV or Excel) into data/ml/, then open a Pull Request.

Once merged, stats are generated automatically.

This keeps datasets versioned, traceable, and safe to integrate while allowing contributors without write access to still submit work.

Every dataset added improves the registry and makes it easier to reuse structured ML data instead of rebuilding it from scratch.


License

MIT - See License

About

DatasetIQ is an automated, structured registry of machine learning datasets with unified metadata, continuous validation, and queryable access for dataset discovery, comparison, and selection.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors