Last updated: November 10, 2025
Author: Paul Namalomba
- SESKA Computational Engineer
- Software Developer
- PhD Candidate (Civil Engineering Spec. Computational and Applied Mechanic
Version: 0.3.3 (10 April 2026)
Contact: kabwenzenamalomba@gmail.com
datashadric provides a collection of well-organized modules for common data science tasks, from data cleaning and exploration to machine learning model building, unsupervised and supervised classification and statistical analysis and testing. The package is designed with readability and ease-of-use in mind, making complex data science workflows more accessible and easier to write for end-use analysts.
- Machine Learning: Model training, data ensembling (sampling), model evaluation, and prediction tools.
- Regression Analysis: Linear and Logistic regression modeling with diagnostic checks.
- Data Manipulation: Pandas-based utilities for cleaning and transforming data, getting data descriptive characteristics.
- Statistical Analysis: Hypothesis testing, confidence intervals, normal, Bayesian and Gaussian distribution checks. Also some sampling stuff included.
- Visualization: Plotting functions for data exploration, visualization and presentation.
- Multiple Imputation: MICE (PMM, norm, logistic regression), Random Forest, and KNN imputation for handling missing data.
pip install datashadricgit clone https://github.com/paulnamalomba/datashadric.git
cd datashadric
pip install .git clone https://github.com/paulnamalomba/datashadric.git
cd datashadric
pip install -e ".[dev]"import pandas as pd
from datashadric.mlearning import ml_naive_bayes_model
from datashadric.regression import lr_ols_model
from datashadric.dataframing import df_check_na_values
from datashadric.stochastics import df_gaussian_checks
from datashadric.plotters import df_boxplotter
from datashadric.aiagents import ai_analyze_plot_data_with_vision
from datashadric.aiagents import ai_data_insights_summary
from datashadric.imputation import df_mice_impute_pmm, df_impute_knn
# load your data
df = pd.read_csv('your_data.csv')
# check for missing values
na_summary = df_check_na_values(df)
# test for normality
normality_results = df_gaussian_checks(df, 'your_column')
# create visualizations
df_boxplotter(df, 'category_col', 'numeric_col', type_plot=0)
# build machine learning models
model, metrics = ml_naive_bayes_model(df, 'target_column', test_size=0.2)
# perform regression analysis
ols_results = lr_ols_model(df, 'dependent_var', ['independent_var1', 'independent_var2'])ml_naive_bayes_model(): Train and evaluate Naive Bayes classifiersml_naive_bayes_metrics(): Calculate detailed model performance metricslogr_predictor(): Logistic regression modeling and predictionconfusion_matrix_from_predictions(): Generate confusion matrices
lr_ols_model(): Ordinary Least Squares regression modelinglr_check_homoscedasticity(): Test regression assumptionslr_check_normality(): Check residual normalitylr_post_hoc_test(): Post-hoc regression diagnostics
df_check_na_values(): Comprehensive missing value analysisdf_drop_dupes(): Remove duplicate rows with reportingdf_one_hot_encoding(): Convert categorical variables to dummy variablesdf_check_correlation(): Correlation analysis and visualization
df_gaussian_checks(): Test data normality with Shapiro-Wilk and Q-Q plotsdf_calc_conf_interval(): Calculate confidence intervalsdf_calc_moe(): Compute margin of errordf_calc_zscore(): Z-score calculations
df_boxplotter(): Box plots for outlier detectiondf_histplotter(): Histogram creation with customizationdf_scatterplotter(): Scatter plot generationdf_pairplot(): Comprehensive pairwise plotting
df_mice_impute_pmm(): MICE with Predictive Mean Matching — imputes from observed donor valuesdf_mice_impute_norm(): MICE with Bayesian Linear Regression (norm) — smooth posterior-predictive drawsdf_mice_impute_logistic(): MICE with Logistic Regression for binary/categorical columnsdf_impute_random_forest(): Iterative Random Forest imputation (missForest-style)df_impute_knn(): K-Nearest Neighbours imputationdf_impute_summary(): Before/after comparison of NaN counts and descriptive statistics
- pandas >= 1.3.0
- numpy >= 1.20.0
- scikit-learn >= 1.0.0
- matplotlib >= 3.4.0
- seaborn >= 0.11.0
- scipy >= 1.7.0
- statsmodels >= 0.12.0
- plotly >- 5.0.0
You can simply do:
pip install -r requirements/requirements-core.txtFor running tests, you'll need to install additional packages:
pip install pytest pytest-covTo run the test suite:
# Install testing dependencies first
pip install pytest pytest-cov
# Run all tests
python -m pytest tests/ -v
# Run tests with coverage report
python -m pytest tests/ --cov=datashadric --cov-report=html --cov-report=term-missingfrom datashadric.dataframing import df_check_na_values, df_drop_dupes
from datashadric.plotters import df_histplotter
# check data quality
na_report = df_check_na_values(df)
df_clean = df_drop_dupes(df)
# visualize distributions
df_histplotter(df_clean, 'numeric_column', type_plot=0, bins=30)from datashadric.stochastics import df_gaussian_checks, df_calc_conf_interval
# test normality
normality_test = df_gaussian_checks(df, 'measurement_column')
# calculate confidence intervals
ci = df_calc_conf_interval(df['measurement_column'], confidence=0.95)from datashadric.mlearning import ml_naive_bayes_model, ml_naive_bayes_metrics
# train model
model, initial_metrics = ml_naive_bayes_model(df, 'target', test_size=0.3)
# detailed evaluation
detailed_metrics = ml_naive_bayes_metrics(model, X_test, y_test)Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
The author retains all rights to the code and documentation in this repository. You are free to use, modify, and distribute the code as long as you comply with the terms of the MIT License.
If you encounter any problems or have questions, please file an issue on the datashadric GitHub repository - Issues Page.
The full build-to-publish workflow is captured in datashadric-build-test-upload_instructions.ps1 (PowerShell)
and datashadric-build-test-upload_instructions.bat (CMD). The steps below can be run manually in order.
# Remove old distributions and egg-info
rm -rf dist/ build/ src/*.egg-infopython -m buildThis produces .tar.gz and .whl files in the dist/ directory.
twine check dist/*Ensure the output reports no errors or warnings.
import datashadric
print(datashadric.__version__) # should print 0.3.3 as of 13 March 2026python -m pytest tests/ -v --cov=datashadric --cov-report=term-missingtwine upload --repository testpypi dist/*
pip install --index-url https://test.pypi.org/simple/ datashadric==0.3.3twine upload --repository pypi dist/*pip install -e .git add .
git commit -m "Release v0.3.3 — multiple imputation methods"
git tag -a v0.3.3 -m "v0.3.3"
git push origin main --tagsNote: If you use the
Manage-GitHubPowerShell function, you can replace steps 8-9 with:Manage-GitHub -commitMessage "Release v0.3.3" -TagName v0.3.3 -TagMessage "v0.3.3"
Iterative Releases are usually the same release re-bundled with minor imporvements, hence they are grouped also below
- New module:
imputation— comprehensive multiple imputation methods for handling missing data- MICE with Predictive Mean Matching (PMM)
- MICE with Bayesian Linear Regression (norm)
- MICE with Logistic Regression for binary/categorical columns
- Iterative Random Forest imputation (missForest-style, supports numeric and categorical)
- K-Nearest Neighbours (KNN) imputation
- Imputation summary utility for before/after comparison
- Added
MODULE_NOTES.mdinsrc/datashadric/documenting every module and function - Added build, release, and deploy instructions to README
- Version bump to 0.3.3 and then 0.3.3 for minor fixes and documentation updates
- Fixed README formatting and typos
- Fixed broken anova function in
stochasticsmodule (was using wrong statsmodels submodules) - Fixed VIF calculation function in
stochasticsmodule to ensure it works correctly with pandas DataFrames and handles constant term properly - Fixed broken ols regression function in
regressionmodule (was using wrong statsmodels submodules) - Updated documentation in
MODULE_NOTES.mdfor all modules, especially the newimputationmodule
- Added image annotation when detecting outliers using AI-assisted bounding box generation
- Enhanced outlier detection and removal functions in data-processor module
- Added use of AI agents to assist with data analysis and visualization tasks (needs user to store their API keys in system environment variables)
- Added Apache Superset as an additional visualization dependency
- Minor bug fixes and enhancements in dataframing and plotters modules
- Updated documentation
- Minor bug fixes
- Minor enhancements to user optionality in many functions for mlearning, stochastics and dataframing modules
- Added user optionality for saving plots to files in plotters module
- Updated documentation
- Minor bug fixes
- Added print statements for better process tracking in data processing functions
- Added for stochastic and machine learning based outlier detectio adn removal
- Updated documentation
- Enhanced dataframe utilities
- New functions for index and column name retrieval
- Improved documentation and examples
- Supplemental release
- Additional functions for outlier detection
- Additional functions for plotting (LOWESS meanline plotter)
- Additional functions for data clustering based on k-means
- Initial release
- Core modules: mlearning, regression, dataframing, stochastics, plotters
- Comprehensive documentation and examples
- Minimal test coverage