Skip to content

Commit dcdd987

Browse files
Merge pull request #1 from RichardScottOZ/copilot/create-python-package
Create geoscience_data_quality Python package from notebook code
2 parents eee2ec8 + e92b5e6 commit dcdd987

14 files changed

Lines changed: 1426 additions & 4 deletions

README.md

Lines changed: 98 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,108 @@
11
# Geoscience-Data-Quality-for-Machine-Learning
22

3+
A Python package for assessing geoscience data quality for machine learning.
4+
35
A problem exists when building broad scale models, for example, Australia.
6+
Disparate datasets from many domains need to be assessed for quality before
7+
being combined into machine learning pipelines. This package provides tools
8+
to quantify and map data quality across geoscience datasets.
9+
10+
## Installation
11+
12+
```bash
13+
pip install -e .
14+
```
15+
16+
With optional dependencies:
17+
18+
```bash
19+
# For Excel file support
20+
pip install -e ".[excel]"
21+
22+
# For gravity point-density analysis (verde, xarray, pooch)
23+
pip install -e ".[gravity]"
24+
25+
# For visualization (matplotlib)
26+
pip install -e ".[viz]"
27+
28+
# Everything
29+
pip install -e ".[all]"
30+
31+
# Development (includes tests)
32+
pip install -e ".[dev]"
33+
```
34+
35+
## Package Modules
36+
37+
### `geoscience_data_quality.quality_model`
38+
Quality scoring model for geoscience datasets. Load quality models from
39+
CSV/Excel, compute resolution scores, final quality scores, and filter
40+
by domain or sub-domain.
41+
42+
```python
43+
from geoscience_data_quality import load_quality_model, compute_final_score, compute_resolution_score
44+
45+
model = load_quality_model("DataQuality_Models.csv")
46+
res_score = compute_resolution_score(90.0) # finer resolution → higher score
47+
final = compute_final_score(score=3, presence=1.0, resolution_score=res_score)
48+
```
449

5-
## Disparate datasets, breaking them down into broad domains:
50+
### `geoscience_data_quality.vector`
51+
Analyze quality fields (confidence, observation method, positional accuracy,
52+
metadata) in geological vector datasets.
53+
54+
```python
55+
from geoscience_data_quality import analyze_quality_fields, get_quality_summary
56+
57+
results = analyze_quality_fields(geology_gdf, fields=["confidence", "obsmethod"])
58+
summary = get_quality_summary(geology_gdf)
59+
```
60+
61+
### `geoscience_data_quality.survey`
62+
Fetch, filter, and fix geophysical survey metadata from WFS services such
63+
as Geoscience Australia's GADDS.
64+
65+
```python
66+
from geoscience_data_quality import fetch_ga_survey_metadata, filter_surveys, fix_survey_geometry
67+
68+
surveys = fetch_ga_survey_metadata()
69+
mag_line = filter_surveys(surveys, measure_type="magnetic", dataset_type="line")
70+
gdf = fix_survey_geometry(mag_line, swap_coordinates=True)
71+
```
72+
73+
### `geoscience_data_quality.rasterize`
74+
Rasterize vector quality attributes onto reference grids or new grids
75+
defined by bounds and resolution.
76+
77+
```python
78+
from geoscience_data_quality import rasterize_vector_attribute
79+
80+
array = rasterize_vector_attribute(
81+
gdf, column="max_line_spacing_m",
82+
reference_raster="model_raster.tif",
83+
output_path="survey_quality.tif",
84+
sort_ascending=False, # smallest (best) value wins in overlaps
85+
)
86+
```
87+
88+
### `geoscience_data_quality.point_density`
89+
Compute observation point density for datasets like gravity stations
90+
(requires the `gravity` optional dependencies).
91+
92+
```python
93+
from geoscience_data_quality.point_density import compute_point_density
94+
95+
coords, counts = compute_point_density((longitude, latitude), spacing=0.1)
96+
```
97+
98+
## Disparate datasets, breaking them down into broad domains
699

7100
- Geophysics (Gravity, Magnetics, Radiometrics, Seismic, Electromagnetic, Induced Polarisation, Magnetotelluric...)
8101
- Geology (Lithology, Stratigraphy, Structure, Hydro..)
9102
- Remote Sensing (Landsat, ASTER, Sentinel...)
10103
- Geochemistry (Rock, Soil, Water, Assay techniques...)
11104

12-
## Variety of data layers:
105+
## Variety of data layers
13106

14107
- Direct observations
15108
- Gridded Data
@@ -57,8 +150,9 @@ How, thinking in a raster fashion, to get a combined per-pixel Data Quality rati
57150
- Simple qualitative (3/2/1, Good/Average/Bad, High/Medium/Low or other ordinals).
58151
- Exists / Missing
59152

60-
# Reference
61-
- [https://www.researchgate.net/profile/Alan_Aitken/publication/326193704/figure/fig1/AS:646297606443016@1531100765653/](https://www.researchgate.net/publication/326193704_A_role_for_data_richness_mapping_in_exploration_decision_making)
153+
## Reference
154+
155+
- [A role for data richness mapping in exploration decision making (Aitken et al)](https://www.researchgate.net/publication/326193704_A_role_for_data_richness_mapping_in_exploration_decision_making)
62156

63157
![sample map output](https://github.com/RichardScottOZ/Geoscience-Data-Quality-for-Machine-Learning/blob/main/reliability_index.png "Sample Quality Map - derived from Leonardo Uieda's Australia Gravity Data repository work")
64158

pyproject.toml

Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
[build-system]
2+
requires = ["setuptools>=64", "setuptools-scm>=8"]
3+
build-backend = "setuptools.build_meta"
4+
5+
[project]
6+
name = "geoscience-data-quality"
7+
version = "0.1.0"
8+
description = "Tools for assessing geoscience data quality for machine learning"
9+
readme = "README.md"
10+
license = {text = "MIT"}
11+
requires-python = ">=3.9"
12+
authors = [
13+
{name = "Richard Scott"},
14+
]
15+
classifiers = [
16+
"Development Status :: 3 - Alpha",
17+
"Intended Audience :: Science/Research",
18+
"License :: OSI Approved :: MIT License",
19+
"Programming Language :: Python :: 3",
20+
"Programming Language :: Python :: 3.9",
21+
"Programming Language :: Python :: 3.10",
22+
"Programming Language :: Python :: 3.11",
23+
"Programming Language :: Python :: 3.12",
24+
"Topic :: Scientific/Engineering :: GIS",
25+
"Topic :: Scientific/Engineering",
26+
]
27+
dependencies = [
28+
"geopandas>=0.12",
29+
"numpy>=1.22",
30+
"pandas>=1.4",
31+
"rasterio>=1.3",
32+
"shapely>=2.0",
33+
]
34+
35+
[project.optional-dependencies]
36+
excel = [
37+
"openpyxl>=3.0",
38+
]
39+
gravity = [
40+
"verde>=1.7",
41+
"xarray>=2022.3",
42+
"pooch>=1.6",
43+
]
44+
viz = [
45+
"matplotlib>=3.5",
46+
]
47+
all = [
48+
"geoscience-data-quality[excel,gravity,viz]",
49+
]
50+
dev = [
51+
"geoscience-data-quality[all]",
52+
"pytest>=7.0",
53+
"pytest-cov>=4.0",
54+
]
55+
56+
[project.urls]
57+
Homepage = "https://github.com/RichardScottOZ/Geoscience-Data-Quality-for-Machine-Learning"
58+
Repository = "https://github.com/RichardScottOZ/Geoscience-Data-Quality-for-Machine-Learning"
59+
Issues = "https://github.com/RichardScottOZ/Geoscience-Data-Quality-for-Machine-Learning/issues"
60+
61+
[tool.setuptools.packages.find]
62+
where = ["src"]
63+
64+
[tool.pytest.ini_options]
65+
testpaths = ["tests"]
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
"""Tools for assessing geoscience data quality for machine learning."""
2+
3+
__version__ = "0.1.0"
4+
5+
from geoscience_data_quality.quality_model import (
6+
compute_final_score,
7+
compute_resolution_score,
8+
load_quality_model,
9+
)
10+
from geoscience_data_quality.rasterize import rasterize_vector_attribute
11+
from geoscience_data_quality.survey import (
12+
fetch_ga_survey_metadata,
13+
filter_surveys,
14+
fix_survey_geometry,
15+
)
16+
from geoscience_data_quality.vector import (
17+
analyze_quality_fields,
18+
get_quality_summary,
19+
)
20+
21+
__all__ = [
22+
"__version__",
23+
"analyze_quality_fields",
24+
"compute_final_score",
25+
"compute_resolution_score",
26+
"fetch_ga_survey_metadata",
27+
"filter_surveys",
28+
"fix_survey_geometry",
29+
"get_quality_summary",
30+
"load_quality_model",
31+
"rasterize_vector_attribute",
32+
]
Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
"""Point density analysis for geophysical observations.
2+
3+
Functions for computing observation density from point data such as
4+
gravity station locations, providing a spatial measure of data quality.
5+
6+
Based on the Gravity-Survey-Quality notebook which uses verde for block
7+
reduction to compute points-per-pixel.
8+
9+
These functions require the optional ``verde`` dependency. Install it
10+
with::
11+
12+
pip install geoscience-data-quality[gravity]
13+
"""
14+
15+
from __future__ import annotations
16+
17+
from typing import Optional
18+
19+
import numpy as np
20+
21+
22+
def compute_point_density(
23+
coordinates: tuple[np.ndarray, np.ndarray],
24+
spacing: float = 0.1,
25+
center_coordinates: bool = True,
26+
) -> tuple[tuple[np.ndarray, np.ndarray], np.ndarray]:
27+
"""Compute point density using block reduction.
28+
29+
Divides the area into blocks of the given *spacing* and counts the
30+
number of points in each block.
31+
32+
Parameters
33+
----------
34+
coordinates : tuple of ndarray
35+
``(longitude, latitude)`` arrays of observation locations.
36+
spacing : float
37+
Block size in degrees. Default ``0.1``.
38+
center_coordinates : bool
39+
If ``True``, return the centre of each block as the
40+
coordinates. Default ``True``.
41+
42+
Returns
43+
-------
44+
coords : tuple of ndarray
45+
``(longitude, latitude)`` of block centres.
46+
counts : ndarray
47+
Number of points in each block.
48+
49+
Raises
50+
------
51+
ImportError
52+
If ``verde`` is not installed.
53+
"""
54+
try:
55+
import verde as vd
56+
except ImportError as exc:
57+
raise ImportError(
58+
"The 'verde' package is required for point density analysis. "
59+
"Install it with: pip install geoscience-data-quality[gravity]"
60+
) from exc
61+
62+
def _count(array: np.ndarray) -> int:
63+
return array.size
64+
65+
# Create dummy data matching the coordinate arrays
66+
dummy_data = np.ones(coordinates[0].shape)
67+
68+
coords, counts = vd.BlockReduce(
69+
_count,
70+
center_coordinates=center_coordinates,
71+
spacing=spacing,
72+
).filter(coordinates, data=dummy_data)
73+
74+
return coords, counts

0 commit comments

Comments
 (0)