Moves data from UMN to Pure (Experts@Minnesota), and vice versa.
This is an Extract-Transform-Load (ETL) system that integrates other major systems both inside and outside UMN. The most important of these are the OIT Legacy Data Warehouse and Pure systems hosted by Elsevier.
To provision remote environments, to install and deploy Experts ETL for running in those environemtns, and for general documentation about our deployment of Experts ETL in those remote environments, see experts-ansible on UMN GitHub.
While it should be possible to run the Experts ETL system on a local development machine, we advise against it, due to its reliance on integration with external systems, whose data it writes as well as reads. Instead most of this document is about running unit and integration tests in a local development environment.
Experts ETL requires a relatively recent version of Python 3. See the pyproject.toml project config file for supported versions.
Both the OIT Legacy Data Warehouse and the Experts Data Warehouse are Oracle databases. See experts_dw on GitHub for supported versions of the required Oracle InstanctClient library.
To install and manage Python versions we use pyenv, and to manage dependencies we use poetry.
One way to set up all these tools to work together, for a new project, is to
follow the workflow below. Note that we prefer to put virtual environments
inside the project directory. Note also that we use the built-in venv module
to create virtual environments, and we name their directories .venv, because
that is what poetry does and expects.
- Install pyenv.
pyenv install $python_versionmkdir $project_dir; cd $project_dir- Create a
.python-versionfile, containing$python_version. pip install poetrypoetry config virtualenvs.in-project truepython -m venv ./.venv/source ./.venv/bin/activate
Now running commands like poetry install or poetry update should install
packages into the virtual environment in ./.venv.
So to install the python package dependencies for Experts ETL, run poetry install.
Don't forget to deactivate the virtual environment when finished using it. If
the project virtual environment is not activated, poetry run and poetry shell will activate it. When using poetry shell, exit the shell to
deactivate the virtual environment.
Experts ETL connects to several external services, for which it requires configuration via environment variables:
- Pure web services API
PURE_API_DOMAINPURE_API_VERSIONPURE_API_KEY
- Experts@Minnesota Data Warehouse and the UMN OIT Data Warehouse
EXPERTS_DB_USEREXPERTS_DB_PASSEXPERTS_DB_HOSTNAMEEXPERTS_DB_PORTEXPERTS_DB_SERVICE_NAME
Some tests are integration tests that connect to these external services, so
these variables must be set for testing. One option is to set these
environment variables in a .env file. See env.dist for an example.
Run the following, either as arguments
to poetry run, or after running poetry shell:
pytest tests/test_affiliate_job.py
pytest tests/test_employee_job.py
...
Or to run all tests: pytest
Automated, scheduled ETL processes start by executing runner.py. To follow the flow of execution from the beginning, start with that file.
Experts ETL gets data from Pure via its web services API, for which UMN Libraries has a separate project, pureapi, which is a dependency for this project, Experts ETL. Any contributions which use the Pure API should be made there.
Python package managers, including poetry, will be unable to install a VCS-based
package without a setup.py file in the project root. To generate setup.py:
poetry build
tar -zxf dist/experts_etl-0.0.0.tar.gz experts_etl-0.0.0/setup.py --strip-components 1
Because Experts ETL is an application, please commit pyproject.lock so that we
can reproduce builds with exactly the same set of packages.