Web application to store and retrieve immutable data in a folder organized way.
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
See deployment for notes on how to deploy the project on a live system (Coming soon).
- Python 3.10 (required — the project pins
>=3.10,<3.11) - Poetry (dependency manager)
- Node.js and Yarn (for the React frontend)
- Redis (Celery broker and result backend)
- Docker (optional — only needed if you want to test file uploads via MiniStack)
| Layer | Technologies |
|---|---|
| Database | SQLite (local dev), PostgreSQL (production), SQLAlchemy |
| Backend/API | Python 3.10, Flask, Connexion (Swagger/OpenAPI), Celery/Redis |
| Object storage | AWS S3 (production), MiniStack (optional, local dev) |
| Frontend | React, TypeScript, Webpack, Yarn |
-
Install Python dependencies:
poetry install -
Install frontend dependencies:
cd react_frontend && yarn install && cd .. -
Copy the sample settings file (if you don't already have one):
cp settings.cfg.sample settings.cfg -
Create the dev database:
poetry run bash -c 'source setup_env.sh && flask recreate-dev-db'
./dev.shThis single script handles setup and launches all services via mprocs:
- Starts Redis (if not already running)
- Starts MiniStack Docker container for local S3 (if
settings.cfgis configured for it) - Creates the S3 bucket and dev database if they don't exist
- Launches the Webpack dev server, Flask app, and Celery worker via mprocs
mprocs gives you a TUI where you can switch between process outputs with j/k, restart individual processes with r, and quit everything with q.
Prerequisite: Install mprocs with brew install mprocs
Open your browser to: http://127.0.0.1:5000/taiga/
# Terminal 1 — Redis (skip if already running; check with `redis-cli ping`)
redis-server
# Terminal 2 — Webpack dev server (frontend hot reload)
poetry run bash -c 'source setup_env.sh && flask webpack'
# Terminal 3 — Flask app server
poetry run bash -c 'source setup_env.sh && flask run'
# Terminal 4 — Celery worker (async file conversion tasks)
poetry run bash -c 'source setup_env.sh && flask run-worker'Open your browser to: http://127.0.0.1:5000/taiga/
You are automatically logged in as the seeded admin user (admin@broadinstitute.org) via the DEFAULT_USER_EMAIL setting.
Without S3 configured, you can browse/search the seeded data, create folders, and work with the UI. File uploads require either MiniStack or real AWS credentials (see below).
MiniStack is a free, open-source AWS emulator that runs 33 AWS services (including S3 and STS) in a single Docker container. It lets you test the full upload pipeline locally without an AWS account.
-
Start MiniStack:
docker run -d --name ministack -p 4566:4566 nahuelnucera/ministack -
Create the local S3 bucket (run once):
python -c "import boto3; boto3.client('s3', endpoint_url='http://localhost:4566', aws_access_key_id='test', aws_secret_access_key='test').create_bucket(Bucket='taiga-dev')" -
In
settings.cfg, uncomment the MiniStack block (Option A) and comment out Option B:S3_ENDPOINT_URL = 'http://localhost:4566' AWS_ACCESS_KEY_ID = 'test' AWS_SECRET_ACCESS_KEY = 'test' S3_BUCKET = 'taiga-dev'
-
Restart Flask and the Celery worker to pick up the new settings.
docker start ministack # start (if previously stopped)
docker stop ministack # stop
docker rm ministack # remove entirelyWarning: File uploads may fail with MiniStack because MiniStack omits the
ETagheader that boto3's high-levelBucket.copy()uses for validation. WhenS3_ENDPOINT_URLis set insettings.cfg,aws.copy_object()(intaiga2/third_party_clients/aws.py) automatically uses the low-level client API which does not require ETags. WhenS3_ENDPOINT_URLis empty or unset, the standard resource-levelBucket.copy()is used.If you see upload failures locally (errors mentioning ETag or copy validation), make sure
S3_ENDPOINT_URLis set to your MiniStack endpoint (http://localhost:4566).Note on tests: The test suite does not set
S3_ENDPOINT_URL, so tests always exercise theBucket.copy()path. If you addS3_ENDPOINT_URLto the test config,imp_conv_test.pywill fail becauseMockS3Clientdoes not implementcopy_object. This is intentional — tests validate the production (real AWS) code path.
Set S3_ENDPOINT_URL = '' and clear the AWS keys in settings.cfg. The app runs fine without S3 — you just can't upload files.
We need two users: One IAM account (main) is used in general by the app to read/write to S3. The second (uploader) has it's rights delegated via STS on a short term basis. However, this user should only have access to upload to a single location within S3.
Permissions for the main user:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "s3:*",
"Resource": [
"arn:aws:s3:::taiga2",
"arn:aws:s3:::taiga2/*"
]
}
]
}Permissions for the "upload" user:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Stmt1482441362000",
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:HeadObject"
],
"Resource": [
"arn:aws:s3:::taiga2/upload/*"
]
}
]
}Because we are using S3 to store the files, we need to correctly configure the S3 service, and its buckets.
Please follow this tutorial from Amazon, on how to create a Bucket.
We need now to be able to access to this Bucket programmatically, and through CORS (Cross Origin Resource Sharing):
For our case, it is pretty simple:
- Select your bucket in your amazon S3 console
- Click on
Properties - Click on the
Permissionsaccordion - Click on
Edit CORS Configuration - Paste the following configuration into the page that should appear (CORS Configuration Editor):
[
{
"AllowedOrigins": ["*"],
"AllowedMethods": ["GET", "POST", "PUT"],
"ExposeHeaders": ["ETag"],
"AllowedHeaders": ["*"]
}
]Warning: Be careful to not override your existing configuration!
- Edit
settings.cfgand setAWS_ACCESS_KEY_IDandAWS_SECRET_ACCESS_KEY - Set
S3_BUCKETto the bucket created above - Remove
S3_ENDPOINT_URL(or leave it empty) so the app connects to real AWS
INSERT INTO group_user_association (group_id, user_id) select 1, id FROM users WHERE name =
'pmontgom';# Run all tests
poetry run pytest
# Run a specific test file
poetry run pytest taiga2/tests/datafile_test.py
# Run with verbose output (shows each test name)
poetry run pytest -vEvery push triggers the GitHub Actions workflow (.github/workflows/build-docker.yaml), which:
- Builds the Docker image from the root
Dockerfile - Runs
pytestinside the image - Pushes to
us.gcr.io/cds-docker-containers/taiga:ga-build-<run_number> - On
main, also tags and pushes asus.gcr.io/cds-docker-containers/taiga:latest
Once the workflow completes, ssh into ubuntu@cds.team:
- Pull the ga build of the image that you want to deploy:
GOOGLE_APPLICATION_CREDENTIALS=/etc/google/auth/docker-pull-creds.json docker pull us.gcr.io/cds-docker-containers/taiga:ga-build-68 - Tag the image with
taiga-prodandtaiga-staging. For example:GOOGLE_APPLICATION_CREDENTIALS=/etc/google/auth/docker-pull-creds.json docker tag us.gcr.io/cds-docker-containers/taiga:ga-build-68 us.gcr.io/cds-docker-containers/taiga:taiga-stagingandGOOGLE_APPLICATION_CREDENTIALS=/etc/google/auth/docker-pull-creds.json docker tag us.gcr.io/cds-docker-containers/taiga:ga-build-68 us.gcr.io/cds-docker-containers/taiga:taiga-prod - Push the images with new tags. For example,
GOOGLE_APPLICATION_CREDENTIALS=/etc/google/auth/docker-pull-creds.json docker push us.gcr.io/cds-docker-containers/taiga:taiga-stagingandGOOGLE_APPLICATION_CREDENTIALS=/etc/google/auth/docker-pull-creds.json docker push us.gcr.io/cds-docker-containers/taiga:taiga-prod - Restart the service:
sudo systemctl restart taiga
If there's any problem, then you can look for information in the logs (stored at
/var/log/taiga) or ask journald for the output from the service (journalctl -u taiga).
The production database is a PostgreSQL instance on Google Cloud SQL in the cds-servers project (instance: cds-servers:us-central1:migrate-taiga-prod). Migrations are run from your local machine using Alembic via Flask, connected through the Cloud SQL Auth Proxy.
- Cloud SQL Auth Proxy installed locally (
brew install cloud-sql-proxyor download from Google) gcloudCLI authenticated with access to thecds-serversproject- The
cds-ansible-configsrepo cloned locally (for retrieving the database credentials)
The production connection string is stored in the Ansible vault. From the cds-ansible-configs repo:
ansible-vault view --vault-id prod@secret-manager-client roles/taiga2/vars/taiga.yamlCopy the SQLALCHEMY_DATABASE_URI value from the output.
In the taiga repo root, create or edit prod_settings.cfg (this file is in .gitignore and will not be committed):
SQLALCHEMY_DATABASE_URI = 'postgresql://USER:PASSWORD@127.0.0.1:5432/taiga'Use 127.0.0.1 as the host — the Cloud SQL Auth Proxy will tunnel the connection locally.
In a separate terminal:
cloud-sql-proxy cds-servers:us-central1:migrate-taiga-prod --port 5432Wait until it prints "Ready for new connections" before proceeding.
cd ~/Github/taiga
TAIGA_SETTINGS_FILE=prod_settings.cfg ./flask db currentThis should show the current Alembic revision. Verify it matches the expected parent of your new migration.
Depending on the nature of your changes, choose online or offline:
- Online-safe changes (additive only — new tables, new nullable columns): can be applied while the service is running. Apply the migration first, then deploy the new code.
- Offline changes (data migrations, column renames, dropping columns): stop the service first, then apply, then deploy.
- Take a snapshot of the database in the GCP console (recommended)
- Apply the migration (with Cloud SQL Auth Proxy running):
TAIGA_SETTINGS_FILE=prod_settings.cfg ./flask db upgrade
- Deploy the new code — see Deployment to Production above
- Take a snapshot of the database in the GCP console
- Stop the service:
ssh ubuntu@cds.team sudo systemctl stop taiga - Apply the migration (with Cloud SQL Auth Proxy running):
TAIGA_SETTINGS_FILE=prod_settings.cfg ./flask db upgrade
- Deploy the new code and start the service — see Deployment to Production above
The safest rollback path is to restore from the database snapshot taken before the migration. This is slow but guaranteed to be correct.
Alembic's flask db downgrade can be used for simple, trivially reversible migrations. But highly discouraged.
If you are confident the downgrade script is valid:
TAIGA_SETTINGS_FILE=prod_settings.cfg ./flask db downgrade <previous_revision_id>When you change SQLAlchemy models and need to create a new migration file:
TAIGA_SETTINGS_FILE=prod_settings.cfg ./flask db migrate -m "description of change"Review the generated file in migrations/versions/. You may need to re-order table creation statements to satisfy foreign key dependencies.
Users are able to delete datasets through the UI. We do not allow the undeletion directly, but in some extreme cases,
we have a way of un-deleting:
The api has the deprecation endpoint (/datasetVersion/{datasetVersionId}/deprecate) which could be use to turn a
deleted dataset version to a deprecated one.
You can use a curl request, e.g curl -d '{"deprecationReason":"notNeeded"}' -H "Content-Type: application/json" -X POST http://cds.team/taiga/api/datasetVersion/{datasetVersionId_here}/deprecate
Feel free to make a contribution and then submit a Pull Request!
We use Git for versioning! If you don't know how to use it, we strongly recommend doing this tutorial.
- Philip Montgomery - Initial work + advising on the current development + data processing
- Remi Marenco - Prototype + current development
- Cancer Data Science
- Broad Institute