Skip to content

Latest commit

 

History

History
70 lines (69 loc) · 5 KB

File metadata and controls

70 lines (69 loc) · 5 KB

This is aset of governance standards will provide a foundation for managing ML projects effectively and professionally. Remember to adapt and extend these standards as the project evolves. Having a well-defined governance framework is crucial for long-term success.

I. Versioning and Metadata Management:

  1. Project Versioning:
    • Semantic Versioning (SemVer) will be used for the overall project version (e.g., v1.2.3). This allows for clear communication of changes between releases.
    • The project version will be stored in the project_config.yaml file and updated with each release.
    • Git tags will be used to mark specific project versions in the repository.
  2. Component Versioning:
    • Versioning will be applied to individual components (data schemas, models, etc.) where applicable.
    • Component versions will also follow SemVer and be stored in the project_config.yaml file.
    • The project_metadata.json file will track the specific versions of each component used in a given project instance.
  3. Metadata Store:
    • A project_metadata.json file will serve as the central repository for all project metadata.
    • This file will include:
      • Project and component versions.
      • Data schema versions.
      • Model versions.
      • Docker image tags.
      • Creation timestamps.
      • Any other relevant metadata.
      • Metadata Updates:
    • Metadata will be updated programmatically using the update_metadata.py script (or a similar mechanism).
    • Updates will occur after key events, such as model training, data schema changes, or deployment.
    • This ensures that the metadata is always consistent and up-to-date.

II. Dependency Management:

  1. Dependency Declaration:
    • All project dependencies will be explicitly declared in the requirements.txt file.
    • This file will be generated and managed by the create_project.py script based on the dependencies specified in project_config.yaml.
  2. Virtual Environments:
    • Virtual environments will be used for all projects to isolate dependencies and prevent conflicts.
    • The setup script will automatically create a virtual environment if one doesn't exist.
  3. Dependency Updates:
    • Dependencies will be reviewed and updated periodically to ensure compatibility and access to the latest features and security patches.
    • Dependency updates will be tested thoroughly before being incorporated into the project.

III. Security:

  1. Service Accounts (GCP):
    • Separate service accounts will be used for different components (e.g., DataFlow, Cloud Functions) to follow the principle of least privilege.
    • The governance section of the project_config.yaml file will define the allowed service accounts.
    • Components will reference these governance-defined service accounts using variable substitution (e.g., $governance.service_account_dataflow).
  2. Secrets Management:
    • Sensitive information (API keys, passwords, etc.) will never be stored directly in the configuration files or code.
    • A secure secrets management solution such as Vault or Google Cloud Secret Manager) will be used to store and access secrets.
  3. Code Reviews:
    • All code changes will undergo a code review process before being merged into the main branch.
    • Code reviews will focus on security best practices, code quality, and adherence to the project's coding standards.

IV. Reproducibility:

  1. Configuration Management:
    • All project configurations will be stored in version-controlled configuration files (e.g., project_config.yaml).
    • This ensures that the project can be easily reproduced in different environments.
  2. Automated Build and Deployment:
    • Automated build and deployment pipelines will be used to minimize manual steps and ensure consistency.
    • These pipelines will use the configuration files and metadata to build and deploy the project.
  3. Data Versioning:
    • Data versioning will be implemented using appropriate tools (e.g., DVC, Git LFS) to track changes in datasets.
    • This allows for reproducibility of experiments and models.

V. Component Standards:

  1. Naming Conventions:
    • Components will follow a consistent naming convention to clearly indicate their purpose and version (e.g., dataflow_process_files_and_load_to_bq_v1_2).
  2. Code Style:
    • Code will adhere to a defined style guide (e.g., PEP 8 for Python) to maintain consistency and readability.
  3. Documentation:
    • All components will be well-documented, including their purpose, inputs, outputs, and dependencies.

VI. Governance Process:

  1. Governance Committee:
    • A governance committee will be responsible for defining, maintaining, and enforcing these governance standards.
  2. Standard Updates:
    • These standards will be reviewed and updated periodically to reflect best practices and evolving project needs.
  3. Enforcement:
    • The governance committee will ensure that all projects adhere to these standards.