Skip to content

Recommendations about chunk sizes #22

Description

@jeromekelleher

We currently say nothing at all about chunk sizes, but I think we will need to provide some rules/guidance in order to make processing arrays efficient. For example, it really does help a lot of call-level arrays all have the same chunking (in the variants and samples dimension) so that code can read in (say) genotypes and DP values chunk-by-chunk in the same loop.

Currently vcf2zarr enforces a uniform chunk size across dimensions, so that we have one variants_chunk_size. While this is a useful simplification, it does have some drawbacks, particularly when we want to read in all of a low-dimensional array at once (e.g., ``variant_position). See #21 for discussion and some benchmarks on this point.

This would need some feedback from a variety of implementations and use-cases, I think.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions