|
| 1 | +## [Main title](../README.md) |
| 2 | +### [Interview questions](full.md) |
| 3 | +# |
| 4 | +# Cost Optimization |
| 5 | ++ [Why is cost optimization a core data engineering skill?](#Why-is-cost-optimization-a-core-data-engineering-skill) |
| 6 | ++ [What are the main cost drivers in data platforms?](#What-are-the-main-cost-drivers-in-data-platforms) |
| 7 | ++ [How does partitioning affect cost and performance?](#How-does-partitioning-affect-cost-and-performance) |
| 8 | ++ [What is the small files problem and why does it increase cost?](#What-is-the-small-files-problem-and-why-does-it-increase-cost) |
| 9 | ++ [How do you choose a target file size for Parquet tables?](#How-do-you-choose-a-target-file-size-for-Parquet-tables) |
| 10 | ++ [How do you detect and fix data skew in distributed processing?](#How-do-you-detect-and-fix-data-skew-in-distributed-processing) |
| 11 | ++ [What is shuffle and how do you reduce it in Spark?](#What-is-shuffle-and-how-do-you-reduce-it-in-Spark) |
| 12 | ++ [When should you pre-aggregate or materialize tables?](#When-should-you-pre-aggregate-or-materialize-tables) |
| 13 | ++ [How do you prevent runaway queries and protect shared clusters?](#How-do-you-prevent-runaway-queries-and-protect-shared-clusters) |
| 14 | ++ [How do you optimize joins in large-scale analytics?](#How-do-you-optimize-joins-in-large-scale-analytics) |
| 15 | ++ [How do you plan and estimate the cost of a backfill?](#How-do-you-plan-and-estimate-the-cost-of-a-backfill) |
| 16 | ++ [What metrics would you track for FinOps in data engineering?](#What-metrics-would-you-track-for-FinOps-in-data-engineering) |
| 17 | + |
| 18 | +## Why is cost optimization a core data engineering skill? |
| 19 | +Data platforms can scale costs linearly or worse with data volume and usage. Cost optimization ensures the platform remains sustainable while meeting SLAs. It requires understanding storage layout, compute behavior, query patterns, and operational practices like backfills and compaction. |
| 20 | + |
| 21 | +[Table of Contents](#Cost-Optimization) |
| 22 | + |
| 23 | +## What are the main cost drivers in data platforms? |
| 24 | +Typical drivers include: |
| 25 | ++ compute time (clusters, warehouses, serverless slots) |
| 26 | ++ bytes scanned and shuffles |
| 27 | ++ storage growth (raw + derived + duplicates) |
| 28 | ++ data movement (egress, cross-region transfers) |
| 29 | ++ operational overhead (frequent backfills, retries) |
| 30 | + |
| 31 | +[Table of Contents](#Cost-Optimization) |
| 32 | + |
| 33 | +## How does partitioning affect cost and performance? |
| 34 | +Good partitioning reduces scanned data by enabling partition pruning. Bad partitioning creates too many small partitions, increasing metadata overhead and small files. Partitioning should match common query filters and data distribution, and be reevaluated as workloads evolve. |
| 35 | + |
| 36 | +[Table of Contents](#Cost-Optimization) |
| 37 | + |
| 38 | +## What is the small files problem and why does it increase cost? |
| 39 | +Small files increase scheduling and metadata overhead and reduce scan efficiency. Engines spend more time opening and planning files than processing data. Small files often appear from streaming writes or overly granular partitions and typically require compaction or clustering to fix. |
| 40 | + |
| 41 | +[Table of Contents](#Cost-Optimization) |
| 42 | + |
| 43 | +## How do you choose a target file size for Parquet tables? |
| 44 | +You choose a size that balances parallelism and overhead. Too small increases file count and planning cost; too large reduces parallelism and can slow selective queries. Many teams target hundreds of MB per file, but the correct value depends on the engine, storage, and query patterns. |
| 45 | + |
| 46 | +[Table of Contents](#Cost-Optimization) |
| 47 | + |
| 48 | +## How do you detect and fix data skew in distributed processing? |
| 49 | +Skew happens when some partitions have much more data than others, causing straggler tasks. Detect it via task duration distributions and partition size metrics. Fixes include salting keys, using skew-aware joins, repartitioning, or changing the join strategy (broadcast when possible). |
| 50 | + |
| 51 | +[Table of Contents](#Cost-Optimization) |
| 52 | + |
| 53 | +## What is shuffle and how do you reduce it in Spark? |
| 54 | +Shuffle is data redistribution across executors (for joins, group by, distinct). It is expensive due to network IO and disk spills. You reduce shuffle by using proper partitioning, avoiding wide transformations, filtering early, broadcasting small tables, and tuning shuffle partitions. |
| 55 | + |
| 56 | +[Table of Contents](#Cost-Optimization) |
| 57 | + |
| 58 | +## When should you pre-aggregate or materialize tables? |
| 59 | +Materialize when many downstream queries reuse expensive computations or when interactive BI requires low latency. Avoid over-materialization because it increases storage and refresh complexity. A good approach is to materialize stable, high-value marts and keep exploratory logic as views. |
| 60 | + |
| 61 | +[Table of Contents](#Cost-Optimization) |
| 62 | + |
| 63 | +## How do you prevent runaway queries and protect shared clusters? |
| 64 | +Use workload management: query timeouts, concurrency limits, resource quotas, and separate compute for heavy workloads. Enforce best practices with guardrails (linting, cost alerts) and educate users with profiling tools. Multi-tenant platforms typically need isolation to prevent one team from impacting others. |
| 65 | + |
| 66 | +[Table of Contents](#Cost-Optimization) |
| 67 | + |
| 68 | +## How do you optimize joins in large-scale analytics? |
| 69 | +You optimize joins by ensuring join keys are clean and well-distributed, filtering before joins, and choosing appropriate join strategies (broadcast vs shuffle). You also reduce the size of join inputs (select only needed columns) and consider pre-joining into curated marts when joins are repeated. |
| 70 | + |
| 71 | +[Table of Contents](#Cost-Optimization) |
| 72 | + |
| 73 | +## How do you plan and estimate the cost of a backfill? |
| 74 | +Estimate based on data volume, compute requirements, and expected scan/shuffle behavior. Run on a small sample window to measure throughput and extrapolate. Backfills should be staged, monitored, and preferably run in off-peak windows, with a clear rollback plan. |
| 75 | + |
| 76 | +[Table of Contents](#Cost-Optimization) |
| 77 | + |
| 78 | +## What metrics would you track for FinOps in data engineering? |
| 79 | +Common metrics include cost by team/project, cost per pipeline, bytes scanned per query, cluster utilization, storage growth by layer, and top expensive datasets/queries. You also track trend changes (week-over-week) and tie cost to value (usage, criticality). |
| 80 | + |
| 81 | +[Table of Contents](#Cost-Optimization) |
| 82 | + |
0 commit comments