Skip to content

Commit ffa29c0

Browse files
committed
Add Hudi, cost optimization, Python, and system design Q&A
Add four new topic pages (English Q&A) in the existing repo format and link them from README and the full index.
1 parent a1154d4 commit ffa29c0

6 files changed

Lines changed: 399 additions & 0 deletions

File tree

README.md

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -107,6 +107,13 @@
107107
<th>Apache Iceberg is an open table format for huge analytic datasets.</th>
108108
<th><a href="https://iceberg.apache.org/docs/latest/">Iceberg docs</a></th>
109109
</tr>
110+
<tr>
111+
<th><a href="https://github.com/apache/hudi"><img style="vertical-align:middle" src="img/icon/github.ico" alt="Hudi"></a></th>
112+
<th><a href="https://hudi.apache.org/"><img style="vertical-align:middle" src="img/icon/fire.ico" alt="Hudi"></a></th>
113+
<th><a href="./content/hudi.md">Apache Hudi</a></th>
114+
<th>Apache Hudi brings upserts, deletes, and incremental processing to data lakes.</th>
115+
<th><a href="https://hudi.apache.org/docs/overview/">Hudi docs</a></th>
116+
</tr>
110117
<th colspan="5"><a></a></th>
111118
<tr>
112119
<th colspan="5">Big Data Frameworks</th>
@@ -251,6 +258,24 @@
251258
<th><a href="./content/data-governance.md">Data Governance</a></th>
252259
<th>Ownership, policies, privacy, and access controls for data platforms.</th>
253260
<th><a href="https://github.com/datahub-project/datahub">DataHub</a></th>
261+
</tr>
262+
<tr>
263+
<th colspan="2"><a href="./content/cost-optimization.md"><img style="vertical-align:middle" src="img/icon/fire.ico" alt="Cost Optimization"></a></th>
264+
<th><a href="./content/cost-optimization.md">Cost Optimization</a></th>
265+
<th>Practical techniques to reduce compute and storage costs while meeting SLAs.</th>
266+
<th><a href="https://spark.apache.org/docs/latest/tuning.html">Spark tuning</a></th>
267+
</tr>
268+
<tr>
269+
<th colspan="2"><a href="./content/python.md"><img style="vertical-align:middle" src="img/icon/fire.ico" alt="Python"></a></th>
270+
<th><a href="./content/python.md">Python for Data Engineering</a></th>
271+
<th>Python fundamentals for reliable, scalable data pipelines and tooling.</th>
272+
<th><a href="https://arrow.apache.org/docs/python/">PyArrow docs</a></th>
273+
</tr>
274+
<tr>
275+
<th colspan="2"><a href="./content/system-design.md"><img style="vertical-align:middle" src="img/icon/fire.ico" alt="System Design"></a></th>
276+
<th><a href="./content/system-design.md">Data System Design</a></th>
277+
<th>System design interview questions for batch/streaming data platforms.</th>
278+
<th><a href="https://martinfowler.com/articles/data-monolith-to-mesh.html">Data mesh overview</a></th>
254279
</tr>
255280
<tr>
256281
<th colspan="2"><a href="./content/data-structure.md"><img style="vertical-align:middle" src="img/icon/datastruct.ico" alt="Airflow"></a></th>

content/cost-optimization.md

Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
## [Main title](../README.md)
2+
### [Interview questions](full.md)
3+
#
4+
# Cost Optimization
5+
+ [Why is cost optimization a core data engineering skill?](#Why-is-cost-optimization-a-core-data-engineering-skill)
6+
+ [What are the main cost drivers in data platforms?](#What-are-the-main-cost-drivers-in-data-platforms)
7+
+ [How does partitioning affect cost and performance?](#How-does-partitioning-affect-cost-and-performance)
8+
+ [What is the small files problem and why does it increase cost?](#What-is-the-small-files-problem-and-why-does-it-increase-cost)
9+
+ [How do you choose a target file size for Parquet tables?](#How-do-you-choose-a-target-file-size-for-Parquet-tables)
10+
+ [How do you detect and fix data skew in distributed processing?](#How-do-you-detect-and-fix-data-skew-in-distributed-processing)
11+
+ [What is shuffle and how do you reduce it in Spark?](#What-is-shuffle-and-how-do-you-reduce-it-in-Spark)
12+
+ [When should you pre-aggregate or materialize tables?](#When-should-you-pre-aggregate-or-materialize-tables)
13+
+ [How do you prevent runaway queries and protect shared clusters?](#How-do-you-prevent-runaway-queries-and-protect-shared-clusters)
14+
+ [How do you optimize joins in large-scale analytics?](#How-do-you-optimize-joins-in-large-scale-analytics)
15+
+ [How do you plan and estimate the cost of a backfill?](#How-do-you-plan-and-estimate-the-cost-of-a-backfill)
16+
+ [What metrics would you track for FinOps in data engineering?](#What-metrics-would-you-track-for-FinOps-in-data-engineering)
17+
18+
## Why is cost optimization a core data engineering skill?
19+
Data platforms can scale costs linearly or worse with data volume and usage. Cost optimization ensures the platform remains sustainable while meeting SLAs. It requires understanding storage layout, compute behavior, query patterns, and operational practices like backfills and compaction.
20+
21+
[Table of Contents](#Cost-Optimization)
22+
23+
## What are the main cost drivers in data platforms?
24+
Typical drivers include:
25+
+ compute time (clusters, warehouses, serverless slots)
26+
+ bytes scanned and shuffles
27+
+ storage growth (raw + derived + duplicates)
28+
+ data movement (egress, cross-region transfers)
29+
+ operational overhead (frequent backfills, retries)
30+
31+
[Table of Contents](#Cost-Optimization)
32+
33+
## How does partitioning affect cost and performance?
34+
Good partitioning reduces scanned data by enabling partition pruning. Bad partitioning creates too many small partitions, increasing metadata overhead and small files. Partitioning should match common query filters and data distribution, and be reevaluated as workloads evolve.
35+
36+
[Table of Contents](#Cost-Optimization)
37+
38+
## What is the small files problem and why does it increase cost?
39+
Small files increase scheduling and metadata overhead and reduce scan efficiency. Engines spend more time opening and planning files than processing data. Small files often appear from streaming writes or overly granular partitions and typically require compaction or clustering to fix.
40+
41+
[Table of Contents](#Cost-Optimization)
42+
43+
## How do you choose a target file size for Parquet tables?
44+
You choose a size that balances parallelism and overhead. Too small increases file count and planning cost; too large reduces parallelism and can slow selective queries. Many teams target hundreds of MB per file, but the correct value depends on the engine, storage, and query patterns.
45+
46+
[Table of Contents](#Cost-Optimization)
47+
48+
## How do you detect and fix data skew in distributed processing?
49+
Skew happens when some partitions have much more data than others, causing straggler tasks. Detect it via task duration distributions and partition size metrics. Fixes include salting keys, using skew-aware joins, repartitioning, or changing the join strategy (broadcast when possible).
50+
51+
[Table of Contents](#Cost-Optimization)
52+
53+
## What is shuffle and how do you reduce it in Spark?
54+
Shuffle is data redistribution across executors (for joins, group by, distinct). It is expensive due to network IO and disk spills. You reduce shuffle by using proper partitioning, avoiding wide transformations, filtering early, broadcasting small tables, and tuning shuffle partitions.
55+
56+
[Table of Contents](#Cost-Optimization)
57+
58+
## When should you pre-aggregate or materialize tables?
59+
Materialize when many downstream queries reuse expensive computations or when interactive BI requires low latency. Avoid over-materialization because it increases storage and refresh complexity. A good approach is to materialize stable, high-value marts and keep exploratory logic as views.
60+
61+
[Table of Contents](#Cost-Optimization)
62+
63+
## How do you prevent runaway queries and protect shared clusters?
64+
Use workload management: query timeouts, concurrency limits, resource quotas, and separate compute for heavy workloads. Enforce best practices with guardrails (linting, cost alerts) and educate users with profiling tools. Multi-tenant platforms typically need isolation to prevent one team from impacting others.
65+
66+
[Table of Contents](#Cost-Optimization)
67+
68+
## How do you optimize joins in large-scale analytics?
69+
You optimize joins by ensuring join keys are clean and well-distributed, filtering before joins, and choosing appropriate join strategies (broadcast vs shuffle). You also reduce the size of join inputs (select only needed columns) and consider pre-joining into curated marts when joins are repeated.
70+
71+
[Table of Contents](#Cost-Optimization)
72+
73+
## How do you plan and estimate the cost of a backfill?
74+
Estimate based on data volume, compute requirements, and expected scan/shuffle behavior. Run on a small sample window to measure throughput and extrapolate. Backfills should be staged, monitored, and preferably run in off-peak windows, with a clear rollback plan.
75+
76+
[Table of Contents](#Cost-Optimization)
77+
78+
## What metrics would you track for FinOps in data engineering?
79+
Common metrics include cost by team/project, cost per pipeline, bytes scanned per query, cluster utilization, storage growth by layer, and top expensive datasets/queries. You also track trend changes (week-over-week) and tie cost to value (usage, criticality).
80+
81+
[Table of Contents](#Cost-Optimization)
82+

content/full.md

Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,10 @@
3333
+ [Data Quality](#Data-Quality)
3434
+ [Data Observability](#Data-Observability)
3535
+ [Data Governance](#Data-Governance)
36+
+ [Apache Hudi](#Apache-Hudi)
37+
+ [Cost Optimization](#Cost-Optimization)
38+
+ [Python for Data Engineering](#Python-for-Data-Engineering)
39+
+ [Data System Design](#Data-System-Design)
3640

3741
## Apache Hadoop
3842
+ [What are the main components of a Hadoop Application?](hadoop.md#What-are-the-main-components-of-a-Hadoop-Application)
@@ -369,6 +373,69 @@
369373

370374
[Table of Contents](#Interview-questions-for-Data-Engineer)
371375

376+
## Apache Hudi
377+
+ [What is Apache Hudi?](hudi.md#What-is-Apache-Hudi)
378+
+ [What problems does Hudi solve in a data lake?](hudi.md#What-problems-does-Hudi-solve-in-a-data-lake)
379+
+ [What is the difference between Copy-on-Write (COW) and Merge-on-Read (MOR)?](hudi.md#What-is-the-difference-between-Copy-on-Write-(COW)-and-Merge-on-Read-(MOR))
380+
+ [How do you choose between COW and MOR?](hudi.md#How-do-you-choose-between-COW-and-MOR)
381+
+ [What is a record key, partition path, and precombine field?](hudi.md#What-is-a-record-key-partition-path-and-precombine-field)
382+
+ [How does Hudi support upserts?](hudi.md#How-does-Hudi-support-upserts)
383+
+ [What is compaction in Hudi?](hudi.md#What-is-compaction-in-Hudi)
384+
+ [What is clustering in Hudi and when do you need it?](hudi.md#What-is-clustering-in-Hudi-and-when-do-you-need-it)
385+
+ [Why does the small files problem happen and how do you mitigate it in Hudi?](hudi.md#Why-does-the-small-files-problem-happen-and-how-do-you-mitigate-it-in-Hudi)
386+
+ [How do you handle CDC with Hudi?](hudi.md#How-do-you-handle-CDC-with-Hudi)
387+
+ [What are common operational metrics for Hudi tables?](hudi.md#What-are-common-operational-metrics-for-Hudi-tables)
388+
+ [When would you choose Hudi vs Iceberg vs Delta?](hudi.md#When-would-you-choose-Hudi-vs-Iceberg-vs-Delta)
389+
390+
[Table of Contents](#Interview-questions-for-Data-Engineer)
391+
392+
## Cost Optimization
393+
+ [Why is cost optimization a core data engineering skill?](cost-optimization.md#Why-is-cost-optimization-a-core-data-engineering-skill)
394+
+ [What are the main cost drivers in data platforms?](cost-optimization.md#What-are-the-main-cost-drivers-in-data-platforms)
395+
+ [How does partitioning affect cost and performance?](cost-optimization.md#How-does-partitioning-affect-cost-and-performance)
396+
+ [What is the small files problem and why does it increase cost?](cost-optimization.md#What-is-the-small-files-problem-and-why-does-it-increase-cost)
397+
+ [How do you choose a target file size for Parquet tables?](cost-optimization.md#How-do-you-choose-a-target-file-size-for-Parquet-tables)
398+
+ [How do you detect and fix data skew in distributed processing?](cost-optimization.md#How-do-you-detect-and-fix-data-skew-in-distributed-processing)
399+
+ [What is shuffle and how do you reduce it in Spark?](cost-optimization.md#What-is-shuffle-and-how-do-you-reduce-it-in-Spark)
400+
+ [When should you pre-aggregate or materialize tables?](cost-optimization.md#When-should-you-pre-aggregate-or-materialize-tables)
401+
+ [How do you prevent runaway queries and protect shared clusters?](cost-optimization.md#How-do-you-prevent-runaway-queries-and-protect-shared-clusters)
402+
+ [How do you optimize joins in large-scale analytics?](cost-optimization.md#How-do-you-optimize-joins-in-large-scale-analytics)
403+
+ [How do you plan and estimate the cost of a backfill?](cost-optimization.md#How-do-you-plan-and-estimate-the-cost-of-a-backfill)
404+
+ [What metrics would you track for FinOps in data engineering?](cost-optimization.md#What-metrics-would-you-track-for-FinOps-in-data-engineering)
405+
406+
[Table of Contents](#Interview-questions-for-Data-Engineer)
407+
408+
## Python for Data Engineering
409+
+ [Why is Python widely used in data engineering?](python.md#Why-is-Python-widely-used-in-data-engineering)
410+
+ [How do iterators and generators help with large data processing?](python.md#How-do-iterators-and-generators-help-with-large-data-processing)
411+
+ [What is the difference between threads, multiprocessing, and async IO in Python?](python.md#What-is-the-difference-between-threads-multiprocessing-and-async-IO-in-Python)
412+
+ [What is the GIL and why does it matter?](python.md#What-is-the-GIL-and-why-does-it-matter)
413+
+ [How do you read and write Parquet efficiently in Python?](python.md#How-do-you-read-and-write-Parquet-efficiently-in-Python)
414+
+ [How do you process large CSV files without running out of memory?](python.md#How-do-you-process-large-CSV-files-without-running-out-of-memory)
415+
+ [How do you implement retries with exponential backoff?](python.md#How-do-you-implement-retries-with-exponential-backoff)
416+
+ [What logging practices are important for data pipelines?](python.md#What-logging-practices-are-important-for-data-pipelines)
417+
+ [How do you structure a Python project for data pipelines?](python.md#How-do-you-structure-a-Python-project-for-data-pipelines)
418+
+ [How do you manage dependencies and reproducible environments?](python.md#How-do-you-manage-dependencies-and-reproducible-environments)
419+
+ [How do you test data transformations in Python?](python.md#How-do-you-test-data-transformations-in-Python)
420+
+ [How do you profile and optimize slow Python code?](python.md#How-do-you-profile-and-optimize-slow-Python-code)
421+
422+
[Table of Contents](#Interview-questions-for-Data-Engineer)
423+
424+
## Data System Design
425+
+ [How would you design an end-to-end batch analytics pipeline?](system-design.md#How-would-you-design-an-end-to-end-batch-analytics-pipeline)
426+
+ [How would you design a near-real-time ingestion pipeline?](system-design.md#How-would-you-design-a-near-real-time-ingestion-pipeline)
427+
+ [How do you ensure idempotency in data pipelines?](system-design.md#How-do-you-ensure-idempotency-in-data-pipelines)
428+
+ [How do you handle late arriving events and backfills?](system-design.md#How-do-you-handle-late-arriving-events-and-backfills)
429+
+ [How do you choose between batch and streaming?](system-design.md#How-do-you-choose-between-batch-and-streaming)
430+
+ [How do you model raw/silver/gold layers (bronze/silver/gold)?](system-design.md#How-do-you-model-raw/silver/gold-layers-(bronze/silver/gold))
431+
+ [How do you design a data platform for multiple teams (multi-tenancy)?](system-design.md#How-do-you-design-a-data-platform-for-multiple-teams-(multi-tenancy))
432+
+ [How do you design for schema evolution?](system-design.md#How-do-you-design-for-schema-evolution)
433+
+ [What are the main reliability patterns for pipelines?](system-design.md#What-are-the-main-reliability-patterns-for-pipelines)
434+
+ [How do you design observability for a data platform?](system-design.md#How-do-you-design-observability-for-a-data-platform)
435+
+ [How do you manage cost while meeting SLAs?](system-design.md#How-do-you-manage-cost-while-meeting-SLAs)
436+
437+
[Table of Contents](#Interview-questions-for-Data-Engineer)
438+
372439
## Apache Flume
373440
+ [What is Flume?](flume.md#What-is-Flume)
374441
+ [What is Apache Flume?](flume.md#What-is-Apache-Flume)

0 commit comments

Comments
 (0)