Data Engineer Technical Interview Questions & Answers (2026)
Data engineering technical interviews focus on practical skills: writing complex SQL, designing data pipelines, choosing appropriate storage formats, and handling data quality. Unlike software engineering interviews that focus on algorithms, data eng...
Data engineer technical interviews test SQL proficiency, data pipeline design, ETL/ELT architecture, and knowledge of modern data stack tools. This guide covers common technical questions spanning SQL, pipeline design, and data modeling.
Overview
Data engineering technical interviews focus on practical skills: writing complex SQL, designing data pipelines, choosing appropriate storage formats, and handling data quality. Unlike software engineering interviews that focus on algorithms, data engineering interviews emphasize data modeling, pipeline architecture, and the ability to reason about data at scale.
Technical Interview Questions for Data Engineer Roles
Q1: Write SQL to find users who have been active for 3 or more consecutive days.
What they're really asking: This tests advanced SQL skills: window functions, date arithmetic, and the gaps-and-islands pattern. It's a common data engineering interview pattern that tests real analytical SQL ability.
How to answer: Use the gaps-and-islands technique: create row numbers, subtract from dates to find groups, then count consecutive days.
See example answer
This is a classic gaps-and-islands problem. I'd use the technique of subtracting a row number from the date to identify consecutive groups. WITH daily_activity AS (SELECT DISTINCT user_id, DATE(activity_timestamp) as activity_date FROM user_events), consecutive_groups AS (SELECT user_id, activity_date, activity_date - INTERVAL '1 day' * ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY activity_date) as group_id FROM daily_activity) SELECT user_id, MIN(activity_date) as streak_start, MAX(activity_date) as streak_end, COUNT(*) as consecutive_days FROM consecutive_groups GROUP BY user_id, group_id HAVING COUNT(*) >= 3 ORDER BY consecutive_days DESC. How it works: for consecutive dates, subtracting an incrementing row number produces the same value (the 'group_id'). Jan 1 - 1 = Dec 31, Jan 2 - 2 = Dec 31, Jan 3 - 3 = Dec 31 — same group. A gap breaks the pattern: Jan 1 - 1 = Dec 31, Jan 3 - 2 = Jan 1 — different group. The first CTE deduplicates to one row per user per day (a user might have multiple events per day). I'd test with edge cases: user active on exactly 3 days, user with multiple separate streaks, and a single-day activity.
Q2: Design a data pipeline that ingests events from multiple sources, transforms them, and loads into a data warehouse.
What they're really asking: This tests end-to-end pipeline design thinking: source ingestion, data quality, transformation architecture, and warehouse loading strategy.
How to answer: Design the pipeline with clear stages: ingestion, staging, transformation, loading, and monitoring. Discuss trade-offs between batch and streaming.
See example answer
I'd design an ELT pipeline with these stages: Ingestion: Kafka for real-time event streams (clickstream, transactions) with schema registry for contract enforcement. Fivetran or Airbyte for database CDC (change data capture) from operational databases and SaaS API sources. All raw data lands in a cloud storage staging layer (S3/GCS) in Parquet format, partitioned by date. Staging: raw data loaded into the warehouse (Snowflake/BigQuery) 'as-is' in a raw schema. This preserves the original data for debugging and replay. Transformation: dbt for SQL-based transformations organized in layers — staging models (rename, type-cast, deduplicate), intermediate models (business logic joins), and marts (fact and dimension tables for analytics). dbt provides lineage tracking, testing, and documentation. Quality: Great Expectations or dbt tests at each layer checking row counts, null rates, value ranges, and referential integrity. A failed test blocks downstream models. Orchestration: Airflow DAG with sensor → ingest → test → transform → test → publish stages. Each stage is idempotent for safe retries. Monitoring: pipeline latency dashboard, data freshness SLOs (mart tables updated within 2 hours of source events), and PagerDuty alerts for pipeline failures. I chose ELT over ETL because the warehouse handles transformation compute, and dbt provides better developer experience than custom ETL code.
Q3: Explain slowly changing dimensions (SCD) and how you would implement Type 2 SCD.
What they're really asking: This tests data modeling knowledge specific to data warehousing. SCDs are fundamental to maintaining historical data in analytical systems.
How to answer: Explain SCD types, implement Type 2 with valid_from/valid_to columns, and discuss the trade-offs and query patterns.
See example answer
Slowly changing dimensions track how dimension attributes change over time. Type 1: overwrite the old value (lose history). Type 2: add a new row with versioning (preserve full history). Type 3: add columns for old/new values (limited history). For Type 2 implementation, each dimension row has: surrogate_key (auto-increment, used for fact table joins), natural_key (business identifier), all dimension attributes, valid_from (timestamp when this version became active), valid_to (timestamp when replaced, NULL for current), and is_current (boolean flag for easy current-record queries). When an attribute changes: UPDATE the current row setting valid_to = NOW() and is_current = FALSE. INSERT a new row with updated attributes, valid_from = NOW(), valid_to = NULL, is_current = TRUE. In dbt, I'd use the dbt snapshot feature which handles this automatically using either 'check' strategy (compare specific columns) or 'timestamp' strategy (use a last_modified column). Query patterns: current state uses WHERE is_current = TRUE. Point-in-time analysis uses WHERE valid_from <= target_date AND (valid_to > target_date OR valid_to IS NULL). Fact tables join on surrogate_key to get the dimension values that were active when the fact event occurred. Trade-off: Type 2 grows the dimension table over time and complicates queries, but provides complete historical analysis capability.
Q4: How would you handle data quality issues in a production pipeline?
What they're really asking: This evaluates practical data quality thinking: detection, prevention, and remediation strategies for real-world data problems.
How to answer: Discuss prevention (schema enforcement, contracts), detection (automated testing), and remediation (alerting, quarantine, backfill) strategies.
See example answer
I approach data quality at three levels: Prevention: schema enforcement at ingestion (Kafka schema registry rejects messages that don't match the expected schema), data contracts between teams defining expected formats and SLAs, and input validation in source applications. These prevent bad data from entering the pipeline. Detection: automated tests at each pipeline layer. I'd implement: row count checks (alert if today's count deviates >20% from the 7-day average), null rate monitoring (alert if a required column exceeds 1% nulls), value range checks (negative ages, future dates), referential integrity tests (fact table foreign keys match dimension tables), and freshness checks (alert if source data is >2 hours stale). I'd use dbt tests for transformation-layer checks and Great Expectations for ingestion-layer checks. Remediation: when quality issues are detected, the pipeline should: 1) quarantine bad records in a separate table for investigation (never silently drop data), 2) alert the on-call data engineer via PagerDuty with the specific test failure and sample bad records, 3) continue processing good records if possible (degrade gracefully rather than halt the entire pipeline), 4) provide a backfill mechanism to reprocess data once the upstream issue is fixed. I'd also maintain a data quality dashboard showing quality metrics over time, making it easy to spot degradation trends before they become incidents.
Q5: Compare batch processing and stream processing. When would you choose each?
What they're really asking: This tests your understanding of data processing paradigms and the engineering trade-offs between latency, throughput, and complexity.
How to answer: Compare both paradigms on key dimensions (latency, complexity, cost, use cases) and discuss hybrid approaches.
See example answer
Batch processing handles bounded data sets on a schedule (hourly, daily). Tools: Spark, dbt, Airflow. Strengths: simpler to reason about (process all data, then done), easier error handling (rerun the batch), cost-efficient (use spot instances, process during off-peak), and well-suited for aggregations and complex joins across large datasets. Stream processing handles unbounded data in real-time. Tools: Kafka Streams, Flink, Spark Streaming. Strengths: low latency (seconds to minutes), event-driven architecture, and continuous processing. Weaknesses: harder to debug (state management, exactly-once semantics), more expensive (always-on compute), and complex windowing logic for aggregations. Choose batch when: latency of hours is acceptable, you need complex transformations across large datasets, cost efficiency matters, or the downstream consumers (BI dashboards, reports) are refreshed periodically. Choose streaming when: low latency is required (fraud detection, real-time personalization, alerting), events need immediate action, or the data naturally arrives as a stream. Hybrid (Lambda/Kappa architecture): many modern pipelines use both. Stream processing handles real-time use cases with approximate results, while batch processing runs the same logic nightly to produce exact results that reconcile any streaming approximations. I prefer the Kappa approach (single streaming pipeline with replay capability) when the team has streaming expertise, as it's simpler to maintain one codebase.
Ace the interview — but first, get past ATS screening. Make sure your resume reaches the hiring manager with Ajusta's 5-component ATS scoring — 500 free credits, no card required.
Optimize Your Resume Free →Preparation Tips
- Master advanced SQL: window functions, CTEs, recursive queries, and performance optimization (EXPLAIN ANALYZE)
- Know your data stack deeply: be ready to compare Snowflake vs Databricks vs BigQuery with specific trade-offs
- Practice designing data pipelines on a whiteboard: source to warehouse with quality checks at each stage
- Understand data modeling: star schema, snowflake schema, OBT (one big table), and when each is appropriate
- Be ready to discuss data quality philosophies and specific tools (dbt tests, Great Expectations)
- Know Python data libraries (pandas, PySpark) and be ready to write data transformation code
Common Mistakes to Avoid
- Writing SQL without considering performance: missing indexes, unnecessary JOINs, or scanning entire tables when partitioning could help
- Designing pipelines without quality checks — data quality should be tested at every stage, not just at the end
- Not discussing operational concerns: monitoring, alerting, backfill processes, and pipeline SLAs
- Over-engineering with streaming when batch processing would meet the latency requirements at lower complexity
- Not understanding the difference between ETL and ELT and when each is appropriate
- Ignoring data governance: PII handling, access control, and data lineage in pipeline design
Research Checklist
Before your technical interview, make sure you have researched:
- Research the company's data stack: warehouse, ETL/ELT tools, orchestration, and data quality tools
- Understand the company's data volume and velocity to calibrate your architecture answers
- Check if the company uses dbt, Airflow, or other specific tools mentioned in the job description
- Review the company's engineering blog for data engineering posts
- Understand the company's analytics use cases to inform your data modeling answers
- Practice SQL in the specific database dialect the company uses (Snowflake SQL, BigQuery, PostgreSQL)
Questions to Ask Your Interviewer
- What does your data platform architecture look like? What tools and technologies do you use?
- How does the data engineering team work with data scientists and analysts?
- What's your approach to data quality and testing?
- What are the biggest data engineering challenges the team is facing?
- How do you handle schema evolution and breaking changes from upstream sources?
- What does the on-call process look like for data pipeline issues?
How Your Resume Connects to the Interview
Data engineering resumes should list specific data tools, pipeline scale, and quality metrics. Ajusta ensures your resume includes the exact tool names (Spark, Airflow, dbt, Snowflake, Kafka), pipeline metrics, and data modeling terms that ATS systems at high-paying data engineering roles prioritize.