Real-Time Salary Stream: Scala, Spark Structured Streaming, Kafka, MySQL

engineering7 min read
← Back to Projects

Coursework streaming pipeline that ingests employee records over Kafka, classifies them by salary band in Spark Structured Streaming, and sinks the splits to dedicated MySQL tables and downstream Kafka topics. The whole thing runs in Docker Compose.

Scala 2.12Apache Spark Structured Streaming 3.5.0Apache Kafka (Confluent 7.4.0)MySQL 8.0Apache MavenScalaTestDocker Compose

The default mental model for ETL still tends to be: collect data overnight, run a batch job, see results the next morning. That model breaks the moment the business question is "what is happening right now." This project, built during the Data Collection and Curation course in my Big Data Analytics program at Georgian College, was the first time I built an ETL pipeline that worked the other way around: data arriving as a stream, transformations applied as it flowed past, results available the moment they were computed.

The use case was a salary band classifier on a continuous stream of employee records. The toolchain was Apache Kafka as the message broker, Apache Spark Structured Streaming for transformation, and MySQL for the persistent sink. Everything was written in Scala against the typed Spark DataFrame API and packaged with Maven. The whole stack runs in Docker Compose, which means a reviewer can clone the repo, run docker-compose up, and have the producer, broker, processor, and database sitting on their laptop in a few minutes.

The shift from batch to streaming

Batch processing makes sense when latency does not matter. Payroll runs once a fortnight. End-of-quarter financials wait for the books to close. Most analytical questions about historical data have always been answered this way and most still are.

Streaming changes the calculus when latency does matter. Fraud detection that catches a stolen card after the second transaction is meaningfully different from one that catches it at the end of the day. Inventory rebalancing that reacts to demand spikes within minutes is meaningfully different from rebalancing that runs nightly. Patient deterioration alerts that fire while the patient is still on the same shift are meaningfully different from alerts that come back the next morning.

The 2024 Confluent Data Streaming Report puts roughly 86 percent of surveyed IT leaders describing data streaming as a strategic priority, and the typical motivation is operational, not technical preference. Real-time decision-making is the use case. Streaming is the architecture that supports it.

What the pipeline does

The architecture is a textbook three-stage flow with a small twist on the sink side: the output is two parallel writes per stream, one to MySQL for analytical queries and one back to Kafka for downstream consumers.

Stage one: ingestion. An EmployeeDataProducer (a small Scala job) generates randomized employee records with four fields: Id, Name, Department, Salary. Each record is serialized to JSON and published to a Kafka topic at a configurable rate. Kafka's append-only log preserves the raw stream before any transformation, which is the streaming equivalent of keeping a copy of the source files in batch ETL. In production the producer would be replaced by a real HR system feed; the rest of the pipeline does not change.

Stage two: transformation. A SalaryProcessor Spark Structured Streaming job (Spark 3.5.0) subscribes to the Kafka topic, parses each message against a typed schema (StructType with IntegerType, StringType, IntegerType), and splits the stream into two: high earners with Salary >= 20000 and low earners with Salary < 20000. The split is two filtered DataFrames branching off a single source query.

Stage three: dual sinks. Each branch writes to two destinations in parallel:

  1. A dedicated MySQL table (high_salary_employees and low_salary_employees) via JDBC, for downstream analytical queries.
  2. A dedicated Kafka topic (high-salary and low-salary), for downstream services that want to react to the classification in real time.

This dual-sink pattern is the production-ready shape for a streaming classifier. Analytical consumers do not have to touch Kafka. Reactive services do not have to poll MySQL. Both sides see the same canonical event stream.

val spark = SparkSession.builder()
  .appName("SalaryProcessor")
  .master("local[*]")
  .getOrCreate()

val schema = new StructType()
  .add("Id", IntegerType)
  .add("Name", StringType)
  .add("Department", StringType)
  .add("Salary", IntegerType)

val source = spark.readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "broker:9092")
  .option("subscribe", "employees")
  .load()
  .selectExpr("CAST(value AS STRING) as json")
  .select(from_json(col("json"), schema).as("data"))
  .select("data.*")

val high = source.filter(col("Salary") >= 20000)
val low  = source.filter(col("Salary") <  20000)

// Each branch writes to MySQL and a Kafka topic in parallel

What Structured Streaming actually does

The interesting design choice in Structured Streaming is the abstraction. The framework treats a stream as an unbounded table that grows over time. A query against that table is logically the same as a query against a static DataFrame. The runtime, behind the scenes, breaks the query into micro-batches, applies the transformations to each batch, and updates the output continuously.

Two consequences follow.

First, the developer experience is mostly batch-shaped. You write df.filter(...) and df.groupBy(...) and the same code reads as if it were a one-shot query. The Spark planner decides how to make it incremental. That is what made the coursework project tractable. Learning the streaming runtime would have been a semester on its own. Reusing the DataFrame API I already knew shrunk that to a focus on the new concepts: triggers, watermarks, output modes, checkpoints.

Second, the failure modes are different from batch. A batch job either runs or fails. A streaming job fails partway, restarts, and has to know where it was. Spark uses checkpointing for that: it persists query state and Kafka offsets so that on restart the runtime resumes exactly where it left off. That mechanism is the foundation of exactly-once processing semantics.

Kafka's role: more than a queue

Kafka's job in this architecture is not "fast queue." It is "durable, replayable, ordered log." That distinction matters because it is the property that lets the rest of the pipeline be honest about its semantics.

Because Kafka retains messages for a configurable window, a downstream consumer that fails can rewind and reprocess. Because partitions are ordered, downstream computations that depend on event order can rely on it within a partition. Because consumers track their own offsets, two different consumers can read the same topic for completely different purposes without coordinating. The data engineer's job becomes choosing partition keys carefully, since that decision determines what kinds of stateful computation are possible later.

Partitioning by Id (the natural row key) gives good load balance at the cost of forcing any per-department aggregation to shuffle. Partitioning by Department would localise per-department windowed aggregations at the cost of skew on large departments. There is no universally right answer. The right choice is the one that matches the queries you actually want to run.

The bug that taught me the most

The first version of my pipeline duplicated records on restart. I had pointed Spark's checkpoint location at a directory that lived inside the project repo, which I had set up to be cleaned between runs because the assignment grader expected a fresh state. Every time I restarted the streaming job, Spark could not find its previous offsets and would replay the Kafka topic from the configured starting position. The output MySQL tables grew duplicates that did not look obviously wrong until I started counting unique IDs.

The fix was to move the checkpoint directory outside the project, treat it as durable state, and never delete it as part of a re-run. The lesson generalises. Checkpoints are not cache. They are the source of truth for "where did this stream get to," and treating them as ephemeral is the streaming equivalent of dropping your database between deploys. In production the checkpoint directory lives on object storage with retention rules that match the topic's. In coursework I had been treating it like a temp file.

Operational realities I did not anticipate

Three things surprised me about running this pipeline beyond the assignment.

Backpressure is real. When a downstream stage is slower than an upstream one, queues fill up. Kafka's retention buffer absorbs the first wave gracefully, but if the consumer falls far enough behind, messages are deleted before they are read. Tuning consumer parallelism, batch sizes, and retention windows is the operational discipline that keeps a streaming system honest. Batch ETL never makes you think about this.

Schema evolution breaks everything. A new field added by the producer that the parser does not expect can crash the streaming job, and a crashed job that restarts from a checkpoint will hit the same poisoned message and crash again. The fix is permissive parsing: read the JSON as a string, validate the fields you care about, route everything else to a dead-letter topic. This is unglamorous engineering and it is most of what production data pipelines actually do. The pattern is the same as any HL7 v2 message I deal with at work, where the right answer is always to validate what you need and ignore what you do not.

Observability matters more than for batch. A batch job that fails is obvious. A streaming job that quietly drifts behind real-time is not. The metrics that matter (input rate, processing rate, batch duration, watermark age) need their own dashboard. Setting that up was outside the coursework scope but is the first thing I would build if this pipeline went anywhere near production.

Engineering hygiene

The repo is small but production-shaped:

  • Scala 2.12 for the producer and the processor, against the typed Spark DataFrame API.
  • Apache Maven for build and dependency management.
  • ScalaTest 3.2 for unit tests.
  • Docker Compose brings up Kafka, Zookeeper, and MySQL as a single command. The producer and processor are also containerizable.
  • Confluent 7.4.0 Kafka images. Standard, not bleeding-edge.

A reviewer who clones the repo and runs docker-compose up should see records flowing within a minute or two. That is the bar for a coursework streaming repo to be useful as a portfolio artifact. If it cannot be brought up locally, no recruiter is going to read the code.

How this informs my current work

The day-to-day work at metricHEALTH is closer to operational data integration than to high-volume streaming. Most of our pipelines are batch-shaped because clinical workflows are batch-shaped. But the streaming vocabulary, especially the discipline of thinking about exactly-once semantics, schema evolution, and operational observability, transfers cleanly.

When the question is "should this be a stream or a batch," the right answer follows from the use-case latency. When the answer is stream, the toolchain I learned in this coursework remains the default for general-purpose enterprise data pipelines: Kafka as the message log, Spark for transformation, a checkpointed sink for the output. Newer tools like Flink push the latency floor lower and Confluent's managed Kafka removes most of the operational overhead. The architecture is the architecture.

References

  • Spark: The Definitive Guide by Bill Chambers and Matei Zaharia, especially the chapters on Structured Streaming.
  • The Apache Kafka documentation on consumer groups, partitioning, and exactly-once semantics.
  • Confluent's 2024 Data Streaming Report for industry adoption context.

GitHub: https://github.com/TirtheshJani/Data_Collection_and_Curation