15-418 / 15-618 · Spring 2026 · CMU

GPU Parallelization of the
Goldstein Branch-Cut
Phase Unwrapping Algorithm

Furi Xiang  ·  Anurag Aryal

We propose to implement and evaluate a GPU-accelerated version of the Goldstein branch-cut phase unwrapping algorithm on NVIDIA GPUs using CUDA. Our work focuses on three stages— residue identification, branch-cut placement, and phase integration—and will compare multiple parallelization strategies, including tiled GPU kernels, residue-list-based cut placement, and multi-point BFS-style frontier integration.

Phase unwrapping arises in interferometric imaging applications such as InSAR, digital holography, and optical metrology, where the measured phase is wrapped in 2π discontinuities. Goldstein's branch-cut algorithm is a classical path-following approach that first detects residues, then places branch cuts connecting opposite-polarity residues or residues to the image boundary, and finally integrates/unwraps phase along paths that avoid those cuts.

01
Residue ID
For each 2×2 cell, compute wrapped circulation and mark positive/negative residues. Naturally data-parallel — one thread per pixel/cell.
02
Branch-Cut Placement
Connect opposite-polarity residues or residues to the image boundary. Traditionally greedy and serial; our main parallelism challenge.
03
Phase Integration
Starting from a valid seed pixel, propagate unwrapped phase to reachable neighbors without crossing branch cuts. Graph traversal over the phase domain.

The first stage is a good fit for GPU: each residue test depends only on a small neighborhood, enabling tiled shared-memory kernels with strong locality. Stages 2 and 3 are harder. Branch-cut placement is traditionally greedy and serial. Integration is effectively a graph traversal with queue dependence, irregular control flow, and poor SIMD behavior. Recent work suggests tile/block-based GPU processing and BFS-style frontier traversal as the two main directions for exposing parallelism in these irregular stages.

This project is challenging because Goldstein's algorithm mixes a regular image-processing stage with two highly irregular, path-dependent stages. CUDA GPUs are most effective for workloads with regular memory access, abundant thread-level parallelism, and limited synchronization — and Goldstein's stages 2 and 3 violate all of these assumptions.

Residue ID
Regular memory access, strong spatial locality, little synchronization. Good fit for tiled CUDA kernels. Expected to scale well with image size.
Branch-Cut Placement
Serial dependence: once one residue is paired, future choices change. Sparse, data-dependent memory access. Expect warp divergence, atomic contention, and load imbalance across thread blocks.
Integration
Graph traversal with queue dependence and poor SIMD utilization. BFS-style frontier exposure introduces synchronization between levels and irregular frontier sizes. Tiled flood-fill may improve locality but requires repeated directional passes.
System Constraints
Avoiding warp divergence in residue matching and phase propagation; reducing atomic overhead during sparse residue compaction; maintaining coalesced memory access for sparse data structures; handling large differences in residue density across phase maps.
⚙️
Language / APIs: CUDA / C++, OpenCV (image I/O and pre/post-processing)
💻
Hardware: NVIDIA RTX 2080 GPUs on GHC cluster machines
Starter code: CPU serial reference implementation — github.com/williamdelacruz/parallelGoldstein
📄
Full proposal: final_proposal_15418.pdf — contains complete write-up with all references

We aim to match or exceed the ~61× speedup reported in the reference GPU phase unwrapping paper (CPU: AMD Ryzen 5900H vs GPU: NVIDIA RTX 3070-laptop).

Plan to Achieve
CPU serial baseline: residue detection, greedy branch-cut placement, serial flood-fill, correctness testing on synthetic phase maps.
GPU residue-identification kernel: naive global-memory vs. shared-memory tiled comparison.
GPU branch-cut placement: residue compaction into +/− arrays, GPU residue matching, branch-cut construction. Compare ≥2 variants.
GPU integration: tiled flood fill / wavefront propagation vs. BFS-style frontier integration.
Experimental evaluation: RMS correctness, stage-by-stage runtime, end-to-end speedup, scaling with image size and residue density.
Hope to Achieve
Multi-source BFS or connected-component-based integration after cuts.
Spatial binning to reduce branch-cut pairing search cost.
Batched phase unwrapping for improved throughput.
Outperform the reference GPU paper's reported speedup.
If slower than expected: fully implement stage 2 on GPU; offload one of stages 3/4 to CPU for a hybrid CPU-GPU pipeline.

We chose NVIDIA GPU and CUDA because at least part of Goldstein is clearly data-parallel, and the project's core interest is in how to map the irregular stages onto SIMD hardware.

The residue detection stage is a pixel-wise parallel image kernel with strong locality — shared memory tiling can directly accelerate it. More importantly, branch-cut placement and integration have poor native SIMD compatibility, making them ideal case studies for learning how to reformulate branch-heavy, sparse algorithms into GPU-friendly forms.

This project is not merely about speeding up a dense numerical kernel. It is about understanding how branch divergence, sparse matching, atomics, and frontier expansion behave on a real parallel system — all of which are central themes of 15-418/618.

This schedule will be updated each week based on actual progress.

W1
Mar 25
Read and summarize key references. Evaluate and re-implement CPU serial baseline for Goldstein branch-cut from the reference GitHub repo. Build synthetic wrapped phase test cases and find open-sourced test images.
W2
Apr 7
Complete naive CUDA residue detection and shared-memory tiled kernel. Implement residue compaction into positive/negative arrays and residue matching.
W3
Apr 14 ★
Milestone Report Due. Implement GPU integration with tiled flood fill or tiled multi-point wavefront propagation. Complete and submit milestone report.
W4
Apr 21
Finalize integration kernel. Tune all 3 stages to reduce divergence and improve memory locality. Benchmark to record progress.
W5
Apr 30 ★
Final Report Due. Complete final benchmarks, generate plots, write report, and prepare poster/demo materials for poster session (May 1st).
[1] R. M. Goldstein, H. A. Zebker, and C. L. Werner, "Satellite Radar Interferometry: Two-Dimensional Phase Unwrapping," Radio Science, 1988.
[2] G. López García, S. V. Veleva, and A. C. De La Campa, "A parallel path-following phase unwrapping algorithm based on a top-down breadth-first search approach," Optik, 2020.
[3] Y. Li, S. Han, X. Li, and C. Xu, "Efficient GPU acceleration for phase unwrapping algorithm," in Optical Metrology and Inspection for Industrial Applications X, vol. 12769, SPIE, 2023, Art. no. 1276917, doi: 10.1117/12.2686780.