GPU Phase Unwrapping — Final Report

Summary

We implemented a GPU-accelerated version of the Goldstein branch-cut phase unwrapping algorithm in CUDA on NVIDIA RTX 2080 GPUs. All three stages of the pipeline — residue identification, branch-cut placement, and phase integration — were parallelized and benchmarked against a serial CPU baseline. Stage 1 achieves up to 15.4× speedup. Stage 3 required three distinct algorithmic designs before reaching 15.25× speedup. A fused device-resident pipeline adds a further ~1.6× gain by eliminating intermediate host-device transfers.

Background

Phase unwrapping recovers the true phase from measurements wrapped into [−π, π]. It arises in InSAR, digital holography, and optical metrology. Goldstein's branch-cut algorithm handles this in three stages.

Residue ID

Compute wrapped circulation over each 2×2 plaquette. Mark positive/negative residues. Perfectly data-parallel — one thread per cell.

Branch-Cut Placement

Connect opposite-polarity residues with branch cuts. Sparse, data-dependent matching — write conflicts and warp divergence are the main challenges.

Phase Integration

Flood-fill unwrapped phase from seed pixels without crossing branch cuts. O(image diameter) data dependency — the hardest stage to parallelize.

Approach

Each stage went through multiple GPU implementations before converging. The core challenge in each case was different.

Stage 1

Naive global-memory kernel outperformed a shared-memory tiled version — the hardware L2 cache already captured spatial locality, making explicit tiling counterproductive. Final kernel also computes gradients and packs residue coordinates in one pass.

Stage 2

Brute-force parallel matching and exclusive-sum spatial binning both had too much overhead. Final design uses a fixed-capacity bin table with atomic appends — no host round-trip, no compaction pass.

Stage 3 — BFS

Global BFS is correct but requires O(image diameter) rounds of cudaDeviceSynchronize. At 8192×8192 this is 16,383 sync calls — ~1.6s overhead independent of compute.

Stage 3 — Tiled

Tiled flood-fill in shared memory is correct but 31–112× slower than CPU. Each round launches a kernel with 1024 inner iterations; total work scales as O(image_size × TILE_SIZE).

Stage 3 — Final

Each tile unwraps locally with its own seed. Neighboring tiles are matched along boundaries to estimate integer 2π offsets. A small tile-graph BFS (over tiles, not pixels) resolves global consistency — eliminating the O(diameter) synchronization bottleneck entirely.

Fusion

All intermediate data stays device-resident. Only the input phase is copied H2D and the final solution D2H. Removes two full-image host-device transfers between stages.

For full implementation details, correctness analysis, and per-attempt diagnostics — see the full report (PDF).

Results

All timings are wall-clock milliseconds on GHC cluster RTX 2080. Correctness verified by pixel-wise TIFF comparison against serial CPU output (raw_diff = 0.0).

Stage 1 speedup

15.4×

at 8192×8192 — embarrassingly parallel residue detection

Stage 3 speedup (final)

15.25×

at 8192×8192 — per-tile BFS + tile-height matching

Pipeline fusion gain

1.6×

from removing inter-stage host-device transfers

Stage 3 BFS

1.7–22× slower than CPU across all sizes. Bottleneck: O(image diameter) sync rounds. Gap narrows at large images as per-round parallelism improves, but never closes.

Stage 3 Tiled

31–112× slower than CPU — worst of the three. O(image_size / TILE_SIZE) kernel relaunches each doing TILE_SIZE² inner iterations dominates all compute.

Stage 3 Final

1.1–15.25× faster than CPU. Scales well with image size — local tile work is bounded, cross-tile matching is cheap.

Full results, all tables, and figures

Detailed per-size breakdown of all 5 configurations, profiling data, and correctness analysis.

Read Full Report