Parallelizing Simulated Annealing

How we accelerated global optimization by 11x using heterogeneous CPU+GPU computing and comparing cooperative vs. independent parallel algorithms

Global Optimization in Complex Landscapes

Traditional gradient-based optimization methods get trapped in local optima when dealing with complex, multi-modal landscapes. Simulated Annealing (SA) solves this by probabilistically accepting worse solutions, allowing escape from local traps to find global optima.

But how do we make SA faster using modern parallel computing? We compared two fundamentally different approaches:

Multi-Start Simulated Annealing (MSA)

Multiple SA processes run completely independently
No communication until final result selection
Highly parallelizable with no synchronization overhead

Coupled Simulated Annealing (CSA)

Multiple SA processes share information during optimization
Periodic synchronization and parameter sharing
Collective intelligence guides exploration

Parallelization Models Used in Coupled Simulated Annealing (CSA) Algorithm (MSA Model is similar but uses OpenMP for Individual Temperature Scheduling)

The Schwefel Function

We tested on the notoriously difficult Schwefel function:

\[f(x) = 418.9829d - \sum_{i=1}^{d} x_i \sin(\sqrt{|x_i|}), \quad x \in [-500, 500]^d\]

3D Schwefel function landscape

Why Schwefel: Large search space, many local minima, global minimum far from center, and crucially - linear separability allowing dimension-wise GPU parallelization.

Experimental Design

We systematically tested across four computational paradigms:

Sequential: Single-threaded baseline
OpenMP: Multi-threaded CPU parallelization
CUDA: GPU acceleration for function evaluation
Heterogeneous: OpenMP + CUDA combined

OpenMP Parallelization: Near-Linear Speedup

Total Algorithm Timing of `f`, step, parameter evaluation during Minimization

4x speedup with 4 threads (near-linear scaling)
CSA overhead minimal: Parameter sharing costs negligible in parallel setting
Surprising insight: Cooperation became essentially “free”

Heterogeneous Computing: Game-Changing Performance

Configuration	MSA Single Thread	MSA Multi-Thread	CSA Single Thread	CSA Multi-Thread
CPU Only	3,493,635 μs	876,549 μs	3,884,980 μs	995,846 μs
GPU + CPU	1,358,433 μs	344,000 μs	1,563,125 μs	440,082 μs
GPU (Warped) + CPU	1,208,433 μs	321,389 μs	1,413,214 μs	400,981 μs

Key Results:

3x improvement from heterogeneous computing alone
Additional 20% boost from warp-level optimizations
Total speedup: Up to 11x faster than baseline (3.49s → 0.32s for MSA)

Memory Bottleneck Discovery

CUDA Total Time Profiling

CUDA Profiling Revealed:

73% of GPU time spent on memory operations (copying, allocation, deallocation)
Only 27% on actual computation
Key insight: Memory management, not computation, determines performance

If we remember the roofline model, it is obvious here that we are in the nonoptimal part of the roofline model (i.e. not utilizing hardware effectively)

Implementation Highlights

Below are a couple of design details to show how we thought of this problem.

CUDA Kernel Design

// Exploit linear separability - one thread per dimension
__global__ void evaluate_schwefel_kernel(double* x, double* results, int d) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid < d) {
        results[tid] = x[tid] * sin(sqrt(fabs(x[tid])));
    }
}

// Warp-level reduction for efficient summation
__global__ void warped_reduction_kernel(double* partial, double* result, int d) {
    // Use shuffle operations to avoid global memory synchronization
    for (int offset = 16; offset > 0; offset /= 2) {
        partial[tid] += __shfl_down_sync(0xffffffff, partial[tid], offset);
    }
}

OpenMP Coordination

// CSA: Cooperative with careful synchronization
#pragma omp parallel for
for (int i = 0; i < num_processes; i++) {
    while (!convergence) {
        cuda_sa_step(process[i]);  // GPU accelerated
        
        #pragma omp barrier
        #pragma omp critical
        { update_shared_parameters(process[i]); }
    }
}

Conclusion

Our study demonstrates that heterogeneous computing can achieve dramatic speedups (11x) for global optimization, but success requires:

Algorithm-hardware co-design: Matching computational patterns to architecture strengths
Memory-first optimization: Focusing on data movement over raw computation
Cooperative algorithms: Information sharing provides benefits when overhead is controlled
Problem structure awareness: Mathematical properties (like linear separability) determine parallelization strategies

Key Finding: The combination of algorithmic cooperation (CSA) and heterogeneous computing (CPU+GPU) provides the best performance, achieving near-optimal solutions 11x faster than sequential approaches.

References

[1] Ge-Fei Hao, Wei-Fang Xu, Sheng-Gang Yang, et al. “Multiple Simulated Annealing-Molecular Dynamics (MSA-MD) for Conformational Space Search of Peptide and Miniprotein”. In: Scientific Reports 5 (2015), p. 15568. DOI: 10.1038/srep15568. URL: https://doi.org/10.1038/srep15568.

[2] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. “Optimization by Simulated Annealing”. In: Science 220.4598 (1983), pp. 671–680. DOI: 10.1126/science.220.4598.671. eprint: https://www.science.org/doi/pdf/10.1126/science.220.4598.671. URL: https://www.science.org/doi/abs/10.1126/science.220.4598.671.

[3] Latkin. A Simple Benchmark of Various Math Operations. Accessed: [2023]. Nov. 2014. URL: https://latkin.org/blog/2014/11/09/a-simple-benchmark-of-various-math-operations/.

[4] Simon Fraser University. Schwefel’s Function - Optimization Test Problems. https://www.sfu.ca/~ssurjano/schwef.html.

[5] Emrullah Sonuc, Baha Sen, and Safak Bayir. “A cooperative GPU-based Parallel Multistart Simulated Annealing algorithm for Quadratic Assignment Problem”. In: Engineering Science and Technology, an International Journal 21.5 (2018), pp. 843–849. ISSN: 2215-0986. DOI: https://doi.org/10.1016/j.jestch.2018.08.002. URL: https://www.sciencedirect.com/science/article/pii/S2215098618308103.

[6] Samuel Xavier-de-Souza et al. “Coupled Simulated Annealing”. In: IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 40.2 (2010), pp. 320–335. DOI: 10.1109/TSMCB.2009.2020435.

This research was conducted at Brown University in collaboration with Mason Lee, exploring high-performance global optimization. Complete implementation details and experimental data demonstrate the effectiveness of cooperative parallel algorithms in heterogeneous computing environments.