Skip to content

B&B CPU determinism#798

Open
aliceb-nv wants to merge 429 commits intoNVIDIA:release/26.02from
aliceb-nv:determinism
Open

B&B CPU determinism#798
aliceb-nv wants to merge 429 commits intoNVIDIA:release/26.02from
aliceb-nv:determinism

Conversation

@aliceb-nv
Copy link
Contributor

@aliceb-nv aliceb-nv commented Jan 26, 2026

This PR introduces a deterministic execution mode for the parallel branch-and-bound MIP solver.
When determinism_mode = CUOPT_MODE_DETERMINISTIC, the solver guarantees bitwise-identical results across runs regardless of thread scheduling variations, so long as the runs are on the same platform with the same environment (= same glibc, etc...)

The approach that was chosen is referred to as Bulk Synchronous Parallel (BSP) in the literature. This is an execution model where computation proceeds in discrete horizons of virtual time (=work units) separated by synchronization barriers.
To cope with the inherent nondeterminism of wall-clock-time due to factors such as caching state, OS scheduling, CPU throttling...
instead of time, progress is measured in "work units" approximating the execution time of an algorithm using known and deterministic factors, such as memory operations, problem features and sparsity...

Workers explore the tree independently and locally within a horizon, and collect events (branching decisions, solutions, pseudo-cost updates) during the process without exchanging them yet.
At each horizon boundary, events are sorted by work-unit timestamp with tie-breaking, history is replayed in deterministic order to update global state, and if need be nodes are redistributed across workers to balance load (which replaces the ramp-up mechanism).
Workers operate on local snapshots of shared state (pseudo-costs, upper bound, LP iteration count) taken at horizon start to avoid read races. This trades some accuracy in decisions (pruning may rely on a higher lower bound than in nondeterminsitic mode), for determinism.

Support for start CPU heuristics is present, based on a "producer" model: CPUFJ starts immediately and begins producing solutions with work unit timestamps, that are produced by the B&B thread once the appropriate time is reached. A form of one-way synchronization is implemented to prevent the CPUFJ thread from "falling behind" and producing solutiosn in the past from the perspective of the B&B thread. CPUFJ is allowed to run ahead, and is biased to run faster than B&B to avoid unnecessary syncs.

B&B BFS workers, diving workers, and a start CPUFJ thread are supported in deterministic mode. GPU Heuristics are disabled in this mode and will be incorporated into the deterministic framework in the future.

Note: files in the utilities/models/ folders are auto-generated from the LightGBM models trained to predict work units from algorithm features.

The implementation is described in more detail in this document: B&B Determinism

Benchmark results

B&B alone, 10min

  • Nondeterministic B&B alone (default): 174 feasible, 38.3% primal gap
  • Deterministic B&B alone: 167 feasible, 40.3% primal gap
  • Deterministic B&B+CPUFJ: 184 feasible, 35.9% primal gap

Regression testing again main branch:
With changes:
Baseline:

Summary by CodeRabbit

  • New Features

    • Added work-limit control, a deterministic mode, and a configurable seed; new CLI flags to set work-limit and enable determinism.
  • Infrastructure

    • Expanded profiling, memory and timing instrumentation; added work-unit synchronization and scheduling for coordinated runs; deterministic execution scaffolding.
  • Tests

    • New deterministic reproducibility tests verifying consistent termination, objectives, and solutions across repeated runs.

@aliceb-nv aliceb-nv changed the title [Don't review yet, reworking] B&B CPU determinism B&B CPU determinism Feb 6, 2026
@aliceb-nv
Copy link
Contributor Author

/ok to test ff47875

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

🤖 Fix all issues with AI agents
In `@cpp/src/dual_simplex/bb_event.hpp`:
- Around line 71-129: operator< uses event_sequence as a final tie-breaker but
none of the factory functions set it; update all factory methods (make_branched,
make_integer_solution, make_fathomed, make_infeasible, make_numerical) to accept
an extra event_sequence parameter (e.g., uint64_t event_sequence) and assign it
to e.event_sequence before returning, ensuring deterministic ordering;
alternatively, if caller-side sequence management isn't desired, have each
factory pull a monotonic counter (e.g., next_event_sequence()) and set
e.event_sequence from that source — but make the change consistently across all
listed factory functions so event_sequence is never left at its default.

In `@cpp/src/dual_simplex/bounds_strengthening.cpp`:
- Around line 105-110: A_i is declared but never used while the code still
accesses A.i directly, causing an unused-variable warning and inconsistent nnz
accounting; fix this by replacing every occurrence of the instrumented access
A.i[p] with the cached reference A_i[p] (so A_i, A_x, Arow_j, Arow_x are used
consistently) in the column loop and other places where A.i is read, ensuring
nnz_processed remains the single source of nonzero counting; alternatively, if
you prefer to keep instrumented accesses, remove the unused A_i declaration
instead.

In `@cpp/src/dual_simplex/branch_and_bound.cpp`:
- Around line 666-675: The code unconditionally calls
settings_.heuristic_preemption_callback() and then calls it again in later
branches, causing duplicate preemption notifications; add a boolean guard (e.g.,
bool preemption_called = false) near the top of the function and replace direct
calls to settings_.heuristic_preemption_callback() with a conditional that
checks and sets preemption_called so the callback is invoked at most once
(update all sites that currently call settings_.heuristic_preemption_callback to
use this guard); reference the variables/members solver_status_,
mip_status_t::WORK_LIMIT, mip_status_t::NODE_LIMIT and
settings_.heuristic_preemption_callback when locating and updating the logic.
- Around line 481-520: The call-site in local_search.cu should validate the work
unit timestamp before calling
branch_and_bound_t::queue_external_solution_deterministic to avoid silently
dropping old submissions; modify the producer/registration code around where
producer_sync_ and work_units are used to check work_units against
bb.get_work_unit_context().current_horizon() and only call
queue_external_solution_deterministic(h_vec, work_units) when work_units >=
current_horizon(), otherwise log a warning (or otherwise handle) indicating the
solution was from the past; reference the symbols producer_sync_, work_units,
bb.get_work_unit_context().current_horizon(), and
queue_external_solution_deterministic to locate where to insert the check and
warning.
- Around line 2363-2388: After the deterministic coordinator returns you must
propagate its termination code into the solver state: inside the
settings_.deterministic branch (after run_determinism_coordinator(Arow_)) assign
solver_status_ = determinism_global_termination_status_; so that
set_final_solution(...) sees TIME_LIMIT/WORK_LIMIT set by the coordinator;
update the same pattern in the other determinism-related exit paths referenced
(around lines 2668-2786) where the coordinator/sync callback can set
determinism_global_termination_status_.

In `@cpp/src/dual_simplex/phase2.cpp`:
- Line 3374: The local variable iters_elapsed (computed as "i_t iters_elapsed =
iter - last_feature_log_iter;") is never used and causes -Wunused-variable
warnings; either remove this declaration or restore its use (e.g., re-enable the
logging/feature code that consumed it) or explicitly mark it as used (cast to
void) to suppress the warning; update the code around the calculation in
phase2.cpp where iter and last_feature_log_iter are available to apply one of
these fixes.
- Around line 2658-2684: The code uses a magic divisor 1e8 when converting
aggregated byte loads/stores before calling work_unit_context->record_work
inside the cuopt::scope_guard lambda and again elsewhere; extract this value
into a named constant (e.g., constexpr double kWorkUnitDivisor = 1e8) near the
top of the translation unit, create a helper function
record_aggregated_work(WorkUnitContext*, std::pair<uint64_t,uint64_t>
loads_stores) that calls aggregator.collect_and_flush() (or accepts its result)
and performs the division + work_unit_context->record_work, then replace the
duplicate logic in the scope_guard and the periodic logging block with calls to
record_aggregated_work(work_unit_context, {total_loads,total_stores}) (or pass
the collected pair); also remove the "// TEMP;" and the large commented-out
feature-logging block or replace them with a single TODO referencing the
tracking issue to keep the diff clean.

In `@cpp/src/mip/feasibility_jump/fj_cpu.cu`:
- Around line 270-281: The logging calculates hit percentages using divisions
like (double)fj_cpu.hit_count / (fj_cpu.hit_count + fj_cpu.miss_count) which can
divide by zero; update each CUOPT_LOG_TRACE call (the overall cache line and the
three candidate move lines referencing fj_cpu.hit_count, fj_cpu.miss_count,
fj_cpu.candidate_move_hits[i], and fj_cpu.candidate_move_misses[i]) to guard the
denominator (e.g., use a ternary or if to compute percentage = sum==0 ? 0.0 :
(double)hits / sum * 100.0) so that when hits+misses == 0 the log prints 0.00%
instead of NaN.

In `@cpp/src/utilities/memory_instrumentation.hpp`:
- Line 37: The macro CUOPT_ENABLE_MEMORY_INSTRUMENTATION is hardcoded to 1
causing all ins_vector accesses to route through element_proxy_t and increment
byte_loads/byte_stores; change this to be opt-in by setting the default to 0 in
memory_instrumentation.hpp and wire a CMake option to control it (e.g., add a
CUOPT_ENABLE_MEMORY_INSTRUMENTATION cache option that defaults to OFF and sets
-DCUOPT_ENABLE_MEMORY_INSTRUMENTATION=1 for instrumented builds). Update the
header to `#ifndef` CUOPT_ENABLE_MEMORY_INSTRUMENTATION / `#define`
CUOPT_ENABLE_MEMORY_INSTRUMENTATION 0 / `#endif` so code uses the macro only when
explicitly enabled, and document how to enable it for deterministic work-unit
prediction (affecting element_proxy_t, ins_vector, and the
byte_loads/byte_stores counters).

In `@cpp/src/utilities/work_unit_scheduler.cpp`:
- Around line 67-71: The current loop where each thread independently calls
wait_at_sync_point while (total_work >= current_sync_target() && !is_shutdown())
can deadlock because threads may call a different number of barriers; change the
logic so all threads perform at most one barrier per sync generation and
re-evaluate together after the barrier: replace the inner while with a single
conditional check (if total_work >= current_sync_target() && !is_shutdown())
that calls wait_at_sync_point(ctx, current_sync_target()), and ensure
wait_at_sync_point (or the surrounding scheduler) advances a shared sync
generation/next_target after the barrier so every thread recomputes
current_sync_target() and can decide whether another sync is needed; reference
functions/variables: total_work, current_sync_target(), wait_at_sync_point,
is_shutdown to locate and update the logic.
- Line 53: The setter set_sync_interval writes sync_interval_ unsafely while
current_sync_target (used by on_work_recorded across OpenMP threads) reads it,
causing a data race; make sync_interval_ a std::atomic<double> and update all
accesses to use .store() in set_sync_interval and .load() where
current_sync_target/on_work_recorded read it (or protect reads/writes with a
mutex), so concurrent reads/writes are synchronized; ensure the declaration of
sync_interval_ is changed to std::atomic<double> and replace direct reads/writes
with atomic load/store calls.

In `@cpp/tests/mip/determinism_test.cu`:
- Around line 155-173: The test
DeterministicBBTest::reproducible_solution_vector fails to set settings.seed
(unlike other tests), so the RNG may differ across runs; update the
mip_solver_settings_t<int,double> settings instance in that test to set
settings.seed to a fixed value (the same constant used elsewhere in tests)
before calling solve_mip(&handle_, problem, settings) twice so both solves use
identical RNG initialization and produce comparable results.
🧹 Nitpick comments (20)
cpp/src/dual_simplex/sparse_vector.cpp (1)

105-134: Consider templating to_dense/scatter to avoid duplication.

The ins_vector overloads are exact copies of the std::vector versions (lines 94–124). A single templated overload would eliminate the duplication:

template <typename Container>
void to_dense(Container& x_dense) const { /* ... */ }

Not blocking since the wrapper is designed to be interchangeable, but worth considering if more container-type overloads are anticipated. As per coding guidelines, refactor code duplication in solver components into shared utilities.

cpp/src/dual_simplex/pseudo_costs.hpp (1)

25-33: The underlying_type trait is clever but somewhat obscure.

Extracting the inner type via decltype(std::declval<T&>()++) relies on the fact that omp_atomic_t<T>::operator++(int) returns T. A brief doc comment explaining why post-increment is used (to unwrap omp_atomic_t) would help future readers.

cpp/src/utilities/memory_instrumentation.hpp (3)

39-43: HDI macro name is short and collision-prone.

HDI is a very short, unprefixed macro name defined at global scope. If any other header (CUDA helpers, third-party libraries) also defines HDI, it will silently conflict. Consider a namespaced name like CUOPT_HDI or use #ifndef HDI guard.

♻️ Suggested fix
-#ifdef __NVCC__
-#define HDI inline __host__ __device__
-#else
-#define HDI inline
-#endif
+#ifndef CUOPT_HDI
+  `#ifdef` __NVCC__
+    `#define` CUOPT_HDI inline __host__ __device__
+  `#else`
+    `#define` CUOPT_HDI inline
+  `#endif`
+#endif

Then replace all uses of HDI with CUOPT_HDI in this file.


762-763: Implicit conversion to T& bypasses all instrumentation.

operator T&() and operator const T&() const allow any code accepting std::vector<T>& to silently bypass the instrumented accessors. This is convenient for backward compatibility but means instrumentation counters will undercount in any code path that passes an ins_vector to a function expecting std::vector&.

If this is intentional for the transition period, a brief comment would help future readers understand the trade-off.


282-283: standard_layout assertion may reject valid use cases.

static_assert(std::is_standard_layout_v<value_type>, ...) will reject ins_vector<std::string> or other non-trivial types. Currently the codebase only wraps numeric types and simple pairs, so this is fine, but the assertion message should clarify why standard layout is required (for sizeof-based byte counting).

cpp/src/dual_simplex/diving_heuristics.cpp (2)

74-81: Unused log parameter will cause compiler warnings.

After delegating to pseudocost_diving_from_arrays, the log parameter is no longer referenced. Same applies to guided_diving below (lines 91–98). Silence with (void)log; or mark [[maybe_unused]].

Suggested fix
 {
+  (void)log;
   return pseudocost_diving_from_arrays((const f_t*)pc.pseudo_cost_sum_down.data(),

91-98: Same unused log parameter issue.

Suggested fix
 {
+  (void)log;
   return guided_diving_from_arrays((const f_t*)pc.pseudo_cost_sum_down.data(),
cpp/tests/linear_programming/c_api_tests/c_api_test.c (1)

1512-1513: Consider a fixed seed instead of unseeded rand().

rand() without a prior srand() call uses the implementation-default seed (typically 1), so the "random" seed is actually the same constant on every test execution. Additionally, rand() is not thread-safe. Since any fixed value works for a determinism test, a literal constant (or a srand(time(NULL)) call if variety across runs is desired) would be clearer and more portable.

cpp/src/mip/solver_context.cuh (1)

62-63: Hardcoded 5.0-second sync interval.

The work_unit_scheduler_ interval is a compile-time constant. Consider making it configurable via mip_solver_settings_t if tuning is expected across different problem profiles. Fine for now if this is an internal tuning constant.

cpp/src/utilities/work_limit_context.hpp (2)

19-19: Unused #include <algorithm>.

This header doesn't use anything from <algorithm>. The static analysis tool also flagged this. Removing it will clean up the dependency.

Proposed fix
-#include <algorithm>
 `#include` <string>

36-36: Consider marking the constructor explicit.

Single-argument constructors should generally be explicit to prevent unintended implicit conversions from std::string.

Proposed fix
-  work_limit_context_t(const std::string& name) : name(name) {}
+  explicit work_limit_context_t(const std::string& name) : name(name) {}
cpp/src/mip/local_search/local_search.cu (1)

58-59: Static globals local_search_best_obj and pop_ptr leak state between test runs.

These file-scope statics persist across test cases, which can cause non-deterministic test behavior. local_search_best_obj is never reset, so its value from a prior test case carries over. pop_ptr is assigned but never appears to be read. Consider resetting local_search_best_obj at the start of each solve, or moving it to instance scope. As per coding guidelines, test isolation requires preventing global variables from leaking between test cases.

cpp/tests/mip/determinism_test.cu (1)

224-234: Comment on line 225 is outdated — missing work_limit field.

The comment says // Instance, threads, time_limit but the tuple actually has four fields including work_limit.

Proposed fix
-    // Instance, threads, time_limit
+    // Instance, threads, time_limit, work_limit
cpp/src/mip/problem/problem.cuh (1)

13-14: Duplicate #include "host_helper.cuh".

host_helper.cuh is included at both Line 13 and Line 22. The second include is redundant.

Proposed fix
-#include "host_helper.cuh"
-#include "problem_fixing.cuh"
+#include "problem_fixing.cuh"

Also applies to: 22-23

cpp/src/mip/feasibility_jump/fj_cpu.cu (1)

99-109: Dead variable new_val in get_mtm_for_constraint.

new_val is assigned at Line 102 (f_t new_val = old_val;) but is never read before the break at Line 105, and it is not part of the return tuple. This appears to be leftover from a refactor.

Proposed fix
     if (move_type == MTMMoveType::FJ_MTM_VIOLATED ? !violated : violated) continue;
 
-    f_t new_val = old_val;
-
     delta_ij = slack / (cstr_coeff * sign);
     break;
cpp/src/dual_simplex/deterministic_workers.hpp (1)

317-447: Skip enqueue when bounds strengthening reports infeasible.

enqueue_dive_node ignores the return value of bounds_strengthening, so infeasible nodes are still queued. Consider early‑returning to avoid wasted dives.

♻️ Suggested guard
-    this->node_presolver.bounds_strengthening(
-      settings, bounds_changed, entry.resolved_lower, entry.resolved_upper);
+    if (!this->node_presolver.bounds_strengthening(
+          settings, bounds_changed, entry.resolved_lower, entry.resolved_upper)) {
+      return;
+    }
cpp/src/dual_simplex/phase2.cpp (4)

26-35: NVTX ranges are unconditionally enabled in a hot path.

PHASE2_NVTX_RANGES is always #defined, so every simplex iteration incurs NVTX push/pop overhead. Consider gating this behind a CMake option or NDEBUG so that release builds avoid the cost.

-#define PHASE2_NVTX_RANGES
+// `#define` PHASE2_NVTX_RANGES  // Enable for profiling builds

As per coding guidelines, "Assess algorithmic complexity for large-scale problems … ensure O(n log n) or better complexity" — while individual NVTX calls are O(1), they add measurable overhead when called millions of times in the main simplex loop.


3371-3399: Redundant null check and dead code.

Line 3393 if (work_unit_context) is already guaranteed true by the outer condition on line 3373. The ~15 lines of commented-out feature logging should be removed or replaced with a TODO linking to a tracking issue.

Minimal cleanup
     if ((iter % FEATURE_LOG_INTERVAL) == 0 && work_unit_context) {
       i_t iters_elapsed = iter - last_feature_log_iter;
 
       auto [total_loads, total_stores] = aggregator.collect_and_flush();
-      // features.byte_loads              = total_loads;
-      // features.byte_stores             = total_stores;
-
-      // features.interval_runtime = now - interval_start_time;
-      // interval_start_time       = now;
-
-      // features.iteration             = iter;
-      // features.num_refactors         = num_refactors;
-      // features.num_basis_updates     = ft.num_updates();
-      // features.sparse_delta_z_count  = sparse_delta_z;
-      // features.dense_delta_z_count   = dense_delta_z;
-      // features.total_bound_flips     = total_bound_flips;
-      // features.num_infeasibilities   = infeasibility_indices.size();
-      // features.delta_y_nz_percentage = delta_y_nz_percentage;
-      // features.log_features(settings);
-
-      if (work_unit_context) {
-        // TEMP;
-        work_unit_context->record_work((total_loads + total_stores) / 1e8);
-      }
+      // TODO(aliceb-nv): Re-enable feature logging for regression training
+      work_unit_context->record_work((total_loads + total_stores) / BYTES_TO_WORK_UNITS);
 
       last_feature_log_iter = iter;
     }

3224-3248: Mixed ins_vector / std::vector& API surface.

update_primal_infeasibilities and update_single_primal_infeasibility still accept std::vector&, requiring .underlying() at every call site. This is fine as an incremental migration step, but consider migrating these two functions to ins_vector parameters in a follow-up to eliminate the .underlying() calls and get consistent memory tracking.


2631-2647: Consider a helper to register sparse vector components.

Each sparse_vector_t requires two aggregator.add() calls (.i and .x). With 8 sparse vectors, that's 16 lines of boilerplate. A small helper like aggregator.add_sparse("name", sv) that registers both .i and .x would halve the lines and reduce the risk of missing a component.

Copy link
Contributor

@nguidotti nguidotti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will approve it so my previous "request change" does not block your PR.

@aliceb-nv
Copy link
Contributor Author

/ok to test 57129b7

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@cpp/src/dual_simplex/branch_and_bound.cpp`:
- Around line 2377-2389: The assignment to solver_status_ should only occur in
deterministic runs; update the branch in the code around
determinism_mode_enabled_ so that solver_status_ =
determinism_global_termination_status_ is executed only when
determinism_mode_enabled_ is true (leave solver_status_ unchanged in the
else/non-deterministic path so existing TIME_LIMIT/NODE_LIMIT statuses are
preserved); locate the block that computes lower_bound and the subsequent
solver_status_ assignment (references: determinism_mode_enabled_,
determinism_global_termination_status_, solver_status_, set_final_solution) and
wrap or move the assignment into the deterministic branch accordingly.

In `@cpp/src/dual_simplex/phase2.cpp`:
- Around line 2655-2656: The local variable interval_start_time (assigned via
toc(start_time)) is unused and will trigger -Wunused-variable; mark it with
[[maybe_unused]] or remove it to silence the warning. Update the declaration of
interval_start_time near last_feature_log_iter/iter to either add the attribute
(e.g., [[maybe_unused]] f_t interval_start_time = toc(start_time);) or delete
the variable if the commented-out uses will not be restored; ensure consistency
with the existing iters_elapsed [[maybe_unused]] style and keep
last_feature_log_iter and surrounding logic unchanged.
- Around line 2896-2913: The bound_flipping_ratio_test_t constructor call is
passing nonbasic_mark (an ins_vector<i_t>) directly while other ins_vector
arguments use .underlying(); change the argument to nonbasic_mark.underlying()
in the bound_flipping_ratio_test_t(...) invocation so the parameter type matches
the expected const std::vector<i_t>&; update the call site around the creation
of bfrt (bound_flipping_ratio_test_t<i_t,f_t> bfrt(...)) to pass
nonbasic_mark.underlying() instead of nonbasic_mark.
🧹 Nitpick comments (1)
cpp/src/dual_simplex/phase2.cpp (1)

3373-3399: Redundant null-check on work_unit_context at line 3393.

The outer condition at line 3373 already gates on work_unit_context being non-null, making the inner check at line 3393 redundant.

Suggested simplification
     if ((iter % FEATURE_LOG_INTERVAL) == 0 && work_unit_context) {
       [[maybe_unused]] i_t iters_elapsed = iter - last_feature_log_iter;
 
       auto [total_loads, total_stores] = aggregator.collect_and_flush();
       // ...
 
-      if (work_unit_context) {
-        // TEMP;
-        work_unit_context->record_work((total_loads + total_stores) / 1e8);
-      }
+      // TEMP;
+      work_unit_context->record_work((total_loads + total_stores) / 1e8);
 
       last_feature_log_iter = iter;
     }

@aliceb-nv
Copy link
Contributor Author

/ok to test 1332fbd

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improves an existing functionality non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants