Blog v2

Paper	Problem Instance Encoding	Output Decoding	Result Exact or Approximate	Training Objective	Uses Transfer Learning?	Generalization Testing	Theoretical Motivation for Architecture	Dataset Generation	Key Hyperparameters
Multi-Filter GNN for CO (Wenkel et al. 2024)	Graph structured input (e.g. adjacency of the problem graph) with minimal node features (constant or degree). Uses both low-pass and band-pass graph filters on nodes	Produces a Bernoulli probability for each node belonging to the solution; a discrete solution subset is obtained by thresholding these node probabilities	Approximate. Finds high-quality feasible solutions but no optimality guarantee (performance on par with Gurobi for Max-Cut, but not provably optimal)	Self-supervised (unsupervised) – trained with custom surrogate loss terms per problem (e.g. differentiable objectives/constraints for clique, dominating set, cut) instead of labeled solutions	No. Trained separately for each CO problem; no pre-training across tasks (focus is on a general GNN architecture rather than shared weights)	Tested on larger synthetic graphs than seen in training. E.g. model trained on small Barabási–Albert graphs for Max-Cut generalizes well to larger BA graphs (outperforming specialized GNNs), while for Max Clique/MDS it remains competitive but not superior to task-specific methods	Leverages graph signal processing theory: combines low-pass smoothing and band-pass wavelet filters with localized attention to capture both “smooth” and “heterophilic” patterns in solutions. A theorem is provided showing that decoupling filter outputs via attention improves the ability to extract relevant high-frequency information	Synthetic graph benchmarks (Erdős–Rényi “RB” and Barabási–Albert “BA” graphs). Training on small graphs (e.g. 50–100 nodes) and evaluating on “-large” variants	Very deep GNN (up to 16 message-passing layers) with large hidden size (128–256) to capture complex patterns. Activations (ELU/GELU) and L2 norm were tuned per task. Learning rate ~1e-3–3e-3; ~200 epochs for convergence.
Neural Algorithmic Reasoning for CO (Georgiev et al. 2023)	CO instance represented as a graph (e.g. fully connected graph with edge weights for TSP, standard graph for k-center), fed into a graph neural network reasoning model. No manual features beyond basic distances or costs	Problem-specific solution output. E.g. for TSP, the model produces a tour (sequence of nodes) – implemented via a recurrent pointer mechanism or iterative graph policy. For selection problems (center selection, etc.), outputs are node scores indicating chosen set	Approximate. Aims to outperform classical heuristics but does not guarantee optimality	Hybrid supervised + pre-training. First pre-trained on algorithmic tasks (e.g. learning to execute MST or shortest-path algorithms) to imbue the model with “algorithmic” biases. Then fine-tuned (supervised) on CO instances (using optimal solutions or high-quality solutions for training labels). The notable aspect is leveraging algorithmic pre-training to improve combinatorial generalization	Yes. Relies on transfer learning: knowledge from pre-training on classical graph algorithms (MST, BFS, etc.) is transferred to solving NP-hard CO problems. This algorithmic pre-training significantly improved performance on CO tasks	Trained on small graph instances (e.g. 8–20 nodes) and tested on significantly larger instances. Demonstrated the ability to extrapolate – e.g. models trained on TSP tours of ≤20 cities evaluated on tours with 40–1000 cities with improved generalization compared to models without pre-training	The GNN architecture is designed to emulate classical iterative algorithms (recurrent message passing akin to dynamic programming). The hypothesis is that a model infused with algorithmic reasoning (e.g. understanding of MST for TSP) will generalize more like an algorithm (size-invariant), addressing the theoretical limitation of standard neural nets on out-of-distribution sizes	Synthetic data generated for each task. E.g. TSP training instances from random points in [0,1]^2 (optimal tours via Concorde) and k-center instances from random graphs. Also used the DeepMind CLRS-30 algorithmic dataset for pre-training	Algorithmic reasoning model with recurrent message-passing steps (not fixed-depth) to allow extrapolation. Used ~16 message-passing iterations and moderate hidden dimensions (e.g. 128) so the model can simulate multi-step algorithms. Large training set sizes (e.g. 5×10^5 TSP instances across sizes) were used to ensure generality.
Are GNNs Optimal Approximation Algorithms? (Yau et al. 2024)	Problem instance cast as a graph or factor graph for Max-CSP (e.g. graph with weighted edges for Max-Cut/Min-Cover, hypergraph for SAT). The GNN (OptGNN) treats each variable as a node with learnable vector state, and uses the adjacency/constraint structure for message passing	Produces a fractional solution embedding (e.g. continuous vector for each node). A randomized rounding scheme is then applied to obtain a discrete feasible solution – e.g. cut assignment by hyperplane cut of node embeddings. The model can also output bound estimates from the embeddings	Approximate. Targets the best known approximation ratios (e.g. 0.878 for Max-Cut). Under theoretical assumptions (UGC), the message-passing algorithm achieves optimal approximation guarantees, but it does not guarantee exact optima	Unsupervised/self-supervised. Trained by minimizing a continuous relaxation objective. OptGNN’s loss is the penalized SDP objective (no ground-truth labels). Essentially it learns by gradient descent to minimize the gap between its output and the SDP optimum, with additional penalties to enforce constraints (unit-norm embeddings, etc.), rather than via any supervised solutions	No (single model per class). The model is general in design but trained separately for each problem type (Max-Cut vs. Max-3SAT, etc.) – there is no cross-problem fine-tuning in this work	Tested on both synthetic and real graph instances larger or different than those seen in training. E.g. model trained on small random graphs (50–100 nodes) was evaluated on much larger graphs (up to 1000 nodes) and different distributions (e.g. trained on Erdős–Rényi, tested on social networks), showing strong out-of-distribution performance. Also evaluated on standard benchmarks (Gset, etc.), often matching or beating solver or heuristic baselines in quality vs. runtime	The architecture directly arises from the theory of SDP relaxations. The message-passing network is proven to solve the SDP relaxation of Max-CSP to optimality under sufficient depth, essentially emulating the famous approximation algorithms. This provides a theoretical link: a polynomial-size GNN can match optimal SDP solution quality assuming the Unique Games Conjecture. It also yields provable bounds (using learned embeddings to produce dual bounds)	Both random synthetic graphs (Erdős–Rényi, Barabási–Albert, etc.) and real-world graph datasets (small networks from TU collections) were used. The training data for each problem is large (e.g. 100k+ random instances of varying sizes), and the model is then tested on a wide range of unseen graph instances, including established benchmark sets for each problem	OptGNN uses 64-dimensional node embeddings and a Transformer-style graph message-passing mechanism. Number of diffusion (message-passing) steps is a key parameter – training with ~1000 steps and then using 3× more steps at inference improved solutions. The noise schedule for diffusion was “annealed” for stability. Overall, moderate network size (few layers of MLP per message pass) but many message-passing iterations are employed to reach high-quality solutions.
Exact CO with GCNs – Branching (Gasse et al. 2019)	MILP instance encoded as a bipartite graph: variable nodes (features include objective coeff., bounds, etc.) and constraint nodes (features include coefficients’ stats, RHS, etc.), with edges linking a variable to the constraints it appears in (edge weight = coefficient)	Outputs a score for each candidate branching variable at a search tree node (via a final MLP on variable node embeddings). The highest-scoring variable is selected for branching. This learned policy is plugged into a standard Branch-and-Bound solver	Exact (solver-level). Using the model within B&B preserves solver optimality – it just chooses the branch order. The B&B with learned branching finds provably optimal solutions, but more efficiently (fewer nodes/time)	Supervised imitation learning. Trained to imitate the “strong branching” expert rule. They generate training data by computing strong branch scores for variables at tree nodes and train the GCN to predict those scores (ranking good vs. bad branch choices). The loss is a ranking loss that encourages the model to order variables like the expert	No. The GCN policy is trained on a specific distribution of MILPs (e.g. set covering problems) and is not transferred to different MILP classes. (Generality comes from the bipartite representation, but weights are not fine-tuned across heterogeneous problems)	Demonstrated generalization to larger instances and different instances of the same class than seen in training. E.g. a policy trained on small set-cover instances was applied to much larger ones, and it outperformed conventional heuristics in branching efficiency. Also tested on a mix of hard MILPs, including some not explicitly in training, with success	The model architecture aligns with the variable–constraint structure of MILPs, which is theoretically natural for branching decisions. By using a GCN on the constraint hypergraph, the learned policy can approximate strong branching scores (which relate to dual bound improvements). In effect, the architecture can be seen as learning a fast approximation to the LP re-solve computations of strong branching	Used a variety of MILP benchmarks (set covering, capacitated facility location, MIPLIB instances) to train and evaluate. Training data was generated by instrumenting a solver to collect strong-branching decisions on thousands of nodes across training instances	A 2-layer Graph Convolution Network on the bipartite graph, with ~64 hidden units, was used (as per open-source code). Optimized with Adam. Important hyperparameters included the number of strong-branching samples per instance (to limit data collection cost) and a ranking margin in the loss. The model was integrated at runtime with a inference cutoff (they branch with the model until confidence drops, then fall back to strong branching).
Learning CO Algorithms over Graphs – S2V-DQN (Dai et al. 2017)	Graph of the problem instance (nodes = elements, edges = interactions). The current partial solution is encoded by node tags (e.g. a binary feature indicating whether a vertex is already in the solution). A Structure2Vec graph embedding network computes state embeddings given the graph + partial solution	Outputs a Q-value for each possible action (adding a specific node to the solution, or terminating). The agent greedily selects the action with highest Q. The sequence of actions incrementally builds a solution, and when no action has positive Q (or a termination action is chosen), the algorithm outputs the current set as the final solution	Approximate. The learned greedy or local search heuristic yields high-quality solutions but not guaranteed optimal. (It achieved near-optimality on test graphs on average)	Reinforcement learning (Q-learning). Trained with n-step Q-learning and experience replay to maximize cumulative reward (which is the negative of solution cost for MVC, or positive of cut weight for Max-Cut). Reward is given only when a solution is completed (or incremental rewards for intermediate improvements), encouraging the agent to construct better solutions. No supervised optimal solutions are needed for training	No. A separate model is learned for each problem type (they apply the same framework to MVC, Max-Cut, TSP, etc., but do not transfer parameters between them)	The learned policies were evaluated on graphs much larger than those seen in training. For example, a model trained on 50–100 node graphs was tested on graphs up to 1200 nodes, and maintained strong performance (nearly flat performance degradation with size for MVC/Max-Cut). This demonstrates the agent’s ability to generalize to significantly larger instances. It also generalized to different graph distributions (ER vs. BA) with robust performance	The architecture and greedy paradigm are motivated by the structure of efficient approximation algorithms. By using a graph embedding network (structure2vec) that resembles belief propagation, the agent can estimate solution value changes for adding a node, akin to classical greedy criteria. The ability to revise node embeddings after each addition gives it a theoretical advantage over irreversible one-shot constructions, capturing the local search nature of many combinatorial heuristics	Random graph instances were generated for training (Erdős–Rényi with edge probability 0.15, and Barabási–Albert with average degree 4). Optimal or near-optimal solutions were obtained via solvers for small graphs to evaluate performance (not for training labels). The agent’s performance was also compared on classic benchmark graphs (e.g. TSPLIB for TSP) to test generalization	Uses a graph embedding network (S2V) with 64-dimensional embeddings and 3–5 message-passing iterations per action. The DQN was trained with an ε-greedy exploration strategy and experience replay. Key hyperparameters included the discount factor (set to 1 since reward is final solution value), and the number of training episodes (~10⁴). A “budget” of actions per episode was set to prevent very long runtimes. The same network architecture (embedding layers and Q-value MLP) was kept across different graph sizes, enabling scalability.
Diffusion Model for Unsupervised NCO – DiffUCO (Sanokowski et al. 2024)	CO problem formulated as an energy minimization (Ising-like). The model treats a candidate solution as a vector of binary variables (e.g. an indicator per node) and learns a score function over this 0/1 space. The problem’s graph structure (which variables interact) is provided implicitly via the energy function and through a graph neural network denoiser that runs on the factor graph (e.g. a GNN conveys which pairs of variables have an edge)	Generates solutions by reverse diffusion: starting from random noise and iteratively refining into a structured binary solution. The final output is a probability distribution or sample for each decision variable. A discrete solution is obtained by sampling or thresholding the final probabilities (often multiple samples are drawn and the best valid solution taken)	Approximate. It produces high-quality feasible solutions but not guaranteed optimal. (It often reaches or beats state-of-art heuristic quality, but optimality is not assured without exhaustive sampling)	Unsupervised – energy-based training. No optimal solutions needed. The model is trained by minimizing a divergence between the model’s distribution and the true problem’s Boltzmann distribution. In practice, a surrogate loss (upper bound on reverse KL) is used, which encourages the diffusion model to place high probability on low-energy (near-optimal) states. This effectively teaches the model to sample good solutions without explicit solution labels	No. Each problem type (Max-Cut, MIS, etc.) is learned separately. The framework is generic, but they did not pre-train on one problem and transfer to another in this work	The model was evaluated on a variety of unseen instances and larger sizes than train. E.g. trained on small random graphs (≈50 nodes) and tested on much larger graphs (≥500 nodes). It showed strong out-of-distribution performance, often outperforming specialized methods. They also demonstrated generalization to different graph structures (ER vs. BA vs. real networks) – the learned model maintained solution quality across these	The architecture is rooted in diffusion probabilistic models and graphical model inference. By interpreting CO as sampling from a Boltzmann distribution, they marry neural networks with statistical physics samplers. The use of a GNN in each diffusion step aligns with the locality of constraint interactions. Theoretically, as the number of diffusion steps increases, the model’s sample distribution approaches the target distribution, and indeed they find solution quality improves with more denoising steps (trading off runtime)	No ground-truth solution data – training instances are generated on the fly or taken from standard benchmarks. They evaluate on common CO test sets (e.g. DIMACS Max-Cut instances, random graphs for MIS, etc.). The absence of labeled data means the training “dataset” is essentially the problem’s own energy landscape, sampled via the diffusion process itself. (They also compare against Generative Flow Networks and other unsupervised methods on those instances)	Number of diffusion timesteps is crucial – they train with T in the order of 1000, and at inference use ~3× more steps for better results. The denoiser is a GNN with ~4–6 message-passing layers and 64–128 hidden size, balancing model capacity and efficiency. An annealed noise schedule is used to gradually refine solutions. They also introduced a Conditional Expectation (CE) trick and tuned a token size for state updates to further improve sampler efficiency.
Exploratory CO with RL – ECO-DQN (Barrett et al. 2020)	Graph representation of the problem (as in S2V-DQN), with dynamic state: at any time, some vertices are in the current solution subset. The state is encoded by node features indicating membership (or solution mask), processed by a GNN to produce Q-values	The agent can add or remove a vertex. The Q-network assigns each vertex an action-value for adding (if out) or removing (if in). The agent flips the vertex with highest positive Q. This continues until no flip yields improvement (all Q<0), at which point the current set is output as the final solution. (Multiple random restarts can be run to escape local optima)	Approximate. It’s essentially a learned local search. It finds very good solutions (state-of-the-art RL results for Max-Cut) but not guaranteed optimal. The advantage is the ability to improve an initial solution by iterative exploration	Reinforcement learning (DQN). Trained to maximize the final solution value via Q-learning. Rewards are structured to encourage exploration: small negative step costs and a big positive reward at episode end for the final cut weight achieved. This setup drives the agent to climb in solution value during an episode. The training uses experience replay and target networks similar to DQN in games. No supervised solutions are given; the agent learns by trial-and-error improvements	No. Focused on a single problem (Max-Cut) in experiments, using the general method. There is no pre-training on one graph task to transfer to another in this work (though conceptually the framework applies to any graph CO)	Evaluated on large and diverse graphs: e.g. trained on random graphs up to 100 nodes and tested on instances with up to 2000 nodes (including Gset graphs). The learned algorithm maintained strong performance, outperforming previous learned heuristics and remaining competitive with problem-specific algorithms even on graphs an order of magnitude larger than training. It also can start from any initial solution (even one provided by another heuristic) and learn to improve it, illustrating robustness at test-time	Key insight: allow the agent to revise earlier choices (add or remove nodes), addressing the “irrevocability” limitation of greedy methods. The GNN+Q-network architecture provides the agent with global feedback on how a local move changes solution quality (much like the flip algorithms in local search). Theoretically, this aligns with iterative improvement techniques that can escape local optima. The architecture must also encode the current solution state (they include that as input) so the agent’s decisions depend on context, a form of memory that improves search exploration	Training instances were random graphs (ER and BA) as in prior work. Additionally, they tested on known Max-Cut benchmark graphs (up to 2000 nodes from Gset) without further training. The model’s ability to start from arbitrary initial solutions means they sometimes initialize with random or greedy solutions and let the agent refine them, effectively combining search with learned guidance	Uses the same Structure2Vec graph embedding (embedding size ~64) but extends it to allow node removals. The Q-network was trained with a curriculum: starting from easier instances (smaller graphs or known good initial solutions) and gradually increasing difficulty, to stabilize learning. Discount factor γ≈1 and careful reward shaping (e.g. –1 per step, +Δ cut gain) were crucial. They also found that running multiple episodes with different random initial solutions (“exploratory starts”) and taking the best result improved final performance, effectively a hyperparameter of how many random restarts to try.
Scalable Discrete Diffusion Samplers – SDDS (Sanokowski et al. 2025)	Same discrete CO encoding as DiffUCO: problem defined by an energy function over binary variables (Ising model). Variable interactions (graph constraints) inform the structure of the denoising model – implemented via a graph-based network that processes variable nodes and their connections. Key difference: new training methods allow using many more diffusion steps, so the model can be run with very long chains	Solution sampling is identical in format to DiffUCO (reverse diffusion yielding a set of variable assignments). However, SDDS can produce unbiased samples from the Boltzmann distribution using adjustments: it introduces a self-normalized importance sampling and a neural MCMC correction step so that outputs can be treated as approximately drawn from the true distribution. This means in principle it can sample not just one high-quality solution but a distribution of solutions with known statistics	Approximate (stochastic). It outputs very high-quality solutions (state-of-the-art on MIS, MaxClique, etc.) but like DiffUCO, no guarantee of global optimality. It can provide statistically unbiased estimates for solution properties due to its importance-sampling mechanism, but obtaining the true optimum still relies on sampling luck	Unsupervised, memory-efficient training. Introduces two novel training objectives to avoid backpropagating through the entire diffusion chain: (1) a forward KL objective optimized via Self-Normalized Neural Importance Sampling (SN-NIS), which uses Monte Carlo samples of trajectories to estimate gradients without storing full chains, and (2) a reverse KL objective optimized via a policy-gradient RL approach (they apply the policy gradient theorem / PPO to treat the diffusion process as a sequential decision policy). Both approaches forgo the huge memory cost of backprop-through-time, enabling training with long diffusion horizons	No (task-general, not pre-trained). These techniques aim for a general sampler, but the model is still trained separately per problem type (or per distribution). There is no indication of a single model being fine-tuned across multiple CO problems here	Evaluated on significantly larger problem instances and an Ising model in physics. Thanks to memory-efficient training, they could use diffusion chains with T up to thousands. Empirically, models trained with more steps showed improved results on benchmarks (e.g. best MIS results on large graphs). They outperform prior neural solvers on popular benchmarks (Max-Cut, MIS on large graphs), and even approach the performance of specialized algorithms, demonstrating strong generalization to instance sizes far beyond training (some results reported up to 2000–3000 node graphs)	The design marries deep diffusion models with reinforcement learning and importance sampling theory. Theoretically, the SN-NIS training provides an unbiased gradient estimator of the forward KL divergence, avoiding the intractable likelihood computations by re-weighting sampled trajectories. The RL-based training reframes the reverse KL objective as a Markov decision process, where Q-values of diffusion transitions are used as rewards for policy updates. These innovations address two theoretical limitations: scaling to long sequences (by avoiding backprop through T steps) and enabling unbiased sampling (via corrected proposal distributions)	A large synthetic dataset of instances is not required (training is problem-specific). They validate on standard CO benchmarks (same as DiffUCO, plus additional Ising spin glass instances from statistical physics). The approach can be seen as a solver, so “dataset generation” is about picking representative instances to train on. The paper introduced MILP-Evolve for generating diverse MILP instances (in a different context), but for SDDS they focus on graph problems and Ising models generated by known distributions	The crucial hyperparameter is the number of diffusion steps T – SDDS can use very large T (e.g. 3000 or more) during training because of memory-efficient objectives. The model architecture (denoiser) is similar to DiffUCO’s GNN (64-dim features, ~6 layers), but training requires tuning of learning rate to stabilize the importance-weighted and RL gradient estimates. They also introduce a self-normalized IS threshold to reduce variance, and use PPO-specific hyperparams (batch size, clipping) for the RL objective. In inference, they can combine diffusion with a short neural MCMC procedure – the length of this Markov chain is another parameter for trade-off between thoroughness and time.
“Learning to Branch” in MIP (Khalil et al. 2016)	No graph neural net – uses hand-crafted features for each candidate branching variable. At a node, a 72-dimensional feature vector is computed per variable (e.g. objective coefficient, number of constraints involved, pseudocost history, fractional LP value, etc.), including static problem features and dynamic solver state features. Second-order feature interactions are added (effectively using a quadratic kernel in feature space)	Outputs a ranking or scoring of variables to branch on. A linear model (or boosted model) uses the features to predict a score for each candidate, and the variable with highest score is chosen for branching. In practice they produce a two-level ranking (good vs bad) mimicking strong branching’s ranking	Exact (solver-level). As with other learned branching, the B&B search remains exact. This method yields a different search order that on average explores far fewer nodes than default, but it does not sacrifice optimality – the solver eventually proves the optimum if given time	Supervised (rank learning) on-the-fly. They observe strong branching decisions during the B&B process and learn a ranking model live on that data. Specifically, as the solver explores a few nodes with full strong branching, those decisions (good vs bad branch choice outcomes) form training examples. A learning-to-rank algorithm (SVM-rank with a polynomial kernel) is then solved to fit a surrogate scoring function. This learned model is thereafter used at subsequent tree nodes. The training happens per instance, rather than offline on a dataset	No. Each MIP instance is handled independently – the model is trained from scratch for that instance during its solve. There is no transfer of the learned model to other instances (no global model)	Evaluated on a broad set of MIPLIB benchmark instances. The method adapts to each instance, so generalization in the usual sense (one model to unseen instances) is not applicable – instead they show the approach robustly improves branching performance across many instances of varied types by learning anew for each. In experiments, the average tree size reduction was significant compared to human-designed heuristics, indicating the method’s general applicability to diverse MIPs	The approach is inspired by learning to rank in information retrieval. The theoretical justification is that a linear model on a carefully chosen feature map (including quadratic interactions) can approximate the strong branching scoring rule. Indeed, they note the final feature mapping corresponds to a degree-2 polynomial kernel, which in theory can capture pairwise interactions between branching metrics that might correlate with downstream node count reduction. This yields a simple, explainable model that nonetheless matches a complex expert strategy	Used the standard MIPLIB2010 “Benchmark” set of 87 diverse MILP instances for evaluation. No separate training set – each instance’s solve includes its own training phase (using a portion of the tree where strong branching is applied to collect data). Thus, the “training data” are strong branch outcomes on a subset of nodes within the instance’s search	Key hyperparameters include the number of initial nodes to sample with strong branching (they used a small cutoff so that training is fast) and the choice of ranking algorithm. They used a pairwise rank SVM with a polynomial kernel of degree 2 (implemented by explicit feature quadratic terms). Feature normalization was applied so that different scale features (counts, fractional values) are comparable. The learned linear model’s weight vector is effectively the tuned hyperparameters (no deep nets here), and they found that even training on as few as ~50 branch decisions was enough to outperform default heuristics.
Ising Formulations of NP Problems (Lucas 2014)	Each decision problem is mapped to an equivalent Ising spin system. That is, introduce a binary spin variable for each boolean decision (e.g. each vertex in a vertex cover) and formulate a quadratic energy function such that its minimum (ground state) corresponds to a valid optimal solution. This involves adding penalty terms for constraint violations and linear terms for the objective. The encoding is done manually for each NP problem but follows a common Ising/QUBO template	A solution is obtained by finding the spin configuration that minimizes the Ising Hamiltonian. Decoding is direct: for example, if a spin $s_i=+1$ represents “include element $i$ in the solution”, then the ground-state spins provide the chosen subset, ordering, etc. of the original problem. In other words, once the Ising model is solved (e.g. by an annealer or exhaustive search for small cases), the assignment of 0/1 to each variable yields the problem’s solution	Exact (in theory). The Ising formulation is constructed so that global minimum energy = optimal solution of the NP problem. In practice, heuristic solvers (quantum or digital annealers) are used to find low-energy states, which may or may not be optimal, but any ground state found is an optimal solution. The mapping itself introduces no approximation	Not a learned model – no training objective. This is a hand-crafted reduction from NP problems to an Ising model. (In use, one might “train” an annealer by adjusting parameters, but that’s outside the formulation scope.) The goal is to enable unsupervised optimization by physics-inspired solvers rather than to learn parameters	N/A. No machine learning involved – thus no transfer learning	The formulations are general and can be applied to any instance of the given NP problem. Lucas demonstrates them on representative instances and notes they work up to the size limits of whatever Ising solver is used. There is no notion of generalization error since no model is trained – only correctness of the formulation. The key validation is showing that many different NP-complete problems (traveling salesman, graph coloring, etc.) can all be encoded in the same Ising format	The motivation is theoretical computer science: since finding the ground state of an Ising model is NP-hard, any NP problem can be reduced to it. The paper provides explicit reductions for many problems, thereby establishing a common “hardware-amenable” representation. This unified view lets one apply quantum annealing or other Ising solvers as a general approach to NP-hard problems. The architecture (Ising spins + quadratic couplers) is chosen because it’s the native language of quantum annealers and has known computational complexity ties to NP-hardness	No data needed – just problem definitions. The paper covers a suite of problems (covering, partitioning, scheduling, graph problems) and for each, constructs the Ising/QUBO formulation analytically. Numerical examples are given to illustrate each formulation’s correctness. If implementing, one would generate QUBO coefficients from the given formulas for any specific instance (the “data” generation is trivial conversion of instance parameters into the Ising coefficients)	Important “hyperparameters” in this context are the penalty weights chosen in the Ising energy to enforce constraints. The formulation must set these large enough that any constraint violation increases energy more than the optimal objective difference. Selecting these weights is critical (too low and ground state might violate a constraint, too high and it dominates optimization). The paper provides guidance for each problem on suitable weight magnitudes. Otherwise, there are no training hyperparameters – just the size of the Ising model (number of spins equals number of decision variables, which grows with problem size).
Generic Representation for Comb. Problems (Boisvert et al. 2024)	Unified graph encoding: Any CSP/MIP is parsed into a bipartite graph of variables and constraints, with fine-grained structure for each constraint. Each constraint is broken into an Abstract Syntax Tree of operations (e.g. sum, ≤, etc.), and this AST is represented as small connected subgraph. Nodes represent variables, constants, and operation symbols, and edges represent the relationships (e.g. a variable appearing in an operation). The entire problem is thus one heterogeneous graph capturing variables and all constraints	A problem-independent GNN processes this graph to produce an output. In principle, the output can be anything – in experiments they feed the final variable node embeddings into a decoder that yields either a value assignment for each variable or a prediction relevant to the problem’s solution. For example, in graph coloring, the GNN could classify each variable node with a color index. The key is that the same architecture/decoder template is used for all problems, differing only in the interpretation of the output	Approximate. The learned model serves as a heuristic/solver for the given problems – it does not guarantee satisfying all constraints or optimality. In tests on four problem types, its solution quality or feasibility was comparable to specialized neural solvers, indicating it finds reasonably good solutions but not necessarily optimal	Supervised learning. They train the GNN on generated instances of each problem with labels from known solutions. For example, for a satisfaction problem, the label could be a satisfying assignment (and training loss measures constraint violations of the model’s predicted assignment). For an optimization problem, they likely train to predict the optimal solution’s variables (treated as classification labels or via a regression of objective value plus a feasibility loss). The important aspect is that the same network architecture is trained for each task, rather than designing a new one per task	Not in this work. While the aim is a “fully generic” representation, the trained model is still specific to a problem type (they train one model per problem in experiments). They demonstrate that the architecture can be reused without changes, but they did not perform transfer learning of weights from one problem to another (no joint multi-problem training)	Demonstrated on four different combinatorial problems (from the XCSP3 competition). The model achieved similar accuracy or solution quality to problem-tailored GNNs on each, showing that it didn’t lose performance despite generality. It generalizes to new instances of the same problem drawn from the same distribution (and in some cases, to larger instance sizes than seen in training) with performance close to specialized models. The general graph encoding did not hinder generalization within a problem, and provides a promising ability to apply the same model design across domains	The approach is motivated by the desire for a single neural architecture that can handle arbitrary constraints – akin to how a SAT solver can take any CNF. By using an AST-based graph, they ensure any constraint logic can be encoded (this is analogous to how compilers break expressions into syntax trees). The GNN architecture is then a generic message-passer over this graph. The authors introduce a custom GNN that respects the bipartite variable-constraint structure and the tree structure of each constraint, effectively performing a form of generalized constraint propagation in a learnable way. This offers a theoretically universal representation: any CSP can be encoded in this graph format, so in principle one neural model design can cover all, addressing the lack of generality in prior graph encodings	Instances are provided in the standard XCSP3 format (a CSP/MIP specification language). Their tool automatically converts the input file into the graph representation. They test on four problem classes from the 2023 XCSP3 competition (each with a set of instances). For each class, they likely use a train/validation/test split of instances. The graph conversion handles all constraint types featured in the competition (e.g. AllDifferent, linear sums, etc.), meaning the method was tested on a diverse set of constraints to ensure generality	The graph neural network architecture must handle heterogeneous node types (variable nodes vs. operation nodes). They design a GNN with specialized message functions for different edge types (e.g. variable→operator vs operator→variable). Number of GNN layers (~3 message-passing rounds) and latent dimension (maybe ~128) were chosen to be sufficient for all tested problems, and kept fixed to demonstrate generality. No problem-specific tuning of architecture was done – only training on each problem’s data. This indicates hyperparameters like layer count, hidden size, etc., were set at some default that worked across all four tasks, highlighting the “one-size-fits-all” intention. (The exact values aren’t given in the abstract, but the emphasis is that the same hyperparameter set was used for all problems.)
Foundation Models for MILP (Li et al. 2025)	MILPs are represented in a uniform graph format: a bipartite graph with variable nodes and constraint nodes (similar to branch-and-bound learning methods). Variable node features include objective coefficient, bounds, etc., and constraint node features capture coefficients and rhs. An edge links a variable to a constraint if it appears, with weight equal to its coefficient. Additionally, for the natural language alignment task, each MILP instance is paired with a textual description – the model processes that via a language encoder (e.g. a transformer) in parallel	The foundation model is multi-task, so outputs depend on the task: (1) For integrality gap prediction, it outputs a continuous value (regression) for the predicted LP vs. IP gap. (2) For branching, it outputs a branching policy – e.g. a score for each variable (the variable with highest score is chosen). (3) For MILP–NL alignment, it produces embeddings for the MILP and for the candidate natural language description and outputs a similarity score or classification indicating if they match. All these outputs stem from one shared model backbone applied to the MILP graph (and text, for task 3)	N/A (predictive tasks). The model is not directly finding an optimal solution; it is either predicting properties (gaps), making heuristic decisions (branching choices), or doing semantic matching. Thus there’s no “optimal vs approximate” in the usual sense. Its branching decisions aim to improve solver efficiency (not guaranteed optimal nodes but statistically good), and gap predictions are real-valued estimates	Multi-task supervised learning. Adopts a foundation model training paradigm: train a single model on a broad distribution of MILPs with multiple learning tasks. They generate a large synthetic dataset (MILP-Evolve) and supervise the model on: (a) integrality gaps computed by solvers, (b) expert branching decisions (from strong branching) for many instances, and (c) known MILP-description pairs. The model is optimized on all these objectives, either sequentially or jointly, to learn a general representation of MILPs. No reinforcement learning is used; labels are obtained from traditional solvers or the data generation process	Yes. This is the core idea – a single model that generalizes across different MILP problem classes. By training on a diverse set of MILPs (covering many structures), the model learns transferrable knowledge. It is then tested zero-shot or fine-tuned on unseen MILP classes (e.g. benchmark MIPLIB instances) and yields strong results. This cross-problem generalization is analogous to how an NLP foundation model transfers to new language tasks	Evaluated on completely unseen MILP types, including the standard MIPLIB benchmarks. The model trained on MILP-Evolve’s synthetic variety achieved significant improvements on those unseen problems compared to models trained on narrow distributions. It was also tested on instances described in natural language, aligning them correctly. Essentially, the foundation model exhibits broad generalization: it handles MILPs from classes it never saw in training (a notable first in MILP learning)	Inspired by the success of foundation models in NLP/vision, the idea is that MILP solving can benefit from a pretrained universal model. The architecture (a GNN or possibly a graph-transformer) is chosen to be flexible enough for any MILP. The use of an LLM to generate diverse MILPs is a key innovation – theoretically, training on an “unlimited” distribution of MILPs should capture generic features (e.g. structures that make a MILP hard). They also introduce a multimodal aspect (text and MILP), anticipating that a truly generic model could understand MILP formulations in human terms. Overall, the motivation is to capture the common structure underlying all MILPs (variables, linear constraints) in one model, something previous task-specific models couldn’t do	A new dataset MILP-Evolve is created using an LLM-driven evolutionary algorithm. Starting from a few seed MILP classes, the LLM iteratively mutates and recombines problem descriptions to generate a large and diverse set of MILP classes, each with unlimited random instances. This provides training data covering a wide range of structures beyond existing libraries. The model is then evaluated on real MILP instances from MIPLIB to demonstrate real-world applicability. (For NL alignment, the LLM also provides a plain-language description for each generated MILP, giving ground-truth pairings)	Uses a graph neural network backbone to encode MILPs, with on the order of millions of parameters (sufficient to be “foundation”-scale for the relatively small graph sizes). The same network is used for all tasks, with task-specific output heads. Key hyperparameters include the number of synthetic MILP classes (~hundreds) and instances (virtually unlimited) used in training – diversity and volume here are as important as model size. The model is trained on a combination of objectives, so weighting between tasks is a hyperparameter (ensuring it learns general MILP features without neglecting any task). Additionally, when aligning with NL, a text encoder (possibly a transformer) is jointly trained or pretrained – its dimension must match the MILP encoder’s output for comparison. The training routine likely cycles through tasks or uses a multitask loss. Finally, fine-tuning on specific tasks (like branching on MIPLIB) has its own hyperparameters (learning rate, etc.), but the model showed strong zero-shot performance even without extensive fine-tuning.