Kantorovich Initiative Event: A small workshop in Optimal Transport | PIMS - Pacific Institute for the Mathematical Sciences

Many applications in machine learning involve data represented as probability distributions. The emergence of such data requires radically novel techniques to design tractable gradient flows on probability distributions over this type of (infinite-dimensional) objects. For instance, being able to flow labeled datasets is a core task for applications ranging from domain adaptation to transfer learning or dataset distillation. In this setting, we propose to represent each class by the associated conditional distribution of features, and to model the dataset as a mixture distribution supported on these classes (which are themselves probability distributions), meaning that labeled datasets can be seen as probability distributions over probability distributions. We endow this space with a metric structure from optimal transport, namely the Wasserstein over Wasserstein (WoW) distance, derive a differential structure on this space, and define WoW gradient flows. The latter enables to design dynamics over this space that decrease a given objective functional. We apply our framework to transfer learning and dataset distillation tasks, leveraging our gradient flow construction as well as novel tractable functionals that take the form of Maximum Mean Discrepancies with Sliced-Wasserstein based kernels between probability distributions.

Marco Cuturi (Apple)

Optimal Transport and Generative Modeling

After a brief introduction to computational OT, I will present in this talk a few recent works highlighting the interest of OT solvers in generative problems. I will start with a recent result on a factorization of probability measures that builds upon the idea of moment measures introduced by Cordero-Erausquin and Klartag in 2015. I will show how this idea can be modified to suit better the generative paradigm to present conjugate moment measures. Next, I will show how simple univariate OT maps on activations can help control the generation of LLMs and text-2-image models, through a finetuning procedure that only uses a handful of intervention examples. Finally, I will investigate to what extent the use of large coupling solvers can improve the training of flow matching models, by pushing the applicability of Sinkhorn solvers to millions of examples using multi node sharding.

Jakwang Kim (UBC)

Adversarial training model via optimal transport

Since the revolution in AI by neural networks, most of their behavior are still not clearly understood. In particular, one of the most fundamental questions about neural networks is robustness. It has been observed that neural networks are highly sensitive to small perturbations, which is a big obstacle for implementing AI to real world applications. Adversarial training is one way to surmount this issue: introducing the adversary, classifiers are trained to be more robust against any small perturbations. Despite its importance in practice, however, due to its minimax structure, its analytic structure has not been clear. In this talk, I introduce some recent progresses on adversarial training via optimal transport. This problem is equivalent to the multimarginal optimal transport problem. The minimax nature is translated to the MOT problem naturally: the primal is for the adversary, and the dual is for the learner. From this connection, I also discuss how the well-definedness of the model is driven, and an interesting c-conjugate relation between the dual potential and optimal robust classifier: based on works with Nicholas Garcia Trillos, Matt Jacbos and Matt Werenski.

Forest Kobayashi (UBC)

How to approximate a blob with a curve

The following is joint work with Jonathan Hayase (UW) and Young-Heon Kim (UBC). Given an n-dimensional measure μ, how can we "most efficiently" approximate $\mu$ via an m-dimensional set $\Sigma$? In this talk we discuss one possible formalization of this question in which one seeks an "optimal" balance between an OT cost (representing "goodness-of-fit") and the size of a certain Sobolev norm (loosely, a "regularization" term that keeps $\Sigma$ simple). Here, we have good news and bad news. The bad: Finding precise solutions to this problem appears fundamentally difficult, since we are able to recover an NP-hard problem in a simple limiting case. The good: By (loosely speaking) approximating a certain "constrained" gradient flow, we obtain a surprisingly-effective local algorithm. We give the high-level ideas of our method, and, time permitting, discuss a novel interpretation for how regularization can improve training in certain generative learning problems.

Sharvaj Kubal (UBC)

Optimal sequencing depth for single-cell RNA-sequencing in Wasserstein space

How many samples should one collect for an empirical distribution to be as close as possible to the true population? This question is not trivial in the context of single-cell RNA-sequencing. With limited sequencing depth, profiling more cells comes at the cost of fewer reads per cell. Therefore, one must strike a balance between the number of cells sampled and the accuracy of each measured gene expression profile. Here, we analyze an empirical distribution of cells and obtain upper and lower bounds on the Wasserstein distance to the true population. Our analysis holds for general, non-parametric distributions of cells, and is validated by simulation experiments on a real single-cell dataset.

(Joint work with Jakwang Kim and Geoff Schiebinger)

Wanxin Li (UBC)

Wassertsein projection distance for fairness testing of regression models

Fairness in machine learning is a critical concern, yet most research has focused on classification tasks, leaving regression models under-explored. This talk introduces a Wasserstein projection-based framework for fairness testing in regression models, focusing on expectation-based criteria. We propose a hypothesis-testing approach and an optimal data perturbation method to improve fairness while balancing accuracy. Theoretical results include computable test statistics and asymptotic bounds. Experiments on synthetic and real-world datasets demonstrate the method’s effectiveness in detecting and mitigating biases. Future work aims to extend the framework to broader fairness criteria and improve computational efficiency.

Sibylle Marcotte (ENS-Paris)

Conservation laws for ResNets and Transformers

Understanding the geometric properties of gradient descent dynamics is a key ingredient in deciphering the recent success of very large machine learning models. A striking observation is that trained over-parameterized models retain some properties of the optimization initialization. This “implicit bias” is believed to be responsible for some favorable properties of the trained models and could explain their good generalization properties. In this talk, I will first introduce the notion of ''conservation laws'’ that are quantities exactly preserved during the gradient flow of a model (e.g., a ReLU network) regardless of the dataset. Using Lie algebra techniques, we determine the exact number of independent conservation laws, recovering all known laws in linear and ReLU networks, and proving there are no others in the shallow case. We also fully determine all conservation laws for single attention layers and two-layer convolutional ReLU networks, and show that residual blocks inherit the same conservation laws as their underlying blocks without skip connections. We then introduce the notion of conservation laws that depend only on a subset of parameters (corresponding e.g. to a residual block). We demonstrate that the characterization of such laws can be exactly reduced to the analysis of the corresponding building block in isolation. Finally, we examine how these newly discovered conservation principles, initially established in the continuous gradient flow regime, persist under SGD.

Joint work with Gabriel Peyré and Rémi Gribonval. Associated papers: https://arxiv.org/abs/2307.00144 https://arxiv.org/abs/2506.06194

Christophe Vauthier (University of Paris, Saclay)

Critical points of the Sliced-Wasserstein distance

The SW metric has gained significant interest in the optimal transport and machine learning literature, due to its ability to capture intricate geometric properties of probability distributions while remaining computationally tractable, making it a valuable tool for various applications, including generative modeling and domain adaptation. Our study aims to provide a rigorous analysis of the critical points arising from the optimization of the SW objective. By computing explicit perturbations, we establish that stable critical points of SW cannot concentrate on segments. This stability analysis is crucial for understanding the behaviour of optimization algorithms for models trained using the SW objective. Furthermore, we investigate the properties of the SW objective, shedding light on the existence and convergence behavior of critical points. We illustrate our theoretical results through numerical experiments.

Andrew Warren (UBC)

Principal curves and principal trees

Given a data distribution which is concentrated around a one-dimensional structure, can we infer that structure? We consider versions of this problem where the distribution resides in a metric space and the 1d structure is assumed to either be the range of an absolutely continuous curve, or a connected set of finite 1d Hausdorff measure. In each of these cases, we relate the inference task to solving a variational problem where there is a tradeoff between data fidelity and simplicity of the inferred structure; the variational problems we consider are closely related to the so-called "principal curve" problem of Hastie and Steutzle as well as the "average-distance problem" of Buttazzo, Oudet, and Stepanov. For each of the variational problems under consideration, we establish existence of minimizers, stability with respect to the data distribution, and consistency of a discretization scheme which is amenable to Lloyd-type numerical methods.

Rentian Yao (UBC)

Wasserstein Barycenter via Nonconvex-Concave Minimax Optimization

Wasserstein barycenter is a fundamental notion of averaging that extends from the Euclidean space to the Wasserstein space of probability distributions. Computation of the unregularized barycenter for discretized probability distributions on point clouds is a challenging task when the domain dimension $d > 1$. Most practical algorithms for the barycenter problem are based on entropic regularization. In this paper, we introduce a nearly linear time $O(m \log{m})$ and linear space complexity $O(m)$ primal-dual algorithm, the Wasserstein-Descent $\dot{\mathbb{H}}^1$-Ascent (WDHA) algorithm, for computing the exact barycenter when the input probability density functions are discretized on an $m$-point grid. The key success of the WDHA algorithm hinges on alternating between two different yet closely related Wasserstein and Sobolev optimization geometries for the primal barycenter and dual Kantorovich potential subproblems. Under reasonable assumptions, we establish the convergence rate and iteration complexity of WDHA to its stationary point when the step size is appropriately chosen. Superior computational efficacy, scalability, and accuracy over the existing Sinkhorn-type algorithms are demonstrated on high-resolution (e.g., $1024 \times 1024$ images) 2D synthetic and real data.

Jump to Event Details

This is a Past Event

Event Type

Scientific, Workshop

Date

July 18, 2025

Time

9:00am - 6:00pm

Location

University of British Columbia

MATH 203

Organizers

Young-Heon Kim

Professor of Mathematics, University of British Columbia