Regularized gradient descent

I have been attending a talk discussing inverse problems optimization using plug-and-play algorithms. During the talk, the speaker made a quick and clear path from regularized gradient descent to plug-and-play approaches.

Here it is. When you want to constrain your optimization problem to find solutions, you can classically add a penalty term, or so-called regularization term to the loss function.

For example, consider a regularized least square with a radial basis function representation of your input. In the expression below, we denote by $y$ the vector of all the outputs to be predicted, $w$ the amplitudes of the gaussians of the RBF. We do not specify it further but, to fix the ideas, $\Phi$ would be the function whose output components contain the evaluation of the individual RBF functions evaluated on the data points $x$.

\[min_w \|y - w^T \Phi(x)\|^2 + \lambda \|w\|_1\]

This is of the general form $f(x) + g(x)$ where $g$ contains our regularization term. Here, the regularization is the L1 norm which promotes sparsity. With classical gradient descent, you compute the gradient of this compound cost and iterate with whichever algorithm you like (stochastic gradient descent, adam, etc…).

A step further, you could iterate one step of minimizing the likelihood (minimizing the MSE can be seen as maximizing the likelihood of the data under the hypothesis of a normal distribution whose mean is given by our linear model) and one step of minimizing the constraint. Proximal gradient descent is a step further as it allows to consider $g(x)$ that are not differentiable.

Proximal gradient descent (Combettes & Pesquet, 2010; Parikh & Boyd, 2014) iteratively updates the estimate of the solution by :

\[x_{k+1} = \text{prox}_\alpha(x_k - \alpha \nabla f(x_k))\]

with

\[\text{prox}_\alpha(z) = \text{argmin}_y (\frac{1}{2\alpha}\|y - z\|^2 + g(y))\]

The operator $\text{prox}$ is called the proximity operator.

The proximal operator requires a minimization. In the plug-and-play (Venkatakrishnan et al., 2013) approach, you replace that minimization by a denoising step. This denoising step can be implemented with a denoising neural network.

In his works, Samuel Hurault (who is recipients of a best PhD award), was interested (among other things I believe), to recover convergence guarantees on the the plug-and-play algorithms that are lost if you blindly implement the regularization with a denoising network.

This monograph is about a class of optimization algorithms called proximal algorithms. Much like Newton’s method is a standard tool for solving unconstrained smooth optimization problems of modest size, proximal algorithms can be viewed as an analogous tool for nonsmooth, constrained, large-scale, or distributed versions of these problems. They are very generally applicable, but are especially well-suited to problems of substantial recent interest involving large or high-dimensional datasets. Proximal methods sit at a higher level of abstraction than classical algorithms like Newton’s method: the base operation is evaluating the proximal operator of a function, which itself involves solving a small convex optimization problem. These subproblems, which generalize the problem of projecting a point onto a convex set, often admit closed-form solutions or can be solved very quickly with standard or simple specialized methods. Here, we discuss the many different interpretations of proximal operators and algorithms, describe their connections to many other topics in optimization and applied mathematics, survey some popular algorithms, and provide a large number of examples of proximal operators that commonly arise in practice.

Regularized gradient descent

References

Enjoy Reading This Article?