Proximal Algorithms
Proximal operator
- Proximal operator of is

with parameter .
- may be nonsmooth, have embedded constraints, β¦
- evaluating involves solving a convex optimization problem
- can evaluate via standard methods like BFGS, but very often has an analytical solution or simple specialized linear-time algorithm
Generalized projection
- proximal operator of an indicator function of a convex set is projection:

many properties carry over
- if , then

- if is block separable, so , then

- example: if , then

a simple element wise operation called soft thresholding
Fixed points
- the point minimized if and only if is a fixed point:

- provides a link between proximal operators and fixed point theory
- many proximal algorithms can be viewed as methods for finding fixed points of appropriate operators
Moreau-Yosida regularization
- infimal convolution of and , denote , is defined as

- Moreau envelope or Moreau-Yosida regularization of is

- a smoothed or regularized form of :
- always has full domain
- always continuously differentiable
- has the same minimizes as
- can minimized instead of , though could be hard to evaluate
- motivation: can show that

- Moreau envelope obtains a smooth approximation via:
- taking conjugate
- regularizing to get a strongly convex function
- raking conjugate again
- example: Moreau envelope of is the Huber function

Modified gradient step
- many relationships between proximal operators and gradient steps
- proximal operator is gradient step for Moreau envelope:

- for small , converges to gradient step in :

- parameter can be interpreted as a step size, though proximal methods will generally work even for large step sizes, unlike gradient method
Resolvent of subdifferential operator
- if , then

- can rewrite as

- must be careful to interpret and expressions using it as relations
- mapping known as resolvent of operator
Moreau decompotition
- following relation always holds:

- main link between proximal operators and duality
- a generalization of orthogonal decomposition induced by subspace :

follows from Moreau decomposition and
Norms and norm balls
- in general: if and is unit ball of dual norm, then

by Moreau decomposition
- if and in the unit ball, then

sometimes called 'block soft thresholding' operator
- if and in the unit ball, then

let us derive (element wise) soft thresholding

- if and in the unit ball, simple algrithms available
Proximal minimization
- the proximal minimization or proximal point algorithm is

- the simplest proximal method, can be interpreted as
- gradient method applied to rather than
- simple iteration for finding a fixed point of
- if , reduces to iterative refinement for
- works for any fixed , or for non-summable
Operator splitting
- the most useful proximal methods use the idea of operator splitting
- these algorithms minimize only using or
- useful when and each have useful structure separately
- very simple historical example: alternating projections to find
Proximal gradient method

is smooth, is closed proper convex
- method:

- converges with rate when is Lipschitz continuous with constant and step sizes are
- special case: projected gradient method (take )
- if is not known (usually the case), can use the following line search
given , and parameter . Let . repeat 1. Let 2. break if 3.Update return
typical value of is 1/2, and

Interpretations
- is the solution to

trade of between minimizing and being close to gradient step for
- majorization-minimization method for :
- keep minimizing convex upper bound to objective tight at the previous iterate
- here, use as upper bound (when )
- if and only if

i.e., is a fixed point of forward-backward operator
Accelerated proximal gradient method

is smooth, is closed proper convex
- method:

works for and similar line search as before
- faster convergence rate, originated with Nesterov (1983)
Applications
Lasso

with and
- proximal gradient method (similar for accelerated):

- faster implementations:
- parallel matrix-vector multiplication
- if , precompute and , then solve smaller lasso problem with ; effort is then mostly computing
- compute and in parallel with one in memory at a time
- compute the entire regularization path with warm starting
- can easily generalize to group lasso, elastic net, GLMs, ...
Monotone Operators
Relations
- a relation on a set is a subset of
- dom
- overload to mean the set
- can think of as 'set-valued mapping', i.e., from dom into
- when is always empty or a singleton, we say is a function
- any function (or operator) with is a relation is then ambiguous: it can mean or { }
- Examples
- empty relation:
- full relation:
- identity:
- zero:
- subdifferential relation:
Operations on relations
- inverse (relation):
- inverse exists for any relation
- coincides with inverse function, when inverse function exists
- composition:
- scalar multiplication:
- addition:
- Example: Resolvent of operator
- for relation and , resolvent (much more on this later) is relation
- for

Generalized equations
- goal: solve generalized equation
- i.e., find with
- solution set or zero set is
- if and , then means minimizes
Monotone operators
- relation on is monotone if

- is maximal monotone if there is no monotone operator that properly contains it
- is maximal monotone iff it is a connected curve with no endpoints, with nonnegative (or infinite) slope

- suppose and are monotone
- sum: is monotone
- nonnegative scaling: if , then is monotone
- inverse: is monotone
- congruence: for is monotone (on )
- zero set: is convex if is maximal monotone
- affine function is monotone iff
Subdifferential
is monotone
- suppose and
- then

- add these and cancel to get

if is convex closed proper (CCP) then is maximal monotone
Nonexpansive and contractive operators
- has Lipschitz constant if

- for , we say is nonexpansive
- for , we say is a contraction (with contraction factor )
Properties
- if and have Lipschitz constant , so does

- composition of nonexpansive operators is nonexpansive
- composition of nonexpansive operator and contraction is a contraction
- fixed point set of nonexpansive (with )

is convex (but can be empty)
- a contraction has a single fixed point
Resolvent and Cayley operator
- for , resolvent of relation is

- when and monotone, is nonexpansive (thus a function)
- when and maximal monotone,
- Cayley operator of is

- when and monotone, is nonexpansive
- we write and to explicitly show dependence on
Proof that is nonexpansive
assume and monotone
- suppose and , i.e.,

- subtract to get
- multiply by and use monotonicity of to get

- so when , we must have (i.e., is a function)
- now let's show is nonexpansive

using inequality above
- is nonexpansive since it is the average of and :

Example: Linear operators
- linear mapping is
- monotone iff
- nonexpansive iff
- and
- nonsingular
- for matrix case, we have alternative formula for Cayley operator:

cf. bilinear function , which maps

Resolvent of subdifferential: Proximal mapping
- suppose , with convex
- this means
- rewrite as

which is the same as

- RHS called proximal mapping of , denoted
Example: Indicator function
- take , indicator function of convex set
- is the normal cone operator

- proximal operator of (i.e., resolvent of ) is

where is (Euclidean) projection onto
Resolvent of multiplier to residual map
- take to be multiplier to residual mapping for convex problem

- where
- means
- for some
- write as

- write second term as , so

- function on right side is augmented Lagrangian for the problem
- so can be found as

Fixed points of Cayley and resolvent operators
- assume is maximal monotone,
- solutions of are fixed points of :

- solutions of are fixed points of :

- key result: we can solve by finding fixed points of or
- next: how to actually find these fixed points
Contraction mapping theorem
- also known as Banach fixed point theorem
- assume is contraction, with Lipschitz constant
- the iteration

converges to the unique fixed point of
- proof (sketch):
- sequence is Cauchy:
- hence converges to a point
- must be (the) fixed point
Example: Gradient method
- assume is convex, (i.e., strongly convex, Lipschitz)
- gradient method is

(fixed points are exactly solutions of )
- is a Lipschitz with parameter
- is a contraction when
- so gradient method converges (geometrically) when
Damped iteration of a nonexpansive operator
- suppose is nonexpansive, , with fixed point set
- can have (e.g., translation)
- simple iteration of need not converge, even when (e.g, rotation)
- damped iteration:

- important special case:
- another special case: , which gives simple averaging

Convergence results
- assume is nonexpansive, , and

(which holds for special cases above)
- then we have

i.e., (some) iterates get arbitrarily close to fixed point set, and

i.e., (some) iterates yield arbitrarily good 'almost fixed points'

- is no farther from than is (by nonexpansivity)
- so is closer to than is
Proof
- start with identity

- apply to

using
- iterate inequality to get

- if for , then

- RHS goes to zero as
Damped Cayley iteration
- want to solve with maximal monotone
- damped Cayley iteration

with and
- converges (assuming ) in sense given above
- important: requires ability to evaluate resolvent map of
Proximal point algorithm
- take in damped Cayley iteration
- gives resolvent iteration or proximal point algorithm:

- if with convex, yields proximal minimization algorithm

can interpret as quadratic regularization that goes away in limit
- many classical algorithms are just proximal point methods applied to the appropriate maximal monotone operator
method of multipliers
- like dual method, but with augmented Lagrangian, specific step size
- converges under far more general conditions than dual subgradient
- need not be strictly convex, or differentiable
- can take on value
- but not amenable to decomposition
Monotone Operator Splitting Methods
Operator splitting
- want to solve with maximal monotone
- main idea: write as , with and maximal monotone
- called operator splitting
- solve using methods that require evaluation of resolvents

(or Cayley operators and )
- useful when and can be evaluated more easily than
main result
- maximal monotone, so Cayley operators nonexpansive
- hence nonexpansive
- key result:

- so solutions of can be found from fixed points of nonexpansive operator
Proof of main result
- write , and as

- subtract 2nd & 4th equations to conclude
- 4th equation is then
- now add and to get

- hence
- argument goes other way (but we don't need it)
Peaceman-Rachford and Douglas-Rachford splitting
- Peaceman-Rachford splitting is undamped iteration

doesn't converge in general case; need or to be contraction
- Douglas-Rachford splitting is damped iteration

always converges when has solution
Douglas-Rachford splitting
write D-R iteration as

last update follows from

- can consider as a residual
- is running sum of residuals
Douglas-Rachford algorithm
- many ways to rewrite/rearrange D-R algorithm
- equivalent to many other algorithms; often not obvious
- need very little: maximal monotone; solution exists
- and are handled separately (via and ); they are βuncoupledβ
Alternating direction method of multipliers
to minimize , we solve
with , D-R is

a special case of the alternating direction method of multipliers (ADMM)
Constrained optimization
- constrained convex problem:

- take and
- so and
- D-R is

Dykstra's alternation projections
- find a point in the intersection of convex sets
- D-R gives algorithm

- this is Dykstraβs alternating projections algorithm
- much faster than classical alternating projections (e.g., for polyhedral, converges in finite number of steps)
Consensus optimization
- want to minimize
- rewrite as consensus problem

- D-R consensus optimization:

Douglas-Rachford consensus
- -update splits into separate (parallel) problems:

- -update is averaging:

- -update becomes

- taking average of last equation, we get
- renaming as , D-R consensus becomes

- subsystem (local) state:
- gather 's to compute , which is then scattered
Β