Eclats de vers : Matemat 10 : Optimisation - 9

Table des matières

1. Algorithmes d'optimisation contrainte
2. Réseaux de neurones

\( \newcommand{\parentheses}[1]{\left(#1\right)} \newcommand{\crochets}[1]{\left[#1\right]} \newcommand{\accolades}[1]{\left\{#1\right\}} \newcommand{\ensemble}[1]{\left\{#1\right\}} \newcommand{\identite}{\mathrm{Id}} \newcommand{\indicatrice}{\boldsymbol{\delta}} \newcommand{\dirac}{\delta} \newcommand{\moinsun}{{-1}} \newcommand{\inverse}{\ddagger} \newcommand{\pinverse}{\dagger} \newcommand{\topologie}{\mathfrak{T}} \newcommand{\ferme}{\mathfrak{F}} \newcommand{\img}{\mathbf{i}} \newcommand{\binome}[2]{ \left\{ \begin{array}{c} #1 \\ #2 \\ \end{array} \right\} } \newcommand{\canonique}{\mathfrak{c}} \newcommand{\tenseuridentite}{\boldsymbol{\mathcal{I}}} \newcommand{\permutation}{\boldsymbol{\epsilon}} \newcommand{\matriceZero}{\mathfrak{0}} \newcommand{\matriceUn}{\mathfrak{1}} \newcommand{\christoffel}[2]{ \left\{ \begin{array}{c} #1 \\ #2 \\ \end{array} \right\} } \newcommand{\lagrangien}{\mathfrak{L}} \newcommand{\sousens}{\mathfrak{P}} \newcommand{\partition}{\mathrm{Partition}} \newcommand{\tribu}{\mathrm{Tribu}} \newcommand{\topologies}{\mathrm{Topo}} \newcommand{\setB}{\mathbb{B}} \newcommand{\setN}{\mathbb{N}} \newcommand{\setZ}{\mathbb{Z}} \newcommand{\setQ}{\mathbb{Q}} \newcommand{\setR}{\mathbb{R}} \newcommand{\setC}{\mathbb{C}} \newcommand{\corps}{\mathbb{K}} \newcommand{\boule}{\mathfrak{B}} \newcommand{\intervalleouvert}[2]{\relax \ ] #1 , #2 [ \ \relax} \newcommand{\intervallesemiouvertgauche}[2]{\relax \ ] #1 , #2 ]} \newcommand{\intervallesemiouvertdroite}[2]{[ #1 , #2 [ \ \relax} \newcommand{\fonction}{\mathbb{F}} \newcommand{\bijection}{\mathrm{Bij}} \newcommand{\polynome}{\mathrm{Poly}} \newcommand{\lineaire}{\mathrm{Lin}} \newcommand{\continue}{\mathrm{Cont}} \newcommand{\homeomorphisme}{\mathrm{Hom}} \newcommand{\etagee}{\mathrm{Etagee}} \newcommand{\lebesgue}{\mathrm{Leb}} \newcommand{\lipschitz}{\mathrm{Lip}} \newcommand{\suitek}{\mathrm{Suite}} \newcommand{\matrice}{\mathbb{M}} \newcommand{\krylov}{\mathrm{Krylov}} \newcommand{\tenseur}{\mathbb{T}} \newcommand{\essentiel}{\mathfrak{E}} \newcommand{\relation}{\mathrm{Rel}} \newcommand{\strictinferieur}{\ < \ } \newcommand{\strictsuperieur}{\ > \ } \newcommand{\ensinferieur}{\eqslantless} \newcommand{\enssuperieur}{\eqslantgtr} \newcommand{\esssuperieur}{\gtrsim} \newcommand{\essinferieur}{\lesssim} \newcommand{\essegal}{\eqsim} \newcommand{\union}{\ \cup \ } \newcommand{\intersection}{\ \cap \ } \newcommand{\opera}{\divideontimes} \newcommand{\autreaddition}{\boxplus} \newcommand{\autremultiplication}{\circledast} \newcommand{\commutateur}[2]{\left[ #1 , #2 \right]} \newcommand{\convolution}{\circledcirc} \newcommand{\correlation}{\ \natural \ } \newcommand{\diventiere}{\div} \newcommand{\modulo}{\bmod} \newcommand{\pgcd}{pgcd} \newcommand{\ppcm}{ppcm} \newcommand{\produitscalaire}[2]{\left\langle #1 \left|\right\relax #2 \right\rangle} \newcommand{\scalaire}[2]{\left\langle #1 \| #2 \right\rangle} \newcommand{\braket}[3]{\left\langle #1 \right| #2 \left| #3 \right\rangle} \newcommand{\orthogonal}{\bot} \newcommand{\forme}[2]{\left\langle #1 , #2 \right\rangle} \newcommand{\biforme}[3]{\left\langle #1 , #2 , #3 \right\rangle} \newcommand{\contraction}[3]{\left\langle #1 \odot #3 \right\rangle_{#2}} \newcommand{\dblecont}[5]{\left\langle #1 \right| #3 \left| #5 \right\rangle_{#2,#4}} \newcommand{\major}{major} \newcommand{\minor}{minor} \newcommand{\maxim}{maxim} \newcommand{\minim}{minim} \newcommand{\argument}{arg} \newcommand{\argmin}{arg\ min} \newcommand{\argmax}{arg\ max} \newcommand{\supessentiel}{ess\ sup} \newcommand{\infessentiel}{ess\ inf} \newcommand{\dual}{\star} \newcommand{\distance}{\mathfrak{dist}} \newcommand{\norme}[1]{\left\| #1 \right\|} \newcommand{\normetrois}[1]{\left|\left\| #1 \right\|\right|} \newcommand{\adh}{adh} \newcommand{\interieur}{int} \newcommand{\frontiere}{\partial} \newcommand{\image}{im} \newcommand{\domaine}{dom} \newcommand{\noyau}{ker} \newcommand{\support}{supp} \newcommand{\signe}{sign} \newcommand{\abs}[1]{\left| #1 \right|} \newcommand{\unsur}[1]{\frac{1}{#1}} \newcommand{\arrondisup}[1]{\lceil #1 \rceil} \newcommand{\arrondiinf}[1]{\lfloor #1 \rfloor} \newcommand{\conjugue}{conj} \newcommand{\conjaccent}[1]{\overline{#1}} \newcommand{\division}{division} \newcommand{\difference}{\boldsymbol{\Delta}} \newcommand{\differentielle}[2]{\mathfrak{D}^{#1}_{#2}} \newcommand{\OD}[2]{\frac{d #1}{d #2}} \newcommand{\OOD}[2]{\frac{d^2 #1}{d #2^2}} \newcommand{\NOD}[3]{\frac{d^{#3} #1}{d #2^{#3}}} \newcommand{\deriveepartielle}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\PD}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\dblederiveepartielle}[2]{\frac{\partial^2 #1}{\partial #2 \partial #2}} \newcommand{\dfdxdy}[3]{\frac{\partial^2 #1}{\partial #2 \partial #3}} \newcommand{\dfdxdx}[2]{\frac{\partial^2 #1}{\partial #2^2}} \newcommand{\gradient}{\mathbf{\nabla}} \newcommand{\combilin}[1]{\mathrm{span}\{ #1 \}} \newcommand{\trace}{tr} \newcommand{\proba}{\mathbb{P}} \newcommand{\probaof}[1]{\mathbb{P}\left[#1\right]} \newcommand{\esperof}[1]{\mathbb{E}\left[#1\right]} \newcommand{\cov}[2]{\mathrm{cov} \left( #1 , #2 \right) } \newcommand{\var}[1]{\mathrm{var} \left( #1 \right) } \newcommand{\rand}{\mathrm{rand}} \newcommand{\variation}[1]{\left\langle #1 \right\rangle} \newcommand{\composante}{comp} \newcommand{\bloc}{bloc} \newcommand{\ligne}{ligne} \newcommand{\colonne}{colonne} \newcommand{\diagonale}{diag} \newcommand{\matelementaire}{\mathrm{Elem}} \newcommand{\matpermutation}{permut} \newcommand{\matunitaire}{\mathrm{Unitaire}} \newcommand{\gaussjordan}{\mathrm{GaussJordan}} \newcommand{\householder}{\mathrm{Householder}} \newcommand{\rang}{rang} \newcommand{\schur}{\mathrm{Schur}} \newcommand{\singuliere}{\mathrm{DVS}} \newcommand{\convexe}{\mathrm{Convexe}} \newcommand{\petito}[1]{o\left(#1\right)} \newcommand{\grando}[1]{O\left(#1\right)} \)

\label{chap:algocont}

1.1. Introduction

Nous allons présenter des algorithmes permettant de résoudre approximativement des problèmes de minimisation d'une fonction \(\varphi\) sur un ensemble \(\Omega \subseteq \setR^n\). Ces algorithmes partent d'un point initial \(x_0 \in \Omega\) et itèrent schématiquement comme suit :

\[x_{k + 1} = I(x_k) = x_k + p_k\]

pour un certain \(p_k \in \setR^n\). On espère bien entendu que la suite converge et que :

\[x_N \approx \lim_{k \to \infty} x_k = \arg\min_{x \in \Omega} \varphi(x)\]

pour \(N\) assez grand. Nous adoptons les notations :

\[J = \partial \varphi\]

pour le gradient, de taille \((n,1)\) et :

\[H = \partial^2 \varphi\]

pour la hessienne, de taille \((n,n)\). On note également :

\[\Phi_k = \Phi(x_k)\]

pour toute fonction \(\Phi\) (par exemple, \(\Phi \in \{\varphi,J,H\}\)).

1.2. Contraintes linéaires

Nous considérons le cas de contraintes linéaires :

\[\Omega = \{ x \in \setR^n : A \cdot x = b \}\]

où \(A\) est une matrice de taille \((m,n)\) et \(b\) un vecteur de taille \((m,1)\). On choisit généralement \(x_0 = 0\). L'itération \(k\) part d'un \(x_k\) satisfaisant les contraintes :

\[A \cdot x_k = b\]

Si on veut que \(x_{k + 1} = x_k + s_k \in \Omega\), il est nécessaire et suffisant que \(A \cdot (x_k + s_k) = b\), ou que :

\[A \cdot s_k = A \cdot (x_k + s_k) - A \cdot x_k = b - b = 0\]

On doit donc avoir \(s_k \in \noyau A\). Soit les \(u_i \in \setR^m\), \(v_i \in \setR^n\) et \(\sigma_i \in \setR\) constituant la décomposition en valeurs singulières de \(A\). On forme une matrice à partir des vecteurs de base de \(\noyau A\) :

\[V = [v_{r + 1} \ ... \ v_n]\]

On a alors :

\[s_k = \sum_{i = r + 1}^n p_{k,i} \cdot v_i = V \cdot p_k\]

où \(p_k = [p_{k, r + 1} ... p_{k,n}]^\dual\).

1.2.1. Newton projeté

Il s'agit de l'adaptation de la méthode de Newton :

\[x_{k + 1} = x_k + V \cdot p_k\]

On minimise le développement d'ordre deux :

\[\varphi_{k+1} \approx \varphi_k + J_k^\dual \cdot V \cdot p_k + \unsur{2} \cdot p_k^\dual \cdot V^\dual \cdot H_k \cdot V \cdot p_k\]

L'annulation du gradient par rapport à \(p_k\) nous donne :

\[V^\dual \cdot J_k + V^\dual \cdot H_k \cdot V \cdot p_k = 0\]

et donc :

\[p_k = - (V^\dual \cdot H_k \cdot V)^{-1} \cdot V^\dual \cdot J_k\]

1.3. Contraintes d'inégalité

Examinons à présent le cas où :

\[\Omega = \{ x \in \setR^n : \omega(x) \le 0 \}\]

Nous laissons à présent tomber le \(k\) des itérations, afin d'alléger les notations.

1.3.1. Méthodes de penalité

La méthode de pénalité consiste à ajouter une fonction positive à \(\varphi\) lorsque \(x\) sort de \(\Omega\). Si l'on augmente la valeur de la fonction pénalité, on rapproche alors le minimum global de \(\Omega\). La solution à notre problème contraint devrait donc être approchée en faisant tendre l'amplitude de la pénalité vers l'infini. Soit la fonction \(\varpi : \setR^n \mapsto \setR^m\) définie par :

\[\varpi_i(x) = \max\{\omega_i(x),0\}\]

On a par construction \(\varpi(x) = 0\) pour tout \(x \in \Omega\) et \(\varpi(x) \strictsuperieur 0\) pour tout \(x \notin \Omega\). On ajoute la somme des carrés \(\varpi(x)^\dual \cdot \varpi(x) = \sum_i \varpi_i(x)^2\) multipliée par un paramètre réel \(k \ge 0\) à la fonction objectif pour obtenir l'objectif modifié :

\[\psi_k(x) = \varphi(x) + k \cdot \varpi(x)^\dual \cdot \varpi(x)\]

On utilise ensuite un algorithme de minimisation libre pour évaluer le minimum global :

\[\mu(k) = \arg\min_{x \in \setR^n} \psi_k(x)\]

On espère que \(\mu\) converge à l'infini et que :

\[\lim_{k \to \infty} \mu(k) = \arg\min_{x \in \Omega} \varphi(x)\]

Il suffit dans ce cas de choisir \(k\) assez grand pour obtenir une estimation de la solution du problème contraint.

2. Réseaux de neurones

2.1. Définition
2.2. Perceptron à une couche
2.3. Entrainement

2.1. Définition

Commençons par la description d'un neurone \(i\) de fonction caractéristique \(\sigma\). La relation entre les entrées \(x_j\) et la sortie \(y_i\) s'écrit :

\[y_i = \sigma\left( \sum_j ( w_{ij} \ x_j ) + b_i \right)\]

Un réseau de neurones est composé de neurones reliés entres eux (la sortie d'un neurone peut servir d'entrée à un autre).

La fonction caractéristique est généralement l'une de celles décrites ci-dessous :

\( \sigma(x) = \mt{sign}(x) \\ \sigma(x) = \tanh(x) \\ \sigma(x) = \exp(-x^2) \\ \sigma(x) = x \exp(-x^2) \)

2.2. Perceptron à une couche

Le perceptron monocouche est composé d'une rangée de neurones reliant les entrées \(x_i\) aux sorties \(y_i\) (le perceptron multicouche est composé de monocouches assemblées l'une à la suite de l'autre). La relation entrées-sorties s'écrit :

\[y_i = P_{\theta}(x) = c + \sum_j v_j \ \sigma\left( \sum_k ( w_{jk} \ x_k ) + b_j \right)\]

On écrit \(y_i = P_{\theta}(x)\) pour mettre en évidence l'influence des paramètres du résau \(\theta = (v,w,b,c)\) sur la sortie \(y\).

2.3. Entrainement

Les réseaux de neurones sont principalement utilisés afin de calquer le comportement d'un système difficille à modéliser par d'autres méthodes. On dispose d'un certain nombre de vecteurs \(y^{(n)}\in\setR^M\), \(x^{(n)}\in\setR^{N}\), où \(n=1,...,K\). On aimerait bien trouver le vecteur des paramètres, \(\theta\), qui correspond le mieux à cette série d'entrées-sorties. On va alors entrainer le réseau de neurones défini par \(y=R_{\theta}(x)\) en utilisant une méthode d'optimisation non linéaire afin d'obtenir la solution de

\[\theta^* = \arg\min_{\theta} \sum_n \norme{ y^{(n)} - R_{\theta}\left( x^{(n)} \right) }^2\]