Large width limits of neural networks

Artificial neural networks are a class of models used in machine learning, and inspired by biological neural networks. They are the core component of modern deep learning algorithms. Computation in artificial neural networks is usually organized into sequential layers of artificial neurons. The number of neurons in a layer is called the layer width. Theoretical analysis of artificial neural networks sometimes considers the limiting case that layer width becomes large or infinite. This limit enables simple analytic statements to be made about neural network predictions, training dynamics, generalization, and loss surfaces. This wide layer limit is also of practical interest, since finite width neural networks often perform strictly better as layer width is increased.^[1]^[2]^[3]^[4]^[5]^[6]

Behavior of a neural network simplifies as it becomes infinitely wide. Left: a Bayesian neural network with two hidden layers, transforming a 3-dimensional input (bottom) into a two-dimensional output

(y_{1},y_{2})

(top). Right: output probability density function

p(y_{1},y_{2})

induced by the random weights of the network. Video: as the width of the network increases, the output distribution simplifies, ultimately converging to a Neural network Gaussian process in the infinite width limit.

Theoretical approaches based on a large width limit

The Neural Network Gaussian Process (NNGP) corresponds to the infinite width limit of Bayesian neural networks, and to the distribution over functions realized by non-Bayesian neural networks after random initialization.^[7]^[8]^[9]^[10]
The same underlying computations that are used to derive the NNGP kernel are also used in deep information propagation to characterize the propagation of information about gradients and inputs through a deep network.^[11] This characterization is used to predict how model trainability depends on architecture and initializations hyper-parameters.
The Neural Tangent Kernel describes the evolution of neural network predictions during gradient descent training. In the infinite width limit the NTK usually becomes constant, often allowing closed form expressions for the function computed by a wide neural network throughout gradient descent training.^[12] The training dynamics essentially become linearized.^[13]
Mean-field limit analysis, when applied to neural networks with weight scaling of $\sim 1/h$ instead of $\sim 1/{\sqrt {h}}$ and large enough learning rates, predicts qualitatively distinct nonlinear training dynamics compared to the static linear behavior described by the fixed neural tangent kernel, suggesting alternative pathways for understanding infinite-width networks.^[14]^[15]
Catapult dynamics describe neural network training dynamics in the case that logits diverge to infinity as the layer width is taken to infinity, and describe qualitative properties of early training dynamics.^[16]

References

^ Novak, Roman; Bahri, Yasaman; Abolafia, Daniel A.; Pennington, Jeffrey; Sohl-Dickstein, Jascha (2018-02-15). "Sensitivity and Generalization in Neural Networks: an Empirical Study". International Conference on Learning Representations. arXiv:1802.08760. Bibcode:2018arXiv180208760N.
^ Canziani, Alfredo; Paszke, Adam; Culurciello, Eugenio (2016-11-04). "An Analysis of Deep Neural Network Models for Practical Applications". arXiv:1605.07678. Bibcode:2016arXiv160507678C. {{cite journal}}: Cite journal requires |journal= (help)
^ Novak, Roman; Xiao, Lechao; Lee, Jaehoon; Bahri, Yasaman; Yang, Greg; Abolafia, Dan; Pennington, Jeffrey; Sohl-Dickstein, Jascha (2018). "Bayesian Deep Convolutional Networks with Many Channels are Gaussian Processes". International Conference on Learning Representations. arXiv:1810.05148. Bibcode:2018arXiv181005148N.
^ Neyshabur, Behnam; Li, Zhiyuan; Bhojanapalli, Srinadh; LeCun, Yann; Srebro, Nathan (2019). "Towards understanding the role of over-parametrization in generalization of neural networks". International Conference on Learning Representations. arXiv:1805.12076. Bibcode:2018arXiv180512076N.
^ Lawrence, Steve; Giles, C. Lee; Tsoi, Ah Chung (1996). "What size neural network gives optimal generalization? convergence properties of backpropagation". CiteSeerX 10.1.1.125.6019. {{cite journal}}: Cite journal requires |journal= (help)
^ Bartlett, P.L. (1998). "The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network". IEEE Transactions on Information Theory. 44 (2): 525–536. doi:10.1109/18.661502. ISSN 1557-9654.
^ Neal, Radford M. (1996), "Priors for Infinite Networks", Bayesian Learning for Neural Networks, Lecture Notes in Statistics, vol. 118, Springer New York, pp. 29–53, doi:10.1007/978-1-4612-0745-0_2, ISBN 978-0-387-94724-2
^ Lee, Jaehoon; Bahri, Yasaman; Novak, Roman; Schoenholz, Samuel S.; Pennington, Jeffrey; Sohl-Dickstein, Jascha (2017). "Deep Neural Networks as Gaussian Processes". International Conference on Learning Representations. arXiv:1711.00165. Bibcode:2017arXiv171100165L.
^ G. de G. Matthews, Alexander; Rowland, Mark; Hron, Jiri; Turner, Richard E.; Ghahramani, Zoubin (2017). "Gaussian Process Behaviour in Wide Deep Neural Networks". International Conference on Learning Representations. arXiv:1804.11271. Bibcode:2018arXiv180411271M.
^ Hron, Jiri; Bahri, Yasaman; Novak, Roman; Pennington, Jeffrey; Sohl-Dickstein, Jascha (2020). "Exact posterior distributions of wide Bayesian neural networks". ICML 2020 Workshop on Uncertainty & Robustness in Deep Learning. arXiv:2006.10541.
^ Schoenholz, Samuel S.; Gilmer, Justin; Ganguli, Surya; Sohl-Dickstein, Jascha (2016). "Deep information propagation". International Conference on Learning Representations. arXiv:1611.01232.
^ Jacot, Arthur; Gabriel, Franck; Hongler, Clement (2018). "Neural tangent kernel: Convergence and generalization in neural networks". Advances in Neural Information Processing Systems. arXiv:1806.07572.
^ Lee, Jaehoon; Xiao, Lechao; Schoenholz, Samuel S.; Bahri, Yasaman; Novak, Roman; Sohl-Dickstein, Jascha; Pennington, Jeffrey (2020). "Wide neural networks of any depth evolve as linear models under gradient descent". Journal of Statistical Mechanics: Theory and Experiment. 2020 (12): 124002. arXiv:1902.06720. Bibcode:2020JSMTE2020l4002L. doi:10.1088/1742-5468/abc62b. S2CID 62841516.
^ Mei, Song Montanari, Andrea Nguyen, Phan-Minh (2018-04-18). A Mean Field View of the Landscape of Two-Layers Neural Networks. OCLC 1106295873.{{cite book}}: CS1 maint: multiple names: authors list (link)
^ Nguyen, Phan-Minh; Pham, Huy Tuan (2020). "A Rigorous Framework for the Mean Field Limit of Multilayer Neural Networks". arXiv:2001.11443 [cs.LG].
^ Lewkowycz, Aitor; Bahri, Yasaman; Dyer, Ethan; Sohl-Dickstein, Jascha; Gur-Ari, Guy (2020). "The large learning rate phase of deep learning: the catapult mechanism". arXiv:2003.02218 [stat.ML].

[:7-1] Novak, Roman; Bahri, Yasaman; Abolafia, Daniel A.; Pennington, Jeffrey; Sohl-Dickstein, Jascha (2018-02-15). "Sensitivity and Generalization in Neural Networks: an Empirical Study". International Conference on Learning Representations. arXiv:1802.08760. Bibcode:2018arXiv180208760N.

[:8-2] Canziani, Alfredo; Paszke, Adam; Culurciello, Eugenio (2016-11-04). "An Analysis of Deep Neural Network Models for Practical Applications". arXiv:1605.07678. Bibcode:2016arXiv160507678C. {{cite journal}}: Cite journal requires |journal= (help)

[:1-3] Novak, Roman; Xiao, Lechao; Lee, Jaehoon; Bahri, Yasaman; Yang, Greg; Abolafia, Dan; Pennington, Jeffrey; Sohl-Dickstein, Jascha (2018). "Bayesian Deep Convolutional Networks with Many Channels are Gaussian Processes". International Conference on Learning Representations. arXiv:1810.05148. Bibcode:2018arXiv181005148N.

[:6-4] Neyshabur, Behnam; Li, Zhiyuan; Bhojanapalli, Srinadh; LeCun, Yann; Srebro, Nathan (2019). "Towards understanding the role of over-parametrization in generalization of neural networks". International Conference on Learning Representations. arXiv:1805.12076. Bibcode:2018arXiv180512076N.

[5] Lawrence, Steve; Giles, C. Lee; Tsoi, Ah Chung (1996). "What size neural network gives optimal generalization? convergence properties of backpropagation". CiteSeerX 10.1.1.125.6019. {{cite journal}}: Cite journal requires |journal= (help)

[6] Bartlett, P.L. (1998). "The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network". IEEE Transactions on Information Theory. 44 (2): 525–536. doi:10.1109/18.661502. ISSN 1557-9654.

[7] Neal, Radford M. (1996), "Priors for Infinite Networks", Bayesian Learning for Neural Networks, Lecture Notes in Statistics, vol. 118, Springer New York, pp. 29–53, doi:10.1007/978-1-4612-0745-0_2, ISBN 978-0-387-94724-2

[8] Lee, Jaehoon; Bahri, Yasaman; Novak, Roman; Schoenholz, Samuel S.; Pennington, Jeffrey; Sohl-Dickstein, Jascha (2017). "Deep Neural Networks as Gaussian Processes". International Conference on Learning Representations. arXiv:1711.00165. Bibcode:2017arXiv171100165L.

[9] G. de G. Matthews, Alexander; Rowland, Mark; Hron, Jiri; Turner, Richard E.; Ghahramani, Zoubin (2017). "Gaussian Process Behaviour in Wide Deep Neural Networks". International Conference on Learning Representations. arXiv:1804.11271. Bibcode:2018arXiv180411271M.

[10] Hron, Jiri; Bahri, Yasaman; Novak, Roman; Pennington, Jeffrey; Sohl-Dickstein, Jascha (2020). "Exact posterior distributions of wide Bayesian neural networks". ICML 2020 Workshop on Uncertainty & Robustness in Deep Learning. arXiv:2006.10541.

[:10-11] Schoenholz, Samuel S.; Gilmer, Justin; Ganguli, Surya; Sohl-Dickstein, Jascha (2016). "Deep information propagation". International Conference on Learning Representations. arXiv:1611.01232.

[12] Jacot, Arthur; Gabriel, Franck; Hongler, Clement (2018). "Neural tangent kernel: Convergence and generalization in neural networks". Advances in Neural Information Processing Systems. arXiv:1806.07572.

[Lee-13] Lee, Jaehoon; Xiao, Lechao; Schoenholz, Samuel S.; Bahri, Yasaman; Novak, Roman; Sohl-Dickstein, Jascha; Pennington, Jeffrey (2020). "Wide neural networks of any depth evolve as linear models under gradient descent". Journal of Statistical Mechanics: Theory and Experiment. 2020 (12): 124002. arXiv:1902.06720. Bibcode:2020JSMTE2020l4002L. doi:10.1088/1742-5468/abc62b. S2CID 62841516.

[14] Mei, Song Montanari, Andrea Nguyen, Phan-Minh (2018-04-18). A Mean Field View of the Landscape of Two-Layers Neural Networks. OCLC 1106295873.{{cite book}}: CS1 maint: multiple names: authors list (link)

[15] Nguyen, Phan-Minh; Pham, Huy Tuan (2020). "A Rigorous Framework for the Mean Field Limit of Multilayer Neural Networks". arXiv:2001.11443 [cs.LG].

[16] Lewkowycz, Aitor; Bahri, Yasaman; Dyer, Ethan; Sohl-Dickstein, Jascha; Gur-Ari, Guy (2020). "The large learning rate phase of deep learning: the catapult mechanism". arXiv:2003.02218 [stat.ML].

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]