2(2), June 2023:56-63. DOI: https://doi.org/10.46632/jbab/2/2/7
Vishal Triloknath Jaiswar
While gradient descent optimization algorithms have gained immense popularity, they are often treated as mysterious black-box optimizers due to the scarcity of practical explanations about their strengths and weaknesses. This article endeavors to equip readers with intuitive insights into the behavior of various algorithms, enabling them to harness their potential. Throughout this comprehensive overview, we explore diverse variants of gradient descent, address challenges, introduce prominent optimization algorithms, delve into parallel and distributed architectures, and explore additional strategies to optimize gradient descent. Prepare to unravel the enigmatic realm of gradient descent and unleash its true power.
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., … Zheng, X. (2015). TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems.
Bengio, Y., Boulanger-Lewandowski, N., & Pascanu, R. (2012). Advances in Optimizing Recurrent Networks.
Bengio, Y., Louradour, J., Collobert, R., & Weston, J. (2009). Curriculum Learning. Proceedings of the 26th Annual International Conference on Machine Learning.
Darken, C., Chang, J., & Moody, J. (1992). Learning Rate Schedules for Faster Stochastic Gradient Search. Neural Networks for Signal Processing II Proceedings.
Dauphin, Y. N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., & Bengio, Y. (2014). Identifying and Attacking the Saddle Point Problem in High-Dimensional Non-Convex Optimization.
Dean, J., Corrado, G. S., Monga, R., Chen, K., Devin, M., Le, Q. V., … Ng, A. Y. (2012). Large-Scale Distributed Deep Networks. Neural Information Processing Systems.
Dozat, T. (2016). Incorporating Nesterov Momentum into Adam. ICLR Workshop.
Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research.
Ioffe, S., & Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.
Kingma, D. P., & Ba, J. L. (2015). Adam: A Method for Stochastic Optimization. International Conference on Learning Representations.
LeCun, Y., Bottou, L., Orr, G. B., & Müller, K. R. (1998). Efficient BackProp. Neural Networks: Tricks of the Trade.
McMahan, H. B., & Streeter, M. (2014). Delay-Tolerant Algorithms for Asynchronous Distributed Online Learning. Advances in Neural Information Processing Systems.
Neelakantan, A., Vilnis, L., Le, Q. V., Sutskever, I., Kaiser, L., Kurach, K., & Martens, J. (2015). Adding Gradient Noise Improves Learning for Very Deep Networks.
Nesterov, Y. (n.d.). A Method for Unconstrained Convex Minimization Problem with the Rate of Convergence O(1/k2).
Niu, F., Recht, B., Christopher, R., & Wright, S. J. (2011). Hogwild!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent.
[16] Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global Vectors for Word Representation. Conference on Empirical Methods in Natural Language Processing.
Qian, N. (1999). On the Momentum Term in Gradient Descent Learning Algorithms. Neural Networks.
Robbins, H., & Monro, S. (1951). A Stochastic Approximation Method
Vishal Triloknath Jaiswar, “Unleashing the Power of Gradient Descent: Dive into the World of Optimization Algorithms”, REST Journal on Banking, Accounting and Business, 2(2), June 2023:56-63.