Gradient Descent is a Technique for Learning to Learn
DOI:
https://doi.org/10.18034/ajhal.v5i2.578Keywords:
Gradient decent, Long Short Term Memory (LSTM), Gauss-Newton matrix, Machine learning, Recurrent neural network (RNN)Abstract
In machine learning, the transition from hand-designed features to learned features has been a huge success. Regardless, optimization methods are still created by hand. In this study, we illustrate how an optimization method's design can be recast as a learning problem, allowing the algorithm to automatically learn to exploit structure in the problems of interest. On the tasks for which they are taught, our learning algorithms, implemented by LSTMs, beat generic, hand-designed competitors, and they also adapt well to other challenges with comparable structure. We show this on a variety of tasks, including simple convex problems, neural network training, and visual styling with neural art.
Downloads
References
Bach, F., R. Jenatton, J. Mairal, and G. Obozinski. 2012. Optimization with sparsity-inducing penalties. Foundations and Trends in Machine Learning, 4(1):1–106.
Bengio, S., Y. Bengio, and J. Cloutier. 1995. On the search for new learning rules for ANNs. Neural Processing Letters, 2(4):26–30.
Bengio, Y., S. Bengio, and J. Cloutier. 1990. Learning a synaptic learning rule. Université de Montréal, Département d’informatique et de recherche opérationnelle.
Bynagari, N. B. (2014). Integrated Reasoning Engine for Code Clone Detection. ABC Journal of Advanced Research, 3(2), 143-152. https://doi.org/10.18034/abcjar.v3i2.575
Bynagari, N. B. (2015). Machine Learning and Artificial Intelligence in Online Fake Transaction Alerting. Engineering International, 3(2), 115-126. https://doi.org/10.18034/ei.v3i2.566
Bynagari, N. B. (2016). Industrial Application of Internet of Things. Asia Pacific Journal of Energy and Environment, 3(2), 75-82. https://doi.org/10.18034/apjee.v3i2.576
Bynagari, N. B. (2017). Prediction of Human Population Responses to Toxic Compounds by a Collaborative Competition. Asian Journal of Humanity, Art and Literature, 4(2), 147-156. https://doi.org/10.18034/ajhal.v4i2.577
Cotter N. E. and P. R. Conwell. 1990. Fixed-weight networks can learn. In International Joint Conference on Neural Networks, pages 553–559.
Daniel, C., J. Taylor, and S. Nowozin. 2016. Learning step size controllers for robust neural network training. In Association for the Advancement of Artificial Intelligence.
Deng, J., W. Dong, R. Socher, L.J. Li, K. Li, and L. Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, pages 248–255. IEEE.
Donoho. D. L. 2006. Compressed sensing. Transactions on Information Theory, 52(4):1289–1306.
Duchi, J., E. Hazan, and Y. Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159.
Feldkamp L. A. and G. V. Puskorius. 1998. A signal processing framework based on dynamic neural networks with application to problems in adaptation, filtering, and classification. Proceedings of the IEEE, 86(11): 2259–2277.
Ganapathy, A. (2015). AI Fitness Checks, Maintenance and Monitoring on Systems Managing Content & Data: A Study on CMS World. Malaysian Journal of Medical and Biological Research, 2(2), 113-118. https://doi.org/10.18034/mjmbr.v2i2.553
Ganapathy, A. (2016). Speech Emotion Recognition Using Deep Learning Techniques. ABC Journal of Advanced Research, 5(2), 113-122. https://doi.org/10.18034/abcjar.v5i2.550
Ganapathy, A. (2017). Friendly URLs in the CMS and Power of Global Ranking with Crawlers with Added Security. Engineering International, 5(2), 87-96. https://doi.org/10.18034/ei.v5i2.541
Ganapathy, A., & Neogy, T. K. (2017). Artificial Intelligence Price Emulator: A Study on Cryptocurrency. Global Disclosure of Economics and Business, 6(2), 115-122. https://doi.org/10.18034/gdeb.v6i2.558
Gatys, L. A., A. S. Ecker, and M. Bethge. 2015. A neural algorithm of artistic style. arXiv Report 1508.06576.
Hochreiter S. and J. Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
Hochreiter, S., A. S. Younger, and P. R. Conwell. 2001. Learning to learn using gradient descent. In International Conference on Artificial Neural Networks, pages 87–94. Springer.
Kingma D. P. and J. Ba. 2015. Adam: A method for stochastic optimization. In International Conference on Learning Representations.
Krizhevsky. A. 2009. Learning multiple layers of features from tiny images. Technical report.
Lake, B. M., T. D. Ullman, J. B. Tenenbaum, and S. J. Gershman. 2016. Building machines that learn and think like people. arXiv Report 1604.00289.
Martens J. and R. Grosse. 2015. Optimizing neural networks with Kronecker-factored approximate curvature. In International Conference on Machine Learning, pages 2408–2417.
Nemhauser G. L. and L. A. Wolsey. 1988. Integer and combinatorial optimization. John Wiley & Sons.
Nesterov. Y. 1983. A method of solving a convex programming problem with convergence rate o (1/k2). In Soviet Mathematics Doklady, volume 27, pages 372–376.
Riedmiller M. and H. Braun. 1993. A direct adaptive method for faster backpropagation learning: The RPROP algorithm. In International Conference on Neural Networks, pages 586–591.
Runarsson and M. T. Jonsson. 2000. Evolution and design of distributed learning rules. In IEEE Symposium on Combinations of Evolutionary Computation and Neural Networks, pages 59–63. IEEE.
Santoro, A., S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap. 2016. Meta-learning with memory-augmented neural networks. In International Conference on Machine Learning.
Schmidhuber, J., J. Zhao, and M. Wiering. Shifting inductive bias with success-story algorithm, adaptive Levin search, and incremental self-improvement. Machine Learning, 28(1):105–130, 1997.
Schmidhuber. J. 1987. Evolutionary principles in self-referential learning; On learning how to learn: The meta-meta-hook. PhD thesis, Institut f. Informatik, Tech. Univ. Munich.
Schmidhuber. J. 1992. Learning to control fast-weight memories: An alternative to dynamic recurrent networks.Neural Computation, 4(1):131–139.
Schmidhuber. J. 1993. A neural network that embeds its own meta-levels. In International Conference on Neural Networks, pages 407–412. IEEE.
Schraudolph. N. N. 1999. Local gain adaptation in stochastic gradient descent. In International Conference on Artificial Neural Networks, volume 2, pages 569–574.
Sutton. R. S. 1992. Adapting bias by gradient descent: An incremental version of delta-bar-delta. In Association for the Advancement of Artificial Intelligence, pages 171–176.
Thrun S. and L. Pratt. 1998. Learning to learn. Springer Science & Business Media.
Tieleman T. and G. Hinton. 2012. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 4:2.
Tseng. P. 1998. An incremental gradient (-projection) method with momentum term and adaptive stepsize rule. Journal on Optimization, 8(2):506–531.
Vadlamudi, S. (2015). Enabling Trustworthiness in Artificial Intelligence - A Detailed Discussion. Engineering International, 3(2), 105-114. https://doi.org/10.18034/ei.v3i2.519
Vadlamudi, S. (2016). What Impact does Internet of Things have on Project Management in Project based Firms?. Asian Business Review, 6(3), 179-186. https://doi.org/10.18034/abr.v6i3.520
Wolpert D. H. and W. G. Macready. 1997. No free lunch theorems for optimization. Transactions on Evolutionary Computation, 1(1):67–82.
Younger, A. S., S. Hochreiter, and P. R. Conwell. 2001. Meta-learning with backpropagation. In International Joint Conference on Neural Networks.
--0--