Why More Is More (in Artificial Intelligence)

Deep neural networks (DNNs) have profoundly transformed the landscape of machine learning, often becoming synonymous with the broader fields of artificial intelligence and machine learning. Yet, their rise would have been unimaginable without their partner-in-crime: stochastic gradient descent (SGD). SGD, along with its derivative optimizers, forms the core of many self-learning algorithms. At its heart, the concept is straightforward: calculate the task’s loss using training data, determine the gradients of this loss in relation to its parameters, and then adjust the parameters in a direction that minimizes the loss. It sounds simple, but in applications, it has proven to be immensely powerful: SGD can find solutions for all kinds of complex problems and training data, given it is used in conjunction with a sufficiently expressive architecture. It’s particularly good at finding parameter sets that make the network perform perfectly on the training data, something called the interpolation regime. But under which conditions are neural networks thought to generalize well, meaning that they perform well on unseen test data? <a href="https://towardsdatascience.com/why-more-is-more-in-deep-learning-b28d7cedc9f5">Read More</a>