Deep Learning Essential Terms

Essential terms for understanding deep learning research papers, tutorials and textbooks.

Term	Description
Jacobian matrix	The matrix containing all partial derivatives of a function whose input and output are both vectors
Hessian matrix	Similar to Jacobian matrix but it contains the second derivatives collected in a matrix
First-order optimization algorithms	Optimization algorithms that use only a gradient such as gradient descent
Second-order optimization algorithms	Optimization algorithms that use Hessian matrix like Newton’s method
Constrained Optimization	Find the maximal or minimal value of f(x)
Karush-Kuhn-Tucker (KKT) Approach	General solution to constrained optimization making use of generalized Lagrange function (Lagrangian)
KKT Conditions	Simple set of properties that describe the optimal points of constrained optimization problems
Hyperparameters	Machine algorithms’ settings that must be determined external to the learning algorithm itself
Accuracy	Proportion of examples for which the model produces the correct output
Error rate	Proportion of examples for which the model produces an incorrect output
Design matrix	Matrix containing a different example in each row
Underfitting	Model cannot obtain sufficiently low error value
Overfitting	Large gap between training and test error
Capacity	Model’s ability to fit functions
Hypothesis space	Set of functions learning algorithm s allowed to select as being the solution
Representational capacity	The model specifies which family of functions the learning algorithm can choose from when varying the parameters to reduce training objective
Occam’s razor	Among competing hypotheses, one should choose the simplest one
Vapnik-Chervonenkis (VC) dimension	Measures the capacity of a binary classifier
Nearest neighbor Regression	Non-parametric model minimizing the L2 norm of the point and the surrounding points
Parametric Models	Models that learn a function described by a finite-sized parameter vector such as Linear Regression. And if it has less than optimal capacity, it will asymptote with an error value more than the Bayes error
Non-parametric Models	No limitation on parameters such as nearest neighbour regression. And more data yields better generalization
Nearest neighbour regression	It simply stores X and y and when given x it looks up for the nearest entry and returns the label
Bayes error	The error incurred by an oracle, knowing the true probability distribution that generates the data, making predictions from the true distribution p(x, y)
Generalization error	It can never increase with more training examples
No Free Lunch Theorem	Averaged over all possible data generating distributions, every classification algorithm has the same error rate when classifying previously unobserved points
Weight decay	Large (underfitting), medium (just right), small (overfitting)
Regularization	We can regularize a model simply by adding a penalty to the cost function called a Regularizer. There are other ways too and a more generic definition is: regularization is any modification we make to the algorithm that is intended to reduce the generalization error not the training error
Hyperparameter	Settings that we can use to control the behaviour of the learning algorithm. The setting must be a hyperparameter because it is not appropriate to learn that hyperparameter for the training set such as hyperparameter controlling model capacity where it would always choose to maximize the model capacity for the training set that results in overftting
Validation set	Examples that the training algorithm does not observe. This is not the test set. It is used to guide the selection of our hyperparameters. Since it is used to “train” the hyperparameters, the validation set error will underestimate the generalization error though typically by a smaller amount than the training error
Test set	This is the set we use to estimate our generalization error after all our hyperparameter optimization is complete. If the test set is small, it can be problematic as this implies statistical uncertainty around the estimated test error
K-fold Cross Validation	This is computationally expensive. We partition the data into k non-overlapping subsets and the test set can be estimated by taking the average test error across k-trials
Point estimator or statistic	Point estimation is the attempt to provide the single “best” prediction of some quantity of interest. Any function of the data that is drawn i.i.d. Since the data is drawn i.i.d. any function of the data is random and therefore the point estimator is a random variable
Function estimator	This can also be called point estimator. But a function estimator is the estimation of the relationship between input and output variables
Bias	$bias(\hat\theta) = \mathbf{E}(\hat\theta_m) - \theta$ Estimator $\hat\theta_m$ is ubiased if $bias(\hat\theta) = 0$

Tags: