# Machine Learning Vocab

Term | Meaning |

Metrics | |

precision | TODO |

recall | TODO |

ROC curve | Receiver Operating Characteristic Curve [TODO] |

AUC | Area under the ROC curve. The probability the model will rank a random positive example higher than a random negative example. |

Precision-Recall Curve | TODO |

Architecture | |

softmax | Sends each input through an exponential function and then normalizes the results to add up to one. In other words: $$ y_i = \frac{e^{\beta x_i}}{\sum_{j=1}^N{\beta x_j}} $$ where $\beta$ is a hyper-parameter. This is typically used to convert vague "confidence" between mutually exclusive categories into probabilities in the final layer in a neural network. It can also be used for embeddings. |

autoencoder | TODO |

kernel | TODO |

stride | The distance between the center of two copies of a kernel. |

dilation | TODO |

width | The number of chananels in a layer. |

depth | The number of layers in a network. |

Activation Functions | |

Sigmoid | $ y = \frac{e^x}{1+e^x} $ |

Softplus | $ y = \log(1+e^x) $ |

ReLU | Rectified lienar units: $ y = \max(0, x) $ |

Noisy ReLU | TODO |

Leaky ReLU | $ y = \max(0.01x, x) $ |

ELU | Exponential linear units: $ y = \begin{cases} x & x \geq 0 \\ ax & x \lt 0 \end{cases} $, where $a$ is a hyperparameter |

Data Augmentation | |

TODO | TODO |

Training | |

momentum | While performing gradient descent, we can use "momentum" (i.e. an expoential moving average of past gradients, rather than a single batch's gradient) to help speed up convergence (since it makes your steps less noisy). |

Nesterov momentum | TODO |

dampening | TODO |

stochastic depth | Dropping layers randomly during training |

batch normalization | Fixing the mean (typically 0) and variance (typically 1) of each channel. Ideally this is done over the entire training set, but this is infeasible with stochastic gradient descent. Instead the mean and variance of each channel is computed with a moving average. |

L2 Regularization | This can be done by summing over the squares of all your parameters and adding this to your loss. This incentivizes the network to keep its weights small, even if it means slightly increasing training loss. In practice this is sometimes implemented while incrementing your parameters, rather than during backpropagation. |

L1 Regularization | This can be done by summing over the absolute value of all your parameters and adding this to your loss. This incentivizes the network to keep its weights small, even if it means slightly increasing training loss. |

saturated | TODO |

stochastic gradient descent | TODO |

mini-batch gradient descent | TODO |

TODO |