# Machine Learning Vocab

Term | Meaning |

Metrics | |

precision | TODO |

recall | TODO |

ROC curve | Receiver Operating Characteristic Curve [TODO] |

AUC | Area under the ROC curve. The probability the model will rank a random positive example higher than a random negative example. |

Precision-Recall Curve | TODO |

Architecture | |

softmax | Sends each input through an exponential function and then normalizes the results to add up to one. In other words: $$ y_i = \frac{e^{\beta x_i}}{\sum_{j=1}^N{\beta x_j}} $$ where $\beta$ is a hyper-parameter. This is typically used to convert vague "confidence" in categories into probabilities as the final layer in a neural network. |

autoencoder | TODO |

kernel | TODO |

stride | The distance between the center of two copies of a kernel. |

dilation | TODO |

width | The number of chananels in a layer. |

depth | The number of layers in a network. |

Activation Functions | |

Sigmoid | $ y = \frac{e^x}{1+e^x} $ |

Softplus | $ y = \log(1+e^x) $ |

ReLU | Rectified lienar units: $ y = \max(0, x) $ |

Noisy ReLU | TODO |

Leaky ReLU | $ y = \max(0.01x, x) $ |

ELU | Exponential linear units: $ y = \begin{cases} x & x \geq 0 \\ ax & x \lt 0 \end{cases} $, where $a$ is a hyperparameter |

Data Augmentation | |

TODO | TODO |

Training | |

momentum | TODO |

Nesterov momentum | TODO |

dampening | TODO |

stochastic depth | Dropping layers randomly during training |

batch normalization | Fixing the mean (typically 0) and variance (typically 1) of each layer's inputs. Ideally this is done over the entire training set, but this is infeasible with stochastic gradient descent, so instead this fixing is done for each min-batch. [TODO: is this over all channels or per channel?] |

internal covariate shift | when parameter changes in the first layer change and thereby affect the distribution of inputs to all subsequent layers (ditto for all but the last layer). This slows down training, because the function each layer is optimizing keeps changing making it hard to find "the" optimum. |

L1 Regularization | You multiply each weight by (1-$\beta$), where $\beta$ is a small hyperparameter. This prevents overfitting by making each weight have to "work harder to justify its existence". |

L0 Regularization | You subtract a small hyperparameter from every positive weight and add it to every negative weight. If this changes the weight's sign, you set the weight to zero. Like L1 regularization, this prevents overfitting. It has the added benefit of zeroing out some "synapses", thereby simplifying the model. |

saturated | TODO |

stochastic gradient descent | TODO |

mini-batch gradient descent | TODO |

TODO |