Glossary – Data All The Way

Statistics and general machine learning

Term	Definition
ANOVA	A statistical test used to compare the means of three or more groups.
Bar chart	A graph showing the frequencies of different categories, with the horizontal axis representing the categories and the vertical axis representing the frequencies.
Beta distribution	A continuous distribution of probabilities that is defined on the interval between 0 and 1, and is often used in Bayesian statistics.
Between-subjects design	A research design in which different subjects are measured under different conditions or at different time points.
Binomial distribution	A distribution of probabilities for a discrete variable that has only two possible outcomes, such as heads or tails.
Box plot	A graph showing the distribution of a set of data, with a box representing the middle 50% of the values and whiskers extending to the minimum and maximum values.
Central limit theorem	The statistical principle that states that the distribution of sample means will be approximately normal, regardless of the distribution of the population from which the samples are drawn.
Chi-square test	A statistical test used to determine whether two categorical variables are related.
Cluster analysis	A statistical technique used to group data into clusters or groups based on similarity.
Cluster sampling	A sampling method in which the population is divided into groups or clusters, and a representative sample is selected from each cluster.
Confidence interval	A range of values that is likely to contain the true value of a population parameter, with a certain level of confidence.
Confirmatory factor analysis	A statistical technique used to test the fit of a statistical model to the data, and to identify the underlying structure of a set of observed variables.
Continuous variable	A variable that can take on any value within a given range.
Correlation	A statistical relationship between two variables, measured by the strength and direction of the linear relationship between them.
Cox proportional hazards model	A statistical model used to estimate the risk of an event occurring over time, taking into account the effects of multiple covariates.
Discrete variable	A variable that can only take on specific, distinct values.
Discriminant analysis	A statistical technique used to classify observations into different groups based on their characteristics.
Event history analysis	A statistical technique used to analyze data on the timing and occurrence of events, such as transitions between different states or stages.
Exponential distribution	A continuous distribution of probabilities that represents the time between events occurring at a constant rate.
F-distribution	A continuous distribution of probabilities that is used in hypothesis testing to compare the variances of two or more groups.
Factor analysis	A statistical technique used to identify the underlying structure or patterns in a set of correlated variables.
Factorial ANOVA	A statistical test used to analyze the effects of two or more factors on a response variable, assuming that the data are normally distributed and the variances are equal.
Factorial design	A research design in which multiple treatment conditions are combined in a single study, allowing for the analysis of main effects and interactions.
Fixed effect	A variable in a statistical model that is considered to be a fixed part of the model, and is not allowed to vary across different levels or groups.
Frequency distribution	A tabular summary of the data showing the number of occurrences of each unique value or range of values.
Friedman test	A nonparametric statistical test used to compare the means of two or more groups, when the data are not normally distributed or the variances are not equal, and the subjects are measured under multiple conditions or at multiple time points.
Generalizability	The extent to which the results of a study can be generalized to a larger population.
Generalized estimating equations	A statistical technique used to estimate the parameters of a statistical model when the data are correlated or unbalanced.
Generalized linear mixed model	A statistical model that extends the generalized linear model to allow for both fixed and random effects.
Generalized linear model	A statistical model that extends the linear regression model to allow for non-normal distributions of the response variable.
Heteroscedasticity	A violation of the assumption of homogeneity of variance, where the variance of
Histogram	A graph showing the frequency distribution of a set of data, with the horizontal axis representing the values and the vertical axis representing the frequencies.
Hyperparameter	A parameter that is set before training a machine learning model, influencing the model's behavior and performance.
Hyperparameter Search	The process of finding the optimal hyperparameter values for a machine learning model through methods like grid search, random search, or Bayesian optimization.
Hyperparameter Tuning	The process of finding the best hyperparameter values for a machine learning model, often done using techniques like grid search or random search.
Hypothesis testing	A statistical procedure used to evaluate the validity of a hypothesis or claim about a population, by comparing the observed data to what would be expected under the null hypothesis.
Imbalanced Class Handling	Techniques used to deal with imbalanced class distributions in classification tasks, such as class weighting or resampling.
Interquartile range	The difference between the upper and lower quartiles of a set of data.
Interval scale	A scale of measurement in which the categories have a numerical order and the intervals between the categories are equal, but there is no true zero point.
Item response theory	A statistical theory used to model the relationship between an individual's ability and their performance on a test or assessment.
Kruskal-Wallis test	A nonparametric statistical test used to compare the medians of three or more groups, when the data are not normally distributed or the variances are not equal.
Kurtosis	A measure of the peakedness or flatness of a distribution, indicating whether it has a heavy or light tail.
Latent class analysis	A statistical technique used to identify unobserved or latent classes or groups within a population based on observed characteristics.
Latent growth curve model	A statistical model that estimates individual differences in the rate and level of change over time.
Latent semantic analysis	A statistical technique used to analyze the relationships between words and documents in a text corpus.
Line chart	A graph showing the trend or pattern in a set of data over time, with the horizontal axis representing the time and the vertical axis representing the values.
Logistic regression	A statistical analysis used to predict the probability of a binary outcome, such as success or failure.
Longitudinal data analysis	A statistical technique used to analyze data that are collected at multiple time points from the same subjects.
MANOVA	A statistical test used to compare the means of two or more groups on multiple dependent variables, assuming that the data are normally distributed and the variances are equal.
Maximum likelihood estimation	A statistical technique used to estimate the parameters of a statistical model that maximizes the likelihood of the observed data.
McNemar test	A statistical test used to compare the proportions of two groups on a dichotomous outcome, when the data are paired or matched.
Mean	The average of a set of numbers, calculated by adding all the numbers together and dividing by the number of items in the set.
Mean Absolute Error (MAE)	A loss function used in regression tasks, calculated as the average absolute difference between predicted and actual values.
Mean Squared Error (MSE)	A common loss function used in regression tasks, calculated as the average squared difference between predicted and actual values.
Mean Squared Logarithmic Error (MSLE)	A loss function used in regression tasks, calculated as the average squared logarithmic difference between predicted and actual values.
Median	The middle value in a set of numbers, where half the values are higher and half are lower.
Meta-analysis	A statistical technique used to synthesize and combine the results of multiple studies, in order to estimate the overall effect size and statistical significance of a research question.
Mixed ANOVA	A statistical test used to analyze data with a mixed design, where some subjects are measured under multiple conditions or at multiple time points, while others are only measured once.
Mixed effects model	A statistical model that includes both fixed and random effects, allowing for the analysis of both within- and between-group variations.
Mode	The most frequently occurring value in a set of numbers.
Multilevel modeling	A statistical technique used to analyze data with a hierarchical or nested structure, such as data from individuals nested within groups.
Multinomial logistic regression	A statistical analysis used to predict the probability of a categorical outcome with more than two categories.
Multiple regression	A statistical analysis used to predict the value of a dependent variable based on the values of two or more independent variables.
Multivariate analysis	A statistical analysis that involves the simultaneous study of multiple variables.
Nominal scale	A scale of measurement in which the categories are mutually exclusive and do not have a numerical order.
Normality test	A statistical test used to determine whether a set of data follows a normal distribution.
One-way ANOVA	A statistical test used to compare the means of three or more groups, assuming that the data are normally distributed and the variances are equal.
Ordinal logistic regression	A statistical analysis used to predict the probability of an ordinal outcome, such as a rating scale.
Ordinal scale	A scale of measurement in which the categories have a numerical order, but the intervals between the categories are not equal.
Outlier	A value that is significantly higher or lower than the other values in a set of data.
P-value	The probability of obtaining a result as extreme or more extreme than the observed data, if the null hypothesis is true.
Panel data analysis	A statistical technique used to analyze data that are collected from the same subjects over multiple time points.
Partial correlation coefficient	A statistical measure of the association between two variables, controlling for the effects of one or more other variables.
Pearson's correlation coefficient	A statistical measure of the linear association between two continuous variables, ranging from -1 to 1.
Percentile	The value below which a certain percentage of the data falls.
Point-biserial correlation coefficient	A statistical measure of the association between a continuous variable and a dichotomous variable.
Poisson distribution	A distribution of probabilities for a discrete variable that represents the number of events occurring in a fixed interval of time or space.
Power	The probability of correctly rejecting the null hypothesis, given that it is false.
Principal component analysis	A statistical technique used to reduce the dimensionality of a data set by projecting the data onto a lower-dimensional space.
Probability	The likelihood or chance of an event occurring, expressed as a number between 0 and 1.
Quartile	One of the three points that divide a set of data into four equal parts.
Random effect	A variable in a statistical model that is allowed to vary across different levels or groups, but is not considered to be a fixed part of the model.
Random sampling	A sampling method in which each member of the population has an equal chance of being selected for the sample.
Range	The difference between the highest and lowest values in a set of numbers.
Rasch model	A statistical model used in item response theory to measure an individual's ability or trait level based on their responses to a series of items.
Ratio scale	A scale of measurement in which the categories have a numerical order, the intervals between the categories are equal, and there is a true zero point.
Regression	A statistical analysis used to predict the value of a dependent variable based on the value of one or more independent variables.
Repeated measures ANOVA	A statistical test used to compare the means of two or more groups, where the subjects are measured under multiple conditions or at multiple time points.
Repeated measures design	A research design in which the same subjects are measured under multiple conditions or at multiple time points.
Sampling	The process of selecting a subset of a population for study, in order to make inferences about the population as a whole.
Scatter plot	A graph showing the relationship between two numerical variables, with each data point represented by a dot plotted on the horizontal and vertical axes.
Skewness	A measure of the asymmetry of a distribution, indicating whether it is skewed to the left or right.
Spearman's rank correlation coefficient	A statistical measure of the monotonic association between two ordinal or continuous variables, ranging from -1 to 1.
Standard deviation	A measure of the dispersion or spread of a set of numbers, calculated as the square root of the variance.
Stratified sampling	A sampling method in which the population is divided into subgroups or strata, and a representative sample is selected from each stratum.
Structural equation modeling	A statistical technique used to test and estimate relationships between variables, both observed and latent.
Survival analysis	A statistical technique used to analyze data on the time it takes for an event of interest to occur, such as death or failure.
T-score	A standardized score used in hypothesis testing, calculated as the number of standard deviations a sample mean is from the hypothesized population mean.
t-test	A statistical test used to compare the means of two groups, assuming that the data are normally distributed and the variances are equal.
Time series analysis	A statistical technique used to analyze data that are collected at regular intervals over time.
Type I error	The error of rejecting the null hypothesis when it is true.
Type II error	The error of failing to reject the null hypothesis when it is false.
Variance	A measure of the dispersion or spread of a set of numbers, calculated as the average of the squared differences from the mean.
Weibull distribution	A continuous distribution of probabilities that is often used to model failure times or lifespan data.
Wilcoxon rank-sum test	A nonparametric statistical test used to compare the medians of two groups, when the data are not normally distributed or the variances are not equal.
Z-score	The number of standard deviations a value is from the mean of a distribution.

Artificial neural networks (ANN) and deep learning

Term	Definition
Activation Function	A function applied to the output of a neuron to introduce non-linearity in the network and determine the neuron's firing behavior.
Actor-Critic	A hybrid reinforcement learning approach that combines value-based (Critic) and policy-based (Actor) methods to improve stability and efficiency.
Artificial Neural Network (ANN)	A computational model inspired by the structure and function of biological neural networks, used for machine learning tasks.
Attention Mechanism	A mechanism used in deep learning models, especially in natural language processing tasks, to focus on relevant parts of the input sequence.
Attention Score	In the context of attention mechanisms, a score that reflects the relevance or importance of a particular part of the input data.
Autoencoder	A type of neural network used for unsupervised learning, trained to reconstruct its input data and compress it into a lower-dimensional representation.
Backpropagation	A learning algorithm used in training neural networks by adjusting the network's weights based on the error signal propagated backward from the output to the input layer.
Batch Normalization	A technique used to improve the training and generalization of neural networks by normalizing the inputs of each layer in a mini-batch.
Batch Size	The number of training samples processed together in one forward and backward pass during each epoch.
Convolutional Neural Network (CNN)	A type of neural network specifically designed for image recognition and processing tasks, utilizing convolutional layers to detect patterns and features in images.
Data Augmentation	A technique used to artificially increase the size of a dataset by applying transformations or modifications to the existing data.
Deep Q-Network (DQN)	A deep learning algorithm used for reinforcement learning, combining Q-learning with a deep neural network to approximate the Q-values.
Dropout	A regularization technique in deep learning where randomly selected neurons are ignored during training to reduce overfitting.
Early Stopping	A regularization technique used during training to stop the learning process when the performance on a validation set stops improving.
Encoder-Decoder Architecture	A neural network architecture where an encoder compresses the input data into a latent representation, and a decoder reconstructs the data from the latent space.
End-to-End Learning	A learning approach where a neural network is trained to directly map raw input data to output predictions without the need for intermediate feature engineering.
Epoch	A single pass of the entire training dataset through the neural network during the training process.
Exploding Gradient Problem	A problem occurring in deep neural networks during backpropagation, where the gradients of the loss function with respect to certain weights become excessively large, leading to unstable training.
Feedforward Neural Network	A type of artificial neural network where the flow of data is unidirectional, moving from input to output through hidden layers.
Gated Recurrent Unit (GRU)	Another variant of RNN similar to LSTM but with a simpler architecture and fewer parameters.
Generative Adversarial Network (GAN)	A type of neural network architecture consisting of a generator and a discriminator, competing in a game to produce realistic synthetic data.
Gradient Descent	An optimization algorithm used to find the optimal weights of a neural network by iteratively updating them in the direction of the steepest descent of the loss function.
Learning Rate	A hyperparameter that controls the step size in gradient descent, determining how much the weights are updated in each iteration.
Learning Rate Scheduler	A technique used to adjust the learning rate during training, allowing for faster convergence and better optimization.
Long Short-Term Memory (LSTM)	A variant of RNN designed to alleviate the vanishing gradient problem and better capture long-range dependencies in sequential data.
Loss Function	A measure used to evaluate how well the neural network's predictions match the actual targets during training.
Momentum	A hyperparameter used to speed up the convergence of gradient descent by adding a fraction of the previous weight update to the current update.
Neuron (Node)	A basic unit in an artificial neural network, responsible for receiving input, performing computations, and producing an output.
Overfitting	A phenomenon where a neural network performs well on the training data but poorly on unseen or test data due to excessive memorization of noise in the training set.
Policy Gradient	A method used in reinforcement learning to directly optimize the policy of an agent, often employed in situations with continuous action spaces.
Pooling Layer	A layer in a CNN used to reduce the spatial dimensions of the feature maps and retain the most important information.
Recurrent Neural Network (RNN)	A type of neural network that is capable of processing sequential data by maintaining hidden state information across time steps.
Reinforcement Learning	A type of machine learning paradigm where an agent interacts with an environment to learn how to take actions to maximize cumulative rewards.
Self-Supervised Learning	A training paradigm in which a model generates its own labels or supervision from the input data, often used to pre-train models before fine-tuning them on specific tasks.
Semi-Supervised Learning	A learning paradigm where a model is trained on both labeled and unlabeled data, leveraging the unlabeled data to improve performance.
Softmax	An activation function used in the output layer of a neural network for multi-class classification tasks, converting raw scores into probabilities.
Tensor	A multi-dimensional array used to represent data in deep learning frameworks.
Transfer Learning	A technique in deep learning where a pre-trained neural network is used as a starting point for a new task, often with some of its layers frozen to retain learned features.
Transformer	A neural network architecture based on self-attention mechanisms, widely used in natural language processing tasks.
Underfitting	A phenomenon where a neural network performs poorly on both the training and test data due to the model's simplicity and inability to capture the underlying patterns.
Unsupervised Learning	A machine learning paradigm where a model is trained on unlabeled data to find underlying patterns and structures without explicit supervision.
Vanishing Gradient Problem	A problem occurring in deep neural networks during backpropagation, where the gradients of the loss function with respect to certain weights become extremely small, hindering learning.
Word Embeddings	A representation of words in a continuous vector space, learned from large textual corpora, used to capture semantic relationships between words.
Word2Vec	A popular algorithm used to generate word embeddings from textual data, typically based on either Skip-gram or Continuous Bag of Words (CBOW) models.

Tree based methods

Term	Definition
AdaBoost (Adaptive Boosting)	A boosting algorithm that assigns higher weights to misclassified samples in each iteration, emphasizing difficult-to-predict instances.
Bagging	The process of training multiple models independently on different subsets of the training data and averaging their predictions, used in Random Forests.
Boosting	An ensemble learning method that trains multiple weak learners sequentially, with each subsequent model focusing on correcting the errors of the previous ones.
CART (Classification and Regression Trees)	An algorithm for constructing decision trees that can be used for both classification and regression tasks.
CatBoost	A gradient boosting library that can handle categorical features directly, without the need for explicit encoding.
Categorical Feature Encoding	Techniques to convert categorical variables into numerical form for decision trees, such as one-hot encoding or label encoding.
Classification Tree	A decision tree used for classification tasks, where the output at each leaf node represents a class label.
Decision Boundary	The boundary or threshold that separates different classes in a decision tree or other classification model.
Decision Path	The sequence of feature-based decisions made from the root node to a specific leaf node in a decision tree, providing insights into how predictions are made.
Decision Stump	A decision tree with only one level of internal nodes, used as a weak learner in boosting algorithms.
Decision Tree	A hierarchical tree-like structure used for classification and regression tasks, where each internal node represents a decision based on a feature, and each leaf node represents the predicted output.
Early Stopping	A regularization technique used during training to stop the learning process when the performance on a validation set stops improving.
Ensemble Aggregation	The process of combining predictions from multiple decision trees or models to produce the final output.
Ensemble Learning	A machine learning technique that combines multiple models to make predictions, often leading to improved accuracy and robustness.
Entropy	A measure of the randomness or uncertainty in a dataset, used in information gain calculations.
Feature Importance	A measure of the importance of each feature in a decision tree or ensemble model, indicating how much each feature contributes to the prediction.
Feature Split Importance	A measure of how much a feature contributes to the decision-making process in a tree-based model, based on the number of times the feature is used for splitting and the resulting impurity reduction.
Gini Impurity	A measure of the impurity or uncertainty in a dataset, used as a splitting criterion for decision trees in classification tasks.
Gradient Boosting	A boosting algorithm that builds decision trees sequentially, where each new tree corrects the errors of the previous ones using gradient descent.
Information Gain	A measure of the reduction in entropy achieved by a split, used as a splitting criterion for decision trees in classification tasks.
Interaction Depth	The depth or number of interactions allowed between features in a tree-based model, often set to control model complexity.
Internal Node	A node in a decision tree that contains a decision based on a feature, leading to further splits.
Leaf Node	A terminal node in a decision tree that provides the final prediction or decision.
Leaf Sample Size	The minimum number of samples required at a leaf node for the node to be considered valid and used for predictions.
LightGBM	Another optimized gradient boosting implementation, designed to be faster and more memory-efficient than XGBoost.
Missing Values Handling	Strategies for dealing with missing values during the construction of decision trees, such as imputation or assigning them to the most common class.
Node Depth	The depth or level of a node in a decision tree, representing the number of splits needed to reach that node from the root.
Node Impurity	A measure of the impurity or homogeneity of the samples at a node, used to evaluate potential splits during decision tree construction.
Out-of-Bag Error	The error rate of a random forest model calculated using the samples that were not used in a particular tree's training (out-of-bag samples).
Overfitting	A phenomenon where a decision tree captures noise or irrelevant patterns in the training data, resulting in poor generalization to unseen data.
Pruned Tree	A decision tree that has undergone pruning to remove branches that do not significantly contribute to the model's performance.
Pruning	A technique used to reduce the size of a decision tree by removing branches that do not significantly contribute to the model's performance, helping to avoid overfitting.
Random Forest	An ensemble learning method that builds multiple decision trees and combines their predictions to improve accuracy and reduce overfitting.
Regression Tree	A decision tree used for regression tasks, where the output at each leaf node is a continuous value.
Root Node	The topmost node of a decision tree, representing the initial split based on a feature.
Split	The process of dividing the dataset at a node based on a feature and its threshold value.
Splitting Criterion	A measure used to determine the best feature and threshold for a split, such as Gini impurity or information gain for classification, and mean squared error for regression.
Splitting Strategy	The method used to choose the feature and threshold for splitting a node in a decision tree, such as best split, random split, or greedy split.
Termination Criteria	Conditions used to stop the tree-building process, such as the maximum depth of the tree or the minimum number of samples required to split a node.
Tree Pruning	A process that reduces the size of a decision tree by removing branches that do not contribute significantly to the model's performance, helping to avoid overfitting.
Underfitting	A phenomenon where a decision tree is too simple to capture the underlying patterns in the training data, resulting in poor performance.
Variance Reduction	The process of reducing the variance of predictions by averaging the outputs of multiple models, such as in random forests.
XGBoost	An optimized implementation of gradient boosting, known for its efficiency and high performance.