Mean Squared Error (MSE)

Definition: $$ MSE = \frac{1}{N} \sum_{i=1}^{N}(y_i - \hat{y}_i)^2 $$
Application Scenarios: Primarily used for regression problems. By minimizing MSE, the model tries to reduce the squared difference between predicted values and actual values.
Advantages and Disadvantages:
- Advantages: More sensitive to large errors, significantly penalizing larger deviations.
- Disadvantages: Very sensitive to outliers, which may lead to unstable training.

Mean Absolute Error (MAE)

Definition: $$ MAE = \frac{1}{N}\sum_{i=1}^{N} \lvert y_i - \hat{y}_i \rvert $$
Application Scenarios: Also commonly used in regression problems.
Advantages and Disadvantages:
- Advantages: Relatively less sensitive to outliers, offering better stability.
- Disadvantages: During optimization, the gradient is discontinuous (non-differentiable) near $|x|=0$, requiring special handling strategies.

Cross Entropy

Definition: Most commonly used for classification problems, measuring the difference between predicted distribution and actual distribution.
- Binary Cross-Entropy: $$ L_{BCE} = -\frac{1}{N}\sum_{i=1}^{N} \Bigl[y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i)\Bigr] $$
- Categorical Cross-Entropy: $$ L_{CE} = -\sum_{c=1}^{C} y_c \log(\hat{y}_c) $$
Application Scenarios: Common in image classification, natural language processing, and various classification tasks.
Advantages and Disadvantages:
- Advantages: Good effectiveness for modeling probability distributions.
- Disadvantages: If predicted values are very close to 0 or 1, it may lead to vanishing or exploding gradients.

KL Divergence (Kullback–Leibler Divergence, KLD)

Definition: Measures the difference between two probability distributions ($P$ and $Q$). $$ D_{KL}(P || Q) = \sum_{x} P(x) \log\frac{P(x)}{Q(x)} $$
Application Scenarios: Common in generative models such as VAE (Variational Autoencoder).
Advantages and Disadvantages:
- Advantages: Can precisely characterize the differences between distributions.
- Disadvantages: Very sensitive to the overlap areas of distributions, and asymmetric ($D_{KL}(P || Q) \neq D_{KL}(Q || P)$).

Huber Loss

Definition: Combines the characteristics of MSE and MAE, using linear penalty for large errors and quadratic penalty for small errors. $$ L_\delta(x) = \begin{cases} \frac{1}{2}(x)^2 & \text{if } |x| \le \delta,\ \delta (|x| - \frac{\delta}{2}) & \text{if } |x| > \delta \end{cases} $$
Application Scenarios: Often used in situations where sensitivity to outliers is needed but excessive penalties like MSE are not desired.

Focal Loss

Definition: Adds an adjustment factor $(1 - \hat{y}_i)^\gamma$ to the cross-entropy basis to balance hard-to-classify samples and easy-to-classify samples. $$ L_{Focal}(p_i) = -\alpha_i (1 - p_i)^{\gamma} \log(p_i) $$
Application Scenarios: Commonly used in object detection, image segmentation, or rare event detection scenarios with severe class imbalance.
Advantages: Can significantly reduce the weight of simple samples, focusing on learning difficult samples.

Dice Loss

Definition: Mainly used for image segmentation, measuring the overlap between segmentation results and ground truth labels.
- Dice Coefficient definition: $$ Dice = \frac{2|A \cap B|}{|A| + |B|} $$
- Dice Loss definition: $$ L_{Dice} = 1 - Dice $$
Application Scenarios: 2D or 3D image segmentation tasks, such as medical imaging and semantic segmentation.
Advantages: More stable than pure cross-entropy when dealing with unbalanced samples; especially when the ratio between foreground and background is disparate.

IoU Loss (Intersection over Union Loss)

Definition: Similar to Dice Loss, measuring the ratio of intersection between predicted area and true area. $$ IoU = \frac{|A \cap B|}{|A \cup B|} $$ $$ L_{IoU} = 1 - IoU $$
Application Scenarios: Image segmentation and object detection. IoU is also the basis for many detection evaluation metrics (mAP).
Advantages: Can directly reflect the degree of overlap between regions.

Tversky Loss

Definition: Tversky index is a generalized form of Dice, providing adjustment capabilities for false positives and false negatives. $$ Tversky(A, B; \alpha, \beta) = \frac{|A \cap B|}{|A \cap B| + \alpha|A\B| + \beta|B\A|} $$ $$ L_{Tversky} = 1 - Tversky(A, B; \alpha, \beta) $$
Application Scenarios: Used for highly unbalanced segmentation tasks, allowing weight adjustment based on different error types (false positives or false negatives).

Wasserstein Loss (Earth Mover’s Distance)

Definition: Measures the distance between two distributions, especially common in Generative Adversarial Networks (GANs).
- Taking WGAN as an example, optimizing the output of the discriminator (critic): $$ W(p_r, p_\theta) = \inf_{\gamma \in \Pi(p_r, p_\theta)} \mathbb{E}_{(x,y) \sim \gamma}[||x-y||] $$ where $\Pi$ represents the set of all joint distributions between the distributions.
Application Scenarios: WGAN, WGAN-GP, and other generative adversarial networks.
Advantages: Provides more stable adversarial training, reducing mode collapse.

Cosine Similarity Loss

Definition: Commonly used in metric learning or semantic similarity tasks, aiming to maximize the cosine similarity between predicted vectors and true vectors.
- Cosine similarity: $$ \mathrm{cos}(\theta) = \frac{\mathbf{u}\cdot\mathbf{v}}{|\mathbf{u}|,|\mathbf{v}|} $$
- Cosine Loss (can take 1 - cos(\theta)): $$ L_{cos} = 1 - \mathrm{cos}(\theta) $$
Application Scenarios: Sentence similarity, recommendation systems, vector retrieval, etc.
Advantages: Insensitive to vector length, focusing on directional consistency.

Triplet Loss

Definition: Common in metric learning, training through the distance relationship between anchor, positive, and negative samples. $$ L_{Triplet} = \max\bigl(d(a, p) - d(a, n) + \alpha, 0\bigr) $$ where $d$ represents the distance function, and $\alpha$ is the margin hyperparameter.
Application Scenarios: Face recognition (FaceNet), image retrieval, metric learning.
Advantages: Effectively brings similar samples closer and pushes dissimilar samples apart.

InfoNCE Loss

Definition: Common in self-supervised learning for contrastive learning, such as SimCLR, MoCo, etc. $$ L_{InfoNCE} = -\log \frac{\exp(\mathbf{z}_i \cdot \mathbf{z}_j / \tau)}{\sum_{k=1}^{K}\exp(\mathbf{z}_i \cdot \mathbf{z}_k / \tau)} $$ where $\mathbf{z}_i$ and $\mathbf{z}_j$ are positive sample pairs, and $\mathbf{z}_k$ is a negative sample.
Application Scenarios: Contrastive learning, representation learning.
Advantages: Able to learn discriminative feature representations without labels.

Perceptual Loss

Definition: Extracts high-level features through deep networks (such as VGG), then compares the distance between generated results and real images in these feature spaces. $$ L_{perc} = \sum_{l} \frac{1}{N_l}|\phi_l(x) - \phi_l(y)|^2 $$ where $\phi_l$ represents the features of network layer $l$, and $N_l$ is the channel size of that layer.
Application Scenarios: Image synthesis, style transfer, super-resolution, etc.
Advantages: Measures differences at the perceptual level, more aligned with human visual perception.

Mean Squared False Error (MSFE)

Concept Introduction: Mean Squared False Error (MSFE) is a type of error measurement used for 3D voxels or other structured data, focusing on simultaneously balancing “false positive” and “false negative” errors. By squaring and summing or combining these two types of errors, MSFE can ensure model prediction accuracy for both occupied and unoccupied voxels in tasks such as shape reconstruction.
General Form: The specific implementation of MSFE varies depending on papers and application scenarios, but the core idea is to define error measures for false positives and false negatives, and measure and sum them using squared error. For example, in some papers, MSFE is related to the following errors:
- False Positive Error (FP Error): The error for voxels that are actually unoccupied but predicted as occupied by the model.
- False Negative Error (FN Error): The error for voxels that are actually occupied but predicted as unoccupied by the model.
If FP and FN are calculated separately, a simplified form similar to the following may be used:
$$ MSFE = (FP)^2 + (FN)^2, $$
Or more advanced methods such as squaring after averaging FP and FN, weighted summation, etc., depending on task requirements.
Application Scenarios:
- 3D Shape Reconstruction: When predicting object shapes on voxel grids, it’s desirable to simultaneously reduce false positives and false negatives to obtain more precise geometry.
- Sparse Data Processing: When unoccupied (background) data dominates, simple MSE or cross-entropy may cause imbalance. MSFE emphasizes the squared errors of FP and FN, making the model pay more attention to these two types of errors.
Advantages:
- Balance: Simultaneously squares both “errors in predicting occupation” and “errors in predicting non-occupation,” allowing the model to address both types of errors.
- Flexibility: If there are different emphases on false positives or false negatives in a task, weight adjustments can be made in the MSFE calculation.
Limitations:
- Non-standard: MSFE is not common in traditional 2D classification or regression tasks, mainly targeted at 3D voxels or special situations.
- Need for complementary metrics: In some cases, using MSFE alone may not be sufficient to comprehensively describe model performance, often requiring other loss functions or evaluation metrics (such as IOU, Dice, etc.).

MSFCEL (Mean Squared False Cross-Entropy Loss)

MSFCEL builds on the concept of “Mean Squared False Error (MSFE)”, combining “False Positive Cross-Entropy” ($\text{FPCE}$) and “False Negative Cross-Entropy” ($\text{FNCE}$) in the form of squared sum, defined as follows:

$$ \text{MSFCEL} = \text{FPCE}^2 + \text{FNCE}^2. $$

False Positive Cross-Entropy FPCE
- FPCE (False Positive Cross-Entropy) is used to measure prediction errors for unoccupied voxels, i.e., “actually unoccupied but misclassified as occupied”.
- Let $N$ be the total number of unoccupied voxels, $V_n$ be the true label of the $n$-th unoccupied voxel (mostly 0), $\hat{V}_n$ be the predicted value for that voxel, then $$ \text{FPCE} = -\frac{1}{N} \sum_{n=1}^N \Bigl[,V_n \log \hat{V}_n +(1 - V_n)\log\bigl(1 - \hat{V}_n\bigr)\Bigr]. $$
False Negative Cross-Entropy FNCE
- FNCE (False Negative Cross-Entropy) is used to measure prediction errors for occupied voxels, i.e., “actually occupied but misclassified as unoccupied”.
- Let $P$ be the total number of occupied voxels, $V_p$ be the true label of the $p$-th occupied voxel (mostly 1), $\hat{V}_p$ be the predicted value for that voxel, then $$ \text{FNCE} = -\frac{1}{P} \sum_{p=1}^P \Bigl[,V_p \log \hat{V}_p +(1 - V_p)\log\bigl(1 - \hat{V}_p\bigr)\Bigr]. $$
Combining into MSFCEL
The “squared sum” of the above two errors gives $$ \text{MSFCEL} = \text{FPCE}^2 + \text{FNCE}^2. $$
In this way, the module will simultaneously minimize $\text{FPCE}$ and $\text{FNCE}$ during training, and because both are “squared” and magnified, it helps to balance and suppress false positive and false negative errors.

Advantages and Potential Applications

Balanced prediction for occupied and unoccupied voxels
MSFCEL simultaneously penalizes the squared errors of “false positives” and “false negatives”, ensuring the model doesn’t overly focus on the correctness of one class while neglecting the accuracy of the other.
Improved voxel shape reconstruction quality
In 3D shape reconstruction applications, high false positives would introduce unnecessary noise (i.e., extra protrusions or noise points) to the reconstruction result; high false negatives would make the reconstructed object incomplete. MSFCEL suppresses both types of errors, thus better maintaining the integrity and cleanness of object contours.
Compatibility with other techniques
MSFCEL can be integrated into sparse voxel processing, deep learning network design, or loss weight re-weighting techniques to further control model learning behavior. If needed, it can also be combined with strategies like focal loss or dice loss to further address sparsity issues.

CS Note
中

Summary of Common Loss Functions