7 最佳化 (Optimization)

學習目標

用導數找函數的極值點：令 $f'(x) = 0$
使用二階導數判斷極大或極小
理解凹向上 (concave up) 與凹向下 (concave down)
連結醫學統計：MLE、最小平方法、梯度下降法

7.1 為什麼需要最佳化？

統計學的核心問題之一就是找最佳值：

最大概似估計 (MLE)：找讓 likelihood 最大的參數值
最小平方法 (OLS)：找讓殘差平方和最小的迴歸係數
機器學習：找讓損失函數最小的模型參數

這些都是最佳化問題 (optimization problems)，而微分是解決這類問題的關鍵工具！

7.2 極值的基本概念

7.2.1 什麼是極值？

局部極大值 (local maximum)：在某個範圍內，函數值最大的點

局部極小值 (local minimum)：在某個範圍內，函數值最小的點

全域極值 (global extrema)：在整個定義域內的最大或最小值

Code

# 創建一個有多個極值的函數
f <- function(x) {
  -0.1*x^4 + x^3 - 2*x^2 + 1
}

x <- seq(-2, 5, by = 0.01)
y <- f(x)

df <- data.frame(x, y)

# 找極值點（數值近似）
# 局部極小值約在 x ≈ 0.4 和 x ≈ 3.3
# 局部極大值約在 x ≈ 2
extrema <- data.frame(
  x = c(0.4, 2, 3.3),
  y = f(c(0.4, 2, 3.3)),
  type = c("局部極小", "局部極大", "全域極小"),
  color = c("#E94F37", "#2E86AB", "#E94F37")
)

ggplot(df, aes(x, y)) +
  geom_line(color = "#2E86AB", linewidth = 1.5) +
  geom_hline(yintercept = 0, color = "gray70", linetype = "dashed") +
  geom_point(data = extrema, aes(x, y, color = type), size = 5) +
  geom_text(data = extrema, aes(x, y, label = type),
            vjust = -1.5, size = 4, fontface = "bold") +
  scale_color_manual(values = c("局部極大" = "#2E86AB",
                                "局部極小" = "#E94F37",
                                "全域極小" = "#E94F37")) +
  labs(
    title = "極值的類型",
    subtitle = "局部極值 vs. 全域極值",
    x = "x", y = "f(x)"
  ) +
  theme_minimal(base_size = 14) +
  theme(legend.position = "none")

7.3 一階導數檢定 (First Derivative Test)

7.3.1 核心定理

Fermat’s Theorem：如果 $f(x)$ 在 $x=c$ 有局部極值，且 $f'(c)$ 存在，則：

\[ f'(c) = 0 \]

白話文：極值點的切線斜率是水平的（斜率 = 0）。

注意

$f'(c) = 0$ 是極值的必要條件，但不是充分條件。

也就是說：

極值點 $\Rightarrow$ $f'(c) = 0$（一定成立）
$f'(c) = 0$ $\nRightarrow$ 極值點（不一定成立）

反例：$f(x) = x^3$ 在 $x=0$ 處，$f'(0) = 0$，但這不是極值點！

7.3.2 找極值點的步驟

計算導數 $f'(x)$
解方程式 $f'(x) = 0$，找出臨界點 (critical points)
判斷每個臨界點是極大、極小、還是反曲點

Code

x <- seq(-2, 2, by = 0.01)

# 三種情況
f1 <- -x^2 + 1       # 極大值
f2 <- x^2            # 極小值
f3 <- x^3            # 反曲點

df <- data.frame(x, f1, f2, f3)

p1 <- ggplot(df, aes(x, f1)) +
  geom_line(color = "#2E86AB", linewidth = 1.5) +
  geom_point(aes(x = 0, y = 1), color = "#E94F37", size = 5) +
  geom_hline(yintercept = 0, color = "gray70") +
  annotate("text", x = 0, y = 1.5, label = "極大值\nf'(0) = 0",
           hjust = 0.5, color = "#E94F37", fontface = "bold") +
  labs(title = expression(f(x) == -x^2 + 1), y = "f(x)") +
  theme_minimal(base_size = 11)

p2 <- ggplot(df, aes(x, f2)) +
  geom_line(color = "#2E86AB", linewidth = 1.5) +
  geom_point(aes(x = 0, y = 0), color = "#E94F37", size = 5) +
  geom_hline(yintercept = 0, color = "gray70") +
  annotate("text", x = 0, y = 1, label = "極小值\nf'(0) = 0",
           hjust = 0.5, color = "#E94F37", fontface = "bold") +
  labs(title = expression(f(x) == x^2), y = "f(x)") +
  theme_minimal(base_size = 11)

p3 <- ggplot(df, aes(x, f3)) +
  geom_line(color = "#2E86AB", linewidth = 1.5) +
  geom_point(aes(x = 0, y = 0), color = "#E94F37", size = 5) +
  geom_hline(yintercept = 0, color = "gray70") +
  annotate("text", x = 0, y = 1.5, label = "反曲點\nf'(0) = 0\n但不是極值",
           hjust = 0.5, color = "#E94F37", fontface = "bold") +
  labs(title = expression(f(x) == x^3), y = "f(x)") +
  theme_minimal(base_size = 11)

(p1 | p2 | p3) +
  plot_annotation(
    title = "f'(x) = 0 的三種可能",
    subtitle = "需要進一步檢驗才能確定是哪一種"
  )

7.4 二階導數檢定 (Second Derivative Test)

7.4.1 凹向上與凹向下

二階導數 $f''(x)$ 描述函數的彎曲方向：

$f''(x) > 0$：函數凹向上 (concave up)，像微笑 ∪
$f''(x) < 0$：函數凹向下 (concave down)，像皺眉 ∩

Code

x <- seq(-2, 2, by = 0.01)

f_up <- x^2           # f''(x) = 2 > 0
f_down <- -x^2        # f''(x) = -2 < 0

df <- data.frame(x, f_up, f_down)

p1 <- ggplot(df, aes(x, f_up)) +
  geom_line(color = "#2E86AB", linewidth = 1.5) +
  geom_hline(yintercept = 0, color = "gray70") +
  annotate("text", x = 0, y = 2.5, label = "凹向上 ∪\nf''(x) > 0",
           hjust = 0.5, size = 5, fontface = "bold") +
  labs(title = expression(f(x) == x^2), y = "f(x)") +
  ylim(-5, 5) +
  theme_minimal(base_size = 12)

p2 <- ggplot(df, aes(x, f_down)) +
  geom_line(color = "#E94F37", linewidth = 1.5) +
  geom_hline(yintercept = 0, color = "gray70") +
  annotate("text", x = 0, y = -2.5, label = "凹向下 ∩\nf''(x) < 0",
           hjust = 0.5, size = 5, fontface = "bold") +
  labs(title = expression(f(x) == -x^2), y = "f(x)") +
  ylim(-5, 5) +
  theme_minimal(base_size = 12)

p1 + p2 +
  plot_annotation(
    title = "二階導數與凹凸性",
    subtitle = "二階導數的符號決定曲線的彎曲方向"
  )

7.4.2 二階導數檢定

假設 $f'(c) = 0$（臨界點），則：

$f''(c)$	結論
$f''(c) > 0$	$c$ 是局部極小值
$f''(c) < 0$	$c$ 是局部極大值
$f''(c) = 0$	無法判斷，需要其他方法

直觀理解：

凹向上 ($f'' > 0$) + 水平切線 ($f' = 0$) = 谷底（極小值）
凹向下 ($f'' < 0$) + 水平切線 ($f' = 0$) = 山頂（極大值）

7.4.3 範例：找 $f(x) = x^3 - 3x$ 的極值

步驟 1：計算一階導數

\[ f'(x) = 3x^2 - 3 \]

步驟 2：找臨界點

\[ f'(x) = 0 \Rightarrow 3x^2 - 3 = 0 \Rightarrow x^2 = 1 \Rightarrow x = \pm 1 \]

步驟 3：計算二階導數

\[ f''(x) = 6x \]

步驟 4：判斷極值類型

在 $x = -1$：$f''(-1) = -6 < 0$ → 極大值，$f(-1) = 2$
在 $x = 1$：$f''(1) = 6 > 0$ → 極小值，$f(1) = -2$

Code

x <- seq(-2.5, 2.5, by = 0.01)

f <- x^3 - 3*x
f_prime <- 3*x^2 - 3
f_double_prime <- 6*x

df <- data.frame(x, f, f_prime, f_double_prime)

# 極值點
extrema <- data.frame(
  x = c(-1, 1),
  y = c(2, -2),
  type = c("極大值", "極小值")
)

p1 <- ggplot(df, aes(x, f)) +
  geom_line(color = "#2E86AB", linewidth = 1.5) +
  geom_hline(yintercept = 0, color = "gray70", linetype = "dashed") +
  geom_vline(xintercept = c(-1, 1), color = "#E94F37",
             linetype = "dotted", alpha = 0.5) +
  geom_point(data = extrema, aes(x, y), color = "#E94F37", size = 5) +
  geom_text(data = extrema, aes(x, y, label = type),
            vjust = c(1.5, -1.5), hjust = 0.5, color = "#E94F37",
            fontface = "bold") +
  labs(title = expression(f(x) == x^3 - 3*x), y = "f(x)") +
  theme_minimal(base_size = 12)

zeros_df <- data.frame(x = c(-1, 1), y = c(0, 0))

p2 <- ggplot(df, aes(x, f_prime)) +
  geom_line(color = "#E94F37", linewidth = 1.5) +
  geom_hline(yintercept = 0, color = "gray70", linetype = "dashed") +
  geom_vline(xintercept = c(-1, 1), color = "#E94F37",
             linetype = "dotted", alpha = 0.5) +
  geom_point(data = zeros_df, aes(x = x, y = y),
             color = "#E94F37", size = 5) +
  annotate("text", x = c(-1, 1), y = c(1, 1),
           label = "f'(x) = 0", hjust = 0.5, color = "#E94F37") +
  labs(title = expression(f*"'"*(x) == 3*x^2 - 3), y = "f'(x)") +
  theme_minimal(base_size = 12)

second_df <- data.frame(x = c(-1, 1), y = c(-6, 6))

p3 <- ggplot(df, aes(x, f_double_prime)) +
  geom_line(color = "#2E86AB", linewidth = 1.5) +
  geom_hline(yintercept = 0, color = "gray70", linetype = "dashed") +
  geom_vline(xintercept = c(-1, 1), color = "#E94F37",
             linetype = "dotted", alpha = 0.5) +
  geom_point(data = second_df, aes(x = x, y = y),
             color = "#E94F37", size = 5) +
  annotate("text", x = -1, y = -6, label = "f''(x) < 0\n極大",
           vjust = 1.5, hjust = 0.5, color = "#E94F37", size = 3) +
  annotate("text", x = 1, y = 6, label = "f''(x) > 0\n極小",
           vjust = -0.5, hjust = 0.5, color = "#E94F37", size = 3) +
  labs(title = expression(f*"''"*(x) == 6*x), y = "f''(x)") +
  theme_minimal(base_size = 12)

p1 / (p2 + p3) +
  plot_annotation(
    title = "最佳化分析：f(x) = x³ - 3x",
    subtitle = "使用一階與二階導數找極值"
  )

7.5 統計應用

7.5.1 1. 最大概似估計 (MLE)

核心想法：找讓 likelihood 最大的參數值。

步驟：

寫出 likelihood function：$L(\theta) = \prod_{i=1}^n f(x_i; \theta)$
取對數：$\ell(\theta) = \ln L(\theta) = \sum_{i=1}^n \ln f(x_i; \theta)$
對 $\theta$ 微分：$\frac{d\ell}{d\theta}$
令導數為 0：$\frac{d\ell}{d\theta} = 0$
解出 $\hat{\theta}_{\text{MLE}}$

範例：估計常態分布的平均數

假設 $X_1, \ldots, X_n \sim N(\mu, \sigma^2)$，已知 $\sigma^2$，要估計 $\mu$。

\[ \begin{align} \ell(\mu) &= \sum_{i=1}^n \ln f(x_i; \mu) \\ &= \sum_{i=1}^n \ln \left[\frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x_i - \mu)^2}{2\sigma^2}\right)\right] \\ &= -\frac{n}{2}\ln(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n (x_i - \mu)^2 \end{align} \]

微分：

\[ \frac{d\ell}{d\mu} = \frac{1}{\sigma^2}\sum_{i=1}^n (x_i - \mu) \]

令導數為 0：

\[ \sum_{i=1}^n (x_i - \mu) = 0 \Rightarrow \hat{\mu}_{\text{MLE}} = \frac{1}{n}\sum_{i=1}^n x_i = \bar{x} \]

結論：樣本平均數就是 MLE！

Code

set.seed(123)
data <- rnorm(20, mean = 5, sd = 2)

# Log-likelihood 函數（固定 sigma = 2）
log_lik <- function(mu) {
  sum(dnorm(data, mean = mu, sd = 2, log = TRUE))
}

mu_range <- seq(2, 8, by = 0.01)
ll_values <- sapply(mu_range, log_lik)

# MLE
mu_mle <- mean(data)
ll_mle <- log_lik(mu_mle)

df <- data.frame(mu = mu_range, ll = ll_values)

ggplot(df, aes(mu, ll)) +
  geom_line(color = "#2E86AB", linewidth = 1.5) +
  geom_vline(xintercept = mu_mle, color = "#E94F37",
             linetype = "dashed", linewidth = 1) +
  geom_point(aes(x = mu_mle, y = ll_mle),
             color = "#E94F37", size = 5) +
  annotate("text", x = mu_mle + 0.3, y = ll_mle,
           label = paste0("MLE = ", round(mu_mle, 2), "\n= 樣本平均數"),
           hjust = 0, vjust = 0.5, color = "#E94F37",
           fontface = "bold", size = 4.5) +
  labs(
    title = "Maximum Likelihood Estimation",
    subtitle = "找讓 ℓ(μ) 最大的點 → 令 dℓ/dμ = 0",
    x = expression(mu),
    y = expression(ell(mu))
  ) +
  theme_minimal(base_size = 14)

Figure 7.5: MLE 視覺化：找讓 log-likelihood 最大的 μ

7.5.2 2. 最小平方法 (Ordinary Least Squares)

在簡單線性迴歸中：

\[ y_i = \beta_0 + \beta_1 x_i + \epsilon_i \]

目標：找 $\beta_0, \beta_1$ 使殘差平方和 (RSS) 最小：

\[ \text{RSS}(\beta_0, \beta_1) = \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i)^2 \]

最佳化條件：

\[ \frac{\partial \text{RSS}}{\partial \beta_0} = 0, \quad \frac{\partial \text{RSS}}{\partial \beta_1} = 0 \]

解出來就是最小平方估計式！

Code

set.seed(42)
n <- 30
x <- rnorm(n, mean = 5, sd = 2)
y <- 2 + 0.8*x + rnorm(n, sd = 1)

# 計算 OLS
fit <- lm(y ~ x)
beta0_hat <- coef(fit)[1]
beta1_hat <- coef(fit)[2]

df <- data.frame(x, y, y_pred = predict(fit))

ggplot(df, aes(x, y)) +
  geom_segment(aes(xend = x, yend = y_pred),
               color = "#E94F37", alpha = 0.3, linetype = "dashed") +
  geom_point(color = "#2E86AB", size = 3) +
  geom_abline(intercept = beta0_hat, slope = beta1_hat,
              color = "#E94F37", linewidth = 1.5) +
  annotate("text", x = max(x) - 1, y = min(y) + 1,
           label = paste0("ŷ = ", round(beta0_hat, 2), " + ",
                         round(beta1_hat, 2), "x"),
           hjust = 1, vjust = 0, color = "#E94F37",
           fontface = "bold", size = 5) +
  labs(
    title = "Ordinary Least Squares (OLS)",
    subtitle = "最小化殘差平方和（紅色虛線長度平方的總和）",
    x = "x", y = "y"
  ) +
  theme_minimal(base_size = 14)

7.5.3 3. 梯度下降法 (Gradient Descent)

在機器學習中，有些損失函數無法直接解析求解，需要用迭代法。

梯度下降法的核心想法：

從某個初始值 $\theta_0$ 開始
計算梯度（導數）：$g = \frac{d}{d\theta}L(\theta)$
朝著梯度的反方向移動：$\theta_{t+1} = \theta_t - \alpha \cdot g_t$
重複直到收斂

其中 $\alpha$ 叫做學習率 (learning rate)。

直觀理解：想像你在山上，想找到最低點。你看不到全貌，只能感受腳下的斜率。你每次都往下坡走一小步，最終會到達谷底。

Code

# 簡單的二次函數
f <- function(x) (x - 3)^2 + 1
f_prime <- function(x) 2*(x - 3)

# 梯度下降
gradient_descent <- function(x0, alpha, n_iter) {
  path <- numeric(n_iter + 1)
  path[1] <- x0

  for (i in 1:n_iter) {
    grad <- f_prime(path[i])
    path[i+1] <- path[i] - alpha * grad
  }

  data.frame(
    iter = 0:n_iter,
    x = path,
    y = f(path)
  )
}

# 執行梯度下降
path <- gradient_descent(x0 = 0, alpha = 0.1, n_iter = 15)

x <- seq(-1, 6, by = 0.01)

ggplot() +
  # 函數曲線
  geom_line(aes(x = x, y = f(x)), color = "#2E86AB", linewidth = 1.5) +
  # 梯度下降路徑
  geom_path(data = path, aes(x, y), color = "#E94F37",
            linewidth = 1, arrow = arrow(length = unit(0.3, "cm"))) +
  geom_point(data = path, aes(x, y), color = "#E94F37", size = 2) +
  # 起點和終點
  geom_point(data = path[1,], aes(x, y),
             color = "#E94F37", size = 5, shape = 21, fill = "white") +
  geom_point(data = path[nrow(path),], aes(x, y),
             color = "#E94F37", size = 5) +
  annotate("text", x = path$x[1], y = path$y[1] + 2,
           label = "起點", hjust = 0.5, color = "#E94F37", fontface = "bold") +
  annotate("text", x = path$x[nrow(path)], y = path$y[nrow(path)] - 0.5,
           label = "終點（極小值）", hjust = 0.5, vjust = 1,
           color = "#E94F37", fontface = "bold") +
  labs(
    title = "Gradient Descent 視覺化",
    subtitle = "每一步都朝著梯度的反方向移動",
    x = expression(theta), y = expression(L(theta))
  ) +
  theme_minimal(base_size = 14)

7.6 練習題

7.6.1 觀念題

用自己的話解釋：為什麼極值點的導數是 0？

參考答案

導數代表函數的瞬時變化率（切線斜率）。在極值點（山頂或谷底），函數暫時停止上升或下降，切線是水平的，所以斜率為 0。就像爬山到山頂時，那一瞬間既不上坡也不下坡。

如果 $f'(c) = 0$ 且 $f''(c) = 0$，我們能判斷 $c$ 是極大還是極小嗎？

參考答案

不能。二階導數檢定在 $f''(c) = 0$ 時失效，無法判斷。例如 $f(x) = x^4$ 在 $x=0$ 處是極小值，但 $f(x) = -x^4$ 在 $x=0$ 處是極大值，兩者的 $f'(0) = f''(0) = 0$ 都成立。此時需要用更高階的導數或其他方法判斷。

在 MLE 中，為什麼要「最大化」log-likelihood？在 OLS 中，為什麼要「最小化」RSS？

參考答案

MLE 要找「最能解釋觀察到的資料」的參數值，所以要讓 likelihood（資料出現的機率）越大越好。OLS 要找「預測誤差最小」的迴歸線，所以要讓殘差平方和越小越好。兩者都是最佳化問題，只是目標函數的方向不同：一個求最大值，一個求最小值。

7.6.2 計算題

找出 $f(x) = x^4 - 4x^3$ 的極值點，並判斷是極大還是極小。

參考答案

步驟：$f'(x) = 4x^3 - 12x^2 = 4x^2(x-3) = 0$ → $x = 0$ 或 $x = 3$。$f''(x) = 12x^2 - 24x$。在 $x=0$：$f''(0) = 0$（無法判斷）。在 $x=3$：$f''(3) = 108 - 72 = 36 > 0$ → 極小值，$f(3) = -27$。$x=0$ 實際上是反曲點，不是極值。

假設 $X_1, \ldots, X_n \sim \text{Exp}(\lambda)$（指數分布），推導 $\lambda$ 的 MLE。提示：$f(x; \lambda) = \lambda e^{-\lambda x}$

參考答案

Log-likelihood: $\ell(\lambda) = \sum_{i=1}^n \ln(\lambda e^{-\lambda x_i}) = n\ln\lambda - \lambda\sum_{i=1}^n x_i$。微分：$\frac{d\ell}{d\lambda} = \frac{n}{\lambda} - \sum_{i=1}^n x_i$。令其為 0：$\frac{n}{\lambda} = \sum_{i=1}^n x_i$ → $\hat{\lambda}_{\text{MLE}} = \frac{n}{\sum_{i=1}^n x_i} = \frac{1}{\bar{x}}$。

使用二階導數檢定，驗證 $\hat{\mu}_{\text{MLE}} = \bar{x}$ 確實是最大值（不是最小值）。

參考答案

從本章推導：$\frac{d\ell}{d\mu} = \frac{1}{\sigma^2}\sum_{i=1}^n (x_i - \mu)$。二階導數：$\frac{d^2\ell}{d\mu^2} = -\frac{n}{\sigma^2} < 0$（恆負）。因為 $f''(\mu) < 0$，曲線凹向下，所以 $\hat{\mu} = \bar{x}$ 確實是極大值！

7.6.3 R 操作題

實作梯度下降法，找 $f(x) = x^2 - 4x + 5$ 的最小值。

Code

f <- function(x) x^2 - 4*x + 5
f_prime <- function(x) ___

# 梯度下降
x <- 0  # 起點
alpha <- ___  # 學習率
for (i in 1:___) {
  x <- x - alpha * f_prime(x)
  print(paste("Iteration", i, ": x =", x, ", f(x) =", f(x)))
}

參考答案

f <- function(x) x^2 - 4*x + 5
f_prime <- function(x) 2*x - 4

x <- 0
alpha <- 0.1
for (i in 1:20) {
  x <- x - alpha * f_prime(x)
  print(paste("Iteration", i, ": x =", round(x, 4), ", f(x) =", round(f(x), 4)))
}

理論上最小值在 $x=2$，$f(2)=1$。梯度下降會逐步收斂到這個值。

用視覺化比較不同學習率 $\alpha$ 對梯度下降收斂速度的影響。

參考答案

可以嘗試 $\alpha = 0.01, 0.1, 0.5, 0.9$ 等不同值。太小的學習率收斂很慢，太大的學習率可能震盪或發散。建議用 patchwork 將不同 $\alpha$ 的收斂路徑並排比較，觀察步數和穩定性的差異。最佳學習率通常需要實驗調整。

本章重點整理

核心概念

找極值的步驟：

計算 $f'(x)$，令 $f'(x) = 0$ 找臨界點
計算 $f''(x)$，判斷極值類型：
- $f''(x) > 0$ → 極小值
- $f''(x) < 0$ → 極大值
- $f''(x) = 0$ → 無法判斷

統計應用：

MLE：令 $\frac{d\ell}{d\theta} = 0$，找讓 log-likelihood 最大的參數
OLS：令 $\frac{\partial \text{RSS}}{\partial \beta} = 0$，找讓殘差平方和最小的係數
Gradient Descent：$\theta_{t+1} = \theta_t - \alpha \nabla L(\theta_t)$

Part II 總結：

我們已經學會微分的所有基本工具！接下來 Part III 會學習積分，它是微分的「逆運算」，在計算機率和期望值時不可或缺。

# 最佳化 (Optimization) ```{r} #| include: false source(here::here("R/_common.R")) ``` ## 學習目標 {.unnumbered} - 用導數找函數的極值點：令 $f'(x) = 0$ - 使用二階導數判斷極大或極小 - 理解凹向上 (concave up) 與凹向下 (concave down) - 連結醫學統計：MLE、最小平方法、梯度下降法 ## 為什麼需要最佳化？統計學的核心問題之一就是**找最佳值**： - **最大概似估計 (MLE)**：找讓 likelihood 最大的參數值 - **最小平方法 (OLS)**：找讓殘差平方和最小的迴歸係數 - **機器學習**：找讓損失函數最小的模型參數這些都是**最佳化問題 (optimization problems)**，而微分是解決這類問題的關鍵工具！ ## 極值的基本概念 ### 什麼是極值？ **局部極大值 (local maximum)**：在某個範圍內，函數值最大的點 **局部極小值 (local minimum)**：在某個範圍內，函數值最小的點 **全域極值 (global extrema)**：在整個定義域內的最大或最小值 ```{r} #| label: fig-extrema-concept #| fig-cap: "極值的視覺化：局部與全域" #| warning: false #| message: false # 創建一個有多個極值的函數 f <- function(x) { -0.1*x^4 + x^3 - 2*x^2 + 1 } x <- seq(-2, 5, by = 0.01) y <- f(x) df <- data.frame(x, y) # 找極值點（數值近似） # 局部極小值約在 x ≈ 0.4 和 x ≈ 3.3 # 局部極大值約在 x ≈ 2 extrema <- data.frame( x = c(0.4, 2, 3.3), y = f(c(0.4, 2, 3.3)), type = c("局部極小", "局部極大", "全域極小"), color = c("#E94F37", "#2E86AB", "#E94F37") ) ggplot(df, aes(x, y)) + geom_line(color = "#2E86AB", linewidth = 1.5) + geom_hline(yintercept = 0, color = "gray70", linetype = "dashed") + geom_point(data = extrema, aes(x, y, color = type), size = 5) + geom_text(data = extrema, aes(x, y, label = type), vjust = -1.5, size = 4, fontface = "bold") + scale_color_manual(values = c("局部極大" = "#2E86AB", "局部極小" = "#E94F37", "全域極小" = "#E94F37")) + labs( title = "極值的類型", subtitle = "局部極值 vs. 全域極值", x = "x", y = "f(x)" ) + theme_minimal(base_size = 14) + theme(legend.position = "none") ``` ## 一階導數檢定 (First Derivative Test) ### 核心定理 **Fermat's Theorem**：如果 $f(x)$ 在 $x=c$ 有局部極值，且 $f'(c)$ 存在，則： $$ f'(c) = 0 $$ **白話文**：極值點的切線斜率是水平的（斜率 = 0）。 :::{.callout-note} ## 注意 $f'(c) = 0$ 是極值的**必要條件**，但不是**充分條件**。也就是說： - 極值點 $\Rightarrow$ $f'(c) = 0$（一定成立） - $f'(c) = 0$ $\nRightarrow$ 極值點（不一定成立）反例：$f(x) = x^3$ 在 $x=0$ 處，$f'(0) = 0$，但這不是極值點！ ::: ### 找極值點的步驟 1. 計算導數 $f'(x)$ 2. 解方程式 $f'(x) = 0$，找出**臨界點 (critical points)** 3. 判斷每個臨界點是極大、極小、還是反曲點 ```{r} #| label: fig-critical-points #| fig-cap: "臨界點的三種類型" #| warning: false #| message: false x <- seq(-2, 2, by = 0.01) # 三種情況 f1 <- -x^2 + 1 # 極大值 f2 <- x^2 # 極小值 f3 <- x^3 # 反曲點 df <- data.frame(x, f1, f2, f3) p1 <- ggplot(df, aes(x, f1)) + geom_line(color = "#2E86AB", linewidth = 1.5) + geom_point(aes(x = 0, y = 1), color = "#E94F37", size = 5) + geom_hline(yintercept = 0, color = "gray70") + annotate("text", x = 0, y = 1.5, label = "極大值\nf'(0) = 0", hjust = 0.5, color = "#E94F37", fontface = "bold") + labs(title = expression(f(x) == -x^2 + 1), y = "f(x)") + theme_minimal(base_size = 11) p2 <- ggplot(df, aes(x, f2)) + geom_line(color = "#2E86AB", linewidth = 1.5) + geom_point(aes(x = 0, y = 0), color = "#E94F37", size = 5) + geom_hline(yintercept = 0, color = "gray70") + annotate("text", x = 0, y = 1, label = "極小值\nf'(0) = 0", hjust = 0.5, color = "#E94F37", fontface = "bold") + labs(title = expression(f(x) == x^2), y = "f(x)") + theme_minimal(base_size = 11) p3 <- ggplot(df, aes(x, f3)) + geom_line(color = "#2E86AB", linewidth = 1.5) + geom_point(aes(x = 0, y = 0), color = "#E94F37", size = 5) + geom_hline(yintercept = 0, color = "gray70") + annotate("text", x = 0, y = 1.5, label = "反曲點\nf'(0) = 0\n但不是極值", hjust = 0.5, color = "#E94F37", fontface = "bold") + labs(title = expression(f(x) == x^3), y = "f(x)") + theme_minimal(base_size = 11) (p1 | p2 | p3) + plot_annotation( title = "f'(x) = 0 的三種可能", subtitle = "需要進一步檢驗才能確定是哪一種" ) ``` ## 二階導數檢定 (Second Derivative Test) ### 凹向上與凹向下 **二階導數** $f''(x)$ 描述函數的**彎曲方向**： - $f''(x) > 0$：函數**凹向上 (concave up)**，像微笑 ∪ - $f''(x) < 0$：函數**凹向下 (concave down)**，像皺眉 ∩ ```{r} #| label: fig-concavity #| fig-cap: "凹向上 vs. 凹向下" #| warning: false #| message: false x <- seq(-2, 2, by = 0.01) f_up <- x^2 # f''(x) = 2 > 0 f_down <- -x^2 # f''(x) = -2 < 0 df <- data.frame(x, f_up, f_down) p1 <- ggplot(df, aes(x, f_up)) + geom_line(color = "#2E86AB", linewidth = 1.5) + geom_hline(yintercept = 0, color = "gray70") + annotate("text", x = 0, y = 2.5, label = "凹向上 ∪\nf''(x) > 0", hjust = 0.5, size = 5, fontface = "bold") + labs(title = expression(f(x) == x^2), y = "f(x)") + ylim(-5, 5) + theme_minimal(base_size = 12) p2 <- ggplot(df, aes(x, f_down)) + geom_line(color = "#E94F37", linewidth = 1.5) + geom_hline(yintercept = 0, color = "gray70") + annotate("text", x = 0, y = -2.5, label = "凹向下 ∩\nf''(x) < 0", hjust = 0.5, size = 5, fontface = "bold") + labs(title = expression(f(x) == -x^2), y = "f(x)") + ylim(-5, 5) + theme_minimal(base_size = 12) p1 + p2 + plot_annotation( title = "二階導數與凹凸性", subtitle = "二階導數的符號決定曲線的彎曲方向" ) ``` ### 二階導數檢定假設 $f'(c) = 0$（臨界點），則： | $f''(c)$ | 結論 | |----------|------| | $f''(c) > 0$ | $c$ 是**局部極小值** | | $f''(c) < 0$ | $c$ 是**局部極大值** | | $f''(c) = 0$ | **無法判斷**，需要其他方法 | **直觀理解**： - 凹向上 ($f'' > 0$) + 水平切線 ($f' = 0$) = 谷底（極小值） - 凹向下 ($f'' < 0$) + 水平切線 ($f' = 0$) = 山頂（極大值） ### 範例：找 $f(x) = x^3 - 3x$ 的極值 **步驟 1**：計算一階導數 $$ f'(x) = 3x^2 - 3 $$ **步驟 2**：找臨界點 $$ f'(x) = 0 \Rightarrow 3x^2 - 3 = 0 \Rightarrow x^2 = 1 \Rightarrow x = \pm 1 $$ **步驟 3**：計算二階導數 $$ f''(x) = 6x $$ **步驟 4**：判斷極值類型 - 在 $x = -1$：$f''(-1) = -6 < 0$ → **極大值**，$f(-1) = 2$ - 在 $x = 1$：$f''(1) = 6 > 0$ → **極小值**，$f(1) = -2$ ```{r} #| label: fig-optimization-example #| fig-cap: "完整的最佳化分析" #| warning: false #| message: false x <- seq(-2.5, 2.5, by = 0.01) f <- x^3 - 3*x f_prime <- 3*x^2 - 3 f_double_prime <- 6*x df <- data.frame(x, f, f_prime, f_double_prime) # 極值點 extrema <- data.frame( x = c(-1, 1), y = c(2, -2), type = c("極大值", "極小值") ) p1 <- ggplot(df, aes(x, f)) + geom_line(color = "#2E86AB", linewidth = 1.5) + geom_hline(yintercept = 0, color = "gray70", linetype = "dashed") + geom_vline(xintercept = c(-1, 1), color = "#E94F37", linetype = "dotted", alpha = 0.5) + geom_point(data = extrema, aes(x, y), color = "#E94F37", size = 5) + geom_text(data = extrema, aes(x, y, label = type), vjust = c(1.5, -1.5), hjust = 0.5, color = "#E94F37", fontface = "bold") + labs(title = expression(f(x) == x^3 - 3*x), y = "f(x)") + theme_minimal(base_size = 12) zeros_df <- data.frame(x = c(-1, 1), y = c(0, 0)) p2 <- ggplot(df, aes(x, f_prime)) + geom_line(color = "#E94F37", linewidth = 1.5) + geom_hline(yintercept = 0, color = "gray70", linetype = "dashed") + geom_vline(xintercept = c(-1, 1), color = "#E94F37", linetype = "dotted", alpha = 0.5) + geom_point(data = zeros_df, aes(x = x, y = y), color = "#E94F37", size = 5) + annotate("text", x = c(-1, 1), y = c(1, 1), label = "f'(x) = 0", hjust = 0.5, color = "#E94F37") + labs(title = expression(f*"'"*(x) == 3*x^2 - 3), y = "f'(x)") + theme_minimal(base_size = 12) second_df <- data.frame(x = c(-1, 1), y = c(-6, 6)) p3 <- ggplot(df, aes(x, f_double_prime)) + geom_line(color = "#2E86AB", linewidth = 1.5) + geom_hline(yintercept = 0, color = "gray70", linetype = "dashed") + geom_vline(xintercept = c(-1, 1), color = "#E94F37", linetype = "dotted", alpha = 0.5) + geom_point(data = second_df, aes(x = x, y = y), color = "#E94F37", size = 5) + annotate("text", x = -1, y = -6, label = "f''(x) < 0\n極大", vjust = 1.5, hjust = 0.5, color = "#E94F37", size = 3) + annotate("text", x = 1, y = 6, label = "f''(x) > 0\n極小", vjust = -0.5, hjust = 0.5, color = "#E94F37", size = 3) + labs(title = expression(f*"''"*(x) == 6*x), y = "f''(x)") + theme_minimal(base_size = 12) p1 / (p2 + p3) + plot_annotation( title = "最佳化分析：f(x) = x³ - 3x", subtitle = "使用一階與二階導數找極值" ) ``` ## 統計應用 ### 1. 最大概似估計 (MLE) **核心想法**：找讓 likelihood 最大的參數值。 **步驟**： 1. 寫出 likelihood function：$L(\theta) = \prod_{i=1}^n f(x_i; \theta)$ 2. 取對數：$\ell(\theta) = \ln L(\theta) = \sum_{i=1}^n \ln f(x_i; \theta)$ 3. 對 $\theta$ 微分：$\frac{d\ell}{d\theta}$ 4. 令導數為 0：$\frac{d\ell}{d\theta} = 0$ 5. 解出 $\hat{\theta}_{\text{MLE}}$ **範例：估計常態分布的平均數** 假設 $X_1, \ldots, X_n \sim N(\mu, \sigma^2)$，已知 $\sigma^2$，要估計 $\mu$。 $$ \begin{align} \ell(\mu) &= \sum_{i=1}^n \ln f(x_i; \mu) \\ &= \sum_{i=1}^n \ln \left[\frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x_i - \mu)^2}{2\sigma^2}\right)\right] \\ &= -\frac{n}{2}\ln(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n (x_i - \mu)^2 \end{align} $$ **微分**： $$ \frac{d\ell}{d\mu} = \frac{1}{\sigma^2}\sum_{i=1}^n (x_i - \mu) $$ **令導數為 0**： $$ \sum_{i=1}^n (x_i - \mu) = 0 \Rightarrow \hat{\mu}_{\text{MLE}} = \frac{1}{n}\sum_{i=1}^n x_i = \bar{x} $$ **結論**：樣本平均數就是 MLE！ ```{r} #| label: fig-mle-normal #| fig-cap: "MLE 視覺化：找讓 log-likelihood 最大的 μ" #| warning: false #| message: false set.seed(123) data <- rnorm(20, mean = 5, sd = 2) # Log-likelihood 函數（固定 sigma = 2） log_lik <- function(mu) { sum(dnorm(data, mean = mu, sd = 2, log = TRUE)) } mu_range <- seq(2, 8, by = 0.01) ll_values <- sapply(mu_range, log_lik) # MLE mu_mle <- mean(data) ll_mle <- log_lik(mu_mle) df <- data.frame(mu = mu_range, ll = ll_values) ggplot(df, aes(mu, ll)) + geom_line(color = "#2E86AB", linewidth = 1.5) + geom_vline(xintercept = mu_mle, color = "#E94F37", linetype = "dashed", linewidth = 1) + geom_point(aes(x = mu_mle, y = ll_mle), color = "#E94F37", size = 5) + annotate("text", x = mu_mle + 0.3, y = ll_mle, label = paste0("MLE = ", round(mu_mle, 2), "\n= 樣本平均數"), hjust = 0, vjust = 0.5, color = "#E94F37", fontface = "bold", size = 4.5) + labs( title = "Maximum Likelihood Estimation", subtitle = "找讓 ℓ(μ) 最大的點 → 令 dℓ/dμ = 0", x = expression(mu), y = expression(ell(mu)) ) + theme_minimal(base_size = 14) ``` ### 2. 最小平方法 (Ordinary Least Squares) 在簡單線性迴歸中： $$ y_i = \beta_0 + \beta_1 x_i + \epsilon_i $$ **目標**：找 $\beta_0, \beta_1$ 使**殘差平方和 (RSS)** 最小： $$ \text{RSS}(\beta_0, \beta_1) = \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i)^2 $$ **最佳化條件**： $$ \frac{\partial \text{RSS}}{\partial \beta_0} = 0, \quad \frac{\partial \text{RSS}}{\partial \beta_1} = 0 $$ 解出來就是**最小平方估計式**！ ```{r} #| label: fig-ols-visualization #| fig-cap: "OLS：找讓 RSS 最小的迴歸線" #| warning: false #| message: false set.seed(42) n <- 30 x <- rnorm(n, mean = 5, sd = 2) y <- 2 + 0.8*x + rnorm(n, sd = 1) # 計算 OLS fit <- lm(y ~ x) beta0_hat <- coef(fit)[1] beta1_hat <- coef(fit)[2] df <- data.frame(x, y, y_pred = predict(fit)) ggplot(df, aes(x, y)) + geom_segment(aes(xend = x, yend = y_pred), color = "#E94F37", alpha = 0.3, linetype = "dashed") + geom_point(color = "#2E86AB", size = 3) + geom_abline(intercept = beta0_hat, slope = beta1_hat, color = "#E94F37", linewidth = 1.5) + annotate("text", x = max(x) - 1, y = min(y) + 1, label = paste0("ŷ = ", round(beta0_hat, 2), " + ", round(beta1_hat, 2), "x"), hjust = 1, vjust = 0, color = "#E94F37", fontface = "bold", size = 5) + labs( title = "Ordinary Least Squares (OLS)", subtitle = "最小化殘差平方和（紅色虛線長度平方的總和）", x = "x", y = "y" ) + theme_minimal(base_size = 14) ``` ### 3. 梯度下降法 (Gradient Descent) 在機器學習中，有些損失函數無法直接解析求解，需要用**迭代法**。 **梯度下降法**的核心想法： 1. 從某個初始值 $\theta_0$ 開始 2. 計算梯度（導數）：$g = \frac{d}{d\theta}L(\theta)$ 3. 朝著梯度的**反方向**移動：$\theta_{t+1} = \theta_t - \alpha \cdot g_t$ 4. 重複直到收斂其中 $\alpha$ 叫做**學習率 (learning rate)**。 **直觀理解**：想像你在山上，想找到最低點。你看不到全貌，只能感受腳下的斜率。你每次都往下坡走一小步，最終會到達谷底。 ```{r} #| label: fig-gradient-descent #| fig-cap: "梯度下降法的視覺化" #| warning: false #| message: false # 簡單的二次函數 f <- function(x) (x - 3)^2 + 1 f_prime <- function(x) 2*(x - 3) # 梯度下降 gradient_descent <- function(x0, alpha, n_iter) { path <- numeric(n_iter + 1) path[1] <- x0 for (i in 1:n_iter) { grad <- f_prime(path[i]) path[i+1] <- path[i] - alpha * grad } data.frame( iter = 0:n_iter, x = path, y = f(path) ) } # 執行梯度下降 path <- gradient_descent(x0 = 0, alpha = 0.1, n_iter = 15) x <- seq(-1, 6, by = 0.01) ggplot() + # 函數曲線 geom_line(aes(x = x, y = f(x)), color = "#2E86AB", linewidth = 1.5) + # 梯度下降路徑 geom_path(data = path, aes(x, y), color = "#E94F37", linewidth = 1, arrow = arrow(length = unit(0.3, "cm"))) + geom_point(data = path, aes(x, y), color = "#E94F37", size = 2) + # 起點和終點 geom_point(data = path[1,], aes(x, y), color = "#E94F37", size = 5, shape = 21, fill = "white") + geom_point(data = path[nrow(path),], aes(x, y), color = "#E94F37", size = 5) + annotate("text", x = path$x[1], y = path$y[1] + 2, label = "起點", hjust = 0.5, color = "#E94F37", fontface = "bold") + annotate("text", x = path$x[nrow(path)], y = path$y[nrow(path)] - 0.5, label = "終點（極小值）", hjust = 0.5, vjust = 1, color = "#E94F37", fontface = "bold") + labs( title = "Gradient Descent 視覺化", subtitle = "每一步都朝著梯度的反方向移動", x = expression(theta), y = expression(L(theta)) ) + theme_minimal(base_size = 14) ``` ## 練習題 ### 觀念題 1. 用自己的話解釋：為什麼極值點的導數是 0？ ::: {.callout-tip collapse="true" title="參考答案"} 導數代表函數的瞬時變化率（切線斜率）。在極值點（山頂或谷底），函數暫時停止上升或下降，切線是水平的，所以斜率為 0。就像爬山到山頂時，那一瞬間既不上坡也不下坡。 ::: 2. 如果 $f'(c) = 0$ 且 $f''(c) = 0$，我們能判斷 $c$ 是極大還是極小嗎？ ::: {.callout-tip collapse="true" title="參考答案"} 不能。二階導數檢定在 $f''(c) = 0$ 時失效，無法判斷。例如 $f(x) = x^4$ 在 $x=0$ 處是極小值，但 $f(x) = -x^4$ 在 $x=0$ 處是極大值，兩者的 $f'(0) = f''(0) = 0$ 都成立。此時需要用更高階的導數或其他方法判斷。 ::: 3. 在 MLE 中，為什麼要「最大化」log-likelihood？在 OLS 中，為什麼要「最小化」RSS？ ::: {.callout-tip collapse="true" title="參考答案"} MLE 要找「最能解釋觀察到的資料」的參數值，所以要讓 likelihood（資料出現的機率）越大越好。OLS 要找「預測誤差最小」的迴歸線，所以要讓殘差平方和越小越好。兩者都是最佳化問題，只是目標函數的方向不同：一個求最大值，一個求最小值。 ::: ### 計算題 4. 找出 $f(x) = x^4 - 4x^3$ 的極值點，並判斷是極大還是極小。 ::: {.callout-tip collapse="true" title="參考答案"} **步驟**：$f'(x) = 4x^3 - 12x^2 = 4x^2(x-3) = 0$ → $x = 0$ 或 $x = 3$。$f''(x) = 12x^2 - 24x$。在 $x=0$：$f''(0) = 0$（無法判斷）。在 $x=3$：$f''(3) = 108 - 72 = 36 > 0$ → **極小值**，$f(3) = -27$。$x=0$ 實際上是反曲點，不是極值。 ::: 5. 假設 $X_1, \ldots, X_n \sim \text{Exp}(\lambda)$（指數分布），推導 $\lambda$ 的 MLE。提示：$f(x; \lambda) = \lambda e^{-\lambda x}$ ::: {.callout-tip collapse="true" title="參考答案"} Log-likelihood: $\ell(\lambda) = \sum_{i=1}^n \ln(\lambda e^{-\lambda x_i}) = n\ln\lambda - \lambda\sum_{i=1}^n x_i$。微分：$\frac{d\ell}{d\lambda} = \frac{n}{\lambda} - \sum_{i=1}^n x_i$。令其為 0：$\frac{n}{\lambda} = \sum_{i=1}^n x_i$ → $\hat{\lambda}_{\text{MLE}} = \frac{n}{\sum_{i=1}^n x_i} = \frac{1}{\bar{x}}$。 ::: 6. 使用二階導數檢定，驗證 $\hat{\mu}_{\text{MLE}} = \bar{x}$ 確實是最大值（不是最小值）。 ::: {.callout-tip collapse="true" title="參考答案"} 從本章推導：$\frac{d\ell}{d\mu} = \frac{1}{\sigma^2}\sum_{i=1}^n (x_i - \mu)$。二階導數：$\frac{d^2\ell}{d\mu^2} = -\frac{n}{\sigma^2} < 0$（恆負）。因為 $f''(\mu) < 0$，曲線凹向下，所以 $\hat{\mu} = \bar{x}$ 確實是極大值！ ::: ### R 操作題 7. 實作梯度下降法，找 $f(x) = x^2 - 4x + 5$ 的最小值。 ```{r} #| eval: false f <- function(x) x^2 - 4*x + 5 f_prime <- function(x) ___ # 梯度下降 x <- 0 # 起點 alpha <- ___ # 學習率 for (i in 1:___) { x <- x - alpha * f_prime(x) print(paste("Iteration", i, ": x =", x, ", f(x) =", f(x))) } ``` ::: {.callout-tip collapse="true" title="參考答案"} ```r f <- function(x) x^2 - 4*x + 5 f_prime <- function(x) 2*x - 4 x <- 0 alpha <- 0.1 for (i in 1:20) { x <- x - alpha * f_prime(x) print(paste("Iteration", i, ": x =", round(x, 4), ", f(x) =", round(f(x), 4))) } ``` 理論上最小值在 $x=2$，$f(2)=1$。梯度下降會逐步收斂到這個值。 ::: 8. 用視覺化比較不同學習率 $\alpha$ 對梯度下降收斂速度的影響。 ::: {.callout-tip collapse="true" title="參考答案"} 可以嘗試 $\alpha = 0.01, 0.1, 0.5, 0.9$ 等不同值。太小的學習率收斂很慢，太大的學習率可能震盪或發散。建議用 `patchwork` 將不同 $\alpha$ 的收斂路徑並排比較，觀察步數和穩定性的差異。最佳學習率通常需要實驗調整。 ::: ## 本章重點整理 {.unnumbered} :::{.callout-important} ## 核心概念 **找極值的步驟**： 1. 計算 $f'(x)$，令 $f'(x) = 0$ 找臨界點 2. 計算 $f''(x)$，判斷極值類型： - $f''(x) > 0$ → 極小值 - $f''(x) < 0$ → 極大值 - $f''(x) = 0$ → 無法判斷 **統計應用**： - **MLE**：令 $\frac{d\ell}{d\theta} = 0$，找讓 log-likelihood 最大的參數 - **OLS**：令 $\frac{\partial \text{RSS}}{\partial \beta} = 0$，找讓殘差平方和最小的係數 - **Gradient Descent**：$\theta_{t+1} = \theta_t - \alpha \nabla L(\theta_t)$ **Part II 總結**：我們已經學會微分的所有基本工具！接下來 Part III 會學習**積分**，它是微分的「逆運算」，在計算機率和期望值時不可或缺。 :::

\(f''(c)\)	結論
\(f''(c) > 0\)	\(c\) 是局部極小值
\(f''(c) < 0\)	\(c\) 是局部極大值
\(f''(c) = 0\)	無法判斷，需要其他方法

學習目標

7.1 為什麼需要最佳化？

7.2 極值的基本概念

7.2.1 什麼是極值？

7.3 一階導數檢定 (First Derivative Test)

7.3.1 核心定理

7.3.2 找極值點的步驟

7.4 二階導數檢定 (Second Derivative Test)

7.4.1 凹向上與凹向下

7.4.2 二階導數檢定

7.4.3 範例：找 \(f(x) = x^3 - 3x\) 的極值

7.5 統計應用

7.5.1 1. 最大概似估計 (MLE)

7.5.2 2. 最小平方法 (Ordinary Least Squares)

7.5.3 3. 梯度下降法 (Gradient Descent)

7.6 練習題

7.6.1 觀念題

7.6.2 計算題

7.6.3 R 操作題

本章重點整理