葫芦的运维日志_teg_EKS上用Higress替换Ingress实战

✸ ✸ ✸

EKS 上用 Higress 替换 Ingress 实战：从安装到生产的完整指南

2025 年底，Kubernetes 社区扔了一颗炸弹：ingress-nginx 正式宣布退役，2026 年 3 月之后不再提供安全补丁。它的继任者 inGate 也跟着宣布退役。一时间，整个社区都在问同一个问题：我的 Ingress 该换成什么？

如果你的 EKS 集群还在跑 nginx-ingress，现在已经到了必须行动的时候了。2026 年 3 月的维护截止日期已经到来，这意味着从现在起，任何新发现的安全漏洞都不会再被修复。一个没人维护的网关组件，迟早会变成你凌晨三点被叫醒的理由。

这篇文章介绍一个值得关注的选项：Higress。它是阿里巴巴开源的云原生 API 网关，底层基于 Envoy，上层兼容 Kubernetes Ingress API 和 Gateway API。在阿里内部，它承载着每天数千亿次的 API 调用。

我们从 SRE 的视角出发，完整走一遍：从安装部署到认证授权、流量管理、安全防护，每一步都给你可以直接用的配置。

一、为什么选 Higress？先看对比

在选型之前，先搞清楚 Higress 跟 nginx-ingress 的核心差异：

维度	nginx-ingress	Higress
底层引擎	Nginx（C 语言）	Envoy（C++）
配置热更新	需要 reload，有短暂中断	xDS 协议推送，零中断
协议支持	HTTP/HTTPS/gRPC（需额外配置）	原生 HTTP/2、gRPC、WebSocket
插件机制	Lua 脚本 / 编译 C 模块	Wasm 插件，支持 Go/Rust/JS
可观测性	基础 metrics + access log	原生 Prometheus metrics + 链路追踪
金丝雀发布	annotation 支持有限	Header/Cookie/权重多维度灰度
Gateway API	不支持	支持（v1.0.0）
管理控制台	无	自带 Web Console
社区状态	2026.3 停止维护	活跃开发中

最关键的一点：配置热更新零中断。nginx-ingress 每次配置变更都要 reload Nginx 进程，在高并发场景下会导致短暂的连接中断。Higress 基于 Envoy 的 xDS 协议，配置变更通过控制面推送到数据面，全程零中断。Sealos 团队在管理 20000+ 域名网关配置时，从 nginx-ingress 迁移到 Higress 后，配置生效时间从分钟级降到了秒级。

二、在 EKS 上安装 Higress

2.1 前置条件

确保你有一个正常运行的 EKS 集群，kubectl 和 Helm 已配置好：

# 确认集群连接正常
kubectl get nodes
# NAME                          STATUS   ROLES    AGE   VERSION
# ip-10-0-1-100.ec2.internal    Ready    <none>   30d   v1.32.0

# 确认 Helm 版本 >= 3.x
helm version
# version.BuildInfo{Version:"v3.16.0"}

2.2 Helm 安装 Higress

# 添加 Higress Helm 仓库
helm repo add higress.io https://higress.io/helm-charts
helm repo update

# 安装 Higress（标准云环境配置）
helm install higress higress.io/higress \
  -n higress-system --create-namespace \
  --set global.local=false \
  --set global.ingressClass=higress \
  --set global.enableStatus=true \
  --set higress-core.gateway.replicas=2 \
  --set higress-console.replicas=1

# 等待所有 Pod 就绪
kubectl wait --for=condition=Ready pod --all \
  -n higress-system --timeout=300s

# 查看安装结果
kubectl get pods -n higress-system
# NAME                                    READY   STATUS    RESTARTS
# higress-controller-xxx                  1/1     Running   0
# higress-gateway-xxx                     1/1     Running   0
# higress-gateway-yyy                     1/1     Running   0
# higress-console-xxx                     1/1     Running   0

几个关键参数说明：

global.local=false：告诉 Higress 这是云环境（非本地 Kind/Minikube），会创建 LoadBalancer 类型的 Service
global.ingressClass=higress：只处理 ingressClassName 为 higress 的 Ingress 资源。如果你想同时兼容 nginx 的 Ingress，可以设为 nginx
gateway.replicas=2：数据面至少 2 副本，保证高可用

2.3 获取网关入口地址

# 查看 Higress Gateway 的 LoadBalancer 地址
kubectl get svc higress-gateway -n higress-system
# NAME              TYPE           CLUSTER-IP     EXTERNAL-IP
# higress-gateway   LoadBalancer   172.20.x.x     a1b2c3-xxx.elb.amazonaws.com

# 在 AWS 上，EXTERNAL-IP 是一个 ELB/NLB 的 DNS 名称
# 把你的域名 CNAME 指向这个地址即可

# 测试网关是否响应
curl -v http://$(kubectl get svc higress-gateway -n higress-system \
  -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')
# 应该返回 404（因为还没配置路由），说明网关已经在工作了

2.4 访问 Higress Console（可选但推荐）

Higress 自带一个 Web 管理控制台，可以可视化管理路由、插件、证书等：

# 本地端口转发访问 Console
kubectl port-forward svc/higress-console -n higress-system 8080:8080

# 浏览器打开 http://localhost:8080
# 默认用户名：admin
# 默认密码：admin（首次登录后务必修改）

Console 不是必须的，所有功能都可以通过 kubectl + YAML 完成。但对于日常运维来说，有个 UI 看路由拓扑、实时流量、插件状态，确实方便不少。

三、从 nginx-ingress 平滑迁移

如果你的集群已经在跑 nginx-ingress，不用慌，不需要一刀切。Higress 支持渐进式迁移。

3.1 双网关并行阶段

核心思路：Higress 和 nginx-ingress 同时运行，通过 ingressClassName 区分各自管理的 Ingress 资源。迁移的是 Ingress 路由配置，后端服务本身无需任何修改。

具体做法：

新服务：创建 Ingress 时直接设置 ingressClassName: higress
老服务：将现有 Ingress 的 ingressClassName 从 nginx 改为 higress

# 老服务：继续用 nginx-ingress
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: legacy-app
spec:
  ingressClassName: nginx    # 还是 nginx 处理
  rules:
    - host: legacy.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: legacy-service
                port:
                  number: 80

---
# 新服务：用 Higress
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: new-app
spec:
  ingressClassName: higress   # Higress 处理
  rules:
    - host: new.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: new-service
                port:
                  number: 80

3.2 逐步迁移 Ingress 配置

验证 Higress 稳定后，将现有 Ingress 的 ingressClassName 从 nginx 改成 higress。后端服务本身无需任何修改，一次迁移一个 Ingress，每次迁移完观察 10-15 分钟：

# 迁移前：确认 Higress 能正确解析该 Ingress
kubectl get ingress legacy-app -o yaml | \
  sed 's/ingressClassName: nginx/ingressClassName: higress/' | \
  kubectl apply --dry-run=client -f -

# 迁移：修改 ingressClassName
kubectl patch ingress legacy-app \
  -p '{"spec":{"ingressClassName":"higress"}}'

# 验证：检查路由是否生效
curl -H "Host: legacy.example.com" \
  http://$(kubectl get svc higress-gateway -n higress-system \
  -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')

# DNS 切换：把 legacy.example.com 的 CNAME 从 nginx ELB 改到 Higress ELB

3.3 Annotation 兼容性

好消息是 Higress 兼容大部分 nginx-ingress 的 annotation。常用的对照表：

nginx-ingress annotation	Higress annotation	说明
nginx.ingress.kubernetes.io/rewrite-target	higress.io/rewrite-target	路径重写，语法一致
nginx.ingress.kubernetes.io/ssl-redirect	higress.io/ssl-redirect	HTTP 跳转 HTTPS
nginx.ingress.kubernetes.io/cors-*	higress.io/cors-*	CORS 配置，字段名一致
nginx.ingress.kubernetes.io/canary-*	higress.io/canary-*	金丝雀发布
nginx.ingress.kubernetes.io/proxy-body-size	higress.io/proxy-body-size	请求体大小限制
nginx.ingress.kubernetes.io/backend-protocol	higress.io/backend-protocol	后端协议（HTTPS/GRPC）

迁移时只需要把 annotation 前缀从 nginx.ingress.kubernetes.io 改成 higress.io，大部分场景可以无缝切换。实际上根据 Higress 官方文档，你甚至可以继续使用 nginx.ingress.kubernetes.io 前缀，Higress 两种前缀都支持。

3.4 全部迁移完成后

# 确认没有 Ingress 还在用 nginx
kubectl get ingress --all-namespaces -o json | \
  jq '.items[] | select(.spec.ingressClassName=="nginx") | .metadata.name'
# 如果输出为空，说明全部迁移完成

# 卸载 nginx-ingress
helm uninstall nginx-ingress -n ingress-nginx
kubectl delete namespace ingress-nginx

# 清理完毕，轻装上阵

四、认证授权：守好大门

网关是所有流量的入口，认证授权自然是第一道防线。Higress 内置了多种认证插件，不需要额外部署 OAuth2 Proxy 之类的 sidecar。

需要注意的是，Higress 的认证插件（jwt-auth、key-auth、ext-auth 等）是通过 Higress Console 或 WasmPlugin CRD 来配置的，而不是通过 Ingress annotation。这跟 nginx-ingress 的 annotation 方式不同，但好处是插件配置更灵活，支持全局、域名级、路由级的精细化控制。

4.1 JWT 认证

最常见的场景：前端带 JWT Token 访问后端 API，网关负责验证 Token 的合法性，验证通过才放行。

# jwt-auth 插件全局配置（通过 Higress Console 或 WasmPlugin CRD 配置）
# 定义 Consumer 及其验证方式
global_auth: false
consumers:
  - name: mobile-app
    issuer: "https://auth.example.com"
    # JWKS 方式验证（推荐，密钥可轮转）
    jwks: |
      {
        "keys": [
          {
            "kty": "RSA",
            "e": "AQAB",
            "use": "sig",
            "kid": "my-key-id",
            "alg": "RS256",
            "n": "your-rsa-public-key-n-value..."
          }
        ]
      }
    # 从 Authorization: Bearer xxx 中提取 Token
    from_headers:
      - name: Authorization
        value_prefix: "Bearer "
    # Token 中的 claims 会透传给后端
    claims_to_headers:
      - claim: sub
        header: X-User-Id
      - claim: role
        header: X-User-Role
  - name: internal-service
    issuer: "https://internal.example.com"
    # 对称密钥方式（内部服务间调用）
    jwks: |
      {
        "keys": [
          {
            "kty": "oct",
            "kid": "internal-key",
            "k": "your-base64url-encoded-256-bit-secret",
            "alg": "HS256"
          }
        ]
      }
    from_headers:
      - name: X-Internal-Token
        value_prefix: ""

# 在路由级别启用 JWT 认证（只允许指定 Consumer 访问）
# 通过 Higress Console 对路由 api-service 配置 jwt-auth 插件：
allow:
  - mobile-app
  - internal-service

# 对应的 Ingress 路由定义
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: api-service
spec:
  ingressClassName: higress
  rules:
    - host: api.example.com
      http:
        paths:
          - path: /api/v1
            pathType: Prefix
            backend:
              service:
                name: api-service
                port:
                  number: 80

这样配置后，访问 api.example.com/api/v1/* 的请求必须携带有效的 JWT Token，否则直接返回 401。而且 Token 中的 sub 和 role 字段会自动透传到后端的请求头里，后端服务不需要再解析 Token。

4.2 API Key 认证

适合第三方对接、开放 API 等场景。比 JWT 简单，适合不需要复杂权限体系的情况：

# 全局配置：定义 API Key Consumer
# key-auth 插件配置（通过 Higress Console 或 WasmPlugin CRD 配置）
consumers:
  - name: partner-a
    credential: "ak-xxxx-partner-a-secret"
  - name: partner-b
    credential: "ak-yyyy-partner-b-secret"
# 从哪里读取 API Key（支持 header 和 query 参数）
keys:
  - X-API-Key
  - apikey
in_header: true
in_query: true

# 路由级别：只允许 partner-a 访问
# 通过 Higress Console 对路由 partner-api 配置 key-auth 插件：
allow:
  - partner-a

# 对应的 Ingress 路由定义
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: partner-api
spec:
  ingressClassName: higress
  rules:
    - host: openapi.example.com
      http:
        paths:
          - path: /partner/v1
            pathType: Prefix
            backend:
              service:
                name: partner-service
                port:
                  number: 80

4.3 OAuth2 认证

Higress 还支持作为 OAuth2 Token 端点，直接签发符合 RFC9068 规范的 Access Token：

# OAuth2 插件配置（通过 Higress Console 或 WasmPlugin CRD 配置）
consumers:
  - name: web-client
    client_id: "web-app-client-id"
    client_secret: "web-app-client-secret"
issuer: "https://gateway.example.com"
auth_path: "/oauth2/token"
token_ttl: 3600  # Token 有效期 1 小时

这意味着你可以省掉一个独立的 OAuth2 Server，网关直接承担 Token 签发的职责。当然，复杂的 SSO 场景还是建议用专业的 IdP（比如 Keycloak）。

4.4 外部认证（ext-auth）

如果你已经有自己的认证服务，不想迁移到网关内置的认证插件，可以用 ext-auth 插件。它会把每个请求先转发到你的认证服务，认证通过才放行：

# ext-auth 插件配置（通过 Higress Console 或 WasmPlugin CRD 配置）
http_service:
  endpoint_mode: envoy
  endpoint:
    # 你的认证服务地址（FQDN 格式）
    service_name: auth-service.auth.svc.cluster.local
    service_port: 8080
    path_prefix: /verify
  timeout: 1000  # 超时 1 秒
  # 转发哪些请求头给认证服务
  authorization_request:
    allowed_headers:
      - exact: Authorization
      - exact: Cookie
      - exact: X-Forwarded-For
  # 认证服务返回的哪些头透传给后端
  authorization_response:
    allowed_upstream_headers:
      - exact: X-User-Id
      - exact: X-User-Role
      - exact: X-Tenant-Id

这种方式的好处是认证逻辑完全由你控制，网关只负责转发和拦截。适合已有成熟认证体系的团队。

五、流量管理：SRE 的核心战场

以下是 Higress 处理流量的完整路由架构：

5.1 路径重写与 Host 重写

最常见的需求：前端访问 /api/v1/users，但后端服务的路径是 /users，需要网关帮忙去掉前缀。

# 去掉路径前缀 /api/v1
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: api-rewrite
  annotations:
    higress.io/rewrite-target: "/$2"
spec:
  ingressClassName: higress
  rules:
    - host: api.example.com
      http:
        paths:
          - path: /api/v1(/|$)(.*)
            pathType: ImplementationSpecific
            backend:
              service:
                name: user-service
                port:
                  number: 80
# /api/v1/users  ->  /users
# /api/v1/orders ->  /orders

# Host 重写：外部域名和内部服务域名不同
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: host-rewrite
  annotations:
    higress.io/upstream-vhost: "internal-api.default.svc.cluster.local"
spec:
  ingressClassName: higress
  rules:
    - host: public-api.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: internal-api
                port:
                  number: 80

5.2 请求头透传与控制

SRE 经常需要在网关层注入一些头信息，比如链路追踪 ID、灰度标记、客户端真实 IP 等：

# 添加/修改/删除请求头
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: header-control
  annotations:
    # 添加请求头（转发给后端）
    higress.io/request-header-control-add: |
      X-Request-Source gateway
      X-Gateway-Version v2.0
    # 修改请求头
    higress.io/request-header-control-update: |
      X-Forwarded-Proto https
    # 删除敏感请求头（不让后端看到）
    higress.io/request-header-control-remove: "X-Debug-Token,X-Internal-Secret"
    # 添加响应头（返回给客户端）
    higress.io/response-header-control-add: |
      X-Content-Type-Options nosniff
      X-Frame-Options DENY
      Strict-Transport-Security max-age=31536000
    # 删除响应头（隐藏后端信息）
    higress.io/response-header-control-remove: "Server,X-Powered-By"
spec:
  ingressClassName: higress
  rules:
    - host: api.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: api-service
                port:
                  number: 80

注意最后两行：删除 Server 和 X-Powered-By 响应头。这是安全加固的基本操作，防止攻击者通过响应头探测你的技术栈。

5.3 金丝雀发布（灰度发布）

Higress 的灰度能力比 nginx-ingress 强不少，支持 Header、Cookie、权重三种维度，而且可以同时存在多个灰度版本。

# 基于 Header 的灰度：内部测试用
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: app-canary-v2
  annotations:
    higress.io/canary: "true"
    higress.io/canary-by-header: "X-Canary"
    higress.io/canary-by-header-value: "v2"
    # 给灰度流量打标，方便后端区分
    higress.io/request-header-control-add: "X-Traffic-Type canary-v2"
spec:
  ingressClassName: higress
  rules:
    - host: app.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: app-service-v2
                port:
                  number: 80
---
# 基于权重的灰度：逐步放量
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: app-canary-weight
  annotations:
    higress.io/canary: "true"
    higress.io/canary-weight: "10"  # 10% 流量到新版本
spec:
  ingressClassName: higress
  rules:
    - host: app.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: app-service-v2
                port:
                  number: 80
---
# 主版本（兜底）
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: app-stable
spec:
  ingressClassName: higress
  rules:
    - host: app.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: app-service-v1
                port:
                  number: 80

灰度发布的典型流程：先用 Header 灰度让内部测试 -> 没问题后切到 5% 权重灰度 -> 观察指标 -> 逐步提到 20%、50%、100%。每一步都可以随时回滚，改个数字就行。

5.4 重定向

# HTTP 强制跳转 HTTPS
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: force-https
  annotations:
    higress.io/ssl-redirect: "true"
spec:
  ingressClassName: higress
  rules:
    - host: www.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: web-service
                port:
                  number: 80
  tls:
    - hosts:
        - www.example.com
      secretName: example-tls

---
# 永久重定向：旧域名跳新域名
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: domain-redirect
  annotations:
    higress.io/permanent-redirect: "https://new.example.com"
    higress.io/permanent-redirect-code: "301"
spec:
  ingressClassName: higress
  rules:
    - host: old.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: placeholder-service
                port:
                  number: 80

5.5 超时与重试

# 精细化超时和重试控制
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: resilient-api
  annotations:
    # 总超时 10 秒（不区分连接/读写，更直观）
    higress.io/timeout: "10"
    # 最多重试 2 次
    higress.io/proxy-next-upstream-tries: "2"
    # 重试超时 5 秒
    higress.io/proxy-next-upstream-timeout: "5"
    # 只在 502/503 时重试，且允许非幂等请求重试
    higress.io/proxy-next-upstream: "http_502,http_503,non_idempotent"
spec:
  ingressClassName: higress
  rules:
    - host: api.example.com
      http:
        paths:
          - path: /api/payment
            pathType: Prefix
            backend:
              service:
                name: payment-service
                port:
                  number: 80

注意 non_idempotent 这个选项。默认情况下，Higress 不会重试 POST/PATCH 等非幂等请求（因为可能导致重复操作）。如果你的接口做了幂等性保证，可以开启这个选项。

六、安全防护：不只是挡住坏人

6.1 CORS 跨域配置

前后端分离架构的标配。在网关层统一处理 CORS，比每个后端服务自己处理要干净得多：

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: cors-api
  annotations:
    higress.io/enable-cors: "true"
    # 只允许特定域名跨域访问
    higress.io/cors-allow-origin: "https://www.example.com,https://m.example.com"
    higress.io/cors-allow-methods: "GET,POST,PUT,DELETE,OPTIONS"
    higress.io/cors-allow-headers: "Authorization,Content-Type,X-Request-Id"
    # 暴露自定义响应头给前端
    higress.io/cors-expose-headers: "X-Total-Count,X-Page-Size"
    higress.io/cors-allow-credentials: "true"
    # 预检请求缓存 24 小时
    higress.io/cors-max-age: "86400"
spec:
  ingressClassName: higress
  rules:
    - host: api.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: api-service
                port:
                  number: 80

一个常见的坑：cors-allow-origin 不要设成 * 然后又开 cors-allow-credentials: true。浏览器会直接拒绝这种组合。要么明确列出允许的域名，要么关掉 credentials。

6.2 IP 黑白名单

Higress 的 ip-restriction 插件支持 IP 和 CIDR 级别的访问控制（通过 Higress Console 或 WasmPlugin CRD 配置）：

# 白名单模式：只允许办公网络和 VPN 访问管理后台
# ip-restriction 插件配置（通过 Higress Console 或 WasmPlugin CRD 配置）
ip_source_type: header  # 从请求头获取真实 IP（默认为 origin-source）
ip_header_name: x-forwarded-for
allow:
  - "10.0.0.0/8"        # 内网
  - "172.16.0.0/12"     # VPN
  - "203.0.113.50"      # 办公室出口 IP

# 黑名单模式：封禁恶意 IP
deny:
  - "198.51.100.0/24"   # 已知攻击源
  - "192.0.2.100"       # 恶意爬虫

在 EKS 环境中，客户端真实 IP 通常在 X-Forwarded-For 头里（经过 ELB 转发后）。记得设置 ip_source_type: header，否则默认取的是 origin-source（即直连 IP），拿到的是 ELB 的内网 IP，白名单形同虚设。

6.3 Bot 检测

Higress 内置了 bot-detect 插件，可以识别和拦截常见的爬虫和扫描器（通过 Higress Console 或 WasmPlugin CRD 配置）：

# bot-detect 插件配置
# 基于 User-Agent 识别爬虫（插件内置了默认爬虫规则集）
# deny 字段添加额外的拦截规则
deny:
  - "(sqlmap|nikto|nmap|masscan|zgrab)"
  - "(scrapy|python-requests)"
# allow 字段放行合法爬虫（覆盖默认拦截规则）
allow:
  - "(Googlebot|Bingbot|baiduspider)"
blocked_code: 403

注意：不要无脑拦截所有 Bot。搜索引擎爬虫（Googlebot、Bingbot）对 SEO 至关重要，要加到白名单里。

6.4 限流

防止突发流量打垮后端服务，限流是 SRE 的基本功：

# 路由级别限流
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: rate-limited-api
  annotations:
    # 每个网关实例每秒最多 100 个请求
    higress.io/route-limit-rps: "100"
    # 突发倍数：允许瞬时 500 个请求（100 * 5，默认值为 5）
    higress.io/route-limit-burst-multiplier: "5"
spec:
  ingressClassName: higress
  rules:
    - host: api.example.com
      http:
        paths:
          - path: /api/public
            pathType: Prefix
            backend:
              service:
                name: public-api
                port:
                  number: 80

注意这是单实例限流，每个 Higress Gateway Pod 独立计数。如果你有 2 个 Gateway 副本，实际总限流是 200 rps。需要全局限流的话，Higress 商业版提供了分布式限流能力，开源版可以结合 Redis 自行实现。

6.5 TLS 加固

# 强制 TLS 1.3，禁用不安全的旧版本
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: secure-app
  annotations:
    higress.io/ssl-redirect: "true"
    higress.io/tls-min-protocol-version: "TLSv1.3"
    # 安全响应头
    higress.io/response-header-control-add: |
      Strict-Transport-Security max-age=31536000;includeSubDomains
      X-Content-Type-Options nosniff
      X-Frame-Options DENY
      Content-Security-Policy default-src 'self'
spec:
  ingressClassName: higress
  rules:
    - host: secure.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: secure-service
                port:
                  number: 80
  tls:
    - hosts:
        - secure.example.com
      secretName: secure-tls

6.6 mTLS 双向认证

零信任架构的核心：不仅服务端要证明自己是谁，客户端也要证明。适合内部服务间调用、合作伙伴 API 对接等场景：

# 网关与客户端之间的 mTLS
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: mtls-api
  annotations:
    # CA 证书 Secret 名称 = TLS Secret 名称 + "-cacert"
    higress.io/auth-tls-secret: "mtls-tls-cacert"
spec:
  ingressClassName: higress
  rules:
    - host: partner-api.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: partner-service
                port:
                  number: 80
  tls:
    - hosts:
        - partner-api.example.com
      secretName: mtls-tls

---
# 网关与后端服务之间的 mTLS
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: backend-mtls
  annotations:
    higress.io/backend-protocol: "HTTPS"
    higress.io/proxy-ssl-secret: "default/gateway-client-cert"
    higress.io/proxy-ssl-server-name: "on"
    higress.io/proxy-ssl-name: "backend.internal"
spec:
  ingressClassName: higress
  rules:
    - host: api.example.com
      http:
        paths:
          - path: /secure-backend
            pathType: Prefix
            backend:
              service:
                name: secure-backend
                port:
                  number: 443

七、高级流量策略

7.1 负载均衡算法

默认的轮询（round_robin）不一定适合所有场景。Higress 支持多种负载均衡策略：

# 最少连接：适合长连接、处理时间不均匀的服务
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: least-conn-api
  annotations:
    higress.io/load-balance: "least_conn"
spec:
  ingressClassName: higress
  rules:
    - host: api.example.com
      http:
        paths:
          - path: /api/compute
            pathType: Prefix
            backend:
              service:
                name: compute-service
                port:
                  number: 80

---
# 一致性哈希：同一用户的请求总是打到同一个后端 Pod
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: sticky-api
  annotations:
    # 基于用户 ID 请求头做哈希
    higress.io/upstream-hash-by: "$http_x-user-id"
spec:
  ingressClassName: higress
  rules:
    - host: api.example.com
      http:
        paths:
          - path: /api/session
            pathType: Prefix
            backend:
              service:
                name: session-service
                port:
                  number: 80

7.2 会话保持（Cookie Affinity）

# 基于 Cookie 的会话保持
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: session-sticky
  annotations:
    higress.io/affinity: "cookie"
    higress.io/session-cookie-name: "SERVERID"
    higress.io/session-cookie-path: "/"
    higress.io/session-cookie-max-age: "3600"  # 1 小时过期
spec:
  ingressClassName: higress
  rules:
    - host: app.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: web-app
                port:
                  number: 80

首次访问时，Higress 会在响应中设置一个 SERVERID Cookie，后续请求带着这个 Cookie 就会被路由到同一个后端 Pod。适合还没做完 Session 外部化改造的老应用。

7.3 gRPC 和 WebSocket 支持

# gRPC 服务路由
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: grpc-service
  annotations:
    higress.io/backend-protocol: "GRPC"
spec:
  ingressClassName: higress
  rules:
    - host: grpc.example.com
      http:
        paths:
          - path: /mypackage.MyService
            pathType: Prefix
            backend:
              service:
                name: grpc-backend
                port:
                  number: 50051

Higress 基于 Envoy，原生支持 HTTP/2 和 gRPC，不需要像 nginx-ingress 那样额外配置 nginx.ingress.kubernetes.io/backend-protocol: "GRPC" 加上一堆 annotation。如果你的 K8s Service 的 Port Name 定义为 grpc，Higress 甚至会自动识别，连 annotation 都不用加。

八、可观测性：看得见才管得住

8.1 Prometheus Metrics 接入

Higress 基于 Envoy，天然暴露丰富的 Prometheus metrics。接入你现有的监控体系非常简单：

# 如果你用 Prometheus Operator，创建 ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: higress-gateway-metrics
  namespace: higress-system
spec:
  selector:
    matchLabels:
      app: higress-gateway
  endpoints:
    - port: http-monitoring
      interval: 15s
      path: /stats/prometheus

关键指标一览：

指标	含义	告警建议
envoy_http_downstream_rq_total	总请求数	突增/突降告警
envoy_http_downstream_rq_xx	按状态码分类的请求数	5xx 比例 > 1% 告警
envoy_http_downstream_rq_time	请求延迟（直方图）	P99 > 2s 告警
envoy_cluster_upstream_cx_active	到后端的活跃连接数	接近上限时告警
envoy_server_memory_allocated	网关内存使用	> 80% 告警

如果你安装 Higress 时开启了可观测性套件（global.o11y.enabled=true），会自动部署 Grafana + Prometheus + Loki，开箱即用。

8.2 配合 OpenTelemetry 做全链路追踪

Prometheus metrics 能告诉你"哪里慢了"，但要回答"为什么慢"，你需要分布式链路追踪。Higress 底层是 Envoy，天然支持 OpenTelemetry 协议，可以把网关层的 Trace 数据无缝接入你的可观测性体系。

整体架构长这样：

关键点：Higress 作为流量入口，是链路追踪的第一个 Span。它生成 Trace ID 并通过请求头（W3C Trace Context 或 B3 格式）传播给后端服务。后端服务只需要从请求头中提取 Trace Context 继续传播，整条链路就串起来了。

第一步：部署 OpenTelemetry Collector

OTel Collector 是整个可观测性管道的核心枢纽。它接收来自 Higress 和后端服务的遥测数据，统一处理后导出到各个后端存储。

# 添加 OpenTelemetry Helm 仓库
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm repo update

# otel-collector-values.yaml
mode: deployment
replicaCount: 2

config:
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: 0.0.0.0:4317
        http:
          endpoint: 0.0.0.0:4318

  processors:
    batch:
      timeout: 5s
      send_batch_size: 1024
    memory_limiter:
      check_interval: 5s
      limit_mib: 512
      spike_limit_mib: 128
    # 给所有数据打上集群标签
    resource:
      attributes:
        - key: k8s.cluster.name
          value: "my-eks-cluster"
          action: upsert
        - key: deployment.environment
          value: "production"
          action: upsert

  exporters:
    # 导出 Traces 到 Tempo
    otlp/tempo:
      endpoint: "tempo.monitoring.svc.cluster.local:4317"
      tls:
        insecure: true
    # 导出 Metrics 到 Prometheus（通过 Remote Write）
    prometheusremotewrite:
      endpoint: "http://prometheus.monitoring.svc.cluster.local:9090/api/v1/write"
    # 导出 Logs 到 Loki
    loki:
      endpoint: "http://loki.monitoring.svc.cluster.local:3100/loki/api/v1/push"
    # 调试用：打印到 stdout
    debug:
      verbosity: basic

  service:
    pipelines:
      traces:
        receivers: [otlp]
        processors: [memory_limiter, resource, batch]
        exporters: [otlp/tempo]
      metrics:
        receivers: [otlp]
        processors: [memory_limiter, resource, batch]
        exporters: [prometheusremotewrite]
      logs:
        receivers: [otlp]
        processors: [memory_limiter, resource, batch]
        exporters: [loki]

# 安装 OTel Collector
helm install otel-collector open-telemetry/opentelemetry-collector \
  -n monitoring --create-namespace \
  -f otel-collector-values.yaml

# 验证 Collector 运行正常
kubectl get pods -n monitoring -l app.kubernetes.io/name=opentelemetry-collector
# NAME                              READY   STATUS    RESTARTS
# otel-collector-xxx                1/1     Running   0
# otel-collector-yyy                1/1     Running   0

# 确认 OTLP 端口可达
kubectl get svc -n monitoring -l app.kubernetes.io/name=opentelemetry-collector
# NAME             TYPE        CLUSTER-IP     PORT(S)
# otel-collector   ClusterIP   172.20.x.x     4317/TCP,4318/TCP

第二步：部署 Tempo（Trace 后端存储）

Grafana Tempo 是目前最轻量的分布式追踪后端，只需要对象存储（S3）就能跑，不需要 Elasticsearch 或 Cassandra 那样的重型依赖。

# 添加 Grafana Helm 仓库
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

# tempo-values.yaml
tempo:
  storage:
    trace:
      backend: s3
      s3:
        bucket: my-eks-tempo-traces
        endpoint: s3.ap-southeast-1.amazonaws.com
        region: ap-southeast-1
        # 使用 IRSA（IAM Roles for Service Accounts）认证
        # 不要硬编码 AK/SK
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: 0.0.0.0:4317
  retention:
    min_duration: 72h   # 至少保留 3 天
    max_duration: 168h  # 最多保留 7 天

# 生产环境用分布式模式
# 小规模集群用单体模式就够了
replicas: 1

# 安装 Tempo
helm install tempo grafana/tempo \
  -n monitoring \
  -f tempo-values.yaml

# 如果是大规模生产环境，用分布式模式：
# helm install tempo grafana/tempo-distributed \
#   -n monitoring \
#   -f tempo-distributed-values.yaml

Tempo 用 S3 存储 Trace 数据，成本极低。一个中等流量的集群（日均 1000 万 Span），每月 S3 费用大概在 5-10 美元左右。比跑一套 Jaeger + Elasticsearch 便宜一个数量级。

第三步：配置 Higress 发送 Trace 数据

Higress 基于 Envoy，通过 EnvoyFilter 或 Istio API 配置 Tracing。如果你在安装 Higress 时启用了 Istio API 支持（global.enableIstioAPI=true），可以用 Istio 的 Telemetry CRD 来配置：

# 确保 Higress 启用了 Istio API 支持
helm upgrade higress higress.io/higress \
  -n higress-system \
  --set global.enableIstioAPI=true \
  --reuse-values

# 安装 Istio CRD（如果还没装）
helm repo add istio https://istio-release.storage.googleapis.com/charts
helm install istio-base istio/base -n istio-system --create-namespace

# higress-tracing.yaml
# 通过 Istio Telemetry API 配置 Higress 的链路追踪
apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
  name: higress-tracing
  namespace: higress-system
spec:
  tracing:
    - providers:
        - name: opentelemetry
      # 采样率：生产环境建议 1-10%，测试环境可以 100%
      randomSamplingPercentage: 5.0
      customTags:
        # 自定义标签，方便在 Grafana 中筛选
        gateway.name:
          literal:
            value: "higress"
        k8s.cluster:
          literal:
            value: "my-eks-cluster"

---
# 配置 OTel Provider
apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  name: otel-tracing-provider
  namespace: higress-system
spec:
  configPatches:
    - applyTo: BOOTSTRAP
      patch:
        operation: MERGE
        value:
          tracing:
            http:
              name: envoy.tracers.opentelemetry
              typed_config:
                "@type": type.googleapis.com/envoy.config.trace.v3.OpenTelemetryConfig
                grpc_service:
                  envoy_grpc:
                    cluster_name: otel-collector
                  timeout: 1s
                service_name: higress-gateway
    - applyTo: CLUSTER
      patch:
        operation: ADD
        value:
          name: otel-collector
          type: STRICT_DNS
          connect_timeout: 1s
          lb_policy: ROUND_ROBIN
          typed_extension_protocol_options:
            envoy.extensions.upstreams.http.v3.HttpProtocolOptions:
              "@type": type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions
              explicit_http_config:
                http2_protocol_options: {}
          load_assignment:
            cluster_name: otel-collector
            endpoints:
              - lb_endpoints:
                  - endpoint:
                      address:
                        socket_address:
                          address: otel-collector.monitoring.svc.cluster.local
                          port_value: 4317

# 应用配置
kubectl apply -f higress-tracing.yaml

# 验证 Envoy 是否加载了 tracing 配置
kubectl exec -n higress-system deploy/higress-gateway -- \
  curl -s localhost:15000/config_dump | grep -i opentelemetry
# 应该能看到 opentelemetry tracer 的配置

第四步：配置 Grafana 关联三大信号

可观测性的终极目标是把 Metrics、Traces、Logs 三大信号关联起来。在 Grafana 中配置数据源关联：

# grafana-datasources.yaml (ConfigMap)
apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-datasources
  namespace: monitoring
data:
  datasources.yaml: |
    apiVersion: 1
    datasources:
      - name: Prometheus
        type: prometheus
        url: http://prometheus.monitoring.svc.cluster.local:9090
        isDefault: true
        jsonData:
          # 关联到 Tempo：从 metrics 跳转到 trace
          exemplarTraceIdDestinations:
            - name: traceID
              datasourceUid: tempo
      - name: Tempo
        type: tempo
        uid: tempo
        url: http://tempo.monitoring.svc.cluster.local:3100
        jsonData:
          # 关联到 Loki：从 trace 跳转到 logs
          tracesToLogs:
            datasourceUid: loki
            filterByTraceID: true
            filterBySpanID: true
          # 关联到 Prometheus：从 trace 跳转到 metrics
          tracesToMetrics:
            datasourceUid: prometheus
            queries:
              - name: "Request Rate"
                query: "rate(envoy_http_downstream_rq_total{$__tags}[5m])"
          # 从 Loki 日志跳转到 trace
          lokiSearch:
            datasourceUid: loki
      - name: Loki
        type: loki
        uid: loki
        url: http://loki.monitoring.svc.cluster.local:3100
        jsonData:
          derivedFields:
            # 从日志中提取 traceID 并关联到 Tempo
            - name: traceID
              matcherRegex: "traceID=(\\w+)"
              url: "$${__value.raw}"
              datasourceUid: tempo

配置好之后，你可以在 Grafana 中实现这样的排查流程：

1. Prometheus 告警：api.example.com 的 P99 延迟飙到 5s
   |
   v
2. 点击 Exemplar 跳转到 Tempo，看到具体的慢请求 Trace
   |
   v
3. Trace 显示：Higress Gateway -> user-service (50ms) -> order-service (4.8s!)
   |
   v
4. 点击 order-service 的 Span，跳转到 Loki 看该时间段的日志
   |
   v
5. 日志显示：数据库连接池耗尽，大量请求在排队等连接
   |
   v
6. 根因定位完成，修复数据库连接池配置

这就是 Metrics -> Traces -> Logs 三大信号关联的威力。没有链路追踪，你只知道"慢了"；有了链路追踪，你能精确定位到是哪个服务的哪个操作慢了，以及为什么慢。

后端服务的 Trace 传播

Higress 生成的 Trace Context 会通过 HTTP 请求头传播给后端服务。后端服务需要做两件事：提取 Trace Context，继续传播。

Higress 默认使用 W3C Trace Context 格式（traceparent 和 tracestate 请求头）。后端服务只需要接入 OpenTelemetry SDK：

# Python 示例（Flask + OpenTelemetry）
# pip install opentelemetry-api opentelemetry-sdk \
#   opentelemetry-instrumentation-flask \
#   opentelemetry-exporter-otlp

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.sdk.resources import Resource

# 配置 Tracer
resource = Resource.create({"service.name": "user-service"})
provider = TracerProvider(resource=resource)
exporter = OTLPSpanExporter(
    endpoint="otel-collector.monitoring.svc.cluster.local:4317",
    insecure=True
)
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)

# Flask 自动注入：自动提取请求头中的 Trace Context 并继续传播
app = Flask(__name__)
FlaskInstrumentor().instrument_app(app)

// Go 示例（gin + OpenTelemetry）
// go get go.opentelemetry.io/otel
// go get go.opentelemetry.io/contrib/instrumentation/github.com/gin-gonic/gin/otelgin

import (
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
    "go.opentelemetry.io/otel/sdk/resource"
    sdktrace "go.opentelemetry.io/otel/sdk/trace"
    semconv "go.opentelemetry.io/otel/semconv/v1.21.0"
    "go.opentelemetry.io/contrib/instrumentation/github.com/gin-gonic/gin/otelgin"
)

func initTracer() func() {
    exporter, _ := otlptracegrpc.New(ctx,
        otlptracegrpc.WithEndpoint("otel-collector.monitoring.svc.cluster.local:4317"),
        otlptracegrpc.WithInsecure(),
    )
    tp := sdktrace.NewTracerProvider(
        sdktrace.WithBatcher(exporter),
        sdktrace.WithResource(resource.NewWithAttributes(
            semconv.SchemaURL,
            semconv.ServiceName("order-service"),
        )),
    )
    otel.SetTracerProvider(tp)
    return func() { tp.Shutdown(ctx) }
}

// Gin 中间件自动处理 Trace Context 传播
r := gin.Default()
r.Use(otelgin.Middleware("order-service"))

核心原则：网关生成 Trace，后端传播 Trace。只要每个服务都接入了 OTel SDK 并正确传播 traceparent 请求头，整条调用链路就能在 Grafana 里完整呈现。

采样策略建议

生产环境不可能 100% 采样（数据量太大，存储成本爆炸）。推荐的采样策略：

环境	采样率	说明
开发/测试	100%	全量采样，方便调试
预发布	20-50%	压测时需要足够样本
生产（低流量）	10-20%	日均请求 < 100 万
生产（高流量）	1-5%	日均请求 > 1000 万
错误请求	100%	所有 5xx 响应必须采样

在 OTel Collector 中可以配置尾部采样（tail sampling），确保错误请求 100% 被采集：

# OTel Collector 尾部采样配置
processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    policies:
      # 所有错误请求必须保留
      - name: errors-policy
        type: status_code
        status_code:
          status_codes: [ERROR]
      # 延迟超过 2 秒的请求必须保留
      - name: latency-policy
        type: latency
        latency:
          threshold_ms: 2000
      # 其余请求按 5% 概率采样
      - name: probabilistic-policy
        type: probabilistic
        probabilistic:
          sampling_percentage: 5

这样配置后，正常请求只采 5%，但所有出错的和慢的请求都会被完整记录。排查问题时，你永远不会遇到"那条出问题的请求刚好没被采到"的尴尬。

Metrics 长期存储：Thanos 方案

上面的方案用 Prometheus 存 metrics，但 Prometheus 有两个天然短板：单集群视角和存储有限。如果你有多个 EKS 集群，或者需要保留超过 15 天的 metrics 历史数据，就需要 Thanos。

Thanos 是 Prometheus 的高可用和长期存储扩展方案，同类的还有 Grafana Mimir 和 VictoriaMetrics。它在每个 Prometheus 实例旁边跑一个 Sidecar，把数据同步到 S3，然后通过 Thanos Query 组件提供跨集群的统一 PromQL 查询入口：

# thanos-sidecar 配置（添加到 Prometheus StatefulSet）
# 如果你用 kube-prometheus-stack，在 Helm values 中配置：
prometheus:
  prometheusSpec:
    thanos:
      # 启用 Thanos Sidecar
      image: quay.io/thanos/thanos:v0.36.1
      objectStorageConfig:
        existingSecret:
          name: thanos-objstore-config
          key: objstore.yml
    # 保留本地数据 2 小时（其余由 Thanos 管理）
    retention: 2h
    # 外部标签，用于区分不同集群的数据
    externalLabels:
      cluster: my-eks-cluster
      region: ap-southeast-1

# thanos-objstore-config Secret
# S3 存储配置（推荐用 IRSA 认证，不要硬编码 AK/SK）
apiVersion: v1
kind: Secret
metadata:
  name: thanos-objstore-config
  namespace: monitoring
stringData:
  objstore.yml: |
    type: S3
    config:
      bucket: my-eks-thanos-metrics
      endpoint: s3.ap-southeast-1.amazonaws.com
      region: ap-southeast-1

# 部署 Thanos Query（统一查询入口）
helm install thanos bitnami/thanos \
  -n monitoring \
  --set query.enabled=true \
  --set query.stores[0]=dnssrv+_grpc._tcp.prometheus-operated.monitoring.svc.cluster.local \
  --set storegateway.enabled=true \
  --set compactor.enabled=true \
  --set compactor.retentionResolutionRaw=30d \
  --set compactor.retentionResolution5m=90d \
  --set compactor.retentionResolution1h=1y

配置好之后，在 Grafana 中把数据源从 Prometheus 改成 Thanos Query 的地址，查询语法完全不变（还是 PromQL），但你可以：

跨多个 EKS 集群查询 Higress 网关的 metrics
查看 90 天甚至 1 年前的历史数据（存在 S3，成本极低）
Compactor 自动降采样，老数据从 5 秒精度降到 5 分钟/1 小时，节省存储

如果你只有一个集群且 metrics 保留 15 天够用，直接用 Prometheus 就行，不需要 Thanos。但如果是多集群或者有合规要求需要长期保留监控数据，Thanos 是目前最成熟的方案。

九、生产环境最佳实践

9.1 高可用部署

# 生产环境推荐配置
helm upgrade higress higress.io/higress \
  -n higress-system \
  --set higress-core.gateway.replicas=3 \
  --set higress-core.controller.replicas=2 \
  --set higress-core.gateway.resources.requests.cpu=500m \
  --set higress-core.gateway.resources.requests.memory=512Mi \
  --set higress-core.gateway.resources.limits.cpu=2 \
  --set higress-core.gateway.resources.limits.memory=2Gi \
  --reuse-values

# 配合 PodDisruptionBudget 防止滚动更新时全部不可用
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: higress-gateway-pdb
  namespace: higress-system
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: higress-gateway

# 配合 TopologySpreadConstraints 跨 AZ 分布
# 在 Helm values 中配置：
higress-core:
  gateway:
    podAnnotations: {}
    topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: higress-gateway

9.2 证书管理

在 EKS 上管理 TLS 证书，推荐用 cert-manager 自动签发和续期：

# 安装 cert-manager
helm install cert-manager jetstack/cert-manager \
  -n cert-manager --create-namespace \
  --set installCRDs=true

# 创建 Let's Encrypt 签发器
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: ops@example.com
    privateKeySecretRef:
      name: letsencrypt-prod-key
    solvers:
      - http01:
          ingress:
            class: higress  # 用 Higress 处理 ACME challenge

---
# Ingress 自动申请证书
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: auto-tls-app
  annotations:
    cert-manager.io/cluster-issuer: "letsencrypt-prod"
spec:
  ingressClassName: higress
  tls:
    - hosts:
        - app.example.com
      secretName: app-tls-auto  # cert-manager 自动创建
  rules:
    - host: app.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: app-service
                port:
                  number: 80

9.3 与 AWS NLB 集成

默认情况下，EKS 上的 LoadBalancer Service 会创建 Classic LB。推荐改用 NLB（Network Load Balancer），性能更好、支持静态 IP：

# 通过 Helm values 配置 NLB annotation
helm upgrade higress higress.io/higress \
  -n higress-system \
  --set higress-core.gateway.service.annotations."service\.beta\.kubernetes\.io/aws-load-balancer-type"="nlb" \
  --set higress-core.gateway.service.annotations."service\.beta\.kubernetes\.io/aws-load-balancer-scheme"="internet-facing" \
  --set higress-core.gateway.service.annotations."service\.beta\.kubernetes\.io/aws-load-balancer-cross-zone-load-balancing-enabled"="true" \
  --reuse-values

如果你安装了 AWS Load Balancer Controller，还可以用更高级的 NLB 特性：

# 使用 AWS Load Balancer Controller 管理的 NLB
service:
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: "external"
    service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: "ip"
    service.beta.kubernetes.io/aws-load-balancer-scheme: "internet-facing"
    # 绑定弹性 IP（固定出口 IP，方便合作伙伴加白名单）
    service.beta.kubernetes.io/aws-load-balancer-eip-allocations: "eipalloc-xxx,eipalloc-yyy"

十、踩坑指南

最后分享几个实际迁移中容易踩的坑：

坑 1：IngressClass 不匹配

Higress 默认只处理 ingressClassName: higress 的 Ingress。如果你的老 Ingress 没有设置 ingressClassName（依赖默认值），需要在安装时设置：

# 让 Higress 处理没有指定 ingressClassName 的 Ingress
helm upgrade higress higress.io/higress \
  -n higress-system \
  --set global.ingressClass="" \
  --reuse-values
# 空字符串 = 处理所有 Ingress

坑 2：ELB 健康检查失败

AWS NLB 的健康检查默认走 TCP，但 Higress Gateway 的健康检查端口可能不是默认的 80/443：

# 检查 Higress Gateway 的健康检查端口
kubectl get svc higress-gateway -n higress-system -o yaml | grep -A5 ports
# 确保 NLB 的健康检查指向正确的端口

坑 3：X-Forwarded-For 信任链

经过 NLB -> Higress 两层转发后，X-Forwarded-For 可能包含多个 IP。IP 黑白名单插件要配置正确的 IP 提取方式：

# 取 X-Forwarded-For 中的第一个 IP（客户端真实 IP）
ip_source_type: header
ip_header_name: X-Forwarded-For

坑 4：Wasm 插件内存限制

Higress 的 Wasm 插件运行在沙箱中，默认内存限制较小。如果插件逻辑复杂，可能会 OOM：

# 查看 Wasm 插件相关日志
kubectl logs -n higress-system -l app=higress-gateway | grep -i wasm

坑 5：Gateway API CRD 版本

如果你想用 Gateway API，需要先安装对应版本的 CRD。根据 Higress 官方文档，当前支持的最高 Gateway API 版本为 v1.0.0。安装时建议使用 --server-side=true 参数，因为 CRD 文件较大：

# 安装 Gateway API CRD（Higress 支持的最高版本为 v1.0.0）
kubectl apply --server-side=true -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.0.0/experimental-install.yaml

# 如果需要实验性功能（如 TCPRoute、TLSRoute 等），已包含在 experimental-install.yaml 中
# 如果只需要标准功能，也可以使用 standard-install.yaml：
# kubectl apply --server-side=true -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.0.0/standard-install.yaml

# 启用 Higress 的 Gateway API 支持
helm upgrade higress higress.io/higress \
  -n higress-system \
  --set global.enableGatewayAPI=true \
  --reuse-values

总结

回顾一下这篇文章覆盖的内容：

模块	核心能力	关键配置
安装部署	Helm 一键安装，自带 Console	helm install + global 参数
平滑迁移	双网关并行，逐步切换	ingressClassName 区分
认证授权	JWT / API Key / OAuth2 / ext-auth	插件配置 + allow 列表
流量管理	重写/透传/灰度/重定向/超时重试	Ingress annotation
安全防护	CORS/IP限制/Bot检测/限流/TLS/mTLS	插件 + annotation
高级策略	负载均衡/会话保持/gRPC	annotation 配置
可观测性	Prometheus metrics + Grafana + OTel 链路追踪 + Thanos 长期存储	ServiceMonitor + OTel Collector + Thanos Sidecar
生产加固	高可用/证书自动化/NLB集成	PDB + cert-manager + NLB annotation

Higress 不是唯一的选择。Envoy Gateway、Traefik、Kong 都是不错的替代方案。但如果你在找一个兼容 Ingress API、插件生态丰富、有中文社区支持的网关，Higress 值得一试。

迁移不是一蹴而就的事。建议先在测试环境跑起来，把你现有的 Ingress 配置逐个验证一遍，确认没问题再切生产。毕竟网关是所有流量的咽喉，稳字当头。

最后一句话送给还在犹豫的同学：nginx-ingress 的维护期已经结束了（2026 年 3 月）。如果你的集群还在跑没人维护的 nginx-ingress，现在就是迁移的最佳时机。每多拖一天，安全风险就多一分。