本文為英文版的機器翻譯版本，如內容有任何歧義或不一致之處，概以英文版為準。

# Managed Service for Apache Flink 中的指標和維度
<a name="metrics-dimensions"></a>

當 Managed Service for Apache Flink 處理資料來源時，會向 Amazon CloudWatch 報告下列指標和維度。

**Flink 2.2 指標變更**  
Flink 2.2 引入可能會影響監控和警示的指標變更。升級之前，請檢閱下列變更：  
指標`fullRestarts`已移除。請改用 `numRestarts`。
`uptime` 和 `downtime`指標已棄用，並將在未來版本中移除。遷移至新的狀態特定指標。
Kinesis Data Streams 連接器 `bytesRequestedPerFetch` 6.0.0 的指標已移除。

## 應用程式指標
<a name="metrics-dimensions-jobs"></a>


| 指標 | 單位 | Description | Level | 使用須知 | 
| --- | --- | --- | --- | --- | 
| backPressuredTimeMsPerSecond\$1 | 毫秒 | 此任務或運算子每秒承受背壓的時間 (毫秒)。 | 任務、運算子、平行處理層級 | \$1 僅可用於執行 Flink 1.13 版本之 Managed Service for Apache Flink 應用程式。 這些指標可用於識別應用程式中的瓶頸。 | 
| busyTimeMsPerSecond\$1 | 毫秒 | 此任務或運算子每秒忙碌 (既不是閒置也沒有背壓) 的時間 (毫秒)。如果無法計算該值，則可以是 NaN。 | 任務、運算子、平行處理層級 | \$1 僅可用於執行 Flink 1.13 版本之 Managed Service for Apache Flink 應用程式。 這些指標可用於識別應用程式中的瓶頸。 | 
| cpuUtilization | 百分比 | 跨任務管理員的 CPU 使用率整體百分比。例如，如果有 5 個任務管理員，Managed Service for Apache Flink 會針對每個報告間隔發行此指標的 5 個樣本。 | 應用程式 | 您可以使用此指標來監控應用程式中的最小、平均和最大 CPU 使用率。CPUUtilization 指標只考慮在容器內執行之 TaskManager JVM 處理序的 CPU 使用率。 | 
| containerCPUUtilization | 百分比 | Flink 應用程式叢集中跨任務管理員容器的 CPU 使用率整體百分比。例如，如果有 5 個任務管理員，則相應地有 5 個任務管理員容器，Managed Service for Apache Flink 會每 1 分鐘的報告間隔發佈 2\$1 5 個此指標樣本。 | 應用程式 | 它按每個容器計算如下： *容器使用的總 CPU 時間 (秒) \$1 100 / 容器 CPU 限制 (單位為 CPU/秒)* `CPUUtilization` 指標只考慮在容器內執行之 TaskManager JVM 處理序的 CPU 使用率。同一個容器內的 JVM 外部還有其他元件在執行。`containerCPUUtilization` 指標為您提供了更完整的視角，包括容器中 CPU 耗盡的所有處理序以及由此引起的故障。  | 
| containerMemoryUtilization | 百分比 | Flink 應用程式叢集中跨任務管理員容器的記憶體使用率整體百分比。例如，如果有 5 個任務管理員，則相應地有 5 個任務管理員容器，Managed Service for Apache Flink 會每 1 分鐘的報告間隔發佈 2\$1 5 個此指標樣本。 | 應用程式 | 它按每個容器計算如下： *容器記憶體用量 (位元組) \$1 100 / 每 Pod 部署規格的容器記憶體限制 (位元組)* `HeapMemoryUtilization` 和 `ManagedMemoryUtilzations` 指標僅考慮特定的記憶體指標，例如 TaskManager JVM 的堆積記憶體使用情況或受管記憶體 (例如用於 [RocksDB 狀態後端](https://flink.apache.org/2021/01/18/rocksdb.html#:~:text=Conclusion-,The%20RocksDB%20state%20backend%20(i.e.%2C%20RocksDBStateBackend)%20is%20one%20of,with%20exactly%2Donce%20processing%20guarantees.)等原生處理序的 JVM 外記憶體使用情況)。`containerMemoryUtilization` 指標可為您提供更完整的視角，現包括工作集記憶體，這是一個更佳的記憶體總量耗盡追蹤器。在其耗盡之後，它將導致任務管理員 Pod `Out of Memory Error`。  | 
| containerDiskUtilization | 百分比 | Flink 應用程式叢集中跨任務管理員容器的磁碟使用率整體百分比。例如，如果有 5 個任務管理員，則相應地有 5 個任務管理員容器，Managed Service for Apache Flink 會每 1 分鐘的報告間隔發佈 2\$1 5 個此指標樣本。 | 應用程式 | 它按每個容器計算如下： *磁碟使用量 (位元組) \$1 100 / 容器的磁碟限制 (位元組)* 對於容器，它代表在其上設定容器根磁碟區的檔案系統的使用率。  | 
| currentInputWatermark | 毫秒 | 此應用程式/運算子/任務/執行緒收到的最後一個浮水印 | 應用程式、運算子、任務、平行處理層級 | 僅針對具有兩個輸入的維度發出此記錄。這是最後接收到的浮水印的最小值。 | 
| currentOutputWatermark | 毫秒 | 此應用程式/運算子/任務/執行緒發出的最後一個浮水印 | 應用程式、運算子、任務、平行處理層級 |  | 
| downtime 【已棄用】 | 毫秒 | 對於目前處於失敗/復原狀況的作業，此中斷期間經過的時間。 | 應用程式 | 此指標衡量作業失敗或復原時經過的時間。此指標針對執行中作業傳回 0，針對完成的作業傳回 -1。如果此指標不是 0 或 -1，則表示應用程式的 Apache Flink 作業執行失敗。**Flink 2.2 中已棄用。**請`failingTime`改用 `cancellingTime`、 `restartingTime`和/或 。 | 
| failingTime | 毫秒 | 應用程式處於失敗狀態所花費的時間 （以毫秒為單位）。使用此指標來監控應用程式失敗和觸發警示。 | 應用程式、流程 | 可從 Flink 2.2 取得。取代部分已棄用downtime指標。 | 
| heapMemoryUtilization | 百分比 | 任務管理員的整體堆積記憶體使用率。例如，如果有 5 個任務管理員，Managed Service for Apache Flink 會針對每個報告間隔發行此指標的 5 個樣本。 | 應用程式 | 您可以使用此指標來監控應用程式中的最小、平均和最大堆積記憶體使用率。HeapMemoryUtilization 只考慮特定記憶體指標，如 TaskManager JVM 的堆積記憶體使用情況。 | 
| idleTimeMsPerSecond\$1 | 毫秒 | 此任務或運算子每秒閒置 (沒有要處理的資料) 的時間 (毫秒)。閒置時間不包括背壓時間，因此如果任務受到背壓，則不會閒置。 | 任務、運算子、平行處理層級 | \$1 僅可用於執行 Flink 1.13 版本之 Managed Service for Apache Flink 應用程式。 這些指標可用於識別應用程式中的瓶頸。 | 
| lastCheckpointSize | 位元組 | 最後一個檢查點的大小總計 | 應用程式 | 您可以使用此指標判斷執行中應用程式的儲存體使用率。如果此指標的值增加，可能表示應用程式發生了問題，例如記憶體流失或瓶頸。 | 
| lastCheckpointDuration | 毫秒 | 完成最後一個檢查點所花費的時間 | 應用程式 | 此指標會測量完成最新檢查點所花費的時間。如果此指標的值增加，可能表示應用程式發生了問題，例如記憶體流失或瓶頸。在某些情況下，您可以藉由停用檢查點來疑難排解此問題。 | 
| managedMemoryUsed\$1 | 位元組 | 目前使用中的受管記憶體數量。 | 應用程式、運算子、任務、平行處理層級 | \$1 僅可用於執行 Flink 1.13 版本之 Managed Service for Apache Flink 應用程式。 這與 Java 堆積之外由 Flink 管理的記憶體有關。它用於 RocksDB 狀態後端，也可用於應用程式。 | 
| managedMemoryTotal\$1 | 位元組 | 記憶體總量。 | 應用程式、運算子、任務、平行處理層級 | \$1 僅可用於執行 Flink 1.13 版本之 Managed Service for Apache Flink 應用程式。 這與 Java 堆積之外由 Flink 管理的記憶體有關。它用於 RocksDB 狀態後端，也可用於應用程式。`ManagedMemoryUtilzations` 指標僅考慮特定的記憶體指標，例如受管記憶體 (用於 [RocksDB 狀態後端](https://flink.apache.org/2021/01/18/rocksdb.html#:~:text=Conclusion-,The%20RocksDB%20state%20backend%20(i.e.%2C%20RocksDBStateBackend)%20is%20one%20of,with%20exactly%2Donce%20processing%20guarantees.)等原生處理序的 JVM 外記憶體使用情況) | 
| managedMemoryUtilization\$1 | 百分比 | 由 managedMemoryUsed/managedMemoryTotal 所衍生 | 應用程式、運算子、任務、平行處理層級 | \$1 僅可用於執行 Flink 1.13 版本之 Managed Service for Apache Flink 應用程式。 這與 Java 堆積之外由 Flink 管理的記憶體有關。它用於 RocksDB 狀態後端，也可用於應用程式。 | 
| numberOfFailedCheckpoints | 計數 | 檢查點失敗的次數。 | 應用程式 | 您可以使用此指標來監控應用程式運作狀態和進度。檢查點可能會因為應用程式問題 (例如輸送量或許可問題) 而失敗。 | 
| numRecordsIn\$1 | 計數 | 此應用程式、運算子或任務已接收的記錄總數。 | 應用程式、運算子、任務、平行處理層級 | \$1 若要套用一段時間內 (秒/分鐘) 的 SUM 統計資料： [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/zh_tw/managed-flink/latest/java/metrics-dimensions.html) 指標的「層級」指定此指標是衡量整個應用程式、特定運算子還是特定任務接收的記錄總數。 | 
| numRecordsInPerSecond\$1 | 計數/秒 | 此應用程式、運算子或任務每秒收到的記錄總數。 | 應用程式、運算子、任務、平行處理層級 | \$1 若要套用一段時間內 (秒/分鐘) 的 SUM 統計資料： [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/zh_tw/managed-flink/latest/java/metrics-dimensions.html) 指標的「層級」指定此指標是衡量整個應用程式、特定運算子還是特定任務每秒接收的記錄總數。 | 
| numRecordsOut\$1 | 計數 | 此應用程式、運算子或任務發出的記錄總數。 | 應用程式、運算子、任務、平行處理層級 |  \$1 若要套用一段時間內 (秒/分鐘) 的 SUM 統計資料： [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/zh_tw/managed-flink/latest/java/metrics-dimensions.html) 指標的「層級」指定此指標是衡量整個應用程式、特定運算子還是特定任務發出的記錄總數。 | 
| numLateRecordsDropped\$1 | 計數 | 應用程式、運算子、任務、平行處理層級 |  | \$1 若要套用一段時間內 (秒/分鐘) 的 SUM 統計資料： [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/zh_tw/managed-flink/latest/java/metrics-dimensions.html) 此運算子或任務因遲到而丟棄的記錄數。 | 
| numRecordsOutPerSecond\$1 | 計數/秒 | 此應用程式、運算子或任務每秒發出的記錄總數。 | 應用程式、運算子、任務、平行處理層級 |  \$1 若要套用一段時間內 (秒/分鐘) 的 SUM 統計資料： [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/zh_tw/managed-flink/latest/java/metrics-dimensions.html) 指標的「層級」指定此指標是衡量整個應用程式、特定運算子還是特定任務每秒發出的記錄總數。 | 
| oldGenerationGCCount | 計數 | 所有任務管理員中發生的垃圾回收操作總數。 | 應用程式 |  | 
| oldGenerationGCTime | 毫秒 | 執行垃圾回收操作所花費的總時間。 | 應用程式 | 您可以使用此指標來監控總計、平均和最大垃圾回收時間。 | 
| threadsCount | 計數 | 應用程式使用的即時執行緒總數。 | 應用程式 | 此指標衡量應用程式的程式碼使用的執行緒數目。這與應用程式平行處理層級不同。 | 
| cancellingTime | 毫秒 | 應用程式處於取消狀態所花費的時間 （以毫秒為單位）。使用此指標來監控應用程式取消操作。 | 應用程式、流程 | 可從 Flink 2.2 取得。取代部分已棄用downtime指標。 | 
| restartingTime | 毫秒 | 應用程式處於重新啟動狀態所花費的時間 （以毫秒為單位）。使用此指標來監控應用程式重新啟動行為。 | 應用程式、流程 | 可從 Flink 2.2 取得。取代部分已棄用downtime指標。 | 
| runningTime | 毫秒 | 應用程式在未中斷的情況下執行的時間 （以毫秒為單位）。取代已取代的uptime指標。 | 應用程式、流程 | 可從 Flink 2.2 取得。使用 直接取代已取代的uptime指標。 | 
| uptime 【已棄用】 | 毫秒 | 作業在不中斷的情況下執行的時間。 | 應用程式 | 您可以使用此指標來判斷作業是否在成功執行。此指標針對已完成的作業傳回 -1。**Flink 2.2 中已棄用。**請改用 `runningTime`。 | 
| jobmanagerFileDescriptorsMax | 計數 | JobManager 可用的檔案描述項數目上限。 | 應用程式、流程、主機 | 使用此指標來監控檔案描述項容量。 | 
| jobmanagerFileDescriptorsOpen | 計數 | JobManager 目前開啟的檔案描述項數量。 | 應用程式、流程、主機 | 使用此指標來監控檔案描述項用量，並偵測潛在的資源耗盡。 | 
| taskmanagerFileDescriptorsMax | 計數 | 每個 TaskManager 可用的檔案描述項數量上限。 | 應用程式、流程、主機、tm\$1id | 使用此指標來監控檔案描述項容量。 | 
| taskmanagerFileDescriptorsOpen | 計數 | 每個 TaskManager 的目前開啟檔案描述項數量。 | 應用程式、流程、主機、tm\$1id | 使用此指標來監控檔案描述項用量，並偵測潛在的資源耗盡。 | 
| KPUs\$1 | 計數 | 應用程式使用的 KPUs 總數。 | 應用程式 | \$1此指標每個計費期間 （一小時） 會收到一個範例。若要視覺化一段時間內的 KPUs 數量，請在至少一 (1) 小時內使用 MAX 或 AVG。 KPU 計數包含 `orchestration` KPU。如需詳細資訊，請參閱 [Managed Service for Apache Flink 定價](https://aws.amazon.com/managed-service-apache-flink/pricing/)。 | 

**Flink 2.2 指標遷移指引**  
**從 fullRestarts 遷移：**指標已在 Flink `fullRestarts` 2.2 中移除。請改用 `numRestarts` 指標。`numRestarts` 指標提供同等功能，可以直接取代 CloudWatch 警示，無需調整閾值。  
**從執行時間遷移：**在 Flink 2.2 中已棄用`uptime`指標，並將在未來版本中移除。請改用 `runningTime` 指標。`runningTime` 指標提供同等功能，可以直接取代 CloudWatch 警示，無需調整閾值。  
**從停機時間遷移：**在 Flink 2.2 中已棄用`downtime`指標，並將在未來版本中移除。根據您想要監控的內容，使用下列一或多個指標：  
`restartingTime`：監控重新啟動應用程式所花費的時間
`cancellingTime`：監控取消應用程式所花費的時間
`failingTime`：監控處於失敗狀態所花費的時間

## Kinesis Data Streams 連接器指標
<a name="metrics-dimensions-stream"></a>

AWS 除了下列項目之外， 還會發出 Kinesis Data Streams 的所有記錄：


| 指標 | 單位 | Description | Level | 使用須知 | 
| --- | --- | --- | --- | --- | 
| millisbehindLatest | 毫秒 | 取用者位於串流開頭之後的毫秒數，指出取用者落後目前時間多久。 | 應用程式 (用於串流)，平行處理層級 (用於 ShardId) | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/zh_tw/managed-flink/latest/java/metrics-dimensions.html)  | 

**注意**  
指標`bytesRequestedPerFetch`已在 Flink AWS 連接器 6.0.0 版 （唯一與 Flink 2.2 相容的連接器版本） 中移除。Flink 2.2 中唯一可用的 Kinesis Data Streams 連接器指標是 。 `millisBehindLatest`

## Amazon MSK 連接器指標
<a name="metrics-dimensions-msk"></a>

AWS 除了下列項目之外， 還會發出 Amazon MSK 的所有記錄：


| 指標 | 單位 | Description | Level | 使用須知 | 
| --- | --- | --- | --- | --- | 
| currentoffsets | N/A | 每個分割區的取用者目前的讀取位移。您可以依據主題名稱和分割區 ID 來指定特定分割區的指標。 | 應用程式 (針對主題)、平行處理層級 (針對 PartitionId) |  | 
| commitsFailed | N/A | 向 Kafka 遞交位移失敗的總數，如果啟用了位移遞交和檢查點。 | 應用程式、運算子、任務、平行處理層級 | 將位移遞交回 Kafka 只是公開取用者進度的一種手段，因此遞交失敗不會影響 Flink 的檢查點分割區位移完整性。 | 
| commitsSucceeded | N/A | 向 Kafka 成功遞交位移的總數，如果啟用了位移遞交和檢查點。 | 應用程式、運算子、任務、平行處理層級 |  | 
| committedoffsets | N/A | 每個分割區最後一次成功提交到 Kafka 的位移。您可以依據主題名稱和分割區 ID 來指定特定分割區的指標。 | 應用程式 (針對主題)、平行處理層級 (針對 PartitionId) |  | 
| records\$1lag\$1max | 計數 | 此視窗中任何分割區以記錄數目而言的最大延遲 | 應用程式、運算子、任務、平行處理層級 |  | 
| bytes\$1consumed\$1rate | 位元組 | 每秒使用的主題位元組平均數目 | 應用程式、運算子、任務、平行處理層級 |  | 

## Apache Zeppelin 指標
<a name="metrics-dimensions-zeppelin"></a>

對於 Studio 筆記本， 在應用程式層級 AWS 發出下列指標：`KPUs`、`cpuUtilization`、`heapMemoryUtilization`、`oldGenerationGCTime`、 `oldGenerationGCCount`和 `threadCount`。此外，它還會在應用程式層級發出下表中顯示的指標。


****  

| 指標 | 單位 | Description | Prometheus 名稱 | 
| --- | --- | --- | --- | 
| zeppelinCpuUtilization | 百分比 | Apache Zeppelin 伺服器中 CPU 使用率的整體百分比。 | process\$1cpu\$1usage | 
| zeppelinHeapMemoryUtilization | 百分比 | Apache Zeppelin 伺服器的堆積記憶體使用率整體百分比。 | jvm\$1memory\$1used\$1bytes | 
| zeppelinThreadCount | 計數 | Apache Zeppelin 伺服器使用的即時執行緒總數。 | jvm\$1threads\$1live\$1threads | 
| zeppelinWaitingJobs | 計數 | 等待執行緒的已排入佇列的 Apache Zeppelin 作業數目。 | jetty\$1threads\$1jobs | 
| zeppelinServerUptime | 秒鐘 | 伺服器啟動並執行的總時間。 | process\$1uptime\$1seconds | 

# 檢視 CloudWatch 指標
<a name="metrics-dimensions-viewing"></a>

您可以使用 Amazon CloudWatch 主控台或 AWS CLI來檢視應用程式的 CloudWatch 指標。

**使用 CloudWatch 主控台檢視指標**

1. 在 [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/) 開啟 CloudWatch 主控台。

1. 在導覽窗格中，選擇 **指標**。

1. 在 Managed Service for Apache Flink 的**依類別分類的 CloudWatch 指標**窗格中，選擇指標類別。

1. 在上方窗格中，向下捲動以檢視完整指標清單。

**使用 檢視指標 AWS CLI**
+ 在命令提示中，使用下列命令。

  ```
  1. aws cloudwatch list-metrics --namespace "AWS/KinesisAnalytics" --region region
  ```

# 設定 CloudWatch 指標報告層級
<a name="cloudwatch-logs-levels"></a>

您可以控制應用程式建立的應用程式指標層級。Managed Service for Apache Flink 支援下列指標層級：
+ **應用程式**：應用程式只報告每個應用程式的最高層級指標。依預設，Managed Service for Apache Flink 指標在 Application 層級發佈。
+ **任務**：應用程式針對使用「任務」指標報告層級定義的指標來報告任務特定的指標維度，例如每秒進出應用程式的記錄數。
+ **運算子**：應用程式針對以「運算子」指標報告層級定義的指標來報告運算子特定的指標維度，例如每個篩選或對應操作的指標。
+ **平行處理層級**：應用程式為每個執行緒報告 `Task` 和 `Operator` 層級指標。由於成本過高，平行處理設定超過 64 的應用程式不建議使用此報告層級。
**注意**  
鑒於服務所產生的指標資料量，您只能使用此指標層級進行疑難排解。您只能使用 CLI 來設定此指標層級。此指標層級在主控台中無法使用。

預設層級為**應用程式**。應用程式會報告目前層級和所有更高層級的指標。例如，如果報告層級設定為**運算子**，則應用程式會報告**應用程式**、**任務**和**運算子**指標。

您可以使用 [https://docs.aws.amazon.com/managed-service-for-apache-flink/latest/apiv2/API_CreateApplication.html](https://docs.aws.amazon.com/managed-service-for-apache-flink/latest/apiv2/API_CreateApplication.html) 動作的 `MonitoringConfiguration` 參數或 [https://docs.aws.amazon.com/managed-service-for-apache-flink/latest/apiv2/API_UpdateApplication.html](https://docs.aws.amazon.com/managed-service-for-apache-flink/latest/apiv2/API_UpdateApplication.html) 動作的 `MonitoringConfigurationUpdate` 參數來設定 CloudWatch 指標報告層級。[https://docs.aws.amazon.com/managed-service-for-apache-flink/latest/apiv2/API_UpdateApplication.html](https://docs.aws.amazon.com/managed-service-for-apache-flink/latest/apiv2/API_UpdateApplication.html) 動作的下列範例請求會將 CloudWatch 指標報告層級設定為**任務**：

```
{
   "ApplicationName": "MyApplication",  
   "CurrentApplicationVersionId": 4,
   "ApplicationConfigurationUpdate": { 
      "FlinkApplicationConfigurationUpdate": { 
         "MonitoringConfigurationUpdate": { 
            "ConfigurationTypeUpdate": "CUSTOM",
            "MetricsLevelUpdate": "TASK"
         }
      }
   }
}
```

您也可以使用 [https://docs.aws.amazon.com/managed-service-for-apache-flink/latest/apiv2/API_CreateApplication.html](https://docs.aws.amazon.com/managed-service-for-apache-flink/latest/apiv2/API_CreateApplication.html) 動作的 `LogLevel` 參數或 [https://docs.aws.amazon.com/managed-service-for-apache-flink/latest/apiv2/API_UpdateApplication.html](https://docs.aws.amazon.com/managed-service-for-apache-flink/latest/apiv2/API_UpdateApplication.html) 動作的 `LogLevelUpdate` 參數來設定記錄層級。您可以使用下列日誌層級：
+ `ERROR`：記錄可能復原的錯誤事件。
+ `WARN`：記錄可能導致錯誤的警告事件。
+ `INFO`：記錄資訊事件。
+ `DEBUG`：記錄一般偵錯事件。

如需 Log4j 記錄層級的詳細資訊，請參閱 [Apache Log4j](https://logging.apache.org/log4j/2.x/) 文件中的[自訂日誌層級](https://logging.apache.org/log4j/2.x/manual/customloglevels.html)。

# 搭配 Amazon Managed Service for Apache Flink 使用自訂指標
<a name="monitoring-metrics-custom"></a>

Managed Service for Apache Flink 向 CloudWatch 公開了 19 個指標，包括資源使用率和輸送量的指標。此外，您可以建立自己的指標來追蹤應用程式特定的資料，例如處理事件或存取外部資源。

**Topics**
+ [運作方式](#monitoring-metrics-custom-howitworks)
+ [檢視建立映射類別的範例](#monitoring-metrics-custom-examples)
+ [檢視自訂指標](#monitoring-metrics-custom-examples-viewing)

## 運作方式
<a name="monitoring-metrics-custom-howitworks"></a>

Managed Service for Apache Flink 中的自訂指標使用 Apache Flink 指標系統。Apache Flink 指標具有下列屬性：
+ **類型：**指標的類型說明衡量和報告資料的方式。可用的 Apache Flink 指標類型包括「計數」、「量計」、「長條圖」和「計量」。如需 Apache Flink 指標類型的詳細資訊，請參閱[指標類型](https://nightlies.apache.org/flink/flink-docs-release-1.15/monitoring/metrics.html#metric-types)。
**注意**  
AWS CloudWatch Metrics 不支援直方圖 Apache Flink 指標類型。CloudWatch 只顯示「計數」、「量計」和「計量」類型的 Apache Flink 指標。
+ **範圍：**指標的範圍包含其識別碼和一組指示如何向 CloudWatch 報告指標的鍵值對。指標的識別碼包含下列項目：
  + 系統範圍，指出報告指標的層級 (例如「運算子」)。
  + 使用者範圍，定義諸如使用者變數或指標群組名稱等屬性。這些屬性使用 [https://ci.apache.org/projects/flink/flink-docs-master/api/java/org/apache/flink/metrics/MetricGroup.html#addGroup-java.lang.String-java.lang.String-](https://ci.apache.org/projects/flink/flink-docs-master/api/java/org/apache/flink/metrics/MetricGroup.html#addGroup-java.lang.String-java.lang.String-) 或 [https://ci.apache.org/projects/flink/flink-docs-master/api/java/org/apache/flink/metrics/MetricGroup.html#addGroup-java.lang.String-](https://ci.apache.org/projects/flink/flink-docs-master/api/java/org/apache/flink/metrics/MetricGroup.html#addGroup-java.lang.String-) 定義。

  如需指標範圍的詳細資訊，請參閱[範圍](https://nightlies.apache.org/flink/flink-docs-release-1.15/monitoring/metrics.html#scope)。

如需 Apache Flink 指標的詳細資訊，請參閱 [Apache Flink 文件](https://nightlies.apache.org/flink/flink-docs-release-1.15/)中的[指標](https://nightlies.apache.org/flink/flink-docs-release-1.15/monitoring/metrics.html)。

若要在 Managed Service for Apache Flink 中建立自訂指標，您可以從任何透過呼叫 [https://nightlies.apache.org/flink/flink-docs-release-1.15/api/java/org/apache/flink/api/common/functions/RuntimeContext.html#getMetricGroup--](https://nightlies.apache.org/flink/flink-docs-release-1.15/api/java/org/apache/flink/api/common/functions/RuntimeContext.html#getMetricGroup--) 來擴充 `RichFunction` 的使用者函數存取 Apache Flink 指標系統。此方法會傳回 [MetricGroup](https://nightlies.apache.org/flink/flink-docs-release-1.15/api/java/org/apache/flink/metrics/MetricGroup.html) 物件，您可以用它來建立和註冊自訂指標。Managed Service for Apache Flink 會將使用群組索引鍵 `KinesisAnalytics` 建立的所有指標報告給 CloudWatch。您定義的自訂指標具有下列特性：
+ 您的自訂指標具有指標名稱和群組名稱。這些名稱必須包含根據 [Prometheus 命名規則](https://prometheus.io/docs/instrumenting/writing_exporters/#naming)的英數字元。
+ 您在使用者範圍中定義的屬性 (`KinesisAnalytics` 指標群組除外) 會發佈為 CloudWatch 維度。
+ 依預設，自訂指標會在 `Application` 層級發佈。
+ 維度 (任務/運算子/平行處理層級) 會根據應用程式的監控層級新增至指標。您可以使用 [CreateApplication](https://docs.aws.amazon.com/managed-flink/latest/apiv2/API_CreateApplication.html) 動作的 [MonitoringConfiguration](https://docs.aws.amazon.com/managed-flink/latest/apiv2/API_MonitoringConfiguration.html) 參數或 [UpdateApplication](https://docs.aws.amazon.com/managed-flink/latest/apiv2/API_UpdateApplication.html) 動作的 [MonitoringConfigurationUpdate](https://docs.aws.amazon.com/managed-flink/latest/apiv2/API_MonitoringConfigurationUpdate.html) 參數來設定應用程式的監控層級。

## 檢視建立映射類別的範例
<a name="monitoring-metrics-custom-examples"></a>

下列程式碼範例示範如何建立建立並遞增自訂指標的映射類別，以及如何透過將映射類別新增至`DataStream`物件，在應用程式中實作映射類別。

### 記錄計數自訂指標
<a name="monitoring-metrics-custom-examples-recordcount"></a>

下列程式碼範例示範如何建立映射類別，以建立可計算資料串流中記錄數目的指標 (功能與 `numRecordsIn` 指標相同)：

```
    private static class NoOpMapperFunction extends RichMapFunction<String, String> {
        private transient int valueToExpose = 0;
        private final String customMetricName;
 
        public NoOpMapperFunction(final String customMetricName) {
            this.customMetricName = customMetricName;
        }
 
        @Override
        public void open(Configuration config) {
            getRuntimeContext().getMetricGroup()
                    .addGroup("KinesisAnalytics")
                    .addGroup("Program", "RecordCountApplication")
                    .addGroup("NoOpMapperFunction")
                    .gauge(customMetricName, (Gauge<Integer>) () -> valueToExpose);
        }
 
        @Override
        public String map(String value) throws Exception {
            valueToExpose++;
            return value;
        }
    }
```

在上述範例中，`valueToExpose` 變數會針對應用程式處理的每筆記錄遞增。

定義映射類別之後，您可以建立實作對應的應用程式內串流：

```
DataStream<String> noopMapperFunctionAfterFilter =
    kinesisProcessed.map(new NoOpMapperFunction("FilteredRecords"));
```

如需此應用程式的完整程式碼，請參閱[記錄計數自訂指標應用程式](https://github.com/aws-samples/amazon-managed-service-for-apache-flink-examples/tree/main/java/CustomMetrics/RecordCount)。

### 單字計數自訂指標
<a name="monitoring-metrics-custom-examples-wordcount"></a>

下列程式碼範例示範如何建立映射類別，以建立可計算資料串流中字數的指標：

```
private static final class Tokenizer extends RichFlatMapFunction<String, Tuple2<String, Integer>> {
     
            private transient Counter counter;
     
            @Override
            public void open(Configuration config) {
                this.counter = getRuntimeContext().getMetricGroup()
                        .addGroup("KinesisAnalytics")
                        .addGroup("Service", "WordCountApplication")
                        .addGroup("Tokenizer")
                        .counter("TotalWords");
            }
     
            @Override
            public void flatMap(String value, Collector<Tuple2<String, Integer>>out) {
                // normalize and split the line
                String[] tokens = value.toLowerCase().split("\\W+");
     
                // emit the pairs
                for (String token : tokens) {
                    if (token.length() > 0) {
                        counter.inc();
                        out.collect(new Tuple2<>(token, 1));
                    }
                }
            }
        }
```

在上述範例中，`counter` 變數會針對應用程式處理的每個單字遞增。

定義映射類別之後，您可以建立實作對應的應用程式內串流：

```
// Split up the lines in pairs (2-tuples) containing: (word,1), and
// group by the tuple field "0" and sum up tuple field "1"
DataStream<Tuple2<String, Integer>> wordCountStream = input.flatMap(new Tokenizer()).keyBy(0).sum(1);
     
// Serialize the tuple to string format, and publish the output to kinesis sink
wordCountStream.map(tuple -> tuple.toString()).addSink(createSinkFromStaticConfig());
```

如需此應用程式的完整程式碼，請參閱[單字計數自訂指標應用程式](https://github.com/aws-samples/amazon-managed-service-for-apache-flink-examples/tree/main/java/CustomMetrics/WordCount)。

## 檢視自訂指標
<a name="monitoring-metrics-custom-examples-viewing"></a>

應用程式的自訂指標會顯示在 CloudWatch 指標主控台 **AWS/KinesisAnalytics** 儀表板的**應用程式**指標群組下。

# 搭配 Amazon Managed Service for Apache Flink 使用 CloudWatch 警示
<a name="monitoring-metrics-alarms"></a>

使用 Amazon CloudWatch 指標警示，您可在自己指定的時段內監看 CloudWatch 指標。警示會根據在數個期間與閾值相關的指標值或表達式值來執行一或多個動作。某個動作將通知傳送至 Amazon Simple Notification Service (Amazon SNS) 主題的範例。

如需 CloudWatch 警示的詳細資訊，請參閱[使用 Amazon CloudWatch 警示](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html)。

## 檢閱建議的警示
<a name="monitoring-metrics-alarms-recommended"></a>

本節包含用於監控 Managed Service for Apache Flink 應用程式的建議警示。

下表說明了建議的警示，其中包含下列欄位：
+ **指標表達式：**根據閾值測試的指標或指標表示式。
+ **統計值：**用來檢查指標的統計值 — 例如**平均值**。
+ **閾值：**使用此警示會要求您決定用來定義預期應用程式效能限制的閾值。您必須在正常情況下監控應用程式，藉此決定此閾值。
+ **說明：**可能觸發此警示的原因，以及該狀況的可能解決方案。


| 指標表達式 | 統計數字 | Threshold | Description | 
| --- |--- |--- |--- |
| downtime > 0 | Average | 0 |  A downtime greater than zero indicates that the application has failed. If the value is larger than 0, the application is not processing any data. Recommended for all applications. The 停機 metric measures the duration of an outage. A downtime greater than zero indicates that the application has failed. For troubleshooting, see [應用程式正在重新啟動](troubleshooting-rt-restarts.md). | 
| RATE (numberOfFailedCheckpoints) > 0 | Average | 0 | This metric counts the number of failed checkpoints since the application started. Depending on the application, it can be tolerable if checkpoints fail occasionally. But if checkpoints are regularly failing, the application is likely unhealthy and needs further attention. We recommend monitoring RATE(numberOfFailedCheckpoints) to alarm on the gradient and not on absolute values. Recommended for all applications. Use this metric to monitor application health and checkpointing progress. The application saves state data to checkpoints when it's healthy. Checkpointing can fail due to timeouts if the application isn't making progress in processing the input data. For troubleshooting, see [檢查點逾時](troubleshooting-chk-timeout.md). | 
| Operator.numRecordsOutPerSecond < threshold | Average | The minimum number of records emitted from the application during normal conditions.  | Recommended for all applications. Falling below this threshold can indicate that the application isn't making expected progress on the input data. For troubleshooting, see [輸送量太慢](troubleshooting-rt-throughput.md). | 
| records\$1lag\$1max\$1millisbehindLatest > threshold | Maximum | The maximum expected latency during normal conditions. | If the application is consuming from Kinesis or Kafka, these metrics indicate if the application is falling behind and needs to be scaled in order to keep up with the current load. This is a good generic metric that is easy to track for all kinds of applications. But it can only be used for reactive scaling, i.e., when the application has already fallen behind. Recommended for all applications. Use the records\$1lag\$1max metric for a Kafka source, or the millisbehindLatest for a Kinesis stream source. Rising above this threshold can indicate that the application isn't making expected progress on the input data. For troubleshooting, see [輸送量太慢](troubleshooting-rt-throughput.md). | 
| lastCheckpointDuration > threshold | Maximum | The maximum expected checkpoint duration during normal conditions. | Monitors how much data is stored in state and how long it takes to take a checkpoint. If checkpoints grow or take long, the application is continuously spending time on checkpointing and has less cycles for actual processing. At some points, checkpoints may grow too large or take so long that they fail. In addition to monitoring absolute values, customers should also considering monitoring the change rate with RATE(lastCheckpointSize) and RATE(lastCheckpointDuration). If the lastCheckpointDuration continuously increases, rising above this threshold can indicate that the application isn't making expected progress on the input data, or that there are problems with application health such as backpressure. For troubleshooting, see [無限制狀態成長](troubleshooting-rt-stateleaks.md). | 
| lastCheckpointSize > threshold | Maximum | The maximum expected checkpoint size during normal conditions. | Monitors how much data is stored in state and how long it takes to take a checkpoint. If checkpoints grow or take long, the application is continuously spending time on checkpointing and has less cycles for actual processing. At some points, checkpoints may grow too large or take so long that they fail. In addition to monitoring absolute values, customers should also considering monitoring the change rate with RATE(lastCheckpointSize) and RATE(lastCheckpointDuration). If the lastCheckpointSize continuously increases, rising above this threshold can indicate that the application is accumulating state data. If the state data becomes too large, the application can run out of memory when recovering from a checkpoint, or recovering from a checkpoint might take too long. For troubleshooting, see [無限制狀態成長](troubleshooting-rt-stateleaks.md). | 
| heapMemoryUtilization > threshold | Maximum | This gives a good indication of the overall resource utilization of the application and can be used for proactive scaling unless the application is I/O bound. The maximum expected heapMemoryUtilization size during normal conditions, with a recommended value of 90 percent. | You can use this metric to monitor the maximum memory utilization of task managers across the application. If the application reaches this threshold, you need to provision more resources. You do this by enabling automatic scaling or increasing the application parallelism. For more information about increasing resources, see [實作應用程式擴展](how-scaling.md). | 
| cpuUtilization > threshold | Maximum | This gives a good indication of the overall resource utilization of the application and can be used for proactive scaling unless the application is I/O bound. The maximum expected cpuUtilization size during normal conditions, with a recommended value of 80 percent. | You can use this metric to monitor the maximum CPU utilization of task managers across the application. If the application reaches this threshold, you need to provision more resources You do this by enabling automatic scaling or increasing the application parallelism. For more information about increasing resources, see [實作應用程式擴展](how-scaling.md). | 
| threadsCount > threshold | Maximum | The maximum expected threadsCount size during normal conditions. | You can use this metric to watch for thread leaks in task managers across the application. If this metric reaches this threshold, check your application code for threads being created without being closed. | 
| (oldGarbageCollectionTime \$1 100)/60\$1000 over 1 min period') > threshold | Maximum | The maximum expected oldGarbageCollectionTime duration. We recommend setting a threshold such that typical garbage collection time is 60 percent of the specified threshold, but the correct threshold for your application will vary. | If this metric is continually increasing, this can indicate that there is a memory leak in task managers across the application. | 
| RATE(oldGarbageCollectionCount)  > threshold | Maximum | The maximum expected oldGarbageCollectionCount under normal conditions. The correct threshold for your application will vary. | If this metric is continually increasing, this can indicate that there is a memory leak in task managers across the application. | 
| Operator.currentOutputWatermark - Operator.currentInputWatermark  > threshold | Minimum | The minimum expected watermark increment under normal conditions. The correct threshold for your application will vary. | If this metric is continually increasing, this can indicate that either the application is processing increasingly older events, or that an upstream subtask has not sent a watermark in an increasingly long time. |