myn’s diary

はじめに

AWS EC2 では CPU 使用率や Disk I/O 、Network I/O 等のデータをデフォルトのメトリクスで取得できますが、メモリ使用量やディスク使用率のデータは取れません。これらはカスタムメトリクスとして登録することで CloudWatch から取得できるようになります。

カスタムメトリクスは PutMetricData API で登録できます。データを収集・整形しこの API 呼び出しを定期実行させればメモリ使用量やディスク使用率を CloudWatch から取得できるようになります。例えば以下のようなことをします。

/proc/meminfo や df からデータを取得し整形
aws cli を実行するシェルスクリプトを実装
cron で定期実行または while, sleep で時間間隔を設けて実行

取得するデータが少ない場合はこれでも運用できますが、データが多くなってくるとメンテナンスがし辛くなるでしょう。このような場合のために公式ドキュメントでは CloudWatch エージェントと CloudWatch モニタリングスクリプトの２通りの方法が紹介されています。

CloudWatch エージェントはインスタンスに常駐してメモリ使用量やディスク使用率等のリソースデータを収集しカスタムメトリクスとして登録します。 CloudWatch モニタリングスクリプトは Perl で実装されたスクリプトで、これを cron で定期実行させデータの取得とカスタムメトリクスの登録をします。 2020/05/14 時点では CloudWatch エージェントを用いたカスタムメトリクス生成方法が推奨されています。加えて、CloudWatch モニタリングスクリプトより CloudWatch エージェントの方がプロセスごとの CPU、メモリ使用量等の細かいリソースデータが取得できます。特に理由がなければ CloudWatch エージェントを使った方がよいでしょう。

ここでは実際に CloudWatch エージェントを使って EC2 インスタンスのリソースデータをカスタムメトリクスとして登録し、 CloudWatch Metrics で確認するまでの手順・設定方法を紹介します。

検証用リソース作成

検証用の EC2 インスタンスやそれに割り当てるための IAM Role, Instance Profile 等の AWS リソースを作成します。 CloudWatch エージェントを使用してカスタムメトリクスを登録するには CloudWatchAgentServerPolicy ポリシーが必要になります。ここではそれに加えて Systems Manager Session Manager で EC2 にログインするために AmazonSSMManagedInstanceCore ポリシーも割り当てます。

terraform configuration 例

terraform {
  required_providers {
    aws = "~> 2.61"
  }
}

resource "aws_iam_instance_profile" "this" {
  role = aws_iam_role.this.name
}

resource "aws_iam_role" "this" {
  assume_role_policy = <<EOF
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": "sts:AssumeRole",
            "Principal": {
               "Service": "ec2.amazonaws.com"
            },
            "Effect": "Allow"
        }
    ]
}
EOF
}

resource "aws_iam_role_policy_attachment" "AmazonSSMManagedInstanceCore" {
  role       = aws_iam_role.this.name
  policy_arn = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
}

resource "aws_iam_role_policy_attachment" "CloudWatchAgentServerPolicy" {
  role       = aws_iam_role.this.name
  policy_arn = "arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy"
}

data "aws_ami" "amzn2" {
  most_recent = true

  filter {
    name   = "name"
    values = ["amzn2-ami-hvm-2.0.????????.?-x86_64-gp2"]
  }

  filter {
    name   = "state"
    values = ["available"]
  }

  owners = ["amazon"]
}

resource "aws_spot_instance_request" "this" {
  ami           = data.aws_ami.amzn2.id
  instance_type = "t3.micro"

  iam_instance_profile = aws_iam_instance_profile.this.id

  credit_specification {
    cpu_credits = "standard"
  }

  spot_price           = "0.01"
  spot_type            = "one-time"
  wait_for_fulfillment = true
}

output "instance_id" {
  value = aws_spot_instance_request.this.spot_instance_id
}

CloudWatch エージェント

CloudWatch エージェントはこれらのことができます。

リソースデータを収集しカスタムメトリクスとして登録
ログファイルを読み取り CloudWatch Logs へ送信

以下では CloudWatch エージェントのインストールや設定ファイルの書き方、反映方法等を説明します。

インストール

上記で作成した検証用 EC2 インスタンスに CloudWatch エージェントをインストールします。まずは Systems Manager Session Manager でログインします。

$ aws ssm start-session --target i-028f726cdabfa5c14

Starting session with SessionId: 2020-05-09-213857-0ef6fb6e30fcf5cf5
sh-4.2$ sudo su - ec2-user

Download and Configure the CloudWatch Agent Using the Command Line - Amazon CloudWatch に記載されている CloudWatch エージェントのパッケージ先をディストリビューションに合わせて選択します。ここでは Amazon Linux 2 用の rpm パッケージからインストールします。

$ sudo yum install -y -q -e 0 https://s3.amazonaws.com/amazoncloudwatch-agent/amazon_linux/amd64/latest/amazon-cloudwatch-agent.rpm
create group cwagent, result: 0
create user cwagent, result: 0

CloudWatch エージェントは /opt/aws/amazon-cloudwatch-agent にインストールされます。ここに実行ファイルや設定ファイル、CloudWatch エージェントの実行ログが格納されます。

CloudWatch エージェントの設定をするには、amazon-cloudwatch-agent-ctl コマンドを使います。このコマンドの詳細は amazon-cloudwatch-agent-ctl -h で確認することができます。

amazon-cloudwatch-agent-ctl -h の出力例

e.g.
1. apply a SSM parameter store config on EC2 instance and restart the agent afterwards:
amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c ssm:AmazonCloudWatch-Config.json -s
2. append a local json config file on onPremise host and restart the agent afterwards:
amazon-cloudwatch-agent-ctl -a append-config -m onPremise -c file:/tmp/config.json -s
3. query agent status:
amazon-cloudwatch-agent-ctl -a status

-a: action
stop: stop the agent process.
start: start the agent process.
status: get the status of the agent process.
fetch-config: use this json config as the agent's only configuration.
append-config: append json config with the existing json configs if any.
remove-config: remove json config based on the location (ssm parameter store name, file name)

-m: mode
ec2: indicate this is on ec2 host.
onPremise: indicate this is on onPremise host.
auto: use ec2 metadata to determine the environment, may not be accurate if ec2 metadata is not available for some reason on EC2.

-c: configuration
default: default configuration for quick trial.
ssm:<parameter-store-name>: ssm parameter store name
file:<file-path>: file path on the host

-s: optionally restart after configuring the agent configuration
this parameter is used for 'fetch-config', 'append-config', 'remove-config' action only.

起動

インストール直後はまだ CloudWatch エージェントが起動されていません。

$ amazon-cloudwatch-agent-ctl -a status
{
  "status": "stopped",
  "starttime": "",
  "version": "1.237768.0"
}

設定ファイルを用意せずにエージェントを起動すると、AWS で用意されたデフォルト設定が反映され CloudWatch エージェントが起動されます。

$ sudo amazon-cloudwatch-agent-ctl -a start
amazon-cloudwatch-agent is not configured. Applying default configuration before starting it.
/opt/aws/amazon-cloudwatch-agent/bin/config-downloader --output-dir /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.d --download-source default --mode ec2 --config /opt/aws/amazon-cloudwatch-agent/etc/common-config.toml --multi-config default
Successfully fetched the config and saved in /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.d/default.tmp
Start configuration validation...
/opt/aws/amazon-cloudwatch-agent/bin/config-translator --input /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json --input-dir /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.d --output /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.toml --mode ec2 --config /opt/aws/amazon-cloudwatch-agent/etc/common-config.toml --multi-config default
2020/05/10 00:18:42 Reading json config file path: /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.d/default.tmp ...
Valid Json input schema.
I! Detecting runasuser...
No csm configuration found.
No log configuration found.
Configuration validation first phase succeeded
/opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent -schematest -config /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.toml
Configuration validation second phase succeeded
Configuration validation succeeded
Created symlink from /etc/systemd/system/multi-user.target.wants/amazon-cloudwatch-agent.service to /etc/systemd/system/amazon-cloudwatch-agent.service.
Redirecting to /bin/systemctl restart amazon-cloudwatch-agent.service

２行目で amazon-cloudwatch-agent is not configured. Applying default configuration before starting it. と出力されていることから、デフォルト設定が反映されていることがわかります。

CloudWatch エージェントの実行ログはデフォルトでは /opt/aws/amazon-cloudwatch-agent/logs に格納されます。ログからも起動できていることがわかります。

$ tail -f /opt/aws/amazon-cloudwatch-agent/logs/amazon-cloudwatch-agent.log
2020/05/10 05:18:10 I! Detected runAsUser: cwagent
2020/05/10 05:18:10 I! Change ownership to cwagent:cwagent
2020/05/10 05:18:10 I! Set HOME: /home/cwagent
2020-05-10T05:18:10Z I! cloudwatch: get unique roll up list []
2020-05-10T05:18:10Z I! cloudwatch: publish with ForceFlushInterval: 1m0s, Publish Jitter: 37s
2020-05-10T05:18:10Z I! Starting AmazonCloudWatchAgent (version 1.237768.0)
2020-05-10T05:18:10Z I! Loaded outputs: cloudwatch
2020-05-10T05:18:10Z I! Loaded inputs: disk mem
2020-05-10T05:18:10Z I! Tags enabled: host=ip-172-31-7-225.ap-northeast-1.compute.internal
2020-05-10T05:18:10Z I! Agent Config: Interval:1m0s, Quiet:false, Hostname:"ip-172-31-7-225.ap-northeast-1.compute.internal", Flush Interval:1s

デフォルト設定ファイル

デフォルト設定はどのようになっているのでしょうか。

設定ファイルは etc/ に保存されています。

$ ls -Rl /opt/aws/amazon-cloudwatch-agent/etc/
/opt/aws/amazon-cloudwatch-agent/etc/:
total 8
drwxr-xr-x 2 cwagent cwagent   21 May 10 05:18 amazon-cloudwatch-agent.d
-rw-rw-r-- 1 cwagent cwagent 1098 May 10 05:18 amazon-cloudwatch-agent.toml
-rw-r--r-- 1 cwagent cwagent  925 Jan 22 17:04 common-config.toml

/opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.d:
total 4
-rwxr-xr-x 1 cwagent cwagent 462 May 10 05:18 default

/opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.d/default がデフォルトの設定です。拡張子がありませんが json ファイルです。

{
  "agent": {
    "run_as_user": "cwagent"
  },
  "metrics": {
    "metrics_collected": {
      "mem": {
        "measurement": ["mem_used_percent"]
      },
      "disk": {
        "measurement": ["used_percent"],
        "resources": ["*"]
      }
    },
    "append_dimensions": {
      "ImageId": "${aws:ImageId}",
      "InstanceId": "${aws:InstanceId}",
      "InstanceType": "${aws:InstanceType}",
      "AutoScalingGroupName": "${aws:AutoScalingGroupName}"
    }
  }
}

３行目の run_as_user で CloudWatch エージェントの実行ユーザを cwagent と設定しています。６行目の metrics_collected セクションで収集するデータをメトリクスで指定されます。８行目の mem_used_percent でメモリ使用率、１１行目の used_percent でブロックデバイスごとのディスクスペース利用率を指定しています。１２行目の "resources": ["*"] では全ブロックデバイスを収集対象としており、/, /dev, /run 等のブロックデバイスが収集されます。

この時点で収集されたデータを CloudWatch Metrics で確認できます。

toml 設定ファイル

etc/ ディレクトリ内には json ファイルの他に toml ファイルもあります。設定ファイル周りの挙動を見る限り、以下のようにして CloudWatch エージェントに設定ファイルが反映されるのだと思われます。

amazon-cloudwatch-agent-ctl コマンドで json から toml に変換
CloudWatch エージェントに toml を読み込ませる

設定ファイルはローカルにある json ファイルまたは Systems Manager Parameter Store から toml に変換できます。最初は json を直に読み込んでいるだろうと思っていたので、なぜ toml があるのか、わざわざ json から toml に変換する必要があるのかが疑問でした。恐らくですが CloudWatch エージェントが influxdata/telegraf を元に実装されているからだと思います。 THIRD-PARTY-LICENSES ファイルから telegraf が使われていることがわかります。

$ grep telegraf THIRD-PARTY-LICENSES
** influxdata/telegraf; version 1.3 -- https://github.com/influxdata/telegraf

加えて telegraf の設定ファイルも toml 形式であることと設定ファイルの構成（inputs.cpu とか）が似ています。公式ドキュメントを読む限り toml についてあまり言及されていないので、基本的には設定ファイルを json で書き amazon-cloudwatch-agent-ctl コマンドで反映するという使い方がよいでしょう。もしかすると toml を直に編集して reload すれば telegraf 特有の機能を使えるかもしれませんが、CloudWatch エージェントとは別件になるのでここでは触れません。

ウィザード

設定ファイルはウィザードを用いて作成できます。 CloudWatch エージェントに付属している amazon-cloudwatch-agent-config-wizard を実行するとウィザードが始まり、１５近くの質問に答えていくだけで簡単に設定ファイルを作成できます。初回起動時や初めて触る場合はウィザードを使うのがよいでしょう。

ウィザード実行例

$ sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-config-wizard
=============================================================
= Welcome to the AWS CloudWatch Agent Configuration Manager =
=============================================================
On which OS are you planning to use the agent?
1. linux
2. windows
default choice: [1]:
1
Trying to fetch the default region based on ec2 metadata...
Are you using EC2 or On-Premises hosts?
1. EC2
2. On-Premises
default choice: [1]:
1
Which user are you planning to run the agent?
1. root
2. cwagent
3. others
default choice: [1]:
1
Do you want to turn on StatsD daemon?
1. yes
2. no
default choice: [1]:
2
Do you want to monitor metrics from CollectD?
1. yes
2. no
default choice: [1]:
2
Do you want to monitor any host metrics? e.g. CPU, memory, etc.
1. yes
2. no
default choice: [1]:
1
Do you want to monitor cpu metrics per core? Additional CloudWatch charges may apply.
1. yes
2. no
default choice: [1]:
1
Do you want to add ec2 dimensions (ImageId, InstanceId, InstanceType, AutoScalingGroupName) into all of your metrics if the info is available?
1. yes
2. no
default choice: [1]:
1
Would you like to collect your metrics at high resolution (sub-minute resolution)? This enables sub-minute resolution for all metrics, but you can customize for specific metrics in the output json file.
1. 1s
2. 10s
3. 30s
4. 60s
default choice: [4]:
2
Which default metrics config do you want?
1. Basic
2. Standard
3. Advanced
4. None
default choice: [1]:
3
Current config as follows:
{
        "agent": {
                "metrics_collection_interval": 10,
                "run_as_user": "root"
        },
        "metrics": {
                "append_dimensions": {
                        "AutoScalingGroupName": "${aws:AutoScalingGroupName}",
                        "ImageId": "${aws:ImageId}",
                        "InstanceId": "${aws:InstanceId}",
                        "InstanceType": "${aws:InstanceType}"
                },
                "metrics_collected": {
                        "cpu": {
                                "measurement": [
                                        "cpu_usage_idle",
                                        "cpu_usage_iowait",
                                        "cpu_usage_user",
                                        "cpu_usage_system"
                                ],
                                "metrics_collection_interval": 10,
                                "resources": [
                                        "*"
                                ],
                                "totalcpu": false
                        },
                        "disk": {
                                "measurement": [
                                        "used_percent",
                                        "inodes_free"
                                ],
                                "metrics_collection_interval": 10,
                                "resources": [
                                        "*"
                                ]
                        },
                        "diskio": {
                                "measurement": [
                                        "io_time",
                                        "write_bytes",
                                        "read_bytes",
                                        "writes",
                                        "reads"
                                ],
                                "metrics_collection_interval": 10,
                                "resources": [
                                        "*"
                                ]
                        },
                        "mem": {
                                "measurement": [
                                        "mem_used_percent"
                                ],
                                "metrics_collection_interval": 10
                        },
                        "netstat": {
                                "measurement": [
                                        "tcp_established",
                                        "tcp_time_wait"
                                ],
                                "metrics_collection_interval": 10
                        },
                        "swap": {
                                "measurement": [
                                        "swap_used_percent"
                                ],
                                "metrics_collection_interval": 10
                        }
                }
        }
}
Are you satisfied with the above config? Note: it can be manually customized after the wizard completes to add additional items.
1. yes
2. no
default choice: [1]:
1
Do you have any existing CloudWatch Log Agent (http://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/AgentReference.html) configuration file to import for migration?
1. yes
2. no
default choice: [2]:
2
Do you want to monitor any log files?
1. yes
2. no
default choice: [1]:
2
Saved config file to /opt/aws/amazon-cloudwatch-agent/bin/config.json successfully.
Current config as follows:
{
        "agent": {
                "metrics_collection_interval": 10,
                "run_as_user": "root"
        },
        "metrics": {
                "append_dimensions": {
                        "AutoScalingGroupName": "${aws:AutoScalingGroupName}",
                        "ImageId": "${aws:ImageId}",
                        "InstanceId": "${aws:InstanceId}",
                        "InstanceType": "${aws:InstanceType}"
                },
                "metrics_collected": {
                        "cpu": {
                                "measurement": [
                                        "cpu_usage_idle",
                                        "cpu_usage_iowait",
                                        "cpu_usage_user",
                                        "cpu_usage_system"
                                ],
                                "metrics_collection_interval": 10,
                                "resources": [
                                        "*"
                                ],
                                "totalcpu": false
                        },
                        "disk": {
                                "measurement": [
                                        "used_percent",
                                        "inodes_free"
                                ],
                                "metrics_collection_interval": 10,
                                "resources": [
                                        "*"
                                ]
                        },
                        "diskio": {
                                "measurement": [
                                        "io_time",
                                        "write_bytes",
                                        "read_bytes",
                                        "writes",
                                        "reads"
                                ],
                                "metrics_collection_interval": 10,
                                "resources": [
                                        "*"
                                ]
                        },
                        "mem": {
                                "measurement": [
                                        "mem_used_percent"
                                ],
                                "metrics_collection_interval": 10
                        },
                        "netstat": {
                                "measurement": [
                                        "tcp_established",
                                        "tcp_time_wait"
                                ],
                                "metrics_collection_interval": 10
                        },
                        "swap": {
                                "measurement": [
                                        "swap_used_percent"
                                ],
                                "metrics_collection_interval": 10
                        }
                }
        }
}
Please check the above content of the config.
The config file is also located at /opt/aws/amazon-cloudwatch-agent/bin/config.json.
Edit it manually if needed.
Do you want to store the config in the SSM parameter store?
1. yes
2. no
default choice: [1]:
2
Program exits now.

設定ファイルの構文

よりカスタマイズをしたい場合や procstat, StatsD, collectd プラグインを使いたいのであれば設定ファイルを自分で作成する必要があります。書き方の詳細は公式ドキュメントを参考にするのがよいでしょう。ここでは重要な部分だけ説明します。

設定ファイルは agent、metrics、logs の３セクションからなります。

`agent` セクション

{
  "agent": {
    "metrics_collection_interval": 10,
    "run_as_user": "root",
    "debug": true
  },
...

CloudWatch エージェント自体の設定をします。このセクションを省略した場合は、デフォルト値が使用されます。

metrics_collection_interval: 秒数指定で 1, 5, 10, 30, 60, 60 の倍数が指定可能（default 60）
run_as_user: CloudWatch エージェントの実行ユーザ（default root）
debug: CloudWatch エージェントの詳細な実行ログを吐き出すが否か（default false）

CloudWatch Metrics でメトリクスが取れておらず CloudWatch エージェントが動いてなさそうな場合に true にして確認する際に有効です

`metrics` セクション

{
  "metrics": {
    "append_dimensions": {
      "ImageId": "${aws:ImageId}",
      "InstanceId": "${aws:InstanceId}",
      "InstanceType": "${aws:InstanceType}",
      "AutoScalingGroupName": "${aws:AutoScalingGroupName}"
    },
    "aggregation_dimensions": [
      ["AutoScalingGroupName"],
      ["InstanceId", "InstanceType"],
      []
    ],
    "metrics_collected": {
      "cpu": {
        "resources": ["*"],
        "measurement": ["usage_system"]
      }
    }
  }
}

このセクションが肝です。 metrics セクションでは取得したいカスタムメトリクスを指定します。

metrics_collected: 必須収集するメトリクスを指定

cpu, disk, diskio, swap, mem, net, netstat, processes, procstat はそのまま使えますが collectd, statsd はそれぞれ collectd, StatsD をインストールし設定する必要があります。
append_dimensions: メトリクスに追加するディメンション

ディメンションは CloudWatch Metrics でメトリクスを探す際にフィルタリングのように機能します。
aggregation_dimensions: 集約するディメンションを指定

例えば [["AutoScalingGroupName"]] とした場合 AutoScalingGroupName ディメンションを集約して CloudWatch Metrics からは１つのメトリクスとして見ることができます。

CloudWatch エージェントで収集できるメトリクス一覧は Metrics Collected by the CloudWatch Agent - Amazon CloudWatch をご覧ください。

metrics_collected.*.measurement セクションで指定するメトリクスは、完全な名前またはリソースタイプが省略された名前どちらでも指定できます。例えば、memory フィールド内の mem_used_percent は used_percent と、 disk フィールド内の used_percent は disk_used_percent と書くことができます。

`logs` セクション

{
  "logs": {
    "logs_collected": {
      "files": {
        "collect_list": [
          {
            "file_path": "/var/log/syslog",
            "log_group_name": "/my/log/group/syslog"
          }
        ]
      }
    }
  }
}

ログを収集し CloudWatch Logs へ送信します。公式ドキュメントには logs.log_stream_name は必須と書いてありますがなくても問題ありませんでした。

指定したロググループがない場合は自動で作成されます。 retention を設定したい場合は予めロググループを作成しておくか、自動作成後に PutRetentionPolicy API を実行する必要があるでしょう。

procstat プラグイン

CloudWatch エージェントには procstat, collectd, StatsD プラグインがあります。

collectd や StatsD は CloudWatch エージェントと同様にシステムリソースデータを収集するデーモンです。 CloudWatch エージェントだけで収集できないデータがある場合や、既に collectd または StatsD を使用しておりカスタムメトリクスを登録したい場合はこれらのプラグインを使うとよいかもしれません。

procstat プラグインは何かインストールする必要もなくそのまま使うことができます。プロセス毎のデータを取得することができ、pid_file, exe, pattern で収集したいプロセスを指定します。

`pid_file`

プロセス ID が格納されたファイルのパスを指定します。以下では Nginx のプロセスのメモリ使用量を収集します。

{
  "metrics": {
    "metrics_collected": {
      "procstat": [
        {
          "pid_file": "/var/run/nginx.pid",
          "measurement": ["memory_rss"]
        }
      ]
    }
  }
}

１プロセスしか監視できないので、収集対象のプロセスがマルチプロセスの場合（Apache HTTP MPM prefork/workers や Puma Clustered mode, uWSGI multiple workers 等）は以下の exe または pattern を使ったほうがよいでしょう。

`exe`

プロセス名を正規表現で指定します。 pgrep <pattern> にマッチするプロセスが対象です。以下では pgrep nginx にマッチするプロセスのプロセス ID とプロセス数を収集します。

{
  "metrics": {
    "metrics_collected": {
      "procstat": [
        {
          "exe": "nginx",
          "measurement": ["pid", "pid_count"]
        }
      ]
    }
  }
}

`pattern`

プロセスのフルコマンドを正規表現で指定します。 pgrep -f <pattern> にマッチするプロセスが対象です。 pgrep コマンドのオンラインマニュアル（man pgrep）にもある通り、 exe との違いはプロセス実行時のオプションを含めて検索している点です。

以下では pgrep -f '/var/lib/libvirt/dnsmasq/default.conf' にマッチするプロセスの CPU 使用時間の割合を収集します。例えば /usr/sbin/dnsmasq --conf-file=/var/lib/libvirt/dnsmasq/default.conf のように起動しているプロセスが収集対象になります。

{
  "metrics": {
    "metrics_collected": {
      "procstat": [
        {
          "pattern": "/var/lib/libvirt/dnsmasq/default.conf",
          "measurement": ["cpu_usage"]
        }
      ]
    }
  }
}

ちなみに exe と pattern の違いは AWS 公式ドキュメントより telegraf/plugins/inputs/procstat を読んだほうが理解が早かったです。

設定ファイルの反映

設定ファイルができたら反映します。 json ファイルを toml に変換して CloudWatch エージェントを再起動する必要があります。 json ファイルの扱いが億劫な場合は、未検証ですが toml ファイルを配置して restart すれば動作するかもしれません。

以下では /tmp/config.json にある設定ファイルを toml に変換した後に再起動（-s オプション）します。

例

/tmp/config.json

{
  "metrics": {
    "metrics_collected": {
      "procstat": [
        {
          "exe": "nginx",
          "measurement": ["cpu_usage"]
        }
      ]
    }
  }
}

$ sudo amazon-cloudwatch-agent-ctl -a fetch-config -c file:/tmp/config.json -s
/opt/aws/amazon-cloudwatch-agent/bin/config-downloader --output-dir /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.d --download-source file:/tmp/config.json --mode ec2 --config /opt/aws/amazon-cloudwatch-agent/etc/common-config.toml --multi-config default
Successfully fetched the config and saved in /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.d/file_config.json.tmp
Start configuration validation...
/opt/aws/amazon-cloudwatch-agent/bin/config-translator --input /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json --input-dir /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.d --output /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.toml --mode ec2 --config /opt/aws/amazon-cloudwatch-agent/etc/common-config.toml --multi-config default
2020/05/14 07:06:49 Reading json config file path: /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.d/file_config.json.tmp ...
Valid Json input schema.
I! Detecting runasuser...
No csm configuration found.
No log configuration found.
Configuration validation first phase succeeded
/opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent -schematest -config /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.toml
Configuration validation second phase succeeded
Configuration validation succeeded
Redirecting to /bin/systemctl stop amazon-cloudwatch-agent.service
Redirecting to /bin/systemctl restart amazon-cloudwatch-agent.service

設定ファイルが複数ある場合は、上書きされるのを避けるため１つ目のファイルを fetch-config し、２つ目以降のファイルは append-config します。

例

/tmp/base.json

{
  "agent": {
    "metrics_collection_interval": 10,
    "run_as_user": "root"
  },
  "metrics": {
    "append_dimensions": {
      "InstanceId": "${aws:InstanceId}"
    },
    "metrics_collected": {
      "cpu": {
        "resources": ["*"],
        "measurement": ["usage_system", "usage_user"]
      }
    }
  }
}

/tmp/svc03.json

{
  "metrics": {
    "metrics_collected": {
      "procstat": [
        {
          "pattern": "nginx",
          "measurement": ["memory_rss"]
        }
      ]
    }
  }
}

$ sudo amazon-cloudwatch-agent-ctl -a fetch-config -c file:/tmp/base.json
/opt/aws/amazon-cloudwatch-agent/bin/config-downloader --output-dir /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.d --download-source file:/tmp/base.json --mode ec2 --config /opt/aws/amazon-cloudwatch-agent/etc/common-config.toml --multi-config default
Successfully fetched the config and saved in /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.d/file_base.json.tmp
Start configuration validation...
/opt/aws/amazon-cloudwatch-agent/bin/config-translator --input /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json --input-dir /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.d --output /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.toml --mode ec2 --config /opt/aws/amazon-cloudwatch-agent/etc/common-config.toml --multi-config default
2020/05/14 07:19:54 Reading json config file path: /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.d/file_base.json.tmp ...
Valid Json input schema.
I! Detecting runasuser...
No csm configuration found.
No log configuration found.
Configuration validation first phase succeeded
/opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent -schematest -config /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.toml
Configuration validation second phase succeeded
Configuration validation succeeded
$ sudo amazon-cloudwatch-agent-ctl -a append-config -c file:/tmp/svc03.json -s
/opt/aws/amazon-cloudwatch-agent/bin/config-downloader --output-dir /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.d --download-source file:/tmp/svc03.json --mode ec2 --config /opt/aws/amazon-cloudwatch-agent/etc/common-config.toml --multi-config append
Successfully fetched the config and saved in /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.d/file_svc03.json.tmp
Start configuration validation...
/opt/aws/amazon-cloudwatch-agent/bin/config-translator --input /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json --input-dir /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.d --output /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.toml --mode ec2 --config /opt/aws/amazon-cloudwatch-agent/etc/common-config.toml --multi-config append
2020/05/14 07:31:05 Reading json config file path: /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json ...
/opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json does not exist or cannot read. Skipping it.
2020/05/14 07:31:05 Reading json config file path: /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.d/file_base.json ...
2020/05/14 07:31:05 Reading json config file path: /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.d/file_svc03.json.tmp ...
Valid Json input schema.
I! Detecting runasuser...
No csm configuration found.
No log configuration found.
Configuration validation first phase succeeded
/opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent -schematest -config /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.toml
Configuration validation second phase succeeded
Configuration validation succeeded
Redirecting to /bin/systemctl stop amazon-cloudwatch-agent.service
Redirecting to /bin/systemctl restart amazon-cloudwatch-agent.service
$ cat /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.toml
[agent]
  collection_jitter = "0s"
  debug = false
  flush_interval = "1s"
  flush_jitter = "0s"
  hostname = ""
  interval = "10s"
  logfile = "/opt/aws/amazon-cloudwatch-agent/logs/amazon-cloudwatch-agent.log"
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  omit_hostname = false
  precision = ""
  quiet = false
  round_interval = false

[inputs]

  [[inputs.cpu]]
    fieldpass = ["usage_system", "usage_user"]
    percpu = true
    totalcpu = true
    [inputs.cpu.tags]
      "aws:StorageResolution" = "true"
      metricPath = "metrics"

  [[inputs.procstat]]
    fieldpass = ["memory_rss"]
    pattern = "nginx"
    pid_finder = "native"
    [inputs.procstat.tags]
      "aws:StorageResolution" = "true"
      metricPath = "metrics"

[outputs]

  [[outputs.cloudwatch]]
    force_flush_interval = "60s"
    namespace = "CWAgent"
    region = "ap-northeast-1"
    tagexclude = ["metricPath"]
    [outputs.cloudwatch.tagpass]
      metricPath = ["metrics"]

CloudWatch Metrics

上記で設定した CloudWatch エージェントを動作させて、実際に CloudWatch Metrics からカスタムメトリクスとして登録したデータを確認してみましょう。

以下の設定ファイルの場合だとします。

/opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.toml

[agent]
  collection_jitter = "0s"
  debug = false
  flush_interval = "1s"
  flush_jitter = "0s"
  hostname = ""
  interval = "10s"
  logfile = "/opt/aws/amazon-cloudwatch-agent/logs/amazon-cloudwatch-agent.log"
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  omit_hostname = false
  precision = ""
  quiet = false
  round_interval = false

[inputs]

  [[inputs.cpu]]
    fieldpass = ["usage_system", "usage_user"]
    percpu = true
    totalcpu = true
    [inputs.cpu.tags]
      "aws:StorageResolution" = "true"
      metricPath = "metrics"

  [[inputs.procstat]]
    fieldpass = ["memory_rss"]
    pattern = "nginx"
    pid_finder = "native"
    [inputs.procstat.tags]
      "aws:StorageResolution" = "true"
      metricPath = "metrics"

[outputs]

  [[outputs.cloudwatch]]
    force_flush_interval = "60s"
    namespace = "CWAgent"
    region = "ap-northeast-1"
    tagexclude = ["host", "metricPath"]
    [outputs.cloudwatch.tagpass]
      metricPath = ["metrics"]

[processors]

  [[processors.ec2tagger]]
    ec2_metadata_tags = ["InstanceId"]
    refresh_interval_seconds = "2147483647s"
    [processors.ec2tagger.tagpass]
      metricPath = ["metrics"]

CPU 利用率が cpu_usage_system, cpu_usage_user メトリクスで取得できていることがわかります。 CPU 利用率がきちんと取れているか確認するために 16:55 から 16:58 の間で stress-ng コマンドを使用し負荷を掛けています。

使用例

EC2 インスタンスがサーバとして機能している場合の CloudWatch エージェントの設定例を紹介します。

Nginx

Nginx は初期設定
Nginx プロセスの CPU、メモリ使用量を取得しカスタムメトリクスとして CloudWatch へ登録
Nginx のログと CloudWatch エージェントのログを CloudWatch Logs へ送信

{
  "agent": {
    "metrics_collection_interval": 10,
    "logfile": "/var/log/amazon-cloudwatch-agent/amazon-cloudwatch-agent.log"
  },
  "metrics": {
    "namespace": "/prod/svc01",
    "metrics_collected": {
      "procstat": [
        {
          "pattern": "nginx",
          "measurement": ["cpu_usage", "memory_rss"],
          "metrics_collection_interval": 10
        }
      ]
    },
    "append_dimensions": {
      "InstanceId": "${aws:InstanceId}"
    }
  },
  "logs": {
    "logs_collected": {
      "files": {
        "collect_list": [
          {
            "file_path": "/var/log/nginx/access.log",
            "log_group_name": "/prod/svc01/nginx",
            "log_stream_name": "{instance_id}",
            "timestamp_format": "%d/%b/%Y:%H:%M:%S %z",
            "multi_line_start_pattern": "{timestamp_format}",
            "auto_removal": true
          },
          {
            "file_path": "/var/log/nginx/error.log",
            "log_group_name": "/prod/svc01/nginx",
            "log_stream_name": "{instance_id}",
            "timestamp_format": "%Y/%m/%d %H:%M:%S",
            "multi_line_start_pattern": "{timestamp_format}",
            "auto_removal": true
          },
          {
            "file_path": "/var/log/amazon-cloudwatch-agent/amazon-cloudwatch-agent.log",
            "log_group_name": "/prod/svc01/amazon-cloudwatch-agent",
            "log_stream_name": "/prod/svc01/amazon-cloudwatch-agent",
            "timestamp_format": "%Y-%m-%dT%H:%M:%S",
            "multi_line_start_pattern": "{timestamp_format}",
            "auto_removal": true
          }
        ]
      }
    }
  }
}

CloudWatch エージェントの実行ログをデフォルトの /opt/aws/amazon-cloudwatch-agent/logs/amazon-cloudwatch-agent.log とは別に /var/log/amazon-cloudwatch-agent/amazon-cloudwatch-agent.log へ吐き出すようにしています。これはデフォルトのログファイルには amazon-cloudwatch-agent-ctl で CloudWatch エージェントを操作した際のログも吐き出されるようになっており、 timestamp の形式が違い正しい timestamp_format, multi_line_start_pattern が指定できないための措置です。

/opt/aws/amazon-cloudwatch-agent/logs/amazon-cloudwatch-agent.log

2020/05/15 02:22:38 Reading json config file path: /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.d/file_config.json ...
2020/05/15 02:22:38 I! Detected runAsUser: root
2020/05/15 02:22:38 I! Change ownership to root:root
2020-05-15T02:22:38Z I! cloudwatch: get unique roll up list []
2020-05-15T02:22:38Z I! cloudwatch: publish with ForceFlushInterval: 1m0s, Publish Jitter: 37s
2020-05-15T02:22:38Z I! Starting AmazonCloudWatchAgent (version 1.237768.0)

さいごに

CloudWatch エージェントを導入して EC2 インスタンスのリソースデータをカスタムメトリクスとして登録し、 CloudWatch Metrics で確認する方法を紹介しました。今回は json ファイルから設定しましたが、json 設定ファイルを Systems Manager Parameter Store に格納して Systems Manager Run Command を使えばインスタンスにログインする必要もなく、複数台の設定も容易だと思います。

AWS のサービスを使って収集・監視・通知の仕組みを導入するのであれば、これに加えて CloudWatch Alarms や SNS が必要になってきます。まずは収集の部分だけでも CloudWatch エージェントを導入してみてはどうでしょうか。

myn’s diary

CloudWatch エージェントの設定方法

はじめに

検証用リソース作成

CloudWatch エージェント

インストール

起動

デフォルト設定ファイル

toml 設定ファイル

ウィザード

設定ファイルの構文

`agent` セクション

`metrics` セクション

`logs` セクション

procstat プラグイン

`pid_file`

`exe`

`pattern`

設定ファイルの反映

CloudWatch Metrics

使用例

Nginx

さいごに

References

はじめに

検証用リソース作成

CloudWatch エージェント

インストール

起動

デフォルト設定ファイル

toml 設定ファイル

ウィザード

設定ファイルの構文

agent セクション

metrics セクション

logs セクション

procstat プラグイン

pid_file

exe

pattern

設定ファイルの反映

CloudWatch Metrics

使用例

Nginx

さいごに

References

`agent` セクション

`metrics` セクション

`logs` セクション

`pid_file`

`exe`

`pattern`