swiftlm.access.host.operation.get.bytes |
|
This metric is the number of bytes read from objects in GET requests
processed by this host during the last minute. Only successful GET
requests to objects are counted. GET requests to the account or
container is not included.
|
swiftlm.access.host.operation.ops |
|
This metric is a count of the all the API requests made to Swift that
were processed by this host during the last minute.
|
swiftlm.access.host.operation.project.get.bytes | | |
swiftlm.access.host.operation.project.ops | | |
swiftlm.access.host.operation.project.put.bytes | | |
swiftlm.access.host.operation.put.bytes |
|
This metric is the number of bytes written to objects in PUT or POST
requests processed by this host during the last minute. Only successful
requests to objects are counted. Requests to the account or container is
not included.
|
swiftlm.access.host.operation.status | | |
swiftlm.access.project.operation.status |
|
This metric reports whether the swiftlm-access-log-tailer program is
running normally.
|
swiftlm.access.project.operation.ops |
tenant_id
service=object-storage
|
This metric is a count of the all the API requests made to Swift that
were processed by this host during the last minute to a given project
id.
|
swiftlm.access.project.operation.get.bytes |
tenant_id
service=object-storage
|
This metric is the number of bytes read from objects in GET requests
processed by this host for a given project during the last minute. Only
successful GET requests to objects are counted. GET requests to the
account or container is not included.
|
swiftlm.access.project.operation.put.bytes |
tenant_id
service=object-storage
|
This metric is the number of bytes written to objects in PUT or POST
requests processed by this host for a given project during the last
minute. Only successful requests to objects are counted. Requests to the
account or container is not included.
|
swiftlm.async_pending.cp.total.queue_length |
observer_host
service=object-storage
|
This metric reports the total length of all async pending queues in the
system.
When a container update fails, the update is placed on the async pending
queue. An update may fail becuase the container server is too busy or
because the server is down or failed. Later the system will “replay”
updates from the queue – so eventually, the container listings will
show all objects known to the system.
If you know that container servers are down, it is normal to see the
value of async pending increase. Once the server is restored, the value
should return to zero.
A non-zero value may also indicate that containers are too large. Look
for “lock timeout” messages in /var/log/swift/swift.log. If you find
such messages consider reducing the container size or enable rate
limiting.
|
swiftlm.check.failure |
check
error
component
service=object-storage
|
The total exception string is truncated if longer than 1919 characters
and an ellipsis is prepended in the first three characters of the
message. If there is more than one error reported, the list of errors is
paired to the last reported error and the operator is expected to
resolve failures until no more are reported. Where there are no further
reported errors, the Value Class is emitted as ‘Ok’.
|
swiftlm.diskusage.cp.avg.usage |
observer_host
service=object-storage
|
Is the average utilization of all drives in the system. The value is a
percentage (example: 30.0 means 30% of the total space is used).
|
swiftlm.diskusage.cp.max.usage |
observer_host
service=object-storage
|
Is the highest utilization of all drives in the system. The value is a
percentage (example: 80.0 means at least one drive is 80% utilized). The
value is just as important as swiftlm.diskusage.usage.avg. For example,
if swiftlm.diskusage.usage.avg is 70% you might think that there is
plenty of space available. However, if swiftlm.diskusage.usage.max is
100%, this means that some objects cannot be stored on that drive. Swift
will store replicas on other drives. However, this will create extra
overhead.
|
swiftlm.diskusage.cp.min.usage |
observer_host
service=object-storage
|
Is the lowest utilization of all drives in the system. The value is a
percentage (example: 10.0 means at least one drive is 10% utilized)
|
swiftlm.diskusage.cp.total.avail |
observer_host
service=object-storage
|
Is the size in bytes of available (unused) space of all drives in the
system. Only drives used by Swift are included.
|
swiftlm.diskusage.cp.total.size |
observer_host
service=object-storage
|
Is the size in bytes of raw size of all drives in the system.
|
swiftlm.diskusage.cp.total.used |
observer_host
service=object-storage
|
Is the size in bytes of used space of all drives in the system. Only
drives used by Swift are included.
|
swiftlm.diskusage.host.avg.usage |
hostname
service=object-storage
|
This metric reports the average percent usage of all Swift filesystems
on a host.
|
swiftlm.diskusage.host.max.usage |
hostname
service=object-storage
|
This metric reports the percent usage of a Swift filesystem that is most
used (full) on a host. The value is the max of the percentage used of
all Swift filesystems.
|
swiftlm.diskusage.host.min.usage |
hostname
service=object-storage
|
This metric reports the percent usage of a Swift filesystem that is
least used (has free space) on a host. The value is the min of the
percentage used of all Swift filesystems.
|
swiftlm.diskusage.host.val.avail |
hostname
service=object-storage
mount
device
label
|
This metric reports the number of bytes available (free) in a Swift
filesystem. The value is an integer (units: Bytes)
|
swiftlm.diskusage.host.val.size |
hostname
service=object-storage
mount
device
label
|
This metric reports the size in bytes of a Swift filesystem. The
value is an integer (units: Bytes)
|
swiftlm.diskusage.host.val.usage |
hostname
service=object-storage
mount
device
label
|
This metric reports the percent usage of a Swift filesystem. The value
is a floating point number in range 0.0 to 100.0
|
swiftlm.diskusage.host.val.used |
hostname
service=object-storage
mount
device
label
|
This metric reports the number of used bytes in a Swift filesystem.
The value is an integer (units: Bytes)
|
swiftlm.load.cp.avg.five |
observer_host
service=object-storage
|
This is the averaged value of the five minutes system load average of
all nodes in the Swift system.
|
swiftlm.load.cp.max.five |
observer_host
service=object-storage
|
This is the five minute load average of the busiest host in the Swift
system.
|
swiftlm.load.cp.min.five |
observer_host
service=object-storage
|
This is the five minute load average of the least loaded host in the
Swift system.
|
swiftlm.load.host.val.five |
hostname
service=object-storage
|
This metric reports the 5 minute load average of a host. The value is
derived from /proc/loadavg .
|
swiftlm.md5sum.cp.check.ring_checksums |
observer_host
service=object-storage
|
If you are in the middle of deploying new rings, it is normal for this
to be in the failed state.
However, if you are not in the middle of a deployment, you need to
investigate the cause. Use “swift-recon –md5 -v” to identify the
problem hosts.
|
swiftlm.replication.cp.avg.account_duration |
observer_host
service=object-storage
|
This is the average across all servers for the account replicator to
complete a cycle. As the system becomes busy, the time to complete a
cycle increases. The value is in seconds.
|
swiftlm.replication.cp.avg.container_duration |
observer_host
service=object-storage
|
This is the average across all servers for the container replicator to
complete a cycle. As the system becomes busy, the time to complete a
cycle increases. The value is in seconds.
|
swiftlm.replication.cp.avg.object_duration |
observer_host
service=object-storage
|
This is the average across all servers for the object replicator to
complete a cycle. As the system becomes busy, the time to complete a
cycle increases. The value is in seconds.
|
swiftlm.replication.cp.max.account_last |
hostname
path
service=object-storage
|
This is the number of seconds since the account replicator last
completed a scan on the host that has the oldest completion time.
Normally the replicators runs periodically and hence this value will
decrease whenever a replicator completes. However, if a replicator is
not completing a cycle, this value increases (by one second for each
second that the replicator is not completing). If the value remains high
and increasing for a long period of time, it indicates that one of the
hosts is not completing the replication cycle.
|
swiftlm.replication.cp.max.container_last |
hostname
path
service=object-storage
|
This is the number of seconds since the container replicator last
completed a scan on the host that has the oldest completion time.
Normally the replicators runs periodically and hence this value will
decrease whenever a replicator completes. However, if a replicator is
not completing a cycle, this value increases (by one second for each
second that the replicator is not completing). If the value remains high
and increasing for a long period of time, it indicates that one of the
hosts is not completing the replication cycle.
|
swiftlm.replication.cp.max.object_last |
hostname
path
service=object-storage
|
This is the number of seconds since the object replicator last completed
a scan on the host that has the oldest completion time. Normally the
replicators runs periodically and hence this value will decrease
whenever a replicator completes. However, if a replicator is not
completing a cycle, this value increases (by one second for each second
that the replicator is not completing). If the value remains high and
increasing for a long period of time, it indicates that one of the hosts
is not completing the replication cycle.
|
swiftlm.swift.drive_audit |
hostname
service=object-storage
mount_point
kernel_device
|
If an unrecoverable read error (URE) occurs on a filesystem, the error
is logged in the kernel log. The swift-drive-audit program scans the
kernel log looking for patterns indicating possible UREs.
To get more information, log onto the node in question and run:
sudoswift-drive-audit/etc/swift/drive-audit.conf
UREs are common on large disk drives. They do not necessarily indicate
that the drive is failed. You can use the xfs_repair command to attempt
to repair the filesystem. Failing this, you may need to wipe the
filesystem.
If UREs occur very often on a specific drive, this may indicate that the
drive is about to fail and should be replaced.
|
swiftlm.swift.file_ownership.config |
|
This metric reports if a directory or file has the appropriate owner.
The check looks at Swift configuration directories and files. It also
looks at the top-level directories of mounted file systems (for example,
/srv/node/disk0 and /srv/node/disk0/objects).
|
swiftlm.swift.file_ownership.data |
|
This metric reports if a directory or file has the appropriate owner.
The check looks at Swift configuration directories and files. It also
looks at the top-level directories of mounted file systems (for example,
/srv/node/disk0 and /srv/node/disk0/objects).
|
swiftlm.swiftlm_check |
hostname
service=object-storage
|
This indicates of the Swiftlm Monasca Agent Plug-in is running normally.
If the status is failed, it probable that some or all metrics are no
longer being reported.
|
swiftlm.swift.replication.account.last_replication |
hostname
service=object-storage
|
This reports how long (in seconds) since the replicator process last
finished a replication run. If the replicator is stuck, the time will
keep increasing forever. The time a replicator normally takes depends on
disk sizes and how much data needs to be replicated. However, a value
over 24 hours is generally bad.
|
swiftlm.swift.replication.container.last_replication |
hostname
service=object-storage
|
This reports how long (in seconds) since the replicator process last
finished a replication run. If the replicator is stuck, the time will
keep increasing forever. The time a replicator normally takes depends on
disk sizes and how much data needs to be replicated. However, a value
over 24 hours is generally bad.
|
swiftlm.swift.replication.object.last_replication |
hostname
service=object-storage
|
This reports how long (in seconds) since the replicator process last
finished a replication run. If the replicator is stuck, the time will
keep increasing forever. The time a replicator normally takes depends on
disk sizes and how much data needs to be replicated. However, a value
over 24 hours is generally bad.
|
swiftlm.swift.swift_services |
hostname
service=object-storage
|
This metric reports of the process as named in the component dimension
and the msg value_meta is running or not.
Use the swift-start.yml playbook to attempt to
restart the stopped process (it will start any process that has stopped
– you do not need to specifically name the process).
|
swiftlm.swift.swift_services.check_ip_port |
hostname
service=object-storage
component
| Reports if a service is listening to the correct ip and port. |
swiftlm.systems.check_mounts |
hostname
service=object-storage
mount
device
label
|
This metric reports the mount state of each drive that should be mounted
on this node.
|
swiftlm.systems.connectivity.connect_check |
observer_host
url
target_port
service=object-storage
|
This metric reports if a server can connect to a VIPs. Currently the
following VIPs are checked:
|
swiftlm.systems.connectivity.memcache_check |
observer_host
hostname
target_port
service=object-storage
|
This metric reports if memcached on the host as specified by the
hostname dimension is accepting connections from the host running the
check. The following value_meta.msg are used:
We successfully connected to <hostname> on port
<target_port>
{
"dimensions": {
"hostname": "ardana-ccp-c1-m1-mgmt",
"observer_host": "ardana-ccp-c1-m1-mgmt",
"service": "object-storage",
"target_port": "11211"
},
"metric": "swiftlm.systems.connectivity.memcache_check",
"timestamp": 1449084058,
"value": 0,
"value_meta": {
"msg": "ardana-ccp-c1-m1-mgmt:11211 ok"
}
}
We failed to connect to <hostname> on port <target_port>
{
"dimensions": {
"fail_message": "[Errno 111] Connection refused",
"hostname": "ardana-ccp-c1-m1-mgmt",
"observer_host": "ardana-ccp-c1-m1-mgmt",
"service": "object-storage",
"target_port": "11211"
},
"metric": "swiftlm.systems.connectivity.memcache_check",
"timestamp": 1449084150,
"value": 2,
"value_meta": {
"msg": "ardana-ccp-c1-m1-mgmt:11211 [Errno 111] Connection refused"
}
}
|
swiftlm.systems.connectivity.rsync_check |
observer_host
hostname
target_port
service=object-storage
|
This metric reports if rsyncd on the host as specified by the hostname
dimension is accepting connections from the host running the check. The
following value_meta.msg are used:
We successfully connected to <hostname> on port
<target_port>:
{
"dimensions": {
"hostname": "ardana-ccp-c1-m1-mgmt",
"observer_host": "ardana-ccp-c1-m1-mgmt",
"service": "object-storage",
"target_port": "873"
},
"metric": "swiftlm.systems.connectivity.rsync_check",
"timestamp": 1449082663,
"value": 0,
"value_meta": {
"msg": "ardana-ccp-c1-m1-mgmt:873 ok"
}
}
We failed to connect to <hostname> on port <target_port>:
{
"dimensions": {
"fail_message": "[Errno 111] Connection refused",
"hostname": "ardana-ccp-c1-m1-mgmt",
"observer_host": "ardana-ccp-c1-m1-mgmt",
"service": "object-storage",
"target_port": "873"
},
"metric": "swiftlm.systems.connectivity.rsync_check",
"timestamp": 1449082860,
"value": 2,
"value_meta": {
"msg": "ardana-ccp-c1-m1-mgmt:873 [Errno 111] Connection refused"
}
}
|
swiftlm.umon.target.avg.latency_sec |
component
hostname
observer_host
service=object-storage
url
|
Reports the average value of N-iterations of the latency values recorded
for a component.
|
swiftlm.umon.target.check.state |
component
hostname
observer_host
service=object-storage
url
|
This metric reports the state of each component after N-iterations of
checks. If the initial check succeeds, the checks move onto the next
component until all components are queried, then the checks sleep for
‘main_loop_interval’ seconds. If a check fails, it is retried every
second for ‘retries’ number of times per component. If the check
fails ‘retries’ times, it is reported as a fail instance.
A successful state will be reported in JSON:
{
"dimensions": {
"component": "rest-api",
"hostname": "ardana-ccp-vip-admin-SWF-PRX-mgmt",
"observer_host": "ardana-ccp-c1-m1-mgmt",
"service": "object-storage",
"url": "http://ardana-ccp-vip-admin-SWF-PRX-mgmt:8080"
},
"metric": "swiftlm.umon.target.check.state",
"timestamp": 1453111805,
"value": 0
},
A failed state will report a “fail” value and the value_meta will
provide the http response error.
{
"dimensions": {
"component": "rest-api",
"hostname": "ardana-ccp-vip-admin-SWF-PRX-mgmt",
"observer_host": "ardana-ccp-c1-m1-mgmt",
"service": "object-storage",
"url": "http://ardana-ccp-vip-admin-SWF-PRX-mgmt:8080"
},
"metric": "swiftlm.umon.target.check.state",
"timestamp": 1453112841,
"value": 2,
"value_meta": {
"msg": "HTTPConnectionPool(host='192.168.245.9', port=8080): Max retries exceeded with url: /v1/AUTH_76538ce683654a35983b62e333001b47 (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7fd857d7f550>: Failed to establish a new connection: [Errno 110] Connection timed out',))"
}
}
|
swiftlm.umon.target.max.latency_sec |
component
hostname
observer_host
service=object-storage
url
|
This metric reports the maximum response time in seconds of a REST call
from the observer to the component REST API listening on the reported
host
A response time query will be reported in JSON:
{
"dimensions": {
"component": "rest-api",
"hostname": "ardana-ccp-vip-admin-SWF-PRX-mgmt",
"observer_host": "ardana-ccp-c1-m1-mgmt",
"service": "object-storage",
"url": "http://ardana-ccp-vip-admin-SWF-PRX-mgmt:8080"
},
"metric": "swiftlm.umon.target.max.latency_sec",
"timestamp": 1453111805,
"value": 0.2772650718688965
}
A failed query will have a much longer time value:
{
"dimensions": {
"component": "rest-api",
"hostname": "ardana-ccp-vip-admin-SWF-PRX-mgmt",
"observer_host": "ardana-ccp-c1-m1-mgmt",
"service": "object-storage",
"url": "http://ardana-ccp-vip-admin-SWF-PRX-mgmt:8080"
},
"metric": "swiftlm.umon.target.max.latency_sec",
"timestamp": 1453112841,
"value": 127.288015127182
}
|
swiftlm.umon.target.min.latency_sec |
component
hostname
observer_host
service=object-storage
url
|
This metric reports the minimum response time in seconds of a REST call
from the observer to the component REST API listening on the reported
host
A response time query will be reported in JSON:
{
"dimensions": {
"component": "rest-api",
"hostname": "ardana-ccp-vip-admin-SWF-PRX-mgmt",
"observer_host": "ardana-ccp-c1-m1-mgmt",
"service": "object-storage",
"url": "http://ardana-ccp-vip-admin-SWF-PRX-mgmt:8080"
},
"metric": "swiftlm.umon.target.min.latency_sec",
"timestamp": 1453111805,
"value": 0.10025882720947266
}
A failed query will have a much longer time value:
{
"dimensions": {
"component": "rest-api",
"hostname": "ardana-ccp-vip-admin-SWF-PRX-mgmt",
"observer_host": "ardana-ccp-c1-m1-mgmt",
"service": "object-storage",
"url": "http://ardana-ccp-vip-admin-SWF-PRX-mgmt:8080"
},
"metric": "swiftlm.umon.target.min.latency_sec",
"timestamp": 1453112841,
"value": 127.25378203392029
}
|
swiftlm.umon.target.val.avail_day |
component
hostname
observer_host
service=object-storage
url
|
This metric reports the average of all the collected records in the
swiftlm.umon.target.val.avail_minute metric data. This is a walking
average data set of these approximately per-minute states of the Swift
Object Store. The most basic case is a whole day of successful
per-minute records, which will average to 100% availability. If there is
any downtime throughout the day resulting in gaps of data which are two
minutes or longer, the per-minute availability data will be “back
filled” with an assumption of a down state for all the per-minute
records which did not exist during the non-reported time. Because this
is a walking average of approximately 24 hours worth of data, any
outtage will take 24 hours to be purged from the dataset.
A 24-hour average availability report:
{
"dimensions": {
"component": "rest-api",
"hostname": "ardana-ccp-vip-admin-SWF-PRX-mgmt",
"observer_host": "ardana-ccp-c1-m1-mgmt",
"service": "object-storage",
"url": "http://ardana-ccp-vip-admin-SWF-PRX-mgmt:8080"
},
"metric": "swiftlm.umon.target.val.avail_day",
"timestamp": 1453645405,
"value": 7.894736842105263
}
|
swiftlm.umon.target.val.avail_minute |
component
hostname
observer_host
service=object-storage
url
|
A value of 100 indicates that swift-uptime-monitor was able to get a
token from Keystone and was able to perform operations against the Swift
API during the reported minute. A value of zero indicates that either
Keystone or Swift failed to respond successfully. A metric is produced
every minute that swift-uptime-monitor is running.
An “up” minute report value will report 100 [percent]:
{
"dimensions": {
"component": "rest-api",
"hostname": "ardana-ccp-vip-admin-SWF-PRX-mgmt",
"observer_host": "ardana-ccp-c1-m1-mgmt",
"service": "object-storage",
"url": "http://ardana-ccp-vip-admin-SWF-PRX-mgmt:8080"
},
"metric": "swiftlm.umon.target.val.avail_minute",
"timestamp": 1453645405,
"value": 100.0
}
A “down” minute report value will report 0 [percent]:
{
"dimensions": {
"component": "rest-api",
"hostname": "ardana-ccp-vip-admin-SWF-PRX-mgmt",
"observer_host": "ardana-ccp-c1-m1-mgmt",
"service": "object-storage",
"url": "http://ardana-ccp-vip-admin-SWF-PRX-mgmt:8080"
},
"metric": "swiftlm.umon.target.val.avail_minute",
"timestamp": 1453649139,
"value": 0.0
}
|
swiftlm.hp_hardware.hpssacli.smart_array.firmware |
component
hostname
service=object-storage
component
model
controller_slot
|
This metric reports the firmware version of a component of a Smart Array
controller.
|
swiftlm.hp_hardware.hpssacli.smart_array |
component
hostname
service=object-storage
component
sub_component
model
controller_slot
|
This reports the status of various sub-components of a Smart Array
Controller.
A failure is considered to have occured if:
Controller is failed
Cache is not enabled or has failed
Battery or capacitor is not installed
Battery or capacitor has failed
|
swiftlm.hp_hardware.hpssacli.physical_drive |
component
hostname
service=object-storage
component
controller_slot
box
bay
|
This reports the status of a disk drive attached to a Smart Array
controller.
|
swiftlm.hp_hardware.hpssacli.logical_drive |
component
hostname
observer_host
service=object-storage
controller_slot
array
logical_drive
sub_component
|
This reports the status of a LUN presented by a Smart Array controller.
A LUN is considered failed if the LUN has failed or if the LUN cache is
not enabled and working.
|