Monitoring the ARC Information System¶
The main configuration section for these probes is arcinfosys
, see
Configuration Files.
EGIIS Check¶
This probe will soon be deprecated. Do not use it for new deployments.
To monitor an EGIIS service, use
check_egiis -H <HOST> [-P <PORT>] --index=<INDEX-NAME>
This will do an LDAP query of the EGIIS service on <HOST>:<PORT>
. The
default port is 2135. The base DN of the query is Mds-Vo-name=<INDEX-NAME>,
o=grid
. The probe will also fetch the subschema at cn=subschema
and
check the presence of attributes against MAY and MUST specifications in the
schema. In addition some type conversions are attempted to catch invalid
data.
Any validation error will give a CRITICAL Nagios status. If the index is empty, a WARNING Nagios status is reported. Otherwise, the status is OK and counts for different registrations states is printed.
CE Health State using EMIES¶
The following probe contacts the EMIES service of the compute element and
checks the HealtStatus
element in the reply.
check_arcservice -u <url> [-k <key-file> -c <cert-file>] [-t <timeout>]
arcinfo -c <host>
shows whether a CE supports EMIES and which URL to use.
EMIES uses SSL client authentication. By default the host certificate is
used. To use a grid proxy, pass it as both key and certificate. Example:
- check_arcservice -u https://arcce.example.org:443/arex
-k /tmp/x509up_1000 -c /tmp/x509up_1000
CE Infosys Validation for the NorduGrid and GLUE 1 Schemas¶
This probe will soon be deprecated. Do not use it for new deplomynts.
The ARIS probe is invoked with
check_aris -H <HOST> [-P <PORT>] [--cluster <CLUSTER>...] \
[--cluster-test <testname>...] [--queue-test <testname>...] \
[OTHER-OPTIONS...]
See check_aris --help
for the full list of options.
It will query Mds-Vo-name=local, o=grid
on <HOST>:<PORT>
. The default
port is 2135. If one or more clusters are specified with the --cluster
option, only those will be checked (nordugrid-cluster-name=<CLUSTER>
), and
it is considered error for any of them to be missing. The probe validates
attributes of entries against MAY and MUST of the schema, and attempts some
type conversions. For each found cluster, the probe will query and validate
queues.
If no clusters are found, or if no queues are found for a given cluster, it
will be reported as a warning. You can change this by passing a Nagios status
to the option --if-no-clusters
or --if-no-queues
, respectively.
Valid statuses are ok
, warning
, critical
, and unknown
, though
only the first three makes sense here.
This probe can also do custom checks on the LDAP data, either numeric limits
or regular-expression matches. A custom test defined in the configuration
file under a section arcinfosys.aris.<testname>
, can be enabled by passing
any number of --cluster-test <testname>
and --queue-test <testname>
options to the probe. The tests are run on entries of the type
nordugrid-cluster
and nordugrid-queue
, respectively.
The ARIS infosystem contains a attribute nordugrid-cluster-contactstring
which provides the interface for job submission. You can check that this URL
is accessible by passing --check-contact
. This will do a list operation
and, if the logging level is INFO
or lower, will report the number of
entries. If the attribute is missing or the URL is inaccessible, the service
goes CRITICAL with an appropriate message.
Limit Checks¶
A limit check takes the form
[arcinfosys.aris.<testname>]
type = limit
value = <expr>
critical.min = <value>
critical.max = <value>
critical.message = <message>
warning.min = <value>
warning.max = <value>
warning.message = <message>
The type
and value
variables are required, and at least one of the
min
or one of the max
variables should be given for the test to be
useful. There are reasonable defaults for the messages, though if your
<expr>
is complex, you may want to provide a more human readable version.
The probe will
Evaluate
<expr>
using Python’s eval function, in an environment based on the LDAP attribute names to the corresponding converted values. The variable names are obtained from the attribute names by replacing “-
” with “_
” and stripping common prefixes including “nordugrid-cluster-
“, “nordugrid-queue-
“, and “Mds-
“.If
critical.min
is given and the result is below this value, or ifcritical.max
is given and the result is above this value, report it as a critical error.Similar for
warning.min
andwarning.max
, reported as a warning.
Regular Expression Checks¶
A regular expression check takes the form:
[arcinfosys.aris.<testname>]
type = regex
variable = <varname>
critical.pattern = <python-regex>
critical.message = <message>
warning.pattern = <python-regex>
warning.message = <message>
The type
and variable
settings are required, and you should specify at
least on of critical.pattern
and warning.pattern
. The variable name
is obtained the same way as for the limit checks. The probe will consider all
values for the LDAP attribute corresponding to <varname>
.
If
critical.pattern
is specified and none of the values match it, then a critical condition is reported, elseif
warning.pattern
is specified and none of the values match it, then a warning is reported.
The following example will issue a critical state if a queue is not active:
[arcinfosys.aris.queue-active]
type = regex
variable = status
critical.pattern = ^active$
critical.message = Inactive queue
Glue Schema Checks¶
Some CEs publish cluster and queue information in the Glue schema in addition
to the NorduGrid schema. You can enable schema checks for these if present by
passing --enable-glue
.
The information in the Glue entries should match information in the ARC
entries as described in [ARCIS2011]. You can enable a partial comparison of
GlueCE, GlueCluster, and GlueSubCluster records by passing --compare-glue
.
- ARCIS2011
“The NorduGrid-ARC Information System”; Balázs Kónya and Daniel Johansson; NORDUGRID-TECH-4; http://www.nordugrid.org/documents/arc_infosys.pdf