Monitoring the ARC Information System¶
The main configuration section for these probes is arcinfosys
, see
Configuration Files.
EGIIS Check¶
To monitor an EGIIS service, use
check_egiis -H <HOST> [-P <PORT>] --index=<INDEX-NAME>
This will do an LDAP query of the EGIIS service on <HOST>:<PORT>
. The
default port is 2135. The base DN of the query is Mds-Vo-name=<INDEX-NAME>,
o=grid
. The probe will also fetch the subschema at cn=subschema
and
check the presence of attributes against MAY and MUST specifications in the
schema. In addition some type conversions are attempted to catch invalid
data.
Any validation error will give a CRITICAL Nagios status. If the index is empty, a WARNING Nagios status is reported. Otherwise, the status is OK and counts for different registrations states is printed.
CE Health State using EMIES¶
The following probe contacts the EMIES service of the compute element and
checks the HealtStatus
element in the reply.
check_arcservice -u <url> [-k <key-file> -c <cert-file>] [-t <timeout>]
arcinfo -c <host>
shows whether a CE supports EMIES and which URL to use.
EMIES uses SSL client authentication. By default the host certificate is
used. To use a grid proxy, pass it as both key and certificate. Example:
- check_arcservice -u https://arcce.example.org:443/arex
- -k /tmp/x509up_1000 -c /tmp/x509up_1000
CE Infosys Validation for the GLUE 2 LDAP Schema¶
You can test the GLUE 2 LDAP records published by an CE with
check_arcglue2 -H <HOST> [-P <PORT>] \
[--glue2-schema PATH] [--if-dependent-schema STATUS] \
[--warn-if-missing OBJECTCLASS,...,OBJECTCLASS] \
[--critical-if-missing OBJECTCLASS,...,OBJECTCLASS] \
[--hierarchical-foreign-keys FOREIGN-KEY,...,FOREIGN-KEY] \
[--hierarchical-aggregates]
See check_arcglue2 --help
for a full list of options.
This probe will do a full query under o=glue
on the provided host and port
and perform the following checks. The default port is 2135.
As a basic check that the information system contains data,
--warn-if-missing
and --critical-if-missing
may be passed a
comma-separated list of LDAP objectclasses for which there should be at least
one entry in the information system. By default, a warning is raised if the
system has no entries of type GLUE2AdminDomain
, GLUE2Service
, or
GLUE2Endpoint
.
The probe will verify each entry using the GLUE 2 LDAP schema. By default,
the GLUE 2 schema is expected at /etc/ldap/schema/GLUE20.schema
. An
alternative path may be specified with the --glue2-schema
option. If the
schema is not found, a warning is raised and the schema is fetched from
cn=subschema
. The rationale behind this warning is that the content
should be checked independent of what the remote end claims it should be.
Another Nagios status can be specified with --if-dependent-schema
,
including OK
to disable the warning.
As GLUE 2 is relational in nature, the probe does further checks on
connections which cannot be specified in the LDAP schema. It checks
uniqueness of the *ID
attributes, and the outgoing and incoming
multiplicities of *ForeignKey
attributes as specified in the GLUE
Specification v2.0 [GLUE2] and the LDAP schema reference implementation
[GLUE2L].
Further, the probe checks some of the constraints on the directory information tree (DIT) [GLUE2L]. A critical condition is raised if the following conditions are not met.
- All
GLUE2Extension
objects must appear immediately below the object they extend. - Objects which are aggregates of a
GLUE2Service
must appear somewhere below that service. - Services which link to a
GLUE2AdminDomains
cannot reside under a different domain.
Optionally you can require the DIT to reflect additional foreign keys, either
passing an explicit list to --hierarchical-foreign-keys
, or passing
--hierarchical-aggregates
to include all keys which represent aggregation
or composition. Note that the latter will fail unless services are structured
under their administrative domain, if any.
CE Infosys Validation for the NorduGrid and GLUE 1 Schemas¶
The ARIS probe is invoked with
check_aris -H <HOST> [-P <PORT>] [--cluster <CLUSTER>...] \
[--cluster-test <testname>...] [--queue-test <testname>...] \
[OTHER-OPTIONS...]
See check_aris --help
for the full list of options.
It will query Mds-Vo-name=local, o=grid
on <HOST>:<PORT>
. The default
port is 2135. If one or more clusters are specified with the --cluster
option, only those will be checked (nordugrid-cluster-name=<CLUSTER>
), and
it is considered error for any of them to be missing. The probe validates
attributes of entries against MAY and MUST of the schema, and attempts some
type conversions. For each found cluster, the probe will query and validate
queues.
If no clusters are found, or if no queues are found for a given cluster, it
will be reported as a warning. You can change this by passing a Nagios status
to the option --if-no-clusters
or --if-no-queues
, respectively.
Valid statuses are ok
, warning
, critical
, and unknown
, though
only the first three makes sense here.
This probe can also do custom checks on the LDAP data, either numeric limits
or regular-expression matches. A custom test defined in the configuration
file under a section arcinfosys.aris.<testname>
, can be enabled by passing
any number of --cluster-test <testname>
and --queue-test <testname>
options to the probe. The tests are run on entries of the type
nordugrid-cluster
and nordugrid-queue
, respectively.
The ARIS infosystem contains a attribute nordugrid-cluster-contactstring
which provides the interface for job submission. You can check that this URL
is accessible by passing --check-contact
. This will do a list operation
and, if the logging level is INFO
or lower, will report the number of
entries. If the attribute is missing or the URL is inaccessible, the service
goes CRITICAL with an appropriate message.
Limit Checks¶
A limit check takes the form
[arcinfosys.aris.<testname>]
type = limit
value = <expr>
critical.min = <value>
critical.max = <value>
critical.message = <message>
warning.min = <value>
warning.max = <value>
warning.message = <message>
The type
and value
variables are required, and at least one of the
min
or one of the max
variables should be given for the test to be
useful. There are reasonable defaults for the messages, though if your
<expr>
is complex, you may want to provide a more human readable version.
The probe will
- Evaluate
<expr>
using Python’s eval function, in an environment based on the LDAP attribute names to the corresponding converted values. The variable names are obtained from the attribute names by replacing “-
” with “_
” and stripping common prefixes including “nordugrid-cluster-
”, “nordugrid-queue-
”, and “Mds-
”. - If
critical.min
is given and the result is below this value, or ifcritical.max
is given and the result is above this value, report it as a critical error. - Similar for
warning.min
andwarning.max
, reported as a warning.
Regular Expression Checks¶
A regular expression check takes the form:
[arcinfosys.aris.<testname>]
type = regex
variable = <varname>
critical.pattern = <python-regex>
critical.message = <message>
warning.pattern = <python-regex>
warning.message = <message>
The type
and variable
settings are required, and you should specify at
least on of critical.pattern
and warning.pattern
. The variable name
is obtained the same way as for the limit checks. The probe will consider all
values for the LDAP attribute corresponding to <varname>
.
- If
critical.pattern
is specified and none of the values match it, then a critical condition is reported, else - if
warning.pattern
is specified and none of the values match it, then a warning is reported.
The following example will issue a critical state if a queue is not active:
[arcinfosys.aris.queue-active]
type = regex
variable = status
critical.pattern = ^active$
critical.message = Inactive queue
Glue Schema Checks¶
Some CEs publish cluster and queue information in the Glue schema in addition
to the NorduGrid schema. You can enable schema checks for these if present by
passing --enable-glue
.
The information in the Glue entries should match information in the ARC
entries as described in [ARCIS2011]. You can enable a partial comparison of
GlueCE, GlueCluster, and GlueSubCluster records by passing --compare-glue
.
Checking Expiration of Host Certificates¶
A separate probe is provided for checking the host certificate as reported by the information system:
check_archostcert -H <HOST> [-p <PORT>] \
[-c <CRITDAYS>] [-w <WARNDAYS>] [-t <TIMEOUT>]
The suggestion is to run this for each compute element on a low frequency, like once or a few times a day. A command definition like
define command {
command_name check_archostcert
command_line $USER$/check_archostcert -H $HOSTNAME$ -c 7 -w 31
}
will warn about a certificate one month before it expires and report a
critical status one week before. The port number defaults to 2135, but can be
changed with -p <port>
, and a timeout of <T>
seconds is specified as
-t <T>
. Se also check_archostcert --help
.
The lifetime of the host certificate can also be checked using a generic HTTPS probe against the EMIES service, as long as the probe supports client authentication and lifetime checks.
[GLUE2] | “GLUE Specification v2.0”; Sergio Andreozzi (ed.), et al.; http://www.ogf.org/documents/GFD.147.pdf |
[GLUE2L] | (1, 2) “GLUE v. 2.0 – Reference Implementation of an LDAP Schema” Sergio Andreozzi (ed.), et al.; https://forge.ogf.org/sf/docman/do/downloadDocument/projects.glue-wg/docman.root.drafts/doc15526 |
[ARCIS2011] | “The NorduGrid-ARC Information System”; Balázs Kónya and Daniel Johansson; NORDUGRID-TECH-4; http://www.nordugrid.org/documents/arc_infosys.pdf |