Tags:
create new tag
, view all tags

Server Monitoring with Nagios

"Nagios is an Open Source infrastructure allowing a single server to monitor a bunch of disparate servers and computers remotely. For basic monitoring, like checking network connection, server services, etc, there is no need to install any software on the client PCs. If you need more data, there are special services that can be installed on each PC to collect information and forward it to the main Nagios Server.

It's a pretty flexible architecture: it is possible to create your own monitoring modules in C or Perl, there is a pretty complete web interface available, and Nagios can also warn you by email, pager or mobile (though third-party services) to let you know of problems. All network configurations are possible to monitor as you can set up intermediary servers to listen and collect data from a local network which the main server can collect securely."

http://etc.nkadesign.com/Linux/ServerMonitoring

-- Contributors: MartinCleaver

Discussion

-- MartinCleaver - 04 May 2006

The software continues to be actively developed and has matured since its initial development in early 1999. The main Nagios program is written in C; host, service, and other configuration items are stored text-based files (there is optional database support for the configuration, though its use is deprecated). Nagios relies heavily on the use of plugins for service checks; most of the core plugins are written in C, though these can be written in any language (C, shell, Perl, Python, etc.) as long as they adhere to a published API (see http://nagiosplug.sourceforge.net/). Native plugins exist for a number of services, making it trivial to quickly implement generic and/or specific service checks for well-known and arbitrary services. Existing plugins are actively developed, and many user-contributed plugins exists as well. There are also a number of optional addons for the program, including remote daemons for performing remote service checks, data collecton, execution of "local" plugins on remote hosts, and providing remote event handlers (e.g. to restart services).

For general service monitoring, rather than use one of the nagios-specific agents for remote host service monitoring, we should use an SNMP agent, as this already provides an interface into the status of a broad range of system variables (e.g. disk usage, system load, uptime, as well as many others), can be extended to get additional host or application specific variables, and provides a mechanism that can be re-used to collect data for trending. This also keeps our monitoring infrastructure more consistent and platform-independent; we can run agents on Solaris / AiX? / Linux / Windows / OS-X and have available the same or similar information from each, using a common underlying mechanism.

Some feature of Nagios:

- template-based object and extended-info configuration files; this change makes management of the configuration more flexible, and can make configuration changes easier. Configuration options may also be specified per host or service, for highly granular monitoring/alerting. - the ability to define host dependencies; this can prevent notifications from being sent out for a particular host if one or more criteria fail, which can reduce the amount of false-positive alerts, as well as help pinpoint certain failures. - Notification Escalations; this feature allows escalation of notification for selective hosts/services after some period of non-acknowledgement. - State retention between restarts; this makes program restarts quick, and allows long-term state to be reflected accurately in the reporting tools. - A web interface for viewing the status of hosts/services, acknowledging host/service problems, scheduling downtime, rescheduling checks, reviewing historical event/trend/availability reports. - Flap detection; this feature will allow us to curb the excessive alerts generated by those currently monitored services that "flap" regularly. - and a bunch more...

Software - Syslog-NG (insert syslog-ng blurb here)

Hardware

The monitoring host should have sufficient memory to run a number of simultaneous service checks and serve up the management interface; 1GB of physical memory is recommended. It should be able to accomodate archival logs of program data (relatively small) and have modern, fast disks (10k RPM are preferred); historical report generation can be disk intensive.

The new system should run under Linux; the program and plugins are most actively developed on this platform, and considering that other existing campus Nagios installations are running on Linux, it would be to everyone's advantage if program improvements/developments can be easily shared within the Duke IT community. The system should run Linux@Duke 9, the current production distribution as of this writing; we should plan to move to the next pending release of Linux@Duke, which is based on Fedora Core 2, sometime within the year.

A modem and modem line will be required for modem paging.

The monitoring host should be placed in a prominent location on the network backbone, if possible, in order for the programs method of determining which hosts are alive/reachable.

Configuration

The existing Big Brother / Juvenal configurations cannot be migrated / converted directly to a format supported by Nagios, so regenerating new, consistent configuration definitions for each host is necessary. As much as possible of the existing configuration data will be retained, particularly abnormal threshold settings, in order to generate as little noise as possible during the migration. However, there will be a training period in which some alerts may be generated unnecessarily; we will attempt to keep these to a minimum.

Configuration updates are generally accomplished by hand editing a text-based configuration. There are some tools for automating configuration generation; having SNMP agents installed on systems should facilitate generating roughly half of the configuration. The configuration files should at a minimum be under some revision control; long-term, configuration changes should be integrated into a yet-to-be developed change management procedure for SCS.

Hosts will be checked for availability and network latency (ICMP RTT), while Services will be checked using plugins either natively using synthetic transactions (e.g. check_http), by custom SNMP queries (e.g. check_snmp_disk = check_snmp -o ), or by custom TCP/UDP transactions (e.g. check_ph2ldap = check_tcp -p 105).

The configuration files are template-based, which provides a way of specifying reusable configuration settings; you need only define once some configuration directives in a default template, then assign that template to real configuration entries. Configuration files can be broken down into several groups:

- Hosts: all internal and external production systems, virtual hosts, network devices. - Hostgroups: collections of Hosts by type, e.g. Solaris, Windows, Webservers, CoreServices?. - Services: monitored services, typically disk/swap utilization, system load, and availability of provided services (e.g. DNS, SMTP, NFS, HTTP) and/or support services (e.g. SSH). - Contacts: individual user definitions, including email address, pager number, netid. - Contactgroups: e.g. unix-admins, win-admins, net-admins - each group contains contacts. - Extended Info: this config allows you to assign icons that are used in the statusmap and statuswrl CGI's, and and urls to additional information about that particular host. - Other Program and Object configs: including the nagios.cfg, cgi, checkcommands, misccommands, resource, timeperiods, dependencies, and escalations configurations.

Services - The current service definitions are unique to each host; at a minimum an entry for the check_host_alive plugin is included, followed by entries for the appropriate command definition for the services offered. See the section on checkcommands for a list of all available command definitions. Here is a sample service definition for the ssh service on adams:

define service { use unix-services host_name adams service_description SSH check_command check_ssh }

...though a service may include the following:

define service { host_name host_name service_description service_description is_volatile [0/1] check_command command_name max_check_attempts # normal_check_interval # retry_check_interval # active_checks_enabled [0/1] passive_checks_enabled [0/1] check_period timeperiod_name parallelize_check [0/1] obsess_over_service [0/1] check_freshness [0/1] freshness_threshold # event_handler command_name event_handler_enabled [0/1] low_flap_threshold # high_flap_threshold # flap_detection_enabled [0/1] process_perf_data [0/1] retain_status_information [0/1] retain_nonstatus_information [0/1] notification_interval # notification_period timeperiod_name notification_options [w,u,c,r] notifications_enabled [0/1] contact_groups contact_groups stalking_options [o,w,u,c] }

Services are scheduled for checking at predefined intervals using the normal_check_interval setting, though the program attempts to space and interleave service checks to improve overall efficiency; note that several variables of the program can be tuned to impact this efficiency.

Contacts - There is an entry for every user that needs to receive alerts and/or make changes via the web interface. Contact definitions take the form:

define contact { contact_name contact_name alias alias host_notification_period timeperiod_name service_notification_period timeperiod_name host_notification_options [d,u,r,n] service_notification_options [w,u,c,r,n] host_notification_commands command_name service_notification_commands command_name email email_address pager pager_number or pager_email_gateway }

As with the other configuration files, each definition may specify each of these values explicitly or inherit them from a default configuration entry, though the contact_name, alias, host_notification_period, service_notification_period, host_notification_options, and service_notification_options are all mandatory and must be specified per entry.

Contactgroups - Contactgroups contain one or more contacts (!); contactgroups are used to defince who receives alert/recovery notifications. Definitions take the form:

define contactgroup { contactgroup_name contactgroup_name alias alias members members }

All values in this definition are mandatory.

Other Program and Object configs: With the exception of the host dependency definitions, these configuration files generally change infrequently:

- nagios.cfg: main program settings are specified here, including the names and paths of all configuration and support files. - cgi.cfg: cgi configuration, including access permissions and optional data collection settings - checkcommands.cfg: definitions for host and service checks, dependent on currently available plugins. - misccommands.cfg: command definitions for contact notifications, performance data collection - resource.cfg: configuration for external database usage (optional) - timeperiods.cfg: definitions for periods applied to contacts for alert/recovery notification schedules. - dependencies.cfg: host dependency definitions - escalations.cfg: notification escalation definitions

Plugins

If configuration files are the heart of Nagios, then plugins are its brain. Each checkcommand definition references an available plugin, with optional parameters. The parameters can be plugin-specific, but usually require arguments of thresholds for warning/critical, and can optionally use the output of the expected results of the actual transaction. This allows the same plugin to be used to create several checkcommand definitions by simply calling it with alternate parameters, e.g. the check_smtp plugin can be used to alternatively check the version of the SMTP daemon on the remote system by comparing the text returned from the check with an expected value:

command[check_smtp_ver]=/usr/local/nagios/libexec/check_smtp $HOSTADDRESS$ -e $ARG1$

Since the plugin specification is well-known, it is also possible to write simple plugins in shell, perl, or any language available to the system. The current Nagios distribution includes 68 unique plugins, some of which are presented in both C and Perl forms, which include both core-developed and user-contributed variants.

-- MartinCleaver - 09 May 2006

Am thinking about using Nagios to monitor TWiki in addition to the underlying o/s.

http://nagiosplug.sourceforge.net/developer-guidelines.html has guidelines. I would first seek to expose the TWiki logs so that particular events could be escalated appropriately.

Has anyone built a plugin for Nagios & TWiki?

If there is one that could be made available I will share any modifications I make to it.

-- MartinCleaver - 09 May 2006

We would really benefit from a NagiosContrib. I would like to contrib and will try but Im not a decent programmer. We have all servers and more registered in forms and also use RackPlannerPlugin (great). It would only need a plugin that could export a table from a SEARCH and write it out with the additional "define host {" ..or service etcetera. I would also say that we installed Nagios 3 and it even if we need to plan and configure, it was much much easier to set up and configure by config files that I expected. It immediatedly gave rewards.....

-- LarsEik - 18 Sep 2007

I treat TWiki as a web service and use webinject to login and execute a canned walk through it and verify its timely operation. My patches to webinject http://www.cs.umb.edu/~rouilj/webinject/webinject-1.41.patch may also be useful.

-- JohnRouillard - 23 Sep 2007

Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r4 - 2007-09-23 - JohnRouillard
 
  • Learn about TWiki  
  • Download TWiki
This site is powered by the TWiki collaboration platform Powered by Perl Hosted by OICcam.com Ideas, requests, problems regarding TWiki? Send feedback. Ask community in the support forum.
Copyright © 1999-2017 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.