SevOne logo
You must be logged into the NMS to search.

Table of Contents (Start)

Self-monitoring Quick Start Guide - SevOne NMS 5.7

SevOne Documentation

All SevOne user documentation is available online from the SevOne Support customer portal.

Copyright © 2005-2020 SevOne Inc. All rights reserved worldwide.

All right, title, and interest in and to the software and documentation are and shall remain the exclusive property of SevOne and its respective licensors. No part of this document may be reproduced by any means nor modified, decompiled, disassembled, published or distributed, in whole or in part, or translated to any electronic medium or other means without the written consent of SevOne.

In no event shall SevOne, its suppliers, nor its licensors be liable for any damages, whether arising in tort, contract, or any other legal theory even if SevOne has been advised of the possibility of such damages, and SevOne disclaims all warranties, conditions, or other terms, express or implied, statutory or otherwise, on software and documentation furnished hereunder including without limitation the warranties of design, merchantability, or fitness for a particular purpose, and noninfringement.

All SevOne marks identified or used on the SevOne website, as updated by SevOne from time to time, may be, or are, registered with the U.S. Patent and Trademark Office and may be registered or pending registration in other countries. All other trademarks or registered trademarks contained and/or mentioned herein are used for identification purposes only and may be trademarks or registered trademarks of their respective companies.

Introduction

At SevOne, we're crazy about monitoring. If it's connected to a network, SevOne NMS will monitor it. Whatever it is–router, server, switch, telephone system–we say monitor that bad boy! We think monitoring is so important that we'd like to talk about monitoring the things that do the monitoring.

Self-monitoring involves monitoring SevOne appliances through SevOne NMS, just as you would any of your other devices. Monitoring your appliances helps you detect potential problems and address them immediately. It's also really useful for existing problems, especially when you don't know the source. Sometimes it's not immediately clear what kind of problem you're dealing with. Maybe it's a hardware issue. Maybe it's software. Or maybe it's at the network level. With self-monitoring, you can pinpoint the cause and resolve problems quickly, preventing downtime.

Self-monitoring is a series of SevOne created API scripts that allows monitoring of core SevOne functions such as:

Function

Description

MySQLMon

Monitors MySQL database performance. By default, both Config and Data databases are configured for this.

SevOneMon

Monitors SevOne internal data such as utilization, flow load, etc.

KRONMon

Monitors SevOne polling daemon performance.

RAIDMon

Monitors hard drive RAID health of physical appliances.

The metrics from these functions can be used to create reports and alerts when the state of the self-monitoring indicators changes and may indicate a problem with one of the SevOne peers.

In the next sections, we'll cover the following:

  • Things to monitor

  • Best practices

  • Setting up appliances for self-monitoring

  • Self-monitoring activities

  • Recommended policies

Prerequisites

As part of the process, you'll need to SSH into your SevOne appliance as the root user. Make sure to have the following ready:

  • Root access to the SevOne appliance

  • IP address of the SevOne appliance

  • An SSH client, such as PuTTY

  • Ensure that SevOneStats login is enabled prior to installing self-monitoring

  • The self-monitoring install script. Please contact SevOne Support to make sure that you have the most current version of the install script, install.sh. The install script, which ships with your appliance, is located in /usr/local/scripts/utilities/plugins/selfmon

What to Monitor

In this section, we'll talk about which components to pay attention to. We'll also be looking at ideal ranges, signs of trouble, and some of the available indicators.

CPU

When it comes to the CPU, we want to focus on the total aggregate CPU usage. Looking at the sum of the CPU cores rather than at individual cores provides an overview of the CPU activity. If there's a problem with the total aggregate usage, it may be a good idea to start looking at individual cores to determine whether it's just specific cores or all cores that are acting up. This can help us pinpoint whether the problem is related to a particular process or to the entire system.

Ideally, CPU usage will be in the following ranges:

  • Idle time >= 50% - most of the time, the CPU shouldn't be doing a whole lot.

  • Waiting time <= 10% - if the waiting time is consistently 10% or higher, the system is doing a little too much waiting, and we need to find out why.

Disk

When looking at disks, our two main areas of focus are free disk space and input/output (I/O). The typical SevOne appliance has three main disk components:

  • sda

    • / - contains the entire operating system, including libraries, executables, etc.

    • /index - is used by the database for indexing activities. This lets us distribute our read/write requests across another disk to improve performance.

  • sdb

    • /data - contains large amounts of data (covering long time spans), flow data, etc.

  • fioa

    • /ioDrive - optional Fusion-io SSD, included with higher-capacity appliances.

The following are the ideal amounts of free disk space:

  • /, /ioDrive >= 20% free

  • /data>= 10% free - if this is too full, the database can freeze up.

SNMP reports I/O statistics for entire disks as well as the individual partitions of each disk. For self-monitoring purposes, we're interested in the entire disk. The following are three I/O states and information about what they mean:

  • I/O is low - this means there's not a whole lot going on at the moment. Generally speaking, there's nothing to worry about here.

  • I/O is up and down - this means that there's some reading and writing going on. This isn't cause for concern–just work as usual.

  • I/O is high - if it's consistently high, this means that the disk is constantly reading and writing. This is a red flag and requires a closer look.

    One possible cause of high I/O is hot standby synchronization, which involves copying lots of data from one appliance to another. However, it's always good to check into the exact cause of high I/O.

RAID

It's important to keep an eye on RAID behavior. By monitoring RAID, you can get information about potential disk failure before it happens. You can then take appropriate action to ensure that the performance is minimally affected or not affected at all.

You'll want to keep an eye on the following things:

  • RAID array state. This applies to the array. The RAID array state should be 4. Anything other than 4 is cause for concern.

  • RAID disk predictive failure count. This applies to the individual disks that make up the array. Predictive failures result from trends of certain errors and act as warning signs, indicating that you may need to replace the disk.

  • RAID disk firmware state. This also applies to individual disks. The firmware state should be 7. If it's not 7 on any of the disks, then there may be a problem with that disk.

Memory

There are a couple of things to keep in mind about memory as it applies to SevOne appliances. First, you'll notice pretty high memory usage. Depending on how much RAM an appliance has, it might be close to 100%. Memory usage on appliances with a greater amount of RAM may not be quite so close to the 100% mark, but it'll still be high. This is because the Linux kernel caches all kinds of stuff in RAM. For example, if you write to a file, as long as there's free space in RAM, the kernel will take the whole file and dump it in RAM. So, when you read and write to the file, you're reading and writing to memory, which makes the process nice and fast! The kernel then caches the file to disk.

Next, there's the issue of swap. From some perspectives, swap is great. However, swap isn't such a good thing from our current perspective. SevOne appliances ship with a ton of memory–so much, in fact, that there shouldn't be any need for swap. Swapping should almost never occur. It should only occur when absolutely necessary, and that shouldn't be often. If there is anything in swap, it shouldn't be over 1 GB. Long story short, if we're using swap, it means that we've used up a whole lot of memory. Now we need to find out why.

Possible causes of swap include:

  • Memory leak

  • A poorly designed script

Used Memory

You can use SNMP stats to look at the amount of used memory, among other things. One of the indicators you'll use is Used Memory. But you'll also want to include the Cached indicator and the Buffers indicator. This will give you a more accurate picture of the memory being used, because some of that memory may be cached. Compare the two screenshots below. The first one shows only the results for used memory. In the second one, you can see that same amount of used memory, but you'll also see how much memory is cached.

images/download/attachments/33033174/p55x_usedMemory.png

images/download/attachments/33033174/p55x_usedMemoryAndBuffersAndCached.png

Processes

Make sure to monitor all SevOne processes–everything with SevOne as part of its name. Below is information on process monitoring using the Process Poller and SNMP.

Process Poller

The Process Poller provides stats for processes related to the following:

  • Apache

  • Bash

  • CRON scheduler

  • MySQL

  • PHP

  • xStats

  • etc.

Refer to the Process Poller monitoring information in SevOne NMS for a list of processes that are identified and monitored for the following information:

  • Availability

  • CPU Time

  • Instances of the process

  • System memory used - this refers to actual memory and doesn't include virtual.

SNMP

All major SevOne NMS processes connect to the database and issue queries. Every SevOne daemon exports SNMP information about its database use. You can look at the following daemons with SNMP:

  • SevOne-clusterd

  • SevOne–datad

  • SevOne-dispatchd

  • SevOne-polld

  • SevOne–requestd - provides percentage of requestd availability for each peer. An alert is raised when the availability is < 100%.

  • SevOne-trapd

  • etc.

The exact list of processes depends on the which appliance type you're monitoring, for example, PAS, HSA, or DNC.

SNMP indicators provide several database statistics for processes. The following are a few that you'll want to keep an eye on:

  • Query counts - any sudden changes in query counts may indicate a problem.

  • Query errors - query errors may be the result of a schema or code issue.

  • Number of connections

  • Number of reconnections

  • Number of traps received, processed, etc. (applies to SevOne-trapd) - you'll want to make sure that the number of traps received isn't much different from the number processed. The numbers should be close to each other (within 100, for example), if not the same. If there's a big difference in the number of traps received and processed, then you may be receiving more traps than you can handle.

Hot Standby Appliance

A hot standby appliance (HSA) consists of an active peer and a passive peer. The active peer pulls configuration data from the cluster master (in the case of a cluster) and polls data from objects and indicators on the devices that are assigned to it. The passive peer maintains a redundant copy of all configuration data and poll data. The following are important things to keep an eye on if you have an HSA:

  • Make sure MySQL replication is working. If Availability drops to 0%, then there may be an issue with the MySQL daemon (mysqld).

  • Make sure that the poller is running on the active appliance but not on the passive one.

  • Make sure that there's a lot of database activity on the active appliance but not on the passive one.

For HSA monitoring, we recommend using the Deferred Data plugin. You'll want to select the SevOne Statistics object (for example, when using Instant Graphs) and the following indicators:

  • Is master - indicates whether a peer is the master.

  • Is primary - indicates whether a peer is the primary.

If the passive peer of an HSA is reporting an Is master value of 1, this indicates that a switchover has occurred. If both the active peer and the passive peer are reporting an Is master value of 1, then a split-brain condition exists.

Interface

Because SevOne NMS monitors network devices, your SevOne appliance requires some bandwidth but not much. The normal range for a single appliance would fall somewhere between 1 and 3%. For an HSA, you'll see anywhere from two to four times as much bandwidth used. If the amount of bandwidth used is consistently high, this could indicate a problem, and you'll need to look into what's causing the high bandwidth.

With an HSA, you'll see peaks in traffic every two hours. This happens because data is being transferred from short-term storage to long-term storage, which results in additional MySQL replication.

The graph below displays bandwidth information using the indicators HC In Octets and HC Out Octets.

images/download/attachments/17469147/image2015-5-26-12_27_51.png

Best Practices

Before continuing, please read through the following tips to optimize the self-monitoring process:

  • Check the basic Linux stats before looking for specific problems. In short, check the easy stuff before the hard stuff. This can save you a lot of time.

  • Make sure every single SevOne appliance is being monitored in some way.

  • If you have a cluster, we recommend against monitoring all appliances from the master. The reason for this is that if you monitor all appliances from the master, and the master happens to go down, then you lose all self-monitoring. It's best to balance the load across the cluster.

  • Don't monitor a SevOne appliance from itself. That said, if you have only one SevOne appliance, this may be your only option.

  • When selecting a peer appliance to monitor a specific appliance from, go with the peer that is geographically closest to the appliance you're going to monitor. This helps reduce latency.

  • In a hot standby setup, the active appliance and passive appliance together count as a single peer. So, ideally, you'll have an entirely different peer (as understood in this sense) looking at the active and passive appliances making up the peer being monitored.

  • Make sure all your SevOne appliances are together in their own device group. We'll talk about this more in just a minute.

  • In a cluster environment, make sure all appliances are in time sync. To do this, SSH into any appliance in the cluster and enter the following command:

    SevOne-peer-do date

Getting Started

In this section, we'll go over the things you need to do to get your appliances ready for self-monitoring. We'll be looking at the following:

  • Changing the SevOneStats password

  • The SevOne device group

  • Adding an appliance to the Device Manager

  • Enabling self-monitoring on an appliance

  • Creating object rules

The SevOneStats User Account

In just a bit, we'll look at the steps for enabling self-monitoring on your appliance. In order to do this, you'll need to enter the password for the SevOneStats user account. There are just a couple of things to take care of before then. First, you'll need to enable the SevOneStats user account, as it is disabled in SevOne NMS by default. After that, you'll need to change the password for SevOneStats. This is important for two reasons:

  • If you're still using the default password, you should change the password for security reasons anyway.

  • If you change the SevOneStats password once self-monitoring is already in progress, it can interrupt the process. To be on the safe side, it's best to change it now rather than later.

It's important that you don't change the SevOneStats password once you have enabled self-monitoring. Changing the password after enabling self-monitoring can cause the self-monitoring process to stop.

Perform the following steps to enable the SevOneStats user account and change the default password for it.

  1. From the navigation bar, click Administration and select Access Configuration, then User Manager.
    images/download/attachments/33033174/image2016-8-30-14_34_5.png

  2. In the Search text field, type SevOneStats.

  3. Select the check box for SevOneStats and click images/download/attachments/33033174/icon_wrench_blueBackground.png to display the Edit User pop-up.
    images/download/attachments/33033174/image2016-8-31-15_28_22.png

  4. In the Email field, enter an email address. An email address is required here in order to save the changes you will be making in the next steps.

  5. Under Credentials, locate the Password text field and enter a new password.

  6. In the Confirm text field, reenter the password.

  7. At the bottom of the pop-up, select the User Enabled check box to enable the SevOneStats user account.

  8. If the Force password change on next login check box is selected, clear it. This is important because it will ensure that a password change isn't required on the next login.

  9. To prevent the password from expiring, select the check box for The password for this user will never expire.

  10. Click Save.

    Verify credentials / Check permissions

    Execute the following commands to verify the credentials.

    $ /usr/local/scripts/utilities/plugins/selfmon/testApiUser.php -u SevOneStats -p <password>

    If the credentials are correct, it will return 1.

    The Self-monitoring install process verifies that SevOneStats ID has the following permissions:

    - Can view reports
    - Can view alerts
    - Can create devices

    If SevOneStats has more or less permissions set, the installer will stop and warn about improper permission settings. To verify/modify the permissions, f rom the navigation bar, click Administration and select Access Configuration , then User Manager . Search for SevOneStats user and set the permissions.


The SevOne Device Group

In the Best Practices section, we recommended putting all of your appliances in their own device group. There's an existing device group called SevOne, and we strongly suggest adding any SevOne appliances that you want to monitor to this group.

To look at the SevOne device group, perform the following steps:

  1. From the navigation bar, click Devices and select Grouping, then Device Groups.

    images/download/attachments/33033174/image2016-1-8-14_29_40.png
  2. On the left, under All Device Groups, click images/download/attachments/33033174/image2015-12-16-14_32_43.png next to Manufacturer to expand the section.

    images/download/attachments/18554330/image2015-11-16-12_53_48.png
  3. Select the SevOne device group.

On the upper right pane, you'll notice that sysDescr is set to SEVONE. For more information on device groups, refer to our documentation on the topic.

There's no need to do anything at this point. We'll be talking about the SevOne device group again in the next section.

Add SevOne Appliance to Device Manager

In order to monitor your SevOne appliance, you'll need to add it to the Device Manager just as you would any other device that you want to monitor. The following steps will walk you through the process.

  1. First, select the appliance that you would like to monitor your appliance from. You'll be working from that appliance. For example, let's say you want to set up Appliance A to be monitored, and you've decided that Appliance B will do the monitoring. That means you'll need to work from Appliance B.

  2. Now that you've selected the appliance that will do the monitoring, log into SevOne NMS on that appliance.

  3. From the navigation bar, click Devices and select Device Manager.
    images/download/attachments/33033174/deviceMgr.png

  4. On the right, under Devices, click Add Device.
    images/download/attachments/33033174/image2016-3-30-13_41_24.png

  5. At the top of the New Device page, in the Name field, enter the name of the appliance that you want to monitor.

  6. In the Alternate Name field, enter an alternate name for the appliance. Users can search for the appliance by this name.

  7. In the Description field, enter a description of the appliance. You can use this to provide additional information about the appliance, such as location, etc.

  8. In the IP Address field, enter the IP address of the appliance.

  9. The Allow Deletion check box appears to the admin user only and is selected by default. When selected, it enables users to delete the device. If you would like to prevent users from deleting the appliance as a device, clear the check box.

  10. Below that, click the Device Groups drop-down. Click + next to Manufacturer and select the SevOne check box.

  11. Configure other settings as needed. For more information on the settings, please see our documentation on adding/editing devices.

  12. Click Save As New.

  13. You'll see a message at the top of the page letting you know that the device is being queued for discovery.

Enable Self-monitoring on Appliance

In this section, you'll be working with the appliance that you want to monitor. Perform the following steps to enable self-monitoring on that appliance.

It's important that you don't change the SevOneStats password once you have enabled self-monitoring. Changing the password after enabling self-monitoring can cause the self-monitoring process to stop.

  1. Open your SSH client and log into the appliance that you want to enable self-monitoring on.

  2. Run the install.sh script. This can be done by either following the prompts or executing the command in a single line.

    Prior to installing self-monitoring, ensure that appliance is in its final state, peered into the correct cluster, and has IP assigned to it.

    1. Run the install.sh script and follow the prompts.

      1. $ /usr/local/scripts/utilities/plugins/selfmon/install.sh

        images/download/attachments/33033174/install_part1.png

      2. You'll see a couple of warnings followed by the text Press Ctrl-C to abort or any key to continue... Hit Enter to proceed. images/download/attachments/33033174/install_part2.png

      3. You'll see one more message, Press Ctrl-C to abort or any key if you are sure... Hit Enter again.

        images/download/attachments/33033174/install_part3.png

      4. Next, enter the password for the API user SevOneStats.
        images/download/attachments/33033174/install_part4.png

    2. Run the install.sh script by using a single line.

      $ (echo -e '\n\n<password>') | /usr/local/scripts/utilities/plugins/selfmon/install.sh; history -d $((HISTCMD-1))

      Now you should see some information about your appliance–such as number of peers, whether the appliance is part of an HSA pair, and the IP address used for it. Soon after that, the installation process will start. You'll see a long list of information, especially about objects and indicator types. Once that's done, you'll receive confirmation that the installation is complete.


  3. Validate self-monitoring has been installed by checking the root user's crontab for the following lines.

    */5 * * * * php /usr/local/scripts/utilities/plugins/selfmon/RAIDMon/poll.megacli.objects.php -i "2" --api '127.0.0.1' >> /var/SevOne/RAIDMon.log 2>&1
    */5 * * * * /usr/local/scripts/utilities/plugins/selfmon/SevOneMon/sevone.deferreddata.poll.sevone --api 127.0.0.1 --device-id "2" --object 'SevOne Statistics' >> /var/SevOne/SevOneMon.log 2>&1
    */5 * * * * /usr/local/scripts/utilities/plugins/selfmon/MySQLMon/sevone.deferreddata.poll.mysql --api 127.0.0.1 --device-id "2" --database 127.0.0.1:3307 --object 'MySQL Config Database' --database-profile "config" >> /var/SevOne/MySQLMon.log 2>&1
    */5 * * * * /usr/local/scripts/utilities/plugins/selfmon/MySQLMon/sevone.deferreddata.poll.mysql --api 127.0.0.1 --device-id "2" --database 127.0.0.1:3306 --object 'MySQL Data Database' --database-profile "data" >> /var/SevOne/MySQLMon.log 2>&1
    */5 * * * * php /usr/local/scripts/utilities/plugins/selfmon/PolldMon/deferreddata.polld.php --api 127.0.0.1 --device-id "2" --polld-object 'SevOne-polld performance' --highpolld-object 'SevOne-highpolld performance' >> /var/SevOne/PolldMon.log 2>&1
    */5 * * * * php /usr/local/scripts/utilities/plugins/selfmon/RedisMon/sevone.deferreddata.poll.redis --api 127.0.0.1 --device-id "2" --object 'Redis Instance' >> /var/SevOne/RedisMon.log 2>&1

Create Object Rules

By creating object rules, you can make sure you're monitoring the objects you care about on your SevOne appliance. You can also use object rules to exclude the objects that you don't want to monitor. Perform the following steps to create object rules.

For this part, go to the appliance that will be doing the monitoring and log into SevOne NMS.

  1. From the navigation bar, click Administration and select Monitoring Configuration, then Object Rules.

    images/download/attachments/18554330/image2015-11-16-13_18_45.png

Click Add Rule to display the Add Rule pop-up.
images/download/attachments/33033174/image2016-1-8-15_8_46.png

Click the Device Group drop-down and select a device group. Select the device group that the appliance you wish to monitor belongs to (for example, the SevOne device group).

Click the Plugin drop-down and select a plugin.

Click the Object Type drop-down and select an object type.

Click the Subtype drop-down and select an object subtype.

The next drop-down is set to Match. This means that the object rule will be applied if the object name matches the expression that you specify (you'll specify this in the next step). If you want the object rule to apply when the object name doesn't match the expression, then click the drop-down and select Do not match.

You'll need to set at least one match expression. You can set an expression here, for the object name, or you can set one below, for the object description. You can also set expressions for both the name and the description if you like.

In the text field directly below, enter a Perl regular expression that you want the object name to match or not match. If you don't want to check the object name against an expression, just leave this blank.

The next Match/Do not Match drop-down applies to the object description, rather than the object name. Click the drop-down and select Match or Do not match if you plan on specifying an expression in the next step.

In the text field directly below, enter a Perl regular expression that you want the object description to match or not match. If you don't want to check the object description against an expression, just leave this blank.

Select the Case Sensitive check box to require expressions to match the case used in the object name/description. If you don't require the case to match, leave the check box clear.

Click the Status drop-down and select one of the following options.

    • Include - to discover objects that meet the rule criteria and to poll metrics from the object indicators.

    • Exclude - to discover objects that meet the rule criteria but not poll metrics from the object indicators.

    • Block - to prevent discovery of objects that meet the rule criteria and not poll object indicators.

  1. In the Notes field, enter any comments that you would like to include about the object rule. This is optional.

  2. Click Save.

Self-monitoring in Action

In this section, we'll walk through the following self-monitoring topics. We'll be incorporating a few specific examples in most of the subsections. Feel free to try these on your own as you read through the steps.

For this portion, make sure that you're working on the appliance that will be doing the monitoring.

Policies and Thresholds

Policies

By creating policies, you can receive alerts when the conditions you specify are triggered. Policies let you define thresholds for device groups or object groups. In this section, we're going to create a policy that will alert us if the average available swap memory is less than or equal to 20% for 15 minutes. This policy comes from the SevOne Recommended Policies section (see policy S1_SELFMON_memory_SwapRed).

Perform the following steps to create a policy.

General Settings tab

On the General Settings tab, you can define basic policy settings.

  1. From the navigation bar, click Events and select Configuration , then Policy Browser .
    images/download/attachments/33033174/policyBrowser.png

  2. Click Create Policy to display the Policy Editor.

    images/download/attachments/33033174/createandeditpolicies.png
  3. The Enable check box is selected by default. Go ahead and leave it that way. This indicates that the policy is active.

    Disabled policies appear in light text on the Policy Browser.

  4. Leave the Technology Type drop-down set to the default, Metric. This indicates that the policy is triggered by non-flow data.

    If you're creating a policy to be triggered by flow data, select Flow here. When you select Flow, additional, flow-related fields will appear (to the right). For more information about these fields, please see SevOne's documentation on creating and editing policies.

  5. In the Name field, enter the name of your policy. Let's call this one S1_SELFMON_memory_SwapRed.

  6. Leave the Device Group drop-down set to the default, Device Group.

  7. To the right of Device Group, use the drop-down to specify a device group to apply the policy to. Under All Device Groups, click + next to Manufacturer and select SevOne.

  8. Group Relationship field allows you to associate multiple device or object groups to a policy. Click the drop-down to choose one of the following options.

    • Member of Any - if the device or object is in any of the selected group(s) then the device or object will be used in the policy. It includes devices or objects as an OR operator. For example, devices or objects that are either in Group 1 OR in Group 2 OR in Group 3.

    • Member of All - if the device or object belongs to all the groups then the device or group will be used in the policy. It includes devices or objects as an AND operator. For example, devices or objects that are in Group 1 AND in Group 2 AND in Group 3.

  9. Perform the following actions to specify the object type.

    1. Click the Object Type drop-down.

    2. Scroll down to SNMP Poller and click + to expand the section.

    3. Scroll down to Memory and click + to expand the section.

    4. Scroll down until you see Memory (Linux (Net-SNMP)). Click on this to select it as your object type.

      You won't be able to edit this field once you've saved the policy. This also applies to the next field, Subtype.

  10. The Subtype drop-down is set to All Subtypes. In this example, there's no need to select an object subtype.

    If you select the check box next to Show Common Subtypes, only the common object subtypes will appear. You specify object subtypes as common using the Object Subtype Manager.

  11. Click the Severity drop-down and select Alert(action must be taken immediately) for our example. This is the severity level that displays on the Alerts page when the policy triggers an alert.

    For this policy (S1_SELFMON_memory_SwapRed), we've specified an alert to be triggered if the average available swap memory is less than or equal to 20% . We've designated our severity level as Alert(action must be taken immediately), which indicates urgency. In the Recommended Policies section, there's a similar policy (S1_SELFMON_memory_SwapYellow) that triggers an alert if available swap memory is less than or equal to 50%. For that policy, we recommend specifying the severity level as Warning (warning condition), since the situation is less urgent.

  12. Clicking on edit next to Schedule displays a pop-up where you can specify days/dates and times for the policy to run. When a policy is enabled, the alert engine retests it every three minutes. Use the Schedule settings if there are specific times that you do or don't want the policy to be tested. For our example, we're not going to create a schedule. This means that our policy will be retested every three minutes without interruption.

    For more information about configuring Schedule settings, please see SevOne's documentation on creating and editing policies.

  13. Next to Email, click edit to display the email configuration pop-up. Perform the following actions to specify who should receive email when the policy triggers an alert.

    images/download/attachments/33033174/image2015-12-16-14_41_10.png

    1. Provide input for one or more of the following sections:

    2. Addresses - manually enter one or more email addresses. Once you've entered an email address, click images/download/attachments/33033174/image2015-12-16-14_39_8.png to add it. Email addresses that you've added will appear in the column on the right.

    3. Users - select one or more users from the Users list and click images/download/attachments/33033174/image2015-12-16-14_39_16.png to add your selection(s).

    4. Roles - click the Roles drop-down and select the check box next to one or more user roles to add as recipients.

    5. Under Mail when the threshold is triggered, select one of the following options:

    6. Just once - to have the email sent only the first time that the policy triggers an alert. This means that no further email will be sent out for the alert. Of course, once the alert is cleared, if the policy triggers an alert again at a future time, an email will be sent out.

    7. One time every - to specify how frequently an email should be sent when the policy triggers an alert. Enter a number in the text field. Then select the drop down and select either minutes, hours, or days.

    8. Click Close.

  14. Next to Trap Destinations, click edit to display a pop-up that lets you specify where to send traps from the policy. For our example, let's go with the default settings. Now click Close.

    For more information about trap destinations, please see SevOne's documentation on the topic.

    images/download/attachments/33033174/image2015-12-16-14_43_26.png

  15. The Append Condition Message check box is selected by default. Go ahead and leave it this way. On the next two tabs, we'll be setting up trigger conditions and clear conditions. On the Trigger Conditions tab, we'll be able to create a trigger message. We'll also have the option of creating specific messages for each condition of the trigger. By selecting this check box, all of the condition-specific messages we create will be appended to the trigger message. The same concept applies to the Clear Conditions tab.

  16. In the Description field, enter a description of the policy. For our example, we'll say Triggers if available swap memory is low .
    images/download/attachments/33033174/selfMonPolicyEditorGeneralSettings.png

  17. Continue with the steps below to configure a trigger condition for this policy.

Trigger Conditions tab

This is where you can define the conditions to trigger the policy. You'll also be able to create a trigger message and condition messages.

If you define a trigger condition, and then define a clear condition (see Clear Conditions Tab) that contradicts it, the trigger condition will take precedence. For example, say you have a trigger condition to trigger an alert when packet loss is greater than 60% for fifteen minutes. You might define a clear condition that clears the alert once packet loss is less than 60% over fifteen minutes, and this would clear the trigger condition. However, if you instead define the clear condition to clear the alert when packet loss is greater than 65% for fifteen minutes, it won't clear the alert that was triggered. This is because packet loss is still over the number you specified (60%) as the limit in your trigger condition.

  1. Select the Trigger Conditions tab. In the upper section of the tab, you'll see the object type and device group that you selected on the General Settings tab.

    images/download/attachments/33033174/selfMonPolicyEditorTriggerConditionsBlank.png
  2. In the Trigger Message field, enter the message that you want to display for this policy on the Alerts page. For this one, let's enter Available swap memory on $deviceName is getting low. By including the $deviceName variable, we'll be able to see the name of the device that triggered the alert. (Custom message variables are not available for Flow policies.)

    Remember the Append Condition Message check box on the General Settings tab? It was selected by default, which means that when we create custom messages for specific conditions in just a minute, they'll be appended to the trigger message that we just created. (The same thing applies to condition messages for the clear message, which we'll discuss in just a bit.)

  3. Under Conditions, click images/download/attachments/33033174/image2015-12-16-14_46_18.png and select Create New. Perform the following actions to configure your new condition.

    images/download/attachments/33033174/image2016-3-29-17_23_16.png

    1. Click the Indicator drop-down and select an indicator to base the condition on. For this condition, our indicator is Available Swap Memory.

    2. The Type drop-down is set to Static by default. Leave this as it is. This will compare the current indicator value to the indicator value that you specify.

    3. Click the Comparison drop-down and select less than equal to.

    4. In the Threshold field, enter the value to trigger the condition. Enter 20 for our example.

    5. Click the drop-down to the right of the Threshold field and select Percent as the unit of measurement for this condition.

    6. In the Duration field, enter the number of minutes for the condition to be met before triggering an alert. Enter 15 for the example. This means that if the amount of available swap memory is 20% or lower for 15 minutes, the condition will trigger an alert. If the available swap memory is 20% or lower but only for 14 minutes, for example, an alert won't be triggered.

    7. The Aggregation drop-down is set to Average as the default data aggregation method. Just leave it as it is.

    8. In the Custom Message field, you can enter a message specific to this condition. For this one, let's sa y Available swap memory 20% or lower for 15 minutes . Since we've specified that we want custom messages appended to the trigger message, this condition message will append to our trigger message. (Custom message variables are not available for Flow policies.)

    9. Now click Save.

  4. Our new condition appears under Conditions. Since it's the first condition, it's listed as Condition A.

    images/download/attachments/33033174/selfMonPolicyEditorTriggerConditions.png

  5. In the Rules section, a new rule has been created as a result of the condition we just created. Since it's the first rule, it's listed as Rule 1. This new rule contains one condition: Condition A.

    If you create another condition, it'll be added to this rule using the AND Boolean operator. For example, Rule 1 would now contain Condition A AND Condition B.
    images/download/attachments/33033174/image2015-12-16-15_27_4.png
    If you'd prefer to apply an OR operator, you would first create a second rule (Rule 2). Then delete Condition B from Rule 1. After you've done that, select the check box for Condition B and add it to Rule 2 (click images/download/attachments/33033174/image2015-12-16-15_11_16.png and select Add to Rule 2).

    images/download/attachments/33033174/image2015-12-16-15_10_42.png

  6. For Webhooks, HTTP requests can be invoked to the URL location when an alert is triggered. Verb (GET, POST, PUT, PATCH, DELETE) and URL are required to be defined.

    At present, Webhooks can only be configured on Policies . When adding Thresholds , although Webhooks are visible, they are disabled and cannot be configured.

Clear Conditions tab

This is where you can define the conditions to clear the alert. If you don't define a clear condition, then the alerts triggered by the trigger condition will display on the Alerts page until you manually acknowledge them. In the following steps, we're going to define a clear condition to clear the triggered alert (defined on the Trigger Conditions tab) once the available swap memory is higher than 20% for 15 minutes.

  1. Select the Clear Conditions tab. You'll notice that it has an uncanny resemblance to the Trigger Conditions tab.

    images/download/attachments/33033174/selfMonPolicyEditorClearConditionsBlank.png
  2. In the Clear Message field, enter the message to display when the alert has been cleared. For example, you might say something along the lines of $deviceName is reporting an increase in available swap memory.

  3. Under Conditions, click images/download/attachments/33033174/image2015-12-16-15_29_52.png and select Create New. Perform the following actions to define your new condition.

    images/download/attachments/33033174/image2016-3-29-17_28_15.png

    1. Click the Indicator drop-down and select an indicator to base the condition on. For this condition, we're going to select Available Swap Memory as we did on the Trigger Conditions tab.

    2. For the Type drop-down, let's go with the default, Static, again.

    3. Click the Comparison drop-down and select greater than.

    4. In the Threshold text field, enter 20.

    5. Click the drop-down to the right of the Threshold field and select Percent.

    6. In the Duration field, enter the number of minutes for the condition to be met before the alert can be cleared. Enter 15 for our example. This means that if the amount of available swap memory is higher than 20% for 15 minutes, the alert that was triggered will be cleared. If the available swap memory is higher than 20% but only for 14 minutes , for example, the alert won't be cleared.

    7. The Aggregation drop-down is set to Average as the default data aggregation method. Just leave it as it is.

    8. In the Custom Message field, you can enter a message specific to this condition. For this one, let's say Available swap memory higher than 20% for 15 minutes . Earlier, we specified that we want custom messages appended to the clear message, so this message will append to our clear message. (Custom message variables are not available for Flow thresholds.)

    9. Now click Save.

      images/download/attachments/33033174/selfMonPolicyEditorClearConditions.png

    At present, Webhooks can only be configured on Policies . When adding Thresholds , although Webhooks are visible, they are disabled and cannot be configured.

  4. Once you've configured general settings and created trigger and clear conditions, click Save As New at the bottom of the page.

  5. Click the Policy Browser button to return to the Policy Browser, where you can see your new policy.

    images/download/attachments/33033174/policyCreated.png

Thresholds

While policies apply to entire device groups or object groups, you can use thresholds to receive alerts when conditions are triggered for a specific device. Creating a threshold is much like creating a policy, with just a couple of differences.

In the steps below, we're going to create a threshold that will alert us if the amount of average used disk space for a specified partition of a device is 80% or higher over 15 minutes . This threshold comes from the SevOne Recommended Policies section (see policy S1_SELFMON_Disk_UtilizationYellow).

Perform the following steps to create a threshold.

General Settings tab

On the General Settings tab, you can define basic threshold settings .

  1. From the navigation bar, click Events and select Configuration , then Threshold Browser .
    images/download/attachments/33033174/thresholdBrowser.png

  2. Click New Threshold to display the Threshold Editor.

    images/download/attachments/33033174/createandeditthresholdsMetric.png
  3. The Enable check box is selected by default. Go ahead and leave it that way. This indicates that the threshold is active.

    Disabled thresholds appear in light text on the Threshold Browser.

  4. Next to Created by, you'll see Stand Alone. This indicates that the threshold is created independently of any policy.

    For thresholds that were created from a policy, you would see the name of that policy after Created by. For example, if you were to look up a threshold created by our policy from the previous section, you'd see Created by: S1_SELFMON_memory_SwapRed. Clicking S1_SELFMON_memory_SwapRed would take you directly to the policy itself.

  5. Leave the Technology Type drop-down set to the default, Metric. This indicates that the threshold is triggered by non-flow data.

    If you're creating a threshold to be triggered by flow data, select Flow here. When you select Flow, additional, flow-related fields will appear (below and to the right). For more information about these fields, please see SevOne's documentation on creating and editing thresholds.

  6. In the Name field, enter the name of your threshold. For our example, we're going to call this one IPSLADevice _S1_SELFMON_Disk_UtilizationYellow. Replace IPSLADevice with the name of the device that your threshold applies to .

  7. Click the Device Group drop-down and select the device group that the device belongs to. Under All Device Groups, click + next to Manufacturer and select SevOne. If your device belongs to a different device group, select that device group instead.

  8. Click the Device drop-down and select the device that this threshold applies to. For our example, we've selected IPSLADevice 1.

  9. Click the Severity drop-down to select a severity level to display on the Alerts page when the threshold triggers an alert. Select Warning (warning condition).

  10. Leave the Schedule field as it is, so that the alert engine will run every three minutes to retest the thresholds.

    For more information about configuring Schedule settings, please see SevOne's documentation on creating and editing thresholds.

  11. Next to Email, click edit to display the email configuration pop-up. Perform the following actions to specify who should receive email when the threshold triggers an alert.
    images/download/attachments/33033174/image2015-12-16-14_41_100.png

    1. Provide input for one or more of the following sections:

    2. Addresses - manually enter one or more email addresses. Once you've entered an email address, click images/download/attachments/33033174/image2015-12-16-14_39_8.png to add it. Email addresses that you've added will appear in the column on the right.

    3. Users - select one or more users from the Users list and click images/download/attachments/33033174/image2015-12-16-14_39_16.png to add your selection(s).

    4. Roles - click the Roles drop-down and select the check box next to one or more user roles to add as recipients.

    5. Under Mail when the threshold is triggered, select one of the following options:

    6. Just once - to have the email sent only the first time that the threshold triggers an alert. This means that no further email will be sent out for the alert. Of course, once the alert is cleared, if the theshold triggers an alert again at a future time, an email will be sent out.

    7. One time every - to specify how frequently an email should be sent when the threshold triggers an alert. Enter a number in the text field. Then select the drop down and select either minutes, hours, or days.

    8. Click Close.

  12. Next to Trap Destinations, click edit to display a pop-up that lets you specify where to send traps from the threshold. By default, the first two check boxes are selected. Just leave this as it is. Now click Close.

    For more information about trap destinations, please see SevOne's documentation on the topic.

    images/download/attachments/33033174/image2015-12-16-15_42_30.png

  13. The Append Condition Message check box is selected by default. Go ahead and leave it selected.

  14. In the Description field, enter a description of the policy. For our example, let's say Triggers if IPSLADevice 1 used disk space is high.
    images/download/attachments/33033174/createandeditthresholdGeneral.png

Trigger Conditions tab

This is where you can define the conditions to trigger the threshold. You'll also be able to create a trigger message and condition messages.

  1. Select the Trigger Conditions tab. Next to Device, you'll see the device that you specified on the General Settings tab.

    images/download/attachments/33033174/selfMonThresholdEditorTriggerConditionsBlank.png
  2. In the Trigger Message field, enter the message that you want to display for this threshold on the Alerts page. For this one, we'll say Used disk space for IPSLADevice 1 is high . Feel free to modify the message to suit your needs.

  3. Under Conditions, click images/download/attachments/33033174/image2015-12-16-15_29_52.png and select Create New. Perform the following actions to define your new condition.

    images/download/attachments/33033174/image2016-3-29-17_49_20.png

    1. Click the Object drop-down and select an object to base the condition on. For this threshold, we need to specify the disk partition that we want to monitor. To find a list of partitions, scroll down to the SNMP Poller section. The partitions should appear near or at the top of the list. For our example, we'll select /boot - /dev/sda2. You may want to select a different partition, depending on your needs.

    2. Click the Indicator drop-down and select an indicator to base the condition on. Select Used Disk Space for the example.

    3. Leave the Type drop-down set to the default, Static. This will compare the current indicator value to the indicator value that you define.

    4. Click the Comparison drop-down and select greater than equal to.

    5. In the Threshold field, enter 80 as the value to trigger the condition.

    6. Click the drop-down to the right of the Threshold field and select Percent as the unit of measure for the condition.

    7. In the Duration field, enter 15 as the number of minutes for the condition to be met before triggering an alert. This means that if the used disk space is 80% or higher for 15 minutes , the condition will trigger an alert.

    8. Leave the Aggregation drop-down set to the default data aggregation method, Average.

    9. In the Custom Message field, you can enter a message specific to this condition. For this one, we'll sa y Used disk space for IPSLADevice 1 80% or higher for 15 minutes. Since we've specified that we want custom messages appended to the trigger message, this condition message will append to our trigger message. (Custom message variables are not available for Flow thresholds.)

    10. Now click Save.
      images/download/attachments/33033174/selfMonThresholdEditorTriggerConditions.png

      At present, Webhooks can only be configured on Policies . When adding Thresholds , although Webhooks are visible, they are disabled and cannot be configured.

Clear Conditions tab

This is where you can define the conditions to clear the alert. If you don't define a clear condition, then the alerts triggered by the trigger condition will display on the Alerts page until you manually acknowledge them. In the following steps, we're going to define a clear condition to clear the triggered alert (defined on the Trigger Conditions tab) onc e the average amount of used disk space is less than 80% for 15 minutes.

  1. Select the Clear Conditions tab.

    images/download/attachments/33033174/selfMonThresholdEditorClearConditionsBlank.png
  2. In the Clear Message field, enter a message to display when the alert has been cleared. For example, you might say something along the lines of IPSLADevice 1 is reporting a decrease in used disk space .

  3. Under Conditions, click images/download/attachments/33033174/image2015-12-16-15_29_52.png and select Create New. Perform the following actions to define your new condition.
    images/download/attachments/33033174/image2016-3-29-17_54_46.png

    1. Click the Object drop-down and select an object to base the condition on. Scroll down to the SNMP Poller section and select the same disk partition that you specified for the Trigger Conditions tab. For our example, this was /boot - /dev/sda2 .

    2. Click the Indicator drop-down and select Used Disk Space.

    3. Leave the Type drop-down set to the default, Static.

    4. Click the Comparison drop-down and select less than.

    5. In the Threshold field, enter 80.

    6. Click the drop-down to the right of the Threshold field and select Percent.

    7. In the Duration field, enter 15 as the number of minutes for the condition to be met before the alert can be cleared. This means that if the amount of used disk space is lower than 80% for 15 minutes , the alert that was triggered will be cleared.

    8. Leave the Aggregation drop-down set to the default data aggregation method, Average.

    9. In the Custom Message field, you can enter a message specific to this condition. For this one, let's say Used disk space for IPSLADevice 1 below 80% for 15 minutes. (Custom message variables are not available for Flow thresholds.)

    10. Now click Save.
      images/download/attachments/33033174/selfMonThresholdEditorClearConditions.png

      At present, Webhooks can only be configured on Policies . When adding Thresholds , although Webhooks are visible, they are disabled and cannot be configured.

Alerts

The Alerts page lets you look at the current, active alerts in the system. These include the threshold violations, trap notifications, and web site errors that you define on the Policy Browser or the Threshold Browser.

To access the Alerts page from the navigation bar, click Events and select Alerts.

images/download/attachments/33033174/alerts.png

Filters

The Alerts page features filters, which you can use to focus the display results. For example, if you only want to look at Critical alerts, you can easily specify a severity level of Critical on the General tab. If you wanted to narrow your focus a little more, you could also specify a device group–such as the SevOne device group we talked about earlier–on the Device Groups tab. Filters are cumulative. This means that the display results would include only devices that have Critical alerts and belong to the SevOne device group.

Filters are optional. You can maximize the display results by clearing the filters. Following is an overview of the filter tabs. After making any changes on a tab, click Apply Filter to apply your filter settings for that tab.

Click Show Filter at the top of the Alerts page to see the filter tabs.

General tab

images/download/attachments/33033174/selfMonAlertsGeneral.png

The General tab includes the following filters:

  • Severity - lets you specify the severity level of the display results. By default, all severity levels are selected.

  • Technology type - lets you specify whether the display results are based on flow data only, non-flow data only, or both flow data and non-flow data.

  • Message - lets you filter results based on the message text.

  • Assigned - lets you filter results based on the person or role that the alert is assigned to . Earlier, when we set up a policy and a threshold, we had the option of specifying someone to receive email when an alert triggers. This is what Assigned refers to.

  • Search ID - lets you filter results based on either the Alert ID, the Threshold ID, or the Policy ID.

  • Show - lets you choose whether to view active alerts, alerts that have been ignored, or both.

Devices tab

images/download/attachments/33033174/image2016-1-8-16_25_35.png

The Devices tab lets you specify one or more devices to filter the display results by. Click the drop-down and select the device(s) that you'd like to add to the filter.

Objects tab

images/download/attachments/33033174/image2016-1-8-16_27_26.png

The Objects tab lets you specify one or more objects for a specific device to filter the display results by. Perform the following steps.

  1. Click the Device drop-down and select a device.

  2. In the Object field, a list of available objects will appear for the selected device. Select the object(s) that you'd like to add to the filter. Then click images/download/attachments/33033174/image2016-1-11-13_54_6.png to add the object(s) to the field on the right. To display alerts for all objects, don't add any objects to the field on the right.

Device Groups tab

images/download/attachments/33033174/image2016-1-11-14_15_48.png

The Device Groups tab lets you specify one or more device group(s) to filter the display results by. Click the drop-down and select the check box next to each device group that you'd like to include.

Object Groups tab

images/download/attachments/33033174/image2016-1-11-14_17_20.png

The Object Groups tab lets you filter display results based on one or more specific object groups. Select the object group(s) that you'd like to add to the filter. Once you've made your selection, click images/download/attachments/33033174/image2015-12-16-14_39_8.png to add them. Object groups that you've added will appear in the field on the right.

Grouping

Just below the filters area, you'll see the Grouping drop-down. The Grouping option lets you display different levels of granularity. There are four possible Grouping settings, with Alerts being the most granular.

Below is a description of each Grouping setting. Click the Grouping drop-down to select any of the four settings.

Grouping: Alerts

images/download/attachments/33033174/selfMonGroupingAlerts.png

The Alerts grouping setting is set as the default for the Alerts page and displays the following columns:

  • ID - the ID number for the Alert. An ID preceded by images/download/attachments/3836460/exclamation.png is unassigned. images/download/attachments/33033174/image2015-12-16-15_55_34.png , on the other hand, indicates that the alert is assigned.

  • Device - the name of the device that triggered the alert.

  • First - the date and time that the alert was first reported to SevOne NMS.

  • Last - the date and time that the alert was last reported.

  • Severity - the severity level of the alert.

  • Message - the message generated by the threshold or trap.

Alerts - Controls

images/download/attachments/33033174/selfMonControlsAlerts.png

When you choose the Alerts Grouping, you'll see several controls at the bottom of the page. You can use these to view alert details and manage alerts. To see the controls for a specific alert, select the check box for that alert. The following is an overview of the available controls.

  • Device - the device that triggered the alert. Click the device name to display a link to the Device Summary as well as to the report templates for the device.

  • Object - the object that triggered the alert. Click the object name to display a link to the Object Summary as well as to the report templates for the object.

  • Threshold or Trap - the threshold or trap that triggered the alert. Click the threshold name or trap event name to go to the Threshold Editor or Trap Event Editor.

  • Severity - the severity level of the alert.

  • Message - the message generated by the threshold or trap.

  • First - the date and time that the alert was first reported.

  • Last - the date and time that the alert was last reported.

  • Occurrences - the total number of times that the alert triggered.

    The alert engine runs every three minutes. A policy-based threshold requires ten minutes to gather enough data to trigger an alert. This means that it can take up to thirteen minutes for the first alert for a new policy to appear.

  • Assigned To - the name of the person or role that the alert is assigned to. By default, alerts are unassigned.

  • Log Analytics - a field that appears if your cluster has a Performance Log Appliance (PLA). Click the Check for device logs link to display log data related to the alert.

Alerts - Actions

images/download/attachments/33033174/selfMonActionsAlerts.png

At the bottom of the page, just below the controls, you'll see four buttons. With the Alerts grouping setting, there are four actions that you can apply to an alert. The following is an overview of those actions. Select the check box next to an alert and do one (or more) of the following:

  • Click Acknowledge to acknowledge the alert. Once an alert has been acknowledged, it's permanently moved to the Alert Archive. When you acknowledge an alert, you'll see a pop-up where you can enter an explanation for the acknowledgement.

    images/download/attachments/33033174/image2016-2-16-15_44_37.png
  • Click Retest to see if the threshold trigger condition still exists for the alert. The alert engine retests all thresholds every three minutes.

    A couple notes about alerts triggered by thresholds without clear conditions:

    • They'll remain on the Alerts page until they're acknowledged.

    • When you retest them–even if the trigger condition is no longer met–they'll continue to display on the Alerts page, but the last date and time information won't change.

  • Click Ignore to ignore the alert. When you ignore an alert, you'll see a pop-up where you can specify how long you would like to ignore the alert. You can also enter a note.

    images/download/attachments/33033174/image2015-12-17-13_40_43.png
  • Click Assign to assign the alert to someone. Then click the drop-down next to Assign to select a person or a role.


Grouping: Devices

images/download/attachments/33033174/selfMonGroupingDevices.png

The Devices grouping setting displays the following columns:

  • Device - the name of the device that triggered the alert.

  • First - the date and time that the first alert for the device was first reported.

  • Last - the date and time that the last alert for the device was last reported.

  • Highest Severity - the highest severity level of the alerts for the device.

  • Message - information about the total number of alerts triggered by the device along with the highest severity level of the alerts for the device.

To view the alerts for a device, click either the device name or the device message. Clicking either one will apply the Alerts grouping setting to that specific device.

Grouping: Device Groups

images/download/attachments/33033174/selfMonGroupingDeviceGroups.png

The Device Groups grouping setting displays the following columns:

  • Device Group - the name of the device group or device type that triggered alerts.

  • First - the date and time that the first alert for the device group/device type was first reported.

  • Last - the date and time that the last alert for the device group/device type was last reported.

  • Highest Severity - the highest severity level of the alerts for the device group/device type.

  • Message - information about the total number of alerts triggered by the device group/device type along with the highest severity level of the alerts for the device group/device type.

Click any device group/device type name or message to view the devices belonging to that group/type along with the alerts for those devices. This will apply the Devices grouping setting to that specific device group/device type.

Grouping: Object Groups

images/download/attachments/33033174/selfMonGroupingObjectGroups.png

The Object Groups grouping setting displays the following columns:

  • Object Group - the name of the object group that triggered alerts.

  • First - the date and time that the first alert for the object group was first reported.

  • Last - the date and time that the last alert for the object group was last reported.

  • Highest Severity - the highest severity level of the alerts for the object group.

  • Message - information about the total number of alerts triggered by the object group along with the highest severity level of the alerts for the object group.

Click any object group name or message to view the devices associated with that object group along with the alerts for those devices. This will apply the Alerts grouping setting to that specific object group.

Instant Graphs

On the Instant Graphs page, you can create statistical graphs for the objects and indicators on devices. This is a great place to start because instant graphs are easy and fast to set up, allowing you to get an immediate look at potential problem areas.

In this section we're going to look at CPU activity for the past 24 hours.

To access the Instant Graphs page from the navigation bar, click Reports and select Instant Graphs.

images/download/attachments/33033174/image2015-12-17-13_44_31.png

  1. Click the Device Group drop-down to select a device group. We're going to select the SevOne device group. Under All Device Groups, click + next to Manufacturer and select SevOne.

Click the Device drop-down and select a device to base the instant graph on.

Click the Object drop-down to select an object. Let's go with CPU Total0 - CPU Total under SNMP Poller.

Click the Indicator drop-down to select an indicator. Select Idle CPU Time.

Click the Select button. This will move the indicator you selected to the Indicators to Graph field.

Let's throw in one more indicator. Click the Indicator drop-down and select Waiting CPU Time. Then click the Select button to add it.

Now you should see both indicators in the Indicators to Graph field. You'll notice that each is preceded by the device name and the object name.
images/download/attachments/33033174/image2016-2-16-16_14_21.png

We just added two indicators from the same device and object. You can also add indicators from other devices and objects to display everything in a single graph.

The Graph Type is set to Line Graph by default. Let's go with that.

Click the Time Span drop-down and select Past 24 hours.

In the Units section, the Scaled to Max check box is selected by default. Just leave it that way. This means that the graph will be scaled to the largest actual value for the specified time span. Clearing the check box scales the graph to the maximum potential value. The Scaled to Max setting is irrelevant for indicators that don't have a defined maximum value.

Leave the Percentage check box clear for our example.

You'll notice two radio buttons, Bits and Bytes. Select Bytes. Bits is used for network-oriented data, while Bytes applies to server-oriented data.

Click the Draw Graph button to display a graph of your selected indicators. The graph will appear below the button .
images/download/attachments/33033174/image2016-1-11-14_48_17.png

Your new graph comes with the following controls, which you can use to manipulate the graph presentation:

  • Auto-refresh every __ seconds check box - refreshes the graph every <n> seconds. Enter the number of seconds in the text field.

  • images/download/attachments/33033174/image2015-12-17-13_50_28.png - re-centers the graph. Click this icon and then click the point of the graph that you would like to be in the center. For example, if you were to click 8:00 on the graph in the screenshot above, the 8:00 point would move to the center.

  • images/download/attachments/33033174/image2015-12-17-13_51_1.png - zooms in. Click this icon and then click the point on the graph where you would like to zoom in for a closer look.

  • images/download/attachments/33033174/image2015-12-17-13_51_47.png - zooms out. Click this icon and then click the point on the graph from where you would like to zoom out.

  • images/download/attachments/33033174/image2015-12-17-13_52_21.png - focuses in on a selected range. First, click this icon. After that, click once on the starting point of your desired range, and then click on the end point of the range. Your selected range will appear highlighted. Click anywhere within the highlighted range to view a graph of it.

  • Detach - adds the instant graph as an attachment in a report on a new browser tab.

    You can modify reports to add other attachments and you can save reports to the Report Manager. Report workflows enable you to designate reports to be your favorite reports and to define one report to appear as your custom dashboard.

  • PDF - exports the report to a .pdf format.

  • Calendar - provides a view of the indicator performance over time. You can do some neat stuff with the calendar. Try the following, for example:

    1. Click the Calendar button. The calendar will appear in a new browser tab. When we set up our graph, we specified the past 24 hours as our time span. You'll notice that our new calendar covers an entire month.

      images/download/attachments/33033174/image2016-1-11-14_50_31.png
    2. Each day on the calendar has a graph, which summarizes the indicator's activity for that day. Select a day and click the date in the upper right corner of that day.

      images/download/attachments/33033174/image2016-1-11-14_52_21.png
    3. Now you should see a breakdown of the day's activity in the form of four six-hour graphs (from 0:00 to 24:00).

      images/download/attachments/33033174/image2016-1-11-14_53_52.png
  • NetFlow - displays associated flow data if an indicator in the graph has mapped flow dat a. This button only appears if there is mapped flow data. When you enable the SNMP plugin for a device, many SNMP objects are automatically mapped to their corresponding flow interface, and the Object Mapping page enables you to map additional objects for flow monitoring.

  • NBAR - displays associated NBAR data if an indicator in the graph has mapped NBAR data . This button only appears if there is mapped NBAR data. You can map objects for NBAR monitoring on the Object Mapping page.

Recommended Policies for Self-monitoring

This section includes the following four tables, which break down groups of policies by the component they apply to:

  • SYSTEM POLICIES

  • CORE PROCESS POLICIES

  • CONFIGURATION POLICIES

  • XSTATS POLICIES

Each table contains a list of SevOne recommended self-monitoring policies and includes the following information for each policy:

  • Policy - The name and description of the policy.

  • Applies to - Whether the policy applies to the PAS, HSA, DNC, or any combination of these.

  • Plugin - The plugin you'll use when creating the policy.

  • Alert Condition - Specific information for setting up the policy trigger condition(s). This includes object type, severity level and the condition specifications.

  • Clear Condition - Specific i nformation for setting up the policy clear condition(s). This includes the condition specifications.

  • Suggested Remediation Steps - Recommended steps to take in case a condition is violated.

  • Details - Any additional information.

Using the Tables

The Plugin, Alert Condition, and Trigger Condition columns provide the specific information that you'll need for creating each policy (see Policies and Thresholds for more information on creating policies). The following screenshot shows an example of a policy.

images/download/attachments/33033174/image2016-3-29-18_5_40.png

General Settings Tab

In the example above, the Plugin column indicates that you'll need to specify SNMP Poller as the plugin. At the top of the Alert Conditions column, you'll find the object type that you'll need, along with the object subtype if there is one. In this example, the object type is Memory (Linux (Net-SNMP)). Immediately below the object type information is the recommended severity level, which is Alert in our example. The following screenshot shows how this information appears on the General Settings tab when you create a policy.

images/download/attachments/33033174/selfMonPolicyEditorGeneralSettings0.png

Trigger Conditions Tab

The Alert Condition column also contains the information you'll need for setting up your policy's trigger condition(s). This information appears below the object type and severity level information. For example, in the screenshot above, the following trigger (or alert) condition is recommended:

Average Available Swap Memory <= 20% over 15 minutes

When setting up the trigger condition, this means the following:

  • Average - specifies Average as the recommended Aggregation setting for this condition.

  • Available Swap Memory - specifies Available Swap Memory as the indicator to use for this condition. The indicator will appear in bold font.

  • <= - specifies that the less than equal to operator should be used for this condition.

  • 20% - specifies a threshold of 20% for this condition.

  • over 15 minutes - indicates a duration of 15 minutes for the condition.

Below is a screenshot of the the pop-up where you specify the settings for your trigger condition. The screenshot includes the settings provided above.

images/download/attachments/33033174/popup_createTriggerCondition_v2.png

Clear Conditions Tab

The Clear Condition column contains information for setting up the clear condition. For example, in the screenshot above (of the policy S1_SELFMON_memory_SwapRed), the following clear condition is recommended:

Average Available Swap Memory > 20% over 15 minutes

When setting up the trigger condition, this means the following:

  • Average - specifies Average as the recommended Aggregation setting for this condition.

  • Available Swap Memory - specifies Available Swap Memory as the indicator to use for this condition. The indicator will appear in bold font.

  • > - specifies that the greater than operator should be used for this condition.

  • 20% - specifies a threshold of 20% for this condition.

  • over 15 minutes - indicates a duration of 15 minutes for the condition.

Below is a screenshot of the the pop-up where you specify the settings for your clear condition. The screenshot includes the settings provided above.

images/download/attachments/33033174/popup_createClearCondition_v2.png

Additional Information

Type

When you create a condition for a policy, the Type drop-down is set to Static by default. For most of the conditions in the table, this is what you'll need. A few conditions will require a different Type (for example, Baseline Delta, Baseline Percentage, etc.). These exceptions are noted in the Alert Condition column. Unless otherwise noted, the Type will be Static.

Multiple Conditions

Some policies include more than one condition. In cases where both (or all) conditions must be met, this will be indicated by the AND Boolean operator. An OR Boolean operator indicates that only one condition (or set of conditions) must be met in order to trigger a threshold. In the screenshot below, only one of the conditions (Rule 1 or Rule 2) listed under Alert Condition needs to be met. In this case, you would need a separate Rule for each condition when creating conditions on the Trigger Conditions tab. (See information on using Rules.)

images/download/attachments/33033174/image2016-3-29-18_21_30.png

Recommended Policies

The following tables contain recommended policies for SevOne self-monitoring.

Most of the entries in the tables are policies. However, a small number of the entries need to be configured as thresholds. These entries include a comment to this effect directly under the threshold name.

For example:

S1_SELFMON_Disk_UtilizationRed

Note: This is a Threshold rather than a Policy.


In the tables below, each Policy that begins with '_' contains the prefix S1_SELFMON. For example,

  • _memory_SwapRed must be read as S1_SELFMON_memory_SwapRed

  • _memory_SwapYellow must be read as S1_SELFMON_memory_SwapYellow

SYSTEM POLICIES

Policy
(prefix: S1_SELFMON)

Applies To

Plugin

Alert Condition

Clear Condition

Suggested Remediation Steps

Details

_memory_SwapRed

PAS

HSA

DNC

SNMP
Poller

Memory (Linux (Net-SNMP))
Severity: Alert

Average Available Swap Memory <= 20% over 15 minutes

Average Available Swap Memory > 20% over 15 minutes

See Details before executing the following command.

For CentOS,

killall -9 SevOne-requestd; php /usr/local/scripts/periodic.shortterm.backup.php --force-backup; supervisorctl restart mysqld ; php /usr/local/scripts/periodic.shortterm.backup.php --force-restore;

For Gentoo,
killall -9 SevOne-requestd; php /usr/local/scripts/periodic.shortterm.backup.php --force-backup; /etc/init.d/mysql restart; php /usr/local/scripts/periodic.shortterm.backup.php --force-restore;

WARNING: This command may cause the loss of one poll cycle, and the past two hours of data may be briefly unavailable in the GUI (it will come back once the command finishes). This command will flush out any processes hogging up memory and will relieve the swap memory.

_memory_SwapYellow

PAS

HSA

DNC

SNMP
Poller

Memory (Linux (Net-SNMP))
Severity: Warning

Average Available Swap Memory <= 50% over 15 minutes

Average Available Swap Memory > 50% over 15 minutes

Execute the following command:

killall -9 SevOne-requestd;

Consider running the command for SevOne_SwapRed (previous row). Otherwise, keep tabs on swap usage.

_Raid_Array

NMS (PAS, HSA)

Deferred
Data

Raid Array
Severity: Alert

Average State != 4 over 15 minutes

Average State = 4 over 15 minutes

Execute the following command:
SevOne-raid-status |grep State |grep -v Foreign

If degraded state is shown, replace failed disk. If rebuilding state is shown, array is healthy.

_Raid_DiskPredictiveFail

NMS (PAS, HSA)

Deferred
Data

Raid Array Disk
Severity: Warning

Average Predictive Failure Count > 0 over 15 minutes

Average Predictive Failure Count = 0 over 15 minutes

N/A

Contact SevOne Support. SevOne will ask to reseat drive before replacing it. Replace failed disk.

_Raid_DiskFailure

NMS (PAS, HSA)

Deferred
Data

Raid Array Disk
Severity: Alert

Rule 1
Average Firmware State = 4 over 15 minutes

OR

Rule 2
Average Firmware State <= 2 over 15 minutes

Average Firmware State != 4 over 15 minutes

AND

Average Firmware State > 2 over 15 minutes

N/A

Contact SevOne Support. SevOne will ask to reseat drive before replacing it. Replace failed disk.

_Disk_UtilizationYellow

Note: This is a Threshold rather than a Policy.

NMS (PAS, HSA)

SNMP
Poller

/ [specified partition]
Severity: Warning


Average Used Disk Space >= 80% over 15 minutes

Note: For [specified partition], select the partition to monitor. Create a separate threshold for each partition that you plan to monitor.

Average Used Disk Space < 80% over 15 minutes

N/A

Review settings in Administration > Cluster Manager > Cluster Settings > Storage. Specify lower number for Data Retention and Maximum Disk Utilization if possible.

_Disk_UtilizationRed

Note: This is a Threshold rather than a Policy.

NMS (PAS, HSA)

SNMP
Poller

/ [specified partition]

Severity: Alert

Average Used Disk Space >= 90% over 15 minutes

Note: For [specified partition], select the partition to monitor. Create a separate threshold for each partition that you plan to monitor.

Average Used Disk Space < 90% over 15 minutes

Execute the following command:

trim-longterm --emergency-purge

Use this policy for sd* objects , such as sda1, sdb, etc. This may indicate that the system is operating with a workload that is above or below its rated capacity. In case of system degradation, contact SevOne Support.

_Disk_Reads

NMS (PAS, HSA)

SNMP
Poller

Disk IO (Linux (Net-SNMP))
Severity: Warning

Note: For Type, specify Baseline Percentage.

Average Number of reads > 150% of baseline over 50 minutes

Average Number of reads < 150% of baseline over 50 minutes

N/A

Use this policy for sd* objects , such as sda1, sdb, etc. This may indicate that the system is operating with a workload that is above or below its rated capacity. In case of system degradation, contact SevOne Support.

_Disk_Writes

NMS (PAS, HSA)

SNMP
Poller

Disk IO (Linux (Net-SNMP))
Severity: Warning

Note: For Type, specify Baseline Percentage.

Average Number of writes > 150% of baseline over 50 minutes

Average Number of writes < 150% of baseline over 50 minutes

N/A

Use this policy for sd* objects , such as sda1, sdb, etc. This may indicate that the system is operating with a workload that is above or below its rated capacity. In case of system degradation, contact SevOne Support.

_HSA_FAILOVER

Note: For this policy, we recommend creating a device group for the HSAs and applying the policy to that device group.

NMS (PAS, HSA)

Deferred
Data

SevOne Appliance
Severity: Alert

Average Is master = 1 over 5 minutes

Average Is master = 0 over 5 minutes

_Memory_PhysicalFree

Note: This is a Threshold rather than a Policy.

NMS (PAS, HSA)

SNMP
Poller

Physical Memory - Memory
Severity: Alert

Average Total Free Memory < 1 GB over 15 minutes

Average Total Free Memory > 1 GB over 15 minutes

_Ethernet_Traffic

NMS (PAS, HSA)

SNMP
Poller

Interface
Severity: Warning

Note: For Type, specify Baseline Percentage.

Rule 1
Average HC In Octets > 150% o f baseline over 15 minutes

OR

Rule 2
Average HC Out Octets > 150% of baseline over 15 minutes

Average HC In Octets < 150 % of baseline over 15 minutes

AND

Average HC Out Octets < 150 % of baseline over 15 minutes

Consider moving devices to a different appliance. Otherwise, contact SevOne Support.

_Ethernet_Errors

NMS (PAS, HSA)

SNMP
Poller

Interface
Severity: Warning

Note: For Type, specify Baseline Percentage.

Rule 1
Average In Errors > 120% of baseline over 15 minutes

OR

Rule 2
Average Out Errors > 120% of baseline over 15 minutes

OR

Rule 3
Average In Discards > 120% of baseline over 15 minutes

OR

Rule 4
Average Out Discards > 120% of baseline over 15 minutes

Average In Errors < 120% of baseline over 15 minutes

AND

Average Out Errors < 120% of baseline over 15 minutes

AND

Average In Discards < 120% of baseline over 15 minutes

AND

Average Out Discards < 120% of baseline over 15 minutes

_ALL_iDRAC_ICMPReachability

Note: For this policy to work, you'll need to add iDRACs to the Device Manager as separate devices, for example:

Device 1 - SevOne (10.10.10.1)

Device 2 - SevOne-idrac (10.10.10.2)

NMS (PAS, HSA)

ICMP
Poller

Ping Data
Severity: Emergency

Average Availability < 95% over 15 minutes

Average Availability >= 95% over 15 minutes

Check network connectivity to the appliance. If the network is okay, contact SevOne Support.

_ALL_iDRAC_SNMPReachability

NMS (PAS, HSA)

SNMP
Poller

SNMP Availability
Severity: Alert

Average Availability < 95% over 15 minutes

Average Availability >= 95% over 15 minutes

If the iDRAC goes down, you'll need to investigate server health, iDRAC connectivity, etc.

If iDRAC health, etc., looks good, contact SevOne Support.

CORE PROCESS POLICIES

Policy
(prefix: S1_SELFMON)

Applies To

Plugin

Alert Condition

Clear Condition

Suggested Remediation Steps

Details

_requestd_Availability

PAS

HSA

DNC

Process Poller

Process
Subtype: SevOne-requestd
Severity: Alert

Average Availability < 100% over 15 minutes

Average Availability = 100% over 15 minutes

Execute the following command:

For CentOS,
supervisorctl start SevOne-requestd

For Gentoo,
initedit --enable SevOne-requestd

If the command fails, check firewall access on port 60006 (TCP).

_requestd_CPUTime

PAS

HSA

DNC

Process Poller

Process
Subtype: SevOne-requestd
Severity: Alert

Average CPU Time > 1000 Milliseconds over 15 minutes

Average CPU Time < 1000 Milliseconds over 15 minutes

_requestd_Memory

PAS

HSA

DNC

Process Poller

Process
Subtype: SevOne-requestd
Severity: Error

Average System memory used > 5 GB over 15 minutes

Average System memory used < 5 GB over 15 minutes

N/A

Check network connectivity to the appliance. If the network is okay, contact SevOne Support.

_apache2_Availability

PAS

HSA

DNC

Process Poller

Process
Subtype: apache2
Severity: Alert

Average Availability < 100% over 15 minutes

Average Availability = 100% over 15 minutes

Execute the following command:

For CentOS, N/A

For Gentoo,
/etc/init.d/apache2 start

If the command fails, check firewall access on ports 80 and 443 (TCP).

_apache2_CPUTime

PAS

HSA

DNC

Process Poller

Process
Subtype: apache2
Severity: Alert

Note: For Type, specify Baseline Percentage.


Average CPU Time > 150% of baseline over 15 minutes

Average CPU Time < 150% of baseline over 15 minutes

_apache2_Memory

PAS

HSA

DNC

Process Poller

Process
Subtype: apache2
Severity: Alert


Note: For Type, specify Baseline Percentage.

Average System memory used > 150% of baseline over 15 minutes

Average System memory used < 150% of baseline over 15 minutes

_apache2_ThreadCount

PAS

HSA

DNC

Process Poller

Process
Subtype: apache2
Severity: Alert
Note: For Type, specify Baseline Percentage.

Average Instances of the process > 150% of baseline over 15 minutes

Average System memory used < 150% of baseline over 15 minutes

_polld_Availability

PAS

HSA

DNC

Process Poller

Process
Subtype: SevOne-polld
Severity: Alert

Average Availability < 100% over 15 minutes

Average Availability = 100% over 15 minutes

Execute the following command:

For CentOS,
supervisorctl start SevOne-polld

For Gentoo,
initedit --enable SevOne-polld

If the command fails, contact SevOne Support.

_polld_CPUTime

PAS

HSA

DNC

Process Poller

Process
Subtype: SevOne-polld
Severity: Alert


Note: For Type, specify Baseline Percentage.


Average CPU Time > 150% of baseline over 15 minutes

Average CPU Time < 150% of baseline over 15 minutes

_polld_Memory

PAS

HSA

DNC

Process Poller

Process
Subtype: SevOne-polld
Severity: Alert


Note: For Type, specify Baseline Percentage.


Average System memory used > 150% of baseline over 15 minutes

Average System memory used < 150% of baseline over 15 minutes

_polld_ThreadCount

PAS

HSA

DNC

Process Poller

Process
Subtype: SevOne-polld
Severity: Alert

Note: For Type, specify Baseline Percentage.
Average Instances of the process > 150% of baseline over 15 minutes
OR
Average Instances of the process < 50% of baseline over 15 minutes

Average Instances of the process < 150% of baseline over 15 minutes

AND

Average Instances of the process > 50% of baseline over 15 minutes

_mysqlData_Availability

The object subtype for this policy, mysqld, applies to both the MySQL Config database and the MySQL Data database.

PAS

HSA

DNC

Process Poller

Process
Subtype: mysqld
Severity: Alert

Average Availability < 100% over 15 minutes

Average Availability = 100% over 15 minutes

Execute the following command:

For CentOS,
supervisorctl start mysqld; supervisorctl start mysqld2

For Gentoo,
/etc/init.d/mysql start; /etc/init.d/mysql2 start

If the command fails, check firewall access on ports 3306 and 3307 (TCP).

_mysqldData_CPUTime

The object subtype for this policy, mysqld, applies to both the MySQL Config database and the MySQL Data database.

PAS

HSA

DNC

Process Poller

Process
Subtype: mysqld
Severity: Alert

Note: For Type, specify Baseline Percentage.


Average CPU Time > 150% of baseline over 15 minutes

Average CPU Time < 150% of baseline over 15 minutes

_mysqlData_Memory

The object subtype for this policy, mysqld, applies to both the MySQL Config database and the MySQL Data database.

PAS

HSA

DNC

Process Poller

Process
Subtype: mysqld
Severity: Alert

Note: For Type, specify Baseline Percentage.


Average System memory used > 150% of baseline over 15 minutes

Average System memory used < 150% of baseline over 15 minutes

_mysqlData_ThreadCount

The object subtype for this policy, mysqld, applies to both the MySQL Config database and the MySQL Data database.

PAS

HSA

DNC

Process Poller

Process
Subtype: mysqld
Severity: Alert

Note: For Type, specify Baseline Percentage.
Average Instances of the process > 150% of baseline over 15 minutes
OR
Average Instances of the process < 50% of baseline over 15 minutes

Average Instances of the process < 150% of baseline over 15 minutes

AND

Average Instances of the process > 50% of baseline over 15 minutes

_sshd_Availability

PAS

HSA

DNC

Process Poller

Process
Subtype: sshd
Severity: Alert

Average Availability < 100% over 15 minutes

Average Availability = 100% over 15 minutes

Execute the following command:

For CentOS,
supervisorctl restart sshd

For Gentoo,
/etc/init.d/sshd restart

If the command fails, check firewall access on port 22 (TCP).

_sshd_CPUTime

PAS

HSA

DNC

Process Poller

Process
Subtype: sshd
Severity: Alert

Note: For Type, specify Baseline Percentage.


Average CPU Time > 150% of baseline over 15 minutes

Average CPU Time < 150% of baseline over 15 minutes

_sshd_Memory

PAS

HSA

DNC

Process Poller

Process
Subtype: sshd
Severity: Alert

Note: For Type, specify Baseline Percentage.


Average System memory used > 150% of baseline over 15 minutes

Average System memory used < 150% of baseline over 15 minutes

_crond_Availability

PAS

HSA

DNC

Process Poller

Process
Subtype: crond
Severity: Alert

Average Availability < 100% over 15 minutes

Average Availability = 100% over 15 minutes

_crond_CPUTime

PAS

HSA

DNC

Process Poller

Process
Subtype: crond
Severity: Alert

Note: For Type, specify Baseline Percentage.


Average CPU Time > 150% of baseline over 15 minutes

Average CPU Time < 150% of baseline over 15 minutes

_crond_Memory

PAS

HSA

DNC

Process Poller

Process
Subtype: crond
Severity: Alert

Note: For Type, specify Baseline Percentage.


Average System memory used > 150% of baseline over 15 minutes

Average System memory used < 150% of baseline over 15 minutes

_datad_Availability

PAS

HSA

DNC

Process Poller

Process
Subtype: SevOne-datad
Severity: Alert

Average Availability < 100% over 15 minutes

Average Availability = 100% over 15 minutes

Execute the following command:

For CentOS,
supervisorctl start SevOne-datad

For Gentoo,
initedit --enable SevOne-datad

If the command fails, contact SevOne Support.

_datad_CPUTime

PAS

HSA

DNC

Process Poller

Process
Subtype: SevOne-datad
Severity: Alert

Note: For Type, specify Baseline Percentage.


Average CPU Time > 150% of baseline over 15 minutes

Average CPU Time < 150% of baseline over 15 minutes

_datad_Memory

PAS

HSA

DNC

Process Poller

Process
Subtype: SevOne-datad
Severity: Alert

Note: For Type, specify Baseline Percentage.


Average System memory used > 150% of baseline over 15 minutes

Average System memory used < 150% of baseline over 15 minutes

_masterslaved_Availability

PAS

HSA

DNC

Process Poller

Process
Subtype: SevOne-masterslaved
Severity: Alert

Average Availability < 100% over 15 minutes

Average Availability = 100% over 15 minutes

Execute the following command:

For CentOS,
supervisorctl start SevOne-masterslaved

For Gentoo,
initedit --enable SevOne-masterslaved

If the command fails, check firewall access on port 5050 (TCP).

_masterslaved_CPUTime

PAS

HSA

DNC

Process Poller

Process
Subtype: SevOne-masterslaved
Severity: Alert

Note: For Type, specify Baseline Percentage.


Average CPU Time > 150% of baseline over 15 minutes

Average CPU Time < 150% of baseline over 15 minutes

_masterslaved_Memory

PAS

HSA

DNC

Process Poller

Process
Subtype: SevOne-masterslaved
Severity: Alert

Note: For Type, specify Baseline Percentage.


Average System memory used > 150% of baseline over 15 minutes

Average System memory used < 150% of baseline over 15 minutes

_ntpd_Availability

PAS

HSA

DNC

Process Poller

Process
Subtype: ntpd
Severity: Alert

Average Availability < 100% over 15 minutes

Average Availability = 100% over 15 minutes

Execute the following command:

For CentOS,
systemctl start ntpd

For Gentoo,
/etc/init.d/ntpd start

If the command fails, check firewall access on port 123 (UDP).

_ntpd_CPUTime

PAS

HSA

DNC

Process Poller

Process
Subtype: ntpd
Severity: Alert

Note: For Type, specify Baseline Percentage.


Average CPU Time > 150% of baseline over 15 minutes

Average CPU Time < 150% of baseline over 15 minutes

_ntpd_Memory

PAS

HSA

DNC

Process Poller

Process
Subtype: ntpd
Severity: Alert

Note: For Type, specify Baseline Percentage.


Average System memory used > 150% of baseline over 15 minutes

Average System memory used < 150% of baseline over 15 minutes

_netflowd_Availability

DNC

Process Poller

Process
Subtype: SevOne-netflowd
Severity: Alert

Average Availability < 100% over 15 minutes

Average Availability = 100% over 15 minutes

Execute the following command:

For CentOS,
supervisorctl start SevOne-netflowd

For Gentoo,
initedit --enable SevOne-netflowd

If the command fails, contact SevOne Support.

_netflowd_CPUTime

DNC

Process Poller

Process
Subtype: SevOne-netflowd
Severity: Alert

Note: For Type, specify Baseline Percentage.


Average CPU Time > 150% of baseline over 15 minutes

Average CPU Time < 150% of baseline over 15 minutes

_netflowd_Memory

DNC

Process Poller

Process
Subtype: SevOne-netflowd
Severity: Alert

Note: For Type, specify Baseline Percentage.


Average System memory used > 150% of baseline over 15 minutes

Average System memory used < 150% of baseline over 15 minutes

Selfmond: peer is not reachable via SSH

PAS

HSA

DNC

SNMP Poller

One or more peers are not reachable via SSH.

Object Type: SevOne Process (Linux (Net-SNMP))
Severity: Emergency

AllPeersReachable changed to 0 over 1 minute

No

Selfmond: updater has fallen behind

PAS

HSA

DNC

SNMP Poller

Updater has fallen behind.

Object Type: SevOne Process (Linux (Net-SNMP))
Severity: Emergency

Average SecondsSinceLastUpdate > 10800 over 1 minute

No

Selfmond: master-slave both master

PAS

HSA

DNC

SNMP Poller

Both appliances think that they are Master.

Object Type: SevOne Process (Linux (Net-SNMP))
Severity: Emergency

Average BothMaster >= 5 minutes

No

Selfmond: master-slave active appliance

PAS

HSA

DNC

SNMP Poller

Active appliance is changed.

Object Type: SevOne Process (Linux (Net-SNMP))
Severity: Emergency

ActiveAppliance changed over 5 minutes

No

Selfmond: Config Db replication is too much behind

PAS

HSA

DNC

SNMP Poller

Config Db replication is too much behind for normal operations.

Object Type: SevOne Process (Linux (Net-SNMP))
Severity: Emergency

Average SecondsBehindMaster > 300 over 5 minutes

No

Selfmond: Data Db replication is too much behind

PAS

HSA

DNC

SNMP Poller

Data Db replication is too much behind for normal operations.

Object Type: SevOne Process (Linux (Net-SNMP))
Severity: Emergency

Average SecondsBehindMaster > 300 over 5 minutes

No

Selfmond: / mount point become read-only

PAS

HSA

DNC

SNMP Poller

/data mount point become read-only.

Object Type: SevOne Process (Linux (Net-SNMP))
Severity: Emergency

Average dataMountPoint = 2 over 1 minute

No

Selfmond: /data mount point become read-only

PAS

HSA

DNC

SNMP Poller

/data mount point become read-only.

Object Type: SevOne Process (Linux (Net-SNMP))
Severity: Emergency

Average dataMountPoint = 2 over 1 minute

No

Selfmond: /iodrive mount point become read-only

PAS

HSA

DNC

SNMP Poller

/iodrive mount point become read-only.

Object Type: SevOne Process (Linux (Net-SNMP))
Severity: Emergency

Average iodriveMountPoint = 2 over 1 minute

No

Selfmond: no polled indicators

PAS

HSA

DNC

SNMP Poller

Object Type: SevOne Process (Linux (Net-SNMP))

Severity: Emergency
polled indicators per seconds are equal to 0 for over 5 minutes

No

Selfmond: report mailer has fallen behind

PAS

HSA

DNC

SNMP Poller

Object Type: SevOne Process (Linux (Net-SNMP))

Severity: Emergency
report mailer has fallen 300 seconds after it is supposed to be sent

No

Selfmond: keys expired

PAS

HSA

DNC

SNMP Poller

Object Type: SevOne Process (Linux (Net-SNMP))

Severity: Emergency
When a number of expired crypto keys is > 0 for more than 5 minutes

No

In NMS console on any peer execute the following command: SevOne-act activate-crypto-permissions --uid <USER ID>

where, <USER ID> is the user with expired crypto key

Crypto keys are used for decrypting sensitive information obtained by REST API and are generated in NMS console by SevOne-act activate-crypto-permissions

CONFIGURATION POLICIES

Policy
(prefix: S1_SELFMON)

Applies To

Plugin

Alert Condition

Clear Condition

Suggested Remediation Steps

Details

_Managed_Objects

PAS

HSA

DNC

Deferred Data

SevOne Appliance
Severity: Alert

Note: For Type, specify Baseline Percentage.


Average Total objects enabled > 20% of baseline over 60 minutes

OR

Average Total objects enabled < 20% of baseline over 60 minutes

No clear condition. If the alert condition triggers, there is likely a serious problem. The alert should be cleared manually once the issue has been addressed.

_Disabled_Objects

PAS

HSA

DNC

Deferred Data

SevOne Appliance
Severity: Alert

Note: For Type, specify Baseline Percentage.


Average Total objects disabled > 20% of baseline over 60 minutes

OR

Average Total objects disabled < 20% of baseline over 60 minutes

No clear condition. If the alert condition triggers, there is likely a serious problem. The alert should be cleared manually once the issue has been addressed.


XSTATS POLICIES

Policy
(prefix: S1_SELFMON)

Applies To

Plugin

Alert Condition

Clear Condition

Suggested Remediation Steps

Details

_bulkd_Availability (xStats Only)

PAS

Process Poller

Process
Subtype: SevOne-bulkd (process)
Severity: Alert

Average Availability < 100% over 15 minutes

Average Availability = 100% over 15 minutes

_bulkd_CPUTime (xStats Only)

PAS

Process Poller

Process
Subtype: SevOne-bulkd (process)
Severity: Alert

Note: For Type, specify Baseline Percentage.


Average CPU Time > 150% of baseline over 15 minutes

Average CPU Time < 150% of baseline over 15 minutes

_bulkd_Memory (xStats Only)

PAS

Process Poller

Process
Subtype: SevOne-bulkd (process)
Severity: Alert

Note: For Type, specify Baseline Percentage.


Average System memory used > 150% of baseline over 15 minutes

Average System memory used < 150% of baseline over 15 minutes

_bulkd_ThreadCount (xStats Only)

PAS

Process Poller

Process
Subtype: SevOne-bulkd (process)
Severity: Alert

Note: For Type, specify Baseline Percentage.

Average Instances of the process > 150% of baseline over 15 minutes

OR

Average Instances of the process < 50% of baseline over 15 minutes

Average Instances of the process < 150% of baseline over 15 minutes

AND

Average Instances of the process > 50% of baseline over 15 minutes

_fcad_Availability (xStats Only)

PAS

Process Poller

Process
Subtype: SevOne-fcad
Severity: Alert

Average Availability < 75% over 15 minutes

Average Availability > 75% over 15 minutes

_fcad_CPUTime (xStats Only)

PAS

Process Poller

Process
Subtype: SevOne-fcad
Severity: Alert

Note: For Type, specify Baseline Percentage.


Average CPU Time > 120% of baseline over 15 minutes

Average CPU Time < 120% of baseline over 15 minutes

_fcad_Memory (xStats Only)

PAS

Process Poller

Process
Subtype: SevOne-fcad
Severity: Alert

Note: For Type, specify Baseline Percentage.


Average System memory used > 150% of baseline over 15 minutes

Average System memory used < 150% of baseline over 15 minutes

_fcad_ThreadCount (xStats Only)

PAS

Process Poller

Process
Subtype: SevOne-fcad
Severity: Alert

Note: For Type, specify Baseline Percentage.

Average Instances of the process > 150% of baseline over 15 minutes

OR

Average Instances of the process < 50% of baseline over 15 minutes

Average Instances of the process < 150% of baseline over 15 minutes

AND

Average Instances of the process > 50% of baseline over 15 minutes

Uninstall Self-monitoring

There are a handful of instances when it may become necessary to uninstall the Self-monitoring components. Removal and re-installation is the most efficient way to update Self-monitoring in the event of an IP address change, SevOneStats user password change, etc. Users should ensure that their root user's crontab is backed up prior to executing the following commands as this script makes modifications to the crontab.

  1. Backup root user's crontab.

    $ crontab -l > /root/root_crontab_backup-`date +%s`.txt
  2. Go to /usr/local/scripts/utilities/plugins/selfmon/ directory.

    $ cd /usr/local/scripts/utilities/plugins/selfmon/
  3. Run ./uninstall.sh script and follow the prompts OR run the script in a single line as shown below.

    $ (echo -e '\n') | /usr/local/scripts/utilities/plugins/selfmon/uninstall.sh
  4. Validate that the root user's crontab no longer contains the following lines.

    */5 * * * * php /usr/local/scripts/utilities/plugins/selfmon/RAIDMon/poll.megacli.objects.php -i "2" --api '127.0.0.1' >> /var/SevOne/RAIDMon.log 2>&1
    */5 * * * * /usr/local/scripts/utilities/plugins/selfmon/SevOneMon/sevone.deferreddata.poll.sevone --api 127.0.0.1 --device-id "2" --object 'SevOne Statistics' >> /var/SevOne/SevOneMon.log 2>&1
    */5 * * * * /usr/local/scripts/utilities/plugins/selfmon/MySQLMon/sevone.deferreddata.poll.mysql --api 127.0.0.1 --device-id "2" --database 127.0.0.1:3307 --object 'MySQL Config Database' --database-profile "config" >> /var/SevOne/MySQLMon.log 2>&1
    */5 * * * * /usr/local/scripts/utilities/plugins/selfmon/MySQLMon/sevone.deferreddata.poll.mysql --api 127.0.0.1 --device-id "2" --database 127.0.0.1:3306 --object 'MySQL Data Database' --database-profile "data" >> /var/SevOne/MySQLMon.log 2>&1
    */5 * * * * php /usr/local/scripts/utilities/plugins/selfmon/PolldMon/deferreddata.polld.php --api 127.0.0.1 --device-id "2" --polld-object 'SevOne-polld performance' --highpolld-object 'SevOne-highpolld performance' >> /var/SevOne/PolldMon.log 2>&1
    */5 * * * * php /usr/local/scripts/utilities/plugins/selfmon/RedisMon/sevone.deferreddata.poll.redis --api 127.0.0.1 --device-id "2" --object 'Redis Instance' >> /var/SevOne/RedisMon.log 2>&1

Troubleshooting

I changed the SevOneStats password after enabling self-monitoring, and now it's not working.

We strongly advise against changing the password for the API user SevOneStats after you enable self-monitoring. However, if you have changed the password once self-monitoring was already in progress and you're experiencing problems, please contact the SevOne Support team for assistance.