Weblogic Authors: Yeshim Deniz, Elizabeth White, Michael Meiner, Michael Bushong, Avi Rosenthal

Related Topics: Weblogic

Weblogic: Article

Health Monitoring and Notification of Servers in a Cluster

Health Monitoring and Notification of Servers in a Cluster

What would happen if you had a stand-alone server, say an Admin Server having just a managed server? Or what if the Admin server itself becomes unresponsive. How would someone be notified when they became unresponsive?

".....This is bad. Why did the server hang? And on top of this, why did we come to know about this so late?" As Bob heard his boss say this, he knew what was about to come next. He would be told to open a case with WebLogic support, who would help them do a postmortem of why and how the server stopped responding in production. Probably he would need to go beyond it this time, and figure out a way to automatically check server status.

I've often come across situations in which customers want to monitor the health of the servers running in a cluster. Better still, to get notifications should a condition arise. If you're running in a cluster, which of course most production-based systems do, the failover would take place right away, as soon as a server becomes unresponsive. As far as your applications go, there would be other servers that would serve the requests. But as an administrator you should be aware of this immediately, and take corrective measures. Again, what would happen if you had a stand-alone server, say an Admin server having just a managed server? Or what if the Admin server itself becomes unresponsive. How would someone be notified when this happens?

The answers to these questions can be either external or internal (meaning that you don't use any WebLogic-specific tool) to WebLogic. The external solution is probably the simpler one, as it makes use of a WebLogic ping utility that would result in a success if it finds the server running, and return a java.net.ConnectException if it isn't. Other than that, you pretty much rely on the OS script (and the network of course!) to do the job for you. The internal solution delves deeper into the WebLogic core, such as using SNMP and utilizing MBeans, specifically the ClusterMBeans.

Let's dive into the simple solution right away. Our test case is: we are having running instances (they need not be in a cluster) of x servers. We want to be notified if any of them becomes unresponsive or goes down. All that you might be required to do is to execute the following simple steps:

  • Set the environment so that the WebLogic-specific classes are in the classpath.
  • Input the host:ports for all of the instances of WebLogic server that are running to the weblogic.Admin ping utility.
  • Loop infinitely, pinging each instance (hosts:port) of the server passed as input.
  • Define a time interval for the ping in the script so that after pinging each instance passed to weblogic.Admin, you sleep for some time.
  • E-mail when a ConnectException is detected.

    A simplified version of this utility running on Bourne shell on Solaris 5.6 is given in Listing 1 (the code for this article is available online at www.sys-con.com/weblogic/sourcec.cfm).

    So all you might need to do is run the script in Listing 1 in the background, and it would keep you notified by e-mail should any server go down or become unresponsive.

    Before we proceed with the MBean material a brief outline of how WebLogic Server instances in a cluster detect failures of their peer server instances would be helpful. Some of the information given here is from the WebLogic documentation. For details refer to the cluster specific information at http://edocs.bea.com/wls/docs61/cluster/index.html.

    The instances monitor the:

  • Socket connections to a peer server
  • Regular server "heartbeat" messages

    Failure Detection Using IP Sockets
    WebLogic Servers monitor the use of IP sockets between peer server instances as an immediate method of detecting failures. If a server connects to one of its peers in a cluster and begins transmitting data over a socket, an unexpected closure of that socket causes the peer server to be marked as "failed," and its associated services are removed from the JNDI naming tree.

    The WebLogic Server "Heartbeat"
    If clustered server instances don't have opened sockets for peer-to-peer communication, failed servers may also be detected via the WebLogic Server "heartbeat." All server instances in a cluster use multicast to broadcast regular server "heartbeat" messages to other members of the cluster. Each server heartbeat contains data that uniquely identifies the server that sends the message. Servers broadcast their heartbeat messages at regular intervals of 10 seconds. In turn, each server in a cluster monitors the multicast address to ensure that all peer servers' heartbeat messages are being sent.

    If a server monitoring the multicast address misses three heartbeats from a peer server (i.e., if it doesn't receive a heartbeat from the server for 30 seconds or longer), the monitoring server marks the peer server as "failed." It then updates its local JNDI tree, if necessary, to retract the services that were hosted on the failed server.

    In this way, servers can detect failures even if they have no sockets open for peer-to-peer communication. In our case, the AliveServerCount value for each server would be the updated list of those numbers of active servers, which are still in the cluster jndi list of the servers.

    The next solution (if you don't want to use the shell script and for non-Unix platforms) is to use the WebLogic APIs to generate the trap and the Java mail API for generating the notification. In other words, do something like this:

  • Generate a list of arguments passing the host:port, username, password for the Admin server, the total number of servers participating in the cluster, and the delay interval acceptable for generating the e-mail. This list of arguments will be required by the java program.
  • Get the Admin MBeanHome by passing these specific properties.
  • If the Admin home is not found, generate an appropriate error message.
  • Use the Admin MBeanHome to get the ClusterRunTime MBean and iterate through to get the server names.
  • Check to see if the total number of alive servers is less than the total number of servers passed as an argument to the Java program.
  • If the count goes below the one passed as an argument, formulate the appropriate string that would be passed as the e-mail content.
  • Generate an e-mail at the address specified for the SMTP (not to be confused with SNMP) Server.

    Listing 2 would be the meat of this solution.

    SNMP Model
    We now come to the last method that I'm proposing: using an SNMP model. WebLogic Server software includes the ability to communicate with enterprise-wide management systems using Simple Network Management Protocol (SNMP). The WebLogic Server SNMP capability enables you to integrate management of WebLogic Servers into an SNMP-compliant management system that gives you a single view of the various software and hardware resources of a complex, distributed system.

    The following definitions help us derive a practical scenario for cluster monitoring and have been partially derived from the WebLogic documentation.

    SNMP management is based on the agent/manager model described in the network management standards defined by the International Organization for Standardization (ISO). In this model, a network/systems manager exchanges monitoring and control information about system and network resources with distributed software processes called agents. In our case, the SNMP agent is the WebLogic Admin Server. For the SNMP manager as an illustration, and example, I used a freely downloadable third-party software called the "Trap Receiver".

    Any system or network resource that is manageable through the exchange of information is a managed resource. This could be a software resource such as a Java Database Connectivity (JDBC) connection pool or a hardware resource such as a router. In our case, we are monitoring the ClusterRuntime.

    The underlying idea is that the WebLogic Admin server, which is acting as the SNMP agent, would act as a "collection device" that would gather and send us the information of the managed resource, i.e., the ClusterRuntime. This would be achieved by setting thresholds (referred to as Monitors) for any specific attribute for the ClusterRuntime. In our example we would monitor the AliveServerCount attribute for the ClusterRuntime. Say that the total servers running in a cluster is three; if any one of the servers becomes unresponsive, the AliveServerCount would decrease to two and a trap notification should be sent to the SNMP manager, which would then generate an e-mail.

    The Trap Receiver relies upon a database of definitions and information about the properties of managed resources and the services the agents support - this makes up the Management Information Base (MIB). In our case, the MIB will be available under the WEBLOGIC_HOME/lib/ BEA-WEBLOGIC-MIB.asn1 (see Figure 1). (For more information about SNMP management, visit http://e-docs.bea.com/wls/docs61/snmpman/index.html.)

    The following basic steps are required for setting up our Failure Notification Model:

  • Configuring the SNMP Agent
  • Configuring the SNMP Manager

    Configuring the SNMP Agent
    This would be the WebLogic Admin Server. We would start by assuming that we are running a cluster of three managed servers - managedserver1, managedserver2, managedserver3 - all of them listening at different ips and the same port. The steps in this process will be:

  • Access the 6.1 WebLogic Admin browser console after starting your admin server.
  • Click the Trap Destinations node on the left-hand pane.
  • Fill in the appropriate values (see Figure 2)

  • Click the Trag Destinations node on the left-hand pane after expanding the SNMP node.
  • Click on the domain name on the left-hand pane and select the SNMP tab.
  • Make sure that the Enabled check box is checked.
  • Select the Trap Destination that you configured in the previous step as a target.
  • The default value for Mib Data Refresh Interval is 120, and the least possible value is 30 secs. The MIB Data Refresh Interval is the interval, in seconds, at which the SNMP agent does a complete refresh of the cache. This value would eventually determine the freshness of data; for our case, the time since the last time the number of active servers was checked. Decreasing this value significantly might impact performance.
  • If you want to use the default trap that Weblogic server generates when it goes down (OID, explained in detail in the SNMP Agent section), you need not follow any further steps for configuring the SNMP Agent and can jump directly to "Configuring the SNMP Manager."
  • Expand the Monitors node on the left hand pane
  • Click on "Configure a new gauge monitor" in the right-hand pane.
  • Fill in the values shown in Figure 3.

    The definitions for what the value stands for can be seen by clicking on the '?' against each parameter. In our case, we would be creating three such monitors: MyGaugeMonitor1, MyGaugeMonitor2, and MyGaugeMonitor3. You would need to replace the name for the Monitored MBean Name with the names of your managed servers, respectively. The idea is that the SNMP agent will generate a trap whenever the value of the AliveServerCount for any server goes to two or below two. It will also generate a trap when the AliveServerCount goes to three or more, but in this specific case that information won't be useful. You will need to apply the changes and select the Servers Tab to target the respective servers.

    Configuring the SNMP Manager
    As I mentioned earlier, I used a third-party tool, freely downloadable online from www.ncomtech.com/download.htm. I selected "Trap Receiver for NT/2000", as I wanted this installed on my local Windows 2000 box. Detailed help for the various attributes are available at www.ncomtech.com/trmanual.html. The following steps are required to quickly configure the SNMP manager after installation:

  • Select the MIBs tab, hit Load, and select the location of WebLogic-specific MIBs, i.e., WEBLOGIC_HOME/lib/BEA-WEBLOGIC-MIB.asn1.
  • Select the Actions tab, hit Add, and select the Varbind OID from the Watch drop-down.
  • Fill in the actual MIB value for the trap in the Equals column. In our case it would be Here is the Enterprise vendor identification (OID) for the WLS6.1 instance we are using. The value of 75 is meant for the Monitor Trap we are interested in. We want to generate an e-mail when this value is reached so select the checkbox for the e-mail option.
  • It is worthwhile to mention here that if we had wanted notifications to be generated using the default WebLogic traps, we don't need to create a separate monitor. There are some predefined WebLogic SNMP traps that would be generated automatically. For example, the server startup trap is 65, and the shutdown trap is 70. So the full Varbind OID would be and, respectively. Hence if we don't want to go through the process of creating a separate monitor (in Figure 3 for MyGaugeMonitor1), all we need to do is fill the value of instead of in the previous step. I found that the trap notification was almost instantaneous in case of a simulated hang when using the custom created MyGaugeMonitor1, whereas it took a little while before the shutdown trap was generated. However, the disadvantage in this case would be that the number of e-mails that will be generated when the trap condition is reached would depend on the number of servers for which the monitor has been created minus the server, which went in hung state (in our case, 3 -1 = 2).
  • We could have used the counter monitor here as well, in which only one of the thresholds would be required to be given. But again, this is only in case the value equals or exceeds the threshold value. The advantage of using the Gauge monitor was that the counter is reset after the stopped server, is restarted and the high value is reached.
  • The last thing to be done is to configure the e-mail option. Select the e-mail tab and fill in the appropriate values for your SMTP server. For the message box, you might want to give something like "Startup/Shutdown/Hang. A trap from %SENDERIP% of type %GENERICTYPE%/%SPECIFICTYPE% was received".

    Testing the SNMP Setup
    We're now all set. When all three servers are up and running, the high value of the gauge is reached. This would generate an e-mail notification; however, this is not important to us. What we want is to be notified if the server becomes unresponsive or goes down. Bringing down one instance of managedserver would again generate a trap, and since the AliveServerCount would become 2, will generate the trap so that the e-mail notification can be sent. It you had created the noncustom monitor and used the default OID, you would still be notified. The purpose of using the custom monitor is to illustrate the usage of MBeans and monitors.

    All of the options given here have different merits. The method using the WebLogic ping utility does not have any additional overhead as far as the performance of WebLogic server is concerned, whereas when the SNMP traps are used there could be a slight performance impact if the sampling period is reduced. On the other hand, in the first method, if you want to use the shell script as is (see Listing 1), it would benefit the Unix platforms. But, you would not get the real time information, and would have a separate share of CPU usage. The emailServersRunningInCluster.java (see Listing 2) can be set as a cron job on Unix platforms in case the sleep is eliminated in the code, and utilizes WebLogic-specific API’s. However, as it's pure java you can run it from anywhere, provided the WebLogic-specific classes are present in the classpath. This would again have its separate share of CPU utilization. The third option, using the SNMP Manager, can be utilized for catching other traps as well, in addition to finding the AliveServerCount, by tweaking the OIDs and creating separate monitors. The Java program could not be used as is, and if you wanted to monitor other MBeans you would need major code changes.

    A sequel to all this can be to find the Java process id of the server that becomes unresponsive, either as a matter of ConnectException or AliveServerCount going down, and automate the script to do a kill -3 for Unix-based platforms to get the Thread Dumps and send it to support the analysis.

  • More Stories By Apurb Kumar

    Apurb Kumar is a developer relations engineer in Backline WebLogic Support at BEA Systems. He has more than 10 years of experience, starting with real-time programming, moving on to databases, and finally Java development. Before moving to BEA Systems, Apurb consulted for companies such as Charles Schwab, AllAdvantage.com, and Holland Systems.

    Comments (0)

    Share your thoughts on this story.

    Add your comment
    You must be signed in to add a comment. Sign-in | Register

    In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.

    IoT & Smart Cities Stories
    The deluge of IoT sensor data collected from connected devices and the powerful AI required to make that data actionable are giving rise to a hybrid ecosystem in which cloud, on-prem and edge processes become interweaved. Attendees will learn how emerging composable infrastructure solutions deliver the adaptive architecture needed to manage this new data reality. Machine learning algorithms can better anticipate data storms and automate resources to support surges, including fully scalable GPU-c...
    Machine learning has taken residence at our cities' cores and now we can finally have "smart cities." Cities are a collection of buildings made to provide the structure and safety necessary for people to function, create and survive. Buildings are a pool of ever-changing performance data from large automated systems such as heating and cooling to the people that live and work within them. Through machine learning, buildings can optimize performance, reduce costs, and improve occupant comfort by ...
    The explosion of new web/cloud/IoT-based applications and the data they generate are transforming our world right before our eyes. In this rush to adopt these new technologies, organizations are often ignoring fundamental questions concerning who owns the data and failing to ask for permission to conduct invasive surveillance of their customers. Organizations that are not transparent about how their systems gather data telemetry without offering shared data ownership risk product rejection, regu...
    René Bostic is the Technical VP of the IBM Cloud Unit in North America. Enjoying her career with IBM during the modern millennial technological era, she is an expert in cloud computing, DevOps and emerging cloud technologies such as Blockchain. Her strengths and core competencies include a proven record of accomplishments in consensus building at all levels to assess, plan, and implement enterprise and cloud computing solutions. René is a member of the Society of Women Engineers (SWE) and a m...
    Poor data quality and analytics drive down business value. In fact, Gartner estimated that the average financial impact of poor data quality on organizations is $9.7 million per year. But bad data is much more than a cost center. By eroding trust in information, analytics and the business decisions based on these, it is a serious impediment to digital transformation.
    Digital Transformation: Preparing Cloud & IoT Security for the Age of Artificial Intelligence. As automation and artificial intelligence (AI) power solution development and delivery, many businesses need to build backend cloud capabilities. Well-poised organizations, marketing smart devices with AI and BlockChain capabilities prepare to refine compliance and regulatory capabilities in 2018. Volumes of health, financial, technical and privacy data, along with tightening compliance requirements by...
    Predicting the future has never been more challenging - not because of the lack of data but because of the flood of ungoverned and risk laden information. Microsoft states that 2.5 exabytes of data are created every day. Expectations and reliance on data are being pushed to the limits, as demands around hybrid options continue to grow.
    Digital Transformation and Disruption, Amazon Style - What You Can Learn. Chris Kocher is a co-founder of Grey Heron, a management and strategic marketing consulting firm. He has 25+ years in both strategic and hands-on operating experience helping executives and investors build revenues and shareholder value. He has consulted with over 130 companies on innovating with new business models, product strategies and monetization. Chris has held management positions at HP and Symantec in addition to ...
    Enterprises have taken advantage of IoT to achieve important revenue and cost advantages. What is less apparent is how incumbent enterprises operating at scale have, following success with IoT, built analytic, operations management and software development capabilities - ranging from autonomous vehicles to manageable robotics installations. They have embraced these capabilities as if they were Silicon Valley startups.
    As IoT continues to increase momentum, so does the associated risk. Secure Device Lifecycle Management (DLM) is ranked as one of the most important technology areas of IoT. Driving this trend is the realization that secure support for IoT devices provides companies the ability to deliver high-quality, reliable, secure offerings faster, create new revenue streams, and reduce support costs, all while building a competitive advantage in their markets. In this session, we will use customer use cases...