Welcome!

Weblogic Authors: Yeshim Deniz, Elizabeth White, Michael Meiner, Michael Bushong, Avi Rosenthal

Related Topics: Weblogic

Weblogic: Article

Health Monitoring and Notification of Servers in a Cluster

Health Monitoring and Notification of Servers in a Cluster

What would happen if you had a stand-alone server, say an Admin Server having just a managed server? Or what if the Admin server itself becomes unresponsive. How would someone be notified when they became unresponsive?

".....This is bad. Why did the server hang? And on top of this, why did we come to know about this so late?" As Bob heard his boss say this, he knew what was about to come next. He would be told to open a case with WebLogic support, who would help them do a postmortem of why and how the server stopped responding in production. Probably he would need to go beyond it this time, and figure out a way to automatically check server status.

I've often come across situations in which customers want to monitor the health of the servers running in a cluster. Better still, to get notifications should a condition arise. If you're running in a cluster, which of course most production-based systems do, the failover would take place right away, as soon as a server becomes unresponsive. As far as your applications go, there would be other servers that would serve the requests. But as an administrator you should be aware of this immediately, and take corrective measures. Again, what would happen if you had a stand-alone server, say an Admin server having just a managed server? Or what if the Admin server itself becomes unresponsive. How would someone be notified when this happens?

The answers to these questions can be either external or internal (meaning that you don't use any WebLogic-specific tool) to WebLogic. The external solution is probably the simpler one, as it makes use of a WebLogic ping utility that would result in a success if it finds the server running, and return a java.net.ConnectException if it isn't. Other than that, you pretty much rely on the OS script (and the network of course!) to do the job for you. The internal solution delves deeper into the WebLogic core, such as using SNMP and utilizing MBeans, specifically the ClusterMBeans.

Let's dive into the simple solution right away. Our test case is: we are having running instances (they need not be in a cluster) of x servers. We want to be notified if any of them becomes unresponsive or goes down. All that you might be required to do is to execute the following simple steps:

  • Set the environment so that the WebLogic-specific classes are in the classpath.
  • Input the host:ports for all of the instances of WebLogic server that are running to the weblogic.Admin ping utility.
  • Loop infinitely, pinging each instance (hosts:port) of the server passed as input.
  • Define a time interval for the ping in the script so that after pinging each instance passed to weblogic.Admin, you sleep for some time.
  • E-mail when a ConnectException is detected.

    A simplified version of this utility running on Bourne shell on Solaris 5.6 is given in Listing 1 (the code for this article is available online at www.sys-con.com/weblogic/sourcec.cfm).

    So all you might need to do is run the script in Listing 1 in the background, and it would keep you notified by e-mail should any server go down or become unresponsive.

    Before we proceed with the MBean material a brief outline of how WebLogic Server instances in a cluster detect failures of their peer server instances would be helpful. Some of the information given here is from the WebLogic documentation. For details refer to the cluster specific information at http://edocs.bea.com/wls/docs61/cluster/index.html.

    The instances monitor the:

  • Socket connections to a peer server
  • Regular server "heartbeat" messages

    Failure Detection Using IP Sockets
    WebLogic Servers monitor the use of IP sockets between peer server instances as an immediate method of detecting failures. If a server connects to one of its peers in a cluster and begins transmitting data over a socket, an unexpected closure of that socket causes the peer server to be marked as "failed," and its associated services are removed from the JNDI naming tree.

    The WebLogic Server "Heartbeat"
    If clustered server instances don't have opened sockets for peer-to-peer communication, failed servers may also be detected via the WebLogic Server "heartbeat." All server instances in a cluster use multicast to broadcast regular server "heartbeat" messages to other members of the cluster. Each server heartbeat contains data that uniquely identifies the server that sends the message. Servers broadcast their heartbeat messages at regular intervals of 10 seconds. In turn, each server in a cluster monitors the multicast address to ensure that all peer servers' heartbeat messages are being sent.

    If a server monitoring the multicast address misses three heartbeats from a peer server (i.e., if it doesn't receive a heartbeat from the server for 30 seconds or longer), the monitoring server marks the peer server as "failed." It then updates its local JNDI tree, if necessary, to retract the services that were hosted on the failed server.

    In this way, servers can detect failures even if they have no sockets open for peer-to-peer communication. In our case, the AliveServerCount value for each server would be the updated list of those numbers of active servers, which are still in the cluster jndi list of the servers.

    The next solution (if you don't want to use the shell script and for non-Unix platforms) is to use the WebLogic APIs to generate the trap and the Java mail API for generating the notification. In other words, do something like this:

  • Generate a list of arguments passing the host:port, username, password for the Admin server, the total number of servers participating in the cluster, and the delay interval acceptable for generating the e-mail. This list of arguments will be required by the java program.
  • Get the Admin MBeanHome by passing these specific properties.
  • If the Admin home is not found, generate an appropriate error message.
  • Use the Admin MBeanHome to get the ClusterRunTime MBean and iterate through to get the server names.
  • Check to see if the total number of alive servers is less than the total number of servers passed as an argument to the Java program.
  • If the count goes below the one passed as an argument, formulate the appropriate string that would be passed as the e-mail content.
  • Generate an e-mail at the address specified for the SMTP (not to be confused with SNMP) Server.

    Listing 2 would be the meat of this solution.

    SNMP Model
    We now come to the last method that I'm proposing: using an SNMP model. WebLogic Server software includes the ability to communicate with enterprise-wide management systems using Simple Network Management Protocol (SNMP). The WebLogic Server SNMP capability enables you to integrate management of WebLogic Servers into an SNMP-compliant management system that gives you a single view of the various software and hardware resources of a complex, distributed system.

    The following definitions help us derive a practical scenario for cluster monitoring and have been partially derived from the WebLogic documentation.

    SNMP management is based on the agent/manager model described in the network management standards defined by the International Organization for Standardization (ISO). In this model, a network/systems manager exchanges monitoring and control information about system and network resources with distributed software processes called agents. In our case, the SNMP agent is the WebLogic Admin Server. For the SNMP manager as an illustration, and example, I used a freely downloadable third-party software called the "Trap Receiver".

    Any system or network resource that is manageable through the exchange of information is a managed resource. This could be a software resource such as a Java Database Connectivity (JDBC) connection pool or a hardware resource such as a router. In our case, we are monitoring the ClusterRuntime.

    The underlying idea is that the WebLogic Admin server, which is acting as the SNMP agent, would act as a "collection device" that would gather and send us the information of the managed resource, i.e., the ClusterRuntime. This would be achieved by setting thresholds (referred to as Monitors) for any specific attribute for the ClusterRuntime. In our example we would monitor the AliveServerCount attribute for the ClusterRuntime. Say that the total servers running in a cluster is three; if any one of the servers becomes unresponsive, the AliveServerCount would decrease to two and a trap notification should be sent to the SNMP manager, which would then generate an e-mail.

    The Trap Receiver relies upon a database of definitions and information about the properties of managed resources and the services the agents support - this makes up the Management Information Base (MIB). In our case, the MIB will be available under the WEBLOGIC_HOME/lib/ BEA-WEBLOGIC-MIB.asn1 (see Figure 1). (For more information about SNMP management, visit http://e-docs.bea.com/wls/docs61/snmpman/index.html.)

    The following basic steps are required for setting up our Failure Notification Model:

  • Configuring the SNMP Agent
  • Configuring the SNMP Manager

    Configuring the SNMP Agent
    This would be the WebLogic Admin Server. We would start by assuming that we are running a cluster of three managed servers - managedserver1, managedserver2, managedserver3 - all of them listening at different ips and the same port. The steps in this process will be:

  • Access the 6.1 WebLogic Admin browser console after starting your admin server.
  • Click the Trap Destinations node on the left-hand pane.
  • Fill in the appropriate values (see Figure 2)

  • Click the Trag Destinations node on the left-hand pane after expanding the SNMP node.
  • Click on the domain name on the left-hand pane and select the SNMP tab.
  • Make sure that the Enabled check box is checked.
  • Select the Trap Destination that you configured in the previous step as a target.
  • The default value for Mib Data Refresh Interval is 120, and the least possible value is 30 secs. The MIB Data Refresh Interval is the interval, in seconds, at which the SNMP agent does a complete refresh of the cache. This value would eventually determine the freshness of data; for our case, the time since the last time the number of active servers was checked. Decreasing this value significantly might impact performance.
  • If you want to use the default trap that Weblogic server generates when it goes down (OID 1.3.6.1.4.1.140.625.100.70, explained in detail in the SNMP Agent section), you need not follow any further steps for configuring the SNMP Agent and can jump directly to "Configuring the SNMP Manager."
  • Expand the Monitors node on the left hand pane
  • Click on "Configure a new gauge monitor" in the right-hand pane.
  • Fill in the values shown in Figure 3.

    The definitions for what the value stands for can be seen by clicking on the '?' against each parameter. In our case, we would be creating three such monitors: MyGaugeMonitor1, MyGaugeMonitor2, and MyGaugeMonitor3. You would need to replace the name for the Monitored MBean Name with the names of your managed servers, respectively. The idea is that the SNMP agent will generate a trap whenever the value of the AliveServerCount for any server goes to two or below two. It will also generate a trap when the AliveServerCount goes to three or more, but in this specific case that information won't be useful. You will need to apply the changes and select the Servers Tab to target the respective servers.

    Configuring the SNMP Manager
    As I mentioned earlier, I used a third-party tool, freely downloadable online from www.ncomtech.com/download.htm. I selected "Trap Receiver for NT/2000", as I wanted this installed on my local Windows 2000 box. Detailed help for the various attributes are available at www.ncomtech.com/trmanual.html. The following steps are required to quickly configure the SNMP manager after installation:

  • Select the MIBs tab, hit Load, and select the location of WebLogic-specific MIBs, i.e., WEBLOGIC_HOME/lib/BEA-WEBLOGIC-MIB.asn1.
  • Select the Actions tab, hit Add, and select the Varbind OID from the Watch drop-down.
  • Fill in the actual MIB value for the trap in the Equals column. In our case it would be 1.3.6.1.4.1.140.625.100.75. Here 1.3.6.1.4.1.140.625 is the Enterprise vendor identification (OID) for the WLS6.1 instance we are using. The value of 75 is meant for the Monitor Trap we are interested in. We want to generate an e-mail when this value is reached so select the checkbox for the e-mail option.
  • It is worthwhile to mention here that if we had wanted notifications to be generated using the default WebLogic traps, we don't need to create a separate monitor. There are some predefined WebLogic SNMP traps that would be generated automatically. For example, the server startup trap is 65, and the shutdown trap is 70. So the full Varbind OID would be 1.3.6.1.4.1.140.625.100.65 and 1.3.6.1.4.1.140.625.100.70, respectively. Hence if we don't want to go through the process of creating a separate monitor (in Figure 3 for MyGaugeMonitor1), all we need to do is fill the value of 1.3.6.1.4.1.140.625.100.70 instead of 1.3.6.1.4.1.140.625.100.75 in the previous step. I found that the trap notification was almost instantaneous in case of a simulated hang when using the custom created MyGaugeMonitor1, whereas it took a little while before the shutdown trap was generated. However, the disadvantage in this case would be that the number of e-mails that will be generated when the trap condition is reached would depend on the number of servers for which the monitor has been created minus the server, which went in hung state (in our case, 3 -1 = 2).
  • We could have used the counter monitor here as well, in which only one of the thresholds would be required to be given. But again, this is only in case the value equals or exceeds the threshold value. The advantage of using the Gauge monitor was that the counter is reset after the stopped server, is restarted and the high value is reached.
  • The last thing to be done is to configure the e-mail option. Select the e-mail tab and fill in the appropriate values for your SMTP server. For the message box, you might want to give something like "Startup/Shutdown/Hang. A trap from %SENDERIP% of type %GENERICTYPE%/%SPECIFICTYPE% was received".

    Testing the SNMP Setup
    We're now all set. When all three servers are up and running, the high value of the gauge is reached. This would generate an e-mail notification; however, this is not important to us. What we want is to be notified if the server becomes unresponsive or goes down. Bringing down one instance of managedserver would again generate a trap, and since the AliveServerCount would become 2, will generate the trap so that the e-mail notification can be sent. It you had created the noncustom monitor and used the default OID, you would still be notified. The purpose of using the custom monitor is to illustrate the usage of MBeans and monitors.

    Summary
    All of the options given here have different merits. The method using the WebLogic ping utility does not have any additional overhead as far as the performance of WebLogic server is concerned, whereas when the SNMP traps are used there could be a slight performance impact if the sampling period is reduced. On the other hand, in the first method, if you want to use the shell script as is (see Listing 1), it would benefit the Unix platforms. But, you would not get the real time information, and would have a separate share of CPU usage. The emailServersRunningInCluster.java (see Listing 2) can be set as a cron job on Unix platforms in case the sleep is eliminated in the code, and utilizes WebLogic-specific API’s. However, as it's pure java you can run it from anywhere, provided the WebLogic-specific classes are present in the classpath. This would again have its separate share of CPU utilization. The third option, using the SNMP Manager, can be utilized for catching other traps as well, in addition to finding the AliveServerCount, by tweaking the OIDs and creating separate monitors. The Java program could not be used as is, and if you wanted to monitor other MBeans you would need major code changes.

    A sequel to all this can be to find the Java process id of the server that becomes unresponsive, either as a matter of ConnectException or AliveServerCount going down, and automate the script to do a kill -3 for Unix-based platforms to get the Thread Dumps and send it to support the analysis.

  • More Stories By Apurb Kumar

    Apurb Kumar is a developer relations engineer in Backline WebLogic Support at BEA Systems. He has more than 10 years of experience, starting with real-time programming, moving on to databases, and finally Java development. Before moving to BEA Systems, Apurb consulted for companies such as Charles Schwab, AllAdvantage.com, and Holland Systems.

    Comments (0)

    Share your thoughts on this story.

    Add your comment
    You must be signed in to add a comment. Sign-in | Register

    In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


    @ThingsExpo Stories
    The 22nd International Cloud Expo | 1st DXWorld Expo has announced that its Call for Papers is open. Cloud Expo | DXWorld Expo, to be held June 5-7, 2018, at the Javits Center in New York, NY, brings together Cloud Computing, Digital Transformation, Big Data, Internet of Things, DevOps, Machine Learning and WebRTC to one location. With cloud computing driving a higher percentage of enterprise IT budgets every year, it becomes increasingly important to plant your flag in this fast-expanding busin...
    SYS-CON Events announced today that Synametrics Technologies will exhibit at SYS-CON's 22nd International Cloud Expo®, which will take place on June 5-7, 2018, at the Javits Center in New York, NY. Synametrics Technologies is a privately held company based in Plainsboro, New Jersey that has been providing solutions for the developer community since 1997. Based on the success of its initial product offerings such as WinSQL, Xeams, SynaMan and Syncrify, Synametrics continues to create and hone inn...
    Smart cities have the potential to change our lives at so many levels for citizens: less pollution, reduced parking obstacles, better health, education and more energy savings. Real-time data streaming and the Internet of Things (IoT) possess the power to turn this vision into a reality. However, most organizations today are building their data infrastructure to focus solely on addressing immediate business needs vs. a platform capable of quickly adapting emerging technologies to address future ...
    No hype cycles or predictions of a gazillion things here. IoT is here. You get it. You know your business and have great ideas for a business transformation strategy. What comes next? Time to make it happen. In his session at @ThingsExpo, Jay Mason, an Associate Partner of Analytics, IoT & Cybersecurity at M&S Consulting, presented a step-by-step plan to develop your technology implementation strategy. He also discussed the evaluation of communication standards and IoT messaging protocols, data...
    Product connectivity goes hand and hand these days with increased use of personal data. New IoT devices are becoming more personalized than ever before. In his session at 22nd Cloud Expo | DXWorld Expo, Nicolas Fierro, CEO of MIMIR Blockchain Solutions, will discuss how in order to protect your data and privacy, IoT applications need to embrace Blockchain technology for a new level of product security never before seen - or needed.
    In his session at 21st Cloud Expo, Raju Shreewastava, founder of Big Data Trunk, provided a fun and simple way to introduce Machine Leaning to anyone and everyone. He solved a machine learning problem and demonstrated an easy way to be able to do machine learning without even coding. Raju Shreewastava is the founder of Big Data Trunk (www.BigDataTrunk.com), a Big Data Training and consulting firm with offices in the United States. He previously led the data warehouse/business intelligence and B...
    Cloud Expo | DXWorld Expo have announced the conference tracks for Cloud Expo 2018. Cloud Expo will be held June 5-7, 2018, at the Javits Center in New York City, and November 6-8, 2018, at the Santa Clara Convention Center, Santa Clara, CA. Digital Transformation (DX) is a major focus with the introduction of DX Expo within the program. Successful transformation requires a laser focus on being data-driven and on using all the tools available that enable transformation if they plan to survive ov...
    A strange thing is happening along the way to the Internet of Things, namely far too many devices to work with and manage. It has become clear that we'll need much higher efficiency user experiences that can allow us to more easily and scalably work with the thousands of devices that will soon be in each of our lives. Enter the conversational interface revolution, combining bots we can literally talk with, gesture to, and even direct with our thoughts, with embedded artificial intelligence, whic...
    "Evatronix provides design services to companies that need to integrate the IoT technology in their products but they don't necessarily have the expertise, knowledge and design team to do so," explained Adam Morawiec, VP of Business Development at Evatronix, in this SYS-CON.tv interview at @ThingsExpo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
    Recently, REAN Cloud built a digital concierge for a North Carolina hospital that had observed that most patient call button questions were repetitive. In addition, the paper-based process used to measure patient health metrics was laborious, not in real-time and sometimes error-prone. In their session at 21st Cloud Expo, Sean Finnerty, Executive Director, Practice Lead, Health Care & Life Science at REAN Cloud, and Dr. S.P.T. Krishnan, Principal Architect at REAN Cloud, discussed how they built...
    Digital Transformation (DX) is not a "one-size-fits all" strategy. Each organization needs to develop its own unique, long-term DX plan. It must do so by realizing that we now live in a data-driven age, and that technologies such as Cloud Computing, Big Data, the IoT, Cognitive Computing, and Blockchain are only tools. In her general session at 21st Cloud Expo, Rebecca Wanta explained how the strategy must focus on DX and include a commitment from top management to create great IT jobs, monitor ...
    "Digital transformation - what we knew about it in the past has been redefined. Automation is going to play such a huge role in that because the culture, the technology, and the business operations are being shifted now," stated Brian Boeggeman, VP of Alliances & Partnerships at Ayehu, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
    In his Opening Keynote at 21st Cloud Expo, John Considine, General Manager of IBM Cloud Infrastructure, led attendees through the exciting evolution of the cloud. He looked at this major disruption from the perspective of technology, business models, and what this means for enterprises of all sizes. John Considine is General Manager of Cloud Infrastructure Services at IBM. In that role he is responsible for leading IBM’s public cloud infrastructure including strategy, development, and offering m...
    SYS-CON Events announced today that Evatronix will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Evatronix SA offers comprehensive solutions in the design and implementation of electronic systems, in CAD / CAM deployment, and also is a designer and manufacturer of advanced 3D scanners for professional applications.
    To get the most out of their data, successful companies are not focusing on queries and data lakes, they are actively integrating analytics into their operations with a data-first application development approach. Real-time adjustments to improve revenues, reduce costs, or mitigate risk rely on applications that minimize latency on a variety of data sources. In his session at @BigDataExpo, Jack Norris, Senior Vice President, Data and Applications at MapR Technologies, reviewed best practices to ...
    Recently, WebRTC has a lot of eyes from market. The use cases of WebRTC are expanding - video chat, online education, online health care etc. Not only for human-to-human communication, but also IoT use cases such as machine to human use cases can be seen recently. One of the typical use-case is remote camera monitoring. With WebRTC, people can have interoperability and flexibility for deploying monitoring service. However, the benefit of WebRTC for IoT is not only its convenience and interopera...
    Nordstrom is transforming the way that they do business and the cloud is the key to enabling speed and hyper personalized customer experiences. In his session at 21st Cloud Expo, Ken Schow, VP of Engineering at Nordstrom, discussed some of the key learnings and common pitfalls of large enterprises moving to the cloud. This includes strategies around choosing a cloud provider(s), architecture, and lessons learned. In addition, he covered some of the best practices for structured team migration an...
    With tough new regulations coming to Europe on data privacy in May 2018, Calligo will explain why in reality the effect is global and transforms how you consider critical data. EU GDPR fundamentally rewrites the rules for cloud, Big Data and IoT. In his session at 21st Cloud Expo, Adam Ryan, Vice President and General Manager EMEA at Calligo, examined the regulations and provided insight on how it affects technology, challenges the established rules and will usher in new levels of diligence arou...
    22nd International Cloud Expo, taking place June 5-7, 2018, at the Javits Center in New York City, NY, and co-located with the 1st DXWorld Expo will feature technical sessions from a rock star conference faculty and the leading industry players in the world. Cloud computing is now being embraced by a majority of enterprises of all sizes. Yesterday's debate about public vs. private has transformed into the reality of hybrid cloud: a recent survey shows that 74% of enterprises have a hybrid cloud ...
    22nd International Cloud Expo, taking place June 5-7, 2018, at the Javits Center in New York City, NY, and co-located with the 1st DXWorld Expo will feature technical sessions from a rock star conference faculty and the leading industry players in the world. Cloud computing is now being embraced by a majority of enterprises of all sizes. Yesterday's debate about public vs. private has transformed into the reality of hybrid cloud: a recent survey shows that 74% of enterprises have a hybrid cloud ...