Welcome!

Weblogic Authors: Yeshim Deniz, Elizabeth White, Michael Meiner, Michael Bushong, Avi Rosenthal

Related Topics: Weblogic

Weblogic: Article

Health Monitoring and Notification of Servers in a Cluster

Health Monitoring and Notification of Servers in a Cluster

What would happen if you had a stand-alone server, say an Admin Server having just a managed server? Or what if the Admin server itself becomes unresponsive. How would someone be notified when they became unresponsive?

".....This is bad. Why did the server hang? And on top of this, why did we come to know about this so late?" As Bob heard his boss say this, he knew what was about to come next. He would be told to open a case with WebLogic support, who would help them do a postmortem of why and how the server stopped responding in production. Probably he would need to go beyond it this time, and figure out a way to automatically check server status.

I've often come across situations in which customers want to monitor the health of the servers running in a cluster. Better still, to get notifications should a condition arise. If you're running in a cluster, which of course most production-based systems do, the failover would take place right away, as soon as a server becomes unresponsive. As far as your applications go, there would be other servers that would serve the requests. But as an administrator you should be aware of this immediately, and take corrective measures. Again, what would happen if you had a stand-alone server, say an Admin server having just a managed server? Or what if the Admin server itself becomes unresponsive. How would someone be notified when this happens?

The answers to these questions can be either external or internal (meaning that you don't use any WebLogic-specific tool) to WebLogic. The external solution is probably the simpler one, as it makes use of a WebLogic ping utility that would result in a success if it finds the server running, and return a java.net.ConnectException if it isn't. Other than that, you pretty much rely on the OS script (and the network of course!) to do the job for you. The internal solution delves deeper into the WebLogic core, such as using SNMP and utilizing MBeans, specifically the ClusterMBeans.

Let's dive into the simple solution right away. Our test case is: we are having running instances (they need not be in a cluster) of x servers. We want to be notified if any of them becomes unresponsive or goes down. All that you might be required to do is to execute the following simple steps:

  • Set the environment so that the WebLogic-specific classes are in the classpath.
  • Input the host:ports for all of the instances of WebLogic server that are running to the weblogic.Admin ping utility.
  • Loop infinitely, pinging each instance (hosts:port) of the server passed as input.
  • Define a time interval for the ping in the script so that after pinging each instance passed to weblogic.Admin, you sleep for some time.
  • E-mail when a ConnectException is detected.

    A simplified version of this utility running on Bourne shell on Solaris 5.6 is given in Listing 1 (the code for this article is available online at www.sys-con.com/weblogic/sourcec.cfm).

    So all you might need to do is run the script in Listing 1 in the background, and it would keep you notified by e-mail should any server go down or become unresponsive.

    Before we proceed with the MBean material a brief outline of how WebLogic Server instances in a cluster detect failures of their peer server instances would be helpful. Some of the information given here is from the WebLogic documentation. For details refer to the cluster specific information at http://edocs.bea.com/wls/docs61/cluster/index.html.

    The instances monitor the:

  • Socket connections to a peer server
  • Regular server "heartbeat" messages

    Failure Detection Using IP Sockets
    WebLogic Servers monitor the use of IP sockets between peer server instances as an immediate method of detecting failures. If a server connects to one of its peers in a cluster and begins transmitting data over a socket, an unexpected closure of that socket causes the peer server to be marked as "failed," and its associated services are removed from the JNDI naming tree.

    The WebLogic Server "Heartbeat"
    If clustered server instances don't have opened sockets for peer-to-peer communication, failed servers may also be detected via the WebLogic Server "heartbeat." All server instances in a cluster use multicast to broadcast regular server "heartbeat" messages to other members of the cluster. Each server heartbeat contains data that uniquely identifies the server that sends the message. Servers broadcast their heartbeat messages at regular intervals of 10 seconds. In turn, each server in a cluster monitors the multicast address to ensure that all peer servers' heartbeat messages are being sent.

    If a server monitoring the multicast address misses three heartbeats from a peer server (i.e., if it doesn't receive a heartbeat from the server for 30 seconds or longer), the monitoring server marks the peer server as "failed." It then updates its local JNDI tree, if necessary, to retract the services that were hosted on the failed server.

    In this way, servers can detect failures even if they have no sockets open for peer-to-peer communication. In our case, the AliveServerCount value for each server would be the updated list of those numbers of active servers, which are still in the cluster jndi list of the servers.

    The next solution (if you don't want to use the shell script and for non-Unix platforms) is to use the WebLogic APIs to generate the trap and the Java mail API for generating the notification. In other words, do something like this:

  • Generate a list of arguments passing the host:port, username, password for the Admin server, the total number of servers participating in the cluster, and the delay interval acceptable for generating the e-mail. This list of arguments will be required by the java program.
  • Get the Admin MBeanHome by passing these specific properties.
  • If the Admin home is not found, generate an appropriate error message.
  • Use the Admin MBeanHome to get the ClusterRunTime MBean and iterate through to get the server names.
  • Check to see if the total number of alive servers is less than the total number of servers passed as an argument to the Java program.
  • If the count goes below the one passed as an argument, formulate the appropriate string that would be passed as the e-mail content.
  • Generate an e-mail at the address specified for the SMTP (not to be confused with SNMP) Server.

    Listing 2 would be the meat of this solution.

    SNMP Model
    We now come to the last method that I'm proposing: using an SNMP model. WebLogic Server software includes the ability to communicate with enterprise-wide management systems using Simple Network Management Protocol (SNMP). The WebLogic Server SNMP capability enables you to integrate management of WebLogic Servers into an SNMP-compliant management system that gives you a single view of the various software and hardware resources of a complex, distributed system.

    The following definitions help us derive a practical scenario for cluster monitoring and have been partially derived from the WebLogic documentation.

    SNMP management is based on the agent/manager model described in the network management standards defined by the International Organization for Standardization (ISO). In this model, a network/systems manager exchanges monitoring and control information about system and network resources with distributed software processes called agents. In our case, the SNMP agent is the WebLogic Admin Server. For the SNMP manager as an illustration, and example, I used a freely downloadable third-party software called the "Trap Receiver".

    Any system or network resource that is manageable through the exchange of information is a managed resource. This could be a software resource such as a Java Database Connectivity (JDBC) connection pool or a hardware resource such as a router. In our case, we are monitoring the ClusterRuntime.

    The underlying idea is that the WebLogic Admin server, which is acting as the SNMP agent, would act as a "collection device" that would gather and send us the information of the managed resource, i.e., the ClusterRuntime. This would be achieved by setting thresholds (referred to as Monitors) for any specific attribute for the ClusterRuntime. In our example we would monitor the AliveServerCount attribute for the ClusterRuntime. Say that the total servers running in a cluster is three; if any one of the servers becomes unresponsive, the AliveServerCount would decrease to two and a trap notification should be sent to the SNMP manager, which would then generate an e-mail.

    The Trap Receiver relies upon a database of definitions and information about the properties of managed resources and the services the agents support - this makes up the Management Information Base (MIB). In our case, the MIB will be available under the WEBLOGIC_HOME/lib/ BEA-WEBLOGIC-MIB.asn1 (see Figure 1). (For more information about SNMP management, visit http://e-docs.bea.com/wls/docs61/snmpman/index.html.)

    The following basic steps are required for setting up our Failure Notification Model:

  • Configuring the SNMP Agent
  • Configuring the SNMP Manager

    Configuring the SNMP Agent
    This would be the WebLogic Admin Server. We would start by assuming that we are running a cluster of three managed servers - managedserver1, managedserver2, managedserver3 - all of them listening at different ips and the same port. The steps in this process will be:

  • Access the 6.1 WebLogic Admin browser console after starting your admin server.
  • Click the Trap Destinations node on the left-hand pane.
  • Fill in the appropriate values (see Figure 2)

  • Click the Trag Destinations node on the left-hand pane after expanding the SNMP node.
  • Click on the domain name on the left-hand pane and select the SNMP tab.
  • Make sure that the Enabled check box is checked.
  • Select the Trap Destination that you configured in the previous step as a target.
  • The default value for Mib Data Refresh Interval is 120, and the least possible value is 30 secs. The MIB Data Refresh Interval is the interval, in seconds, at which the SNMP agent does a complete refresh of the cache. This value would eventually determine the freshness of data; for our case, the time since the last time the number of active servers was checked. Decreasing this value significantly might impact performance.
  • If you want to use the default trap that Weblogic server generates when it goes down (OID 1.3.6.1.4.1.140.625.100.70, explained in detail in the SNMP Agent section), you need not follow any further steps for configuring the SNMP Agent and can jump directly to "Configuring the SNMP Manager."
  • Expand the Monitors node on the left hand pane
  • Click on "Configure a new gauge monitor" in the right-hand pane.
  • Fill in the values shown in Figure 3.

    The definitions for what the value stands for can be seen by clicking on the '?' against each parameter. In our case, we would be creating three such monitors: MyGaugeMonitor1, MyGaugeMonitor2, and MyGaugeMonitor3. You would need to replace the name for the Monitored MBean Name with the names of your managed servers, respectively. The idea is that the SNMP agent will generate a trap whenever the value of the AliveServerCount for any server goes to two or below two. It will also generate a trap when the AliveServerCount goes to three or more, but in this specific case that information won't be useful. You will need to apply the changes and select the Servers Tab to target the respective servers.

    Configuring the SNMP Manager
    As I mentioned earlier, I used a third-party tool, freely downloadable online from www.ncomtech.com/download.htm. I selected "Trap Receiver for NT/2000", as I wanted this installed on my local Windows 2000 box. Detailed help for the various attributes are available at www.ncomtech.com/trmanual.html. The following steps are required to quickly configure the SNMP manager after installation:

  • Select the MIBs tab, hit Load, and select the location of WebLogic-specific MIBs, i.e., WEBLOGIC_HOME/lib/BEA-WEBLOGIC-MIB.asn1.
  • Select the Actions tab, hit Add, and select the Varbind OID from the Watch drop-down.
  • Fill in the actual MIB value for the trap in the Equals column. In our case it would be 1.3.6.1.4.1.140.625.100.75. Here 1.3.6.1.4.1.140.625 is the Enterprise vendor identification (OID) for the WLS6.1 instance we are using. The value of 75 is meant for the Monitor Trap we are interested in. We want to generate an e-mail when this value is reached so select the checkbox for the e-mail option.
  • It is worthwhile to mention here that if we had wanted notifications to be generated using the default WebLogic traps, we don't need to create a separate monitor. There are some predefined WebLogic SNMP traps that would be generated automatically. For example, the server startup trap is 65, and the shutdown trap is 70. So the full Varbind OID would be 1.3.6.1.4.1.140.625.100.65 and 1.3.6.1.4.1.140.625.100.70, respectively. Hence if we don't want to go through the process of creating a separate monitor (in Figure 3 for MyGaugeMonitor1), all we need to do is fill the value of 1.3.6.1.4.1.140.625.100.70 instead of 1.3.6.1.4.1.140.625.100.75 in the previous step. I found that the trap notification was almost instantaneous in case of a simulated hang when using the custom created MyGaugeMonitor1, whereas it took a little while before the shutdown trap was generated. However, the disadvantage in this case would be that the number of e-mails that will be generated when the trap condition is reached would depend on the number of servers for which the monitor has been created minus the server, which went in hung state (in our case, 3 -1 = 2).
  • We could have used the counter monitor here as well, in which only one of the thresholds would be required to be given. But again, this is only in case the value equals or exceeds the threshold value. The advantage of using the Gauge monitor was that the counter is reset after the stopped server, is restarted and the high value is reached.
  • The last thing to be done is to configure the e-mail option. Select the e-mail tab and fill in the appropriate values for your SMTP server. For the message box, you might want to give something like "Startup/Shutdown/Hang. A trap from %SENDERIP% of type %GENERICTYPE%/%SPECIFICTYPE% was received".

    Testing the SNMP Setup
    We're now all set. When all three servers are up and running, the high value of the gauge is reached. This would generate an e-mail notification; however, this is not important to us. What we want is to be notified if the server becomes unresponsive or goes down. Bringing down one instance of managedserver would again generate a trap, and since the AliveServerCount would become 2, will generate the trap so that the e-mail notification can be sent. It you had created the noncustom monitor and used the default OID, you would still be notified. The purpose of using the custom monitor is to illustrate the usage of MBeans and monitors.

    Summary
    All of the options given here have different merits. The method using the WebLogic ping utility does not have any additional overhead as far as the performance of WebLogic server is concerned, whereas when the SNMP traps are used there could be a slight performance impact if the sampling period is reduced. On the other hand, in the first method, if you want to use the shell script as is (see Listing 1), it would benefit the Unix platforms. But, you would not get the real time information, and would have a separate share of CPU usage. The emailServersRunningInCluster.java (see Listing 2) can be set as a cron job on Unix platforms in case the sleep is eliminated in the code, and utilizes WebLogic-specific API’s. However, as it's pure java you can run it from anywhere, provided the WebLogic-specific classes are present in the classpath. This would again have its separate share of CPU utilization. The third option, using the SNMP Manager, can be utilized for catching other traps as well, in addition to finding the AliveServerCount, by tweaking the OIDs and creating separate monitors. The Java program could not be used as is, and if you wanted to monitor other MBeans you would need major code changes.

    A sequel to all this can be to find the Java process id of the server that becomes unresponsive, either as a matter of ConnectException or AliveServerCount going down, and automate the script to do a kill -3 for Unix-based platforms to get the Thread Dumps and send it to support the analysis.

  • More Stories By Apurb Kumar

    Apurb Kumar is a developer relations engineer in Backline WebLogic Support at BEA Systems. He has more than 10 years of experience, starting with real-time programming, moving on to databases, and finally Java development. Before moving to BEA Systems, Apurb consulted for companies such as Charles Schwab, AllAdvantage.com, and Holland Systems.

    Comments (0)

    Share your thoughts on this story.

    Add your comment
    You must be signed in to add a comment. Sign-in | Register

    In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


    @ThingsExpo Stories
    BnkToTheFuture.com is the largest online investment platform for investing in FinTech, Bitcoin and Blockchain companies. We believe the future of finance looks very different from the past and we aim to invest and provide trading opportunities for qualifying investors that want to build a portfolio in the sector in compliance with international financial regulations.
    A strange thing is happening along the way to the Internet of Things, namely far too many devices to work with and manage. It has become clear that we'll need much higher efficiency user experiences that can allow us to more easily and scalably work with the thousands of devices that will soon be in each of our lives. Enter the conversational interface revolution, combining bots we can literally talk with, gesture to, and even direct with our thoughts, with embedded artificial intelligence, whic...
    Imagine if you will, a retail floor so densely packed with sensors that they can pick up the movements of insects scurrying across a store aisle. Or a component of a piece of factory equipment so well-instrumented that its digital twin provides resolution down to the micrometer.
    In his keynote at 18th Cloud Expo, Andrew Keys, Co-Founder of ConsenSys Enterprise, provided an overview of the evolution of the Internet and the Database and the future of their combination – the Blockchain. Andrew Keys is Co-Founder of ConsenSys Enterprise. He comes to ConsenSys Enterprise with capital markets, technology and entrepreneurial experience. Previously, he worked for UBS investment bank in equities analysis. Later, he was responsible for the creation and distribution of life settle...
    Product connectivity goes hand and hand these days with increased use of personal data. New IoT devices are becoming more personalized than ever before. In his session at 22nd Cloud Expo | DXWorld Expo, Nicolas Fierro, CEO of MIMIR Blockchain Solutions, will discuss how in order to protect your data and privacy, IoT applications need to embrace Blockchain technology for a new level of product security never before seen - or needed.
    Leading companies, from the Global Fortune 500 to the smallest companies, are adopting hybrid cloud as the path to business advantage. Hybrid cloud depends on cloud services and on-premises infrastructure working in unison. Successful implementations require new levels of data mobility, enabled by an automated and seamless flow across on-premises and cloud resources. In his general session at 21st Cloud Expo, Greg Tevis, an IBM Storage Software Technical Strategist and Customer Solution Architec...
    Nordstrom is transforming the way that they do business and the cloud is the key to enabling speed and hyper personalized customer experiences. In his session at 21st Cloud Expo, Ken Schow, VP of Engineering at Nordstrom, discussed some of the key learnings and common pitfalls of large enterprises moving to the cloud. This includes strategies around choosing a cloud provider(s), architecture, and lessons learned. In addition, he covered some of the best practices for structured team migration an...
    No hype cycles or predictions of a gazillion things here. IoT is here. You get it. You know your business and have great ideas for a business transformation strategy. What comes next? Time to make it happen. In his session at @ThingsExpo, Jay Mason, an Associate Partner of Analytics, IoT & Cybersecurity at M&S Consulting, presented a step-by-step plan to develop your technology implementation strategy. He also discussed the evaluation of communication standards and IoT messaging protocols, data...
    Coca-Cola’s Google powered digital signage system lays the groundwork for a more valuable connection between Coke and its customers. Digital signs pair software with high-resolution displays so that a message can be changed instantly based on what the operator wants to communicate or sell. In their Day 3 Keynote at 21st Cloud Expo, Greg Chambers, Global Group Director, Digital Innovation, Coca-Cola, and Vidya Nagarajan, a Senior Product Manager at Google, discussed how from store operations and ...
    In his session at 21st Cloud Expo, Raju Shreewastava, founder of Big Data Trunk, provided a fun and simple way to introduce Machine Leaning to anyone and everyone. He solved a machine learning problem and demonstrated an easy way to be able to do machine learning without even coding. Raju Shreewastava is the founder of Big Data Trunk (www.BigDataTrunk.com), a Big Data Training and consulting firm with offices in the United States. He previously led the data warehouse/business intelligence and B...
    "IBM is really all in on blockchain. We take a look at sort of the history of blockchain ledger technologies. It started out with bitcoin, Ethereum, and IBM evaluated these particular blockchain technologies and found they were anonymous and permissionless and that many companies were looking for permissioned blockchain," stated René Bostic, Technical VP of the IBM Cloud Unit in North America, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Conventi...
    When shopping for a new data processing platform for IoT solutions, many development teams want to be able to test-drive options before making a choice. Yet when evaluating an IoT solution, it’s simply not feasible to do so at scale with physical devices. Building a sensor simulator is the next best choice; however, generating a realistic simulation at very high TPS with ease of configurability is a formidable challenge. When dealing with multiple application or transport protocols, you would be...
    Smart cities have the potential to change our lives at so many levels for citizens: less pollution, reduced parking obstacles, better health, education and more energy savings. Real-time data streaming and the Internet of Things (IoT) possess the power to turn this vision into a reality. However, most organizations today are building their data infrastructure to focus solely on addressing immediate business needs vs. a platform capable of quickly adapting emerging technologies to address future ...
    We are given a desktop platform with Java 8 or Java 9 installed and seek to find a way to deploy high-performance Java applications that use Java 3D and/or Jogl without having to run an installer. We are subject to the constraint that the applications be signed and deployed so that they can be run in a trusted environment (i.e., outside of the sandbox). Further, we seek to do this in a way that does not depend on bundling a JRE with our applications, as this makes downloads and installations rat...
    Widespread fragmentation is stalling the growth of the IIoT and making it difficult for partners to work together. The number of software platforms, apps, hardware and connectivity standards is creating paralysis among businesses that are afraid of being locked into a solution. EdgeX Foundry is unifying the community around a common IoT edge framework and an ecosystem of interoperable components.
    DX World EXPO, LLC, a Lighthouse Point, Florida-based startup trade show producer and the creator of "DXWorldEXPO® - Digital Transformation Conference & Expo" has announced its executive management team. The team is headed by Levent Selamoglu, who has been named CEO. "Now is the time for a truly global DX event, to bring together the leading minds from the technology world in a conversation about Digital Transformation," he said in making the announcement.
    In this strange new world where more and more power is drawn from business technology, companies are effectively straddling two paths on the road to innovation and transformation into digital enterprises. The first path is the heritage trail – with “legacy” technology forming the background. Here, extant technologies are transformed by core IT teams to provide more API-driven approaches. Legacy systems can restrict companies that are transitioning into digital enterprises. To truly become a lead...
    Digital Transformation (DX) is not a "one-size-fits all" strategy. Each organization needs to develop its own unique, long-term DX plan. It must do so by realizing that we now live in a data-driven age, and that technologies such as Cloud Computing, Big Data, the IoT, Cognitive Computing, and Blockchain are only tools. In her general session at 21st Cloud Expo, Rebecca Wanta explained how the strategy must focus on DX and include a commitment from top management to create great IT jobs, monitor ...
    "Cloud Academy is an enterprise training platform for the cloud, specifically public clouds. We offer guided learning experiences on AWS, Azure, Google Cloud and all the surrounding methodologies and technologies that you need to know and your teams need to know in order to leverage the full benefits of the cloud," explained Alex Brower, VP of Marketing at Cloud Academy, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clar...
    The IoT Will Grow: In what might be the most obvious prediction of the decade, the IoT will continue to expand next year, with more and more devices coming online every single day. What isn’t so obvious about this prediction: where that growth will occur. The retail, healthcare, and industrial/supply chain industries will likely see the greatest growth. Forrester Research has predicted the IoT will become “the backbone” of customer value as it continues to grow. It is no surprise that retail is ...