FAQs about monitoring Mediation Servers

After making a UDP based JGroups discovery request and receiving a response from an application server in the cluster, each mediation server makes an RMI (TCP) call to an appserver every 30 seconds. This RMI call results in a “call on cluster” on the application server cluster, utilizing JGroups (UDP by default), to call the agentHeartbeat method of the OWMedServerTrackerMBean on each application server in the cluster. The primary application server updates the timestamp for the medserver in question, and the others ignore the call. Every five seconds, the primary application server checks to see if it has not received a call from a medserver in the last 52 seconds. If it has not, it attempts to verify down status by pinging the suspected mediation server. Then it issues an RMI call on that medserver. It considers the meditation server down if the ping or the final RMI call fails. This avoids false meditation server down notifications when a network cable is pulled from an application server.

• Does the application server wait 15 seconds after receiving the mediation server's response? Or does it monitor mediation server every 15 seconds regardless of the mediation server's response?

The receipt of the mediation server's RMI call is on a different thread than the monitoring code. The monitoring code should run every 5 seconds, regardless of the frequency of mediation server calls. However, after investigating the scheduling mechanism used (the JBoss scheduler - http://community.jboss.org/wiki/scheduler), it is possible that other tasks using this scheduler could impact the schedule because of a change in the JDK timer implementation after JDK 1.4.

• What kind of functionality (JMS?) does application server use to send and receive Redcell messages?

The application server does not actively monitor the mediation servers unless it fails to get a call from one for 52 seconds. If it does try to verify a downed mediation server, it uses an RMI call.

The RMI calls use TCP sockets. It may use multiple ports: 1103/1123 (UDP - JGroups Discovery), 4445/4446 (TCP - RMI Object), 1098/1099 (TCP - JNDI), or 3100/3200 (TCP - HAJNDI), 8093 (UIL2).

• What kind of problem or bug would it make application server to falsely detect a mediation server down? For example, would failing to allocate memory cause application server to think a mediation server is down (dead)?

An out of memory error on an application server could result in a false detection of a downed medserver.

• If such memory depletion occurs as described in the previous answer, would the record appears in the log? If it doesn't appear in the log, would it possibly appear if the log-level is changed?

An out of memory error usually appears in the log without modifying logging configuration, since it is logged at ERROR level.

• The log shows that a mediation server was detached from the cluster configuration, but what kind of logic is used to decide the detachment from the cluster? For instance, would it detach application servers if they detect the mediation server down?

JBoss (JGroups) has a somewhat complex mechanism for detecting a slow server in a cluster, which can result in a server being “shunned.” This logic remains, even though we have never observed the shunning of a server resulting in a workable cluster. This is the only mechanism which removes servers from the cluster. The configuration for this service is located in $OWARE_USER_ROOT/oware/jboss-3.2.7/owareconf/cluster-service.xml. Shunning can be disabled by replacing all shun=’true’ instances with shun="false". A flow control option also exists which regulates the rate of cluster communication to compensate for one server being slower in processing cluster requests than another. The detection of a mediation server being down with the heartbeat mechanism described here does not attempt to remove the medserver from its cluster.