Troubleshooting

Administrator

The Runtime State of applications is Lost Contact or Unknown
If the Runtime State column of applications is Lost Contact or Unknown, the connection to theEnterprise Message Service server acting as the notification server and Messaging Bus has been lost.
Action History is stuck at In Progress

An Action History column stuck at In Progress could indicate that:

  • One or more of the pending tasks in the dialog that displays when you click the Action History link have failed, most likely due to lost communication with the notification server. The tasks will not be re-queued even after the notification server starts up.
  • A node involved in that action is unavailable. When the node becomes available, the action will execute and complete.
Failure to reconnect to the notification server
Restart the server if you see the following message after you try to reconnect to the notification server:
Refresh Status Cache action failed , caused by:
com.tibco.tibems.qin.TibQinRecoveryException: Connection to the
server is failed, caused by: Connection to the server is failed,
caused by: Session is closed
Notification Server URL needs to be changed manually
When the configured notification server fails, add another available notification server manually to the notification.xml file in the TIBCO host configuration folder. This will enable the TIBCO host to restart. However, the Administration UI continues to display the old notification server URL. Use the following steps to correct it:
  1. Select Admin Configuration > Admin Server.
  2. Change the Notification Server URL to the one you added to the notification.xml file and Save
  3. Click Reconnect to EMS Server.
Action History shows Paused Offline
This means that actions in Administrator are queued up while runtime objects are offline and executed when they comes back online.
Recover from network outages or IP address changes
The IP address of the machine on which the Administrator server is running could change due to DHCP reconfiguration if the machine is connected to a new network after being created. To recover from communication errors that can arise from the change in IP address:
  1. Stop all nodes managed by the SystemHost TIBCO host instance.
  2. Stop the SystemHost TIBCO host instance.
  3. If the machine on which the Administrator server is running also hosts the Enterprise Message Service server, restart the Enterprise Message Service server.
  4. Start the SystemHost TIBCO host instance.
Reconnect to EMS Server after Restarting the QIN EMS Server
Actions such as Deploy, Undeploy, Start, or Stop after the QIN EMS server crash results in Error Queing Task. After the QIN EMS server is restarted, go to Admin Configuration > Admin Server > Transport Configurationand click Reconnect to EMS Server for the Administration action function.
Improve the Administrator UI response time
Create an index on the TASK table to increase the Administrator UI response time.

For example, if using the Microsoft SQL server create the index using the statement CREATE INDEX index-name ON task (objectURI,queueURI).

Administrator Host instances

tibcohost.exe doesn't start
  • Ensure tibcohost.tra is in the same folder.
  • Ensure the Java classpath in the tra file is updated for your environment. tibcohost is automatically configured to use the JRE version that is installed with the product.
  • Ensure your Java version is at least JRE 1.6.0_14, which is required because of a bug in the Java IO implementation on Windows.
If you see an exception while starting a TIBCO Host instance that looks like this:
C:\amx\tibcohost\1.0\instances\TibcoHostInstance\HPAInstance\bin>
  tibcohost [TibcoHost - START] [INFO ]
  com.tibco.amf.hpa.tibcohost.runtime.TibcoHost - No running TibcoHost instance
  found on localhost. [TibcoHostInstance] [ERROR]
  com.tibco.amf.hpa.tibcohost.runtime.TibcoHost -
  TIBCO-AMX-TIBCOHOST-RUNTIME-103: TibcoHost: TIBCO ActiveMatrix host
  pingz-t400_TibcoHostInstance failed to start. Cause
  com.tibco.tibems.qin.TibQinException: Connection to the server is failed.
Check your Enterprise Message Service server configuration, especially if you installed Enterprise Message Service on Windows.
2009-12-17 15:09:49.954
 Storage Location: 'datastore'. 2009-12-17 15:09:49.954 Routing is disabled.
 2009-12-17 15:09:49.954 Authorization is disabled. 2009-12-17 15:09:49.972
 Accepting connections on tcp://pingz-t400:7222. 2009-12-17 15:09:49.972
 Recovering state, please wait. 2009-12-17 15:09:49.975 Server is active.
 2009-12-17 15:26:01.026 WARNING: [admin@pingz-t400]: create subscriber failed:
 not allowed to create dynamic topic
 [EMSGMS.UnboundHost_amxadmin.132ba2cc_1259ef65268_-80000a699217]. 2009-12-17
 15:26:01.564 WARNING: [admin@pingz-t400]: create subscriber failed: not allowed
 to create dynamic topic
 [EMSGMS.UnboundHost_amxadmin.132ba2cc_1259ef65268_-80000a699217]. 2009-12-17
 15:26:16.355 WARNING: [admin@pingz-t400]: create subscriber failed: not allowed
 to create dynamic topic
 [EMSGMS.UnboundHost_amxadmin.7f68b7a6_1259ef68ea8_-80000a699217]. 2009-12-17
 15:26:16.905 WARNING: [admin@pingz-t400]: create subscriber failed: not allowed
 to create dynamic topic
 [EMSGMS.UnboundHost_amxadmin.7f68b7a6_1259ef68ea8_-80000a699217]. 2009-12-17
 15:26:52.138 WARNING: [admin@pingz-t400]: create subscriber failed: not allowed
 to create dynamic topic
 [EMSGMS.UnboundHost_amxadmin.-5e8ec58d_1259ef71a70_-80000a699217]. 2009-12-17
 15:26:52.732 WARNING: [admin@pingz-t400]: create subscriber failed: not allowed
 to create dynamic topic
 [EMSGMS.UnboundHost_amxadmin.-5e8ec58d_1259ef71a70_-80000a699217].

In this case you likely have an invalid Enterprise Message Service configuration, which was created automatically by the Enterprise Message Service installer on Windows. To fix this, run the installer of Enterprise Message Service and replace the installer filled default ProgramData with a valid folder. The installer does not create missing folders and therefore Enterprise Message Service does not work properly.

Disable notifications for the host and the nodes.
To disable notifications for the host and the nodes, delete the CONFIG_HOME/tibcohost/ Admin-enterpriseName-adminServerName/host/configuration/notification.xml file.
Memory guidelines for the SystemNode for enterprises with a large number of nodes.
When many nodes restart at the same time, such as after a power failure, the SystemNode will be flooded with messages and will temporarily need increased heap memory to handle this load. The maximum heap size should be set to handle peak load. Giving a heap size of 3G (-Xmx3g) will accommodate simultaneous messages from around 400 nodes hosting user applications. If your enterprise has more nodes, then the maximum heap memory size should be appropriately increased.
TIBCO host shows erratic behavior after waking up from hibernation

Sometimes the tibcohost process runs into problems with communicating with its nodes. This happens when the machine was hibernated or suspended and woken up afterwards. The management connections do not always reinitialize properly leaving the connection 'hanging'. Only a restart can solve this issue, but tibcohost may not be able to properly shut down the node processes.

Another effect is the problem of the connection to the notification server not initializing properly after the wakeup from hibernation. This is especially true when the wakeup is performed in a different environment from the hibernation. For example, hibernate in the office, wakeup at home. In this case, the IP address changes upon wakeup, which causes communication problems with connections relying on the TCP/IP stack in Java. Avoid wakeup in a different environment or restart with the new IP address.

Is TIBCO Host instance connected to the right node process?

With the problem described in the preceding section, it can happen that a node process sticks around long after control is returned to the TIBCO Host instance. If the instance is either restarted or it is told to start the node again, it may immediately connect to the older node process that is in the process of shutting down.

To verify that the TIBCO Host instance is connected to the correct node process, it prints out the node process unique identifier when it successfully connected. This UUID can be compared to the UUID printed in the node process log file upon startup. Since the UUID is unique for every run, it becomes easy to verify the correctness of the connection.

Node process log:
[DEBUG] control.internal.FrameworkImpl - framework is starting with UUID 116295c6-adea-472d-9655-1d6e305a1959
TIBCO Host instance log:
[DEBUG] ProxyImpl.AMXAdministratorNode - reached node AMXAdministratorNode_116295c6-adea-472d-9655-1d6e305a1959

When installing a TIBCO Host instance and some nodes on remote systems you have to make sure that they are properly connected via the network. The instance and the node will try to reach the Enterprise Message Service server on the configured port (7222 per default) and for this it is necessary that the port is enabled on the firewall. Especially on Windows systems this port may be blocked by default.

The same problem will occur when the node is trying to reach Administrator. Make sure that the connector is configured on an interface that is reachable over the network and the port is unblocked on the firewall.

TIBCO Host instance or node does not come up on remote systems

When installing a TIBCO Host instance and some nodes on remote systems you have to make sure that they are properly connected via the network. The instance and the node will try to reach the Enterprise Message Service server on the configured port (7222 per default) and for this it is necessary that the port is enabled on the firewall. Especially on Windows systems this port may be blocked by default.

The same problem will occur when the node is trying to reach Administrator. Make sure that the connector is configured on an interface that is reachable over the network and the port is unblocked on the firewall.

Nodes

Node runs out of memory (Java heap space)
When this occurs, configure the node JVM to dump a snapshot of the heap by editing the .tra file of the node and adding the following argument to java.extended.properties:
-XX:HeapDumpPath=file
where file is the name of the file in which the binary heap dump will be written. The dump file can then be analyzed offline by profiling tools.

The .tra file of the node is located in the folder CONFIG_HOME/tibcohost/ Admin-enterpriseName-adminServerName/nodes/nodeName/bin.

Node does not start

Look at the following places to analyze the problem:

  • Check the log file of the node for exceptions
  • Check the node-stdout.log file of the instance for exceptions and unusual error messages, which may indicate a problem
  • Check the Equinox log file, which is always written to <nodename>/configuration/123....log. Every start of the node process produces a new version of the file. Check for exceptions.
Bundles cannot be started. The likely causes are a Java.lang.ClassNotFoundException in the Equinox log file indicates a fatal condition in the node, which prevents it from starting up. For example:
!ENTRY
 com.tibco.trintiy.server.credentialserver.common 4 0 2009-05-21 11:06:05.186
 !MESSAGE !STACK 0 org.osgi.framework.BundleException: The activator
 com.tibco.trintiy.server.credentialserver.jmx.Activator for bundle
 com.tibco.trintiy.server.credentialserver.common is invalid at
 org.eclipse.osgi.framework.internal.core.AbstractBundle.loadBundleActivator(AbstractBundle.Java:146)
 at
 org.eclipse.osgi.framework.internal.core.BundleContextImpl.start(BundleContextImpl.Java:980)
 at
 org.eclipse.osgi.framework.internal.core.BundleHost.startWorker(BundleHost.Java:346)
 at
 org.eclipse.osgi.framework.internal.core.AbstractBundle.resume(AbstractBundle.Java:355)
 at
 org.eclipse.osgi.framework.internal.core.Framework.resumeBundle(Framework.Java:1074)
 at
 org.eclipse.osgi.framework.internal.core.StartLevelManager.resumeBundles(StartLevelManager.Java:616)
 at
 org.eclipse.osgi.framework.internal.core.StartLevelManager.incFWSL(StartLevelManager.Java:508)
 at
 org.eclipse.osgi.framework.internal.core.StartLevelManager.doSetStartLevel(StartLevelManager.Java:299)
 at
 org.eclipse.osgi.framework.internal.core.StartLevelManager.dispatchEvent(StartLevelManager.Java:489)
 at
 org.eclipse.osgi.framework.eventmgr.EventManager.dispatchEvent(EventManager.Java:211)
 at
 org.eclipse.osgi.framework.eventmgr.EventManager$EventThread.run(EventManager.Java:321)
 Caused by: Java.lang.ClassNotFoundException:
 com.tibco.trintiy.server.credentialserver.jmx.Activator at
 org.eclipse.osgi.framework.internal.core.BundleLoader.findClassInternal(BundleLoader.Java:483)
 at
 org.eclipse.osgi.framework.internal.core.BundleLoader.findClass(BundleLoader.Java:399)
 at
 org.eclipse.osgi.framework.internal.core.BundleLoader.findClass(BundleLoader.Java:387)
 at
 org.eclipse.osgi.internal.baseadaptor.DefaultClassLoader.loadClass(DefaultClassLoader.Java:87)
 at Java.lang.ClassLoader.loadClass(ClassLoader.Java:251) at
 org.eclipse.osgi.framework.internal.core.BundleLoader.loadClass(BundleLoader.Java:315)
 at
 org.eclipse.osgi.framework.internal.core.BundleHost.loadClass(BundleHost.Java:227)
 at
 org.eclipse.osgi.framework.internal.core.AbstractBundle.loadBundleActivator(AbstractBundle.Java:139)
Node does not stop after the TIBCO Host instance stop -wait true has completed

Occasionally, you will find that it takes several minutes for the node processes to finally disappear. Unfortunately, this may or may not be a problem and requires a closer look almost every time. In most cases, it is a normal behavior and can be explained like this:

  • The node process runs an OSGi framework. There are many concurrent activities in separate threads that interact during the shutdown sequence. These include Springframework Timers, Framework Event Dispatcher, Startlevel Thread, custom extenders from TIBCO and from customers.
  • Each thread is competing for the same shared resources (CPU, IO). Depending on the overall load of the system (operating system), it may take some time for threads to be scheduled and proceed. Because of interdependencies, this may cause a delay of the overall shutdown sequence
  • During shutdown, the Activator.stop() method is called for every bundle if present. Any long running or CPU/IO intensive operation performed in that implementation stalls the overall shutdown procedure. Therefore, it is essential to keep this implementation short and quick.
  • As a last item of work before ending the process, the OSGi framework (Equinox in our case) persists the current state of the runtime to the disk. This includes bundles and wiring information. Depending on the number of bundles in the runtime and the availability of IO cycles, this operation may take a long time (i.e. > 1min) to complete. It is essential not to disrupt this procedure or else the runtime state may get corrupted and the node may not come up and function as expected.
With all or most of the possible reasons for the delays listed above, there is still the possibility of a problem with the node itself. Any process that hangs around for an excessively long time, that is, > 5min should be examined carefully. To diagnose the issue you can open the node log files and look at the end for where the node may have gotten stuck. A typical run ends with statements similar to this:
11 Feb 2010 18:07:08,412 [Event Dispatcher] [DEBUG] control.internal.FrameworkImpl - com.tibco.commonlogging.cbe.model stopped 
11 Feb 2010 18:07:08,412 [Framework - sync] [INFO ] control.internal.FrameworkImpl - Sync thread ends. 
11 Feb 2010 18:07:08,413 [Bundle Shutdown] [DEBUG] control.internal.FrameworkImpl - removing node.lck 
11 Feb 2010 18:07:08,482 [Bundle Shutdown] [INFO ] stdout - Restoring STDOUT 
11 Feb 2010 18:07:08,482 [Bundle Shutdown] [INFO ] stdout - Restoring STDERR 
11 Feb 2010 18:07:10,968 [shutdown thread] [INFO ] control.internal.FrameworkImpl - exiting process! 
11 Feb 2010 18:07:10,971 [Shutdown] [INFO ] org.mortbay.log - Shutdown hook executing 
11 Feb 2010 18:07:10,971 [ Shutdown] [INFO ] org.mortbay.log - Shutdown hook complete 
Node cannot be removed
This problem only exists on Windows systems and has to do with file locking. If you see a message like this in the tibcohost.log file:
AMXAdminHost 26 Feb 2010 14:35:22,458 [Job_Executor10] [ERROR]
com.tibco.amf.hpa.tibcohost.runtime.TibcoHostInstance - error removing node
"node2": error preparing for delete by renaming
C:\MatrixDevInstall\tibcohost\1.0\instances\TibcoHostInstance\Nodes\node2 to
C:\MatrixDevInstall\tibcohost\1.0\instances\TibcoHostInstance\Nodes\node2.tmp0

then Java code tries to delete a folder for which another process: Windows Explorer, a text editor open with a log file, or even the node process has a lock. On Windows systems, those locks have to be removed before the node folder can be deleted.

The tool is very helpful in finding the processes that keep holding the lock.

Note: The entire directory tree of the node folder must be unlocked.
TIBCO host takes a long time to start up on Linux platforms.
This may happen intermittently and is not always reproducible. The pseudo-random number generator needs to be seeded with truly random bits. Reads from /dev/random device will wait until there's data to return and in case of insufficient entropy the wait can last for a long time (many minutes). To confirm that the problem is due to seeding of pseudo-random number generator, run kill -QUIT pid or kill -3 pid. The stacktrace should include com.sun.SeedGenerator. For truly random seed bits, run the daemon rngd which reads from a hardware device and inserts verified random entropy bits to /dev/random. If fast start is more important, switch to /dev/urandom which does not wait for random bits but reuses already returned bits. Alternatives include:
  • Add the line {{java.properties.java.security.egd=file:/dev/./urandom}} to tibcohost.tra.

    The .tra file of the host is located in the folder CONFIG_HOME/tibcohost/ Admin-enterpriseName-adminServerName/host/bin.

  • Edit $JAVA_HOME/jre/lib/security/java.security and replace securerandom.source with securerandom.source=file:/dev/./urandom.
Errors when starting a node in a replicated environment if an external URL used for load balancing.
If an external port is used for load balancing during replication, using the Administrator UI add to the SystemNode and SystemNodeReplica a logging configuration named org.mortbay.log with a logging appender systemnode_root with the Level set to ERROR.
Thread blocks are observed at java.security.SecureRandom with higher concurrence
Secure random behavior if securerandom.source pointing to /dev/random when the entropy pool is emply
  1. Stop the node.
  2. Modify the files as mentioned below:

    Add the following property to java.securities file at TIBCO_HOME/tibcojre64/1.6.0/lib/security.

    securerandom.source=file:/dev/./urandom

    Add the following property to the node tra file (appended to java.extended.properties)

    Djava.security.egd=file:/dev/./urandom
  3. Restart the node.

Applications

Application deployment failures caused by resource instance failures
When deploying an application, ActiveMatrix Administrator automatically installs resource instances if there are resource templates with scope to the application. If the resource template installation fails, then application deployment also will fail. For example, if the HTTP connector has a port conflict, it fails to start. For HTTP Connector port conflicts use substitution variables to assign different port numbers for each node to avoid port conflicts. Then uninstall the application and redeploy it.

Resource Templates

HTTP connecter Acceptor Thread Count changed from 1 to 20
When HTTP Connector is changed from Blocking IO Socket to Non-Blocking IO Socket using the Advanced tab, the acceptor threads count in the General tab automatically changes to 1. However, HTTP Connector instance shows 20 threads when you check the threads in the node VM using jvisualvm or similar tool.

Issue

  1. Shared Objects > Resource Templates
  2. Create a new HTTP Connector resource template with Blocking IO Sockets with an instance.
  3. Set the Acceptor Thread Count to -20.
  4. Click Advanced tab.
  5. Check the Use Non-Blocking IO Sockets box and Save.
  6. Click Yes to reinstall the resource instance.
  7. Click the General tab.

    Now, the Acceptor Thread Count is changed to 1 and the Save button is enabled.

  8. Check the thread in the node VM.

    It shows 20 threads for the HTTP Connector instead of 1.

Workaround

  1. Click General and click Save.
  2. Click Yes to reinstall the resource instance.

    The Acceptor Thread Count now shows 1 in the node VM for the HTTP Connector instance.

Users of KeyStore provider fail to detect KeyStore refreshes
Users of KeyStore Provider such as Identity Provider, Trust Provider, and Mutual Identity Provider initialize at startup with credentials obtained from the KeyStore. However, they fail to detect future KeyStore refreshes. In order to avoid any service failures, perform the following procedure:
  1. Stop dependent services.
  2. Stop Subject, Trust, and Mutual Identity providers that supply the credentials.
  3. Stop KeyStore provider that supplies the KeyStore containing the credentials.
  4. Change login credentials of external system.
  5. Change the credentials in the ActiveMatrix Administrator's hosted KeyStore.
  6. Restart the KeyStore Credential and Subject, Trust, and Mutual Identity providers.
  7. Restart the dependent services.