While TIBCO EBX® is designed to support large volumes of data, several common factors can lead to poor performance. Addressing the key points discussed in this section will solve the usual performance bottlenecks.
For reference, the table below details the programmatic extensions that can be implemented.
Use case | Programmatic extensions that can be involved |
---|---|
Validation | |
Table access | |
EBX® content display | |
Data update |
For large volumes of data, cumbersome algorithms have a serious impact on performance. For example, a constraint algorithm's complexity is O(n 2 ). If the data size is 100, the resulting cost is proportional to 10 000 (this generally produces an immediate result). However, if the data size is 10 000, the resulting cost will be proportional to 10 000 000.
Another reason for slow performance is calling external resources. Local caching usually solves this type of problem.
If one of the use cases above displays poor performance, it is recommended to track the problem either through code analysis or using a Java profiling tool.
Authentication and permissions management involve the user and roles directory.
If a specific directory implementation is deployed and accesses an external directory, it can be useful to ensure that local caching is performed. In particular, one of the most frequently called methods is Directory.isUserInRole
.
In a data model, when an element's cardinality constraint maxOccurs
is greater than 1 and no osd:table
is declared on this element, it is implemented as a Java List
. This type of element is called an aggregated list, as opposed to a table.
It is important to consider that there is no specific optimization when accessing aggregated lists in terms of iterations, user interface display, etc. Besides performance concerns, aggregated lists are limited with regard to many functionalities that are supported by tables. See tables introduction for a list of these features.
For the reasons stated above, aggregated lists should be used only for small volumes of simple data (one or two dozen records), with no advanced requirements for their identification, lookups, permissions, etc. For larger volumes of data (or more advanced functionalities), it is recommended to use osd:table
declarations.
Dataspaces available in semantic mode, are an invaluable tool for managing complex data life cycles. While this feature brings great flexibility, it also implies a certain overhead cost, which should be taken into consideration for optimizing usage patterns.
This section reviews the most common performance issues that can appear in case of an intensive use of many dataspaces containing large tables, and how to avoid them.
Sometimes, the use of dataspaces is not strictly needed. As an extreme example, consider the case where every transaction triggers the following actions:
A dataspace is created.
The transaction modifies some data.
The dataspace is merged, closed, then deleted.
In this case, no future references to the dataspace are needed, so using it to make isolated data modifications is unnecessary. Thus, using Procedure
already provides sufficient isolation to avoid conflicts from concurrent operations. It would then be more efficient to directly do the modifications in the target dataspace, and get rid of the steps which concern branching and merging.
For a developer-friendly analogy, referring to a source-code management tool (CVS, SVN, etc.): when you need to perform a simple modification impacting only a few files, it is probably sufficient to do so directly on the main branch. In fact, it would be neither practical nor sustainable, with regard to file tagging/copying, if every file modification involved branching the whole project, modifying the files, then merging the dedicated branch.
When a table is in semantic mode (default), the EBX® Java memory cache is used. It ensures a much more efficient access to data when this data is already loaded in the cache. However, if there is not enough space for working data, swaps between the Java heap space and the underlying database can heavily degrade overall performance.
This memory swap overhead can only occur for tables in a dataspace with an on-demand loading strategy.
Such an issue can be detected by looking at the monitoring log file. If it occurs, various actions can be considered:
reducing the number of child dataspaces that contain large tables;
reducing the number of indexes specifically defined for large tables;
using relational mode instead of semantic mode;
or (obviously) allocating more memory, or optimizing the memory used by applications for non-EBX® objects.
In semantic mode, when a transaction has performed some updates in the current dataspace and then aborts, loaded indexes of the modified tables are reset. If updates on a large table are often cancelled and, at the same time, this table is intensively accessed, then the work related to index rebuild will slow down the access to the table; moreover, the induced memory allocation and garbage collection can reduce the overall performance.
As with any database, inserting and deleting large volumes of data may lead to fragmented data, which can deteriorate performance over time. To resolve the issue, reorganizing the impacted database tables is necessary. See Monitoring and cleanup of the relational database.
A specificity of EBX® is that creating dataspaces and snapshots adds new entries to tables HTA
and ATB
. When poor performance is experienced, it may be necessary to schedule a reorganization of these tables, for large repositories in which many dataspaces are created and deleted.
The administrator can specify the loading strategy of a dataspace or snapshot in its information. The default strategy is to load and unload the resources on demand. For resources that are heavily used, a forced load strategy is usually recommended.
The following table details the loading modes which are available in semantic mode. Note that the application server must be restarted so as to take into account any loading strategy change.
On-demand loading and unloading | In this default mode, each resource in a dataspace is loaded or built only when it is needed. The resources of the dataspace are "soft"-referenced using the standard Java The main advantage of this mode is the ability to free memory when needed. As a counterpart, this implies a load/build cost when an accessed resource has not yet been loaded since the server started up, or if it has been unloaded since. |
Forced loading | If the forced loading strategy is enabled for a dataspace or snapshot, its resources are loaded asynchronously at server startup. Each resource of the dataspace is maintained in memory until the server is shut down or the dataspace is closed. This mode is particularly recommended for long-living dataspaces and/or those that are used heavily, namely any dataspace that serves as a reference. |
Forced loading and prevalidation | This strategy is similar to the forced loading strategy, except that the content of the loaded dataspace or snapshot will also be validated upon server startup. |
Indications of EBX® load activity are provided by monitoring the underlying database, and also by the 'monitoring' logging category.
If the numbers for cleared and built objects remain high for a long time, this is an indication that EBX® is swapping.
To facilitate the analysis of logs generated by EBX®, you can use the provided OpenOffice spreadsheet worksheet (right-click and save).
The maximum size of the memory allocation pool is usually specified using the Java command-line option -Xmx
. As is the case for any intensive process, it is important that the size specified by this option does not exceed the available physical RAM, so that the Java process does not swap to disk at the operating-system level.
Tuning the garbage collector can also benefit overall performance. This tuning should be adapted to the use case and specific Java Runtime Environment used.
The internal incremental validation framework will optimize the work required when updates occur. The incremental validation process behaves as follows:
The first call to a dataset validation report performs a full validation of the dataset. The loading strategy can also specify a dataspace to be prevalidated at server startup.
Data updates will transparently and asynchronously maintain the validation report, insofar as the updated nodes specify explicit dependencies. For example, standard and static facets, foreign key constraints, dynamics facets, selection nodes specify explicit dependencies.
If a mass update is executed or if there are too many validation messages, the incremental validation process is stopped. The next call to the validation report will then trigger a full validation.
If a transaction is cancelled, the validation state of the updated dataset is reset. The next call to the validation report will trigger a full validation as well.
Certain nodes are systematically revalidated, however, even if no updates have occurred since the last validation. These are the nodes with unknown dependencies. A node has unknown dependencies if:
It specifies a programmatic constraint in the default unknown dependencies mode,
It declares a computed value, or it declares a dynamic facet that depends on a node that is itself a computed value.
It is an Inherited fields or it declares a dynamic facet that depends on a node that is itself an Inherited fields.
Consequently, on large tables (beyond the order of 10 5 ), it is recommended to avoid nodes with unknown dependencies (or at least to minimize the number of such nodes). For programmatic constraints, the developer is able to specify two alternative modes that drastically reduce incremental validation cost: local dependency mode and explicit dependencies. For more information, see Dependencies and validation.
It is possible for an administrator user to manually reset the validation report of a dataset. This option is available from the validation report section in EBX®.
Mass updates can involve several hundred thousands of insertions, modifications and deletions. These updates are usually infrequent (usually initial data imports), or are performed non-interactively (nightly batches). Thus, performance for these updates is less critical than for frequent or interactive operations. However, similar to classic batch processing, it has certain specific issues.
For relational tables, the implementation of insertions, updates and deletions relies on the JDBC
batch feature. On large procedures, this can dramatically improve performance by reducing the number of round-trips between the application server and the database engine.
In order to fully exploit this feature, the batch mode can be activated on large procedures. See ProcedureContext.setBatch
. This disables the explicit check for existence before record insertions, thus reducing the number of queries to the database, and making the batch processing even more efficient.
It is generally not advised to use a single transaction when the number of atomic updates in the transaction is beyond the order of 10 4 . Large transactions require a lot of resources, in particular, memory, from EBX® and from the underlying database.
To reduce transaction size, it is possible to:
Specify the property ebx.manager.import.commit.threshold. However, this property is only used for interactive archive imports performed from the EBX® user interface.
Explicitly specify a commit threshold inside the batch procedure.
Structurally limit the transaction scope by implementing Procedure
for a part of the task and executing it as many times as necessary.
On the other hand, specifying a very small transaction size can also hinder performance, due to the persistent tasks that need to be done for each commit.
If intermediate commits are a problem because transactional atomicity is no longer guaranteed, it is recommended to execute the mass update inside a dedicated dataspace. This dataspace will be created just before the mass update. If the update does not complete successfully, the dataspace must be closed, and the update reattempted after correcting the reason for the initial failure. If it succeeds, the dataspace can be safely merged into the original dataspace.
If required, triggers can be deactivated using the method ProcedureContext.setTriggerActivation
.
Tables are commonly accessed through EBX® and also through the Request
API and data services. This access involves a unique set of functions, including a dynamic resolution process. This process behaves as follows:
Inheritance: Inheritance in the dataset tree takes into account records and values that are defined in the parent dataset, using a recursive process. Also, in a root dataset, a record can inherit some of its values from the data model default values, defined by the xs:default
attribute.
Value computation: A node declared as an osd:function
is always computed on the fly when the value is accessed. See ValueFunction.getValue
.
Filtering: An XPath predicate, a programmatic filter, or a record-level permission rule requires a selection of records.
Sort: A sort of the resulting records can be performed.
In order to improve the speed of operations on tables, indexes are managed by the EBX® engine.
EBX® advanced features, such as advanced life-cycle (snapshots and dataspaces), dataset inheritance, and flexible XML Schema modeling, have led to a specialized design for indexing mechanisms. This design can be summarized as follows:
Indexes maintain an in-memory data structure on a whole table.
An index is not persisted, and building it requires loading all table blocks from the database.
Faster access to tables is ensured if indexes are ready and maintained in memory cache. As mentioned above, it is important for the Java Virtual Machine to have enough space allocated, so that it does not release indexes too quickly.
The request optimizer favors the use of indexes when computing a request result.
Only XPath filters are taken into account for index optimization.
Non-primary-key indexes are not taken into account for child datasets.
Assuming the indexes are already built, the impacts on performance are as follows:
If the request does not involve filtering, programmatic rules, or sorting, accessing its first few rows (these fetched by a paged view) is almost instantaneous.
If the request can be resolved without an extra sort step (this is the case if it has no sort criteria, or if its sort criteria relate to those of the index used for computing the request), accessing the first few rows of a table should be fast. More precisely, it depends on the cost of the specific filtering algorithm that is executed when fetching at least 2000 records.
Both cases above guarantee an access time that is independent of the size of the table, and provide a view sorted by the index used. If an extra sort is required, the time taken by the first access depends on the table size according to an Nlog(N)
function, where N
is the number of records in the resolved view.
The paginated requests automatically add the primary key to the end of the specified criterion, in order to ensure consistent ordering. Thus, the primary key fields should also be added to the end of any index intended to improve the performance of paginated requests. These include tabular and hierarchical views, and drop-down menus for table references.
If indexes are not yet built, or have been unloaded, additional time is required. The build time is O(Nlog(N))
.
Accessing the table data blocks is required when the request cannot be computed against a single index (whether for resolving a rule, filter or sort), as well as for building the index. If the table blocks are not present in memory, additional time is needed to fetch them from the database.
It is possible to get information through the monitoring and request logging categories.
The new records creations or record insertions depend on the primary key index. Thus, a creation becomes almost immediate if this index is already loaded.
When computing a request result, the EBX® engine delegates the following to the RDBMS:
Handling of all request sort criteria, by translating them to an ORDER BY
clause.
Whenever possible, handling of the request filters, by translating them to a WHERE
clause.
Only XPath filters are taken into account for index optimization. If the request includes non-optimizable filters, table rows will be fetched from the database, then filtered in Java memory by EBX®, until the requested page size is reached. This is not as efficient as filtering on the database side (especially regarding I/O).
Information on the transmitted SQL request is logged to the category persistence. See Configuring the EBX® logs.
In order to improve the speed of operations on tables, indexes may be declared on a table at the data model level. This will trigger the creation of an index of the corresponding table in the database.
When designing an index aimed at improving the performance of a given request, the same rules apply as for traditional database index design.
In order to improve performance, a fetch size should be set according to the expected size of the result of the request on a table. If no fetch size is set, the default value will be used.
In semantic mode, the default value is 2000.
In mapped mode, the default value is assigned by the JDBC driver: 10 for Oracle and 0 for PostgreSQL.
On PostgreSQL, the default value of 0 instructs the JDBC driver to fetch the whole result set at once, which could lead to an OutOfMemoryError
when retrieving large amounts of data. On the other hand, using fetchSize on PostgreSQL will invalidate server-side cursors at the end of the transaction. If, in the same thread, you first fetch a result set with a fetchsize, then execute a procedure that commits the transaction, then, accessing the next result will raise an exception.