LiveView Data Recovery

Recovering Data to the Server

LiveView supports both peer-based and log-based persistence and recovery options, as explained below.

Peer-based Recovery

You can configure peer-based recovery for a LiveView server to either recover from a service interruption, or to add a new LiveView server to an existing group of already-running servers. Peer recovery is configured per table, and each table can have the same or a different list of recovery partners. In order for a configured peer to participate in recovery:

  • It must be in the READY state.

  • It must have the exact same table configuration as the table in question.

  • Both the recovering and peer servers are configured to be published to with the same data from the same persistent message bus.

If you set * in the <persistence peer-uri-list>, and the table configuration has a table group defined, peer-based recovery will recover from all servers it finds in the table group. See Table Group Configuration Option for peer-based recovery configuration.

LiveView iterates through the peer recovery list until it finds a peer that meets the requirements. A server with peer recovery configured on one or more tables fails to start if successful recovery cannot be completed from a peer.

There are some limitations on how and when peer based recovery should be used. A recovering peer-based server makes one StreamBaseClient connection for each configured snapshot-parallelism to the selected recovery partner. All the data for each snapshot-parallel region is transferred over the network. This requires very high network bandwidth between recovery partners.

Servers configured to be recovery partners must have extra heap memory allocated to accommodate the demand surge that may accompany a recovery request. An initial estimate is to add between one half and up to one times the size of each table that may be recovered.

It is a best practice increasing the max-client-pages setting in the server's page-pool element. An initial estimate is to set (TableSize/snapshot-parallelism) / 4096.

It is essential that you stress-test your peer-based recovery configurations in conditions and sizes that exceed expected production conditions and sizes. Misconfigured systems can result in servers being unable to recover from a peer and can also cause the recovery partner to fail.

There is a balance between making page sizes larger or smaller. The larger the page size, the more data that can be lost if the server crashes. All lost data is recovered by your configured persistent message bus, but more data will generally have to retrieved with bigger page sizes. On the other hand, log file compression generally improves with larger page sizes. Also, generally IO efficiency also improves with large page file sizes.

Log-based Recovery

In addition to the peer-based option described above, LiveView tables can be configured for persistence such that their contents are saved to disk in log-based files and recovered from these files after a crash or server restart.

When tables are configured to use log-based persistence, the configured log file page size must be greater than the largest tuple ever published to that table.

Any arriving tuple larger than the page size will not be logged, and while the tuple will be published to the LiveView table, it will be lost on recovery. A server error is logged if a tuple larger than the page size is detected and is being dropped.

Use the following system property to control the log file page size:

name= "engine"
version= "1.0.0"
type= "com.tibco.ep.ldm.configuration.ldmengine"

configuration = {
  LDMEngine = {
    systemProperties = {
      "liveview.persist.pagesizekb" = "64"
      }
   }
}

It is a best practice to keep the page sizes between 16 and 1024 KB. The default is 64, or a 64 KB page size.

Periodic Compression of Recovery Files

By default, LiveView persistence log files roll every 12 hours, or when they reach 100 MB in size. The following system properties configure the roll time and size, with their default values shown:

name= "engine"
version= "1.0.0"
type= "com.tibco.ep.ldm.configuration.ldmengine"

configuration = {
  LDMEngine = {
    systemProperties = {
      "liveview.store.rollinterval.s" = "43200"
      "liveview.store.file.max.mb" = "100"
      }
   }
}

The roll size limit is the compressed size of one of the snapshot-parallelism regions. This generally makes correlating the amount of data published to the table with the size of the log file imprecise. Use the following formula to obtain an approximate log size value:

(published data size) / (snapshot-parallelism * 2)

During recovery from a service interruption, the LiveView Server reads all log files and recovers data to the state it was in at the time of the interruption, as described in the previous section. No additional log file maintenance is likely to be needed for sites that cycle LiveView Server periodically — perhaps daily — and, before the server is restarted, delete all log files, such that the new day's LiveView table is started empty.

For sites where continuous LiveView Server uptime is required, and the published data is updated and/or deleted as a normal part of LiveView operations, log file disk space consumption may become an issue. For these use cases, there is a command named lv-store-vacuum that you can run on the directory that contains the table's log files. Running this command is an external administrative operation, and should be run at an off peak time. The LiveView Server can be running while the command proceeds, but does not have to be.

The lv-store-vacuum command traverses all rolled log files in the persistence log directory and preserves only current data in a temporary log file. That is, the command reads all .restore files in that directory (but not the currently in-use .db file), then dumps all deleted rows to a new, temporary log file and consolidates all updated rows into a single entry in the temporary log. This preserves only the most recent row value from the table in the temporary log file. When the command finishes, the previously rolled log files are moved to a backup directory, and the just-created single temporary log file is renamed such that it will be used for the next recovery event.

LiveView administrators must periodically purge old rolled log files from the specified backup directory, following your site's backup and log storage policies.

Publisher Recovery Protocol

Regardless of the recovery option (peer- or log-based), there is a recovery protocol to help publishers identify and recover any data lost between the server's downtime and recovery time. This recovery protocol relies on each row for a given table having a publisher name (PublisherID) and that publisher's sequence number (PublisherSN). Any publisher participating in recovery must supply these two values. The PublisherID must not change for the lifetime of the publishing entity — it will be used in sequent LiveView server restarts. The PublisherSN must be a monotonically increasing long over the lifetime of the PublisherID. Incrementing by exactly one after each publish is highly recommended. There may be multiple PublisherIDs, each with their own PublisherSN sequences in a given table. A publisher should not update rows added by a different publisher.

Details on the reliable EventFlow publisher streams are given below. Details for the reliable LiveViewClient publisher are given in the Javadoc.

Part of the recovered data is the table’s publisher sequence number. During initialization, an embedded publisher typically retrieves the sequence number for each persistent table and resumes publishing at that point, to avoid overwriting previously written, recovered records.

For publisher initialization, the publisher interface has an Input stream name ServerStateIn that LiveView uses to inform the publisher of the server state during initialization. This stream has two fields:

  • Name (string) — can be ignored.

  • CurrentState (int):

    • 0 = Recovery in progress. The publisher may begin querying the server on the QueryTheLastGoodRecordOut port and begin its recovery procedure.

    • 1 = Recovery complete for all publishers.

    • 2 = Server recovery has failed. By default, the server shuts down.

The publisher interface has an Output stream named PublisherStateOut that publishing applications must use to inform the server that they have completed recovery, whether successfully or unsuccessfully. The PublisherStateOut stream has the same fields as ServerStateIn, with these definitions:

  • Name — Must be the name of the publisher.

  • CurrentState:

    • 0 = Recovery in progress.

    • 1 = Recovery completed successfully and normal publishing has begun.

    • 2 = Recovery has failed.

Ideally, publishers report recovery complete when they finish catching up to the current real time data from their data source. Reporting complete before this time (for example reporting complete as soon as an adapter has connected to the data source) means that clients might connect and issue queries to LiveView and not see up-to-date data.

For publishing EventFlow applications that do not wish to implement recovery at all, use a Map operator between ServerStateIn and PublisherStateOut. In that operator, set the Name field to the name of the publisher, with the CurrentState field set to 1.

The embedded publisher interface provides two streams for retrieving the publisher sequence number:

  • QueryTheLastGoodRecordOut — An embedded publisher emits a tuple on this output stream to request sequence number information for a specific table. This stream has the following schema:

    • CQSConnectionID (string) — An embedded publisher-provided value that is echoed in the response, allowing the publisher to match requests with responses.

    • PublisherID (string) — The ID of the embedded publisher that was used to publish the existing records to the LiveView table prior to the crash or restart. A null requests sequence number information for all previous publishers of the table.

    • Tablename (string) — The name of the LiveView table for which sequence number information is being requested. A null requests sequence number information for all the tables this publisher is publishing to.

  • TheLastGoodRecordIn — Responses to the query on the last good record requests arrive on this data stream. If the PublisherID was null in the request, a response is received for each publisher of the LiveView table. This stream has the following schema:

    • CQSConnectionID (string) — The value echoed from the query of the last good record.

    • PublisherID (string) — The ID of the publisher for which sequence number information is provided. This value is normally echoed from the query of the last good record. If the PublisherID was null in the request, a response is received for each previous table publisher. A null in this field identifies this as a punctuation tuple, which indicates all recovery information has been returned for the specified table.

    • LowestPublishedSN — The lowest published sequence number available across all the parallel regions comprising the LiveView table.

    • LowestPersistedSN — The lowest persisted sequence number available across all the parallel regions comprising the LiveView table. In recovering after a server restart, an embedded publisher typically resumes publishing from this value.

    • HighestPublishedSN — The highest published sequence number available across all the parallel regions comprising the LiveView table.

    • HighestPersistedSN — The highest persisted sequence number available across all the parallel regions comprising the LiveView table.

    • Tablename (string) — The name of the LiveView table for which sequence number information is provided. This value is normally echoed from the query of the last good record. If the Tablename was null in the request, a response is received for each LiveView table being published to.

The publisher, in requesting last good record information, has the option of specifying:

  • A table name,

  • A publisher ID,

  • Both a table name and a Publisher ID, or

  • Neither a table name nor a publisher ID.

Specifying a null table name in the request retrieves information for all tables being published to, while specifying a null publisher ID requests the last good record for all previous publishers to the table(s).

In response to a “query the last good record” request, LiveView Server returns one or more tuples for each table specified in the request. If the table name is null in the request, the server returns responses for all tables being published to; otherwise it returns responses for just the specified table.

The last response tuple returned by LiveView Server for each table is a punctuation tuple, which does not carry last good record information (all sequence number fields are null) and is identified by a null in the PublisherID field. All tuples returned by the server, including punctuation tuples, have a non-null Tablename field.

Thus, in response to a “query the last good record” request, the publisher should expect either one punctuation tuple, if the Tablename field was non-null in the request, or one punctuation tuple per table the publisher is configured to publish to, if the Tablename field was null in the request. The number of tables the publisher is configured to publish to is equal to the number of top-level fields present in the PublishSchemasIn input stream's schema.

The following example presents a publish scenario followed by the recovery activity for each of the four “query the last good record” request combinations.

Published tuples:

(Publisher A, 1) -> Table-X
(Publisher A, 2) -> Table-Y 
(Publisher B, 3) -> Table-X 
(Publisher B, 4) -> Table-Y 
(Publisher C, 5) -> Table-X 
(Publisher D, 6) -> Table-Y

If publisher sends request with PublisherID == null, Tablename == null, server returns:

(Publisher A, Table-X, 1) 
(Publisher B, Table-X, 3) 
(Publisher C, Table-X, 5) 
(null, Table-X, null) <- punctuation table for Table-X 

(Publisher A, Table-Y, 2) 
(Publisher B, Table-Y, 4) 
(Publisher D, Table-Y, 6) 
(null, Table-Y, null) <- punctuation table for Table-Y

If publisher sends request with PublisherID == Publisher A, Tablename == null, server returns:

(Publisher A, Table-X, 1) 
(null, Table-X, null)

(Publisher A, Table-Y, 2) 
(null, Table-Y, null)

If publisher sends request with PublisherID == null, Tablename == Table-X, server returns:

(Publisher A, Table-X, 1) 
(Publisher B, Table-X, 3) 
(Publisher C, Table-X, 5) 
(null, Table-X, null) <- punctuation table for Table-X

If publisher sends request with PublisherID == Publisher A, Tablename == Table-X, server returns:

(Publisher A, Table-X, 1) 
(null, Table-X, null)

Note

When responding to query the last good record requests for multiple tables (Tablename is null) the response tuples for the tables are generated in parallel and can therefore be interspersed. However, the punctuation tuple is always the last tuple returned for a specific table.