Vertica Load Adapter Sample

Summary

This sample application demonstrates how to use a Vertica database in a StreamBase application. Trade and Quote data is loaded into Vertica through the TIBCO StreamBase® Adapter for Vertica, and then queries can be done on the loaded data.

This Sample's Files

The Chronicle for Vertica sample application consists of:

The application, vertica-table.ssql. This is used to create and delete tables used in this sample.
The application, vertica-load.sbapp. This application can read trade or quote CSV files and load them into a Vertica table. It also produces load performance metrics.
The application, vertica-read.sbapp. This application does a number of different queries to the Vertica database, and produces read performance metrics.
An feed simulation configuration file, simple-load.sbfs. This feed simulator generates TAQ-like data for just over seven thousand symbols. The TAQ data is mostly uniform distribution of values over some reasonable range for each field.
A sample Trade CSV file and a sample Quote CSV file. These contain just a few records of fictional data and are included to demo how to load Vertica from CSV files.
A StreamBase configuration file with a Chronicle for Vertica data source partially configured.
Several other modules that are used in the load and read applications.

A Vertica JDBC driver (vertica.jar) is required in order to connect to the Vertica DB. The JAR file can be found in your Vertica database installation at /opt/vertica/client/JDBC/jars/vertica.jar.

Preliminary Configuration

Before you can run this sample you must have a supported Vertica database installation available. You will need your database administrator to provide the following information:

- The JDBC connect string to access your Vertica database server. For example:
  
  jdbc:vertica://192.168.58.128:5433/vertica
- The username and password for logging into the Vertica server.
Verify that your StreamBase Chronicle system can establish a connection with the Vertica database. Issue a ping command, using the IP address embedded in the JDBC connect string provided to you.
Import the Chronicle for Vertica sample into StreamBase Studio, as described above.

Double-click sbd.sbconf to open the file for editing in Studio, and make the following changes:

Locate the <data-sources> element in the file. That section will look like the following example:

<data-source name="vertica"  type="chronicle">

    <uri value="jdbc:vertica://MyHost:5433/chron"/>

    <driver value="com.vertica.Driver"/>
    <param name="user" value="YourNameHere"/>
    <param name="password" value="YourPasswordHere"/>
    <param name="jdbc-share-connection" value="true"/>
    <param name="jdbc-quote-strings" value="true"/>
</data-source>

<operator-parameters>
    <operator-parameter name="VerticaHostName1" value="localhost"/>
    <operator-parameter name="VerticaHostPort1" value="5433"/>
    <operator-parameter name="VerticaUserName" value="dbadmin"/>
    <operator-parameter name="VerticaPassword" value=""/>
</operator-parameters>

Replace the uri value and the username and password keywords with values supplied by your database administrator. The <data-source> section is used by the Chronicle connection.
Update the host name, port number, username and password in the <operator-parameters> section. This section is used by the Vertica Load adapter.

Copy the Vertica JDBC driver (vertica.jar) to the sample directory. The JAR file can be found in your Vertica database installation at /opt/vertica/client/JDBC/jars/vertica.jar.

Vertica Load Adapter Overview

The Vertica load adapter provides the fastest way to load data into Vertica. There are two different basic load types, streaming load and simple buffered load.

Each bulk load to Vertica creates what is called a miniROS. These miniROSs are merged into larger and larger ROSs as defined by Vertica configuration parameters. Creating many small miniROSs is expensive for Vertica, since there is more merging required. Creating fewer, larger, miniROSs is much more efficient.

Streaming loads offer the highest overall load rates into Vertica because it allows you create very large miniROS. Simple buffered loads will create a miniROS no larger then the configured buffer size. Simple buffered loads should have two buffers configured so that, while one is busy being written to Vertica, the other can continue to be filled by the StreamBase application. This limits the size of the miniROS that a simple buffered load can create to less then half the available main memory — a few gigabytes at the most.

In contrast, streaming loads do not create a miniROS until a flush occurs. Flushes can be configured to occur at fixed intervals or forced to occur by sending a flush tuple, or both. So with streaming loads, miniROS size can be made large — many tens of gigabytes, or even larger.

One advantage of simple buffered loading is that StreamBase holds all the data until the flush has completed successfully. If the Vertica database becomes unreachable for some reason, the flush fails but enters a reconnect and retry loop. Assuming the database becomes reachable again, and tables of interest still exist, the flush subsequently succeeds and no data is lost. With streaming loads, StreamBase discards data as each streaming write completes. If the Vertica database becomes unreachable or terminates unexpectedly after some number of streaming writes, all the data since the last flush may be lost.

Streaming Load Adapter Configuration considerations

In Streaming Load mode, the buffer size can be modest, perhaps only large enough for hold a few seconds of data. A buffer of 10 to 50 megabytes is often enough. There should be enough buffers configured to accommodate data rate spikes, a few hundred megabytes total buffering (size of buffer times number of buffers) is often enough.

The maximum number of concurrent writers can be adjusted to match a combination the incoming arrival rate, and the Vertica database absorption rate. Often a single writer is enough. Vertica best practices suggests two writers for each database for maximum throughput. The principal downside to having more writers is that each writer fills a new and different miniROS. So for a fixed data arrival rate, and fixed flush interval, the size of the miniROSs created is divided by the number of concurrent writers configured. Thus, using multiple concurrent writers is recommended only for high data rates or long flush intervals.

Simple Buffering Load Adapter Configuration considerations

With simple buffer loading, the buffer size should be as large as possible, given the amount of main memory available. Configure two buffers, because one buffer is unavailable while the data buffer is being written to Vertica. So the buffer size should be half the amount of memory set aside for buffering. Generally, there need only be one concurrent writer configured.

Vertica Load Adapter Properties

The Vertica load adapter provides the fastest way to load data into Vertica. Refer to the Load Adapter overview for guidelines on the best way to configure the adapter.

Vertica HostName

The name or IP address of the computer running the Vertica database.

Vertica Port

The port number the Vertica database is configured to listen for JDBC on.

DB Name

The name of the Vertica database you wish to load.

User Name

The user name the Vertica JDBC connection excepts.

Password

If the Vertica database was created with a password, supply it here. If there is no password, just leave this blank.

Table Name

The name of the table within the database you wish to load.

Streaming Load

If true, the Vertica Streaming load API is used. See load adapter overview for more information.

Buffer Size

The size, in bytes, of each buffer used to accrue data before it is sent.

Maximum Number of Buffers

The maximum number of buffers that will be allocated. When no buffers are available, the enqueuers will block waiting for a free buffer.

Convert Double Fields

Allow a StreamBase double to be written to a Vertica long. The double value is rounded.

Flush Interval

The maximum number of seconds data will be leave buffered before being sent. If a buffer fills before this interval, it is sent and the interval is reset.

Direct Copy

This is normally true for bulk loads. If set, data is written to the Read Optimized Store (ROS). This is fastest way to load data.

Connection Timeout

The number of milliseconds to wait for a JDBC connection before returning an error.

Reconnect Interval

The number of milliseconds to wait before trying to reconnect to the database, when the connection was previously interrupted.

Enable Status Output Stream

If true, an additional output port is supplied that emits an informational tuple just before a buffer is sent to Vertica, just after a buffer is sent, and when a flush completes. This tuple has the following 3 Int fields:

Type - 0 = buffer write started, 1 = buffer write completed, 2 = flush completed

ID - A number identifying the committer for this message. The first committer ID is 0.

Count - set to the number of tuples in the buffer being sent for a type "0"; 0 for other types

Enable Flush Request Input Stream

If true, an additional input port is provided. When a tuple is received on this port, the currently filling buffer is sent immediately. Any Flush Interval time is restarted.

Enable Changing Table Name

If true, an additional input port is provided that allows you to change the name of the table name to be loaded. Sending a tuple with a string field called "table_name" to the port will cause a flush of all outstanding buffers, the next tuples will be sent to the new table name. This table must have the exact same schema as the original. This feature can be used to "roll" tables on a programmatic basis.

Max Concurrent Writes

The number of buffer flush threads to start. Each thread can take a buffer and write it to the database.

Verbose

Turn on informational notices about important state adapter state information. Only meant for debugging purposes.

Running This Sample in StreamBase Studio

First, perform the steps in Preliminary Configuration.
You will need to create the table to load. In the Package Explorer, double-click to open the vertica-table.sbapp application. Make sure the application is the currently active tab in the EventFlow Editor.
Click the Run button. It may take a few moments to connect to the Vertica database. When it does, Studio switches to the SB Test/Debug perspective and starts the application.
Select the input stream create_db, and set the field create_db to 1. Send data. The table TicTest now exists in the DB. You can delete the table by sending any int to the drop_db stream.
When done, press F9 or click the Stop Running Application button.
In the Package Explorer, double-click to open the vertica-load.sbapp application. Make sure the application is the currently active tab in the EventFlow Editor.
Click the Run button. It may take a few moments to connect to the Vertica database. When it does, Studio switches to the SB Test/Debug perspective and starts the application.
Select the quoteFile input stream. In the fileName field type quotes.csv and hit Send Data. This will read all data from the quotes.csv file and begin loading into Vertica. The data is buffered and will not be written immediately.
Select the tradeFile input stream. In the fileName field type trades.csv and hit Send Data. This will read all data from the trades.csv file and begin loading into Vertica. The data is buffered and will not be written immediately.
When done, press F9 or click the Stop Running Application button.
In the Package Explorer, double-click to open the vertica-read.sbapp application. Make sure the application is the currently active tab in the EventFlow Editor.
Click the Run button. It may take a few moments to connect to the Vertica database. When it does, Studio switches to the SB Test/Debug perspective and starts the application.
Select the readAll input stream, set the go field to any int value, then click Send Data. The data written from the trades.cvs and quotes.csv appears in the application output view.
When done, press F9 or click the Stop Running Application button.

Importing This Sample into StreamBase Studio

In StreamBase Studio, import this sample with the following steps:

From the top menu, click File → Load StreamBase Sample.
Select this sample from the Chronicle list.
Click OK.

StreamBase Studio creates a single project containing the sample files.

Sample Location

When you load the sample into StreamBase Studio, Studio copies the sample project's files to your Studio workspace, which is normally part of your home directory, with full access rights.

Important

Load this sample in StreamBase Studio, and thereafter use the Studio workspace copy of the sample to run and test it, even when running from the command prompt.

Using the workspace copy of the sample avoids the permission problems that can occur when trying to work with the initially installed location of the sample. The default workspace location for this sample is:

studio-workspace/sample_vertica

See Default Installation Directories for the location of studio-workspace on your system.

In the default TIBCO StreamBase installation, this sample's files are initially installed in:

streambase-install-dir/sample/vertica

See Default Installation Directories for the location of streambase-install-dir on your system. This location may require administrator privileges for write access, depending on your platform.