Web Reader Input Adapter

Introduction

The TIBCO StreamBase® Web Reader adapter reads web pages via HTTP GET or POST requests and emits the page contents in a string field of its Data output port.

The adapter can be configured to read a web page on demand when receiving a tuple on its control input port, or to periodically poll the web page configured in its HTTP URL property.

The adapter has multiple samples, described in Web Reader Input Adapter Samples. Note that these samples will demonstrate how to perform REST and SOAP requests by the command port to send header information and SOAP or REST payloads.

Web Reader Properties

This section describes the properties you can set for this adapter, using the various tabs of the Properties view in StreamBase Studio.

General Tab

Name: Use this field to specify or change the component's name, which must be unique in the application. The name must contain only alphabetic characters, numbers, and underscores, and no hyphens or other special characters. The first character must be alphabetic or an underscore.

Adapter: A read-only field that shows the formal name of the adapter.

Class: A field that shows the fully qualified class name that implements the functionality of this adapter. Use this class name when loading the adapter in StreamSQL programs with the APPLY JAVA statement. You can right-click this field and select Copy from the context menu to place the full class name in the system clipboard.

Start with application: If this field is set to Yes or to a module parameter that evaluates to true, an instance of this adapter starts as part of the containing StreamBase Server. If this field is set to No or to a module parameter that evaluates to false, the adapter is loaded with the server, but does not start until you send an sbadmin resume command, or until you start the component with StreamBase Manager. With this option set to No or false, the adapter does not start even if the application as a whole is suspended and later resumed. The recommended setting is selected by default.

Enable Error Output Port: Select this check box to add an Error Port to this component. In the EventFlow canvas, the Error Port shows as a red output port, always the last port for the component. See Using Error Ports and Error Streams to learn about Error Ports.

Description: Optionally enter text to briefly describe the component's purpose and function. In the EventFlow canvas, you can see the description by pressing Ctrl while the component's tooltip is displayed.

Adapter Properties Tab

Property Description
HTTP URL The URL of the web page to read. When the control port is enabled, this contains the default value used when the input tuple's URL field is null. When the control port is disabled, the URL is polled periodically based on the value of the Poll Frequency property.
HTTP request method The type of request to send to the HTTP server, the available options are GET and POST. When the control port is enabled, this contains the default value used when the input tuple's RequestType field is null.
Charset Use the charset to determine the connection character set as well as how to encode the POST data sent to the server.
Connect timeout Sets a specified timeout value, in milliseconds, to be used when opening. A timeout of zero is interpreted as an unlimited timeout.
Read timeout Sets a specified timeout value, in milliseconds, to be used when reading. A timeout of zero is interpreted as an unlimited timeout.
Use Proxy Use a proxy server in processing the HTTP GET request.
Proxy Host The proxy server host name or IP address.
Proxy Port The proxy server TCP port number.
Use Default Charset If selected, specifies whether the Java platform default character set is to be used. If cleared, a valid character set name must be specified for the Character Set property.
Character Set The name of the character set encoding that the adapter is to use to read input or write output.
Output a tuple for each line received Used mainly for streaming applications this option will output a tuple for each line of data received from the server.
Output blank lines This option will send blank tuples when a blank line is received. Note: Option only available when outputting tuples per line
Output null tuple on completion This option will send a tuple with all fields set to null when reading is complete. Note: Option only available when outputting tuples per line
Maintain Line Separator If enabled this will maintain the new line and carriage return characters produced by the server in the output result.
Use basic auth Enable basic authentication
Username When basic authentication is enabled this is the username that will be sent to the server.
Password When basic authentication is enabled this is the password that will be sent to the server.
Enable Control Port Enables a control input port used to request web pages on demand. Selecting this check box disables the Poll Frequency control.
Poll Frequency The time, in milliseconds, to wait between HTTP GET requests. Ignored if the control port is enabled, in which case web requests are made on demand on receipt of an input tuple.
Enable Pass-Through Fields Enable the pass-through fields to allow all fields of the incoming control tuple to be copied to the outgoing data tuple. When enabled the outgoing data tuple will contain a new field called 'PassThroughFields' which will contain the entire contents of the incoming control tuple.
Ignore certificate errors If enabled any errors produced by invalid SSL certificates will be ignored and the website will be processed as normal. Warning! This can lead to man in the middle attacks.
Process As File Download If enabled the web page will be processed as a binary file download and the data output will be changed to a blob field.
Unescape HTML results If enabled the adapter will unescape strings containing entity escapes to a string containing the actual Unicode characters corresponding to the escapes.

For example, the string "&lt;Fran&ccedil;ais&gt;" will become "<Français>" If an entity is unrecognized, it is left alone, and inserted verbatim into the result string. e.g. "&gt;&zzzz;x" will become ">&zzzz;x".

Log Level Controls the level of verbosity the adapter uses to send notifications to the console. This setting can be higher than the containing application's log level. If set lower, the system log level is used. Available values, in increasing order of verbosity, are: OFF, ERROR, WARN, INFO, DEBUG, TRACE, and ALL.

HTTP Headers Tab

Property Description
Default HTTP Headers These HTTP headers will always be sent with the web request. If the HTTPHeaders input is used on the control port then if a key matches it will replace the default. Otherwise, the defaults are appended to the control port's list.

Data Tab

Property Description
URL Encode If set to true the Value portion of the URLParams and the entire PostData value will be URL encoded. If false, no encoding is performed. The control ports options URLEncode value override this value if present.
Default URL Params A list of key value parameter pairs to send to the server along with this request. If the request type is GET this list will be added to the end of the URL field and a "?" is appended between the URL and the parameters. If the URLParams input is used on the control port and if a key matches, it will replace the default. Otherwise, the defaults are appended to the control ports list.
Default Post Data If this value is set, it is sent to the server directly and the URLParams value is ignored. Control port values override these values. If the control port contains URLParams, they are used and this value is ignored.

Concurrency Tab

Use the Concurrency tab to specify parallel regions for this instance of this component, or multiplicity options, or both. The Concurrency tab settings are described in Concurrency Options, and dispatch styles are described in Dispatch Styles.

Caution

Concurrency settings are not suitable for every application, and using these settings requires a thorough analysis of your application. For details, see Execution Order and Concurrency, which includes important guidelines for using the concurrency options.

Description of This Adapter's Ports

The Web Reader adapter's ports are used as follows:

  • Control (input): Tuples enqueued on this port cause the adapter to fetch web pages. The schema for this port has the following field:

    • URL, string, the HTTP URL to read. If null, the URL is taken from the adapter's HTTP URL property.

    • (Optional) URLParams, List of Tuples, A list of key value pairs to send to the server along with this request. If the request type is GET this list will be added to the end of the URL field and a "?" will be appended between the URL and params.

      • Key, string, The key value of the HTTP parameter.

      • Value, string, The value to send to the server associated with the given key, null values are ignored but empty values are allowed

    • (Optional) PostData, string, If this value is set it will be sent to the server directly and the URLParams value will be ignored.

    • (Optional) URLEncode, boolean, If set to true the Value portion of the URLParams and the entire PostData value will be URL encoded. If null, false is assumed and no encoding is performed.

    • (Optional) HTTPHeaders, List of Tuples, A list of key value pairs to send to the server as HTTP headers.

      • Key, string, The key value of the HTTP header.

      • Value, string, The header value to send to the server associated with the given key, null values are ignored but empty values are allowed.

    • (Optional) RequestType, string, Sets the outgoing HTTP request type, valid values are "POST" and "GET" any other value will be ignored and the default will be used.

  • Status (output): The adapter emits tuples from this port when significant events occur, such as when an attempt to read a web page fails. The schema for this port has the following fields:

    • type, string: returns one of the following values to convey the type of event:

      • Read

      • UserInput

    • Action, string: returns an action associated with the event Type:

      • Failed

      • Rejected

    • Object, string: returns an event type-specific value, such as the HTTP URL for which a read failed or the control input tuple that was rejected.

    • Message, string: Returns a human-readable description of the event.

  • Data (output): Tuples are emitted on this port when web pages are successful read. The schema for this port has the following fields:

    • Data, string, The contents of the web page.

    • Headers, List<Tuple>, The web page response headers. Each tuple will contain a Header (String) value and (List<String>) Values for that header ()

    • PassThroughFields, Tuple, When 'Enable Pass-Through Fields' option is checked this field will appear and contains the entire contents of the incoming control port request.

Typechecking and Error Handling

The Web Reader adapter uses typecheck messages to help you configure the adapter within your StreamBase application. In particular, the adapter generates typecheck messages for the following reasons:

  • The Control Input Port is disabled and no HTTP URL value is provided.

  • The Control Input Port is disabled and an invalid (unspecified or negative) Polling Frequency is specified.

  • The Control Input Port is enabled but is not presented with the required schema.

  • The Use Proxy property is enabled but no Proxy Host or Proxy Port is specified.

The adapter generates warning messages during runtime under various conditions, including:

  • A control tuple is received with a null value in its URL field and a value for the adapter's HTTP URL property has not been specified.

  • An error occurs attempting to read a web page.

Suspend and Resume Behavior

When suspended, the adapter stops processing web pages.

When resumed, the adapter once again starts processing web pages.

Related Topics