Elasticsearch Overview

Elasticsearch is an open source distributed, RESTful search engine used to index and search documents. ElasticSearch is extremely fast, performing near real-time search. All its features are available through the REST API, which is used by the Statistica Elasticsearch Text Analysis workspace node (.Net client NEST/Elasticsearch Net to be specific).

The Elasticsearch Text Analysis workspace node in Statistica leverages this indexing capability to process unstructured text and extract meaningful numeric indices from the text which can then be used as input for various machine learning algorithms in Statistica similar to the Statistica Text Mining module (for more details please refer to Text Mining and Document Retrieval Introductory Overview).

The Elasticsearch Text Analysis workspace node is capable of indexing and analyzing documents in over 36 different languages using a local or remote (even on cloud) instance of ElasticSearch server. The node is capable of analyzing text residing in a Statistica spreadsheet or a collection of documents on local file system or an existing Elasticsearch index on the server. To perform an analysis, the user just needs to choose the documents, specify the analysis and request the results documents the user expects back in a simple work flow.

In specifying the analysis, the user has two options to specify the Elasticsearch analyzer to use when indexing the documents. An Elasticsearch analyzer is a pipeline with the following processing stages:

  • Character filters: An analyzer may have zero or more. It accepts mutiple characters in the stream and transforms the stream by adding, removing or changing characters.
  • Tokenizers: An analyzer has exactly one tokenizer. It splits the string in to terms in case it encounters whitespace or or punctuation.
  • Token filters: An analyzer may have zero or more . It accepts stream of tokens and modifies them.

In the end, the specification of the analysis can be as flexible as the user wishes. This is exposed to the user through the custom analysis specification on the Analyzer tab of the node, where the user can add/ remove processes the stages as required using the JSON notation supported by Elasticsearch.

You can use these analyzers without modifications by selecting the Standard analysis specification on the Analyzer tab of the node.

The Elasticsearch Text Analysis workspace node produces term frequency matrix and other derived results (such as its svd decomposition) as Statistica report. Apart from Statistica spreadsheets and graphs, the node also produces a TextModel pmml specification of the analysis. The node is also capable of consuming this pmml specification to deploy or score new documents.