Reading Your Data Using Direct Connection in a Notebook

This option is used to read a table, perform various operations, and then write the output to the table. It is more efficient than Import Dataset into Notebook option because the data are extracted directly from the source by Python. This can only be used with small data sets because the entire data is loaded in memory.

You can use this option to read or write files and expose those notebooks as a python execute operator that downstream operators can use in a workflow. This makes the data read much more efficiently than the default mode which uses Chorus REST APIs. There is no limit to the data set size that can be read using this mode. This only depends on the available memory in the notebook process or container.

Perform the following steps to read your data in a notebook by using the direct connection.

Procedure

Associate the data set with your workspace.
From the menu, select Data, and then select Read Dataset using Direct Connection.

Read Dataset using Direct Connection

A SELECT DATASETS TO IMPORT INTO NOTEBOOK dialog appears. Select the data set to import, and then click Import.
A bit of code is inserted into your notebook. This facilitates the communication between the data source and your notebook. If you want to read more data sets, repeat step 1 to step 3 (maximum limit is 3 for inputs to a python execute operator). To run this code, press shift+enter or click Run. The commands use the object cc which is an instantiated object of class ChorusCommander with the required parameters for the methods to work correctly.
If you run ds79_purchase (ds79_the name of the data set imported into your notebook) in the Python notebook, it shows a preview of the data imported.

Now, you can run other commands by referring to the comments in the inserted code.
The generated code detects and sets the appropriate jdbc_driver argument in the cc.read_input_table method call which enables the direct connection mode. For some data sources that are unknown to Chorus, the JDBC Driver class name which is set in jdbc_driver may not be detected correctly, so change them if it is not as expected by the data source. You can also pass the jdbc_url argument to explicitly override the JDBC URL.
To use the notebook as a python execute operator, change the use_input_substitution parameter from False to True and add the execution_label parameter for data sets to be read. The execution_label value should start from string '1', followed by '2', and '3' for subsequent data sets. For more information, see help(cc.read_input_table).
The generated cc.read_input_table method call returns a Pandas Data Frame. You can modify, copy, or perform any other operations on the Data Frame as required.
Once the required output Pandas Data Frame has been created, write it to a target table using the cc.write_output_table. To enable the use of output in downstream operators, set use_output_substitution=True. Use the same argument as in read_input_table to use direct connection mode for the write. Add the drop_if_exists parameter to True to drop any existing table having the same name as the target. For more information, see help(cc.write_output_table).

Note:
You can use comma(,) as a delimiter argument in write_output_file for compatibility with other operators in a workflow, or do not use a delimiter argument.
Run the notebook manually for the metadata to be determined so that the notebook can be used as a Python Executor operator in legacy workflows.

Did you find this helpful?

Yes No