Reading Your Data Using Direct Connection in a Notebook
This option is used to read a table, perform various operations, and then write the output to the table. It is more efficient than Import Dataset into Notebook option because the data are extracted directly from the source by Python. This can only be used with small data sets because the entire data is loaded in memory.
You can use this option to read or write files and expose those notebooks as a python execute operator that downstream operators can use in a workflow. This makes the data read much more efficiently than the default mode which uses Chorus REST APIs. There is no limit to the data set size that can be read using this mode. This only depends on the available memory in the notebook process or container.
Perform the following steps to read your data in a notebook by using the direct connection.
- Procedure
- Associate the data set with your workspace.
- From the menu, select Data, and then select Read Dataset using Direct Connection.
- A SELECT DATASETS TO IMPORT INTO NOTEBOOK dialog appears. Select the data set to import, and then click Import.
- A bit of code is inserted into your notebook. This facilitates the communication between the data source and your notebook. If you want to read more data sets, repeat step 1 to step 3 (maximum limit is 3 for inputs to a python execute operator). To run this code, press
shift+enter
or click Run. The commands use the objectcc
which is an instantiated object of classChorusCommander
with the required parameters for the methods to work correctly. - If you run
ds79_purchase
(ds79_the name of the data set imported into your notebook) in the Python notebook, it shows a preview of the data imported.Now, you can run other commands by referring to the comments in the inserted code.
-
The generated code detects and sets the appropriate
jdbc_driver
argument in thecc.read_input_table
method call which enables the direct connection mode. For some data sources that are unknown to Chorus, the JDBC Driver class name which is set injdbc_driver
may not be detected correctly, so change them if it is not as expected by the data source. You can also pass thejdbc_url
argument to explicitly override the JDBC URL. -
To use the notebook as a python execute operator, change the
use_input_substitution
parameter fromFalse
toTrue
and add theexecution_label
parameter for data sets to be read. Theexecution_label
value should start from string'1'
, followed by'2'
, and'3'
for subsequent data sets. For more information, seehelp(cc.read_input_table)
. -
The generated
cc.read_input_table
method call returns a Pandas Data Frame. You can modify, copy, or perform any other operations on the Data Frame as required. -
Once the required output Pandas Data Frame has been created, write it to a target table using the
cc.write_output_table
. To enable the use of output in downstream operators, setuse_output_substitution=True
. Use the same argument as inread_input_table
to use direct connection mode for the write. Add thedrop_if_exists
parameter toTrue
to drop any existing table having the same name as the target. For more information, seehelp(cc.write_output_table)
.Note:You can use comma(
,
) as a delimiter argument inwrite_output_file
for compatibility with other operators in a workflow, or do not use a delimiter argument. -
Run the notebook manually for the metadata to be determined so that the notebook can be used as a Python Executor operator in legacy workflows.