openlayer.InferencePipeline.upload_reference_dataset#

InferencePipeline.upload_reference_dataset(*args, **kwargs)#

Uploads a reference dataset saved as a csv file to an inference pipeline.

The reference dataset is used to measure drift in the inference pipeline. The different types of drift are measured by comparing the production data published to the platform with the reference dataset.

Ideally, the reference dataset should be a representative sample of the training set used to train the deployed model.

Parameters:
file_pathstr

Path to the csv file containing the reference dataset.

dataset_configDict[str, any], optional

Dictionary containing the dataset configuration. This is not needed if dataset_config_file_path is provided.

What’s in the dataset config?

The dataset configuration depends on the TaskType. Refer to the How to write dataset configs guides for details.

dataset_config_file_pathstr

Path to the dataset configuration YAML file. This is not needed if dataset_config is provided.

What’s in the dataset config file?

The dataset configuration YAML depends on the TaskType. Refer to the How to write dataset configs guides for details.

Notes

Your dataset is in a pandas dataframe? You can use the upload_reference_dataframe method instead.

Examples

Related guide: How to set up monitoring.

First, instantiate the client and retrieve an existing inference pipeline:

>>> import openlayer
>>>
>>> client = openlayer.OpenlayerClient('YOUR_API_KEY_HERE')
>>>
>>> project = client.load_project(name="Churn prediction")
>>>
>>> inference_pipeline = project.load_inference_pipeline(
...     name="XGBoost model inference pipeline",
... )

With the InferencePipeline object retrieved, you are able to upload a reference dataset.

For example, if your project’s task type is tabular classification and your dataset looks like the following:

CreditScore

Geography

Balance

Churned

618

France

321.92

1

714

Germany

102001.22

0

604

Spain

12333.15

0

Important

The labels in your csv must be integers that correctly index into the class_names array that you define (as shown below). E.g. 0 => ‘Retained’, 1 => ‘Churned’

Prepare the dataset config:

>>> dataset_config = {
...     'classNames': ['Retained', 'Churned'],
...     'labelColumnName': 'Churned',
...     'featureNames': ['CreditScore', 'Geography', 'Balance'],
...     'categoricalFeatureNames': ['Geography'],
... }

You can now upload this reference dataset to your project with:

>>> inference_pipeline.upload_reference_dataset(
...     file_path='/path/to/dataset.csv',
...     dataset_config=dataset_config,
... )