openlayer.InferencePipeline.upload_reference_dataframe#

InferencePipeline.upload_reference_dataframe(*args, **kwargs)#

Uploads a reference dataset (a pandas dataframe) to an inference pipeline.

The reference dataset is used to measure drift in the inference pipeline. The different types of drift are measured by comparing the production data published to the platform with the reference dataset.

Ideally, the reference dataset should be a representative sample of the training set used to train the deployed model.

Parameters:
dataset_dfpd.DataFrame

Dataframe containing the reference dataset.

dataset_configDict[str, any], optional

Dictionary containing the dataset configuration. This is not needed if dataset_config_file_path is provided.

What’s in the dataset config?

The dataset configuration depends on the TaskType. Refer to the How to write dataset configs guides for details.

dataset_config_file_pathstr

Path to the dataset configuration YAML file. This is not needed if dataset_config is provided.

What’s in the dataset config file?

The dataset configuration YAML depends on the TaskType. Refer to the How to write dataset configs guides for details.

Notes

Your dataset is in csv file? You can use the upload_reference_dataset method instead.

Examples

Related guide: How to set up monitoring.

First, instantiate the client and retrieve an existing inference pipeline:

>>> import openlayer
>>>
>>> client = openlayer.OpenlayerClient('YOUR_API_KEY_HERE')
>>>
>>> project = client.load_project(name="Churn prediction")
>>>
>>> inference_pipeline = project.load_inference_pipeline(
...     name="XGBoost model inference pipeline",
... )

With the InferencePipeline object retrieved, you are able to upload a reference dataset.

For example, if your project’s task type is tabular classification, your dataset looks like the following (stored in a pandas dataframe called df):

>>> df
            CreditScore  Geography    Balance  Churned
0               618       France       321.92     1
1               714      Germany      102001.22   0
2               604       Spain       12333.15    0

Important

The labels in your csv must be integers that correctly index into the class_names array that you define (as shown below). E.g. 0 => ‘Retained’, 1 => ‘Churned’

Prepare the dataset config:

>>> dataset_config = {
...     'classNames': ['Retained', 'Churned'],
...     'labelColumnName': 'Churned',
...     'featureNames': ['CreditScore', 'Geography', 'Balance'],
...     'categoricalFeatureNames': ['Geography'],
... }

You can now upload this reference dataset to your project with:

>>> inference_pipeline.upload_reference_dataframe(
...     dataset_df=df,
...     dataset_config_file_path=dataset_config,
... )