openlayer.OpenlayerClient.add_dataframe#

OpenlayerClient.add_dataframe(dataset_df, dataset_config_file_path, project_id=None, force=False)#

Adds a dataset to a project’s staging area (from a pandas DataFrame).

Parameters
dataset_dfpd.DataFrame

Dataframe containing your dataset.

dataset_config_file_pathstr

Path to the dataset configuration YAML file.

What’s on the dataset config file?

The YAML file with the dataset config must have the following fields:

columnNamesList[str]

List of the dataset’s column names.

classNamesList[str]

List of class names indexed by label integer in the dataset. E.g. [negative, positive] when [0, 1] are in your label column.

labelColumnNamestr

Column header in the dataframe containing the labels.

Important

The labels in this column must be zero-indexed integer values.

labelstr

Type of dataset. E.g. 'training' or 'validation'.

featureNamesList[str], default []

List of input feature names. Only applicable if your task_type is TaskType.TabularClassification or TaskType.TabularRegression.

textColumnNamestr, default None

Column header in the dataframe containing the input text. Only applicable if your task_type is TaskType.TextClassification.

predictionsColumnNamestr, default None

Column header in the dataframe containing the model’s predictions as zero-indexed integers. Only applicable if you are uploading a model as well with the add_model method.

This is optional if you provide a predictionScoresColumnName.

Important

The values in this column must be zero-indexed integer values.

predictionScoresColumnNamestr, default None

Column header in the dataframe containing the model’s predictions as lists of class probabilities. Only applicable if you are uploading a model as well with the add_model method.

This is optional if you provide a predictionsColumnName.

Important

Each cell in this column must contain a list of class probabilities. For example, for a binary classification task, the column with the predictions should look like this:

prediction_scores

[0.1, 0.9]

[0.8, 0.2]

...

categoricalFeatureNamesList[str], default []

A list containing the names of all categorical features in the dataset. E.g. ["Gender", "Geography"]. Only applicable if your task_type is TaskType.TabularClassification or TaskType.TabularRegression.

languagestr, default ‘en’

The language of the dataset in ISO 639-1 (alpha-2 code) format.

sepstr, default ‘,’

Delimiter to use. E.g. ‘\t’.

forcebool

If add_dataframe is called when there is already a dataset of the same type in the staging area, when force=True, the existing staged dataset will be overwritten by the new one. When force=False, the user will be prompted to confirm the overwrite.

Notes

  • Please ensure your input features are strings, ints or floats.

  • Please ensure your label column name is not contained in feature_names.

Examples

First, instantiate the client:

>>> import openlayer
>>> client = openlayer.OpenlayerClient('YOUR_API_KEY_HERE')

Create a project if you don’t have one:

>>> from openlayer.tasks import TaskType
>>> project = client.create_project(
...     name="Churn Prediction",
...     task_type=TaskType.TabularClassification,
...     description="My first project!",
... )

If you already have a project created on the platform:

>>> project = client.load_project(name="Your project name")

If your project’s task type is tabular classification…

Let’s say your dataframe looks like the following:

>>> df
    CreditScore  Geography    Balance  Churned
0           618     France     321.92        1
1           714    Germany  102001.22        0
2           604      Spain   12333.15        0

Important

The labels in your dataframe must be integers that correctly index into the class_names array that you define (as shown below). E.g. 0 => ‘Retained’, 1 => ‘Churned’.

Write the dataset config YAML file with the variables are needed by Openlayer:

>>> import yaml
>>>
>> dataset_config = {
...     'columnNames': ['CreditScore', 'Geography', 'Balance', 'Churned'],
...     'classNames': ['Retained', 'Churned'],
...     'labelColumnName': 'Churned',
...     'label': 'training',  # or 'validation'
...     'featureNames': ['CreditScore', 'Geography', 'Balance'],
...     'categoricalFeatureNames': ['Geography'],
... }
>>>
>>> with open('/path/to/dataset_config.yaml', 'w') as f:
...     yaml.dump(dataset_config, f)

You can now add this dataset to your project with:

>>> project.add_dataframe(
...     dataset_df=df,
...     dataset_config_file_path='/path/to/dataset_config.yaml',
... )

After adding the dataset to the project, it is staged, waiting to be committed and pushed to the platform. You can check what’s on your staging area with status. If you want to push the dataset right away with a commit message, you can use the commit and push methods:

>>> project.commit("Initial dataset commit.")
>>> project.push()

If your task type is text classification…

Let’s say your dataset looks like the following:

>>> df
                              Text  Sentiment
0    I have had a long weekend              0
1    I'm in a fantastic mood today          1
2    Things are looking up                  1

Write the dataset config YAML file with the variables are needed by Openlayer:

>>> import yaml
>>>
>> dataset_config = {
...     'columnNames': ['Text', 'Sentiment'],
...     'classNames': ['Negative', 'Positive'],
...     'labelColumnName': 'Sentiment',
...     'label': 'training',  # or 'validation'
...     'textColumnName': 'Text',
... }
>>>
>>> with open('/path/to/dataset_config.yaml', 'w') as f:
...     yaml.dump(dataset_config, f)

You can now add this dataset to your project with:

>>> project.add_dataframe(
...     dataset_df=df,
...     dataset_config_file_path='/path/to/dataset_config.yaml',
... )

After adding the dataset to the project, it is staged, waiting to be committed and pushed to the platform. You can check what’s on your staging area with status. If you want to push the dataset right away with a commit message, you can use the commit and push methods:

>>> project.commit("Initial dataset commit.")
>>> project.push()