openlayer.OpenlayerClient.add_dataframe#
- OpenlayerClient.add_dataframe(dataset_df, dataset_config_file_path, project_id=None, force=False)#
Adds a dataset to a project’s staging area (from a pandas DataFrame).
- Parameters
- dataset_dfpd.DataFrame
Dataframe containing your dataset.
- dataset_config_file_pathstr
Path to the dataset configuration YAML file.
What’s on the dataset config file?
The YAML file with the dataset config must have the following fields:
- columnNamesList[str]
List of the dataset’s column names.
- classNamesList[str]
List of class names indexed by label integer in the dataset. E.g.
[negative, positive]
when[0, 1]
are in your label column.- labelColumnNamestr
Column header in the dataframe containing the labels.
Important
The labels in this column must be zero-indexed integer values.
- labelstr
Type of dataset. E.g.
'training'
or'validation'
.- featureNamesList[str], default []
List of input feature names. Only applicable if your
task_type
isTaskType.TabularClassification
orTaskType.TabularRegression
.- textColumnNamestr, default None
Column header in the dataframe containing the input text. Only applicable if your
task_type
isTaskType.TextClassification
.- predictionsColumnNamestr, default None
Column header in the dataframe containing the model’s predictions as zero-indexed integers. Only applicable if you are uploading a model as well with the
add_model
method.This is optional if you provide a
predictionScoresColumnName
.Important
The values in this column must be zero-indexed integer values.
- predictionScoresColumnNamestr, default None
Column header in the dataframe containing the model’s predictions as lists of class probabilities. Only applicable if you are uploading a model as well with the
add_model
method.This is optional if you provide a
predictionsColumnName
.Important
Each cell in this column must contain a list of class probabilities. For example, for a binary classification task, the column with the predictions should look like this:
prediction_scores
[0.1, 0.9]
[0.8, 0.2]
...
- categoricalFeatureNamesList[str], default []
A list containing the names of all categorical features in the dataset. E.g.
["Gender", "Geography"]
. Only applicable if yourtask_type
isTaskType.TabularClassification
orTaskType.TabularRegression
.- languagestr, default ‘en’
The language of the dataset in ISO 639-1 (alpha-2 code) format.
- sepstr, default ‘,’
Delimiter to use. E.g. ‘\t’.
- forcebool
If
add_dataframe
is called when there is already a dataset of the same type in the staging area, whenforce=True
, the existing staged dataset will be overwritten by the new one. Whenforce=False
, the user will be prompted to confirm the overwrite.
Notes
Please ensure your input features are strings, ints or floats.
Please ensure your label column name is not contained in
feature_names
.
Examples
First, instantiate the client:
>>> import openlayer >>> client = openlayer.OpenlayerClient('YOUR_API_KEY_HERE')
Create a project if you don’t have one:
>>> from openlayer.tasks import TaskType >>> project = client.create_project( ... name="Churn Prediction", ... task_type=TaskType.TabularClassification, ... description="My first project!", ... )
If you already have a project created on the platform:
>>> project = client.load_project(name="Your project name")
If your project’s task type is tabular classification…
Let’s say your dataframe looks like the following:
>>> df CreditScore Geography Balance Churned 0 618 France 321.92 1 1 714 Germany 102001.22 0 2 604 Spain 12333.15 0
Important
The labels in your dataframe must be integers that correctly index into the
class_names
array that you define (as shown below). E.g. 0 => ‘Retained’, 1 => ‘Churned’.Write the dataset config YAML file with the variables are needed by Openlayer:
>>> import yaml >>> >> dataset_config = { ... 'columnNames': ['CreditScore', 'Geography', 'Balance', 'Churned'], ... 'classNames': ['Retained', 'Churned'], ... 'labelColumnName': 'Churned', ... 'label': 'training', # or 'validation' ... 'featureNames': ['CreditScore', 'Geography', 'Balance'], ... 'categoricalFeatureNames': ['Geography'], ... } >>> >>> with open('/path/to/dataset_config.yaml', 'w') as f: ... yaml.dump(dataset_config, f)
You can now add this dataset to your project with:
>>> project.add_dataframe( ... dataset_df=df, ... dataset_config_file_path='/path/to/dataset_config.yaml', ... )
After adding the dataset to the project, it is staged, waiting to be committed and pushed to the platform. You can check what’s on your staging area with
status
. If you want to push the dataset right away with a commit message, you can use thecommit
andpush
methods:>>> project.commit("Initial dataset commit.") >>> project.push()
If your task type is text classification…
Let’s say your dataset looks like the following:
>>> df Text Sentiment 0 I have had a long weekend 0 1 I'm in a fantastic mood today 1 2 Things are looking up 1
Write the dataset config YAML file with the variables are needed by Openlayer:
>>> import yaml >>> >> dataset_config = { ... 'columnNames': ['Text', 'Sentiment'], ... 'classNames': ['Negative', 'Positive'], ... 'labelColumnName': 'Sentiment', ... 'label': 'training', # or 'validation' ... 'textColumnName': 'Text', ... } >>> >>> with open('/path/to/dataset_config.yaml', 'w') as f: ... yaml.dump(dataset_config, f)
You can now add this dataset to your project with:
>>> project.add_dataframe( ... dataset_df=df, ... dataset_config_file_path='/path/to/dataset_config.yaml', ... )
After adding the dataset to the project, it is staged, waiting to be committed and pushed to the platform. You can check what’s on your staging area with
status
. If you want to push the dataset right away with a commit message, you can use thecommit
andpush
methods:>>> project.commit("Initial dataset commit.") >>> project.push()