Datasets and models#

This section describes data entities API (datasets and models).

python-mlboardclient allows to manage datasets and models: in this case mlboardclient acts as a thin layer to another CLI - kdataset.

Managing datasets and models#

Once the mlboardclient is initialized, it can be used to leverage datasets (for examples below, instance of mlboardclient is in mlboard variable)

Note: All methods below used almost identically for either a dataset or a model. To make call for the model, pass argument type='model' instead of type='dataset' which is used by default.

List datasets (or models)#

def list(self, workspace, type='dataset')

Lists datasets for the given workspace.

Note: If inside the project, it is possible to take workspace name from environment variable WORKSPACE_NAME.

Examples:

datasets = mlboard.datasets.list('my-workspace', type='dataset')
print(datasets)
import os

datasets = mlboard.datasets.list(os.environ['WORKSPACE_NAME'], type='dataset')

List version for specific dataset (or model)**#

def version_list(self, workspace, name, type='dataset')

Lists versions for the given catalog entity.

Note: If inside the project, it is possible to take workspace name from environment variable WORKSPACE_NAME.

Example:

v_list = mlboard.datasets.version_list('my-workspace', 'my-dataset', type='dataset')
print(v_list)

Pull (download) dataset (or model)#

def pull(self, workspace, name, version, to_dir, type='dataset', file_name=None):

Downloads the data entity tar archive to the specified location.

If file_name is None, then entity is downloaded to file named <workspace>-<dataset>.<version>.tar. If to_dir is empty, then current working directory is used.

Example:

mlboard.datasets.pull('my-workspace', 'my-dataset', '1.0.0', '', type='dataset')

Push (upload) dataset (or model)#

def push(self, workspace, name, version, from_dir, type='dataset', create=False, publish=False, force=False, chunk_size=None, concurrency=None, spec=None):

Push the data within the specified directory.

  • If file_name is None, then entity is downloaded to file named <workspace>-<dataset>.<version>.tar.
  • If from_dir is empty, then current working directory is used.
  • If create is True, then entity will be created if not exists.
  • If publish is True, then entity will be public when created.
  • If force is True, then entity will be created regardless some warnings.

  • chunk_size is used to specify chunk size for every file in dataset (default 1024000)

  • concurrency is used to specify number of concurrent connections (defaults to <cores_num * 2>)
  • spec used only if pushing a model. It is a dict of model spec for serving (or compatible json-string). The client automatically picks up a spec from ML project if any exists. See more details at Upload a model.

Example:

mlboard.datasets.push('my-workspace', 'my-dataset', '1.0.0', '/model/path', type='dataset')

Delete dataset (or model)#

delete(self, workspace, name, type='dataset')

Deletes specific catalog entity.

Example:

mlboard.datasets.delete('my-workspace', 'my-dataset')

Delete version of dataset (or model)#

def version_delete(self, workspace, name, version, type='dataset')

Delete specific version of the catalog entity.

Example:

mlboard.datasets.version_delete('my-workspace', 'my-dataset', '1.0.0')