How to register datasets

Learn how to register your datasets to Vectice.

Make sure to complete the prerequisites before getting started with registering datasets to Vectice. To learn more, view our Getting Started with Vectice API guide.

The Vectice API enables you to register all datasets used during development to the Vectice UI. This includes origin datasets, cleaned datasets, and modeling datasets.

Dataset Registration

Now that you know more about the different dataset types and available Resources, we will walk you through registering your wrapped datasets.

Each wrapped dataset is registered at the step level of an iteration. Use the static methods for each dataset type (i.e., origin(), clean(), and modeling() to register your datasets to Vectice.

To register a dataset, make sure to use the prefix "step_" followed by the step name found in the Vectice UI or obtained through the API using the command my_iteration.list_steps().

For example, to register data to the "Collect initial data" step, you can access the iteration step by using my_iteration.step_collect_initial_data.

Origin Dataset Registration

To register the columnar data of your origin dataset, use any Resource to register it to your current iteration's step. For this example, we will use FileResource() to register a local origin dataset to our step named "Origin dataset".

from vectice import FileResource

# This line registers your cleaned dataset's metadata
# for your iteration step named "Register origin dataset" 
iteration.step_origin_dataset = Dataset.origin(name="Iris Origin", 
                                resource=FileResource(paths="raw_iris.csv"))

Cleaned Dataset Registration

To register the columnar data of your cleaned dataset, use any Resource to register it to your current iteration's step. For this example, we will use FileResource() to register a local clean dataset to our step named "Clean dataset".

from vectice import FileResource

# This line registers your cleaned dataset's metadata
# for your iteration step named "Clean Dataset"
iteration.step_clean_dataset = Dataset.clean(name="Iris Cleaned", 
                                resource=FileResource(paths="iris_cleaned.csv"))

Modeling Dataset Registration

To register the columnar data of your modeling datasets used for training, testing, and validation, use any Resource to register it to your current iteration's step. We will use FileResource() to register this example dataset to our step named "Modeling dataset".

Training and testing resources are required to register a modeling dataset. Validation resources are optional.

from vectice import FileResource

training_resource = FileResource(paths="iris_training.csv")
testing_resource = FileResource(paths="iris_testing.csv")
validation_resource = FileResource(paths="iris_validation.csv") #optional

# This line registers your modeling dataset's metadata
# for your iteration step named "Modeling Dataset"
iteration.step_modeling_dataset = Dataset.modeling(
    name="Modeling Dataset",
    training_resource=training_resource,
    testing_resource=testing_resource,
    validation_resource=validation_resource)

Full Code Example

This example will demonstrate the full workflow of how to register your modeling datasets' metadata to Vectice.

from vectice import FileResource

# Connect to Vectice 
connection = vectice.connect(config="vectice_config.json")

# Select your project phase and start an iteration
modeling_phase = connection.phase("PHA-XXX")
my_iteration = modeling_phase.iteration()

# Wrap your datasets used for training, testing, and validation
training_resource = FileResource(paths="iris_training.csv")
testing_resource = FileResource(paths="iris_testing.csv")
validation_resource = FileResource(paths="iris_validation.csv")

# Register your modeling datasets
my_iteration.step_modeling_data = Dataset.modeling(
    name="Modeling Dataset",
    training_resource=training_resource,
    testing_resource=testing_resource,
    validation_resource=validation_resource)

# Complete the iteration
my_iteration.complete()

Once your wrapped dataset is registered with Vectice, you can view datasets metadata by clicking on the Datasets tab in the Vectice UI.

Best Practices

  • When registering datasets, append the dataset type (Origin, Cleaned, and Modeling) to the end of the corresponding dataset name for easy identification when registered in the UI.

  • Before cleaning or transforming your datasets, register your origin dataset to Vectice. This way, you can always refer back to it if needed.

  • As you iterate through different modeling approaches, register multiple versions of your modeling dataset. This will help you to keep track of the changes you have made and to compare the results of different approaches.

  • Document your work thoroughly, including your data sources, cleaning and modeling processes, and any assumptions or decisions you make. This will help you to communicate your work to others and to ensure that your analysis is transparent and reproducible. Document key milestones via:

    • The Vectice UI in your Phase documentation

    • The Vectice API by adding a Step comment

  • Mark your most valuable iterations and assets in the UI by selecting the star next to the corresponding iteration and assets before beginning the phase review. This will make it easier for stakeholders and subject matter experts to identify the iterations and assets in review.