How to register datasets

This guide will show Data Scientists how to register datasets to Vectice.

The Vectice API enables you to register all datasets used during development to the Vectice UI. This includes origin datasets, cleaned datasets, and modeling datasets.

Dataset type
Description

Origin datasets

Origin datasets refer to your datasets containing raw data.

Cleaned datasets

Cleaned datasets refer to your datasets that have been cleaned and prepared for data modeling or data analysis.

Modeling datasets

Modeling datasets combine training, testing, and validation data in a single dataset.

Resources

Use the following resources listed below to wrap data from any data source. This will enable you to register your dataset's columnar data and metadata to Vectice.

Vectice stores the metadata of your datasets, not your actual datasets.

Resources
Description

Resource()

Wrap your dataset's columnar data and metadata from your storage location. It can be extended for any data source. (example: Redshift, RDS, etc.)

FileResource(...)

Wrap your dataset's columnar data and metadata from a local file.

GCSResource(...)

Wrap your dataset's columnar data and metadata from your Google Cloud Storage (GCS) source.

S3Resource(...)

Wrap your dataset's columnar data and its metadata from your AWS S3 source.

For more information on each Resource, visit our Vectice Python API Reference docs, where you will find the information under Resources.

Resource Usage Examples

Below we highlight how you can use the available Resources to wrap your dataset's columnar and metadata to later register your dataset to Vectice.

Any Data Source

To wrap data from any data source, create a custom resource class, inherit from Resource, and implement the _build_metadata() and _fetch_data() methods:

from vectice.models.resource import Resource
from vectice.models.resource.metadata import DatasetSourceOrigi, FilesMetadata

class MyResource(Resource):
    _origin = "Data source name"

    def _build_metadata(self) -> FilesMetadata:  # 
        files = ...  # fetch file list from your custom storage
        total_size = ...  # compute total file size
        return FilesMetadata(
            size=total_size,
            origin=self._origin,
            files=files,
            usage=self.usage,
        )

    def _fetch_data(self) -> dict[str, bytes]:
        files_data = {}
        for file in self.metadata.files:
            file_contents = ...  # fetch file contents from your custom storage
            files_data[file.name] = file_contents
        return files_data

Local Data Source

Use FileResource() to wrap columnar data that you have stored in a local file.

from vectice import FileResource

my_resource = FileResource(path="my_resource_path")

Google Cloud Storage Data Source

Use GCSResource() to wrap columnar data that you have stored in Google Cloud Storage to Vectice.

from vectice import GCSResource
from google.cloud.storage import Client

gcs_client = Client.from_service_account_info(info=MY_GCP_CREDENTIALS)   

my_resource = GCSResource(
    gcs_client,
    bucket_name="my_bucket_name",
    resource_paths="my_folder/my_filename"
)

AWS S3 Data Source

Use S3Resource() to wrap data that you have stored in AWS S3.

from vectice import S3Resource
from boto3.session import Session

s3_session = Session(  
    aws_access_key_id="...",
    aws_secret_access_key="...",
    region_name="us-east-1",
)

s3_client = s3_session.client(service_name="s3")   

my_resource = S3Resource(
    s3_client,
    bucket_name="my_bucket",
    resource_path="my_resource_path"
)

Prerequisites

You must complete the following prerequisites before registering data to Vectice.

  • Ensure you have installed vectice with the correct version.

  • Configure your API and download your API Key JSON file to connect to Vectice API config = <API_key_config_name>.json.

  • You must have steps defined in the Vectice UI before beginning an iteration to register your dataset to a specific step.

Dataset Registration

Now that you know more about the different dataset types and available Resources, we will walk you through registering your wrapped datasets.

Each wrapped dataset is registered at the step level of an iteration. Use the static methods for each dataset type (i.e., origin(), clean(), and modeling() to register your datasets to Vectice.

When registering a dataset, you must prefix the step name with step_. For example, to register data to the "Collect initial data" step, you can access the iteration step using step_collect_initial_data.

Origin Dataset Registration

To register the columnar data of your origin dataset, use any Resource to register it to your current iteration's step. For this example, we will use FileResource() to register a local origin dataset to our step named "Origin dataset".

from vectice import FileResource

# This line registers your cleaned dataset's metadata
# for your iteration step named "Register origin dataset" 
iteration.step_origin_dataset = Dataset.origin(name="Iris Origin", 
                                resource=FileResource(path="raw_iris.csv"))

Cleaned Dataset Registration

To register the columnar data of your cleaned dataset, use any Resource to register it to your current iteration's step. For this example, we will use FileResource() to register a local clean dataset to our step named "Clean dataset".

from vectice import connect, FileResource

# This line registers your cleaned dataset's metadata
# for your iteration step named "Clean Dataset"
iteration.step_clean_dataset = Dataset.clean(name="Iris Cleaned", 
                                resource=FileResource(path="iris_cleaned.csv"))

Modeling Dataset Registration

To register the columnar data of your modeling datasets used for training, testing, and validation, use any Resource to register it to your current iteration's step. We will use FileResource() to register this example dataset to our step named "Modeling dataset".

Training and testing resources are required to register a modeling dataset. Validation resources are optional.

from vectice import FileResource

training_resource = FileResource(path="iris_training.csv")
testing_resource = FileResource(path="iris_testing.csv")
validation_resource = FileResource(path="iris_validation.csv") #optional

# This line registers your modeling dataset's metadata
# for your iteration step named "Modeling Dataset"
iteration.step_modeling_dataset = Dataset.modeling(
    name="Modeling Dataset",
    training_resource=training_resource,
    testing_resource=testing_resource,
    validation_resource=validation_resource)

Full Code Example

This example will demonstrate the full workflow of how to register your modeling datasets' metadata to Vectice.

from vectice import connect, FileResource

# Connect to your Vectice Project
project = connect(config="<API_key_config_name>.json", 
            workspace="ws_name", project="project_name")

# Select your project phase and start an iteration
modeling_phase = project.phase("modeling")
iteration = modeling_phase.iteration()

# Wrap your datasets used for training, testing, and validation
training_resource = FileResource(path="iris_training.csv")
testing_resource = FileResource(path="iris_testing.csv")
validation_resource = FileResource(path="iris_validation.csv")

# Register your modeling datasets
iteration.step_modeling_data = Dataset.modeling(
    name="Modeling Dataset",
    training_resource=training_resource,
    testing_resource=testing_resource,
    validation_resource=validation_resource)

# Complete the iteration
iteration.complete()

Once your wrapped dataset is registered with Vectice, you can view datasets metadata by clicking on the Datasets tab in the Vectice UI.

Capturing Dataset Structure and Statistics [BETA]

Data scientists can capture their dataset's structure and basic statistics when registering datasets to Vectice. Below we have the statistics that are automatically captured based on the column types in your dataframe:

Statistics will be captured only for the first 100 columns of your dataframe.

Stats
Dataframe Column Type
Description

Null

String

The count of null values in the dataframe.

Series count (size)

String, Bool, Numeric

The total number of rows in the dataframe.

Unique

String

The count of unique values in the dataframe.

Top

String, Bool

The most recurrent value in the dataframe.

Frequency

String, Bool

The number of occurrences of the most recurrent value in dataframe.

Mean

Numeric

The average value of the data points in the dataframe.

Median

Numeric

The middle value in the dataframe.

Variance

Numeric

The measure of the distribution of data points in the dataframe from the mean value.

Standard deviation (STD)

Numeric

The square root of the variance is a commonly used measure of the data distribution.

Min value

Numeric

The smallest data point in the dataframe.

Max value

Numeric

The largest data point in the dataframe.

25% percentiles

Numeric

The value below which 25% of the data points in the dataframe fall.

50% percentiles

Numeric

Also known as the median, this value separates the top 50% of data points from the bottom 50% of data points in the dataframe.

75% percentiles

Numeric

The value below which 75% of the data points in the dataframe fall.

Origin and Cleaned Dataset Stats

To capture your origin and cleaned resource's structure and statistics on numeric and string columns, pass a dataframe to the Dataset class using the dataframe parameter:

bucket_name = "sdk-tests-data-source"
filename = "bike_dataset.csv"
folder_name = "data_structure_capture_folder"
resource_path = f"{folder_name}/{filename}"
gcs_resource = GCSResource(
    gcs_client,
    bucket_name,
    resource_path,
)
df = pandas...  # load your pandas dataframe

# Register all origin or cleaned dataset and its structure and statistics to Vectice
dataset = Dataset.clean(gcs_resource, name="my clean dataset", dataframe=df)

Modeling Dataset Stats

Capturing your modeling resources's structure and statistics require specifying which resource statistics you need. You can collect statistics for all modeling resources (training, testing, and validation) or select the specific resource(s) for statistics capture.

For example, if you only want your testing resource's statistics, solely pass your testing dataframe to the Dataset class using the testing_dataframe parameter to capture statistics of your testing_resource.

training_resource = FileResource(path="train_clean.csv")
testing_resource = FileResource(path="train_reduced.csv")
validation_resource = FileResource(path="validation.csv")

# Load your pandas dataframes for your modeling datasets  
training_dataframe = pd.DataFrame(pd.read_csv("train_clean.csv"))
testing_dataframe = pd.DataFrame(pd.read_csv("train_reduced.csv"))
validation_dataframe = pd.DataFrame(pd.read_csv("validation.csv"))

# Register all modeling datasets and its structure and statistics to Vectice
modeling_dataset = Dataset.modeling(
                    name = "Modeling",
                    training_resource = training_resource,
                    testing_resource = testing_resource,
                    validation_resource = validation_resource,
                    training_dataframe=training_dataframe,
                    testing_dataframe=testing_dataframe,
                    validation_dataframe=validation_dataframe
                  )

''' 
# To register all modeling datasets to Vectice and  
# only capture the testing dataset's statistics 

modeling_dataset = Dataset.modeling(
                    name = "Modeling",
                    training_resource = training_resource,
                    testing_resource = testing_resource,
                    validation_resource = validation_resource,
                    testing_dataframe=testing_dataframe
                  )
'''

Dataset Lineage

To keep track of your derived datasets lineage, use the derived_from parameter to list the datasets (or dataset IDs) from which your dataset is derived.

from vectice import Dataset, FileResource

origin_dataset = Dataset.origin(
    name="my origin dataset",
    resource=FileResource(path="origin.csv"),
)

clean_dataset = Dataset.clean(
    name="my clean dataset",
    resource=FileResource(path="clean_dataset.csv"),
    derived_from= [origin_dataset]
)

Dataset Versions

Dataset versions are datasets with the same name as another dataset that you have already registered in Vectice. As you register datasets with the same name, the versions are automatically incremented, maintaining the dataset's history.

Best Practices

  • When registering datasets, append the dataset type (Origin, Cleaned, and Modeling) to the end of the corresponding dataset name for easy identification when registered in the UI.

  • Before cleaning or transforming your datasets, register your origin dataset to Vectice. This way, you can always refer back to it if needed.

  • As you iterate through different modeling approaches, register multiple versions of your modeling dataset. This will help you to keep track of the changes you have made and to compare the results of different approaches.

  • Document your work thoroughly, including your data sources, cleaning and modeling processes, and any assumptions or decisions you make. This will help you to communicate your work to others and to ensure that your analysis is transparent and reproducible. Document key milestones via:

    • The Vectice UI in your Phase documentation

    • The Vectice API by adding a Step comment

  • Mark your most valuable iterations and assets in the UI by selecting the star next to the corresponding iteration and assets before beginning the phase review. This will make it easier for stakeholders and subject matter experts to identify the iterations and assets in review.

Last updated