How to register datasets
This guide will show Data Scientists how to register datasets to Vectice.
The Vectice API enables you to register all datasets used during development to the Vectice UI. This includes origin datasets, cleaned datasets, and modeling datasets.
Origin datasets
Origin datasets refer to your datasets containing raw data.
Cleaned datasets
Cleaned datasets refer to your datasets that have been cleaned and prepared for data modeling or data analysis.
Modeling datasets
Modeling datasets combine training, testing, and validation data in a single dataset.
Resources
Use the following resources listed below to wrap data from any data source. This will enable you to register your dataset's columnar data and metadata to Vectice.
Vectice stores the metadata of your datasets, not your actual datasets.
Resource()
Wrap your dataset's columnar data and metadata from your storage location. It can be extended for any data source. (example: Redshift, RDS, etc.)
FileResource(...)
Wrap your dataset's columnar data and metadata from a local file.
GCSResource(...)
Wrap your dataset's columnar data and metadata from your Google Cloud Storage (GCS) source.
S3Resource(...)
Wrap your dataset's columnar data and its metadata from your AWS S3 source.
For more information on each Resource, visit our Vectice Python API Reference docs, where you will find the information under Resources.
Resource Usage Examples
Below we highlight how you can use the available Resources to wrap your dataset's columnar and metadata to later register your dataset to Vectice.
Any Data Source
To wrap data from any data source, create a custom resource class, inherit from Resource
, and implement the _build_metadata()
and _fetch_data()
methods:
Local Data Source
Use FileResource()
to wrap columnar data that you have stored in a local file.
Google Cloud Storage Data Source
Use GCSResource()
to wrap columnar data that you have stored in Google Cloud Storage to Vectice.
AWS S3 Data Source
Use S3Resource()
to wrap data that you have stored in AWS S3.
Prerequisites
You must complete the following prerequisites before registering data to Vectice.
Ensure you have installed vectice with the correct version.
Configure your API and download your API Key JSON file to connect to Vectice API
config = <API_key_config_name>.json
.You must have steps defined in the Vectice UI before beginning an iteration to register your dataset to a specific step.
Dataset Registration
Now that you know more about the different dataset types and available Resources, we will walk you through registering your wrapped datasets.
Each wrapped dataset is registered at the step level of an iteration. Use the static methods for each dataset type (i.e., origin()
, clean()
, and modeling()
to register your datasets to Vectice.
When registering a dataset, you must prefix the step name with step_
. For example, to register data to the "Collect initial data" step, you can access the iteration step using step_collect_initial_data
.
Origin Dataset Registration
To register the columnar data of your origin dataset, use any Resource to register it to your current iteration's step. For this example, we will use FileResource()
to register a local origin dataset to our step named "Origin dataset".
Cleaned Dataset Registration
To register the columnar data of your cleaned dataset, use any Resource to register it to your current iteration's step. For this example, we will use FileResource()
to register a local clean dataset to our step named "Clean dataset".
Modeling Dataset Registration
To register the columnar data of your modeling datasets used for training, testing, and validation, use any Resource to register it to your current iteration's step. We will use FileResource()
to register this example dataset to our step named "Modeling dataset".
Training and testing resources are required to register a modeling dataset. Validation resources are optional.
Full Code Example
This example will demonstrate the full workflow of how to register your modeling datasets' metadata to Vectice.
Once your wrapped dataset is registered with Vectice, you can view datasets metadata by clicking on the Datasets tab in the Vectice UI.
Capturing Dataset Structure and Statistics [BETA]
Data scientists can capture their dataset's structure and basic statistics when registering datasets to Vectice. Below we have the statistics that are automatically captured based on the column types in your dataframe:
Statistics will be captured only for the first 100 columns of your dataframe.
Null
String
The count of null values in the dataframe.
Series count (size)
String, Bool, Numeric
The total number of rows in the dataframe.
Unique
String
The count of unique values in the dataframe.
Top
String, Bool
The most recurrent value in the dataframe.
Frequency
String, Bool
The number of occurrences of the most recurrent value in dataframe.
Mean
Numeric
The average value of the data points in the dataframe.
Median
Numeric
The middle value in the dataframe.
Variance
Numeric
The measure of the distribution of data points in the dataframe from the mean value.
Standard deviation (STD)
Numeric
The square root of the variance is a commonly used measure of the data distribution.
Min value
Numeric
The smallest data point in the dataframe.
Max value
Numeric
The largest data point in the dataframe.
25% percentiles
Numeric
The value below which 25% of the data points in the dataframe fall.
50% percentiles
Numeric
Also known as the median, this value separates the top 50% of data points from the bottom 50% of data points in the dataframe.
75% percentiles
Numeric
The value below which 75% of the data points in the dataframe fall.
Origin and Cleaned Dataset Stats
To capture your origin and cleaned resource's structure and statistics on numeric and string columns, pass a dataframe to the Dataset
class using the dataframe
parameter:
Modeling Dataset Stats
Capturing your modeling resources's structure and statistics require specifying which resource statistics you need. You can collect statistics for all modeling resources (training, testing, and validation) or select the specific resource(s) for statistics capture.
For example, if you only want your testing resource's statistics, solely pass your testing dataframe to the Dataset
class using the testing_dataframe
parameter to capture statistics of your testing_resource
.
Dataset Lineage
To keep track of your derived datasets lineage, use the derived_from
parameter to list the datasets (or dataset IDs) from which your dataset is derived.
Dataset Versions
Dataset versions are datasets with the same name as another dataset that you have already registered in Vectice. As you register datasets with the same name, the versions are automatically incremented, maintaining the dataset's history.
Best Practices
When registering datasets, append the dataset type (Origin, Cleaned, and Modeling) to the end of the corresponding dataset name for easy identification when registered in the UI.
Before cleaning or transforming your datasets, register your origin dataset to Vectice. This way, you can always refer back to it if needed.
As you iterate through different modeling approaches, register multiple versions of your modeling dataset. This will help you to keep track of the changes you have made and to compare the results of different approaches.
Document your work thoroughly, including your data sources, cleaning and modeling processes, and any assumptions or decisions you make. This will help you to communicate your work to others and to ensure that your analysis is transparent and reproducible. Document key milestones via:
The Vectice UI in your Phase documentation
The Vectice API by adding a Step comment
Mark your most valuable iterations and assets in the UI by selecting the star next to the corresponding iteration and assets before beginning the phase review. This will make it easier for stakeholders and subject matter experts to identify the iterations and assets in review.
Last updated