How to register datasets
Learn how to register your datasets to Vectice.
Make sure to complete the prerequisites before getting started with registering datasets to Vectice. To learn more, view our Getting Started with Vectice API guide.
The Vectice API enables you to register all datasets used during development to the Vectice UI. This includes origin datasets, cleaned datasets, and modeling datasets.
Dataset type | Description |
---|---|
Origin datasets | Origin datasets refer to your datasets containing raw data. |
Cleaned datasets | Cleaned datasets refer to your datasets that have been cleaned and prepared for data modeling or data analysis. |
Modeling datasets | Modeling datasets combine training, testing, and validation data in a single dataset. |
Dataset Registration
Now that you know more about the different dataset types and available Resources, we will walk you through registering your wrapped datasets.
Each wrapped dataset is registered at the step level of an iteration. Use the static methods for each dataset type (i.e., origin()
, clean()
, and modeling()
to register your datasets to Vectice.
To register a dataset, make sure to use the prefix "step_" followed by the step name found in the Vectice UI or obtained through the API using the command my_iteration.list_steps()
.
For example, to register data to the "Collect initial data" step, you can access the iteration step by using my_iteration.step_collect_initial_data
.
Origin Dataset Registration
To register the columnar data of your origin dataset, use any Resource to register it to your current iteration's step. For this example, we will use FileResource()
to register a local origin dataset to our step named "Origin dataset".
Cleaned Dataset Registration
To register the columnar data of your cleaned dataset, use any Resource to register it to your current iteration's step. For this example, we will use FileResource()
to register a local clean dataset to our step named "Clean dataset".
Modeling Dataset Registration
To register the columnar data of your modeling datasets used for training, testing, and validation, use any Resource to register it to your current iteration's step. We will use FileResource()
to register this example dataset to our step named "Modeling dataset".
Training and testing resources are required to register a modeling dataset. Validation resources are optional.
Full Code Example
This example will demonstrate the full workflow of how to register your modeling datasets' metadata to Vectice.
Once your wrapped dataset is registered with Vectice, you can view datasets metadata by clicking on the Datasets tab in the Vectice UI.
Best Practices
When registering datasets, append the dataset type (Origin, Cleaned, and Modeling) to the end of the corresponding dataset name for easy identification when registered in the UI.
Before cleaning or transforming your datasets, register your origin dataset to Vectice. This way, you can always refer back to it if needed.
As you iterate through different modeling approaches, register multiple versions of your modeling dataset. This will help you to keep track of the changes you have made and to compare the results of different approaches.
Document your work thoroughly, including your data sources, cleaning and modeling processes, and any assumptions or decisions you make. This will help you to communicate your work to others and to ensure that your analysis is transparent and reproducible. Document key milestones via:
The Vectice UI in your Phase documentation
The Vectice API by adding a Step comment
Mark your most valuable iterations and assets in the UI by selecting the star next to the corresponding iteration and assets before beginning the phase review. This will make it easier for stakeholders and subject matter experts to identify the iterations and assets in review.