Introduction to Datasets

Learn more about utilizing Datasets within Vectice.

Datasets overview

Datasets in Vectice reflect the datasets used for data analysis, cleaning, model training, testing, and validation during development. Datasets can be logged in the Vectice app from various source systems, where the dataset's origin, version, and resource artifacts are stored. Vectice will automatically version the artifacts of datasets to enable lineage.

The Vectice API enables you to log all datasets used during development to the Vectice app, including your origin, cleaned, and modeling datasets.

Dataset type
Description

Origin datasets

Origin datasets refer to your datasets containing raw data.

Cleaned datasets

Cleaned datasets refer to your datasets that have been cleaned and prepared for data modeling or data analysis.

Modeling datasets

Modeling datasets combine training, testing, and validating data in a single dataset.

Things you should know

  • Data scientists can capture basic statistics when logging their datasets' artifacts to Vectice. These statistics include the mean, median, variance, quartiles, and more. Learn more by viewing our Capturing Dataset Statistics section.

Datasets best practices

  • Logged datasets should have the dataset types Origin, Cleaned, and Modeling appended to the end of the corresponding dataset name for easy identification in the Vectice app.

  • Create a phase with the primary objective of origin datasets registration. This is usually included in the Data Preparation phase of a CRISP-DM project.

  • Project phases that aim to clean and process raw data can have a requirement that ask members to log the cleaned datasets. This is typically completed in the Data Preparation phase of a CRISP-DM project.

  • Mark your most valuable iterations and assets in the Vectice app by selecting the star next to the corresponding iteration and assets before beginning the phase review. This will make it easier for stakeholders and subject matter experts to identify the iterations and assets in review.

Select the star next to the valuable iterations before completing a phase, even without review. This allows you to identify the most valuable iterations for that phase.

How to log datasets to Vectice?

Logging datasets is your first step to knowledge capture in Vectice. You can log all datasets used during development, including your origin datasets (raw datasets), cleaned datasets, and modeling datasets.

View our How to log datasets guide for more information on how to log datasets during iterative development.

Last updated