Log dataset structure and statistics
Data scientists can document dataset structure and statistics in Vectice by using Pandas or Spark dataframes (version 3.0 and newer) when logging datasets.
Most common statistics are currently not captured with Spark Dataframes.
Statistics will be captured for the first 100 columns of your dataframe. Statistics are not captured if the numbers of rows are below 100 to keep the data anonymous. The Org Admin can adjust this threshold in organization settings.
Here are the automatically captured statistics based on column types in your dataframes.
If no dataframe is provided, it will retrieve schema columns and rows from the resource (if available).
If dataframes are provided, it will infer schema columns and rows based on the dataframe.
Capture schema without statistic
By default, both schema and column statistics are captured. Setting capture_schema_only
to True
captures only schema information, excluding column stats.
Column stats computation can impact processing time, so it is recommended to set capture_schema_only
to True
if performance is a concern or detailed stats are not needed.
Log origin and cleaned dataset statistics
To log your origin and cleaned resource's structure and statistics for each column type, pass a dataframe to your Resource
class using the dataframes
parameter:
Log modeling dataset statistics
To log the structure and statistics of your modeling resources, you need to specify which resource statistics you want. You can collect statistics for all modeling resources (training, testing, and validation) or choose specific resources for statistics.
For instance, if you only want statistics for your testing resource, use the "dataframe" parameter in the Resource class to log testing_resource statistics.
Pandas dataframe statistics logging example
Last updated