How to log dataset structure and statistics
Data scientists can capture their dataset structure and statistics when logging datasets to Vectice using Pandas or Spark (version 3.0 and above) dataframes.
Most common statistics are currently not captured with Spark Dataframes.
Statistics will be captured for the first 100 columns of your dataframe. Statistics are not captured if the numbers of rows are below 100 to keep the data anonymous. The Org Admin can adjust this threshold in organization settings.
Below we have the statistics that are automatically captured based on the column types in your dataframes. If no dataframe is given, the schema columns and rows will be retrieved from the resource (if available). If dataframes are given, the schema columns and rows will be inferred based on the dataframe.
Unique
Text
The count of unique values in the dataframe.
Most Common
Text
The most recurrent value in the dataframe and percentage of occurrences.
Mean
Numeric, Date
The average value of the data points in the dataframe.
Median
Numeric, Date
The middle value in the dataframe.
Variance
Numeric
The measure of the distribution of data points in the dataframe from the mean value.
St. Deviation
Numeric
The square root of the variance is a commonly used measure of the data distribution.
Minimun
Numeric, Date
The smallest data point in the dataframe.
Maximum
Numeric, Date
The largest data point in the dataframe.
Quantiles
Numeric
The value in which the data falls within the 25%, 50%, and 75% percentiles and their min and max.
Missing
Text, Numeric, Date
The percentage of missing values in the data column.
True
Boolean
The count of true values with the percentage of occurrence in the column.
False
Boolean
The count of false values with the percentage of occurrence in the column.
Capture schema without statistic
By default, both schema and column statistics are captured. Setting capture_schema_only
to True
captures only schema information, excluding column stats.
Column stats computation can impact processing time, so it is recommended to set capture_schema_only
to True
if performance is a concern or detailed stats are not needed.
Capture origin and cleaned dataset statistics
To capture your origin and cleaned resource's structure and statistics for each column type, pass a dataframe to your Resource
class using the dataframes
parameter:
Capture modeling dataset statistics
Capturing your modeling resources' structure and statistics requires specifying which resource statistics you need. You can collect statistics for all modeling resources (training, testing, and validation) or select the specific resource(s) for statistics capture.
For example, if you only want your testing resource's statistics, use the dataframe
parameter in the Resource
class to capture statistics of your testing_resource
.