How to capture dataset structure and statistics

Data scientists can capture their dataset structure and statistics when registering datasets to Vectice using Pandas or Spark (version 3.0 and above) dataframes.

Most common statistics are currently not captured with Spark Dataframes.

Statistics will be captured for the first 100 columns of your dataframe. Statistics are not captured if the numbers of rows are below 100 to keep the data anonymous. The Org Admin can adjust this threshold in organization settings.

Below we have the statistics that are automatically captured based on the column types in your dataframes. If no dataframe is given, the schema columns and rows will be retrieved from the resource (if available). If dataframes are given, the schema columns and rows will be inferred based on the dataframe.

Stats	Dataframe Column Type	Description
Unique	Text	The count of unique values in the dataframe.
Most Common	Text	The most recurrent value in the dataframe and percentage of occurrences.
Mean	Numeric, Date	The average value of the data points in the dataframe.
Median	Numeric, Date	The middle value in the dataframe.
Variance	Numeric	The measure of the distribution of data points in the dataframe from the mean value.
St. Deviation	Numeric	The square root of the variance is a commonly used measure of the data distribution.
Minimun	Numeric, Date	The smallest data point in the dataframe.
Maximum	Numeric, Date	The largest data point in the dataframe.
Quantiles	Numeric	The value in which the data falls within the 25%, 50%, and 75% percentiles and their min and max.
Missing	Text, Numeric, Date	The percentage of missing values in the data column.
True	Boolean	The count of true values with the percentage of occurrence in the column.
False	Boolean	The count of false values with the percentage of occurrence in the column.

Stats

Dataframe Column Type

Description

Unique

Text

The count of unique values in the dataframe.

Most Common

Text

The most recurrent value in the dataframe and percentage of occurrences.

Mean

Numeric, Date

The average value of the data points in the dataframe.

Median

Numeric, Date

The middle value in the dataframe.

Variance

Numeric

The measure of the distribution of data points in the dataframe from the mean value.

St. Deviation

Numeric

The square root of the variance is a commonly used measure of the data distribution.

Minimun

Numeric, Date

The smallest data point in the dataframe.

Maximum

Numeric, Date

The largest data point in the dataframe.

Quantiles

Numeric

The value in which the data falls within the 25%, 50%, and 75% percentiles and their min and max.

Missing

Text, Numeric, Date

The percentage of missing values in the data column.

True

Boolean

The count of true values with the percentage of occurrence in the column.

False

Boolean

The count of false values with the percentage of occurrence in the column.

Capture Schema without Statistic

By default, both schema and column statistics are captured. Setting capture_schema_only to True captures only schema information, excluding column stats.

my_resource = FileResource(paths, dataframes, capture_schema_only=True)

Column stats computation can impact processing time, so it is recommended to set capture_schema_only to True if performance is a concern or detailed stats are not needed.

Capture Origin and Cleaned Dataset Statistic

To capture your origin and cleaned resource's structure and statistics for each column type, pass a dataframe to your Resource class using the dataframes parameter:

df = dataframe...  # load your pandas or spark dataframe

gcs_resource = GCSResource(
    uris="gs://my_bucket_name/my_folder/my_filename",
    dataframes=df
)

# Register origin or cleaned datasets and its structure and statistics to Vectice
dataset = Dataset.clean(gcs_resource, name="Clean_Dataset")

Capture Modeling Dataset Statistics

Capturing your modeling resources' structure and statistics requires specifying which resource statistics you need. You can collect statistics for all modeling resources (training, testing, and validation) or select the specific resource(s) for statistics capture.

For example, if you only want your testing resource's statistics, use the dataframe parameter in the Resource class to capture statistics of your testing_resource.

Pandas dataframe statistics capture example

# Create your pandas (or spark) dataframes for your modeling datasets  
training_dataframe = pd.DataFrame(pd.read_csv("train_clean.csv"))
testing_dataframe = pd.DataFrame(pd.read_csv("train_reduced.csv"))
validation_dataframe = pd.DataFrame(pd.read_csv("validation.csv"))

# Create your dataset resources to wrap your datasets and collect statistics
training_resource = FileResource(paths="train_clean.csv", dataframes=training_dataframe)
testing_resource = FileResource(paths="train_reduced.csv", dataframes=testing_dataframe)

# Data resource without collecting statistics
validation_resource = FileResource(paths="validation.csv")

# Register all modeling datasets and its structure and statistics to Vectice
modeling_dataset = Dataset.modeling(
                    name = "Modeling",
                    training_resource = training_resource,
                    testing_resource = testing_resource,
                    validation_resource = validation_resource,
                  )

PreviousHow to add a custom data source NextHow to set a threshold for statistics capture