Vectice Docs
API Reference (Latest)Vectice WebsiteStart Free Trial
Latest
Latest
  • 🏠Introduction
    • Vectice overview
      • Autolog
      • Next-Gen Autolog [BETA]
      • AskAI
      • Vectice for financial services
  • 🏁Quickstart
    • Getting started
    • Quickstart project
    • Tutorial project
    • FAQ
  • ▶️Demo Center
    • Feature videos
  • 📊Manage AI/ML projects
    • Organize workspaces
      • Create a workspace
      • Workspace Dashboard
    • Organize projects
      • Create a project
      • Project templates best practices
    • Invite colleagues
    • Define phase requirements
    • Collaborate with your team
  • 🚀Log and Manage Assets with Vectice API
    • API cheatsheets
      • Vectice Python API cheatsheet
      • Vectice R API cheatsheet
    • Connect to API
    • Log assets to Vectice
      • Autolog your assets
      • Log datasets
      • Log models
      • Log attachments and notes
      • Log code
      • Log a custom data source
      • Log assets using Vectice IDs
      • Log dataset structure and statistics
      • Log custom metadata in a table format
      • Log MLFLow runs
    • Retrieve assets from app
    • Manage your assets
    • Manage your iteration
    • Preserve your code and asset lineage
  • 🤝Create Model documentation and reports
    • Create model documentation with Vectice Reports
    • Streamline documentation with Macros
    • Auto-document Models and Datasets with AskAI Prompts
    • Document phase outcomes
  • 🗂️Admin Guides
    • Organization management
    • Workspace management
    • Teams management
    • User management
      • User roles and permissions
      • Update a user role in your organization
      • Activate and deactivate users
      • Reset a user's password
    • Manage report templates
  • 🔗Integrations
    • Integrations Overview
    • Integrate Vectice with your data platform
  • 💻IT & Security
    • IT & Security Overview
    • Secure Evaluation Environment Overview
    • Deployment
      • SaaS offering (Multi-Tenant SaaS)
      • Kubernetes self-hosted offering
        • General Architecture & Infrastructure
        • Kubernetes on GCP
          • Appendices
        • Kubernetes on AWS
          • Appendices
        • Kubernetes on Azure
          • Appendices
        • GCP Marketplace deployment
        • On premise
        • Configuration
      • Bring Your Own LLM Guide
    • Data privacy
    • User management
    • SSO management
      • Generic SAML integration
      • Okta SSO integration
    • Security
      • Data storage security
      • Network Security
        • HTTPS communication
        • Reverse proxy
        • CORS/CSRF
        • VPC segregation
      • Sessions
      • Secrets and certificates
      • Audit logs
      • SOC2
      • Security updates
      • Best practices
      • Business continuity
    • Monitoring
      • Installation guide
      • Customizing the deployments
    • Maintenance & upgrades
    • Integrating Vectice Securely
  • ⭐Glossary
    • Concepts
      • Workspaces
      • Projects
        • Setup a project
      • Phases
      • Iterations
        • Iterative development
      • Datasets
        • Dataset resources
        • Dataset properties
        • Dataset lineage and versions
      • Models
      • Reports
  • 🎯Release notes
    • Release notes
  • ↗️References
    • Vectice Python API Reference
    • Vectice R API Cheatsheet
    • Notebooks and code samples
    • Vectice website
Powered by GitBook
On this page
  • Capture schema without statistic
  • Log origin and cleaned dataset statistics
  • Log modeling dataset statistics

Was this helpful?

  1. Log and Manage Assets with Vectice API
  2. Log assets to Vectice

Log dataset structure and statistics

Data scientists can document dataset structure and statistics in Vectice by using Pandas or Spark dataframes (version 3.0 and newer) when logging datasets.

Most common statistics are currently not captured with Spark Dataframes.

Statistics will be captured for the first 100 columns of your dataframe. Statistics are not captured if the numbers of rows are below 100 to keep the data anonymous. The Org Admin can adjust this threshold in organization settings.

Here are the automatically captured statistics based on column types in your dataframes.

  • If no dataframe is provided, it will retrieve schema columns and rows from the resource (if available).

  • If dataframes are provided, it will infer schema columns and rows based on the dataframe.

Stats
Dataframe Column Type
Description

Unique

Text

The count of unique values in the dataframe.

Most Common

Text

The most recurrent value in the dataframe and percentage of occurrences.

Mean

Numeric, Date

The average value of the data points in the dataframe.

Median

Numeric, Date

The middle value in the dataframe.

Variance

Numeric

The measure of the distribution of data points in the dataframe from the mean value.

St. Deviation

Numeric

The square root of the variance is a commonly used measure of the data distribution.

Minimun

Numeric, Date

The smallest data point in the dataframe.

Maximum

Numeric, Date

The largest data point in the dataframe.

Quantiles

Numeric

The value in which the data falls within the 25%, 50%, and 75% percentiles and their min and max.

Missing

Text, Numeric, Date

The percentage of missing values in the data column.

True

Boolean

The count of true values with the percentage of occurrence in the column.

False

Boolean

The count of false values with the percentage of occurrence in the column.

Capture schema without statistic

By default, both schema and column statistics are captured. Setting capture_schema_only to True captures only schema information, excluding column stats.

resource = FileResource(paths, dataframes, capture_schema_only=True)

Column stats computation can impact processing time, so it is recommended to set capture_schema_only to True if performance is a concern or detailed stats are not needed.

Log origin and cleaned dataset statistics

To log your origin and cleaned resource's structure and statistics for each column type, pass a dataframe to your Resource class using the dataframes parameter:

df = dataframe...  # load your pandas or spark dataframe

gcs_resource = GCSResource(
    uris="gs://my_bucket_name/my_folder/my_filename",
    dataframes=df
)

# Log origin or cleaned datasets and its structure and statistics to Vectice
dataset = Dataset.clean(gcs_resource, name="Clean_Dataset")

Log modeling dataset statistics

To log the structure and statistics of your modeling resources, you need to specify which resource statistics you want. You can collect statistics for all modeling resources (training, testing, and validation) or choose specific resources for statistics.

For instance, if you only want statistics for your testing resource, use the "dataframe" parameter in the Resource class to log testing_resource statistics.

Pandas dataframe statistics logging example

# Create your pandas (or spark) dataframes for your modeling datasets  
training_dataframe = pd.DataFrame(pd.read_csv("train_clean.csv"))
testing_dataframe = pd.DataFrame(pd.read_csv("train_reduced.csv"))
validation_dataframe = pd.DataFrame(pd.read_csv("validation.csv"))

# Create your dataset resources to wrap your datasets and collect statistics
training_resource = FileResource(paths="train_clean.csv", dataframes=training_dataframe)
testing_resource = FileResource(paths="train_reduced.csv", dataframes=testing_dataframe)

# Data resource without collecting statistics
validation_resource = FileResource(paths="validation.csv")

# Log all modeling datasets and its structure and statistics to Vectice
modeling_dataset = Dataset.modeling(
                    name = "Modeling",
                    training_resource = training_resource,
                    testing_resource = testing_resource,
                    validation_resource = validation_resource,
                  )
PreviousLog assets using Vectice IDsNextLog custom metadata in a table format

Last updated 8 months ago

Was this helpful?

🚀