2 minute read

Tags: , , , ,

Categories: ,

Introduction

BayesCART is a Python package for Bayesian Classification and Regression Trees (CART) with advanced Markov Chain Monte Carlo (MCMC) sampling. Designed for modularity, efficiency, and extensibility, it follows best practices of modern, high-performance machine learning software.

A screenshot of the bayescart package code

At a glance

  • Advanced ML Algorithms: Implements Bayesian MCMC samplers for flexible and scalable probabilistic tree modelling.
  • Efficient Software Design: Built using modular OOP principles, leverages caching strategies and optimized tree operations for high performance.
  • Robust Engineering Practices: Comprehensive documentation, type hints, and automated CI workflows enhance reliability and maintainability.
  • Scalability and Extensibility: Designed to handle large datasets efficiently, with subclassing mechanisms allowing seamless expansion of features.
  • Parallelism and Performance: Supports parallel execution to accelerate MCMC computations, leveraging multiprocessing for scalable inference.
  • Optimized Data Structures: Integrates efficiently with pandas and numpy to manage data and speed up operations, reducing processing overhead.
  • Innovative Tempering Strategies: Tackles multimodality in Bayesian CART models with custom MCMC tempering techniques, ensuring efficient posterior exploration and improved convergence.
  • Optimized Memory Usage: Lightweight copy mechanisms reduce overhead during iterative tree updates, minimizing redundant computations. Design Philosophy and Modularity. The package follows an object-oriented design with a hierarchical class structure for seamless extension and optimization. Memory and speed optimized methods can be introduced without modifying core infrastructure, and new sampling strategies integrate easily via subclassing.

Computational Efficiency

Several design choices have been introduces to boost performance:

  • Optimized Copying: NodeFast and TreeFast reduce memory overhead by copying only those attributes that would corrupt the sampling process.
  • Caching: Likelihoods, priors, data, and other intermediate results are stored at the node level, minimizing redundant calculations.
  • Fast Sampling: Custom samplers for prior distributions replace slower library routines.
  • Efficient Data Handling: Data subsets are stored at each internal and leaf node to provide quick access. While this increases momery footprint, it sensibly speeds-up routine operations such as splitting and estimating parameters, because no pre-processing of the tree data is needed.

Advanced MCMC and Tempering Strategies

BayesCART supports multiple tempering strategies for better posterior exploration:

  • Geometric Tempering (BCARTGeom): Designed for multimodal distributions, this method duplicates and flattens the posterior density function to facilitate exploration. It helps escape local modes but may require careful tuning of the temperature schedule to balance exploration and exploitation.
  • Likelihood-Based Tempering (BCARTGeomLik): Similar to Geometric tempering, but flattening is only applied to the integrated likelihood. By retaining prior information on the tree characteristics (e.g. size), the flatter posterior is biased towards less overfit trees, which are more likely. This method enhances mixing but can potentially distort prior information.
  • Pseudo-Prior Tempering (BCARTPseudoPrior): Introduces an auxiliary prior to smooth transitions between modes, biasing sampling towards smaller trees. This aids exploration as smaller trees are more easily changed by the MCMC. It shows similar performance as Likelihood-Based Tempering, but with a more theoretically sound formulation.

Moreover, a shared parallel tempering base class ensures modularity for new methods, allowing flexible implementation of alternative strategies.

Documentation and Testing

BayesCART adheres to best practices:

  • Sphinx Documentation: Clear, well-maintained API references.
  • Type Annotations: Improves readability and prevents common runtime errors.
  • Automated Testing: CI workflows via GitHub Actions ensure reliability.

BayesCART extends the existing treelib Python library by embedding data, metadata and sampling logic within nodes. I am grateful to the treelib authors and maintainers for providing a simple yet robust implementation of trees.

Additional resources

Updated: