clean.boudt {PerformanceAnalytics}R Documentation

clean extreme observations in a time series to to provide more robust risk estimates


Robustly clean a time series to reduce the magnitude, but not the number or direction, of observations that exceed the 1-α% risk threshold.


clean.boudt(R, alpha = 0.01, trim = 0.001)


R an xts, vector, matrix, data frame, timeSeries or zoo object of asset returns
alpha probability to filter at 1-alpha, defaults to .01 (99%)
trim where to set the "extremeness" of the Mahalanobis distance


Many risk measures are calculated by using the first two (four) moments of the asset or portfolio return distribution. Portfolio moments are extremely sensitive to data spikes, and this sensitivity is only exacerbated in a multivariate context. For this reason, it seems appropriate to consider estimates of the multivariate moments that are robust to return observations that deviate extremely from the Gaussian distribution.

There are two main approaches in defining robust alternatives to estimate the multivariate moments by their sample means (see e.g. Maronna[2006]). One approach is to consider a more robust estimator than the sample means. Another one is to first clean (in a robust way) the data and then take the sample means and moments of the cleaned data.

Our cleaning method follows the second approach. It is designed in such a way that, if we want to estimate downside risk with loss probability α, it will never clean observations that belong to the 1-α least extreme observations. Suppose we have an n-dimensional vector time series of length T: r_1,...,r_T. We clean this time series in three steps.

  1. Ranking the observations in function of their extremeness. Denote μ and Σ the mean and covariance matrix of the bulk of the data and let lfloor cdot rfloor be the operator that takes the integer part of its argument. As a measure of the extremeness of the return observation r_t, we use its squared Mahalanobis distance d^2_t = (r_t-μ)'Σ^{-1}(r_t-μ). We follow Rousseeuw(1985) by estimating μ and Σ as the mean vector and covariance matrix (corrected to ensure consistency) of the subset of size lfloor (1-α)Trfloor for which the determinant of the covariance matrix of the elements in that subset is the smallest. These estimates will be robust against the α most extreme returns. Let d^2_{(1)},...,d^2_{(T)} be the ordered sequence of the estimated squared Mahalanobis distances such that d^2_{(i)}<=q d^2_{(i+1)}.
  2. Outlier identification. Return observations are qualified as outliers if their estimated squared Mahalanobis distance d^2_t is greater than the empirical 1-α quantile d^2_{(lfloor (1-α)T rfloor)} and exceeds a very extreme quantile of the Chi squared distribution function with n degrees of freedom, which is the distribution function of d^2_t when the returns are normally distributed. In this application we take the 99.9% quantile, denoted chi^2_{n,0.999}.
  3. Data cleaning. Similarly to Khan(2007) we only clean the returns that are identified as outliers in step 2 by replacing these returns r_t with

    r_tsqrt{max(d^2_{(lfloor (1-α)T)rfloor},chi^2_{n,0.999})/d^2_t}

    The cleaned return vector has the same orientation as the original return vector, but its magnitude is smaller. Khan(2007) calls this procedure of limiting the value of d^2_t to a quantile of the chi^2_n distribution, ``multivariate Winsorization'.

Note that the primary value of data cleaning lies in creating a more robust and stable estimation of the distribution generating the large majority of the return data. The increased robustness and stability of the estimated moments utilizing cleaned data should be used for portfolio construction. If a portfolio manager wishes to have a more conservative risk estimate, cleaning may not be indicated for risk monitoring. It is also important to note that the robust method proposed here does not remove data from the series, but only decreases the magnitude of the extreme events. It may also be appropriate in practice to use a cleaning threshold somewhat outside the VaR threshold that the manager wishes to consider. In actual practice, it is probably best to back-test the results of both cleaned and uncleaned series to see what works best with the particular combination of assets under consideration.


cleaned data matrix


This function and much of this text was originally written for Boudt, et. al, 2008


Kris Boudt, Brian G. Peterson


Boudt, K., Peterson, B. G., Croux, C., 2008. Estimation and Decomposition of Downside Risk for Portfolios with Non-Normal Returns. Journal of Risk, forthcoming.

Khan, J. A., S. Van Aelst, and R. H. Zamar (2007). Robust linear model selection based on least angle regression. Journal of the American Statistical Association 102.

Maronna, R. A., D. R. Martin, and V. J. Yohai (2006). Robust Statistics: Theory and Methods. Wiley.

Rousseeuw, P. J. (1985). Multivariate estimation with high breakdown point. In W. Grossmann, G. Pflug, I. Vincze, and W. Wertz (Eds.), Mathematical Statistics and Its Applications, Volume B, pp. 283?297. Dordrecht-Reidel.

See Also


[Package PerformanceAnalytics version 0.9.9-5 Index]