Would you know an outlier if you saw one? They’re everywhere and easy to spot, or so one can argue. But casual observation is one thing and shouldn’t be confused with robust statistical definitions.
Indeed, definitions matter in this space — a lot. Alas, there’s no consensus on the single, best way to identify “extreme” values in a data set for every analytical project. Regardless, the stakes are high because extreme numbers can reduce the reliability of modeling and analysis and so it’s often essential to filter these outliers.
“An outlier is an observation that lies an abnormal distance from other values in a random sample from a population,” advises the Engineering Statistics Handbook. Unfortunately, that leaves plenty of room for debate since “this definition leaves it up to the analyst (or a consensus process) to decide what will be considered abnormal.”
The good news: there are a number of techniques for identifying outliers. The only problem is that each technique has its own set of pros and cons and so there’s no one-size-fits-all solution.
To understand what’s available and how to identify the best option for your data analytics, let’s take a brief review of the choices. That starts with recognizing that finding “abnormal” data points first requires defining “normal.”
One of the standard approaches is to use the interquartile range (IQR), which measures the statistical dispersion of a data set based on quartiles. Using the standard application for this statistical tool, data within the 25th to 75th percentiles is the IQR and is considered “normal.” Numbers outside this range are the outliers.
As an example, let’s run the analytics using rolling one-year percentage changes for the US stocks market (S&P 500) since 1959. For perspective, here’s how the raw data compares through time.
It’s not obvious how to define outliers by looking at the chart above. That’s where IQR analysis can help, at least as an initial filtering step. The boxplot below shows the IQR for these returns in the grey box, which covers performances from roughly 0% to 19%. By this measure, returns that are negative or above 19% are considered outliers.
But that’s a bit harsh since one-year negative returns for the S&P 500 are common, or at least not unusual through time. In other words, the standard approach for identifying outliers via IQR isn’t practical. Fortunately, there are other techniques that are better-suited to finding outliers in financial markets.
In upcoming installments of this series, we’ll take a closer look at the possibilities for improving on the IQR method for outlier detection.
Learn To Use R For Portfolio Analysis
Quantitative Investment Portfolio Analytics In R:
An Introduction To R For Modeling Portfolio Risk and Return
By James Picerno