Considerations: math

Showing posts with label math. Show all posts

Monday, September 5, 2022

Compact Letter Display (CLD) to improve transparency of multiple hypothesis testing

Multiple hypothesis testing is most commonly undertaken using ANOVA. But, ANOVA is an incomplete test because it only tells you if several variables, or factors, have different Means. But, it does not tell you which specific ones are truly different. Maybe out of 5 variables (A, B, C, D, E) only E is truly different. And, this sole variable causes the ANOVA F test to be statistically significant. The other 4 variables could have similar Means.

The Tukey Highly Significant Difference test (Tukey HSD) remedies the above situation. This is a post-ANOVA test that tests whether each variable is different from any of the other ones. And, Tukey HSD is conducted on a one-on-one matched variable basis just like an unpaired t test. So, Tukey HSD tests the difference in Means for A vs. B, A vs. C, A vs. D, etc. While the Tukey HSD test provides an abundance of supplementary information to ANOVA, its output is overwhelming for non-statisticians.

Compact Letter Display (CLD) dramatically improves the clarity of the ANOVA & Tukey HSD test output.

CLD can be used to improve tabular data presentation and visual data presentation. For instance, if we want to compare the average rainfall data of five West Coast cities, we could first represent the tabular data as shown below.

The above table is sorting the cities rainfall data in alphabetical order. As is, that table is not very informative. You have no idea which cities average rainfall level is statistically different from any of the other cities.

If we replicate the above table, but improve it using the CLD methodology, it is now far more informative as shown below.

Now the table using the CLD methodology ranks the cities by mean (or average) rainfall in descending order. Additionally, it groups the cities by clusters of cities that do not have statistically different means.

For instance, Seattle and Portland are both classified as "b" because their mean are not statistically different (using an alpha of 0.05).

Similarly, San Francisco and Spokane are classified as "c" because their respective means are also not statistically different.

When it comes to visual data, the CLD enhancement is also quite spectacular. See below, a starting box plot describing the rainfall level of the five mentioned cities. The cities are again sorted alphabetically from left to right on the X-axis. As structured this box plot is not that informative. You can't readily identify the cities that have similar vs. dissimilar mean rainfall levels.

Now, if we restructure this same box plot using the CLD methodology, it immediately becomes far more informative.

As shown above, we can now readily identify the cities with the higher rainfall levels. They are sorted in descending order from left to right on the X-axis. We can also identify the cities that do not have statistically different means with Seattle and Portland both classified as "b", and San Francisco and Spokane classified as "c".

You can read a more detailed explanation of this CLD methodology at the following URLs:

CLD at Slideshare.net

CLD at SlidesFinder

CLD at ResearchGate

Tuesday, June 21, 2022

Are we already in a recession?

2022 Q1 GDP growth was already negative. And, 2022 Q2 may very well be [negative] when the released data comes out.

The majority of the financial media believes we are already in a recession because of the stubbornly high inflation (due to supply chain bottlenecks) and the Federal Reserve aggressive monetary policy to fight inflation. The policy includes a rapid rise in short-term rates, and a reversing of the Quantitative Easing bond purchase program (reducing the Fed’s balance sheet and taking liquidity & credit out of the financial system). The Bearish stock market also suggests we are currently in a recession.

On the other hand, Government authorities including the President, the Secretary of the Treasury (Janet Yellen), and the Federal Reserve all believe that the US economy can achieve a “soft landing” with a declining inflation rate, while maintaining positive economic growth.

The linked presentations include two explanatory models to attempt to predict recessions.

Recessions at Slideshare.net

Recessions at SlidesFinder

The first one is a logistic regression. The second one is a deep neural network (DNN). Both use the same set of independent variables: the velocity of money, inflation, the yield curve, and the stock market.

A copy of one of the slides describes the Logistic Regression model below.

A foundational equality: Price x Quantity = Money x Velocity of money

The logistic regression to predict regression includes Price (cpi) and Velocity (velo). As the CPI goes up, the probability of a recession increases and vice versa. As the velocity of money goes up the probability of a recession decreases and vice versa.

This model also includes the yield curve, a well established variable to predict recession. Notice that this variable is not quite statistically significant (p-value 0.14). But, the sign of the coefficient is correct. It does inform and improve the model. And, is well supported by economic theory. When the yield curve widens the probability of a recession goes down and vice versa.

The model includes the stock market (S&P 500) that is by nature forward looking in terms of economic outlook. This makes it a most relevant variable to include in a regression model to predict recessions. When the stock market goes up, the probability of a recession goes down and vice versa.

The deep neural network (DNN) model is described below.

The DNN model uses the same explanatory variable inputs.

The DNN model has two hidden layers with 3 neurons in the first one, and 2 neurons in the second one.

Number of neurons is nearly predetermined as hidden layers must have fewer neurons than the input layer and more neurons than the output layer.

The activation function is Sigmoid, which is the same as a Logistic Regression. And, the output function is also Sigmoid. This makes this DNN consistent with the Logistic Regression model.

I noticed that when using the entire data (from 1960 to the present using quarterly data), ROC curves and Kolmogorov - Smirnov plots did not differentiate between the two models. I am just showing the KS plots below. The two plots are very similar, not allowing you to clearly rank the models.

The next set of plots more clearly differentiate between the two models.

On the plots above, the recessionary quarters are shown in green, and the others are shown in red. You can see that the DNN generates nearly ideal probabilities that are very close to 1 during a recession, and very close to Zero otherwise. The Logistic Regression model generates a much more continuous set of probabilities within the 0 to 1 boundaries. Notice that both models do make a few mistakes with green dots (indicating recessions), when they should be red.

The graph above displays how much more certain the DNN model is.

All of the above visual data was generated using the entire data set. Next, we will briefly explore how the models fared when predicting several recessionary periods treated as Hold Out or out-of-sample if you will.

Let's start with the Great Recession.

As shown above, during the Great Recession period, the Logistic Regression was a lot better at capturing the actual recessionary quarters. It captured 4 out of 5 of them vs. only 2 out of 5 for the DNN.

Next, let's look at the COVID Recession period.

The above shows a rather rare occurrence in econometrics modeling, a perfect prediction. Indeed, both models with much certainty predicted all 6 quarters of this COVID Recession period correctly. And, as a reminder, these 6 quarters were indeed treated out-of-sample.

Next, we will use a frequentist Bayesian representation of both models when combining all the recession periods we tested (on an out-of-sample basis).

We can consider that recession is like a disease. And, given a disease prevalence, a given test sensitivity and specificity, we can map out the actual accuracy of a positive test or a negative test. Below we are doing the exact same thing treating recession as a disease.

Here is the mentioned representation for the Logistic Regression.

As shown above, during the cumulative combined periods there were 13 recessionary quarters out of a total of 30 quarters. And, the Logistic Regression model correctly predicted 10 out of the 13 recessionary quarters.

And, now the same representation for the DNN.

A table of these accuracy measures is shown below.

When you use the entire data set, the DNN is marginally more accurate. When you focus on the recessionary periods on an out-of-sample basis, the two models are very much tied.

So, can these models predict the current prospective recession?

No, they can’t. That is for a couple of reasons:

First, both models have already missed out 2022 Q1 as a recessionary quarter. Even using the historical data (not true testing), the Logistic Regression model assigned a probability of a recession of only 6% for 2022 Q1; and the DNN assigned a probability of 0%. Remember, the DNN is always far more deterministic in its probability assessments. So, when it is wrong, it is far more off than the Logistic Regression model.

Second, for the models to be able to forecast accurately going forward, you would need to have a crystal ball to accurately forecast the 4 independent variables. And, that is a general shortcoming of all econometrics models.

Tuesday, May 24, 2022

Overfitting with Deep Neural Network (DNN) models

I developed a set of models to explain, estimate, and predict home prices. My second modeling objective was to benchmark the accuracy in testing (prediction) of simple OLS regression models vs. more complex DNN model structures.

I won't spend any time describing in much detail the data, the explanatory variables, etc. For that you can look at the complete study at the following links. The study is pretty short (about 20 slides).

Housing Price models at Slideshare

Housing Price models at Slidesfinder

Just to cover the basics, the dependent variable is home prices in April 2022 defined as the median county zestimate from Zillow, that I just call zillow within the models. The models use 7 explanatory variables that capture income, education, innovation, commute time, etc. All variables are standardized. But, final output is translated back into nominal dollars using a scale of $000.

The models use data for over 2,500 counties.

I developed four models:

1. A streamlined OLS regression (OLS Short) that uses only three explanatory variables. It worked as well as any of the other models in testing/predicting;

2. An OLS regression with all 7 explanatory variables (OLS Long). It tested & predicted with about the same level of accuracy as OLS Short. But, as specified it was far more explanatory (due to using 7 explanatory variables, instead of just 3);

3. A DNN model using the smooth rectified linear unit activation function. I called it DNN Soft Plus. This model structure had real challenge converging towards a solution. Its testing/predicting performance was not any better than the OLS regressions;

4. A DNN model using the Sigmoid activation function (DNN Logit). And, this model will be the main focus of our analysis regarding overfitting with DNNs.

The DNN Logit was structured as shown below:

I purposefully structured the above DNN to be fairly streamlined in order to facilitate convergence towards a solution. Nevertheless, this structure was already too much for the DNN Soft Plus (where I had to prune down the hidden layers to (3, 2) in order to reach mediocre convergence (I also had to rise the error level threshold).

When using the entire data set, the Goodness-of-fit measures indicate that the DNN Logit model is the clear winner.

You can also observe the superiority of the DNN Logit visually on the scatter plots below.

On the scatter plot matrix above, check out the one for the DNN Logit at the bottom right; and focus on how well it fits all the home prices > $1 million (look at rectangle defined by the dashed red and green lines). As shown, the DNN Logit model fits those perfectly. Meanwhile, the 3 other models struggle in fitting any of the data points > $1 million.

However, when we move on to testing by creating new data (splitting the data between a train sample and a test sample), the DNN Logit performance is mediocre.

As shown above when using or creating new data and focusing on model prediction on such data, the DNN Logit predicting performance is rather poor. It is actually weaker than a simple OLS regression using just 3 independent variables.

Next, let's focus on what happened to the DNN Logit model by looking how it fit the "train 50%" data (using 50% of the data to train the model and fit zestimates) vs. how it predicted on the "test 50%" data (using the other half of the data to test the model's prediction).

As shown in training, the DNN Logit model perfectly fit the home prices > $1 million. At such stage, this model gives you the illusion that its DNN structure was able to leverage non linear relationships that OLS regressions can't.

However, these non linear relationships uncovered during training were entirely spurious. We can see that because in the testing the DNN Logit model was unable to predict other home prices > $1 million within the test 50% data.

The two scatter plots above represent a perfect image of model overfitting.

Monday, March 21, 2022

Can you Deep Learn the Stock Market?

You can read the complete study at the following links:

DNN Stock Market Study at SlidesFinder

DNN Stock Market Study at Slideshare

Objectives:

We will test whether:

a) Sequential Deep Neural Networks (DNNs) can predict the stock market better than OLS regression;

b) DNNs using smooth Rectified Linear Unit activation functions perform better than the ones using Sigmoid (Logit) activation functions.

Data:

Quarterly data from 1959 Q2 to 2021 Q3. All variables are fully detrended as quarterly % change or first differenced in % (for interest rate variables). Models are using standardized variables. Predictions are converted back into quarterly % change.

Data sources are from FREDS for the economic variables, and the Federal Reserve H.15 for interest rates.

Software used for DNNs.

R neuralnet package. Inserted a customized function to use a smooth ReLu (SoftPlus) activation function.

The variables within the underlying OLS Regression models are shown within the table below:

Consumer Sentiment is by far the most predominant variable. This is supported by the behavioral finance (Richard Thaler) literature.

Housing Start is supported by the research of Edward E. Leamer advancing that the housing sector is a leading indicator of overall economic activity, which in turn impacts the stock market.

Next, the Yield Curve (5 Year Treasury minus FF), and economic activity (RGDP growth) are well established exogenous variables that influence the stock market. Both are not quite statistically significant. And, their influence is much smaller than for the first two variables. Nevertheless, they add explanatory logic to our OLS regression fitting the S&P 500.

The above were the best variables we could select out of a wide pool of variables including numerous other macroeconomic variables (CPI, PPI, Unemployment rate, etc.) interest rates, interest rate spreads, fiscal policy, and monetary policy (including QE) variables.

Next, let's quickly discuss activation functions of hidden layers within sequential Deep Neural Networks (DNN) model. Until 2017 or so, the preferred activation function was essentially a Logit regression called Sigmoid function.

There is nothing wrong with the Sigmoid function per se. The problem occurs when you take the first derivative of this function. And, it compresses the range of values by 50% (from 0 to 1, to 0 to 0.5 for the first iteration). In iterative DNN models, the output of one hidden layer becomes the input for the sequential layer. And, this 50% compression from one layer to the next can generate values that converge close to zero. This problem is called the “vanishing gradient descent.”

Over the past few years, the Rectified Linear function, called ReLu, has become the most prevalent activation function for hidden layers. We will advance that the smooth ReLu, also called SoftPlus is actually much superior to ReLu.

SoftPlus appears superior to ReLu because it captures the weights of many more neurons’ features, as it does not zero out any such features with input values < 0. Also, it generates a continuous set of derivatives values ranging from 0 to 1. Instead, ReLu derivatives values are limited to a binomial outcome (0, 1).

Here is a picture of our DNN structure.

• One input layer with 4 independent variables: Consumer Sentiment, Housing Start, Yield Curve, and RGDP.

• Two hidden layers. The first one with 3 nodes, and the second one with 2 nodes. Activation function for the two hidden layers are SoftPlus for the 1st DNN model, and Sigmoid for the second one.

• One output variable, with one node, the dependent variable, the S&P 500 quarterly % change. The output layer has a linear activation function.

• The DNN loss function is minimizing the sum of the square errors (SSE). Same as for OLS.

The balance of the DNN structure is appropriate. It is recommended that the hidden layers have fewer

nodes than the input one; and, that they have more nodes than the output layer. Given that, the choice of

nodes at each layer is just about predetermined. More extensive DNNs would not have worked anyway.

This is because the DNNs, as structured, already had trouble converging towards a solution given an

acceptable error threshold.

As expected the DNN models have much better fit with the complete historical data than the OLS

Regression.

As seen above, despite the mentioned limitation of the Sigmoid function, the two DNN models (SoftPlus

vs. Sigmoid) relative performances are indistinguishable. And, they are both better than OLS Regression.

But, fitting historical data and predicting or forecasting on an out-of-sample or Hold Out test basis are two

completely different hurdles. Fitting historical data is a lot easier than forecasting.

We will use three different Test periods as shown in the table below:

Each testing period is 12 quarters long. And, it is a true Hold Out or out-of-sample test. The training data

consists of all the earlier data from 1959 Q2 up to the onset of the Hold Out period. Thus, for the

Dot.com period, the training data runs from 1959 Q2 to 2000 Q1.

The quarters highlighted in orange denote recessions. We call the three periods, Dot.com, Great

Recession, and COVID periods as each respective period covers the mentioned events.

To visualize the models' respective prediction performance, we will use "skylines." The column graph

below looks like a set of skylines with vertical buildings for positive values and reflection in water for

negative values. Within the complete linked study, we show several other ways to convey the forecasting

performance that you may prefer.

As shown above, all the models predictions are really pretty dismal. None of the models predicted the

protracted 3-year Bear market associated with the Dot.com bubble. At the margin, the OLS model

actually performed a bit better than the DNN models.

Now, let's look at the Great Recession period. In this situation, the models did better. However, their

overall predicting performance was nothing to write home about. All models completely missed the

severe market correction in the third year of the Great Recession period. And, again the DNN models did

not perform any better than the OLS Regression.

When focusing on the COVID period, the ongoing mediocrity (at best) of the models' prediction

performance is readily apparent. All models completely missed the robust Bull market in the third year of

the COVID period (as defined). Again, the DNN models did not fare any better than the simpler OLS

Regression.

If we look at average predictions for all three models for all three testing periods, we can get a quick

snapshot of the competitiveness of the models.

Without getting bogged down into attempting to fine tune model rankings between these three models, we can still derive two takeaways.

The first one is that the Sigmoid issue with the "vanishing gradient descent" did not materialize. As shown, the Sigmoid DNN model actually was associated with greater volatility in average S&P 500 quarterly % change than for the SoftPlus DNN model.

The second one is that the DNN models did not provide any prediction incremental benefits over the simpler OLS Regression.

So, why did all the models, regardless of their sophistication, pretty much fail in their respective predictions?

It is for a very simple reason. All the relationships between the Xs and Y variables are very unstable. The table below shows the correlations between such variables during the Training and Testing periods. As shown, many of the correlations are very different between the two (Training and Testing). At times, those correlations even flip signs (check out the correlations with the Yield Curve (t5_ff)).

The models' predictions failing is especially humbling when you consider that the mentioned 3-year Hold Out tests still presumed you had perfect information over the next 3 years regarding the four X variables. As we know, this is not a realistic assumption.