Example of the use of shift for linear regression in Python. How to find optimal correlation shift?
What is this correlation shift?
In supervised deep machine learning, we have two directions: classification and regression. The regression needs continuous values of data. Because from time to time we are forced to transform discrete data into continuous values.
More important, have to say, is to find a linear correlation between independent variables and dependent variable who represents the result.
How to find correlation?
In the natural environment, everything is correlated with each other. Rain causes the level of the lake to rise. The hot sun causes the level of the lake to down. It is obvious examples of linear correlation.
But to observe it, use simple correlation can be insufficient.
The problem is in the shift. Rain contributes to rising water in rivers but this rise appears after a couple of hours. Sun makes the level of water go down after a couple of days. Frankly speaking most correlations in all environments have longer or shorter delays.
How to find correlation shift?
from scipy import signal, fftpack
import pandas as pd
import numpy
Let’s build this dataframe.
AAA = [295, 370, 310, 385, 325, 400, 340, 415, 355, 430, 370, 175, 250,
190, 265, 205, 280, 220, 295, 235, 310, 250, 325, 265, 340, 280,
355, 295, 370, 310, 385, 325, 400, 340, 415, 355, 430, 370, 445,
385, 460, 400, 475, 415, 490, 430, 175, 250, 190, 265, 205, 280,
220, 295, 235, 310, 250, 325, 265, 340, 280, 355, 295, 370, 310,
385, 325, 400, 340, 415, 355, 430, 370, 445, 385, 460, 400, 475,
415, 490, 430, 505, 445, 175, 250, 190, 265, 205, 280, 220, 295,
235, 310, 250, 325, 265, 340, 280, 355]
BBB = [123, 221, 113, 105, 150, 114, 159, 123, 168, 132, 177, 141, 186,
150, 195, 159, 204, 168, 213, 177, 222, 186, 231, 195, 240, 204,
249, 213, 258, 222, 267, 231, 276, 240, 285, 249, 294, 258, 105,
150, 114, 159, 123, 168, 132, 177, 141, 186, 150, 195, 159, 204,
168, 213, 177, 222, 186, 231, 195, 240, 204, 249, 213, 258, 222,
267, 231, 276, 240, 285, 249, 294, 258, 303, 267, 105, 150, 114,
159, 123, 168, 132, 177, 141, 186, 150, 195, 159, 204, 168, 213,
177, 222, 186, 231, 195, 240, 204, 249]
CCC = [124, 154, 130, 160, 136, 166, 142, 172, 148, 70, 100, 76, 106,
82, 112, 88, 118, 94, 124, 100, 130, 106, 136, 112, 142, 118,
148, 124, 154, 130, 160, 136, 166, 142, 172, 148, 178, 154, 184,
160, 190, 166, 196, 172, 70, 100, 76, 106, 82, 112, 88, 118,
94, 124, 100, 130, 106, 136, 112, 142, 118, 148, 124, 154, 130,
160, 136, 166, 142, 172, 148, 178, 154, 184, 160, 190, 166, 196,
172, 202, 178, 70, 100, 76, 106, 82, 112, 88, 118, 94, 124,
100, 130, 106, 136, 112, 142, 118, 148]
DDD = [ 437, 453, 764, 346, 239, 420, 600, 456, 636, 492, 672,
528, 708, 564, 744, 600, 780, 636, 816, 672, 852, 708,
888, 744, 924, 780, 960, 816, 996, 852, 1032, 888, 1068,
924, 1104, 960, 1140, 996, 1176, 1032, 420, 600, 456, 636,
492, 672, 528, 708, 564, 744, 600, 780, 636, 816, 672,
852, 708, 888, 744, 924, 780, 960, 816, 996, 852, 1032,
888, 1068, 924, 1104, 960, 1140, 996, 1176, 1032, 1212, 1068,
420, 600, 456, 636, 492, 672, 528, 708, 564, 744, 600,
780, 636, 816, 672, 852, 708, 888, 744, 924, 780, 960]
RESULT = [ 35, 50, 38, 53, 41, 56, 44, 59, 47, 62, 50, 65, 53,
68, 56, 71, 59, 74, 62, 77, 65, 80, 68, 83, 71, 86,
74, 89, 77, 92, 80, 95, 83, 98, 86, 35, 50, 38, 53,
41, 56, 44, 59, 47, 62, 50, 65, 53, 68, 56, 71, 59,
74, 62, 77, 65, 80, 68, 83, 71, 86, 74, 89, 77, 92,
80, 95, 83, 98, 86, 101, 89, 35, 50, 38, 53, 41, 56,
44, 59, 47, 62, 50, 65, 53, 68, 56, 71, 59, 74, 62,
77, 65, 80, 68, 83, 71, 86, 74]
df = pd.DataFrame({'AAA': AAA, 'BBB': BBB,'CCC':CCC,'DDD':DDD, 'RESULT':RESULT})
df.head()
Descriptive in the DataFrame phenomena are perfectly correlated. But we don’t know about it. Now we use the ordinary method of searching correlation.
corr = df.corr()
corr
corr['RESULT']
Is it all? Is it the entire correlation for linear regression? How to find correlation delay?
Function to find optimal correlation shift
I made a special function to detect optimal shift values for maximal linear correlation between dependent and independent variables.
def cross_corr(x, y, lag=0):
return x.corr(y.shift(lag))
def shift_Factor(x, y, R):
x_corr = [cross_corr(x, y, lag=i) for i in range(R)]
# R factor is the number of the shifts who should be checked by the function
Kot = pd.DataFrame(list(x_corr)).reset_index()
Kot.rename(columns={0: 'Corr', 'index': 'Shift_num'}, inplace=True)
# We find optimal correlation shift
Kot['abs'] = Kot['Corr'].abs()
SF = Kot.loc[Kot['abs'] == Kot['abs'].max(), 'Shift_num']
p1 = SF.to_frame()
SF = p1.Shift_num.max()
return SF
We declare variables to function.
x = df.AAA # independent variable
y = df.RESULT # dependent variable
R = 20 # number of shifts who will be checked
The shift for variable AAA
We are looking for an optimal correlation shift in variable AAA.
In [13]:
SKO = shift_Factor(x,y,R)
print('Optimal shift for AAA: ',SKO)
We calculate that in 11 rows of shifts there are the biggest correlations between AAA independent variable and RESULT variable (in absolute values). What is the level of correlation?
In [8]:
cross_corr(x, y, lag=SKO)
Out[8]:
0.9999999999999996
We create a new DateFrame with the optimal shift.
In [9]:
def df_shif(df, target=None, lag=0):
if not lag and not target:
return df
new = {}
for h in df.columns:
if h == target:
new[h] = df[target]
else:
new[h] = df[h].shift(periods=lag)
return pd.DataFrame(data=new)
In [10]:
df2 = df_shif(df, 'AAA', lag=SKO)
df2.rename(columns={'AAA':'SHIFTED AAA'}, inplace=True)
df2.head(13)
Now we repeat these manuals for the rest independent variables.
The shift for variable BBB
BBB = df.BBB # independent variable
SKS = shift_Factor(BBB,y,R)
print('Optimal shift for BBB: ',SKS)
df4 = df_shif(df3, 'CCC', lag=SKK)
df4.rename(columns={'CCC':'SHIFTED CCC'}, inplace=True)
The shift for variable DDD
DDD = df.DDD
PKP = shift_Factor(DDD,y,R)
print('Optimal shift for DDD: ',PKP)
df5 = df_shif(df4, 'DDD', lag=PKP)
df5.rename(columns={'DDD':'SHIFTED DDD'}, inplace=True)
Correlation after making the shifts
I wipe rows in dataframe where appear NaN values and calculate the correlation.
df5 = df5.dropna(how='any')
df5.head(3)
corr = df5.corr()
corr
corr['RESULT']
As we see, independent variables are perfectly correlated with the result variable. This phenomenon was hidden because there were existing shifts. I hope I convinced that researchers should enter the rule of checking shifts during the model making.