Introduction
This article is an extension of Farmer, Rohit. 2023. “Linear Regression for Inferential and Predictive Modeling.” July 13, 2023. . Please check out the parent article for the theoretical background.
Import Packages
import Pkg
Pkg .activate ("." )
using IJulia
using PalmerPenguins
using DataFrames
using Statistics
using Plots
using Pipe
using LsqFit
using GLM
using StatsPlots
using Statistics
using StatsBase
Activating new project at `~/sandbox/dataalltheway/posts/013-01-linear-regression-julia`
Getting the data
Let us also drop the rows which have any missing data.
ENV ["DATADEPS_ALWAYS_ACCEPT" ] = "true" # needed for downloading data without confirmation
data = dropmissing (DataFrame (PalmerPenguins.load ()))
333×7 DataFrame
308 rows omitted
1
Adelie
Torgersen
39.1
18.7
181
3750
male
2
Adelie
Torgersen
39.5
17.4
186
3800
female
3
Adelie
Torgersen
40.3
18.0
195
3250
female
4
Adelie
Torgersen
36.7
19.3
193
3450
female
5
Adelie
Torgersen
39.3
20.6
190
3650
male
6
Adelie
Torgersen
38.9
17.8
181
3625
female
7
Adelie
Torgersen
39.2
19.6
195
4675
male
8
Adelie
Torgersen
41.1
17.6
182
3200
female
9
Adelie
Torgersen
38.6
21.2
191
3800
male
10
Adelie
Torgersen
34.6
21.1
198
4400
male
11
Adelie
Torgersen
36.6
17.8
185
3700
female
12
Adelie
Torgersen
38.7
19.0
195
3450
female
13
Adelie
Torgersen
42.5
20.7
197
4500
male
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
322
Chinstrap
Dream
45.2
16.6
191
3250
female
323
Chinstrap
Dream
49.3
19.9
203
4050
male
324
Chinstrap
Dream
50.2
18.8
202
3800
male
325
Chinstrap
Dream
45.6
19.4
194
3525
female
326
Chinstrap
Dream
51.9
19.5
206
3950
male
327
Chinstrap
Dream
46.8
16.5
189
3650
female
328
Chinstrap
Dream
45.7
17.0
195
3650
female
329
Chinstrap
Dream
55.8
19.8
207
4000
male
330
Chinstrap
Dream
43.5
18.1
202
3400
female
331
Chinstrap
Dream
49.6
18.2
193
3775
male
332
Chinstrap
Dream
50.8
19.0
210
4100
male
333
Chinstrap
Dream
50.2
18.7
198
3775
female
The data already does not include the year
data, and so we don’t need to change anything.
1
Adelie
Torgersen
39.1
18.7
181
3750
male
2
Adelie
Torgersen
39.5
17.4
186
3800
female
3
Adelie
Torgersen
40.3
18.0
195
3250
female
4
Adelie
Torgersen
36.7
19.3
193
3450
female
5
Adelie
Torgersen
39.3
20.6
190
3650
male
6
Adelie
Torgersen
38.9
17.8
181
3625
female
7
Adelie
Torgersen
39.2
19.6
195
4675
male
8
Adelie
Torgersen
41.1
17.6
182
3200
female
9
Adelie
Torgersen
38.6
21.2
191
3800
male
10
Adelie
Torgersen
34.6
21.1
198
4400
male
11
Adelie
Torgersen
36.6
17.8
185
3700
female
12
Adelie
Torgersen
38.7
19.0
195
3450
female
13
Adelie
Torgersen
42.5
20.7
197
4500
male
14
Adelie
Torgersen
34.4
18.4
184
3325
female
15
Adelie
Torgersen
46.0
21.5
194
4200
male
16
Adelie
Biscoe
37.8
18.3
174
3400
female
17
Adelie
Biscoe
37.7
18.7
180
3600
male
18
Adelie
Biscoe
35.9
19.2
189
3800
female
19
Adelie
Biscoe
38.2
18.1
185
3950
male
20
Adelie
Biscoe
38.8
17.2
180
3800
male
Exploratory Plots
We groupby
species
and sex
, select
ing the length of number of rows. Then, we simply find the unique
rows.
g_data = @pipe data |>
groupby (_, [: species, : sex]) |>
combine (_, : sex=> length=>: counts) |>
unique
labels = map (zip (g_data[!, : species], g_data[!, : sex])) do (i,j)
" $ (i)- $ (j)"
end
plot (labels, g_data[!, : counts], st=: bar, title= "Counts" , label= false )
@. linear_model (x, p) = p[1 ]* x + p[2 ]
linear_model (generic function with 1 method)
function plot_corrs (data, variable)
p = plot (title= variable)
ds1 = @pipe data |>
_[!, [: species, variable, : body_mass_g]] |>
groupby (_, : species)
for sdf in ds1
color = palette (: auto)[length (p.series_list)÷ 2 + 1 ]
scatter! (sdf[!, variable], sdf[!, : body_mass_g],
label= sdf[!, : species][1 ],
markersize= 2 ,
color= color
)
fit = curve_fit (
linear_model,
sdf[!, variable], sdf[!, : body_mass_g], [0 ., 0 .],
)
plot! (
sdf[!, variable],
linear_model (sdf[!, variable], fit.param),
label= sdf[!, : species][1 ],
lw = 2 ,
color= color
)
end
return p
end
plot_corrs (generic function with 1 method)
plot (
plot_corrs (data, : bill_depth_mm),
plot_corrs (data, : bill_length_mm),
plot_corrs (data, : flipper_length_mm),
layout= (1 ,3 ),
size= (1300 , 400 ),
ylabel= "body_mass_g"
)
First Model
fit_model_1 = lm (
@formula (body_mass_g ~ bill_depth_mm + bill_length_mm + flipper_length_mm + sex),
data,
)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
body_mass_g ~ 1 + bill_depth_mm + bill_length_mm + flipper_length_mm + sex
Coefficients:
────────────────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
────────────────────────────────────────────────────────────────────────────────────
(Intercept) -2288.46 631.58 -3.62 0.0003 -3530.92 -1046.01
bill_depth_mm -86.0882 15.5698 -5.53 <1e-07 -116.718 -55.4589
bill_length_mm -2.32868 4.6843 -0.50 0.6194 -11.5437 6.88638
flipper_length_mm 38.8258 2.44776 15.86 <1e-41 34.0105 43.6411
sex: male 541.029 51.7098 10.46 <1e-21 439.304 642.753
────────────────────────────────────────────────────────────────────────────────────
We can do the ftest against the null model as
F-test against the null model:
F-statistic: 381.30 on 333 observations and 4 degrees of freedom, p-value: <1e-99
Diagnostic Plots
x = predict (fit_model_1)
y = residuals (fit_model_1)
ystd = (y .- mean (y))/ std (y);
scatter (x, y, title= "Residuals vs Prediction" , label= false )
qqplot (Normal (), ystd, title= "qqplot of the standardized residuals" )
scatter (x, sqrt .(abs .(ystd)), ylabel= "√(Std Residuals)" , xlabel= "Prediction" , label= false )
cdist = cooksdistance (fit_model_1)
plot (cdist, st=: sticks, title= "Cook's Distance" , label= "" )
high_dist = findall (> (0.02 ), cdist)
hist_dist_annots = map (high_dist) do i
(i, cdist[i]+ 0.002 , text (string (i), 8 ))
end
annotate! (hist_dist_annots)
hline! ([0.02 ], linestyle=: dash, label= false )
The leverage function is not implemented in the GLM.jl
package yet, but there is a PR in draft .
Second Model
fit_model_2 = lm (
@formula (body_mass_g ~ species + island + bill_depth_mm + bill_length_mm + flipper_length_mm + sex),
data,
)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
body_mass_g ~ 1 + species + island + bill_depth_mm + bill_length_mm + flipper_length_mm + sex
Coefficients:
──────────────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
──────────────────────────────────────────────────────────────────────────────────
(Intercept) -1500.03 575.822 -2.61 0.0096 -2632.85 -367.207
species: Chinstrap -260.306 88.5506 -2.94 0.0035 -434.513 -86.0996
species: Gentoo 987.761 137.238 7.20 <1e-11 717.771 1257.75
island: Dream -13.1031 58.5407 -0.22 0.8230 -128.271 102.065
island: Torgersen -48.0636 60.9215 -0.79 0.4307 -167.915 71.7881
bill_depth_mm 67.5754 19.8213 3.41 0.0007 28.5808 106.57
bill_length_mm 18.1893 7.13639 2.55 0.0113 4.1498 32.2288
flipper_length_mm 16.2385 2.93946 5.52 <1e-07 10.4557 22.0213
sex: male 387.224 48.1382 8.04 <1e-13 292.521 481.927
──────────────────────────────────────────────────────────────────────────────────
Comparing the two models
GLM.jl
provides an interface to run an FTest to compare if either model is better than the other.
ftest (fit_model_1.model, fit_model_2.model)
F-test: 2 models fitted on 333 observations
──────────────────────────────────────────────────────────────────────────────
DOF ΔDOF SSR ΔSSR R² ΔR² F* p(>F)
──────────────────────────────────────────────────────────────────────────────
[1] 6 38099290.7809 0.8230
[2] 10 4 26859432.2504 -11239858.5304 0.8752 0.0522 33.8960 <1e-22
──────────────────────────────────────────────────────────────────────────────
Inferring this test is left as an exercise to the reader, with reference to the previous blogpost on Parametric tests .
Using the model as a prediction
This is also left as an exercise to the reader. You can partition the data using the following function:
using Random
function partition_data (data, train_frac = 0.7 )
n = nrow (data)
idx = shuffle (1 : n)
train_idx = @view idx[1 : floor (Int , train_frac* n)]
test_idx = @view idx[(floor (Int , train_frac* n)+ 1 ): n]
data[train_idx, : ], data[test_idx, : ]
end
partition_data (generic function with 2 methods)
Back to topCitation BibTeX citation:
@online{sambrani2023,
author = {Sambrani, Dhruva},
title = {Linear Regression for Inferential and Predictive Modeling
with Examples in {Julia}},
date = {2023-08-15},
url = {https://dataalltheway.com/posts/013-01-linear-regression-julia},
langid = {en}
}
For attribution, please cite this work as:
Sambrani, Dhruva. 2023.
“Linear Regression for Inferential and
Predictive Modeling with Examples in Julia.” August 15, 2023.
https://dataalltheway.com/posts/013-01-linear-regression-julia .