```
import numpy as np
from scipy import stats
import pandas as pd
dat = pd.read_csv("https://raw.githubusercontent.com/opencasestudies/ocs-bp-rural-and-urban-obesity/master/data/wrangled/BMI_long.csv")
```

2023-01-09 First draft

This article is an extension of Rohit Farmer. 2022. “Parametric Hypothesis Tests with Examples in R.” November 10, 2022. Please check out the parent article for the theoretical background.

- Z-test (Section 2)
- T-test (Section 3)
- ANOVA (Section 4)

```
import numpy as np
from scipy import stats
import pandas as pd
dat = pd.read_csv("https://raw.githubusercontent.com/opencasestudies/ocs-bp-rural-and-urban-obesity/master/data/wrangled/BMI_long.csv")
```

```
from statsmodels.stats.weightstats import ztest as ztest
import random
mask1 = (dat['Sex'] == "Women") & (dat['Year'] == 1985)
x1 = dat[mask1]['BMI']
x1 = x1.array.dropna()
x1 = random.sample(x1.tolist(), k = 300)
mask2 = (dat['Sex'] == "Women") & (dat['Year'] == 2017)
x2 = dat[mask2]['BMI']
x2 = x2.array.dropna()
x2 = random.sample(x2.tolist(), k = 300)
z_statistics, p_value = ztest(x1, x2, value=0)
print("z-statistic:", z_statistics)
print("p-value:", p_value)
```

```
z-statistic: -9.201889936608346
p-value: 3.517084717411295e-20
```

```
mask1 = (dat['Sex'] == "Women") & (dat['Region'] == "Rural") & (dat['Year'] == 1985)
x1 = dat[mask1]['BMI']
mask2 = (dat['Sex'] == "Women") & (dat['Region'] == "Urban") & (dat['Year'] == 1985)
x2 = dat[mask2]['BMI']
t_statistic, p_value = stats.ttest_ind(x1, x2, equal_var = True, nan_policy = "omit")
print("t-statistic:", t_statistic)
print("p-value:", p_value)
```

```
t-statistic: -3.8952336023562912
p-value: 0.00011523146459551333
```

```
t_statistic, p_value = stats.ttest_ind(x1, x2, equal_var = True, nan_policy = "omit", alternative = "greater")
print("t-statistic:", t_statistic)
print("p-value:", p_value)
```

```
t-statistic: -3.8952336023562912
p-value: 0.9999423842677022
```

```
t_statistic, p_value = stats.ttest_rel(x1, x2, nan_policy = "omit")
print("t-statistic:", t_statistic)
print("p-value:", p_value)
```

```
t-statistic: -14.095486243034763
p-value: 1.426675846865914e-31
```

```
mask1 = (dat['Sex'] == "Men") & (dat['Region'] == "Rural") & (dat['Year'] == 2017)
x1 = dat[mask1]['BMI']
mask2 = (dat['Sex'] == "Men") & (dat['Region'] == "Urban") & (dat['Year'] == 2017)
x2 = dat[mask2]['BMI']
mask3 = (dat['Sex'] == "Men") & (dat['Region'] == "National") & (dat['Year'] == 2017)
x3 = dat[mask3]['BMI']
f_value, p_value = stats.f_oneway(x1.array.dropna(), x2.array.dropna(), x3.array.dropna())
print("f-value statistic: ",f_value)
print("p-value: ", p_value)
```

```
f-value statistic: 3.4215235158825905
p-value: 0.033309935710150805
```

BibTeX citation:

```
@online{farmer2023,
author = {Rohit Farmer},
title = {Parametric Hypothesis Tests with Examples in {Python}},
date = {2023-01-09},
url = {https://www.dataalltheway.com/posts/010-02-parametric-hypothesis-tests-python},
langid = {en}
}
```

For attribution, please cite this work as:

Rohit Farmer. 2023. “Parametric Hypothesis Tests with Examples in
Python.” January 9, 2023. https://www.dataalltheway.com/posts/010-02-parametric-hypothesis-tests-python.

2022-11-30 First draft

This article is an extension of Farmer. 2022. “Non-Parametric Hypothesis Tests with Examples in R.” November 18, 2022. Please check out the parent article for the theoretical background.

- Wilcoxon rank-sum (Mann-Whitney U test) (Section 3)
- Wilcoxon signed-rank test (Section 4)
- Kruskal-Wallis test (Section 5)

```
import Pkg
Pkg.activate(".")
using CSV
using Plots
using HypothesisTests
using DataFrames
```

` Activating project at `~/sandbox/dataalltheway/posts/011-01-non-parametric-hypothesis-tests-julia``

I have subsetted the data from 1928 onward and dropped any columns with all NAs or zeros. To do so for `eachcol`

of `data`

we first calculate whether `all`

the elements are `!ismissing`

`&&`

`!=0`

(`!`

= `not`

). Then pick all rows for those columns, while `disallowmissing`

data.

```
temp_file = download("https://zenodo.org/record/7081360/files/1.%20Cement_emissions_data.csv")
data = CSV.read(temp_file, DataFrame)
dropmissing!(data, :Year)
filter!(:Year => >=(1928), data)
picked_cols_mask = eachcol(data) .|>
col -> all(x->(!ismissing(x) && x!=0), col)
data = disallowmissing(data[!, picked_cols_mask])
```

94×24 DataFrame

69 rows omitted

Row | Year | Argentina | Australia | Belgium | Brazil | Canada | Chile | China | Democratic Republic of the Congo | Denmark | Egypt | Finland | Italy | Japan | Mozambique | Norway | Peru | Portugal | Romania | Spain | Sweden | Turkey | USA | Global |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Int64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |

1 | 1928 | 116.3 | 378.0 | 1505.0 | 43.61 | 868.6 | 54.57 | 39.78 | 21.81 | 385.2 | 43.8 | 138.1 | 1519.0 | 1897.0 | 7.27 | 158.3 | 25.42 | 36.34 | 163.6 | 763.2 | 233.2 | 29.07 | 15420.0 | 35616.0 |

2 | 1929 | 174.4 | 356.2 | 1606.0 | 47.25 | 963.0 | 72.62 | 76.51 | 29.07 | 396.1 | 87.22 | 138.1 | 1730.0 | 2112.0 | 10.9 | 159.1 | 25.43 | 43.61 | 156.3 | 901.3 | 284.0 | 32.71 | 15000.0 | 36873.0 |

3 | 1930 | 189.0 | 348.9 | 1508.0 | 43.58 | 926.7 | 79.88 | 73.45 | 32.71 | 385.2 | 148.5 | 101.6 | 1723.0 | 1853.0 | 10.9 | 160.2 | 10.9 | 47.29 | 196.3 | 908.4 | 304.6 | 29.07 | 14290.0 | 35561.0 |

4 | 1931 | 265.3 | 196.3 | 1218.0 | 83.59 | 799.5 | 50.88 | 97.93 | 21.81 | 250.8 | 119.9 | 79.95 | 1519.0 | 1788.0 | 10.9 | 109.7 | 14.54 | 47.25 | 98.14 | 806.8 | 257.9 | 50.83 | 10810.0 | 30931.0 |

5 | 1932 | 247.1 | 123.6 | 1039.0 | 72.69 | 363.4 | 54.51 | 79.57 | 7.27 | 203.6 | 119.9 | 76.23 | 1545.0 | 1843.0 | 10.9 | 117.3 | 10.9 | 58.15 | 105.4 | 705.1 | 241.0 | 54.46 | 6656.0 | 24721.0 |

6 | 1933 | 254.5 | 159.9 | 963.1 | 109.0 | 189.0 | 69.05 | 113.2 | 3.63 | 272.6 | 141.7 | 80.05 | 1756.0 | 2366.0 | 10.18 | 110.8 | 14.53 | 79.95 | 109.0 | 694.1 | 200.7 | 58.15 | 5587.0 | 23866.0 |

7 | 1934 | 279.8 | 207.2 | 937.6 | 159.9 | 272.6 | 101.7 | 97.93 | 3.63 | 381.6 | 145.4 | 112.7 | 2013.0 | 2304.0 | 7.27 | 123.6 | 21.82 | 90.94 | 156.3 | 672.3 | 287.1 | 83.59 | 6900.0 | 28542.0 |

8 | 1935 | 356.3 | 276.2 | 1087.0 | 181.6 | 272.6 | 141.6 | 156.1 | 3.63 | 374.3 | 189.0 | 134.3 | 2086.0 | 2904.0 | 7.27 | 130.8 | 29.07 | 105.4 | 189.0 | 668.7 | 367.1 | 65.42 | 6684.0 | 32090.0 |

9 | 1936 | 410.7 | 323.5 | 1163.0 | 239.7 | 388.9 | 123.5 | 428.5 | 3.63 | 392.4 | 167.2 | 163.5 | 1890.0 | 3082.0 | 7.27 | 149.0 | 36.34 | 119.9 | 185.3 | 297.9 | 392.5 | 69.05 | 9950.0 | 38763.0 |

10 | 1937 | 512.3 | 363.4 | 1486.0 | 283.5 | 483.4 | 156.3 | 437.7 | 3.77 | 334.4 | 163.5 | 203.4 | 2155.0 | 2984.0 | 7.27 | 159.9 | 39.96 | 127.1 | 225.3 | 189.0 | 432.5 | 105.4 | 10370.0 | 40829.0 |

11 | 1938 | 614.3 | 178.1 | 1508.0 | 305.5 | 432.5 | 181.7 | 9.18 | 7.5 | 316.2 | 185.3 | 236.2 | 2279.0 | 2729.0 | 11.99 | 163.6 | 50.86 | 130.8 | 221.7 | 294.4 | 490.6 | 130.9 | 9239.0 | 38551.0 |

12 | 1939 | 556.0 | 338.0 | 1261.0 | 345.5 | 450.7 | 167.2 | 223.4 | 17.44 | 345.3 | 185.0 | 279.6 | 2526.0 | 2508.0 | 14.54 | 192.6 | 58.18 | 145.4 | 261.7 | 588.8 | 585.1 | 141.7 | 10790.0 | 35687.0 |

13 | 1940 | 534.4 | 348.8 | 105.4 | 367.1 | 592.4 | 189.0 | 272.4 | 11.45 | 218.1 | 178.1 | 149.0 | 2373.0 | 2101.0 | 14.54 | 167.2 | 61.78 | 134.5 | 196.3 | 770.5 | 345.3 | 130.9 | 11550.0 | 31431.0 |

⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |

83 | 2010 | 4178.0 | 3549.0 | 2582.0 | 21290.0 | 6005.0 | 1046.0 | 639600.0 | 189.0 | 672.2 | 20510.0 | 525.7 | 13280.0 | 24320.0 | 340.9 | 754.0 | 3338.0 | 3376.0 | 2778.0 | 11200.0 | 1324.0 | 29980.0 | 31450.0 | 1.2549e6 |

84 | 2011 | 4586.0 | 3496.0 | 2762.0 | 22840.0 | 6020.0 | 1080.0 | 708600.0 | 175.2 | 861.8 | 20100.0 | 557.8 | 12580.0 | 24980.0 | 373.3 | 749.0 | 3300.0 | 2813.0 | 3089.0 | 9523.0 | 1361.0 | 31450.0 | 32210.0 | 1.3498e6 |

85 | 2012 | 4184.0 | 3518.0 | 2643.0 | 25000.0 | 6532.0 | 1128.0 | 714800.0 | 157.9 | 871.1 | 20970.0 | 497.2 | 10070.0 | 25620.0 | 452.7 | 725.0 | 3731.0 | 2550.0 | 3150.0 | 8754.0 | 1479.0 | 31370.0 | 35270.0 | 1.3846e6 |

86 | 2013 | 4581.0 | 3294.0 | 2541.0 | 26650.0 | 5973.0 | 1086.0 | 748300.0 | 174.0 | 867.1 | 20210.0 | 481.8 | 8877.0 | 26810.0 | 505.6 | 731.0 | 4257.0 | 2814.0 | 2695.0 | 7642.0 | 1402.0 | 33910.0 | 36370.0 | 1.4441e6 |

87 | 2014 | 4336.0 | 3138.0 | 2643.0 | 26910.0 | 5912.0 | 1022.0 | 778600.0 | 127.9 | 887.3 | 20760.0 | 468.8 | 8339.0 | 26560.0 | 585.9 | 727.0 | 4590.0 | 3096.0 | 2944.0 | 8897.0 | 1399.0 | 34500.0 | 39440.0 | 1.4999e6 |

88 | 2015 | 4571.0 | 3076.0 | 2348.0 | 25080.0 | 6185.0 | 1033.0 | 722000.0 | 154.6 | 931.5 | 21650.0 | 462.1 | 8196.0 | 25940.0 | 614.2 | 672.0 | 4476.0 | 2921.0 | 3337.0 | 9216.0 | 1537.0 | 34440.0 | 39910.0 | 1.4444e6 |

89 | 2016 | 4029.0 | 2931.0 | 2436.0 | 22420.0 | 6114.0 | 1120.0 | 743000.0 | 98.04 | 1095.0 | 22820.0 | 553.2 | 7680.0 | 25970.0 | 947.8 | 684.0 | 4340.0 | 2297.0 | 3181.0 | 9414.0 | 1554.0 | 37530.0 | 39440.0 | 1.4876e6 |

90 | 2017 | 4362.0 | 3019.0 | 2291.0 | 19080.0 | 6827.0 | 865.9 | 758200.0 | 348.7 | 1194.0 | 21770.0 | 603.7 | 7711.0 | 26430.0 | 910.6 | 766.0 | 4291.0 | 2531.0 | 3310.0 | 9449.0 | 1484.0 | 39470.0 | 40320.0 | 1.5079e6 |

91 | 2018 | 4369.0 | 2942.0 | 2534.0 | 19340.0 | 6915.0 | 782.2 | 786700.0 | 406.1 | 1160.0 | 21000.0 | 601.7 | 7757.0 | 26180.0 | 930.0 | 730.0 | 4320.0 | 2251.0 | 3505.0 | 9667.0 | 1607.0 | 39410.0 | 38970.0 | 1.5692e6 |

92 | 2019 | 4141.0 | 3040.0 | 2819.0 | 19860.0 | 7125.0 | 825.3 | 826900.0 | 451.0 | 1129.0 | 19670.0 | 583.5 | 7912.0 | 25330.0 | 1011.0 | 722.0 | 4546.0 | 2225.0 | 3828.0 | 9064.0 | 1349.0 | 32350.0 | 40900.0 | 1.6175e6 |

93 | 2020 | 3508.0 | 2820.0 | 2634.0 | 22050.0 | 6625.0 | 825.3 | 858200.0 | 451.0 | 1227.0 | 18130.0 | 569.7 | 7059.0 | 24490.0 | 1011.0 | 725.0 | 4546.0 | 2310.0 | 3901.0 | 8192.0 | 1272.0 | 40810.0 | 40690.0 | 1.6375e6 |

94 | 2021 | 4671.0 | 2820.0 | 2634.0 | 23790.0 | 6625.0 | 825.3 | 853000.0 | 451.0 | 1227.0 | 16160.0 | 569.7 | 7059.0 | 23790.0 | 1011.0 | 701.3 | 4546.0 | 2310.0 | 3901.0 | 8609.0 | 1272.0 | 44390.0 | 41200.0 | 1.6729e6 |

```
plot(
data[!, :Year],
Array(log.(data[!, Not(:Year)])),
label=reshape(string.(propertynames(data)[2:end]), 1, :),
legend= :outerright,
size=(900, 400)
)
```

`mwut_results = MannWhitneyUTest(data[!, :USA], data[!, :Canada], )`

```
Approximate Mann-Whitney U test
-------------------------------
Population details:
parameter of interest: Location parameter (pseudomedian)
value under h_0: 0
point estimate: 28503.5
Test summary:
outcome with 95% confidence: reject h_0
two-sided p-value: <1e-30
Details:
number of observations in each group: [94, 94]
Mann-Whitney-U statistic: 8763.0
rank sums: [13228.0, 4538.0]
adjustment for ties: 24.0
normal approximation (μ, σ): (4345.0, 373.05)
```

`pvalue(mwut_results, tail=:right)`

`1.2040605143479147e-31`

```
dt = select(filter(:Year=> y-> y==2000 || y==2020, data), Not(:Year))
x = collect(dt[1, :])
y = collect(dt[2, :])
SignedRankTest(x, y)
```

```
Exact Wilcoxon signed rank test
-------------------------------
Population details:
parameter of interest: Location parameter (pseudomedian)
value under h_0: 0
point estimate: 16.0
95% confidence interval: (-3848.0, 621.5)
Test summary:
outcome with 95% confidence: fail to reject h_0
two-sided p-value: 0.5803
Details:
number of observations: 23
Wilcoxon rank-sum statistic: 119.0
rank sums: [119.0, 157.0]
adjustment for ties: 0.0
```

`KruskalWallisTest(collect(eachcol(data[:, Not(:Year)]))...)`

```
Kruskal-Wallis rank sum test (chi-square approximation)
-------------------------------------------------------
Population details:
parameter of interest: Location parameters
value under h_0: "all equal"
point estimate: NaN
Test summary:
outcome with 95% confidence: reject h_0
one-sided p-value: <1e-99
Details:
number of observation in each group: [94, 94, 94, 94, 94, 94, 94, 94, 94, 94 … 94, 94, 94, 94, 94, 94, 94, 94, 94, 94]
χ²-statistic: 1253.58
rank sums: [100736.0, 98787.5, 1.11532e5, 1.21338e5, 1.21336e5, 55862.0, 138744.0, 20383.0, 71558.5, 1.03662e5 … 19940.0, 59493.0, 68485.5, 84067.0, 1.0239e5, 132068.0, 83910.0, 1.10562e5, 1.7974e5, 195500.0]
degrees of freedom: 22
adjustment for ties: 0.999999
```

BibTeX citation:

```
@online{sambrani2022,
author = {Dhruva Sambrani},
title = {Non-Parametric Hypothesis Tests with Examples in {Julia}},
date = {2022-11-30},
url = {https://www.dataalltheway.com/posts/011-01-non-parametric-hypothesis-tests-julia},
langid = {en}
}
```

For attribution, please cite this work as:

Dhruva Sambrani. 2022. “Non-Parametric Hypothesis Tests with
Examples in Julia.” November 30, 2022. https://www.dataalltheway.com/posts/011-01-non-parametric-hypothesis-tests-julia.

2022-11-23 This article is cross-posted from https://thisisnic.github.io/2022/11/21/type-inference-in-readr-and-arrow/ with permission.

The CSV format is widely used in data science, and at its best works well as a simple human-readable format that is widely known and understood. The simplicity of CSVs though, as a basic text format also has its drawbacks. One is that it contains no information about data types of its columns, and if you’re working with CSVs in an application more complex than a text editor, those data types must be inferred by whatever is reading the data.

In this blog post, I’m going to discuss how CSV type inference works in the R packages readr and arrow, and highlight the differences between the two.

Before I get started though, I’d like to acknowledge that this post is an exercise in examining the underlying mechanics of the two packages. In practice, I’ve found that when working with datasets small enough to fit in-memory, it’s much more fruitful to either read in the data first and then manipulate it into the required shape, or just specify the column types up front. Still, the strategies for automatic guessing are interesting to explore.

Since readr version 2.0.0 (released in July 2020), there was a significant overhaul of the underlying code, which subsequently depended on the vroom package.

The type inference is done by a C++ function in vroom called `guess_type__()`

which guesses types in the following order:

Does the column contain 0 rows? If yes, return “character”

Are all values missing? If yes, return “logical”

Tries to parse column to each of these formats and returns the first one it successfully parses:

Logical

Integer (though the default is to not look for these)

Double

Number (a special type which can remove characters from strings representing numbers and then convert them to doubles)

Time

Date

Datetime

Character

The ordering above in the parsing bullet point goes from most to least strict in terms of the conditions which have to be met to successfully parse an input as that data type. For example, for a column to be of logical type, it can only contain a small subset of values representing true (`T`

, `t`

, `True`

, `TRUE`

, `true`

), false (`F`

, `f`

, `False`

, `FALSE`

, `false`

) or NA, which is why this is the most strict type, but all of the other types could be read in as character data, which is the least strict and why this is last in the order.

In arrow, `read_csv_arrow()`

handles CSV reading, and much of its interface has been designed to closely follow the excellent APIs of `readr::read_csv()`

and `vroom::vroom()`

. The intention here is that users can use parameter names they’re familiar with from the aforementioned readers when using arrow, and get the same results. The underlying code is pretty different though.

In addition, Arrow has a different set of possible data types compared to R; see the Arrow docs for more information about the mapping between R data types and Arrow types.

In the Arrow docs, we can see that types are inferred in this order:

Null

Int64

Boolean

Date32

Timestamp (with seconds unit)

Float64

Dictionary<String> (if ConvertOptions::auto_dict_encode is true)

Dictionary<Binary> (if ConvertOptions::auto_dict_encode is true)

String

Binary

Note that if you use `arrow::read_csv_arrow()`

with parameter `as_data_frame = TRUE`

(the default), the Arrow data types are then converted to R data types.

```
simple_data <- data.frame(x = c(1, 2, 3), y = c("a", "b", "c"), z = c(1.1, 2.2, 3.3))
readr::write_csv(simple_data, "simple_data.csv")
# columns are arrow's int64, string, and double (aka float64) types
arrow::read_csv_arrow("simple_data.csv", as_data_frame = FALSE)
```

```
Table
3 rows x 3 columns
$x <int64>
$y <string>
$z <double>
```

```
# columns converted to R's integer, character, and double types
arrow::read_csv_arrow("simple_data.csv", as_data_frame = TRUE)
```

```
# A tibble: 3 × 3
x y z
<int> <chr> <dbl>
1 1 a 1.1
2 2 b 2.2
3 3 c 3.3
```

Although there appear to be quite a few differences between the order of type inference when comparing arrow and readr, in practice, this doesn’t have much effect. Type inference for logical/boolean and integer values are the opposite way round, but given that the underlying data that translates into these types looks very different, they’re not going to be mixed up. The biggest differences come from custom behaviours which are specific to readr and arrow; I’ve outlined them below.

In the code for readr, the default setting is for numeric values to always be read in as doubles but never integers. If you want readr to guess that a column may be an integer, you need to read it in as character data, and then call `type_convert()`

. This isn’t necessarily a great workflow though, and in most cases it would make sense to just manually specify the column type instead of having it inferred.

In arrow, if data can be represented as integers but not doubles, then it will be.

```
int_or_dbl <- data.frame(
x = c(1L, 2L, 3L)
)
readr::write_csv(int_or_dbl, "int_or_dbl.csv")
readLines("int_or_dbl.csv")
```

`[1] "x" "1" "2" "3"`

```
# double
readr::read_csv("int_or_dbl.csv")
```

```
Rows: 3 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (1): x
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
```

```
# A tibble: 3 × 1
x
<dbl>
1 1
2 2
3 3
```

```
# integer via inference
readr::read_csv("int_or_dbl.csv", col_types = list(.default = col_character())) %>%
type_convert(guess_integer = TRUE)
```

```
── Column specification ────────────────────────────────────────────────────────
cols(
x = col_integer()
)
```

```
# A tibble: 3 × 1
x
<int>
1 1
2 2
3 3
```

```
# integer via specification
readr::read_csv("int_or_dbl.csv", col_types = list(col_integer()))
```

```
# A tibble: 3 × 1
x
<int>
1 1
2 2
3 3
```

```
# integer via inference
arrow::read_csv_arrow("int_or_dbl.csv")
```

```
# A tibble: 3 × 1
x
<int>
1 1
2 2
3 3
```

Another difference between readr and arrow is the difference between how integers larger than 32 bits are read in. Natively, R can only support 32-bit integers, though it can support 64-bit integers via the bit64 package. If we create a CSV with one column containing the largest integer that R can natively support, and then another column containing that value plus 1, we get different behaviour when we import this data with readr and arrow. In readr, when we enable integer guessing, the smaller value is read in as an integer, and the larger value is read in as a double. However, once we move over to manually specifying column types, we can use `vroom::col_big_integer()`

to use bit64 and get us a large integer column. The arrow package also uses bit64, and its integer guessing results in 64-bit integer via inference.

```
sixty_four <- data.frame(x = 2^31 - 1, y = 2^31)
readr::write_csv(sixty_four, "sixty_four.csv")
# doubles by default
readr::read_csv("sixty_four.csv")
```

```
Rows: 1 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (2): x, y
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
```

```
# A tibble: 1 × 2
x y
<dbl> <dbl>
1 2147483647 2147483648
```

```
# 32 bit integer or double depending on value size
readr::read_csv("sixty_four.csv", col_types = list(.default = col_character())) %>%
type_convert(guess_integer = TRUE)
```

```
── Column specification ────────────────────────────────────────────────────────
cols(
x = col_integer(),
y = col_double()
)
```

```
# A tibble: 1 × 2
x y
<int> <dbl>
1 2147483647 2147483648
```

```
# integers by specification
readr::read_csv(
"sixty_four.csv",
col_types = list(x = col_integer(), y = vroom::col_big_integer())
)
```

```
# A tibble: 1 × 2
x y
<int> <int64>
1 2147483647 2147483648
```

```
# integers by inference
arrow::read_csv_arrow("sixty_four.csv")
```

```
# A tibble: 1 × 2
x y
<int> <int64>
1 2147483647 2147483648
```

One really cool feature in readr is the “number” parsing strategy. This allows values which have been stored as character data with commas to separate the thousands to be read in as doubles. This is not supported in arrow.

```
number_type <- data.frame(
x = c("1,000", "1,250")
)
readr::write_csv(number_type, "number_type.csv")
# double type, but parsed in as number in column spec shown below
readr::read_csv("number_type.csv")
```

```
Rows: 2 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
num (1): x
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
```

```
# A tibble: 2 × 1
x
<dbl>
1 1000
2 1250
```

```
# read in as character data in Arrow
arrow::read_csv_arrow("number_type.csv")
```

```
# A tibble: 2 × 1
x
<chr>
1 1,000
2 1,250
```

Anyone who’s been around long enough might remember that R’s native CSV reading function `read.csv()`

had a default setting of importing character columns as factors (I definitely have `read.csv(..., stringAsFactors=FALSE)`

carved into a groove in some dark corner of my memory). This default was changed in version 4.0.0, released in April 2020, reflecting the fact that in most cases users want their string data to be imported as characters unless otherwise specified. Still, some datasets contain character data which users do want to import as factors. In readr, this can be controlled by manually specifying the column as a factor

In arrow, if you don’t want to individually specify column types, you can set up an option to import character columns as dictionaries (the Arrow equivalent of factors), which are converted into factors.

```
dict_type <- data.frame(
x = c("yes", "no", "yes", "no")
)
readr::write_csv(dict_type, "dict_type.csv")
# character data
readr::read_csv("dict_type.csv")
```

```
Rows: 4 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): x
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
```

```
# A tibble: 4 × 1
x
<chr>
1 yes
2 no
3 yes
4 no
```

```
# factor data
readr::read_csv("dict_type.csv", col_types = list(x = col_factor()))
```

```
# A tibble: 4 × 1
x
<fct>
1 yes
2 no
3 yes
4 no
```

```
# set up the option. there's an open ticket to make this code a bit nicer to read.
auto_dict_option <- arrow::CsvConvertOptions$create(auto_dict_encode = TRUE)
arrow::read_csv_arrow("dict_type.csv", convert_options = auto_dict_option)
```

```
# A tibble: 4 × 1
x
<fct>
1 yes
2 no
3 yes
4 no
```

Another slightly niche but potentially useful piece of functionality available in arrow is the ability to customise which values can be parsed as logical/boolean type and how they translate to `TRUE`

/`FALSE`

. This can be achieved by setting some custom conversion options.

```
alternative_true_false <- arrow::CsvConvertOptions$create(
false_values = "no", true_values = "yes"
)
arrow::read_csv_arrow("dict_type.csv", convert_options = alternative_true_false)
```

```
# A tibble: 4 × 1
x
<lgl>
1 TRUE
2 FALSE
3 TRUE
4 FALSE
```

Although relying on the reader itself to guess your column types can work well, what if you want more precise control?

In readr, you can use the `col_types`

parameter to specify column types. You can use the same parameter in arrow to use R type specifications.

```
given_types <- data.frame(x = c(1, 2, 3), y = c(4, 5, 6))
readr::write_csv(given_types, "given_types.csv")
readr::read_csv("given_types.csv", col_types = list(col_integer(), col_double()))
```

```
# A tibble: 3 × 2
x y
<int> <dbl>
1 1 4
2 2 5
3 3 6
```

You can also use this shortcode specification. Here, “i” means integer and “d” means double.

`readr::read_csv("given_types.csv", col_types = "id")`

```
# A tibble: 3 × 2
x y
<int> <dbl>
1 1 4
2 2 5
3 3 6
```

In arrow you can use the shortcodes (though not the `col_*()`

functions), but you must specify the column names.

We skip the first row as our data has a header row - this is the same behaviour as when we use both names and types in `readr::read_csv()`

which then assumes that the header row is data if we don’t skip it.

`arrow::read_csv_arrow("given_types.csv", col_names = c("x", "y"), col_types = "id", skip = 1)`

```
# A tibble: 3 × 2
x y
<int> <dbl>
1 1 4
2 2 5
3 3 6
```

What if you want to use Arrow types instead of R types though? In this case, you need to use a schema. I won’t go into detail here, but in short, schemas are lists of fields, each of which contain a field name and a data type. You can specify a schema like this:

```
# this gives the same result as before - because our Arrow data has been converted to the relevant R type
arrow::read_csv_arrow("given_types.csv", schema = schema(x = int8(), y = float32()), skip = 1)
```

```
# A tibble: 3 × 2
x y
<int> <dbl>
1 1 4
2 2 5
3 3 6
```

```
# BUT, if you don't read it in as a data frame you'll see the Arrow type
arrow::read_csv_arrow("given_types.csv", schema = schema(x = int8(), y = float32()), skip = 1, as_data_frame = FALSE)
```

```
Table
3 rows x 2 columns
$x <int8>
$y <float>
```

An alternative approach is to use Parquet format, which stores the data types along with the data. This means that if you’re sharing your data with others, you don’t need to worry about it being read in as the wrong data types. In a follow-up post I’ll explore the Parquet format and compare management of data types in CSVs and Parquet.

If you want a much more detailed discussion of Arrow data types, see this excellent blog post by Danielle Navarro.

Huge thanks to everyone who helped review and tweak this blog post, and special thanks to Jenny Bryan who gave some really helpful feedback on the content on readr/vroom!

BibTeX citation:

```
@misc{crane2022,
author = {Nic Crane},
title = {Type Inference in Readr and Arrow},
date = {2022-11-21},
url = {https://thisisnic.github.io/2022/11/21/type-inference-in-readr-and-arrow/},
langid = {en}
}
```

For attribution, please cite this work as:

Nic Crane. 2022. “Type Inference in Readr and Arrow.”
November 21, 2022. https://thisisnic.github.io/2022/11/21/type-inference-in-readr-and-arrow/.

2022-11-21 Code on Kaggle

2022-11-18 First draft

A hypothesis test is a statistical test used to determine whether there is enough evidence to support a hypothesis. For example, there is a difference between the average height of males and females. Non-parametric hypothesis tests are tests that do not rely on the assumptions of normality or equal variance. They are traditional alternatives to parametric tests because they make few or no assumptions about the distribution of the data or population. Non-parametric tests are often based on ranks given to the original numerical data. Usually non-parametric tests are regarded as relatively easy to perform but some problems can occur. It can be cumbersome to carry out such tests when working with large amounts of data. In many field of study such as psychology, the data have quite restricted ranges of scores, which can result in the same value appearing several times in a set of data. Tests based on rank can become more complicated with increased tied scores. Though, non-parametric tests have fewer assumptions they are not as powerful as parametric tests.

The different types of non-parametric hypothesis tests are:

- the Wilcoxon rank-sum (Mann-Whitney U test) (Section 2),
- the Wilcoxon signed-rank test (Section 3), and
- the Kruskal-Wallis test (Section 4).

Besides deciding which hypothesis test to use to answer the question at hand, we also need to decide a couple of other parameters, for example, whether the test would be one sample or two samples, paired or un-paired, and one or two-tailed. I have discussed those parameters in detail here.

For our example exercises, we will work with the “Global CO2 emissions from cement production” dataset (Andrew 2022). I have subsetted the data from 1928 onward and dropped any columns with all NAs or zeros. The table below shows all the data we will use in this tutorial.

Figure 1 shows per country yearly (x-axis) emissions logged to base 10 (y-axis). The log is taken for visualization purposes. All the statistical calculations will be done on the original values.

```
suppressMessages(library(DT))
suppressMessages(library(tidyverse))
suppressMessages(library(kableExtra))
```

```
# Download the data from Zenodo
dat <- readr::read_csv("https://zenodo.org/record/7081360/files/1.%20Cement_emissions_data.csv", show_col_types = FALSE)
# Filter the data and present it in a DT::datatable
dat <- dat %>% dplyr::filter(Year >= 1928) %>%
select_if(function(x) all(!is.na(x))) %>%
select_if(function(x) all(!x == 0))
DT::datatable(dat)
```

```
dat_gather <- dat %>% gather(key = "Country", value = "Emission", -Year)
ggplot(dat_gather, aes(x = Year, y = as.numeric(log10(Emission)), color = Country)) +
geom_line(aes(group = Country)) +
labs(x = "Year", y = "log10(Emission)", color = "") +theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
```

The Wilcoxon rank-sum or Mann-Whitney U test is perhaps the most common non-parametric test for unrelated samples. You would use it when the two groups are independent of each other, for example in our dataset testing differences in CO_{2} emissions between two different countries (e.g. USA vs. Canada). It can be used even when the two groups are of different sizes.

- First, we rank all of the values (from both groups) from the smallest to largest. Equal values are allocated the average of the ranks they would have if there was tiny differences between them.
- Next we sum the ranks for each group. You call the sum of the ranks for the larger group and for the smaller sized group, . If both groups are equally sized then we can label them whichever way round we like.
- We then input and and also and , the respective sizes of each group, into the Equation 1.

- Then we compare the value of to significance tables. You find the intersection of the column with the value of and the row with the value of . In this intersection there will be two ranges of values of which are significant at the level. If our value is within one of these ranges, then we have a significant result and we reject the null hypothesis. If our value is not in the range then it is not significant and then the independent variable is unrelated to the dependent variable, we accept the .
- As a check, we also need to examine the means of the two groups, to see which has the higher scores on the dependent variable.

In this example we will do a two-tailed test to measure if there is a difference in emission between the USA and Canada. Our null hypothesis is that there is no difference.

`(w_res <- wilcox.test(dat$USA, dat$Canada, conf.int = TRUE))`

```
Wilcoxon rank sum test with continuity correction
data: dat$USA and dat$Canada
W = 8763, p-value < 2.2e-16
alternative hypothesis: true location shift is not equal to 0
95 percent confidence interval:
26384 29797
sample estimates:
difference in location
28155.27
```

We can fetch results from `w_res`

object like `w_res$p.value`

. However, it’s easier to fetch all the values and convert them into a data frame using the `boom::tidy()`

function from the tidyverse suite. As we see in Table 1 the p-value is , which means we can reject our null hypothesis and accept our alternative hypothesis that there is a significant difference in CO_{2} emissions between the USA and Canada.

```
broom::tidy(w_res) %>%
kbl() %>%
kable_paper("hover", full_width = F)
```

estimate | statistic | p.value | conf.low | conf.high | method | alternative |
---|---|---|---|---|---|---|

28155.27 | 8763 | 0 | 26384 | 29797 | Wilcoxon rank sum test with continuity correction | two.sided |

In this example we will do a one-tailed test to measure if emissions from the USA is greater than Canada. Our null hypothesis is that the emissions from the USA is not greater than Canada.

`(w_res <- wilcox.test(dat$USA, dat$Canada, conf.int = TRUE, alternative = "greater"))`

```
Wilcoxon rank sum test with continuity correction
data: dat$USA and dat$Canada
W = 8763, p-value < 2.2e-16
alternative hypothesis: true location shift is greater than 0
95 percent confidence interval:
26696 Inf
sample estimates:
difference in location
28155.27
```

As we see in Table 2 the p-value is , which means we can reject our null hypothesis and accept our alternative hypothesis that the CO_{2} emissions are in the USA than Canada.

```
broom::tidy(w_res) %>%
kbl() %>%
kable_paper("hover", full_width = F)
```

estimate | statistic | p.value | conf.low | conf.high | method | alternative |
---|---|---|---|---|---|---|

28155.27 | 8763 | 0 | 26696 | Inf | Wilcoxon rank sum test with continuity correction | greater |

The Wilcoxon signed ranks test also known as the Wilcoxon matched pairs test, is similar to the sign test. The only alteration is that we rank the differences ignoring their signs (but we do keep a note of them). As the name implies, we use the Wilcoxon matched pairs test on related data, so each sample or group will be equal in size.

- Calculate the difference values between your two samples of data. We then remove difference values of zero.
- Rank them. If values are tied then you use the same method as in the Mann-Whitney tests. You assign the difference scores the average rank if it was possible to separate the tied difference scores.
- The ranks of the differences can now have the sign of the difference reattached.
- The sum of the positive ranks are calculated.
- The sum of the negative ranks are calculated.
- You then choose the smaller sum of ranks and we call this our value, which we compare with significance tables. You choose the row which has the number of pairs of values in your sample.
- Report your findings and make your conclusion.

Since this a paired test we will test if there is difference in emission between two time periods say 2000 and 2020 across all the countries in our dataset. Our null hypothesis is that there is no difference.

```
dat_m <- dat %>% dplyr::select(-Year) %>% as.matrix()
rownames(dat_m) <- dat$Year
dat_t <- t(dat_m)
x <- as.numeric(dat_t[,"2000"])
y <- as.numeric(dat_t[,"2020"])
(w_res <- wilcox.test(x, y, conf.int = TRUE, paired = TRUE))
```

```
Wilcoxon signed rank exact test
data: x and y
V = 119, p-value = 0.5803
alternative hypothesis: true location shift is not equal to 0
95 percent confidence interval:
-3848.5 621.5
sample estimates:
(pseudo)median
-297
```

As we can see in the Table 3, the p.value is 0.5, which is above our alpha level of 0.05; therefore, we can accept our null hypothesis that there is indeed no significant difference in CO_{2} emissions between 2000 and 2020.

```
broom::tidy(w_res) %>%
kbl() %>%
kable_paper("hover", full_width = F)
```

estimate | statistic | p.value | conf.low | conf.high | method | alternative |
---|---|---|---|---|---|---|

-297 | 119 | 0.580338 | -3848.5 | 621.5 | Wilcoxon signed rank exact test | two.sided |

Kruskal-Wallis test by rank (Kruskal–Wallis H test) is a non-parametric alternative to one-way ANOVA test, which extends the two-samples Wilcoxon test in the situation where there are more than two groups. It’s recommended when the assumptions of one-way ANOVA test are not met.

A significant Kruskal–Wallis test indicates that at least one sample stochastically dominates one other sample. The test does not identify where this stochastic dominance occurs or for how many pairs of groups stochastic dominance obtains.

Since it is a nonparametric method, the Kruskal–Wallis test does not assume a normal distribution of the residuals, unlike the analogous one-way analysis of variance. If the researcher can make the assumptions of an identically shaped and scaled distribution for all groups, except for any difference in medians, then the null hypothesis is that the medians of all groups are equal, and the alternative hypothesis is that at least one population median of one group is different from the population median of at least one other group. Otherwise, it is impossible to say, whether the rejection of the null hypothesis comes from the shift in locations or group dispersions.

Rank all data from all groups together; i.e., rank the data from 1 to ignoring group membership. Assign any tied values the average of the ranks they would have received had they not been tied.

The test statistic is given by Equation 2.

Where is the total sample size, is the number of groups we are comparing, is the sum of ranks for group , and is the sample size of group .

The decision to reject or not the null hypothesis is made by comparing to a critical value obtained from a table or a software for a given significance or alpha level. If is bigger than , the null hypothesis is rejected.

If the statistic is not significant, then there is no evidence of stochastic dominance between the samples. However, if the test is significant then at least one sample stochastically dominates another sample.

We will use the same long form of data that we used in the Figure 1.

`(k_res <- kruskal.test(dat_gather$Emission, as.factor(dat_gather$Country)))`

```
Kruskal-Wallis rank sum test
data: dat_gather$Emission and as.factor(dat_gather$Country)
Kruskal-Wallis chi-squared = 1253.6, df = 22, p-value < 2.2e-16
```

```
broom::tidy(k_res) %>%
kbl() %>%
kable_paper("hover", full_width = F)
```

statistic | p.value | parameter | method |
---|---|---|---|

1253.576 | 0 | 22 | Kruskal-Wallis rank sum test |

Andrew, Robbie. 2022. “Global CO2 Emissions from Cement Production.” Zenodo. https://doi.org/10.5281/ZENODO.7081360.

BibTeX citation:

```
@online{farmer2022,
author = {Rohit Farmer},
title = {Non-Parametric Hypothesis Tests with Examples in {R}},
date = {2022-11-18},
url = {https://www.dataalltheway.com/posts/011-non-parametric-hypothesis-tests-r},
langid = {en}
}
```

For attribution, please cite this work as:

Rohit Farmer. 2022. “Non-Parametric Hypothesis Tests with Examples
in R.” November 18, 2022. https://www.dataalltheway.com/posts/011-non-parametric-hypothesis-tests-r.

2022-11-17 First draft

This article is an extension of Rohit Farmer. 2022. “Parametric Hypothesis Tests with Examples in R.” November 10, 2022. Please check out the parent article for the theoretical background.

- Z-test (Section 3)
- T-test (Section 4)
- F-test (Section 5)
- ANOVA (Section 6)

```
import Pkg
Pkg.activate(".")
using CSV
using DataFrames
using Statistics
using HypothesisTests
```

` Activating project at `~/sandbox/dataalltheway/posts/010-01-parametric-hypothesis-tests-julia``

Some cleaning is necessary since the data is not of the correct types.

```
begin
data = CSV.read(download("https://raw.githubusercontent.com/opencasestudies/ocs-bp-rural-and-urban-obesity/master/data/wrangled/BMI_long.csv"), DataFrame) # download and load
allowmissing!(data, :BMI) # Allow BMI col to have missing values
replace!(data.BMI, "NA" => missing) # Convert "NA" to missing
data[!, :BMI] .= passmissing(parse).(Float64, (data[!, :BMI])) # Typecast into Float64?
end;
```

`first(data, 20)`

20×5 DataFrame

Row | Country | Sex | Region | Year | BMI |
---|---|---|---|---|---|

String | String7 | String15 | Int64 | Float64? | |

1 | Afghanistan | Men | National | 1985 | 20.2 |

2 | Afghanistan | Men | Rural | 1985 | 19.7 |

3 | Afghanistan | Men | Urban | 1985 | 22.4 |

4 | Afghanistan | Men | National | 2017 | 22.8 |

5 | Afghanistan | Men | Rural | 2017 | 22.5 |

6 | Afghanistan | Men | Urban | 2017 | 23.6 |

7 | Afghanistan | Women | National | 1985 | 20.6 |

8 | Afghanistan | Women | Rural | 1985 | 20.1 |

9 | Afghanistan | Women | Urban | 1985 | 23.2 |

10 | Afghanistan | Women | National | 2017 | 24.4 |

11 | Afghanistan | Women | Rural | 2017 | 23.6 |

12 | Afghanistan | Women | Urban | 2017 | 26.3 |

13 | Albania | Men | National | 1985 | 25.2 |

14 | Albania | Men | Rural | 1985 | 25.0 |

15 | Albania | Men | Urban | 1985 | 25.4 |

16 | Albania | Men | National | 2017 | 27.0 |

17 | Albania | Men | Rural | 2017 | 26.9 |

18 | Albania | Men | Urban | 2017 | 27.0 |

19 | Albania | Women | National | 1985 | 26.0 |

20 | Albania | Women | Rural | 1985 | 26.1 |

```
uneqvarztest = let
# Fetch a random sample of BMI data for women in the year 1985 and 2017
x1 = filter([:Sex, :Year] => (s, y) -> s=="Women" && y==1985 , data) |>
x -> x[!, :BMI] |> skipmissing |> collect |> x->rand(x, 300)
x2 = filter([:Sex, :Year] => (s, y) -> s=="Women" && y==2017 , data) |>
x -> x[!, :BMI] |> skipmissing |> collect |> x->rand(x, 300)
UnequalVarianceZTest(x1, x2)
end
```

```
Two sample z-test (unequal variance)
------------------------------------
Population details:
parameter of interest: Mean difference
value under h_0: 0
point estimate: -2.26
95% confidence interval: (-2.679, -1.841)
Test summary:
outcome with 95% confidence: reject h_0
two-sided p-value: <1e-25
Details:
number of observations: [300,300]
z-statistic: -10.560590588866509
population standard error: 0.21400318296427412
```

```
eqvarztest = let
# Fetch a random sample of BMI data for women in the year 1985 and 2017
x1 = filter([:Sex, :Year] => (s, y) -> s=="Women" && y==1985 , data) |>
x -> x[!, :BMI] |> skipmissing |> collect |> x->rand(x, 300)
x2 = filter([:Sex, :Year] => (s, y) -> s=="Women" && y==2017 , data) |>
x -> x[!, :BMI] |> skipmissing |> collect |> x->rand(x, 300)
EqualVarianceZTest(x1, x2)
end
```

```
Two sample z-test (equal variance)
----------------------------------
Population details:
parameter of interest: Mean difference
value under h_0: 0
point estimate: -2.173
95% confidence interval: (-2.611, -1.735)
Test summary:
outcome with 95% confidence: reject h_0
two-sided p-value: <1e-21
Details:
number of observations: [300,300]
z-statistic: -9.724414586039652
population standard error: 0.22345818154642977
```

```
onesamplettest = let
x1 = filter(
[:Sex, :Region, :Year] =>
(s, r, y) -> s=="Men" && r=="Rural" && y == 2017,
data
) |>
x -> x[!, :BMI] |> skipmissing |> collect |> x->rand(x, 300)
OneSampleTTest(x1, 24.5)
end
```

```
One sample t-test
-----------------
Population details:
parameter of interest: Mean
value under h_0: 24.5
point estimate: 25.466
95% confidence interval: (25.16, 25.77)
Test summary:
outcome with 95% confidence: reject h_0
two-sided p-value: <1e-08
Details:
number of observations: 300
t-statistic: 6.280721563263261
degrees of freedom: 299
empirical standard error: 0.15380398418714467
```

```
unpairedtwosamplettest = let
x1 = filter([:Sex, :Region, :Year] =>
(s, r, y) -> s=="Women" && r=="Rural" && y == 1985,
data) |>
x -> x[!, :BMI] |> skipmissing |> collect |> x->rand(x, 300)
x2 = filter([:Sex, :Region, :Year] =>
(s, r, y) -> s=="Women" && r=="Urban" && y == 1985,
data) |>
x -> x[!, :BMI] |> skipmissing |> collect |> x->rand(x, 300)
x -> x[!, :BMI] |> skipmissing |> collect |> x->rand(x, 300)
UnequalVarianceTTest(x1, x2)
end
```

```
Two sample t-test (unequal variance)
------------------------------------
Population details:
parameter of interest: Mean difference
value under h_0: 0
point estimate: -1.05867
95% confidence interval: (-1.512, -0.6054)
Test summary:
outcome with 95% confidence: reject h_0
two-sided p-value: <1e-05
Details:
number of observations: [300,300]
t-statistic: -4.587387795167387
degrees of freedom: 575.968012373301
empirical standard error: 0.2307776699807073
```

```
unpairedtwosamplettest = let
x1 = filter([:Sex, :Region, :Year] =>
(s, r, y) -> s=="Women" && r=="Rural" && y == 1985,
data) |>
x -> x[!, :BMI] |> skipmissing |> collect |> x->rand(x, 300)
x2 = filter([:Sex, :Region, :Year] =>
(s, r, y) -> s=="Women" && r=="Urban" && y == 1985,
data) |>
x -> x[!, :BMI] |> skipmissing |> collect |> x->rand(x, 300)
x -> x[!, :BMI] |> skipmissing |> collect |> x->rand(x, 300)
UnequalVarianceTTest(x1, x2)
end
pvalue(unpairedtwosamplettest, tail=:right)
```

`0.9999999445762`

```
pairedtwosamplettest = let
x1 = filter([:Sex, :Region, :Year] =>
(s, r, y) -> s=="Women" && r=="Rural" && y == 1985,
data) |>
x -> x[!, :BMI] |> skipmissing |> collect |> x->rand(x, 300)
x2 = filter([:Sex, :Region, :Year] =>
(s, r, y) -> s=="Women" && r=="Urban" && y == 1985,
data) |>
x -> x[!, :BMI] |> skipmissing |> collect |> x->rand(x, 300)
x -> x[!, :BMI] |> skipmissing |> collect |> x->rand(x, 300)
EqualVarianceTTest(x1, x2)
end
```

```
Two sample t-test (equal variance)
----------------------------------
Population details:
parameter of interest: Mean difference
value under h_0: 0
point estimate: -1.01167
95% confidence interval: (-1.44, -0.5838)
Test summary:
outcome with 95% confidence: reject h_0
two-sided p-value: <1e-05
Details:
number of observations: [300,300]
t-statistic: -4.64337449574737
degrees of freedom: 598
empirical standard error: 0.2178731583233696
```

```
Ftest = let
x1 = filter([:Sex, :Region, :Year] =>
(s, r, y) -> s=="Women" && r=="Rural" && y == 1985,
data) |>
x -> x[!, :BMI] |> skipmissing |> collect |> x->rand(x, 300)
x2 = filter([:Sex, :Region, :Year] =>
(s, r, y) -> s=="Women" && r=="Urban" && y == 1985,
data) |>
x -> x[!, :BMI] |> skipmissing |> collect |> x->rand(x, 300)
x -> x[!, :BMI] |> skipmissing |> collect |> x->rand(x, 300)
VarianceFTest(x1, x2)
end
```

```
Variance F-test
---------------
Population details:
parameter of interest: variance ratio
value under h_0: 1.0
point estimate: 1.32245
Test summary:
outcome with 95% confidence: reject h_0
two-sided p-value: 0.0159
Details:
number of observations: [300, 300]
F statistic: 1.322447205183649
degrees of freedom: [299, 299]
```

```
Atest = let
x = filter([:Sex, :Year] => (s,y) -> (s=="Men" && y==2017), data)
groups = groupby(x, :Region)
bmis = map(keys(groups)) do key # for each group,
collect(skipmissing(groups[key][!, :BMI])) # collect BMI, skipping missing values
end
res = OneWayANOVATest(bmis...)
end
```

```
One-way analysis of variance (ANOVA) test
-----------------------------------------
Population details:
parameter of interest: Means
value under h_0: "all equal"
point estimate: NaN
Test summary:
outcome with 95% confidence: reject h_0
p-value: 0.0333
Details:
number of observations: [200, 196, 199]
F statistic: 3.42167
degrees of freedom: (2, 592)
```

BibTeX citation:

```
@online{sambrani2022,
author = {Dhruva Sambrani},
title = {Parametric Hypothesis Tests with Examples in {Julia}},
date = {2022-11-17},
url = {https://www.dataalltheway.com/posts/010-01-parametric-hypothesis-tests-julia},
langid = {en}
}
```

For attribution, please cite this work as:

Dhruva Sambrani. 2022. “Parametric Hypothesis Tests with Examples
in Julia.” November 17, 2022. https://www.dataalltheway.com/posts/010-01-parametric-hypothesis-tests-julia.

2022-12-22 Section for ANOVA.

2022-11-11 Added example code for one-tailed t-test; mentioned Welch’s t-test in a note callout.

2022-11-10 First draft with live code on Kaggle.

A hypothesis test is a statistical test used to determine whether there is enough evidence to support a hypothesis. For example, there is a difference between the average height of males and females. A parametric hypothesis test assumes that the data you want to test is (or approximately is) normally distributed. For example, height is normally distributed in a population. In addition, your data needs to be symmetrical since normally distributed data is symmetrical. If your data does not have the appropriate properties, then you use a non-parametric test.

There are different parametric tests, but the most common is the t-test. Below is the list of the five parametric hypothesis tests we will explore here.

- Z-test (Section 2)
- T-test (Section 3)
- F-test (Section 4)
- ANOVA (Section 5)

Besides deciding which hypothesis test to use to answer the question at hand, we also need to decide a couple of other parameters, for example, whether the test would be one sample or two samples, paired or un-paired, and one or two-tailed. Below is a brief description of each parameter.

Paired tests are used to compare two related data groups, * such as before and after measurements.* Unpaired tests are used to compare two unrelated data groups, such as men and women.

There are key differences between one-sample and two-sample hypothesis testing. Firstly, in one sample hypothesis testing, we are testing the mean of a single sample ** against a known population mean**. In contrast, two-sample hypothesis testing involves

In a hypothesis test, the null and alternate hypotheses are stated in terms of population parameters. These hypotheses are:

- Null hypothesis (): The value of the population parameter is equal to the hypothesized value.
- The alternate hypothesis (): The value of the population parameter is not equal to the hypothesized value.

The hypothesis test is based on a sample from the population. This sample is used to test the hypotheses by deriving a test statistic. Finally, the value of the test statistic is compared to a critical value. The critical value depends on the alpha level, which is the likelihood of rejecting the null hypothesis when it is true.

The null and alternate hypotheses determine the direction of the test. There are two types of hypothesis tests: ** one-tailed and two-tailed tests**.

A one-tailed test is conducted when the null and alternate hypotheses are stated in terms of *“greater than” or “less than”.*

For example, let’s say that a company wants to test a new advertising campaign. The null hypothesis () is that the new campaign will have no effect on sales. The alternate hypothesis () is that the new campaign will increase sales.

The null hypothesis is stated as:

: The population mean is less than or equal to 10%.

The alternate hypothesis is stated as:

: The population mean is greater than 10%.

The test is conducted by taking a sample of data and calculating the mean. Then, the mean is compared to the critical value. *The null hypothesis is rejected if the mean is greater than the critical value.*

A two-tailed test is conducted when the null and alternate hypotheses are stated in terms of *“not equal to”.*

Taking our advertising campaign example. The null hypothesis () is that the new campaign will have no effect on sales. The alternate hypothesis () is that the new campaign will increase or decrease sales.

The null hypothesis is stated as:

: The population mean is equal to 10%.

The alternate hypothesis is stated as:

: The population mean is not equal to 10%.

The test is conducted by taking a sample of data and calculating the mean. Then, the mean is compared to the critical value. *The null hypothesis is rejected if the mean is not equal to the critical value.*

No matter which hypothesis testing method is selected following are the steps that are executed:

- Firstly, identify the null hypothesis
- Then identify the alternative hypothesis and decide if it is of the form (a two-tailed test) or if there is a specific direction for how the mean changes or , (a one-tailed test).
- Next, calculate the test statistic. Compare the test statistic to the critical values and obtain a range for the p-value which is the probability that the difference between the two groups is due to chance. The test is usually used with a significance level of 0.05, which means that there is a 5% chance that the difference between the two groups is due to chance. However, per recommendations from the American Statistical Association we need to be careful when we state statements like
.*“statistically significant”* - Form conclusions. If your test statistic is greater than the critical values in the table, it is significant. You can reject the null hypothesis at that level, otherwise you accept it.

For our example exercises, we will use the dataset from Open Case Studies ** “exploring global patterns of obesity across rural and urban regions”** (Wright et al. 2020) . Body mass index (BMI) is often used as a proxy for

“We noted a persistently higher rural BMI, especially for women.”

We will fetch the cleaned version of the data from the the Open Case Studies GitHub repository as data wrangling is out of the scope of this tutorial.

```
suppressMessages(library(tidyverse))
suppressMessages(library(DT))
suppressMessages(library(kableExtra))
dat <- readr::read_csv("https://raw.githubusercontent.com/opencasestudies/ocs-bp-rural-and-urban-obesity/master/data/wrangled/BMI_long.csv", show_col_types = FALSE)
DT::datatable(dat)
```

A z-test is used to compare two population means when the sample size is large and the population variance is known.

These are some conditions for using this type of test:

- The data must be
**normally distributed**. - All data points must be
**independent**. - For each sample the
**variances must be equal**.

Equation 1 shows the formula for a z-test.

Where is the sample mean, is the population mean, is the population variance, and is the number of samples.

Where and are the sample means, and are the population variances, and and are the number of samples.

```
suppressMessages(library(PASWR2))
# Calculate population standard deviations
sig_x <- dplyr::filter(dat, Sex == "Women", Year == 1985) %>%
dplyr::pull(BMI) %>% na.omit() %>% sd()
sig_y <- dplyr::filter(dat, Sex == "Women", Year == 2017) %>%
dplyr::pull(BMI) %>% na.omit() %>% sd()
# Fetch a random sample of BMI data for women in the year 1985 and 2017
x1 <- dplyr::filter(dat, Sex == "Women", Year == 1985) %>%
dplyr::pull(BMI) %>% na.omit() %>%
sample(.,300)
x2 <- dplyr::filter(dat, Sex == "Women", Year == 2017) %>%
dplyr::pull(BMI) %>% na.omit() %>%
sample(.,300)
# Perform a two sample (unpaired) t-test between x1 and x2
(z_res <- PASWR2::z.test(x1, x2, mu = 0, sigma.x = sig_x, sigma.y = sig_y))
```

```
Two Sample z-test
data: x1 and x2
z = -11.071, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.885719 -2.017614
sample estimates:
mean of x mean of y
24.08067 26.53233
```

```
# Fetch z-test result metrics and present them in a tidy table
broom::tidy(z_res) %>%
kbl() %>%
kable_paper("hover", full_width = F)
```

estimate1 | estimate2 | statistic | p.value | conf.low | conf.high | method | alternative |
---|---|---|---|---|---|---|---|

24.08067 | 26.53233 | -11.0705 | 0 | -2.885719 | -2.017614 | Two Sample z-test | two.sided |

Where is the mean of the differences between the samples, is the hypothesised mean of the differences (usually this is zero), is the sample size and is the population variance of the differences.

```
# Fetch a first 300 samples of BMI data for women in the year 1985 and 2017
x1 <- dplyr::filter(dat, Sex == "Women", Year == 1985) %>%
dplyr::pull(BMI) %>% na.omit()
x1 <- x1[1:300]
x2 <- dplyr::filter(dat, Sex == "Women", Year == 2017) %>%
dplyr::pull(BMI) %>% na.omit()
x2 <- x2[1:300]
# Perform a two sample (unpaired) t-test between x1 and x2
(z_res <- PASWR2::z.test(x1, x2, sigma.x = sig_x, sigma.y = sig_y, sigma.d = abs(sig_y - sig_x), paired = TRUE))
```

```
Paired z-test
data: x1 and x2
z = -393.33, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.36171 -2.33829
sample estimates:
mean of the differences
-2.35
```

```
# Fetch z-test result metrics and present them in a tidy table
broom::tidy(z_res) %>%
kbl() %>%
kable_paper("hover", full_width = F)
```

estimate | statistic | p.value | conf.low | conf.high | method | alternative |
---|---|---|---|---|---|---|

-2.35 | -393.3333 | 0 | -2.36171 | -2.33829 | Paired z-test | two.sided |

The t-test is used to determine whether two groups or one group and a known mean are significantly different from each other. The t-test can be used to compare means, or to compare proportions and can be used to compare two independent groups, or two dependent groups. The t-test is based on the t-statistic, which for a one group/sample where the mean for comparison is known is calculated using the following formula:

Where is the mean of the first group (one sample), is the known mean, is the standard deviation of the first group, and is the number of observations in the first group.

```
# Fetch the BMI data for men from rural areas in the year 2017
x <- dplyr::filter(dat, Sex == "Men", Region == "Rural", Year == 2017) %>%
dplyr::pull(BMI)
# Perform one sample t-test between x and a known mean = 24.5
(t_res <- t.test(x, mu = 24.5))
```

```
One Sample t-test
data: x
t = 3.955, df = 195, p-value = 0.000107
alternative hypothesis: true mean is not equal to 24.5
95 percent confidence interval:
24.87345 25.61635
sample estimates:
mean of x
25.2449
```

```
# Fetch t-test result metrics and present them in a tidy table
broom::tidy(t_res) %>%
kbl() %>%
kable_paper("hover", full_width = F)
```

estimate | statistic | p.value | parameter | conf.low | conf.high | method | alternative |
---|---|---|---|---|---|---|---|

25.2449 | 3.955031 | 0.000107 | 195 | 24.87345 | 25.61635 | One Sample t-test | two.sided |

Where the pooled standard deviation is:

Where and are the means from the two samples, likewise and are the sample sizes and and are the sample variances. This test statistic you will compare to *t*-tables on degrees of freedom.

```
# Fetch the BMI data for women from rural and urban areas in the year 1985
x1 <- dplyr::filter(dat, Sex == "Women", Region == "Rural", Year == 1985) %>%
dplyr::pull(BMI)
x2 <- dplyr::filter(dat, Sex == "Women", Region == "Urban", Year == 1985) %>%
dplyr::pull(BMI)
# Perform a two sample (unpaired) t-test between x1 and x2
(t_res <- t.test(x1, x2, var.equal = TRUE))
```

```
Two Sample t-test
data: x1 and x2
t = -3.8952, df = 394, p-value = 0.0001152
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.5744694 -0.5182378
sample estimates:
mean of x mean of y
23.58782 24.63417
```

We use the `var.equal = TRUE`

option here to use the pooled standard deviation Equation 6.

```
# Fetch t-test result metrics and present them in a tidy table
broom::tidy(t_res) %>%
kbl() %>%
kable_paper("hover", full_width = F)
```

estimate | estimate1 | estimate2 | statistic | p.value | parameter | conf.low | conf.high | method | alternative |
---|---|---|---|---|---|---|---|---|---|

-1.046354 | 23.58782 | 24.63417 | -3.895234 | 0.0001152 | 394 | -1.574469 | -0.5182378 | Two Sample t-test | two.sided |

Here we will test if women in rural areas had a higher BMI than those in urban areas in 1985.

```
# Perform a one-tailed two sample (unpaired) t-test between x1 and x2
(t_res <- t.test(x1, x2, var.equal = TRUE, alternative = "greater"))
```

```
Two Sample t-test
data: x1 and x2
t = -3.8952, df = 394, p-value = 0.9999
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
-1.489242 Inf
sample estimates:
mean of x mean of y
23.58782 24.63417
```

```
# Fetch t-test result metrics and present them in a tidy table
broom::tidy(t_res) %>%
kbl() %>%
kable_paper("hover", full_width = F)
```

estimate | estimate1 | estimate2 | statistic | p.value | parameter | conf.low | conf.high | method | alternative |
---|---|---|---|---|---|---|---|---|---|

-1.046354 | 23.58782 | 24.63417 | -3.895234 | 0.9999424 | 394 | -1.489242 | Inf | Two Sample t-test | greater |

Where is the mean of the differences between the samples. You will compare the *t*-statistic to the critical values in a *t*-table on degrees of freedom. Here is the standard deviation of the differences.

```
# Perform a two sample (paired) t-test between x1 and x2
(t_res <- t.test(x1, x2, paired = TRUE))
```

```
Paired t-test
data: x1 and x2
t = -14.095, df = 195, p-value < 2.2e-16
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
-1.1870263 -0.8956268
sample estimates:
mean difference
-1.041327
```

```
# Fetch t-test result metrics and present them in a tidy table
broom::tidy(t_res) %>%
kbl() %>%
kable_paper("hover", full_width = F)
```

estimate | statistic | p.value | parameter | conf.low | conf.high | method | alternative |
---|---|---|---|---|---|---|---|

-1.041327 | -14.09549 | 0 | 195 | -1.187026 | -0.8956268 | Paired t-test | two.sided |

Conventionally in a hypothesis test the sample means are compared. However, it is possible to have two identical means for two different samples/groups, meanwhile their variances could differ drastically. The statistical test to use to compare variance is called the -ratio test (or the variance ratio test) and compares two variances in order to test whether they come from the same populations.

Comparing two variances is useful in several cases, including:

When you want to perform a two samples t-test to check the equality of the variances of the two samples

When you want to compare the variability of a new measurement method to an old one. Does the new method reduce the variability of the measure?

Where and are the two variances to compare. There is a table of the -distribution. Similarly to the and Normal distributions, it is organised according to the degrees of freedom of the two variance estimates. The degrees of freedom are (for the numerator) and (for the denominator).

`(f_res <- var.test(x1, x2))`

```
F test to compare two variances
data: x1 and x2
F = 1.3238, num df = 196, denom df = 198, p-value = 0.04957
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
1.000525 1.751931
sample estimates:
ratio of variances
1.323819
```

```
# Fetch f-test result metrics and present them in a tidy table
broom::tidy(f_res) %>%
kbl() %>%
kable_paper("hover", full_width = F)
```

`Multiple parameters; naming those columns num.df, den.df`

estimate | num.df | den.df | statistic | p.value | conf.low | conf.high | method | alternative |
---|---|---|---|---|---|---|---|---|

1.323819 | 196 | 198 | 1.323819 | 0.0495733 | 1.000525 | 1.751931 | F test to compare two variances | two.sided |

ANOVA (Analysis of Variance) is a parametric statistical method used to compare the means of two or more groups. Often used to determine if the differences between the means of several groups are statistically significant and therefore generalizes the t-test beyond two means.

At its core, ANOVA is a hypothesis testing procedure that uses an F-test Section 4 to compare the means of two or more groups. In R, ANOVA can be performed using the aov() function. This function requires a formula, data, and an optional subset argument. The formula should include the response variable and the explanatory variable. The data argument should be a data frame containing the variables specified in the formula. Finally, the subset argument is optional and can specify a subset of cases to include in the analysis.

In the example below, we test the difference in the average BMI for men in the year 2017 among national, rural, and urban regions.

```
# Filter the data for men and the year 2017.
a_dat <- dat %>% dplyr::filter(Sex == "Men" & Year == 2017)
DT::datatable(a_dat)
```

```
# Carry out ANOVA to measure differences in the BMI across the three regions.
a_res <- aov(BMI ~ Region, data = a_dat)
summary(a_res)
```

```
Df Sum Sq Mean Sq F value Pr(>F)
Region 2 42 21.171 3.422 0.0333 *
Residuals 592 3663 6.187
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
5 observations deleted due to missingness
```

The output of the `summary()`

function will display the F-statistic and the corresponding p-value. If the p-value is less than the significance level, then we can conclude that there is a statistically significant difference between the means of the three groups.

Finally, we can use the `TukeyHSD()`

function to perform a Tukey’s Honest Significant Difference test, which compares the means of all pairs of groups and tells us which groups are significantly different from each other.

```
tshd_res <- TukeyHSD(a_res)
(tshd_res)
```

```
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = BMI ~ Region, data = a_dat)
$Region
diff lwr upr p adj
Rural-National -0.3701020 -0.95753700 0.2173329 0.3011038
Urban-National 0.2829899 -0.30220442 0.8681843 0.4921257
Urban-Rural 0.6530920 0.06492696 1.2412570 0.0251982
```

“Rising Rural Body-Mass Index Is the Main Driver of the Global Obesity Epidemic in Adults.” 2019. *Nature* 569 (7755): 260–64. https://doi.org/10.1038/s41586-019-1171-x.

Wright, Carrie, Qier Meng, Leah Jager, Margaret Taub, and Stephanie. Hicks. 2020. “Exploring Global Patterns of Obesity Across Rural and Urban Regions (Version V1.0.0).” 2020. https://github.com/opencasestudies/ocs-bp-rural-and-urban-obesity.

BibTeX citation:

```
@online{farmer2022,
author = {Rohit Farmer},
title = {Parametric Hypothesis Tests with Examples in {R}},
date = {2022-11-10},
url = {https://www.dataalltheway.com/posts/010-parametric-hypothesis-tests-r},
langid = {en}
}
```

For attribution, please cite this work as:

Rohit Farmer. 2022. “Parametric Hypothesis Tests with Examples in
R.” November 10, 2022. https://www.dataalltheway.com/posts/010-parametric-hypothesis-tests-r.

2022-11-03 Corrections made as pointed by https://fosstodon.org/@rudolf read this thread https://fosstodon.org/@swatantra/109281773145979327

This post aims to lay down some rules that help name files and folders that are cross-platform, meaning they are acceptable across Linux, MacOS, and MS Windows, the three major operating systems. I have compiled these rules following international standards and recommendations from the US National Archives. Some rules mentioned here are exceptions to OS compatibility and are mentioned here from a useability point of view.

Having suitable files and folder naming can be helpful in data science, especially while working on the command line or programmatically querying large sets of files. For example, not having spaces in file names makes it easier to refer to them on the command line; having fixed delimiters in the file names can help iterate over files and fetch critical information programmatically.

Use styles that are internationally recognized, easy to read by a human eye, and require fewer finger movements on the keyboard, for example, using lowercase letters that eliminate the shift key.

- Use all lowercase letters (this doesn’t apply to OS compatibility).
- Use hyphens (not underscores) to separate words. Avoid using spaces as they are difficult to work with on the command line.
- Use YYYY-MM-DD format for dates.
- While using the date at the beginning of a file name, keep the rest of the file name the same as the one created on a previous date (if there is more than one file for the same purpose, but it is sectioned according to the date.)
- Use “.” only for file extensions.
- Use v1, v2, …, vn to denote file versions. Ideally, one should use a version control system (VCS) such as Git.
- For a stack of fewer than 100 files, number them as 00, 01, 02, …, 0n for easier sorting. Accordingly, for a stack of fewer than 1000 files, number them as 000, 001, 002, …, 00n.
- File names should NOT contain punctuation, symbols, or special characters. ” / [ ] : ; | = _ , < ? > & $ # ! ’ { } ( ).

```
2022-08-31-labnotebook-for-hdstim.docx
figure-01.png
figure-02.png
figure-03.png
/path/to/folder/exploring-flow
```

BibTeX citation:

```
@online{farmer2022,
author = {Rohit Farmer},
title = {Rules for Naming Files and Folders That Are Cross Platform
and Helpful in Datascience},
date = {2022-11-03},
url = {https://www.dataalltheway.com/posts/009-rules-for-naming-files},
langid = {en}
}
```

For attribution, please cite this work as:

Rohit Farmer. 2022. “Rules for Naming Files and Folders That Are
Cross Platform and Helpful in Datascience.” November 3, 2022. https://www.dataalltheway.com/posts/009-rules-for-naming-files.

Singularity is a free and open-source container platform foroperating-system-level virtualization. It allows you to create and run containers that package up pieces of software in a way that is portable and reproducible. You can build a container using Singularity on your laptop (preferably Linux) and then run it on your local computer or a High-Performance Computer (HPC). A Singularity container is a single file that is easy to ship to an HPC or a friend.

I heavily rely on Singularity for my work as I write arbitrary code that runs on an HPC. I often require a specific set of libraries, compilers, or other supporting software that are often hard to manage on an HPC. Also, from a reproducibility point of view, it’s easier to build a container with a fixed library and software version packaged in a single file than scattered in multiple modules on the HPC. And when the time comes for the publication, I can easily share the container on Zenodo, etc.

The title of this post emphasizes data science, machine learning, and chemistry because all the software we will install is related to these disciplines. However, this procedure applies to building containers for any field of application.

- Build a Linux based Singularity container.
- First build a writable sandbox with essential elements.
- Inspect the container.
- Install additional software.
- Convert the sandbox to a read-only SquashFS container image.

- Install software & packages from multiple sources.
- Using
`apt-get`

package management system. - Compiling from source code.
- Using
`Python pip`

. - Using
`install.packages()`

function in R.

- Using
- Software highlight.
- Jupyter notebook.
- Tensorflow GPU version.
- OpenMPI.
- Popular datascience packages in Python and R.
- Chemistry/chemoinformatics software: RDkit, OpenBabel, Pybel, & Mordred.

- Test the container.
- Test the GPU version of Tensorflow.

First we will build a writable Singularity sandbox with the essential software, languages, and developmental libraries. To build a writable sandbox copy the recipe below to a `container.def`

text file and then execute:

`sudo singularity build --sandbox container/ container.def`

**Recipe/Definition File**

```
BootStrap: docker
From: ubuntu:bionic
%labels
APPLICATION_NAME Data Science and Chemistry
AUTHOR_NAME Rohit Farmer
AUTHOR_EMAIL rohit.farmer@gmail.com
YEAR 2021
%help
Container for data science and chemistry with packages from Python 3 & R 3.6.
It also includes CUDA and MPI for Tensorflow GPU and parallel processing respectively.
%environment
# Set system locale
PATH=/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin
RDBASE=/usr/local/share/rdkit
CUDA=/usr/local/cuda/lib64:/usr/local/cuda-10.1/lib64:/usr/local/cuda-10.2/lib64
LD_LIBRARY_PATH=/.singularity.d/libs:$RDBASE/lib:$CUDA
PYTHONPATH=modules:$RDBASE:/usr/local/share/rdkit/rdkit:/usr/local/lib/python3.6/dist-packages/
LANG=C.UTF-8 LC_ALL=C.UTF-8
%post
# Change to tmp directory to download temporary files.
cd /tmp
# Install essential software, languages and libraries.
apt-get -qq -y update
export DEBIAN_FRONTEND=noninteractive
apt-get -qq install -y --no-install-recommends tzdata apt-utils
ln -fs /usr/share/zoneinfo/America/New_York /etc/localtime
dpkg-reconfigure --frontend noninteractive tzdata
apt-get -qq -y update
apt-get -qq install -y --no-install-recommends \
autoconf \
automake \
build-essential \
bzip2 \
ca-certificates \
cmake \
gcc \
g++ \
gfortran \
git \
gnupg2 \
libtool \
libjpeg-dev \
libpng-dev \
libtiff-dev \
libatlas-base-dev \
libxml2-dev \
zlib1g-dev \
libcairo2-dev \
libeigen3-dev \
libcupti-dev \
libpcre3-dev \
libssl-dev \
libcurl4-openssl-dev \
libboost-all-dev \
libboost-dev \
libboost-system-dev \
libboost-thread-dev \
libboost-serialization-dev \
libboost-regex-dev \
libgtk2.0-dev \
libreadline-dev \
libbz2-dev \
liblzma-dev \
libpcre++-dev \
libpango1.0-dev \
libmariadb-client-lgpl-dev \
libopenblas-dev \
liblapack-dev \
libxt-dev \
neovim \
openjdk-8-jdk \
python \
python-pip \
python-dev \
python3-dev \
python3-pip \
python3-wheel \
swig \
texlive \
texlive-fonts-extra \
texinfo \
vim \
wget \
xvfb \
xauth \
xfonts-base \
zip
export LANG=C.UTF-8 LC_ALL=C.UTF-8
# Add NVIDIA package repositories.
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-repo-ubuntu1804_10.1.243-1_amd64.deb
apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
dpkg -i cuda-repo-ubuntu1804_10.1.243-1_amd64.deb
apt-get update
wget http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb
apt-get -qq install -y --no-install-recommends ./nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb
apt-get update
# Install NVIDIA driver (optional)
# apt-get install --no-install-recommends nvidia-driver-430
# Install development and runtime libraries.
apt-get install -y --no-install-recommends \
cuda-10-1 \
libcudnn7=7.6.4.38-1+cuda10.1 \
libcudnn7-dev=7.6.4.38-1+cuda10.1
# Install TensorRT. Requires that libcudnn7 is installed above.
apt-get install -y --no-install-recommends libnvinfer6=6.0.1-1+cuda10.1 \
libnvinfer-dev=6.0.1-1+cuda10.1 \
libnvinfer-plugin6=6.0.1-1+cuda10.1
# Update python pip.
python3 -m pip --no-cache-dir install --upgrade pip
python3 -m pip --no-cache-dir install setuptools --upgrade
python -m pip --no-cache-dir install setuptools --upgrade
# Install R 3.6.
echo "deb https://cloud.r-project.org/bin/linux/ubuntu bionic-cran35/" >> /etc/apt/sources.list
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9
apt-get update
apt-get install -y --no-install-recommends r-base
apt-get install -y --no-install-recommends r-base-dev
# Install Jupyter notebook with Python and R support.
python3 -m pip --no-cache-dir install jupyter
R --quiet --slave -e 'install.packages(c("IRkernel"), repos="https://cloud.r-project.org/")'
# Install MPI (match the version with the cluster).
mkdir -p /tmp/mpi
cd /tmp/mpi
wget -O openmpi-2.1.0.tar.bz2 https://www.open-mpi.org/software/ompi/v2.1/downloads/openmpi-2.1.0.tar.bz2
tar -xjf openmpi-2.1.0.tar.bz2
cd openmpi-2.1.0
./configure --prefix=/usr/local --with-cuda
make -j $(nproc)
make install
ldconfig
# Cleanup
apt-get -qq clean
rm -rf /var/lib/apt/lists/*
rm -rf /tmp/mpi
```

To get a list of the labels defined for the container `singularity inspect --labels container/`

To print the container’s help section `singularity inspect --helpfile container/`

To show container’s environment `singularity inspect --environment container/`

To retrieve the definition file used to build the container `singularity inspect --deffile container/`

Once the core writable sandbox is built we will install the additional data science and chemistry packages.

To do that execute:

`sudo singularity shell --writable container/`

Then execute the following lines in the shell environment.

```
# Install Python packages.
python3 -m pip --no-cache-dir install numpy pandas h5py pyarrow sklearn statsmodels matplotlib seaborn plotly
# Install Tensorflow.
python3 -m pip --no-cache-dir install tensorflow==2.2.0
# Install R packages.
R --quiet --slave -e 'install.packages("tidyverse", version = "1.3.0", repos="https://cloud.r-project.org/")'
R --quiet --slave -e 'install.packages("tidymodels", version = "0.1.0", repos="https://cloud.r-project.org/")'
R --quiet --slave -e 'install.packages(c("lme4", "glmnet", "yaml", "jsonlite", "rlang"), repos="https://cloud.r-project.org/")'
# Install RDKit
export RDBASE=/usr/local/share/rdkit
export LD_LIBRARY_PATH="$RDBASE/lib:$LD_LIBRARY_PATH"
export PYTHONPATH="$RDBASE:$PYTHONPATH"
mkdir -p /tmp/rdkit
cd /tmp/rdkit
wget https://github.com/rdkit/rdkit/archive/2020_03_3.tar.gz
tar zxf 2020_03_3.tar.gz
mv rdkit-2020_03_3 $RDBASE
mkdir $RDBASE/build
cd $RDBASE/build
cmake -DPYTHON_EXECUTABLE=/usr/bin/python3 ..
make -j $(nproc)
make install
ln -s /usr/local/share/rdkit/rdkit /usr/local/lib/python3.6/dist-packages/
# Install OpenBabel.
apt-get -qq -y update
apt-get -qq install -y --no-install-recommends openbabel python-openbabel
# Install Mordred Molecular Descriptor Calculator.
python3 -m pip --no-cache-dir install mordred
# Cleanup
rm -rf /tmp/rdkit
```

Once you are satisfied that you have installed all the required packages you can convert the writable sandbox to a read only squashfs filesystem. Squashfs is a compressed read-only file system for Linux.

`sudo singularity build container.sif container/`

Kernel specs are installed from outside the container in the host’s home environment.

`singularity exec container.sif R --quiet --slave -e 'IRkernel::installspec()'`

NOTE: You only have to do it once per host to install `kernelspec`

.

```
import tensorflow as tf
tf.debugging.set_log_device_placement(True)
gpus = tf.config.list_physical_devices('GPU')
if gpus:
with tf.device('/GPU:0'):
tf.random.set_seed(123)
a = tf.random.normal([10000,20000], 0, 1, tf.float32, seed=1)
b = tf.random.normal([20000,10000], 0, 1, tf.float32, seed=1)
c = tf.matmul(a, b)
print(c)
else:
print("No GPUs found.")
print("Num GPUs:", len(gpus))
```

To execute the script `singularity exec --nv container.sif python3 tf_gpu.py`

To monitor NVIDIA GPU usage `nvidia-smi`

BibTeX citation:

```
@online{farmer2022,
author = {Rohit Farmer},
title = {How to Build a {Singularity} Container for Machine Learning,
Data Science, and Chemistry},
date = {2022-10-31},
url = {https://www.dataalltheway.com/posts/008-build-a-singularity-container},
langid = {en}
}
```

For attribution, please cite this work as:

Rohit Farmer. 2022. “How to Build a Singularity Container for
Machine Learning, Data Science, and Chemistry.” October 31, 2022.
https://www.dataalltheway.com/posts/008-build-a-singularity-container.

2022-10-26 Added sections for Kaggle and other platforms

2022-10-25 Initial uncomplete post and Kaggle notebook

R and Python, the two most popular data science and machine learning programming languages, come with a default data set for demonstration and educational purposes. Moreover, many popular data science libraries such as tidyverse, lme4, nlme, MASS, survival, Bioconductor, and sklearn, amongst others, also contain example datasets for unit testing and demonstration. However, even though the included data are carefully selected, most of them are old (from the 1970s and ’80s) and hence limited in sample size (due to the limitation in the then available computing power) and may not be more than a toy example in the current context. Therefore, in a learning path after utilizing the default datasets for initial testing of a statistical function or a machine learning model, it’s both desired and recommended to practice on a real-life dataset. Working on a real-life dataset that is apt in the current context of data science and computing resources would not only teach the desired statistical/machine learning technique but also expose the learner to challenges that are usually not associated with toy datasets—for example, class imbalance, missing values, wrongly labeled datatypes, statistical noise, etc.

For a real-life dataset, I recommend using open data released by government agencies and other non-government organizations as part of their openness in operations. These datasets are not only of adequate size but also represent what is happening around us in a field of interest. Many countries around the world release data to inform citizens about their operations and fair practices. However, I do not think, other than the established data analytics-based companies and research groups, ordinary citizens ever look into such resources. I bet many would not even know that their government agencies are making ample amounts of data available for scrutiny. Therefore, in my opinion, budding data scientists should take it upon themselves to utilize these datasets in place of toy examples to not only justify the existence of such a resource but also, through analysis, gain first-hand insight and transfer it to their friends, family, and broader audience. Such a practice by amateur data scientists may have far-reaching implications than what I can convey here.

In the sections below, for completeness, I will briefly discuss how to access default datasets available in R and then move on to other resources, including open government data mentioned earlier.

```
suppressMessages(library(DT))
suppressMessages(library(tidyverse))
suppressMessages(library(kableExtra))
```

In R (v4.1.3), there are 104 datasets for various statistical and machine-learning tasks. The commands in the cell below list all the datasets available by default (Table 1) and across all the installed packages, respectively. This article summarizes some of R’s popular datasets, namely mtcars, iris, etc.

```
# Default datasets
data()
# Datasets across all the installed packages
data(package = .packages(all.available = TRUE))
```

```
dat <- data()
dat <- as_tibble(dat$results) %>% dplyr::select(-LibPath) %>%
dplyr::filter(Package == "datasets")
knitr::kable(dat) %>%
kable_styling(bootstrap_options = c("striped", "hover")) %>%
scroll_box(width = "100%", height = "300px")
```

Package | Item | Title |
---|---|---|

datasets | AirPassengers | Monthly Airline Passenger Numbers 1949-1960 |

datasets | BJsales | Sales Data with Leading Indicator |

datasets | BJsales.lead (BJsales) | Sales Data with Leading Indicator |

datasets | BOD | Biochemical Oxygen Demand |

datasets | CO2 | Carbon Dioxide Uptake in Grass Plants |

datasets | ChickWeight | Weight versus age of chicks on different diets |

datasets | DNase | Elisa assay of DNase |

datasets | EuStockMarkets | Daily Closing Prices of Major European Stock Indices, 1991-1998 |

datasets | Formaldehyde | Determination of Formaldehyde |

datasets | HairEyeColor | Hair and Eye Color of Statistics Students |

datasets | Harman23.cor | Harman Example 2.3 |

datasets | Harman74.cor | Harman Example 7.4 |

datasets | Indometh | Pharmacokinetics of Indomethacin |

datasets | InsectSprays | Effectiveness of Insect Sprays |

datasets | JohnsonJohnson | Quarterly Earnings per Johnson & Johnson Share |

datasets | LakeHuron | Level of Lake Huron 1875-1972 |

datasets | LifeCycleSavings | Intercountry Life-Cycle Savings Data |

datasets | Loblolly | Growth of Loblolly pine trees |

datasets | Nile | Flow of the River Nile |

datasets | Orange | Growth of Orange Trees |

datasets | OrchardSprays | Potency of Orchard Sprays |

datasets | PlantGrowth | Results from an Experiment on Plant Growth |

datasets | Puromycin | Reaction Velocity of an Enzymatic Reaction |

datasets | Seatbelts | Road Casualties in Great Britain 1969-84 |

datasets | Theoph | Pharmacokinetics of Theophylline |

datasets | Titanic | Survival of passengers on the Titanic |

datasets | ToothGrowth | The Effect of Vitamin C on Tooth Growth in Guinea Pigs |

datasets | UCBAdmissions | Student Admissions at UC Berkeley |

datasets | UKDriverDeaths | Road Casualties in Great Britain 1969-84 |

datasets | UKgas | UK Quarterly Gas Consumption |

datasets | USAccDeaths | Accidental Deaths in the US 1973-1978 |

datasets | USArrests | Violent Crime Rates by US State |

datasets | USJudgeRatings | Lawyers' Ratings of State Judges in the US Superior Court |

datasets | USPersonalExpenditure | Personal Expenditure Data |

datasets | UScitiesD | Distances Between European Cities and Between US Cities |

datasets | VADeaths | Death Rates in Virginia (1940) |

datasets | WWWusage | Internet Usage per Minute |

datasets | WorldPhones | The World's Telephones |

datasets | ability.cov | Ability and Intelligence Tests |

datasets | airmiles | Passenger Miles on Commercial US Airlines, 1937-1960 |

datasets | airquality | New York Air Quality Measurements |

datasets | anscombe | Anscombe's Quartet of 'Identical' Simple Linear Regressions |

datasets | attenu | The Joyner-Boore Attenuation Data |

datasets | attitude | The Chatterjee-Price Attitude Data |

datasets | austres | Quarterly Time Series of the Number of Australian Residents |

datasets | beaver1 (beavers) | Body Temperature Series of Two Beavers |

datasets | beaver2 (beavers) | Body Temperature Series of Two Beavers |

datasets | cars | Speed and Stopping Distances of Cars |

datasets | chickwts | Chicken Weights by Feed Type |

datasets | co2 | Mauna Loa Atmospheric CO2 Concentration |

datasets | crimtab | Student's 3000 Criminals Data |

datasets | discoveries | Yearly Numbers of Important Discoveries |

datasets | esoph | Smoking, Alcohol and (O)esophageal Cancer |

datasets | euro | Conversion Rates of Euro Currencies |

datasets | euro.cross (euro) | Conversion Rates of Euro Currencies |

datasets | eurodist | Distances Between European Cities and Between US Cities |

datasets | faithful | Old Faithful Geyser Data |

datasets | fdeaths (UKLungDeaths) | Monthly Deaths from Lung Diseases in the UK |

datasets | freeny | Freeny's Revenue Data |

datasets | freeny.x (freeny) | Freeny's Revenue Data |

datasets | freeny.y (freeny) | Freeny's Revenue Data |

datasets | infert | Infertility after Spontaneous and Induced Abortion |

datasets | iris | Edgar Anderson's Iris Data |

datasets | iris3 | Edgar Anderson's Iris Data |

datasets | islands | Areas of the World's Major Landmasses |

datasets | ldeaths (UKLungDeaths) | Monthly Deaths from Lung Diseases in the UK |

datasets | lh | Luteinizing Hormone in Blood Samples |

datasets | longley | Longley's Economic Regression Data |

datasets | lynx | Annual Canadian Lynx trappings 1821-1934 |

datasets | mdeaths (UKLungDeaths) | Monthly Deaths from Lung Diseases in the UK |

datasets | morley | Michelson Speed of Light Data |

datasets | mtcars | Motor Trend Car Road Tests |

datasets | nhtemp | Average Yearly Temperatures in New Haven |

datasets | nottem | Average Monthly Temperatures at Nottingham, 1920-1939 |

datasets | npk | Classical N, P, K Factorial Experiment |

datasets | occupationalStatus | Occupational Status of Fathers and their Sons |

datasets | precip | Annual Precipitation in US Cities |

datasets | presidents | Quarterly Approval Ratings of US Presidents |

datasets | pressure | Vapor Pressure of Mercury as a Function of Temperature |

datasets | quakes | Locations of Earthquakes off Fiji |

datasets | randu | Random Numbers from Congruential Generator RANDU |

datasets | rivers | Lengths of Major North American Rivers |

datasets | rock | Measurements on Petroleum Rock Samples |

datasets | sleep | Student's Sleep Data |

datasets | stack.loss (stackloss) | Brownlee's Stack Loss Plant Data |

datasets | stack.x (stackloss) | Brownlee's Stack Loss Plant Data |

datasets | stackloss | Brownlee's Stack Loss Plant Data |

datasets | state.abb (state) | US State Facts and Figures |

datasets | state.area (state) | US State Facts and Figures |

datasets | state.center (state) | US State Facts and Figures |

datasets | state.division (state) | US State Facts and Figures |

datasets | state.name (state) | US State Facts and Figures |

datasets | state.region (state) | US State Facts and Figures |

datasets | state.x77 (state) | US State Facts and Figures |

datasets | sunspot.month | Monthly Sunspot Data, from 1749 to "Present" |

datasets | sunspot.year | Yearly Sunspot Data, 1700-1988 |

datasets | sunspots | Monthly Sunspot Numbers, 1749-1983 |

datasets | swiss | Swiss Fertility and Socioeconomic Indicators (1888) Data |

datasets | treering | Yearly Treering Data, -6000-1979 |

datasets | trees | Diameter, Height and Volume for Black Cherry Trees |

datasets | uspop | Populations Recorded by the US Census |

datasets | volcano | Topographic Information on Auckland's Maunga Whau Volcano |

datasets | warpbreaks | The Number of Breaks in Yarn during Weaving |

datasets | women | Average Heights and Weights for American Women |

As I mentioned in the introduction, many governments worldwide release data for transparency and accountability; for example, https://data.gov, the US federal government’s open data site. Data.gov also maintains a list of websites at https://data.gov/open-gov/ pointing to data repositories related to US cities and counties, US states, and international countries and regions. The primary aim of these repositories is to publish information online as open data using standardized, machine-readable data formats with their metadata.

Depending upon the type of data requested, most of the data can be downloaded in multiple machine-readable formats either via the export option on the website or programmatically through APIs (see section Section 2.4). For example for tabular data popular formats include CSV, XML, RDF, JSON, and XML.

Interactive and exportable tables below show the list of websites at https://data.gov/open-gov/.

```
open_gov <- read.csv("https://data.gov/datagov/wordpress/2019/09/opendatasites91819.csv", header = FALSE)
colnames(open_gov) <- c("Item", "Website", "Type")
cat("Total number of entries: ", nrow(open_gov))
```

`Total number of entries: 314`

```
city_county <- dplyr::filter(open_gov, Type == "US City or County")
DT::datatable(city_county, options = list(pageLength = 5))
```

```
us_state <- dplyr::filter(open_gov, Type %in% c("US State", "Other State Related"))
DT::datatable(us_state, options = list(pageLength = 5))
```

```
int_count <- dplyr::filter(open_gov, Type %in% c("International Country", "International Regional"))
DT::datatable(int_count, options = list(pageLength = 5))
```

Since I live and work in Maryland, I want to see how wages in Maryland and its counties have changed over time. I also want to test if Montgomery county (where I live) has different wages compared to Frederick, Howard, and Prince George’s counties which borders Montgomery on the north, east, and south sides. Therefore, in this example, I will fetch Maryland Average Wage Per Job (Current Dollars): 2010-2020 data via API using RSocrata library in R and carry out some analysis.

In Table 2, each row has an average wage for a year for Maryland, and each of its counties (columns) from 2010-2020 and Figure 1 shows the same data as a line graph depicting the change in wages (y-axis) over time (x-axis).

Table 3 lists the results of an unpaired two-sample t-test between wages from Montgomery and Frederick, Howard, and Prince George’s counties. As you can see from the t-test results, wages differ between Montgomery and Frederick, Howard, and Prince George’s counties, with Montogomery county residents earning higher than all its three bordering counties.

```
library(RSocrata)
# Fetch the data using the API endpoint
maw <- read.socrata("https://opendata.maryland.gov/resource/mk5a-nf44.json")
knitr::kable(dplyr::select(maw, -date_created)) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
scroll_box(width = "100%", height = "400px")
```

year | maryland | allegany_county | anne_arundel_county | baltimore_city | baltimore_county | calvert_county | caroline_county | carroll_county | cecil_county | charles_county | dorchester_county | frederick_county | garrett_county | harford_county | howard_county | kent_county | montgomery_county | prince_george_s_county | queen_anne_s_county | somerset_county | st_mary_s_county | talbot_county | washington_county | wicomico_county | worcester_county |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

2010 | 53096 | 35771 | 56745 | 55640 | 49986 | 42726 | 34616 | 38027 | 42027 | 41290 | 35489 | 48018 | 31591 | 46741 | 58130 | 36334 | 65178 | 51808 | 36018 | 38228 | 60032 | 37845 | 38228 | 38472 | 30799 |

2011 | 54517 | 36677 | 58011 | 57027 | 50914 | 43431 | 35981 | 39039 | 42465 | 42200 | 35718 | 48794 | 32484 | 48558 | 60448 | 36815 | 67247 | 52844 | 36437 | 39652 | 63057 | 38462 | 39420 | 38915 | 31438 |

2012 | 55466 | 36983 | 58706 | 58876 | 51722 | 44239 | 37506 | 39919 | 43260 | 42888 | 37172 | 49972 | 32506 | 49772 | 62371 | 36622 | 68159 | 53292 | 37258 | 39853 | 63698 | 39807 | 39564 | 39066 | 31641 |

2013 | 55555 | 37827 | 59384 | 59318 | 51778 | 44126 | 38404 | 40736 | 44214 | 42909 | 37773 | 49570 | 33477 | 49624 | 62271 | 37572 | 67437 | 53441 | 36848 | 40744 | 63501 | 39901 | 40032 | 39714 | 32384 |

2014 | 56924 | 38449 | 60551 | 61112 | 52961 | 45162 | 39383 | 41607 | 45051 | 44260 | 39094 | 50747 | 34195 | 50205 | 64784 | 38411 | 68731 | 54985 | 37932 | 41802 | 64691 | 40118 | 41018 | 40863 | 33635 |

2015 | 58729 | 39888 | 62195 | 63389 | 54248 | 48825 | 41043 | 43325 | 46776 | 44919 | 40022 | 51510 | 35067 | 52418 | 66677 | 38741 | 71480 | 56456 | 38970 | 43397 | 65497 | 41313 | 42270 | 42599 | 34524 |

2016 | 59710 | 40708 | 63147 | 64481 | 55159 | 53657 | 40832 | 43815 | 47300 | 46958 | 40431 | 51630 | 34925 | 52862 | 67621 | 39504 | 72904 | 57251 | 39941 | 43575 | 65937 | 41740 | 42725 | 43875 | 35260 |

2017 | 61298 | 42143 | 64629 | 66365 | 56887 | 55922 | 42034 | 45576 | 48662 | 47673 | 41711 | 52270 | 35971 | 53775 | 68958 | 40446 | 74709 | 58829 | 42099 | 45988 | 67622 | 43105 | 44039 | 45491 | 35802 |

2018 | 62836 | 43197 | 66458 | 67005 | 58793 | 53557 | 43190 | 45690 | 49981 | 48225 | 41987 | 53624 | 37575 | 54921 | 71300 | 42422 | 76867 | 60383 | 43582 | 45381 | 68887 | 44670 | 45846 | 45567 | 37231 |

2019 | 64690 | 44692 | 68586 | 69930 | 60116 | 51598 | 45190 | 47189 | 52177 | 49193 | 43271 | 55621 | 38290 | 57349 | 74136 | 42575 | 78386 | 62096 | 44011 | 49234 | 70807 | 45115 | 46965 | 46620 | 38234 |

2020 | 70446 | 48294 | 74533 | 74483 | 65743 | 55903 | 49336 | 51470 | 55854 | 53404 | 47182 | 60646 | 40690 | 62395 | 82780 | 45891 | 86138 | 66777 | 48385 | 53880 | 77490 | 48338 | 50743 | 50556 | 41605 |

```
maw_gather <- maw %>% dplyr::select(-date_created) %>%
gather(key = "county", value = "wage", -year ) %>% as_tibble()
ggplot(maw_gather, aes(x = year, y = as.numeric(wage), color = county)) +
geom_line(aes(group = county)) +
labs(x = "Year", y = "Wage", color = "") +theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
```

```
mft <- broom::tidy(t.test(as.numeric(maw$montgomery_county),
as.numeric(maw$frederick_county))) %>%
dplyr::mutate("test" = "Montgomery vs. Frederick")
mht <- broom::tidy(t.test(as.numeric(maw$montgomery_county),
as.numeric(maw$howard_county))) %>%
dplyr::mutate("test" = "Montgomery vs. Howard")
mpgt <- broom::tidy(t.test(as.numeric(maw$montgomery_county),
as.numeric(maw$prince_george_s_county))) %>%
dplyr::mutate("test" = "Montgomery vs. Prince George's")
all_t <- dplyr::bind_rows(mft, mht, mpgt) %>%
dplyr::select(all_of(c("test", "estimate", "estimate1",
"estimate2", "statistic", "p.value")))
knitr::kable(all_t) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
```

test | estimate | estimate1 | estimate2 | statistic | p.value |
---|---|---|---|---|---|

Montgomery vs. Frederick | 20439.455 | 72476 | 52036.55 | 9.452393 | 0.0000001 |

Montgomery vs. Howard | 5250.909 | 72476 | 67225.09 | 1.858408 | 0.0781124 |

Montgomery vs. Prince George's | 15370.364 | 72476 | 57105.64 | 6.597893 | 0.0000030 |

Kaggle is a website that hosts data science and machine learning competitions. Users can compete to win prizes, and the site also has a public dataset repository. In addition to hosting competitions and disseminating public datasets, Kaggle also hosts tutorials and Jupyter-like notebook environments for Python and R runtimes. Users can clone an existing notebook on Kaggle, create it from scratch or upload it from their local computer and provision free hardware resources, including GPUs, to execute their notebooks.

While uploading public datasets, users can also generate a DOI for their dataset to give them a permanent identifier on the internet. For example, I am distributing “Classify the bitter or sweet taste of compounds” (Rohit Farmer 2022a) and “Tweets from heads of governments and states” (Rohit Farmer 2022b) datasets on Kaggle for classification and natural language processing analysis, respectively.

Datasets on Kaggle can be downloaded using a web browser or via its API, which also interacts with its competition environment. Users can find datasets for various statical and machine learning tasks available in various formats, including CSVs, MS Excel, images, SQLite databases, pandas pickle, etc.

In contrast to Kaggle (mentioned in section Section 3) which is mainly a competition hosting and data science learning community Zenodo, Fighsare, and Dryad are data hosting platforms primarily used by scientists and researchers to serve data, manuscripts, and articles. The data hosted on these three platforms usually do not qualify for a domain-specific data respository or too large to attach as supplementary material to an article. They also aim to serve as a long-term archiving solution; therefore, even if the data/manuscript is published elsewhere under the correct licensing term, it can be redistributed here.

Zenodo is a digital repository that allows users to upload and share digital content such as datasets, software, and research papers. Zenodo is designed to preserve and provide long-term access to digital content. Zenodo is developed and operated by OpenAIRE, a European consortium that promotes open access to research. A unique feature of Zenodo is to archive releases from a GitHub repository and provide a DOI making it easier to cite a GitHub repository, for example, (Farmer 2022b). Another excellent feature is the ability to create communities to organize data and manuscripts of a similar kind. Anyone can create a community, and the owner can accept or reject a request for an item to be indexed in their community, for example, https://zenodo.org/communities/data-all-the-way/.

Like Zenodo, Figshare is a web-based platform for storing, sharing, and managing research data. Figshare also provides a unique identifier (DOI) for each data set, which can be used to cite the data set in publications. Data can be stored privately or publicly, and Figshare provides tools for managing data access and permissions. Data scientists and machine learning engineers often use Figshare to share data sets and models with collaborators or the public.

Dryad is also a digital repository primarily used for scientific and medical data associated with a publication in a scholarly journal. Dryad makes data discoverable, usable, and citable by integrating it into the scholarly publication process.

Zenodo, Figshare, and Dryad are available to anyone to upload data (download is always free and allowed); however, limitations may apply to the file size uploaded or the private vs. public data status. Additionally, many educational/research institutions may partner with one or all of these platforms, thus providing enhanced options.

Farmer, Rohit. 2022a. “A Case for Using Google Colab Notebooks as an Alternative to Web Servers for Scientific Software,” October. https://doi.org/10.5281/ZENODO.7232108.

———. 2022b. *ColabHDStIM: A Google Colab Interface to HDStIM (High Dimensional Stimulation Immune Mapping)*. Zenodo. https://doi.org/10.5281/ZENODO.7231731.

Rohit Farmer. 2022a. “Classify the Bitter or Sweet Taste of Compounds.” Kaggle. https://doi.org/10.34740/KAGGLE/DSV/4234193.

———. 2022b. “Tweets from Heads of Governments and States.” Kaggle. https://doi.org/10.34740/KAGGLE/DSV/4208877.

BibTeX citation:

```
@online{farmer2022,
author = {Rohit Farmer},
title = {Sources of Open Data for Statistics, Data Science, and
Machine Learning},
date = {2022-10-25},
url = {https://www.dataalltheway.com/posts/007-open-data-for-datascience},
langid = {en}
}
```

For attribution, please cite this work as:

Rohit Farmer. 2022. “Sources of Open Data for Statistics, Data
Science, and Machine Learning.” October 25, 2022. https://www.dataalltheway.com/posts/007-open-data-for-datascience.

2022-10-18 Typo correction and included a list of links to learn more about Google Colab.

2022-10-20 The title and description changed. A PDF version of the article is uploaded to Zenodo at https://doi.org/10.5281/zenodo.7232109

I recently came across ColabFold (Mirdita et al. 2022), a slimmer and faster implementation of AlphaFold2 (Jumper et al. 2021) (the famous protein structure prediction software from DeepMind) implemented on Google Colab in the form of a Jupyter notebook, giving it an easy-to-use web server-like interface. I found this idea intriguing as it removes the overhead of maintaining a webserver while providing a web-based graphical user interface.

Google Colab is a free (with options for pro subscriptions) Jupyter notebook environment for Python (R indirectly) provided by Google that runs on unoccupied Google servers. This free resource also includes access to GPU and TPU making it attractive to various machine learning and data science tasks. For the most part, Google Colab is utilized in machine learning and data science education. However, following the example of ColabFold and my implementation of ColabHDStIM, I want to make a case that it can also be used for providing an easy-to-use interface or live demo for scientific software without maintaining the complex infrastructure of a web server.

Coming from a bioinformatics/computational biology background, I know there is a craze for developing web servers worldwide. However, although many web servers are created yearly, many groups, especially in developing countries, lack the resources to build one. On the flip side, many of these initially well-funded web servers are either of low quality, are not kept updated, or go offline soon after the publication, thus squandering the resources (Veretnik, Fink, and Bourne 2008; Schultheiss et al. 2011; Kern, Fehlmann, and Keller 2020). Therefore, there is a need for an alternative where scientists can distribute their software in an easy-to-use interface like interactive notebooks. Even if the notebook environments are limited in executing production-scale software, they can still be utilized to provide a live demo on a minimal dataset. In my opinion, it is better than the vignettes accompanying software.

Below are some pros and cons of using Google Colab.

**Pros**

- Easy to implement
- Free hardware resources from Google, including GPU and TPU
- Option to buy more resources from Google as per need
- While hosted on Google’s server, the same notebook can be executed using a local runtime to take advantage of local hardware resources.
- Forkable and hackable if the original maintainer stops the development.

**Cons**

- Free hardware resources can be limiting
- Uploading and downloading data to a Colab is slow and require a workaround
- All the instances are transient; therefore, on every restart, all the required software is re-installed, which takes time.
- Colab notebooks are meant to run interactively; therefore, maintaining a long background session is hard or impossible.
- Colab primarily supports Python and requires workarounds to support other languages.

**Learn more about Google Colab**

- Google Colab frequently asked questions
- Welcome to Colab!
- Practical introduction to Google Colab for data science (YouTube video)

**References**

Jumper, John, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, et al. 2021. “Highly Accurate Protein Structure Prediction with AlphaFold.” *Nature* 596 (7873): 583–89. https://doi.org/10.1038/s41586-021-03819-2.

Kern, Fabian, Tobias Fehlmann, and Andreas Keller. 2020. “On the Lifetime of Bioinformatics Web Services.” *Nucleic Acids Research* 48 (22): 12523–33. https://doi.org/10.1093/nar/gkaa1125.

Mirdita, Milot, Konstantin Schütze, Yoshitaka Moriwaki, Lim Heo, Sergey Ovchinnikov, and Martin Steinegger. 2022. “ColabFold: Making Protein Folding Accessible to All.” *Nature Methods* 19 (6): 679–82. https://doi.org/10.1038/s41592-022-01488-1.

Schultheiss, Sebastian J., Marc-Christian Münch, Gergana D. Andreeva, and Gunnar Rätsch. 2011. “Persistence and Availability of Web Services in Computational Biology.” Edited by Dongxiao Zhu. *PLoS ONE* 6 (9): e24914. https://doi.org/10.1371/journal.pone.0024914.

Veretnik, Stella, J. Lynn Fink, and Philip E. Bourne. 2008. “Computational Biology Resources Lack Persistence and Usability.” Edited by Barbara Bryant. *PLoS Computational Biology* 4 (7): e1000136. https://doi.org/10.1371/journal.pcbi.1000136.

BibTeX citation:

```
@misc{farmer2022,
author = {Rohit Farmer},
publisher = {Zenodo},
title = {A Case for Using {Google} {Colab} Notebooks as an Alternative
to Web Servers for Scientific Software},
date = {2022-10-17},
url = {https://doi.org/10.5281/zenodo.7232109},
doi = {10.5281/zenodo.7232109},
langid = {en}
}
```

For attribution, please cite this work as:

Rohit Farmer. 2022. “A Case for Using Google Colab Notebooks as an
Alternative to Web Servers for Scientific Software.” Zenodo. https://doi.org/10.5281/zenodo.7232109.

This post is an identical copy of “About Dataset” at Kaggle: https://www.kaggle.com/dsv/4234193

Throughout human evolution, we have been drawn toward sweet-tasting foods and averted from bitter tastes - sweet is good or desirable, bitter is undesirable, ear wax or medicinal. Therefore, a better understanding of molecular features that determine the bitter-sweet taste of substances is crucial for identifying natural and synthetic compounds for various purposes.

This dataset is adapted from https://github.com/cosylabiiit/bittersweet, https://www.nature.com/articles/s41598-019-43664-y. In chemoinformatics, molecules are often represented as compact SMILES strings. In this dataset, SMILES structures, along with their names and targets (bitter, sweet, tasteless, and non-bitter), were obtained from the original study. Subsequently, SMILES were converted into canonical SMILES using RDKit, and the features (molecular descriptors, both 2D and 3D) were calculated using Mordred. Secondly, tasteless and non-bitter categories were merged into a single category of non-bitter-sweet. Finally, since many of the compounds were missing names, IUPAC names were fetched using PubChemPy for all the compounds, and for still missing names, a generic compound + incrementor name was assigned.

This is a classification dataset with the first three columns carrying names, SMILES, and canonical SMILES. Any of these columns can be used to refer to a molecule. The fourth column is the target (taste category). And all numeric features are from the 5th column until the end of the file. Many features have cells with string annotations due to errors produced by Mordred. Therefore, the following data science techniques can be learned while working on this dataset:

- Data cleanup
- Features selection (since the number of features is quite large in proportion to the data points)
- Feature scaling/transformation/normalization
- Dimensionality reduction
- Binomial classification (bitter vs. sweet) - utilize non-bitter-sweet as a negative class.
- Multinomial classification (bitter vs. sweet vs. non-bitter-sweet)
- Since SMILES can be converted into molecular graphs, graph-based modeling should also be possible.

A copy of the original dataset and the scripts and notebooks used to convert SMILES to canonical SMILES, generate features, fetch names, and export the final TSV file for Kaggle is loosely maintained at https://github.com/rohitfarmer/bittersweet.

BibTeX citation:

```
@dataset{farmer2022,
author = {Rohit Farmer},
publisher = {Kaggle},
title = {Classify the Bitter or Sweet Taste of Compounds},
date = {2022-10-15},
url = {https://www.kaggle.com/dsv/4234193},
doi = {10.34740/KAGGLE/DSV/4234193},
langid = {en}
}
```

For attribution, please cite this work as:

Rohit Farmer. 2022. “Classify the Bitter or Sweet Taste of
Compounds.” Kaggle. https://doi.org/10.34740/KAGGLE/DSV/4234193.

*Note: This tutorial is written for Linux based systems.*

To install the latest version of R please follow the download and install instructions at https://cloud.r-project.org/

Neovim (nvim) is the continuation and extension of Vim editor with the aim to keep the good parts of Vim and add more features. In this tutorial I will be using Neovim (nvim), however, most of the steps are equally applicable to Vim also. Please follow download and installation instructions on nvim’s GitHub wiki https://github.com/neovim/neovim/wiki/Installing-Neovim.

**OR**

Vim usually comes installed in most of the Linux based operating system. However, it may not be the latest one. Therefore, to install the latest version please download and install it from Vim’s GitHub repository as mentioned below or a method that is more confortable to you.

```
git clone https://github.com/vim/vim.git
make -C vim/
sudo make install -C vim/
```

There are more than one plugin manager’s available for Vim that can be used to install the required plugins. In this tutorial I will be using vim-plug pluggin manager.

In the end below are the plugins that we would need to convert Vim editor into a fully functional IDE for R.

- Nvim-R: https://github.com/jalvesaq/Nvim-R
- Nvim-R is the main plugin that will add the functionality to execute R code from within the Vim editor.

- Ncm-R: https://github.com/gaalcaras/ncm-R
- Nerd Tree: https://github.com/preservim/nerdtree
- Nerd Tree will be used to toggle file explorer in the side panel.

- DelimitMate: https://github.com/Raimondi/delimitMate
- This plug-in provides automatic closing of quotes, parenthesis, brackets, etc.

- Vim-monokai-tasty: https://github.com/patstockwell/vim-monokai-tasty
- Monokai color scheme inspired by Sublime Text’s interpretation of monokai.

- Lightline.vim: https://github.com/itchyny/lightline.vim
- Lineline.vim adds asthetic enhancements to Vim’s statusline/tabline.

- Make sure that you have
`R >=3.0.0`

installed. - Make sure that you have
`Neovim >= 0.2.0`

installed. - Install the
`vim-plug`

plugin manager.

```
curl -fLo ~/.local/share/nvim/site/autoload/plug.vim --create-dirs \
https://raw.githubusercontent.com/junegunn/vim-plug/master/plug.vim
```

- Install the required plugins.

First, create an `init.vim`

file in `~/.config/nvim`

folder (create the folder if it doesn’t exist). This file is equivalent to a `.vimrc`

file in the traditional Vim environment. To `init.vim`

file start adding:

```
" Specify a directory for plugins
" - Avoid using standard Vim directory names like 'plugin'
call plug#begin('~/.vim/plugged')
" List of plugins.
" Make sure you use single quotes
" Shorthand notation
Plug 'jalvesaq/Nvim-R'
Plug 'ncm2/ncm2'
Plug 'roxma/nvim-yarp'
Plug 'gaalcaras/ncm-R'
Plug 'preservim/nerdtree'
Plug 'Raimondi/delimitMate'
Plug 'patstockwell/vim-monokai-tasty'
Plug 'itchyny/lightline.vim'
" Initialize plugin system
call plug#end()
```

- Update and add more features to the
`init.vim`

file.

```
" Set a Local Leader
" With a map leader it's possible to do extra key combinations
" like <leader>w saves the current file
let mapleader = ","
let g:mapleader = ","
" Plugin Related Settings
" NCM2
autocmd BufEnter * call ncm2#enable_for_buffer() " To enable ncm2 for all buffers.
set completeopt=noinsert,menuone,noselect " :help Ncm2PopupOpen for more
" information.
" NERD Tree
map <leader>nn :NERDTreeToggle<CR> " Toggle NERD tree.
" Monokai-tasty
let g:vim_monokai_tasty_italic = 1 " Allow italics.
colorscheme vim-monokai-tasty " Enable monokai theme.
" LightLine.vim
set laststatus=2 " To tell Vim we want to see the statusline.
let g:lightline = {
\ 'colorscheme':'monokai_tasty',
\ }
" General NVIM/VIM Settings
" Mouse Integration
set mouse=i " Enable mouse support in insert mode.
" Tabs & Navigation
map <leader>nt :tabnew<cr> " To create a new tab.
map <leader>to :tabonly<cr> " To close all other tabs (show only the current tab).
map <leader>tc :tabclose<cr> " To close the current tab.
map <leader>tm :tabmove<cr> " To move the current tab to next position.
map <leader>tn :tabn<cr> " To swtich to next tab.
map <leader>tp :tabp<cr> " To switch to previous tab.
" Line Numbers & Indentation
set backspace=indent,eol,start " To make backscape work in all conditions.
set ma " To set mark a at current cursor location.
set number " To switch the line numbers on.
set expandtab " To enter spaces when tab is pressed.
set smarttab " To use smart tabs.
set autoindent " To copy indentation from current line
" when starting a new line.
set si " To switch on smart indentation.
" Search
set ignorecase " To ignore case when searching.
set smartcase " When searching try to be smart about cases.
set hlsearch " To highlight search results.
set incsearch " To make search act like search in modern browsers.
set magic " For regular expressions turn magic on.
" Brackets
set showmatch " To show matching brackets when text indicator
" is over them.
set mat=2 " How many tenths of a second to blink
" when matching brackets.
" Errors
set noerrorbells " No annoying sound on errors.
" Color & Fonts
syntax enable " Enable syntax highlighting.
set encoding=utf8 " Set utf8 as standard encoding and
" en_US as the standard language.
" Enable 256 colors palette in Gnome Terminal.
if $COLORTERM == 'gnome-terminal'
set t_Co=256
endif
try
colorscheme desert
catch
endtry
" Files & Backup
set nobackup " Turn off backup.
set nowb " Don't backup before overwriting a file.
set noswapfile " Don't create a swap file.
set ffs=unix,dos,mac " Use Unix as the standard file type.
" Return to last edit position when opening files
au BufReadPost * if line("'\"") > 1 && line("'\"") <= line("$") | exe "normal! g'\"" | endif
```

Note: The commands below are according to the `init.vim`

settings mentioned in this Gist.

```
# Nvim-R
\rf " Connect to R console.
\rq " Quit R console.
\ro " Open object bowser.
\d " Execute current line of code and move to the next line.
\ss " Execute a block of selected code.
\aa " Execute the entire script. This is equivalent to source().
\xx " Toggle comment in an R script.
# NERDTree
,nn " Toggle NERDTree.
```

```
library(tidyverse)
# \rf " Connect to R console.
# \rq " Quit R console.
# \ro " Open object bowser.
# \d \ss \aa " Execution modes.
# ?help
# ,nn " NERDTree.
# ,nt, tp, tn " Tab navigation.
theme_set(theme_bw())
data("midwest", package = "ggplot2")
gg <- ggplot(midwest, aes(x=area, y = poptotal)) +
geom_point(aes(col = state, size = popdensity)) +
geom_smooth(method = "loess", se = F) +
xlim(c(0, 0.1)) +
ylim(c(0, 500000)) +
labs(subtitle = "Area Vs Population",
y = "Population",
x = "Area",
title = "Scatterplot",
caption = "Source: midwest")
plot(gg) # Opens an external window with the plot.
midwest$county # To show synchronous auto completion.
View(midwest) # Opens an external window to display a portion of the tibble.
```

Add these lines to `~/.screenrc`

file.

```
# Use 256 colors
attrcolor b ".I" # allow bold colors - necessary for some reason
termcapinfo xterm 'Co#256:AB=\E[48;5;%dm:AF=\E[38;5;%dm' # tell screen how to set colors. AB = background, AF=foreground
defbce on # use current bg color for erased chars]]'
# Informative statusbar
hardstatus off
hardstatus alwayslastline
hardstatus string '%{= kG}[ %{G}%H %{g}][%= %{= kw}%?%-Lw%?%{r}(%{W}%n*%f%t%?(%u)%?%{r})%{w}%?%+Lw%?%?%= %{g}][%{B} %m-%d %{W} %c %{g}]'
# Use X scrolling mechanism
termcapinfo xterm* ti@:te@
# Fix for residual editor text
altscreen on
```

BibTeX citation:

```
@online{farmer2022,
author = {Rohit Farmer},
title = {How to Use {Neovim} or {VIM} Editor as an {IDE} for {R}},
date = {2022-10-15},
url = {https://www.dataalltheway.com/posts/004-how-to-use-neovim-or-vim-editor-as-an-ide-for-r},
langid = {en}
}
```

For attribution, please cite this work as:

Rohit Farmer. 2022. “How to Use Neovim or VIM Editor as an IDE for
R.” October 15, 2022. https://www.dataalltheway.com/posts/004-how-to-use-neovim-or-vim-editor-as-an-ide-for-r.

To download a shared file with “anyone with the link” access rights from Google Drive in R, we can utilize the `googledrive`

library from the `tidyverse`

package. The method described here will utilize the file ID copied from the shared link. Typically `googledrive`

package is used to work with a Google Drive of an authenticated user. However, since we are downloading a publicly shared file in this tutorial, we will work without user authentication. So, please follow the steps below.

```
if(!require(googledrive)) install.packages("googledrive")
library(googledrive)
drive_deauth()
drive_user()
public_file <- drive_get(as_id("1vj607etanUVYzVFj_HXkznHTd0Ltv_Y4"))
drive_download(public_file, overwrite = TRUE)
```

```
File downloaded:
• hdstim-example-data.rds <id: 1vj607etanUVYzVFj_HXkznHTd0Ltv_Y4>
Saved locally as:
• hdstim-example-data.rds
```

The downloaded data frame.

```
library(DT)
datatable(head(readRDS("hdstim-example-data.rds")))
```

BibTeX citation:

```
@online{farmer2022,
author = {Rohit Farmer},
title = {How to Download a Shared File from {Google} {Drive} in {R}},
date = {2022-10-14},
url = {https://www.dataalltheway.com/posts/003-how-to-download-a-shared-file-from-googledrive-in-r},
langid = {en}
}
```

For attribution, please cite this work as:

Rohit Farmer. 2022. “How to Download a Shared File from Google
Drive in R.” October 14, 2022. https://www.dataalltheway.com/posts/003-how-to-download-a-shared-file-from-googledrive-in-r.

The dataset contains an Excel workbook per year with data points on the rows and features on the columns. Features include the timestamp (UTC), language in which the tweet is written, user id, user name, tweet id, and tweet text. The first version includes the data from October 2018 until September 15, 2022. After that, future releases will be quarterly. It is a textual dataset and is primarily useful for analyses related to natural language processing.

In the Kaggle submission, I have also included a notebook (https://www.kaggle.com/code/rohitfarmer/dont-run-tweet-collection-and-preprocessing) with the Python code that collected the tweets and the additional code that I used to pre-process the data before submission. After releasing the first data set, I updated the code and moved the bot from Python to R using the `rtweet`

library instead of `tweepy`

. I found `rtweet`

to perform better, especially in filtering out duplicated tweets.

In the current setup (https://github.com/rohitfarmer/government-tweets) that is still running on my Raspberry Pi 3B+, the main bot script runs every fifteen minutes via `crontab`

and fetches data that is more recent than the latest tweet collected in the previous run. The data is stored in an SQLite database which is backed up to MEGA cloud storage via Rclone once every midnight ET.

I enjoyed the process of creating the bot and being able to run it for a couple of years, and I hope I will soon find some time to look into the data and fetch some exciting insights. But, until then, the data is available to the data science community to utilize as they please. So, please open a discussion on the Kaggle page for questions, comments, or collaborations.

BibTeX citation:

```
@dataset{farmer2022,
author = {Rohit Farmer},
publisher = {Kaggle},
title = {Tweets from Heads of Governments and States},
date = {2022-10-05},
url = {https://www.kaggle.com/dsv/4208877},
doi = {10.34740/KAGGLE/DSV/4208877},
langid = {en}
}
```

For attribution, please cite this work as:

Rohit Farmer. 2022. “Tweets from Heads of Governments and
States.” Kaggle. https://doi.org/10.34740/KAGGLE/DSV/4208877.

Data transformation is a process of performing a mathematical function on each data point used in a statistical or machine learning analysis to either satisfy the underlying assumptions of a statistical test (e.g., normal distribution for a t-test), help a machine-learning algorithm to converge faster and or make a visualization interpretable. In addition to statistical analyses and modeling, data transformation can also be helpful in data visualization, for example, performing a log transformation on a skewed data set to plot it in a relatively unskewed and visually appealing scatter plot. Most of the data transformation methods are invertible and original values of a data set can be recovered by implementing a counter mathematical function. In mathematical form it can be expressed as:

Where is the original data, is the transformed data, and is a mathematical function performed on .

In data science, data transformation is also sometimes combined with the data cleaning step. In addition to performing a mathematical function to the data points, they are also checked for quality, for example, checking for missing values. I will discuss data cleaning procedures elsewhere. Data transformation can be considered as an umbrella term for both data scaling and data normalization. They are frequently used interchangeably, sometimes referring to the same mathematical operation. Although data scaling and normalization are used to achieve a similar result, it is better to understand them as two different operations that are happening under the hood.

Although every data transformation method performs a mathematical operation on every data point (e.i. element wise), for some, this operation is not influenced if data points are either removed or added to the data set. Let’s consider a data set in the form of a two-dimensional data table with samples on the row and features on the column. Now take two methods to compare 1) log transformation 2) min-max scaling. In log transformation , a log is taken for every data point individually, and the result will not change if some rows or columns are dropped or added in our example data table. However, in min-max scaling

that is performed feature-wise (columns); if the data point that was selected as a min or max in a previous transformation is removed, then re-doing the transformation will change the result. The removal of a data point may happen; for example, if the min or max value selected in the first iteration was an outlier or that a particular sample had multiple missing values, and therefore, it had to be removed, amongst others. Min-max scaling will also influence if more data points are added to our data set. It may bring a new min or max data point and hence will change the scaling. Therefore while selecting a data transformation method, it must be noted if data points are dropped in the subsequent analysis, then should you perform the transformation again as a result of data point loss or it will be indifferent.

As mentioned in the general introduction above, element wise data transformation happens per element without utilizing any information from the rest of the elements in a feature (column) or in a sample (row). These methods are therefore immune to any change in the size of the data hence if some features or samples are removed after the transformation will not affect the subsequent analysis.

In a log transformation, logarithm is calculated for every value in the data set. Traditionally, log transformation is carried out to reduce the skewness of data or to bring data closer to a normal distribution. Usually the base to the log doesn’t matter unless it is a domain specific requirement. However, every feature of the data set should be transformed with the same base. Most of the programming languages have a core function to calculate the log of a number. In programming languages that support vector operation, for example, R, the same log function can be performed on both a single value or on all the values within a data frame, vector or matrix.

For example, let’s visualize the effect of log transformation on a synthetically generated dummy data. To generated figures Figure 1 and Figure 2, I have randomly sampled 10,000 positive real numbers from a skewed (positive and negative) normal distribution and performed a log transformation on every data point. The left sub-panel shows a histogram of the non-transformed data, and the right sub-panel shows a histogram of the log-transformed data. Although log transformation is known for reducing the skewness of the data and making the distribution more symmetric around the mean, it holds only for the positively skewed data. If the data are negatively skewed a log transformation will skew it further. In case of a negatively skewed data doing a power transformation may help to reduce the skewness (figure Figure 3). Usually raising the data to a power of 2 has slight effect on the skewness; a higher number may be required. In addition to the visual inspection, we can also numerically quantify the skewness of the data; that is mentioned in the figure caption.

*Log Transformation:*

*Power Transformation:*

```
import numpy as np
import pandas as pd
from scipy import stats
from scipy.stats import skewnorm
from scipy.stats import skew
import math
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
# Generate random data points from a skewed normal distribution
data_pos = np.round(skewnorm.rvs(10, size=10000, loc=1, random_state = 101), decimals = 2)
#print('Skewness for the positively (right) skewed data before transformation : ', round(skew(data_pos), 2))
data_neg = np.round(skewnorm.rvs(-10, size=10000, loc=10, random_state = 101), decimals = 2)
#print('Skewness for the negatively (left) skewed data before transformation : ', round(skew(data_neg), 2))
# Log transform the data
log_data_pos = np.log(data_pos)
#print('Skewness for the positively skewed data after transformation : ', round(skew(log_data_pos), 2))
log_data_neg = np.log(data_neg)
#print('Skewness for the negatively skewed data after transformation : ', round(skew(log_data_neg), 2))
```

```
fig, axs = plt.subplots(1, 2, sharey=True, tight_layout=True)
# We can set the number of bins with the *bins* keyword argument.
axs[0].hist(data_pos, bins=20, edgecolor='black', linewidth=1.0)
axs[0].set_title("Non-Transformed Data")
axs[0].set_xlabel("Feature")
axs[0].set_ylabel("Frequency")
axs[1].hist(log_data_pos, bins=20, edgecolor='black', linewidth=1.0)
axs[1].set_title("Log-Transformed Data")
axs[1].set_xlabel("Feature")
plt.show()
```

```
fig, axs = plt.subplots(1, 2, sharey=True, tight_layout=True)
axs[0].hist(data_neg, bins = 20, edgecolor='black', linewidth=1.0)
axs[0].set_title("Non-Transformed Data")
axs[0].set_xlabel("Feature")
axs[0].set_ylabel("Frequency")
axs[1].hist(log_data_neg, bins = 20, edgecolor='black', linewidth=1.0)
axs[1].set_title("Log-Transformed Data")
axs[1].set_xlabel("Feature")
plt.show()
```

```
# Square data.
pow_data_neg = np.power(data_neg, 6)
#print('Skewness for the negatively skewed data after transformation : ', round(skew(pow_data_neg), 2))
fig, axs = plt.subplots(ncols=2, sharey = "all", tight_layout=True)
axs[0].hist(data_neg, bins = 20, edgecolor='black', linewidth=1.0)
axs[0].set_title("Non-Transformed Data")
axs[0].set_xlabel("Feature")
axs[0].set_ylabel("Frequency")
axs[1].hist(pow_data_neg, bins = 20, edgecolor='black', linewidth=1.0)
axs[1].set_title("Power-Transformed Data")
axs[1].set_xlabel("Feature")
plt.show()
```

*Note: Since the data used in these figures are sampled from a skewed normal distribution the skewness calculated here are below 2. For a non-normally distributed skewed data it would be higher than 2. Log transformation is often used to bring a non-normal distribution closer to a normal distribution.*

Log transformation can only be performed on positive values. Mathematics principles doesn’t allow log calculation on negative values. In case our input data contains negative values and a log like transformation is desired inverse hyperbolic sin (arcsinh) transformation method can be used.

Inverse hyperbolic sin transformation is a non-linear transformation that is often used in situations where a log transformations can’t be used; such as in the presence of negative values. Flow and mass cytometry are popular examples where arcsinh transformation is a almost always a method of choice. Reason being older flow cytometry machines produced positive values that were displayed on a log scale. However, newer machines can produce both negative and positive values that can’t be displayed on a log scale. Therefore, to keep the data resemble a log transformation arcsinh transformation is used.

Arcsinh transformation can also be tweaked by using a cofactor to behave differently around zero. For both negative and positive values starting from zero to cofactor are presented in a linear fashion along the lines of raw data values and values beyond he cofactor are presented in a log like fashion. In flow and mass cytometry a cofactor of 150 and 5 are used respectively.

For all real x:

Let’s use similar positively skewed data as in the log transformation to visualize how an arcsinh transformation affects the shape of the distribution. The only change that I would want to do in this data set is to add few negative values. As I mentioned earlier that our mathematical laws doesn’t allow us to take log on negative numbers arcsinh transformation is capable of transforming small negative values closer to zero. Figures Figure 4 and Figure 5 show the histograms comparing the original and the arcsinh transformed data for positive and negatively skewed data respectively. From the figures it’s evident that unlike log, arcsinh transformation works on both positively and negatively skewed data equally well.

```
# Generate random data points from a skewed normal distribution
data_pos = np.round(skewnorm.rvs(10, size=10000, loc=0, random_state = 101), decimals = 2)
#print('Skewness for the positively (right) skewed data before transformation : ', round(skew(data_pos), 2))
data_neg = np.round(skewnorm.rvs(-10, size=10000, loc=0, random_state = 101), decimals = 2)
#print('Skewness for the negatively (left) skewed data before transformation : ', round(skew(data_neg), 2))
# Arcsinh transform the data
arcsinh_data_pos = np.arcsinh(data_pos)
#print('Skewness for the positively skewed data after transformation : ', round(skew(arcsinh_data_pos), 2))
arcsinh_data_neg = np.arcsinh(data_neg)
#print('Skewness for the negatively skewed data after transformation : ', round(skew(arcsinh_data_neg), 2))
```

```
fig, axs = plt.subplots(ncols=2, sharey = "all", tight_layout=True)
axs[0].hist(data_pos, bins = 20, edgecolor='black', linewidth=1.0)
axs[0].set_title("Non-Transformed Data")
axs[0].set_xlabel("Feature")
axs[0].set_ylabel("Frequency")
axs[1].hist(arcsinh_data_pos, bins = 20, edgecolor='black', linewidth=1.0)
axs[1].set_title("Arcsinh-Transformed Data")
axs[1].set_xlabel("Feature")
plt.show()
```

```
fig, axs = plt.subplots(ncols=2, sharey = "all", tight_layout=True)
axs[0].hist(data_neg, bins = 20, edgecolor='black', linewidth=1.0)
axs[0].set_title("Non-Transformed Data")
axs[0].set_xlabel("Feature")
axs[0].set_ylabel("Frequency")
axs[1].hist(arcsinh_data_neg, bins = 20, edgecolor='black', linewidth=1.0)
axs[1].set_title("Arcsinh-Transformed Data")
axs[1].set_xlabel("Feature")
plt.show()
```

Data scaling is a type of data transformation that usually doesn’t affect the distribution of the data but change the scale on which the numerical values are presented. For example, if a distribution is normally distributed then it will stay normally distributed after the transformation however, if the numbers range from say 10 to 100 they may be re-scaled from 0 to 1. The relative difference between the numbers will remain the same. Such type of transformation is useful when the features in the data set are measured on different scales. For example in a data set that records height, weight, and time taken to finish a 100 meter sprint for 20 high school boys height would probably range from 4 to 6 ft, weight from 40 to 80 kg and sprint time from 10 to 30 seconds. You can see although they are all positive real numbers but they have different units and also different scales on which they are measured. In this particular example none of the ranges even overlap. Such kind of data sometimes becomes very difficult for machine learning algorithms to work with in particular for gradient decent algorithms to converge in a reasonable number of iterations. Therefore, having all the features on the same scale becomes desirable if not essential.

There are two common ways to get all the features to have the same scale: min-max scaling and standardization.

In min-max scaling for a given feature, we subtract the minimum value from each value and divide the residual by the difference between the maximum and the minimum value. The resulting transformed data is scaled between 0 and 1.

Min-max scaling can also be modified to scale the values to the desired range, for example, between -1 and 1.

Where and are the minimum and maximum range respectively.

- Neural networks

Standardization is also known as z-scaling, mean removal, or variance scaling. In standardization, the goal is to scale the data with a mean of zero and a standard deviation of one.

Where is the mean and is the standard deviation of a given feature. Then, the distribution of the transformed data is called the z-distribution.

- Principal Component Analysis (PCA)
- In heatmaps to compare data among samples

In data science, we casually use the term data normalization for any method that transforms the data across the samples or features so that the data’s elements (samples or features) are similar and comparable. For example, in the case of gene expression measurements for multiple samples, we want to detect actual biological differences between the samples than the technical variations caused by human errors in handling samples. Therefore, having normalized data ensures that the differentially expressed genes we detect are due to biological conditions and not technical noise.

However, I like to consider normalization methods different from element-wise transformation or feature-wise scaling as changing the dataset requires re-normalization. As in the gene expression example, quantile normalization is frequently used and sensitive to changes in the samples and the features. Unlike element-wise transformation, where neither a sample nor a feature affects the transformation, or in feature-wise scaling, where dropping a feature would not affect the scaling of other features.

Therefore, this section will look into methods that are unlike element-wise transformation or feature-wise scaling.

Quantile normalization (QN) is a technique to make two distribution identical in statistical properties. QN involves first ranking the feature of each sample by magnitude, calculating the average value for genes occupying the same rank, and then substituting the values of all genes occupying that particular rank with this average value. The next step is to reorder the features of each sample in their original order.

BibTeX citation:

```
@online{farmer2022,
author = {Rohit Farmer},
title = {Data {Transformation}},
date = {2022-10-05},
url = {https://www.dataalltheway.com/posts/001-data-transformation},
langid = {en}
}
```

For attribution, please cite this work as:

Rohit Farmer. 2022. “Data Transformation.” October 5, 2022.
https://www.dataalltheway.com/posts/001-data-transformation.