Skip to content

Winsorizing

The function winsorize tries to emulate stata winsor function.

There is a winsor function in StatsBase.jl but I think it's a little less full-featured.

Basic usage

Start with a simple distribution to visualize the effect of winsorizing

julia
Random.seed!(3); x = randn(10_000);
p1 = histogram(x, bins=-4:0.1:4, color="blue", label="distribution",
    framestyle=:box, size=(1250,750))

Replace the outliers based on quantile

julia
x_win = winsorize(x, probs=(0.05, 0.95));
p2 = histogram(x, bins=-4:0.1:4, color="blue", label="distribution", framestyle=:box);
histogram!(x_win, bins=-4:0.1:4, color="red", opacity=0.5, label="winsorized")

One side trim

julia
x_win = winsorize(x, probs=(0, 0.8));
p3 = histogram(x, bins=-4:0.1:4, color="blue", label="distribution", framestyle=:box);
histogram!(x_win, bins=-4:0.1:4, color="red", opacity=0.5, label="winsorized");

Bring your own cutpoints

Another type of winsorizing is to specify your own cutpoints (they do not have to be symmetric):

julia
x_win = winsorize(x, cutpoints=(-1.96, 2.575));
p4 = histogram(x, bins=-4:0.1:4, color="blue", label="distribution", framestyle=:box);
histogram!(x_win, bins=-4:0.1:4, color="red", opacity=0.5, label="winsorized");

Rely on the computer to select the right cutpoints

If you do not specify either they will specified automatically

julia
x_win = winsorize(x; verbose=true);
p5 = histogram(x, bins=-4:0.1:4, color="blue", label="distribution", framestyle=:box);
histogram!(x_win, bins=-4:0.1:4, color="red", opacity=0.5, label="winsorized");
[ Info: Inferred cutpoints are ... (-4.073837032137298, 4.019734075131403) (using interquartile range x 3 from median)

How not to replace outliers

If you do not want to replace the value by the cutoffs, specify replace_value=missing:

julia
x_win = winsorize(x, cutpoints=(-2.575, 1.96), replace_value=missing);
p6 = histogram(x, bins=-4:0.1:4, color="blue", label="distribution", framestyle=:box);
histogram!(x_win, bins=-4:0.1:4, color="red", opacity=0.5, label="winsorized");

How to choose your replacement

The replace_value command gives you some flexibility to do whatever you want in your outlier data transformation

julia
x_win = winsorize(x, cutpoints=(-2.575, 1.96), replace_value=(-1.96, 1.28));
p7 = histogram(x, bins=-4:0.1:4, color="blue", label="distribution", framestyle=:box);
histogram!(x_win, bins=-4:0.1:4, color="red", opacity=0.5, label="winsorized");

Within a DataFrame

I try to mimick the gtools winsor example

Winsorize one variable

julia
df = DataFrame(PalmerPenguins.load())

# gstats winsor wage
transform!(df, :body_mass_g => (x -> winsorize(x, probs=(0.1, 0.9)) ) => :body_mass_g_w)

p8 = histogram(df.body_mass_g, bins=2700:100:6300, color="blue", label="distribution", framestyle=:box);
histogram!(df.body_mass_g_w, bins=2700:100:6300, color="red", opacity=0.5, label="winsorized");

Winsorize multiple variables

julia
# gstats winsor wage age hours, cuts(0.5 99.5) replace
var_to_winsorize = ["bill_length_mm", "bill_depth_mm", "flipper_length_mm"]
transform!(df,
    var_to_winsorize .=> (x -> winsorize(x, probs=(0.1, 0.9)) ) .=> var_to_winsorize .* "_w")

Winsorize on one side only

julia
# left-winsorizing only, at 1th percentile;
# cap noi gstats winsor wage, cuts(1 100); gstats winsor wage, cuts(1 100) s(_w2)
transform!(df, :body_mass_g => (x -> winsorize(x, probs=(0.1, 1)) ) => :body_mass_g_w )

Winsorize by groups

julia
transform!(
    groupby(df, :sex),
    :body_mass_g => (x -> winsorize(x, probs=(0.2, 0.8)) ) => :body_mass_g_w)
p9 = histogram(df[ isequal.(df.sex, "male"), :body_mass_g], bins=3000:100:6300,
    color="blue", label="distribution", framestyle=:box);
histogram!(df[ isequal.(df.sex, "male"), :body_mass_g_w], bins=3000:100:6300,
    color="red", opacity=0.5, label="winsorized");