Winsorizing
The function winsorize
tries to emulate stata winsor function.
There is a winsor
function in StatsBase.jl but I think it's a little less full-featured.
Basic usage
Start with a simple distribution to visualize the effect of winsorizing
julia
Random.seed!(3); x = randn(10_000);
p1 = histogram(x, bins=-4:0.1:4, color="blue", label="distribution",
framestyle=:box, size=(1250,750))
Replace the outliers based on quantile
julia
x_win = winsorize(x, probs=(0.05, 0.95));
p2 = histogram(x, bins=-4:0.1:4, color="blue", label="distribution", framestyle=:box);
histogram!(x_win, bins=-4:0.1:4, color="red", opacity=0.5, label="winsorized")
One side trim
julia
x_win = winsorize(x, probs=(0, 0.8));
p3 = histogram(x, bins=-4:0.1:4, color="blue", label="distribution", framestyle=:box);
histogram!(x_win, bins=-4:0.1:4, color="red", opacity=0.5, label="winsorized");
Bring your own cutpoints
Another type of winsorizing is to specify your own cutpoints (they do not have to be symmetric):
julia
x_win = winsorize(x, cutpoints=(-1.96, 2.575));
p4 = histogram(x, bins=-4:0.1:4, color="blue", label="distribution", framestyle=:box);
histogram!(x_win, bins=-4:0.1:4, color="red", opacity=0.5, label="winsorized");
Rely on the computer to select the right cutpoints
If you do not specify either they will specified automatically
julia
x_win = winsorize(x; verbose=true);
p5 = histogram(x, bins=-4:0.1:4, color="blue", label="distribution", framestyle=:box);
histogram!(x_win, bins=-4:0.1:4, color="red", opacity=0.5, label="winsorized");
[ Info: Inferred cutpoints are ... (-4.073837032137298, 4.019734075131403) (using interquartile range x 3 from median)
How not to replace outliers
If you do not want to replace the value by the cutoffs, specify replace_value=missing
:
julia
x_win = winsorize(x, cutpoints=(-2.575, 1.96), replace_value=missing);
p6 = histogram(x, bins=-4:0.1:4, color="blue", label="distribution", framestyle=:box);
histogram!(x_win, bins=-4:0.1:4, color="red", opacity=0.5, label="winsorized");
How to choose your replacement
The replace_value
command gives you some flexibility to do whatever you want in your outlier data transformation
julia
x_win = winsorize(x, cutpoints=(-2.575, 1.96), replace_value=(-1.96, 1.28));
p7 = histogram(x, bins=-4:0.1:4, color="blue", label="distribution", framestyle=:box);
histogram!(x_win, bins=-4:0.1:4, color="red", opacity=0.5, label="winsorized");
Within a DataFrame
I try to mimick the gtools winsor
example
Winsorize one variable
julia
df = DataFrame(PalmerPenguins.load())
# gstats winsor wage
transform!(df, :body_mass_g => (x -> winsorize(x, probs=(0.1, 0.9)) ) => :body_mass_g_w)
p8 = histogram(df.body_mass_g, bins=2700:100:6300, color="blue", label="distribution", framestyle=:box);
histogram!(df.body_mass_g_w, bins=2700:100:6300, color="red", opacity=0.5, label="winsorized");
Winsorize multiple variables
julia
# gstats winsor wage age hours, cuts(0.5 99.5) replace
var_to_winsorize = ["bill_length_mm", "bill_depth_mm", "flipper_length_mm"]
transform!(df,
var_to_winsorize .=> (x -> winsorize(x, probs=(0.1, 0.9)) ) .=> var_to_winsorize .* "_w")
Winsorize on one side only
julia
# left-winsorizing only, at 1th percentile;
# cap noi gstats winsor wage, cuts(1 100); gstats winsor wage, cuts(1 100) s(_w2)
transform!(df, :body_mass_g => (x -> winsorize(x, probs=(0.1, 1)) ) => :body_mass_g_w )
Winsorize by groups
julia
transform!(
groupby(df, :sex),
:body_mass_g => (x -> winsorize(x, probs=(0.2, 0.8)) ) => :body_mass_g_w)
p9 = histogram(df[ isequal.(df.sex, "male"), :body_mass_g], bins=3000:100:6300,
color="blue", label="distribution", framestyle=:box);
histogram!(df[ isequal.(df.sex, "male"), :body_mass_g_w], bins=3000:100:6300,
color="red", opacity=0.5, label="winsorized");