Followers

Thursday, January 1, 2015

Determining Outlier

Reference: http://www.itl.nist.gov/div898/handbook/prc/section1/prc16.htm

Boxplot Construction:

The box plot is a useful graphical display for describing the behavior of the data in the middle as well as at the ends of the distributions. The box plot uses the medianand the lower and upper quartiles (defined as the 25th and 75th percentiles). If the lower quartile is Q1 and the upper quartile is Q3, then the difference (Q3 - Q1) is called the interquartile range or IQ.

Box plot with fences:

A box plot is constructed by drawing a box between the upper and lower quartiles with a solid line drawn across the box to locate the median. The following quantities (called fences) are needed for identifying extreme values in the tails of the distribution:
  1. lower inner fence: Q1 - 1.5*IQ
  2. upper inner fence: Q3 + 1.5*IQ
  3. lower outer fence: Q1 - 3*IQ
  4. upper outer fence: Q3 + 3*IQ
Outlier Detection:

A point beyond an inner fence on either side is considered a mild outlier. A point beyond an outer fence is considered an extreme outlier.

My script to detect outlier:

#!/usr/bin/perl -w

# This script is to detect outliers in a dataset


`sort -g $ARGV[0] > 'tmpSorted'`;

open FH , "tmpSorted" or die "Cant open file for reading $! \n";

my $count=1;
my @arr;

while(<FH>){
        chomp;
        $arr[$count]=$_;
        $count++;
}
close(FH);

my ($median, $lowQ, $upQ, $low1B, $up1B, $low2B, $up2B);

my $length = scalar(@arr);

my $m = findMidpoint($length);

my @med = @$m;

if(scalar(@med) > 1){
        $median = ($arr[$med[0]] + $arr[$med[1]])/2;
        print "Inside outer if and $med[0] and $med[1] and median is $median\n";
        $m = findMidpoint($med[0]);
        my @tmp = @$m;
        $m = findMidpoint($length - $med[1] + 1);
        my @tmp1 = @$m;
                print "printing...",join("|",@tmp),"and ", join("=",@tmp1) , "\n
";

                if(scalar(@tmp) > 1){
                        $lowQ = ($arr[$tmp[0]] + $arr[$tmp[1]]) /2;
                }
                else{
                        $lowQ = $arr[$tmp[0]];
                }
                if(scalar(@tmp1) > 1){
                        $upQ = ($arr[ $med[1] + $tmp[0] - 1] + $arr[ $med[1] + $
tmp[1] - 1]) / 2;

                }
                else{
                        $upQ = $arr[$med[1] + $tmp1[0] - 1];
                }


}

else{
        $median = $arr[$med[0]];
        print "Inside outer else and median is $median\n";
        $m = findMidpoint($med[0]);
        my @tmp = @$m;
        $m = findMidpoint($length - $med[0] + 1);
        my @tmp1 = @$m;
                print "printing",@tmp, @tmp1 , "\n";

                if(scalar(@tmp) > 1){
                        $lowQ = ($arr[$tmp[0]] + $arr[$tmp[1]]) /2;
                }
                else{
                        $lowQ = $arr[$tmp[0]];
                }
                if(scalar(@tmp1) > 1){
                        $upQ = ($arr[ $med[0] + $tmp[0] - 1] + $arr[ $med[0] + $
tmp[1] - 1]) / 2;

                }
                else{
                        $upQ = $arr[$med[0] + $tmp1[0] - 1];
                }

}


my $innerRange = ($upQ - $lowQ) * 1.5;
my $outerRange = ($upQ - $lowQ) * 3;

$low1B = $lowQ - $innerRange;
$low2B = $lowQ - $outerRange;
$up1B  = $upQ  + $innerRange;
$up2B  = $upQ  + $outerRange;

print "Median,lowq,upq,innerRange, outerRange, up1b,up2b, low1b, low2b is $media
n, $lowQ, $upQ, $innerRange, $outerRange, $up1B, $up2B, $low1B, $low2B \n";

sub findMidpoint{

my $length = $_[0];
my @arr;

# Even number return 2 values
if(length($length) % 2 == 0){
        $arr[0] = $length / 2;
        $arr[1] = $arr[0] + 1;
}
else{
        $arr[0] = ($length + 1)/2;
}

return \@arr;
}