Thursday, April 22, 2010

A note on statistical methods in the survey

There were 46 responses to the survey but most of the ensuing analysis that will be posted here relies on a sample size of 42 surveys.

Most statistical calculations that we will post in the future exclude two high outlier survey responses. We decided these two outliers must be excluded but wanted to disclose their exclusion. Including the outliers when presenting median salary data may still yield a robust estimate. But we nevertheless exclude the two outliers at the high end of the hourly rate even in calculating the median hourly rate, as compared to the mean.

We exclude the high pair for three reasons. First and most importantly, the two high outliers were flagged as coming from identical computer DNS numbers. The DNS is a unique identifier that can track the source of the survey response. They were also submitted in close proximity in time. Thus, there is an inference that the two high outliers are not credible responses.

Second, these two high responses were not identical but were similar in other ways. Had they been identical, they would have been more credible. For example they may have been mistakenly submitted twice by accident. Instead, they both contained extremely high pay rates with rather low work loads. They simply bore no resemblance to any other survey responses, except each other. Combined with the fact they came from the same DNS number, their credibility is eroded further.

Third, the two high outliers were outlandish. They would result in an annual salary in excess of $100,000 per year. These responses were many many standard deviations away from the mean, and thus the characterization as an outlier is statistically sound. Exclusion is one appropriate statistical method to mange outliers, especially considering how many standard deviations away they were from the mean. Simply put, no other survey responses were even close to this pair. While we could alternatively manage these two high outliers through a frequency distribution to mitigate the extreme number of standard deviations from the mean, their origin from an identical DNS computer number calls into question their underlying good faith. Surveymonkey is supposed to block repeated attempts to game the survey. But it was successful in flagging the identical DNS numbers.

This transparent description of how the two high outliers were excluded of course allows any reader who wishes to disregard the evidence of bad faith in these two survey responses to simply consider the pair of high outliers part of the survey for their own purposes. But we will not taint the calculations with a pair of responses that are excluded on sound statistical principles.

On the other hand, if you are the person who submitted one (or both) of these outlier responses, and would like to discuss including them, email us at rosemontnanny@yahoo.com. The pair of postings were made on Monday April 19 at 4:41 and 4:50 pm.

There were also two live in nanny responses. Some aspects of these responses will be excluded, such as calculation of mean and median and other numbers. But these live in responses will still be included in other data points. Stay tuned for mode. We will also be posting a wide variety of statistics, such as cross-tabs on mean and median salary broken down by experience levels.

No comments:

Post a Comment