This paper analyzes how an experimenter can balance errors in subjective video quality tests betweenthe statistical power of finding an effect if it is there and not claiming that an effect is there if the effect is not there,i.e., balancing Type I and Type II errors. The risk of committing Type I errors increases with the number ofcomparisons that are performed in statistical tests. We will show that when controlling for this and at thesame time keeping the power of the experiment at a reasonably high level, it is unlikely that the number oftest subjects that are normally used and recommended by the International Telecommunication Union (ITU),i.e., 15 is sufficient but the number used by the Video Quality Experts Group (VQEG), i.e., 24 is more likelyto be sufficient. Examples will also be given for the influence of Type I error on the statistical significance ofcomparing objective metrics by correlation. We also present a comparison between parametric and nonparametricstatistics. The comparison targets the question whether we would reach different conclusions on the statisticaldifference between the video quality ratings of different video clips in a subjective test, based on thecomparison between the student T-test and the Mann–Whitney U-test. We found that there was hardly a differencewhen few comparisons are compensated for, i.e., then almost the same conclusions are reached. Whenthe number of comparisons is increased, then larger and larger differences between the two methods arerevealed. In these cases, the parametric T-test gives clearly more significant cases, than the nonparametrictest, which makes it more important to investigate whether the assumptions are met for performing a certaintest.
Copyright (2018) Society of Photo-Optical Instrumentation Engineers. One print or electronic copy may be made for personal use only. Systematic reproduction and distribution, duplication of any material in this paper for a fee or for commercial purposes, or modification of the content of the paper are prohibited.