Improvements to the Turing Test

I am really fond of the Turing test, but it has some methodological problems.
First of all the statistics are weak. If one person says one machine is a human, that does not make it intelligent.
This could be made statistically significant. The test could be run repeatedly with different judges (and human contestants).
Guessing is allowed. So, the judges would have to perform near chance.
Similarly, the judges need to be screened. I could get three year olds to be the judges, or illiterates.
Another problem would be language. All three contestants need to use the same language. Perhaps they need to be native speakers.
If I were a judge, and both contestants were typing in Chinese, I would just have to guess.
Of course, for now this is not a problem. No system comes close, so we are not currently concerned with these details.
The open ended duration is not a weakness. It seems like a judge should be able to spend as much time conversing as he likes to make up his mind.