A group of Stanford researchers recently decided to put AI detectors to the test, and if it was a graded assignment, the detection tools would have received an F.
“Our main finding is that current AI detectors are not reliable in that they can be easily fooled by changing prompts,” says James Zou, a Stanford professor and co-author of the paper based on the research. More significantly, he adds, “They have a tendency to mistakenly flag text written by non-native English speakers as AI-generated.”
This is bad news for those educators who have embraced AI detection sites as a necessary evil in the AI era of teaching. Here’s everything you need to know about how this research into bias in AI detectors was conducted and its implications for teachers.
How was this AI detection research conducted?
Zou and his co-authors were aware of the interest in third-party tools to detect whether text was written by ChatGPT or another AI tool, and wanted to scientifically evaluate any tool's efficacy. To do that, the researchers evaluated seven unidentified but “widely used” AI detectors on 91 TOEFL (Test of English as a Foreign Language) essays from a Chinese forum and 88 U.S. eighth-grade essays from the Hewlett Foundation’s ASAP dataset.
What did the research find?
The performance of these detectors on students who spoke English as a second language was, to put it in terms no good teacher would ever use in their feedback to a student, atrocious.
The AI detectors incorrectly labeled more than half of the TOEFL essays as “AI-generated” with an average false-positive rate of 61.3%. While none of the detectors did a good job correctly identifying the TOEFL essays as human-written, there was a great deal of variation. The study notes: “All detectors unanimously identified 19.8% of the human-written TOEFL essays as AI-authored, and at least one detector flagged 97.8% of TOEFL essays as AI-generated.”
The detectors did much better with those who spoke English as their first language but were still far from perfect. “On 8th grade essays written by students in the U.S., the false positive rate of most detectors is less than 10%,” Zou says.
Why are AI detectors more likely to incorrectly label writing from non-native English speakers as AI-written?
Most AI detectors attempt to differentiate between human- and AI-written text by assessing a sentence’s perplexity, which Zou and his co-authors define as “a measure of how ‘surprised’ or ‘confused’ a generative language model is when trying to guess the next word in a sentence.”
The higher the perplexity and more surprising text is, the more likely it was written by a human, at least in theory. This theory, the study authors conclude, seems to break down somewhat when evaluating writing from non-native English speakers who generally “use a more limited range of linguistic expressions.”
What are its implications for educators?
The research suggests AI detectors are not ready for prime time, especially given the way these platforms inequitably flag content as AI written, and could potentially exacerbate existing biases against non-native English-speaking students.
“I think educators should be very cautious about using current AI detectors given its limitations and biases,” Zou says. “There are ways to improve AI detectors. However, it's a challenging arms race because the large language models are also becoming more powerful and flexible to emulate different human writing styles.”
In the meantime, Zou advises educators to take other steps to try and prevent the use of AI to cheat by students. “One approach is to teach students how to use AI responsibly,” he says. “More in-person discussions and assessments could also help.”