“Exams” are often followed for exams and graduations, but if you try to score the exams of many students, the amount of work will be enormous. Although automatic scoring with algorithms is becoming popular in the United States as it reduces the burden on humans, Motherboard reports that “automatic scoring is still unreliable for scoring sentences”.

An automatic paper scoring system using algorithms is called 'Natural language processing artificial intelligence systems', but it is also commonly called 'Small paper automatic scoring system'. Motherboard pointed out the problem with the automatic essay scoring system, part 1. The automatic essay scoring system has a bias . This essay automatic scoring system scores points by recognizing patterns that correlate with 'an essay with a high score by a human grader' and 'an essay with a low score'. However, the problem is that the essay automatic scoring system also learns human prejudice because human scorers add or subtract points based on their own prejudice against 'dialects' in sentences rooted in race. Has occurred.

Educational Testing Service (ETS) is a non-profit private organization that conducts various tests such as TOEIC , TOEFL , and the ' Graduate Record Examination (GRE) ', a common examination required for graduate school entry in the US and Canada. The original essay automatic scoring system `` e-rater '' developed by ETS is used in various tests of our company such as GRE and TOEFL, but in 1999, 2004, 2007, 2008, 2012 and 2018 Research has shown that e-raters score biased against specific races and nationalities.

The bias is a high score for Chinese students and a low score for African Americans. GRE is scored on the basis of a maximum score of 6, but Chinese students tend to get 1.3 points on average and African Americans tend to get 0.81 points on average.



Chinese students tended to get high scores on the items of “length of essay” and “use of sophisticated words”, but “many Chinese students“ round memorize ”sophisticated sentences. 'Maybe it is caused by writing as it is,' Motherboard wrote. On the other hand, African Americans tended to get low scores in items such as `` grammar '', `` style '', `` sentence composition '', but Motherboard said `` If humans score, it will be a pretty good score There was also a case. ”

Regarding the question of 'can we correct the bias against dialects rooted in these races and nationalities?' Said Brent Bridgeman, a senior researcher at ETS, said, 'When adjusting the scores for a dialect of a particular race or nationality, It ’s the same thing, because it puts another bias on candidates of other races and nationalities, ”he says.

All the essay questions given in the exams conducted by ETS are scored by e-raters and humans, and if the scores are different, they will be scored by another person. For this reason, ETS says that there is no influence of e-rater bias.

Another problem with the automatic essay scoring system is that if sophisticated words are used, high scores will be given even if the content is unknown. Motherboard is a software ' Basic Automatic BS Essay Language Generator (BABEL Generator ) ”, We conducted an experiment to score GRE's online essay scoring service ScoreltNow! As a result, the essay converted by BABEL Generator scored 4 out of 6 points. This score is a score that is said to be “a sentence that clearly conveys meaning and has sufficient ability for discussion”.

In the United States, it is illegal to disclose test scores without written consent under federal law, so it is difficult to conduct external audits on essay automatic scoring systems and to study bias on algorithms. However, Motherboard reports that 21 out of 50 states in the United States have adopted an automatic essay scoring system, 18 of which have left their scoring to an automatic essay scoring system.

