The Turkish Online Journal of Educational Technology - TOJET January 2006 ISSN: 1303-6521 Volume 5, Issue 1, Article 9
THE EFFECT OF COMPUTERS ON THE TEST AND INTER-RATER RELIABILITY OF WRITING TESTS OF ESL LEARNERS
Dr. Selami AYDIN
Atatürk Üniversitesi
Dil Eğitimi-Öğretimi,
Uygulama ve Araştırma Merkezi
25240
Erzurum
ABSTRACT
This research aimed to investigate the effect of computers on the test and inter-rater reliability of writing test scores of ESL learners. Writing samples of 20 pen-paper and 20 computer group students were scored in analytic scoring method by two scorers, and then the scores were analyzed in Alpha (Cronbach) model. The results showed that the test and inter-rater reliability of the writing samples of the computer group students were significantly higher than the ones of the pen-paper group participants.
Key Words: English as a second language, computers, writing test, test reliability, inter-rater reliability
INTRODUCTION
Since the 1970s, computers have been in schools, in homes, and computer use has a considerable influence on education (Zandvliet and Farragher, 1997). Thus, for three decades, educational theorists and researchers have proposed many ways in which computers influence education. As a result of this influence, in recent years, there has been an explosion of interest in using computers in language teaching, learning and testing. Today, the role of computers in language instruction is a significant issue confronting large numbers of language teachers throughout the world (Warschauer and Healey, 1998).
The turning point on computer use in language testing is item response theory that has made individual test taking possible. The advances in item response theory and computer technology played a greater role in the development of language testing in 1990s, and extensive literature has been developed to examine the effectiveness of CALL (Brown, 1997). The literature on computers and language testing focused on four issues: item banking, computer-assisted language testing, computer-adaptive language testing, and the effectiveness of computers in language testing. However, computer use in language testing is still a specific area (Brown, 1997).
Computers have
also become an accepted tool in writing classes, and research on various aspects
of the writing process on computer has mushroomed in the last decade (Phinney,
The research on the test and the inter-rater reliability of writing tests of ESL
students shows that the results are also conflicting and not conclusive
(McNamara, 1996). Some studies showed that scorers assigned lower scores to
computer versions of the tests than the pen-paper ones (Bridgeman and Cooper,
Finally, this study was guided by the following reasons:
1. Although many studies have been conducted on computer use in native language writing, little research has appeared on second language writing.
2. The studies have not established a consensus on the computer effects in the testing of writing skills of ESL writers.
3. There is not certain empirical data on the effect of computers on the test and inter-rater reliability of writing tests of ESL learners.
In sum, these concerns show that it is a necessity to study the effects of computers on the test and inter-rater reliability of writing tests of ESL learners. In other words, the study has one research question: What is the effect of the computer on the test and inter-rater reliability of writing tests of ESL learners in analytic scoring?
METHOD
The sample groups
consisted of 40 second-year students in the English Language Teaching Department
at the Faculty of Education at
Atatürk
University in Erzurum, Turkey. The reason why second year students were chosen
was that they had writing and computer classes in the same term, spring
Since writing ability between the participants in the pen-paper in computer groups seemed a significant variable that affects the reliability, the students were assigned according to their equal writing abilities. Thus, the final exam scores of the writing and computer classes of the previous term were used as criteria. Then, computer versions of the pre- and posttests were administered to the participants in the computer group. Similarly, pen-papers versions of the pre- and posttests were administered to the students in the pen-paper group.
All the participants were Turkish students who were ESL learners at upper-intermediate level. The three topics, chosen from the TOEFL practice tests (See Appendix 1), for the pretest and three for the posttest were given to the participants in the pen-paper and computer groups. The participants were asked to respond only one topic and to write in free writing style.
The computer lab
consisted of
Since the study focused on the test and inter-rater reliability of writing samples, the duration between the administration of the pre- and posttests was one week and the participants did not receive writing instruction during this time. In other words, students’ progress was not a variable in the research. Then, pen-paper and computer versions of the tests after printing were delivered to the scorers.
The two scorers
were teachers in the ELT department with PhD degrees in English language
teaching. They have taught writing individually, administered and scored writing
tests at ELT department for at least fifteen years. They scored the tests
without seeing the ones given by the other. A scoring rubric for writing
proficiency in a range of 0 – 100 points was developed (See Appendix
DATA ANALYSIS
Since the writing ability and computer familiarity of the participants could affect the reliability of writing tests administered in the study, the mean and standard deviations of the final examination scores of writing and computer classes in the previous instruction semester were analyzed and presented in Table 1. The mean differences between the previous semester scores of the participants were 1.3 in writing and 0.4 in computer class in the scale of 0–100. The data showed that there were no significant mean differences between the groups on both writing ability and computer familiarity.
Table 1. The Mean and Standard Deviations of the Previous Writing and Computer Exam Scores

The means of the pre- and posttest scores given by two scores were presented in Table 2. When the values in Table 1 were compared to the ones in Table 2, it was seen that the participants had lower scores in the pre- and posttests. As Phinney (1991) noted that computer use seemed to have positive effects on second language writers, the computer group participants had higher scores of which the mean differences between the groups, 0.53 for the pre- and 3.57 for the posttest.
Table 2. The Mean of the Pre- and Posttests

The means of the
text length were 226.5 for the pen-paper and 281.2 words for the computer group
participants. Although the text lengths are related to the writing quality
rather than the reliability of the tests, the significant point was that the
computer group students produced longer texts, as Phinney (
Table 3. The Word Length of the Texts

The average of pre- and posttest scores given by two scorers for each paper were computed to find the test reliability in Alpha (Cronbach), a model of internal consistency, based on the average inter-item correlation. Depending on the means, standard deviations and pre- and posttest Alpha (Cronbach) values presented in Table 4, three results can be discussed: First, for both groups, the posttest means were lower than pretest means. However, since analytic scoring procedure was applied for both groups, scoring method was not the factor that affects the results. The different topics given for the pre- and posttests, writing medium and the scorers’ experience on the scoring table could have been an influence on the scores. However, since the issue in the research focused on the test reliability, the mean differences were significant to see the consistency between the tests. Second, the mean difference between pre- and posttest in the pen-paper group was higher than the one in the computer group. When the data in Table 1 and 4 was considered, it would be seen that the computer group participants had higher scores. Third, the reliability analysis showed that the computer group scores were more consistent when the Alpha (Cronbach) value and standard deviations were considered, and that the reliability coefficient of the computerized papers was significantly higher than the one of the hand-written ones. In sum, it seemed that the computer has a considerable effect on the test reliability in analytic scoring.
Table 4. Test Reliability Coefficients
|
Groups |
Tests |
Mean |
Standard Deviation |
Alpha (Cronbach) |
|
Pen-paper |
Pretest |
57.00 |
10.35 |
0.6111 |
|
Posttest |
52.98 |
12.20 |
||
|
Computer |
Pretest |
57.53 |
17.63 |
0.9857 |
|
Posttest |
56.55 |
16.93 |
The inter-rater reliability coefficients of the scores were computed between the
scores given by the two scorers in analytic scoring. In Table 5 and Figure 1,
the means, standard deviations and inter-rater reliability coefficients in Alpha
model were compared among the pre- and posttests scores of the pen-paper and
computer group participants. The scores given by the first and second scorers
for each paper were used to compute the Alpha value. The findings presented in
Table 5 and Figure 1 suggested that the inter-rater reliability coefficients of
the computerized versions of the papers were considerably higher than the ones
of hand-written papers. In sum, it seemed that the computer had a significant
effect on the inter-rater reliability of the writing tests of ESL learners in
analytic scoring, on the contrary of the studies that showed scorers assigned
lower scores to computer versions of the tests than the pen-paper ones (Bridgeman
and Cooper,
Table 5. Inter-rater Reliability Coefficients of the Tests
|
Groups |
Tests |
Scoring |
Mean |
Standard Deviation |
Alpha (Cronbach) |
|
Pen-paper |
Pretest |
First |
57.95 |
11.03 |
0.6790 |
|
Second |
56.05 |
12.71 |
|||
|
Posttest |
First |
53.25 |
13.15 |
0.8752 |
|
|
Second |
52.70 |
12.72 |
|||
|
Computer |
Pretest |
First |
58.30 |
18.81 |
0.9892 |
|
Second |
56.75 |
16.57 |
|||
|
Posttest |
First |
56.80 |
17.37 |
0.9929 |
|
|
Second |
56.30 |
16.52 |
|
Pen-paper Group |
Computer Group |
|
Pretest (Alpha=0,6790) |
The Pretest (Alpha=0,9892) |
|
|
|
|
Posttest (Alpha=0,8752) |
Posttest (Alpha=0,9929) |
|
|
|
Figure 1. The Consistency between the Scorers
CONCLUSION AND DISCUSSION
One of the results was that the scores of the computer versions were higher than
the pen-paper ones. However, in some studies (Bridgeman and Cooper,
The findings in
the study showed that the test and inter-rater reliability of the writing test
scores of ESL learners in analytic scoring were significantly higher than the
ones of the pen-paper group participants. Indeed, the reliability of a test
depends on some factors; scoring method, scale length, text length, writing
approach or method, topic, writing abilities and progress level of writers, and
raters (Penny, Johnson and Gordon,
Some limitations of the research can be noted. First of all, the study is limited to the ESL learners at ELT Department of the Education Faculty of Ataturk University, Erzurum, Turkey. Second, the compositions were written in free writing approach, and the tests were scored analytically. Third, the different topics presented as pre- and posttest might be a factor that affects the scores. In sum, the results in the study are limited to the ESL writers of upper-intermediate level, free writing approach, the scale presented below, and analytic scoring.
Considering that the study is limited to the test and inter-rater reliability of writing tests of ESL writers, further research should be focused on the factors that affect the attitudes of scorers and writers. The scoring scale, the comparison of holistic and analytic scoring, different writing approaches and methods, and the topics of writing exams are other areas to be investigated. Finally, the writing abilities and progress level of participants are also other factors that should be researched.
REFERENCES
Breland, H. (
Bridgeman, B., & Cooper, P. (
Brown, J. S. (
Brown, J. D. (
Bunderson, C. V., Inouye, D. K. & Olsen, J. B. (
Daiute, C. (
Dalton, D. W., & Hannafin, M. J. (
Dunkel, P. (
Hee-Kyung, L. (
Krashen, S. (
McNamara, T. (
Neu, J. & Scarcella, R. (
Penny, J., Johnson, R. L., & Gordon, B. (
Phinney, M. (
APPENDIX 1: Writing Topics
1. Pretest items for the pen-paper and computer group participants:
a. When choosing a place to live, what do you consider most important: location, size, style, number of rooms, types of rooms, or other features? Use reasons and specific examples to support your answer.
b. Films can tell us a lot about the country in which they were made. What have you learned about a country from watching its movies? Use specific examples and details to support your response.
c. Because of developments in communication and transportation, countries are becoming more and more alike. How is your country becoming more similar to other places in the world? Use specific examples and details to support your answer.
2. Posttest items for the pen-paper and computer group participants:
a. People attend colleges or universities for many different reasons (for example, new experiences, career preparation, and increased knowledge). Why do you think people attend colleges? Use specific reasons and examples to support your answer.
b. If you could change one important thing about your hometown, what would you change? Use specific reasons and examples to support your answer.
c. If you had the time and money to invent something new, what product would you develop? Use specific details to explain why this product is needed.
APPENDIX
Writing Proficiency Scoring Table |
||||||||||||||
|
Student’s Name |
|
Scorer’s Name |
|
|||||||||||
|
Student’s Number |
|
|||||||||||||
|
|
||||||||||||||
|
Criteria / Point |
10 |
9 |
8 |
7 |
6 |
5 |
4 |
3 |
2 |
1 |
0 |
|||
|
Vocabulary |
|
|
|
|
|
|
|
|
|
|
|
|||
|
Accuracy (Grammar and structure) |
|
|
|
|
|
|
|
|
|
|
|
|||
|
Organization |
|
|
|
|
|
|
|
|
|
|
|
|||
|
Originality and Creativity |
|
|
|
|
|
|
|
|
|
|
|
|||
|
Unity and Coherence |
|
|
|
|
|
|
|
|
|
|
|
|||
|
Relevance |
|
|
|
|
|
|
|
|
|
|
|
|||
|
Mechanics |
|
|
||||||||||||