The Introduction of Large Scale Computer Adaptive Testing in Georgia

The Introduction of Large Scale Computer Adaptive Testing in Georgia

we feel so lonely now that everyone has
moved over to the computer room
broken?is that a bug or a feature?

Testing: For better and/or for worse, many education systems are built around it (although some perhaps not as much as others). In numerous countries, a long-standing joke asks whether 'MOE' stands for 'Ministry of Education', or 'Ministry of Examination'? (This joke is not meant to be funny, of course.)

'Testing' is a source of and trigger for controversies of all different sorts, in different places around the world. The word 'standardized' is considered negative by many people when it is used to modify 'testing', but it is perhaps worth noting that a lack of 'standardized' tests can have important implications for equity. Within the U.S., the Obama administration recently declared that students should spend no more than 2% of classroom instruction time taking tests. However one feels about the wisdom of setting such hard targets (one could argue, for instance, that it's not 'testing' per se that's the problem, to the extent that it is indeed a problem, but rather 'bad testing') and the various types of time accounting shenanigans that might predictably emerge so that the letter but not the spirit of such declarations are met (a school could be creative about what it feels constitutes a 'test' or 'instruction time', for example), there is no denying the centrality of testing to approaches to education in schools around the world.

'Testing' means different things to different people. There are important distinctions between assessments that are formative (i.e. low stakes means to provide feedback to teachers and students on how much students are learning, as a way to identify strengths and weaknesses and act accordingly) and those that are summative (e.g. high stakes final exams).

It's also potentially worth noting that tests can be utilized not only as means of assessment, but explicitly as tools to help learning we well (an approach sometimes called 'studying by testing'; here's an interesting related paper: When Does Testing Enhance Retention? A Distribution-Based Interpretation of Retrieval as a Memory Modifier [pdf]).

The point here is not to get into a debate about testing, as illuminating and energetic (or frustrating and political) as such a debate might be. Rather, it is to shine a light on some related things happening at the frontier of activities and experiences in this area that are comparatively little known in most of the world but which may be increasingly relevant to many education systems in the coming years.

The nature of tests and testing is changing, enabled in large part by new technologies. (Side note: One way to predict where there are going to be large upcoming public sector procurement activities to provide computing equipment and connectivity to schools is to identify places where big reforms around standardized testing are underway.) While there continues to be growing interest (and hype, and discussion, and confusion) surrounding the potential for technology to enable more 'personalized learning', less remarked on in many quarters is the potential rise in more personalized testing.

The science fiction author William Gibson has famously observed that, The future is already here, it's just not evenly distributed. When it comes to educational technology use around the world, there are lots of interesting 'innovations at the edges' that are happening far away from the spots where one might reflexively look (like Seoul, Silicon Valley or Shanghai, to cite just a few examples that begin with the letter 'S') to learn practical lessons about what might be coming next, and how this may come to pass.

When it comes to testing, one such place is ... Georgia. This is not the Georgia in the southern United States (capital = Atlanta, where people play football while wearing helmets), but rather the small, mountainous country that borders the Black Sea which emerged as a result of the breakup of the Soviet Union (capital = Tbilisi, where people play football with their feet).

Georgia is the first country in the world to utilize computer adaptive testing for all of its school leaving examinations.

What does this mean, what does it look like in practice, what is there to learn from this experience, and why should we care?

---

Perhaps not surprisingly, given that the institution employs literally thousands of economists, currently provides over US$14bn in financing for education systems around the world, and is an enthusiastic advocate for more data driven decisionmaking, the World Bank devotes a great deal of time and energy into documenting activities and supporting research into topics related to educational assessments [pdf].

As an input into discussions in Armenia, where they have been considering the potential adoption of computer-adapted testing, the World Bank commissioned a background paper and case study of lessons from recent efforts in Georgia to introduce CAT-based school leaving examinations. The result -- The Introduction of Large-scale Computer Adaptive Testing in Georgia: Political Context, capacity building, implementation and lessonslearned [pdf] -- is a clear, lucid and step-by-step dissection and analysis of the Georgian experience. Its author, Dr. Steven Bakker, stopped by the World Bank recently to talk about the experiences in Georgia, together with senior officials from Georgia's Ministry of Education and its National Assessments and Examination Center (NAEC).

Bakker framed his discussion by asking (and then attempting to answer) the following question:

Why has it taken so long to introduce on-line assessments as part of large-scale, high-stakes exams
in countries with long-standing examination traditions, such as the United Kingdom, the Netherlands and the United States, and why could it be introduced in Georgia almost overnight?

---

Sometimes, being first to something means that you will be late to whatever comes next. In international development circles, there is much talk of the potential for 'leapfrogging' as a result of the introduction of new technologies in some places, in comparison to other places which have to deal with legacies and inertial pulls of various sorts related to 'older' ones. While such 'leapfrogging' is often celebrated and seen as inevitably 'good', it is of course possible to leapfrog in the wrong direction.

To help make sure that the Armenians were orienting themselves as well as they could during their considerations of the potential applicability of computer-adaptive testing within their education system, Bakker explored lessons from various aspects of the Georgian experience, both in his paper and during the course of a fascinating presentation. These include: the political context; the human and material resources that were needed; the process of capacity building; how stakeholders were informed; the implementation of CAT in schools; costs; the impact it had on all involved and on the educational system in Georgia; and the limitations of CAT and its current implementation of Georgia, and possible ways forward. The German military leader Carl von Clausewitz is meant to have remarked that 'amateurs talk strategy, professionals talk logistics and implementation.' The context for von Clausewitz was war, of course, something slightly different than our context here, but in this regard Bakker's discussion was that of the 'professional'.

As with so many things, context is key, and there are many 'special' circumstances which may have led to this sort of innovation first being roll out at scale in Georgia, and not some other country. First, the country itself is small, with a population of under four million people and a land size about that of Sri Lanka or Ireland (and half the size of the U.S. state with the same name). This small size meant that a 'big bang' approach to introducing CAT for school leaving exams was feasible: It was possible (although of course not always easy, especially in rural areas) to implement in every school, all at once. In 2008, the first cohort of students to complete 12 years of compulsory education graduated (before this time, only 11 years had been required), but the school leaving exams (which were administered by the schools themselves) were rather easy and, as a result, many grade 12 students were skipping school to work with private tutors to prepare for their university entrance examinations. Introducing standard national school-leaving exams, it was hoped, would help bring many of these students back into school for their final year.

In addition, Georgia has had a number of initiatives to introduce computers and the Internet in schools, and so a basic level of both technical infrastructure and comfort level were in place (both within the education system, and across broader Georgian society) which could serve as a foundation for the introduction of something like 'computer-adaptive testing'.

Georgia had another thing going for it that many countries (especially countries of its comparatively small size and level of economic development) do not: a local population that includes technical experts with strong psychometric, mathematical and technological skills who could plan for, implement and evaluate a technically complicated initiative such as CAT.

CAT, sometimes referred to as 'tailored' testing, is a type of computer-based testing that adapts to the ability level of the person taking the test. In basic terms: If you get a question correct, you get a more difficult question. If you get a question wrong, you get an easier question. Complex computer algorithms work to adapt the difficulty level of the test questions to better match the ability level of the test-taker, reducing e.g. the number of 'easy' questions that high ability test takers are shown, as well as the number of 'hard' questions that 'lower ability' test takers are shown (being shown more questions to which you don't know the answer increases the likelihood of guessing; getting a question 'correct' because you guessed doesn't actually demonstrate that you answered that question correctly). Potential benefits of CAT include a reduction in cheating (because test takers are shown different questions) and in time sitting for an exam (because test takers don't have to 'waste' time trying to answer too many questions which are either much too difficult or much too easy). This all sounds great in theory, but in practice doing something like this, and doing it well, is rather nontrivial.

---

So: What exactly happened in Georgia?

As one account [pdf] of this initiative relates,

"In Georgia the Ministry of Education and Science decided in September 2010 to use computer adaptive testing (CAT) as the delivery mode for the re-introduced external school graduation exams and to conduct the first administration in May 2011. International experts were quite sceptical about the feasibility of a nation-wide rollout of such a logistically and technologically complex measurement instrument as a large scale CAT at such short notice. In May 2011, 44,000 students sat eight computer adaptive subject tests in one of the 1500 test centres established in Georgian schools."

This was the consequence of a short but technically rigorous consideration of a number of key questions, including:

  • Is it sufficiently known if CAT delivery in a computerized testing center will indeed be more secure than paper and pencil test in a local gym?
  • Will converting the [school leaving] test to CAT likely bring the expected reduction in test length?
  • Does the [anticipated] reduction in test length translate to enough saved examinee seat time – which can be costly – to translate into actual monetary savings?
  • Even if CAT costs more and does not substantially decrease seat time, are these disadvantages sufficiently offset by the increase in precision and security to make it worthwhile for the organization?
  • Does the organization have the psychometric expertise, or is it able to afford it if an external consultant is used?
  • Does the organization [NAEC] have the capacity to develop extensive [question] item banks?
  • Is an affordable CAT delivery engine available for use, or does the organization have the resources to develop its own?

For a discussion of how these (and many other related) questions were answered in Georgia, both conceptually and at a practical implementation level, have a look at Bakker's paper. It's quite good. You'll find clear details about lots of things, including: computer hardware and connectivity; some nitty-gritty related to the actual test itself; the institutional structures, people and skills necessary to pull something like this off; the information and advocacy campaigns that have been integral to the roll-out; and some mundane, practical details about how the test is actually administered. Countries who wish to learn even more about all of this may be interested to know that, during most years, Georgia hosts an international conference to share lessons from its efforts in this area, as well as to learn from related developments in other parts of the world.

---

For what it's worth, here are a few things that I took away from the presentation and follow on question-and-answer session with Bakker and representatives from the ministry of education and the National Assessments and Examination Center (NAEC), as well as with World Bank colleagues who know a *lot* about assessments in general, complementing what I learned from reading the paper (I took copious notes and hopefully I understood things correctly; if/where I did not, this was undoubtedly a consequence of my own cognitive and note-taking limitations, and not the ability of many experts and practitioners who patiently and kindly tried to explain things to me):

Costs
How much does it cost to do something like this? It is difficult to say. The paper makes an attempt to identify and quantify certain costs. That said, it is worth noting (as the Georgians do, based on their experience) that many costs are not immediately apparent and/or are absorbed in other ways. There are costs to create and maintain the question bank, but they would be doing much of this anyway. There are costs related to test administration. There are costs related to computers -- although, it should be noted, schools in Georgia already have lots of computers. There are new and substantial costs related to the surveillance cameras, and the storage of the related data. But by far the largest expense relates to training, certifying and paying the 2300 test proctors who are on-site in schools.

Designing for technology constraints
The CAT, and the infrastructure to support it, were explicitly designed to work within the existing technology constraints. This was an important design consideration (and, it is worth noting, usually a rather good practice). One result: Every single keystroke a student makes while taking a test is buffered (saved), so that, if there is a problem, nothing is lost, and students can re-start from exactly where they were before, should there be any glitches.

Testing in advance of the test
Each year they hold a full scale mock online test for all 45,000 students in all 1600 testing centres about one month before the official testing commences. This has been important for a number of reasons. Perhaps most importantly, this test has meant that students, teachers, proctors, schools officials and representatives from other stakeholder groups could actually see and experience the testing environments personally, which helped to dispel many related myths and worries. As part of this effort, students take a practice test on the computers. (That said, there are challenges related to student motivations in this regard: it can be difficult for students to take a practice test like this seriously if their scores don't count, which means that the scenario isn't a perfect proxy for what happens on actual test days). A large scale, synchronous test of this sort at school testing centres across the country is also a very good test for the robustness and reliability of the technology infrastructure which makes online testing of this sort possible.

Connectivity and bandwidth
Rolling out a national program for mandatory, synchronous online testing presented an interesting connectivity challenge in Georgia: How do you ensure adequate bandwidth at schools for the test takers? For a few weeks each year, the ministry of education needs absolutely reliable, very fast and secure broadband supporting lots of concurrent users for a few hours each day, available to *all* schools in all regions of our country. It is critically important that not only all schools have access to a common minimum level of bandwidth, but that some schools do not have access to bandwidth that exceed the defined minimum acceptable levels to the extent that this provides test takers in those schools with various advantages. ISPs in many countries, especially small ones, don't typically get clients with these sorts of demands, and MOEs aren't used to negotiating for (or even spec'ing out) what they need in this regard. This means that there are new actors on both the supply and demand sides of the connectivity equation, both of which are exploring new products and services, for which competitive local markets do not (yet) exist. This makes the process of tendering rather ... challenging.

In Georgia there are currently around 1600 testing centres in use in schools. 570 of these schools had fibre connections to the Internet, the rest of the schools did not, and so these less (and in some cases un-) connected schools needed to be connected wirelessly in order to meet the minimum connectivity requirements to function as test centres. (Local wi-fi networks have been benchmarked to ensure that they are able to handle 30 concurrent students; each of whom needs 30kbps continuously for the testing application to work.) The Internet provider in Georgia has been able to provide a wireless CDMA/EVDO connection (i.e. not the standard 3G connection normally available to wireless customers in Georgia) to each school test centre. This has meant that testing centres could be installed in schools in the mountains, resulting in less stress for students who otherwise would have had to travel from their home villages to other schools in order to take part in the tests. (In some rural communities, because wireless traffic to/from schools takes precedence on the related mobile networks, during the time of testing it can be more difficult for people in the village to get a mobile signal.)

Cheating and security
A number of actions have been taken to try to prevent cheating, as well as to detect it as quickly as possible, where it might nonetheless occur. Most notably, and expensively, and intrusively, 1800 surveillance cameras were purchased and installed in schools. Within each school testing centre, empty seats were placed between each student, to make it difficult for them to whisper to each other or pass notes. (Computers were placed on the desktops in front of these empty seats, so that if a computer went down, a student could just slide over to an adjacent seat and continue with very little delay.) Monitoring software was installed on the computers, and certain basic functionality (e.g. screen shots, printing) was disabled. Internet access was also disabled except for access to the testing web site itself. A special shell or user interface ensured that test takers could only see the test on the screen, they couldn't access any other applications or windows. USB ports were also closed, disabled or removed. No mobile phones were allowed in the test centres.

(I asked if they had considered measures what have been implemented in some places in China, where drones hover above schools where tests are occurring in order to block attempts to communicate with test takers who have smuggled in phones; they had not considered this, but were intrigued that this was being done!)

These measures have meant that there has been very little apparent cheating or 'leakage' of test questions -- at least so far. It is possible for students to memorize the questions as they appear on the screen and then later share them on the Internet, but the testing authorities monitor for this in various ways, including mentions on social media (a practice that, it is worth noting, has generated controversy in other parts of the world). To date, this reportedly hasn't been a problem (or if it has been a problem, the authorities haven't discovered it yet).

The bank of test questions itself has apparently not been compromised either. (That said, there was no test in 2013, when the head of the assessments centre was dismissed. When he and his team was re-instated, there were concerns about whether the integrity of the test bank had been compromised, so an all new question bank had to be created.)

No attempts to take down the networks utilized for testing or the testing servers have yet been detected; I assume that such things are unfortunately not a matter of if, but when. (They don't -- yet -- do penetration testing, i.e. explicitly hire groups to try to hack the system in some way, sit at a testing centre and cheat, etc., but that might be something that will come with time.)

Transparency and openness
In the minds of many policymakers, there are important tradeoffs consider related to security and transparency. Where stakeholders are not able to see the individual items after the test has been completed and scored, trust is key. For a number of reasons, and as a result of numerous related actions, gaining the trust of key stakeholder groups in Georgia has been achievable. Some related questions potentially worth asking when considering the potential relevance of the Georgian experience to other countries: Do citizens trust their government with this stuff, given that it is very difficult to have such things independently audited? If the test questions are kept secret, and the algorithms at the heart of CAT remain proprietary and/or unpublished -- as has been the case in Georgia -- what measures to promote transparency and trust might be useful to consider?

Some related idle speculation:
It would potentially be possible to have the surveillance cameras installed to help prevent cheating in school test centres monitored remotely in real time ... perhaps in a call centre in another country, or even in a 'crowd-sourced' way by the general public over the Internet. My point in speculating about this isn't to suggest that such things are good ideas, rather to note that such things are becoming technically possible. Inevitably, someone will eventually propose to do such types of things, and so policymakers may wish to anticipate and start thinking about how they might feel about such proposals.

User buy-in
Soliciting and listening to feedback from a number of key stakeholders groups has been very important to this whole effort in Georgia. According to the paper, "The need for a well-planned and coordinated information campaign was seriously underestimated by the Ministry of Education, which resulted in a variety of rumors and fears of massive percentages of students failing and punitive measures being taken against under-achieving schools. Eventually, NAEC, using its image as a reliable institution and its good relations with schools, managed to convince the educational community that the test results would not be used against them, but would support them in achieving their goals."

School principals have generally been supportive of the move to CAT. They especially like that they receive feedback as a result of CAT that they didn't get before (e.g. about how their school compares to others, to see how students did on individual test item categories, etc.). That said, principals haven't liked that they couldn't see individual test questions, which makes it difficult to appeal, should someone wish to contest their score. (Students see the test questions on screen, but once they go to the next screen, they can't go back.) Generally speaking, teachers remark that CAT appears to be fair, but note that this is just one of many means of possibly assessing student performance. They especially like that CAT hasn't been used for 'accountability' purposes (i.e. to punish teachers or principals if students don't perform well enough). There is regret, but both teachers and students, that students can't change their answers once they have been entered. (On this point: Psychometricians respond that, in CAT, if you make a mistake, it doesn't really matter that you can't correct it -- at least psychometrically, realizing that you have made a mistake but are unable to correct it can impact test takers in other ways -- because basically you 'correct' things as you answer future questions to arrive in the end at your appropriate "level".) Upon completing a test, students immediately learn whether they have passed or not. Some students (and their parents) have complained that, under CAT, some students can finish the test in only 15 minutes (because they were either very high or very low scorers and so their level could be determined relatively quickly) while it took other students 45 minutes to complete a test.

The advantage and disadvantages of keeping your expertise in-house
At the NAEC, the Georgian Ministry of education has 'in-house' expertise that can plan for, manage and evaluate the entire CAT implementation process. This arrangement creates an environment supportive of iteration (constant, regular improvement over time) and ensures continuity over time, to things that might not be supported to the same extent if the entire process were to be contracted out to e.g. a private company. There are risks associated with concentrating all of this sort of expertise within a single institution, however. (Marveling at what the Georgians have done, I wondered about the potential for a company to be spun off to commercialize what has been done and sell the resulting products and services to other countries.)

Unintended consequences
The implementation of new technologies almost always bring with them unintended, and unforeseen, consequences. One example from Georgia: Like in many other countries, the Ministry of Education is keen to see a decrease in private tutoring (which some might argue is, in certain ways, an indirect privatization of education). Bringing students back to school to prepare for the school leaving exam means that they don't use this time to skip school to spend that time with their private tutor for university entrance exams. Some students, however, respond that now they feel the need to go to tutors for *both* the school leaving and the university entrance exams, meaning that there is an overall increase in private tutoring (although whether this is actually the case or not is still to be determined).

---

Taken together, lessons from the experience of Georgia to introduce computer-adaptive testing for all of its school leaving exams underscore that, in the end, success or failure when it comes to introducing new technologies in education is not really a result of the technology itself. The technology is important, of course -- indeed it is critical, and it can be difficult to 'gets thing right', especially the first time around -- but making all of the right technology decisions does not necessarily mean that an initiative will be successful as a result. In the end it is about the ability and capacity of key actors within an education system to plan for, manage, absorb and implement change. It is through this lens of change management that the successes and challenges of CAT in Georgia should perhaps best be considered. There is certainly much that can be learned in this regard from Georgia's pioneering example.

Note: The image used at the top of this post of an empty examination room ("We feel so lonely now that everyone has moved over to the computer room") comes from Wikimedia Commons and is in the public domain. The picture of the broken pencil ("broken?") comes from the Wikipedian Pwlps via Wikimedia Commons and is used according to the terms of its Creative Commons Attribution-Share Alike 4.0 International license. The picture of a Luprops beetle on a pencil in Kerala ("is that a bug or a feature?") is from Mathews Sunny Kunnelpurayidom via Wikimedia Commons and is used according to the terms of its Creative Commons Attribution-Share Alike 3.0 Unported license. The final image, of a mountain peak near Ushguli, Georgia, was originally posted to Flickr by Ilan Molcho. It was discovered via Wikimedia Commons and is used according to the terms of its Creative Commons Attribution 2.0 Generic license.

cross posted at blogs.worldbank.org/edutech

Michael Trucano is the World Bank's Senior Education & Technology Policy Specialist and Global Lead for Innovation in Education, serving as the organization's focal point on issues at the intersection of technology use and education in middle- and low-income countries and emerging markets around the world. Read more at blogs.worldbank.org/edutech.