Assessing in the age of AI
This term, the highest and lowest marks I awarded were to AI-augmented submissions. How are we judging quality when student output is co-generated by bots?
Whether we like it or not, students are using LLMs to produce their written assessments now.
There is an utter din in response to this situation — schools and universities making rules about scaffolding it, discouraging it, imposing bans, requiring declarations of use. But in our hearts we know none of this actually matters. Because students are using LLMs to produce their written assessments now. We can’t systematically detect it, and if we suspect it, we can’t label it. It’s just happening.
There’s plenty of talk about the role of teachers now expanding to deliver “AI literacy” — which, of course, has no clear definition. We never had a push for “search engine literacy”, despite this being an extremely similar idea: both types of tool require technical prompting skills to generate semantic results, the validity of which must be viewed with caution; as well, the ethical implications of both running and using them are considerable.
But I feel pretty doubtful about the idea of requiring teachers to build and maintain such knowledge and skills. Ethical objections aside, I think of prompting skills as a bit like SEO skills: they aren’t static. We can come up with cute mnemonics all we like, but different bots respond differently to the same prompts, and then differently again over time as their algorithms are updated. So it’s not impossible, but it’s also not a one-shot mastery thing, and teachers have enough to manage already without adding extras to their curricula.
I also don’t think changing all our assignments is the answer. It’s a Chicken Little response — “our foundations are crumbling! Our fundamental beliefs about the universe are wrong!” — to the introduction of a technology that lacks, and always will lack, the only thing that actually matters about assessment:
Judgement.
Assessment is about evaluation, for both students and teachers. For students it’s about making decisions and producing responses to a brief based on specified criteria, but also on the tacit knowledge about what teachers value, gained through our learning and relational interactions with them.
The reality, of course, is that some teachers only ever valued the “right” answers. Some teachers never used judgement to assess; they simply checked their students’ submissions, compared them to the “right” answers, and gave out marks for each match. It wasn’t great practice before, and AI won’t make them worse.
So what does assessment look like in the age of Gen AI?
A tale of two papers
I’ve just finished marking assessments for a unit I teach. In our faculty, using AI is not restricted, though students are expected to declare what they’ve used and how. Most of them do, and this is incredibly helpful in understanding how they’ve approached the task. Some of them don’t, and we can tell, and it’s very frustrating — not because we want to be able to accuse them, but because this makes it impossible to give feedback on how their AI processes could improve.
This term, the highest and lowest marks I awarded were to AI-augmented submissions.
The worst one was a disappointing stack of LLM list outputs, complete with Title Case Subheadings. Although there was no declaration of AI use, it wasn’t a stretch to imagine a student using a basic prompt with keywords from the unit and assignment brief to produce something shaped like an assessment paper, but which, taken as a whole, made little more sense than a few pages of Samuel L. Ipsum. It was grammatically sound (Americanizations aside) and used many important terms in coherent sentences, but as a complete piece of work it showed a total lack of judgement.
The best paper (by my own assessment, which is subjective) was one that contained a detailed disclosure of AI use, including an outline of targeted prompting approaches and a brief description of the iteration process. The paper itself was a sparkling synthesis of personal experience and evaluative judgement, and I was able to see that those aspects which had been delegated to the LLM (for instance, producing a conclusion paragraph and reducing the word count) did not obscure the student’s own voice or intentions.
I feel, essentially, that both papers received about the same mark they might have done if AI had not been involved. Because in the end, what I saw was the ability of each student to comprehend the assignment, produce a response, and then evaluate their own response to determine how well it addressed what I had asked for. There might be a few superficial enhancements (tidying up grammar or speeding up brainstorming) but there is no shortcut to evaluative judgement, and that, in the end, is how we evidence real learning.
What skills make the difference?
There is no rubric criterion for “effective use of AI”. And there shouldn’t be, because not every student uses it (or is able to, or wants to). But that’s not what I actually care about anyway. I am looking for evidence that students were able to apply the concepts they used to their designs; to justify their choices; to produce compelling, cohesive, long-form arguments.
Reflecting on the “good” submission, I began to think of AI-augmented writing skills broadly as requiring a combination of delegation and synthesis. While there are, clearly, very specific technical aspects to this, delegation and synthesis are far more consequential than fiddly techniques of prompt crafting.
Effective delegation requires the ability to recognise one’s own capabilities and the capabilities of the delegate, to break down the requirements of a task, and to choose communication approaches that will be effective to instruct and request help. This is true whether the delegate is a junior staff member, a child with a messy room, Claude, or HAL 9000. It takes communication and leadership.
Effective synthesis requires the ability to identify potentially fruitful connections between disparate components, speculate on ways they might come together, and create an original whole that both honours and is more than the sum of its parts. Synthesis takes creativity, yes, but more than that, it requires the ability to take ingredients and transform them into something new.
In theory, we are already infusing all of our teaching with these capabilities. Transferable skills aren’t something we teach once in one context and then see them transfer across other domains; they are things we teach again and again, in different contexts, to different degrees, in different ways.
So (I think) it’s not the tasks we assign that are in question in a Gen AI world, but the criteria we assess. The assigned tasks (assuming they were meaningful before) are still meaningful now. But are our rubrics doing the work we need them to?
Are we looking for surface features like tidy formatting and “right” answers?
Or are we seeking the messy, hard-to-distinguish signature of students’ emerging judgement?
I just have to spruik the work of Joanna Tai, Margaret Bearman and Rola Ajjawi, and others in the CRADLE team whose work on evaluative judgement is just absolutely essential to this conversation.