Soaring standards in the age of AI assessment
When all university students have access to AI, is competency based assessment the answer?
As usual, something is bothering me so I’ve uncapped my pen.
Phillip Dawson, co-director of CRADLE, predicts that the rise of AI use in unsupervised university assessments will lead to one of three undesirable consequences: grade inflation, marking to a curve, or standards creep.
In brief:
Grade inflation is where all students’ grades get better. In a competitive graduate job market, a high GPA means nothing, because everyone has one.
Marking to a curve is where, instead of scoring each assessment on its individual merits, we rank them against one another. So students don’t just do well or badly — they do better or worse than each other.
Standards creep is where the requirements for getting a good grade go up. We raise the bar, because the students were clearing it too easily.
Of course, this all assumes that student work is getting better as a consequence of AI.
Is it?
Where’s the evidence?
I honestly don’t know if this is happening or not. I haven’t seen any large-scale data. In aggregate, my own students (for whom AI use in assessment is unrestricted) certainly aren’t doing better. To be painfully honest, some good students are doing worse. Some of the things I’ve observed from real students whose work I’ve marked over more than one term of study:
A passionate, engaged student with a HD-track record who failed a major project for such excessive use of GenAI that the resulting work showed a total lack of judgement. It wasn’t just a good student this happened to — it was a good person.
A student with developing English, whose writing was incoherent until they starting using GenAI as a language translator, and whose marks have shifted from Ns and Ps to Cs and Ds. It’s helping me see their actual abilities, not masking them.
Multiple students whose GenAI tools have inserted fake references into their work, which is easily detectable because the DOIs don’t exist. Once that piques your suspicion, you search for the title of the paper and discover it doesn’t exist either. Not only have they not read these “sources”, they haven’t verified them either. This has been happening since long before ChatGPT, but it’s more obvious now that they’re not even using real DOIs.
I asked Professor Dawson (who, please don’t mistake me, is a genuinely excellent dude) if he had data on this and he referred to anecdotal evidence from some academics who said students were all getting 100% on their tests now.
Of course, the first thing I thought, as a HASS person (humanities and social sciences, for those of you playing at home), is that 100% is something one gets on a spelling or arithmetic test. In HASS, 100% is sort of an imaginary grade, one that means you facilitated a near-transcendental experience for your assessor. It doesn’t not happen, but it’s vanishingly unusual. (Shoutout to Catherine Smith who gave me the only one I ever got in my masters!)
What’s in a perfect score?
If most of your students are getting 100%, then I’d say one of two things is happening:
Your standards were already too low. Raising the bar is necessary, and not because of AI.
Your assessment is actually competency based — that is, it’s not about levels of achievement, but about whether they can or can’t do the thing. For example, in a test of construction site safety or bowel surgery, if my students are getting 13/20 then they should not be viewed as competent or ready to progress. That test should require 20/20, or more learning is needed.
We need to be much, much more honest with ourselves about what our standards actually mean. Students know that Ps get degrees — and so do employers. As I’ve said before, allowing students to pass at 50% sends a very dubious message about the capabilities of graduates.
There is an essential place for competency based assessment from the lowest to the highest levels of education. As someone with VET in my blood, I constantly rankle at the assumption that vocational-style assessment is only for the “lower classes of learner”.
In praise of competency based assessment
Competency standards are the purest form of assessment — the work is marked according to explicit criteria, with as little room for assessor subjectivity as possible. There are no rubric levels allowing the assessor to quibble on whether the student’s argument was a HD-worthy “comprehensive” or merely a D-rated “robust”.
Those championing the ungrading movement should consider this approach with care. The recorded judgement is something like “satisfactory” or “not satisfactory” — not pass/fail. The language of failure is carved deep into our imaginary of schooling, but it’s not welcome here. That recorded judgement means only that a student is ready (or not) to progress. This cleanly separates feedback from marking — a match made in hell if ever there was one.
That doesn’t mean it’s a model that fits every discipline or every task within it. This kind of absolutism is one of the reasons that HE and VET appear so hopelessly incompatible in Australia. But it’s a model that far more educators need to acknowledge is relevant for some of the tasks they assign.
If we know a task can be completed “perfectly” with GenAI (20/20), then perhaps it does make sense to ask our students to do it perfectly, and to give them access to the tools to do so.
This isn’t an end in itself. It’s not the final exam. It’s a crucial diagnostic that will reveal potential inequities, learning and resource needs in each cohort. It will, I hope, help us develop better ways and means of teaching what we teach. Something we don’t acknowledge enough is that assessment processes are not just for giving feedback to students. They are a vital form of feedback to education itself.
Coda: the wisdom of letting go
I’m really struggling with the conversation around unsupervised assessment right now. Universities are scrambling to find ways of shoring up their assessment methods against the rising tide of AI-facilitated cheating. In fact, I was in a staff meeting just last night where new approaches to supervised online assessment were presented to my teaching team.
I’ll leave aside the cheating discussion for now, and simply ask this.
What, exactly, are universities trying so hard to protect — and is it really worth protecting?