Language Log

Australian government assessment of AI vs. human performance

September 3, 2024 @ 8:13 pm · Filed by Victor Mair under Artificial intelligence, Tests

"AI worse than humans in every way at summarising information, government trial finds:
A test of AI for Australia's corporate regulator found that the technology might actually make more work for people, not less." Cam Wilson, Crikey (Sep 03, 2024)

Artificial intelligence is worse than humans in every way at summarising documents and might actually create additional work for people, a government trial of the technology has found.

Amazon conducted the test earlier this year for Australia’s corporate regulator the Securities and Investments Commission (ASIC) using submissions made to an inquiry. The outcome of the trial was revealed in an answer to a questions on notice at the Senate select committee on adopting artificial intelligence.

The test involved testing generative AI models before selecting one to ingest five submissions from a parliamentary inquiry into audit and consultancy firms. The most promising model, Meta’s open source model Llama2-70B, was prompted to summarise the submissions with a focus on ASIC mentions, recommendations, references to more regulation, and to include the page references and context.

For the full government report, click on the link embedded on "answer" in the second paragraph of the Crikey article above. The remainder of the Crikey summary article itself may be found in its sister publication, Mandarin, here:

Ten ASIC staff, of varying levels of seniority, were also given the same task with similar prompts. Then, a group of reviewers blindly assessed the summaries produced by both humans and AI for coherency, length, ASIC references, regulation references and for identifying recommendations. They were unaware that this exercise involved AI at all.

These reviewers overwhelmingly found that the human summaries beat out their AI competitors on every criteria and on every submission, scoring an 81% on an internal rubric compared with the machine’s 47%.

Human summaries ran up the score by significantly outperforming on identifying references to ASIC documents in the long document, a type of task that the report notes is a “notoriously hard task” for this type of AI. But humans still beat the technology across the board.

Reviewers told the report’s authors that AI summaries often missed emphasis, nuance and context; included incorrect information or missed relevant information; and sometimes focused on auxiliary points or introduced irrelevant information. Three of the five reviewers said they guessed that they were reviewing AI content.

The reviewers’ overall feedback was that they felt AI summaries may be counterproductive and create further work because of the need to fact-check and refer to original submissions which communicated the message better and more concisely.

The report mentions some limitations and context to this study: the model used has already been superseded by one with further capabilities that may improve its ability to summarise information, and that Amazon increased the model’s performance by refining its prompts and inputs, suggesting that there are further improvements that are possible. It includes optimism that this task may one day be competently undertaken by machines.

But until then, the trial showed that a human’s ability to parse and critically analyse information is unparalleled by AI, the report said.

“This finding also supports the view that GenAI should be positioned as a tool to augment and not replace human tasks,” the report concluded.

AI is not going away, but through careful appraisals such as this one from the Australian government, we are gaining a better perspective on its pluses and minuses.

Selected readings

"The AI threat: keep calm and carry on" (6/29/23)
"ChatGPT has a sense of humor (sort of)" (6/10/23)
"InternLM (6/10/23)" — with a long bibliography

[h.t. Kent McKeever]

September 3, 2024 @ 8:13 pm · Filed by Victor Mair under Artificial intelligence, Tests

Permalink

6 Comments

loon said,

September 4, 2024 @ 2:59 am

"[…]"[…]GenAI should be positioned as a tool to augment and not replace human tasks[…]"[…]" – the adage 'AI should take on the menial tasks, leaving us free to pursue art, but it took the art first' comes to mind. AI produces bullshit [the Frankfurtian term]. It has to, that is how it is currently implemented. There are scant, but some, applications where bullshit is used productively [not as in 'in production' but as in 'furthering the stated goals']. Bureaucracy is not one of them. I applaud the government entity for reviewing this (instead of blindly implementing it), while hoping it happened against the will of the technical staff (as in: 'we do not want or need this, it is bullshit, no need to evaluate')
Kenny Easwaran said,

September 5, 2024 @ 12:19 pm

Re: Loon –

Much regulatory work is bullshit in precisely the same sense. The person filling out the paperwork regarding flood control and discriminatory impacts and so on and so on often has literally zero concern with the things they are doing – they just know that they have to do it, and so they do. Obviously, it would usually be better to have a person who actually cares about the topic fill out each piece of paperwork. But we have these regulations anyway because there is still some value provided in ensuring that someone at least pretends to think about the issue – if there are obvious problems with a project, they will often come up in that paperwork, whether or not the person doing the paperwork cared about the issue.

With a lot of these things, the question isn't whether the AI will do as good a job as a person who is motivated to care – the question is whether it will do as good a job as the person who actually does the work otherwise, and whether it will introduce more or fewer delays into the project as a result.

But it strikes me as missing the point if you think a government agency should say "this is bullshit, we have no need to evaluate" – every government agency is almost always in the business of evaluating bullshit, because the bullshit is actually still valuable, independent of the intentions of the person doing it.
Kate said,

September 5, 2024 @ 2:44 pm

"AI is not going away, but through careful appraisals such as this one from the Australian government, we are gaining a better perspective on its pluses and minuses."

Hmmm, yes, those long-promised "pluses" to using AI. Still waiting on those, apart from potentially making certain wealthy techbros even wealthier. At least, so long as their options vest before the AI bubble bursts.
Andrew Usher said,

September 5, 2024 @ 8:18 pm

Kenny Easwaran:

You are quite correct that regulators may not have any personal interest in the matters they prepare paperwork on. However, being 'bullshit' requires further that they have no interest in the truth of what they write, which is more doubtful; I think they'd still fear getting in trouble if they made up things that weren't true.

As for AI use, it seems that this trial did employ such conditions: the humans believed they were just doing their normal jobs and were not aware they were part of an experiment. If not, I'd certainly not object to further trials in that manner. However, though of scientific value, the practical benefit of such is not likely to be realised in the regulatory field.

I must further mention that your post expressed major misconceptions that regulators are likely to have: first, that a regulation existing means it has a good reason behind it. No, it only means (at best) that someone at some time thought it did. Because, at every level, it is easier to create new rules than to remove existing ones, it is unavoidable that we will end up with too many, barring any deliberate effort to reverse that.

Secondly, that finding any problems with a project means it should be delayed. That might seem obvious, but as there are real costs to delaying any project that has merit, is sub-optimal; not to mention the unavoidable marginal effect of projects not being proposed when regulatory delays would make them not worth while. Just about every large-scale action has both positive and negative effects, and seeing only one side (here the negative) is a clear bias.

As I've stated before, the functional purpose of today's levels of bureaucracy is to create jobs, no matter their societal value, and as such the replacement of any such work with AI will only lead to a further expansion of bureaucracy.

k_over_hbarc at yahoo dot com
Pamela said,

September 6, 2024 @ 9:07 am

The trials seem to confirm something that is already obvious from the design of LLM/AI: So far AI is not designed to test the validity of the phrases it mines from the internet. If a string of words or letters exists and AI can find dictionary meanings for it, it is legit to use; if a relational database suggests a relationship, that is legit. There is no attempt to test or validate information against good sources vs bad sources, original sources vs poor summaries or wrong quotes. AI just doesn't test itself to find errors. Which I guess is why when you start a conversation with a chatbot you are impressed, but as you reach beyond the most superficial, you find that the chatbot is confidently spewing information that has simply come from faulty sources or bad relational inferences. And, it uses it own wrong inferences to build even more erroneous inferences. Without any internal mechanisms to check and correct itself against a hierarchy of reliable sources to unreliable sources. "Hallucinations" at this point are not errors in the AI system, they are produced by attempts to fill information or relational holes with faulty junk. It is in the design. It can be designed better and I assume it will be, but at present, yes, AI is a gigantic BS and misinformation platform, with no internal controls (that is, no conscience and no sense of duty). It can't yet be allowed to do serious things–though, unfortunately, it is.
Benjamin E. Orsatti said,

September 6, 2024 @ 10:26 am

Oz having inherited more or less the same common law system as the U.S., I'm going to go out on a limb and say that AI isn't good for democracy in either case. But to explain why, you need to know what a "regulation" is.

A brief law school class: In the U.S., all government power derives from the Constitution — if the Constitution doesn't say the government can do it, the government can't do it. This means that any "legislation" (i.e. law passed by Congress) has to be permitted within the Constitution. So, for example, the government can regulate healthcare because healthcare providers and insurers are engaged in "interstate commerce", and the Constitution's "Commerce Clause" says that the feds can regulate businesses engaged in interstate commerce.

Now — regulations. At the federal level, regulations are promulgated by administrative agencies, which are run, ultimately, by Cabinet heads appointed by the President. So, the President appoints a Secretary of Labor, the Secretary of Labor appoints 2/3 seats on the National Labor Relations Board, and the NLRB drafts labor regulations which the Department of Labor then promulgates.

Here, the identical process is replicated — the DOL can _only_ pass regulations that are authorized by their "enabling legislation", e.g., the National Labor Relations Act, because administrative agencies are part of the Executive branch of government, and the Executive branch doesn't legislate, it enforces.

So, the agencies themselves cannot act outside the bounds of what they're authorized to do by their regulations, which cannot exceed the authority of their enabling legislation, which can't overstep the bounds established by the Constitution. If they do, the executive action may be nullified, regulation withdrawn, or law declared unconstitutional.

This is why you have to both DRAFT and READ administrative documents VERY CAREFULLY. If the U.S. version of Australia's ASIC, the SEC, started pumping out "bullshit" regulations, administrative decisions, agency memos, draft procedures and forms, etc., it would very quickly find itself not having enough office space for all the lawyers they'd have to hire to defend them in court, because if the bureaucrats themselves aren't paying attention to their own documents, the publicly-traded corporate lawyers, whose clients are regulated by the SEC certainly are!

Ergo, AI is a threat to democracy.

RSS feed for comments on this post

Australian government assessment of AI vs. human performance

6 Comments

loon said,

Kenny Easwaran said,

Kate said,

Andrew Usher said,

Pamela said,

Benjamin E. Orsatti said,

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta