top of page
Search

A COMPANY STAFFED ONLY WITH AI AGENTS – WHAT COULD POSSIBLY GO WRONG?


Agentic AI is all the rage at the moment. It isn’t just the influencers filling my LinkedIn and youtube feed telling me so. The heavyweights in the industry do:


  • Nvidia’s cofounder and CEO Jensen Huang predicted every company's IT department will soon "be the HR department of AI agents.”

  • In a Deloitte survey of over 2,500 C-suite leaders, more than one-quarter of respondents said their organizations were exploring autonomous agents to a "large or very large extent."

  • OpenAI’s Sam Altman recently wrote: “We are now starting to roll out AI agents, which will eventually feel like virtual co-workers.”


So why not put this to the test? That is exactly what a a group of Carnegie Mellon University researchers did. They set up a simulation with all the trappings of a small software company with internal websites, a Slack-like chat program, an employee handbook, and designated bots — an HR manager and chief technology officer — to contact for help.

Inside the fake company, an autonomous agent could browse the web, write code, organize information in spreadsheets, and communicate with coworkers.

They then tested a number of different models.

Some agents performed better than others, but the general consensus was that the agents were plagued with a lack of common sense, weak social skills, and a poor understanding of how to navigate the internet. They also struggled with self-deception, where they created shortcuts to complete a task.

So basically it was as if you’d hired your grumpy old aunt Mildred with no computer skills into the company as an intern.

A few examples where the agents struggled:


  • Lack of commonsense. Just like aunt Mildred, the bots lacked any actual domain knowledge to infer implicit assumptions. In one example a task was to write a response to a particular file with a docx extension. The agent, not understanding that it was a word file, treated it as a text file and wrote plain text straight into the file.

  • Incompetence in browsing. I’ll be the first to agree that many websites suck. Hidden links, pop-ups, ads, poorly formatted pages, paywalls and confusing hamburger menus. This was an issue for the agents too. Many tasks involved the intranet ownCloud, which had a closable popup that sometimes showed up and asked the user to download the mobile phone apps. This was a frequent show-stopper for the agents, who couldn’t work out what to do.

  • Deceiving oneself. Thinking outside the box is many times seen as a strength. What we don’t realise is that, even when thinking outside the box, we still adhere to some conventions. During the execution of one task, an agent failed to find the right person to ask questions on Chat. As a result, it then decides to create a shortcut solution by renaming another user to the name of the intended user. Out of the box indeed!


Will we have agents doing tasks for us in the future? Definitely!

Will they be better than grumpy old aunt Mildred?  I sure as hell hope so!

 

References:



 
 
 

Comentários


© 2024 by Mikael Svanström
bottom of page