Automation Levels of AI Client Devices

Yu Huang
33 min readApr 23, 2024

Abstract: The large language models (LLMs) are regarded as potential sparks for Artificial General Intelligence (AGI), offering hope for building general AI agents. Based on that, client devices have been evolving with the aid of AI, from Applications (APPs)-based, Chatbot, copilot as AI assistant, to the LLM-based agent, working on various client devices as PC, smartphone, smartwatch, smart home, smart cockpit and autonomous driving of vehicles etc. Based on that, we define levels of automation on AI client devices, from L0 (instruction driven), L1(Apps-based), L2 (AI assistant and copilot/butler), L3 (work side-by-side as part of human), L4(proactively interact) to L5 (digital persona with social behavior).

1. Introduction

As we all know, autonomous driving of vehicles is classified into L0-L5 according to SAE‘s ’definition. According to the current development status of AI, how to grade the levels of Automation on AI client devices?

The evolution of Chatbot technology has led to distinct categories: rule-based bots, AI-driven bots, and LLMs-based bots like Microsoft Copilot.

Microsoft built AI-powered copilots into most used products — making coding more efficient with GitHub, transforming productivity at work with Microsoft 365, redefining search with Bing and Edge and delivering contextual value that works across users’ apps and PC with Windows, which navigates any task.

OpenAI offers a method for creating user own AI assistants by focusing on four key components: 1) LLMs (core); 2) Interpreter code — an improved calculator (tool); 3) A search engine (knowledge retrieval); 4) API calls for a custom function (action).

Semantic Kernel is an open-source MS SDK that lets users easily build agents that can call users’ existing code. As a highly extensible SDK, users can use Semantic Kernel with models from OpenAI, Azure OpenAI, Hugging Face etc. By combining users’ existing C#, Python, and Java code with these models, agents that answer questions and automate processes can be built. With Semantic Kernel, users can build increasingly more sophisticated agents that don’t require developers to be an AI expert.

2. Chatbot for Interaction

The field of Chatbot design has undergone significant evolution since 2016 [1]. The transition from GUI to AI-driven conversational interfaces prompts a renewed emphasis on best practices and methodologies. Successful Chatbot design now revolves around creating human-like real-time conversations, resembling text messaging or voice interactions.

In the past, Chatbot design relied heavily on rule-based approaches, where predefined decision trees dictated the bot’s responses. However, the emergence of LLMs like GPT-4 has transformed the landscape. These advanced models leverage AI to comprehend user input and generate human-like text. This shift has revolutionized Chatbot design, with a focus on enhancing conversational abilities, domain-specific training, and delivering value to users. The result is a more engaging and effective user experience.

LLM-powered Chatbots[2] possess the superpower of generating personalized and contextually relevant responses in real-time. It can seek customer service assistance and receive not just answers, but tailored solutions make customers feel like a VIP.

ChatEd is a Chatbot architecture[3] for education (Figure 1), which is retrieval-based and integrated with a large language model such as ChatGPT. The unique aspect of the ChatEd architecture is integrating an information retrieval system that stores and queries sources provided by the instructor with an LLM that provides the conversational support and general knowledge:

1) Context-Specific Database: The first step is for instructors to provide their sources as documents or URLs. Each document is retrieved and indexed. These instructor documents provide the source context for the Chatbot that is specific to the current course.

2) LLM Integration: When a user poses a question, instead of sending the question directly to the LLM, which would respond using its generalized knowledge, the question is first used as a query in the database to determine similar indexed documents. Then, the question, indexed documents, and prior chat history are provided as a prompt to the LLM.

Figure 1. ChatEd [3]

CataractBot[4] is an experts-in-the-loop Chatbot powered by LLMs. It answers cataract surgery related questions instantly by querying a curated knowledge base and provides expert-verified responses asynchronously. CataractBot features multimodal support and multilingual capabilities.

ChatDiet[5] is a LLM-powered framework designed specifically for personalized nutrition-oriented food recommendation Chatbots. ChatDiet shown in Figure 2, integrates personal and population models, complemented by an orchestrator, to seamlessly retrieve and process pertinent information. The result is a dynamic delivery of personalized and explainable food recommendations, tailored to individual user preferences.

Figure 2. ChatDiet [5]

3. GUI for interaction

Natural language interfaces and Graphical User Interfaces (GUIs) connect the human user to the abilities of the computer system. Natural language allows humans to communicate with each other about things outside of immediacy while pointing allows communication about concrete items in the world. Pointing requires less cognitive effort for one’s communicative counterpart than producing and processing natural language. It also leaves less room for confusion. Natural language, however, can convey information about the entire world: concrete, abstract, past, present, future, and the meta-world, offering random access to everything.

Many sorts of information are suitable for graphical representation. A common approach is to weave GUI elements into the chat conversation. The cost of this, however, is that the chat history becomes bulky, and the state management of GUI elements in a chat history is non-trivial. Also, by fully adopting the chat paradigm, we lose the option of offering menu-driven interaction paths to the users, so they are left more in the dark with respect to the abilities of the app.

The User Interface (UI) is pivotal for human interaction with the digital world, facilitating efficient control of machines, information navigation, and complex task completion. To achieve easy, efficient, and free interactions, researchers have been exploring the potential of encapsulating the traditional Programming Language Interfaces (PLIs) and GUIs into Natural Language Interfaces (NLIs). However, due to the limited capabilities of small models, traditional work mainly focuses on tasks for which only a single step is needed. This largely constrains the application of NLIs. Recently, LLMs have exhibited robust reasoning and planning abilities, yet their potential for multi-turn interactions in complex environments remains under-explored.

To assess LLMs as NLIs in real-world graphical environments, the GUI interaction platform, Mobile-Env[6] shown in Figure 3, specifically on mobile apps, enhances interaction flexibility, task extensibility, and environment adaptability compared with previous environments. A GUI task set based on WikiHow app is collected on Mobile-Env to form a benchmark covering a range of GUI interaction capabilities.

Figure 3. Hierarchical GUI interaction framework of Mobile-Env platform [6]

CogAgent[7] is an 18-billion-parameter visual language model (VLM) specializing in GUI understanding and navigation. By utilizing both low-resolution and high-resolution image encoders, CogAgent supports input at a resolution of 1120×1120, enabling it to recognize tiny page elements and text.

Comprehensive Cognitive LLM Agent, CoCo-Agent[8], with comprehensive environment perception (CEP) and conditional action prediction (CAP), can systematically improve the GUI automation performance. First, CEP facilitates the GUI perception through different aspects and granularity, including screenshots and complementary detailed layouts for the visual channel and historical actions for the textual channel. Second, CAP decomposes the action prediction into sub-problems: action type prediction and action target conditioned on the action type.

4. Copilot for Assitant

AI assistants are becoming an integral part of society, used for asking advice or help in personal and confidential issues. AI assistants are typically deployed in cloud-based environments. This setup allows for scalable and efficient access to the computational resources required to run these sophisticated models. A user session with an AI assistant generally follows a straightforward process:

1. Connection: The user connects to a server hosted in the cloud via a web app in a browser or via an API (e.g., using a 3rd-party app). The user starts or resumes a chat session (conversation) to set the context of the prompts.

2. Prompting: The user submits a prompt (a query or statement) and it is transmitted to the server as a single message. The server forwards the prompt to an instance of the LLM model for processing.

3. Response Generation: The LLM generates a response to the prompt and the response tokens are sent back to the user sequentially and in real time for visualizing the response as it’s created. This operational approach enhances user experience by allowing users to see the AI’s responses form in real-time, ensuring a dynamic and engaging conversation. This is especially important given that state-of-the-art LLMs are slow due to their complexity.

The LLMs has resulted in the creation of a variety of intelligent copilots that help users be much more effective and productive in their professional and personal lives (for example, the GitHub Copilot helps developers significantly accelerate software development through natural language interaction).

The strong text creation ability of the LLMs inspires the development of paper-writing copilot. To assist the user in writing academic analysis about scientific diagrams, the copilot should be equipped with major three abilities. First, the model should be able to understand multiple diagrams of various types (figures, tables, etc.) and in different formats (image or latex). Second, the diagram analysis should remain consistent with the preceding texts and therefore ask to model to correlate multimodal context and diagram information. Third, for better aligning the user’s intention, the copilot should be able to interact with the user, which requires the model controllable.

The Josh.AI company is evolving beyond the smart home to deliver its supercharged LLMs-based JoshGPT assistant at home and on-the-go [9]. In addition to requesting any song, artist, album, or genre, JoshGPT empowers the user to learn more about the favorite music. Beyond setting the perfect cooking mood and food timers when needed, JoshGPT offers the assistance of an expert sous chef. Besides, learning about historical facts and figures is also accessible.

SheetCopilot[10] is a LLM-based assistant that takes natural language task and control spreadsheet to fulfill the requirements. A set of atomic actions is proposed as an abstraction of spreadsheet software functionalities. Further a state machine-based task planning framework for LLMs is designed to robustly interact with spreadsheets.

Data-Copilot[11] is an LLM-based system that connects numerous data sources on one end and caters to diverse human demands on the other end. Acting like an experienced expert, Data-Copilot autonomously transforms raw data into visualization results that best match the user’s intent. Specifically, Data-Copilot autonomously designs versatile interfaces (tools) for data management, processing, prediction, and visualization.

Figure 4 Dat-Copilot [11]

The mPLUG-DocOwl[12] is a copilot for OCR-free document understanding based on the Multi-modal LLM called mPLUG-Owl, which comprises a pre-trained visual foundation model, a visual abstractor, and a language foundation model. Specifically, first an instruction tuning dataset featuring a wide range of visual-text understanding tasks is constructed.

The mPLUG-PaperOwl[13] is a copilot for academic paper writing, extended from mPLUG-DocOwl. By parsing Latex source files of high-quality papers, a multi-modal diagram understanding dataset M-Paper is built. By aligning diagrams in the paper with related paragraphs, professional diagram analysis samples are constructed for training and evaluation. Based on mPLUG-DocOwl, instruction-tuning is performed on a combination of training data from three tasks (Multimodal Diagram Captioning, Multimodal Diagram Analysis, and Outline Recommendation).

Decision Optimization CoPilot (DOCP)[14] is an AI tool designed to assist any decision maker, interacting in natural language to grasp the business problem, subsequently formulating and solving the corresponding optimization model.

5. Web Navigation

The 1st version of the internet is sometimes called the “static web.” It was made of read-only webpages that, by and large, lacked much in the way of interactive features. Web 1.0 offered little beyond browsing static pages. Content generation was handled by a select few, and information was hard to find.

The video-sharing site like YouTube, was a big part of the Web 2.0 revolution, which marked the internet’s departure to an era of dynamic content. Users could now interact with web pages, communicate with each other and create content. For many, the greatest symbol of this era is the emergence of social media networks. Smartphones soon followed, i.e. the released iPhone 1.0. Web 2.0 could be seen as the read/write upgrade, i.e. the internet today.

For cryptocurrency developers and enthusiasts, Web 3.0 incorporates the technologies and concepts that are at the heart of crypto: decentralization, token-based economies and block chain. This vision of Web 3.0 tends to be a more democratic version of today’s online world.

The advancements in AI technology in recent years have provided new and powerful solutions to various obstacles in the development of Web 3.0. These solutions include utilizing AI for big data analysis, AI-generated content, and detecting and classifying various forms of content such as text and video.

Web navigation is a class of sequential decision-making problems where agents interact with web interfaces following user instructions. The progress of autonomous web navigation has been hindered by the dependence on billions of exploratory interactions via online reinforcement learning, and domain-specific model designs that make it difficult to leverage generalization from rich out-of-domain data.

An instruction-following multimodal agent, WebGUM[15], observes both webpage screenshots and HTML pages and outputs web navigation actions, such as click and type. Based on vision-language foundation models, WebGUM is trained by jointly fine-tuning an instruction-tuned language model and a vision encoder with temporal and local perception on a large corpus of demonstrations.

MIND2WEB[16] is the dataset for developing and evaluating generalist agents for the web that can follow language instructions to complete complex tasks on any website. MIND2WEB provides 3 necessary ingredients for building generalist web agents: 1) diverse domains, websites, and tasks, 2) use of real-world websites instead of simulated and simplified ones, and 3) a broad spectrum of user interaction patterns. Employing the data from MIND2WEB, an exploratory framework, MINDACT, is constructed, leveraging the power of LLMs.

WebArena[17] is a standalone, self-hostable web environment for building autonomous agents. WebArena creates websites from 4 popular categories with functionality and data mimicking their real-world equivalents. To emulate human-like problem-solving, WebArena also embeds tools and knowledge resources as independent websites. WebArena introduces a benchmark on interpreting high-level realistic natural language command to concrete web-based interactions.

WebAgent[18], an LLM-driven agent shown in Figure 5, learns from self-experience to complete tasks on real websites following natural language instructions. WebAgent plans ahead by decomposing instructions into sub-instructions, summarizes long HTML documents into task-relevant snippets, and acts on websites via Python programs generated from those. The WebAgent is designed with Flan-U-PaLM, for grounded code generation, and HTML-T5, a pre-trained LLM for long HTML documents.

Figure 5. WebAgent with HTML-T5 and Flan-U-PaLM [18]

6. Software Development

The work similar to GitHub Copilot offers a conversational Chatbot-like interface to programming, enabling the programmers to express their intent in a sequence of natural language utterances, initiating the generation or modification of code in response. This paradigm shift in interaction has fundamentally transformed the programming experience, bridging the gap between human intent and code implementation, ushering in a new era of natural language-based program[19].

To help GitHub copilot, PwR (Programming with Representations, read as ‘power’)[20] uses representations as a shared understanding between the user and the AI system, which facilitates conversational programming. A representation denotes the AI system’s understanding of a sequence of utterances.

The representation within the PwR tool consists of three components (Figure 6):

• A knowledge base (KB) consisting of a set of key-value pairs.

• The logic of the bot, encompassing a set of rules expressed in natural language.

• A set of variables, which stores the conversational state.

Figure 6. The Chatbot builder page of the PwR tool [20]

TaskWeaver[21] is a code-first framework for building LLM-powered autonomous agents. It converts user requests into executable code and treats user-defined plugins as callable functions. TaskWeaver provides support for rich data structures, flexible plugin usage, and dynamic plugin selection, and leverages LLM coding capabilities for complex logic. It also incorporates domain-specific knowledge through examples and ensures the secure execution of generated code. TaskWeaver offers a powerful and flexible framework for creating intelligent conversational agents that can handle complex tasks and adapt to domain-specific scenarios.

The AI startup Cognition released Devin[22], the world’s first fully autonomous AI Software Engineer, setting a new standard of state-of-the-art on the SWE-bench coding benchmark. With just a single prompt, Devin is capable of writing code or creating websites, much like a human software engineer.

Devin is an autonomous AI model that can plan, analyze, and execute complex code and software engineering tasks with a single prompt. It has its own command line, a code editor, and a separate web browser. The model’s capabilities were shown off by testing Meta’s Llama 2 on a couple of different API providers. Devin first set up a step-by-step “Plan” before tackling the problem. It then went on to build the whole project using the same tools as a human software engineer would. Using its built-in browser, Devin was able to pull up the API documentation to read up and learn how to plugin to each of these APIs. Finally, it built and deployed a website with full styling.

What sets Devin apart is its ability to learn from mistakes. It can make thousands of decisions and gets better over time. It outperformed other solutions when it was tested on a few standard sets of software engineering problems. Devin also underwent interviews with top tech brands regarding AI tasks and met its expectations. It has also completed tasks from real jobs posted on Upwork, such as coding tasks, debugging computer vision models, and generating detailed reports.

AutoDev[23], shown in Figure 7, enables an AI Agent to achieve a given objective by performing several actions within the repository. The Eval Environment executes the suggested operations, providing the AI Agent with the resulting outcome. In the conversation, purple messages are from the AI agent, while blue messages are responses from the Eval Environment.

Figure 7. AutoDev as AI agent for software development [23]

Princeton University turns LMs (e.g. GPT-4) into software engineering agents, called SWE-agent [24] that can fix bugs and issues in real GitHub repositories. It realizes by designing simple LM-centric commands and feedback formats to make it easier for the LM to browse the repository, view, edit and execute code files, which is called an Agent-Computer Interface (ACI). Good ACI design leads to much better results when using agents.

7 AI Agent PC

To do any task on a PC, the users have to tell the device which Apps to use. Users can use Microsoft Word and Google Docs to draft a business proposal, but they can’t help users send an email, share a selfie, analyze data, schedule a party, or buy movie tickets. And even the best websites have an incomplete understanding of the work, personal life, interests, and relationships and a limited ability to use this information to do things for users. That’s the kind of thing that is only possible with another human being, like a close friend or personal assistant.

Agents are not only going to change how everyone interacts with PCs. They’re also going to overthrow the software industry, bringing about the biggest revolution in computing, from typing commands, tapping on icons, to speaking like a human to make a command. People will be able to have nuanced conversations with the agents. They will be much more personalized, and they won’t be limited to relatively simple tasks like writing a letter, instead become the human assistants, friends or butlers.

OS-Copilot[25] is a framework to build generalist agents capable of interfacing with comprehensive elements in an operating system (OS), including the web, code terminals, files, multimedia, and various third-party applications, shown in Figure 8. OS-Copilot is applied to create FRIDAY, a self-improving embodied agent for automating general computer tasks. On GAIA, a general AI assistants benchmark, FRIDAY exhibits strong generalization to unseen applications via accumulated skills from previous tasks. FRIDAY can learn to control and self-improve on Excel and Powerpoint with minimal supervision.

Figure 8. OS-Copilot framework[25]

UFO[26], stands for UI-Focused agent on Windows OS to fulfill user requests tailored to applications, harnessing the capabilities of GPT-Vision. UFO shown in Figure 9 employs a dual-agent framework to meticulously observe and analyze the GUI and control information of Windows applications. This enables the agent to seamlessly navigate and operate within individual applications and across them to fulfill user requests, even when spanning multiple applications. The framework incorporates a control interaction module, facilitating action grounding without human intervention and enabling fully automated execution. Consequently, UFO transforms arduous and time-consuming processes into simple tasks achievable solely through natural language commands.

Figure 9. UFO (An UI Focused Agent on Windows OS) [26]

8. AI Agent Mobile Device

Though AI on smartphones is not new, we’ve seen AI running on ISPs and NPUs on-device for nearly a decade. The emergence of LLMs has led us to re-think and re-define the definition of “AI-capable” smartphones powered by generative AI (GenAI). Everyone with a smartphone can already access the “Big Three” AI chatbots — OpenAI’s ChatGPT, Microsoft’s Bing Chat and Google’s Bard — through an app or browser. LLM marks a giant step for mobile devices towards more intelligent and personalized assistive agent.

The AI startup Humane wants the smartphone in your pocket to disappear. Its first production, the Ai Pin[27], is its first step on the road to a diaphanous blend of the real and the digital, a personal AI device that the users wear and communicate with through speech and gesture.

It shows interacting with the Ai Pin through speech and the touchpad that dominates the front surface. The device also incorporates an ultra-wide camera, depth and motion sensors, along with a ‘personic’ speaker designed to create a ‘bubble of sound’ — Bluetooth headphone connectivity is also available. All this is powered by a Snapdragon processor running Humane’s new Cosmos operating system.

The Rabbit R1[28] stands apart from other devices due to its smart way of leveraging AI with LLMs to make them “trigger” results from human-machine interactions using voice, text, or images. It innovates the HMI by merging the functionality of voice assistants with the AI power of LLMs like ChatGPT, pooled into an Agent-based AI system that can perform tasks accurately and swiftly across various interfaces and platforms.

The resulting solution is a large action model (LAM) that helps bridge the gap between precisely understanding the user and completing the required task. The LAM is the cornerstone of Rabbit OS. LAM is a new type of foundation model that understands human intentions on computers. With LAM, rabbit OS understands what you say and gets things done.

Given a natural language description of a desired task, DroidBot-GPT[29] can automatically generate and execute actions that navigate the App to complete the task, shown in Figure 10. It works by translating the App GUI state information and the available actions on the smartphone screen to natural language prompts and asking the LLM to make a choice of actions. Since the LLM is typically trained on a large amount of data including the how-to manuals of diverse software applications, it has the ability to make reasonable choices of actions based on the provided information.

Figure 10. DroidBot-GPT[29]

MM-Navigator[30] is a GPT-4V-based agent for the smartphone GUI navigation task. MM-Navigator can interact with a smartphone screen as human users and determine subsequent actions to fulfill given instructions. Specifically, given a screen, it will detect UI elements via the OCR tool and IconNet. Each element has a bounding box and either OCR-detected text or an icon class label (one of the possible 96 icon types detected) are contained.

GptVoiceTasker[31] is a virtual assistant poised to enhance user experiences and task efficiency on mobile devices, leveraging LLMs to enhance voice control. GptVoiceTasker excels at intelligently deciphering user commands and executing relevant device interactions to streamline task completion. The system continually learns from historical user commands to automate subsequent usages, further enhancing execution efficiency.

MemoDroid[32] is an innovative LLM-based mobile task automator enhanced with a unique App memory. MemoDroid shown in Figure 11 emulates the cognitive process of humans interacting with a mobile app — explore, select, derive, and recall. This approach allows for a more precise and efficient learning of a task’s procedure by breaking it down into smaller, modular components that can be re-used, re-arranged, and adapted for various objectives. MemoDroid is implemented using online LLMs services (GPT-3.5 and GPT-4).

Figure 11. MemoDroid [32]

To minimize the LLM context switching overhead under tight mobile device memory budget, LLMaaS (LLM as system service o mobile devices)[33] decouples the memory management of App and LLM contexts with a key idea of fine-grained, chunk-wise, globally-optimized KV cache compression and swapping. By fully leveraging KV cache’s unique characteristics, it proposes three techniques: (1) Tolerance-Aware Compression. (2) IO-Recompute Pipelined Loading. (3) Chunk Lifecycle Management.

9. Multi-agents’ Collaboration and Persona

In real-world scenarios, complex tasks such as software development, consulting, and game playing might require cooperation among individuals to achieve better effectiveness. Throughout history, numerous studies have delved into methods for strengthening collaboration among humans to enhance work efficiency and effectiveness. More recently, with the evolution of autonomous agents towards AGI, certain research studies have conceptualized assemblies of agents as a society or group and focused on exploring the potential of their cooperation. A multi-agent group enhances decision-making capabilities during collaborative problem-solving.

CAMEL[34] applied a communicative agent framework named role-playing. It involves using inception prompting to guide chat agents toward task completion while maintaining consistency with human intentions. It showcases how role-playing can be used to generate conversational data for studying the behaviors and capabilities of a society of agents, providing a valuable resource for investigating conversational language models.

MetaGPT[35] is a meta-programming framework incorporating human workflows into LLM-based multi-agent collaborations. MetaGPT encodes Standardized Operating Procedures (SOPs) into prompt sequences for more streamlined workflows, thus allowing agents with human-like domain expertise to verify intermediate results and reduce errors. MetaGPT utilizes an assembly line paradigm to assign diverse roles to various agents, efficiently breaking down complex tasks into subtasks involving many agents working together.

AutoGen[36] is a versatile framework that allows for the creation of applications using language models. It is distinctive for its high level of customization, enabling developers to program agents using both natural language and code to define how these agents interact. This versatility enables its use in diverse fields, from technical areas such as coding and mathematics to consumer-focused sectors like entertainment.

AGENTS[37] is an open-source library with the goal of opening up to a wider non-specialist audience some capabilities like automatically solving various tasks and interacting with environments, humans, and other agents using natural language interfaces. AGENTS is carefully engineered to support important features including planning, memory, tool usage, multi-agent communication, and fine-grained symbolic control.

The AgentVerse[38] framework simulates the problem-solving procedures of human groups and allows for dynamic adjustment of group members based on current problem-solving progress.

Specifically, AgentVerse splits the group problem-solving process into four pivotal stages shown in Figure 12: (1) Expert Recruitment. (2) Collaborative Decision-Making. (3) Action Execution. (4) Evaluation.

Figure 12. AgentVerse [38]

The multi-agents’ role-playing aims to enable or customize LLMs to simulate various characters or personas with distinct attributes and conversational styles, which provides a more nuanced interaction experience for users, and renders LLMs more familiar, companionable and immersive.

RoleLLM[39] is a role-playing framework of data construction, evaluation, and solutions for both closed-source and open-source models. RoleLLM includes four key stages shown in Figure 13: (1) Role Profile Construction; (2) Context-Based Instruction Generation (Context-Instruct); (3) Role Prompting using GPT (RoleGPT); and (4) Role-Conditioned Instruction Tuning (RoCIT) to achieve RoleLLaMA and RoleGLM.

Figure 13. RoleLLM [39]

LLM agent research advances from the simple chain-of-thought prompting to more complex ReAct[40] and Reflection reasoning strategy; agent architecture also evolves from single agent generation to multi-agent conversation, as well as multi-LLM multi-agent group chat. However, with the existing intricate frameworks and libraries, creating and evaluating new reasoning strategies and agent architectures has become a complex challenge, which hinders research investigation into LLM agents.

Salesforce’s AI research opened a new AI agent library, AgentLite[41], which simplifies this building process by offering a lightweight, user-friendly platform for innovating LLM agent reasoning, architectures, and applications with ease. AgentLite is a task-oriented framework designed to enhance the ability of agents to break down tasks and facilitate the development of multi-agent systems.

10. AI Games

With the enhanced capabilities of LLMs, open-world games have emerged as the frontier for language agent applications. This is due to the unique and challenging scenarios present in open-world games, which provides a fertile ground for general-purpose language agents. Open-world games present a rich, dynamic, and engaging environment, encompassing complex missions and storylines. They require the use of agents to equip non-player characters with diversified behaviors.

Minecraft has established itself as an unparalleled platform for researching autonomous and robust Generally Capable Agents (GCAs) in open-world environments brimmed with long-horizon challenges, environmental disruptions, and uncertainties. Minecraft acts as a microcosm of the real world. Developing an automated agent that can master all technical challenges in Minecraft is akin to creating an AI agent capable of autonomously learning and mastering the entire real-world technology.

StarCraft II, launched by Blizzard Entertainment in 2010, is a real-time strategy (RTS) game that has garnered substantial attention within the gaming community. Participants in standard gameplay competitions have the opportunity to engage in strategic contests while playing the roles of one of three distinct races: Terran, Zerg, and Protoss. StarCraft II has become an ideal testing platform for AI capabilities, as the next conquest for AI.

Civilization game is profound in alignment with human history and society necessitates sophisticated learning, while its ever-changing situations demand strong reasoning to generalize.

Compared to existing games, tactical battle games are better suited for benchmarking the game-playing ability of LLMs as the win rate can be directly measured and consistent opponents like AI or human players are always available. Poke ́mon battles, serving as a mechanism that evaluates the battle abilities of trainers in the well-known Poke ́mon games, offer several unique advantages as the first attempt for LLMs to play tactical battle games:

(1) The state and action spaces are discrete and can be translated into text losslessly;

(2) The turn-based format eliminates the demands of intensive gameplay;

(3) Despite its seemingly simple mechanism, Poke ́mon battle is strategic and complex, demanding the players both the Poke ́mon knowledge and reasoning ability.

Ghost in the Minecraft (GITM)[42] integrates LLMs with text-based knowledge and memory, aiming to create Generally Capable Agents (GCAs) in Minecraft. These agents, equipped with the logic and commonsense capabilities of LLMs, can skillfully navigate complex, sparse-reward environments with text-based interactions. A set of structured actions is constructed and LLMs are leveraged to generate action plans for the agents to execute.

VOYAGER[43] is the LLM-powered embodied lifelong learning agent in Minecraft that continuously explores the world, acquires diverse skills, and makes novel discoveries without human intervention. VOYAGER consists of three key components shown in Figure 14: 1) an automatic curriculum that maximizes exploration, 2) an ever-growing skill library of executable code for storing and retrieving complex behaviors, and 3) a new iterative prompting mechanism that incorporates environment feedback, execution errors, and self-verification for program improvement. VOYAGER interacts with GPT-4 via blackbox queries, which bypasses the need for model parameter fine-tuning.

Figure 14. VOYAGER [43]

Language Agent for Role Play (LARP)[44] is a framework toward open-world games. LARP focuses on blending the open-world games with language agents, utilizing a modular approach for memory processing, decision-making, and continuous learning from interactions. In the agent’s internal depiction, a complex cognitive architecture is designed based on cognitive psychology, equipping agents under the LARP framework with high playability. Aiming to yield more realistic role-playing experience, agents are regularized using the data and context of the open-world gaming environment, prior set personalities, knowledge, rules, memory, and post constraints, which can be seen as a specific case within the general-purpose language agents. This architecture incorporates a cluster of smaller language models, each fine-tuned for different domains, to handle various tasks separately.

CivRealm[45] is an environment inspired by the Civilization game. CivRealm sets up an imperfect-information general-sum game with a changing number of players; it presents a plethora of complex features, challenging the agent to deal with open-ended stochastic environments that require diplomacy and negotiation skills. Within CivRealm, interfaces are provided for two typical agent types: tensor-based agents that focus on learning, and language-based agents that emphasize reasoning.

To leverage the strategic interpretability of language models and the logical reasoning capabilities of CoT, LLM agents are engaged in long-term strategic planning and real-time strategy adjustments in complex real-time strategy games like StarCraft II. To conveniently take full advantage of LLMs’ reasoning abilities, a textual StratCraft II environment, called TextStarCraft II[46] shown in Figure 15, is first developed, which LLM agent can interact. Secondly, a Chain of Summarization method, including single-frame summarization for processing raw observations and multi-frame summarization for analyzing game information, is proposed.

Figure 15. The Enhanced Chain of Summarization Method in TextStarCraft II [46]

SwarmBrain[47] is an embodied agent leveraging LLM for real-time strategy implementation in the StarCraft II game environment. The SwarmBrain comprises two key components: 1) a Overmind Intelligence Matrix, powered by LLMs, is designed to orchestrate macro-level strategies from a high-level perspective. 2) a Swarm ReflexNet, which is agile counterpart to the calculated deliberation of the Overmind Intelligence Matrix.

POKE ́LLMON[48] is a LLM-embodied agent that achieves human-parity performance in tactical battle games, as demonstrated in Poke ́mon battles. The design of POKE ́LLMON incorporates three key strategies: (i) In-context reinforcement learning; (ii) Knowledge-augmented generation; (iii) Consistent action generation.

11. Smart Cockpit and Autonomous Driving

Where once only relatively primitive voice commands were possible, the driver now begins a conversation with a more intelligent assistant that understands context and can also respond to follow-up questions. The car is also beginning to proactively contact the driver when it needs maintenance.

Driving functions and on-board infotainment can be controlled by voice, gestures and with the help of facial recognition, and there is also “safety monitoring” for drivers and passengers (like children).

So LLM not only changes driver-vehicle communication, making it more interactive and personal, but also promises to improve autonomous driving functions by making faster and better decisions.

Based on LLM, AI can help benefit the mobility market in three services: proactive care, proactive journey and proactive mobility. Proactive journey means AI can examine a driver’s commute and schedule to ensure efficient time management, also examining traffic trends. Proactive mobility complements autonomous driving, as AI brings augmented reality and in-car infotainment, benefitting users while the car is in motion. Proactive care will see car owners offered a hassle-free experience when it comes to their vehicles. AI could take care of admin and logistics, such as insuring the vehicle, booking maintenance and even predicting potential issues.

However, it may also offer proactive communication and recommendations, leaving the financial decisions up to the driver. This means AI could help to maintain customer loyalty, especially when it comes to electric vehicles (EVs) which are subject to less maintenance. The technology could help ensure drivers interact with original manufacturers instead of considering a third party.

· Smart Cockpit

In-vehicle conversational assistants (IVCAs) are an integral component in smart cockpits and play a vital role in facilitating human-agent interaction. They can deliver features including navigation, entertainment control, and hands-free phone operation.

Research demonstrates that the proactivity of IVCAs can help to reduce distractions and enhance driving safety, better meeting users’ cognitive needs. However, existing IVCAs struggle with user intent recognition and context awareness, which leads to suboptimal proactive interactions.

The Rewrite-ReAct-Reflect (R3) prompting framework[49] with five proactivity levels across two dimensions — assumption and autonomy — for IVCAs is constructed, given in Figure 16. ReAct[40] is LLMs-based method to generate both reasoning traces and task-specific actions in an interleaved manner. The R3 strategy aims at empowering LLMs to fulfill the specific demands of each proactivity level when interacting with users.

Figure 16. R3 prompting for IVCAs [49]

· Autonomous Driving

Knowledge is the concretization and generalization of human representation of scenes and events in the real world, representing a summary of experiences and causal reasoning. Knowledge-driven methods aim to induce information of driving scenarios into knowledge-augmented representation space and deduce to the generalized driving semantic space. It enables the emulation of human understanding of the real world and the acquisition of learning and reasoning capabilities from experience.

As possessing rich human driving experience and common sense, LLMs are commonly employed as foundation models for knowledge-driven autonomous driving nowadays to actively understand, interact, acquire knowledge, and reason from driving scenarios. Analogous to embodied AI, the driving agent should possess the ability to interact with the driving environment, engaging in exploration, understanding, memory, and reflection.

A framework named DiLu[50] is designed for LLM-based knowledge driven driving. To be specific, the driver agent utilizes the Reasoning Module to query experiences from the Memory Module and leverage the common-sense knowledge of the LLM to generate decisions based on current scenarios. It then employs the Reflection Module to identify safe and unsafe decisions produced by the Reasoning Module, based on the knowledge embedded in the LLM subsequently refining them into correct decisions, which then are updated into the Memory Module.

“Driver Anywhere”[51] is a framework that harness multimodal foundation models to enhance the robustness and adaptability of autonomous driving systems, enabling out-of-distribution, end-to-end, multimodal, and more explainable autonomy. Specifically, it applies end-to-end open-set (any environment/scene) autonomous driving that is capable of providing driving decisions from representations queryable by image and text.

Context-Aware Visual Grounding (CAVG) model[52] is an advanced system for autonomous driving that integrates five core encoders — Text, Image, Context, and Cross-Modal — with a Multimodal decoder. This integration enables the CAVG model to adeptly capture contextual semantics and to learn human emotional features, augmented by LLMs including GPT-4.

DriveGPT4[53] shown in Figure 17, is an interpretable end-to-end autonomous driving system utilizing LLMs with a comprehensive multimodal language model to process inputs comprising videos, text, and control signals. Video sequences undergo tokenization using a dedicated video tokenizer, while text and control signals share a common tokenizer. Following tokenization, the advanced language model can concurrently generate responses to human inquiries and predict control signals for the next step.

Figure 17. DriveGPT4 [53]

VLP[54] is Vision-Language-Planning framework that exploits language models to bridge the gap between linguistic understanding and autonomous driving. VLP enhances autonomous driving systems by strengthening both the source memory foundation and the self-driving car’s contextual understanding, leveraging LLMs in both local and global contexts. The Agent-centric Learning Paradigm (ALP) concentrates on refining local details to enhance source memory reasoning, while the Self-Driving-Car-centric Learning Paradigm (SLP) focuses on guiding the planning process for the self-driving-car (SDC).

DriveVLM[55] is an autonomous driving system leveraging Vision-Language Models (VLMs) for enhanced scene understanding and planning capabilities. DriveVLM integrates a unique combination of chain-of-thought (CoT) modules for scene description, scene analysis, and hierarchical planning. Furthermore, recognizing the limitations of VLMs in spatial reasoning and heavy computational requirements, DriveVLM-Dual, a hybrid system is proposed that synergizes the strengths of DriveVLM with the traditional autonomous driving pipeline. DriveVLM-Dual achieves robust spatial understanding and real-time inference speed.

A language agent called Agent-Driver[56] shown in Figure 18, introduces a tool library accessible via function calls, a cognitive memory of common sense and experiential knowledge for decision-making, and a reasoning engine capable of CoT reasoning, task planning, motion planning, and self-reflection. Powered by LLMs, Agent-Driver enables a more nuanced, human-like approach to autonomous driving.

Figure 18. Agent-Driver [56]

12 Levels of AI Automation in Client Devices

There are two related work defining the levels of AI agents.

The desired features of Personal LLM Agents need different kinds of capabilities. Inspired by the six levels of autonomous driving given by SAE (Society of Automotive Engineers), the intelligence levels of Personal LLM Agents are categorized into 5 levels by Tsinghua University (Beijing, China)[57], from L1 to L5. The key characteristics and representative use cases of each level are listed in the Table 1 as below.

Table 1. Levels of Personal LLM Agents [57]

The proactivity scales for IVCAs is designed based on assumptions and autonomy by Tongji University (Shanghai, China)[49]. Incorporating user control as a design principle or constraint, the levels of proactivity is divided into five classes,shown in Table 2.

Table 2. Levels of IVCAs [49]

Quite similarly, we could classify the levels of AI automation in client devices as follows in Table 3.

Table 3. AI Levels of Automation in Client Devices

13. Conclusions

We addressed how the client devices evolve under the AI progress, especially LLMs, from APPs-based to LLMs-based, from Chatbot, then co-pilot assistant, to autonomous agent, working on various devices as PC, smartphone, smartwatch, smart home, smart cockpit and autonomous driving of vehicles. Further, levels of automation on client devices are defined from L0 to L5.

References

[1] G Caldarini, S Jaf, and K McGarry, A Literature Survey of Recent Advances in Chatbots, arXiv 2201.06657, 2022

[2] R Sutcliffe, A survey of personality, persona, and profile in conversational agents and Chatbots, arXiv 2401.00609, 2024

[3] K Wang, J Ramos, R Lawrence, ChatEd: A Chatbot Leveraging ChatGPT for an Enhanced Learning Experience in Higher Education, arXiv 2401.00052, 2024

[4] P Ramjee, B Sachdeva, S Golechha, et al., CataractBot: An LLM-Powered Expert-in-the-Loop Chatbot for Cataract Patients, arXiv 2402.04620, 2024

[5] Zhongqi Yang, Elahe Khatibi, Nitish Nagesh, et al., ChatDiet: Empowering Personalized Nutrition-Oriented Food Recommender Chatbots through an LLM-Augmented Framework, arXiv 2403.00781, 2024

[6] D Zhang, L Chen, Z Zhao, R Cao, and K Yu. Mobile-Env: An evaluation platform and benchmark for interactive agents in LLM era. arXiv 2305.08144, 2023.

[7] W Hong, W Wang, Q Lv, et al., CogAgent: A Visual Language Model for GUI Agents, arXiv 2312.08914, 2023

[8] X Ma, Z Zhang, H Zhao, Comprehensive Cognitive LLM Agent for Smartphone GUI Automation, arXiv 2402.11941, 2024

[9] Pioneering a New Era of Intelligent Living, JoshGPT, https://www.josh.ai/joshgpt/, 2023

[10] H Li, J Su, Y Chen, Q Li, and Z Zhang. SheetCopilot: Bringing software productivity to the next level through large language models. arXiv 2305.19308, 2023.

[11] W Zhang, Y Shen, W Lu, Y Zhuang, Data-Copilot: Bridging Billions of Data and Humans with Autonomous Workflow, arXiv 2306.07209, 2023

[12] J Ye, A Hu, H Xu, et al., mPLUG-DocOwl: Modularized multimodal large language model for document understanding. arXiv 2307.02499, 2023

[13] A Hu, Y Shi, H Xu, et al., mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model, arXiv 2311.18248, 2023

[14] S Wasserkrug, L Boussioux, Di Hertog, et al., From Large Language Models and Optimization to Decision Optimization CoPilot: A Research Manifesto, arXiv 2402.16269, 2024

[15] H Furuta, K-H Lee, O Nachum, et al., Multimodal web navigation with instruction-finetuned foundation models (WebGUM), arXiv 2305.11854, 2023

[16] X Deng, Y Gu, B Zheng, S Chen, S Stevens, B Wang, H Sun, and Y Su. Mind2Web: Towards a generalist agent for the web. arXiv 2306.06070, 2023.

[17] S Zhou, F Xu, H Zhu, et al. WebArena: A realistic web environment for building autonomous agents. arXiv 2307.13854, 2023.

[18] I Gur, H Furuta, A Huang, M Safdari, Y Matsuo, D Eck, and A Faust. A real-world WebAgent with planning, long context understanding, and program synthesis. arXiv 2307.12856, 2023.

[19] K Yang, J Liu, J Wu, et al., If LLM Is the Wizard, Then Code Is the Wand: A Survey on How Code Empowers Large Language Models to Serve as Intelligent Agents, arXiv 2401.00812, 2024

[20] P Ym, V Ganesan, D K Arumugam, et al., PwR: Exploring the Role of Representations in Conversational Programming, arXiv 2309.09495, 2023

[21] B Qiao, L Li, X Zhang, et al., TaskWeaver: A Code-First Agent Framework, arXiv 2311.17541, 2023

[22] Introducing Devin, the first AI software engineer, https://www.cognition-labs.com/introducing-devin, Mar., 2024

[23] M Tufano, A Agarwal, J Jang, AutoDev: Automated AI-Driven Development, arXiv 2403.08299, 2024

[24] SWE-agent: Agent Computer Interfaces Enable Software Engineering Language Models, https://github.com/princeton-nlp/SWE-agent, Mar. 2024

[25] Z Wu, C Han, Z Ding, et al., OS-Copilot: Towards generalist computer agents with self-improvement, arXiv 2402.07456, 2024

[26] C Zhang, L Li, S He, et al., UFO: A UI-Focused Agent for Windows OS Interaction, arXiv 2402.07939, 2024

[27] Humane AI Pin, https://humane.com/aipin, 2023

[28] Rabbit R1, https://www.rabbit.tech/rabbit-r1, 2023

[29] H Wen, H Wang, J Liu, and Y Li. DroidBot-GPT: GPT-powered UI automation for android. arXiv 2304.07061, 2023.

[30] A Yan, Z Yang, W Zhu, et al. GPT-4V in Wonderland: Large multimodal models for zero-shot smartphone GUI navigation (MM-Navigator). arXiv 2311.07562, 2023.

[31] M D Vu, H Wang, Z Li, et al., GPTVoiceTasker: LLM-Powered Virtual Assistant for Smartphone, arXiv 2401.14268, 2024

[32] S Lee, J Choi, J Lee, et al. Explore, select, derive, and recall: Augmenting LLM with human-like memory for mobile task automation (MemoDroid). arXiv 2312.03003, 2023.

[33] W Yin, M Xu, Y Li, X Liu, LLM as a System Service on Mobile Devices, arXiv 2403.11805, 2024

[34] G Li, H A K Hammoud, H Itani, et al., CAMEL: Communicative Agents for “Mind” Exploration of Large Language Model Society, arXiv 2303.17760, 2023

[35] S Hong, X. Zheng, J. Chen, et al. MetaGPT: Meta programming for multi-agent collaborative framework. arXiv 2308.00352, 2023.

[36] Q Wu, G Bansal, J Zhang, et al., AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation, arXiv 2308.08155, 2023

[37] W Zhou, Y E Jiang, L Li, et al., AGENTS: An Open-source Framework for Autonomous Language Agents, arXiv 2309.07870, 2023

[38] Weize Chen, Yusheng Su, Jingwei Zuo, et al., AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors, arXiv 2308.10848, 2023

[39] Z M Wang, Z Peng, H Que, et al., RoleLLM: Benchmarking, Eliciting, and Enhancing Role-Playing Abilities of Large Language Models, arXiv 2310.00746, 2023

[40] S Yao et al., ReAct: Synergizing Reasoning and Acting in Language Models, arXiv 2210.03629, 2022

[41] Z Liu, W Yao, J Zhang, et al., AgentLite:A Lightweight Library for Building and Advancing Task-Oriented LLM Agent System, arXiv 2402.15538, 2024

[42] X Zhu, Y Chen, H Tian, et al., Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via LLMs with Text-based Knowledge and Memory, arXiv 2305.17144, 2023

[43] G Wang, Y Xie, Y Jiang, et al., VOYAGER: An Open-Ended Embodied Agent with Large Language Models, arXiv 2305.16291, 2023

[44] M Yan, R Li, H Zhang, et al., LARP: Language-agent Role Play for Open-world Games, arXiv 2312.17653, 2023

[45] S Qi, S Chen, Y Li, et al., CivRealm: a learning and reasoning odyssey in civilization for decision-making agents, arXiv 2401.10568, 2024

[46] W Ma, Q Mi, X Yan, et al., Large Language Models Play StarCraft II: Benchmarks and A Chain of Summarization Approach, arXiv 2312.11865, 2023

[47] X Shao, W Jiang, F Zuo, M Liu, SwarmBrain: embodied agent for real-time strategy game starcraft ii via large language models, arXiv 2401.17749, 2024

[48] S Hu, T Huang, L Liu, Poke ́LLMon: A Human-Parity Agent for Poke‘mon Battles with Large Language Models, arXiv 2402.01118, 2024

[49] H Du, X Feng, J Ma, et al., Towards Proactive Interactions for In-Vehicle Conversational Assistants Utilizing Large Language Models, arXiv 2403.09135, 2024

[50] L Wen et al., DiLu: A Knowledge-Driven Approach to Autonomous Driving with Large Language Models, arXiv 2309.16292, 2023

[51] T-H Wang et al., Drive Anywhere: Generalizable End-to-end Autonomous Driving with Multi-modal Foundation Models, arXiv 2310.17642, 2023

[52] H Liao, H Shen, Z Li, et al., GPT-4 Enhanced Multimodal Grounding for Autonomous Driving: Leveraging Cross-Modal Attention with Large Language Models(CAVG), arXiv 2312.03543, 2023

[53] Z Xu et al., DriveGPT4: Interpretable End-To-End Autonomous Driving Via Large Language Model, arXiv 2310.01412, 2023

[54] C Pan, B Yaman, T Nesti, et al., VLP: Vision Language Planning for Autonomous Driving, arXiv 2401.05577, 2024

[55] X Tian, J Gu, Bailin Li, et al., DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models, arXiv 2402.12289, 2024

[56] J Mao, J Ye, Y Qian, M Pavone, Y Wang, A Language Agent for Autonomous Driving(Agent-Driver), arXiv 2311.10813, 2023

[57] Y Li, H Wen, W Wang, et al., Personal LLM agents: insights and survey about the capability, efficiency and security, arXiv 2401.05459, 2024

--

--

Yu Huang

Working in Computer vision, deep learning, AR & VR, Autonomous driving, image & video processing, visualization and large scale foundation models.