SenseTime Shifts Toward Multimodal AI as It Aims to Reclaim Industry Leadership

SenseTime, one of China’s earliest artificial intelligence innovators, is refocusing its strategy as the global AI industry moves beyond pure language models and toward systems that can see, understand and act in the physical world. According to co founder and chief scientist Lin Dahua, the company believes its strong background in computer vision gives it a natural advantage as the field heads into a new era defined by multimodal capabilities and embodied intelligence.
Betting on Vision as the Foundation of Future AI Systems
In a recent interview, Lin explained that SenseTime’s experience in visual recognition, 3D perception and image based learning places it in a strong position as AI shifts toward real world environments. While many firms have poured resources into large language models, Lin noted that these systems face limitations when dealing with tasks that require physical awareness or interaction.
SenseTime plans to bridge this gap by building AI that can integrate visual understanding with language, motion and decision making. This approach could enable robots, autonomous systems and AI agents to operate more naturally in everyday settings. From factory floors to retail stores to urban streets, the demand for AI that can perceive and act in complex environments is rising, and SenseTime sees this as a major opportunity.
A Growing Debate Over the Future of Large Language Models
The excitement surrounding large language models has dominated the past few years, but the industry is increasingly acknowledging their limitations. While they excel at generating text and analysing language patterns, they are less capable when tasks involve interpreting physical environments or performing actions that require spatial awareness.
Lin said the conversation within the AI community is shifting toward how multimodal systems can solve problems that language alone cannot address. These systems combine vision, sound, movement and decision making, creating a more complete form of intelligence. For SenseTime, the trend plays directly into its strengths.
Parallels With Google’s Multimodal Strategy
Lin also pointed out that SenseTime’s strategy resembles that of Google, which has been advancing multimodal systems such as the new Nano Banana Pro model. Google’s approach starts with strong vision capabilities and layers linguistic and reasoning skills on top to create more integrated intelligence. Lin believes this vision centered methodology will shape the next generation of AI tools worldwide.
SenseTime aims to follow a similar path, blending its vision based expertise with growing capabilities in language and general intelligence. By doing so, the company hopes to regain momentum in an industry where competition has intensified and where new breakthroughs are appearing rapidly.
Embodied Intelligence and Robotics as the Next Frontier
As AI moves into real world applications, embodied intelligence has become one of the most promising areas of innovation. Robots capable of navigating complex spaces, interacting safely with humans and completing meaningful tasks require far more than textual comprehension. They need vision, motor control, situational awareness and the ability to learn from dynamic environments.
SenseTime expects this area to grow quickly. Its goal is to develop AI agents that can support manufacturing automation, logistics, hospitality services and personal assistance. Lin emphasised that these capabilities do not emerge from language models alone but from systems grounded in perception and action.
Building Toward More Practical AI Applications
Lin believes the industry is entering a phase where hype is giving way to practical engineering. Companies that can deliver real world performance, not just impressive text outputs, will gain the advantage. For SenseTime, this means investing heavily in the hardware and software that bring multimodal AI to life, including sensors, robotics platforms and integrated learning models.
By combining vision, language and motion, SenseTime hopes to produce systems capable of meaningful assistance in daily life. Whether guiding autonomous vehicles, helping robots pick and sort objects or enabling interactive AI assistants, these multimodal models could reshape industries and set new standards for intelligence.
A Renewed Direction for SenseTime
SenseTime’s shift reflects a broader transformation in the AI field. As companies worldwide explore how to bring AI into physical environments, vision based capabilities are becoming essential. By tapping into its deep experience and aligning its strategy with global leaders, SenseTime aims to regain its competitive edge and position itself at the forefront of the next wave of artificial intelligence.

