Multi-modal AI models will undoubtedly transform everything

Looking back at the beginning of the year, it's remarkable to consider the significant advancements since generative AI burst onto the mainstream scene. This month, Google introduced three notable developments: Gemini, the AI Hyper computer, and Duet AI for Developers, which is now generally available. These additions join numerous other gen AI products and hundreds of gen AI updates released in 2023, reflecting the astonishing pace of progress.

This rapid innovation is evident across the board. Within Google Cloud, the number of active gen AI projects on Vertex AI has grown by more than 7X. Already, Gemini is greatly enhancing the Vertex AI platform, empowering developers to create sophisticated AI agents. Moreover, it is set to become part of our Duet AI portfolio, ensuring customers have AI support accessible whenever and wherever needed. There has been a notable surge in activity within the open-source generative AI community, accompanied by the emergence of remarkable models from various organizations in the industry. This represents a truly thrilling period of growth. 


Additionally, at the outset of 2023, most models were restricted to their training data; however, we now have robust solutions for fine-tuning models and integrating them with external and proprietary sources. This enables organizations to leverage the intelligence of AI models across their data. From empowering question-answer chat-bots that access an organization's entire range of data to synthesizing and evaluating a variety of information, these capabilities are driving remarkable use cases.

To refrain from exaggeration, my initial experiences with Gemini felt akin to a magical "Eureka" moment. I eagerly anticipate the opportunity for others to experience their own moments of revelation. This marks the point at which an increasing number of leaders will not only identify new applications for generative AI, but will personally integrate it into nearly every aspect of their operations. Gemini has been specifically designed to be multimodal, allowing it to effectively process, comprehend, and integrate diverse forms of information such as text, code, audio, images, and video concurrently. Consequently, Gemini is capable of responding to inquiries like: "What was the cash dividend payout ratio for this bank or online retailer over the last five years?"


Payout ratios represent the portion of a company's earnings distributed to shareholders as dividends relative to its total earnings. To address this, a model must possess a comprehensive comprehension of the various definitions of cash, cash equivalents, and dividends, and be capable of applying them within the context of mathematical ratios. Furthermore, it should accurately retrieve financial data from external systems over the past five years and leverage other AI models to compute the ratio. Multimodality distinguishes between models that can predict subsequent words in a sentence and sophisticated models that not only comprehend but also act on information across diverse data types. In order to respond to the aforementioned query, a model must be able to recognize mathematical ideas such as equations and locate the precise components required – two tasks that were unthinkable less than a year ago.

According to models like Gemini, a completely new era of generation AI is about to begin, one that will get us closer to actual language understanding and enable systems to synthesize a wide range of data kinds and generate much more value for businesses across sectors. Because models like Gemini can handle so many more scenarios, it also means that the applicability across domains and real-world environments are that much stronger. Our on-device, mobile-sized Gemini Nano model opens up new possibilities for running artificial intelligence (AI) at the edge, allowing for faster, more secure data analysis and response with constrained connectivity. These mobile-first models can improve a wide range of jobs, including augmented gaming, mobile banking, and emergency services.

Additionally, multimodal capabilities give enterprises fresh approaches to combining disparate data kinds to address real-world problems. Numerous sectors deal with unstructured, unforeseen issues that might not be resolved by using a single type of analysis or a small number of data sources.

For example, increasing safety on building sites necessitates the analysis and integration of a wide range of data. An organization may own visual data, such as photos or video feeds, incident reports from building sites, or other kinds of data, such as schedule delays or financial charges. With the aid of multi-modal gen AI models, it will be possible to combine all of this data, identify the locations, times, and modes of accidents, and develop safer, more effective methods.


Post a Comment

0 Comments