Organizations rush to democratize data and AI
Generative AI is ushering in a new era of innovation, creativity and productivity. Just 18 months after it entered mainstream conversations, companies everywhere are investing in GenAI to transform their organizations. Enterprises are realizing that their data is central to delivering a high-quality GenAI experience for their users. The urgent question among business leaders now is: What’s the best and fastest way to do that? With siloed data and AI platforms, it’s difficult for teams to accelerate their GenAI projects — whether they are using natural language to ask questions of their data or are building intelligent apps with their data. We believe that data intelligence platforms will result in radical democratization across organizations. This new category of data platforms uses GenAI to more easily secure and leverage data, and lower the technical bar to create value from it. Among our own customers, there’s already a clear acceleration of AI adoption. The State of Data + AI report provides a snapshot of how organizations are prioritizing data and AI initiatives. The insights shared come from more than 10,000 global customers — now including over 300 of the Fortune 500 — using the Databricks Data Intelligence Platform. Discover how the most innovative organizations are succeeding with machine learning, adopting GenAI and responding to evolving governance needs. This report is designed to help companies develop effective data strategies in the evolving era of enterprise AI.
- 11x more AI models were put into production this year
After years of being stuck experimenting with AI, companies are now deploying substantially more models into the real world than a year ago. On average, organizations became over 3 times more efficient at putting models into production. Natural language processing is the most-used and fastest-growing machine learning application.
- 70% of companies leveraging GenAI use tools and vector databases to augment base models
- 76% of companies using LLMs choose open source, often alongside proprietary models
Machine Learning
ORGANIZATIONS RACE TO PUT ML MODELS INTO PRODUCTION This year, we’ve seen a shift from experimentation to production applications of AI. As machine learning (ML) takes off, companies are learning to navigate two distinct halves of the ML model lifecycle. Organizations first create their ML models through the process of experimental testing, trying out different algorithms and hyperparameters to get to the best models, before putting these models into production. In this process, teams have two competing goals: ensuring the experimentation phase is as time-efficient as possible, while only putting rigidly tested models into production. Deploying models in production has historically had many challenges: disparate data and AI platforms, complex deployment workflows, lack of access controls for governance, inability to monitor and more. Our data reveals how companies are overcoming these challenges with the introduction of data intelligence platforms.
Companies accelerate production ML STATE OF DATA + AI 6 Data from MLflow (an open source MLOps platform developed by Databricks) shows how frequently our customers are logging models (representing experimentation) and registering models (representing production). The results? Not only do we see more experimentation, but companies are also becoming substantially more efficient at getting into production.
RATIO OF EXPERIMENTS LOGGED TO MODELS REGISTERED
|
OVERALL |
NUMBER OF CUSTOMERS |
EXPERIMENTS LOGGED |
134% Growth in the number of experiments logged |
56% Growth in the number of companies logging at least 1
experiment |
MODELS REGISTERED |
1018% Growth in the number of models registered |
210% Growth in the number of companies registering at
least 1 model |
*The YoY growth of models registered has far outpaced the growth of experiments logged, indicating companies are moving from experimentation to production
A giant leap: 11x more models went into production
The volume of models has grown substantially in measurable ways. THE NUMBER OF COMPANIES INVESTING IN ML HAS SKYROCKETED Our data shows that 56% more companies are logging experimental models compared to a year ago, but 210% more are registering models. This indicates many companies that were focused on experimenting last year have now moved into production. THE NUMBER OF ML MODELS IS UP ACROSS COMPANIES After years of intense focus on experimentation, organizations are now charging into production. 1,018% more models were registered this year, far outpacing experiments logged, which grew 134%. We see this trend at the company level as well. The average organization registered 261% more models and logged 50% more experiments this year. THE TAKEAWAY ML is core to how companies innovate and differentiate. And as companies continue to build their confidence, we expect to see this trend continue in the coming years. The newer field of GenAI is still in the testing phase, but companies are starting to make traction.
Companies become 3x more efficient at putting models into production
ML efficiency has real value that can be measured in time, money and resources. While model development and experimentation are crucial, ultimately these models need to be deployed to real-world use cases to drive business value. We looked at the ratio of logged-to-registered models across all customers to assess progress. In February 2023, the ratio of logged-to-registered models was 16-to-1. This means that for every 16 experimental models, one model gets registered for production. By the end of the data range, the ratio of logged-to-registered models dropped sharply to 5-to-1, an improvement of 3x. The takeaway? Companies are becoming significantly more efficient at getting models into production, spending fewer resources on experimental models that never provide real-world value.
OVERALL RATIO OF LOGGED-TO-REGISTERED MODELS
|
RATIO |
FEBRUARY 2023 |
16:1 |
MARCH 2024 |
5:1 |
Efficiency at an industry level Industries have different datasets, strategic goals and risk profiles. Therefore, we expect to see variations in their ML approach, including their mix of ML experimentation and production. We analyzed six key industries to better understand these trends. The ratio of logged-to-registered models steadily decreased between February 1, 2023–March 31, 2024, indicating that companies deployed more experimental models in production.
NLP explodes NLP IS THE TOP DATA SCIENCE AND ML APPLICATION FOR THE SECOND YEAR RUNNING Unstructured data is ubiquitous across industries and regions, making natural language processing (NLP) techniques essential to derive meaning. GenAI is a key use case of NLP. The following charts focus on Python libraries because they’re at the forefront of ML advancements and AI, and consistently rank as one of the most popular programming languages. In our data, we aggregate the usage of specialized Python libraries to determine the top five data science and ML (DS/ML) applications used within organizations.
TOP DS/ML APPLICATIONS, BY INDUSTRY
INDUSTRY |
PYTHON LIBRARY APPLICATIONS |
|||
NLP |
GEOSPATIAL |
TIME SERIES |
GRAPH |
|
FINANCIAL SERVICES |
YES |
YES |
NO |
YES |
RETAIL & CONSUMER GOODS |
YES |
YES |
YES |
NO |
MANUFACTURING & AUTOMOTIVE |
YES |
YES |
NO |
NO |
COMMUNICATION, MEDIA & ENTERTAINMENT |
YES |
YES |
NO |
NO |
HEALTHCARE & LIFE SCIENCES |
YES |
NO |
NO |
NO |
PUBLIC SECTOR & EDUCATION |
YES |
YES |
NO |
NO |
NLP is the most commonly used Python library application, leveraged heavily by all our featured industries.
For the second year in a row, our data shows NLP is the top DS/ML application; 50% of specialized Python libraries used are associated with NLP. Data teams are also eager to leverage Geospatial and Time Series applications. Geospatial libraries, which are often used for location-based analysis to customize user experiences, are the second most popular use case, accounting for 30% of Python library usage. HEALTHCARE & LIFE SCIENCES HAS THE HIGHEST ADOPTION OF NLP Among our featured industries, Healthcare & Life Sciences has the highest proportion of Python library usage devoted to NLP, at 69%. According to a survey done by Arcadia with the Healthcare Information and Management Systems Society, the healthcare industry generates 30% of the world’s data volume and is growing faster than any other industry. NLP can support the analysis of clinical research, accelerate the process of bringing novel drugs to market and increase sales and marketing commercial effectiveness.
NLP, the most widely used DS/ML application, isn’t slowing down With the rise of AI-driven applications, there’s a growing demand for NLP solutions across industries. While NLP dominates the use of Python libraries, it also has the highest growth of all applications at 75% YoY.
ALL INDUSTRIES INVEST HEAVILY IN NLP Among our featured industries, Manufacturing & Automotive had the largest gains in use of NLP, with a 148% YoY increase. NLP — which helps the industry do everything from analyzing feedback from customers to monitoring quality control to powering chatbots — enables companies to improve operational efficiency. Public Sector & Education’s growth of NLP over the past year follows close behind, at 139% YoY. FROM WILDFIRES TO BIRD FLU, CURRENT EVENTS CORRESPOND WITH ML GROWTH Geospatial is the other application that grew significantly across all six industries. Companies are increasingly searching for patterns, trends and correlations in locationbased data. The high rate of Geospatial growth from Public Sector & Education may relate to disaster management and emergency response planning. The third highest rate of growth across all applications and industries is the adoption of Time Series libraries among Healthcare & Life Sciences, at 115% YoY. Time Series supports patient risk predictions, supply forecasting and drug discovery. In a 2023 review done by the NIH, they determined “time-series analysis allows us to do easily and, in less time, precise short-term forecasting in novel pandemics by estimating directly from data.
Evolution to GenAI
TOP DATA AND AI PRODUCTS SHOW THE NEXT PHASE OF GEN AI Data leaders are always searching for the best tools to deliver their AI strategies. Our Top 10 Data and AI Products showcase the most widely adopted integrations on the Databricks Data Intelligence Platform. Our categories include DS/ML, data governance and security, orchestration, data integration and data source products. Among our top products, 9 out of 10 are open source. Organizations are choosing more flexibility while avoiding proprietary walls and restrictions. As we’ll discuss later in the report, we also see a growing popularity of open LLMs.
TOP 10 DATA AND AI PRODUCTS
- Plotly Dash
- Hugging Face
- dbt
- Langchain
- Airflow
- GeoPandas
- Shapely
- Kafka
- Fivetran
- Great Expectations
PLOTLY DASH MAINTAINS TOP POSITION
Plotly Dash is a low-code platform that enables data scientists to easily build, scale and deploy data applications. Products like Dash help companies deliver applications faster and more easily to keep up with dynamic business needs. For more than 2 years, Dash has held its position as No. 1, which speaks to the growing pressure on data scientists to develop production-grade data and AI applications.
HUGGING FACE TRANSFORMERS JUMPS TO NO. 2
Hugging Face Transformers ranks as the second most popular product used among our customers, up from No. 4 a year ago. Many companies use the open source platform’s pretrained transformer models together with their enterprise data to build and fine-tune foundation models. This supports a growing trend we’re seeing with RAG applications.
LANGCHAIN BECOMES A TOP PRODUCT ONLY MONTHS AFTER INTEGRATION
LangChain — an open source toolchain for working with and building proprietary LLMs — jumped into the top ranks last spring and rose to No. 4 in less than one year of integration. When companies build their own modern LLM applications and work with specialized transformer-related Python libraries to train the models, LangChain enables them to develop prompt interfaces or integrations to other systems.
COMPANIES INVEST IN PRODUCTS TO BUILD HIGH-QUALITY DATASETS The prominence of three data integration products in our top 10 indicates companies are focused on building trusted datasets. dbt (data transformation), Fivetran (automation of data pipelines) and Great Expectations (data quality) all have steady growth. Most notably, dbt jumped two spots in the past year.
Vector Databases Enterprises rush to customize LLMs
LLMs support a variety of business use cases with their language understanding and generation capabilities. However, especially in enterprise settings, LLMs alone have limitations. They can be unreliable information sources and are prone to providing erroneous information, called hallucinations. At the root, stand-alone LLMs aren’t tailored to the domain knowledge and needs of a specific organization. Our data confirms that more companies are turning to RAG instead of relying on stand-alone LLMs. RAG enables organizations to use their own proprietary data to better customize LLMs and deliver high-quality GenAI apps. By providing LLMs with additional relevant information, the models can give more accurate answers and are less likely to hallucinate.
RAG leads the way for GenAI in the enterprise Last year, our LLM Python Libraries chart revealed the hot trajectory of SaaS LLMs, which grew 1,310% in just over 5 months. SaaS LLMs like GPT-4 are trained on massive text datasets and went mainstream less than 2 years ago. This year, vector database adoption is tearing up our chart. The entire vector database category has grown 377% YoY, and 186% just since the Public Preview of Databricks Vector Search.
WHAT IS RAG? Retrieval augmented generation (RAG) is a GenAI application pattern that finds data and documents relevant to a question or task and provides them as context for the LLM to give more accurate responses. HOW DO VECTOR DATABASES AND RAG WORK TOGETHER? Vector databases generate representations of predominantly unstructured data. This is useful for information retrieval in RAG applications to find documents or records based on their similarity to keywords in a query. RAG applications have a lot of advantages over off the shelf. RAG has quickly emerged as a popular way to incorporate proprietary, real-time data into LLMs without the costs and time requirements of fine-tuning or pretraining a model. The exponential growth of vector databases suggests that companies are building more RAG applications in order to integrate their enterprise data with their LLMs.
LLM DEFINITIONS Transformer training: Libraries for training transformer models (e.g., Hugging Face Transformers)
SaaS LLMs: Libraries for accessing API-based LLMs (e.g., OpenAI)
LLM tools: Toolchains for working with and building proprietary LLMs (e.g., LangChain)
Vector databases: Vector/KNN indexes (e.g., Pinecone and Databricks Vector Search)
COMPANIES ARE BECOMING MORE SOPHISTICATED IN BUILDING LLMs Last year, customers were jumping into LLMs with off-the-shelf models. We still see 178% YoY growth in the number of customers using SaaS LLMs. But companies are beginning to take more control over their LLMs and build tools specific to their needs. The continuing growth of vector databases, LLM tools and transformer-related libraries shows that many data teams are choosing to build vs. buy. Companies increasingly invest in LLM tools, such as LangChain, to work with and build proprietary LLMs. Transformerrelated libraries like Hugging Face are used to train LLMs, and still claim the highest adoption by number of customers. Use of these libraries grew 36% YoY. Together, these trend lines indicate a more sophisticated adoption of open source LLMs. 377% YoY growth in the number of customers using vector databases
Companies prefer smaller open source models
One of the biggest benefits of open source LLMs is the ability to customize them for specific use cases — especially in enterprise settings. We often hear the question: What’s the most popular open source model? In practice, customers often try many models and model families. We analyzed the open source model usage of Meta Llama and Mistral, the two biggest players. Our data shows that the open LLM space is fluid, with new state-ofthe-art models getting rapid adoption. With each model, there is a trade-off between cost, latency and performance. Together, usage of the two smallest Meta Llama 2 models (7B and 13B) is significantly higher than the largest, Meta Llama 2 70B. Across Meta Llama 2, Llama 3 and Mistral users, 77% choose models with 13B parameters or fewer. This suggests that companies care significantly about cost and latency. COMPANIES ARE QUICK TO TRY NEW MODELS Meta Llama 3 launched on April 18, 2024. Within its first week, organizations already started leveraging it over other models and providers. Just 4 weeks after its launch, Llama 3 accounted for 39% of all open source LLM usage.
- 76% of companies that use LLMs are choosing open source models, often alongside proprietary models.
- 70% of companies that leverage GenAI are using tools, retrieval and vector databases to customize models.
TOP GENAI PYTHON PACKAGES |
AREA |
RANK 1 |
RANK 2 |
RANK 3 |
PROMPT ENGINEERING |
LangChain |
LlamaIndex |
DSPy |
|
VECTOR STORES |
Facebook AI Similarity Search |
Pinecone |
Weaviate |
|
EVALUATION |
MLFlow |
Weights & Biases |
Evaluate |
|
MODEL BUILDING |
Hugging Face Transformers |
Tiktoken |
Hugging Face Datasets |
No comments:
Post a Comment