Enterprising Insights: Episode 33 – Salesforce Announces an LLM Benchmark for CRM

Enterprising Insights: Episode 33 – Salesforce Announces an LLM Benchmark for CRM

In this episode of Enterprising Insights, host Keith Kirkpatrick discusses Salesforce’s announcement on what it says is the world’s first LLM benchmark for CRM. He discusses the elements of the benchmark, why it’s important, and assesses whether the market will start to see other benchmarks from SaaS companies or third parties begin to come to market. Then, as always, Kirkpatrick will rant or rave about an item in the market, this time focusing on the lack of transparency around the collection of driving data from automakers Hyundai, Honda, and General Motors.

You can grab the video here and subscribe to our YouTube channel if you’ve not yet done so.

Listen to the audio below:

Disclaimer: The Enterprising Insights podcast is for information and entertainment purposes only. Over the course of this podcast, we may talk about companies that are publicly traded and we may even reference that fact and their equity share price, but please do not take anything that we say as a recommendation about what you should do with your investment dollars. We are not investment advisors and we do not ask that you treat us as such.

Transcript:

Keith Kirkpatrick: Hello everyone. I’m Keith Kirkpatrick, Research Director with The Futurum Group, and I’d like to welcome you to Enterprising Insights. It’s our weekly podcast that explores the latest developments in the enterprise software market and the technologies that underpin these platforms, applications, and tools. This week I’d like to talk about Salesforce’s recent announcement of what it claims is the world’s first LLM benchmark for CRM. We’re going to get into what that benchmark is, why it’s important, and assess whether or not we’re going to start seeing others like that benchmark start to crop up in the market, particularly as we are entering this new phase of generative AI where we’re moving away from just utilizing a large language model, a generic large language model, and starting to see more purpose-built ones and ones that are being tuned for specific applications, specific tasks, so on and so forth. Then, as always, I’m going to move to my rent or rave segment where I pick one item in the market and I either champion it or I criticize it.

Without further delay, let’s get into this week’s topic, which is Salesforce’s announcement of benchmark for LLMs, tailored toward the CRM application. Yes, Salesforce did announce this, I believe it was a couple of weeks ago now. I’m a little late to get to this in terms of my schedule, but I do think it’s an important topic. This new benchmark, it’s essentially, according to Salesforce, it’s a comprehensive evaluation framework that measures the performance of LLMs against four specific measures: LLM, accuracy, cost, speed, and then trust and safety. What this benchmark is really trying to do is evaluate common sales and service use cases, including things like prospecting, lead nurturing, and sales opportunity and service case summers. So, it is not a generic benchmark trying to assess the LLM performance against generic factors or. Overall speed or overall accuracy. That’s not what it’s trying to do. What it’s trying to do is look at an LLM in context being used for these very specific use cases within A CRM application. Now, this is really designed to help people who are using Salesforce really evaluate other LLMs and choose which ones are going to work best for their specific use cases.

Now, why is there even a need for this? Well, there are existing LLM benchmarks, but they’re really academic in nature or they’re focused on consumer use cases. It’s not really relevant for business users. I think one of the other things that they mentioned is that really existing ones are very, very much focused on pure benchmarks with respect to compute and how quickly it returns a result. They don’t really have these expert human evaluations going in to really address some of these other issues and making sure that there really is accuracy in context. And the other thing of course is trust, making sure that the LLM doesn’t go off the rails and hallucinate. This is really about looking at something in practice. To give you an analogy, if we look at how automobiles, if you read any kind of car enthusiast magazine, whether we’re talking Car and Driver, Motor Trend, whatever, they’ll give you the specs that the manufacturer provides. Then they’ll run their road tests and give, let’s say a zero to 60 assessment on how quickly a car can accelerate. That’s great, but that doesn’t necessarily translate to the real world of driving a car. That’s where you need to actually conduct other tests. Things like what is the speed to accelerate from 55 to 75 to pass? How long does it take to do that? Because that’s really looking at the vehicle in context.

And then of course, you would need to do that… To really get an accurate benchmark, you’d really have to incorporate other factors such as the drivers confidence and skill level, all of that kind of stuff. And that’s what this LLM benchmark is trying to do. It’s trying to look at LLM performance within the context of very specific sales and service-based use cases as it pertains to being used with Salesforce CRM. Now, I think what’s really interesting, and I’m going to try to include a screenshot of the interface which actually has this benchmark, and basically it will line up a number of different LLMs and it will go through each attribute across the four different categories of accuracy, cost, speed, and trust, and assess each LLM based on that four specific use cases. And I think what is… Basically, let’s just take a look at some of the attributes here. Accuracy. This is probably the… They’re all important, but accuracy is a big one. This metric actually contains four different subcategories: factuality, completeness, conciseness, and the ability to follow instructions. If you think about accuracy as it pertains to a model, it really is about combining all four of those subcategories together because even if an LLM were to provide a factual answer, it may be not necessarily following the instructions of the prompt as accurately as we might want it to be.

Same thing with completeness. If you think of… If you were to ask the LLM to return something and it returns a result that is partially correct, but it’s not fully complete, that may not be yet useful in the scope of an actual business use case. This is interesting that that accuracy measure contains a number of different subcategories, and that’s why it’s important to consider all of them together to really assess the performance of a particular LLM and its value as a business tool. Cost, this is another one. Now, cost is sort of… This is interesting because it’s categorized as high, medium, and low based on a percentile basis. This is an estimated operational cost that will vary by each use case and really, it’s more of a comparative scale where you’re comparing different LLMs against each other based on a specific use case and is really a relative measure as opposed to a hard and fast cost of X amount to operate, because that’s very, very difficult to ascertain because it will vary based on exactly what goes into each prompt and what type of data is being queried, all of that kind of stuff. But you can get a relative idea on the cost of a model versus… One versus another. I think that’s interesting.

Now, the third one is speed. This is a metric that assesses the LLMs responsiveness and efficiency in processing and delivering information. Obviously, faster response times will improve the user experience and reduce the wait time for customers when deployed in certain customer facing applications. And of course, it enables any kind of service and support teams to address inquiries more quickly if the LLM is responsive. And again, this is interesting because certain things are going to take the LLM longer to respond to, and that’s where it’s important to look at which LLM works best for a specific use case, so you can determine which one might be most suitable. And then of course you’re going to also have to balance that against some of the other factors in there as well because again, all of this is about… It’s not necessarily going to be about, “Hey, let’s just pick the one that has the top numbers in a specific category.” It’s really about evaluating it in context.

And then of course the fourth category here, and this is something that Salesforce has been pounding table on really since generative AI became a thing about a year and a half ago, and that is trust and safety. This metric is designed to measure the LLMs capability to shield sensitive customer data, adhere to privacy regulations, secure that information, and refrain from incorporating bias and toxicity for various CRM use cases. Now, this one is interesting because it’s really looking at the model and framing it in the context of adhering to those specific criteria. That’s going to be really important for organizations as they start to evaluate other LLMs and obviously even as we move to more of an open model for generative AI. I expect that that’s going to be increasingly important. We’re going to get into a little more later on in terms of looking at these issues, particularly around the use of data and how that really will impact an organization and its relationship with its customers.

Now, I think… There’s a couple other things here that I think are worth mentioning. I do think that Salesforce, this is a good move for them in terms of being out in front of the issue of trying to assess the performance of LLMs in context, because really if you don’t assess an LLM using very, very specific or standardized metrics, it’s very hard to compare what is the best one for a specific use case. I do think that we’re going to start to see other platforms also develop these types of benchmarks, and perhaps we’re going to see some third party companies provide the solution as well. I still think though, it’s going to have to be an evaluation that’s done in a very specific manner and what I mean by that is you can’t just have this one sort of, “Okay, let’s compare these models,” because you’re going to have to look at what platform you’re using, what data are you using or type of data are you using, what use case? And that’s going to be very, very specific to the organization and the tools that they have in place.

I do think that looking at this within the context of data available is going to be, it is not an easy process, but handing a tool to start that process really gets you that much further down the road, particularly as end user organizations start to incorporate things like small language models, those types of very, very specific models. That’s going to be important to them to really ascertain what is the best approach with generative AI. Is it taking a large open model with however many billion parameters open AI? Or is it going to be looking at a model that might be a little more expensive to use because it has been purpose built? And if so, if it performs better on some other metrics, perhaps that’s the best use case. So, that’s going to be very interesting to see how that pans out over time as organizations get more familiar with generative AI and really start to crystallize their vision for what they want to do and what business outcomes they’re looking to actually enable using generative AI, obviously in conjunction with other technology. So I think, to wrap up on this, it’d be interesting to see where we’re headed, but I think this is the start of a trend. I do think we’re going to see some more of these tools come into the marketplace over the next several months and years.

Okay, so with that, I want to move to my rant or rave segment. That’s where I pick one item in the market and I will either champion it or criticize it. And this week I have a rant. This is interesting. Piggybacking on the data security, data privacy issues that I was just speaking of. Apparently the New York Times just reported that senators Ron Wyden of Oregon and Edward J. Markey of Massachusetts recently sent a letter to the FTC on July 26th, and in this letter, the two senators called out General Motors, Hyundai, and Honda for collecting driving data from customer vehicles. When we’re talking about driving data, that’s things like how fast the driver accelerated, how hard they brake, how often they went over the speed limit, are they breaking hard a lot, all of that kind of stuff. Now, they’re saying that this data was sold to insurance companies so they could better gauge driver risk. Now, basically, this is really interesting because at the heart of this, it’s really that… The big issue apparently is Senator Ron we said it was deceptive on how it got people to opt into this. And just as a little bit of background, obviously we all have insurance, auto insurance companies, and many of them have similar programs where… One of them, Allstate, I think it’s where the idea or the way it’s market is it says that they will collect data from you, the driver, and then it gives them a better picture of your risk. What they don’t always say is that obviously if you engage in risky behavior, your rates are probably going to go up because you were a higher risk. What this lawsuit does though is it is saying that it collected data from vehicles with an internet connection, and it sounds like they weren’t very, at least the lawsuit says… Or what the letter says that GM and Honda actually gave the drivers a choice to opt in, but it was deceptive.

Now, I think that really when we’re talking about this, there’s two issues here. One is were these auto companies clear about the fact that the data would be collected? Number two, were they clear about what this really meant in terms of the potential for that data to be sold to another party, essentially insurance companies? And what would the impact be or potential impact be from those insurance companies then using that data? I think regardless of whether or not you agree with that type of data being used to assess risk or not, I tend to think, “Hey, it’s kind of fair game,” in the sense of what are insurance companies really? They are basically risk managers. That’s all they are and obviously their job or the only way they really are successful in the market is that they can accurately assess risk and then price their services based on that. I think the bigger issue here is really looking at, again, disclosure issues and deceptive language when it comes to the collection and use and potential sale of the data.

As we become a data-driven world, it is really important that organizations are transparent about this process. You’re going to get some people who are perfectly fine with this. I actually just spoke with my mother about this, my nearly 80-year-old mother who participates in one of these programs and she says, “I’m fine with it because I don’t drive very much and I don’t want to pay more, and I know that I’m going to not be driving aggressively or crazy or anything like that. So I’m happy with that.” So she’s okay with that, but I think the issue is, again, letting people know, being transparent about it, because obviously the more data that’s being collected, it’s also going into these algorithms and it can have an adverse effect on people without them knowing. And the fact that the organizations are not being clear about it or transparent is a major impact, or can have a major impact on consumer trust. And that’s why you get people who are running around obviously worried about their data, how it’s being used, is it being resold and so forth. So, I do think that that is a negative mark on, or signal to the market in terms of data, because obviously if their intentions were completely altruistic, I don’t think they would have that sort of… What the senators say is deceptive language.

Now, the other thing that was very interesting in this piece that I read about this case, these automakers actually didn’t make very much money from selling this data. Now, according to this letter, Verisk paid Honda $25,920 over four years for information on about 97,000 cars. And if you look at the math, that’s about 26 cents per car. Now, Hyundai was paid just about $1 million or 61 cents per car over six years and to me, it seems like a heck of a lot of negative potential publicity for not a heck of a lot of money. But again, it’s possible that they were looking at this in terms of something that as an incremental revenue stream or they were looking to maybe at some point expand it, but it just seems to me that it’s strange that they would engage in this and potentially use deceptive tactics for really what is just honestly pocket change. So, very interesting. Again, I think the big takeaway for me with all this and why I’m ranting about this is again, we are living in a data driven world. Everyone knows that companies are looking to collect data. At this point, it just doesn’t, to me, make any sense to violate any kind of trust you have with consumers by being deceptive or less than transparent about the data that is being collected and what it could be used for or that it could be sold because in the end, when we think about consumer choice, that could have a very negative effect if people feel that they were duped. That is my rant for the week.

I want to thank everyone here for joining me on Enterprising Insights. I will be back again next week with another episode focused in on the happenings within the enterprise application market. Be sure to subscribe, rate, and review this podcast on your preferred platform, and I will see you next time.

Author Information

Keith has over 25 years of experience in research, marketing, and consulting-based fields.

He has authored in-depth reports and market forecast studies covering artificial intelligence, biometrics, data analytics, robotics, high performance computing, and quantum computing, with a specific focus on the use of these technologies within large enterprise organizations and SMBs. He has also established strong working relationships with the international technology vendor community and is a frequent speaker at industry conferences and events.

In his career as a financial and technology journalist he has written for national and trade publications, including BusinessWeek, CNBC.com, Investment Dealers’ Digest, The Red Herring, The Communications of the ACM, and Mobile Computing & Communications, among others.

He is a member of the Association of Independent Information Professionals (AIIP).

Keith holds dual Bachelor of Arts degrees in Magazine Journalism and Sociology from Syracuse University.

SHARE:

Latest Insights:

Join us as Jim Anderson discusses Coherent's next chapter, focusing on innovation and strategic growth in the evolving tech space.
The Latest Earnings Reinforce Arm’s Competitive Momentum, Driven by the Increasing Integration of Its Technology in AI Workloads and High-Performance Computing
Richard Gordon, VP & Practice Lead, Semiconductors at The Futurum Group, examines Arm’s Q3 FY 2025 earnings, highlighting AI-driven royalty growth, Compute Subsystem expansion, and key AI collaborations.
Hammerspace Expands Its Global Reach as Demand for AI Storage Solutions Grows, Leveraging Its Tier 0 Solution to Optimize Data Movement Across Hybrid Cloud Environments
The Futurum Group’s Krista Case and Camberley Bates analyze Hammerspace’s 10x growth and Tier 0 storage, which eliminates GPU bottlenecks. Can its expansion and innovation reshape AI infrastructure?
Alphabet’s Q4 Saw Solid Gains in Search and YouTube, Yet Google Cloud’s Revenue Fell Short, and a $75 Billion Capex Forecast Raised Investor Concerns
The Futurum Group’s Keith Kirkpatrick and Nick Patience analyze Alphabet’s Q4 2024 earnings, highlighting strong ad revenue from Search and YouTube, concerns over Google Cloud’s growth miss, and the company’s aggressive $75 billion capex plan.

Thank you, we received your request, a member of our team will be in contact with you.