On this episode of the Futurum Tech Webcast – Interview Series, I am joined by Dell Technologies’ Bill Leslie, Director, Technical Marketing for an inside look at the performance of Dell’s VxRail solution, including use cases, and the findings of our recent performance testing on Dell’s VxRail hyper converged infrastructure (HCI).
Our discussion covers:
- Dell’s relationship with VMware and the state of the HCI business
- An overview of Dell’s VxRail 8.0, vSAN 8.0 and new vSAN Express Storage Architecture
- The Futurum Group’s performance testing results why those results matter in business terms as well
- What our performance test findings mean for use cases and the expansion for high performance workloads
- Take a look at our findings, and download a copy of our Lab Insight Report here: Performance Review of Dell VxRail HCI with Broadcom 100 GbE Networking
Watch the video below, and be sure to subscribe to our YouTube channel, so you never miss an episode.
Or listen to the full audio here:
Disclaimer: The Futurum Tech Webcast is for information and entertainment purposes only. Over the course of this webcast, we may talk about companies that are publicly traded, and we may even reference that fact and their equity share price, but please do not take anything that we say as a recommendation about what you should do with your investment dollars. We are not investment advisors, and we ask that you do not treat us as such.
Transcript:
Russ Fellows: Hello, I’m Russ Fellows with The Futurum Group, and today I’m here with William Leslie, AKA Bill Leslie, and we’re going to talk a little bit about some VxRail performance results. Hello, Bill.
Bill Leslie: Hi, Russ. Thanks for having me.
Russ Fellows: Sure. So, I know you are the product manager among many other things for VxRail. You want to talk a little bit about the portfolio that you manage at Dell?
Bill Leslie: Sure, thank you. So, I’m a director here at Dell leading our VxRail and also our APEX Cloud Platform’s technical marketing teams, so responsible for some of the new announcements this year in the portfolio. Been a part of the VxRail program since its inception in 2016 as the first jointly engineered, and still continues to be the only jointly engineered solution with VMware in the market.
Russ Fellows: Yeah, and in fact, I want to talk to you a little bit about the HCI business just as a side, because we’ve been tracking HCI, we started tracking it in 2016 and then started tracking consumer sentiment and interest in the different platforms pretty consistently. I’m not saying this, I am bringing it up, because it’s good for you, but this has been true throughout.
Every time we tracked it, VxRail was consistently ranked among the most interesting by people looking at hyper-converged platforms, and satisfaction was always very high as well. So, you’ve done a great job promoting it and people seem to really like it, so that’s great.
Bill Leslie: No, appreciate that. Certainly customer satisfaction is one of our goals with it, and we’ve got nearing 20,000 customers today and over 200,000, closer to 300,000 nodes shipped over the course of the last nearly eight years. Really great momentum continues in the market with it, and really just glad to get the constant feedback from our customers so that we can improve on what we’re doing on their behalf.
Russ Fellows: Right. Another interesting trend that we’ve seen is that initially when HCI came out, it was something new and people were a little resistant. “Well, can it handle real workloads?” Right? “I better use my traditional three tier architecture.” But that’s pretty much gone by the wayside from what we’ve seen. Everybody’s really happy with performance and people throw pretty much any type of application you can think at it now from what we’re seeing. Is that what you see as well?
Bill Leslie: We certainly saw that very early on with VxRail customers saying, “Hey, let me put my VDI environment, maybe a tier two workload, really see what this can do. Is it as truthful as what you’re telling me in terms of the performance characterization? Can we really put these very disparate workloads into one cluster and not have contentional resources?”
We’ve also benefited over the course of time of newer technologies, right? When VxRail was first introduced, hybrid configurations were still very prevalent. All SSD was starting to take over prevalence, that helps tremendously with some of these workloads. Over the course of time, we’ve added all NVMe configurations as well that offer more than enough performance for a lot of our customers.
Over time, we’ve seen them add primary database workloads, some of their tier one applications, we’ve seen them put things on it that historically one would say, “Wow, you really should have dedicated infrastructure for that.” Really it’s been a game changer for them, because they’re able to get all the great benefits of the simplicity of VxRail, the automation for their tier one infrastructure environments as they’ve added more and more to the VxRail use case side.
Russ Fellows: Yeah, that mirrors what we’ve seen. Like I said, we’ve been looking at the market, but then also separately, formerly at Evaluator Group, we were recently acquired by Futurum Group, we’ve been doing testing on these platforms for quite a while and we saw some pretty massive performance bumps when you went from the hybrid vSAN configs to all flash. Then in addition to that, going to, like you said, all NVMe, that was another very significant bump in performance. So now with the all NVMe configs, the performance is really very significantly better than they were initially.
Bill Leslie: Yeah, the all NVMe is something that’s been going now for a few years, we’re seeing a lot more traction with that. One of the things that VMware’s recently done inside of vSAN architectures, they have what’s referred to as OSA or the Original Storage Architecture, and now ESA, the Express Storage Architecture.
Those all NVMe configurations were starting to push that performance envelope on 25 gig networks, right? It was an area where when vSAN first came out with what is now their OSA architecture, that’s the hybrid disk group setup where you’ve got the cash drives and then the storage capacity drives, one gig networks were still pretty prevalent inside of data centers.
There was a transition happening over to 10 gig, and we thought, “Wow, this is going to be great. We’re never going to be able to saturate what’s going on between the nodes.” But as again, more and more workloads started finding their way and the use cases expand, expand, expand, we did see those scenarios where it started to top out in some of those environments with the all NVMe.
Russ Fellows: Yeah, you’re exactly right. The interesting thing is I was actually somewhat surprised by these test results, which I rarely am, ’cause I’ve done similar testing to this quite a bit, and I knew that there’d be a performance bump from going from 25 to 100 gig networking, I just didn’t expect it to be this big, so I was really surprised by that.
But with these next gen platforms, the all NVMe configs, the new higher workload throughputs in the memory bandwidth, everything relies on a balanced system, so it’s really important, like you said, to have networking be up to par with the rest of the systems.
Bill Leslie: That’s right, that’s right. So Russ, if you don’t mind, I’d like to explain a little bit on what changed within that transition from vSAN OSA to vSAN ESA, so that we can then dive a little bit into the specific performance testing that we did in those findings.
VMware made some really big enhancements with vSAN ESA. They actually built it on top of the existing OSA framework and they moved some of the data service constructs higher up and closer to the compute, so they’re now able to do compression right next to the compute before they actually start to write its disk. They’re able to do things like have a single tier of NVMe storage devices where they’re doing both the caching, as well as the long-term storage on those same single devices. So, we’ve got to make sure that we’ve got the right type of right endurance on those drives.
They’ve also significantly improved the rate algorithms. So we can now do RAID 5 and RAID 6 with actually more performance than we get out of RAID 1 in some scenarios, which for anybody that’s been around vSAN for a while, you know that that was maybe something that you didn’t want to consider doing because of the performance hit that you’d get in that OSA architecture.
So, when we saw all of this coming, we said, “Hey, now that these network environments are very different, we should really take a look at how can we isolate certain things in the environment to really see what is the new bottleneck?” Right? Customers always want to know what’s going to be that thing that I’ve got to remove from my environment that’s going to hold back whatever it might be, performance, operational efficiency, whatever it is. So, we wanted to really look at networking inside of vSAN ESA to figure out really what is the top end potential that we could get out of these new configurations.
Russ Fellows: Yeah, so for those who don’t know, we reviewed the Dell testing. So Dell did some pretty extensive testing on a number of different configurations, and like all good test organizations, you tested more than you needed to, you wanted to explore all the possibilities. But it was pretty interesting going through all the different test scenarios, and I could see exactly what you’re talking about.
For the first time, RAID 5 and RAID 6 performance were pretty similar to RAID 1, which was also something new, right? I had done very little testing of ESA, almost all my vSAN testing has been on OSA, as you mentioned, definitely if you want better performance in the past you always used RAID 1. But there was a little performance overhead with RAID 6 versus RAID 5, but that’s expected. There’s benefits with RAID 6 as well, right? Availability. So those numbers were much closer than I really anticipated, so that was another pleasant surprise I would say.
Bill Leslie: I think they shocked even us when we saw how similar they were, right? You obviously get that one little extra IO that you’ve got to write to get the parody bid on RAID 6 to get that resiliency that you’re talking about, but it’s a, marginal is maybe a way of saying it, difference, because it was so close in the results in the findings.
Russ Fellows: Right. So that generates a little bit of extra network traffic as well, because there’s extra parody being written and distributed across the network. So, that really bodes for moving to higher networking with these Broadcom 100 gig NICs?
Bill Leslie: Yep. Russ, can I explain a little bit on what we did with the test environment so that the audience understands just how isolating we got in our test?
Russ Fellows: Yes, definitely. Great.
Bill Leslie: So, what we did was we took the same exact cluster and we put the networking on two 25 gig ports, and then we didn’t change anything except for the networking and put it on two 100 gig ports to see what we could get in terms of the performance, and the findings were rather incredible to us. We pushed nearly double the top end IO on our configuration with 100 gig NICs. I think in the paper we said something in the neighborhood of 78%.
If you really look at it where we recommend to push the workloads, which is leaving some room and buffer in the overhead for things like rebuilds and hosts, rebuild scenarios, buffer space, usually in that 70% load level, it’s nearly double the performance that you get out of ESA running 100 gig versus running the 25 gig network.
The other thing that we found inside of that testing was that it was very evident that we hit the wall. That theoretical limit was actually the actual limit that we were pushing on the 25 gig networking. So, it was very clear that you’d be leaving performance on the table if you’re using 25 gig networks and trying to push the envelope of consolidating workloads or getting top end performance workloads into the cluster.
Russ Fellows: Right. In fact, we even saw that you maybe even slightly exceeded the theoretical max. It’s like, wait a minute, how’d that work out? Because the general rule of thumb is you give 20% overhead for TCP/IP. I’ve been doing this for a while, I’m an old guy, right? So, since the ’80s we just automatically discount it by 20% and say, “Ah, the TCP/IP overhead’s going to be about that,” right? But we saw that you were pushing 100% and in some cases it seemed like even 102% of theoretical max, so the-
Bill Leslie: There’s a little bit that’s still there in the memory, and the cache that’s even on the CPU these days helps get you past a little bit of those in some of those routes.
Russ Fellows: Yeah, exactly. But the other thing is you mentioned the 2X higher performance, but another very interesting thing is if you’re looking at it the other way, for OLTP or latency-sensitive applications, the latency was 50% lower.
Bill Leslie: Oh, yeah.
Russ Fellows: So, for some applications that’s actually more important. It’s actually the harder metric to lower, because you can’t put a whole bunch of things together and get lower latency, right? You can put a whole bunch of things together and get higher total throughput or more IOPs. Add a whole bunch of storage devices. Great, we can get a million, five million, whatever you need, but if you need 100 microseconds of latency, some storage systems just can’t do that, right? Sorry, can’t do it no matter what.
Bill Leslie: Well, the latency that we were getting was significantly less. We were able to sustain under a millisecond latencies, kind of hovered more around that half a millisecond latency level over most of the loads that we were pushing in the environments. But to be able to sustain that when you’re consuming 70%, 80%+ of the resources in the environment, that was also a welcomed finding that we had in these results, and a huge benefit, like you said, to customers.
One of the things that we have the benefit of here at Dell is we have a tool called Live Optics that we have our sellers engage with. It helps collect the environmental information from customers, and because we have so many good sellers and partners out there, we’ve got millions and millions of these types of metadata findings.
That sweet spot that we find in VMware environments is actually around that 22 to 32K block size, looks to be either a relational database or an OLTP style database. You can think things like SAP, Oracle, et cetera, in their design point. So we use those tests as proxies, because it allows us to say, rather than testing every single database on the market, we can say, if your database is similar to this type, this is the type of performance you might be able to expect from it.
But that sweet spot and the findings certainly showcased this, and showed truly tremendous benefit on utilizing the 100 gig networking with Broadcom in the VxRail environments to take advantage of.
Russ Fellows: Right. So just to review for people, you tested three different workload sizes or block sizes. Now, one of those was the mix, the OLTP, or the RDBMS I guess it was called, right? That was a mix with an average block size of 22 kilobytes. But it was interesting, because the tool you use is a tool I use constantly and love, Vdbench, so I knew how to read those config files.
So the mix of reads and writes and the different block sizes in there I thought was excellent, because it almost exactly mirrored an application that I think is probably one of the better all around workloads, which is VMmark. So, the RDBMS test that you created I thought was an excellent proxy for just generalized VM workloads. Then the other ones, the OLTP as you mentioned, mirrors what a lot of databases do, and then the third workload was that large block to really see what you could do for maximum throughput rates.
Bill Leslie: That’s right. One of the things inside of vSAN OSA, the throughput has always seemed to be the one weak spot in OSA environments. Again, because of the using of a right cache drive and then the capacity drives behind it, you tend to get some bottlenecking when you’re doing large block transfers in the right caching mechanisms. So, we really wanted to also evaluate that with ESA to see what we could get.
Because they’ve changed around the software stack inside of vSAN, again that being inside the kernel of VMware and ESX, they’re able to drive some efficiencies for the right algorithms and the way that they’re doing the de-staging as well into the capacity layers. So, some great benefits for customers even in large block environments with it for them
Russ Fellows: I think I’ll mention briefly, because I’ve tried this in the past, I’m sure a lot of IT admins have as well, is to multiplex or team or use different ways of aggregating together multiple ethernet connections, right?
Bill Leslie: Right.
Russ Fellows: It seems like a good idea until you actually do it and realize all the trouble it takes just to get it to work at all with LACP links and the multi-switch LACP. Just even getting it to work is a pain. Then the reality is it’s never nearly as good even as it appears to be on paper, so you’re much better and it’s much easier just to get 100 gig networking rather than try to multiplex multiple 25 gig NIC ports. Maybe you’ll get a tiny bit of it, maybe, but in my experience it’s not worth the effort.
Bill Leslie: Yeah, and that was one of the things as we were putting out our paper, our engineering teams do much more than what gets exposed in all of those test environments. Our test suite actually goes 4K block sizes all the way up to the 512, various different workload types within that with the OLTP and the RDBMS, and various others within that mix.
What we really are trying to figure out is how can we help our sellers tune in these environments to our customer’s needs, right? How can we get it so that they’re not overbuying? How can we get it so that they’re going to get the performance that they need to have on these? It’s one that we build back into those tools that we’re giving our sellers to also make sure that they’re using those in everyday engagements, so that our customers are getting that benefit as well.
Russ Fellows: Right. I would say this isn’t just upselling, right? This isn’t Dell just saying, “Hey, we can upsell you on … get this better thing,” right? I mean, there’s some real benefit here, number one, and the other thing is the cost is incredibly reasonable, right?
If you look at the cost differential going from 25 to 100 gig networking, even considering the switch ports, it’s one of the least expensive upgrades you can make, probably less than going up a couple cores on your processor, and another couple gigs of RAM can cost you more, so this is a very cost-effective upgrade.
Bill Leslie: One of the things that we did in the Dell paper is exactly what you’re talking about, let’s take a look at this and let’s break it down to a common denominator of the per port costs. On a per port basis, it was just a little more than twice the cost per port with 100 gig networking. We know that we’re going to have to switch out everything when we go to 100 gig network, so let’s make sure that we’ve got a true understanding of that.
Well, if you’re leaving half the performance on the table of a node by not switching out the less expensive networking, you’re not actually gaining the benefit of the more costly component in the solution in the architecture, right? So, that’s an opportunity for those server virtualization teams to talk to their networking friends, maybe come up with a way to help them uplevel some of the networking environment as well to get the full top-end and peak performance that VxRail and vSAN ESA can do.
Russ Fellows: Yeah, that’s exactly right. We haven’t even mentioned the hot topic of the day, right? AI.
Bill Leslie: Oh, yeah.
Russ Fellows: Nobody seems to be able to get through a paper or a webinar or anything today without talking about how they’re doing generative AI, so I’ll bring it down a little bit more to reality. Most people probably aren’t going to be running large generative AI training on HCI infrastructure today, but you could, and if you even wanted to contemplate it, obviously you need to have high networking. In fact, some of the configs that people are using in the public clouds now, I see they’re promoting the fact that they have 800 gig networking, right?
Bill Leslie: Wow.
Russ Fellows: So if you’re trying to do anything like that, networking and internode communication is paramount. So I would say if you are even contemplating doing some AI training, 100 gig is probably a good starting point for you.
Bill Leslie: Yeah. Russ, one of the things I found interesting as we were going through things is that FTG has done quite a bit to analyze different networking that you’re seeing across different workload environments as well. It’s not just gen AI that needs these high bandwidth networks, it’s many others and ones that might shock some of the folks that are watching us here today.
Russ Fellows: Yeah, you’re exactly right, thanks for bringing that up. Like I said, we do a lot of testing, and the vast majority of our testing is on hyperconverged infrastructure with vSAN, and we do also a lot of database testing, SQL Server. One of our testing was doing SQL Server backup performance, and what we found is that maximizing the backup of just one SQL Server VM pretty much maxed out a 25 gig network, right? That was just one VM. Now, that was backing it up to vSAN as quickly as we could, right?
So imagine you’ve got a cluster, even a small cluster of four or five, six, eight VMs running SQL Server, which is pretty common thing to do because of licensing, right? You back up just one of them, you might be impacting the performance of the others. So yeah, you’re exactly right. You don’t have to be running gen AI to max out these networks now.
Bill Leslie: That’s right. I’d be amiss if I didn’t take the opportunity to say that we do partner with NVIDIA and other GPU vendors to make sure that you can dial in the right prescriptive type of offloading that you might need to have for those types of environments, whether it’s gen AI, VDI, things like AutoCAD and other types of workstation setups. So, lots of choice inside of VxRail for how you can optimize those environments.
Russ Fellows: Right. All right, well, hopefully that gives people a good taste for some of the results that we were able to summarize. Just want to be clear. So Dell did a lot of the work, we reviewed it. We do this testing on an ongoing basis, so we’re very familiar with all the results and we’re able to really dig into all the details.
At first, I was a little struggling to find anything new and unique to talk about, because you guys have done such a thorough job, but what I realized is that people just need a good summarization of the impact that this has on them. This can benefit you almost regardless of the workloads that you’re running.
Then another point is the fact that this was done on previous third gen systems, because the fourth gen systems weren’t available. So with a higher CPU and memory bandwidth of the new fourth gen systems, that just necessitates moving to the Broadcom 100 gig NICs even more so.
Bill Leslie: Yeah, that’s right. We’re really excited about what we’re getting out of the Intel 4th Gen Scalable procs and how that’s going to benefit both sets of customers inside of vSAN, those that are using OSA today, as well as those that are doing vSAN ESA in their environments.
Russ Fellows: All right, well, thank you Bill. So, I’d encourage people to take a look at the paper that we have available on The Futurum Group, and then there is also a link to the Dell VxRail performance analysis which has some more technical details as well. So, this is Russ Fellows from The Futurum Group, and this has been another Futurum Group Tech Webcast. Thanks. See you next time.
Author Information
Russ brings over 25 years of diverse experience in the IT industry to his role at The Futurum Group. As a partner at Evaluator Group, he built the highly successful lab practice, including IOmark benchmarking.
Prior to Evaluator Group he worked as a Technology Evangelist and Storage Marketing Manager at Sun Microsystems. He was previously a technologist at Solbourne Computers in their test department and later moved to Fujitsu Computer Products. He started his tenure at Fujitsu as an engineer and later transitioned into IT administration and management.
Russ possesses a unique perspective on the industry through his experience as both a product marketing and IT consumer.
A Colorado native, Russ holds a Bachelor of Science in Applied Math and Computer Science from University of Colorado, Boulder, as well as a Master of Business Administration in International Business and Information Technology from University of Colorado, Denver.