I had to do some casual counting of sourcetypes today. In the process I was trying to decide if I needed to dedup before going to stats. It seemed to me a dedup would, in theory, pass less data to stats. Shouldn't the stats command already know that and dedup for me?
``` Search A```
(index=osnix OR index=osnixsec)
(sourcetype=linux_messages_syslog OR sourcetype=linux_secure OR sourcetype=linux:audit)
| fields host, sourcetype
| stats dc(sourcetype) as "Sourcetypes", values(sourcetype) as "List of Sourcetypes" by host
OR I should run
``` Search B```
(index=osnix OR index=osnixsec)
(sourcetype=linux_messages_syslog OR sourcetype=linux_secure OR sourcetype=linux:audit)
| fields host, sourcetype
| dedup host, sourcetype
| stats dc(sourcetype) as "Sourcetypes", values(sourcetype) as "List of Sourcetypes" by host
Which would be more effient?
I would expect that I should get about the same performance, give or take a cached artifact or two. Though, as I was searching it just didn't feel like this.
I was surprised to find that search A took 14.138 seconds after I ran it. It was quick and Splunk UI was updating as you'd exepect. Where as search B, the one which included the "dedup", took 41.883 seconds on the same data set and Splunk felt "clunky" as it ran.
I went ahead and performed the same operation again on a different data set with same results.
Seems stats already has logic built into it that makes dedup just add extra work for no gains. Guess after today's little lab I have a lot to dedups to remove from my searches.
Comments