top of page
Search
danielpwilson

Do you need to dedup when using stats?

Updated: Aug 12, 2021

I had to do some casual counting of sourcetypes today. In the process I was trying to decide if I needed to dedup before going to stats. It seemed to me a dedup would, in theory, pass less data to stats. Shouldn't the stats command already know that and dedup for me?




``` Search A``` 
(index=osnix OR index=osnixsec) 
(sourcetype=linux_messages_syslog OR sourcetype=linux_secure OR sourcetype=linux:audit)
| fields host, sourcetype
| stats dc(sourcetype) as "Sourcetypes",  values(sourcetype) as "List of Sourcetypes" by host


OR I should run



``` Search B```
(index=osnix OR index=osnixsec) 
(sourcetype=linux_messages_syslog OR sourcetype=linux_secure OR sourcetype=linux:audit)
| fields host, sourcetype
| dedup host, sourcetype
| stats dc(sourcetype) as "Sourcetypes",  values(sourcetype) as "List of Sourcetypes" by host

Which would be more effient?


I would expect that I should get about the same performance, give or take a cached artifact or two. Though, as I was searching it just didn't feel like this.


I was surprised to find that search A took 14.138 seconds after I ran it. It was quick and Splunk UI was updating as you'd exepect. Where as search B, the one which included the "dedup", took 41.883 seconds on the same data set and Splunk felt "clunky" as it ran.


I went ahead and performed the same operation again on a different data set with same results.


Seems stats already has logic built into it that makes dedup just add extra work for no gains. Guess after today's little lab I have a lot to dedups to remove from my searches.

249 views0 comments

Recent Posts

See All

Comments


Post: Blog2_Post
bottom of page