Hello Everyone!
Welcome back to Old Logs New Tricks.
Today we are going to be looking at indexes and storage. I recently had an issue where my warm volume was filling past the maxVolumeDataSizeMB causing Splunk to pause data on that host. This was happening sporadically across my indexer cluster.
When we would df the Linux OS it showed 100% full (rounded for human readable of course).
[splunkuser@idxsr22 ~]$ df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 252G 0 252G 0% /dev
/dev/mapper/system-root 15G 2.5G 13G 17% /
/dev/sda1 239M 139M 84M 63% /boot
/dev/mapper/svg01-splunkdata 18T 18T 2G 100% /splunkdata
/dev/mapper/appvg-opt.splunk 350G 38G 312G 13% /opt/splunk
/dev/mapper/system-home 5.0G 997M 4.1G 20% /home
/dev/mapper/appvg-opt.splunk_cold 925T 475T 451T 52% /splunkdata_cold
Our warm data is stored on its own mount and cold is as well.
BUT when we went to the Deployment Monitor it only showed the primary volume at around 14,000GB+
As you can see the volume is set to MAX at 17,000GB or 16.6TB. So where is the extra storage being used?
After some thorough digging I found within each index buckets ending in “duplicate-0”.
For example in my network_firewall_internal index:
/splunkdata/network_firewall_internal/db/rb_1541330575_1541031508_567_12345678-D4A2-34ED-12AB-2FB085990ARG6-duplicate-0/
That directory in that index was 5.4GB. Furthermore there were many of these in that index and the total was quite high. When I totaled them for the entire warm index path I was stunned:
find /splunkdata/*/db -type d -name *duplicate-0* -exec du -ch {} + | grep total
Results:
1.5TB
430GB
1.3TB
Wow almost 4TB that’s marked as “duplicate-0”
What causes “duplicate-0”?
Here’s what I found out:
Splunk renames those buckets that way due to a mechanism that runs a check against buckets as they roll from warm to cold, to see if the bucket already exists. If the bucket exists in cold, it renames it with a "duplicate-0" and leaves it in the hot/warm directory. That way the data isn't deleted (just in case) but it unfortunately is then no longer tracked/managed in Splunk as this should not happen during normal operations.
That means that if it see that bucket already it leaves the data in the warm path and moves on. NEVER COMING BACK TO IT.
Splunk no longer even knows it exists. It’s not used in searches, it’s not counted into storage counts……nothing.
So take a look at the data to ensure its indeed duplicated and then decide whether its ok for you to remove it or move it out of the Splunk system.
If its ok to remove, as it was in my case, then write a command to find and remove it:
find /splunkdata/*/db -type d -name *duplicate-0* -exec rm -rf {} +
After that I looked back at the OS storage levels with df:
[splunkuser@idxsr22 ~]$ df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 252G 0 252G 0% /dev
/dev/mapper/system-root 15G 2.5G 13G 17% /
/dev/sda1 239M 139M 84M 63% /boot
/dev/mapper/svg01-splunkdata 18T 1T 3.0T 83% /splunkdata
/dev/mapper/appvg-opt.splunk 350G 38G 312G 13% /opt/splunk
/dev/mapper/system-home 5.0G 997M 4.1G 20% /home
/dev/mapper/appvg-opt.splunk_cold 925T 475T 451T 52% /splunkdata_cold
The important line:
/dev/mapper/svg01-splunkdata 18T 15T 3.0T 83% /splunkdata
Took a while to find and determine if it could be removed but we found the cause.
After monitoring for a few days this seems to have alleviated all of our issues.
-Cheers
Todd
Comments