There are many parameters to tune in order to ensure you have a well-oiled Cacti.
The very first step of tuning any installation is to install Spine poller instead of the included “cmd.php” poller that came with your cacti installation. This will reduce your overall polling time, and release system resources sooner in order to service web-queries.
No matter what – your polling time must NEVER exceed your polling interval. The standard polling interval is 5 minutes. In this case, I’m in the clear, as i’m only taking about 50 of the 300 allotted seconds.
[root@cactiserver ~]# cat /var/log/cacti/cacti.log | grep “SYSTEM STATS” | tail -5
11/07/2013 06:45:51 AM – SYSTEM STATS: Time:49.8120 Method:spine Processes:12 Threads:8 Hosts:302 HostsPerProcess:26 DataSources:83118 RRDsProcessed:32880
11/07/2013 06:50:51 AM – SYSTEM STATS: Time:49.7650 Method:spine Processes:12 Threads:8 Hosts:302 HostsPerProcess:26 DataSources:83118 RRDsProcessed:32880
11/07/2013 06:55:47 AM – SYSTEM STATS: Time:46.1115 Method:spine Processes:12 Threads:8 Hosts:302 HostsPerProcess:26 DataSources:83118 RRDsProcessed:32880
11/07/2013 07:00:48 AM – SYSTEM STATS: Time:47.5557 Method:spine Processes:12 Threads:8 Hosts:302 HostsPerProcess:26 DataSources:83118 RRDsProcessed:32779
11/07/2013 07:05:49 AM – SYSTEM STATS: Time:47.2800 Method:spine Processes:12 Threads:8 Hosts:302 HostsPerProcess:26 DataSources:83118 RRDsProcessed:32880
The polling interval is controlled by the cronjob. You shouldn’t really touch this, even if you “know what you’re doing”.
[root@cactiserver ~]# cat /etc/cron.d/cacti
*/5 * * * * apache php /usr/share/cacti/poller.php >/dev/null 2>/var/log/cacti/poller-error.log
Reducing polling time will have the effect of “freeing” the spine threads more quickly to service more polling and fulfill more hosts, as each polling package will occupy less time in the thread.
Here’s a visual representation (very basic) of how a 4-thread Spine processes each SNMP bulk request for a device. There is always a need to specify the number of threads at a per device level, but this article will only focus on reducing the duration a thread is occupied by a given “partial” or “bulk” request. Needless to say, it’s BAD FORM to specify a device should have 8 threads, if you’re system only has 8 threads. What would happen if the remote device stalled out for whatever reason? You’ve locked all the threads and are burning idle waiting on each thread to return from the configured “timeout”, when real work could be accomplished on other servers using another thread.
And so, under the premise that you desire to reduce your polling time, one of the most “controllable” levers at your command is the “Maximum OID’s Per Get Request”.
During your “Device” configuration, there is a field to fill out indicating “Maximum OID’s Per Get Request”. The field is only used by the Spine poller, but is often overlooked.
The “official” Spine documentation defines this field as
The maximum number of snmp get OID’s to issue per snmp request. Increasing this value speeds poller performance over slow links. The maximum value is 60 OID’s. Please bear in mind, that some type of devices do not accept huge OID numbers and may fail if set above 1. That’s why cacti 0.8.7 allows to define this value at device level
But there’s so much more to this field.
Essentially the goal of proper SNMP Maximum OIDs Per Get Request is to do the following:
- Try to request and retreive as many OID values as possible inside a single packet
- Reduce the number of back-and-forth SNMP requests and responses to reduce overall polling time of this device
- Choose a size that removes or reduces fragmentation
No two device types (a server SNMPd, an ILOM/ALOM port, a router, an F5 BIGIP, a printer) will ever have the same optimum number of OIDs to get, and therefore, experimentation will be required to determine the optimum size.
There’s the sloppy way, the scientifically calculated way, and the intensly studied way (not covered in this article). Enjoy!
The Sloppy Way
Essentially, the quickest way to check on the optimum size is to test using the snmpbulkwalk command from the CLI. The idea here is to try different values of maximum number of OID’s per get request and hone in on the best case scenario.
Pros – quick, reasonable
Cons – inaccurate for your polling situation (doesn’t target the specific OIDs your script or cacti is after).
“-Cr1” means 1 OIDs per request.
[root@cactiserver ~]# date ; snmpbulkwalk -v 2c -Cr1 -c “communitystring” 10.10.10.10 > garbage.txt ; date
Fri Feb 15 08:22:37 EST 2013
Fri Feb 15 08:25:01 EST 2013
=2 minutes, 24 seconds.
[root@cactiserver ~]# date ; snmpbulkwalk -v 2c -Cr10 -c “communitystring” 10.10.10.10 > garbage.txt ; date
Fri Feb 15 08:25:53 EST 2013
Fri Feb 15 08:26:08 EST 2013
= 15 seconds.
[root@cactiserver ~]# date ; snmpbulkwalk -v 2c -Cr15 -c “communitystring” 10.10.10.10 > garbage.txt ; date
Fri Feb 15 08:26:21 EST 2013
Fri Feb 15 08:26:32 EST 2013
= 11 seconds
[root@cactiserver ~]# date ; snmpbulkwalk -v 2c -Cr20 -c “communitystring” 10.10.10.10 > garbage.txt ; date
Fri Feb 15 08:27:24 EST 2013
Fri Feb 15 08:27:32 EST 2013
[root@cactiserver ~]# date ; snmpbulkwalk -v 2c -Cr25 -c “communitystring” 10.10.10.10 > garbage.txt ; date
Fri Feb 15 08:28:01 EST 2013
Fri Feb 15 08:28:07 EST 2013
[root@cactiserver ~]# date ; snmpbulkwalk -v 2c -Cr30 -c “communitystring” 10.10.10.10 > garbage.txt ; date
Fri Feb 15 08:28:26 EST 2013
Fri Feb 15 08:28:31 EST 2013
[root@cactiserver ~]# date ; snmpbulkwalk -v 2c -Cr40 -c “communitystring” 10.10.10.10 > garbage.txt ; date
Fri Feb 15 08:28:54 EST 2013
Fri Feb 15 08:28:58 EST 2013
= 4 seconds
Bing bing bing bing!
Notice the diminshing returns… any higher on the number of OIDs per request, and the result will go higher… reason: header overhead leaves only so much room for SNMP. We have begun to induce fragmentation.
The Calculated Way
The whole purpose of the calculation is to estimate exactly how many OID responses can fit in a single packet. We are unconcerned with the request, as the responses are always larger and are the target of size reduction.
Consider that every time SNMP queries an OID, the response to that query contains the original query OID.
Query = OID
Response = OID+Value
Pros – most accurate, gets you to an ideal number faster
Cons – time consuming, math, only a starter point. Must be done for every device type and checked.
Step 1, Determine the amount of overhead in an SNMP query
MTU : 1518 bytes
After IP,UDP Headers remains: 1518-42=1476 bytes
1476 – (snmp version code 1 byte, community string notation 2bytes, community string repeated back VARbytes… lets say about 16 bytes for the long community strings) = 1460bytes max
1460 – SNMP response header (assuming no error, about 10 bytes ) = 1450 bytes
There are approximately 1450 bytes available in an SNMP response to accommodate OID identification and the value residing at that OID.
Step 2, Determine how many OID responses can be sent inside one packet
This is the step where you will have to examine the actual length in bytes of the OID.
Here’s some examples.
1) System interface errors
.18.104.22.168.22.214.171.124.1.14 the OID string is 22 bytes long
2) Juniper SRX Firewall accepts
.126.96.36.199.188.8.131.52.184.108.40.206.0 the OID string is 30 bytes long
3) F5 BIGIP Virtual Server number of connections.
the OID string is 204 bytes long! Not even kidding!
Most can be summarised to be about 25 bytes, but a tcpdump will help you learn the exact length if it’s unknown. And so I will continue on the assumption that my average OID length for my purpose is 16 bytes long.
1450 bytes / [(average OID length ~ 16 bytes) + (value responded to ~ could be up to a 64 bit register value of 8 bytes)] = 60.
So 60 is the max that youa re allowed to specify, otherwise we start ending up with cutting the response into multiple packets – a time and resource waste.
Other Considerations for number of OIDs per Request
Sometimes a host device doesn’t like more than X requests on it’s plate at one time, regardless of whether or not you’re breaking (like IBM AMM cards for example), and they start to chug after 15. The “sloppy” method of testing will show this sooner rather than later.
Sometimes the servers that you are polling every 5 minutes are busy doing their own thing every 5 minutes. To make your CLI testing more accurate, try polling at the 5 minute interval, and then try the same poll test 3 minutes later or so, just to see if the remote system is experiencing “chug”.
In our testing methodology, we failed to consider the “gathering time” that the remote system needs in order to fulfill a volley of requests. In the case of consecutive OID’s the wait time should be negligable, but in the case of “hunt and peck” OID’s, the wait time could be significant while the remote system seeks the answers from all over it’s MIB tree, or derives different answers on-the-fly to different TYPES of requests. Spine has a mechanism to request in “bulk” where it can, but I’m unaware of the inner-workings.
So in the end, you need to experiment, especially when deploying for multiple devices of the same type.
Other information: If you’re looking for information about Cacti Hashes, I wrote an article a while back about this.
A great website regarding the installation of CACTI on a Raspberry Pi.
I would also recommend keeping an eye on the polling cycle time with this handy little graph.