Archive for Before posting please check the "stickies" in the support forums.
Please ask questions in real English and not "txt". You will get a better response.
Please do not ask support questions via PMs.
  Forum Index -> What's new? Announcements!
myff admin

Server monitoring

Just thought I'd share an image from the new monitoring system:

This is a snapshot of the jeltz server, the gray line is a rolling average from 24 hours in the past, in this case show that at one point there was an issue which was a datacenter network glitch in this case.

But in general we see jeltz as performing well, with a database backup just a fraction of a day old and nothing under notable strain.

There are close to 600 metrics collected every minute and these will be subject to automatic tests that will spark the alert system if anything is not looking right.

Looks good. It automatically notifies you if there are problems?

Even cooler!
myff admin

Obviously we have some notification systems and always have had.

The new system is not quite online for that yet, but soon will be, it now has over 700 database tables containing minute by minute metrics from all the servers. We already up to 150mb in the database for it!

One of the next stages will be controlling that a bit.

The data is great though, for example there is a clear 5am incident common on some of the servers, that is there and gone in 10 minutes but needs looking into. Without the graphs it would go unnoticed until...
myff admin

739 tables at current count.... and an emergency yesterday... it may be all behind the scenes stuff but it keeps things running.
myff admin

Thought I'd bump this,  there are now over 3,200 tables, obviously this covers a lot more than the forums, and is indicative as much as anything of the current work load stress

The good news on that, is that our partners have employed 3 more people, including a manager "Mr T" to help us get on top of things, already we have 139 issues logged in our ticketing system!

One issue I closed this morning was to add monitoring to our DNS system, e.g. making sure that addresses for forums and other stuff end up propagating, this was spurred by some painful issues in the week where one of the DNS servers was working but not updating

I'm quite pleased with the method I have used Basically every minute I take the current time and encode that into an IP address. Time is a 4 byte value as is an IP, so that is easy. That IP is then set for a test domain on the master name server.

The monitoring system will also every minute ask each nameserver what the IP is for the test domain and decode that back to a time and add a row to a table for that nameserver. So we have tables that contain the time the test took place and the time decoded from the domains IP.
If the difference in those times is not within a certain limit clearly something is up.

Here we see NS2 is behaving nicely having got its last update in a little under 6 minutes.

We also graph how fast the response is, and that implies that we could probably do with changing the system a bit on a couple of nameservers. It is notable how much clearer you can see things once you make the effort to monitor and graph.
myff admin

3889 tables and a conundrum.

I'm flipping the system to one that is on some levels more efficient and reliable.

But some of the results now show our servers as having a lot more load   the problem is what is efficient on one level is momentarily inefficient on another level and the monitor measures its own contribution to server load  

I guess the real answer here is that we monitor in many different ways and the fact that a monitor can make a system look 25% busy rather than 3% means little when you know that it only represents a fraction of a second in every actual minute.
myff admin

getting close to 6000 tables now, which really is rather a lot.

It's not some kind of exercise of never mind the quality feel the width, but an exercise lead by a desire to see what is truly happening.

We have scored some major victories in the last few weeks, reducing some database accesses massively and spotting some so called fail safe operations that were going badly rogue
myff admin

Way over the 8000 table mark now.

Our advertising manager actually reckons there's a business startup based on the methods we have developed for monitoring!

I think he's wrong but the tech is really sweet and clever it has to be said  

What I do note is that I had to up the power available to the dedicated monitoring server this morning as it was dying under the load

It is probably true to say that we now have as much server power monitoring the system as we had actual server power two years ago  

I just had to think for a second there to remind myself that it is two years and not three. Things have ramped up at a totally insane pace.
myff admin

Topped 10,000 just recently.

That's a lot of things being logged.

In rough terms 2 things every 1/100th of a second are recorded to the monitoring database.

I really doubt there is anyone monitoring quite so comprehensibly.

It may not be a business as such, but it is developing into a hell of an interesting set of code.

Coding geeks will be familiar with memcache, which we do use extensively, but code from the monitor is often better/faster and is beginning to make inroads, that won't be a total take over as memcache does some things we are not ever going to attempt to do, but equally as we are not doing those things we are a better fit for a lot of the things memcache is used for!
myff admin

11,990 items monitored per minute  
myff admin

We have had no reports whatsoever of any issues, but over the last couple of weeks since we started transferring stuff to the latest server, the monitoring showed many signs of misbehaviour relative to other servers and essentially a bandwidth "cap" on the server
I can't imagine how we would have seen things or approached things if problems had been reported without such extensive monitoring    
Mind you for all the data the actual reason for the "cap" has not been found, but even that is interesting from the point of view that a server  misbehaving a little when you have 2 or 3 servers is a nightmare, but when you are starting to lose count of server numbers it is simply a problem for analysis.
Ask Mr. Religion

Can you set alarm levels for all these reports? Or correlations between alarms?

Or is this a perl exercise?

myff admin

We have a lot of alarms. I guess I'm getting at least 5 an hour, and really more are needed whilst at the same time the alarms I'm getting are of variable quality. e.g. quite a few alarms on the line of the monitor itself is responding slowly, which of course it is as so much is being monitored  

Equally there are some alarms that not enough is happening, again for well known reasons.

What is coming together slowly is a database of what everything is that can be combined with the raw mass of data to be more intelligent about what is reported.

I'd really like to show screen shots of some of the overlay graphs that show anomalies/nasties easily visible to the human eye on a graph, but imperceptible to the user using the system and very hard to spot programatically.

Any big system will have these issues happening, there is always the weakest link in any system, but it is very rare to be able to monitor so closely that you can visually see issues.
Ask Mr. Religion

Your immense databases sound like prime candidates for scientific visualization tools.


myff admin

Looks good, graphs are all about being able to visualise things, data from previous days or data from similar systems overlay each other so as to show anomalies etc.

3d would help a lot, but is probably some way the line.
myff admin

We already had three virtual servers dedicated to dealing with monitoring, but as we got close to 13,000 items being monitored, the monitoring system was dying under the stain

MkII now has 5 servers with the monitoring database being distributed over 3 servers. The server power being used to monitor the system easily exceeds the total power of the servers we used to run myff on when we were in the USA  

Beefing up the system was essential as not only was it already suffering, but there is an insatiable demand for more things to be monitored.

Next on the list is monitoring server requests every second so as to more clearly see demand spikes.

Some times for all the monitoring it is hard to see exactly what goes "wrong" with a server. It will be sitting there working fine as far as the user can see, but when viewed on a graph comparing like with like on other servers it will clearly be out of whack and for all our experience we can't see a fault, and a reboot will sort things.
myff admin

The per second monitoring has bumped the table count up to over 14,000.

We don't actually permanently record at that resolution as the would get crazy fast, but each minute a min,max,average and current value for what was seen in the previous minute is logged.

This makes a big difference, as if you just log one value a minute then that value will be a lot more random.
For example when deciding if comparable servers are truly in line an unsmoothed graph of those servers was very hard to read, but now a lot less detective work is needed to see effects, if not always causes  Forum Index -> What's new? Announcements!
Page 1 of 1
Create your own free forum | Buy a domain to use with your forum