Your business has a website because you have something to sell. That site needs visitors that turn into customers. Otherwise the site isn't worth the effort and expense of maintaining it. So more traffic is good right?
Not so much. Back in 2014 I had a successful site that was bringing in good money. Unfortunately, the site started to experience page load issues. Specifically, the page load times had climbed from approximately 2.5 sec to over 5 sec. If you'd have asked me at the time I wouldn't have said that was a big deal. After all 5 sec is still pretty fast. But I found out the page load times had climbed because my conversions had started to drop. When that 5 sec climbed to 7 or 8 sec my conversions really took a hit.
Obviously, it was time to start digging into what was causing my page load times to be so high after over a year at somewhere < 3 sec.
As I began investigating the causes of the longer page load times the only thing I could really come up with was server load. I was noticing that the server's cores were consistently busier than I'd ever seen them. I also noticed that the server was consistently using almost twice as much RAM as it ever had in the past.
This server was isolated from my coding and website experiments and it served this 1 website almost exclusively. Combined with seeing dozens of Apache tasks running at any given time it looked like a lot of extra traffic. Yeah, more traffic! Except I was making less money. Why?
The answer turned out to be overly eager bots from a variety of sources.
Bots aren't all bad. Google has a bot to determine what pages are on your site and what those pages are about so that your pages can rank in their Search Engine Results Page (SERP). If you deny the Google bot access to your site then you are essentially invisible to anyone using the Google search engine.
Bing also has a bot. Again, this bot is used to determine what is on your site so you can show in Bing's SERPs.
While both Google and Bing have reasonably well behaved bots others do not. Notably Yahoo's bot was known for being aggressive and ignoring the rules you lay out in robots.txt for where bots are allowed to go. Caveat: Yahoo Slurp was so bad that I blocked it years ago. I've had it blocked so long I'm not even sure it's still around or how it behaves today.
There are other misbehaving search engine bots. A Russian search engine named Yandex has a very aggressive bot. And Baidu, a Chinese search engine, has the worst of the search engine bunch. The Yahoo, Yandex, and Baidu bots alone are enough to consume significant server resources and slow down your site. But hey, you might actually get customers from them if your business operates in their markets.
Once you start digging around in your logs you start finding out that a lot of the entries that fill up your logs come from the MJ12 and Ahref bots. Neither of these bots do YOU any good.
The MJ12 bot is for Majestic SEO. It searches your site and tells anyone who pays Majestic all about the SEO of your site and how it can be beat. Isn't that great! It's an aggressive bot that ties up significant server resources all in the name of selling your competition the information necessary for them to outrank you on Google. Fantastic, right?
The Ahrefs is less aggressive and thought more highly of than the MJ12 bot. But I still block it. No one but me needs to know what links are on my site pointed to some other site. I certainly don't need to have my real visitors leaving my site because it has been slowed down by their bot. So BAM! Ahrefs is blocked.
Beyond SEO companies there are reputation management companies out there with their own bots. Again, why are you paying for the resources necessary for them to crawl your site. It is unlikely that anything at all will come of these bots crawling your site. However, they too can slow down your site and absolutely no good for you can come of their crawling your site. So block them. too.
Beyond the examples of bots to block that I've provided above (don't block Google or Bing) there are lots of other bots out there. Many are attackers with port scanners and bots that are attempting to probe for security weaknesses. Others are comment spam style bots or email address scraping bots. Clearly you don't want them anywhere near your site so block as many as you can find.
This post has gotten fairly long at this point so I will address blocking bad bots in detail in another post.
The quick version is to use .htaccess, a firewall, and a CDN. The CDN will block most security threats. The .htaccess method involves adding a bunch of User Agent strings to your site's .htaccess file (assuming you are using Apache). It's far from perfect as these agent strings can be easily spoofed. However, the .htaccess method is easy to employ and will drop your server's load significantly. Finally, the use of a proper firewall will drop any requests made to ports from other servers you don't want to talk to.
Gabe Spradlin Gabe has a BS in Mech Eng and an MS in Elec Eng. Control Systems was the emphasis of my Masters. The Master culminated in a very long and esoteric thesis about automatically identifying anomalies in International Space Station telemetry. Eventually he ended up using Controls to point lasers very precisely at targets a long, long way away. Some lasers were used to talk; others to put holes in things. In 2009 it was time to start an aerospace consulting business. By 2010 it was time to shift from aerospace to online entrepreneurship. He has worked in his pajamas .... errrrr .... basement ever since.