Sometimes I wonder if anyone actually looks at any of the stuff that I post on this blog. It doesn’t make much of a difference - blogging for its own sake is personally beneficial and fun enough to continue doing it whether anyone reads me or not - but I can’t help being curious. So, in this post, I’ll dig into my blog’s visitor logs (stored in an SQL database) and analyze the traffic that has reached my website so far.
I only started logging around early July of this year (2020), so keep in mind that my logs only go back that far.
First of all, a huge portion of the web traffic to my blog comes from bots with names like DotBot
, Googlebot
, DuckDuckGo-Favicons-Bot
, and PetalBot
. Some of the requests, despite not being explicitly labelled as from bots, don’t seem like they came from humans, or at least not from humans who are actually interested in my website. Like this one:
GET /boaform/admin/formLogin?username=admin&psd=admin
This appears to be an attempt to log in to my website... but my blog doesn’t support user logins at all. Someone probably wrote a script to send this request like grapeshot to many different sites across the internet, hoping to exploit some common vulnerability that appears on many websites’ login forms. Weird.
There’s a script on my server that dumps requests like this into an SQL table, identifying the IP address, the date and time of the request, and the raw text of the request. This information is stored in columns labelled ip
, timestamp
, and request
. My script also makes a guess as to whether the request came from a bot, storing this information is a column called isbot
. My “bot detection strategy” isn’t very sophisticated - it basically consists of checking whether the request contains the substring “bot”
, so it probably underestimates the number of bots visiting my site (since there are surely plenty of bots that don’t announce themselves as bots).
We can estimate the proportion of requests to my blog that come from bots using the following SQLite query:
SELECT AVG(isbot) FROM visitors;
and the result:
0.636117523011598
Again, this is probably an underestimate - so at least $63\%$ of the visits to my blog are perpetrated by bots.
Now we know how much traffic comes from bots, but what if we want to know how many different bots have visited this site? We can estimate this by counting the number of distinct IP addresses of bot visitors (which may not be entirely accurate, because two or more bots may be coming from the same IP address, but it’ll do). This can be accomplished using the query
SELECT COUNT(DISTINCT ip) FROM visitors WHERE isbot=1;
which returns
2337
Compare that to the number of non-bot visitors, which we can find using the query
SELECT COUNT(DISTINCT ip) FROM visitors WHERE isbot=0;
to be approximately
1912
Yikes! I guess I’m much more popular with bots than I am with humans.
Let’s see what kind of bot traffic my blog has gotten over the course of the past week. The following query returns the date, the number of bot requests, and the number of distinct bot IP addresses for each of the past seven days:
SELECT date(timestamp), COUNT(*), COUNT(DISTINCT ip) FROM visitors WHERE isbot=1 GROUP BY date(timestamp) ORDER BY date(timestamp) DESC LIMIT 7;
And the result:
2020-10-07|52|27
2020-10-06|280|80
2020-10-05|246|77
2020-10-04|249|62
2020-10-03|482|67
2020-10-02|258|72
2020-10-01|176|67
Why is today’s bot traffic (first row) so much lower than all of the previous days? Probably because today is still “in progress”, and the rest of the traffic that will come in over the course of the day has yet to be logged.
Now let’s see who is really visiting my website. Of course, I can’t find out exactly who visits my websites - the most I can see in my logs is their IP addresses (as well as some device/browser information). And, for the sake of privacy, I won’t even show my visitors’ IP addresses here, although I will divulge some information about where in the world (literally) they come from.
We’ve already seen that the number of distinct non-bot IP addresses is 1912
, slightly less than the number of distinct bot IPs. Now let’s take a closer look at these visitors. The following query yields a list of the top 15 IP addresses of non-bot visitors to my blog, sorted by the number of requests each has made:
SELECT ip, COUNT(*) FROM visitors WHERE isbot=0 GROUP BY ip ORDER BY COUNT(*) DESC LIMIT 15;
and here’s the result, with the IP addresses blocked out. However, I have looked up each of these top IP addresses using an online IP geolocation tool and listed their geographic locations to the right:
XXX.XX.XX.XX|1638 (Johnson City, TN)
XX.XXX.XXX.XXX|568 (Chicago, IL)
XX.XXX.XXX.XX|471 (Albuquerque, NM)
XX.XX.XXX.XXX|319 (Albuquerque, NM)
XX.XXX.XXX.XX|283 (Albuquerque, NM)
XX.XXX.XXX.XX|226 (Albuquerque, NM)
XX.X.XX.XXX|207 (Albuquerque, NM)
XX.XX.XXX.XXX|156 (Johnson City, TN)
XXX.XX.XX.XX|142 (Bronx, NY)
XX.XXX.XXX.XX|138 (Albuquerque, NM)
XX.XX.XX.XXX|135 (Scranton, PA)
XX.XXX.XXX.XXX|135 (Naperville, IL)
XX.XXX.XXX.XXX|127 (Paris, France)
The first of these is probably me, since I used to live in Johnson City and had to visit my own website often in order to test new features and check to make sure my posts were displaying properly. One of the Albuquerque IPs is probably also me, since I’ve moved to ABQ to attend the University of New Mexico. (So it appears that I’m my own number one biggest fan, as well as my number five or six biggest fan.) The rest of the Albuquerque addresses are probably other people affiliated with my university, but as for some of these other places, I have absolutely no idea. I don’t know anybody in Scranton, Naperville, the Bronx, or Paris... but I guess they know me.
There is, of course, the possibility that some of these people are using VPNs, causing the IP registered to be different from their actual IP address and preventing me from knowing where they’re actually located. It’s a lot cooler to think that someone in Paris is looking at my blog, though, so I’ll stick with that thought.
Suppose I want to know not only who has made the most requests to my blog, but who has visited it most often. The following query gives another list of IP addresses, this time sorted by the number of distinct days on which each IP address has made a request to my website:
SELECT ip, COUNT(DISTINCT date(timestamp)) FROM visitors WHERE isbot=0 GROUP BY ip ORDER BY COUNT(DISTINCT date(timestamp)) DESC LIMIT 15;
yielding the following result (again with geographical locations on the right):
XXX.XXX.XXX.XX|95 (Tokyo, Japan)
XX.XX.XX.XX|55 (Bucharest, Romania)
XX.XXX.XXX.XXX|49 (Chicago, IL)
XX.XX.XX.XXX|41 (Scranton, PA)
XXX.XX.XX.XX|39 (Johnson City, TN)
XXX.XX.XXX.XXX|36 (Los Angeles, CA)
XX.XXX.XXX.XX|35 (Ithaca, NY)
XXX.XX.XXX.XX|31 (Los Angeles, CA)
XXX.XX.XXX.XXX|31 (Los Angeles, CA)
XXX.XXX.XX.XXX|29 (Amsterdam, Netherlands)
XXX.XXX.XX.XXX|29 (Amsterdam, Netherlands)
XXX.XXX.XX.XXX|28 (Amsterdam, Netherlands)
XX.XXX.XXX.XXX|21 (Paris, France)
To be honest, this was a big shock to me, and I’d be very surprised if many of these weren’t being passed through VPNs. Another possibility is that some of these requests are actually from bots, but my (super sophisticated) bot filter failed to recognize them as bots.
Although I can’t determine whether any of these requests have come through a VPN, I can get a better idea of whether or not they’re bots by looking at the actual requests they’ve made. To get an idea of what a typical bot request looks like, let’s use the following query (the IP address XX.XX.XX.XXX
in the query below is one of the addresses that my filter has labelled as a bot):
SELECT request FROM visitors WHERE ip="XXX.XX.XXX.XX" ORDER BY date(timestamp) DESC LIMIT 10;
/vendor/phpunit/phpunit/src/Util/PHP/eval-stdin.php HTTP/1.1
/vendor/phpunit/phpunit/src/Util/PHP/eval-stdin.php HTTP/1.1
/api/jsonws/invoke HTTP/1.1
/solr/admin/info/system?wt=json HTTP/1.1
/?XDEBUG_SESSION_START=phpstorm HTTP/1.1
/?a=fetch&content=
None of these routes actually exist on my blog, so it’s pretty obvious that this is a bot that is groping around blindly, probably looking for a security vulnerability. Here’s another example from a different bot:
/ HTTP/1.1
/favicon.ico HTTP/1.1
/favicon.ico HTTP/1.1
/ HTTP/1.1
/favicon.ico HTTP/1.1
/favicon.ico HTTP/1.1
/ HTTP/1.1
/favicon.ico HTTP/1.1
/favicon.ico HTTP/1.1
/ HTTP/1.1
Lots of repeated visits to my homepage - also probably not something a human would do. Now let’s look at the most frequent requests made by our friend (supposedly) in Tokyo, using a similar query:
/htmlpage/brainfuck.html HTTP/1.1
/htmlpage/brainfuck.html HTTP/1.1
/htmlpage/brainfuck.html HTTP/1.1
/htmlpage/brainfuck.html HTTP/1.1
/htmlpage/brainfuck.html HTTP/1.1
/htmlpage/brainfuck.html HTTP/1.1
/htmlpage/brainfuck.html HTTP/1.1
/htmlpage/brainfuck.html HTTP/1.1
/htmlpage/brainfuck.html HTTP/1.1
/htmlpage/brainfuck.html HTTP/1.1
These requests go to my Brainfuck interpreter, now located at a different link (I promise I’m not gratuitously dropping f-bombs, the programming language is actually called Brainfuck). Given that this page is an interpreter where you can write Brainfuck code and watch it run using a visual representation, it’s pretty believable that a real person would be using it frequently, perhaps to work on a coding puzzle or project written in Brainfuck.
Now let’s try our friend from Romania:
/ HTTP/1.1
/ HTTP/1.1
/ HTTP/1.1
/ HTTP/1.1
/ HTTP/1.1
/ HTTP/1.1
/ HTTP/1.1
/ HTTP/1.1
/ HTTP/1.1
/ HTTP/1.1
Hmmm, this visitor is just requesting my homepage over and over again. Maybe someone really likes to look at my homepage, but has no desire to click any of the links on it? No, I don’t buy it - this is probably a bot incorrectly classified as a regular visitor.
To wrap up this post, let’s find out which pages on my blog are the most popular. We can use the following query to find out which resources on my website have been requested most frequently by non-bots:
SELECT COUNT(*), request FROM visitors WHERE isbot=0 GROUP BY request ORDER BY COUNT(*) DESC LIMIT 10;
which yields
3250|/ HTTP/1.1
1611|/favicon.ico HTTP/1.1
457|/htmlpage/brainfuck.html HTTP/1.1
388|/js/post_script.js HTTP/1.1
126|/posts HTTP/1.1
93|/htmlpage/mathsnack.html HTTP/1.1
91|/post/106 HTTP/1.1
80|/img/2016-12-7-Fig1.png HTTP/1.1
66|/post/169 HTTP/1.1
64|/post/164 HTTP/1.1
As we might expect, the most requested resource is my homepage, followed by the favicon for my website and my Brainfuck compiler. (Don’t ask about mathsnack.html
... that was a failed experiment which no longer exists.) Then there’s my old tilted parabola post from almost four years ago, as well as the image that goes along with it... I’m not sure why this one has been requested so often, considering how, in my opinion, it’s one of my least interesting posts. Perhaps because it’s one of my oldest.
If we want to ignore resources like images and static html pages and focus instead on actual blog posts, we can use the following modified query, which restricts requests to those containing the substring /post/
:
SELECT COUNT(*), request FROM visitors WHERE isbot=0 AND request LIKE "%/post/%" GROUP BY request ORDER BY COUNT(*) DESC LIMIT 10;
which gives us the list
91|/post/106 HTTP/1.1
66|/post/169 HTTP/1.1
64|/post/164 HTTP/1.1
54|/post/146 HTTP/1.1
39|/post/158 HTTP/1.1
31|/post/167 HTTP/1.1
30|/post/159 HTTP/1.1
30|/post/162 HTTP/1.1
25|/post/163 HTTP/1.1
22|/post/114 HTTP/1.1
Respectively, these are the following posts:
Anyways, that concludes this post. Maybe in a couple months or a year or so, once I have more visitor data, I can do something a little bit more mathy with my visitor logs and write a post about it.