NOTE: If you do not know what a web server is, please see my
What's a Web Server? post before continuing.
One of the Internet-related programming problems that I'm discussing in my tech blogs is the question of who is visiting my website/ RSS feeds/ blogsites (hereafter referred to simply as "site"). Every site keeps an "access log" that records every web page visited by all visitors. The access log also keeps track of any incidental transactions such as requests for other files: images, documents, scripts. The date, time and time zone of each request is recorded chronologically, along with some other information (discussed below).
The web server access log is invaluable for
data mining,
geo-plotting, and general
demographic analysis of your site's visitors. To get the most benefit out of your access log, it is worthwhile storing the data from this log to a database. My
MySQL-Tips blog will present the database aspect of data mining the access log. In this blog, in this post in particular, let's discuss the format of the access log.
The
NCSA developed what is known as the Common Log Format for web servers. This was later modified to become the Extended Format.
Microsoft chose to not follow the NCSA standards for their Microsoft Web Server. I do have some Perl and PHP code lurking somewhere on one of my old computers that parses Microsoft's web format. I'll post it in PDF format when I find. But for the time being, I am focusing on the NCSA Extended Format.
Here is an example record from consulting web site, which records page and file requests in Extended Format:
131.104.175.199 - - [08/Sep/2005:17:16:10 -0400] "GET /blog/closeup-verysm.jpg HTTP/1.1" 200 3282 "http://geoplotting.blogspot.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MathPlayer 2.0)"Every page, script, image, or document file request is recorded as above. The difference between Extended and Common formats is that the lattter is missing the last two fields (referring web page and the browser/agent used). The Extended Format has the following information, in this order, for each file of any sort, requested directly or indirectly, by a visitor to a site:
- client - This is the IP address or hostname of the visitor's computer, network, or Internet Service Provider.
- idusr, authusr - These two fields authenticate the visitor. I'll be honest with you. In ten years of webmastering, I've never used these fields. Most web servers just record the default value of "-" (hyphen) for these two fields.
- date, time, tz - These are the date, time and timezone of your web server. For example, my consulting web site, http://www.chameleonintegration.com, runs on a server that is in timezone -0400, which is actually one zone east of where I live. So if I try to access my own website, from my point of view, the time of access will be one hour in the future. The date and time (converted to GMT) are very valuable for log analysis.
- method - This is the HTTP method used to request the file. The most common values for this field are GET and POST. Except for specific analyses, this field can be ignored.
- url - This is the URI of the requested file. In the above example, "/blog/closeup-verysm.jpg" translates to http://www.chameleonintegration.com/blog/closeup-verysm.jpg
- prot - This is the hypertext protocol version number used by the web server. In this case, it is HTTP/1.1.
- status - This is the status of the particular request. In this case, 200 represents success for the requested file (an image).
- bytes - This is the file size of the requested resource (3,282 bytes).
- ref - This is the page from which requested resource was either linked to referred to. (In the case of this image, the request is implicit and was the result of an <img> tag from the referring page. In the case of a page, script, or document, the request is explicit and is the result of clicking a hyperlink.) Note: in web server lingo, this field is known as the "referer", not the "referrer". The spelling mistake was introduced very early on in the definition of the access-log file format and was never rectified.
- usragent - This is the user agent used to request the resource. A user agent can be a web browser, a search engine bot, or some other piece of software that uses the HTTP protocol to request a file from a web server.
In the next post, I'll present a quick overview of Perl regular expressions as they relate to parsing web server logs. That post will then be followed by another post discussing a simple server log parser to produce a count of unique visitors in a given time period. In subsequent posts, I'll refine the code to provide other quick analyses. Eventually, we will write a parser that outputs XML-WSML (Web Server Markup Language). This XML output will then be used by a PHP web application that creates a database of all the server requests.
This database is our ultimate goal in this mini-series of posts. The analysis of the actual data will be carried on in my
NetMetrics blog. Demographic/ geographic analysis will take place in my
GeoPlotting blog. Remember to check out my XML-Tips blog post
WSML Web Server Markup Language for some tips on storing access log information in XML format. As I've mentioned before, this distribution of posts across different blogs is done so that readers can choose to read whichever posts they like, as their experience and interest warrants.
(c) Copyright 2005-present, Raj Kumar Dash, http://perl-tips.blogspot.com
Technorati : access log, log analysis, log file format, perl, scripting, visitor tracking, web analytics, web metrics, web programming, web server