Perl Tips and Techniques: September 2005

Wednesday, September 21, 2005

Web Server Access Log Parsing Part I - Using Perl's Split Function To Extract Specific Fields In A Record

Mastering Regular Expressions, Second Edition

In the last Perl-Tips post, I discussed the NCSA Extended Log Format for web servers. Please read that post before continuing with this one. The discussion here assumes the Extended Log Format for the web server access log.

Let's review the problem at hand. We have a website for which we are getting visitors. We want to do some analysis (web metrics, web analytics) for the website: who is visiting, how often, and which pages? To do this, our secondary goal has to be to transfer the information from the website's web server access log into a database. We are a few posts away from this goal. We have devised a temporary XML format, which we'll use later to transfer the access log data into a database.

To create the WSML (Web Server Markup Language) XML output file, we need to parse the access log. To do that, we need to come up with the appropriate Perl regular expressions to properly extract the fields of each entry in the access.log. Regular expressions are a sort of wildcard/pattern rule that we can specify to extract all or some "fields" in a line of data. I can't give you a full discussion of regular expressions here. (This and this are two of the best books available on the topic.) You'll have to at least under stand the basics of Perl pattern-matching before continuing. (Try checking some of the Perl perldoc documentation that should have come with your Perl distribution first.)

However, before we actually get into true regular expressions, let me show you a way to extract some of the information of the web server access log using the Perl split() function. We cannot accurately extract every field in every record, but we can extract some important ones. The rest of this post is in PDF format [176 Kb]. (Before you read the PDF file, please read the previous posts. There is also the assumption that you know enough Perl to follow along.) The post following this one will get into using regular expressions to extract all of the fields in each access log record.

(c) Copyright 2005-present, Raj Kumar Dash, http://perl-tips.blogspot.com

Technorati : access log, perl, programming, web analytics, web metrics, web script, web server

12:02 AM | Permalink |

Tuesday, September 13, 2005

Web Server Access-Log File Formats

NOTE: If you do not know what a web server is, please see my What's a Web Server? post before continuing.

One of the Internet-related programming problems that I'm discussing in my tech blogs is the question of who is visiting my website/ RSS feeds/ blogsites (hereafter referred to simply as "site"). Every site keeps an "access log" that records every web page visited by all visitors. The access log also keeps track of any incidental transactions such as requests for other files: images, documents, scripts. The date, time and time zone of each request is recorded chronologically, along with some other information (discussed below).

The web server access log is invaluable for data mining, geo-plotting, and general demographic analysis of your site's visitors. To get the most benefit out of your access log, it is worthwhile storing the data from this log to a database. My MySQL-Tips blog will present the database aspect of data mining the access log. In this blog, in this post in particular, let's discuss the format of the access log.

The NCSA developed what is known as the Common Log Format for web servers. This was later modified to become the Extended Format. Microsoft chose to not follow the NCSA standards for their Microsoft Web Server. I do have some Perl and PHP code lurking somewhere on one of my old computers that parses Microsoft's web format. I'll post it in PDF format when I find. But for the time being, I am focusing on the NCSA Extended Format.

Here is an example record from consulting web site, which records page and file requests in Extended Format:

131.104.175.199 - - [08/Sep/2005:17:16:10 -0400] "GET /blog/closeup-verysm.jpg HTTP/1.1" 200 3282 "http://geoplotting.blogspot.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MathPlayer 2.0)"

Every page, script, image, or document file request is recorded as above. The difference between Extended and Common formats is that the lattter is missing the last two fields (referring web page and the browser/agent used). The Extended Format has the following information, in this order, for each file of any sort, requested directly or indirectly, by a visitor to a site:

client - This is the IP address or hostname of the visitor's computer, network, or Internet Service Provider.
idusr, authusr - These two fields authenticate the visitor. I'll be honest with you. In ten years of webmastering, I've never used these fields. Most web servers just record the default value of "-" (hyphen) for these two fields.
date, time, tz - These are the date, time and timezone of your web server. For example, my consulting web site, http://www.chameleonintegration.com, runs on a server that is in timezone -0400, which is actually one zone east of where I live. So if I try to access my own website, from my point of view, the time of access will be one hour in the future. The date and time (converted to GMT) are very valuable for log analysis.
method - This is the HTTP method used to request the file. The most common values for this field are GET and POST. Except for specific analyses, this field can be ignored.
url - This is the URI of the requested file. In the above example, "/blog/closeup-verysm.jpg" translates to http://www.chameleonintegration.com/blog/closeup-verysm.jpg
prot - This is the hypertext protocol version number used by the web server. In this case, it is HTTP/1.1.
status - This is the status of the particular request. In this case, 200 represents success for the requested file (an image).
bytes - This is the file size of the requested resource (3,282 bytes).
ref - This is the page from which requested resource was either linked to referred to. (In the case of this image, the request is implicit and was the result of an <img> tag from the referring page. In the case of a page, script, or document, the request is explicit and is the result of clicking a hyperlink.) Note: in web server lingo, this field is known as the "referer", not the "referrer". The spelling mistake was introduced very early on in the definition of the access-log file format and was never rectified.
usragent - This is the user agent used to request the resource. A user agent can be a web browser, a search engine bot, or some other piece of software that uses the HTTP protocol to request a file from a web server.

In the next post, I'll present a quick overview of Perl regular expressions as they relate to parsing web server logs. That post will then be followed by another post discussing a simple server log parser to produce a count of unique visitors in a given time period. In subsequent posts, I'll refine the code to provide other quick analyses. Eventually, we will write a parser that outputs XML-WSML (Web Server Markup Language). This XML output will then be used by a PHP web application that creates a database of all the server requests.

This database is our ultimate goal in this mini-series of posts. The analysis of the actual data will be carried on in my NetMetrics blog. Demographic/ geographic analysis will take place in my GeoPlotting blog. Remember to check out my XML-Tips blog post WSML Web Server Markup Language for some tips on storing access log information in XML format. As I've mentioned before, this distribution of posts across different blogs is done so that readers can choose to read whichever posts they like, as their experience and interest warrants.

(c) Copyright 2005-present, Raj Kumar Dash, http://perl-tips.blogspot.com

Technorati : access log, log analysis, log file format, perl, scripting, visitor tracking, web analytics, web metrics, web programming, web server

6:17 PM | Permalink |

Program In Perl or PHP? Perl As A Fast Protoyper, PHP For Web Application Development?

Some of you may have noticed that I mention both Perl and PHP quite often in my tech blogs. You might be wondering why I use two scripting languages instead of focusing on just one. The truth is, I like them each for their different strengths. I've always liked Perl for its incredibly powerful pattern-matching and its ability to create complex yet useful data structures rapidly. What used to take me as much as one thousand lines of C code takes me 15-100 lines in Perl. However, PHP in newer versions has some of the same power of Perl. As well, because you can imbed HTML and PHP code chunks together, PHP allows you to rapidly prototype Internet applications. But even when I was co-authoring a book on PHP web development, I found myself using Perl to do rapid pattern matching, as well as writing small client/server applications. It might simply be a matter of longer familiarity with Perl for pattern matching.

That said, most of the Internet-related problem-solving that I'll be doing in my tech blogs will be done in Perl (which, yes, I've mentioned before). However, blog pages aren't the best place to post large or even small quantities of program code. I find myself spending more time making sure I've escaped any special characters in my code that might cause an XML/Atom feed to barf (Blogger.com's blogspot.com domains provide content syndication in Atom instead of RSS.) The net result is that the several half-written posts that I have for this blog are just that: half-written. So what I'll be doing in future posts is only show snippets of Perl code (or PHP over on that blog) in actual blog entries. More detailed code will be supplied to you in PDF form that you can download and even use it helps you. So I am hoping to be posting some code here in the next couple of days, maybe even tomorrow.

(c) Copyright 2005-present, Raj Kumar Dash, http://perl-tips.blogspot.com

Technorati : fast prototyping, internet applications, pattern matching, perl, php, web applications

12:37 AM | Permalink |

Monday, September 05, 2005

What's a Web Server?

NOTE: If you know what a web server is, skip this post.

Every web domain on the Internet is visible because of a brilliant yet relatively simple (in concept) piece of a software called a web server. The Apache Server is one of the most commonly used web servers. As Microsoft decided not to follow the NCSA Common Log Format, so I will not be discussing their web servers - at least for the time being.

A web server is also known as a daemon process that runs continually, "listening" for requests for various web files (HTML, images, CSS, javascript, etc.). The web server (aka server) determines if the file request is valid (i.e., the file exists), then "serves" it up to the "client" software (usually a web browser) across the Internet network.

Every web server on the Internet records each file requested of it, from every visitor, whether it be an HTML page, a web script, an image, or any other format. If you or a friend own a web domain and have access to the web server's access log file (usually called access.log), have a look at it. In the next post, we'll have a closer look at the access log.

(c) Copyright 2005-present, Raj Kumar Dash, http://perl-tips.blogspot.com

11:18 PM | Permalink |

Thursday, September 01, 2005

Managing Non-Standard Modules

Before I get into solving some of the web-related programming problems that I mentioned in the last post, I wanted to talk about the CPAN. The CPAN is an incredible, time-saving source of modules to add to your Perl installation. At the CPAN, you'll find an extremely wide range of plugin modules covering many functionalities (too many to list here). This should be your first source of any code that is not a part of the standard Perl distribution. If you cant find it here, then you'll probably need to write it. (Code that's put here is tested by various people that are part of CPAN. Perl code that you find on other websites is another source, but possibly less reliable. What's more, they are usually not in plug-and-play modules like they are at CPAN.)

Over the past 10 years of using Perl in small web scripts (CGI) and large programs (5000 lines or more), I've located software at the CPAN countless times. Because this repository contains code contributed by other Perl developers, there are occasionally modules with essentially duplicate functionality. Over time, duplication is somewhat weeded out. Some of the more popular modules over the years have ended up becoming part of the standard Perl distribution (the how and why is beyond the scope of this blog entry).

Unfortunately, the drawback of the CPAN is that the modules there ARE non-standard. Many companies that use Perl code have a policy of not allowing the installation of non-standard Perl code, especially if they have many thousands of computers to upgrade. Case in point, I contracted for IBM Canada a few years back. Part of my job was to write code to parse some "help" files that had markup that resembled HTML or XML, but weren't really. I tracked down a couple of suitable markup-parsing modules at CPAN, tested them, then checked with the support team. Sorry, they said, they couldn't allow these modules because of the headache of installing them on thousands of personal computers at many locations in Canada and the US and even in Europe. The code I was writing would get batch installed in the appropriate location, but non-standard Perl modules had to be installed in numerous directories, causing unnecessary work for the install team, as they would separately have to make sure each module didn't break anything else in the humungous software system. Understandable.

At this point, I'd spent a fair bit of time on the problem, and I was a contractor. I couldn't waste too much more time without upsetting the client. I had two choices: (1) Write my own parser or (2) port the modules to be part of a set of "local" modules. Ultimately, I ended up doing both. Some of the modules I had to rewrite because the customization I need was faster to do from scratch. But other modules simply needed a bit of modification.

I first moved each module from the directory it would have resided in, in the Perl distribution, to the same directory that my code was in. I.e., the modules were made local relative to my program. Then I edited each module by changing all path references to the original module location so that the reference was "local". Many times, that's all that was needed. If testing proved otherwise, I sometimes also found that I had to recursively do the same to other modules called by a given localized module. If after similar tweaking, I couldn't get a module to work, I scrapped it and wrote my own code, usually as a module.

(c) Copyright 2005-present, Raj Kumar Dash, http://perl-tips.blogspot.com

5:56 PM | Permalink |

Perl Tips and Techniques

Wednesday, September 21, 2005

Web Server Access Log Parsing Part I - Using Perl's Split Function To Extract Specific Fields In A Record

Tuesday, September 13, 2005

Web Server Access-Log File Formats

Program In Perl or PHP? Perl As A Fast Protoyper, PHP For Web Application Development?

Monday, September 05, 2005

What's a Web Server?

Thursday, September 01, 2005

Managing Non-Standard Modules

About me

Last posts

Archives

Links

My Non-Tech Blogs

My Tech Blogs

Links