Web Server Access Log Parsing Part I - Using Perl's Split Function To Extract Specific Fields In A Record
In the last Perl-Tips post, I discussed the NCSA Extended Log Format for web servers. Please read that post before continuing with this one. The discussion here assumes the Extended Log Format for the web server access log.
Let's review the problem at hand. We have a website for which we are getting visitors. We want to do some analysis (web metrics, web analytics) for the website: who is visiting, how often, and which pages? To do this, our secondary goal has to be to transfer the information from the website's web server access log into a database. We are a few posts away from this goal. We have devised a temporary XML format, which we'll use later to transfer the access log data into a database.
To create the WSML (Web Server Markup Language) XML output file, we need to parse the access log. To do that, we need to come up with the appropriate Perl regular expressions to properly extract the fields of each entry in the access.log. Regular expressions are a sort of wildcard/pattern rule that we can specify to extract all or some "fields" in a line of data. I can't give you a full discussion of regular expressions here. (This and this are two of the best books available on the topic.) You'll have to at least under stand the basics of Perl pattern-matching before continuing. (Try checking some of the Perl perldoc documentation that should have come with your Perl distribution first.)
However, before we actually get into true regular expressions, let me show you a way to extract some of the information of the web server access log using the Perl split() function. We cannot accurately extract every field in every record, but we can extract some important ones. The rest of this post is in PDF format [176 Kb]. (Before you read the PDF file, please read the previous posts. There is also the assumption that you know enough Perl to follow along.) The post following this one will get into using regular expressions to extract all of the fields in each access log record.
(c) Copyright 2005-present, Raj Kumar Dash, http://perl-tips.blogspot.com
Let's review the problem at hand. We have a website for which we are getting visitors. We want to do some analysis (web metrics, web analytics) for the website: who is visiting, how often, and which pages? To do this, our secondary goal has to be to transfer the information from the website's web server access log into a database. We are a few posts away from this goal. We have devised a temporary XML format, which we'll use later to transfer the access log data into a database.
To create the WSML (Web Server Markup Language) XML output file, we need to parse the access log. To do that, we need to come up with the appropriate Perl regular expressions to properly extract the fields of each entry in the access.log. Regular expressions are a sort of wildcard/pattern rule that we can specify to extract all or some "fields" in a line of data. I can't give you a full discussion of regular expressions here. (This and this are two of the best books available on the topic.) You'll have to at least under stand the basics of Perl pattern-matching before continuing. (Try checking some of the Perl perldoc documentation that should have come with your Perl distribution first.)
However, before we actually get into true regular expressions, let me show you a way to extract some of the information of the web server access log using the Perl split() function. We cannot accurately extract every field in every record, but we can extract some important ones. The rest of this post is in PDF format [176 Kb]. (Before you read the PDF file, please read the previous posts. There is also the assumption that you know enough Perl to follow along.) The post following this one will get into using regular expressions to extract all of the fields in each access log record.
(c) Copyright 2005-present, Raj Kumar Dash, http://perl-tips.blogspot.com