.	\" $Id: http-analyze.man,v 1.4 1996/06/27 11:37:59 stefan Exp $
.	\"
.	\" manpage for http-analyze
.	\" Copyright 1996 by Stefan Stapelberg, <stefan@rent-a-guru.de>
.	\"
.	\" $Log: http-analyze.man,v $
.	\" Revision 1.4  1996/06/27  11:37:59  stefan
.	\" Updated.
.	\"
.	\" Revision 1.3  1996/06/13  10:48:12  stefan
.	\" Changed "Daily Summary" into "Hits by day".
.	\"
.	\" Revision 1.2  1996/06/08  16:54:42  stefan
.	\" Final cleanup before publishing.
.	\"
.	\" Revision 1.1  1996/05/29  06:16:30  stefan
.	\" Initial revision
.	\"
.	\"
.if n \{\
.	nr LL 78n
.	nr )O 0
.	po \n()Ou
.\}
.de (P
.sp 1v
.RS
.nf
.ft C
.ps -2p
.vs -2p
..
.de )P
'br
'ps
'vs
.ft 1
.RE
.fi
.P
..
.TH http-analyze 8L "Local Commands"
.SH NAME
.B http-analyze
\- a real fast log analyzer for web servers
.SH SYNOPSIS
.B http-analyze
.RB [\| \-{d|m|h} \|]
.RB [\| \-nrstuvxz \|]
.RB [\| "\-c cfgfile" \|]
.RB [\| "\-i logfile" \|]
.RB [\| "\-o outdir" \|]
.br
.ti +4n
.RB [\| "\-p privdir" \|]
.RB [\| "\-N #s|u" \|]
.RB [\| "\-H homepage" \|]
.RB [\| "\-S srvname" \|]
.RB [\| "\-T title" \|]
.RB [\| file \|]
.SH DESCRIPTION
.I http-analyze
analyzes logfiles of web servers and creates detailed statistics of the
servers's access load in graphical and tabular form.
.B http-analyze
expects logfiles entries in
.IR "common logfile format" ,
which is used by web servers such as Netscape's, NCSA's, and CERN's httpd.
If your server uses another format,
.B http-analyze
can't read the logfile.
.P
.B http-analyze
has been highly optimized to process large logfiles
at the maximum possible speed.
This is achieved by using a history mechanism to skip logfile entries which
have been processed already in a previous run of the program, and by using
two modes of operation (named after their maximum useful update interval)
with a different detail level in the analysis of the logfile entries:
.TP
\f3daily mode\fP (option \f3\-d\fP):
.B http-analyze
generates a short summary showing the hits per day only.
By using a history to skip entries processed already and by avoiding
detailed analysis of each log entry,
.B http-analyze
requires only a fraction of the time needed for a full report.
.TP
\f3monthly mode\fP (option \f3\-m\fP):
In this mode,
a full report with much more details is generated.
The history is used to produce a summary for the last 12 month.
.P
If your logfiles are rather large, you can use an update-interval in the
range of one to 24 hours to generate a short statistics more frequently and
an update-interval from one to 30 days to generate a full report.
Since
.B http-analyze
maintains a history of the results from previous runs, you may rotate the
logfile on a daily base when generating short (daily) reports.
However, to generate a full (monthly) report you have to feed all logfiles
of the appropriate summary period to 
.B http-analyze
at once, because the program needs to do further analysis on all
logfile entries.
After generating a detailed report for a month, you can save the corresponding
logfile(s) on tape and remove them from your system.
.SS "HTML OUTPUT FILES"
.P
In daily mode,
.B http-analyze
writes the short summary into the output file
.B stats.html
and updates the daily values in the history file.
The short summary includes the following informations by day (see the following
section for an explanation of this numbers):
.RS
- the total number of hits
.br
- the total number of 304's (Not Modified responses)
.br
- the total number of files transferred
.br
- the total number of unique sites
.br
- the amount of data sent by the server
.RE
.P
In monthly mode,
.B http-analyze
updates the short summary in
.B stats.html
and the monthly values in the history file.
Additionally, it creates the following files:
.TP
.I statsMMYY.html
contains the detailed summary for the period determined by
analyzing the logfile.
.IR MM " and " YY
are replaced by the month and the year respectively.
.TP
.I filesMMYY.html
lists the URLs of all documents sent by your server.
This file is created by default, but you can suppress its creation with
an option if you want to exclude them from the statistics.
.TP
.I sitesMMYY.html
lists the hostnames of all sites accessing your server if the server
could successfully resolve the IP address.
Again, this file is created by default unless you explicitely
suppress its creation.
.TP
.IR statsYYYY.html " or " index.html
contains a summary of the last 12 month.
Which name is choosen depends on the date of the last logfile entry processed:
If the last entry indicates that
.B http-analyze
is analyzing the current month's log, the name
.I index.html
is used for easy reference of the statistics pages.
In all other cases the name
.I statsYYYY.html
is used.
This naming convention allows you to create reports for previous summary
periods (e.g. for last year) without affecting the results for the current
period.
.TP
.I gr-icon.gif
a small icon for your link to the statistics page (59x41 pixels).
.P
All files are created in the current directory unless you explicitely specify
an output directory for the HTML files.
Furthermore, the files containing the detailed lists of sites and URLs may be
created in a private directory to protect them by authorization.
.P
The full summary (\f2statsYYMM.html\fP) contains the following informations:
.RS
- the total number of hits/304's/files/KB for this month
.br
- the amount of data requested/transferred/saved by cache
.br
- the total number of unique URLs/sites for this month
.br
- the numbers of response codes other than 200 (OK) or 304 (NoMod)
.br
- the maximum/average hits per day/hour
.br
- the total number of hits/files/304's/sites/KB by day
.br
- the top 5 seconds, 5 minutes, and 24 hours of the summary period
.br
- the top 10 sites accessing your server most often
.br
- the top 30 most commonly accessed URLs
.br
- the last 10 frequently accessed URLs
.br
- the hits/304's/KB sent by Country
.RE
.P
.ne 10v
The following section describes the meaning of all entries in the summary report,
which are not self-explaining:
.TP 10
.I Hits
(color key: green) The total number of hits processed by the
server including requests which did generate an invalid response.
.TP 10
.I Files
(color key: blue) The total number of files kind sent by the server
(\f2OK\fP responses).
Here "file" means any kind of file, thus including not only documents,
but also images, CGI scripts, audio and video clips, etc.
.TP 10
.I "304's"
(color key: yellow) A code 304 (\f2Not Modified\fP) response is sent by the
server if a document hasn't been updated since the last time it was requested.
This field therefore contains the total number of requests which didn't
cause the transmission of a file because of various caching mechanisms
used by proxies and browsers.
.TP 10
.I "Other responses"
The total number of all answers from the server which are not
.I OK
(200) or
.I "Not Modified"
(304) responses.
The full summary includes a list of all those other responses.
.IP "Unique URLs" 10 0
This field contains the total number of unique URLs (not counting erroneous
requests).
.TP 10
.I "Unique sites"
(color key: red) In the
.I "Totals"
section, this is the total number of unique sites per month,
while in the
.I "Hits by day"
section it reflects the number of unique sites per day.
Therefore, the sum of all sites shown in the "Hits by day" section is not
equal to the total number of unique sites.
.TP 10
.I "KBytes requested"
The amount of data requested by the users of your server.
.B http-analyze
computes this number by adding the values of the next two fields (see below).
.TP 10
.I "KBytes transferred"
(color key: orange) The amount of data sent as reported by the server.
.TP 10
.I "KBytes saved by cache"
The amount of data saved by various caching mechanisms.
It's value is computed by multiplying the number of Not Modified
requests per page with the size of the document (if known).
Note: Because
.B http-analyze
can determine the size of a page only if the page has been requested
successfully at least once in the same summary period, the values for
"KB saved by cache" and "KB requested" are just approximations
of the real values.
.SH OPTIONS
.TP
.B \-h
print a short help list explaining the usage of the options.
.TP
.B \-d
.I "(daily mode)"
generate short statistics for the current month only.
If a history file exists, the values for previous days are read from this
file and the corresponding logfile entries are skipped.
If the history file does not exist, the whole logfile will be processed
and a history will be created.
(This option is set by default.)
.TP
.B \-m
.I "(monthly mode)"
generate full statistics for a whole month.
Although the values from the history file are usually used to create
a summary for the last 12 month, the actual logfile entries always
have preceedence over any records in the history file.
This means that you should rotate your logfile at least on a monthly base.
The option
.BR \-m " includes " \-d .
.TP
.B \-n
.I "(no update)"
don't create or update the history file.
Useful if you want to generate statistics for previous summary periods
(before the last month) without overwriting the current state of the history.
.TP
.B \-r
don't create a list of all URLs for hidden items (if any) in the full statistics.
.TP
.B \-s
.I "(no sitelist)"
don't create a list of all sites in the full statistics.
.TP
.B \-t
.I "(no TOP lists)"
don't create the top seconds/minutes/hours lists.
Also suppresses the "Hits by hours" bar chart.
.TP
.B \-u
.I "(no URL list)"
don't create a list of all requested URLs in the full statistics.
.TP
.B \-v
.I "(verbose)"
comment ongoing processing.
.TP
.B \-x
Don't comprise images by default.
Normally,
.B http-analyze
sums up the values of all images (\f2*.gif, *.jpg, *.ief, *.pcd, *.rgb,
*.xbm, *.xpm, *.xwd, *.tif\fP) and hides them under the item "All images"
to avoid getting the top lists filled up with lots of image URLs.
If
.B \-x
is given, images are accounted for as single items.
.TP
.B \-z
don't create graphical representations of the results.
.TP
.BI \-c " cfgfile"
Use
.I cfgfile
as the configuration file.
By using a config file,
.B http-analyze
allows you to define some options and to tailor the basic HTML page
layout somewhat.
See
.SM """CONFIGURATION FILE"""
below for a description of the config file format.
.TP
.BI \-i " logfile"
Use
.I logfile
as the server's logfile.
If `-' is given,
.I stdin
is processed.
See also the
.B HTTPLogFile
entry in the config file.
.TP
.BI \-o " outdir"
This is the name of the directory where the HTML output files
should be created.
If no directory is given, the files are created in the current directory.
See also the
.B HTMLDir
entry in the config file.
.TP
.BI \-p " privdir"
Use this directory for the list of all URLs/sites
.I (filesMMYY.html " and " sitesMMYY.html) .
This is useful if you want to grant public access to your web server's
statistics while permitting access to the detailed lists to the staff
only by using server authentication.
See also the
.B PrivateDir
entry in the config file.
.TP
.BI \-H " homepage"
Use
.I homepage
as an alternate name for homepages.
If your index files are named
.BR index.html ,
there is no need to define this option.
However, if your server looks for more than one filename (eg. 
.BR index.html , Welcome.html ,
and
.BR home.html ,
you must define the latter two explicitely.
.B http-analyze
truncates the URLs containing a homepage name so that they merge with `/'
or their "base URL", respectively.
(For example, the "base URL" for
.I /dir/index.html " is " /dir/ .)
You can define up to three alternate names in addition to
.BR index.html .
See also the
.B Homepage
entry in the config file.
.TP
.BI \-N " #{" sul "}"
This option defines the number of entries in the top site (\f3s\fP or \f3S\fP),
top URL (\f3u\fP or \f3U\fP), or last URL (\f3l\fP or \f3L\fP) list.
.I #
is either a positive number or the value 0 to suppress the corresponding list.
Note that the list of last frequently accessed URLs is generated only if the
number of all unique URLs is greater than the sum of the entries in the top
and last URL lists.
See also the entries
.BR TopSites ", " TopURLs ", and " LastURLs
in the config file.
.TP
.BI \-S " srvname"
Use
.I srvname
as the name of the server in the title of the HTML files.
If undefined,
.B http-analyze
tries to determine the server name itself.
Note:
.B http-analyze
uses either the
.I uname (2)
or the
.I gethostname (2)
function to determine the server name depending on what has been defined
at compilation time.
On most System V implementations,
.I uname
returns the nodename (eg. 
.IR host ),
while
.I gethostname
often returns the full qualified domain name (FQDN, eg. 
.IR host.my.domain ).
See also the
.B ServerName
entry in the config file.
.TP
.BI \-T " title"
Use
.I title
as the document title and header for the HTML files.
.B http-analyze
appends the server name and the current summary period to this string.
If left undefined, a default phrase is used.
See also the
.B DocTitle
entry in the config file.
.SS "CONFIGURATION FILE"
When specified with the option
.BR \-c ", " http-analyze
reads some defaults from the named configuration file.
Parameters defined with options always take preceedence over
the definitions in this configuration file.
The configuration file contains one entry per line.
Each entry has a name field and one or two value fields, which must
be separated by one or more tabulator characters (not blanks!).
All names are case-insensitive.
.TP 14
.B ServerName
The name of your server (same as option
.BR \-S ).
.TP 14
.B HTTPLogFile
The name of the server's logfile.
Note that if you define a default name of the logfile, this file gets
processed if no other file is explicitely defined at the invocation of
.BR http-analyze .
Without this definition,
.B http-analyze
processes
.I stdin
if no file is given.
To process
.I stdin
even if a default name has been defined, use `-' as the filename
for the logfile.
.TP 14
.B DefaultMode
Defines the default operation mode of
.BR http-analyze .
The value field contains either the keyword
.BR daily " or " monthly .
If left undefined, the default is the daily mode (\f3\-d\fP).
.TP 14
.B Homepage
Up to three alternate names for homepages in addition to
.B index.html
(same as option
.BR \-H ).
All URLs containing one of the homepage names will get truncated
so they merge with `/' or the base URL respectively.
.TP 14
.B HTMLDir
The name of the directory where the HTML output files should be created
(same as
.BR \-o ).
If left undefined, files are created in the current directory.
.TP 14
.B PrivateDir
The name of a private directory where the detailed site and URL lists
should be created (same as option
.BR \-p ).
Access to this private directory may be granted to staff only
by using server authentication.
Pathnames not beginning with a `/' are relative to 
.BR HTMLDir .
.TP 14
.BR TopSites ", " TopURLs ", and " LastURLs
The number of entries in the top site, top URLs, and last frequently
used URLs lists (same as option
.B \-N ).
If set to zero, the corresponding list will be suppressed.
.TP 14
.B DocTitle
The document title and header to use in the HTML output files (same as option
.BR \-T ).
.B http-analyze
appends the server's name and the current summary period to this string.
.TP 14
.B HeadPrefix
The prefix string to output before the document header
(after the HTML <TITLE> tag).
If
.B HeadPrefix
is defined, it must include the HTML <BODY> tag.
If left undefined,
.B HeadPrefix
defaults to:
.(P
HeadPrefix       <BODY BGCOLOR="#D6D6D6"><P><HR SIZE="8">
.)P
.TP 14
.B HeadSuffix
The suffix string to output after the document header (after
.BR DocTitle ).
Useful if you define left- or right-aligned images in
.B HeadPrefix
with the headline floating around.
.TP 14
.B DocTrailer
The trailer string to output at end of page.
Useful to define a link back to your homepage, as in
.(P
DocTrailer       <BR><FONT SIZE="-1"><A HREF="/">Back</A> to my homepage</FONT>
.)P
.TP 14
.BR HideSys " and " HideURL
These two entries let you define names of sites or URLs which
should be hidden under some arbitrary text.
Hidden items are accounted for separately, but in the summary
they appear comprised under the description defined here.
Both entries have two value fields: the first field following
the name defines a site or an URL and the second field defines
the text under which this item is to be hidden.
The URL/site may begin or end with a `*' as a wildcard.
However, inside strings, a `*' is taken literal.
If the text a item is hidden under begins with a `[' character,
the item is not shown in the top sites/URLs lists, but it will
be always shown in the detailed sites/URLs lists.
Note that URLs are case-sensitive, while sitenames are not.
Note also, that images are hidden automatically unless the option
.B \-x
is specified at invocation of
.BR http-analyze .
See the
.B sample.conf
file for examples on how to use
.BR HideSys " and " HideURL .
.br
.ne 10v
.SH EXAMPLES
First of all, you must know the name of your server's logfile.
If, for example, the name is
.IR /usr/ns-home/httpd-80/logs/access ,
you can create full statistics for the current month with the following command:
.(P
http-analyze -vm -S www.myserver.com /usr/ns-home/httpd-80/logs/access
.)P
This command will create a yearly summary in the file
.I index.html
(or
.I statsYYYY.html
for previous years) and a monthly summary in file
.IR statsMMYY.html ,
where
.I MM
is replaced by the month and
.I YY
is replaced by the year.
If the period determined by analyzing the logfile is the current month,
.B http-analyze
creates also an up-to-date daily summary in the file
.IR stats.html .
All files are created in the current directory.
.P
Assuming that your old logfiles have been saved under the name
.I logYYYY/access.MM
in the server's log directory, use the commands
.(P
cd /usr/ns-home/httpd-80/logs
http-analyze -vmn -o /usr/htdocs/stats log1996/access.01
.)P
to create full statistics for January '96 in the directory
.I /usr/htdocs/stats
preserving the current history (option
.BR \-n ).
Note: Generating statistics for previous summary periods without the
.B \-n
option will overwrite newer values in the history file.
To reconstruct the history, you would have to run
.B http-analyze
for each following month until the very last one (this situation
may be avoided in a following version of the program).
Note also, that immediately after generating the statistics for the
last month you should run
.B "http-analyze \-m"
on the current logfile to create an up-to-date index file (\f3index.html\fP).
Remember that this index file is created automatically only when creating a
monthly summary for the \f3current\fP month.
.P
The following command creates statistics for a whole year using
a customized configuration file and reading the log entries from a pipe:
.(P
gzcat log1996/access.0?.gz |
http-analyze -vm -c /usr/local/bin/sample.conf -
.)P
.br
.ne 10v
.SS "REGULAR INVOCATION VIA CRON"
To have statistics generated on a regular base, use the following scheme:
.IP 1)
Optionally install a cron job which calls
.B "http-analyze \-d"
frequently to create a daily summary.
The execution interval may range from once per day up to twice per hour
depending on the size of your logfile and the time needed to analyze it.
On my server, I run the daily statistics once per hour.
.IP 2)
Install a cron job which calls
.B "http-analyze\ \-m"
to create a monthly summary once per week or once per day (again
depending on the size of your logfile). Note that monthly summaries
.I "(statsMMYY.html)"
are created for the first time at the second day of a new month.
On my server, I create a monthly summary two times per day.
.IP 3)
Create a script which rotates the server's logfile, restarts
the http server, and creates the final summary for this period.
Have
.I cron
execute this script at 00:00 on the first day of a new month.
See the script
.B rotate-httpd
for an example on how to do this for several virtual web servers
running on the same machine.
.IP 4)
Because of
.IR cron 's
scheduling overhead and delays in execution of the script
which rotates the logfile, heavy used servers sometimes
writes a few entries for the new month in the old logfile.
.B http-analyze
usually ignores such kind of "white noise" at the end of a month.
However, to get correct figures, in this last step you should run
.B "http-analyze \-m"
on the logfile for the current month immediately after
generating the statistics for the previous month.
.P
Note that the cron jobs must run with the uid of the owner of the
directory where the HTML output files are going to be created,
except for the rotate script, which usually must run with the uid
of the Server.
You should also take care to avoid running more than one of the
cron jobs related to
.B http-analyze
at the same time.
.P
Here are some sample
.IR crontab (1)
entries for the scheme described above:
.(P
# Generate a full report twice per day at 01:17 and 13:17
17  1,13 * * *  /usr/local/bin/http-analyze -m -c /usr/httpd/analyze.conf
.sp .5v
# Generate a short summary each hour except at 01:17 or 13:17
17  2-12 * * *  /usr/local/bin/http-analyze -d -c /usr/httpd/analyze.conf
17 14-23 * * *  /usr/local/bin/http-analyze -d -c /usr/httpd/analyze.conf
.sp .5v
# Rotate the HTTPD logfiles at the first day, 00:00 of a new month
0 0 1 * *       /usr/local/bin/rotate-httpd
.)P
.	\".br
.	\".ne 10v
.	\".SH DIAGNOSTICS
.	\"The diagnostics fall into the categories informational messages,
.	\"fatal errors, and warnings.
.	\"Only the fatal errors and warnings are listed here:
.	\".SS "FATAL ERROR MESSAGES"
.	\".TP
.	\".I "Can't ..."
.	\"This error ...
.	\".TP
.	\".I "Can't ..."
.	\"This error ...
.	\".SS "WARNING MESSAGES"
.	\".TP
.	\".I "Couldn't ..."
.	\"This error ...
.	\"
.br
.ne 10v
.SH COPYRIGHT
Copyright \(co 1996 by Stefan Stapelberg, RENT-A-GURU\*R
.P
Permission to use, copy, modify, and distribute this software and its
documentation for any purpose and without fee is hereby granted, provided
that the above copyright notice appear in all copies and in all HTML output
files, that both that copyright notice and this permission notice appear in
the supporting documentation, and that the hypertext link to the homepage of
.B http-analyze
which the program produces is left intact.
This software is provided "as is" without express or implied warranty. 
.P
Credit for
.B http-analyze
must be given to RENT-A-GURU\*R in all derived works.
This does not affect your ownership of the derived work itself, and the
intent is to assure proper credit for RENT-A-GURU\*R, not to interfere
with your use of this software. If you have questions, ask.
.P
You may use this software at no cost on any installation,
even at commercial sites.
However,
.B "IT IS STRICTLY FORBIDDEN"
to sell or lease this software in whole or in part or to include it in
whole or in part in a commercial product.
If you plan to run
.B http-analyze
on a commercial installation
.B and
you need support, or if you would like to bundle the program with your
products, you must sign an appropriate license agreement available from
RENT-A-GURU\*R.
Please send an email to <office@rent-a-guru.de>.
.P
.ps -2p
.vs -2p
RENT-A-GURU\*R is a registered trademark of Martin Weitzel, Stefan Stapelberg,
and Walter Mecky.
.ps
.vs
.SH AUTHOR
Stefan Stapelberg, <stefan@rent-a-guru.de>
.SH CREDITS
.P
Thanks to the over 50 beta testers of
.B http-analyzes
for their feedback.
.br
Special thanks to <Lars-Owe.Ivarsson@its.uu.se> for his suggestions to
optimize the parser algorithm and the code he provided as an example.
.br
Thanks also to Thomas Boutell (http://www.boutell.com) for his great
GD library for fast GIF creation, without
.B http-analyze
couldn't produce such fancy graphics in the summary reports
(gd 1.2 is copyright 1994, 1995, Quest Protein Database Center,
Cold Spring Harbor Labs).
.br
.ne 10v
.SH FILES
.P
Note: output files are always created in the directory given with the
.B \-o
option, with the
.B HTMLDir
entry in the config file, or in the current directory (in this order).
See also
.SM "HTML OUTPUT FILES"
above.
.P
.nf
.ie n \{\
.	ta 18n
\.	ta 18n\}
.el \{\
.	ta |1.4i
\.	ta |1.4i\}
\f2index.html,\fP	summary report for last 12 month
.br
\f2statsYYYY.html\fP	summary report for year \f2YYYY\fP
.br
\f2stats.html\fP	short summary (daily mode)
.br
\f2statsMMYY.html\fP	full summary for \f2MM/YY\fP (monthly mode)
.br
\f2filesMMYY.html\fP	list of all URLs requested in \f2MM/YY\fP
.br
\f2sitesMMYY.html\fP	list of all sites accessing the server in \f2MM/YY\fP
.br
\f2stats.hist\fP	the history file for the last 12 month and last \f2N\fP days
.br
\f2avloadMMYY.gif\fP	the \f2Hits by hours\fP bar chart image (492x190)
.br
\f2statsMMYY.gif\fP	the \f2Hits/Files/Sites/KB\fP by day bar chart image (492x317)
.br
\f2cntryMMYY.gif\fP	the \f2Total transfers by Country\fP pie chart image (492x320)
.br
\f2graphMMYY.gif\fP	the \f2Hits/Files/Sites/KB\fP graph image (490x317)
.br
\f2sq_*.gif.gif\fP	icons for creating bars in the full summary (10x8)
.br
\f2gr-icon.gif\fP	an icon for making links to your statistics page (59x41)
.fi
.SH NOTES
.P
If you are going to analyze different logfiles in one invocation of
.BR http-analyze ,
you must sort them in ascending order of their date, otherwise the logfiles
being processed after the first logfile will be silently ignored.
.SH "SEE ALSO"
.nf
.ie n \{\
.	ta 42n
\.	ta 42n\}
.el \{\
.	ta |3i
\.	ta |3i\}
\f23Dstats(8L)\fP	A 3D Access Statistics Generator
.br
\f2http://www.netstore.de/Supply/http-analyze/\fP	The homepage of \f3http-analyze\fP
.fi
.SH BUGS
You tell me.
