Web Page Management Software

John Hurst

Version 1.6.7

20240430:151827

Abstract

This document defines and describes the suite of programs used to create my web page environment on the range of machines that I use.

Table of Contents

1 Introduction
2 Literate Data
3 Various IP address facilities
3.1 IP Tools
3.2 IP decoding program
4 The checkCountry Module
5 The Main Program index.py
5.1 Define Global Variables and Constants
5.2 Define Various String Patterns
5.3 Determine the Host and Server Environments
5.4 filter out bad ips to which we do not respond
5.5 Handle JPEGs
5.5.1 Get JPG File from Remote Server
5.6 Collect HTTP Request
5.6.1 Handle REDIRECT QUERY STRING
5.7 Get Filename from Redirect
5.8 Check for Abbreviated URL
5.9 Make File and Dir Absolute
5.10 Check for HTML Request
5.11 Get Default XSLT File
5.12 Scan for Locally Defined XSLT File
5.13 Determine XSLT File
5.14 Update Counter
5.15 Process File
5.15.1 Process an XML File
6 The Train Image Viewer viewtrains.py
7 The Rankings Display Program ranktrains.py
8 The Train Ranking Module rank.py
9 The Web Server Module webServer.py
10 File Caching
10.1 The File Cache Module
10.2 Clearing the Cache
11 Country Database
12 The Makefile
13 TODOs
14 Indices
14.1 Files
14.2 Chunks
14.3 Identifiers
15 Document History


1. Introduction

This document describes the files used to manage delivery of my personal web pages, and those that I manage for other organisations. The general form of web page delivery is a) a source file written in XML, b) a translation file written in XSLT, and c) the program described here, a python cgi script that calls the appropriate translator on the source file, and delivers the result. It also handles straight HTML, as well as providing some debug and other maintenance options.

The program is invoked by commands in the .htaccess file associated with each web directory. Different .htaccess files can be used for different directories. If none exist in a given directory, the directory path is searched towards the root until one is found.

The XSLT files used can be specified either in the .htaccess file (default), or in the source XML file, through an explicit xml-stylesheet command. If a stylesheet XSLT file is specified, it overrides the default .htaccess one.

Permission is given to reuse this document, provided that the source is acknowledged, and that any changes are noted in the documentation.

The document is in the form of a literate program, and generates all files necessary to maintain the working environment, including a Makefile.

As of 20240513:174010, there are two URLs now recognized: server/file and server/~ajh/file. The first is the original suite of web pages, and the second is the suite of private pages, protected by an .htpasswd gateway. All the original pages are accessible on the private URL, but only the public pages are accessible through the public (~ajh) path. At some stage, it is intended that these roles will be reversed, but only after a significant period of warning.

2. Literate Data

<edit warning 2.1> =
# # DO NOT EDIT this file! # see $HOME/Computers/Sources/Web/web.xlp instead # this also gives further explanation of the program logic #
Chunk referenced in 5.1

This message flags the fact that the source code is a derived document, and should not be directly edited.

3. Various IP address facilities

3.1 IP Tools

"IPtools.py" 3.1 =
import re def str2IP(s): # converts decimal IP form x.x.x.x to binary res=re.match('(\d+)\.(\d+)\.(\d+)\.(\d+)',s) if res: b1=int(res.group(1)) b2=int(res.group(2)) b3=int(res.group(3)) b4=int(res.group(4)) ip=(b1<<24)+(b2<<16)+(b3<<8)+b4 return ip else: return None def IP2str(ip): b1=ip % 256 r=int(ip/256) b2=r % 256 r=int(r/256) b3=r % 256 r=int(r/256) b4=r return f"{b4:0}.{b3:0}.{b2:0}.{b1:0}" def ipMask(bits): allones=0xffffffff mask=allones & (allones << (32-bits)) return mask def ip2Net(ip,bits): # converts binary ip adr to binary network adr mask=ipMask(bits) return ip & mask def thisIPrange(ip,bits): base=ip2Net(ip,bits) incr=pow(2,32-bits)-1 return (base,base+incr)

IPtools provides a range of functions to facilitate handling of IP addresses and networks. This module is intended to be imported by programs in the rest of this suite. The functions are:

str2IP(s)
converts a given IP string to a 32 bit integer value
IP2str(ip)
the complement to str2IP: convert an integer IP address to a decimal string representation.
ipMask(bits)
generate an upper integer mask of bits length, useful for extracting a given size network base
ip2Net(ip,bits)
Convert the IP address ip to the base network address of length bits
thisIPrange
return a tuple that defines the range of (integer) addresses for a given base address and network size

3.2 IP decoding program

Handling IP address between string forms, hexadecimal values, and network ranges can be challenging. This program reads a line, interprets what format it is, and prints the altenatives in an easy to read format, consisting of:

        stringform, hexadecimalform, upperstring, upperhex, networklength
      

where the string form is 'dec.dec.dec.dec'; hexadecimal is the 32 bit hex representation; upper string and upper hex are the maximum addresses for this network size, also printed as network length.

"IPrep.py" 3.2 =
from IPtools import * import re import sys while True: l=sys.stdin.readline().strip() if not l: break OK=True other='' res=re.match('(\d+)\.(\d+)\.(\d+)\.(\d+)$',l) if res: IP=int(res.group(1)) IP=256*IP+int(res.group(2)) IP=256*IP+int(res.group(3)) IP=256*IP+int(res.group(4)) print(f"recognized string format '{l}' = {IP:x}") else: res=re.match('(\d+)\.(\d+)\.(\d+)\.(\d+)/(\d+)$',l) if res: IP=int(res.group(1)) IP=256*IP+int(res.group(2)) IP=256*IP+int(res.group(3)) IP=256*IP+int(res.group(4)) net=int(res.group(5)) (IPb,IPt)=thisIPrange(IP,net) IPbs=IP2str(IPb) IPts=IP2str(IPt) other=f"{IPb:x} {IPt:x} {net} ({IPbs}-{IPts})" print(f"recognized network format '{l}' = {IP:x}/{net}") else: res=re.match('0x([0-9abcdef]+)(/(\d+))?',l) if res: IP=int(res.group(1),base=16) if res.group(2): net=int(res.group(3)) (IPb,IPt)=thisIPrange(IP,net) IPbs=IP2str(IPb) IPts=IP2str(IPt) other=f"{IPb:x} {IPt:x} {net} ({IPbs}-{IPts})" print(f"recognized hexdecimal format '{l}' = {IP:x}") else: OK=False print(f"Cannot recognize {l}") if OK: str=IP2str(IP) print(f"{str:16} {IP:x} {other}")

4. The checkCountry Module

checkCountry.py is a module to provide mechanisms to check the origin of an http request. It relies upon a database blackListIPs which stores IPs of known requesters, along with a flag to indicate whether they are white, grey or black listed. Any requester not in the database is regarded as grey listed.

The module provides the class countries; instances of which have the methods load(), to load the database of known IP addresses, and checkIP(ips), to check if the IP string ips is known to the database.

It returns a triple (known, status, country, annotation), where:

known
is True if the IP is indeed known in the database,
status
has a value
country has an obvious meaning, while annotation is what is recorded in the WhoIs address lookup database.

Note that an IP may be known to the database (known is True), but still have status as 'grey'. This will be where not enough about the behaviour of the said Ip has been observed to make a decision about whether it is black or white.

Currently no attempt is made to update the database with unknown grey listings. That has to be done manually through an edit of the blackListIPs file (see later in this literate program).

"checkCountry.py" 4.1 =
from IPtools import str2IP, IP2str, ipMask, ip2Net, thisIPrange import re import sys def ip2Net24(ip): return ip2Net(ip,24)
Chunk defined in 4.1,4.2,4.4

Define a few useful functions

"checkCountry.py" 4.2 =
class countries(): def __init__(self): self.dbFileName="/home/ajh/local/blackListIPs" self.db=[] def IP2str(self,d): rtn='' ; sep='.' for i in range(4): if i==3: sep='' rtn+=f"{d[i]:0}{sep}" return rtn def load(self): statusKey={'*':'black',' ':'grey','.':'white'} entries=[] f=open(self.dbFileName,'r') lno=1 for l in f.readlines(): res=re.match('([0-9.]+),([0-9.]+) *(\*|\.)? *([^ ]+)?( +(.*)\n)?$',l) if res: # double IP, status, country, annotation bg=str2IP(res.group(1)) en=str2IP(res.group(2)) status=res.group(3) if not status: status=' ' country=res.group(4) annotation=res.group(6) else: res=re.match('([0-9.]+)(/(\d+))? *(\*|\.)? *([^ ]+)?( +(.*)\n)?$',l) if res: # single IP, optional /<networkspec>, status, country, annotation bg=str2IP(res.group(1)) if res.group(3):net=int(res.group(3)) else: net=32 ip=ip2Net(bg,net) incr=pow(2,32-net)-1 en=bg+incr status=res.group(4) # '.' is white, '*' is black, ' ' is grey if not status: status=' ' country=res.group(5) annotation=res.group(7) else: print(f"could not match >{l}<") try: status=statusKey[status] except KeyError: print(f'key error on status={status}') status='black' #print(f"{lno:4}: {bg:08x}->{en:08x} {status=},{country=},{annotation=}") entry=(bg,en,status,country,annotation) entries.append(entry) lno+=1 self.db=entries def rangeIP(self): for (b,e,st,c,a) in self.db: print(f"{b:08x} to {e:08x}") <checkCountry: checkIP 4.3>
Chunk defined in 4.1,4.2,4.4

The class countries is a database of IP addresses derived from the file /home/ajh/local/countryIPs, with entries of the form:

        IP adr in dot form/network size status country annotation
      
where status is a character ('*'|' '|'.'), country is a 2 letter country abbreviation, and annotation is an indication of the network owner. The status character is (resp.) black | grey | white.

<checkCountry: checkIP 4.3> =
def checkIP(self,ips,debug=False): # here is where we check for white listed IPs white=False f=open('/home/ajh/local/guestIP','r') gips=f.read().strip() ip=str2IP(gips) white=(ips==gips) #sys.stderr.write(f"Guest testing gives {gips=}, {ips=}, {white=}\n") if white: return (True,'white',ip,'') # check for IPv4/IPv6 localhost if ips[0]==':' or ips=='127.0.0.1': # IPv6 and IPv4, assume local host. return(True,'white','AU','localhost') # end of white testing cb=str2IP(ips) if debug: print(f"ips={ips}, binary={cb:x}") st='grey' for (bg,en,st,cn,an) in self.db: if debug: print(f"checking {bg:x}<={cb:x}<={en:x} ({cn}-{an},st={st})") if cb<bg: break if bg<=cb and cb<=en: if debug: print(f"bb={bg:x}, cb={cb:x}, en={en:x}, cn={cn}, (an={an})") return (True,st,cn,an) pass pass return (True,'grey','unknown','')
Chunk referenced in 4.2

checkIP checks the given IP address against the database, and return a tuple (isgood,status,country,annotation) where isgood is a boolean, status is ('white','grey,'black'), country is the country of origin, and annotation is the owner of the IP.

"checkCountry.py" 4.4 =
def main(): cnt=countries() cnt.load() #cnt.rangeIP() while True: ip=sys.stdin.readline().strip() if not ip: break (isgood,st,cn,a)=cnt.checkIP(ip,debug=True) aa='' if a: aa=f" (note: {a})" print(f"{ip} returns {isgood}/{st} from country {cn}{aa}") if __name__=="__main__": main()
Chunk defined in 4.1,4.2,4.4

5. The Main Program index.py

"index.py" 5.1 =
#!/home/ajh/binln/python3.10 # coding: utf-8 <edit warning 2.1> version="<current version 15.1>" # swap these to easily change value showBegins=True showBegins=False # these are upfront, because they are the things that often need changing <define global variables and constants 5.9>
Chunk defined in 5.1,5.2,5.3,5.4,5.5,5.6,5.7,5.8

This program is written in Python3, and relies upon Python v3.6 or later, as it uses f strings in many of its output statements. /home/ajh/binln is a common directory used to identify where required versions of interpreters are to be found.

It is not compatible with python3.11 or later, as it uses features now deprecated in 3.11 and later. Hence it is restricted to python 3.10. (This may change in future, once the deprecated features are isolated and corrected.)

We insert the usual edit warning to alert any code editor (person) to the dangers of directly edit constructed code. The version number is inserted into rendered HTML text.

The showBegins flag turns on tracing output on the stderr output to assist in identifying where program failures may occur, and linking these back to the literate code. Other global variable are also identified upfront, to further assist debugging the raw code.

Note the convention that where a code fragment is defined in several chunks, a begin n comment is inserted to assist debugging, and back referencing into this literate program.

This script processes all my web page XML files It requires apache to be configured:

  1. to allow python files (this file) to run as cgi in user directories
  2. to add a handler for XML files that call this program
  3. to pass as a cgi parm the XSLT file that translates the XML file
These are done in a .htaccess file for each directory (and its subdirectories) that require XML processing with a particular XSLT stylesheet.

The script relies upon picking up the required file and its XSLT file from a) the REDIRECT environment variables, and b) the script parameter, respectively.

The interpreter required varies according to the target server. This detail is captured by the <Makefile 12.2> script, although not all systems have yet been encoded into the Makefile script. The python script also changes its behaviour depending upon the host on which it is running. This is done by an explicit call to os.environ (chunk <determine the host and server environments 5.11,5.12>).

"index.py" 5.2 =
#begin 1 import sys if showBegins: sys.stderr.write(f"begin 1\n") import cgi import cgitb ; cgitb.enable() import checkCountry import datetime import html import io import os, os.path import re from subprocess import PIPE,Popen,getstatusoutput import time from urllib.request import urlopen import urllib.parse import xml.dom.minidom
Chunk defined in 5.1,5.2,5.3,5.4,5.5,5.6,5.7,5.8

Gather together all the module and library imports needed for this program.

"index.py" 5.3 =
#begin 2 if showBegins: sys.stderr.write(f"begin 2\n") <define various string patterns 5.10> now=datetime.datetime.now() tsstring=now.strftime("%Y%m%d:%H%M") todayStr=now.strftime("%d %b %Y") htmlmod=xmlmod=0
Chunk defined in 5.1,5.2,5.3,5.4,5.5,5.6,5.7,5.8

Start processing. Get a timestamp for recording key events in the log. Set the modification times to year dot.

"index.py" 5.4 =
#begin 3 if showBegins: sys.stderr.write(f"begin 3a\n") <determine the host and server environments 5.11,5.12> <filter out bad ips to which we do not respond 5.15>
Chunk defined in 5.1,5.2,5.3,5.4,5.5,5.6,5.7,5.8

This first code chunk here identifies who and what is being invoked in this instantiation of the server.

The second part identifies known IP addresses that have a track record of abusing the server. The algorithms used are somewhat heuristic.

"index.py" 5.5 =
# begin 4 - index.py if showBegins: sys.stderr.write(f"begin 4\n") def debug(loc,msg): if debugFlag: print(msg)
Chunk defined in 5.1,5.2,5.3,5.4,5.5,5.6,5.7,5.8

Define a useful debug routine.

"index.py" 5.6 =
# begin 5 - index.py if showBegins: sys.stderr.write("begin 5\n") <check for and handle jpegs 5.16> # start the html output print("Content-type: text/html\n") #sys.stderr.write(f"NEW BOT TESTING SYSTEM!\n") #sys.stderr.write(f"*** server={server} ***\n")
Chunk defined in 5.1,5.2,5.3,5.4,5.5,5.6,5.7,5.8

This fragment simply outputs the required header for flagging the generated content as HTML, and builds a number of string matching patterns for later use.

"index.py" 5.7 =
# begin 6 - index.py if showBegins: sys.stderr.write(f"begin 6\n") <collect HTTP request 5.20> <get filename from redirect 5.22>
Chunk defined in 5.1,5.2,5.3,5.4,5.5,5.6,5.7,5.8

Collect relevant parameters from the original request.

"index.py" 5.8 =
# begin 7 - index.py if showBegins: sys.stderr.write("begin 7\n") <check for abbreviated URL 5.23> <make file and dir absolute 5.24> <check for HTML request 5.25,5.26> <get default XSLT file 5.27> <scan for locally defined XSLT file 5.28> <determine xslt file 5.30> <update counter 5.31,5.32,5.33,5.34> if debugFlag: print("\n<p>\n") print("%s: server = %s<br/>" % (tsstring,server)) print("%s: host = %s<br/>" % (tsstring,host)) print("%s: dir = %s<br/>" % (tsstring,dir)) print("%s: requestedFile = %s<br/>" % (tsstring,requestedFile)) print("%s: relcwd = %s<br/>" % (tsstring,relcwd)) print("%s: relfile = %s<br/>" % (tsstring,relfile)) print("%s: counter = %s<br/>" % (tsstring,counterName)) print("%s: alreadyHTML = %s<br/>" % (tsstring,alreadyHTML)) print("%s: cachedHTML = %s<br/>" % (tsstring,cachedHTML)) print("%s: os.environ = %s\n</p>\n" % (tsstring,repr(os.environ))) print("%s: docRoot = %s\n</p>\n" % (tsstring,docRoot)) <process file 5.35,5.36> now=datetime.datetime.now() tsstring=now.strftime("%Y%m%d:%H%M") sys.stderr.write(f"{tsstring}: [{remoteAdr}] request satisfied\n\n")
Chunk defined in 5.1,5.2,5.3,5.4,5.5,5.6,5.7,5.8

The major work in rendering the required page is done by the <process file 5.35,5.36> code chunks. After this point, processing is complete, and the program falls through to exit.

5.1 Define Global Variables and Constants

<define global variables and constants 5.9> =
xslfile="" debugFlag=False returnXML=False convertXML=False alreadyHTML=False cachedHTML=False clientIP='0.0.0.0'
Chunk referenced in 5.1

returnXML is set True when the display of the raw untranslated XML is required.

convertXML is set True when a converted copy of the translated XML is required to be saved.

alreadyHTML is set True when the incoming file to be rendered is already in HTML and does not require conversion.

cachedHTML is set True when the incoming file to be rendered has been cached in the HTMLS directory, and does not require conversion.

5.2 Define Various String Patterns

<define various string patterns 5.10> =
# - to extract directory and filename from request filepat=re.compile('/~ajh/?(.*)/([^/]*)$') filename='index.xml' # - to detect stylesheet request (optional) stylesheet=re.compile('<\?xml-stylesheet.*href="(.*)"') # - to terminate file scanning doctype=re.compile('<!DOCTYPE') # to check for missing htmls htmlpat=re.compile('(.*)\.html$') # to check for xslspecification in htaccess xslspec=re.compile('.*?xslfile=(.*)&')
Chunk referenced in 5.3

Somewhat self-explanatory string matching patterns.

5.3 Determine the Host and Server Environments

<determine the host and server environments 5.11> =
# determine the host and server environments (exitcode,host)=getstatusoutput('hostname') host=re.split('\.',host)[0] # break off leading part before the '.' char #sys.stderr.write(f'{os.environ}\n') try: host=os.environ["HOSTNAME"] except KeyError: cmd='/bin/hostname' pid=Popen(cmd,shell=True,stdout=PIPE,stderr=PIPE,close_fds=True) host=pid.communicate()[0].strip().decode('UTF-8') try: server=os.environ["SERVER_NAME"] except KeyError: server='localhost' try: docRoot=os.environ["DOCUMENT_ROOT"] except KeyError: docRoot='/Users/ajh/www' try: requestUri=os.environ["REQUEST_URI"] except KeyError: sys.stderr.write('No REQUEST_URI in call environment, using default') requestUri='/~ajh/index.xml' pass if '~ajh' in requestUri: docRoot='/home/ajh/public_html' # else leave as '/var/www/html' #sys.stderr.write(f'requestUri={requestUri}, docRoot={docRoot}\n') if "SCRIPT_URL" in os.environ: URL=os.environ["SCRIPT_URL"] elif "REDIRECT_URL" in os.environ: URL=os.environ["REDIRECT_URL"] else: URL="We got a problem"
Chunk referenced in 5.4
Chunk defined in 5.11,5.12
<determine the host and server environments 5.12> =
MacOSX='MacOSX' ; Solaris='Solaris' ; Linux="Linux" ; Ubuntu='Ubuntu' if server in ["localhost"]: host='localhost' ostype=Linux system=Ubuntu elif server in ['121.200.25.188']: host='spencer' ostype=Linux system=Ubuntu elif server in ['burnley','burnley.local','10.0.0.8']: host='burnley' ostype=Linux system=Ubuntu elif server in ['ajh.co','www.ajh.co','ajh.id.au','www.ajh.id.au']: host='spencer' ostype=Linux system=Ubuntu elif server in ['ajhurst.org','www.ajhurst.org',\ 'albens.ajhurst.org','45.55.18.15',\ 'njhurst.com','www.njhurst.com']: server='ajhurst.org' host='albens' ostype=Linux system=Ubuntu elif server in ['newport','newport.local','newport.home.gateway','10.0.0.3']: server='newport' host='newport' ostype=Linux system=Ubuntu elif server in ['spencer','spencer-fast']: ostype=Linux system=Ubuntu elif server in ['cahurst.org','www.cahurst.org']: server='cahurst.org' host='albens' ostype=Linux system=Ubuntu elif server in ['glenwaverleychurches.org','www.glenwaverleychurches.org']: host='albens' ostype=Linux system=Ubuntu docRoot='/home/ajh/public_html/parish/GWICC' #elif host in ['newport']: # server=newport # ostype=Linux # system=Ubuntu else: sys.stderr.write("server/host values not recognized\n") sys.stderr.write("(supplied values are %s/%s)\n" % (server,host)) sys.stderr.write("aborting\n") sys.exit(1) ostype=Linux system=Ubuntu sys.stderr.write("(assuming (ostype,system)=(%s,%s)\n" % (ostype,system)) <define the working directories 5.13> <define the XSLTPROC program 5.14>
Chunk referenced in 5.4
Chunk defined in 5.11,5.12

The server is the address to which this request was directed, and is useful in making decisions about what to render to the client. Examples are "localhost", "www.ajh.id.au", "chairsabs.org.au".

The host is the machine upon which the server is running, and may be different from the server. This name is used to determine where to store local data, such as logging information. For example, the server may be "localhost", but this can run on a variety of hosts: "murtoa", "dimboola", dyn-13-194-xx-xx", etc..

<define the working directories 5.13> =
# begin 3.2a - define the working directories if showBegins: sys.stderr.write("begin 3.2a\n") # BASE is the path to the web base directory - with no trailing slash! if system==MacOSX: #sys.stderr.write("MacOS\n") HOME="/home/ajh/" BASE="/home/ajh/www" PRIVATE="/home/ajh/local/"+server elif system==Ubuntu and server=='cahurst.org': #sys.stderr.write("Ubuntu,cahurst\n") HOME="/home/ajh" BASE="/home/ajh/www/personal/cahurst" PRIVATE="/home/ajh/local/"+server elif system==Ubuntu and docRoot=='/var/www/html': #sys.stderr.write("Ubuntu,/var/www/html\n") HOME="/home/ajh/" BASE="/var/www/html" PRIVATE="/home/ajh/local/"+server elif system==Ubuntu: #sys.stderr.write("Ubuntu\n") if debugFlag: print("docRoot=%s" % (docRoot)) if docRoot == '/home/ajh/public_html/parish/GWICC': #sys.stderr.write("Ubuntu,GWICC\n") HOME="/home/ajh/public_html/parish/GWICC" BASE=HOME PRIVATE="/home/ajh/local/"+server+"/parish/GWICC" elif docRoot == '/home/ajh/public_html/personal/cahurst': #sys.stderr.write("Ubuntu,cahurst\n") HOME="/home/ajh/public_html/personal/cahurst" BASE=HOME PRIVATE="/home/ajh/local/"+server+"/personal/cahurst" elif docRoot == '/var/www/html': #sys.stderr.write("Ubuntu,/var/www/html\n") HOME="/home/ajh/public_html" BASE=docRoot PRIVATE="/home/ajh/local/"+server else: # ajh public web pages HOME="/home/ajh/" BASE="/home/ajh/public_html" PRIVATE="/home/ajh/local/"+server COUNTERS=PRIVATE+"/counters/" HTMLS=PRIVATE+"/htmls/" WEBDIR="file://"+BASE if debugFlag: # done this way because only python 3.6 on some machines msg=f"docRoot={docRoot},BASE={BASE},HOME={HOME}\n" sys.stderr.write(msg) # end 3.2a - define the working directories
Chunk referenced in 5.12

Note: This section needs a bit more work to distinguish the new 001-default host.

BASE is set to the path to the web root directory on this server. It should not have a trailing slash!

HOME has its usual Unix meaning.

PRIVATE is set to the path to a working directory on this particular server that is used to store accounting and audit information about this particular access. The path includes a specific reference to the server hostname to uniquely distinguish it. This directory basename is rendered on the web page as the parameter server and is the first of the "server@host" pairs rendered at the top of the web page.

COUNTERS is the path to the directory containing all the web page access counts. Each counter is incremented on page access, whether to the cached HTML, or to the (re)rendered XML file.

HTMLS is the path to a local copy of html versions of the files. These are cached versions, and some mechanism to age and delete needs to be identified. If the corresponding XML file is older than the HTML file found in this subdirectory, the HTML version is used.

<define the XSLTPROC program 5.14> =
# begin 3.2b - define the XSLTPROC program if showBegins: sys.stderr.write("begin 3.2b\n") # define the XSLTPROC if system in ['MacOSX','Linux','Ubuntu']: XSLTPROC="/usr/bin/xsltproc" else: # no other option sys.stderr.write(f"No XSLTPROC defined for system={system}\n") sys.exit(1) # end 3.2b - define the XSLTPROC program
Chunk referenced in 5.12

XSLTPROC is the path to the xsltproc processor. Without this processor, this entire script (as far as XML files are concerned) is meaningless!

5.4 filter out bad ips to which we do not respond

<filter out bad ips to which we do not respond 5.15> =
def filteredResponse(arg=None): print("Content-type: text/html\n") print("<H1>NOT AVAILABLE</H1>") print(''' <p> You have been filtered, and this page is blocked against you. Please write to the author if you think there is some mistake. </p> ''') if showBegins: sys.stderr.write(f"begin 3b\n") if "REMOTE_ADDR" in os.environ: clientIP=os.environ["REMOTE_ADDR"] else: sys.stderr.write(f"key error for REMOTE_ADDR or REDIRECT_URL\n") sys.exit(1) if 'REDIRECT_URL' in os.environ: requestedFile=os.environ['REDIRECT_URL'] else: sys.stderr.write(f"key error for REDIRECT_URL\n") sys.exit(1) # check my database. Only fail if ip is known and flagged as bad if showBegins: sys.stderr.write(f"begin 3b1\n") cnt=checkCountry.countries() if showBegins: sys.stderr.write(f"begin 3b2\n") cnt.load() (isgood,st,cn,a)=cnt.checkIP(clientIP) if showBegins: sys.stderr.write(f"begin 3b3\n") # isgood means we know about this ip, regard as OK for now # st=='black' means it is bad, abort now rqf=f"(requestedFile={requestedFile})" if st=='black': sys.stderr.write(f"{clientIP:>15} black, {cn}-{a} {rqf}\n") filteredResponse() sys.exit(1) elif st=='grey': sys.stderr.write(f"{clientIP:>15} grey, {cn}-{a} allowed for now, {rqf}\n") else: if isgood: sys.stderr.write(f"{clientIP:>15} OK, {cn}-{a} {rqf}\n") else: sys.stderr.write(f"{clientIP:>15} unknown-ignored, {rqf}, ignored\n") filteredResponse() sys.exit(1) # check if country OK goodIP=filterOnCountry(clientIP) if not goodIP: sys.stderr.write(f"ip={clientIP} but fails country {cn}\n") filteredResponse() sys.exit(1)
Chunk referenced in 5.4

We maintain a database of IP addresses, categorized three ways:

  1. 'white': this IP is allowed.
  2. 'grey': no information on this IP as yet. Allow for now.
  3. 'black': this IP is filtered out.
The variable st returned by checkIP is set to one of these values. checkIP also returns isgood, True if the access is allowed; cn the country from which the request originated; and a and annotation or other information about the IP address.

The database is described in section <Country Database >.

5.5 Handle JPEGs

From version 1.2.0 onwards, this code implements a form of caching for jpg files. A local check for the request file is made, and if it is not found, an attempt to retrieve it from the dimboola server is made. If that is not successful, the file is reported not found. If it is successful, the file is saved locally. No attempt is made to age files out of the cache.

<check for and handle jpegs 5.16> =
#begin 5.1 if showBegins: sys.stderr.write("begin 5.1\n") #sys.stderr.write("Just a check version %s\n" % (version)) cachetime=60*60*24*7 # one week # check for jpgs if 'REQUEST_URI' in os.environ: uri=os.environ['REQUEST_URI'] (scheme,netloc,path,parms,query,fragment)=urllib.parse.urlparse(uri) #sys.stderr.write("path=%s\n" % path) filename=re.sub('/~ajh/','/home/ajh/www/',path) (root,ext)=os.path.splitext(filename) ext=ext.lower() if ext=='.jpg': basedir=os.path.dirname(filename) #sys.stderr.write(f"filename={path}\n") if os.path.exists(filename): sys.stderr.write(f"{tsstring}: Got file {filename} locally\n") f=open(filename,'r').read() else: sys.stderr.write(f"local file {path} not available. Checking ajh.co\n") sys.exit(0) <get jpg file from remote server 5.17,5.18> print("Content-Type: image/jpeg\n") #print(f) # display the image sys.exit(0) else: pass else: sys.stderr.write("No Request_URI\n")
Chunk referenced in 5.6

This code checks to see if the request is for a jpg image file. These are cached, and if not present, are retrieved from the master jpg server for my jpeg images. This is still a bit experimental. The server URL is dimboola.infotech.monash.edu.au/~ajh/Pictures.

It requires that the .htaccess file be modified to refer .jpg requests to this cgi script.

5.5.1 Get JPG File from Remote Server

<get jpg file from remote server 5.17> =
newurl="http://ajh.co%s" % path sys.stderr.write("{0}: using url {1}\n".format(tsstring,newurl)) urlobj=urlopen(newurl) f=urlobj.read() modtimestr=urlobj.info()['Last-Modified'] modtime=time.strptime(modtimestr,"%a, %d %b %Y %H:%M:%S %Z")
Chunk referenced in 5.16
Chunk defined in 5.17,5.18

Generate the URL of the corresponding remote JPG file, and issue read request. By using the urllib.request library, we also get the modification time, which we parse in order to set the correct modification time on the locally cached copy.

<get jpg file from remote server 5.18> =
sys.stderr.write(f"Would be caching {filename}, but currently disabled\n")
Chunk referenced in 5.16
Chunk defined in 5.17,5.18
<get jpg file from remote server removed 5.19> =
try: fc=open(filename,'w') fc.write(f) fc.close() #touch filename -mt time.strftime("%Y%m%d%H%M.%S") mtime=time.mktime(modtime) imtime=int(mtime) nowtime=time.localtime() currtime=int(time.mktime(nowtime)) # local os.utime(filename,(currtime,imtime)) #sys.stderr.write("%s: cached %s\n" % (tsstring,filename)) except (IOError,OSError): #errmsg=os.strerror(errcode) sys.stderr.write("%s: Cannot write cache file %s\n" % (tsstring,filename))

Now try to cache a local copy. This can fail for several reasons, the main one being that the permissions in the local directory are likely to be against (write) access by the www user. More work is required to make this a bit more robust.

Note that we set the pair (access time, modification time) on the local file to be the current time and remote file modification time respectively. This ensures that attempts to synchronize the two file systems will see this file as the same file as the remote file, and not attempt to update one or the other (thus leading to spurious modification times).

5.6 Collect HTTP Request

<collect HTTP request 5.20> =
# collect the original parameters from the redirect (if there is one!) if 'REDIRECT_QUERY_STRING' in os.environ: <handle redirect query string 5.21> else: form={} requestedFile="" {Note 5.20.1} remoteAdr='' if 'REMOTE_ADDR' in os.environ: remoteAdr=os.environ['REMOTE_ADDR'] if debugFlag: print("<p>%s: (server,host)=(%s,%s)<br/>\n" % (tsstring,server,host)) print("%s: (system,PRIVATE)=(%s,%s)</p>\n" % (tsstring,system,PRIVATE)) print("%s: (BASE,HOME,PRIVATE)=(%s,%s,%s)</p>\n" % (tsstring,BASE,HOME,PRIVATE))
Chunk referenced in 5.7
{Note 5.20.1}
initialize the filename of the file to be rendered. Most of the work in computing the value of this variable is done in <get filename from redirect 5.22>

When this script is called, it has gained control by virtue of an .htaccess directive to Apache to use this program to render the source file. The name of that source file has to be recovered somehow, and different systems seem to handle this parameter in different ways. The first parameter to explore is the REDIRECT_QUERY_STRING, which, if it is present in the form request, contains secondary parameters to the rendering operation. If this parameter is not present, initialize the variable form to an empty value.

5.6.1 Handle REDIRECT QUERY STRING

<handle redirect query string 5.21> =
query_string=os.environ['REDIRECT_QUERY_STRING'] form=urllib.parse.parse_qs(query_string) if 'debug' in form and form['debug'][0]=='true': sys.stderr.write("%s: %s\n" % (tsstring,repr(form))) debugFlag=True print("<h1>%s: INDEX.PY version %s</h1>\n" % (tsstring,version)) print("<p>%s: os.environ=%s</p>\n" % (tsstring,repr(os.environ))) print("<p>%s: form=%s</p>\n" % (tsstring,repr(form))) sys.stderr.write("%s: redirect_query string=%s\n" % (tsstring,query_string)) if 'xml' in form: if form['xml'][0]=='true': sys.stderr.write("%s: %s\n" % (tsstring,repr(form))) returnXML=True if debugFlag: print("<p>%s: os.environ=%s</p>\n" % (tsstring,repr(os.environ))) print("<p>%s: form=%s</p>\n" % (tsstring,repr(form))) sys.stderr.write("%s: redirect_query string=%s\n" % \ (tsstring,query_string)) elif form['xml'][0]=='convert': convertXML=True
Chunk referenced in 5.20

There are several possibilities for secondary parameters. The primary one is the debugFlag parameter, which can be set to true, indicating that debugging information is to be printed along with the rendering. This is intended for administrator access only, but as it is harmless, there is no authentication required.

The other parameter that can be offered at this point is the xml parameter, with values of true or convert. The first of these forces no conversion of the XML, but simply copies it to the browser, substituting escape sequences for any special XML character sequences so that it appears as verbatim XML.

The second choice, convert, allows the use of the rendering engine as an XML-to-HTML converter, in which case a copy of the converted HTML is saved to a temporary file. This file can be used subsequently as a statically converted file as necessary.

5.7 Get Filename from Redirect

<get filename from redirect 5.22> =
# get the file name from the redirect environment if system==MacOSX: scriptURL='REQUEST_URI' elif system==Solaris: scriptURL='REDIRECT_URL' elif system==Linux: scriptURL='REDIRECT_URL' elif system==Ubuntu: scriptURL='REDIRECT_URL' if scriptURL in os.environ: requestedFile=os.environ[scriptURL] argpos=requestedFile.find('?') if argpos>=0: requestedFile=requestedFile[0:argpos] if debugFlag: sys.stderr.write("%s: [client %s] requesting %s\n" % \ (tsstring,remoteAdr,requestedFile)) orgfile=requestedFile # analyse file request. If a bare directory, add 'index.xml' if 'REDIRECT_STATUS' in os.environ and \ os.environ['REDIRECT_STATUS']=='404': res=htmlpat.match(requestedFile) if res: filename=res.group(1)+'.xml' requestedFile=filename dir=relcwd="" res=filepat.match(requestedFile) if res: dir=res.group(1) relcwd=dir # protocol for relcwd: # no subdir => relcwd = '' (empty) # exists subdir => relcwd = subdir (no leading or trailing slash) if dir!="": requestedFile=dir+'/'+res.group(2) else: requestedFile=res.group(2) filename=res.group(2) else: # not ajh (sub)directory, extract full directory path dir=os.path.dirname(requestedFile) relcwd=dir filename=os.path.basename(requestedFile) #sys.stderr.write("{}\n".format(requestedFile)) if 'personal/albums' in requestedFile: pass # sys.exit(0) if debugFlag: print("<p>%s: dir,requestedFile,relcwd to process = %s,%s,%s</p>" % \ (tsstring,dir,requestedFile,relcwd))
Chunk referenced in 5.7

5.8 Check for Abbreviated URL

<check for abbreviated URL 5.23> =
if requestedFile=='' or requestedFile[-1]=='/': requestedFile+='index.xml' filename='index.xml'
Chunk referenced in 5.8

5.9 Make File and Dir Absolute

<make file and dir absolute 5.24> =
requestedFile=re.sub('^/','',requestedFile) # remove any leading / relfile=requestedFile requestedFile=BASE+'/'+requestedFile dir=BASE+'/'+dir
Chunk referenced in 5.8

5.10 Check for HTML Request

We now have a requestedFile name for the document to be rendered. We need to investigate this file to see how it is to be rendered. In particular, it may be an HTML file (indicated by a .html extension), or it may be an XML file previously rendered and cached. In these cases, we do not need to do any XML conversion, and the flag alreadyHTML is set true if it is an HTML file, or the flag cachedHTML is set true if it is a cached converted XML to HTML file.

<check for HTML request 5.25> =
res=htmlpat.match(requestedFile) if res: # we have an HTML request, check if it exists if os.path.exists(requestedFile): # exists, use that alreadyHTML=True #sys.stderr.write(f"requested file {requestedFile} is already html\n") if debugFlag: print("requested file %s is already html<br/>" % (requestedFile)) else: # doesn't exist, convert from HTML filename=res.group(1)+'.xml' requestedFile=filename
Chunk referenced in 5.8
Chunk defined in 5.25,5.26

This code now also checks for a cached version of the XML file, as per the following fragment.

<check for HTML request 5.26> =
if not alreadyHTML: patn="(%s/)(.*).xml" % (BASE) if debugFlag: print("<p>matching xml=%s with pattern=%s<br/>" % (requestedFile,patn)) res=re.match(patn,requestedFile) if res: base=res.group(1); path=res.group(2) if debugFlag: print("matched BASE=%s,path=%s<br/>" % (base,path)) htmlpath="%s%s.html" % (HTMLS,path) if os.path.exists(htmlpath): htmlstat=os.stat(htmlpath) xmlstat=os.stat(requestedFile) htmlmod=htmlstat.st_mtime xmlmod=xmlstat.st_mtime if xmlmod < htmlmod and not form: # cached version is newer use that if debugFlag: print("using cached file %s<br/>" % (htmlpath)) #requestedFile=htmlpath #cachedHTML=True else: if debugFlag: print("no cached version of %s<br/>" % (requestedFile)) else: if debugFlag: print("requested file %s is not XML<br/>" % (requestedFile))
Chunk referenced in 5.8
Chunk defined in 5.25,5.26

Unless the file being retrieved is already an HTML file, check to see if we have a cached HTML version of this (XML) file. Note that any parameters to th http request (indicated by a non-empty form value) will abort the caching process, and force a reload of the XML file.

5.11 Get Default XSLT File

<get default XSLT file 5.27> =
# collect the XSLT file name from the .htaccess referent if debugFlag: print("BASE=%s<br/>" % (BASE)) if 'QUERY_STRING' in os.environ: query_string=os.environ['QUERY_STRING'] else: query_string='xslfile=%s/lib/xsl/ajhwebdoc.xsl&/~ajh/index.xml' % (BASE) if debugFlag: print("query_string=%s<br/>" % (query_string)) #sys.stderr.write("%s: query string=%s\n" % (tsstring,query_string)) form2=urllib.parse.parse_qs(query_string) if debugFlag: print("<p>%s: form2=%s</p>\n" % (tsstring,form2)) if 'xslfile' in form2: xslfile=form2['xslfile'][0] #sys.stderr.write(f"{tsstring}: got this xslfile={xslfile}\n") if debugFlag: print("<p>%s: got this xslfile=%s</p>\n" % (tsstring,xslfile))
Chunk referenced in 5.8

5.12 Scan for Locally Defined XSLT File

<scan for locally defined XSLT file 5.28> =
# Check the requested file for a local stylesheet. We also scan the # entire file, replacing any symbolic references to $WEBDIR with the # full path for the current machine. Note that the DOCTYPE statement # must start a line by itself. try: #sys.stderr.write(f"requested file={requestedFile}\n") filed=open(requestedFile,'r',encoding='utf-8') text='' ; linecount = 0 trackXML=debugFlag and not (alreadyHTML or cachedHTML) while 1: # keep scanning file until we find no more XML directives line=filed.readline() if line=='': # this is EOF, so quit if linecount==0: print("(empty file)") break linecount+=1 line=line.strip() # remove NL text+=' '+line if trackXML: print("<p>read line='%s'" % (html.escape(line))) # check if end of directives, indicated by normal element tag start res=re.match('<[^?!]',line) if res: break if trackXML: print("<p>text read='%s'" % (html.escape(text))) res=re.match('.*(<\?xml-stylesheet)(.*?)(\?>)',text) if res: parms=res.group(2) # now we have the stylesheet parameters res=re.match('.*href="(.*?)"',parms) if res: # extract filename xslfile=res.group(1) xslfile=re.sub('(\$WEBDIR)',WEBDIR,xslfile) if debugFlag: print("<p>%s: stylesheet in xml file, href=%s</p>" % (tsstring,xslfile)) <check if xslfile more recent than cached version 5.29> else: if trackXML: print("<p>Did not find stylesheet href in %s" % (parms)) else: if trackXML: print("<p>Did not find stylesheet reference in %s" % (html.escape(text))) filed.close() except IOError: print(""" <h1>Sorry!! (Error 404)</h1> <p>While processing your request for file %s,<br/> it was found that the corresponding XML file %s does not exist</p> <p>Please check that the URL is correct</p> """ % (orgfile,requestedFile)) sys.exit(0) #newfiled.close()
Chunk referenced in 5.8
<check if xslfile more recent than cached version 5.29> =
localXSLfile=re.sub('file://','',xslfile) try: xslmod=os.stat(localXSLfile) if htmlmod < xslmod: cachedHTML=False if debugFlag: print("<p>XSL newer than HTML, reloading</p>") except: # ignore any errors from this pass
Chunk referenced in 5.28

Look at modification time of XSL file. If it is more recent than the cached HTML file, we must re-convert the XML file.

5.13 Determine XSLT File

<determine xslt file 5.30> =
# have we got an xslfile yet? htacc=None if xslfile=="": # no, so check all .htaccess # first grab directory while len(dir)>=len(BASE): if debugFlag: print("<p>directory=%s</p>\n" % (dir)) if os.path.isfile(dir+"/.htaccess"): htacc=open(dir+"/.htaccess") if debugFlag: print("<p>found .htaccess in directory %s</p>" % (dir)) break else: dir=os.path.dirname(dir) if htacc: for line in htacc.readlines(): res=xslspec.match(line) if res: xslfile=res.group(1) if xslfile[0] != '/': xslfile=BASE+'/'+xslfile break if debugFlag: print("<p>found xslfile %s in .htaccess</p>" % (xslfile)) if system==Solaris: xslfile=re.sub('/home/ajh'+'/www','/u/web/homes/ajh',xslfile) if xslfile[0]!='/' and not (xslfile[0:5]=='file:'): xslfile='/u/web/homes/ajh/'+xslfile
Chunk referenced in 5.8

5.14 Update Counter

Compute the name of an XML counter file which contains a counter element with subelements value and date. The value element contains the current count value, and the date element is the date on which this XML file was initialised. We read the current count from that file, increment it, and update the file. This file is used by most xslt translations to output an access count in the footer. It is also used by the site map program to compute the intensity of accesses to this web page.

It was fortuitous, but this counter also keeps track of HTML accesses, both where an HTML file is the initial request, and where it is a cached version of the corresponding XML file. Since the XML files have their own counters included by the XSLT translator, the count attached to the HTML rendering allows a comparision of how many accesses are to the cached copy (the difference between the two).

For example, suppose the XML rendering gives 986 references, and the HTML rendering cites 993 references. The the cached HTML page has itself been referenced 7 times since it was first cached.

<update counter 5.31> =
counterName=re.sub("/~ajh/",'',relfile) counterName=re.sub("^/",'',counterName) extnPattern=re.compile("(.xml)|(.html)") counterName=re.sub(extnPattern,'',counterName) counterName=COUNTERS+re.sub("/","-",counterName)
Chunk referenced in 5.8
Chunk defined in 5.31,5.32,5.33,5.34

First we process relfile to find the counter name. Remove any extension, and replace all slash path separators with minus signs.

(Strictly speaking, the first sub is not required, but I've left it in, as it does no harm.)

<update counter 5.32> =
newCounterStr='<?xml version="1.0"?>\n' newCounterStr+='<counter><value>0</value><date>%s</date></counter>' % todayStr try: counterFile=open(counterName,'r') dom=xml.dom.minidom.parse(counterFile) counterFile.close() except IOError: dom=xml.dom.minidom.parseString(newCounterStr) except xml.parsers.expat.ExpatError: dom=xml.dom.minidom.parseString(newCounterStr) except: print("Unexpected error:", sys.exc_info()[0]) raise
Chunk referenced in 5.8
Chunk defined in 5.31,5.32,5.33,5.34

Now try to read the counter XML file. The file may not exist if this is the first time we have accessed this page since this mechanism was set up, so we must capture that error, and any error arising from attempting to parse the XML, and create a new counter file, with value initialised to zero, and date initialised to today's date.

<update counter 5.33> =
# now extract count field and update it countNode=dom.getElementsByTagName('value')[0] if countNode.nodeType == xml.dom.Node.ELEMENT_NODE: textNode=countNode.firstChild if textNode.nodeType == xml.dom.Node.TEXT_NODE: text=textNode.nodeValue.strip() countVal=int(text) countVal=countVal+1 textNode.nodeValue="%d" % (countVal) countDate='(unknown)' countNode=dom.getElementsByTagName('date')[0] if countNode.nodeType == xml.dom.Node.ELEMENT_NODE: textNode=countNode.firstChild if textNode.nodeType == xml.dom.Node.TEXT_NODE: countDate=textNode.nodeValue.strip()
Chunk referenced in 5.8
Chunk defined in 5.31,5.32,5.33,5.34

<update counter 5.34> =
# write updated counter document if re.match('.*personal-albums',counterName): # ignore photographs #sys.stderr.write(" ignoring {0}\n".format(counterName)) pass else: try: counterFile=open(counterName,'w') except IOError: print("could not open %s" % (counterName)) counterName='/home/ajh/local/localhost/counters/index' counterFile=open(counterName,'w') domString=dom.toxml() counterFile.write(domString) counterFile.close()
Chunk referenced in 5.8
Chunk defined in 5.31,5.32,5.33,5.34

5.15 Process File

<process file 5.35> =
filestat=os.stat(requestedFile) filemod=filestat.st_mtime dtfilemod=datetime.datetime.fromtimestamp(filemod) dtstring=dtfilemod.strftime("%Y%m%d:%H%M") # define the parameters to the translation filestat=os.stat(requestedFile) filemod=filestat.st_mtime dtfilemod=datetime.datetime.fromtimestamp(filemod) parms="" parms+="--param xmltime \"'%s'\" " % (dtstring) parms+="--param htmltime \"'%s'\" " % (tsstring) parms+="--param filename \"'%s'\" " % (filename) parms+="--param relcwd \"'%s'\" " % (relcwd) parms+="--param URL \"'%s'\" " % (URL) parms+="--param today \"'%s'\" " % (todayStr) parms+="--param host \"'%s'\" " % (host) parms+="--param server \"'%s'\" " % (server) parms+="--param base \"'%s'\" " % (BASE) parms+="--param version \"'%s'\" " % (version) for key in form: value=form[key][0] parms+="--param "+key+" \"'%s'\" " % (value) if debugFlag: sys.stderr.write("%s: xml file modified at %s\n" % (tsstring,dtstring))
Chunk referenced in 5.8
Chunk defined in 5.35,5.36
<process file 5.36> =
if returnXML: rawxmlf=open(requestedFile,'r',encoding='UTF-8') print("<PRE>\n") for line in rawxmlf.readlines(): print(html.escape(line)) print("</PRE>\n") elif alreadyHTML or cachedHTML: #sys.stderr.write(f"requested file={requestedFile}\n") <render the HTML file 5.37> else: <process an XML file 5.38,5.39,5.40>
Chunk referenced in 5.8
Chunk defined in 5.35,5.36

Decide what to with the file. There are 3 choices:

  1. return the raw XML. This means escaping all the active characters, and printing the file verbatim.
  2. The file is HTML, either because of an explicit HTML request, or a cached HTML file previously translated has been found. Again, the file is rendered verbatim, this time without escaping the active characters.
  3. It is an XML file, and it needs translation. Call the XSLT processor to do that (chunk <process an XML file 5.38,5.39,5.40>).

<render the HTML file 5.37> =
rawHTMLf=open(requestedFile,'r',encoding='utf-8') for line in rawHTMLf.readlines(): print(line,end='') sys.stderr.write(f"{tsstring}: [{remoteAdr}] request satisfied\n\n") sys.exit(0) print('<P><SPAN STYLE="font-size:80%%">') print('%d accesses since %s, ' % (countVal,countDate)) print('HTML cache rendered at %s</SPAN>' % (dtstring)) if cachedHTML: os.utime(requestedFile,None) # touch the file
Chunk referenced in 5.36

Note that each line from the HTML file is printed without additional line breaks.

5.15.1 Process an XML File

<process an XML file 5.38> =
# start a pipe to process the XSLT translation cmd=XSLTPROC+" --xinclude %s%s %s " % (parms,xslfile,requestedFile) #(pipein,pipeout,pipeerr)=os.popen3(cmd) pid=Popen(cmd,shell=True,stdout=PIPE,stderr=PIPE,close_fds=True) (pipeout,pipeerr)=(pid.stdout,pid.stderr) if debugFlag: cwd=os.getcwd() print("<p>%s: (cwd:%s) %s</p>" % (tsstring,cwd,cmd)) sys.stderr.write("(cwd:%s) %s: %s\n" % (cwd,tsstring,cmd)) # report the fact, and the context (debugging purposes) if debugFlag: print("%s: converting %s with %s\n" % (tsstring,requestedFile,xslfile))
Chunk referenced in 5.36
Chunk defined in 5.38,5.39,5.40

Run the pipe to perform the translation. Note that this step requires an inordinate amount of time on some servers (sequoia in particular), and was the prompt for including the caching mechanism.

<process an XML file 5.39> =
# process the converted HTML convertfn="/home/ajh/www/tmp/convert.html" if convertXML: try: htmlfile=open(convertfn,'w') except: msg="couldn't open HTML conversion file %s" % convertfn sys.stderr.write("%s: %s\n" % (tsstring,msg)) convertXML=False
Chunk referenced in 5.36
Chunk defined in 5.38,5.39,5.40
<process an XML file 5.40> =
# check that directory exists dirpath=os.path.dirname(htmlpath) # only cache if not an album request gencache=not 'personal/albums' in dirpath if not os.path.isdir(dirpath): os.makedirs(dirpath,0o777) if gencache: htmlfile2=open(htmlpath,'w') for line in pid.stdout.readlines(): line=line.decode('UTF-8') line=line.rstrip() # don't remove trailing blanks! print(line) if gencache: htmlfile2.write(str(line)) if convertXML: htmlfile.write("%s\n" % line) if convertXML: htmlfile.close() if gencache: htmlfile2.close() #os.chmod(htmlpath,0o666) <deal with any conversion errors 5.41> pipeout.close(); pipeerr.close()
Chunk referenced in 5.36
Chunk defined in 5.38,5.39,5.40

Note that in copying the rendered HTML version, we retain the lines as is, and make sure that they are rendered without any additional (or deleted) new lines.

5.15.1.1 Deal with any Conversion Errors

<deal with any conversion errors 5.41> =
errs=[] for line in pipeerr.readlines(): errs.append(line) logfile=PRIVATE+'/xmlerror.log' logfiled=open(logfile,'a') if errs: logfiled.write("%s: %s: ERROR IN REQUEST %s\n" % (tsstring,clientIP,requestedFile)) print("<HR/>\n") print("<H3>%s: MESSAGES GENERATED BY: %s</H3>\n" % (tsstring,requestedFile)) print("<PRE>") for errline in errs: logfiled.write("%s: %s" % (tsstring,errline)) #errline=html.escape(errline) # this line needs UTF FIXING! ************************************ errline=errline.rstrip() print("%s: %s" % (tsstring,errline)) print("</PRE>") print("<p>Please forward these details to ") print("<a href='mailto:ajh@ajhurst.org'>John Hurst</a>") else: logfiled.write("%s: %s: NO ERRORS IN %s\n" % (tsstring,clientIP,requestedFile)) logfiled.close()
Chunk referenced in 5.40

6. The Train Image Viewer viewtrains.py

The literate code for this program has been moved to ViewTrains

7. The Rankings Display Program ranktrains.py

The literate code for this program has been moved to ViewTrains

8. The Train Ranking Module rank.py

The literate code for this program has been moved to ViewTrains

9. The Web Server Module webServer.py

"webServer.py" 9.1 =
#!/usr/bin/python # online.py # # This program looks at all machines known to the ajh network, and # examines them to see if they are on-line. # # version 1.0.0 20160622:104949 # # put imports here: those named are required by this template import datetime import subprocess import socket import re import getopt import os,os.path import sys import shutil # define globals here debug=0; verbose=0 machines=['spencer', #'spencer.ajh.co', 'dimboola', #'ajh.id.au', #'wolseley', 'hamilton.local', #'bittern', #'echuca', 'lilydale', #'albens.ajhurst.org' ] macosx=['dimboola','ajh.id.au','hamilton.local','MU00087507X','bittern','echuca'] linux=['spencer','spencer.ajh.co','wolseley','lilydale','lilydale.local','albens.ajhurst.org'] thismachine=socket.gethostname() ignoreOut=open('/dev/null','w') # define usage here def usage(): print """ This module is intended for use in web page delivery programs. It does not serve any useful purpose in its own right. <flags>= [-d|--debug] print debugging information [-v|--verbose] print more debugging information [-V|--version] print version information The debug flags give (verbose) output about what is happening. """ # define global procedures here # (there are none) # define the key class of this module class location(): def __init__(self,server=thismachine): # determine the server and host names # # the server is the address to which this request was directed, and is # useful in making decisions about what to render to the client. # Examples are "localhost", "www.ajh.id.au", "chairsabs.org.au". # # the host is the machine upon which the server is running, and may be # different from the server. This name is used to determine where to # store local data, such as logging information. For example, the # server may be "localhost", but this can run on a variety of hosts: # "murtoa", "dimboola", dyn-13-194-xx-xx", etc.. Incidentally, hosts # of the form "dyn-130-194-xx-xx" are mashed down to the generic "dyn". MacOSX='MacOSX' ; Solaris='Solaris' ; Linux="Linux" ; Ubuntu='Ubuntu' ostype=system=MacOSX # unless told otherwise host=server if server in ["localhost"]: pass elif server in ['ajh.co','www.ajh.co','spencer']: host='spencer' ostype=Linux system=Ubuntu elif server in ['albens','albens.ajhurst.org','45.55.18.15',\ 'ajhurst.org','www.ajhurst.org',\ 'njhurst.com','www.njhurst.com']: #server='ajhurst.org' host='albens' ostype=Linux system=Ubuntu elif server in ['dimboola','dimboola.local',\ 'ajh.id.au','dimboola.ajh.id.au']: host='dimboola' elif server in ['wolseley','wolseley.home.gateway']: server='wolseley' host='wolseley' ostype=Linux system=Ubuntu elif server in ['burnley','burnley.local']: host='burnley' ostype=Linux elif server in ['eregnans.ajhurst.org','regnans.njhurst.com']: host='eregnans' ostype=Linux system=Ubuntu elif server in ['cahurst.org']: host='albens' ostype=Linux system=Ubuntu elif server in ['glenwaverleychurches.org','www.glenwaverleychurches.org']: host='albens' ostype=Linux system=Ubuntu else: sys.stderr.write("server/host values not recognized\n") sys.stderr.write("(supplied values are %s/%s)\n" % (server,host)) sys.stderr.write("terminating\n") sys.exit(1) ostype=Linux system=Ubuntu sys.stderr.write("(assuming (ostype,system)=(%s,%s)\n" % (ostype,system)) self.ostype=ostype self.system=system self.server=server self.host=host pass # define the main program here def main(s): loc=location(s) print "I have gleaned that:" print " this machine = %s" % (thismachine) print " this server = %s" % (loc.server) print " this host = %s" % (loc.host) print " this ostype = %s" % (loc.ostype) print " this system = %s" % (loc.system) pass if __name__ == '__main__': (vals,path)=getopt.getopt(sys.argv[1:],'dvV', ['debug','verbose','version']) for (opt,val) in vals: if opt=='-d' or opt=='--debug': debug=1 if opt=='-v' or opt=='--verbose': verbose=1 if opt=='-V' or opt=='--version': print version sys.exit(0) server=thismachine if len(path)>0: server=path[0] main(server)

The various web server programs all make use of a pair of values, known as host and server. The host value defines the machine on which this program is running, while the server value defines the name by which the program was invoked (effectively the domain name of the server). This is necessary, as each program may be invoked by different domain name paths, and the service rendered may be different for each path.

While different domain names may invoke services on the same machine or host, it is never the case that different machines are invoked to handle a single domain name service.

When invoked as a stand-alone program, a test server parameter may be passed in to see what the program determines. This is for testing purposes only, and serves no useful purpose otherwise.

10. File Caching

10.1 The File Cache Module

"filecache.py" 10.1 =
"""A module that writes a webpage to a file so it can be restored at a later time Interface: filecache.write(...) filecache.read(...) """ import time import os import md5 import urllib def key(url): k = md5.new() k.update(url) return k.hexdigest() def filename(basedir, url): return "%s/%s.txt"%(basedir, key(url)) def write(url, basedir, content): """ Write content to cache file in basedir for url""" cachefilen=filename(basedir, url) fh = file(cachefilen, mode="w") fh.write(content) fh.close() return cachefilen def read(url, basedir, timeout): """Read cached content for url in basedir if it is fresher than timeout (in seconds)""" cache=0 fname = filename(basedir, url) content = "" if os.path.exists(fname) and \ (os.stat(fname).st_mtime > time.time() - timeout): fh = open(fname, "r") content = fh.read() fh.close() cache=1 return (content,cache)

This code was adapted from an example given on a web page. Sorry, I have forgotten the reference.

10.2 Clearing the Cache

This program clears the HTML caches created by the previous module. It is called independently, and can clear either the entire cache, or subdirectories of it.

The cache is maintained on a per-machine basis, and the machine being used is identified by a hostname call.

"clearWebCache.py" 10.2 =

11. Country Database

Addresses in this database come in two forms: CIDR (Classless_Inter-Domain_Routing), and double IPv4 addresses. The CIDR forms are standard for defining a range of addresses; the second form simply gives a range of IP addresses, which may or may not (more usually) represent an address that could be written in CIDR form. This latter form is used where an address in a CIDR form has an address (or addresses) in range that need separate handling from the CIDR forms.

A third form, infrequently used, is a single address. This is used to indicate a single whitelisted (but could be black) address that is separated out from a range of other addresses. Currently there are only 2 such entries, one for my own server at ajhurst.org (Digital Ocean), and Nathan (part of the Newfold Digital network, otherwise grey listed)

Each line of the database is network range in one of the three forms of address or address range, followed by a colour flag ('*',' ','.'; black, grey, white resp.), followed by a 2 character country indicator, followed by a network group name. One or more blanks separate these fields.

"blackListIPs" 11.1 =
3.0.0.0/9 * US Amazon 3.128.0.0/9 * US Amazon 10.0.0.0/8 . AU local network 13.24.0.0,13.59.255.255 * US Amazon 17.0.0.0/8 * US Apple 18.32.0.0/11 * US Amazon 18.64.0.0/10 * US Amazon 18.128.0.0/9 * US Amazon 23.20.0.0/14 * US Amazon 27.0.0.0/21 * ZZ Asia Pacific Network Information Centre 27.0.232.0/24 CA ONEPROVIDER 27.0.233.0/24 AU Adam Berger 27.0.234.0/24 SG Adam Berger 27.0.235.0/24 KR Adam Berger 27.0.236.0/24 KR Kakao Corp 27.0.237.0/24 KR Kakao Corp 27.0.238.0/24 KR Kakao Corp 27.0.239.0/24 KR Kakao Corp 27.0.240.0/24 VN Vingroup Joint Stock Company 44.192.0.0/10 * US Amazon 45.55.0.0/16 . US Digital Ocean 47.128.0.0/16 * SG Amazon 49.0.200.0,49.0.207.255 * SG Huawei 49.185.0.0/17 . AU Optus Internet 50.112.160.3/32 US 51.222.0.0,51.222.253.0 * CA OVH Hosting Montreal 51.222.253.1,51.222.253.19 . CA OVH Hosting Montreal 51.222.253.20,51.222.255.255 * CA OVH Hosting Montreal 52.0.0.0/10 * US Amazon 52.64.0.0/12 * US Amazon 54.36.0.0/15 * NL RIPE Network 54.38.0.0/16 * NL RIPE Network 54.144.0.0/12 * US Amazon 54.160.0.0/11 * US Amazon 54.192.0.0/10 * US Amazon 59.167.194.123/32 . AU iiNet Limited 62.0.0.0/8 * NL RIPE Network 65.108.0.0/15 * NL RIPE Network 66.96.128.0,66.96.163.138 US Newfold Digital 66.96.163.139 . US Nathan 66.96.163.140,66.96.191.255 US Newfold Digital 66.249.64.0/19 * US Google 69.63.176.0/20 * US Facebook 69.171.224.0/19 * US Facebook 85.0.0.0/8 * NL RIPE Network 87.250.224.0/19 RU Yandex 94.0.0.0/13 * NL RIPE Network 94.23.0.0/16 FR OVH ISP Paris 94.24.0.0/13 * NL RIPE Network 94.32.0.0/11 * NL RIPE Network 94.64.0.0/10 * NL RIPE Network 94.128.0.0/9 * NL RIPE Network 100.21.24.205/32 * US Amazon 101.44.160.0/20 * SG HUAWEI 101.44.248.0/22 * SG HUAWEI 104.131.0.0/16 . US Digital Ocean 110.238.104.0/21 * SG HUAWEI 114.119.128.0/18 * SG HUAWEI 119.8.160.0,119.8.191.255 * SG HUAWEI 119.13.96.0,119.13.111.255 * SG HUAWEI 124.243.128.0/18 * SG HUAWEI 141.98.11.0/24 * LT LT-HOSTBALTIC-11 144.76.0.0/16 NL RIPE Network 144.91.64.0/18 NL RIPE Network 148.251.0.0/16 NL RIPE Network 148.252.0.0/15 NL RIPE Network 158.69.0.0/16 CA OVH Hosting Montreal 158.220.0.0/16 * NL RIPE Network 159.138.0.0/16 AU Asia Pacific Network Information Centre 167.160.64.0,167.160.71.255 * US Blazing SEO, LLC BLAZINGSEO-US-108 172.32.0.0/11 US T-Mobile USA 173.252.64.0/18 * US Facebook 176.0.0.0/8 * NL RIPE Network 183.81.169.0/24 * HK Amarutu Technology Ltd. 185.0.0.0/8 * NL RIPE Network 188.165.0.0/16 * FR OVH ISP 190.92.192.0/19 * SG HUAWEI 195.191.218.0/23 GB VeloxServ 213.0.0.0/8 * RU Yandex 216.244.64.0/19 * US Wowrack 217.76.56.0/20 DE Contabo GmbH
"greyListedIPs" 11.2 =
51.222.0.0/16 CA OVH Hosting Montreal

Notes:

RIPE Network
Réseaux IP Européens Network Coordination Centre

12. The Makefile

The Makefile handles the nitty-gritty of copying files to the right places, and setting permissions, etc.

<install python file 12.1> =
install-machine: /tmp/index-machine.py web.tangle chmod a+x /tmp/index-machine.py if [ ${HOST} = machine ] ; then \ cp /tmp/index-machine.py homedir/public_html/cgi-bin/index.py; \ cp location3.py homedir/public_html/cgi-bin/; \ cp checkCountry.py homedir/public_html/cgi-bin/; \ cp IPtools.py homedir/public_html/cgi-bin/; \ cp blackListIPs homedir/local/; \ else \ rsync -auv /tmp/index-machine.py address:homedir/public_html/cgi-bin/index.py; \ rsync -auv location3.py address:homedir/public_html/cgi-bin/; \ rsync -auv checkCountry.py address:homedir/public_html/cgi-bin/; \ rsync -auv IPtools.py address:homedir/public_html/cgi-bin/; \ rsync -auv blackListIPs address:homedir/local/; \ fi /tmp/index-machine.py: index.py sed -e 's#/sw/bin/python#interpreter#' <index.py >/tmp/index-machine.py
Chunk referenced in 12.2

install python file is an XLP macro that takes four formal parameters. These are:

machine
defines the machine for which this python script is to be built. (The target machine)
address
defines the domain name of the target machine.
interpreter
defines the python interpreter to be used in running this script.
homedir
defines the home directory on the target machine.

"Makefile" 12.2 =
RELCWD = /cgi-bin/ WEBPAGE = /home/ajh/www/ WEBPAGE = /home/ajh/public_html/research/literate FILES = $(EMPTY) XSLLIB = /home/ajh/lib/xsl XSLFILES = $(XSLLIB)/lit2html.xsl $(XSLLIB)/tables2html.xsl INSTALLFILES = index.py countries.py CGIS = $(INSTALLFILES) XMLS = $(EMPTY) DIRS = $(EMPTY) include $(HOME)/etc/MakeXLP include $(HOME)/etc/MakeWeb index.py: web.tangle chmod 755 index.py touch index.py web.tangle web.xml: web.xlp xsltproc --xinclude -o web.xml $(XSLLIB)/litprog.xsl web.xlp touch web.tangle web.html: web.xml $(XSLFILES) xsltproc --xinclude $(XSLLIB)/lit2html.xsl web.xml >web.html html: web.html install: web.tangle install-${HOST} web: $(WEBPAGE)/web.html $(WEBPAGE)/web.html: web.html cp -p web.html $(WEBPAGE)/web.html Makefile: web.tangle <install python file 12.1>(machine='albens', address='albens', interpreter='/usr/bin/python3', homedir='/home/ajh') <install python file 12.1>(machine='albury', address='', interpreter='/usr/bin/python3', homedir='/home/ajh') <install python file 12.1>(machine='burnley', address='burnley', interpreter='/home/ajh/binln/python3', homedir='/home/ajh') <install python file 12.1>(machine='everton', address='everton', interpreter='/home/ajh/binln/python3', homedir='/home/ajh') <install python file 12.1>(machine='jeparit', address='jeparit', interpreter='/home/ajh/binln/python3', homedir='/home/ajh') <install python file 12.1>(machine='newport', address='newport', interpreter='/usr/bin/python', homedir='/home/ajh') <install python file 12.1>(machine='reuilly', address='reuilly', interpreter='/home/ajh/binln/python', homedir='/home/ajh') <install python file 12.1>(machine='spencer', address='spencer', interpreter='/home/ajh/binln/python', homedir='/home/ajh') <install python file 12.1>(machine='wodonga', address='wodonga', interpreter='/usr/bin/python3', homedir='/Users/ajh')

The install-system targets are designed to cater for the variations in interpreters and home directories required for each of the servers installed by the Makefile. Currently, all home directories are the same, but this is not necessarily the case (for example, in MacOS systems).

Note that this has not been updated for setting the filecache module.

The machines listed in the <install python file 12.1> calls are those machines for which we want to run a web serving page (Apache 2).

13. TODOs

20240513:173848
requestedFile traces now need extra infor re private v public
20240504:135256
Need better handling of white listed entries. For example, Nathan's address (66.96.163.139) is in the middle of 66.96.128.0/18, which is otherwise grey listed. Want a white listing to override and otherwise grey listing.
20240211:150655
Check that all machines do indeed serve properly. Note that there are some gotchas, like whether the local files are present, correct, and with proper ownership and permissions.

14. Indices

14.1 Files

File Name Defined in
IPrep.py 3.2
IPtools.py 3.1
Makefile 12.2
blackListIPs 11.1
checkCountry.py 4.1, 4.2, 4.4
checkCountry.py 4.1, 4.2, 4.4
clearWebCache.py 10.2
filecache.py 10.1
greyListedIPs 11.2
index.py 5.1, 5.2, 5.3, 5.4, 5.5, 5.6, 5.7, 5.8
webServer.py 9.1

14.2 Chunks

Chunk Name Defined in Used in
check for HTML request 5.25, 5.26 5.8
check for abbreviated URL 5.23 5.8
check for and handle jpegs 5.16 5.6
check if xslfile more recent than cached version 5.29 5.28
checkCountry: checkIP 4.3 4.2
collect HTTP request 5.20 5.7
current date 15.2
current version 15.1 5.1
deal with any conversion errors 5.41 5.40
define global variables and constants 5.9 5.1
define the XSLTPROC program 5.14 5.12
define the working directories 5.13 5.12
define various string patterns 5.10 5.3
determine the host and server environments 5.11, 5.12 5.4
determine xslt file 5.30 5.8
filter out bad ips to which we do not respond 5.15 5.4
get default XSLT file 5.27 5.8
get filename from redirect 5.22 5.7
get jpg file from remote server 5.17, 5.18 5.16
get jpg file from remote server removed 5.19
handle redirect query string 5.21 5.20
install python file 12.1 12.2, 12.2, 12.2, 12.2, 12.2, 12.2, 12.2, 12.2, 12.2
make file and dir absolute 5.24 5.8
process an XML file 5.38, 5.39, 5.40 5.36
process file 5.35, 5.36 5.8
render the HTML file 5.37 5.36
scan for locally defined XSLT file 5.28 5.8
update counter 5.31, 5.32, 5.33, 5.34 5.8

14.3 Identifiers

Identifier Defined in Used in
BASE 5.13
BASE 5.13
COUNTERS 5.13
HOME 5.13
HOME 5.13
HTMLS 5.13 5.26
PRIVATE 5.13
PRIVATE 5.13
alreadyHTML 5.9 5.25, 5.26, 5.28, 5.36
cachedHTML 5.9 5.26, 5.28, 5.36, 5.37
clientIP 5.9
convertXML 5.9 5.21, 5.39, 5.39, 5.40, 5.40
debugFlag 5.9 5.8, 5.20, 5.21, 5.21, 5.22, 5.27, 5.27, 5.28, 5.28, 5.30, 5.30, 5.30, 5.35, 5.38, 5.38
htmlpat 5.10 5.22, 5.25
requestedFile 5.20 5.8, 5.22, 5.22, 5.22, 5.22, 5.22, 5.22, 5.22, 5.22, 5.22, 5.22, 5.22, 5.22, 5.22, 5.22, 5.23, 5.23, 5.23, 5.24, 5.24, 5.24, 5.25, 5.25, 5.25, 5.26, 5.26, 5.26, 5.26, 5.26, 5.26, 5.28, 5.28, 5.35, 5.36, 5.37, 5.37, 5.38, 5.38, 5.41
returnXML 5.9 5.21, 5.36

15. Document History

20080816:144135 ajh 1.0.0 first version under literate programming
20080817:131040 ajh 1.0.1 general restructuring
20080822:162138 ajh 1.0.2 more restructuring
20081102:164507 ajh 1.1.0 added jpg handling and caching
20081106:134033 ajh 1.1.1 added exception handling
20090507:160328 ajh 1.2.0 bug relating to non-~ajh directories fixed; caching of files implemented.
20090701:175934 ajh 1.3.0 added code to cache the converted HTML file. Still to do: creation of subdirectories for cached files.
20090702:182341 ajh 1.3.1 Subdirectories now created. Cached file is touched on each access
20090703:105343 ajh 1.3.2 some literate tidy ups, and renamed variable file to requestedFile to disambiguate it.
20091203:093814 ajh 1.3.3 updated Makefile to install python interpreter dependent files
20120530:134427 ajh 1.3.4 fix bug in glenwaverleychurches handling of counters
ajh 1.3.5 source code only
20160221:165415 ajh 1.3.6 modified for albens
20160407:111509 ajh 1.3.7 further albens changes for cahurst domain
20180531:110300 ajh 1.3.8 add more documentation to <define global variables and constants 5.9>
20190117:153130 ajh 1.3.9 don't cache album requests
20210118:113916 ajh 1.4.0 convert to Python3
20210118:114112 ajh 1.4.1 add filterWebBots.py to literate program
20210124:124701 ajh 1.4.2 collect bad ip adrs from data file
20210814:125748 ajh 1.4.3 bring literate pgm up-to-date
20210814:150425 ajh 1.4.4 add filter on non-Australian IP addresses
20231030:124305 ajh 1.5.0 add non-local jpg retrieval
20231105:085217 ajh 1.6.0 cleaned up filterWebBots
20231105:085257 ajh 1.6.1 add own country database
20231107:173857 ajh 1.6.2 scan db for IP checks
20231108:101419 ajh 1.6.4 remove filter on country test, rely on db
20231124:131647 ajh 1.6.5 fix bug with glenwaverleychurches server
20240121:140805 ajh 1.6.6 minor clean ups to documentation and Makefile
20240430:151827 ajh 1.6.7 minor corrections to the checkCountry module
<current version 15.1> = 1.6.7
Chunk referenced in 5.1
<current date 15.2> = 20240430:151827