Web Page Management Software
John Hurst
Version 1.6.7
20240430:151827
Abstract
This document defines and describes the suite of programs
used to create my web page environment on the range of
machines that I use.
Table of Contents
1. Introduction
This document describes the files used to manage delivery of my
personal web pages, and those that I manage for other
organisations. The general form of web page delivery is a) a
source file written in XML, b) a translation file written in
XSLT, and c) the program described here, a python cgi script
that calls the appropriate translator on the source file, and
delivers the result. It also handles straight HTML, as well as
providing some debug and other maintenance options.
The program is invoked by commands in the .htaccess file
associated with each web directory. Different .htaccess
files can be used for different directories. If none exist in a
given directory, the directory path is searched towards the root
until one is found.
The XSLT files used can be specified either in the
.htaccess file (default), or in the source XML file,
through an explicit xml-stylesheet command. If a stylesheet
XSLT file is specified, it overrides the default
.htaccess one.
Permission is given to reuse this document, provided that the
source is acknowledged, and that any changes are noted in the
documentation.
The document is in the form of a literate program, and
generates all files necessary to maintain the working
environment, including a Makefile.
As of 20240513:174010, there are two URLs now recognized:
server/file and server/~ajh/file. The first is
the original suite of web pages, and the second is the suite of
private pages, protected by an .htpasswd gateway. All the
original pages are accessible on the private URL, but only the
public pages are accessible through the public (~ajh)
path. At some stage, it is intended that these roles will be
reversed, but only after a significant period of warning.
2. Literate Data
<edit warning 2.1> =#
# DO NOT EDIT this file!
# see $HOME/Computers/Sources/Web/web.xlp instead
# this also gives further explanation of the program logic
#
This message flags the fact that the source code is a derived
document, and should not be directly edited.
3. Various IP address facilities
3.1 IP Tools
"IPtools.py" 3.1 =import re
def str2IP(s):
# converts decimal IP form x.x.x.x to binary
res=re.match('(\d+)\.(\d+)\.(\d+)\.(\d+)',s)
if res:
b1=int(res.group(1))
b2=int(res.group(2))
b3=int(res.group(3))
b4=int(res.group(4))
ip=(b1<<24)+(b2<<16)+(b3<<8)+b4
return ip
else:
return None
def IP2str(ip):
b1=ip % 256
r=int(ip/256)
b2=r % 256
r=int(r/256)
b3=r % 256
r=int(r/256)
b4=r
return f"{b4:0}.{b3:0}.{b2:0}.{b1:0}"
def ipMask(bits):
allones=0xffffffff
mask=allones & (allones << (32-bits))
return mask
def ip2Net(ip,bits):
# converts binary ip adr to binary network adr
mask=ipMask(bits)
return ip & mask
def thisIPrange(ip,bits):
base=ip2Net(ip,bits)
incr=pow(2,32-bits)-1
return (base,base+incr)
IPtools provides a range of functions to facilitate
handling of IP addresses and networks. This module is intended
to be imported by programs in the rest of this suite. The
functions are:
- str2IP(s)
-
converts a given IP string to a 32 bit integer value
- IP2str(ip)
-
the complement to str2IP: convert an integer IP address
to a decimal string representation.
- ipMask(bits)
-
generate an upper integer mask of bits length, useful
for extracting a given size network base
- ip2Net(ip,bits)
-
Convert the IP address ip to the base network address
of length bits
- thisIPrange
-
return a tuple that defines the range of (integer) addresses
for a given base address and network size
3.2 IP decoding program
Handling IP address between string forms, hexadecimal values,
and network ranges can be challenging. This program reads a
line, interprets what format it is, and prints the altenatives
in an easy to read format, consisting of:
stringform, hexadecimalform, upperstring, upperhex, networklength
where the string form is 'dec.dec.dec.dec'; hexadecimal is the
32 bit hex representation; upper string and upper hex are the
maximum addresses for this network size, also printed as
network length.
"IPrep.py" 3.2 =from IPtools import *
import re
import sys
while True:
l=sys.stdin.readline().strip()
if not l: break
OK=True
other=''
res=re.match('(\d+)\.(\d+)\.(\d+)\.(\d+)$',l)
if res:
IP=int(res.group(1))
IP=256*IP+int(res.group(2))
IP=256*IP+int(res.group(3))
IP=256*IP+int(res.group(4))
print(f"recognized string format '{l}' = {IP:x}")
else:
res=re.match('(\d+)\.(\d+)\.(\d+)\.(\d+)/(\d+)$',l)
if res:
IP=int(res.group(1))
IP=256*IP+int(res.group(2))
IP=256*IP+int(res.group(3))
IP=256*IP+int(res.group(4))
net=int(res.group(5))
(IPb,IPt)=thisIPrange(IP,net)
IPbs=IP2str(IPb)
IPts=IP2str(IPt)
other=f"{IPb:x} {IPt:x} {net} ({IPbs}-{IPts})"
print(f"recognized network format '{l}' = {IP:x}/{net}")
else:
res=re.match('0x([0-9abcdef]+)(/(\d+))?',l)
if res:
IP=int(res.group(1),base=16)
if res.group(2):
net=int(res.group(3))
(IPb,IPt)=thisIPrange(IP,net)
IPbs=IP2str(IPb)
IPts=IP2str(IPt)
other=f"{IPb:x} {IPt:x} {net} ({IPbs}-{IPts})"
print(f"recognized hexdecimal format '{l}' = {IP:x}")
else:
OK=False
print(f"Cannot recognize {l}")
if OK:
str=IP2str(IP)
print(f"{str:16} {IP:x} {other}")
4. The checkCountry Module
checkCountry.py is a module to provide mechanisms to
check the origin of an http request. It relies upon a database
blackListIPs which stores IPs of known requesters, along
with a flag to indicate whether they are white, grey or black
listed. Any requester not in the database is regarded as grey
listed.
The module provides the class countries; instances of
which have the methods load(), to load the database of
known IP addresses, and checkIP(ips), to check if the IP
string ips is known to the database.
It returns a triple (known, status,
country, annotation), where:
- known
- is True if the IP is indeed known in the database,
- status
- has a value
-
white (address is whitelisted),
-
black (address is blacklisted), and
-
grey (address has unknown status).
country has an obvious meaning, while annotation
is what is recorded in the WhoIs address
lookup database.
Note that an IP may be known to the database (known is
True), but still have status as 'grey'. This will be where not
enough about the behaviour of the said Ip has been observed to
make a decision about whether it is black or white.
Currently no attempt is made to update the database with unknown
grey listings. That has to be done manually through an edit of
the blackListIPs file (see later in this literate program).
"checkCountry.py" 4.1 =from IPtools import str2IP, IP2str, ipMask, ip2Net, thisIPrange
import re
import sys
def ip2Net24(ip):
return ip2Net(ip,24)
Define a few useful functions
"checkCountry.py" 4.2 =class countries():
def __init__(self):
self.dbFileName="/home/ajh/local/blackListIPs"
self.db=[]
def IP2str(self,d):
rtn='' ; sep='.'
for i in range(4):
if i==3: sep=''
rtn+=f"{d[i]:0}{sep}"
return rtn
def load(self):
statusKey={'*':'black',' ':'grey','.':'white'}
entries=[]
f=open(self.dbFileName,'r')
lno=1
for l in f.readlines():
res=re.match('([0-9.]+),([0-9.]+) *(\*|\.)? *([^ ]+)?( +(.*)\n)?$',l)
if res:
# double IP, status, country, annotation
bg=str2IP(res.group(1))
en=str2IP(res.group(2))
status=res.group(3)
if not status: status=' '
country=res.group(4)
annotation=res.group(6)
else:
res=re.match('([0-9.]+)(/(\d+))? *(\*|\.)? *([^ ]+)?( +(.*)\n)?$',l)
if res:
# single IP, optional /<networkspec>, status, country, annotation
bg=str2IP(res.group(1))
if res.group(3):net=int(res.group(3))
else: net=32
ip=ip2Net(bg,net)
incr=pow(2,32-net)-1
en=bg+incr
status=res.group(4) # '.' is white, '*' is black, ' ' is grey
if not status: status=' '
country=res.group(5)
annotation=res.group(7)
else:
print(f"could not match >{l}<")
try:
status=statusKey[status]
except KeyError:
print(f'key error on status={status}')
status='black'
#print(f"{lno:4}: {bg:08x}->{en:08x} {status=},{country=},{annotation=}")
entry=(bg,en,status,country,annotation)
entries.append(entry)
lno+=1
self.db=entries
def rangeIP(self):
for (b,e,st,c,a) in self.db:
print(f"{b:08x} to {e:08x}")
<checkCountry: checkIP 4.3>
The class countries is a database of IP addresses derived
from the file /home/ajh/local/countryIPs, with entries of
the form:
IP adr in dot form/network size status country annotation
where status is a character ('*'|' '|'.'), country is a 2 letter
country abbreviation, and annotation is an indication of the
network owner. The status character is (resp.) black | grey |
white.
<checkCountry: checkIP 4.3> =def checkIP(self,ips,debug=False):
# here is where we check for white listed IPs
white=False
f=open('/home/ajh/local/guestIP','r')
gips=f.read().strip()
ip=str2IP(gips)
white=(ips==gips)
#sys.stderr.write(f"Guest testing gives {gips=}, {ips=}, {white=}\n")
if white: return (True,'white',ip,'')
# check for IPv4/IPv6 localhost
if ips[0]==':' or ips=='127.0.0.1':
# IPv6 and IPv4, assume local host.
return(True,'white','AU','localhost')
# end of white testing
cb=str2IP(ips)
if debug: print(f"ips={ips}, binary={cb:x}")
st='grey'
for (bg,en,st,cn,an) in self.db:
if debug: print(f"checking {bg:x}<={cb:x}<={en:x} ({cn}-{an},st={st})")
if cb<bg: break
if bg<=cb and cb<=en:
if debug: print(f"bb={bg:x}, cb={cb:x}, en={en:x}, cn={cn}, (an={an})")
return (True,st,cn,an)
pass
pass
return (True,'grey','unknown','')
checkIP checks the given IP address against the database,
and return a tuple (isgood,status,country,annotation)
where isgood is a boolean, status is
('white','grey,'black'), country is the country of origin, and
annotation is the owner of the IP.
"checkCountry.py" 4.4 =def main():
cnt=countries()
cnt.load()
#cnt.rangeIP()
while True:
ip=sys.stdin.readline().strip()
if not ip: break
(isgood,st,cn,a)=cnt.checkIP(ip,debug=True)
aa=''
if a: aa=f" (note: {a})"
print(f"{ip} returns {isgood}/{st} from country {cn}{aa}")
if __name__=="__main__":
main()
5. The Main Program index.py
"index.py" 5.1 =
This program is written in Python3, and relies upon Python
v3.6 or later, as it uses f strings
in many of its output statements. /home/ajh/binln is a
common directory used to identify where required versions of
interpreters are to be found.
It is not compatible with python3.11 or later, as it uses
features now deprecated in 3.11 and later. Hence it is
restricted to python 3.10. (This may change in future, once the
deprecated features are isolated and corrected.)
We insert the usual edit warning to alert any code editor
(person) to the dangers of directly edit constructed code. The
version number is inserted into rendered HTML text.
The showBegins flag turns on tracing output on the stderr
output to assist in identifying where program failures may
occur, and linking these back to the literate code. Other
global variable are also identified upfront, to further assist
debugging the raw code.
Note the convention that where a code fragment is defined in
several chunks, a begin n comment is inserted to assist
debugging, and back referencing into this literate program.
This script processes all my web page XML files
It requires apache to be configured:
-
to allow python files (this file) to run as cgi in user
directories
-
to add a handler for XML files that call this program
-
to pass as a cgi parm the XSLT file that translates the XML file
These are done in a .htaccess file for each directory (and its
subdirectories) that require XML processing with a particular
XSLT stylesheet.
The script relies upon picking up the required file and its XSLT
file from a) the REDIRECT environment variables, and b) the
script parameter, respectively.
The interpreter required varies according to the target server.
This detail is captured by the <Makefile 12.2> script,
although not all systems have yet been encoded into the Makefile
script. The python script also changes its behaviour depending
upon the host on which it is running. This is done by an
explicit call to os.environ (chunk
<determine the host and server environments 5.11,5.12>).
"index.py" 5.2 =#begin 1
import sys
if showBegins: sys.stderr.write(f"begin 1\n")
import cgi
import cgitb ; cgitb.enable()
import checkCountry
import datetime
import html
import io
import os, os.path
import re
from subprocess import PIPE,Popen,getstatusoutput
import time
from urllib.request import urlopen
import urllib.parse
import xml.dom.minidom
Gather together all the module and library imports needed for
this program.
"index.py" 5.3 =#begin 2
if showBegins: sys.stderr.write(f"begin 2\n")
<define various string patterns 5.10>
now=datetime.datetime.now()
tsstring=now.strftime("%Y%m%d:%H%M")
todayStr=now.strftime("%d %b %Y")
htmlmod=xmlmod=0
Start processing. Get a timestamp for recording key events in
the log. Set the modification times to year dot.
"index.py" 5.4 =
This first code chunk here identifies who and what is being
invoked in this instantiation of the server.
The second part identifies known IP addresses that have a track
record of abusing the server. The algorithms used are somewhat
heuristic.
"index.py" 5.5 =# begin 4 - index.py
if showBegins: sys.stderr.write(f"begin 4\n")
def debug(loc,msg):
if debugFlag: print(msg)
Define a useful debug routine.
"index.py" 5.6 =# begin 5 - index.py
if showBegins: sys.stderr.write("begin 5\n")
<check for and handle jpegs 5.16>
# start the html output
print("Content-type: text/html\n")
#sys.stderr.write(f"NEW BOT TESTING SYSTEM!\n")
#sys.stderr.write(f"*** server={server} ***\n")
This fragment simply outputs the required header for flagging
the generated content as HTML, and builds a number of string
matching patterns for later use.
"index.py" 5.7 =
Collect relevant parameters from the original request.
"index.py" 5.8 =# begin 7 - index.py
if showBegins: sys.stderr.write("begin 7\n")
<check for abbreviated URL 5.23>
<make file and dir absolute 5.24>
<check for HTML request 5.25,5.26>
<get default XSLT file 5.27>
<scan for locally defined XSLT file 5.28>
<determine xslt file 5.30>
<update counter 5.31,5.32,5.33,5.34>
if
debugFlag:
print("\n<p>\n")
print("%s: server = %s<br/>" % (tsstring,server))
print("%s: host = %s<br/>" % (tsstring,host))
print("%s: dir = %s<br/>" % (tsstring,dir))
print("%s: requestedFile = %s<br/>" % (tsstring,
requestedFile))
print("%s: relcwd = %s<br/>" % (tsstring,relcwd))
print("%s: relfile = %s<br/>" % (tsstring,relfile))
print("%s: counter = %s<br/>" % (tsstring,counterName))
print("%s: alreadyHTML = %s<br/>" % (tsstring,alreadyHTML))
print("%s: cachedHTML = %s<br/>" % (tsstring,cachedHTML))
print("%s: os.environ = %s\n</p>\n" % (tsstring,repr(os.environ)))
print("%s: docRoot = %s\n</p>\n" % (tsstring,docRoot))
<process file 5.35,5.36>
now=datetime.datetime.now()
tsstring=now.strftime("%Y%m%d:%H%M")
sys.stderr.write(f"{tsstring}: [{remoteAdr}] request satisfied\n\n")
The major work in rendering the required page is done by the <process file 5.35,5.36> code chunks. After this point, processing
is complete, and the program falls through to exit.
5.1 Define Global Variables and Constants
<define global variables and constants 5.9> =
returnXML is set True when the display of the
raw untranslated XML is required.
convertXML is set True when a converted copy of
the translated XML is required to be saved.
alreadyHTML is set True when the incoming file
to be rendered is already in HTML and does not require
conversion.
cachedHTML is set True when the incoming file
to be rendered has been cached in the HTMLS directory,
and does not require conversion.
5.2 Define Various String Patterns
<define various string patterns 5.10> =# - to extract directory and filename from request
filepat=re.compile('/~ajh/?(.*)/([^/]*)$')
filename='index.xml'
# - to detect stylesheet request (optional)
stylesheet=re.compile('<\?xml-stylesheet.*href="(.*)"')
# - to terminate file scanning
doctype=re.compile('<!DOCTYPE')
# to check for missing htmls
htmlpat=re.compile('(.*)\.html$')
# to check for xslspecification in htaccess
xslspec=re.compile('.*?xslfile=(.*)&')
Somewhat self-explanatory string matching patterns.
5.3 Determine the Host and Server Environments
<determine the host and server environments 5.11> =# determine the host and server environments
(exitcode,host)=getstatusoutput('hostname')
host=re.split('\.',host)[0] # break off leading part before the '.' char
#sys.stderr.write(f'{os.environ}\n')
try:
host=os.environ["HOSTNAME"]
except KeyError:
cmd='/bin/hostname'
pid=Popen(cmd,shell=True,stdout=PIPE,stderr=PIPE,close_fds=True)
host=pid.communicate()[0].strip().decode('UTF-8')
try:
server=os.environ["SERVER_NAME"]
except KeyError:
server='localhost'
try:
docRoot=os.environ["DOCUMENT_ROOT"]
except KeyError:
docRoot='/Users/ajh/www'
try:
requestUri=os.environ["REQUEST_URI"]
except KeyError:
sys.stderr.write('No REQUEST_URI in call environment, using default')
requestUri='/~ajh/index.xml'
pass
if '~ajh' in requestUri:
docRoot='/home/ajh/public_html'
# else leave as '/var/www/html'
#sys.stderr.write(f'requestUri={requestUri}, docRoot={docRoot}\n')
if "SCRIPT_URL" in os.environ:
URL=os.environ["SCRIPT_URL"]
elif "REDIRECT_URL" in os.environ:
URL=os.environ["REDIRECT_URL"]
else:
URL="We got a problem"
<determine the host and server environments 5.12> =MacOSX='MacOSX' ; Solaris='Solaris' ; Linux="Linux" ; Ubuntu='Ubuntu'
if server in ["localhost"]:
host='localhost'
ostype=Linux
system=Ubuntu
elif server in ['121.200.25.188']:
host='spencer'
ostype=Linux
system=Ubuntu
elif server in ['burnley','burnley.local','10.0.0.8']:
host='burnley'
ostype=Linux
system=Ubuntu
elif server in ['ajh.co','www.ajh.co','ajh.id.au','www.ajh.id.au']:
host='spencer'
ostype=Linux
system=Ubuntu
elif server in ['ajhurst.org','www.ajhurst.org',\
'albens.ajhurst.org','45.55.18.15',\
'njhurst.com','www.njhurst.com']:
server='ajhurst.org'
host='albens'
ostype=Linux
system=Ubuntu
elif server in ['newport','newport.local','newport.home.gateway','10.0.0.3']:
server='newport'
host='newport'
ostype=Linux
system=Ubuntu
elif server in ['spencer','spencer-fast']:
ostype=Linux
system=Ubuntu
elif server in ['cahurst.org','www.cahurst.org']:
server='cahurst.org'
host='albens'
ostype=Linux
system=Ubuntu
elif server in ['glenwaverleychurches.org','www.glenwaverleychurches.org']:
host='albens'
ostype=Linux
system=Ubuntu
docRoot='/home/ajh/public_html/parish/GWICC'
#elif host in ['newport']:
# server=newport
# ostype=Linux
# system=Ubuntu
else:
sys.stderr.write("server/host values not recognized\n")
sys.stderr.write("(supplied values are %s/%s)\n" % (server,host))
sys.stderr.write("aborting\n")
sys.exit(1)
ostype=Linux
system=Ubuntu
sys.stderr.write("(assuming (ostype,system)=(%s,%s)\n" % (ostype,system))
<define the working directories 5.13>
<define the XSLTPROC program 5.14>
The server is the address to which this request was
directed, and is useful in making decisions about what to
render to the client. Examples are "localhost",
"www.ajh.id.au", "chairsabs.org.au".
The host is the machine upon which the server is
running, and may be different from the server. This name is
used to determine where to store local data, such as logging
information. For example, the server may be "localhost", but
this can run on a variety of hosts: "murtoa", "dimboola",
dyn-13-194-xx-xx", etc..
<define the working directories 5.13> =# begin 3.2a - define the working directories
if showBegins: sys.stderr.write("begin 3.2a\n")
# BASE is the path to the web base directory - with no trailing slash!
if system==MacOSX:
#sys.stderr.write("MacOS\n")
HOME="/home/ajh/"
BASE="/home/ajh/www"
PRIVATE="/home/ajh/local/"+server
elif system==Ubuntu and server=='cahurst.org':
#sys.stderr.write("Ubuntu,cahurst\n")
HOME="/home/ajh"
BASE="/home/ajh/www/personal/cahurst"
PRIVATE="/home/ajh/local/"+server
elif system==Ubuntu and docRoot=='/var/www/html':
#sys.stderr.write("Ubuntu,/var/www/html\n")
HOME="/home/ajh/"
BASE="/var/www/html"
PRIVATE="/home/ajh/local/"+server
elif system==Ubuntu:
#sys.stderr.write("Ubuntu\n")
if debugFlag: print("docRoot=%s" % (docRoot))
if docRoot == '/home/ajh/public_html/parish/GWICC':
#sys.stderr.write("Ubuntu,GWICC\n")
HOME="/home/ajh/public_html/parish/GWICC"
BASE=HOME
PRIVATE="/home/ajh/local/"+server+"/parish/GWICC"
elif docRoot == '/home/ajh/public_html/personal/cahurst':
#sys.stderr.write("Ubuntu,cahurst\n")
HOME="/home/ajh/public_html/personal/cahurst"
BASE=HOME
PRIVATE="/home/ajh/local/"+server+"/personal/cahurst"
elif docRoot == '/var/www/html':
#sys.stderr.write("Ubuntu,/var/www/html\n")
HOME="/home/ajh/public_html"
BASE=docRoot
PRIVATE="/home/ajh/local/"+server
else:
# ajh public web pages
HOME="/home/ajh/"
BASE="/home/ajh/public_html"
PRIVATE="/home/ajh/local/"+server
COUNTERS=PRIVATE+"/counters/"
HTMLS=PRIVATE+"/htmls/"
WEBDIR="file://"+BASE
if debugFlag:
# done this way because only python 3.6 on some machines
msg=f"docRoot={docRoot},BASE={BASE},HOME={HOME}\n"
sys.stderr.write(msg)
# end 3.2a - define the working directories
Note: This section needs a bit more work to distinguish
the new 001-default host.
BASE is set to the path to the web root directory on
this server. It should not have a trailing slash!
HOME has its usual Unix meaning.
PRIVATE is set to the path to a working directory on
this particular server that is used to store accounting and
audit information about this particular access. The path
includes a specific reference to the server hostname to
uniquely distinguish it. This directory basename is rendered
on the web page as the parameter server and is the
first of the "server@host" pairs rendered at the top of the
web page.
COUNTERS is the path to the directory containing all
the web page access counts. Each counter is incremented on
page access, whether to the cached HTML, or to the
(re)rendered XML file.
HTMLS is the path to a local copy of html versions of
the files. These are cached versions, and some mechanism to
age and delete needs to be identified. If the corresponding
XML file is older than the HTML file found in this
subdirectory, the HTML version is used.
<define the XSLTPROC program 5.14> =# begin 3.2b - define the XSLTPROC program
if showBegins: sys.stderr.write("begin 3.2b\n")
# define the XSLTPROC
if system in ['MacOSX','Linux','Ubuntu']:
XSLTPROC="/usr/bin/xsltproc"
else:
# no other option
sys.stderr.write(f"No XSLTPROC defined for system={system}\n")
sys.exit(1)
# end 3.2b - define the XSLTPROC program
XSLTPROC is the path to the xsltproc processor.
Without this processor, this entire script (as far as XML
files are concerned) is meaningless!
5.4 filter out bad ips to which we do not respond
<filter out bad ips to which we do not respond 5.15> =def filteredResponse(arg=None):
print("Content-type: text/html\n")
print("<H1>NOT AVAILABLE</H1>")
print('''
<p>
You have been filtered, and this page is blocked against you.
Please write to the author if you think there is some mistake.
</p>
''')
if showBegins: sys.stderr.write(f"begin 3b\n")
if "REMOTE_ADDR" in os.environ:
clientIP=os.environ["REMOTE_ADDR"]
else:
sys.stderr.write(f"key error for REMOTE_ADDR or REDIRECT_URL\n")
sys.exit(1)
if 'REDIRECT_URL' in os.environ:
requestedFile=os.environ['REDIRECT_URL']
else:
sys.stderr.write(f"key error for REDIRECT_URL\n")
sys.exit(1)
# check my database. Only fail if ip is known and flagged as bad
if showBegins: sys.stderr.write(f"begin 3b1\n")
cnt=checkCountry.countries()
if showBegins: sys.stderr.write(f"begin 3b2\n")
cnt.load()
(isgood,st,cn,a)=cnt.checkIP(clientIP)
if showBegins: sys.stderr.write(f"begin 3b3\n")
# isgood means we know about this ip, regard as OK for now
# st=='black' means it is bad, abort now
rqf=f"(requestedFile={requestedFile})"
if st=='black':
sys.stderr.write(f"{clientIP:>15} black, {cn}-{a} {rqf}\n")
filteredResponse()
sys.exit(1)
elif st=='grey':
sys.stderr.write(f"{clientIP:>15} grey, {cn}-{a} allowed for now, {rqf}\n")
else:
if isgood:
sys.stderr.write(f"{clientIP:>15} OK, {cn}-{a} {rqf}\n")
else:
sys.stderr.write(f"{clientIP:>15} unknown-ignored, {rqf}, ignored\n")
filteredResponse()
sys.exit(1)
# check if country OK
goodIP=filterOnCountry(clientIP)
if not goodIP:
sys.stderr.write(f"ip={clientIP} but fails country {cn}\n")
filteredResponse()
sys.exit(1)
We maintain a database of IP addresses, categorized three ways:
- 'white': this IP is allowed.
- 'grey': no information on this IP as yet. Allow for now.
- 'black': this IP is filtered out.
The variable st returned by checkIP is set to
one of these values. checkIP also returns
isgood, True if the access is allowed; cn the
country from which the request originated; and a and
annotation or other information about the IP address.
The database is described in section
<Country Database >.
5.5 Handle JPEGs
From version 1.2.0 onwards, this code implements a form of
caching for jpg files. A local check for the request file is
made, and if it is not found, an attempt to retrieve it from
the dimboola server is made. If that is not successful, the
file is reported not found. If it is successful, the file is
saved locally. No attempt is made to age files out of the
cache.
<check for and handle jpegs 5.16> =#begin 5.1
if showBegins: sys.stderr.write("begin 5.1\n")
#sys.stderr.write("Just a check version %s\n" % (version))
cachetime=60*60*24*7 # one week
# check for jpgs
if 'REQUEST_URI' in os.environ:
uri=os.environ['REQUEST_URI']
(scheme,netloc,path,parms,query,fragment)=urllib.parse.urlparse(uri)
#sys.stderr.write("path=%s\n" % path)
filename=re.sub('/~ajh/','/home/ajh/www/',path)
(root,ext)=os.path.splitext(filename)
ext=ext.lower()
if ext=='.jpg':
basedir=os.path.dirname(filename)
#sys.stderr.write(f"filename={path}\n")
if os.path.exists(filename):
sys.stderr.write(f"{tsstring}: Got file {filename} locally\n")
f=open(filename,'r').read()
else:
sys.stderr.write(f"local file {path} not available. Checking ajh.co\n")
sys.exit(0)
<get jpg file from remote server 5.17,5.18>
print("Content-Type: image/jpeg\n")
#print(f) # display the image
sys.exit(0)
else:
pass
else:
sys.stderr.write("No Request_URI\n")
This code checks to see if the request is for a jpg image
file. These are cached, and if not present, are retrieved
from the master jpg server for my jpeg images. This is still
a bit experimental. The server URL is
dimboola.infotech.monash.edu.au/~ajh/Pictures.
It requires that the .htaccess file be modified to
refer .jpg requests to this cgi script.
5.5.1 Get JPG File from Remote Server
<get jpg file from remote server 5.17> =newurl="http://ajh.co%s" % path
sys.stderr.write("{0}: using url {1}\n".format(tsstring,newurl))
urlobj=urlopen(newurl)
f=urlobj.read()
modtimestr=urlobj.info()['Last-Modified']
modtime=time.strptime(modtimestr,"%a, %d %b %Y %H:%M:%S %Z")
Generate the URL of the corresponding remote JPG file, and
issue read request. By using the
urllib.request
library, we also get the modification time, which we parse
in order to set the correct modification time on the locally
cached copy.
<get jpg file from remote server 5.18> =sys.stderr.write(f"Would be caching {filename}, but currently disabled\n")
<get jpg file from remote server removed 5.19> =try:
fc=open(filename,'w')
fc.write(f)
fc.close()
#touch filename -mt time.strftime("%Y%m%d%H%M.%S")
mtime=time.mktime(modtime)
imtime=int(mtime)
nowtime=time.localtime()
currtime=int(time.mktime(nowtime)) # local
os.utime(filename,(currtime,imtime))
#sys.stderr.write("%s: cached %s\n" % (tsstring,filename))
except (IOError,OSError):
#errmsg=os.strerror(errcode)
sys.stderr.write("%s: Cannot write cache file %s\n" % (tsstring,filename))
Now try to cache a local copy. This can fail for several
reasons, the main one being that the permissions in the
local directory are likely to be against (write) access by
the www user. More work is required to make this a
bit more robust.
Note that we set the pair (access time, modification time)
on the local file to be the current time and remote file
modification time respectively. This ensures that attempts
to synchronize the two file systems will see this file as
the same file as the remote file, and not attempt to update
one or the other (thus leading to spurious modification
times).
5.6 Collect HTTP Request
<collect HTTP request 5.20> =# collect the original parameters from the redirect (if there is one!)
if 'REDIRECT_QUERY_STRING' in os.environ:
<handle redirect query string 5.21>
else:
form={}
requestedFile=""
{Note 5.20.1}
remoteAdr=''
if 'REMOTE_ADDR' in os.environ:
remoteAdr=os.environ['REMOTE_ADDR']
if
debugFlag:
print("<p>%s: (server,host)=(%s,%s)<br/>\n" % (tsstring,server,host))
print("%s: (system,PRIVATE)=(%s,%s)</p>\n" % (tsstring,system,PRIVATE))
print("%s: (BASE,HOME,PRIVATE)=(%s,%s,%s)</p>\n" % (tsstring,BASE,HOME,PRIVATE))
- {Note 5.20.1}
- initialize the filename of the file to be rendered. Most of the work in computing the value of this variable is done in <get filename from redirect 5.22>
When this script is called, it has gained control by virtue
of an .htaccess directive to Apache to use this program to
render the source file. The name of that source file has to
be recovered somehow, and different systems seem to handle
this parameter in different ways. The first parameter to
explore is the REDIRECT_QUERY_STRING, which, if it is
present in the form request, contains secondary parameters to
the rendering operation. If this parameter is not present,
initialize the variable form to an empty value.
5.6.1 Handle REDIRECT QUERY STRING
<handle redirect query string 5.21> =query_string=os.environ['REDIRECT_QUERY_STRING']
form=urllib.parse.parse_qs(query_string)
if 'debug' in form and form['debug'][0]=='true':
sys.stderr.write("%s: %s\n" % (tsstring,repr(form)))
debugFlag=True
print("<h1>%s: INDEX.PY version %s</h1>\n" % (tsstring,version))
print("<p>%s: os.environ=%s</p>\n" % (tsstring,repr(os.environ)))
print("<p>%s: form=%s</p>\n" % (tsstring,repr(form)))
sys.stderr.write("%s: redirect_query string=%s\n" % (tsstring,query_string))
if 'xml' in form:
if form['xml'][0]=='true':
sys.stderr.write("%s: %s\n" % (tsstring,repr(form)))
returnXML=True
if
debugFlag:
print("<p>%s: os.environ=%s</p>\n" % (tsstring,repr(os.environ)))
print("<p>%s: form=%s</p>\n" % (tsstring,repr(form)))
sys.stderr.write("%s: redirect_query string=%s\n" % \
(tsstring,query_string))
elif form['xml'][0]=='convert':
convertXML=True
There are several possibilities for secondary parameters.
The primary one is the debugFlag parameter, which can
be set to true, indicating that debugging information is to
be printed along with the rendering. This is intended for
administrator access only, but as it is harmless, there is
no authentication required.
The other parameter that can be offered at this point is
the xml parameter, with values of true or
convert. The first of these forces no conversion
of the XML, but simply copies it to the browser,
substituting escape sequences for any special XML character
sequences so that it appears as verbatim XML.
The second choice, convert, allows the use of the
rendering engine as an XML-to-HTML converter, in which case
a copy of the converted HTML is saved to a temporary file.
This file can be used subsequently as a statically converted
file as necessary.
5.7 Get Filename from Redirect
<get filename from redirect 5.22> =# get the file name from the redirect environment
if system==MacOSX:
scriptURL='REQUEST_URI'
elif system==Solaris:
scriptURL='REDIRECT_URL'
elif system==Linux:
scriptURL='REDIRECT_URL'
elif system==Ubuntu:
scriptURL='REDIRECT_URL'
if scriptURL in os.environ:
requestedFile=os.environ[scriptURL]
argpos=
requestedFile.find('?')
if argpos>=0:
requestedFile=
requestedFile[0:argpos]
if debugFlag:
sys.stderr.write("%s: [client %s] requesting %s\n" % \
(tsstring,remoteAdr,
requestedFile))
orgfile=
requestedFile
# analyse file request. If a bare directory, add 'index.xml'
if 'REDIRECT_STATUS' in os.environ and \
os.environ['REDIRECT_STATUS']=='404':
res=
htmlpat.match(
requestedFile)
if res:
filename=res.group(1)+'.xml'
requestedFile=filename
dir=relcwd=""
res=filepat.match(
requestedFile)
if res:
dir=res.group(1)
relcwd=dir
# protocol for relcwd:
# no subdir => relcwd = '' (empty)
# exists subdir => relcwd = subdir (no leading or trailing slash)
if dir!="":
requestedFile=dir+'/'+res.group(2)
else:
requestedFile=res.group(2)
filename=res.group(2)
else:
# not ajh (sub)directory, extract full directory path
dir=os.path.dirname(
requestedFile)
relcwd=dir
filename=os.path.basename(
requestedFile)
#sys.stderr.write("{}\n".format(requestedFile))
if 'personal/albums' in requestedFile:
pass # sys.exit(0)
if
debugFlag:
print("<p>%s: dir,requestedFile,relcwd to process = %s,%s,%s</p>" % \
(tsstring,dir,
requestedFile,relcwd))
5.8 Check for Abbreviated URL
<check for abbreviated URL 5.23> =
5.9 Make File and Dir Absolute
<make file and dir absolute 5.24> =
5.10 Check for HTML Request
We now have a requestedFile name for the document to
be rendered. We need to investigate this file to see how it is
to be rendered. In particular, it may be an HTML file
(indicated by a .html extension), or it may be an XML
file previously rendered and cached. In these cases, we do
not need to do any XML conversion, and the flag
alreadyHTML is set true if it is an HTML file, or the
flag cachedHTML is set true if it is a cached converted
XML to HTML file.
<check for HTML request 5.25> =res=
htmlpat.match(
requestedFile)
if res:
# we have an HTML request, check if it exists
if os.path.exists(
requestedFile):
# exists, use that
alreadyHTML=True
#sys.stderr.write(f"requested file {requestedFile} is already html\n")
if debugFlag:
print("requested file %s is already html<br/>" % (requestedFile))
else:
# doesn't exist, convert from HTML
filename=res.group(1)+'.xml'
requestedFile=filename
This code now also checks for a cached version of the XML
file, as per the following fragment.
<check for HTML request 5.26> =if not
alreadyHTML:
patn="(%s/)(.*).xml" % (BASE)
if debugFlag:
print("<p>matching xml=%s with pattern=%s<br/>" % (
requestedFile,patn))
res=re.match(patn,
requestedFile)
if res:
base=res.group(1); path=res.group(2)
if debugFlag: print("matched BASE=%s,path=%s<br/>" % (base,path))
htmlpath="%s%s.html" % (
HTMLS,path)
if os.path.exists(htmlpath):
htmlstat=os.stat(htmlpath)
xmlstat=os.stat(
requestedFile)
htmlmod=htmlstat.st_mtime
xmlmod=xmlstat.st_mtime
if xmlmod < htmlmod and not form:
# cached version is newer use that
if debugFlag: print("using cached file %s<br/>" % (htmlpath))
#
requestedFile=htmlpath
#
cachedHTML=True
else:
if debugFlag: print("no cached version of %s<br/>" % (
requestedFile))
else:
if debugFlag: print("requested file %s is not XML<br/>" % (
requestedFile))
Unless the file being retrieved is already an HTML file,
check to see if we have a cached HTML version of this (XML)
file. Note that any parameters to th http request (indicated
by a non-empty form value) will abort the caching
process, and force a reload of the XML file.
5.11 Get Default XSLT File
<get default XSLT file 5.27> =# collect the XSLT file name from the .htaccess referent
if debugFlag: print("BASE=%s<br/>" % (BASE))
if 'QUERY_STRING' in os.environ:
query_string=os.environ['QUERY_STRING']
else:
query_string='xslfile=%s/lib/xsl/ajhwebdoc.xsl&/~ajh/index.xml' % (BASE)
if debugFlag: print("query_string=%s<br/>" % (query_string))
#sys.stderr.write("%s: query string=%s\n" % (tsstring,query_string))
form2=urllib.parse.parse_qs(query_string)
if
debugFlag:
print("<p>%s: form2=%s</p>\n" % (tsstring,form2))
if 'xslfile' in form2:
xslfile=form2['xslfile'][0]
#sys.stderr.write(f"{tsstring}: got this xslfile={xslfile}\n")
if
debugFlag:
print("<p>%s: got this xslfile=%s</p>\n" % (tsstring,xslfile))
5.12 Scan for Locally Defined XSLT File
<scan for locally defined XSLT file 5.28> =# Check the requested file for a local stylesheet. We also scan the
# entire file, replacing any symbolic references to $WEBDIR with the
# full path for the current machine. Note that the DOCTYPE statement
# must start a line by itself.
try:
#sys.stderr.write(f"requested file={requestedFile}\n")
filed=open(
requestedFile,'r',encoding='utf-8')
text='' ; linecount = 0
trackXML=
debugFlag and not (
alreadyHTML or
cachedHTML)
while 1: # keep scanning file until we find no more XML directives
line=filed.readline()
if line=='': # this is EOF, so quit
if linecount==0:
print("(empty file)")
break
linecount+=1
line=line.strip() # remove NL
text+=' '+line
if trackXML:
print("<p>read line='%s'" % (html.escape(line)))
# check if end of directives, indicated by normal element tag start
res=re.match('<[^?!]',line)
if res:
break
if trackXML:
print("<p>text read='%s'" % (html.escape(text)))
res=re.match('.*(<\?xml-stylesheet)(.*?)(\?>)',text)
if res:
parms=res.group(2)
# now we have the stylesheet parameters
res=re.match('.*href="(.*?)"',parms)
if res:
# extract filename
xslfile=res.group(1)
xslfile=re.sub('(\$WEBDIR)',WEBDIR,xslfile)
if
debugFlag:
print("<p>%s: stylesheet in xml file, href=%s</p>" % (tsstring,xslfile))
<check if xslfile more recent than cached version 5.29>
else:
if trackXML:
print("<p>Did not find stylesheet href in %s" % (parms))
else:
if trackXML:
print("<p>Did not find stylesheet reference in %s" % (html.escape(text)))
filed.close()
except IOError:
print("""
<h1>Sorry!! (Error 404)</h1>
<p>While processing your request for file %s,<br/>
it was found that the corresponding XML file %s does not exist</p>
<p>Please check that the URL is correct</p>
""" % (orgfile,
requestedFile))
sys.exit(0)
#newfiled.close()
<check if xslfile more recent than cached version 5.29> =localXSLfile=re.sub('file://','',xslfile)
try:
xslmod=os.stat(localXSLfile)
if htmlmod < xslmod:
cachedHTML=False
if debugFlag: print("<p>XSL newer than HTML, reloading</p>")
except: # ignore any errors from this
pass
Look at modification time of XSL file. If it is more recent
than the cached HTML file, we must re-convert the XML file.
5.13 Determine XSLT File
<determine xslt file 5.30> =# have we got an xslfile yet?
htacc=None
if xslfile=="":
# no, so check all .htaccess
# first grab directory
while len(dir)>=len(BASE):
if
debugFlag:
print("<p>directory=%s</p>\n" % (dir))
if os.path.isfile(dir+"/.htaccess"):
htacc=open(dir+"/.htaccess")
if
debugFlag:
print("<p>found .htaccess in directory %s</p>" % (dir))
break
else:
dir=os.path.dirname(dir)
if htacc:
for line in htacc.readlines():
res=xslspec.match(line)
if res:
xslfile=res.group(1)
if xslfile[0] != '/':
xslfile=BASE+'/'+xslfile
break
if
debugFlag:
print("<p>found xslfile %s in .htaccess</p>" % (xslfile))
if system==Solaris:
xslfile=re.sub('/home/ajh'+'/www','/u/web/homes/ajh',xslfile)
if xslfile[0]!='/' and not (xslfile[0:5]=='file:'):
xslfile='/u/web/homes/ajh/'+xslfile
5.14 Update Counter
Compute the name of an XML counter file which contains a
counter element with subelements value and date.
The value element contains the current count value, and the
date element is the date on which this XML file was
initialised. We read the current count from that file,
increment it, and update the file. This file is used by most
xslt translations to output an access count in the footer. It
is also used by the site map program to compute the intensity
of accesses to this web page.
It was fortuitous, but this counter also keeps track of HTML
accesses, both where an HTML file is the initial request, and
where it is a cached version of the corresponding XML file.
Since the XML files have their own counters included by the
XSLT translator, the count attached to the HTML rendering
allows a comparision of how many accesses are to the cached
copy (the difference between the two).
For example, suppose the XML rendering gives 986 references,
and the HTML rendering cites 993 references. The the cached
HTML page has itself been referenced 7 times since it was
first cached.
<update counter 5.31> =counterName=re.sub("/~ajh/",'',relfile)
counterName=re.sub("^/",'',counterName)
extnPattern=re.compile("(.xml)|(.html)")
counterName=re.sub(extnPattern,'',counterName)
counterName=COUNTERS+re.sub("/","-",counterName)
First we process relfile to find the counter name.
Remove any extension, and replace all slash path separators
with minus signs.
(Strictly speaking, the first sub is not required, but
I've left it in, as it does no harm.)
<update counter 5.32> =newCounterStr='<?xml version="1.0"?>\n'
newCounterStr+='<counter><value>0</value><date>%s</date></counter>' % todayStr
try:
counterFile=open(counterName,'r')
dom=xml.dom.minidom.parse(counterFile)
counterFile.close()
except IOError:
dom=xml.dom.minidom.parseString(newCounterStr)
except xml.parsers.expat.ExpatError:
dom=xml.dom.minidom.parseString(newCounterStr)
except:
print("Unexpected error:", sys.exc_info()[0])
raise
Now try to read the counter XML file. The file may not exist
if this is the first time we have accessed this page since
this mechanism was set up, so we must capture that error, and
any error arising from attempting to parse the XML, and create
a new counter file, with value initialised to zero, and
date initialised to today's date.
<update counter 5.33> =# now extract count field and update it
countNode=dom.getElementsByTagName('value')[0]
if countNode.nodeType == xml.dom.Node.ELEMENT_NODE:
textNode=countNode.firstChild
if textNode.nodeType == xml.dom.Node.TEXT_NODE:
text=textNode.nodeValue.strip()
countVal=int(text)
countVal=countVal+1
textNode.nodeValue="%d" % (countVal)
countDate='(unknown)'
countNode=dom.getElementsByTagName('date')[0]
if countNode.nodeType == xml.dom.Node.ELEMENT_NODE:
textNode=countNode.firstChild
if textNode.nodeType == xml.dom.Node.TEXT_NODE:
countDate=textNode.nodeValue.strip()
<update counter 5.34> =# write updated counter document
if re.match('.*personal-albums',counterName):
# ignore photographs
#sys.stderr.write(" ignoring {0}\n".format(counterName))
pass
else:
try:
counterFile=open(counterName,'w')
except IOError:
print("could not open %s" % (counterName))
counterName='/home/ajh/local/localhost/counters/index'
counterFile=open(counterName,'w')
domString=dom.toxml()
counterFile.write(domString)
counterFile.close()
5.15 Process File
<process file 5.35> =filestat=os.stat(requestedFile)
filemod=filestat.st_mtime
dtfilemod=datetime.datetime.fromtimestamp(filemod)
dtstring=dtfilemod.strftime("%Y%m%d:%H%M")
# define the parameters to the translation
filestat=os.stat(
requestedFile)
filemod=filestat.st_mtime
dtfilemod=datetime.datetime.fromtimestamp(filemod)
parms=""
parms+="--param xmltime \"'%s'\" " % (dtstring)
parms+="--param htmltime \"'%s'\" " % (tsstring)
parms+="--param filename \"'%s'\" " % (filename)
parms+="--param relcwd \"'%s'\" " % (relcwd)
parms+="--param URL \"'%s'\" " % (URL)
parms+="--param today \"'%s'\" " % (todayStr)
parms+="--param host \"'%s'\" " % (host)
parms+="--param server \"'%s'\" " % (server)
parms+="--param base \"'%s'\" " % (BASE)
parms+="--param version \"'%s'\" " % (version)
for key in form:
value=form[key][0]
parms+="--param "+key+" \"'%s'\" " % (value)
if
debugFlag:
sys.stderr.write("%s: xml file modified at %s\n" % (tsstring,dtstring))
<process file 5.36> =
Decide what to with the file. There are 3 choices:
- return the raw XML. This means escaping all the active
characters, and printing the file verbatim.
- The file is HTML, either because of an explicit HTML
request, or a cached HTML file previously translated has
been found. Again, the file is rendered verbatim, this
time without escaping the active characters.
- It is an XML file, and it needs translation. Call the
XSLT processor to do that (chunk
<process an XML file 5.38,5.39,5.40>).
<render the HTML file 5.37> =rawHTMLf=open(
requestedFile,'r',encoding='utf-8')
for line in rawHTMLf.readlines():
print(line,end='')
sys.stderr.write(f"{tsstring}: [{remoteAdr}] request satisfied\n\n")
sys.exit(0)
print('<P><SPAN STYLE="font-size:80%%">')
print('%d accesses since %s, ' % (countVal,countDate))
print('HTML cache rendered at %s</SPAN>' % (dtstring))
if
cachedHTML:
os.utime(
requestedFile,None) # touch the file
Note that each line from the HTML file is printed without
additional line breaks.
5.15.1 Process an XML File
<process an XML file 5.38> =# start a pipe to process the XSLT translation
cmd=XSLTPROC+" --xinclude %s%s %s " % (parms,xslfile,
requestedFile)
#(pipein,pipeout,pipeerr)=os.popen3(cmd)
pid=Popen(cmd,shell=True,stdout=PIPE,stderr=PIPE,close_fds=True)
(pipeout,pipeerr)=(pid.stdout,pid.stderr)
if
debugFlag:
cwd=os.getcwd()
print("<p>%s: (cwd:%s) %s</p>" % (tsstring,cwd,cmd))
sys.stderr.write("(cwd:%s) %s: %s\n" % (cwd,tsstring,cmd))
# report the fact, and the context (debugging purposes)
if
debugFlag:
print("%s: converting %s with %s\n" % (tsstring,
requestedFile,xslfile))
Run the pipe to perform the translation. Note that this
step requires an inordinate amount of time on some servers
(sequoia in particular), and was the prompt for
including the caching mechanism.
<process an XML file 5.39> =# process the converted HTML
convertfn="/home/ajh/www/tmp/convert.html"
if
convertXML:
try:
htmlfile=open(convertfn,'w')
except:
msg="couldn't open HTML conversion file %s" % convertfn
sys.stderr.write("%s: %s\n" % (tsstring,msg))
convertXML=False
<process an XML file 5.40> =# check that directory exists
dirpath=os.path.dirname(htmlpath)
# only cache if not an album request
gencache=not 'personal/albums' in dirpath
if not os.path.isdir(dirpath):
os.makedirs(dirpath,0o777)
if gencache: htmlfile2=open(htmlpath,'w')
for line in pid.stdout.readlines():
line=line.decode('UTF-8')
line=line.rstrip() # don't remove trailing blanks!
print(line)
if gencache: htmlfile2.write(str(line))
if
convertXML:
htmlfile.write("%s\n" % line)
if
convertXML:
htmlfile.close()
if gencache:
htmlfile2.close()
#os.chmod(htmlpath,0o666)
<deal with any conversion errors 5.41>
pipeout.close(); pipeerr.close()
Note that in copying the rendered HTML version, we retain
the lines as is, and make sure that they are rendered
without any additional (or deleted) new lines.
5.15.1.1 Deal with any Conversion Errors
<deal with any conversion errors 5.41> =errs=[]
for line in pipeerr.readlines():
errs.append(line)
logfile=PRIVATE+'/xmlerror.log'
logfiled=open(logfile,'a')
if errs:
logfiled.write("%s: %s: ERROR IN REQUEST %s\n" % (tsstring,clientIP,requestedFile))
print("<HR/>\n")
print("<H3>%s: MESSAGES GENERATED BY: %s</H3>\n" % (tsstring,
requestedFile))
print("<PRE>")
for errline in errs:
logfiled.write("%s: %s" % (tsstring,errline))
#errline=html.escape(errline) # this line needs UTF FIXING! ************************************
errline=errline.rstrip()
print("%s: %s" % (tsstring,errline))
print("</PRE>")
print("<p>Please forward these details to ")
print("<a href='mailto:ajh@ajhurst.org'>John Hurst</a>")
else:
logfiled.write("%s: %s: NO ERRORS IN %s\n" % (tsstring,clientIP,requestedFile))
logfiled.close()
6. The Train Image Viewer viewtrains.py
The literate code for this program has been moved to
ViewTrains
7. The Rankings Display Program ranktrains.py
The literate code for this program has been moved to
ViewTrains
8. The Train Ranking Module rank.py
The literate code for this program has been moved to
ViewTrains
9. The Web Server Module webServer.py
"webServer.py" 9.1 =#!/usr/bin/python
# online.py
#
# This program looks at all machines known to the ajh network, and
# examines them to see if they are on-line.
#
# version 1.0.0 20160622:104949
#
# put imports here: those named are required by this template
import datetime
import subprocess
import socket
import re
import getopt
import os,os.path
import sys
import shutil
# define globals here
debug=0; verbose=0
machines=['spencer',
#'spencer.ajh.co',
'dimboola',
#'ajh.id.au',
#'wolseley',
'hamilton.local',
#'bittern',
#'echuca',
'lilydale',
#'albens.ajhurst.org'
]
macosx=['dimboola','ajh.id.au','hamilton.local','MU00087507X','bittern','echuca']
linux=['spencer','spencer.ajh.co','wolseley','lilydale','lilydale.local','albens.ajhurst.org']
thismachine=socket.gethostname()
ignoreOut=open('/dev/null','w')
# define usage here
def usage():
print """
This module is intended for use in web page delivery programs.
It does not serve any useful purpose in its own right.
<flags>= [-d|--debug] print debugging information
[-v|--verbose] print more debugging information
[-V|--version] print version information
The debug flags give (verbose) output about what is happening.
"""
# define global procedures here
# (there are none)
# define the key class of this module
class location():
def __init__(self,server=thismachine):
# determine the server and host names
#
# the server is the address to which this request was directed, and is
# useful in making decisions about what to render to the client.
# Examples are "localhost", "www.ajh.id.au", "chairsabs.org.au".
#
# the host is the machine upon which the server is running, and may be
# different from the server. This name is used to determine where to
# store local data, such as logging information. For example, the
# server may be "localhost", but this can run on a variety of hosts:
# "murtoa", "dimboola", dyn-13-194-xx-xx", etc.. Incidentally, hosts
# of the form "dyn-130-194-xx-xx" are mashed down to the generic "dyn".
MacOSX='MacOSX' ; Solaris='Solaris' ; Linux="Linux" ; Ubuntu='Ubuntu'
ostype=system=MacOSX # unless told otherwise
host=server
if server in ["localhost"]:
pass
elif server in ['ajh.co','www.ajh.co','spencer']:
host='spencer'
ostype=Linux
system=Ubuntu
elif server in ['albens','albens.ajhurst.org','45.55.18.15',\
'ajhurst.org','www.ajhurst.org',\
'njhurst.com','www.njhurst.com']:
#server='ajhurst.org'
host='albens'
ostype=Linux
system=Ubuntu
elif server in ['dimboola','dimboola.local',\
'ajh.id.au','dimboola.ajh.id.au']:
host='dimboola'
elif server in ['wolseley','wolseley.home.gateway']:
server='wolseley'
host='wolseley'
ostype=Linux
system=Ubuntu
elif server in ['burnley','burnley.local']:
host='burnley'
ostype=Linux
elif server in ['eregnans.ajhurst.org','regnans.njhurst.com']:
host='eregnans'
ostype=Linux
system=Ubuntu
elif server in ['cahurst.org']:
host='albens'
ostype=Linux
system=Ubuntu
elif server in ['glenwaverleychurches.org','www.glenwaverleychurches.org']:
host='albens'
ostype=Linux
system=Ubuntu
else:
sys.stderr.write("server/host values not recognized\n")
sys.stderr.write("(supplied values are %s/%s)\n" % (server,host))
sys.stderr.write("terminating\n")
sys.exit(1)
ostype=Linux
system=Ubuntu
sys.stderr.write("(assuming (ostype,system)=(%s,%s)\n" % (ostype,system))
self.ostype=ostype
self.system=system
self.server=server
self.host=host
pass
# define the main program here
def main(s):
loc=location(s)
print "I have gleaned that:"
print " this machine = %s" % (thismachine)
print " this server = %s" % (loc.server)
print " this host = %s" % (loc.host)
print " this ostype = %s" % (loc.ostype)
print " this system = %s" % (loc.system)
pass
if __name__ == '__main__':
(vals,path)=getopt.getopt(sys.argv[1:],'dvV',
['debug','verbose','version'])
for (opt,val) in vals:
if opt=='-d' or opt=='--debug':
debug=1
if opt=='-v' or opt=='--verbose':
verbose=1
if opt=='-V' or opt=='--version':
print version
sys.exit(0)
server=thismachine
if len(path)>0:
server=path[0]
main(server)
The various web server programs all make use of a pair of
values, known as host and server. The host
value defines the machine on which this program is
running, while the server value defines the name by which
the program was invoked (effectively the domain name of the
server). This is necessary, as each program may be invoked by
different domain name paths, and the service rendered may be
different for each path.
While different domain names may invoke services on the same
machine or host, it is never the case that different
machines are invoked to handle a single domain name service.
When invoked as a stand-alone program, a test server
parameter may be passed in to see what the program determines.
This is for testing purposes only, and serves no useful purpose
otherwise.
10. File Caching
10.1 The File Cache Module
"filecache.py" 10.1 ="""A module that writes a webpage to a file so it can be restored at a later time
Interface:
filecache.write(...)
filecache.read(...)
"""
import time
import os
import md5
import urllib
def key(url):
k = md5.new()
k.update(url)
return k.hexdigest()
def filename(basedir, url):
return "%s/%s.txt"%(basedir, key(url))
def write(url, basedir, content):
""" Write content to cache file in basedir for url"""
cachefilen=filename(basedir, url)
fh = file(cachefilen, mode="w")
fh.write(content)
fh.close()
return cachefilen
def read(url, basedir, timeout):
"""Read cached content for url in basedir if it is fresher
than timeout (in seconds)"""
cache=0
fname = filename(basedir, url)
content = ""
if os.path.exists(fname) and \
(os.stat(fname).st_mtime > time.time() - timeout):
fh = open(fname, "r")
content = fh.read()
fh.close()
cache=1
return (content,cache)
This code was adapted from an example given on a web page.
Sorry, I have forgotten the reference.
10.2 Clearing the Cache
This program clears the HTML caches created by the previous
module. It is called independently, and can clear either the
entire cache, or subdirectories of it.
The cache is maintained on a per-machine basis, and the
machine being used is identified by a hostname
call.
"clearWebCache.py" 10.2 =
11. Country Database
Addresses in this database come in two forms:
CIDR (Classless_Inter-Domain_Routing), and double IPv4
addresses. The CIDR forms are standard for defining a range of
addresses; the second form simply gives a range of IP addresses,
which may or may not (more usually) represent an address that
could be written in CIDR form. This latter form is used where
an address in a CIDR form has an address (or addresses) in range
that need separate handling from the CIDR forms.
A third form, infrequently used, is a single address. This is
used to indicate a single whitelisted (but could be black)
address that is separated out from a range of other addresses.
Currently there are only 2 such entries, one for my own server
at ajhurst.org (Digital Ocean), and Nathan (part of the Newfold
Digital network, otherwise grey listed)
Each line of the database is network range in one of the three
forms of address or address range, followed by a colour flag
('*',' ','.'; black, grey, white resp.), followed by a 2
character country indicator, followed by a network group name.
One or more blanks separate these fields.
"blackListIPs" 11.1 =3.0.0.0/9 * US Amazon
3.128.0.0/9 * US Amazon
10.0.0.0/8 . AU local network
13.24.0.0,13.59.255.255 * US Amazon
17.0.0.0/8 * US Apple
18.32.0.0/11 * US Amazon
18.64.0.0/10 * US Amazon
18.128.0.0/9 * US Amazon
23.20.0.0/14 * US Amazon
27.0.0.0/21 * ZZ Asia Pacific Network Information Centre
27.0.232.0/24 CA ONEPROVIDER
27.0.233.0/24 AU Adam Berger
27.0.234.0/24 SG Adam Berger
27.0.235.0/24 KR Adam Berger
27.0.236.0/24 KR Kakao Corp
27.0.237.0/24 KR Kakao Corp
27.0.238.0/24 KR Kakao Corp
27.0.239.0/24 KR Kakao Corp
27.0.240.0/24 VN Vingroup Joint Stock Company
44.192.0.0/10 * US Amazon
45.55.0.0/16 . US Digital Ocean
47.128.0.0/16 * SG Amazon
49.0.200.0,49.0.207.255 * SG Huawei
49.185.0.0/17 . AU Optus Internet
50.112.160.3/32 US
51.222.0.0,51.222.253.0 * CA OVH Hosting Montreal
51.222.253.1,51.222.253.19 . CA OVH Hosting Montreal
51.222.253.20,51.222.255.255 * CA OVH Hosting Montreal
52.0.0.0/10 * US Amazon
52.64.0.0/12 * US Amazon
54.36.0.0/15 * NL RIPE Network
54.38.0.0/16 * NL RIPE Network
54.144.0.0/12 * US Amazon
54.160.0.0/11 * US Amazon
54.192.0.0/10 * US Amazon
59.167.194.123/32 . AU iiNet Limited
62.0.0.0/8 * NL RIPE Network
65.108.0.0/15 * NL RIPE Network
66.96.128.0,66.96.163.138 US Newfold Digital
66.96.163.139 . US Nathan
66.96.163.140,66.96.191.255 US Newfold Digital
66.249.64.0/19 * US Google
69.63.176.0/20 * US Facebook
69.171.224.0/19 * US Facebook
85.0.0.0/8 * NL RIPE Network
87.250.224.0/19 RU Yandex
94.0.0.0/13 * NL RIPE Network
94.23.0.0/16 FR OVH ISP Paris
94.24.0.0/13 * NL RIPE Network
94.32.0.0/11 * NL RIPE Network
94.64.0.0/10 * NL RIPE Network
94.128.0.0/9 * NL RIPE Network
100.21.24.205/32 * US Amazon
101.44.160.0/20 * SG HUAWEI
101.44.248.0/22 * SG HUAWEI
104.131.0.0/16 . US Digital Ocean
110.238.104.0/21 * SG HUAWEI
114.119.128.0/18 * SG HUAWEI
119.8.160.0,119.8.191.255 * SG HUAWEI
119.13.96.0,119.13.111.255 * SG HUAWEI
124.243.128.0/18 * SG HUAWEI
141.98.11.0/24 * LT LT-HOSTBALTIC-11
144.76.0.0/16 NL RIPE Network
144.91.64.0/18 NL RIPE Network
148.251.0.0/16 NL RIPE Network
148.252.0.0/15 NL RIPE Network
158.69.0.0/16 CA OVH Hosting Montreal
158.220.0.0/16 * NL RIPE Network
159.138.0.0/16 AU Asia Pacific Network Information Centre
167.160.64.0,167.160.71.255 * US Blazing SEO, LLC BLAZINGSEO-US-108
172.32.0.0/11 US T-Mobile USA
173.252.64.0/18 * US Facebook
176.0.0.0/8 * NL RIPE Network
183.81.169.0/24 * HK Amarutu Technology Ltd.
185.0.0.0/8 * NL RIPE Network
188.165.0.0/16 * FR OVH ISP
190.92.192.0/19 * SG HUAWEI
195.191.218.0/23 GB VeloxServ
213.0.0.0/8 * RU Yandex
216.244.64.0/19 * US Wowrack
217.76.56.0/20 DE Contabo GmbH
"greyListedIPs" 11.2 =51.222.0.0/16 CA OVH Hosting Montreal
Notes:
- RIPE Network
-
Réseaux IP Européens Network Coordination Centre
12. The Makefile
The Makefile handles the nitty-gritty of copying
files to the right places, and setting permissions, etc.
<install python file 12.1> =install-machine: /tmp/index-machine.py web.tangle
chmod a+x /tmp/index-machine.py
if [ ${HOST} = machine ] ; then \
cp /tmp/index-machine.py homedir/public_html/cgi-bin/index.py; \
cp location3.py homedir/public_html/cgi-bin/; \
cp checkCountry.py homedir/public_html/cgi-bin/; \
cp IPtools.py homedir/public_html/cgi-bin/; \
cp blackListIPs homedir/local/; \
else \
rsync -auv /tmp/index-machine.py address:homedir/public_html/cgi-bin/index.py; \
rsync -auv location3.py address:homedir/public_html/cgi-bin/; \
rsync -auv checkCountry.py address:homedir/public_html/cgi-bin/; \
rsync -auv IPtools.py address:homedir/public_html/cgi-bin/; \
rsync -auv blackListIPs address:homedir/local/; \
fi
/tmp/index-machine.py: index.py
sed -e 's#/sw/bin/python#interpreter#' <index.py >/tmp/index-machine.py
install python file is an XLP macro that takes four
formal parameters. These are:
- machine
-
defines the machine for which this python script is to be
built. (The target machine)
- address
-
defines the domain name of the target machine.
- interpreter
-
defines the python interpreter to be used in running this
script.
- homedir
-
defines the home directory on the target machine.
"Makefile" 12.2 =RELCWD = /cgi-bin/
WEBPAGE = /home/ajh/www/
WEBPAGE = /home/ajh/public_html/research/literate
FILES = $(EMPTY)
XSLLIB = /home/ajh/lib/xsl
XSLFILES = $(XSLLIB)/lit2html.xsl $(XSLLIB)/tables2html.xsl
INSTALLFILES = index.py countries.py
CGIS = $(INSTALLFILES)
XMLS = $(EMPTY)
DIRS = $(EMPTY)
include $(HOME)/etc/MakeXLP
include $(HOME)/etc/MakeWeb
index.py: web.tangle
chmod 755 index.py
touch index.py
web.tangle web.xml: web.xlp
xsltproc --xinclude -o web.xml $(XSLLIB)/litprog.xsl web.xlp
touch web.tangle
web.html: web.xml $(XSLFILES)
xsltproc --xinclude $(XSLLIB)/lit2html.xsl web.xml >web.html
html: web.html
install: web.tangle install-${HOST}
web: $(WEBPAGE)/web.html
$(WEBPAGE)/web.html: web.html
cp -p web.html $(WEBPAGE)/web.html
Makefile: web.tangle
<install python file 12.1>(machine='albens', address='albens', interpreter='/usr/bin/python3', homedir='/home/ajh')
<install python file 12.1>(machine='albury', address='', interpreter='/usr/bin/python3', homedir='/home/ajh')
<install python file 12.1>(machine='burnley', address='burnley', interpreter='/home/ajh/binln/python3', homedir='/home/ajh')
<install python file 12.1>(machine='everton', address='everton', interpreter='/home/ajh/binln/python3', homedir='/home/ajh')
<install python file 12.1>(machine='jeparit', address='jeparit', interpreter='/home/ajh/binln/python3', homedir='/home/ajh')
<install python file 12.1>(machine='newport', address='newport', interpreter='/usr/bin/python', homedir='/home/ajh')
<install python file 12.1>(machine='reuilly', address='reuilly', interpreter='/home/ajh/binln/python', homedir='/home/ajh')
<install python file 12.1>(machine='spencer', address='spencer', interpreter='/home/ajh/binln/python', homedir='/home/ajh')
<install python file 12.1>(machine='wodonga', address='wodonga', interpreter='/usr/bin/python3', homedir='/Users/ajh')
The install-system targets are designed to cater
for the variations in interpreters and home directories required
for each of the servers installed by the Makefile. Currently,
all home directories are the same, but this is not necessarily
the case (for example, in MacOS systems).
Note that this has not been updated for setting the filecache
module.
The machines listed in the <install python file 12.1> calls
are those machines for which we want to run a web serving page
(Apache 2).
13. TODOs
- 20240513:173848
-
requestedFile traces now need extra infor re private v public
- 20240504:135256
-
Need better handling of white listed entries. For example,
Nathan's address (66.96.163.139) is in the middle of
66.96.128.0/18, which is otherwise grey listed. Want a white
listing to override and otherwise grey listing.
- 20240211:150655
-
Check that all machines do indeed serve properly. Note that
there are some gotchas, like whether the local files
are present, correct, and with proper ownership and
permissions.
14. Indices
14.1 Files
File Name |
Defined in |
IPrep.py |
3.2 |
IPtools.py |
3.1 |
Makefile |
12.2 |
blackListIPs |
11.1 |
checkCountry.py |
4.1, 4.2, 4.4
|
checkCountry.py |
4.1, 4.2, 4.4
|
clearWebCache.py |
10.2 |
filecache.py |
10.1 |
greyListedIPs |
11.2 |
index.py |
5.1, 5.2, 5.3, 5.4, 5.5, 5.6, 5.7, 5.8
|
webServer.py |
9.1 |
14.2 Chunks
Chunk Name |
Defined in |
Used in |
check for HTML request |
5.25, 5.26
|
5.8 |
check for abbreviated URL |
5.23 |
5.8 |
check for and handle jpegs |
5.16 |
5.6 |
check if xslfile more recent than cached version |
5.29 |
5.28 |
checkCountry: checkIP |
4.3 |
4.2 |
collect HTTP request |
5.20 |
5.7 |
current date |
15.2 |
|
current version |
15.1 |
5.1 |
deal with any conversion errors |
5.41 |
5.40 |
define global variables and constants |
5.9 |
5.1 |
define the XSLTPROC program |
5.14 |
5.12 |
define the working directories |
5.13 |
5.12 |
define various string patterns |
5.10 |
5.3 |
determine the host and server environments |
5.11, 5.12
|
5.4 |
determine xslt file |
5.30 |
5.8 |
filter out bad ips to which we do not respond |
5.15 |
5.4 |
get default XSLT file |
5.27 |
5.8 |
get filename from redirect |
5.22 |
5.7 |
get jpg file from remote server |
5.17, 5.18
|
5.16 |
get jpg file from remote server removed |
5.19 |
|
handle redirect query string |
5.21 |
5.20 |
install python file |
12.1 |
12.2, 12.2, 12.2, 12.2, 12.2, 12.2, 12.2, 12.2, 12.2
|
make file and dir absolute |
5.24 |
5.8 |
process an XML file |
5.38, 5.39, 5.40
|
5.36 |
process file |
5.35, 5.36
|
5.8 |
render the HTML file |
5.37 |
5.36 |
scan for locally defined XSLT file |
5.28 |
5.8 |
update counter |
5.31, 5.32, 5.33, 5.34
|
5.8 |
14.3 Identifiers
Identifier |
Defined in |
Used in |
BASE |
5.13 |
|
BASE |
5.13 |
|
COUNTERS |
5.13 |
|
HOME |
5.13 |
|
HOME |
5.13 |
|
HTMLS |
5.13 |
5.26 |
PRIVATE |
5.13 |
|
PRIVATE |
5.13 |
|
alreadyHTML |
5.9 |
5.25, 5.26, 5.28, 5.36
|
cachedHTML |
5.9 |
5.26, 5.28, 5.36, 5.37
|
clientIP |
5.9 |
|
convertXML |
5.9 |
5.21, 5.39, 5.39, 5.40, 5.40
|
debugFlag |
5.9 |
5.8, 5.20, 5.21, 5.21, 5.22, 5.27, 5.27, 5.28, 5.28, 5.30, 5.30, 5.30, 5.35, 5.38, 5.38
|
htmlpat |
5.10 |
5.22, 5.25
|
requestedFile |
5.20 |
5.8, 5.22, 5.22, 5.22, 5.22, 5.22, 5.22, 5.22, 5.22, 5.22, 5.22, 5.22, 5.22, 5.22, 5.22, 5.23, 5.23, 5.23, 5.24, 5.24, 5.24, 5.25, 5.25, 5.25, 5.26, 5.26, 5.26, 5.26, 5.26, 5.26, 5.28, 5.28, 5.35, 5.36, 5.37, 5.37, 5.38, 5.38, 5.41
|
returnXML |
5.9 |
5.21, 5.36
|
15. Document History
20080816:144135 |
ajh |
1.0.0 |
first version under literate programming |
20080817:131040 |
ajh |
1.0.1 |
general restructuring |
20080822:162138 |
ajh |
1.0.2 |
more restructuring |
20081102:164507 |
ajh |
1.1.0 |
added jpg handling and caching |
20081106:134033 |
ajh |
1.1.1 |
added exception handling |
20090507:160328 |
ajh |
1.2.0 |
bug relating to non-~ajh directories fixed;
caching of files implemented.
|
20090701:175934 |
ajh |
1.3.0 |
added code to cache the converted HTML file. Still to
do: creation of subdirectories for cached files. |
20090702:182341 |
ajh |
1.3.1 |
Subdirectories now created. Cached file is touched on
each access |
20090703:105343 |
ajh |
1.3.2 |
some literate tidy ups, and renamed variable
file to requestedFile to disambiguate it.
|
20091203:093814 |
ajh |
1.3.3 |
updated Makefile to install python interpreter
dependent files |
20120530:134427 |
ajh |
1.3.4 |
fix bug in glenwaverleychurches handling of counters |
|
ajh |
1.3.5 |
source code only |
20160221:165415 |
ajh |
1.3.6 |
modified for albens |
20160407:111509 |
ajh |
1.3.7 |
further albens changes for cahurst domain |
20180531:110300 |
ajh |
1.3.8 |
add more documentation to
<define global variables and constants 5.9>
|
20190117:153130 |
ajh |
1.3.9 |
don't cache album requests |
20210118:113916 |
ajh |
1.4.0 |
convert to Python3 |
20210118:114112 |
ajh |
1.4.1 |
add filterWebBots.py to literate program |
20210124:124701 |
ajh |
1.4.2 |
collect bad ip adrs from data file |
20210814:125748 |
ajh |
1.4.3 |
bring literate pgm up-to-date |
20210814:150425 |
ajh |
1.4.4 |
add filter on non-Australian IP addresses |
20231030:124305 |
ajh |
1.5.0 |
add non-local jpg retrieval |
20231105:085217 |
ajh |
1.6.0 |
cleaned up filterWebBots |
20231105:085257 |
ajh |
1.6.1 |
add own country database |
20231107:173857 |
ajh |
1.6.2 |
scan db for IP checks |
20231108:101419 |
ajh |
1.6.4 |
remove filter on country test, rely on db |
20231124:131647 |
ajh |
1.6.5 |
fix bug with glenwaverleychurches server |
20240121:140805 |
ajh |
1.6.6 |
minor clean ups to documentation and Makefile
|
20240430:151827 |
ajh |
1.6.7 |
minor corrections to the checkCountry module
|
<current version 15.1> = 1.6.7
<current date 15.2> = 20240430:151827