Project Computing - pageVault Reference

Responses can be removed from a pageVault archive based on the URL and timestamp of the response. The removed responses can either be deleted entirely or moved to a separate archive. The ability to remove responses from the main archive to an old archive allows you to keep a series of pageVault archives covering responses for separate time periods, eg, a 2004 archive, a 2005 archive, etc.

The deletion of responses is handled by the Archiver as part of its normal operation. At a specified time each day, the Archiver will initiate a background process which scans the archive for content to be removed. This process is controlled by parameters supplied in the archiveDeletionRules element archiver parameter file.

Here's a simple example of deletion rules. The archiver will check each day for all responses more than 366 days old:

The time at which the check starts is determined by the checkTime parameter which should specify at time in hhmm format (eg, 0000 for midnight, 0530 for 5:30am, 2100 for 9pm). If this parameter is omitted or invalid it defaults to 0230 (2:30 am)).

If the test parameter is set to y then the archiver will go through the process of determining which responses to delete but will not actually remove the responses from the archive. If it is set to n then eligible responses will be removed from the archive.

Multiple archiveDeletionRule elements can be provided each with a separate deleteAfterDays setting. As well, a log file can be specified which will receive all messages produced by the deletion process (if no log file is nominated, deletion processing messages are written to standard output). In the following example, responses with URL's starting with "intranet" are deleted after 183 days, those with responses whose URL's contain "/sales/" are deleted after 1000 days and all other responses are deleted after 355 days:

<archiveDeletionRules checkTime="0230" deletionLog="c:\pageVault\logs\deletion.log" test="n"> <archiveDeletionRule deleteAfterDays="183"> <urlPattern>^intranet.*</urlPattern> </archiveDeletionRule> <archiveDeletionRule deleteAfterDays="1000"> <urlPattern>.*/sales/.*</urlPattern> </archiveDeletionRule> <archiveDeletionRule deleteAfterDays="366"> <urlPattern>.*</urlPattern> </archiveDeletionRule> </archiveDeletionRules>

This is how multiple archiveDeletionRules are processed:

The disposition of a response is determined by any matching rule. However, a "wildcard" rule consisting of just the ".*" regular expression will not match a url which is also matched any other rule. For example, if there were two archiveDeletionRules :

delete after 200 days url's matching ^www.somename.com/fred.*
delete after 100 days url's matching .*

then the responses at url www.somename.com/fred/index.html will be deleted after 200 days, not after 100 days.

Deleted responses may be optionally moved to another pageVault archive. This allows "pruning" of the active archive by moving old responses to a different archive. The following parameters must be either both present or absent:

moveToIndexDirectory defines the name of a directory containing the pageVault archive index used to store the deleted responses. This directory (and index) will be created if it doesnt exist.
moveToFileDirectory defines the name of a directory to contain the pageVault response files.

If these parameters are not present then responses deleted from this archive will not be moved anywhere: they will be completely and permanently deleted.

Here's an extended example demonstrating severak urlPatterns per rule and also saving some of the removed responses into separate archives:

<archiveDeletionRules checkTime="0230" deletionLog="/pageVault/logs/deletion.log" test="n"> <archiveDeletionRule deleteAfterDays="479" moveToIndexDirectory="/pageVault/oldArchives/indexDeleted1" moveToFileDirectory="/pageVault/oldArchives/dataDeleted1"> <urlPattern>^localhost.*privacy.*</urlPattern> <urlPattern>^127.0.0.1*</urlPattern> <urlPattern>^www.*</urlPattern> </archiveDeletionRule> <archiveDeletionRule deleteAfterDays="450" moveToIndexDirectory="/pageVault/oldArchives/indexDeleted2" moveToFileDirectory="/pageVault/oldArchives/dataDeleted2"> <urlPattern>^www.*vtour.*</urlPattern> </archiveDeletionRule> <archiveDeletionRule deleteAfterDays="465"> <urlPattern>^intranet/.*</urlPattern> </archiveDeletionRule> <archiveDeletionRule deleteAfterDays="730"> <urlPattern>.*</urlPattern> </archiveDeletionRule> </archiveDeletionRules>

The pageVault archiver maintains two B+ Tree indices to the response archive allowing rapid searching and retrieval. It is the nature of such indices that over time they can slowly become unbalanced due to non-random insertion order of entries. The following utilities allow the archive indices to be restructured to minimize space and response time. How often this process needs to be run will vary from archive to archive: most archives will never need reorganisation.

Note

These index maintenance utilities must be run from the command line.

The following examples assume that you can run the Java run time (that is, that a Java 1.4+ runtime is in your path) and that you know where the distributed pageVault.jar is located, which may be:

Windows:
c:\pageVault\WEB-INF\lib\pageVault.jar

Unix:
/usr/local/pageVault/WEB-INF/lib/pageVault.jar

com.pageVault.maint.ShowIndexStats

Displays basic statistics for the B+ Tree indices.

Parameter: the name of the directory containing the pageVault indices (not including the trailing slash)

Windows:
C:\java -classpath c:\pageVault\WEB-INF\lib\pageVault.jar com.pageVault.maint.ShowIndexStats c:/pageVault/archive/index

Unix:
java -classpath /usr/local/pageVault/WEB-INF/lib/pageVault.jar com.pageVault.maint.ShowIndexStats /usr/local/pageVault/archive/index

INFO main 21:18:37 : Reopened index: /usr/local/pageVault/archive/index/pageVaultUrlIndex with 2673 records
INFO main 21:18:37 :   pageSize: 128
INFO main 21:18:38 :   height: 2, non leaf pages: 1, leaf pages: 35
INFO main 21:18:38 :  av entries per leaf page: 76
INFO main 21:18:38 : Reopened index: /usr/local/pageVault/archive/index/pageVaultTimestampIndex with 2673 records
INFO main 21:18:38 :   pageSize: 128
INFO main 21:18:38 :   height: 2, non leaf pages: 1, leaf pages: 38
INFO main 21:18:38 :  av entries per leaf page: 70

Pagesize The maximum number of index entries per index page.

Height The "height" of the B+ Tree: the maximum number of pages between the root of the index and a leaf page.

Non Leaf pages The number of intermediate pages in the index.

Leaf pages The number of leaf pages, which are those pages which contain data rather pointers to lower level index pages.

Av entries per leaf page The average number of entries in use on the leaf pages.

com.pageVault.maint.ExportURLIndex

Exports the index entries for later import by the ImportURLIndex utility. Export/import can be run if the indices become very fragmented. Export/import must be run with the archiver instance stopped.

Parameters: the name of the directory containing the pageVault indices (not including the trailing slash) and the name of the exported index file to be written.

Windows:
C:\java -classpath c:\pageVault\WEB-INF\lib\pageVault.jar com.pageVault.maint.ExportURLIndex c:/pageVault/archive/index c:/tmp/exportedIndex

Unix:
java -classpath /usr/local/pageVault/WEB-INF/lib/pageVault.jar com.pageVault.maint.ExportURLIndex /usr/local/pageVault/archive/index /work/exportedIndex-11June

The export utility executes quickly, dumping in the order of 1,000,000 records every 2 minutes.

com.pageVault.maint.ImportURLIndex

Imports the index entries produced by the ExportURLIndex or ReextractIndexFromData utility. This utility recreates both URL and Timestamp B+ Tree indices. Export/import must be run with the archiver instance stopped.

Parameters: the name of the directory containing the pageVault indices (not including the trailing slash), the name of the index file to be imported and the B+Tree index page size (recommended value: 128).

Warning

This utility will almost always be used for recreating an index, not merging two archives. However, this utility does not delete any existing index entries it finds. Hence, you should normally ensure that the directory represented by the first parameter is empty, thus forcing the creation of new index files. Also, the 3rd parameter (page size) is ignored unless new index files are created.

Windows:
C:\java -classpath c:\pageVault\WEB-INF\lib\pageVault.jar com.pageVault.maint.ImportURLIndex c:/pageVault/archive/index c:/tmp/exportedIndex 128

Unix:
java -classpath /usr/local/pageVault/WEB-INF/lib/pageVault.jar com.pageVault.maint.ImportURLIndex /usr/local/pageVault/archive/index /work/exportedIndex-11June 256

The import utility executes quickly, processing in the order of 1,000,000 records every 10 minutes.

com.pageVault.maint.ReextractIndexFromData

Reads the data files in the archive to recreate all the index entries for the archive. The output file from this utility can be provided to the ImportURLIndex utility to recreate the index files. This utility must be run with the archiver instance stopped.

Parameters: the name of the directory containing the pageVault data (not including the trailing slash) and the name of the index file to be written.

Windows:
C:\java -classpath c:\pageVault\WEB-INF\lib\pageVault.jar com.pageVault.maint.ReextractIndexFromData c:/pageVault/archive/data c:/tmp/recreatedIndex

Unix:
java -classpath /usr/local/pageVault/WEB-INF/lib/pageVault.jar com.pageVault.maint.ReextractIndexFromData /usr/local/pageVault/archive/index /work/recreatedIndex-11June 256

The import utility executes relatively slowly, as it needs to open and read the header of every data file. Hence processing speed is very dependent on the rate at which the hosting operating system and hardware can support file opening and closing and random dispersed reads across the disks. A speed of a few thousand files per minute is typical.

The pageVault distributor and archiver components have an in-built instrumentation facility which can be used for monitoring. The instrumentation facility is enabled by defining a TCP/IP instrumentation port in each component:

Distributor

Add the distributorInstrumentationListener parameter defining an available TCP/IP port to the DistributorParms.xml file.

<distributorInstrumentationListener port="8082"/>

After restarting the distributor, you can then issue an HTTP request to that port on the machine running the distributor, eg http://10.84.12.101:8082.

The response shows the following details and statistics for the current instantiation of the Distributor:

Distributor name	As defined by the distributorName element in the `DistributorParms.xml` file
Instance status	Current status of the distributor - normally polling (reading the directories written to by the web server filter) or sleeping (sleeping between polls)
Instance started	Date/time this distributor instance was started
Instance current time	Current date/time on the machine running this distributor
Last poll sleep time	Most recent date/time that this instance slept between polls
Current poll sleep interval	Length of the most recent sleep in millisecs
Incomplete file processing deferred	Occassionally the distributor will attempt to process information being simultaneously written by the pageVault web server filter. When this is detected the distributor ceases to process the file, deferring processing until it has been completed by the filter.
Stale files deleted	If for some reason the web server filter aborts whilst writing a response file (eg, software or power failure) then this file will never be marked as completed by the filter. A file which has been in a "deferred" status for more than a predetermined time interval is deemed to be such a file, and is deleted by the distributor.
Duplicate responses detected	The web server filter does a 'best guess' attempt at detecting duplicate responses. However, it has a limited and process-local cache of previous responses, whilst the distributor has a bigger and more global cache. Hence, the distributor will detect some duplicate reponses not identified by the web server filter
Unmovable responses discarded	The distributor moves completed requests from the directory in which they have been written by the web server filter to a private queue. If for some reason a request file cannot be moved it is deleted. This indicates a severe error condition within the pageVault system and must be further investigated.
Novel responses queued	The number of reponses considered 'novel' by the distributor and hence queued to be sent to the archiver.
Failed (requeued) transfers to Archiver	If the connection between the distributor and the archiver is lost during transfer of a response then the response is requeued for resending when the connection is reestablished.
Duplicate responses detected by Archiver	The Archiver has true global and persistant knowledge of all responses gathered and hence will detect duplicate responses not removed by the distributor (or before it, by the web server filter).
Novel response sent to Archiver	Responses sent from this distributor instance to the archiver.

This response can be used as a "heart-beat" to monitor the distributor. A valid HTTP response code '200' indicates that the distributor (thinks it) is operating normally.

Archiver

Add the archiverInstrumentationListener parameter defining an available TCP/IP port to the ArchiverParms.xml file.

<archiverInstrumentationListener port="8083"/>

After restarting the archiver, you can then issue an HTTP request to that port on the machine running the archiver, eg http://10.84.12.101:8083.

The response shows the following details and statistics for the current instantiation of the Archiver:

Archiver name	As defined by the archiverName element in the `ArchiverParms.xml` file
Archive size	The total number of responses in the archive
Earliest timestamp	The timestamp of the earliest response in the archive
Latest timestamp	The timestamp of the latest response in the archive
Instance started	Date/time this archiver instance was started
Instance current time	Current date/time on the machine running this archiver
Viewer requests	The number of requests to view or search the contents of the archive received through the pageVault Viewer interface
Total requests from distributors	The number of requests to receive responses from distributors which the distributors consider as 'novel'
Repeated latest responses	The number of the total responses received from distributors which are not novel but match the most recent recorded response for a URL which has been already archived.
Repeated older responses	The number of the total responses received from distributors which are not novel but match an earlier response for a URL. This may happen if, for example, a response fora URL reverts to a previous version.
Novel responses	The number of the total responses received from distributors which are novel and are stored in the archive.

This response can be used as a "heart-beat" to monitor the archiver. A valid HTTP response code '200' indicates that the archiver (thinks it) is operating normally.

Pagesize	The maximum number of index entries per index page.
Height	The "height" of the B+ Tree: the maximum number of pages between the root of the index and a leaf page.
Non Leaf pages	The number of intermediate pages in the index.
Leaf pages	The number of leaf pages, which are those pages which contain data rather pointers to lower level index pages.
Av entries per leaf page	The average number of entries in use on the leaf pages.