pageVault FAQ Reference Trial License Screenshots |
pageVault Reference Manual
Part 3 - Operation
|
Responses can be removed from a pageVault archive based on the URL and timestamp of the response. The removed responses can either be deleted entirely or moved to a separate archive. The ability to remove responses from the main archive to an old archive allows you to keep a series of pageVault archives covering responses for separate time periods, eg, a 2004 archive, a 2005 archive, etc.
The deletion of responses is handled by the Archiver as part of its normal operation. At a specified time each day, the Archiver will initiate a background process which scans the archive for content to be removed. This process is controlled by parameters supplied in the archiveDeletionRules element archiver parameter file.
Here's a simple example of deletion rules. The archiver will check each day for all responses more than 366 days old:
The time at which the check starts is determined by the checkTime parameter which should specify at time in hhmm format (eg, 0000 for midnight, 0530 for 5:30am, 2100 for 9pm). If this parameter is omitted or invalid it defaults to 0230 (2:30 am)).
If the test parameter is set to y then the archiver will go through the process of determining which responses to delete but will not actually remove the responses from the archive. If it is set to n then eligible responses will be removed from the archive.
Multiple archiveDeletionRule elements can be provided each with a separate deleteAfterDays setting. As well, a log file can be specified which will receive all messages produced by the deletion process (if no log file is nominated, deletion processing messages are written to standard output). In the following example, responses with URL's starting with "intranet" are deleted after 183 days, those with responses whose URL's contain "/sales/" are deleted after 1000 days and all other responses are deleted after 355 days:
This is how multiple archiveDeletionRules are processed:
The disposition of a response is determined by any matching rule. However, a "wildcard" rule consisting of just the ".*" regular expression will not match a url which is also matched any other rule. For example, if there were two archiveDeletionRules :
then the responses at url www.somename.com/fred/index.html will be deleted after 200 days, not after 100 days.
Deleted responses may be optionally moved to another pageVault archive. This allows "pruning" of the active archive by moving old responses to a different archive. The following parameters must be either both present or absent:
If these parameters are not present then responses deleted from this archive will not be moved anywhere: they will be completely and permanently deleted.
Here's an extended example demonstrating severak urlPatterns per rule and also saving some of the removed responses into separate archives:
The pageVault archiver maintains two B+ Tree indices to the response archive allowing rapid searching and retrieval. It is the nature of such indices that over time they can slowly become unbalanced due to non-random insertion order of entries. The following utilities allow the archive indices to be restructured to minimize space and response time. How often this process needs to be run will vary from archive to archive: most archives will never need reorganisation.
These index maintenance utilities must be run from the command line.
The following examples assume that you can run the Java run time (that is, that a Java 1.4+ runtime is in your path) and that you know where the distributed pageVault.jar is located, which may be:
Windows:
c:\pageVault\WEB-INF\lib\pageVault.jarUnix:
/usr/local/pageVault/WEB-INF/lib/pageVault.jar
com.pageVault.maint.ShowIndexStats
Displays basic statistics for the B+ Tree indices.
Parameter: the name of the directory containing the pageVault indices (not including the trailing slash)
Windows:
C:\java -classpath c:\pageVault\WEB-INF\lib\pageVault.jar com.pageVault.maint.ShowIndexStats c:/pageVault/archive/index
Unix:
java -classpath /usr/local/pageVault/WEB-INF/lib/pageVault.jar com.pageVault.maint.ShowIndexStats /usr/local/pageVault/archive/index
INFO main 21:18:37 : Reopened index: /usr/local/pageVault/archive/index/pageVaultUrlIndex with 2673 records INFO main 21:18:37 : pageSize: 128 INFO main 21:18:38 : height: 2, non leaf pages: 1, leaf pages: 35 INFO main 21:18:38 : av entries per leaf page: 76 INFO main 21:18:38 : Reopened index: /usr/local/pageVault/archive/index/pageVaultTimestampIndex with 2673 records INFO main 21:18:38 : pageSize: 128 INFO main 21:18:38 : height: 2, non leaf pages: 1, leaf pages: 38 INFO main 21:18:38 : av entries per leaf page: 70
Pagesize | The maximum number of index entries per index page. |
Height | The "height" of the B+ Tree: the maximum number of pages between the root of the index and a leaf page. |
Non Leaf pages | The number of intermediate pages in the index. |
Leaf pages | The number of leaf pages, which are those pages which contain data rather pointers to lower level index pages. |
Av entries per leaf page | The average number of entries in use on the leaf pages. |
com.pageVault.maint.ExportURLIndex
Exports the index entries for later import by the ImportURLIndex utility. Export/import can be run if the indices become very fragmented. Export/import must be run with the archiver instance stopped.
Parameters: the name of the directory containing the pageVault indices (not including the trailing slash) and the name of the exported index file to be written.
Windows:
C:\java -classpath c:\pageVault\WEB-INF\lib\pageVault.jar com.pageVault.maint.ExportURLIndex c:/pageVault/archive/index c:/tmp/exportedIndexUnix:
java -classpath /usr/local/pageVault/WEB-INF/lib/pageVault.jar com.pageVault.maint.ExportURLIndex /usr/local/pageVault/archive/index /work/exportedIndex-11June
The export utility executes quickly, dumping in the order of 1,000,000 records every 2 minutes.
com.pageVault.maint.ImportURLIndex
Imports the index entries produced by the ExportURLIndex or ReextractIndexFromData utility. This utility recreates both URL and Timestamp B+ Tree indices. Export/import must be run with the archiver instance stopped.
Parameters: the name of the directory containing the pageVault indices (not including the trailing slash), the name of the index file to be imported and the B+Tree index page size (recommended value: 128).
This utility will almost always be used for recreating an index, not merging two archives. However, this utility does not delete any existing index entries it finds. Hence, you should normally ensure that the directory represented by the first parameter is empty, thus forcing the creation of new index files. Also, the 3rd parameter (page size) is ignored unless new index files are created.
Windows:
C:\java -classpath c:\pageVault\WEB-INF\lib\pageVault.jar com.pageVault.maint.ImportURLIndex c:/pageVault/archive/index c:/tmp/exportedIndex 128Unix:
java -classpath /usr/local/pageVault/WEB-INF/lib/pageVault.jar com.pageVault.maint.ImportURLIndex /usr/local/pageVault/archive/index /work/exportedIndex-11June 256
The import utility executes quickly, processing in the order of 1,000,000 records every 10 minutes.
com.pageVault.maint.ReextractIndexFromData
Reads the data files in the archive to recreate all the index entries for the archive. The output file from this utility can be provided to the ImportURLIndex utility to recreate the index files. This utility must be run with the archiver instance stopped.
Parameters: the name of the directory containing the pageVault data (not including the trailing slash) and the name of the index file to be written.
Windows:
C:\java -classpath c:\pageVault\WEB-INF\lib\pageVault.jar com.pageVault.maint.ReextractIndexFromData c:/pageVault/archive/data c:/tmp/recreatedIndexUnix:
java -classpath /usr/local/pageVault/WEB-INF/lib/pageVault.jar com.pageVault.maint.ReextractIndexFromData /usr/local/pageVault/archive/index /work/recreatedIndex-11June 256
The import utility executes relatively slowly, as it needs to open and read the header of every data file. Hence processing speed is very dependent on the rate at which the hosting operating system and hardware can support file opening and closing and random dispersed reads across the disks. A speed of a few thousand files per minute is typical.
The pageVault distributor and archiver components have an in-built instrumentation facility which can be used for monitoring. The instrumentation facility is enabled by defining a TCP/IP instrumentation port in each component:
Distributor
Add the distributorInstrumentationListener parameter defining an available TCP/IP port to the DistributorParms.xml file.
After restarting the distributor, you can then issue an HTTP request to that port on the machine running the distributor, eg http://10.84.12.101:8082.
The response shows the following details and statistics for the current instantiation of the Distributor:
Distributor name | As defined by the distributorName element in the DistributorParms.xml file |
Instance status | Current status of the distributor - normally polling (reading the directories written to by the web server filter) or sleeping (sleeping between polls) |
Instance started | Date/time this distributor instance was started |
Instance current time | Current date/time on the machine running this distributor |
Last poll sleep time | Most recent date/time that this instance slept between polls |
Current poll sleep interval | Length of the most recent sleep in millisecs |
Incomplete file processing deferred | Occassionally the distributor will attempt to process information being simultaneously written by the pageVault web server filter. When this is detected the distributor ceases to process the file, deferring processing until it has been completed by the filter. |
Stale files deleted | If for some reason the web server filter aborts whilst writing a response file (eg, software or power failure) then this file will never be marked as completed by the filter. A file which has been in a "deferred" status for more than a predetermined time interval is deemed to be such a file, and is deleted by the distributor. |
Duplicate responses detected | The web server filter does a 'best guess' attempt at detecting duplicate responses. However, it has a limited and process-local cache of previous responses, whilst the distributor has a bigger and more global cache. Hence, the distributor will detect some duplicate reponses not identified by the web server filter |
Unmovable responses discarded | The distributor moves completed requests from the directory in which they have been written by the web server filter to a private queue. If for some reason a request file cannot be moved it is deleted. This indicates a severe error condition within the pageVault system and must be further investigated. |
Novel responses queued | The number of reponses considered 'novel' by the distributor and hence queued to be sent to the archiver. |
Failed (requeued) transfers to Archiver | If the connection between the distributor and the archiver is lost during transfer of a response then the response is requeued for resending when the connection is reestablished. |
Duplicate responses detected by Archiver | The Archiver has true global and persistant knowledge of all responses gathered and hence will detect duplicate responses not removed by the distributor (or before it, by the web server filter). |
Novel response sent to Archiver | Responses sent from this distributor instance to the archiver. |
This response can be used as a "heart-beat" to monitor the distributor. A valid HTTP response code '200' indicates that the distributor (thinks it) is operating normally.
Archiver
Add the archiverInstrumentationListener parameter defining an available TCP/IP port to the ArchiverParms.xml file.
After restarting the archiver, you can then issue an HTTP request to that port on the machine running the archiver, eg http://10.84.12.101:8083.
The response shows the following details and statistics for the current instantiation of the Archiver:
Archiver name | As defined by the archiverName element in the ArchiverParms.xml file |
Archive size | The total number of responses in the archive |
Earliest timestamp | The timestamp of the earliest response in the archive |
Latest timestamp | The timestamp of the latest response in the archive |
Instance started | Date/time this archiver instance was started |
Instance current time | Current date/time on the machine running this archiver |
Viewer requests | The number of requests to view or search the contents of the archive received through the pageVault Viewer interface |
Total requests from distributors | The number of requests to receive responses from distributors which the distributors consider as 'novel' |
Repeated latest responses | The number of the total responses received from distributors which are not novel but match the most recent recorded response for a URL which has been already archived. |
Repeated older responses | The number of the total responses received from distributors which are not novel but match an earlier response for a URL. This may happen if, for example, a response fora URL reverts to a previous version. |
Novel responses | The number of the total responses received from distributors which are novel and are stored in the archive. |
This response can be used as a "heart-beat" to monitor the archiver. A valid HTTP response code '200' indicates that the archiver (thinks it) is operating normally.
Project Computing Pty Ltd ACN: 008 590 967 | contact@projectComputing.com |