Project Computing

pageVault

FAQ

Reference

Trial

License

Screenshots

pageVault
pageVault Reference Manual

Part 3 - Operation


  1. Deleting archived data

    Responses can be removed from a pageVault archive based on the URL and timestamp of the response. The removed responses can either be deleted entirely or moved to a separate archive. The ability to remove responses from the main archive to an old archive allows you to keep a series of pageVault archives covering responses for separate time periods, eg, a 2004 archive, a 2005 archive, etc.

    The deletion of responses is handled by the Archiver as part of its normal operation. At a specified time each day, the Archiver will initiate a background process which scans the archive for content to be removed. This process is controlled by parameters supplied in the archiveDeletionRules element archiver parameter file.

    Here's a simple example of deletion rules. The archiver will check each day for all responses more than 366 days old:

    <archiveDeletionRules checkTime="0230" test="n"> <archiveDeletionRule deleteAfterDays="366"> <urlPattern>.*</urlPattern> </archiveDeletionRule> </archiveDeletionRules>

    The time at which the check starts is determined by the checkTime parameter which should specify at time in hhmm format (eg, 0000 for midnight, 0530 for 5:30am, 2100 for 9pm). If this parameter is omitted or invalid it defaults to 0230 (2:30 am)).

    If the test parameter is set to y then the archiver will go through the process of determining which responses to delete but will not actually remove the responses from the archive. If it is set to n then eligible responses will be removed from the archive.

    Multiple archiveDeletionRule elements can be provided each with a separate deleteAfterDays setting. As well, a log file can be specified which will receive all messages produced by the deletion process (if no log file is nominated, deletion processing messages are written to standard output). In the following example, responses with URL's starting with "intranet" are deleted after 183 days, those with responses whose URL's contain "/sales/" are deleted after 1000 days and all other responses are deleted after 355 days:

    <archiveDeletionRules checkTime="0230" deletionLog="c:\pageVault\logs\deletion.log" test="n"> <archiveDeletionRule deleteAfterDays="183"> <urlPattern>^intranet.*</urlPattern> </archiveDeletionRule> <archiveDeletionRule deleteAfterDays="1000"> <urlPattern>.*/sales/.*</urlPattern> </archiveDeletionRule> <archiveDeletionRule deleteAfterDays="366"> <urlPattern>.*</urlPattern> </archiveDeletionRule> </archiveDeletionRules>

    This is how multiple archiveDeletionRules are processed:

    The disposition of a response is determined by any matching rule. However, a "wildcard" rule consisting of just the ".*" regular expression will not match a url which is also matched any other rule. For example, if there were two archiveDeletionRules :

    1. delete after 200 days url's matching ^www.somename.com/fred.*
    2. delete after 100 days url's matching .*

    then the responses at url www.somename.com/fred/index.html will be deleted after 200 days, not after 100 days.

    Deleted responses may be optionally moved to another pageVault archive. This allows "pruning" of the active archive by moving old responses to a different archive. The following parameters must be either both present or absent:

    1. moveToIndexDirectory defines the name of a directory containing the pageVault archive index used to store the deleted responses. This directory (and index) will be created if it doesnt exist.
    2. moveToFileDirectory defines the name of a directory to contain the pageVault response files.

    If these parameters are not present then responses deleted from this archive will not be moved anywhere: they will be completely and permanently deleted.

    Here's an extended example demonstrating severak urlPatterns per rule and also saving some of the removed responses into separate archives:

    <archiveDeletionRules checkTime="0230" deletionLog="/pageVault/logs/deletion.log" test="n"> <archiveDeletionRule deleteAfterDays="479" moveToIndexDirectory="/pageVault/oldArchives/indexDeleted1" moveToFileDirectory="/pageVault/oldArchives/dataDeleted1"> <urlPattern>^localhost.*privacy.*</urlPattern> <urlPattern>^127.0.0.1*</urlPattern> <urlPattern>^www.*</urlPattern> </archiveDeletionRule> <archiveDeletionRule deleteAfterDays="450" moveToIndexDirectory="/pageVault/oldArchives/indexDeleted2" moveToFileDirectory="/pageVault/oldArchives/dataDeleted2"> <urlPattern>^www.*vtour.*</urlPattern> </archiveDeletionRule> <archiveDeletionRule deleteAfterDays="465"> <urlPattern>^intranet/.*</urlPattern> </archiveDeletionRule> <archiveDeletionRule deleteAfterDays="730"> <urlPattern>.*</urlPattern> </archiveDeletionRule> </archiveDeletionRules>

  2. Index Maintenance

    The pageVault archiver maintains two B+ Tree indices to the response archive allowing rapid searching and retrieval. It is the nature of such indices that over time they can slowly become unbalanced due to non-random insertion order of entries. The following utilities allow the archive indices to be restructured to minimize space and response time. How often this process needs to be run will vary from archive to archive: most archives will never need reorganisation.

    Note

    These index maintenance utilities must be run from the command line.

    The following examples assume that you can run the Java run time (that is, that a Java 1.4+ runtime is in your path) and that you know where the distributed pageVault.jar is located, which may be:

    Windows:
    c:\pageVault\WEB-INF\lib\pageVault.jar

    Unix:
    /usr/local/pageVault/WEB-INF/lib/pageVault.jar

    com.pageVault.maint.ShowIndexStats

    Displays basic statistics for the B+ Tree indices.

    Parameter: the name of the directory containing the pageVault indices (not including the trailing slash)

    Windows:
    C:\java -classpath c:\pageVault\WEB-INF\lib\pageVault.jar com.pageVault.maint.ShowIndexStats c:/pageVault/archive/index

    Unix:
    java -classpath /usr/local/pageVault/WEB-INF/lib/pageVault.jar com.pageVault.maint.ShowIndexStats /usr/local/pageVault/archive/index

    INFO main 21:18:37 : Reopened index: /usr/local/pageVault/archive/index/pageVaultUrlIndex with 2673 records
    INFO main 21:18:37 :   pageSize: 128
    INFO main 21:18:38 :   height: 2, non leaf pages: 1, leaf pages: 35
    INFO main 21:18:38 :  av entries per leaf page: 76
    INFO main 21:18:38 : Reopened index: /usr/local/pageVault/archive/index/pageVaultTimestampIndex with 2673 records
    INFO main 21:18:38 :   pageSize: 128
    INFO main 21:18:38 :   height: 2, non leaf pages: 1, leaf pages: 38
    INFO main 21:18:38 :  av entries per leaf page: 70
    

    PagesizeThe maximum number of index entries per index page.
    HeightThe "height" of the B+ Tree: the maximum number of pages between the root of the index and a leaf page.
    Non Leaf pagesThe number of intermediate pages in the index.
    Leaf pagesThe number of leaf pages, which are those pages which contain data rather pointers to lower level index pages.
    Av entries per leaf pageThe average number of entries in use on the leaf pages.

    com.pageVault.maint.ExportURLIndex

    Exports the index entries for later import by the ImportURLIndex utility. Export/import can be run if the indices become very fragmented. Export/import must be run with the archiver instance stopped.

    Parameters: the name of the directory containing the pageVault indices (not including the trailing slash) and the name of the exported index file to be written.

    Windows:
    C:\java -classpath c:\pageVault\WEB-INF\lib\pageVault.jar com.pageVault.maint.ExportURLIndex c:/pageVault/archive/index c:/tmp/exportedIndex

    Unix:
    java -classpath /usr/local/pageVault/WEB-INF/lib/pageVault.jar com.pageVault.maint.ExportURLIndex /usr/local/pageVault/archive/index /work/exportedIndex-11June

    The export utility executes quickly, dumping in the order of 1,000,000 records every 2 minutes.

    com.pageVault.maint.ImportURLIndex

    Imports the index entries produced by the ExportURLIndex or ReextractIndexFromData utility. This utility recreates both URL and Timestamp B+ Tree indices. Export/import must be run with the archiver instance stopped.

    Parameters: the name of the directory containing the pageVault indices (not including the trailing slash), the name of the index file to be imported and the B+Tree index page size (recommended value: 128).

    Warning

    This utility will almost always be used for recreating an index, not merging two archives. However, this utility does not delete any existing index entries it finds. Hence, you should normally ensure that the directory represented by the first parameter is empty, thus forcing the creation of new index files. Also, the 3rd parameter (page size) is ignored unless new index files are created.

    Windows:
    C:\java -classpath c:\pageVault\WEB-INF\lib\pageVault.jar com.pageVault.maint.ImportURLIndex c:/pageVault/archive/index c:/tmp/exportedIndex 128

    Unix:
    java -classpath /usr/local/pageVault/WEB-INF/lib/pageVault.jar com.pageVault.maint.ImportURLIndex /usr/local/pageVault/archive/index /work/exportedIndex-11June 256

    The import utility executes quickly, processing in the order of 1,000,000 records every 10 minutes.

    com.pageVault.maint.ReextractIndexFromData

    Reads the data files in the archive to recreate all the index entries for the archive. The output file from this utility can be provided to the ImportURLIndex utility to recreate the index files. This utility must be run with the archiver instance stopped.

    Parameters: the name of the directory containing the pageVault data (not including the trailing slash) and the name of the index file to be written.

    Windows:
    C:\java -classpath c:\pageVault\WEB-INF\lib\pageVault.jar com.pageVault.maint.ReextractIndexFromData c:/pageVault/archive/data c:/tmp/recreatedIndex

    Unix:
    java -classpath /usr/local/pageVault/WEB-INF/lib/pageVault.jar com.pageVault.maint.ReextractIndexFromData /usr/local/pageVault/archive/index /work/recreatedIndex-11June 256

    The import utility executes relatively slowly, as it needs to open and read the header of every data file. Hence processing speed is very dependent on the rate at which the hosting operating system and hardware can support file opening and closing and random dispersed reads across the disks. A speed of a few thousand files per minute is typical.

  3. Instrumentation

    The pageVault distributor and archiver components have an in-built instrumentation facility which can be used for monitoring. The instrumentation facility is enabled by defining a TCP/IP instrumentation port in each component:

    Distributor

    Add the distributorInstrumentationListener parameter defining an available TCP/IP port to the DistributorParms.xml file.

    <!-- Listen for connection requests to the distributor intstrumentation server which can be used for either "heartbeat" monitoring or to get statistics on the distributor's operation. Comment out the distributorInstrumentationListener element if you don't want the Distributor to accept instrumentation requests. Attribute port: the port number on which to listen for instrumentation request connections. --> <distributorInstrumentationListener port="8082"/>

    After restarting the distributor, you can then issue an HTTP request to that port on the machine running the distributor, eg http://10.84.12.101:8082.

    The response shows the following details and statistics for the current instantiation of the Distributor:

    Distributor nameAs defined by the distributorName element in the DistributorParms.xml file
    Instance statusCurrent status of the distributor - normally polling (reading the directories written to by the web server filter) or sleeping (sleeping between polls)
    Instance startedDate/time this distributor instance was started
    Instance current timeCurrent date/time on the machine running this distributor
    Last poll sleep timeMost recent date/time that this instance slept between polls
    Current poll sleep intervalLength of the most recent sleep in millisecs
    Incomplete file processing deferredOccassionally the distributor will attempt to process information being simultaneously written by the pageVault web server filter. When this is detected the distributor ceases to process the file, deferring processing until it has been completed by the filter.
    Stale files deletedIf for some reason the web server filter aborts whilst writing a response file (eg, software or power failure) then this file will never be marked as completed by the filter. A file which has been in a "deferred" status for more than a predetermined time interval is deemed to be such a file, and is deleted by the distributor.
    Duplicate responses detected The web server filter does a 'best guess' attempt at detecting duplicate responses. However, it has a limited and process-local cache of previous responses, whilst the distributor has a bigger and more global cache. Hence, the distributor will detect some duplicate reponses not identified by the web server filter
    Unmovable responses discardedThe distributor moves completed requests from the directory in which they have been written by the web server filter to a private queue. If for some reason a request file cannot be moved it is deleted. This indicates a severe error condition within the pageVault system and must be further investigated.
    Novel responses queuedThe number of reponses considered 'novel' by the distributor and hence queued to be sent to the archiver.
    Failed (requeued) transfers to ArchiverIf the connection between the distributor and the archiver is lost during transfer of a response then the response is requeued for resending when the connection is reestablished.
    Duplicate responses detected by ArchiverThe Archiver has true global and persistant knowledge of all responses gathered and hence will detect duplicate responses not removed by the distributor (or before it, by the web server filter).
    Novel response sent to ArchiverResponses sent from this distributor instance to the archiver.

    This response can be used as a "heart-beat" to monitor the distributor. A valid HTTP response code '200' indicates that the distributor (thinks it) is operating normally.

    Archiver

    Add the archiverInstrumentationListener parameter defining an available TCP/IP port to the ArchiverParms.xml file.

    <!-- Listen for connection requests to the archiver intstrumentation server which can be used for either "heartbeat" monitoring or to get statistics on the archiver's operation. Comment out the archiverInstrumentationListener element if you don't want the Archiver to accept instrumentation requests. Attribute port: the port number on which to listen for instrumentation request connections. --> <archiverInstrumentationListener port="8083"/>

    After restarting the archiver, you can then issue an HTTP request to that port on the machine running the archiver, eg http://10.84.12.101:8083.

    The response shows the following details and statistics for the current instantiation of the Archiver:

    Archiver nameAs defined by the archiverName element in the ArchiverParms.xml file
    Archive sizeThe total number of responses in the archive
    Earliest timestampThe timestamp of the earliest response in the archive
    Latest timestampThe timestamp of the latest response in the archive
    Instance startedDate/time this archiver instance was started
    Instance current timeCurrent date/time on the machine running this archiver
    Viewer requestsThe number of requests to view or search the contents of the archive received through the pageVault Viewer interface
    Total requests from distributorsThe number of requests to receive responses from distributors which the distributors consider as 'novel'
    Repeated latest responsesThe number of the total responses received from distributors which are not novel but match the most recent recorded response for a URL which has been already archived.
    Repeated older responsesThe number of the total responses received from distributors which are not novel but match an earlier response for a URL. This may happen if, for example, a response fora URL reverts to a previous version.
    Novel responsesThe number of the total responses received from distributors which are novel and are stored in the archive.

    This response can be used as a "heart-beat" to monitor the archiver. A valid HTTP response code '200' indicates that the archiver (thinks it) is operating normally.

  4. Configuring for Performance

 
Project Computing Pty Ltd    ACN: 008 590 967 contact@projectComputing.com