Project Computing - Australian University research full text search

A Google Custom Search Engine searching the full text of research published in Australian university repositories:

What does this do? How does it work?

Most Australian universities manage repositories of their publications. Google may crawl those repositories and index the text. The Google Custom Search Engine lets anyone scope a search over just a specific part of the entire world wide web. This tool uses the Google Custom Search Engine to limit a search to just some Australian university repositories, more or less.

How does it know where to look?

I've given it a list of URL patterns (see below). This list is almost certainly incomplete, so if you know how I can improve it email me at kent.fitch@projectcomputing.com. The original list was seeded from links found here and then by running Google searches such as site:xxx.edu.au filetype:pdf and looking for interesting patterns.

How is this related to ARROW?

It isn't, really. ARROW is program aimed at helping Australian universities establish repositories which also regularly collects the metadata describing the contents of those repositories (not the full text), and "pushes" what it finds to Google, which reads and indexes those contents and other pages it discovers as a result of that processing.

The ARROW Discovery services lets you search the collected metadata. Its results are relevance ranked based on the occurrence of your search term in the collected metadata. But it doesn't "harvest" or index the full text. So unless your search term appears in the metadata of a resource, it won't find it.

This tool, however, searches the full-text of resources Google has crawled and their link metadata (the text contents of hyperlinks on the web which point to these resources). Relevance ranking is based on Google-magic, which includes occurrence of your search term in the full text contents and incoming links as well as the number of incoming links and their pagerank. For result sets comprised of resources with few incoming links, ranking is rather poor.

The two approaches will find slightly different sets of resources and present them in a different order, for example, compare:

genetically engineered canola on Google subset and ARROW
stolen generation on Google subset and ARROW
toxicity of treated pine on Google subset and ARROW
zinc alloys on Google subset and ARROW
tampa on Google subset and ARROW
unsupervised learning neural nets on Google subset and ARROW
face recognition on Google subset and ARROW
deregulated milk market on Google subset and ARROW
rural general practice on Google subset and ARROW
Don Watson on Google subset and ARROW
"climate change" "national security" on Google subset and ARROW
Phytophthora cinnamomi avocados on Google subset and ARROW
henry handel richardson on Google subset and ARROW

Can the two approaches be combined?

Of course! If searching on a subset of research outputs from the university sector by national boundary is important, I guess they will be!

Why do people want to limit searches to Australian university outputs placed in specific Australian university repositories?

Apart from bean-counters, probably they don't. It is very important that research "outputs" are made public and can be discovered - this is an important goal of ARROW. But whether it is useful to restrict discovery to some fraction of materials produced within a national boundary, reduced further to those materials uploaded to an Australian university repository allied with the ARROW service, then discovered and indexed by Google, is another question...

URL patterns search by this tool

Note: Google does not (yet) index the contents of all these URLs. Some may be excluded from Google's view with a robots.txt configuration. Others may not be crawlable, or may not be linked from the outside web, which is one of the tasks ARROW is hoping to perform.


ACT

www.library.unsw.edu.au/~thesis/adt-ADFA/uploads/*	
erl.canberra.edu.au/*	_
thesis.anu.edu.au/uploads/	
dspace.anu.edu.au/bitstream/*
dspace.anu.edu.au/html/*	
dlibrary.acu.edu.au/digitaltheses/*	

NSW

epubs.scu.edu.au/cgi/*	
library.uws.edu.au/adt-NUWS/uploads/*	
arrow.uws.edu.au:8080/vital/access/manager/Repository/*	
*.une.edu.au/*article*.pdf	
*.une.edu.au/*publications*.pdf	
*.une.edu.au/*Preprints*.pdf	
*.une.edu.au/*Report*.pdf	
www.researchonline.mq.edu.au:9080/vital/access/manager/Repository/*	
www.library.uow.edu.au/adt-NWU/uploads/*	
ro.uow.edu.au/cgi*	
www.library.unsw.edu.au/~thesis/adt-NUN/uploads/*	
unsworks.unsw.edu.au/vital/access/manager/Repository/*	
unsworks.unsw.edu.au/vital/access/services/Download/*	
epress.lib.uts.edu.au/dspace/html/*	
epress.lib.uts.edu.au/dspace/bitstream/*	
ses.library.usyd.edu.au/bitstream/*	
www.newcastle.edu.au/services/library/adt/uploads/*	
ogma.newcastle.edu.au:8080/vital/access/manager/Repository*	
csu.edu.au/research/*.pdf	

QLD

eprints.usq.edu.au/*.pdf	
eprints.usq.edu.au/*	
adt.library.qut.edu.au/adt-qut/uploads/*	
eprints.qut.edu.au/archive/*	
adt.library.uq.edu.au/public/*	
eprint.uq.edu.au/archive/*	
eprints.jcu.edu.au/*.pdf	
eprints.jcu.edu.au/*	
www4.gu.edu.au:8080/adt-root/uploads/*	
www98.griffith.edu.au/dspace/html/*	
www98.griffith.edu.au/dspace/bitstream/*	
research.usc.edu.au/vital/access/manager/Repository/*	
library-resources.cqu.edu.au/thesis/*	
acquire.cqu.edu.au:8080/vital/access/manager/Repository/*	
epublications.bond.edu.au/context/*	
epublications.bond.edu.au/cgi/*	

VIC

eprints.infodiv.unimelb.edu.au/archive/*	
digthesis.ballarat.edu.au/adt/uploads*	
wallaby.vu.edu.au/adt-VVUT/uploads/*	
eprints.vu.edu.au/archive/*	
adt.lib.swin.edu.au/uploads/*	
researchbank.swinburne.edu.au/vital/access/manager/Repository/*	
researchbank.swinburne.edu.au/vital/access/services/*	
adt.lib.rmit.edu.au/adt/uploads/*	
arrowprod.lib.monash.edu.au/vital/access/services/Download/*	
eprint.monash.edu.au/*	
alpha3.latrobe.edu.au/thesis/uploads/*	
tux.lib.deakin.edu.au/adt-VDU/*	

WA

espace.lis.curtin.edu.au/archive/*	 
espace.lis.curtin.edu.au/archive/*.pdf	 
adt.curtin.edu.au/theses/available/*	 	
ro.ecu.edu.au/rqf_submissionsfedrt/*	 
portal.ecu.edu.au/adt-public*	 	
wwwlib.murdoch.edu.au/adt/pubfiles/* 	
*.uwa.edu.au/*article*.pdf	 
theses.library.uwa.edu.au/adt-*	 

SA

digital.library.adelaide.edu.au/dspace/bitstream/*	 	
digital.library.adelaide.edu.au/dspace/html/*	 	
dspace.flinders.edu.au/dspace/bitstream/*	 	
dspace.flinders.edu.au/dspace/html/*	 	
catalogue.flinders.edu.au/local/adt/uploads/*	 	


TAS

eprints.utas.edu.au/*	 	
eprints.utas.edu.au/*.pdf

Thanks to the following people for updates:

Your name here

Kent Fitch, Project Computing

Project Computing Pty Ltd ACN: 008 590 967

contact@projectComputing.com