All Packages Class Hierarchy This Package Previous Next Index
![]() |
java.lang.Object | +----Acme.Spider
This is an Enumeration class that traverses the web starting at a given URL. It fetches HTML files and parses them for new URLs to look at. All files it encounters, HTML or otherwise, are returned by the nextElement() method as a URLConnection.
The traversal is breadth-first, and by default it is limited to files at or below the starting point - same protocol, hostname, and initial directory.
Because of the security restrictions on applets, this is currently only useful from applications.
Sample code:
Enumeration spider = new Acme.Spider( "http://some.site.com/whatever/" );
while ( spider.hasMoreElements() )
{
URLConnection conn = (URLConnection) spider.nextElement();
// Then do whatever you like with conn:
URL thisUrl = conn.getURL();
String thisUrlStr = thisUrl.toExternalForm();
String mimeType = conn.getContentType();
long changed = conn.getLastModified();
InputStream s = conn.getInputStream();
// Etc. etc. etc., your code here.
}
There are also a couple of methods you can override via a subclass, to
control things like the search limits and what gets done with broken links.
Sample applications that use Acme.Spider:
Fetch the software.
Fetch the entire Acme package.
protected PrintStream err
protected Queue todo
protected Hashtable done
public Spider(PrintStream err)
public Spider()
public Spider(String urlStr,
PrintStream err) throws MalformedURLException
public Spider(String urlStr) throws MalformedURLException
public Spider(int todoLimit,
int doneLimit,
PrintStream err)
Guesses at good values for an unlimited traversal: 200000 and 20000. You want the doneLimit pretty small because the hash-table gets checked for every URL, so it will be mostly in memory; the todo queue, on the other hand, is only accessed at the front and back, and so will be mostly paged out.
public Spider(int todoLimit,
int doneLimit)
public synchronized void addUrl(String urlStr) throws MalformedURLException
public synchronized void setAuth(String auth_cookie)
Syntax is userid:password.
public synchronized void addObserver(HtmlObserver observer)
Alternatively, if you want to add a different observer to each scanner, you can cast the input stream to a scanner and call its add routine, like so:
InputStream s = conn.getInputStream();
Acme.HtmlScanner scanner = (Acme.HtmlScanner) s;
scanner.addObserver( this );
protected boolean doThisUrl(String thisUrlStr,
int depth,
String baseUrlStr)
protected void brokenLink(String fromUrlStr,
String toUrlStr,
String errmsg)
protected void reportError(String fromUrlStr,
String toUrlStr,
String errmsg)
public synchronized boolean hasMoreElements()
public synchronized Object nextElement()
public void gotAHREF(String urlStr,
URL contextUrl,
Object clientData)
public void gotIMGSRC(String urlStr,
URL contextUrl,
Object clientData)
public void gotFRAMESRC(String urlStr,
URL contextUrl,
Object clientData)
public void gotBASEHREF(String urlStr,
URL contextUrl,
Object clientData)
public void gotAREAHREF(String urlStr,
URL contextUrl,
Object clientData)
public void gotLINKHREF(String urlStr,
URL contextUrl,
Object clientData)
public void gotBODYBACKGROUND(String urlStr,
URL contextUrl,
Object clientData)
public static void main(String args[])
All Packages Class Hierarchy This Package Previous Next Index
ACME Java ACME Labs