com.norconex.collector.http.db
Interface ICrawlURLDatabase

All Known Implementing Classes:
DerbyCrawlURLDatabase

public interface ICrawlURLDatabase

Database implementation holding necessary information about all URL crawling activities, what crawling stages URLs are in. The few stages a URL can have are:

Author:
Pascal Essiembre

Method Summary
 int getActiveCount()
          Gets the number of active URLs (currently being processed).
 CrawlURL getCached(String cacheURL)
          Gets the cached URL from previous time crawler was run (e.g.
 int getProcessedCount()
          Gets the number of URLs processed.
 int getQueueSize()
          Gets the size of the URL queue (number of URLs left to process).
 boolean isActive(String url)
          Whether the given URL is currently being processed (i.e.
 boolean isCacheEmpty()
          Whether there are any URLs the the cache from a previous crawler run.
 boolean isProcessed(String url)
          Whether the given URL has been processed.
 boolean isQueued(String url)
          Whether the given URL is in the queue or not (waiting to be processed).
 boolean isQueueEmpty()
          Whether there are any URLs to process in the queue.
 boolean isVanished(CrawlURL crawlURL)
          Whether a url has been deleted.
 CrawlURL next()
          Returns the next URL to be processed and marks it as being "active" (i.e.
 void processed(CrawlURL crawlURL)
          Marks this URL as processed.
 void queue(String url, int depth)
          Queues a URL for future processing.
 void queueCache()
          Queues URLs cached from a previous run so they can be processed again.
 

Method Detail

queue

void queue(String url,
           int depth)
Queues a URL for future processing.

Parameters:
url - the URL to eventually be processed
depth - how many clicks away from starting URL(s)

isQueueEmpty

boolean isQueueEmpty()
Whether there are any URLs to process in the queue.

Returns:
true if the queue is empty

getQueueSize

int getQueueSize()
Gets the size of the URL queue (number of URLs left to process).

Returns:
queue size

isQueued

boolean isQueued(String url)
Whether the given URL is in the queue or not (waiting to be processed).

Parameters:
url - url
Returns:
true if the URL is in the queue

next

CrawlURL next()
Returns the next URL to be processed and marks it as being "active" (i.e. currently being processed).

Returns:
next URL

isActive

boolean isActive(String url)
Whether the given URL is currently being processed (i.e. active).

Parameters:
url - the url
Returns:
true if active

getActiveCount

int getActiveCount()
Gets the number of active URLs (currently being processed).

Returns:
number of active URLs.

getCached

CrawlURL getCached(String cacheURL)
Gets the cached URL from previous time crawler was run (e.g. for comparison purposes).

Parameters:
cacheURL - URL cached from previous run
Returns:
url

isCacheEmpty

boolean isCacheEmpty()
Whether there are any URLs the the cache from a previous crawler run.

Returns:
true if the cache is empty

processed

void processed(CrawlURL crawlURL)
Marks this URL as processed. Processed URLs will not be processed again in the same crawl run.

Parameters:
crawlURL -

isProcessed

boolean isProcessed(String url)
Whether the given URL has been processed.

Parameters:
url - url
Returns:
true if processed

getProcessedCount

int getProcessedCount()
Gets the number of URLs processed.

Returns:
number of URLs processed.

queueCache

void queueCache()
Queues URLs cached from a previous run so they can be processed again. This method is normally called when a job is done crawling, and entries remain in the cache. Those are re-processed in case they changed or are no longer valid.


isVanished

boolean isVanished(CrawlURL crawlURL)
Whether a url has been deleted. To find this out, the URL has to be of an invalid state (e.g. NOT_FOUND) and must exists in the URL cache in a valid state.

Parameters:
crawlURL - the URL


Copyright © 2009-2013 Norconex Inc.. All Rights Reserved.