com.norconex.collector.http.handler
Interface IURLNormalizer
- All Superinterfaces:
- Serializable
- All Known Implementing Classes:
- GenericURLNormalizer
public interface IURLNormalizer
- extends Serializable
Responsible for normalizing URLs. Normalization is taking a raw URL and
modifying it to its most basic or standard form. In other words, this makes
different URLs "equivalent". This allows to eliminate URL variations
that points to the same content (e.g. URL carrying temporary session
information). This action takes place right after URLs are extracted
from a document, before each of these URLs is even considered
for further processing. Returning null will effectively tells the crawler
to not even consider it for processing (it won't go through the regular
document processing flow). You may want to consider IURLFilter
to exclude URLs as part has the regular document processing flow
(may create a trace in the logs and gives you more options).
Implementors also implementing IXMLConfigurable must name their XML tag
urlNormalizer
to ensure it gets loaded properly.
- Author:
- Pascal Essiembre
normalizeURL
String normalizeURL(String url)
- Normalize the given URL.
- Parameters:
url
- the URL to normalize
- Returns:
- the normalized URL
Copyright © 2009-2013 Norconex Inc.. All Rights Reserved.