emergingbytes.com Technology blog for data warehousing, business intelligence, iPhone Apps Development, Android Apps development and facebook applications

22May/110

Extracting Domain From A URL In Teradata

When working with internet cloud traffic, domain name is a significant information for analysis. For instance, some telecom operators may offer free GPRS usage for specific websites like facebook, yahoo etc.  For operators offering ADSL and other broadband services, domain name becomes more important to variety of exploratory analysis that can be used for designing new offers or bundles, optimizing cache, adjusting capacity by analyzing streaming media, download trends, popular websites etc

Following techniques can be used to extract domain name from a URL

  • Creating a UDF that supports regular expressions and then using regular expressions to extract domain name
  • Using simple string manipulation functions that are already available in Teradata

Later technique is discussed here for general use. It can be applied on normalized URLs and/or can be modified as per the requirements.  Following SQL can be used to test domain name extraction from a URL

SELECT 
	SUBSTR
	(
		 SUBSTR(?url , 
		   (-COALESCE(NULLIFZERO(POSITION ('/' IN ?url)),-1)*-1)+2 )  , 
		 0, 
		 POSITION('/' IN   SUBSTR(?url , 
		   (-COALESCE(NULLIFZERO(POSITION ('/' IN ?url)),-1)*-1)+2 )||'/'  )  
	);

Use URLs in following style to check above script:
'http://www.emergingbytes.com'

Script is using simple function to locate position of first occurring '/' (forward slash) in a URL and then again search for ending '/'. If ending slash is not present in URL, it appends a '/' and extracts anything between these two '/' (forward slashes). This script can be used on most of the domains but to handle exceptions, it can be improved upon.