Extracting Domain From A URL In Teradata
When working with internet cloud traffic, domain name is a significant information for analysis. For instance, some telecom operators may offer free GPRS usage for specific websites like facebook, yahoo etc. For operators offering ADSL and other broadband services, domain name becomes more important to variety of exploratory analysis that can be used for designing new offers or bundles, optimizing cache, adjusting capacity by analyzing streaming media, download trends, popular websites etc
Following techniques can be used to extract domain name from a URL
- Creating a UDF that supports regular expressions and then using regular expressions to extract domain name
- Using simple string manipulation functions that are already available in Teradata
Later technique is discussed here for general use. It can be applied on normalized URLs and/or can be modified as per the requirements. Following SQL can be used to test domain name extraction from a URL
SELECT SUBSTR ( SUBSTR(?url , (-COALESCE(NULLIFZERO(POSITION ('/' IN ?url)),-1)*-1)+2 ) , 0, POSITION('/' IN SUBSTR(?url , (-COALESCE(NULLIFZERO(POSITION ('/' IN ?url)),-1)*-1)+2 )||'/' ) );
Use URLs in following style to check above script:
'http://www.emergingbytes.com'
Script is using simple function to locate position of first occurring '/' (forward slash) in a URL and then again search for ending '/'. If ending slash is not present in URL, it appends a '/' and extracts anything between these two '/' (forward slashes). This script can be used on most of the domains but to handle exceptions, it can be improved upon.
Teradata Date Addition/Subtraction
Today when I started designing BI layer for a session authentication system (RADIUS), I realized that there are two types of CDRs
- Start of a session CDR
- Stop of a session CDR
Interestingly, STOP CDRs have no Start date information. Therefore, I had to subtract session duration for the stop date to get start date of a session.
Problem: To subtract duration (in seconds) from a date (session stop date) to get actual start date of a session
Solution:
Although, it appears to be a simple riddle but in Teradata, this can be a little bit tricky, specially when you are not comfortable working with dates and intervals. Following code snippet explains that how it is done.
CAST(START_DATE_TIME - CAST((SESSION_DURATION/60/60/24/28/12) AS INTERVAL YEAR) - CAST((SESSION_DURATION/60/60/24/28 MOD 12) AS INTERVAL MONTH) - CAST((SESSION_DURATION/60/60/24 MOD 28) AS INTERVAL DAY) - CAST((SESSION_DURATION/60/60 MOD 24) AS INTERVAL HOUR) - CAST((SESSION_DURATION/60 MOD 60) AS INTERVAL MINUTE) - CAST((SESSION_DURATION MOD 60) AS INTERVAL SECOND) AS DATE FORMAT 'YYYY-MM-DD') AS START_DATE
Point 1:
When dealing with huge number of seconds, their is a limitation of casting it to seconds. Running the following query will result in 'Interval field overflow':
SELECT CAST(30000 AS INTERVAL SECOND(4) )
Therefore, to add/subtract huge number of seconds, conversion to hour, minute and seconds is required.
Point 2:
To successfully cast above, we may need to convert it into hours
SELECT CAST(30000/60/60 AS INTERVAL HOUR)
It will result in 8 hours.
Point 3:
Converting 8 hours back to seconds will not give 30000. Teradata INTERVAL does not maintain precision information. Hence, to keep track of exact seconds, MOD should be used and added up to get precise number of seconds. Below example explains how it can be done.
SELECT CAST(30000/60/60 AS INTERVAL HOUR) AS hrs ,CAST(((30000/60) MOD 60) AS INTERVAL MINUTE) AS mins ,CAST((30000 MOD 60) AS INTERVAL SECOND) AS secs , hrs+mins+secs