Posts Tagged ‘java’

URL Encoding

In my project i wanted to deal with URL’s which needs some canonicalization.I dont know why these guys are not keeping the standards of url formation 😦
W3C has specific set of regulations over url canonicalization and there are unsafe characters that we must not use in urls.But in real word no one follows these simple standards.

http://www.business-standard.com/india/news/tree-house-educations-ipo-subscribed-184-times/445742/

A sample url from business-standard you can see ‘\’ character is unsafe.To make this URL work with JAVA URL or Apache Http you need to replace ‘\’ to ‘%5C’

The domain name is Business-Standard , but they can not keep their URL’s in standard form.How they keep their “Businees” “Standard” 🙂

These URL’s will make problem with java URL and Apache Http. So in order to jump out of these situations,you need to normalise the urls.There are URL Encoder in JAVA to encode URL’s.But its not advisable to do URL encoding in the URL as a whole.It will create another problems and i found it may make the URL more clumsy and will not work.

http%3A%2F%2Fwww.business-standard.com%2Findia%2Fnews%2Fspencer%255Cs-emerges-as-goenka%5Cs-lynchpin-for-growth%2F462834%2F is the encoded URL from URLEncoder and it will not work even in browser.

So URL Endoing is a real time standing problem.One solution to this problem is , instead of encoding the whole URL , just encode only the unsafe characters.
This is an example malformed URL.
http://www.economist.com/node/21534742?fsrc=rss|ltr

The ‘|’ symbol is considered as unsafe but it is included in the URL.A browser can easily encode and understands it,But when you need to call a URL  using some http libraries then it may cause problems.

So to make over this situation you can simple replace the ‘|’ character with ‘%7C’ which is the HEX equivalent and you can see the problem solved.

This is not a neat work around solution because there are many unsafe characters and we need to check the URL for unsafe character occurrence and replace it with its HEX equivalent.

Then I found a stack over flow thread talking about the same and found a neat work around solution to the problem.I found it quite useful and its working fine with my present set of malformed URL. 🙂 Thanks to scott


public class CanonicalizeURL {
public static String escapeIllegalURLCharacters(String url) throws Exception{
String decodeUrl = URLDecoder.decode(url,"UTF-8");
URL urlString = new URL(decodeUrl);
URI uri = new URI(urlString.getProtocol(), urlString.getUserInfo(), urlString.getHost(), urlString.getPort(), urlString.getPath(), urlString.getQuery(), urlString.getRef());
return uri.toString();
}

}
Advertisements