A CLI URL lengthener

You know what URL shortening is, right? It’s what bitly does, and what Twitter does to a URL or image link in a tweet. Instead of a link like http://www.npr.org/2016/02/19/467318832/congress-should-decide-encryption-issue-sen-angus-king-says you can have a nice short link like http://n.pr/28WavP0 or http://tinyurl.com/jhp4c62; something easily typed with accuracy, even conveyed verbally without too much trouble.

Finding the actual URL

The problem, however, is that when presented with such a link you don’t know where it will lead. You might be redirected to some malware site, or somewhere NSFW. If you are a careful (read “paranoid”) Internet wanderer—and you should be careful—it would be useful to have a way of seeing the ultimate destination URL before clicking on that shortened URL.

There are browser plugins and there are websites devoted to offering such a service. I considered using these options but rejected them. Browser extensions can see everything that happens within the sandbox of your browser; to keep my personal attack surface small, I run only a minimal set of trusted browser extensions and would prefer to avoid adding another extension to the mix. Alternatively one could use any of the plethora of websites that offer to resolve shortened URLs. But a quick survey of such sites show that most are laden with ads. To my paranoid mind they also seem like an obvious vehicle to use to deliver malware; I’m not saying that any actually do, just that my ‘spidey sense is tingling‘.

cURL to the rescue! (again)

There is an alternative, at least on general purpose computer. If you have cURL installed then you can easily resolve the URL yourself with a command like curl -I -L -s http://tinyurl.com/jhp4c62. The -s option says to suppress display of a progress bar; the -L option says to follow redirects up to 50 times, and the -I option says to use the HEAD method on requrests (not the default GET method).

On my system, as I write this, that command produces:

HTTP/1.1 301 Moved Permanently
Date: Thu, 23 Jun 2016 04:12:50 GMT
Content-Type: text/html; charset=UTF-8
Connection: keep-alive
Set-Cookie: __cfduid=dc7643d4723ed0a2e1b8db198a5edf74a1466655170; expires=Fri, 23-Jun-17 04:12:50 GMT; path=/; domain=.tinyurl.com; HttpOnly
Set-Cookie: tinyUUID=76b61c6b89ae62c4fc690000; expires=Fri, 23-Jun-2017 04:12:49 GMT; Max-Age=31536000; path=/; domain=.tinyurl.com
Location: http://www.npr.org/2016/02/19/467318832/congress-should-decide-encryption-issue-sen-angus-king-says
X-tiny: cache 0.0096280574798584
Server: cloudflare-nginx
CF-RAY: 2b751a9e33691159-DFW

HTTP/1.1 404 Not Found
Server: Apache
X-Powered-By: PHP/5.5.28
Content-Type: text/html; charset=UTF-8
Content-Length: 0
Cache-Control: max-age=0
Expires: Thu, 23 Jun 2016 04:12:50 GMT
Date: Thu, 23 Jun 2016 04:12:50 GMT
Connection: keep-alive

The key to look for in the output is the Location: header, specifically the last Location: header in the output. In this case there was only one location header, and it resolves the shortened url to http://www.npr.org/2016/02/19/467318832/congress-should-decide-encryption-issue-sen-angus-king-says.

But what’s the deal with that 404 status code, the “Not Found” error? It’s related the the -I option on the curl command. That option asks cURL to use the HEAD method rather than the GET method. Using HEAD will just return the HTTP headers that you would received from a GET request, but doesn’t include the response body. Using HEAD seems perfect if all you want to do is resolve the URL, not retrieve the actual web page (or PDF or zip archive).

Sounds great, right? It is, usually. But in this case the NPR web servers, at least some of them, don’t respond well to HEAD requests. Most of the time the cURL command I used (above) will produce that 404; occassionally, it would produce the expected 200. If we change -I to -i then cURL will use the GET method and we will see all the headers and we will get the 200 status that we wanted—but we’ll also get the whole web page dumped to the terminal window (and will have to scroll back up to find the last location header).

What does it look like if we resolve a shortened URL that doesn’t point at an NPR web server? Let’s try the same cURL command but with this shortened URL from Twitter: https://t.co/J3NSCAu5dr. This time we do end up with a nice 200 status code. We also get multiple redirects—remember that the last Location: header gives the ultimate URL that the shortened URL represents.

HTTP/1.1 301 Moved Permanently
cache-control: private,max-age=300
content-length: 0
date: Thu, 23 Jun 2016 04:53:22 GMT
expires: Thu, 23 Jun 2016 04:58:22 GMT
location: https://trib.it/28QhYlx
server: tsa_b
set-cookie: muc=6917dd24-7cb0-4e2b-9d83-8325f5515fbb; Expires=Tue, 05 Jun 2018 04:53:22 GMT; Domain=t.co
strict-transport-security: max-age=0
x-connection-hash: 19628dda4092469035b6fc96aaada929
x-response-time: 6

HTTP/1.1 301 Moved Permanently
Cache-Control: private, max-age=90
Content-Length: 177
Content-Type: text/html; charset=utf-8
Date: Thu, 23 Jun 2016 04:53:23 GMT
Location: http://www.texastribune.org/2016/06/22/democrats-including-texans-bring-congress-halt-ove/
Server: nginx

HTTP/1.1 301 Moved Permanently
Date: Thu, 23 Jun 2016 04:53:23 GMT
Content-Type: text/html
Connection: keep-alive
Set-Cookie: __cfduid=df8f255c9657d4c2b8b4e7f4c42ef7bfe1466657603; expires=Fri, 23-Jun-17 04:53:23 GMT; path=/; domain=.texastribune.org; HttpOnly
Content-Security-Policy: frame-ancestors 'self'
Location: https://www.texastribune.org/2016/06/22/democrats-including-texans-bring-congress-halt-ove/
X-Frame-Options: SAMEORIGIN
Server: cloudflare-nginx
CF-RAY: 2b75560725901165-DFW

HTTP/1.1 200 OK
Date: Thu, 23 Jun 2016 04:53:24 GMT
Content-Type: text/html; charset=utf-8
Connection: keep-alive
Age: 44
Cache-Control: max-age=300
Content-Security-Policy: frame-ancestors 'self'
Expires: Thu, 23 Jun 2016 04:57:40 GMT
Last-Modified: Thu, 23 Jun 2016 04:52:40 GMT
Vary: Authorization
Via: 1.1 varnish
X-Frame-Options: SAMEORIGIN
X-Request-URL: /2016/06/22/democrats-including-texans-bring-congress-halt-ove/
X-Server: ip-10-244-193-69
X-Varnish: 1286479133 1286479050
Server: cloudflare-nginx
CF-RAY: 2b7556089bb50944-DFW

Automating a cleaner solution

Clearly cURL offers a simple solution for resolving URLs. But there are some issues:

To keep output small, we want to use the HEAD method; but some servers don’t handle HEAD well¹
Often resolving URLs go through multiple 301 redirects, generating a lot of headers, many location headers
Even with just one redirect, there’s a lot of headers. We only want so see the location headers; really, just the last one

I wrote a Bash script that addresses all these issues. Here’s what it looks like resolving that Twitter URL showed above:

$ ./longurl.sh https://t.co/J3NSCAu5dr
https://www.texastribune.org/2016/06/22/democrats-including-texans-bring-congress-halt-ove/

Nice. How about that NPR shortened URL?

$ ./longurl.sh http://tinyurl.com/jhp4c62
http://www.npr.org/2016/02/19/467318832/congress-should-decide-encryption-issue-sen-angus-king-says

Sweet.

How to get it

The script is available in my infrequently updated archive of useful Bash scripts, found here: https://github.com/JeNeSuisPasDave/useful-bash-scripts.

You can get the longurl.sh script from the urlhelpers/ folder.

There you’ll also find longurl.src which you can source to add the longurl() function to your Bash environment. Then you can do something like longurl https://t.co/J3NSCAu5dr; echo -n "${longurl} | pbcopy to get the resolved URL into the clipboard.

If you use the Bash script, then you can pass the -I flag to get the script to dump the HTTP headers to stdout. That can be useful for troubleshooting (something I hope you won’t need to do, of course).

Enjoy.

Don’t just blame NPR. How many automated tests do you have running to check the HEAD method on your company’s web sites and RESTful services? ↩