How to get a websites title using command line

command linehttpweb

I want a command line program that prints the title of a website.
For e.g.:

Alan:~ titlefetcher http://www.youtube.com/watch?v=Dd7dQh8u4Hc

should give:

Why Are Bad Words Bad? 

You give it the url and it prints out the Title.

Best Answer

  • wget -qO- 'http://www.youtube.com/watch?v=Dd7dQh8u4Hc' |
      perl -l -0777 -ne 'print $1 if /<title.*?>\s*(.*?)\s*<\/title/si'
    

    You can pipe it to GNU recode if there are things like &lt; in it:

    wget -qO- 'http://www.youtube.com/watch?v=Dd7dQh8u4Hc' |
      perl -l -0777 -ne 'print $1 if /<title.*?>\s*(.*?)\s*<\/title/si' |
      recode html..
    

    To remove the - youtube part:

    wget -qO- 'http://www.youtube.com/watch?v=Dd7dQh8u4Hc' |
     perl -l -0777 -ne 'print $1 if /<title.*?>\s*(.*?)(?: - youtube)?\s*<\/title/si'
    

    To point out some of the limitations:

    portability

    There is no standard/portable command to do HTTP queries. A few decades ago, I would have recommended lynx -source instead here. But nowadays, wget is more portable as it can be found by default on most GNU systems (including most Linux-based desktop/laptop operating systems). Other fairly portables ones include the GET command that comes with perl's libwww that is often installed, lynx -source, and to a lesser extent curl. Other common ones include links -source, elinks -source, w3m -dump_source, lftp -c cat...

    HTTP protocol and redirection handling

    wget may not get the same page as the one that for instance firefox would display. The reason being that HTTP servers may choose to send a different page based on the information provided in the request sent by the client.

    The request sent by wget/w3m/GET... is going to be different from the one sent by firefox. If that's an issue, you can alter wget behaviour to change the way it sends the request though with options.

    The most important ones here in this regard are:

    • Accept and Accept-language: that tells the server in which language and charset the client would like to get the response in. wget doesn't send any by default so the server will typically send with its default settings. firefox on the other end is likely configured to request your language.
    • User-Agent: that identifies the client application to the server. Some sites send different content based on the client (though that's mostly for differences between javascript language interpretations) and may refuse to serve you if you're using a robot-type user agent like wget.
    • Cookie: if you've visited this site before, your browser may have permanent cookies for it. wget will not.

    wget will follow the redirections when they are done at the HTTP protocol level, but since it doesn't look at the content of the page, not the ones done by javascript or things like <meta http-equiv="refresh" content="0; url=http://example.com/">.

    Performance/Efficiency

    Here, out of laziness, we have perl read the whole content in memory before starting to look for the <title> tag. Given that the title is found in the <head> section that is in the first few bytes of the file, that's not optimal. A better approach, if GNU awk is available on your system could be:

    wget -qO- 'http://www.youtube.com/watch?v=Dd7dQh8u4Hc' |
      gawk -v IGNORECASE=1 -v RS='</title' 'RT{gsub(/.*<title[^>]*>/,"");print;exit}'
    

    That way, awk stops reading after the first </title, and by exiting, causes wget to stop downloading.

    Parsing of the HTML

    Here, wget writes the page as it downloads it. At the same time, perl, slurps its output (-0777 -n) whole in memory and then prints the HTML code that is found between the first occurrences of <title...> and </title.

    That will work for most HTML pages that have a <title> tag, but there are cases where it won't work.

    By contrast coffeeMug's solution will parse the HTML page as XML and return the corresponding value for title. It is more correct if the page is guaranteed to be valid XML. However, HTML is not required to be valid XML (older versions of the language were not), and because most browsers out there are lenient and will accept incorrect HTML code, there's even a lot of incorrect HTML code out there.

    Both my solution and coffeeMug's will fail for a variety of corner cases, sometimes the same, sometimes not.

    For instance, mine will fail on:

    <html><head foo="<title>"><title>blah</title></head></html>
    

    or:

    <!-- <title>old</title> --><title>new</title>
    

    While his will fail on:

    <TITLE>foo</TITLE>
    

    (valid html, not xml) or:

    or:

    <title>...</title>
    ...
    <script>a='<title>'; b='</title>';</script>
    

    (again, valid html, missing <![CDATA[ parts to make it valid XML).

    <title>foo <<<bar>>> baz</title>
    

    (incorrect html, but still found out there and supported by most browsers)

    interpretation of the code inside the tags.

    That solution outputs the raw text between <title> and </title>. Normally, there should not be any HTML tags in there, there may possibly be comments (though not handled by some browsers like firefox so very unlikely). There may still be some HTML encoding:

    $ wget -qO- 'http://www.youtube.com/watch?v=CJDhmlMQT60' |
      perl -l -0777 -ne 'print $1 if /<title.*?>\s*(.*?)\s*<\/title/si'
    Wallace &amp; Gromit - The Cheesesnatcher Part 1 (claymation) - YouTube
    

    Which is taken care of by GNU recode:

    $ wget -qO- 'http://www.youtube.com/watch?v=CJDhmlMQT60' |
      perl -l -0777 -ne 'print $1 if /<title.*?>\s*(.*?)\s*<\/title/si' |
       recode html..
    Wallace & Gromit - The Cheesesnatcher Part 1 (claymation) - YouTube
    

    But a web client is also meant to do more transformations on that code when displaying the title (like condense some of the blanks, remove the leading and trailing ones). However it's unlikely that there'd be a need for that. So, as in the other cases, it's up to you do decide whether it's worth the effort.

    Character set

    Before UTF-8, iso8859-1 used to be the preferred charset on the web for non-ASCII characters though strictly speaking they had to be written as &eacute;. More recent versions of HTTP and the HTML language have added the possibility to specify the character set in the HTTP headers or in the HTML headers, and a client can specify the charsets it accepts. UTF-8 tends to be the default charset nowadays.

    So, that means that out there, you'll find é written as &eacute;, as &#233;, as UTF-8 é, (0xc3 0xa9), as iso-8859-1 (0xe9), with for the 2 last ones, sometimes the information on the charset in the HTTP headers or the HTML headers (in different formats), sometimes not.

    wget only gets the raw bytes, it doesn't care about their meaning as characters, and it doesn't tell the web server about the preferred charset.

    recode html.. will take care to convert the &eacute; or &#233; into the proper sequence of bytes for the character set used on your system, but for the rest, that's trickier.

    If your system charset is utf-8, chances are it's going to be alright most of the time as that tends to be the default charset used out there nowadays.

    $ wget -qO- 'http://www.youtube.com/watch?v=if82MGPJEEQ' |
     perl -l -0777 -ne 'print $1 if /<title.*?>\s*(.*?)\s*<\/title/si'
    Noir Désir - L&#39;appartement - YouTube
    

    That é above was a UTF-8 é.

    But if you want to cover for other charsets, once again, it would have to be taken care of.

    It should also be noted that this solution won't work at all for UTF-16 or UTF-32 encoded pages.

    To sum up

    Ideally, what you need here, is a real web browser to give you the information. That is, you need something to do the HTTP request with the proper parameters, intepret the HTTP response correctly, fully interpret the HTML code as a browser would, and return the title.

    As I don't think that can be done on the command line with the browsers I know (though see now this trick with lynx), you have to resort to heuristics and approximations, and the one above is as good as any.

    You may also want to take into consideration performance, security... For instance, to cover all the cases (for instance, a web page that has some javascript pulled from a 3rd party site that sets the title or redirect to another page in an onload hook), you may have to implement a real life browser with its dom and javascript engines that may have to do hundreds of queries for a single HTML page, some of which trying to exploit vulnerabilities...

    While using regexps to parse HTML is often frowned upon, here is a typical case where it's good enough for the task (IMO).

  • Related Question