Command-line CSS selector tool

command linecssgrephtmlweb

Question

What tool (preferably for Linux) can select the content of an HTML element based on its CSS path?

Example

For example, consider the following HTML document:

<html>
<body>
  <div class="header">
  <h1>Header</h1>
  </div>
  <div class="content">
    <table>
      <tbody>
      <tr><td class="data">Tabular Content 1</td></tr>
      <tr><td class="data">Tabular Content 2</td></tr>
      </tbody>
    </table>
  </div>
  <div class="footer">
  <p>Footer</p>
  </div>
</body>
</html>

What command-line program (e.g., a kind of "cssgrep") can extract values using a CSS selector? That is:

cssgrep page.html "body > div.content > table > tbody > tr > td.data"

The program would write the following to standard output:

Tabular Content 1
Tabular Content 2

Related Links

Thank you!

Best Answer

Use the W3C tools for HTML/XML parsing and extraction of content using CSS selectors. For example:

hxnormalize -l 240 -x filename.html | hxselect -s '\n' -c "td.data"

Will produce the desired output:

Tabular Content 1
Tabular Content 2

Using a line length of 240 characters ensures that elements with long content will not be split across multiple lines. The hxnormalize -x command creates a well-formed XML document, which can be used by hxselect.