Ubuntu – Getting text and links from a web page

w3m

I would like to have a script that downloads a web page with curl, pipes it to w3m, which is stripping it from all content except text and links.

Is it possible to specify for the -T option of w3m, more than just one content-type and how?

To clarify my question a bit more, here's an example:

curl --user-agent "Mozilla/4.0" https://askubuntu.com/questions -s | w3m -dump -T text/html

which returns only text from Ask Ubuntu's questions page but with no links. If w3m cannot do it is there any other tool which is capable of scraping text and links simultaneously?

Best Answer

  • Well, after extensive research on my own, I guess, there is no such a tool...

    However, for what it's worth, I did discover hxnormalize which made writting a particular script I needed, a relatively simple matter.

  • Related Question