Ubuntu – Remove text that I don’t want

command linetext processing

I have a big html file on my desktop that looks like

src="http://images.alaablubnan.com/images/Balls/20.jpg"
alt="http://images.alaablubnan.com/images/Balls/20.jpg"/></a></td><td><a
href="http://images.alaablubnan.com/images/Balls/32.jpg"
target="_blank"><img
src="http://images.alaablubnan.com/images/Balls/32.jpg"
alt="http://images.alaablubnan.com/images/Balls/32.jpg"/></a></td><td><a
href="http://images.alaablubnan.com/images/Balls/30.jpg"
target="_blank"><img
src="http://images.alaablubnan.com/images/Balls/30.jpg"
alt="http://images.alaablubnan.com/images/Balls/30.jpg"/></a></td></tr><tr><td><table><tr><td>webpage/url</td><td>http://www.playlebanon.com/webservices/website/lotto/PopUps/HistoryDetail.aspx?t=1405536730503&FromDraw=1&ToDraw=1213&Draw=0</td></tr></table></td><td>2</td><td>complete
lotto results</td><td>complete lotto results</td><td>2</td><td><a
href="http://www.playlebanon.com/webservices/website/lotto/PopUps/HistoryDetail.

If possible, I want to:

  • get all the .jpg files, remove all the html code (it's 1.jpg, 2.jpg… to 42.jpg)
  • I want to remove the .jpg extenstion
  • I want each row of numbers to have 7 numbers only, then insert new line

Best Answer

  • This is not actually a particularly good job for sed but here goes:

    sed -nr 's#.*/([^"]+).jpg.*#\1#p' file 
    

    The above will get you a list of numbers, one per line:

    20
    20
    32
    32
    32
    30
    30
    30
    

    Now, it is actually possible to get all these on the same line with 7 numbers per line using sed but it is really not worth the effort. Just use standard *nix tools instead:

    $ echo $(sed -nr 's#.*/([^"]+).jpg.*#\1#p' file | tr $'\n' ' ') | fold -sw 21
    20 20 32 32 32 30 30 
    30
    

    Or, if you want to remove duplicates:

    echo $(sed -nr 's#.*/([^"]+).jpg.*#\1#p' file | sort -u | tr $'\n' ' ')
    20 30 32
    

    Explanation

    The sed command uses a few tricks:

    • -n: don't print any lines by default.
    • -r: enable extended regular expressions, this lets us use ( ) to capture groups without needing to escape the parentheses and + for "one or more".
    • s#from#to# : while the standard substitution operator in sed and other, similar tools, is s/from/to/, you can a non standard delimiter so that you can include / in the pattern. In this case I am using # but you could use something else like s|from|to| as well.
    • s#.*/([^"]+).jpg.*#\1#p : this will match everything from the beginning of the line until a / and then captures the longest stretch of non-" characters until .jpg. This is the filename minus extension. The filename is captured in the parenthesis and the whole line (because of the .* on either side) will be replaced with the captured patter (\1). The p at the end means that it will print the lines where the substitution was successful.

    Personally though, I would have done all of this with perl in the first place:

    $ perl -e '@k=grep(s/.*\/([^"]+).jpg.*/$1/s,<>); print "@k[0..6]\n@k[7..$#k]\n"' file 
    20 20 32 32 32 30 30
    30
    

    Or, for a larger file:

    $ perl -e '@k=grep(s/.*\/([^"]+).jpg.*/$1/s,<>); for($i=0;$i<=$#k;$i+=7){print "@k[$i..$i+7]\n"}' file 
    20 20 32 32 32 30 30 30
    30       
    

    Or grep even:

    $ echo $(grep -oP '[^/]+(?=.jpg)' file | tr $'\n' ' ' ) | fold -w 21
    20 20 32 32 32 30 30 
    30
    

    Or, stealing @Olli's clever xargs idea:

    $ grep -oP '[^/]+(?=.jpg)' file |  xargs -n7 echo
    20 20 32 32 32 30 30
    30