# Ubuntu – Remove text that I don’t want

command linetext processing

I have a big html file on my desktop that looks like

src="http://images.alaablubnan.com/images/Balls/20.jpg"
alt="http://images.alaablubnan.com/images/Balls/20.jpg"/></a></td><td><a
href="http://images.alaablubnan.com/images/Balls/32.jpg"
target="_blank"><img
src="http://images.alaablubnan.com/images/Balls/32.jpg"
alt="http://images.alaablubnan.com/images/Balls/32.jpg"/></a></td><td><a
href="http://images.alaablubnan.com/images/Balls/30.jpg"
target="_blank"><img
src="http://images.alaablubnan.com/images/Balls/30.jpg"
alt="http://images.alaablubnan.com/images/Balls/30.jpg"/></a></td></tr><tr><td><table><tr><td>webpage/url</td><td>http://www.playlebanon.com/webservices/website/lotto/PopUps/HistoryDetail.aspx?t=1405536730503&FromDraw=1&ToDraw=1213&Draw=0</td></tr></table></td><td>2</td><td>complete
lotto results</td><td>complete lotto results</td><td>2</td><td><a
href="http://www.playlebanon.com/webservices/website/lotto/PopUps/HistoryDetail.


If possible, I want to:

• get all the .jpg files, remove all the html code (it's 1.jpg, 2.jpg… to 42.jpg)
• I want to remove the .jpg extenstion
• I want each row of numbers to have 7 numbers only, then insert new line

• This is not actually a particularly good job for sed but here goes:

sed -nr 's#.*/([^"]+).jpg.*#\1#p' file


The above will get you a list of numbers, one per line:

20
20
32
32
32
30
30
30


Now, it is actually possible to get all these on the same line with 7 numbers per line using sed but it is really not worth the effort. Just use standard *nix tools instead:

$echo$(sed -nr 's#.*/([^"]+).jpg.*#\1#p' file | tr $'\n' ' ') | fold -sw 21 20 20 32 32 32 30 30 30  Or, if you want to remove duplicates: echo$(sed -nr 's#.*/([^"]+).jpg.*#\1#p' file | sort -u | tr $'\n' ' ') 20 30 32  ### Explanation The sed command uses a few tricks: • -n: don't print any lines by default. • -r: enable extended regular expressions, this lets us use ( ) to capture groups without needing to escape the parentheses and + for "one or more". • s#from#to# : while the standard substitution operator in sed and other, similar tools, is s/from/to/, you can a non standard delimiter so that you can include / in the pattern. In this case I am using # but you could use something else like s|from|to| as well. • s#.*/([^"]+).jpg.*#\1#p : this will match everything from the beginning of the line until a / and then captures the longest stretch of non-" characters until .jpg. This is the filename minus extension. The filename is captured in the parenthesis and the whole line (because of the .* on either side) will be replaced with the captured patter (\1). The p at the end means that it will print the lines where the substitution was successful. Personally though, I would have done all of this with perl in the first place: $ perl -e '@k=grep(s/.*\/([^"]+).jpg.*/$1/s,<>); print "@k[0..6]\n@k[7..$#k]\n"' file
20 20 32 32 32 30 30
30


Or, for a larger file:

$perl -e '@k=grep(s/.*\/([^"]+).jpg.*/$1/s,<>); for($i=0;$i<=$#k;$i+=7){print "@k[$i..$i+7]\n"}' file
20 20 32 32 32 30 30 30
30


Or grep even:

$echo$(grep -oP '[^/]+(?=.jpg)' file | tr $'\n' ' ' ) | fold -w 21 20 20 32 32 32 30 30 30  Or, stealing @Olli's clever xargs idea: $ grep -oP '[^/]+(?=.jpg)' file |  xargs -n7 echo
20 20 32 32 32 30 30
30