Though we wont go into what they're for, I came across the need to collect a list of urls that appear in bulk spam emails. Currently a spamassassin install that I have running tags the spam and postfix redirects anything that's tagged to a spam address rather than it's intended recipient, which leaves me with a mailbox that's just crawling with advertisements for just about every scam out there, and in turn tons and tons of links to bogus, dangerous, or defunct pages. The trouble is in acctually harvesting those links...
Enter my criminally inefficient barrage of bash tools regex:
#!/bin/bash
rm urllist
touch urllist
echo "Getting Spam..."
for i in $(ls cur/); do cat cur/$i|sed -e :a -e '$!N; s/\n//; ta'|sed 's/http/\nhttp/g'|sed 's/>/ /'|sed 's/<!-- /'|awk '{print $1}'|sed 's/\.com[a-z A-Z]*/\.com/'|sed 's/---*/\n/'|sed 's/__*/\n/'|egrep 'http://|https://'|grep -v www.spamcop.net| sed 's/=2E/\./g'|sed 's/)//'|sed 's/(//'|sed 's/,//'|sed 's/]//'|grep -v '\.='|sed 's/\"$//'|egrep -v 'png|gif|jpg|jpeg'|sed 's/=[0-9]*$//'| awk '!/\?/{gsub(/=/, "")}; 1' |sed -e 's~\(http://[^?]*\)=\([^?]*\)~\1\2~' -->> urllist.tmp;done
cat urllist.tmp|sort|uniq > urllist
rm urllist.tmp
echo "List of URLs generated"
It's not pretty or fast, and I'll probably rewrite it in python when i have the time, but so far it performs far better than any existing solution i've tried due to the fact that spam emails often have malformed html, broken lines, substituted or translated characters, and a host of other quirks.. which while still click-able, make them difficult to isolate cleanly from the command line.
The last bit of sed regex in this one I did not figure out myself, but I was thrilled to learn how to do it. sed -e 's~\(http://[^?]*\)=\([^?]*\)~\1\2~' Uses back-references to remove any '=' symbols if they appear in a line prior to a '?'.
0 comments:
Post a Comment