"Look out honey, 'cause I'm using technology..."

2008-02-23

The Musical Gardener's Tools #5: Yet Another Way to Harvest mp3blogs

Update 2008-03-11: There were a number of things wrong with this script making the spidering *waaaay* slower than it needs to be. Fixed that below, and added threading for both the spidering and downloading, thanks to this cool recipe by Wim Schut which lets me run all the sqlite code in a separate thread. (Important because you can only use sqlite connections in the thread in which they were created.) All of this results in a nice speed-up.

Ok I said I wasn't going to, but I did end up writing a bit of code, although it didn't get too far out of hand. Yet :). It solves *all* of my problems: it does not download files over 30MB in size, and it never downloads the same link twice.

I found this message on the python mailing list, which seemed like a very good start. It almost did what I needed, but not quite, and also the parsing was overcomplicated and didn't catch all links, so I replaced that with a simple regular expression.

I ended up changing most of the code and functionality, (for instance it now stores links in a database.) There's a lot of hard coding in there, which I could factor out if people want to use it, but for now it solves my problems beautifully ;).

It's used with the following syntax:

# initial set up
python spider.py createdb
# add a new blog to be harvested
python spider.py add http://url.of.blog/
# (shallowly) spider all blogs for new links to files
python spider.py
# spider a url to a specific depth (5 for example should get 
# most everything, but will take a while)
python spider.py deepspider 5
# download all files
python spider.py download

A minor problem is that curl doesn't do *minimum* file sizes, and with a lot of broken links it does download something small that isn't really an ogg or mp3 file, but a http response. I can probably solve this better, but for now I call the download from an update script as follows:

python spider.py download
find . -iname "*.mp3" -size "-100k"  -print0 | xargs -0 rm
find . -iname "*.ogg" -size "-100k"  -print0 | xargs -0 rm
find . -iname "*.mp3" -print0 | xargs -0 mp3gain -k -r -f
find . -iname "*.ogg" -print0 | xargs -0 vorbisgain -fr

Translation: download files, throw away suspiciously small ones, mp3/vorbisgain what's left.

Here's the code:

Edit 2008-04-18: Moved the code to google code, so I don't have to update it here. Find the latest version here: spider.py

2008-02-12

Reminders via del.icio.us and Yahoo Pipes

Just thought I'd share this, while we still *have* del.icio.us and yahoo pipes... ;)

A while ago I stumbled agross tagmindr, which seemed like a cool idea: put a custom tag on some url in your del.icio.us account and it will remind you at a certain date to look at that url again. I frequently see announcements on website that say something like check back here on [some date] for [some interesting news]. Having automatic reminders for things like that are great, because the chances of me forgetting otherwise are near 100%. The thing is: the personal feed tagmindr promised me *NEVER WORKED*. That's how I completely forgot about a number of things, and tagmindr itself for a while. No biggie, just not very smart if you wanna get all start-uppy and generate buzz ;)

So I decided, how hard can it be to get this right? Turns out not hard at all! I built a yahoo pipe in under an hour that does exactly the same thing. You just give it your del.icio.us username, and it gives you reminders for anything you tag with the tags 'reminders' and 'remind:yyyy-mm-dd' where you replace the ys and ms and ds with the date you want to be reminded on.

The yahoo pipe is here. Enjoy! (As always, feedback and bug reports very welcome!)

Yahoo pipes rock, del.icio.us rocks. Let's hope Yahoo can hold out. I have nothing against Microsoft per se, but I don't think they ever got the web, and I fear they will screw up the nice and open things (like YUI, for instance) that Yahoo has been developing in the past few years.

2008-02-03

Exploratory programming, or my 2 ¢ on arc

A lot of people have blogged on Paul Graham's new language, arc, the (perceived lack of) new features it brings, and the intentionally non-PC announcement by Mr. Graham. I don't have much to add to that particular debate. It looks like lisp, with some new syntactic sugar, which is fine by me. I like lisp, but I wouldn't want to use it in my day job. Others no doubt do, and their taste is no worse or better than mine.

I think maybe the development time worked against it, in that some features seem less than revolutionary, because other languages got there first. Now lisp has them too, and maybe even better implemented, I'm not one to judge.

What I take issue with is that Mr. Graham explains the lack of some other features by the fact that arc is for exploratory programming only, and those features are somehow a hindrance for that. I think this is just plain wrong: unicode support will hurt noone in their exploratory programming, it will actually help a lot of people a good deal. Mr. Graham quotes Guido van Rossum stating he spent a year implementing unicode. I very much doubt that that quote is correct, but *even if it were*, so what? That means exactly nothing to the exploratory programmer, and only hurts the exploratory *language designer* which I think may be a little closer to what is going on here.

As an exploratory programmer in any language I've ever used, (I do think Mr. Graham correct in saying everyone is,) I can safely say that features have never harmed me, as long as they did not get in the way when I wasn't using them. Unicode support in python doesn't. In fact, python (my favorite language, *and* the one I'm most fluent in by now, so yes, I'm biased) is absolutely fantastic for exploratory programming, exactly because of its huge standard library, which helps you get to the meat of the task at hand, without having to build your own support library first.