Simon Fell > Its just code > March 2006
Between grep, awstats, perl and python I've been picking my logs apart to get a handle on the bandwidth usage, here's a summary of what I've found so far.
- Google for some reason was just pounding some 3 year old content which hasn't changed forever, over and over again, a swift kick in the robots.txt seems to have quieted that down, still it managed to chew through 1.86Gb of bandwidth in march.
- The RSS feed for this blog is responsible for a good chunk of bandwidth too, I see people still insist on writing aggregators that don't do conditional GETs, WTF is wrong with you people, the shit list includes AlestiFeedBot, RssReader, squeet. BlendBlogs, NewsAlloy and Thunderbird.
- Some comedian from 22.214.171.124 is running FeedForAlls rss2htm which is crappy enough to download the entire feed for every request, it also sends the URL of my feed as the referrer so other than the IP, I can't find out where this is being used, but its made more than 13k requests for the feed this month. (If you know who you are, get in touch)
Why doesn't the Google bot, and other search engine support conditional GETs?? the amount of change on the site is a fairly small %, it doesn't really need to get the entire thing over, to pick up the 3 changes since last time. As this doesn't make things easier for them (although you'd think re-indexing changes rather than everything would be faster and therefore cheaper) I doubt we'll see it. This is definitely an area where I think one of the underdogs could step up to the plate and force some movement.
Yeah, I know my feed doesn't support compression, I've been talking to the Orcsweb folks about it (who do a fantastic job hosting), the IIS compression settings are all global though, so they can't add .xml files to the compression list without affecting everyone on that server, so they're understandably reluctant to do it. Not sure what to do about that, in the mean time I've cut down the number of entries in the feed, I don't blog as much as I used to, so it shouldn't be a big issue.
update : Good news, both the NewsAllow and Squeet folks have rolled out new versions with conditional GET support, and the BlendBlog guys are working on it, thanks!
A new release of Axis 2 is available, I did my traditional download and run it against the current salesforce.com WSDL. I was hoping after logging issues against previous releases that someone on the axis team might of tried this before me, but apparently not, (yes I'm tried of testing new soap stacks, particularly ones that are more tedious to use and more buggy than their predecessors) I logged my now traditional new build bug. I realize I logged an earlier bug about the fact that xmlbeans generates about 2000 files from our WSDL, however I didn't expect them to swing entirely in the other direction and generate a single java source file that's 5Mb, and 120K lines in size. See the bug for the other offensive things i spotted. I live in hope that someone will ship a java WS stack that doesn't suck.
I just posted an updated version of YATT, this moves to winpcap 3.1, and fixes a couple of bugs around the capture adapter selection. WinPCAP now supports a remote capture API, sounds interesting I might hook it up to YATT if I get some time.
My bandwidth usage for pocketsoap.com has doubled in the last 6 months, I've been trying to work out where its all going. I figured it was largely from downloads, and have been thinking about moving the downloads to a cheaper hosting deal.
I ran this years logs through awstats (a fairly painful process, it seems odd to me that anyone building a web server log mining reporting tool would assume you don't have lots of existing logs to feed it). It generated some interesting stats, not least of which is that the googlebot has already soaked up over one Gb of bandwidth this month (more than 3x of Yahoo, and 5x of AskJeeves), WTF is it doing ? the site is not that big, why has it done 100k hits and 1.03GB of bandwidth just in march ?
Also turns out that my RSS feed eats about as much bandwidth as the binary downloads, so back to the drawing board there I think.
awstats was pretty easy to get up and running on my Mac (easier than a previous attempt to run it on windows), although it got the httpd.conf changes wrong, that was easy enough to fix. One nice trick I managed is that you can feed it logs from a pipe, so there's no need for me to download the logs locally to the machine first, you can just feed it the logs directly from curl, e.g.
LogFile="curl -u user:pass ftp://myserver/serverlogs/%YY-24%MM-24%DD-25.log |"
awstats still has a few holes in it I'd like to see a top 10 list of download files (.exe, .zip, etc, would be nice if the tables were sortable by their headings), and the summaries are great, but you can't drill down, so either I'll be getting more friendly with grep, or I'll be trying out some other tools (any recommendations ?)
Scott is trying to raise $10k for the American Diabetes Association.
Drop by and make (a tax deductible) donation.
Just remind me again why HTTP over SOAP over HTTP is a good idea ? All in the name of the fabled transport independence. If the W3C does anything other than laugh and throw it back I'll be very disappointed. Mark Baker points out that its not even all of HTTP, what? it was too much like hard work to transfer over the rest of the HTTP spec?
From the Sforce forums : To enhance the performance and availability of the service, Salesforce.com has changed the way that large queries are handled. Queries returning more than 5,000 rows will now be processed with a new architecture. Note that only a small subset of queries are large enough to be affected by this change. Existing customer integrations should not be affected in any way. If you notice any issues, please contact our support team.
We headed up to Petaluma Saturday morning to check out the western regional barista competition, I'd read about the competition format in the past but never seen it in person before, It always sounded hard, but seeing it action really highlights how tough it is. The barista has 15 minutes to make a total of 12 drinks, 4 espresso's, 4 cappuccinos and 4 signature drinks (of their own recipe), as if that's not hard enough, there's a 5 star service and presentation aspect to it, where the barista has to set the table, talk about what they're doing and why and so on. The presentation aspects can easily eat up minutes of that precious 15 minutes, I noticed that some of the top competitors would take the first 3 minutes or so setting up the table, talking about the beans, what's drinks they'd be making on so on, leaving then a astonishingly small 12 minutes or so to make the 12 drinks and try and do some cleanup. Saturday saw 22 competitors go up, including a good turn out from San Francisco, Gabe and Ryan from Ritual and Eton and Danielle from Organica. I got to see most of the competitors, but not all of them, it's not exactly a great spectator sport to start with, so many of the points are tied up in taste, for which only judges know how that turned out, in addition it was hard to see the details from the floor, they really need to have either the overhead mirrors, or some kind of video system, so you can get a better look at the drinks and latte art that was made. The 4 stand out competitors for me on Saturday were Eton Tsuno from Cafe Organica, Heather Perry from Coffee Klatch, Emma Sanchez from Barefoot and Gabriel Bosscana from Ritual, these all did a great job on the presentation, looked poised and professional and did a good job of talking about what they were doing. Saturday evening came around and the 6 finalists were announced.
- Eton Tsuno, Café Organica
- Eugenia Chien, Barefoot Coffee
- Ryan Brown, Ritual Coffee
- Pele Aveau, Flying Goat Coffee
- Gabe Boscana, Ritual Coffee
- Heather Perry, Coffee Klatch
Great result for Ritual, both Ryan and Gabe make the finals, and with Eton from Organica making the finals as well, fully 50% of the finalists are San Francisco based, the SF coffee scene is rockin'. Sunday, the 6 finalists run again, same format, The Ritual folks seem to have brought half of the population of the mission up to Petaluma to cheer on their guys. The 6 finalists all do outstanding runs under the pressure, not only the time pressure, but there are a total of 7, yes 7 judges, 4 sensory judges, judging taste, presentation etc, 2 technical judges that are watch closely (very closely) on technique, wastage, cleanliness, preparation etc, and a head judge to tie it all together. After the 6 do their stuff, the judges disappear to finalize the results, everyone else mills around expectantly, taking advantage of "machine 4", one of the competition machines, staffed in rotation by the competitors and local roasters, a great way to try some new and different beans and drinks. The final results,
- 1st place - Heather Perry, Coffee Klatch
- 2nd pace - Gabe Boscana, Ritual Coffee
- 3rd place - Eton Tsuno, Café Organica
Congrats to Heather, Gabe and Eton, outstanding job, and congrats to all involved, it seemed well organized, everything went smoothly, was a good turn out of both competitors and spectators (despite the general spectator unfriendliness of it) and for me was an interesting, educational and enjoyable weekend. The rate of growth in this top level of the industry means the bay area is going to continue to get to be a better and better place to be for coffee. (and, hot dammn, where can i snag one of those ritual ties ?)