Friday, September 4, 2009

Hackerspaces - a way for computers and the people who love them to get out more often

An unfortunate stereotype of members of the Geek Kingdom is that they are introverted and anti-social. There's an old joke:
Q: How can you tell if a programmer is extroverted?
A: He looks at your shoes instead of his when he talks to you.
(I didn't say it was funny). The truth is that while Geeks do appreciate being left alone occasionally to focus on a problem or a project they're working on, they are often social animals, not unlike people in general. They go to conferences, they collaborate on Open Source projects, they go to events like Bloomington's ongoing Geek Dinners or user group meetings. In a college town like Bloomington (yes students, we townies do appreciate you) there are an abundance of opportunities to get out and mix it up.

For all the 'we live in a new era' wonder of computers and the internet, they live in a somewhat insular world. Sure, data travels around the globe via TCP/IP, HTTP, etc, but for many their interaction with the world outside is limited to fingers tapping on their keyboard or perhaps a temperature sensor to make sure they don't melt down. I believe Macbooks have light sensors, too. But the fact remains: for most of the computers we see and interact with day-to-day (I'm not talking about all the more-or-less hidden microcontrollers all around us), there's no meaningful direct interaction with the outside, sun shining, birds singing, the plants need watering world.

There's a nice illustration in the book Physical Computing that shows how your computer sees you:


Physical Computing is about giving computers (often really tiny inexpensive ones, aka microcontrollers) sensors to allow them to take in their environment, and sometimes motors or other physical/mechanical devices so they can move around or change their environment (as with the plant watering).

A new place in town for like-minded creative types with interests in Physical Computing, electronics, and creative activities in general is the Bloomington Hackerspace (really cool name like NYCResistor or more serviceable name like HackDC to follow). Hackerspaces have been around for years, and as the name suggests they offer a physical (as opposed to virtual) space for hackers. Here hacker is defined as a person with an ongoing interest and aptitude for finding creative and usually not obvious uses for the objects, devices, and technological tools around them. These are NOT the let's-break-into-somebody's-system hackers. The very simple distinction: these hackers are all about creating, not destroying (although they do break things down for parts).

While not every great idea with a lot of enthusiasm behind it coalesces into a real-life implementation, things have been going very well for the Bloomington hackerspace under the capable leadership of non-official leader Nathan H. (aka dosman). After an initial meeting a couple weeks ago featuring members of the IU Robotics Club (the future of which is uncertain as at this point as pretty much none of the members are students), a space was identified (donated by member Jennette T.).

For the second meeting, we gathered in this space, which featured a worktable big enough for all of us and many breadboards and electrical components. Some of us brought our Arduinos and a guy named Will brought a MeggyJr he had built. I had heard about these hand-held, Arduino-based game consoles, but this was the first time I'd seen one. It was running a version of the classic memory game Simon, and later Will uploaded other games that one of Jennette's sons enjoyed playing.

While the gathering had a somewhat informal air, we did have a defined goal, as we were all going to get a chance to wire up an ATMega8 (AVR microcontroller) and program it to do the 'Hello World' of microcontroller programming, blinking an LED. It sounds somewhat trivial, but it ended up being interesting as we had to debug things like malfunctioning components or minor wiring errors. As somebody who's played with the Arduino, I found it interesting to take a step down from the ease-of-use and accessibility of working with the Arduino.

Arduino programming works like this:
  1. Write your program using Wiring, a very nice IDE based on the very accessible Processing environment used for Visualization and graphics. (Actually the sketch is available as an Example).
  2. (Optional) put the long pin from your LED in the designated output pin, and the short one in ground (usually there'd be a resistor in the mix, but the Arduino's got that covered. Also there's an LED built in, so even this is optional).
  3. Plug a USB cable into your computer, and the other end into the socket on the Arduino.
  4. Upload the sketch.
  5. Admire the blinky light.
Programming the ATMega8 is a bit more involved:
  1. Grab your breadboard, a 5 Volt Regulator, a power supply, a header for the programmer, the programmer, an LED, lots of wires, and of course, the microcontroller.
  2. Wire up the regulator to your power supply so you have a nice steady 5V.
  3. Look up and print out or draw the pin diagram for the ATMega8. Here it is:
  4. determine which pins from the ATMega8 connect to which pins on the header, and wire it up.
  5. Don't forget the LED.
  6. On your computer, set up avr-gcc and avrdude
  7. Get a program for the blinky thing and preferably a Makefile. You don't necessarily have to write this one either, but you do have to locate it.
  8. Make it.
  9. Connect the header to your computer.
  10. Try to upload the code.
  11. If it fails, go back and figure out where things went wrong. A second pair of eyes often helps here.
  12. Admire the blinky light.
This is not to say programming a microcontroller is hard or inaccessible - it's not, especially in the company of helpful and more experienced people. It did make me appreciate the relative simplicity of the Arduino, but not everybody can spring for a ~$30 a shot Arduino. ATMega chips are cheap. You can potentially do a lot for a lot less money once you've jumped this hurdle (the programmers are a non-trivial initial cost, but with a group you can share resources). Also, the Arduino is more a prototyping tool, so often once you've figured something out with the Arduino, you can build a more permanent version using an ATMega chip programmer in this manner.

Having identified a space and had a successful hands-on meeting, things are really taking off for the Bloomington hackerspace. Perhaps some fantastic product or great robot will be born there, or a Jobs/Wozniak style partnership will form. Whatever happens, hackerspaces are a really great idea, Bloomington is lucky to have one, and I look forward to future meetings.

Some Resources:

Tutorial on AVR programming with pictures at the Physical Computing at ITP site.

Arduino Home Page

Sparkfun is a good source of electronics

Adafruit is a good source for kits (including the MeggyJr)

Sunday, July 26, 2009

Hay-O! NSFW comic strip re: Key-Value stores

How do I query the database? - Click thru for the comic - uses the F-word, so if that offends you, now's the time to f-off.

How do I query the database?

This is from the highscalability site, who have been doing a fine job chronicling the revolution, such as it is.

Sunday, July 5, 2009

When you know what you're after and you have to scale: distributed key-value stores

Last time we talked about key/value databases like Berkeley DB and more recent variants like the Tokyo Tyrant. On the one (very small) end, these can make for really nice, in process datastores. On the other end, for very very large databases with high performance and scaling needs, these key-value stores have found a niche as well.

Do you really need mad scaling capability?

It's probably not the case that your political blog needs to scale to serve millions of users, even if you have some readers who are not your relatives or friends, which would already place your blog in the top 10% of blogs. Additionally, although 'we're not building Amazon.com' is sometimes used as an excuse for neglecting performance considerations, it is really safe to say you are not building Amazon.com. Your needs are likely different. Ted Dziubia has covered this nicely in his blog post: 'I'm going to scale my foot up your ass'.

So why bother talking about it at all, then?

Academic interest. The pursuit of general well-roundedness. Boredom with writing for the more loosey goosey blog mentioned in my intro post. Perhaps I'm turning into one of the clowns I knew 10 years ago who always was installing the latest version of Red Hat, but never was in fact using Red Hat. I hope not. Also, who knows, you aren't building Amazon.com, but you may end up at a place using these things some day, or more likely, you may be using such technology via the much beloved 'cloud'.

Amazon's Dynamo


The people who actually were building Amazon.com came up with a data storage system called Dynamo. If you'd like to read a paper describing it, from the source, you can download the pdf.

According to the paper, many of Amazon's services work just fine with a key-value store and don't necessarily need a relational database schema or flexible, declarative rather than imperative query language like SQL. This makes sense as Amazon's system is doing things like looking up products by key, storing shopping carts according to keys for customers, etc and so on. The really big goal, again, is scaling and efficiency, and according to the paper, systems with ACID properties tend to suffer where availability is concerned (thanks to locking needs and so forth - the thought of handling locks across hundreds or thousands of commodity servers in multiple data centers is headache inducing). It's said that you have to pick at most two from reliability, availability and consistency. Amazon trades consistency for availability and reliability (resiliance in the face of node failure) where needed.

Eventual Consistency

Consistency is not completely thrown out the window. Dynamo has a property called 'Eventual Consistency'. All updates 'eventually' reach all nodes. High write availability is a goal of the system, which is easy enough to understand - forcing a user to wait until all nodes know that 'Infinite Jest' has been added to the shopping cart would be a good way to lose customers. There will be inconsistencies, but resolution of the inconsistencies is pushed off for reads to handle.

The next question to address is whether the data store or the application is supposed to handle resolution. The author of an application (ideally) has knowledge of the data and how it is used, and would be perhaps best qualified to determine a resolution strategy. On the other hand, this sort of micro-management is a recipe for inconsistency across applications and may not be the best use of developer time.

Dynamo stamps updates with something called a vector clock (while it'd be nice to stamp updates with a time stamp from a universal clock, unfortunately a universal clock doesn't exist in practice). The vector clocks can be used to determine which update is the latest. This is discussed in more detail in a post on Paper Trail. To cram the idea into a sentence, the dimensions of the vector are the servers, and the values of the dimensions are from an internal clock (or event counter) maintained by the server in question.

Usually this is sufficient to determine which version is the latest, and this is referred to as syntactic reconciliation. When this can't be done, and the application has to handle it (for example, the contents of 2 versions of a cart might be merged), it's called semantic reconciliation. Feel free to use those terms to show off, but whatever you do, don't spell it 'symantec reconciliation', or people will laugh at you behind your back, much like we laughed at the pompous consultant who did that in one of his masterful 'roadmaps'.

Reliability

One of the goals was to develop a system where nodes can be added as needed and failures can be handled smoothly. Also, the possibility of having nodes of various power needed to be factored in. To that end, an Amazon-ized version of consistent hashing (a really good write-up on consistent hashing by somebody who's used it in the field may be found here) is used to partition the system, and instead of considering all nodes equal, virtual nodes are used - more powerful servers can be responsible for more virtual nodes.

This being a key/value data store, and there being a hash function that's used to map the key to a 128-bit value, when we say 'partition the system', we're partitioning the key space. A number of virtual nodes (the 'preference list') are responsible for a range of keys, and that data is replicated across the first N (a parameter that's configured per-instance) nodes. N < (number of nodes a key maps to), to account for failures. Puts and gets are handled by the first of the top N nodes in the preference list. To maintain consistency, for a read, at least R nodes have to participate. For a read, W nodes. If R + W > N, congratulations, you have a 'sloppy' quorum system. R, W, and N are user configurable, with a typical choices being N = 3, R = 2, W = 2.

In Amazon's case, the data is replicated across datacenters, for big-time fault tolerance. So again, a Mom and Pop shop is probably not going to have multiple datacenters, but even a Mom and Pop shop will be able to enjoy this kind of availability as cloud computing becomes more available and mainstream. Unfortunately this deprives the IT guy of the opportunity to swing his dick around bragging about his datacenters. You win some, you lose some.

Where does Berkeley DB fit in, again?

According to the paper, different storage engines can be plugged in to this framework. The engines include Berkeley Database (BDB) Transactional Data Store, BDB Java Edition, MySQL, and an in-memory buffer with persistent backing store. Berkeley DB Transactional Data Store is the most common. Dynamo-esque systems like LinkedIn's Voldemort also provide this sort of plug-in framework.

What about security?

It's assumed all of this is happening within an already secure, trusted environment. Only systems inside of Amazon are accessing this directly.

So that's it?

Obviously that's not all there is to Dynamo, and even the paper is an overview. It is a nice jumping off point when considering related #nosql options, especially considering that there are a number of Dynamo-like systems out there.

A shortcoming of these systems is that they're not particulary good for reporting applications or ad-hoc querying. This is fine and dandy, since that's not primarily what they're designed to do, but in the future we'll talk about Map Reduce and other tools used to that end.

That is all for now.

Sunday, June 7, 2009

When You Know Exactly What You're After: Key-Value Databases

In the beginning...

Back in 1979, Ken Thompson, a programming deity responsible for Unix (and, with Dennis Ritchie, one of the original Unix 'graybeards') and the B language (predecessor to the more familiar C), wrote a program named, in typical UNIX terseness, 'dbm'. This was short for 'database manager', which was a nice thing to have on a computer then and now. Essentially, it stored data identified by keys in a file. This sort of data structure is known in Python circles as a dictionary, to Perl programmers as (way back when) an associative array or (more recently) a hash, and in Java and it's smarter little brother C#, a Map. It makes for a better way of storing or persisting data (and accessing it) than using a generic flatfile approach. What the user doesn't have is a nice declarative interface for pulling data out via queries specifying conditions of interest (date ranges, department, and yadda yadda yadda). If you don't know the key, tough luck.

DBM is still here 30 years later, and by here I mean here on my Mac. If I type 'man dbm' at the prompt, I get the man page for it, and so will you, if you are a Mac or Linux user. Although some BSD code made it's way into Microsoft code, dbm didn't, so Windows users won't have it. Not that that is a terrible loss in this day and age.

Let a Thousand dbm, Jrs Bloom

As is often the case, dbm spawned off a number of successors, a notable one being the Berkeley Database created at Berkeley as part of the 1986-1994 effort to move from BSD 4.3 to 4.4, and also to move away from AT&T code. In 1996 Netscape asked for some extensions for their needs (for an LDAP server and for their browser) which resulted in the formation of the Sleepycat software company, who maintained and sold the embedded database through 2006, when they were acquired by Oracle (recent acquirers of Sun, and, with it, another well-known and popular database, MySQL). A visit to the official site these days reveals almost nothing useful, in addition to being stripped of the colorful name (it's now 'Oracle Berkely DB' the site is the sort of scrubbed marketing enterprisese Oracle does really well). It is still a viable embedded (no server) database, used by, among other tools OpenLDAP and the source control tool Subversion (the open-source successor to the Netscape browser, Firefox, now uses SQLite to store bookmarks, history, and so on).

Berkeley DB, in addition to being small (the library is about ~300K), fast, ACID compliant, and scalable, gives the user 4 storage options: Hash (as with the old style dbm, for 'I know what I'm looking for' storage), B-Tree (good for searching ranges), persistent Queue and RecNo (record numbers are the keys, and these are generated automatically). Also interesting is that a developer has the option of turning features like locking or transactions off or on as needed. In general, the prevailing value seems to be giving a developer control over a datastore without a lot of administration overhead, allowing the developer to optimize data access or shoot himself in the foot, depending on the choices she makes (I mixed up the developer's gender there. It happens.)

One interesting thing I discovered in my investigations of Berkeley DB is the way joins are handled. Joins are very familiar to relational database users, but it's not obvious how one would join tables in a key-value store. First off, Berkeley DB has this thing called the 'secondary index', which gives one a way of accessing data using something other than the key (removing that limitation). For example, suppose you have an employee database, and you want to find employees in a given department. Rather than going through every record in the database, you'd now have the option of looking up employees according to department.

Using secondary indices, it's possible to join tables. In the best Berkely DB reference I found, that's done like this (the example below is lifted from that source. I give attribution anyhow):

Consider the following three databases:

personnel
  • key = SSN
  • data = record containing name, address, phone number, job title
lastname
  • key = lastname
  • data = SSN
jobs
  • key = job title
  • data = SSN

Consider the following query:

Return the personnel records of all people named smith with the job
title manager.

This query finds are all the records in the primary database (personnel) for whom the criteria lastname=smith and job title=manager is true.

Assume that all databases have been properly opened and have the handles: pers_db, name_db, job_db. We also assume that we have an active transaction to which the handle txn refers.

DBC *name_curs, *job_curs, *join_curs;
DBC *carray[3];
DBT key, data;
int ret, tret;

name_curs = NULL;
job_curs = NULL;
memset(&key, 0, sizeof(key));
memset(&data, 0, sizeof(data));


if ((ret =
name_db->cursor(name_db, txn, &name_curs, 0)) != 0)
goto err;
key.data = "smith";
key.size = sizeof("smith");
if ((ret =
name_curs->c_get(name_curs, &key, &data, DB_SET)) != 0)
goto err;


if ((ret = job_db->cursor(job_db, txn, &job_curs, 0)) != 0)
goto err;
key.data = "manager";
key.size = sizeof("manager");
if ((ret =
job_curs->c_get(job_curs, &key, &data, DB_SET)) != 0)
goto err;


carray[0] = name_curs;
carray[1] = job_curs;
carray[2] = NULL;


if ((ret =
pers_db->join(pers_db, carray, &join_curs, 0)) != 0)
goto err;
while ((ret =
join_curs->c_get(join_curs, &key, &data, 0)) == 0) {
/* Process record returned in key/data. */
}


/*
* If we exited the loop because we ran out of records,
* then it has completed successfully.
*/
if (ret == DB_NOTFOUND)
ret = 0;


err:
if (join_curs != NULL &&
(tret = join_curs->c_close(join_curs)) != 0 && ret == 0)
ret = tret;
if (name_curs != NULL &&
(tret = name_curs->c_close(name_curs)) != 0 && ret == 0)
ret = tret;
if (job_curs != NULL &&
(tret = job_curs->c_close(job_curs)) != 0 && ret == 0)
ret = tret;


return (ret);


Simple, right? This brings up another point: all the major scripting languages have modules or libraries allowing a developer to accomplish this in a much briefer and more straightforward
manner. Life is too short to program in C when you don't need to.

A Product Of Fine Japanese Engineering

Berkeley DB, like all successful software packages, has a number of offshoots, and a particularly interesting one is Tokyo Cabinet, which can be an embedded database, or one can use the Tokyo Tyrant server for remote access. Also available: Tokyo Dystopia for full text search. Watch out for future releases, Tokyo Epidemic, Tokyo Martial Law, and Tokyo Zombie Apocalypse.

Tokyo Cabinet was written by Mikio Hirabayashi whilst working for mixi.inc (apparently, it's like Facebook). Hirabayashi is also known for QDBM (yes, part of the DBM family). TC takes some of the ideas in Berkeley DB, and takes them further. Whereas your dbms and Berkeley DBs are primarily APIs for accessing local files, TC ups the ante with the aforementioned Tyrant for remote access (also available: replication, transaction logs, and hot backup). There are also a number of options for your underlying storage (as with Berkeley DB): Hash, B-Tree, Fixed-Length, and an option distinguishing it from Berkeley DB: the Table Engine, which mimics an RDBMS, but in a schema-less way. It is somewhat like CouchDB, which we'll eventually get to. The developer can declare indexes on columns, and query the Table in a SQLesque manner. While you wouldn't want to use it in a situation where a true RDBMS is what you need, this is definitely handy.

Tokyo Cabinet has developed a loyal following, who praise its speed, flexibility, and ease of use. It also has its detractors. There are bindings for numerous languages, but C, Perl, Ruby, Java and Lua appear to be best: in research I found people complaining that the pytc module for Python lacks support for some features.

As Tokyo Cabinet has developed some momentum, is fairly well established, and is interesting, I plan to dig into it further. Still, in the next installment, we will cover other databases.

Some fun Tokyo Cabinet links:

The Online Manual

Another Person's Notes on Distributed Key Stores

The Sourceforge Project Page

Wednesday, May 20, 2009

Up next: a survey of the 9 bazillion databases in the world

Something's Happening But You Don't Know What It Is, Do you, Mr. Jones?

High on the list of a technology professional's worst fears is becoming out of touch with technology and being left behind. Like death, it's going to happen one day no matter what you do, but as with death, one can 'rage against the dying of the light' and make every effort to stave it off as long as possible.

I have been a database guy for a number of years, either as a developer using the database, or one of the guys designing the database, or even sometimes setting up and acting as an administrator for database servers. Pretty much every database system I've ever dealt with was an RDBMS, in other words, it followed the relational model outlined by E.F. Codd in 1970, and misunderstood by the majority of computer type professionals ever since.

Monstrously large companies like Amazon and Google with monstrously large data sets and extreme scaling needs have in recent years run up against the 'ceiling' of what RDBMS's can do, and in some cases they are willing to trade off some of the letters in ACID to meet other needs. Their work is starting to make its way into the outside world: anybody can read the BigTable paper, Google has opened up Google App Engine (as mentioned in previous entries), and implementations like the open-source Hadoop have made BigTable-like storage available to anyone w/ the intellectual and budgetary resources to take it on.

At the other extreme are the very tiny and lightweight databases like SQLite, for cases where a full-blown database system is overkill, but having the option of using a more-or-less standard interface to your data is great to have.

So You Put 2 and 2 together, and got 5?

In future entries I intend to do more of a deep dive into these alternative options to boring old suit-and-tie, plastic-fantastic-wall-street-scene databases like SQL Server, Oracle, and DB2. But first a few cautionary words.

To predict the death of the relational model is likely premature and ill-advised as well. As I mentioned earlier, and have observed over the past 15 or so years, many (almost all?) IT Professionals manage to go from 'hello world' to the executive suite without knowing or understanding much of anything about the databases that, you know, store and access the data. The thing distinguishing your business from every other business in the world (although maybe you also put cool stickers on your servers). A programmer might learn to regurgitate definitions of first, second and third normal form, and know the difference between an inner and outer join, and 'Congratulations! You got the job, kid.' (it worked for me at least once in my more ignorant days). A lot of times, 'I'm running into limitations in the relational model' can be translated as 'Huh. I didn't know you could do it that way.'

Outside the walls of the IT ivory basement, the situation is even worse. Not particularly rigorous, but quite prolific writer Robert X. Cringely recently predicted the death of SQL in this column.

There are some cringe-worthy howlers within:
  • It's SQL, not Sequel (but maybe he was just being clever)
  • SQL is the language, not the database (and hard-core relational theorists will talk yr ear off about where SQL diverges from the relational model, if you aren't careful)
  • Given the fumbling, Keystone Kops way a lot of shops handle the 'you install it, turn it on, and it runs itself' SQL Server, do you really want to turn something as game-changing as BigTable loose on them? That'd be like teaching a kindergarten class how to make Greek Fire.
All that said, going outside the playground perimeter of your day job and trying to learn completely new things is a good way to keep your brain from freezing, and this stuff is just interesting (to me) anyway. So next time, more about these new (or not so new) database systems that will preserve our way of life, eliminate the need for people to do tedious work of any kind, and bring about Ray Kurzweil's fabled Singularity.

As @iamdiddy would say: LET'S GOOOOOOOOO!!!1!

Monday, May 4, 2009

I guarantee you: no more music by the suckas.

As mentioned earlier, I am a fan of blip.fm. I right now have 364 listeners, which is not shabby but not necessarily fantastic, either (I'm this guy). Blip makes it easy to add people you follow: specifically, if you blip The Gap Band, you're presented with a list of 5 or so people who also blipped The Gap Band, and with one click you can add them all.

This is not necessarily a bad idea, but these people add up fast, and horrendous musical whiplash can result as you find out somebody who blipped Sigur Ros once really really loves an abomination like 80s era Aerosmith. Even worse, a lot of these people won't be bothered to reciprocate: they'll have 5,000 listeners, but only 74 favorites. Perhaps they're discerning, but it just seems rude to me, and annoying.

Anyhow, I had accumulated 500+ favorite DJs, and many of them were deadbeats who weren't reciprocating. So with the handy blip.fm API, I wiped them all out like in that revenge montage from the first Godfather movie. Very, very minimal Python was required:
def getNoRecip(bconn, username, length):
lisses = bconn.user_getListeners(username,0,length)
faves = bconn.user_getFavoriteDJs(username,0,length)
lisNames = [l["urlName"] for l in lisses]
favNames = [f["urlName"] for f in faves]
noRecip = [f for f in favNames if f not in lisNames]
noRecip.sort()
youNoRecip = [l for l in lisNames if l not in favNames]
youNoRecip.sort()
return (noRecip, youNoRecip)

if __name__ == "__main__":
deaders = open("killed.log", "a")
deaders.write("Getting ready to wax some chumps.\n")
# need REAL password
blipConn = BlipConnection(username = 'SoundSystemSDC',
password='********')
(dudes, youDiss) = getNoRecip(blipConn,
'SoundSystemSDC',
1000)
for dude in dudes:
print "Killin' wack punk:\t%\n" % dude
blipConn.favorite_removeDJ(dude)
time.sleep(5)
deaders.write("Killed: %s\n" % dude)
print "That's %i shifty dudes." % len(dudes)
deaders.write("killed %s all told." % len(dudes))
deaders.close()
I put the sleep(5) in there to be nice, although it really wasn't that many requests all told.

It was quick and dirty, and kind of handy. Now it'd be nice to have a Greasemonkey plugin to filter out Blips from Aerosmith (and other unfavorite artists).

Thursday, April 16, 2009

Ignite Bloomington Recap (4-16-2009)

Ignite is a series of events sponsored by O'Reilly, the people that bring us the wonderful tech books with the animals on the cover. The idea behind it is simple: you have 5 minutes, 15 seconds per slide. There have been events in cities like Seattle, Portland, Paris, NYC, and tonight, Bloomington hosted its Ignite event at the Convention Center (the same place where the Chocolate Festival is held). As a loyal Bloomingtonian, I was confident my town would represent itself well, and it did. Here's a quick recap.

1. Scotch Whisky by Jenn Hileman

Jenn gave the audience an outline on the history of Scotch Whisky, the varieties, the various regions where it's produced, and recommendations on how best to enjoy it.

2. My 12-Step Recovery from Corporate Communications by Christian Briggs

Christian recounted the sordid story of his slide into Corporate Communications, from gateway drugs like Jack Welch's books, to harder stuff by Goebbels and Frederick Winslow Taylor. Eventually he came to realize he had a problem and needed to let go and let God, or let Twitter, in this case.

3. Kshitiz Anand on Research Strategy: Design For Social Impact

Kshitiz shared the lessons he learned from research in rural India. The slides were pictures of people from rural India and were the most interesting slides of the show, as they were the least 'Powerpointy'.

4. Mark Krenz on Bloomingpedia

Mark's presentation was both an introduction to Bloomingpedia and a call to action. Mark talked about Bloomingpedia's growth and success so far, and invited all in attendance to register for accounts and contribute by adding or editing articles, and adding more photographs.

(5 minute break for snacks, beer, whatever)

5. Kevin Makice on Mashups

Kevin calls on local government, newspapers, and other sources of data to set that data free, that it may be mashed up, sliced and diced, visualized and analyzed. Also, June is 'Mashup Month', and again the audience was called upon to act: to go forth and create amazing mashups that reveal interesting facts about our town and community.

6. Tall Steve Volan - 'What have you got to say for yourself?'

Steve presented, in rhyme, an idea for a new athletic event: the Nonathlon. Competitors make 9 speeches, each 3 minutes in length, to an audience of at least 3. The speeches are speeches any human being leading a full life will be called upon to give at some point: a toast, a eulogy, a song, a poem, several others I can't remember at this time, and if I were more clever, this section would rhyme.

7. Graffiti, by a guy from Sproutbox

He is not a graffiti artist, but he loves the art form. Lots of people love graffiti, aka the fun crime. He gave us a brief history going back to hieroglyphics, explained 'Kilroy' of 'Kilroy was here' worked in bomb QA during WWII (Kilroy was here = this town was bombed). In an entertaining audience participation bit, he invited us to decipher several tags - almost nobody could. Sproutbox had a really nice tag designed by a local artist.

8. Using technology for Social Justice by Geoff

Another presentation that both informed and called the interested to action: Geoff proposes using recycled computers and free software (Ubuntu Studio) to build a media lab in Detroit. This will happen over 3 days as part of the 'Allied Media Conference' in July. He also gave examples of effective uses of technology in the interest of Social Justice like elcilantro.org, a site focusing on immigration rights and cleaner air and space.

9. Trotzke of Sproutbox on 'Bursting the Bubble'

Trotzke has founded 2 startups at times coinciding with the economy going into the toilet. He draws inspiration from the words of the Wilson Phillips hit 'Hold On' - 'thingsll go your way if you hold on for one more day'. He believes it's a great time to start a startup (there is a lot of great talent available) and 'called bullshit' on the concept of non-paying customers (it needed to be called bullshit on - Trotzke classifies the non-payers as 'prospects' - a more accurate label).

That was that for the official program. Several really great O'Reilly books were given away, including Twitter API: Up and Running by Bloomington's own Kevin Makice (also a presenter as alert readers will remember). Unfortunately, none of the books were given away to ME. An informal question and answer session followed.

All in all, the event was a success. 'Ignite' is a great concept - if it visits your town, check it out.

Friday, March 20, 2009

ircontroller - super simple module for using the MacBook remote from Python

In the event you have a MacBook, and would like to write a Python program that you can control via the nice little remote controllers that come with the MacBook, you might find ircontroller handy.

It also gave me a chance to put something up on Google code. I had worked with SourceForge way back in the day, and being a rather Google-centric guy of late, Google code seemed like the place to go.

You'd think there would already be a module like this about, but in my searching the best thing I could find was the code for the iremoted 'daemon'. Converting it to a Python extension was relatively straightforward. The only part that caused pain was determining how to get distutils to link in the IOKit and Carbon frameworks. I found a blog post by somebody who'd already dealt with that pain, so that was the end of that.

Anyhow the setup.py file is super simple, and (once you know the quirks) way preferable to mucking about with a makefile.
from distutils.core import setup, Extension
setup(name='ircontroller',
version='1.0',
ext_modules=[Extension('ircontroller', ['ircontroller.c'],
extra_link_args=['-framework', 'IOKit', '-framework','Carbon'])])

Needless to say, this won't build on Windows (or Linux either, or Plan 9, or Solaris, or whatever).

Here ya go.

Sunday, February 15, 2009

Google App Engine 2, Conclusion of Foray #1 into the cloud.

In the previous post I mentioned several limitations of Google App Engine. In the time that's passed since then (less than a week), Google has announced that several of these restrictions are no more with the release of 1.1.9. Specifically:
  • It is now permissible to use urllib, urllib2 or httplib to make HTTP requests. (previously users were restricted to using urlfetch. Python programmers will be familiar w/ urllib, urllib2, and etc, and welcome this (they won't have to revise modules that use urllib2, as I did with my Blip API wrapper).
  • The dreaded 10-second deadline for a request has been expanded to 30 seconds. While it's still not good form (actually, it's horrible form) to keep a user waiting for 30 seconds, this prevents errors if a website or API you are querying behind the scenes is slow.
  • No more 'high CPU request' warnings. Note that just as George Carlin once observed that buying a 'safe car' doesn't excuse you from the responsibility of learning how to drive, it's also true that this is not Google's way of saying 'to hell with everything, write wasteful code now'
  • The old 1MB limit on requests and responses was raised to 10MB
The take-away point here is that Google listens to user feedback (up to a point: Ruby/PHP/etc users can still suck it, as far as Google App Engine goes), which is encouraging to those investing time and effort in learning the platform.

Unfortunately I ran into some other issues with my application. While my restructuring (using naive, wrote it myself Javascript b/c I wasn't yet familiar with jQuery or the like) led to something much more robust, and the addition of a simple progress bar made waiting for Blip to respond more tolerable, in testing I ran into an issue with the API failing on a call to pull back certain users' 'blips'. Further investigation revealed a 500 error was being returned from the Blip API due to certain characters being present in the string for the blip.

The good people of Blip were, as has always been my experience, quick to respond, and a fix is on the way, but it's not in place yet. As I mentioned in the previous post, the API is still private beta, so this is more a 'shame on me' matter, but as also mentioned in the previous post, this exercise is mostly an excuse to play around with Python and Google App Engine in order to learn more about it and generally 'keep brain from freezing', and as far as that goes, success was had. We'll re-visit it once Blip.fm has a fix in place.

For now, some good resources I've found for learning about the Google App Engine follow.

Web App Wednesday - Michael Bernstein puts out a new web app, plus the code, every week.

Giftag - BestBuy used Google App Engine to put together Giftag, a gift registry add-on for Firefox and Explorer. The blog is a good source of GAE info.

App Engine Fan - This guy has been experimenting with GAE since it was first released, recording the results of his efforts in this blog.

App Engine Samples - code samples from Google itself.

Monday, February 9, 2009

Google App Engine, or How I learned to Stop Worrying and Love Javascript (Part I)

It has been over a week since my last post, so things are not exactly getting off to a rip-roaring start here. Rather than dwell on the past, though, here's an entry on some experimenting I've done recently with Google App Engine.

There is much hype about 'the cloud', and like a lot of hype, most of it is not necessarily worth the paper it is or isn't printed on. However, as a programmer and not a computer guy (to some of you, this will make sense), the idea of abstracting away the server provisioning process is not without appeal. As a tightwad, the idea of 'only paying for what you use' is not without appeal. Finally, as an extreme tightwad, the idea of free (which Google App Engine is, up to 10 apps) sold me. I'm not starting a business here, I'm just getting my feet wet.

As far as what to do, I am a huge fan of the website blip.fm, (the twitter-length pitch:'It's like twitter, but for music') and not long ago they released an API. It is currently in private beta, where it has been for a while now. At any rate, I got my keys, re-wrote the sample PHP wrapper for the API in Python, and I was ready to go. At least I thought I was.

At this point I would like to praise the Google App Engine Launcher for MacOSX (and, indirectly, praise MacOSX). This was put together by John Grabowski at Google in his 20% time, so apparently that is not a Google urban legend like the one about Sergey or Larry sometimes give an underperformer a brand new Prius under the sole condition that they 'drive away, far away'. In this case using the Launcher simplified the process of getting something up and running using the development server quickly. It's very intuitive, and the interactive console is handy.

Not so fast, pal

After a bit of hacking I had something fairly simple up and running which would grab info for a user via the Blip API, then show the user via an intensity map what the global distribution of 'listeners' and 'favorites' is. For this I used the lovely Visualization API from Google. It also showed who you are following that is not following you, which is essential info not easily obtainable via Blip's website.

The problem I encountered off the bat was that the queries against the API take time, and there really isn't a good way to get just the subset of info you want at this time. On the development server, requests could take awhile, especially for users with thousands of listeners (such people exist).

Curious as to how I'd fare on the real thing, I deployed (a one-click operation w/ the GAE Launcher) the app to appspot.com. At this point all hell broke loose.

Read the fine print

In turns out, amongst the other limitations of GAE (for more Beavis and Butthead immature laffs, check out the Google App Engine Backup and Restore tool, aka GAEBAR) you may have heard about (no background/batch processes, no mischief with sockets, etc) there are a couple tight limitations I should have looked into before deploying:
  • If a call to urlfetch takes more than 5 seconds, you lose, it times out.
  • If your request takes more than 10 seconds, you lose: 'DeadlineExceededError'.
(It is worth noting here that according to this release from Google today, some of these limitations will vanish in the next 6 months).

In addition, there is a limit to CPU that can be consumed by a single request. So even in the event you can fetch your data quickly, if you do too much crunching per request, you are going to violate a quota and start getting errors (the specifics of this limit: a 'high CPU request would be one consuming 0.84 CPU seconds, with the CPU in this case being a 1.2 GHz Intel x86. You are allowed 2 of these per minute).

Thus, I needed to factor in that there'd be a lot of retries going on, but I would not be doing those retries within a request. The inescapable conclusion was that I'd have to use the back end as a simple data store, and rely on Javascript on the front end to handle the retries, putting the pieces together, boiling down the data, and pumping out the results. Also, I'd obviously need a progress indicator to keep the poor end-user updated as things proceeded, rather than leave them hanging.

At this point I will leave you, the reader, hanging, until next time when we get into the Javascript side of things in Part II.

Saturday, January 31, 2009

Hello, World

This is the introductory, statement of purpose post for 'Steve's House Of Logic'. I am your host, Steve. I have been working in and around software development for 14 years, and I have been blogging for 7 years or so. The problem with my old blog is that it was unfocused, ranging from posts about Python to CD reviews to commentary on amusing cat videos on youTube and even occasional forays into the gutter that is political blogging. I realized I needed a blog that focused on the tech side of my existence, and this is it.

My tech interests include but are not limited to these topics, all of which I hope to get to in time. A goal of this blog is to post with relative frequency, but to try to avoid content free fluff posts and mindless link propagation.

  • All things Python
  • Google App Engine and 'Cloud' technologies
  • Databases, relational as well as new technologies (BigTable, Column-oriented databases)
  • Robotics
  • The Arduino Platform
  • The Processing Language/Platform
  • The 9-million APIs of the Internet and Mashups
  • The 'Semantic Web' (RDF, SPARQL, and so on)
  • Freebase (the product from Metaweb, not the drug) and their Acre environment
  • Functional Programming
  • Prolog
Upfront disclosure as to my biases, to save both myself and readers time:

Microsoft: Nay
Open Source: Yay
Mac: Yay
Videoblogging/The proliferation of instructional videos: Eh.
vi: Eh.
Emacs: Yay