Thursday, March 4, 2010
The Collaboration Room in Bloomington: You and Your Thoughts are Welcome
A month or so ago, we were alerted to the existence of The Collaboration Room, a fairly new place in town that makes the very cool town of Bloomington even cooler. It's a house on 214 N Rogers Street where artists and other collaborators can gather to work on projects. I visited a couple of weeks ago during a Friday night session (these are held every Friday from 5-8pm) with my daughter. We met Matthew and Matisse, and in no time we were contributing to prints that had been started by previous visitors. After doing that for a bit Matthew taught my daughter how to make prints with brayers, barens (new terminology to both of us) and colorful ink. Matthew is a good teacher who works well with kids, so it was not surprising to find out he works for the Boys and Girl's Club of Bloomington.
Some other members of the hackerspace dropped by, so we talked about what hackerspace was about and what we'd been up to, and were graciously invited to have meetings at the Collaboration Room Tuesday nights. It seems to be working out very well, as the 2 groups have the same goal of working on creative projects with people with a variety of skills and different levels of expertise. At the first meeting, the collaborative spirit overtook us, and we put aside individual projects to work on a robotic (servo powered flapping wings) parrot we were determined to put together that night. It was not unlike a reality show challenge, only our opponents were the clock and boredom, both of whom we defeated.
A few days later, some of us went to the Collaboration Room's first benefit at the Bishop. This featured a performance art work involving a guy mowing a small indoor lawn and several local bands. People were invited to participate in the Collaboration Room's Red Circle Animation project (I drew a red unicycle). Prints, including two my daughter and I had worked on, were on the wall for sale. I bought the one my daughter had worked on, so now she has sold work.
Matisse runs an animation workshop on Saturday Morning. Spaces are very limited, so interested parents should contact Matisse via the website. At the first session, we watched some animated works made by previous kids for inspiration. One involved a pair of Gwar/Transformer hybrid characters in a guitar battle. After that, no time was wasted, and the kids dove right in to making clay characters and choosing or making sets for their works. By the end of the session, they had made 3 cartoons.
While The Collaboration Room is very kid-friendly, it is a place for all ages. It's very much about inclusiveness and being accessible to anybody with a desire to participate in a constructive way. As they say, 'everybody brings something to the table'. Matthew has pointed out that Bloomington's Wonderlab (a science museum for kids) started out as a small operation run by a small but very enthusiastic and energetic core of volunteers. Bloomington's community radio station, WFHB, started out that way too. I would love to see The Collaboration Room reach the same level of success the Wonderlab and WFHB have enjoyed. If you live in Bloomington and are interested, go to a Friday Night open session, volunteer your talents and teach a workshop, or donate some art supplies.
Related Links:
Monday, January 25, 2010
How I will lose 1,000 pounds in 2010
Left untreated, this could lead to an appearance on the television show 'Hoarders', which is not desirable to me even in these fame-obsessed, less-shame-influenced-than-ever times.
In order to achieve this goal, I will need to drop 83 1/3 pounds a month. I will make an effort not to merely make the junk somebody else's problem by throwing it away, although I've gotten off to a bad start, throwing out 2 trashcans worth of junk that's accumulated in my cube at work. This junk was a combination of out-of-date calendars, obsolete software I got from MS when I was an MSDN member (now: open source for life!), and the sort of write-only notebooks that are part of my 'method'. Even though most of the notebooks have dates, I can't say the June 2006 notebook is going to answer any questions I have. I have made a move to logging things in electronic form recently (Google Docs), which is decidedly more searchable.
This weekend I went through my technical book collection and weeded out some decidedly out of date tech books (Java 1.2, a book on Flash from 2000, that sort of thing). I would feel bad selling these on eBay, because they're worthless, and dumping them on somebody else for a couple bucks, while not as bad as selling crack, is still ethically questionable. I will recycle them instead.
This stack of books weighed 24.5 pounds, and given the dimensions of the stack (10 1/4 in x 8 1/8 in x 12 in), that gives us a paper density of about 42 pounds per cubic foot. Since I didn't weigh the trashcan at work, I instead in Microsoft interview fashion measured the dimensions of the trashcan (16" x 6" x 12") and using that density (although there was some air in the trashcan that will being down the density some) that gives me 2 * .66666666 ft^3 *42 = 56 , plus the 24.5lbs for the books. So so far, we have 80.5 lbs.
So I'm off to a good start so far, as far as January goes. I just need to find one or two more books to let go of. I really have no fear of running out of junk before 2010 is over. I can do this!
Friday, September 4, 2009
Hackerspaces - a way for computers and the people who love them to get out more often
Q: How can you tell if a programmer is extroverted?(I didn't say it was funny). The truth is that while Geeks do appreciate being left alone occasionally to focus on a problem or a project they're working on, they are often social animals, not unlike people in general. They go to conferences, they collaborate on Open Source projects, they go to events like Bloomington's ongoing Geek Dinners or user group meetings. In a college town like Bloomington (yes students, we townies do appreciate you) there are an abundance of opportunities to get out and mix it up.
A: He looks at your shoes instead of his when he talks to you.
For all the 'we live in a new era' wonder of computers and the internet, they live in a somewhat insular world. Sure, data travels around the globe via TCP/IP, HTTP, etc, but for many their interaction with the world outside is limited to fingers tapping on their keyboard or perhaps a temperature sensor to make sure they don't melt down. I believe Macbooks have light sensors, too. But the fact remains: for most of the computers we see and interact with day-to-day (I'm not talking about all the more-or-less hidden microcontrollers all around us), there's no meaningful direct interaction with the outside, sun shining, birds singing, the plants need watering world.
There's a nice illustration in the book Physical Computing that shows how your computer sees you:

Physical Computing is about giving computers (often really tiny inexpensive ones, aka microcontrollers) sensors to allow them to take in their environment, and sometimes motors or other physical/mechanical devices so they can move around or change their environment (as with the plant watering).
A new place in town for like-minded creative types with interests in Physical Computing, electronics, and creative activities in general is the Bloomington Hackerspace (really cool name like NYCResistor or more serviceable name like HackDC to follow). Hackerspaces have been around for years, and as the name suggests they offer a physical (as opposed to virtual) space for hackers. Here hacker is defined as a person with an ongoing interest and aptitude for finding creative and usually not obvious uses for the objects, devices, and technological tools around them. These are NOT the let's-break-into-somebody's-system hackers. The very simple distinction: these hackers are all about creating, not destroying (although they do break things down for parts).
While not every great idea with a lot of enthusiasm behind it coalesces into a real-life implementation, things have been going very well for the Bloomington hackerspace under the capable leadership of non-official leader Nathan H. (aka dosman). After an initial meeting a couple weeks ago featuring members of the IU Robotics Club (the future of which is uncertain as at this point as pretty much none of the members are students), a space was identified (donated by member Jennette T.).
For the second meeting, we gathered in this space, which featured a worktable big enough for all of us and many breadboards and electrical components. Some of us brought our Arduinos and a guy named Will brought a MeggyJr he had built. I had heard about these hand-held, Arduino-based game consoles, but this was the first time I'd seen one. It was running a version of the classic memory game Simon, and later Will uploaded other games that one of Jennette's sons enjoyed playing.
While the gathering had a somewhat informal air, we did have a defined goal, as we were all going to get a chance to wire up an ATMega8 (AVR microcontroller) and program it to do the 'Hello World' of microcontroller programming, blinking an LED. It sounds somewhat trivial, but it ended up being interesting as we had to debug things like malfunctioning components or minor wiring errors. As somebody who's played with the Arduino, I found it interesting to take a step down from the ease-of-use and accessibility of working with the Arduino.
Arduino programming works like this:
- Write your program using Wiring, a very nice IDE based on the very accessible Processing environment used for Visualization and graphics. (Actually the sketch is available as an Example).
- (Optional) put the long pin from your LED in the designated output pin, and the short one in ground (usually there'd be a resistor in the mix, but the Arduino's got that covered. Also there's an LED built in, so even this is optional).
- Plug a USB cable into your computer, and the other end into the socket on the Arduino.
- Upload the sketch.
- Admire the blinky light.
- Grab your breadboard, a 5 Volt Regulator, a power supply, a header for the programmer, the programmer, an LED, lots of wires, and of course, the microcontroller.
- Wire up the regulator to your power supply so you have a nice steady 5V.
- Look up and print out or draw the pin diagram for the ATMega8. Here it is:
- determine which pins from the ATMega8 connect to which pins on the header, and wire it up.
- Don't forget the LED.
- On your computer, set up avr-gcc and avrdude
- Get a program for the blinky thing and preferably a Makefile. You don't necessarily have to write this one either, but you do have to locate it.
- Make it.
- Connect the header to your computer.
- Try to upload the code.
- If it fails, go back and figure out where things went wrong. A second pair of eyes often helps here.
- Admire the blinky light.
Having identified a space and had a successful hands-on meeting, things are really taking off for the Bloomington hackerspace. Perhaps some fantastic product or great robot will be born there, or a Jobs/Wozniak style partnership will form. Whatever happens, hackerspaces are a really great idea, Bloomington is lucky to have one, and I look forward to future meetings.
Some Resources:
Tutorial on AVR programming with pictures at the Physical Computing at ITP site.
Arduino Home Page
Sparkfun is a good source of electronics
Adafruit is a good source for kits (including the MeggyJr)
Sunday, July 26, 2009
Hay-O! NSFW comic strip re: Key-Value stores
How do I query the database?
This is from the highscalability site, who have been doing a fine job chronicling the revolution, such as it is.
Sunday, July 5, 2009
When you know what you're after and you have to scale: distributed key-value stores
Do you really need mad scaling capability?
It's probably not the case that your political blog needs to scale to serve millions of users, even if you have some readers who are not your relatives or friends, which would already place your blog in the top 10% of blogs. Additionally, although 'we're not building Amazon.com' is sometimes used as an excuse for neglecting performance considerations, it is really safe to say you are not building Amazon.com. Your needs are likely different. Ted Dziubia has covered this nicely in his blog post: 'I'm going to scale my foot up your ass'.
So why bother talking about it at all, then?
Academic interest. The pursuit of general well-roundedness. Boredom with writing for the more loosey goosey blog mentioned in my intro post. Perhaps I'm turning into one of the clowns I knew 10 years ago who always was installing the latest version of Red Hat, but never was in fact using Red Hat. I hope not. Also, who knows, you aren't building Amazon.com, but you may end up at a place using these things some day, or more likely, you may be using such technology via the much beloved 'cloud'.
Amazon's Dynamo
The people who actually were building Amazon.com came up with a data storage system called Dynamo. If you'd like to read a paper describing it, from the source, you can download the pdf.
According to the paper, many of Amazon's services work just fine with a key-value store and don't necessarily need a relational database schema or flexible, declarative rather than imperative query language like SQL. This makes sense as Amazon's system is doing things like looking up products by key, storing shopping carts according to keys for customers, etc and so on. The really big goal, again, is scaling and efficiency, and according to the paper, systems with ACID properties tend to suffer where availability is concerned (thanks to locking needs and so forth - the thought of handling locks across hundreds or thousands of commodity servers in multiple data centers is headache inducing). It's said that you have to pick at most two from reliability, availability and consistency. Amazon trades consistency for availability and reliability (resiliance in the face of node failure) where needed.
Eventual Consistency
Consistency is not completely thrown out the window. Dynamo has a property called 'Eventual Consistency'. All updates 'eventually' reach all nodes. High write availability is a goal of the system, which is easy enough to understand - forcing a user to wait until all nodes know that 'Infinite Jest' has been added to the shopping cart would be a good way to lose customers. There will be inconsistencies, but resolution of the inconsistencies is pushed off for reads to handle.
The next question to address is whether the data store or the application is supposed to handle resolution. The author of an application (ideally) has knowledge of the data and how it is used, and would be perhaps best qualified to determine a resolution strategy. On the other hand, this sort of micro-management is a recipe for inconsistency across applications and may not be the best use of developer time.
Dynamo stamps updates with something called a vector clock (while it'd be nice to stamp updates with a time stamp from a universal clock, unfortunately a universal clock doesn't exist in practice). The vector clocks can be used to determine which update is the latest. This is discussed in more detail in a post on Paper Trail. To cram the idea into a sentence, the dimensions of the vector are the servers, and the values of the dimensions are from an internal clock (or event counter) maintained by the server in question.
Usually this is sufficient to determine which version is the latest, and this is referred to as syntactic reconciliation. When this can't be done, and the application has to handle it (for example, the contents of 2 versions of a cart might be merged), it's called semantic reconciliation. Feel free to use those terms to show off, but whatever you do, don't spell it 'symantec reconciliation', or people will laugh at you behind your back, much like we laughed at the pompous consultant who did that in one of his masterful 'roadmaps'.
Reliability
One of the goals was to develop a system where nodes can be added as needed and failures can be handled smoothly. Also, the possibility of having nodes of various power needed to be factored in. To that end, an Amazon-ized version of consistent hashing (a really good write-up on consistent hashing by somebody who's used it in the field may be found here) is used to partition the system, and instead of considering all nodes equal, virtual nodes are used - more powerful servers can be responsible for more virtual nodes.
This being a key/value data store, and there being a hash function that's used to map the key to a 128-bit value, when we say 'partition the system', we're partitioning the key space. A number of virtual nodes (the 'preference list') are responsible for a range of keys, and that data is replicated across the first N (a parameter that's configured per-instance) nodes. N < (number of nodes a key maps to), to account for failures. Puts and gets are handled by the first of the top N nodes in the preference list. To maintain consistency, for a read, at least R nodes have to participate. For a read, W nodes. If R + W > N, congratulations, you have a 'sloppy' quorum system. R, W, and N are user configurable, with a typical choices being N = 3, R = 2, W = 2.
In Amazon's case, the data is replicated across datacenters, for big-time fault tolerance. So again, a Mom and Pop shop is probably not going to have multiple datacenters, but even a Mom and Pop shop will be able to enjoy this kind of availability as cloud computing becomes more available and mainstream. Unfortunately this deprives the IT guy of the opportunity to swing his dick around bragging about his datacenters. You win some, you lose some.
Where does Berkeley DB fit in, again?
According to the paper, different storage engines can be plugged in to this framework. The engines include Berkeley Database (BDB) Transactional Data Store, BDB Java Edition, MySQL, and an in-memory buffer with persistent backing store. Berkeley DB Transactional Data Store is the most common. Dynamo-esque systems like LinkedIn's Voldemort also provide this sort of plug-in framework.
What about security?
It's assumed all of this is happening within an already secure, trusted environment. Only systems inside of Amazon are accessing this directly.
So that's it?
Obviously that's not all there is to Dynamo, and even the paper is an overview. It is a nice jumping off point when considering related #nosql options, especially considering that there are a number of Dynamo-like systems out there.
A shortcoming of these systems is that they're not particulary good for reporting applications or ad-hoc querying. This is fine and dandy, since that's not primarily what they're designed to do, but in the future we'll talk about Map Reduce and other tools used to that end.
That is all for now.
Sunday, June 7, 2009
When You Know Exactly What You're After: Key-Value Databases
Back in 1979, Ken Thompson, a programming deity responsible for Unix (and, with Dennis Ritchie, one of the original Unix 'graybeards') and the B language (predecessor to the more familiar C), wrote a program named, in typical UNIX terseness, 'dbm'. This was short for 'database manager', which was a nice thing to have on a computer then and now. Essentially, it stored data identified by keys in a file. This sort of data structure is known in Python circles as a dictionary, to Perl programmers as (way back when) an associative array or (more recently) a hash, and in Java and it's smarter little brother C#, a Map. It makes for a better way of storing or persisting data (and accessing it) than using a generic flatfile approach. What the user doesn't have is a nice declarative interface for pulling data out via queries specifying conditions of interest (date ranges, department, and yadda yadda yadda). If you don't know the key, tough luck.
DBM is still here 30 years later, and by here I mean here on my Mac. If I type 'man dbm' at the prompt, I get the man page for it, and so will you, if you are a Mac or Linux user. Although some BSD code made it's way into Microsoft code, dbm didn't, so Windows users won't have it. Not that that is a terrible loss in this day and age.
Let a Thousand dbm, Jrs Bloom
As is often the case, dbm spawned off a number of successors, a notable one being the Berkeley Database created at Berkeley as part of the 1986-1994 effort to move from BSD 4.3 to 4.4, and also to move away from AT&T code. In 1996 Netscape asked for some extensions for their needs (for an LDAP server and for their browser) which resulted in the formation of the Sleepycat software company, who maintained and sold the embedded database through 2006, when they were acquired by Oracle (recent acquirers of Sun, and, with it, another well-known and popular database, MySQL). A visit to the official site these days reveals almost nothing useful, in addition to being stripped of the colorful name (it's now 'Oracle Berkely DB' the site is the sort of scrubbed marketing enterprisese Oracle does really well). It is still a viable embedded (no server) database, used by, among other tools OpenLDAP and the source control tool Subversion (the open-source successor to the Netscape browser, Firefox, now uses SQLite to store bookmarks, history, and so on).
Berkeley DB, in addition to being small (the library is about ~300K), fast, ACID compliant, and scalable, gives the user 4 storage options: Hash (as with the old style dbm, for 'I know what I'm looking for' storage), B-Tree (good for searching ranges), persistent Queue and RecNo (record numbers are the keys, and these are generated automatically). Also interesting is that a developer has the option of turning features like locking or transactions off or on as needed. In general, the prevailing value seems to be giving a developer control over a datastore without a lot of administration overhead, allowing the developer to optimize data access or shoot himself in the foot, depending on the choices she makes (I mixed up the developer's gender there. It happens.)
One interesting thing I discovered in my investigations of Berkeley DB is the way joins are handled. Joins are very familiar to relational database users, but it's not obvious how one would join tables in a key-value store. First off, Berkeley DB has this thing called the 'secondary index', which gives one a way of accessing data using something other than the key (removing that limitation). For example, suppose you have an employee database, and you want to find employees in a given department. Rather than going through every record in the database, you'd now have the option of looking up employees according to department.
Using secondary indices, it's possible to join tables. In the best Berkely DB reference I found, that's done like this (the example below is lifted from that source. I give attribution anyhow):
Consider the following three databases:
personnel- key = SSN
- data = record containing name, address, phone number, job title
- key = lastname
- data = SSN
- key = job title
- data = SSN
Consider the following query:
Return the personnel records of all people named smith with the job
title manager.
This query finds are all the records in the primary database (personnel) for whom the criteria lastname=smith and job title=manager is true.
Assume that all databases have been properly opened and have the handles: pers_db, name_db, job_db. We also assume that we have an active transaction to which the handle txn refers.
DBC *name_curs, *job_curs, *join_curs;Simple, right? This brings up another point: all the major scripting languages have modules or libraries allowing a developer to accomplish this in a much briefer and more straightforward
DBC *carray[3];
DBT key, data;
int ret, tret;name_curs = NULL;
job_curs = NULL;
memset(&key, 0, sizeof(key));
memset(&data, 0, sizeof(data));
if ((ret =
name_db->cursor(name_db, txn, &name_curs, 0)) != 0)
goto err;
key.data = "smith";
key.size = sizeof("smith");
if ((ret =
name_curs->c_get(name_curs, &key, &data, DB_SET)) != 0)
goto err;
if ((ret = job_db->cursor(job_db, txn, &job_curs, 0)) != 0)
goto err;
key.data = "manager";
key.size = sizeof("manager");
if ((ret =
job_curs->c_get(job_curs, &key, &data, DB_SET)) != 0)
goto err;
carray[0] = name_curs;
carray[1] = job_curs;
carray[2] = NULL;
if ((ret =
pers_db->join(pers_db, carray, &join_curs, 0)) != 0)
goto err;
while ((ret =
join_curs->c_get(join_curs, &key, &data, 0)) == 0) {
/* Process record returned in key/data. */
}
/*
* If we exited the loop because we ran out of records,
* then it has completed successfully.
*/
if (ret == DB_NOTFOUND)
ret = 0;
err:
if (join_curs != NULL &&
(tret = join_curs->c_close(join_curs)) != 0 && ret == 0)
ret = tret;
if (name_curs != NULL &&
(tret = name_curs->c_close(name_curs)) != 0 && ret == 0)
ret = tret;
if (job_curs != NULL &&
(tret = job_curs->c_close(job_curs)) != 0 && ret == 0)
ret = tret;
return (ret);
manner. Life is too short to program in C when you don't need to.
A Product Of Fine Japanese Engineering
Berkeley DB, like all successful software packages, has a number of offshoots, and a particularly interesting one is Tokyo Cabinet, which can be an embedded database, or one can use the Tokyo Tyrant server for remote access. Also available: Tokyo Dystopia for full text search. Watch out for future releases, Tokyo Epidemic, Tokyo Martial Law, and Tokyo Zombie Apocalypse.
Tokyo Cabinet was written by Mikio Hirabayashi whilst working for mixi.inc (apparently, it's like Facebook). Hirabayashi is also known for QDBM (yes, part of the DBM family). TC takes some of the ideas in Berkeley DB, and takes them further. Whereas your dbms and Berkeley DBs are primarily APIs for accessing local files, TC ups the ante with the aforementioned Tyrant for remote access (also available: replication, transaction logs, and hot backup). There are also a number of options for your underlying storage (as with Berkeley DB): Hash, B-Tree, Fixed-Length, and an option distinguishing it from Berkeley DB: the Table Engine, which mimics an RDBMS, but in a schema-less way. It is somewhat like CouchDB, which we'll eventually get to. The developer can declare indexes on columns, and query the Table in a SQLesque manner. While you wouldn't want to use it in a situation where a true RDBMS is what you need, this is definitely handy.
Tokyo Cabinet has developed a loyal following, who praise its speed, flexibility, and ease of use. It also has its detractors. There are bindings for numerous languages, but C, Perl, Ruby, Java and Lua appear to be best: in research I found people complaining that the pytc module for Python lacks support for some features.
As Tokyo Cabinet has developed some momentum, is fairly well established, and is interesting, I plan to dig into it further. Still, in the next installment, we will cover other databases.
Some fun Tokyo Cabinet links:
Wednesday, May 20, 2009
Up next: a survey of the 9 bazillion databases in the world
High on the list of a technology professional's worst fears is becoming out of touch with technology and being left behind. Like death, it's going to happen one day no matter what you do, but as with death, one can 'rage against the dying of the light' and make every effort to stave it off as long as possible.
I have been a database guy for a number of years, either as a developer using the database, or one of the guys designing the database, or even sometimes setting up and acting as an administrator for database servers. Pretty much every database system I've ever dealt with was an RDBMS, in other words, it followed the relational model outlined by E.F. Codd in 1970, and misunderstood by the majority of computer type professionals ever since.
Monstrously large companies like Amazon and Google with monstrously large data sets and extreme scaling needs have in recent years run up against the 'ceiling' of what RDBMS's can do, and in some cases they are willing to trade off some of the letters in ACID to meet other needs. Their work is starting to make its way into the outside world: anybody can read the BigTable paper, Google has opened up Google App Engine (as mentioned in previous entries), and implementations like the open-source Hadoop have made BigTable-like storage available to anyone w/ the intellectual and budgetary resources to take it on.
At the other extreme are the very tiny and lightweight databases like SQLite, for cases where a full-blown database system is overkill, but having the option of using a more-or-less standard interface to your data is great to have.
So You Put 2 and 2 together, and got 5?
In future entries I intend to do more of a deep dive into these alternative options to boring old suit-and-tie, plastic-fantastic-wall-street-scene databases like SQL Server, Oracle, and DB2. But first a few cautionary words.
To predict the death of the relational model is likely premature and ill-advised as well. As I mentioned earlier, and have observed over the past 15 or so years, many (almost all?) IT Professionals manage to go from 'hello world' to the executive suite without knowing or understanding much of anything about the databases that, you know, store and access the data. The thing distinguishing your business from every other business in the world (although maybe you also put cool stickers on your servers). A programmer might learn to regurgitate definitions of first, second and third normal form, and know the difference between an inner and outer join, and 'Congratulations! You got the job, kid.' (it worked for me at least once in my more ignorant days). A lot of times, 'I'm running into limitations in the relational model' can be translated as 'Huh. I didn't know you could do it that way.'
Outside the walls of the IT ivory basement, the situation is even worse. Not particularly rigorous, but quite prolific writer Robert X. Cringely recently predicted the death of SQL in this column.
There are some cringe-worthy howlers within:
- It's SQL, not Sequel (but maybe he was just being clever)
- SQL is the language, not the database (and hard-core relational theorists will talk yr ear off about where SQL diverges from the relational model, if you aren't careful)
- Given the fumbling, Keystone Kops way a lot of shops handle the 'you install it, turn it on, and it runs itself' SQL Server, do you really want to turn something as game-changing as BigTable loose on them? That'd be like teaching a kindergarten class how to make Greek Fire.
As @iamdiddy would say: LET'S GOOOOOOOOO!!!1!