Thursday, October 29, 2009

IRON/Cloud — the outline of what a modern OS should be

In a previous blog entry — while reviewing the Windows 7 Operating System — I outlined some of the reasons I felt disappointed with the state of operating systems in general in the year 2009.

I sat down and listed many points that I felt (at least some) operating systems should have by now, separating them from forever from the Xerox UI and Bell Labs UNIX roots that they have done so little to grow beyond in the last 30-50 years.

One thing I ran into since this blog entry is an article on Slashdot discussing Microsoft and Danger losing a ton of data that was being stored on behalf of T-mobile Sidekick telephone clients. The sidekick stores all of it's data remotely; yes "in the cloud" and some of that data simply went "poof" earlier this month. The summary of the slashdot article recommended this may have never happened had they been using a more failure-tolerant filesystem, such as ZFS.

So I looked into ZFS, it is a modern filesystem being developed by Sun for use on their commercial Solaris operating system. It meets points #4 and #5a/#5b from my initial blog entry of what Operating Systems should be doing these days. It favors appends over edits, it uses hash-based inodes so that duplicate data content never need be duplicated on-disk, and born of this it copies data via a lazy-copy model. It even implements data editing this way; creating a new block on disk for the newly written data so that the original data is not immediately overwritten but instead remains available to roll back to in case of a crisis. Given enough disk space, you can honestly travel backwards in time to previous data snapshots prior to old data being reclaimed. You can even bring old data snapshots to life as R/W volumes which overlap the live volumes of data!

This is the sort of innovation we should be seeing in modern operating systems, and it is the kind of change I put my weight behind.

Inspired by this and by conversations I've had with others regarding my blog post, I have learned much more about my expectations for operating systems. Enough so that I can describe what I see as the right direction to move in.

One of the most important transitions that the IT industry is presently tripping badly over itself to take advantage of is the ability to abstract software and data completely and forever away from Hardware, such that hardware within a given IT infrastructure can be truly commoditized.

This is breathtakingly important, if anyone has ever worked in a small to medium sized business, you know the horrors of vendor lock-in and hardware lock-in. Somewhere there is a 15-20 year old piece of iron that runs some software long ago abandoned by it's creators that the fate of the entire enterprise rests on for some reason or another. This linchpin of the business can apparently never be upgraded or replaced. If it breaks someone has to figure out how to re-ducttape it back into working order. If the motherboard ever goes out, or it loses enough drives (that they also no longer make replacements for) in the RAID, the powers that be fear it could spell doom for the company.

Being locked into the software is a problem all it's own, but by extension being locked into the hardware just makes things worse. Additionally there is the danger that hardware always fails eventually, you just don't want your software depending heavily upon it during the moment that occurs. Since you never really know when that might happen, the correct answer is software should "never" rely heavily upon any specific example of hardware.

Finally, most small businesses cannot afford to do the trendy thing and forever separate every application onto it's own pristine server.. then move forward with High Availability goals by trying to replicate services across different iron using application-specific replication models (DNS axfer, MySQL slave/master, VMware Vmotion, RAID for the disks, etc) This simply costs too much to buy one or more instances of iron for every new process, to say nothing of the IT workload managing all these new boxes takes.

I feel that the perfect operating system for small business (not to mention Large Enterprise, home power users, and not to shabby for even Gramma checking Email) would be one that allows you the greatest advantages of high availability with as simple a setup as 2 machines in a NOC (each with as few as one large disk) and one machine offsite for geographic replication (also with a single disk). I'm not delving into backup scenarios here; you'd likely still want one backup SAN in the noc and another offsite. I won't jaw about that. But I envision an OS that allows you to run all of your small business's applications on these three machines, while offering maximal data replication, heightened availability and geographic leveraging for you.

I call this hypothetical Operating System "IRON/Cloud". "Cloud" in this case meaning just what it always should have meant: a platform designed to divorce software, data and services from hardware. A platform that lays on top of potentially geographically disparate hardware in order to provide highly available and geographically leveraged services to users. "Cloud" in this sense does not necessarily mean trusting your data to "someone else" or some data vendor. I am talking about IT staff and end-users engineering their own god damned clouds for a change. :P

"IRON" in this sense refers to my own pet name for hardware. Hardware really should be commoditized so heavily that adding a server into a rack should be like pouring more molten iron into a smelt. Obtain and rejoice at all of the added resources. CPU, RAM, Disk, attached peripherals, network card, geographic diversity, possibly GPUs and other hardware acceleration, keyboards and mice for user presence, etc etc. But forget about and do not concern yourself with the limitations of the hardware. All of these resources should be consumed by the cloud platform itself, running processes redundantly and in parallel over all this hardware and shrugging off nodes that go dark like an elephant shrugs off arrows.

I have noticed recently that both independent enterprise and would-be "cloud" vendors are trying valiantly to provide both high availability, and many security aspects I mentioned in my last OS tirade by leveraging virtual machines. The thought is, if you tailor a virtual system on a virtual OS and image that, then you can rapidly rub off copies of that image to meet spikes of demand across a data center. If you can't properly jail your applications (point #3a of my tirade) nor can you afford separate iron for every app, then put each app in it's own virtual machine. I see this as a sign for a strong market needing the sorts of qualities in an OS I am espousing today.

IRON/Cloud would be an OS designed from the ground up to provide all of these features and niceties at the OS level without the added complexity or resource suck of virtualization. An OS that supports full-machine virtualization, but would not require it in order to meet potentially any of these modern needs. It's so twentieth century to run your service as though it has been installed on fake Iron, when you should instead divorce the service from Iron and simply allow it fluid access to resources. It is very twentieth century to write software under the assumption that you are a single thread of execution on a single CPU and responsible for the whole show. It is better to be a single thread of execution on a single CPU (with exclusive access to your memory) within a construct such that you know there are likely more threads just like you performing the same or similar computations, and that you are one of many threads using a shared API and RPC calls to accomplish a task as a team. Your thread will not run forever, it may even be cut tragically short by hardware trouble or even bad coding, but you work for the good of the team of threads that you trust to carry on your work after your execution has stopped; for good or for ill. Your computations will be checked and perhaps even compared with a pool of results from identical threads to weed out and log or report possibly hardware-induced errors in math.

Not every thread or process will be redundant; what is redundant and how redundant is subject to sysadmin configuration. Some of these threads would have application-specific failsafes in case they perish. Some threads can just be run anew if this happens. Worker threads of a webserver are a great example of this. Some threads just need data recovery in case of failure, such as a word processor. So the cloned thread (if one is configured) acts little more than a place for live data to be replicated too, which can replicate back to the master to recover gracefully after an application bork.

Let's go back to the example with 2 machines in a NOC and one offsite. This example is rather arbitrary; IRON/Cloud should run perfectly well on a single machine with no net connection. Of course it would not offer any of it's replicating power in such a setup, but I hope it would compete admirably against today's OSen nonetheless. I only mention the 2+1 scenario because of how modest it is for a small business to actually build/maintain and how much blinding power you could squeeze from it using IRON/Cloud.

If you could get "the software" to do anything you wanted with such a hardware beachhead, what is the most you could really expect from it? I for one imagine the ability to pump data down whatever pipe connects the two sites (most likely commodity internet, but hopefully as much bandwidth as one can manage ;D) in order to keep all live data for all services running across these three boxes perennially up to date. Not only would the offsite machine be there to save your ass if both machines in the NOC fail (fire, flood, site nuked from orbit) but it should be live and capable of handling customer connections from it's geographically unique vantage point. Perhaps your NOC is in Oregon and your backup is hosted in New York or even Europe. Any clients hitting your web server or what have you from those far flung locations ought to (optionally) just be able to dial into that closest node and get low ping times. The data pertinent to their transactions would flow back to you at the NOC via lazy or delayed writes over the backhaul. Using bandwidth for client connections at both points of presence and only sharing the pertinent details between you should keep bandwidth costs as low as possible too. You could maximize peering relationships, larger data center layouts could even schedule workload to favor cheaper bandwidth or power consumption.

Best of all, these benefits would be provided either by the OS, or by applications written to adhere to the design principals of the OS. Gone would be the times when you have to learn the hokey replication voodoo engendered by a specific style of database or lock yourself into a virtual machine vendor and another layer of guest OS. It goes without saying that all applications can benefit from replication and relocation, so it is as much folly to rely on the application vendors to build these features by themselves as it is for app vendors to handle their own installation procedures (See point #7 of my previous rant).

In order to provide such wonders, I envision that IRON/Cloud would be built in two parts (reflected by the dividing line in the name). The first part, IRON, would be a collection of disparate software products to either be installed as a primary OS on individual pieces of hardware, or booted from a liveCD (though doing so would not be encouraged long-term given the waste of RAM resources valuable to the cloud), booted as the guest OS in a VM (not sure why you would want that but hell, why not allow it?) or even run as a simple application within a host OS (which is preferable to VM in all cases I can imagine).

IRON's job would be to run on individual bits of hardware to begin the hardware abstraction process. IRON on a machine makes available whatever resources that machine has (in app or VM mode, you could ideally even export only selected hardware resources!) to the overlaying Cloud platform. IRON would handle things like what kind of processor this is (32 bit? 64 bit? x86, powerPC, ARM?), how many CPU's are present, what Network connectivity exists, and how to talk to the peripherals and either limit or curtail the necessity of the running processes having to give a damn about such colloquial concerns.

With IRON running in whatever capacity you choose across a variety of IRON nodes, the overarching Cloud platform then takes over. IRON nodes would be configured to provide services to one (or more!) specific Cloud instances ("skies", they might be called!) which provide fine grained user and task control and administration. So an organization in house would likely run their own single Sky, while a hosting provider may allow any number of clients to run their skies over shared IRON hardware. Multiple skies would provide little benefit over single skies aside from resolving contentions regarding administration. The paradigm is designed such that a single organization should always pair well with a single sky. Further granularity is acheived via smaller components within the sky, such as individual CloudVM's.

Each "sky" allows you to add or remove nodes from it's access (nodes would also be removed if they fail or fall offline) while it is running. In fact, while a "sky" operates very much like a single mainframe server, it should be designed to never stop completely or reboot. Any major software updates would be accomplished via sort of a rolling process restart to minimize or in most cases completely eliminate applicable service downtime. IRON updates would be handled similarly; the sky would schedule IRON nodes to go offline and reboot the hardware as needed then accept the node back when it comes online (after sysadmin intervention in case things get bad of course). Hardware that needs updating would simply be taken offline and brought back at SysAdmin's whims. The beauty of relaxing our needs for reliability from underlying hardware is hard to miss. It is much more expensive to craft one piece of Iron which you can rely on 24/7 for years than to obtain several that may have an average failure lifespan of a year or two. And both software and hardware must ALWAYS be maintained. You need to take it offline and give it a tuneup; I don't care what "it" is. Thus, an OS that runs perpetually, capable of surviving all such maintenance with even modest levels of replication is just what the doctor ordered. For everyone. Period. :P

Another advantage of this approach is that resources are utilized optimally. Traditional IT either shoehorns many apps dangerously onto the same iron thus the same single point of failure, or spreads apps thinly among iron that spends much of it's time underutilized. When I think of the powerful desktops idling all night in offices across the globe I weep! It would be far preferable if workstations ran the IRON/Cloud OS right at their desktops; their very own workstations can then crunch numbers and handle replication for the enterprise.

The flipside being that "remote desktop" would become a quaint and antiquated notion. Instead your OS is the server OS; you run in the same sky as the "server" you hope to administer. Open windows to maintain and configure your service, those windows really do open right on your desktop hardware. Walk downstairs to watch the blinky lights in the NOC, then open the same live window on this new workstation downstairs. The Cloud does it's best to make you feel like the window is still local, performing RDP-like operations with your first workstation upstairs until it has fully replicated the GUI process to your local terminal, after which your local terminal becomes the master for the process and it *is* once more local. You could do the same thing after a flight from San Fran Cisco to Paris: open an old window or desktop, and after a few minutes of slow RDP-like access to your process in Frisco it completes it's transition to your new terminal in Paris and the activity is once more local. Availability and responsiveness provided in the greatest ways one can conceive given the computational and network limitations available. Today is tomorrow god damn it, I think we deserve this now.

Given different nodes providing the same services, it makes sense to leverage modern network protocols such as SCTP to provide strong multi-homing support. I would recommend IRON/Cloud applications and infrastructure favor SCTP and IPv6 completely, perhaps even forgoing the dual stack for IPv4 and leveraging a site-gateway to IPv4 and TCP/UDP resources instead of bothering to support these at the application level.

The Cloud platform, by way of your "sky", operates spanning over all of the IRON nodes you give it leave to. The sky maintains and tabulates all of the disk, CPU, RAM, network, and peripheral resources provided by the IRON nodes and schedules processes to run where they are needed and in such a way as to best honor the parameters for each task. Tasks are run as individual "Clouds", or "Cloud Virtual Machines" within a sky. Each Task, or CloudVM, is assigned to one or more IRON nodes initially and in many cases comes to span more or fewer IRON nodes throughout the lifespan of the task. Many tasks run indefinitely (web server, database.. really any "daemon") so they are designed to be capable of this, but most tasks do not require this.

Tasks are software application instances. They run as seperate CloudVM's (with their own unique view of available filesystems, unique access to RAM, firewalled Network access, and all other hardware resources) in order to meet point #3 of my first list of criteria. Applications should by default remain jailed from one another. Skies offer rich opportunities for applications to work together via RPC, but most apps really don't care about other running apps on a modern OS. For security's sake they ought to be isolated from one another whenever possible anyway.

Tasks run software that has hopefully been engineered specifically for IRON/Cloud or else to meet a list of criteria that can allow software to remain compatible with this or competing OSen. Such software should be designed around the philosophy I cited above: that a thread is best served by not behaving monolithically, but instead as a single participant among what may be many threads, combined by a shared interface and framework that allows threads to better survive the decoupled nature of such an OS.

One prime example herein is Disk access. Today, processes simply "assume" that the disk is available and fast to access. If the disk is not easily accessed, or the underlying OS must spend a lot of time getting to the disk (network shares, etc) the client application normally blocks or freezes, becoming unresponsive waiting for the disk. Even though the app can still conceivably accept user input, it simply refuses to. Even though such an app is perfectly capable of keeping it's GUI window refreshed, it almost never does and you get page tearing instead. What really grinds my gears about this problem is it normally happens when I NEVER MEANT TO OPEN THAT NETWORK SHARE TO BEGIN WITH!! I simply followed a symlink without thinking, or a file open dialog opened to the "most recent" folder which is no longer available in my laptop's new location. But I can't change my mind now, no! I have to wait for the App and/or OS to time out and realize the shit is gone before I can direct the app's attention to where I really intended to go. >:C

Contrast this with well written AJAX software, such as Google's Gmail. You write a message and Gmail autosaves as you write: both locally to gears and remotely to the server. When you make a decisive action such as hitting send, the local Gmail javascript invokes this command. If the response from the server is instant, then you move right the heck along. In case it is not, Gmail lets you know what's going on with a status line saying "sending", leaving the missive open. If it takes still longer the status will update to "Still sending..". The user can still interact with the application. For example, rub this troublesome message off into a new window (which continues valiantly tring to send) and then poke through the inbox to read other messages (which, if user really has fallen offline, still may have already been locally downloaded to Gears).

This is the kind of decoupled attitude towards resources that all software should embrace. You have what you have (CPU/RAM) and you know what, all else might be slow or unresponsive so deal with it gracefully. In the case of IRON/Cloud, you want to take that a little farther. Processes are spawned from software on one or more IRON nodes. The software should further be designed to maintain state in version controlled, easy to export and merge data sets which are regularly shared among nodes via the sky. Whatever software has not yet committed to it's state hasn't really "happened". Whatever versions of a state have not yet been shared with other nodes remains in danger of being lost in case the specific hardware melts. Software designers should leave the mindset of writing code that masters all data from a single thread and begin thinking about spawning armies of threads running similar code to refine sharable, version controlled and conflict-management-friendly data sets. You don't need this paradigm shift just to take advantage of IRON/Cloud either, this sort of sea change has been a long time coming simply to support the growing popularity of cell based CPU architecture.

Not all applications will be engineered to such standards, especially early on. IRON/Cloud ought to be capable of running applications only lightly ported to the system which continue to behave monolithically (a POSIX compliance layer and ELF-binary support would be a great start), and perhaps even applications over virtual machines running in foreign guest operating systems. Still, these are merely training wheels and the true holy grail is applications built from ground up to participate completely in the IRON/Cloud mantra.

To this end, applications will likely be marked with any number of different properties that skies will take into consideration when launching them and spawning new processes. Properties like "serial" for old-style, unimportant or single-shot tasks vs "Parallel" for tasks smart enough to run concurrently on different hardware. Properties specifying how highly available a task ought to be, and how much redundancy to plan for early on. Properties inferring preference over geographic diversity or local responsiveness. Properties clarifying if the Task is capable of running disparate parallel child processes: such as worker threads for a web server, or parallel compute threads for heavy number crunching vs many threads all crunching the same numbers for simple redundancy.

The Task is not ultimately responsible for micromanaging the creation of it's own army of processes, the sky does that. The Task may request but should not expect to demand more or fewer processes be spawned. Hardware resources are still precious and contentious to the sky (and the enterprise) and Tasks ought to be able to — in the worst of times — run as single threads on single nodes sharing data sets with past and future versions of themselves care of the custodial sky.

This is the expectation I personally have for operating systems today. As an industry we are woefully behind schedule. I honestly and pointedly consider every drawback separating the current state of the art and what I have just described as a SPECIFIC FAILURE. The way operating systems are today is so far behind what is reasonable to expect from our hardware that I feel as though we have no sewer systems and people are still emptying chamberpots out the window into the street.

I cannot really say why the industry has milled around the same watering hole for so many decades now without so much as budging along the obvious evolutionary path. Perhaps it's just that nobody has sat down and thought about where to go next? Perhaps the Software industry just needed someone to shine a flashlight in the appropriate direction?

If that's all it is then that is what I am doing right now. The future is that-a-way, now please proceed at best possible speed, or else for the love of God explain to us, your users, why not. This is what we want. Give us what we want or we will leave you behind.

That is all.

PS: I googled the keywords "Iron Cloud" to make certain the moniker I had chosen for my hypothetical reference OS was not stepping on anyone's toes. The closest match I found was that Amazon is teaming up with a project called "Casst Iron" for their cloud services and calling this combined project "Casst Iron Cloud". Though similar, this initiative has quite different goals from my hypothetical initiative. I hope I don't have to rename my abstract idea, as it would be a pain. Also, I used too many S's in the word "Casst" on purpose to prevent my own article being indexed and showing up in searches for Amazon's service. I don't believe they will market heavily missing the first word, so saying "Iron Cloud" or "IRON/Cloud" should still be safe to protect folk from confusion for the time being and keep Amazon from flinging lawyers at me. :P


ckaminski said...

You've written what I was outlining and drafting last year but never committed to paper. IRON/Cloud should be what the market espouses - you see some of this in the JRockit JVM for ESX - Weblogic without the need for an OS at all. Take this to the next logical level - each app is locked into it's cute little vm, and can be migrated anywhere/anytime. Build fault tolerance into network connections, and some smarts into file IO, and there's no reason i can't ship my processing from my laptop to another PC in the office when I go home for the evening.

I look at the waste in computing power in datacenters and on desktops, and I want to see a convergence of ESX and BOINC/Terrcotta style engines. The cloud shouldn't have some arbitrary wall that starts at the ethernet interface in my laptop - my laptop should be able to be a full-fledged member of said cloud.

Jesse said...

Yay! I am unfamiliar with many of the acronyms you cite here, but I feel as though we are on the same page. :)

Businesses (even Google) aren't pushing for cloud presence on consumer hardware probably because of their overriding interest on acting as a profitable gatekeeper to customer data. They make money by keeping tabs on how you access the data, so all the data has to be stored on their side of the demarc. If the data and processes migrate to your hardware for realtime access, they can't view the subtleties of how you access it. :/

Still, that explains why large vendors stick to their model, it doesn't explain why OS vendors in charge of helping you maintain your own iron fail to follow the yellow brick road. :(