Tom Van Vleck is a computer security pioneer who spent years developing computer operating system architectures before there were PCs. While earning his bachelor of science in mathematics at MIT, he worked as a programmer for Project MAC, an MIT initiative funded by the Advanced Research Projects Agency. Project MAC involved numerous efforts, including the creation of the Multics operating system for the university’s General Electric mainframe. Built around time sharing, the multiplexed information and computing service was conceptualized as a utility to provide an interactive environment for multiple users — initially, MIT research groups. It was developed in collaboration with Bell Telephone Laboratories (now Nokia Bell Labs) and GE. The system was designed with security built in, according to proponents, and several concepts are relevant to today’s operating systems. Bell Labs exited the project in 1969, and two of its former Multics programmers went on to create the Unix time-sharing system.
In addition to his work as an undergraduate, Van Vleck spent seven years as a researcher at the MIT Information Processing Center. He eventually joined Honeywell when the company took over commercial development of the Multics operating system from GE in the 1970s. Over the years, he has worked in various management, research and engineering roles for Tandem Computers, Taligent, CyberCash and SPARTA, a defense contractor acquired by Parsons in 2011.
Today, Van Vleck is an independent consultant in Ocean City, N.J., who specializes in web applications, software engineering and security. Here, he recalls the evolution of system control interfaces and what Multics got right — and wrong — as the computer industry moves to the cloud.
I’ll just jump right into this with an easy question: What’s the relationship between security and system administration? I see the new ‘cloud system automation‘ as a push toward simplifying and rationalizing operating systems’ control interfaces.
Tom Van Vleck: Let me start by saying that my understanding has evolved. One of the things that I worked on was the Multics system administration facilities — and they were quite elaborate. They had many commands and tech manuals describing how to use all those commands, and we built a special-purpose subsystem with different commands for the system administrators and another set of different commands for the system operators. Remember when systems had operators who were trained to operate the computer, instead of having everyone operate their own computers with no training at all?
It was big mistake, in retrospect. It used to be that after the Multics operating system crashed — in large part due to the hardware, which was much less reliable than it is now — the operator would have to go through a very complex set of recovery steps to get the system back up and all the files happy again. Over time, we realized that every place where the operator had to make a choice and type the right thing was a chance for them to type the wrong thing. Over time, we evolved to a thing where — when the system crashed — you said start it up again, and if it turned out that you had to run some recovery step, the system would decide whether or not to do it, and we designed the recovery steps so they could run twice in a row with no negative effect. We aimed toward a completely lights-out, ‘no chance for mistakes’ interface.
It’s where you have complexity in the interface — and Unix and Linux are great examples of that — there’s a chance to set it wrong. I think, in general, that what we have to do is not to ‘de-skill’ the interface so much as ‘de-error’ the interface. My view is that there should be less system administration interface and fewer chances for error.
Some of the current-generation cloud systems present a very simplified interface. Those limited options might be characterized as ‘shoot yourself in the foot’ or ‘shoot yourself in the head.’
Van Vleck: There are times a system gets to the place where it has just added 1 + 1 and gotten 3, and it does not know what to do. As we worked on Multics reliability, I proposed that, if ever that happened, instead of crashing all the processors on the system and all the processes (most of which had no error), that we de-configure one CPU at random, until you ran out of CPUs and then you had to crash. In those days, CPUs just occasionally gave the wrong answer; it made sense to make some change that would have a possible positive effect rather than do nothing at all or do something really drastic.
We did a lot of thinking in Multics about how to change the idea of a ‘system crash.’ It used to be that if anything was wrong, we’d just say ‘Everybody out of the pool; start over!’ It turned out that was a mistake. It worked fine if you had the system designers in the same room as the computer, but even then it meant that after every crash there was a lot of running around: ‘What do we do next?’ ‘Just let me take a peek at the Q register.’ That kind of thing, instead of saying, ‘How do we make the shortest transition to the best operating mode that we can get?’
There were a lot of times when there would be 200 users on the system and their memory images of their processes were all ticking along, and then some obscure table somewhere in the hard core got corrupted. What you wanted to do was to have all of those 200 processes back, or maybe if one of the processes had been lied to and was totally messed up, you wanted to get 199 of the processes back. We began a push toward a recovery process that would take the minimum positive step instead of basically shooting everything down.
Did you find that mostly the administrators did the default behavior all of the time? It’s like fsck [file system consistency check] should only have one option, and that’s -y [attempt to fix detected filesystem corruption automatically] because that’s all anyone ever uses.
Van Vleck: That’s certainly true, and the Multics operating system had the ancestor of fsck, which was a [batch utility] program called the salvager. Actually, the predecessor to the Multics operating system, which was CTSS [Compatible Time-Sharing System, developed at MIT], had the salvager also — it just did what fsck does: run through the entire file tree making sure everything looked correct. I can remember pretty clearly about 1970: I was working at MIT and I was administrating the Multics system, which was a big multi-CPU machine that filled about half a basketball court — we also had an IBM/360 running CP/CMS [a time-sharing operating system]. One day, the Multics machine crashed, and the boss came in and said, ‘Oh dear, it crashed. How long until I can get back into the thing I was doing?’ We said, ‘Oh, well, we’ll run the salvage routine, and it’ll take about an hour.’ About that time, the CP/CMS machine crashed, and it was back on the air in three minutes. He said, ‘How come? How come this much less sophisticated operating system comes back so much faster?’ Good question!
That led us to a long redesign effort that changed the way the salvager was run so that it could be run on a single directory, and we put in a lot of checks in the system that said, ‘If anything looks funny, stop and salvage this current directory and try the call once more.’ It worked beautifully. The system went from being ‘Well, it’s great when it works’ to feeling rock-solid. People said, ‘Well, you can do those kinds of integrity checks on things when you’re debugging, but when we put it in production, it’ll slow things down.’ We measured, and the slowdown was not measurable, and the change in reliability was night and day in terms of a feeling of safety and solidity.
It sounds like you put a lot of thought into running your systems as a production system, and it seems that the current state of affairs is ‘how to run production as a disposable system.’
Tom Van Vlecksecurity researcher
Van Vleck: We tended to make things just the way a power programmer would like it. There were a lot of features that would have only benefitted the system programmers, basically. Nobody else was so deeply invested in the system and all its features that they would ever get there — they would never dig deep enough to use those features. That’s fine if what you’re building is a programmer’s tool, but there are a lot of other things in the world to build and a lot of other uses for computers, and they have very different requirements.
Is part of the problem that we build general-purpose operating systems? Why isn’t there an environment in which I code and an environment in which I run production code?
Van Vleck: That’s been a dream for a long time — to build a low-level substrate that concentrates on being simple and clean. When you talked to Roger Schell, he made the case for a layered system and a system with a proven correct security kernel. Security, if you like, is one set of requirements, and reliability is a similar set but has different measurements. The same reasoning that can prove security attributes can be deployed to show integrity. The integrity constraints are very like the multilevel security requirements but reversed; they’re self-dual.
There’s some relationship between administration, reliability and security. It’s always seemed to me that the reason we have security as a field at all is since the ’80s we’ve screwed up system administration so badly that our systems are neither reliable nor secure.
Van Vleck: It’s sort of true. The cause and effect that you paint is arguable. We’re building code that only works sometimes; this is the big mistake. Whether those failures produce insecurity or unreliability is a second-order issue. We’re not building code to a high enough standard of doing what it’s supposed to do.
Some of it is that we’re vague about what we want it to do; some of it is that we ship and install code that we never should have shipped and installed.
I’m a Unix guy, so I’m thinking about the environment I grew up on. When we gave programmers a virtual machine image with (mostly) protected memory, the unit of computing was process. But since it turned out that the kernel’s attempts to isolate processes from each other failed, the unit of computing became virtual machine image, and now the problem is maintaining the isolation between VMs running in a stack — it’s the same problem with just another level of indirection added. Today, we’re heading toward disposable networks of disposable nonexistent machines.
Van Vleck: You know Steve Lipner? [A fellow MIT alumnus, Lipner is the executive director of nonprofit SAFECode and formerly served as the director of software security at Microsoft.]
I worked with him at Digital Equipment Corp. and for him at Trusted Information Systems in the early ’90s.
Van Vleck: He taught me what he called the first law of sanitary engineering. And it is, ‘The !&#*$’s gotta go somewhere.’ You can move some of the observed bad effect of sloppy code and fuzzy thinking from one place to another, but if that’s what you’ve got, it’s going to go somewhere.