Posts filed under "Technology"

February 7, 2017

Open Source Lucasfilm’s Habitat Restoration Underway

Habitat Frontyard taken 12/30/2017Project Hub taken 12/30/2017

It’s all open source!

Yes – if you haven’t heard, we’ve got the core of the first graphical MMO/VW up and running and the project needs help with code, tools, doc, and world restoration.

I’m leading the effort, with Chip leading the underlying modern server: the Elko project – the Nth generation gaming server, still implementing the basic object model from the original game. is the root of it all. to join the project team Slack. to fork the repo.

To contribute, you should be capable to use a shell, fork a repo, build it, and run it. Current developers use: shell, Eclipse, Vagrant, or Docker.

To get access to the demo server (not at all bullet proofed) join the project.

We’ve had people from around the world in there already! (See the photos) #opensource #c64 #themade

Habitat Turf taken 12/30/2017Habitat Beach taken 12/30/2017

October 14, 2016

Software Crisis: The Next Generation

tl;dr: If you consider the current state of the art in software alongside current trends in the tech business, it’s hard not to conclude: Oh My God We’re All Gonna Die. I think we can fix things so we don’t die.

Marc Andreesen famously says software is eating the world.

His analysis is basically just a confirmation of a bunch of my long standing biases, so of course I think he is completely right about this. Also, I’m a software guy, so naturally I would think this is a natural way for the world to be.

And it’s observably true: an increasing fraction of everything that’s physically or socially or economically important in our world is turning into software.

The problem with this is the software itself. Most of this software is crap.

And I don’t mean mere Sturgeon’s Law (“90% of everything is crap”) levels of crap, either. I’d put the threshold much higher. How much? I don’t know, maybe 99.9%? But then, I’m an optimist.

This is one of the dirty little secrets of our industry, spoken about among developers with a kind of masochistic glee whenever they gather to talk shop, but little understood or appreciated by outsiders.

Anybody who’s seen the systems inside a major tech company knows this is true. Or a minor tech company. Or the insides of any product with a software component. It’s especially bad in the products of non-tech companies; they’re run by people who are even more removed from engineering reality than tech executives (who themselves tend to be pretty far removed, even if they came up through the technical ranks originally, or indeed even if what they do is oversee technical things on a daily basis). But I’m not here to talk about dysfunctional corporate cultures, as entertaining as that always is.

The reason this stuff is crap is far more basic. It’s because better-than-crap costs a lot more, and crap is usually sufficient. And I’m not even prepared to argue, from a purely darwinian, return on investment basis, that we’ve actually made this tradeoff wrong, whether we’re talking about the ROI of a specific company or about human civilization as a whole. Every dollar put into making software less crappy can’t be spent on other things we might also want, the list of which is basically endless. From the perspective of evolutionary biology, good enough is good enough.

But… (and you knew there was a “but”, right?)

Our economy’s ferocious appetite for software has produced teeming masses of developers who only know how to produce crap. And tooling optimized for producing more crap faster. And methodologies and processes organized around these crap producing developers with their crap producing tools. Because we want all the cool new stuff, and the cool new stuff needs software, and crappy software is good enough. And like I said, that’s OK, at least for a lot of it. If Facebook loses your post every now and then, or your Netflix feed dies and your movie gets interrupted, or if your web browser gets jammed up by some clickbait website you got fooled into visiting, well, all of these things are irritating, but rarely of lasting consequence. Besides, it’s not like you paid very much (if you paid anything at all) for that thing that didn’t work quite as well as you might wish, so what’s your grounds for complaint?

But now, like Andreesen says, software is eating the world. And the software is crap. So the world is being eaten by crap.

And still, this wouldn’t be so bad, if the crap wasn’t starting to seep into things that actually matter.

A leading indicator of what’s to come is the state of computer security. We’re seeing an alarming rise in big security breaches, each more epic than the next, to the point where they’re hardly news any more. Target has 70 million customers’ credit card and identity information stolen. Oh no! Security clearance and personal data for 21.5 million federal employees is taken from the Office of Personnel Management. How unfortunate. Somebody breaks into Yahoo! and makes off with a billion or so account records with password hashes and answers to security questions. Ho hum. And we regularly see survey articles like “Top 10 Security Breaches of 2015”. I Googled “major security breaches” and the autocompletes it offered were “2014”, “2015”, and “2016”.

And then this past month we had the website of security pundit Brian Krebs taken down by a distributed denial of service attack originating in a botnet made of a million or so compromised IoT devices (many of them, ironically, security cameras), an attack so extreme it got him evicted by his hosting provider, Akamai, whose business is protecting its customers against DDOS attacks.

Here we’re starting to get bleedover between the world where crap is good enough, and the world where crap kills. Obviously, something serious, like an implanted medical device — a pacemaker or an insulin pump, say — has to have software that’s not crap. If your pacemaker glitches up, you can die. If somebody hacks into your insulin pump, they can fiddle with the settings and kill you. For these things, crap just won’t do. Except of course, the software in those devices is still mostly crap anyway, and we’ll come back to that in a moment. But you can at least make the argument that being crap-free is an actual requirement here and people will (or anyway should) take this argument seriously. A $60 web-enabled security camera, on the other hand, doesn’t seem to have these kinds of life-or-death entanglements. Almost certainly this was not something its developers gave much thought to. But consider Krebs’ DDOS — that was possible because the devices used to do it had software flaws that let them be taken over and repurposed as attack bots. In this case, they were mainly used to get some attention. It was noisy and expensive, but mostly just grabbed some headlines. But the same machinery could have as easily been used to clobber the IT systems of a hospital emergency room, or some other kind of critical infrastructure, and now we’re talking consequences that matter.

The potential for those kinds of much more serious impacts has not been lost on the people who think and talk about computer security. But while they’ve done a lot of hand wringing, very few of them are asking a much more fundamental question: Why is this kind of foolishness even possible in the first place? It’s just assumed that these kinds of security incidents will inevitably happen from time to time, part of the price of having a technological civilization. Certainly they say we need to try harder to secure our systems, but it’s largely accepted without question that this is just how things are. Psychologists have a term for this kind of thinking. They call it learned helplessness.

Another example: every few months, it seems, we’re greeted with yet another study by researchers shocked (shocked!) to discover how readily people will plug random USB sticks they find into their computers. Depending on the spin of the coverage, this is variously represented as “look at those stupid people, har, har, har” or “how can we train people not to do that?” There seems to be a pervasive idea in the computer security world that maybe we can fix our problems by getting better users. My question is: why blame the users? Why the hell shouldn’t it be perfectly OK to plug in a random USB stick you found? For that matter, why is the overwhelming majority of malware even possible at all? Why shouldn’t you be able to visit a random web site, click on a link in a strange email, or for that matter run any damn executable you happen to stumble across? Why should anything bad happen if you do those things? The solutions have been known since at least the mid ’70s, but it’s a struggle to get security and software professionals to pay attention. I feel like we’re trapped in the world of medicine prior to the germ theory of disease. It’s like it’s 1870 and a few lone voices are crying, “hey, doctors, wash your hands” and the doctors are all like, “wut?”. The very people who’ve made it their job to protect us from this evil have internalized the crap as normal and can’t even imagine things being any other way. Another telling, albeit horrifying, fact: a lot of malware isn’t even bothering to exploit bugs like buffer overflows and whatnot, a lot of it is just using the normal APIs in the normal ways they were intended to be used. It’s not just the implementations that are flawed, the very designs are crap.

But let’s turn our attention back to those medical devices. Here you’d expect people to be really careful, and indeed in many cases they have tried, but even so you still have headlines about terrible vulnerabilities in insulin pumps and pacemakers. And cars. And even mundane but important things like hotel door locks. And basically just think of some random item of technology that you’d imagine needs to be secure and Google “<fill in the blank> security vulnerability” and you’ll generally find something horrible.

In the current tech ecosystem, non-crap is treated as an edge case, dealt with by hand on a by-exception basis. Basically, when it’s really needed, quality is put in by brute force. This makes non-crap crazy expensive, so the cost is only justified in extreme use cases (say, avionics). Companies that produce a lot of software have software QA organizations within them who do make an effort, but current software QA practices are dysfunctional much like contemporary security practices are. There’s a big emphasis on testing, often with the idea you can use statistical metrics for quality, which works for assembling Toyotas but not so much for software because software is pathologically non-linear. The QA folks at companies where I’ve worked have been some of the most dedicated & capable people there, but generally speaking they’re ultimately unsuccessful. The issue is not a lack of diligence or competence; it’s that the underlying problem is just too big and complicated. And there’s no appetite for the kinds of heavyweight processes that can sometimes help, either from the people paying the bills or from developers themselves.

One of the reasons things are so bad is that the core infrastructure that we rely on — programming languages, operating systems, network protocols — predates our current need for software to actually not be crap.

In historical terms, today’s open ecosystem was an unanticipated development. Well, lots of people anticipated it, but few people in positions of responsibility took them very seriously. Arguably we took a wrong fork in the road sometime vaguely in the 1970s, but hindsight is not very helpful. Anyway, we now have a giant installed base and replacing it is a boil the ocean problem.

Back in the late ’60s there started to be a lot of talk about what came to be called “the software crisis”, which was basically the same problem but without malware or the Internet.

Back then, the big concern was that hardware had advanced faster than our ability to produce software for it. Bigger, faster computers and all that. People were worried about the problem of complexity and correctness, but also the problem of “who’s going to write all the software we’ll need?”. We sort of solved the first problem by settling for crap, and we sort of solved the second problem by making it easier to produce that crap, which meant more people could do it. But we never really figured out how to make it easy to produce non-crap, or to protect ourselves from the crap that was already there, and so the crisis didn’t so much go away as got swept under the rug.

Now we see the real problem is that the scope and complexity of what we’ve asked the machines to do has exceeded the bounds of our coping ability, while at the same time our dependence on these systems has grown to the point where we really, really, really can’t live without them. This is the software crisis coming back in a newer, more virulent form.

Basically, if we stop using all this software-entangled technology (as if we could do that — hey, there’s nobody in the driver’s seat here), civilization collapses and we all die. If we keep using it, we are increasingly at risk of a catastrophic failure that kills us by accident, or a catastrophic vulnerability where some asshole kills us on purpose.

I don’t want to die. I assume you don’t either. So what do we do?

We have to accept that we can’t really be 100% crap free, because we are fallible. But we can certainly arrange to radically limit the scope of damage available to any particular piece of crap, which should vastly reduce systemic crappiness.

I see a three pronged strategy:

1. Embrace a much more aggressive and fine-grained level of software compartmentalization.

2. Actively deprecate coding and architectural patterns that we know lead to crap, while developing tooling — frameworks, libraries, etc — that makes better practices the path of least resistence for developers.

3. Work to move formal verification techniques out of ivory tower academia and into the center of the practical developer’s work flow.

Each of these corresponds to a bigger bundle of specific technical proposals that I won’t unpack here, as I suspect I’m already taxing a lot of readers’ attention spans. I do hope to go into these more deeply in future postings. I will say a few things now about overall strategy, though.

There have been, and continue to be, a number of interesting initiatives to try to build a reliable, secure computing infrastructure from the bottom up. A couple of my favorites are the Midori project at Microsoft, which has, alas, gone to the great source code repo in the sky (full disclosure: I did a little bit of work for the Midori group a few years back) and the CTSRD project at the University of Cambridge, still among the living. But while these have spun off useful, relevant technology, they haven’t really tried to take a run at the installed base problem. And that problem is significant and real.

A more plausible (to me) approach has been laid out by Mark Miller at Google with his efforts to secure the JavaScript environment, notably Caja and Secure EcmaScript (SES) (also full disclosure: I’m one of the co-champions, along with Mark, and Caridy Patiño of Salesforce, of a SES-related proposal, Frozen Realms, currently working its way through the JavaScript standardization process). Mark advocates an incremental, top down approach: deliver an environment that supports creating and running defensively consistent application code, one that we can ensure will be secure and reliable as long as the underlying computational substrate (initially, a web browser) hasn’t itself been compromised in some way. This gives application developers a sane place to stand, and begins delivering immediate benefit. Then use this to drive demand for securing the next layer down, and then the next, and so on. This approach doesn’t require us to replace everything at once, which I think means it has much higher odds of success.

You may have noticed this essay has tended to weave back and forth between software quality issues and computer security issues. This is not a coincidence, as these two things are joined at the hip. Crappy software is software that misbehaves in some way. The problem is the misbehavior itself and not so much whether this misbehavior is accidental or deliberate. Consequently, things that constrain misbehavior help with both quality and security. What we need to do is get to work adding such constraints to our world.

April 29, 2014

Troll Indulgences: Virtual Goods Patent Gutted [7,076,445]

Indulgence Another terrible virtual currency/goods patent has been rightfully destroyed – this time in an unusual (but worthy) way: From Law360: EA, Zynga Beat Gametek Video Game Purchases Patent Suit, By Michael Lipkin

Law360, Los Angeles (April 25, 2014, 7:20 PM ET) — A California federal judge on Friday sided with Electronic Arts Inc., Zynga Inc. and two other video game companies, agreeing to toss a series of Gametek LLC suits accusing them of infringing its patent on in-game purchases because the patent covers an abstract idea. … “Despite the presumption that every issued patent is valid, this appears to be the rare case in which the defendants have met their burden at the pleadings stage to show by clear and convincing evidence that the ’445 patent claims an unpatentable abstract idea,” the opinion said.

The very first thing I thought when I saw this patent was: “Indulgences! They’re suing for Indulgences? The prior art goes back centuries!” It wasn’t much of a stretch, given the text of the patent contains this little fragment (which refers to the image at the head of this post):

Alternatively, in an illustrative non-computing application of the present invention, organizations or institutions may elect to offer and monetize non-computing environment features and/or elements (e.g. pay for the right to drive above the speed limit) by charging participating users fees for these environment features and/or elements.

WTF? Looks like reasoning something along those lines was used to nuke this stinker out of existence. It is quite unusual for a patent to be tossed out in court. Usually the invalidation process has to take a separate track, as it has with other cases I’ve helped with, such as The Word Balloon Patent. I’m very glad to see this happen – not just for the defendant, but for the industry as a whole. Just adding “on a computer [network]” to existing abstract processes doesn’t make them intellectual property! Hopefully this precedent will help kill other bad cases in the pipeline already…

August 26, 2013

Randy’s Got a Podcast: Social Media Clarity

icon 800x800 with border

I’ve teamed up with Bryce Glass and Marc Smith to create a podcast – here’s the link and the blurb:

Social Media Clarity – 15 minutes of concentrated analysis and advice about social media in platform and product design.

First episode contents:

News: Rumor – Facebook is about to limit 3rd party app access to user data!

Topic: What is a social network, why should a product designer care, and where do you get one?

Tip: NodeXL – Instant Social Network Analysis

August 23, 2013

Patents and Software and Trials, Oh My! An Inventor’s View

What does almost 20 years of software patents yield? You’d be surprised!

I gave an Ignite talk (5 minutes: 20 slides advancing every 15 seconds) entitled

“Patents and Software and Trials, Oh My! An Inventor’s View”

Here’s some improved links…

I gave the talk twice, and the second version is also available (shows me giving the talk and static versions of my slides…) – watch that here:

April 14, 2011

We’re at it again and we’re hiring…

Chip has created the Nth generation of his massive-scale real-time server architecture (the spiritual descendent of Habitat) and we think the time is right for mobile/social games to go multiplayer! So we’ve gotten the band back together, and you can join us!

FUDCorp Job Openings

Real-Time Game Server Programmer, SF Bay Area

About us: a still-stealth start-up with a groundbreaking mobile/gaming platform that will reshape social games/apps. Get in on the ground floor with world-class founders and established technology. If you know us, you what we’ve built since the earliest days of online play.

Your role:

  • Writing server-side Java code for an original massively multiplayer mobile online game
  • Writing/maintaining testing frameworks (mostly in JavaScript for Node.js) for rapid development and massive scale performance evaluation
  • This is a contract position, with potential to join our full-time team

Job Requirements:

  • Immediate Availability. Our recent successes (partners and funding) means we need more help immediately!
  • San Francisco Bay Area. With live meetings at least weekly, increasing over time.
  • Minimum 3 years as a professional Java programmer working on client-server applications in a small, decentralized team.
  • Strong Linux/Unix skills: shell scripting, command line tools, server administration, etc.
  • Big plus: server-side JavaScript/ECMAScript skills, especially with Node.js
  • Big plus: experience with Amazon EC2, and optimizing server features for automatic deployment
  • Big plus: previous work with implementing social games, such as taxonomies, economies, abuse mitigation, and social issues
  • Big plus: experience with iPhone or Android app development

Please send resume and contact info to

September 7, 2009

Elko III: Scale Differently

Preface: This is the third of three posts on Elko, a server platform for sessionful, stateful web applications that I’m releasing this week as open source software. Earlier, Part I presented the business backstory for Elko. Part II, yesterday’s post, presented the technical backstory, laying out the key ideas that lead to the thing. Today’s post presents a more detailed technical explication of the system itself, with particular emphasis on the scaling model that enables it all to work effectively.

In Part II I ranted at length about some of the unfortunate consequences of the doctrine of statelessness, the predominant paradigm for scaling web applications. Keeping the short-term state of a client-server session in the server’s memory is easy and therefor tempting, but, the story goes, you shouldn’t do that because it means you can’t scale your application — you just can’t handle the traffic from thousands or millions of users on the single machine whose memory it would be.

But this isn’t so much a server capacity problem as it is a traffic routing problem. In a traditional web server farm, load is distributed across multiple servers by arranging for successive HTTP requests to a particular named host to be delivered to different servers. Typically this is accomplished through provision of multiple IP addresses in the DNS resolution of the host name or through special load balancing routers in the server datacenter that virtualize the nominal host IP address, directing successive TCP sessions to different machines on the datacenter’s internal network.

This technique has a number of virtues, not least of which is that it is relatively simple. It takes advantage of the expectation that the loads that successive HTTP requests are going to place on the servers are likely to be uncorellated, and thus delivering requests to servers on a simple round-robin schedule, or even randomly, will, through the statistical magic of large numbers, result in more or less even load distribution across the datacenter. This lack of correlation is usually a reasonable assumption, since the various browsers hitting a given site around the same time are, for most sites, uncoordinated (indeed, the deliberate coordination of such activity is the basis for a major class of denial of service attacks).

However, just as this scheme implies that a given browser has no control over (nor ability to predict) which server machine it’s actually going to be talking to when it sends an HTTP request, it similarly means that a given server has no say over which clients it will be servicing. Any service implementation that relies on local data coherence from one request to the next (other than of a statistical nature, as is exploited by caching) is thus doomed. Keeping session state in the server’s memory is right out.

Elko approaches the scaling problem in a different way. First of all, we embrace the concept of a session: a series of interactions between the client and the server that has a beginning, a middle, and an end. This is by no means an exotic abstraction; indeed, the TCP protocol that HTTP is layered on top of is sessionful in exactly this way. However, HTTP then takes the session abstraction away from us, leaving it to the web application framework (of which, in this sense, Elko is just one of many) to pile on a bunch of additional mechanism to put it back in again.

Whereas, from the client’s perspective, a TCP session represents a communications connection to a particular host on the network, an Elko session represents a communications connection to a particular context. Like a web page, a context has a distinct, addressable identity. Unlike a web page, a context has its own computational existence independent of who is communicating with it at any given moment. In particular, multiple clients can interact with a given context at the same time, and the context itself can act independent of any of its individual clients, including when there are no clients at all. For example, in a multi-user chat application, the contexts would most likely be chat rooms. In a real-time auction application, contexts might represent the various auctions that are going on.

The Elko platform provides several different types of servers, all based on a common set of building blocks. However, for purposes of the present discussion, there are two that matter: the Context Server and the Director.

A Context Server provides an environment in which contexts run. Context Servers are generic and fungible in the same kinds of ways that web servers are: need more capacity? Just add more servers. The difference in the scaling story is that rather than handling load by farming out HTTP requests amongst multiple web servers, the Elko approach is to farm out contexts amongst multiple Context Servers.

In Elko, a context can be said to be active or inactive. An inactive context is saved in persistent storage, such as a file or a database. An active context exists in the process and memory space of some Context Server. The job of the Director is to keep track of which contexts are active and, when active, which Context Server each one is running on. When a client wishes to enter a particular context (that is, initiate a communications connection to it), the client sends a request to a Director asking where to go (these requests are routed to Directors using the kinds of standard web scaling techniques described above). If the context is active, the Director replies to the client with the address of the Context Server upon which the context is running (and notifies the Context Server to expect the client’s arrival), rather like this: ActiveContext

If the context is not active, the Director picks a Context Server to run the context, replies to the client with the address of this Context Server, and sends the chosen Context Server a message commanding it to activate the context, like this: InactiveContext

(Note that there is a race between the client arriving at the Context Server and the Context Server loading the context, but the implementation ensures that this is taken care of.)

Unlike the members of a cluster of traditional web servers, the address of each Context Server is fixed. Thus, once the client connection to a particular Context’s Server is made, the client communicates with the same Context Server for all of its interaction needs in that context for as long as the session lasts. This means the Context Server can keep the context state in memory, only going to persistent storage as needed for checkpointing long-term application state. Once the last client exits a context, that context can be unloaded and the server capacity made available for other contexts.

The Context Servers keep the Directors aprised of the contexts they are handling, the clients that are in those contexts, and the server load they are currently experiencing. From this information, the Directors can route client traffic by context or by user (e.g., in a chat application, I may want to enter the chat room where my friends are, rather than a specific room whose identity I know a priori), and can identify the least heavily loaded servers for new context activation.

Directors can be replicated for scale and redundancy, but since they actually do very little work, one Director can handle the load for a large number of clients before capacity becomes an issue. Director scalability is also enhanced because servicing clients only makes reference to in-memory data structures, so everything the Director does is very fast and has quick turnaround.

This scheme scales very well. Because it has a very light footprint and services nearly everything from memory, even a single Context Server can manage a substantial load. We benchmarked the SAF Context Server, which had the identical architecture, in 2002 at Sun’s performance testing center in Menlo Park. On a Sun Enterprise 450 server (2 processor 400Mhz SPARC, a mid- to low-range machine even then), we ran a simulated chat environment, running 8000 concurrent connections spread over ~200 chat rooms, with an average fanout per room of ~40 users, with each client producing an utterances approximately every 30 seconds (in a 40 user chat room, that level of activity is positively frantic). This resulted in about 20% CPU load with no user detectable lag. Ironically, the biggest challenge in performing this test was generating enough load. We ended up having to use several of the biggest machines they had in the lab to run the client side of the test. Note also that this test was conducted three or four generations of server hardware ago. I expect that on modern machines, these numbers will be even more substantial.

One potential criticism of this scaling strategy is that it is more complicated than the way web servers usually do things. On the surface, I have to concede that that is true. However, by the time you take into consideration the extra work you need to do in an actual large-scale web setup, configuring routers and load balancers and memcache servers and database clusters and endless other complications, plus all the extra application engineering work to make use of these, I think Elko ends up being a simpler configuration. I know from experience that it’s a vastly simpler environment for the application coder.

So that’s the theorical side of the scaling story. I invite anyone who has an interest in delving deeper to check things out for themselves. The code is here.

April 26, 2004

Announcing! Yahoo! Avatars

I’m working as Community Strategic Analyst for Yahoo!, where I’m helping to bring out next generation social software in a very large scale. I am proud to announce the first new product from our group (this year) is Yahoo! Avatars support in Messenger 6.0 Beta, which was released today [windows only]. Beside Avatar support, it now integrates LAUNCHcast Radio, Games, Addressbook, and adds sound effects called Audibles.

I think it is interesting that the original avatars walked, ‘talked’, and traded virtual objects in a virtual world with a (dis)functional virtual economy, but some of the latest incarnations include avatars that are more like paper-dolls and don’t interact with each other at all. Interesting market-driven optimization.