(This is a modified version of the keynote talk given by Rob Pike at the SPLASH 2012 conference in Tucson, Arizona, on October 25, 2012.)
The Go programming language was conceived in late 2007 as an answer to some of the problems we were seeing developing software infrastructure at Google. The computing landscape today is almost unrelated to the environment in which the languages being used, mostly C++, Java, and Python, had been created. The problems introduced by multicore processors, networked systems, massive computation clusters, and the web programming model were being worked around rather than addressed head-on. Moreover, the scale has changed: today's server programs comprise tens of millions of lines of code, are worked on by hundreds or even thousands of programmers, and are updated literally every day. To make matters worse, build times, even on large compilation clusters, have stretched to many minutes, even hours.
Go was designed and developed to make working in this environment more productive. Besides its better-known aspects such as built-in concurrency and garbage collection, Go's design considerations include rigorous dependency management, the adaptability of software architecture as systems grow, and robustness across the boundaries between components.
This article explains how these issues were addressed while building an efficient, compiled programming language that feels lightweight and pleasant. Examples and explanations will be taken from the real-world problems faced at Google.
Go is a compiled, concurrent, garbage-collected, statically typed language developed at Google. It is an open source project: Google imports the public repository rather than the other way around.
Go is efficient, scalable, and productive. Some programmers find it fun to work in; others find it unimaginative, even boring. In this article we will explain why those are not contradictory positions. Go was designed to address the problems faced in software development at Google, which led to a language that is not a breakthrough research language but is nonetheless an excellent tool for engineering large software projects.
Go is a programming language designed by Google to help solve Google's problems, and Google has big problems.
The hardware is big and the software is big. There are many millions of lines of software, with servers mostly in C++ and lots of Java and Python for the other pieces. Thousands of engineers work on the code, at the "head" of a single tree comprising all the software, so from day to day there are significant changes to all levels of the tree. A large custom-designed distributed build system makes development at this scale feasible, but it's still big.
And of course, all this software runs on zillions of machines, which are treated as a modest number of independent, networked compute clusters.
In short, development at Google is big, can be slow, and is often clumsy. But it is effective.
The goals of the Go project were to eliminate the slowness and clumsiness of software development at Google, and thereby to make the process more productive and scalable. The language was designed by and for people who write—and read and debug and maintain—large software systems.
Go's purpose is therefore not to do research into programming language design; it is to improve the working environment for its designers and their coworkers. Go is more about software engineering than programming language research. Or to rephrase, it is about language design in the service of software engineering.
But how can a language help software engineering? The rest of this article is an answer to that question.
When Go launched, some claimed it was missing particular features or methodologies that were regarded as de rigueur for a modern language. How could Go be worthwhile in the absence of these facilities? Our answer to that is that the properties Go does have address the issues that make large-scale software development difficult. These issues include:
Individual features of a language don't address these issues. A larger view of software engineering is required, and in the design of Go we tried to focus on solutions to these problems.
As a simple, self-contained example, consider the representation of program structure. Some observers objected to Go's C-like block structure with braces, preferring the use of spaces for indentation, in the style of Python or Haskell. However, we have had extensive experience tracking down build and test failures caused by cross-language builds where a Python snippet embedded in another language, for instance through a SWIG invocation, is subtly and invisibly broken by a change in the indentation of the surrounding code. Our position is therefore that, although spaces for indentation is nice for small programs, it doesn't scale well, and the bigger and more heterogeneous the code base, the more trouble it can cause. It is better to forgo convenience for safety and dependability, so Go has brace-bounded blocks.
A more substantial illustration of scaling and other issues arises in the handling of package dependencies. We begin the discussion with a review of how they work in C and C++.
ANSI C, first standardized in 1989, promoted the idea of#ifndef"guards" in the standard header files. The idea, which is ubiquitous now, is that each header file be bracketed with a conditional compilation clause so that the file may be included multiple times without error. For instance, the Unix header file<sys/stat.h>looks schematically like this:
/* Large copyright and licensing notice */ #ifndef _SYS_STAT_H_ #define _SYS_STAT_H_ /* Types and other definitions */ #endif
The intent is that the C preprocessor reads in the file but disregards the contents on the second and subsequent readings of the file. The symbol_SYS_STAT_H_, defined the first time the file is read, "guards" the invocations that follow.
This design has some nice properties, most important that each header file can safely#includeall its dependencies, even if other header files will also include them. If that rule is followed, it permits orderly code that, for instance, sorts the#includeclauses alphabetically.
But it scales very badly.
In 1984, a compilation ofps.c, the source to the Unixpscommand, was observed to#include<sys/stat.h>37 times by the time all the preprocessing had been done. Even though the contents are discarded 36 times while doing so, most C implementations would open the file, read it, and scan it all 37 times. Without great cleverness, in fact, that behavior is required by the potentially complex macro semantics of the C preprocessor.
The effect on software is the gradual accumulation of#includeclauses in C programs. It won't break a program to add them, and it's very hard to know when they are no longer needed. Deleting a#includeand compiling the program again isn't even sufficient to test that, since another#includemight itself contain a#includethat pulls it in anyway.
Technically speaking, it does not have to be like that. Realizing the long-term problems with the use of#ifndefguards, the designers of the Plan 9 libraries took a different, non-ANSI-standard approach. In Plan 9, header files were forbidden from containing further#includeclauses; all#includeswere required to be in the top-level C file. This required some discipline, of course—the programmer was required to list the necessary dependencies exactly once, in the correct order—but documentation helped and in practice it worked very well. The result was that, no matter how many dependencies a C source file had, each#includefile was read exactly once when compiling that file. And, of course, it was also easy to see if an#includewas necessary by taking it out: the edited program would compile if and only if the dependency was unnecessary.
The most important result of the Plan 9 approach was much faster compilation: the amount of I/O the compilation requires can be dramatically less than when compiling a program using libraries with#ifndefguards.
Outside of Plan 9, though, the "guarded" approach is accepted practice for C and C++. In fact, C++ exacerbates the problem by using the same approach at finer granularity. By convention, C++ programs are usually structured with one header file per class, or perhaps small set of related classes, a grouping much smaller than, say,<stdio.h>. The dependency tree is therefore much more intricate, reflecting not library dependencies but the full type hierarchy. Moreover, C++ header files usually contain real code—type, method, and template declarations—not just the simple constants and function signatures typical of a C header file. Thus not only does C++ push more to the compiler, what it pushes is harder to compile, and each invocation of the compiler must reprocess this information. When building a large C++ binary, the compiler might be taught thousands of times how to represent a string by processing the header file<string>. (For the record, around 1984 Tom Cargill observed that the use of the C preprocessor for dependency management would be a long-term liability for C++ and should be addressed.)
The construction of a single C++ binary at Google can open and read hundreds of individual header files tens of thousands of times. In 2007, build engineers at Google instrumented the compilation of a major Google binary. The file contained about two thousand files that, if simply concatenated together, totaled 4.2 megabytes. By the time the#includeshad been expanded, over 8 gigabytes were being delivered to the input of the compiler, a blow-up of 2000 bytes for every C++ source byte.
As another data point, in 2003 Google's build system was moved from a single Makefile to a per-directory design with better-managed, more explicit dependencies. A typical binary shrank about 40% in file size, just from having more accurate dependencies recorded. Even so, the properties of C++ (or C for that matter) make it impractical to verify those dependencies automatically, and today we still do not have an accurate understanding of the dependency requirements of large Google C++ binaries.
The consequence of these uncontrolled dependencies and massive scale is that it is impractical to build Google server binaries on a single computer, so a large distributed compilation system was created. With this system, involving many machines, much caching, and much complexity (the build system is a large program in its own right), builds at Google are practical, if still cumbersome.
Even with the distributed build system, a large Google build can still take many minutes. That 2007 binary took 45 minutes using a precursor distributed build system; today's version of the same program takes 27 minutes, but of course the program and its dependencies have grown in the interim. The engineering effort required to scale up the build system has barely been able to stay ahead of the growth of the software it is constructing.
When builds are slow, there is time to think. The origin myth for Go states that it was during one of those 45 minute builds that Go was conceived. It was believed to be worth trying to design a new language suitable for writing large Google programs such as web servers, with software engineering considerations that would improve the quality of life of Google programmers.
Although the discussion so far has focused on dependencies, there are many other issues that need attention. The primary considerations for any language to succeed in this context are:
With that background, then, let us look at the design of Go from a software engineering perspective.