The Google File System is an online tool created by Google.Google is a multi-billion-dollar enterprise and a dominant force in the World Wide Web and beyond. The company depends on a distributed computing system to provide users with the tools they need to access, generate, and modify data. Surely, Google invests in the latest high-performance computers and servers to keep everything running smoothly, right?
Incorrect. The machines that power Google's operations aren't the latest, high-end computers filled with fancy features. In fact, they are relatively affordable machines running on Linux operating systems. How can one of the most influential companies on the Web rely on inexpensive hardware? The answer lies in the Google File System (GFS), which takes full advantage of the strengths of standard servers while mitigating any hardware limitations. It’s all about the design.
Google utilizes GFS to manage and process massive files while offering application developers the resources they need for research and development. GFS is proprietary to Google and isn't available for purchase, but it could serve as a model for file systems used by organizations with similar requirements.
Certain aspects of the GFS remain undisclosed to those outside of Google. For instance, Google doesn’t disclose the exact number of computers used to run the GFS. In official Google documents, the company simply mentions that there are 'thousands' of computers in the system (source: Google). Despite this lack of transparency, Google has made significant portions of the GFS’s architecture and functionality publicly accessible.
So, what exactly does the GFS do, and why is it so crucial? Find out in the following section.
The GFS team tailored the system for appending data rather than rewriting it. This design choice was made because Google’s clients rarely need to overwrite files – they typically add new data to the end of existing files. While overwriting data is still possible in GFS, the system isn't optimized for those operations.
Introduction to the Google File System
Google developers regularly handle large files that traditional file systems struggle to manage. The size of these files influenced many of the design decisions for GFS. Another major factor was scalability, which refers to the system’s ability to expand its capacity easily. A scalable system can grow without a drop in performance. Google’s massive network of computers needs to manage an enormous volume of files, making scalability a key priority in GFS’s design.
Due to the vastness of the network, keeping it monitored and maintained is a difficult task. In developing the GFS, the team opted to automate as many of the administrative tasks needed to keep the system functioning smoothly. This approach aligns with the concept of autonomic computing, where computers are capable of diagnosing and solving problems in real time without human involvement. The GFS team faced the challenge of not just creating an automatic monitoring system but ensuring it could operate efficiently across such a large network of machines.
The core principle guiding the team’s designs was simplification. They realized that as systems grow more complex, issues arise more frequently. A simpler design is easier to manage, even on a large scale.
Following this philosophy, the GFS team opted to provide users with a basic set of file commands, such as open, create, read, write, and close. Additionally, they included a few specialized commands: append and snapshot. These commands were designed based on Google’s specific needs. Append allows users to add new data to an existing file without erasing the old data, while snapshot creates an instant copy of a computer’s contents.
Files on the GFS are typically very large, often reaching several gigabytes in size. Working with such large files can consume a lot of the network’s bandwidth, which is the capacity for transferring data across a system. The GFS tackles this by breaking files into 64-megabyte chunks, each assigned a unique 64-bit identification number known as a chunk handle. While the system is capable of processing smaller files, it was not optimized for those tasks.
By ensuring that all file chunks are the same size, the GFS simplifies resource allocation. It’s easier to monitor which computers in the system are nearing their capacity and which ones are underutilized. This uniformity also makes it straightforward to move chunks from one resource to another, balancing the load across the entire system.
What is the actual structure of the GFS? Keep reading to uncover the details.
Distributed computing involves connecting multiple computers together to pool their individual resources in a shared network. Each computer contributes resources like memory, processing power, and storage to the collective system. This transforms the entire network into a gigantic computer, with each machine functioning as both a processor and a storage unit.
Architecture of the Google File System
Google designed the GFS using clusters of computers. A cluster is essentially a network of computers, which may contain hundreds or even thousands of machines. Inside GFS clusters, there are three primary types of entities: clients, master servers, and chunkservers.
In the context of GFS, a 'client' refers to any entity that sends a file request. These requests can range from accessing and modifying existing files to creating new ones within the system. Clients can be other computers or software applications. You can think of clients as the users of the GFS system.
The master server serves as the coordinator for the cluster. It manages an operation log that records the actions of the master within its cluster. This log minimizes service disruptions – if the master server crashes, another server that has tracked the operation log can step in. The master also monitors metadata, which includes information about chunks, such as which files they belong to and where they fit within the larger file. Upon starting up, the master polls all chunkservers in the cluster, and they respond with information about the chunks they store. From then on, the master server tracks chunk locations within the cluster.
There is only one active master server in a cluster at any given time (although multiple copies exist for redundancy in case of hardware failure). While this might seem like a potential bottleneck – after all, one server is coordinating thousands of machines – the GFS avoids traffic jams by keeping the messages exchanged with the master server very small. The master server does not handle file data itself; that task is left to the chunkservers.
Chunkservers are the backbone of the GFS. They are tasked with storing the 64-MB file chunks. Instead of sending chunks to the master server, chunkservers send requested chunks directly to the client. The GFS duplicates each chunk multiple times, storing copies on different chunkservers. Each copy is called a replica. By default, the GFS creates three replicas for each chunk, though users have the flexibility to adjust the number of replicas as needed.
How do these components interact during a typical operation? Read on to discover more in the next section.
The GFS distinguishes between two types of replicas: primary replicas and secondary replicas. A primary replica is the chunk sent from a chunkserver to a client, while secondary replicas serve as backup copies on other chunkservers. The master server decides which chunks are primary or secondary. If a client modifies the data in a chunk, the master notifies the chunkservers holding secondary replicas to update their copies to reflect the changes from the primary replica.
Exploring the Google File System
The process for handling file requests follows a systematic approach. When a read request is made, the client sends a query to the master server to locate a specific file on the system. In response, the server provides the location of the primary replica of the requested chunk. This primary replica is associated with a lease granted by the master server.
If no replica has a lease at the time, the master server assigns the chunk's primary replica by evaluating the client's IP address and comparing it with the IP addresses of chunkservers storing the replicas. The master selects the chunkserver nearest to the client, and that chunkserver's copy becomes the primary. The client then connects directly to the chosen chunkserver to retrieve the replica.
Write requests are a bit more involved. The client begins by sending a request to the master server, which returns the locations of both the primary and secondary replicas. The client then stores these details in its memory cache, enabling it to skip contacting the master server for future requests to the same replica. However, if the primary replica becomes inaccessible or is replaced, the client must consult the master server again to obtain the new replica information before reaching out to a chunkserver.
Once the client has the necessary replica locations, it proceeds to send the write data to each replica, starting with the one closest to it and progressing to the one farthest away. The order of the replicas—whether primary or secondary—does not matter. Google likens this data delivery system to a pipeline model.
Once the replicas receive the data, the primary replica begins to assign sequential serial numbers to each modification made to the file. These modifications are referred to as mutations. The serial numbers guide the replicas in determining the order in which the mutations should be applied. The primary replica then applies these mutations in the proper sequence to its own copy of the data. Afterward, it sends a write request to the secondary replicas, which follow the same process. If all goes as planned, all replicas across the cluster will reflect the updated data. The secondary replicas then report back to the primary once their updates are complete.
Once the update process is complete, the primary replica notifies the client. If the update was successful, the process ends here. If not, the primary replica communicates the issue to the client. For example, if one of the secondary replicas failed to update with a specific mutation, the primary replica informs the client and retries the mutation a few more times. If the secondary replica still fails to update correctly, the primary replica instructs it to restart the update process from the beginning. If this doesn't resolve the issue, the master server will mark the affected replica as garbage.
What other tasks does the GFS handle, and how does the master server manage garbage? Continue reading to learn more.
If a client sends a write request that affects several chunks of a very large file, the GFS splits the overall write request into separate requests for each individual chunk. The rest of the process follows the same steps as a regular write request.
Other Key Functions of the Google File System
Beyond the fundamental services provided by the GFS, there are additional specialized functions designed to keep the system operating seamlessly. When creating the system, the GFS developers anticipated certain challenges would arise from its architecture. They chose affordable hardware, which enabled the creation of a large-scale system at a reasonable cost. However, this choice also implied that the individual computers within the system may not always be reliable, as low-cost hardware is more prone to failure.
To address the inherent unreliability of individual components, the GFS developers embedded various functions within the system. These functions include master and chunk replication, an efficient recovery process, rebalancing, stale replica detection, garbage removal, and checksumming.
While only one master server is actively running per GFS cluster, copies of the master server are maintained on other machines. These copies, known as shadow masters, provide limited services even when the primary master server is up and running. Their services are restricted to read requests, as such requests do not modify data. The shadow master servers typically lag slightly behind the primary master server, but the delay is usually only a fraction of a second. The master server replicas remain connected to the primary master server, monitoring the operation log and polling chunkservers to track the data. If the primary master server fails and cannot restart, a secondary master server can step in to take over.
The GFS replicates chunks to ensure data remains accessible even in the event of hardware failure. It stores these replicas on separate machines located across different racks. This way, if an entire rack fails, the data will still be available from another machine. The system uses the unique chunk identifier to validate each replica. If a replica’s identifier does not match the chunk’s handle, the master server creates a new replica and assigns it to a chunkserver.
The master server also oversees the entire cluster, periodically rebalancing the workload by redistributing chunks across different chunkservers. While all chunkservers operate near their capacity, they are never fully loaded. The master server continually monitors the chunks and ensures each replica is up to date. If a replica does not match the chunk’s identifier, it is marked as stale and considered garbage. After three days, the master server can delete the garbage chunk, offering users the chance to review it before permanent deletion to avoid unintended loss of data.
To safeguard against data corruption, GFS utilizes a technique known as checksumming. It divides each 64 MB chunk into smaller 64 KB blocks, with each block having its own 32-bit checksum, which functions like a unique identifier. The master server observes the chunks by comparing the checksums. If a replica's checksum doesn't align with the one stored in the master server, the server removes the faulty replica and creates a replacement.
Interested in the hardware behind Google's GFS? The next section will reveal more details.
The GFS system uses brief electronic updates called heartbeats and handshakes. These messages enable the master server to keep track of the status of each chunkserver.
Google File System Hardware
Google has not disclosed much about the specific hardware used for the GFS, stating only that it relies on a mix of inexpensive, off-the-shelf Linux servers. However, in an official report on GFS performance, Google shared the specs of the hardware used in certain benchmark tests. While this hardware may not reflect the exact systems in use today, it provides insight into the types of computers Google employs to manage the vast amounts of data it processes.
The test setup featured a master server, two replicas, 16 clients, and 16 chunkservers, all running identical hardware with matching specifications on Linux operating systems. Each machine had dual 1.4 GHz Pentium III processors, 2 GB of RAM, and two 80 GB hard drives. By comparison, some current consumer PCs surpass the power of the servers used in Google's tests by more than double. Google's developers demonstrated that GFS can function effectively even with modest hardware.
The network used to connect the systems featured a 100 Mbps full-duplex Ethernet link and two Hewlett Packard 2524 network switches. One switch connected the 16 client machines, while the other handled the remaining 19 machines. The two switches were linked by a 1 Gbps connection.
By lagging behind the forefront of hardware technology, Google can secure components at significantly lower prices. The GFS architecture is designed for easy expansion, allowing more machines to be added when needed. If a cluster nears its full capacity, Google can supplement the system with additional inexpensive hardware and redistribute the workload. If a master server's memory becomes overloaded, upgrading with more memory is a simple solution. The system is inherently scalable.
Why did Google opt for this system? Some attribute it to Google's recruitment approach. Known for hiring freshly graduated computer science majors and giving them the freedom and resources to experiment with systems like GFS, Google fosters an environment of innovation. Others suggest the company's success stems from a pragmatic 'do what you can with what you have' attitude, a trait shared by many computer system developers (including Google's founders). Ultimately, Google likely chose GFS because it was designed to manage the types of processes essential for the company’s mission of organizing the world’s information.
Bandwidth measures a system's ability to transfer data between locations, while latency indicates the delay between issuing a system command and receiving a response. System administrators typically aim for high bandwidth and low latency. However, Google developers focus more on bandwidth due to the large files their applications handle.
