Big Data Solution How-To: Deploy a Raw Website/Application Disk Cloud Server on GoGrid Data Sheet: Scalable Overview GoGrid just launched Raw Disk Cloud Servers, the perfect choice for your Hadoop data node. These purpose-built Cloud Servers run on a redundant 10-Gbps network fabric on the latest Intel Ivy Bridge processors, which allows both private and public traffic to communicate at up to 10 Gbps and takes advantage of redundant network hardware. What sets these servers apart, however, is the massive amount of raw storage in JBOD (Just a Bunch of Disks) configuration. You can deploy up to 45 x 4 TB SAS disks on 1 Cloud Server. These servers are designed to serve as Hadoop data nodes, which are typically deployed in a JBOD configuration. This setup maximizes available storage space on the server and also aids in performance. There are roughly 2 cores allocated per spindle, giving these servers additional MapReduce processing power. In addition, these disks aren’t a virtual allocation from a larger device. Each volume is actually a dedicated, physical 4 TB hard drive, so you get the full drive per volume with no initial write penalty. You can use our Raw Disk Cloud Servers for uses other than Hadoop, but they should typically be deployed as part of a cluster. You should at least have an application that is able to handle replication and failover conditions. As a JBOD, any data on a failed disk is most likely lost if you don’t have some type of replication or backup. High-Performance Infrastructure Raw Disk Cloud Servers maintain the OS disks separately from the data disks. Only the data disks are configured as JBODS, so any failures on the data disks have no impact on the OS. The JBODs are also direct attached so there is no difference between the Raw Disk Cloud Servers and similar Dedicated Servers with JBOD. The storage is not apportioned but rather is a dedicated, physical SAS disk for each volume. For the X-Large Raw Disk Cloud Server, for example, there are 3 physical 4-TB disks attached to the server. Raw Disk Cloud Servers currently support only Linux. The disks will be attached, but you need to lay down a file system and mount the drive. Step 1: Select a Raw Disk Cloud Server You can deploy a Raw Disk Cloud Server from GoGrid’s management console or through an API call. From the management console, use the Add button and select the “Cloud Server” option. Make sure that you’re in the US-West-1 data center because that’s the only location that currently supports Raw Disk. You’ll be presented with an image selector; select any Linux 64-bit OS. You’re then presented with some options for your Cloud Server. There is a drop-down called “Server Flavor” with the following options: “All,” “Standard,” “SSD,” and “Raw.” This is a filter for the “Server Size” drop-down. If you select “Raw,” then you’ll only see Raw Disk Cloud Server options under “Server Size.” Select the Cloud Server size you’re interested in and hit the Next button to select your subscription term and deploy your Cloud Server. © Copyright 2014 GoGrid. All rights reserved. Various trademarks held by their respective owners. Step 2: Configure Your JBOD Disks The Raw Disk Cloud Servers use a different disk for the OS files (including root and swap). They’re not on the JBODs, so you only need to use those raw disks to store data. To find your volumes, run fdisk –l. You should see them attached as devices. They will appear as 4-TB devices because all the attached raw disks are that size. It most cases, the first volume will be called “/dev/xvdfa” and each volume will be an iteration of that. Disk /dev/xvdfa: 4000.8 GB, 4000787030016 bytes 255 heads, 63 sectors/track, 486401 cylinders, total 7814037168 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0x00000000 Disk /dev/xvdfa doesn’t contain a valid partition table You’ll see an entry similar to this for all disks attached to your Cloud Server. You have the option of creating a partition, but doing so isn’t required. If you want to use a partition, then you’ll need to use GNU parted if you want it larger than 2 TB. Otherwise, you can just format the disk directly. If you’re using these Cloud Servers for Hadoop, ext3 has been extensively tested (it’s been publicly tested on Yahoo’s cluster), but ext4 should also work (and should have better performance with large files). mkfs.ext4 /dev/xvdfa Step 3: Mount the Drive You’ll need to create a new location for the new drive on the file system. For example, you can create a directory called “mydisk1” and enter mkdir /mydisk1 at the prompt. Once you’ve created the directory, you can then mount your disk: mount /dev/xvdfa /mydisk1 © Copyright 2014 GoGrid. All rights reserved. Various trademarks held by their respective owners. You should now be able to read and write files in your mydisk1 directory. If you run df- h then you’ll see your drive and the mydisk1 mount point. Step 4: Make the Drive Permanent The steps above are core to getting the new device up and running. If you want the drive to mount automatically following reboots, however, you’ll need to add a line to your “/etc/fstab” file. /dev/xvdfa /mydisk1 ext4 defaults,nobootwait,noatime 0 0 This is a slight change from the typical “fstab” entry: nobootwait prevents Linux from stalling the boot if the share doesn’t exist. “0 0” means no automatic backup (if activated on your Cloud Server) and no automatic file system check. If you leave both of these options turned on, it will cause the Cloud Server to stall. Noatime prevents reads from turning into unnecessary writes, which helps improve performance (this is an optional setting, typically recommended for Hadoop). Reboot and verify that you still see the drive and mount point. The easiest way to do so is to run df –h. It will look like this: Filesystem /dev/xvda2 udev tmpfs none none /dev/xvda1 /dev/xvdfa Size 36G 7.8G 3.2G 5.0M 7.9G 184M 3.6T Used Avail Use% Mounted on 1.3G 33G 4% / 12K 7.8G 1% /dev 224K 3.2G 1% /run 0 5.0M 0% /run/lock 0 7.9G 0% /run/shm 42M 133M 25% /boot 196M 3.4T 1% /mydisk1 Step 5: Start Storing Stuff! Now that you’ve mounted your drive, you can start using it as a data node for your Hadoop cluster. You can deploy any distribution of Hadoop that you prefer or you can wait until we release our 1-Button Deploy of HBase. You can also use these nodes for a large disk array. You’ll want some sort of software that can manage replication or you can configure software RAID on your server. Either way, you’ll want to have multiple servers to protect against failure. GoGrid is committed to releasing infrastructure that is designed to support Big Data applications, and you can expect to see more applications and infrastructure options coming soon! This article is based on a blog post by Rupert Tagnipes. About GoGrid GoGrid enables companies to evaluate and run multiple, on-demand big data solutions quickly, simply, reliably, securely, and cost-effectively. As the leader in Open Data Services (ODS), GoGrid is committed to delivering purpose-built Big Data solutions and services for the management and integration of open source, commercial, and proprietary technologies across multiple platforms. With over 15,000 customers and over 600,000 VMs deployed, GoGrid has pioneered cloud infrastructure for more than a decade for companies like Condé Nast, Merkle, and Preventice. For more information, please visit www.GoGrid.com. © Copyright 2014 GoGrid. All rights reserved. Various trademarks held by their respective owners. HT_Deploy-Raw-Disk-CS_20140130