Migrating Adbrain’s Apache Spark Jobs from HDFS to Amazon S3: How We Did It
Adbrain ingests and processes billions of data points daily, and it’s our data infrastructure team’s job to build a probabilistic identity graph so we can allow our clients to reach their audiences across devices. This requires a lot of infrastructure, and adding a new country or data source can have a major impact on the amount of data we need to process. We therefore needed to find a solution that was able to scale in a much more cost efficient manner. To help achieve this, our data infrastructure team finished one of our largest data migrations ever - moving petabytes of data from HDFS to Amazon S3.
Over the last couple of years, the size of Adbrain’s device graph has grown exponentially, and the requirements for our data warehouse have grown as well. The majority of the data was stored on HDFS - the Hadoop distributed file system. Adbrain was running HDFS across 20 nodes, and this was getting expensive.
Furthermore, because HDFS relies on local storage that scales horizontally, increasing the storage space means adding new nodes a cluster, or (as we did) running Spark Cluster (an open-source big data processing framework by Apache) to leverage the diversity of different node types. This also meant that we were unable to parallelize or leverage co-location.
At the same time, it is business critical for Adbrain that we are able to ingest many logs across different data-sources; build a customer graph, using statistical and machine learning algorithms; and run mappings and relevant graph extraction jobs for our clients. We also needed the capability to continuously scale the graph as required.
We considered a number of different strategies. Massively scaling HDFS and running Spark jobs independently was an option and did have it’s advantages. For instance, sometimes compute focused instances are better than running memory heavy jobs. However, it also meant that we needed to provision clusters on demand: complicated by the fact that Adbrain has its own custom implementation of provisioning on EC2. Adbrain also didn’t have a dedicated infrastructure person to operate HDFS - if the cluster is down, so are we. We came to the conclusion that this was an unacceptable level of operational risk.
With this approach off the table, we chose to move the HDFS data to Amazon S3. Then, at the switchover time, we could simply redirect everything to the new cluster. We already had data in Parquet format (a columnar data format, which a great option for storing long term big data for analytics purposes). In fact, almost all of our Spark jobs use Parquet. There are some well-documented issues in migrating HDFS to a Parquet, Spark and S3 stack in Spark 1.3, but as of Spark 1.5, Parquet, Spark and S3 is viable!
Using Amazon’s EMR (think ‘Hadoop-as-a-Service’) from AWS and Parquet files, it is possible to read and write to S3. There are a few gotchas - writing the last phase is terribly slow, but we were able to tune the jobs to accommodate for this. Furthermore, if it’s not important that you store the intermediate job results permanently, you can still use HDFS on the EMR nodes. For raw performance, HDFS is still stronger than using S3, but the difference was trivial.
We’re now live and run all of Adbrain’s Spark jobs on S3. We couldn’t be happier: S3 has better scalability (automatically scaling to the stored data without needing to change anything), built in persistence (on EC2 data doesn’t persist if you stop an instance), and a better price than HDFS as you don’t need to store three copies of each block of data.
If you’re interested in helping us solve similar hard problems with big data at scale, check out Careers at Adbrain for more information.
Find out more
Adbrain's Cross-Device: What Marketers Need To Know white paper provides an introduction to cross-device marketing; acquainting you with the basics, and providing insight to help you set your cross-device marketing plans in motion.
Download the White Paper!