The term big data has become so common, it defies clear definition. A prevailing theme in any context is that big data is difficult data: difficult to store in traditional databases, difficult to process on standard servers, and difficult to analyze with typical applications. Even "smaller" data can exhibit a complexity that requires a new approach. As you explore more sources and types of data, you'll also need to identify tools and techniques for managing it efficiently and extracting real value.
This guide illustrates two uses of Amazon Web Services to process big data. Getting Started: Sentiment Analysis shows you how to use Hadoop to evaluate Twitter data.Getting Started: Web Server Log Analysis shows you how to query Apache web server logs with Hive.
Key AWS Services for Big Data
With Amazon Web Services, you pay only for the resources you use. Instead of maintaining a cluster of physical servers and storage devices that are standing by for possible use, you can create resources when you need them. AWS also supports popular tools like Hadoop, and makes it easy to provision, configure, and monitor clusters for running those tools.
The following table shows how Amazon Web Services can help you manage big data.
|Amazon Web Services
|Data sets can be very large. Storage can become expensive, and data corruption and loss can have far-reaching implications.
|Amazon Simple Storage Service (Amazon S3)
|Amazon S3 can store large amounts of data, and its capacity can grow to meet your needs. It is highly redundant and secure, protecting against data loss and unauthorized use. Amazon S3 also has an intentionally small feature set to keep its costs low.
|Maintaining a cluster of physical servers to process data is expensive and time-consuming.
|Amazon Elastic Compute Cloud (Amazon EC2)
|When you run an application on a virtual Amazon EC2 server, you pay for the server only while the application is running, and you can increase the number of servers — within minutes, not hours or days — to meet the processing needs of your application.
|Hadoop and other open-source big-data tools can be challenging to configure, monitor, and operate.
|Amazon EMR handles cluster configuration, monitoring, and management. Amazon EMR also integrates open-source tools with other AWS services to simplify large-scale data processing in the cloud, so you can focus on data analysis and extracting value.