Big Data is a common buzzword in the world of IT nowadays, and it is important to understand what the term means: Big Data describes the realization of greater business intelligence by storing, processing, and analyzing data that was previously ignored or siloed due to the limitations of traditional data management technologies. Notice from this definition that there is more to Big Data than just “a lot of data”, and there is more to Big Data than just storing it:
- Processing: If you are just storing a lot of data, then you probably do not have a use case for Big Data. Big Data is data that you want to be able to process and use as part of a business application.
- Analyzing: In addition to making the data a part of your applications, Big Data is also data that you want to analyze (i.e. mine the data) to find information that was otherwise unknown.
Many articles have been published that add other “V” words to this list, like veracity, viability, and value. These terms (and many others) can certainly be used to describe aspects of Big Data.
Big Data includes all types of data:
- Structured: the data has a schema, or a schema can be easily assigned to it.
- Semi-Structured: has some structure, but typically columns are often missing or rows have their own unique columns.
- Unstructured: data that has no structure, like JPGs, PDF files, audio and video files, etc;
Big Data also has two inherent characteristics:
- Time-based: a piece of data is something known at a certain moment in time, and that time is an important element. For example, you might live in San Francisco and tweet about a restaurant that you enjoy. If you later move to New York, the fact that you once liked a restaurant in San Francisco does not change.
- Immutable: because of its connection to a point in time, the truthfulness of the data does not change. We look at changes in Big data as “new” entries, not “updates” of existing entries.
6 Key Hadoop Data Types
The type of Big data that ends up in Hadoop typically fits into one of the following types: –
- Sentiment: understand how your customers feel about your brand and products – right now!
- Clickstream: Capture and analyze website visitor’s data trails and optimize your website.
- Sensor/Machine: Discover patterns in data streaming automatically from remote sensors and machines.
- Geographic: Analyze location-based data to manage operations where they occur.
- Server Logs: Research log files to diagnose and process failures and prevent security breaches.
- Text: Understand patterns in text across millions of web pages, emails, and documents.
Sentiment Use Case
The goal was to determine how the public felt about the debut of the Iron Man movie using Twitter, and how the movie company might better promote the movie based on the initial feedback. Here are the steps that were performed:
- Use Flume to get the Twitter feeds into HDFS.
- Use HCatalog to define a sharable schema for the data.
- Use Hive to determine sentiment.
- Use an Excel bar graph to visualize the volume of tweets.
- Use MS PowerView to view sentiment by country on a map.
Geolocation Use Case
A trucking company collects sensor data from its trucks based on GPS coordinates and logs driving events like speed, acceleration, stopping too quickly, driving too close to other vehicles, stopping too quickly, and so on. These events get collected and put into Hadoop for analysis. The goal of the trucking company is to reduce fuel costs and improve driver safety by recognizing high-risk drivers.
- Flume is used to get the raw sensor data into Hadoop.
- Sqoop is used to get the data about each vehicle from an RDBMS into Hadoop.
- Hcatalog contains all the schema definitions.
- Hive is used to analyze the gas mileage of trucks.
- Pig is used to compute a risk factor for each truck driver based on his/her events.
- Excel is used for creating bar graphs and maps showing where and how often events are occurring.
What is Hadoop?
Apache Hadoop is one such system. Hadoop ties together a cluster of commodity machines with local storage using free and open source software to store and process vast amounts of data at a fraction of the cost of any other system.
- Framework for solving data-intensive processes: Meaning the bottleneck was waiting to read data from the disk. Of the potential bottlenecks in a computing system are CPU, RAM, Network and Disk I/O. Hadoop was designed to solve the problem of Disk I/O.
- Designed to scale massively: In order to scales massively it is important to keep things as simple as possible and provide redundancy and avoid the need for any sharing of a single system, such as locking files for operations. To meet these goals, the Hadoop file system is write once and files are immutable.
- Processes all the contents of a file: Although a file in HDFS can be accessed by opening the file at any byte offset point, the typical assumption is that an application will read all of a file. The application layer will read the whole file or often a directory of files. The reasoning is that once they system has paid the expensive price of seeking to the top of the file, in order to satisfy the goal of massive throughput, the most efficient use of the disk is to have it spend most of its time transferring data, not seeking to find data.
- Hadoop is very fast for big jobs: Hadoop does scale: A 20 node cluster with 10 disks per machine running a large MapReduce job will have close to 200 disks reading and processing data ALL AT ONCE. The relative speed of work done in parallel when compared to a non-‐parallel system will be significant.
- Hadoop is not fast for small jobs: If your data set is small, and could fit on a single machine, then you will find that Hadoop is slow compared to another system. In general, if you think 5 minutes is “slow” then perhaps the problem you have is not what is typically considered a Hadoop problem.
- No caching, no indexes: Core Hadoop does not provide such features. HBase and apps built on top of Hadoop provide caching and a type of indexing.
- Designed for hardware and software failures: which is accomplished by “sharing nothing”. Core Hadoop systems are designed to share as little information about state as possible. DataNodes do not know what file a block belongs to; a map task writes to a temporary directory and that data is thrown away at failure; a task is either running to success or it fails completely and subsequent attempts do not acquire state from the failed task; a DataNode does not know what block a file belongs to -‐ it has a simpler job to do, read a block when requested.
All of these features put together create a powerful data processing framework that not only store large amounts of data, but also process large amounts of data in a relatively short amount of time.
Relational Databases vs. Hadoop
Understanding how schemas work in Hadoop might help you better understand how Hadoop is different from relational databases:
- With a relational database, a schema must exist before the data is written to the database, which forces the data to fit into a particular model.
- With Hadoop, data is input into HDFS in its raw format without any schema. When data is retrieved from HDFS, a schema can be applied then to fit the specific use case and needs of your application.
Important: Hadoop is not to replace your relational database. Hadoop is for storing Big Data, which is often the type of data that you would otherwise not store in a database due to size of cost constraints. You will still have your database for relational, transactional data.
The Hadoop Ecosystem
Hadoop is more than HDFS and MapReduce. There is a large group of technologies and frameworks that are associated with Hadoop, including:
- Pig: a scripting language that simplifies the creation of MapReduce jobs and excels at exploring and transforming data.
- Hive: provides SQL like access to your Big Data.
- HBase: a Hadoop database.
- HCatalog: for defining and sharing schemas.
- Ambari: for provisioning, managing, and monitoring Apache Hadoop clusters.
- ZooKeeper: an open source server which enables highly reliable distributed coordination.
- Sqoop: for efficiently transferring bulk data between Hadoop and relation databases.
- Oozie: a workflow scheduler system to manage Apache Hadoop jobs
- Mahout: an Apache project whose goal is to build scalable machine learning libraries.
- Flume: for efficiently collecting, aggregating, and moving large amounts of log data
There are many other products and tools in the Hadoop ecosystem, including:
- Hadoop as a Service: includes Microsoft HDInsight and Rackspace Private Cloud.
- Programming Frameworks: includes Cascading, Hama and Tez.
- Data Integration Tools: includes Talend Open Studio
Hope you learnt the overview of Hadoop and its ecosystem. Please share this blog post if you find it informative. Stay tuned for the up coming blog posts.