Every company ranging from enterprise level to small-scale startups has money for Big Data. The storage and hardware costs have dramatically reduced over the past few years enabling the businesses to store and analyze data, which were earlier discarded due to storage and processing challenges. We’re seeing an explosion of data where there is an entirely new scale and scope to the kinds of data we are trying to gain insights from. In this blog post, we will get an insight on what Big Data is and how the Apache Hadoop framework comes in the picture when implementing Big Data solutions.
Having a platform that supports the data explosion trend is our today’s challenge; we also need to make it easier for end users to access so that they can gain insight and make better decisions. If you think about the user experience, with everything we are able to do on the Web. how we’re discovering, sharing, and collaborating in new ways, user expectations of their business, and productivity applications are another challenge as well.
Big Data includes all types of data:
- Structured: the data has a schema, or a schema can be easily assigned to it.
- Semi-Structured: has some structure, but typically columns are often missing or rows have their own unique columns.
- Unstructured: data that has no structure, like JPGs, PDF files, audio and video files.
Big Data also has two inherent characteristics:
- Time-based: a piece of data is something known at a certain moment in time, and that time is an important element.
- Immutable: because of its connection to a point in time, the truthfulness of the data does not change.
The types of Big Data that ends up typically fits into one of the following types:
- Sentiment: Understand ho your customers feel about yoyr brand and products right now.
- Clickstream: Capture and analyze website visitors’ data trails and optimize your website.
- Sensor/Machine: Discover patterns in data streaming automatically from remote sensors and machines.
- Geographic: Analyze location-based data to manage operations where they occur.
- Server Logs: Research log files to diagnose and process failures and prevent security breaches.
- Text: Understand patterns in text across millions of web pages, emails, and documents.
Big Data vs. Data Warehouse
International Data Corporation classifies Big Data as the 3V:
- Volume: Data volume is exploding. In the last few decades, computing and storage capacity have grown exponentially, driving down hardware and storage costs to near zero and making them a commodity and huge amount of data (petabyte or zettabyte) needs to be processed within minutes if not seconds.
- Variety: The variety of data is increasing. It’s all getting stored and nearly 90 percent of new data is unstructured data. The data can be in the form of Text, CSV, XML, JSONs with variable attributes and elements.
- Velocity: The velocity of data is speeding up the pace of business. Data capture has become nearly instantaneous, real-time analytics is more important than ever. The ratio of data remittance rate continues to be way higher than the data consumption rate.
In a traditional data warehouse and business intelligence environment, the data to power your reports will usually come from tables in a database. However, it’s increasingly necessary to supplement this with data obtained from outside your organization. This may be commercially available datasets, such as those available from Windows Azure Marketplace and elsewhere, in most cases, need to cleanse, validate, and transform this data before loading it into an existing data warehouse.
The data store in a Big Data implementation is usually referred to as a NoSQL store, although this is not technically accurate because some implementations do support a SQL-like query language. NoSQL storage is typically much cheaper than relational storage, and usually supports a write-once capability that allows only for data to be appended.
What is Hadoop?
Apache Hadoop is one such system. Hadoop ties together a cluster of commodity machines with local storage using free and open source software to store and process vast amounts of data at a fraction of the cost of any other system. the followings are the characteristics of Hadoop:-
- Framework for solving data-intensive processes.
- Designed to scale massively.
- Processes all the contents of a file, instead of attempting to read portions of a file.
- Very fast for very large jobs.
- Not fast for small jobs.
- It does not provide caching or indexing natively, tools like HBase can provide these features if needed.
- Designed for hardware and software failures.
The core of Hadoop is its storage system and its distributed computing model:
- HDFS: Hadoop Distributed File System is a program level abstraction on top of the host OS file system. It is responsible for storing data on the cluster. Data is split into blocks and distributed across multiple nodes in the cluster.
- MapReduce: MapReduce is a programming model for processing large datasets using distributed computing on clusters of computers. MapReduce consists of two phases: dividing the data across a large number of separate processing units (called Map), and then combining the results produced by these individual processes into a unified result set (called Reduce). Between Map and Reduce, shuffle and sort occur.
Hadoop cluster, once successfully configured on a system, has the following basic components:
- NameNode: This is also called the Head Node/Master Node of the cluster. Primarily, it holds the metadata for HDFS during processing of data which is distributed across the nodes; it keeps track of each HDFS data block in the nodes.
- Secondary NameNode: This is an optional node that you can have in your cluster to back up the NameNode if it goes down. If a secondary NameNode is configured, it keeps a periodic snapshot of the NameNode configuration to serve as a backup when needed. However, there is no automated way for failing over to the secondary NameNode; if the primary NameNode goes down, a manual intervention is needed. This essentially means that there would be an obvious down time in your cluster in case the NameNode goes down.
- DataNode: These are the systems across the cluster which store the actual HDFS data blocks. The data blocks are replicated on multiple nodes to provide fault tolerant and high availability solutions.
- JobTracker: This is a service running on the NameNode, which manages MapReduce jobs and distributes individual tasks.
- TaskTracker: This is a service running on the DataNodes, which instantiates and monitors individual Map and Reduce tasks that are submitted.
Relational Database Vs. Hadoop
Understanding ho schemas work in Hadoop might help you better understand how Hadoop is different from relational databases:
- With a relational database, a schema must exist BEFORE the data is written to the database, which forces the data to fit into a particular model.
- With Hadoop , data is input into HDFS in its raw format without any schema. When data is RETRIEVED from HDFS, a schema can be applied then to fit the specific use case and needs of your application.
The Hadoop Ecosystem
Hadoop is more than HDFS and MapReduce. There is a large group of technologies and frameworks that are associated with Hadoop, Including:
- Pig: a scripting language that simplifies the creation of MapReduce jobs and excels at exploring and transforming data.
- Hive: provides SQL-like access to your Big Data.
- HBase: a Hadoop database.
- HCatalog: for defining and sharing schemas.
- Ambari: for provisioning, managing and monitoring Apache Hadoop clusters.
- ZooKeeper: an open-source server which enables highly reliable distributed coordination.
- Sqoop: for efficiently transferring bulk data between Hadoop and relation databases.
- Oozie: a workflow scheduler system to manage Apache Hadoop jobs.
- Mahout: an Apache project whose goal is to build scalable machine learning libraries.
- Flume: for efficiently collecting, aggregating and moving large amounts of big data.
There are many other products and tools in the Hadoop ecosystem, including:
- Hadoop as a Service: Microsoft HDInsight
- Programming Frameworks: Cascading, Hama and Tez.
- Data Integration Tools: Talend Open Studio.
we went through what Big Data is and why it is one of the compelling needs of the industry. The diversity of data that needs to be processed has taken Information Technology to heights that were never imagined before. Organizations that are able to take advantage of Big Data to parse any and every data will be able to more effectively differentiate and derive new value for the business, whether it is in the form of revenue growth, cost savings, or creating entirely new business models. For example, financial firms using machine learning to build better fraud detection algorithms, go beyond the simple business rules involving charge frequency and location to also include an individual’s customized buying patterns ultimately leading to a better customer experience. When it comes to Big Data implementations, these new requirements challenge traditional data management technologies and call for a new approach to enable organizations to effectively manage, enrich, and gain insights from any data. Apache Hadoop is one of the undoubted leaders in the Big Data industry. The entire ecosystem, along with its supporting projects provides the users a highly reliable, fault tolerant framework that can be used for massively parallel distributed processing of unstructured and semi-structured data.