Big Data and Hadoop

1. Three Vs of Big Data

  • Volume: > dozens of terabytes
  • Variety: unstructured / semi-structured / structured data
  • Velocity: has value for a limited time before being loaded into a data warehouse.

2. Hadoop

  • Hadoop is a framework for storing data on large clusters and running applications against that data.
  • Hadoop consists of two main components:
    • MapReduce (YARN): distributed processing framework on distributed data sets.
      • Data consists of key-value pairs.
      • Computation has only two phases: 
        • Map: input data is split info a large number of fragments, each of which is assigned to a map task. Map tasks are distributed across the cluster to process the key-value pairs from fragments and produce intermediate key-value pairs.
        • Reduce: Intermediate data set is sorted by key and is partitioned for reduce tasks that will produce output key-value pairs into HDFS.
    • Hadoop distributed file system (HDFS).
      • Master service: NameNode (control access to data files)
      • Slave services: DataNode (manage storage, serving read/write requests)
  • An application that is running on Hadoop gets its work divided among the nodes in the cluster.
  • HDFS stores the data that will be processed. 
  • A Hadoop cluster can span thousands of machines, where HDFS stores data, and MapReduce process data nearby (to keep I/O costs low).
  • Hadoop clusters typically consist of a few master nodes and many slave nodes.  
    • Master node: control the storage and processing systems in Hadoop
    • Slave node: store all the cluster's data and is also where the data gets processed. 

3.  Apache Hadoop Ecosystem Components*   

 

 

 4. Releases and Distributions

  • Hadoop releases: directly from Apache Software Foundation (1.x, 2.x)
  • Distributions: 
    • Cloudera (CDH) + value-added components on top of Hadoop. 
    • EMC: SQL processing for Hadoop.
    • Hortonworks: HDP + paid support
    • IBM: InfoSphere BigInsight, PureData System for Hadhoop .. 
    • Intel: Analyzing big data with optimizations for Intel processors & SSD storage & networking.
    • MapR: An enterprise-grade platform that supports many well-known customers.

 

 

 

 

 

 

 

 

No comments:

Post a Comment