What is BIG DATA?
BIG DATA is “the next frontier for innovation”.
What is BIG-DATA?
The data which are beyond storing and processing capacity of a conventional database management systems is called “Big Data”. A Huge amount of data is generated daily in PetaBytes, and data generation rate is rapidly increasing.
Characterization of BIG-DATA by “4V’s”
Volume: It is very common to have Terabytes and Petabytes of the storage system for enterprises. (Volume is nothing but Size of data: MB, TB, PB, EB, ZetaB, YottaB…)
Velocity: Traversing of data through the network for processing.
Variety: Structured, Semi-Structured, and UnStructured data.
Veracity: Uncertainty of data.
Sources of BIG-DATA
The data is coming from various sources: – transactions, social media, sensors, digital images, cc camera, online shopping, Airlines-black box, videos, audios, Search engine and click-streams for domains including healthcare, retail, energy, and utilities. In last decade’s 90% of data is generated from all data available in the world. Ex. New York Stock exchange – 1TB/day, Facebook-1PB/day, Internet Archive – 20 TB/month, Large Hadron Collider near Geneva – 15 PB/year.
Where is the use of BIG-DATA
1. Understanding and Targeting Customers.
2. Understanding and Optimizing Business Processes.
3. Improving Healthcare and Public Health.
4. Improving Science and Research.
5. Optimizing Machine and Device Performance.
6. Financial Trading. and in so many fields.
Different types of Data
1. Structured Data :
All data which can be stored in the database in a row and column format i.e. Relational database and it is very simple to manage. Structured data is only 5-10% of all informatics data.
2. Semi-structured Data :
Semi-structured data doesn’t reside in RDBMS but have some organizational properties that make it easier for analyses. Ex. Log files, CSV, XML.
3. Unstructured Data :
Remaining all data is considered as unstructured data, it contains video, images, email photo, audio, video, web pages and much more. It doesn’t fit neatly into the database. Unstructured data contributes 80% of all informatics data. The growth of unstructured data in exponential than the other types of data. This data is either machine generated or human generated.
Machine-generated data: Satellite images, scientific data, Photographs, Videos, Radar or Sensor data and so many.
Human-generated unstructured data: Mobile data, Website data, Social Media data, Text data and so many.
Challenges with BIG-DATA
1) Capturing & Storing the data. (Collection and Storage)
2) Understanding and analysis of the data. (Data Analysis)
3) Synchronization across the Data Sources. (Data Transfer)
4) Getting and displaying meaningful Information out of that data. (Visualization)
Limitations of RDBMS
1) RDBMS is not able to handle huge data volumes properly, it needs to scale up database management system vertically.
2) The majority of the data comes in a semi-structured or unstructured format. RDBMS can handle only structured data.
3) Big Data generated at very high velocity.RDBMS lacks in high velocity because it’s designed for steady data retention rather than rapid growth.Even if RDBMS is used to handle and store “big data,” it will turn out to be very expensive.
Tools for BIG-DATA
NoSQL: MongoDB, CouchDB, Cassandra, Redis, BigTable, HBase, Zookeeper
MapReduce: Hadoop, Hive, Pig, Cascading, Cascalog, Caffeine, S4, MapR, Flume, Kafka, Oozie
Storage: S3, Hadoop Distributed File System.
Servers: EC2, Google App Engine, Elastic, Beanstalk, Heroku
Processing: R, Yahoo! Pipes, Mechanical Turk, ElasticSearch, Datameer, BigSheets, Tinkerpop
Applications of BIG-DATA
Stock exchange analysis
Social networking analysis
Telecommunication network monitoring and much more.
Current Storage = 300 PB
Process/Day = 600TB
User/Month = 1 billion
Like/Day = 2.7 billion
Photo uploaded/Day = 300 million
Current Storage = 5 EB
Process/Day = 30 PB
NSA toches 1.6% of internet traffic/day
(web search, website visited, phone calls. credit/debit card transactions, health, and finance info)
Current Storage = 15 EB
Process/Day = 100 PB
Searches/Second = 2.3 million
Unique Search/Month > 1 billion
Users: 37 million in 2009
Users: 450 million in 2016
Big Data System Requirement
To understand the working of tools used to process a large amount of data, you must have to understand the working of distributed computing framework.
Storage: Store the massive amount of data
Process: Process the data in a timely manner
Scale: Scale easily as data grows
Traditional data technologies not able to handle and process such huge amount of data.
Approach two solve Big Data problems
We can solve big data problems in two ways using
Challenges with scaling up:
Challenges with scaling up:
Co-ordinate between machines
Handling failure of machines
Hadoop is the solution to handle BigData