How big MNC’s like Facebook stores, manages and manipulate Thousands of Terabytes of data?

Preetika Thakur
4 min readSep 17, 2020

According to the current situation, we can strongly say that it is impossible to see a person without using social media. Because the world is getting drastic exponential growth digitally around every corner of the world. According to a report, from 2017 to 2019 the total number of social media users has been increased from 2.46 to 2.77 billion.

People are using Facebook, Instagram, WhatsApp, and other social/Messaging medium while doing their daily routines.

Refer : How much time do you spend on social media?

This drastic growth of social media is directly impacting the data generation. Yes, Whatever we do in social media including a like, share, retweet, comments and everything has been stored as a record.

WHAT IS BigData?

Big data is a term used for large databases requiring complex processing and visualisation which cannot be efficiently handled by traditional data processing software.

Big data refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze.

A well-known model (known as 3V’s model) of big data attributed to Gartner Inc. defines it as:

“Big data is high volume, high velocity, and/ or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization”

The term ‘volume’ here indicates the complexity of datasets and not necessarily their size. ‘Variety’ refers to the different type of structured or unstructured data such as text and numeric, video and audio and log files. ‘Velocity’ refers to the speed with which data can be made available for analysis. Sometimes other V’s such as ‘Veracity’ (aiming at data integrity and the ability of the organisation to confidently use the data) or ‘Value’ (does new data enable an organisation to get more value?) are highlighted as well.

Some Characteristics of Big Data

SOME INTERESTING STATS

  • More than 90% of all the data of the world has been generated in the last two years.
  • Per-minute: We sent 204 million emails. Tweet 456 thousand times, generated 1.8 million “Likes,” and posted 200 thousand photos on Facebook.
  • More than 100 hours of videos are uploaded to YouTube every minute.
  • If you record on DVDs all the data produced by the world relating to one day, they would be piled twice covering the distance from here to the moon.
  • Five hundred seventies (570) new sites are set up and published per minute on the Internet

Big Data and Hadoop ecosystem

To overcome the problems we face because of BigData we make use of a concept known as distributed storage. A distributed storage system is an infrastructure that can split data across multiple physical servers, and often across more than one data center. It typically takes the form of a cluster of storage units, with a mechanism for data synchronization and coordination between cluster nodes.

To implement this concept of distributed storage there are various technologies available in the market and one of them is Apache Hadoop. In Hadoop framework a cluster generally follows a master-slave topology where a cluster comprises of a single NameNode (Master node) and all the other nodes are DataNodes (Slave nodes).These DataNodes share their free RAM,CPU,HD with the NameNode.

Master-Slave Topology

The power of Hadoop platform is based on two main sub-components: the Hadoop Distributed File System (HDFS) and the MapReduce framework.

In the Hadoop architecture, data is stored and processed across many distributed nodes in the cluster. The HDFS is the module responsible for reliably storing data across multiple nodes in the cluster and for replicating the data to provide fault tolerance. Raw data, intermediate results of processing, processed data and results are all stored in the Hadoop cluster.

How Big Data is Managed in Facebook?

Facebook relies on a massive installation of Hadoop, a highly scalable open-source framework that uses clusters of low-cost servers to solve problems. Facebook even designs its hardware for this purpose. Hadoop is just one of many Big Data technologies employed at Facebook.

--

--