How big is BIGDATA?

No fuss, straight talk.

--

One of the most popular catchphrases emphasizing the importance of data is:

data is the new oil.

During the COVID-19 pandemic of the last two years, the importance of quality data has skyrocketed. As a result, the demand for up-to-date, accurate data as well as methods for organizing data increased dramatically.

Exactly how big is Big Data?

Very big.

And now, time to wrap up… Kidding, kidding.

Calculating the real amount of data is challenging given the size of the internet. However, if we’re talking about the volume of data that qualifies as big data, we can classify it as such when it doesn’t fit on an Excel or Sheets.

Article Summary

  1. Small Data Vs Big Data
  2. Need for Big Data
  3. Data Storage
  4. Batch Processing Vs Stream Processing
  5. Scalability of Data
  6. Hadoop Ecosystem
  7. Extract Transform Load (ETL)
  8. What is Spark?

Small Data Vs Big Data

Let’s compare small data, sometimes referred to as traditional data, with big data in depth.

Small data:

  1. Small data is a fixed-size set of data that is often structured in a tabular fashion.
  2. The small data can be derived up to MB, GB, or max to max TB.
  3. Small data steadily grows in size. And the gradual process can occasionally be insignificant.
  4. As a result of centralization, all users must access one server, which houses data on a single node.
  5. Small data can be stored quickly in Excel files, SQL servers, and database servers.

Big data:

  1. Large swaths of organized and primarily unstructured data can be used to represent Big data. Any type of format, including video, WARC, XML, etc., can be used for big data.
  2. Big data can be kept in petabyte, exabyte, or yetabyte storage sizes.
  3. Big data grows exponentially.
  4. Globally data can be present and accessible. It is thus decentralized.
  5. Big data requires extremely sophisticated software and technologies, such as Hadoop, Spark, Big Query, Hive, etc., to store the data.

Need of Big Data.

Understanding the 4 V’s is necessary to comprehend why we need big data and what problems it answers. Let’s begin defining each one in turn.

  1. Volume: People saw that their data volume was rapidly growing and that vast volumes of data were being stored.
  2. Velocity: Data was expanding at a lightning-fast rate all the time.
  3. Variety: The data formats were growing in all types of forms, including PDF, mp4, and images.
  4. Veracity: The data that was available was quite disorganized and required a lot of cleaning.

There were challenges to face at each stage in order to overcome all of these. The three main difficulties were Storage, Processing, and Scalability.

Data Storage:

Data warehouses are designed to store large amounts of data in order to reduce storage complexity. The majority of data warehouses are of two types:

  1. Monolithic: Monolithic data warehouses imply that we put up a very large server and added a huge number of resources in order to store the data on a single server.
  2. Distributed: A distributed data warehouse is a collection of different servers linked together via a network. The idea is for these individual servers to appear as a single global data warehouse.

Data Processing:

Data processing specifies two methods for processing data stored in warehouses.

  1. Transactional processing:

Stored data is utilized to perform CURD operations on stored data inside warehouses. This processing includes all processes such as retrieving data from the warehouse, updating it to a different value, or deleting data from the warehouse. This is true for the vast majority of online apps and eCommerce websites.

2. Analytical processing:

Analytical processing is required to visualize the data or generate a stat of the data. Analytic processing can be used to obtain insightful data information.

The majority of Machine Learning and Deep Learning processes rely on analytical processing to obtain useful data for training models.

Now that we’ve covered two categories of data processing, let’s look at two data processing techniques.

  1. Batch Processing
  2. Stream Processing

Batch Processing

Batch processing is the processing of large amounts of data in batches over a set period of time. It processes a large amount of data at once. When the data size is known and finite, batch processing is performed.

Batch processing is utilized in food processing systems, billing systems, and other similar applications.

Stream Processing

Stream processing is the processing of a continuous stream of data as it is produced. It analyses real-time streaming data. When the data size is unknown, unlimited, and continuous, stream processing is used.

Stream processing is utilized in the stock market, e-commerce transactions, social networking, and other applications.

Yea, you got it right. Twitter’s trending hashtags are also a type of stream processing. A framework called Kafka is used to store, read, and analyze streaming data. It has a streaming method for data redundancy and fault tolerance.

Scalability of Data

As the amount of data increased, engineers and scientists began investigating the issue, and for the first time in human history, Google took the initiative and released two research papers referring to the GFS technique (Google File System). Later in 2013, it was renamed to HDFS (Hadoop Distributed File System) under Apache.

Hadoop Components

Hadoop is a distributed, open-source system used to manage, store, and process large amounts of data. Let’s examine each of the three elements.

  1. Storage: HDFS

This is nothing more than a storage unit with a structure built to hold various data kinds. Two internal nodes are present, a worker node for data storage and a Manager Node to track metadata about data, such as file structure and directory contents.

2. Compute Engine: Map Reduce

The computational engine, MapReduce, is in charge of making the process happen. The process is divided among various servers or locations using a map. And every piece of material is kept in the worker node. Reduce is used to merge the processes once the map activity is completed.

3. Resource Manager: YARN

YARN is in charge of determining how the worker would get instructions to carry out the process.

Extract Transform Load (ETL)

As the amount of data, data sources, and data types in organizations grows, so does the importance of using that data in both transactional and analytical processing. ETL (extract, transform, and load) is the process that data engineers use to extract data from various sources, transform the data into a usable resource, and load that data into systems that end-users can access.

While ETL was critical, with the exponential growth of data sources and types, building and maintaining reliable data pipelines became one of the more difficult aspects of data engineering.

Hadoop Ecosystem

After a few successful years, engineers encountered challenges writing their ETL pipeline in Hadoop because the majority of the components were built in JAVA. Businesses like Facebook created their own Hadoop-like system called Hive.

But what was the main cause?

See, Facebook has always desired a system in which they wouldn’t have to hire numerous engineers with diverse backgrounds, such as JAVA, just to complete the process. Once more, they desired something that would serve as a mediator system that would allow them to use SQL queries to access the big data store.

However, because there were only a few components available, techies always desired a solution that could accommodate all of their needs as well as, most importantly, a variety of programming languages. With these drawbacks, Spark, one of the most potent big data tools was born.

What hack is Spark?

Spark is merely a distributed, general-purpose, in-memory computing engine.

One of the primary reasons Spark is ranked best in big data is because, while it was created in Scala, our ETL process can be written in Python, JAVA, SCALA, SQL, and R.

Spark replaces Hadoop’s map-reduce compute engine. As a result, Spark is not reliant on Hadoop.

However, we continue to use HDFS as storage units and YARN as a resource manager. What if we decide not to utilize them?
Yes, we now have a variety of big data storage options to replace HDFS, such as AWS S3, and we can also replace YARN with Kubernetes or Apache Mesos.

Wrap Up

Nowadays, whether we use a mobile device to open an app, run a Google search, or simply move around, data is continuously created.
Millions of data points are generated each time one of us opens a device or makes an online transaction.
The total amount of data generated worldwide is predicted to exceed 570 zettabytes by 2030.
When the question “how big is big data” is brought up, perhaps in 2030, it will be fascinating to observe how we will be handling such a large amount of data.

The next post will be going to cover a deep dive into Spark and how we can start writing the data pipeline using pySpark (spark-based python framework).

I hope, you will find these insights useful for your use case, and let me know if you have any other suggestions for me too!

If you are passionate about this topic, or simply want to stay connected, do follow me on Medium.

Thank you for reading!

Follow me on LinkedIn for all the latest updates.🚀

--

--