You must create an annotated bibliography of technical resources with various data collection methods for this assignment.? Begin by locating five research articles related to a t
You must create an annotated bibliography of technical resources with various data collection methods for this assignment.
Begin by locating five research articles related to a technical topic of interest. Each of the resources must employ a different way of collecting data. In each annotation, describe the data collection method and collected data.
Length: 6-page academic paper, not including title and reference pages
The completed assignment should address all assignment requirements, exhibit evidence of concept knowledge, and demonstrate thoughtful consideration of the content presented in the course. The writing should integrate scholarly resources, reflect academic expectations and current APA standards, and submit a plagiarism report.
ATTACKING THE MESSENGER: EXPLORING THE SECURITY OF BIG DATA
MESSENGER APACHE KAFKA
A Thesis Submitted to the Graduate Faculty
of the North Dakota State University
of Agriculture and Applied Science
By
Tyler Losinski
In Partial Fulfillment of the Requirements for the Degree of
MASTER OF SCIENCE
Major Program: Software Engineering
Option: Cybersecurity
September 2021
Fargo, North Dakota
North Dakota State University Graduate School
Title
ATTACKING THE MESSENGER: EXPLORING THE SECURITY OF
BIG DATA MESSENGER APACHE KAFKA
By
Tyler Losinski
The Supervisory Committee certifies that this disquisition complies with North Dakota
State University’s regulations and meets the accepted standards for the degree of
MASTER OF SCIENCE
SUPERVISORY COMMITTEE:
Jeremy Straub
Chair
Kendall Nygard
Daniel Klenow
Approved: 12/23/2021 Simone Ludwig Date Department Chair
iii
ABSTRACT
As technology becomes faster, cheaper, and more compact, a higher volume of data must
be processed. This demand drives the need to process high volumes of data in near real time. As
a result, technologies such as Kafka have been created as high throughput messaging bus
systems. Utilizing these new technologies could vastly improve the way we look at data
processing, especially when that data is coming from IoT or distributed systems.
Kafka provides multiple methods of encryption and authentication with its brokers,
however, securing the producers and consumers is the responsibility of the application owner.
This paper focuses on this key aspect in order to consider how an attacker could exploit and
compromise the flow and integrity of the data. After access to the producers and consumers has
been compromised, examples of data manipulation are performed in order to demonstrate real
world consequence of breaches such as these.
iv
ACKNOWLEDGMENTS
I would like to thank my advisor Dr. Jeremy Straub who helped me through the process
of writing this paper. His help and guidance has helped me immensely in my graduate program
and I couldn’t have completed all that I have without him. I also appreciate all the support and
help I received from Bobcat as they helped me financially throughout my thesis and graduate
program.
v
TABLE OF CONTENTS
ABSTRACT ………………………………………………………………………………………………………………….. iii
ACKNOWLEDGMENTS ………………………………………………………………………………………………. iv
LIST OF FIGURES ……………………………………………………………………………………………………….. vi
1. INTRODUCTION ………………………………………………………………………………………………………. 1
2. BACKGROUND AND RELATED LITERATURE ………………………………………………………… 3
2.1. What is Kafka? ……………………………………………………………………………………………………… 3
3. PROPOSED SYSTEMS ………………………………………………………………………………………………. 7
3.1. System Compared to Traditional Data Storage ………………………………………………………….. 7
3.2. System for Exploiting Producers & Consumers ………………………………………………………… 8
3.3. System for Target Malware Attacks …………………………………………………………………………. 9
4. EVALUATION …………………………………………………………………………………………………………. 11
4.1. Traditional Database Comparison ………………………………………………………………………….. 11
4.2. Attacking Weak Consumers ………………………………………………………………………………….. 13
4.3. Analyzing the Malicious Code ………………………………………………………………………………. 15
5. CONCLUSIONS AND FUTURE WORK ……………………………………………………………………. 20
REFERENCES …………………………………………………………………………………………………………….. 22
vi
LIST OF FIGURES
Figure Page
1. Kafka Architecture Example …………………………………………………………………………………….. 4
2. Example of Error Logging to Kafka Written in Javascript ………………………………………….. 13
3. Code Example of Scanning Network for Potential Kafka Server. Written in C# ……………. 15
4. Code Example to Scan Open Ports. Written in C# …………………………………………………….. 16
5. Code Example to Check for Available Topics on Discovers Kafka Server. Written in C#. ……………………………………………………………………………………………………….. 17
1
1. INTRODUCTION
The size of the internet is estimated to be 16 million petabytes of information [1]. It is
estimated to be growing at a rate of 70 terabytes per second [1]. These numbers show an
explosive growth never before seen in the internet’s history. The creators of LinkedIn saw this
challenge and developed a high throughput messaging bus system called Kafka, to handle the
processing of log files. After realizing the need for and importance of what they created they
eventually released this technology under the Apache license to allow anyone to utilize it. Today
more than 12,000 companies use Kafka to stream and process their data in real time [2].
One area that could benefit greatly from this system is the capture of data from various
sensors. Manufacturing systems, for example, have thousands of sensors with functions ranging
from collecting temperatures, checking fuel levels, to even tracking when people enter or exit a
building [3]. All this data can be collected and sent to one Kafka cluster where it can be
consumed by various applications, and either be processed as it comes in or stored to be
processed later for use in a report or used by another system. Having the ability to capture all this
data and the freedom to process it when and where the user wants makes Kafka a top contender
in this data driven world.
Kafka excels in its high throughput capabilities and its distributed nature also makes it
perfect for scenarios where fault tolerance is key. While collecting and processing high volumes
of data is great, being able to recover if a computer fails can be even more important. That is
why Kafka runs in a clustered environment across one or more servers. This not only allows the
cluster to continue to run the case of a fault, but it also allows the distribution of consumers and
producers across many systems and instances. For example, a system might collect messages
from a thermostat every second then consume and enrich that data to be used by another process.
2
That consumer can be replicated across many servers to prevent the failure of one server
from bringing down the entire enrichment process. Furthermore, this process can be utilized to
scale the system if for some reason the thermostat started producing more data during a specific
time of day than normal. This scalability coupled with its enormous throughput makes Kafka
idea for modern data driven problems.
This paper takes a deeper look into the data processing and distributed nature of Kafka
and compares it with more common and traditional data processing techniques. After this,
potential security vulnerabilities are explored. Traditionally, data is collected and stored in a
structured or unstructured database where a user must query the database to get the data that is
required. This can be a costly endeavor as the database usually can’t push data to the user unless
it is explicitly requested. Another issue is that every time the data is requested the query needs to
be reprocessed which, depending on the efficiency of the query, can waste valuable time and
resources. However, if the data is processed as it is received, the need for a centralized data
processing system is reduced.
The proposed system utilizes Kafka to process data as it is received. This will be
achieved by consumer/producer services that perform simple calculations such as counting, data
enriching data, and data filtering. After this, the data is pushed to Kafka topics to either be pulled
for reporting or reprocessed and pushed to another topic.
3
2. BACKGROUND AND RELATED LITERATURE
Although there has been an explosion of data growth over the last few decades [4],
securing that data has not been at the forefront of that explosion. According to an article
published in the International Journal of Emerging Trends in Social Sciences, in 2019 there were
roughly 1244 data breaches across multiple industries which resulted in the compromise of
446,515,334 records [4]. This has led to an estimated average cost of $3.9 billion in losses with
the highest single incident loss being $8.19 billion [5]. With stakes as high as these, it is
imperative that securing data is put at the forefront of research [6].
Much of this generated data has moved to IoT devices as technology becomes ever more
connected [7]. With this, there is a need to for technology to manage huge volumes of data with a
high level of fault tolerance. This is the drive behind why Kafka was developed.
2.1. What is Kafka?
Kafka has been utilized by thousands of companies to bring real time data processing to
massive concurrent data streams [8]. This relatively new technology was developed by LinkedIn
as a new approach to handling their log files. As a result, Kafka leverages many new
methodologies and features to create a system with high throughput and fault tolerance. Through
its innovative producer consumer architecture, Kafka can leverage multiple servers and services
to create a system that can manage large streams of data [9].
4
Figure 1. Kafka Architecture Example
Depending on the demand, Kafka’s brokers can be scaled to allow for more throughput
and data redundancy. This can be done in order to accommodate more data streams and/or
provide fault tolerance. This scaling can also be performed on the consumers and producers of
the system to achieve the same effect [10].
As can be seen in Figure 1, topics can have one or many partitions which directly
correlate to their throughput. They can also have a replication factor of one or more in order to
5
ensure there is redundancy between the brokers. Consumers and producers utilize these topics to
read and write data. These consumers and producers can be run in one or many locations, similar
to the topic’s architecture, to ensure redundancy and throughput. The consumers and producers
can be written in many different programming languages and can be run across many different
systems. This diversity can leave the system vulnerable to attacks.
A Kafka service starts the process by publishing raw data to a topic. Once that data is
available, Kafka will push the data to any consumers listening to that topic. At this point, another
Kafka services will consume the raw data and can perform the necessary data manipulation such
as counting specific records, enriching the data from a database, or filtering out data that is
unnecessary. After this data manipulation, the Kafka service will publish the new data to a
different topic where the process can begin once again. Another use a consumer might serve
would be to push the data in real time to a reporting application, which is where a system such as
this would excel [11].
This process has the potential to process data before the user requests it, allowing for near
real time fetching of processed data. Comparing this to a system where data manipulation and
joining needs to occur every time a report is generated, there is the potential for huge time
savings [12]. Even using this process to make complex calculations then joining that data with a
dataset from a relational database could provide time savings. Not only does this system have the
potential to save time, but it also has the potential to offload processing to edge services.
Given the distributed nature of Kafka consumers and producers, it is possible to move
these services to be either client side or on lightweight edge services. The benefits of a system
like this include the ability to scale up or down processing nodes based on demand. This alone
could save an enterprise system thousands of dollars a year [13]. Along with the cost savings,
6
this system could utilize technology, such as IoT devices, that could process data while they are
not in use.
In contrast, traditional relational databases lack the ability to offload processing to
smaller services, giving traditional relational data storage a distinct disadvantage [14]. Managing
one single point of processing reduces the overall complexity of the system but does reduce its
ability to expand on demand. Also, database engines can quickly generate datasets that have been
filtered and aggregated on demand whereas, distributed systems such as Kafka would require
complex systems of consumers and producers just to generate one set of data. Overall, both
systems have their own specific pros and cons, and these should be carefully considered when
deciding which to implement.
Although Kafka is compared directly with relational databases in this paper, there is the
potential to utilize more recent technologies that use No-SQL architectures. Technologies such
as MongoDB are known for their speed at scale [15] which might work well with Kafka.
However, this topic will be left to future work as it adds a level of complexity and exceeds the
scope of this paper.
7
3. PROPOSED SYSTEMS
This paper explores three different areas of Kafka and considers its potential benefit over
traditional data storage and the potential security risks it brings. With security misconfiguration
often cited as one of the top security risks, the following example systems have a resemblance to
many current production systems [16]. OWASP cites these types of vulnerabilities as having the
highest likelihood of being prevalent and detectable. This unfortunate truth further underlines the
importance of securing these types of systems.
3.1. System Compared to Traditional Data Storage
The proposed system utilizes a Kafka cluster with three servers running CentOS 7. Each
runs a single instance of Zookeeper and 3 Kafka nodes. Each topic will be running with a
replication factor of one and will contain nine partitions. This system is compared with a
Microsoft SQL Server 2016 Standard Edition server.
The relational database contains two tables, one with the hard drives’ information and
another that is used to store each day’s data and whether it failed or not. The Kafka cluster uses
two topics for storing data. The first is used to store the whole year of data and the second is used
to store the current count of unique hard drives that have not failed.
The hard drive statistics are from 2018 hard drive statistics data [17]. This dataset
contains roughly 100,000 hard drive statistics per day. Each record contains about 53 columns,
however, for these tests we will only be using the first five columns which are the data, serial
number, model number, capacity (in bytes), and failure status of each hard drive. The failure
column is zero if it did not fail that day and one if a failure occurred. Once a failure has
happened, the hard drive is removed and will not be seen in subsequent days.
8
A C# .Net application is used to bulk insert the hard drive info into the SQL server
database. This requires around ten to fifteen seconds of processing per 100,000 records. Another
bulk insert was used to import the hard drive status per day and whether a failure had occurred,
which was much faster due to not having to check for unique records. This tool takes around five
to ten seconds. After the data was in the SQL Server database, the C# application was once again
used to push each record to a Kafka topic at a rate of about 100,000 records per one to five
seconds.
To process the data in SQL server, a simple count statement was written to get all the
hard drives that had not yet failed. This was then run through a simple C# application and the
time to retrieve the data was recorded.
To process the data in Kafka, a Node.js service was written to consume all the messages
that had been sent to the topic containing the raw data. The services then created an object of
hard drives, adding new hard drives to it as they came in and removing drives that had failed. For
each record that came in, another Kafka record was produced to a counts topic. This was a
simple record containing the date of the hard drive stats that were consumed and the current
count of failures. This produced a timeline of counts each time a new record is available to
process. Overall, the process used to transform and filter the data, produces a simple count on
active drives; however, this is a complex process. If the complexity of the system could be
overcome, the time saving benefits could be great.
3.2. System for Exploiting Producers & Consumers
In order to demonstrate a denial of service (DoS) attack, a consumer was run on a
CentOS 7 server with minimum system requirements. The server had 512 MB of RAM, 10gb of
hard drive space, and one CPU core (an Intel Xeon E5-2407 at 2.20Ghz). This system represents
9
a consumer that may only perform a simple yet important task such as sending out an email
regarding critical errors or storing data in a persistent database. This paper presents a consumer
written in Java that continually writes its received data to a text file.
The server running the Kafka message bus also ran CentOS 7 but with 16GB of ram, 126
GB of hard drive space, and 4 CPU cores (Intel Xeon E5-2407 at 2.20Ghz). Both the message
bus and consumer were run through Hyper-V on the same Windows Server 2016 machine. The
producer of the messages was run on the same server as the message bus using the included
Kafka Producer script. The consumer used a Node.js Kafka package called kafkajs [18].
3.3. System for Target Malware Attacks
To demonstrate a potential system that can be exploited, a Kafka cluster was running on a
single virtualized server running CentOS 7. The cluster only ran one instance of Zookeeper with
one node. The resources available to the server were 16Gb of RAM and 4 CPU cores (Intel Xeon
E5-2407 at 2.20Ghz). For the purpose of these experiments, several ‘dummy’ topics were
defined with various partition and replication sizes. These topics represented a typical system
that is consuming and managing real world data.
The software that represents potential malware was a custom written C# application
utilizing .NET 4.5. This application represents what potential malware could do to attack an
unsecured Kafka cluster on a network. Such an application would have a means to spread to
other systems and automatically run these steps in order to discover and exploit any Kafka
systems on the network. Furthermore, this application design assumes that the message bus is
unsecured; however, to maximize effectiveness, leveraging a vulnerability to exploit Kafka
cluster and connect without authorization would greatly increase its effectiveness. This is left to
future research.
10
This type of malware attack is considered spyware, as its main task is to enumerate and
send information back to the attacker. Along with gathering reconnaissance information, it relies
heavily on a default configuration attack of the Kafka systems, which is cited in the OWASP top
10 vulnerabilities year after year [16]. The combination of these common cybersecurity issues
makes this type of malware plausible.
11
4. EVALUATION
This section covers three major points of evaluation. First, a comparison is made with
traditional means of data storage. Next, an evaluation of attacking weak consumers is explored.
Finally, examples of malicious code are examined to show how attacks can be automated.
4.1. Traditional Database Comparison
When evaluating the proposed system, it became evident quickly that complexity would
pose a challenge in testing. The two components of the tests are the Kafka cluster, with its
consumers and producers, and a SQL Server database. Using these two systems, a comparison
was made as to how quickly each could count the number of failures over the course of a year
and how many operating hard drives currently exist. Although this seems like a trivial test,
Kafka’s immutable nature of the data within a topic resulted in complicated consumers and
producers while SQL Server handled these tests with ease.
With the SQL Server, the hard drive failure data was mass loaded into two tables. The
first table, named HardDrive, held all the unique hard drives the data set contained. The second
table, HardDriveStatus, held the status of each hard drive on a given day. The following query
was used to get a count of failures of the year and its elapsed time.
BEGIN DECLARE @t1 DATETIME; DECLARE @t2 DATETIME; SET @t1 = GETDATE(); SELECT COUNT(*) FROM HardDriveStatus WHERE FAILURE = 1; SET @t2 = GETDATE(); SELECT DATEDIFF(millisecond,@t1,@t2) AS elapsed_ms; END;
12
The results from this query were 117,993 records and a time of 5517 milliseconds. These results
show just how simple it is to query this data from a traditional structured database.
BEGIN DECLARE @t1 DATETIME; DECLARE @t2 DATETIME; SET @t1 = GETDATE(); SELECT COUNT(*) FROM ( SELECT HDS.SERIALNUMBER FROM HardDriveStatus HDS JOIN HARDDRIVE HD ON HDS.SerialNumber = HD.SerialNumber WHERE Failure != 1 GROUP BY HDS.SerialNumber ) X SET @t2 = GETDATE(); SELECT DATEDIFF(millisecond,@t1,@t2) AS elapsed_ms; END;
Kafka tests were not as straight forward or simple to create. First, all of the data was
pushed into a topic called Hard_Drive_Data via a simple producer written in Node.js. Then a
consumer was written that would pull in each record from the Hard_Drive_Data topic, count the
unique hard drives that were still operational and push that information into a topic called
HardDrive_Count. This was accomplished by holding a temporary object of hard drives that
would be added to or removed from each time a hard drive was added or failed. Due to its
complexity and limited resources, no conclusive results were recorded.
A similar process was followed when counting the number of failures over time by
consuming in the raw data from the Hard_Drive_Data topic, counting the number of failures and
pushing them back to another topic. However, no conclusive results were gathered from these
tests due to the complexity and limited resources constraint seen in the previous results.
Unfortunately, due to the complex nature of data streams and the required resources to
manage such large datasets, no conclusive data can be drawn from our simple Kafka test. As
outlined in the future work section, a better solution might include a mixture of each experiment
13
where a SQL database generates the initial data, and a Kafka topic supplements the subsequent
data. However, this type of experiment helps outlines the distinct differences Kafka and SQL
Databases have in architecture. This further emphasizes the need to continue to develop the
research and understand around how Kafka differs in vulnerabilities and exploits as compared to
more traditional architectures.
4.2. Attacking Weak Consumers
A simple consumer was constructed to read data from the topic “test_topic” which
included 21 partitions and a replication factor of one for high throughput. Then, a consumer was
written using the kafkajs package in order to listen in on this topic and perform a simple write to
file if the messages contained the word “error” in them. This was used to represent a simple
service that logs critical system errors for future evaluation and could be improved to also send
out an alert via email to key actors. However, for the sake of this experiment it remained simple
and only wrote the errors to file. Figure 2 is the code snippet that was used to execute this task:
Figure 2. Example of Error Logging to Kafka Written in Javascript
14
After the consumer was written it was placed onto the CentOS 7 system with minimal
system resources. A file was then loaded into the topic that contained 479,000 words [19] which
of those 479,000 words only 79 of them contained the word error; thus, only 79 words were
written to the file. This demonstrates a typical system load where critical errors occur
infrequently.
This setup poses some vulnerabilities to DoS attacks. To demonstrate this, the word file
was replaced with a file that contains 500,000 entries of the word “error”. This file was loaded
into the same topic as before, however the consumer cannot handle the volume and inevitably
crashes.
Kafka can process huge volumes of data in real time allowing for faster response times
and an overall better user experience. However, this distinct advantage could turn into a critical
flaw if the code used to consume, manipulate, and produce the data isn’t effectively written [20].
As demonstrated, DoS attacks can pose a serious risk to smaller, more vulnerable
services in the system. This risk could then cascade into more detrimental problems such as data
not being adequately processed or errors spiraling out of control bringing down more critical
systems.
These issues can easily be mitigated by implementing simple standards and/or policies.
For the given example, a more robust service could be written that can recognize an overloaded
topic and notify someone before the volume causes a crash. A consumer could be written to
simply count the throughput of topics and send out notifications when an abnormal amount of
traffic is detected. Another approach would be to monitor any errors that the individual services
produces, notifying a user when errors start to pile up. Finally, the best approach would be to
write the consumers in a way that allows them to have a throughput that is much higher than its
15
anticipated usage. This might be in the form of synchronous file writes or dropping messages to
maintain integrity of the consumer as a whole.
4.3. Analyzing the Malicious Code
As stated in the previous section, this code is merely a representation of techniques that
could be utilized to maximize the effectiveness of automated attacks. This automation can be
broken down into three functions. The first is a scan of the network’s local subnet. This could be
expanded to include several ranges of IP address or even a specific range if more knowledge is
known about the target. Figure 3 is an example of such code:
Figure 3. Code Example of Scanning Network for Potential Kafka Server. Written in C#
This sample of code takes in a base IP address range to scan and checks to see if there is a
device connected to each IP address in the range. If a device is discovered, then the following
code checks for any open ports:
string ipBase = "192.168.0."; for (int i = 1; i < 255; i++) { string ip = ipBase + i.ToString(); Ping p = new Ping(); p.PingCompleted += new PingCompletedEventHandler(p_PingCompleted); countdown.AddCount(); p.SendAsync(ip, 100, ip); } countdown.Signal(); countdown.Wait();
16
Figure 4. Code Example to Scan Open Ports. Written in C#
The only port scanned was 9092, in this example, but this could be expanded to any array
of ports. If this scan discovers any open ports it then runs the code in Figure 5 to see if a
connection to Kafka can be established:
/// Check for open Kafka Port /// </summary> /// <param name="ip"></param> private static void PortScan(string ip) { TcpClient Scan = new TcpClient(); int[] Ports = new int[] { 9092 }; foreach (int s in Ports) { try { Scan.Connect(ip, s); Console.WriteLine($"[{s}] | OPEN");
Collepals.com Plagiarism Free Papers
Are you looking for custom essay writing service or even dissertation writing services? Just request for our write my paper service, and we'll match you with the best essay writer in your subject! With an exceptional team of professional academic experts in a wide range of subjects, we can guarantee you an unrivaled quality of custom-written papers.
Get ZERO PLAGIARISM, HUMAN WRITTEN ESSAYS
Why Hire Collepals.com writers to do your paper?
Quality- We are experienced and have access to ample research materials.
We write plagiarism Free Content
Confidential- We never share or sell your personal information to third parties.
Support-Chat with us today! We are always waiting to answer all your questions.