CSE Technical Reports Sorted by Technical Report Number
|Spam filtering has traditionally relied on extracting spam signatures via supervised learning, i.e., using emails explicitly manually labeled as spam or ham. Such supervised learning is labor-intensive and costly, more importantly cannot adapt to new spamming behavior quickly enough. The fundamental reason for needing labeled training corpus is that the learning, e.g., the process of extracting signatures, is carried out by examining individual emails. In this paper, we study the feasibility of unsupervised learning-based spam filtering that can more effectively identify new spamming behavior. Our study is motivated by three key observations of todays Internet spam: (1) the vast majority of emails are spam, (2) a spam email should always belong to some campaign, (3) spam from the same campaign are generated from some template that obfuscates some parts of the spam, e.g., sensitive terms, leaving other parts unchanged. We present the design of an online, unsupervised spam learning and detection scheme. The key component of our scheme is a novel text-mining-based campaign identification framework that clusters spam into campaigns and extracts the invariant textual fragments from spam as campaign signatures. While the individual terms in the invariant fragments can also appear in ham, the key insight behind our unsupervised scheme is that our learning algorithm is effective in extracting co-occurrences of terms that are generated by campaign templates and rarely appear in ham. Using large traces containing about 2 million emails from three sources, we show our unsupervised scheme alone achieves a false negative ratio of 3.5% and a false positive ratio of at most 0.4%. These detection accuracies are comparable to those of the de-facto supervised-learning-based filtering systems such as SpamAssassin (SA), suggesting that unsupervised spam filtering holds high promise in battling todays Internet spam.
|Botnets have become one of the most serious security
threats to the Internet and personal computer (PC) users.
Although botnets have not yet caused major outbreaks in mobile
networks, with the rapidly-growing popularity of smartphones
such as Apple's iPhone and Android-based phones that store
more personal data and gain more capabilities than earlier generation
handsets, botnets are expected to move towards this
mobile domain. Since SMS is ubiquitous to every phone and can
delay message delivery for offline phones, it is a suitable medium
for command and control (C&C). In this paper, we describe
how a mobile botnet can be built by utilizing SMS messages
for C&C, and how different P2P structures can be exploited
for mobile botnets. Our simulation results demonstrate that a
modified Kademlia's structured architecture's a better choice
for a mobile botnet's topology. In addition, we discuss potential
countermeasures to defend against this mobile botnet threat.
|Fast-Flux (FF) service networks are botnet-based hosting or redirection/
proxy services for hosting malicious and illegal content while affording botmasters
a high level of misdirection and protection. With their use as service networks
among criminals on the rise, researchers and security experts have designed
fast and accurate detection systems based on their intrinsic behavior patterns. However,
botmasters have responded, adopting a plethora of countermeasures to evade
detection. In this paper, we explore the escalating “arms race” between FF botnet
detectors and the botmasters’ effort to subvert them, presenting several novel
mimicry attack techniques that allow botmaster to avoid detection. We first analyze
the state-of-art FF detectors and their effectiveness against the current botnet
threat, demonstrating how botmasters can—with their current resources—thwart
detection strategies. Based on the realistic assumptions inferred from empiricallyobserved
trends, we create formal models for bot decay, online availability, DNSadvertisement
strategies and performance, allowing us to compare how different
mimicry attacks affect the overall online availability and capacity of botnets.
|The monitoring of packets destined for reachable, yet unused, Internet addresses has proven to be a useful technique for measuring a variety of specific Internet phenomenon (e.g., worms, DDoS). In 2004, Pang et al. stepped beyond these targeted uses and provided one of the first generic characterizations of this non-productive traffic, demonstrating both its significant size and diversity. However, the six years that followed this study have seen tremendous changes in both the types of malicious activity on the Internet and the quantity and quality of unused address space. In this paper, we revisit the state of Internet "background radiation" through the lens of two unique data-sets: a five-year collection from a single unused /8 network block, and week-long collections from three recently allocated /8 network blocks. Through the longitudinal study of the long-lived block, comparisons between blocks, and extensive case studies of traffic in these blocks, we characterize the current state of background radiation specifically highlighting those features that remain invariant from previous measurements and those which exhibit significant differences. Of particular interest in this work is the exploration of address space pollution, in which significant non uniform behavior is observed. However, unlike previous observations of differences between unused blocks we show that increasingly these differences are the result of environmental factors (e.g., misconfiguration, location), rather than algorithmic factors. Where feasible, we offer suggestions for clean up of these polluted blocks and identify those blocks whose allocations should be withheld.
|Developing countries face significant challenges in network access, making even simple network tasks unpleasant. Transferring bulk data in these networks is often prohibitively difficult, rendering useful information out of reach for many people. This paper introduces a hybrid delay tolerant networking framework, an approach for orchestrating bulk data transfer through a series of hosts with spare storage and diverse network connectivity. By combining natural individual mobility and available network connectivity, hybrid DTN networking forms a hybrid overlay network for delivering bulk data while preserving scalability in the state required to do so. Our implementation of a hybrid DTN network, Bati, outperforms earlier attempts in using mobility for data delivery and in-network storage for enhanced path selection. Our evaluation demonstrates substantial savings in using Bati for delivering bulk data, transferring an order of magnitude more data than the network alone, and improving the delivery rate by more than 40% compared to popular ad-hoc networks.
|Online networks occupy an increasingly larger position in how we
acquire information, how we communicate with one another, and how we
disseminate information. Frequently, small sets of vertices
dominate various graph and statistical properties of these networks
and, because of this, they are relevant for structural
analysis and efficient algorithms and engineering. For
the web overall, and specifically for social linking in blogs and
instant messaging, we provide a principled, rigorous study of the
properties, the construction, and the utilization of subsets of
special vertices in large online networks. We show that graph
synopses defined by the importance of vertices provide small,
relatively accurate portraits, independent of the importance
measure, of the important
vertices in the underlying larger graphs. Furthermore, they can be computed relatively efficiently in real-world networks.
In addition, we study the stability of these graph synopses over time and trace their development in several large dynamic data sets. We show that important vertices are more likely to have longer active life spans than unimportant ones and that the graph synopses consisting of important vertices remain stable over long periods of time after a short period of initial growth.
|Botnets are one of the most serious security threats to the Internet and its end users. In recent years, utilizing P2P as a Command and Control (C&C) protocol has gained popularity due to its decentralized nature that can help hide the
botmaster’s identity. Most bot detection approaches targeting P2P botnets either rely on behavior monitoring or traffic flow and packet analysis, requiring fine-grained information
collected locally. This requirement limits the scale of detection. In this paper, we consider detection of P2P botnets at a highlevel—
the infrastructure level—by exploiting their structural properties from a graph analysis perspective. Using three different
P2P overlay structures, we measure the effectiveness of detecting each structure at various locations (the Autonomous
System (AS), the Point of Presence (PoP), and the router rendezvous) in the Internet infrastructure.
Technical Reports Page