CSE Technical Reports Sorted by Technical Report Number

TR Number

Title

Authors

Date

Pages

CSE-TR-561-10

A Case for Unsupervised-Learning-based Spam Filtering

Feng Qian, Abhinav Pathak, Y. Charlie Hu, Z. Morley Mao, and Yinglian Xie

February, 2010

Spam filtering has traditionally relied on extracting spam signatures via supervised learning, i.e., using emails explicitly manually labeled as spam or ham. Such supervised learning is labor-intensive and costly, more importantly cannot adapt to new spamming behavior quickly enough. The fundamental reason for needing labeled training corpus is that the learning, e.g., the process of extracting signatures, is carried out by examining individual emails. In this paper, we study the feasibility of unsupervised learning-based spam filtering that can more effectively identify new spamming behavior. Our study is motivated by three key observations of todays Internet spam: (1) the vast majority of emails are spam, (2) a spam email should always belong to some campaign, (3) spam from the same campaign are generated from some template that obfuscates some parts of the spam, e.g., sensitive terms, leaving other parts unchanged. We present the design of an online, unsupervised spam learning and detection scheme. The key component of our scheme is a novel text-mining-based campaign identification framework that clusters spam into campaigns and extracts the invariant textual fragments from spam as campaign signatures. While the individual terms in the invariant fragments can also appear in ham, the key insight behind our unsupervised scheme is that our learning algorithm is effective in extracting co-occurrences of terms that are generated by campaign templates and rarely appear in ham. Using large traces containing about 2 million emails from three sources, we show our unsupervised scheme alone achieves a false negative ratio of 3.5% and a false positive ratio of at most 0.4%. These detection accuracies are comparable to those of the de-facto supervised-learning-based filtering systems such as SpamAssassin (SA), suggesting that unsupervised spam filtering holds high promise in battling todays Internet spam.

CSE-TR-562-10

Design of SMS Commanded-and-Controlled and P2P-Structured Mobile Botnets

Yuanyuan Zeng, Kang G. Shin and Xin Hu

March, 2010

Botnets have become one of the most serious security threats to the Internet and personal computer (PC) users. Although botnets have not yet caused major outbreaks in mobile networks, with the rapidly-growing popularity of smartphones such as Apple's iPhone and Android-based phones that store more personal data and gain more capabilities than earlier generation handsets, botnets are expected to move towards this mobile domain. Since SMS is ubiquitous to every phone and can delay message delivery for offline phones, it is a suitable medium for command and control (C&C). In this paper, we describe how a mobile botnet can be built by utilizing SMS messages for C&C, and how different P2P structures can be exploited for mobile botnets. Our simulation results demonstrate that a modified Kademlia's structured architecture's a better choice for a mobile botnet's topology. In addition, we discuss potential countermeasures to defend against this mobile botnet threat.

CSE-TR-563-10

Good Guys vs. Bot Guise: Mimicry Attacks Against Fast-Flux Detection Systems

Matthew Knysz, Xin Hu, Kang G. Shin

June, 2010

Fast-Flux (FF) service networks are botnet-based hosting or redirection/ proxy services for hosting malicious and illegal content while affording botmasters a high level of misdirection and protection. With their use as service networks among criminals on the rise, researchers and security experts have designed fast and accurate detection systems based on their intrinsic behavior patterns. However, botmasters have responded, adopting a plethora of countermeasures to evade detection. In this paper, we explore the escalating “arms race” between FF botnet detectors and the botmasters’ effort to subvert them, presenting several novel mimicry attack techniques that allow botmaster to avoid detection. We first analyze the state-of-art FF detectors and their effectiveness against the current botnet threat, demonstrating how botmasters can—with their current resources—thwart detection strategies. Based on the realistic assumptions inferred from empiricallyobserved trends, we create formal models for bot decay, online availability, DNSadvertisement strategies and performance, allowing us to compare how different mimicry attacks affect the overall online availability and capacity of botnets.

CSE-TR-564-10

Internet Background Radiation Revisited

Wustrow, Karir, Bailey, Jahanian, Houston

June, 2010

The monitoring of packets destined for reachable, yet unused, Internet addresses has proven to be a useful technique for measuring a variety of specific Internet phenomenon (e.g., worms, DDoS). In 2004, Pang et al. stepped beyond these targeted uses and provided one of the first generic characterizations of this non-productive traffic, demonstrating both its significant size and diversity. However, the six years that followed this study have seen tremendous changes in both the types of malicious activity on the Internet and the quantity and quality of unused address space. In this paper, we revisit the state of Internet "background radiation" through the lens of two unique data-sets: a five-year collection from a single unused /8 network block, and week-long collections from three recently allocated /8 network blocks. Through the longitudinal study of the long-lived block, comparisons between blocks, and extensive case studies of traffic in these blocks, we characterize the current state of background radiation specifically highlighting those features that remain invariant from previous measurements and those which exhibit significant differences. Of particular interest in this work is the exploration of address space pollution, in which significant non uniform behavior is observed. However, unlike previous observations of differences between unused blocks we show that increasingly these differences are the result of environmental factors (e.g., misconfiguration, location), rather than algorithmic factors. Where feasible, we offer suggestions for clean up of these polluted blocks and identify those blocks whose allocations should be withheld.

CSE-TR-565-10

A Hybrid Overlay Network for Bulk Data in Challenged Networks

Azarias Reda and Brian Noble

October, 2010

Developing countries face significant challenges in network access, making even simple network tasks unpleasant. Transferring bulk data in these networks is often prohibitively difficult, rendering useful information out of reach for many people. This paper introduces a hybrid delay tolerant networking framework, an approach for orchestrating bulk data transfer through a series of hosts with spare storage and diverse network connectivity. By combining natural individual mobility and available network connectivity, hybrid DTN networking forms a hybrid overlay network for delivering bulk data while preserving scalability in the state required to do so. Our implementation of a hybrid DTN network, Bati, outperforms earlier attempts in using mobility for data delivery and in-network storage for enhanced path selection. Our evaluation demonstrates substantial savings in using Bati for delivering bulk data, transferring an order of magnitude more data than the network alone, and improving the delivery rate by more than 40% compared to popular ad-hoc networks.

CSE-TR-566-10

The Very Small World of the Well-connected

Xiaolin Shi, Matthew Bonner, Lada A. Adamic and Anna C. Gilbert

December, 2010

Online networks occupy an increasingly larger position in how we acquire information, how we communicate with one another, and how we disseminate information. Frequently, small sets of vertices dominate various graph and statistical properties of these networks and, because of this, they are relevant for structural analysis and efficient algorithms and engineering. For the web overall, and specifically for social linking in blogs and instant messaging, we provide a principled, rigorous study of the properties, the construction, and the utilization of subsets of special vertices in large online networks. We show that graph synopses defined by the importance of vertices provide small, relatively accurate portraits, independent of the importance measure, of the important vertices in the underlying larger graphs. Furthermore, they can be computed relatively efficiently in real-world networks. In addition, we study the stability of these graph synopses over time and trace their development in several large dynamic data sets. We show that important vertices are more likely to have longer active life spans than unimportant ones and that the graph synopses consisting of important vertices remain stable over long periods of time after a short period of initial growth.

CSE-TR-567-10

Measuring the Effectiveness of Infrastructure-Level Detection of Large-Scale Botnets

Yuanyuan Zeng, Guanhua Yan, Stephan Eidenbenz and Kang G. Shin

December, 2010

Botnets are one of the most serious security threats to the Internet and its end users. In recent years, utilizing P2P as a Command and Control (C&C) protocol has gained popularity due to its decentralized nature that can help hide the botmaster’s identity. Most bot detection approaches targeting P2P botnets either rely on behavior monitoring or traffic flow and packet analysis, requiring fine-grained information collected locally. This requirement limits the scale of detection. In this paper, we consider detection of P2P botnets at a highlevel— the infrastructure level—by exploiting their structural properties from a graph analysis perspective. Using three different P2P overlay structures, we measure the effectiveness of detecting each structure at various locations (the Autonomous System (AS), the Point of Presence (PoP), and the router rendezvous) in the Internet infrastructure.

Technical Reports Page