Efficient and adaptive web replication using content clustering

Luan Nguyen

doi:10.1109/JSAC.2003.814608

Outline

Efficient and adaptive web replication using content clustering

Luan Nguyen

2003, IEEE Journal on Selected Areas in Communications

https://doi.org/10.1109/JSAC.2003.814608

visibility

…

description

16 pages

link

1 file

Abstract

Recently there has been an increasing deployment of content distribution networks (CDNs) that offer hosting services to Web content providers. In this paper, we first compare the uncooperative pulling of Web contents used by commercial CDNs with the cooperative pushing. Our results show that the latter can achieve comparable users' perceived performance with only 4-5% of replication and update traffic compared to the former scheme. Therefore we explore how to efficiently push content to CDN nodes. Using trace-driven simulation, we show that replicating content in units of URLs can yield 60-70% reduction in clients' latency, compared to replicating in units of Web sites. However, it is very expensive to perform such a fine-grained replication. To address this issue, we propose to replicate content in units of clusters, each containing objects which are likely to be requested by clients that are topologically close. To this end, we describe three clustering techniques, and use various topologies and several large Web server traces to evaluate their performance. Our results show that the cluster-based replication achieves performance close to that of the URL-based scheme, but only at 1%-2% of computation and management cost. In addition, by adjusting the number of clusters, we can smoothly trade off management and computation cost for better client performance. To adapt to changes in users' access patterns, we also explore incremental clustering that adaptively adds new documents to the existing content clusters. We examine both offline and online incremental clustering, where the former assumes access history is available while the latter predicts access pattern based on the hyperlink structure. Our results show that the offline clustering yields close to the performance of the complete re-clustering at much lower overhead. The online incremental clustering and replication cut down the retrieval cost by 4.6-8 times compared to no replication and random replication, so it is especially useful to improve document availability during flash crowds.

References (33)

S. Jamin, C. Jin, A. Kurc, D. Raz, and Y. Shavitt, "Constrained mirror placement on the Internet," in Proceedings of IEEE INFOCOM'2001, April 2001.
L. Qiu, V. N. Padmanabhan, and G. M. Voelker, "On the placement of Web server replica," in Proceedings of IEEE INFOCOM'2001, April 2001.
A. Luotonen and K. Altis, "World-wide web proxies," in Proc. of the First International Conference on the WWW, 1994.
A. Bestavros, "Demand-based document dissemination to reduce traffic and balance load in distributed information systems," in Proc. of the IEEE Symp. on Parallel and Distr. Processing, 1995.
T. P. Kelly, Y.-M. Chan, S. Jamin, and J. K. MacKie-Mason, "Biased replacement policies for web caches: Differential quality-of-service and aggregate user value," in Proc. of the International Web Caching Workshop, Mar. 1999.
B. Li, M. J. Golin, G. F. Italiano, X. Deng, and K. Sohraby, "On the optimal placement of Web proxies in the Internet," in Proceedings of IEEE INFOCOM'99, Mar. 1999.
L. Kaufman and P. J. Rousseeuw, Finding Groups in Data: An Intro- duction to Cluster Analysis. John Wiley & Sons, 1990.
E. M. Voorhees, "Implementing agglomerative hierarchical clustering algorithms for use in document retrieval," Information Processing & Management, no. 22, pp. 465-476, 1986.
R. Ng and J. Han, "Efficient and effective clustering methods for data mining," in Proc. of Intl. Conf. on VLDB, 1994.
E. Cohen, B. Krishnamurthy, and J. Rexford, "Improving end-to-end performance of the web using server volumes and proxy filters," in Proceedings of ACM SIGCOMM, Sep 1998.
V. N. Padmanabhan and J. C. Mogul, "Using predictive prefetching to improve world wide web latency," in ACM SIGCOMM Computer Communication Review, July 1996.
Z. Su, Q. Yang, H. Zhang, X. Xu, and Y. Hu, "Correlation-based document clustering using web," in Proceedings of the 34th HAWAII International conference on System Sciences, January 2001.
M. Charikar, C. Chekuri, T. Feder, and R. Motwani, "Incremental clustering and dynamic information retrieval," in Proceedings of STOC, May 1997.
T. Zhang, R. Ramakrishnan, and M. Livny, "BIRCH: An efficient data clustering method for very large databases," in Proceedings of SIGMOD, 1996.
E. Zegura, K. Calvert, and S. Bhattacharjee, "How to model an inter- network," in Proceedings of IEEE INFOCOM, 1996.
"IPMA project," http://www.merit.edu/ipma.
MSNBC, "http://www.msnbc.com."
MediaMetrix, "http://www.mediametrix.com."
"NASA Kennedy space center server traces," http://ita.ee.lbl.gov/html/ contrib/NASA-HTTP.html.
M. Arlitt and T. Jin, "Workload characterization of the 1998 world cup web site," hP Tech Report HPL-1999-35(R.1).
B. Krishnamurthy and J. Wang, "On network-aware clustering of web clients," in Proc. of ACM SIGCOMM, Aug. 2000.
BBNPlanet, "telnet://ner-routes.bbnplanet.net."
Akamai, "http://www.akamai.com."
L. Breslau, P. Cao, L. Fan, G. Phillips, and S. Shenker, "Web caching and zipf-like distributions: Evidence and implications," in Proc. of INFOCOMM '99, Mar 1999.
V. N. Padmanabhan and L. Qiu, "Content and access dynamics of a busy Web site: Findings and implications," in Proc. of ACM SIGCOMM, Aug 2000.
DigitalIsland, "http://www.digitalisland.com."
A. Barbir, B. Cain, F. Douglis, M. Green, M. Hofmann, R. Nair, D. Pot- ter, and O. Spatscheck, "Known CN request-routing mechanisms," iETF draft, http://www.ietf.org/internet-drafts/draft-cain-cdnp-known-request- routing-04.txt.
A. Venkataramani, P. Yalagandula, R. Kokku, S. Sharif, and M. Dahlin, "The potential costs and benefits of long term prefetching for content distribution," in Proc. of Web Content Caching and Distribution Work- shop 2001, 2001.
T. F. Gonzalez, "Clustering to minimize the maximum intercluster distance," Theoretical Computer Science, vol. 38, pp. 293-306, 1985.
J. Edachery, A. Sen, and F. J. Brandenburg, "Graph clustering using distance-k cliques," in Proc. of Graph Drawing, Sep 1999.
A. Adya, P. Bahl, and L. Qiu, "Analyzing browse patterns of mobile clients," in Proceedings of SIGCOMM Internet Measurement Workshop 2001, Nov. 2001.
A. Wolman et al., "Organization-based analysis of web-object sharing and caching," in USENIX Symposium on Internet Technologies and Systems, 1999.
WebReaper, "http://www.webreaper.net."

Efficient and adaptive web replication using content clustering

Sign up for access to the world's latest research

Abstract

Related papers

References (33)

Related papers

Related topics

Cited by