• Article
  • | |
  • Metrics
  • |
  • Reference [27]
  • |
  • Related [20]
  • |
  • Cited by [35]
  • | |
  • Comments
    Abstract:

    For many KDD (knowledge discovery in databases) applications, such as fraud detection in E-commerce, it is more interesting to find the exceptional instances or the outliers than to find the common knowledge. Most existing work in outlier detection deals with data with numerical attributes. And these methods give no explanation to the outliers after finding them. In this paper, a hypergraph-based outlier definition is presented, which considers the locality of the data and can give good explanation to the outliers,and it also gives an algorithm called HOT(hypergraph-based outlier test) to find outliers by counting three measurements,the support,belongingness and deviation of size,for each vertex in the hypergraph.This algorithm can manage both numerical attributes and categorical attributes.Analysis shows that this approach can find the outliers in high-dimensionsal space effctively.

    Reference
    [1] Fayyad, U., Piatetsky-Shapiro, G., Smyth, P. Knowledge discovery and data mining: towards a unifying framework. In: Simoudis, E., Han, J., Fayyad, U.M., eds. Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining. Portland, Oregon: AAAI Press, 1996. 82~88.
    [2] Ng, R. T., Han, J. Efficient and effective clustering methods for spatial data mining. In: Bocca, J.B., Jarke, M., Zaniolo, C., eds. Proceedings of the 20th International Conference on Very Large Data Bases. Santiago: Morgan Kaufmann, 1994. 144~155.
    [3] Ester, M., Kriegel, H.-p., Sander, J., et al. A density-based algorithm for discovering clusters in large spatial databases with noise. In: Simoudis, E., Han, J., Fayyad, U.M., eds. Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining. Portland, Oregon: AAAI Press, 1996. 226~231.
    [4] Zhang, T., Ramakrishnan, R., Linvy, M. BIRCH: an efficient eata clustering method for very large databases. In: Jagadish, H.V., Mumick, I.S., eds. Proceedings of the ACM SIGMOD International Conference on Management of Data. Montreal: ACM Press, 1996. 103~114.
    [5] Wang, W., Yang, J., Muntz, R. STING: a statistical information grid approach to spatial data mining. In: Jarke, M., Carey, M.J., Dittrich, K.R., et al., eds. Proceedings of the 23rd International Conference on Very Large Data Bases. Athens, Greece: Morgan Kaufmann, 1997. 186~195.
    [6] Sheikholeslami, G., Chatterjee, S., Zhang, A. WaveCluster: a multi-resolution clustering approach for very large spatial databases. In: Gupta, A., Shmueli, O., Widom, J., eds. Proceedings of the 24th International Conference on Very Large Data Bases. New York : Morgan Kaufmann, 1998. 428~439.
    [7] Hinneburg, A., Keim, D.A. An efficient approach to clustering in large multimedia databases with noise. In: Agrawal, R., Stolorz, P.E., Piatetsky-Shapiro, G. eds. Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining. New York: AAAI Press, 1998. 58~65.
    [8] Agrawal, R., Gehrke, J., Gunopulos, D., et al. Automatic subspace clustering of high dimensional data for data mining applications. In: Haas, L.M., Tiwary, A., eds. Proceedings of the ACM SIGMOD International Conference on Management of Data. Seattle, Washington, D C: ACM Press, 1998. 94~105.
    [9] Ruts, I., Rousseeuw, P. Computing depth contours of bivariate point clouds. Journal of Computational Statistics and Data Analysis, 1996,23:153~168.
    [10] Arning, A., Agrawal, R., Raghavan, P. A linear method for deviation detection in large databases. In: Simoudis, E., Han, J., Fayyad, U.M., eds. Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining. Portland, Oregon: AAAI Press, 1996. 164~169.
    [11] Knorr, E.M., Ng, R.T. Algorithms for mining distance-based outliers in large datasets. In: Gupta, A., Shmueli, O., Widom, J., eds. Proceedings of the 24th International Conference on Very Large Data Bases. New York: Morgan Kaufmann, 1998. 392~403.
    [12] Knorr, E.M., Ng, R.T. Finding intensional knowledge of distance-based outliers. In: Atkinson, M.P., Orlowska, M.E., Valduriez, P., eds. Proceedings of the 25th International Conference on Very Large Data Bases. Edinburgh, Scotland: Morgan Kaufmann, 1999. 211~222.
    [13] Aggarwal, C.C., Yu, P. Outlier detection for high dimensional data. In: Aref, W.G., eds. Proceedings of the ACM SIGMOD International Conference on Management of Data. Santa Barbara, CA: ACM Press, 2001. 37~47.
    [14] Han, E. H., Karypis, G., Kumar, V., et al. Clustering in a high-dimensional space using hypergraph models. Technical Report, TR-97-063, Minneapolis: Department of Computer Science, University of Minnesota, 1997.
    [15] Hawkins, D. Identification of Outliers . London: Chapman and Hall, 1980.
    [16] Barnett V., Lewis T. Outliers in Statistical Data. New York: John Wiley and Sons, Inc., 1994.
    [17] Tukey, J.W. Exploratory Data Analysis. MA: Addison-Wesley, 1977.
    [18] Preparata, F., Shamos, M. Computational geometry: an introduction. New York: Springer-Verlag, 1988.
    [19] Ramaswamy, S., Rastogi, R., Kyuseok, S. Efficient algorithms for mining outliers from large data sets. In: Chen, W., Naughton, J.F., Bernstein, P.A., eds. Proceedings of the ACM SIGMOD International Conference on Management of Data. Dallas, Texas: ACM Press, 2000. 427~438.
    [20] Breunig, M.M., Kriegel, H.P., Ng, R.T., et al. OPTICS-OF: identifying local outliers. In: Zytkow, J.M., Rauch, J., eds. Proceedings of the 3rd European Conference on Principles and Practice of Knowledge Discovery in Databases. Lecture Notes in Computer Science 1704, Prague: Springer, 1999. 262~270.
    [21] Breunig, M. M., Kriegel, H. P., Ng, R. T., et al. LOF: identifying density-based local outliers. In: Chen, W., Naughton, J.F., Bernstein, P.A., eds. Proceedings of the ACM SIGMOD International Conference on Management of Data. Dallas, Texas: ACM Press, 2000. 93~104.
    [22] Beyer, K., Goldstein, J., Ramakrishnan, R., et al. When is nearest neighbors meaningful? In: Beeri, C., Buneman, P., eds. Proceedings of the 7th International Conference on Data Theory. Lecture Notes in Computer Science 1540, Jerusalem: Springer, 1999. 217~235.
    [23] Hinneburg, A., Aggarwal, C.C., Keim, D.A.. What is the nearest neighbor in high dimensional spaces? In: Abbadi, A.E., Brodie, M.L., Chakravarthy, S., et al. eds. Proceedings of the 26th International Conference on Very Large Data Bases. Cairo: Morgan Kaufmann, 2000: 506~515.
    [24] Aggarwal, C. C., Yu, P. Finding generalized projected clusters in high dimensional spaces. In: Chen, W., Naughton, J.F., Bernstein, P.A., eds. Proceedings of the ACM SIGMOD International Conference on Management of Data. Dallas, Texas: ACM Press, 2000. 70~81.
    [25] Dougherty, J., Kohavi, R., Sahami, M. Supervised and unsupervised discretization of continuous features. In: Prieditis, A., Russell, S.J., eds. Proceedings of the 12th International Conference on Machine Learning. Tahoe, CA: Morgan Kaufmann, 1995. 194~202.
    [26] Agrawal, R., Srikant, R. Fast algorithms for mining association rules in large databases. In: Bocca, J.B., Jarke, M., Zaniolo, C., eds. Proceedings of the 20th International Conference on Very Large Data Bases. Santiago: Morgan Kaufmann, 1994. 487~499.
    [27] Karypis, G., Aggarwal, R., Kumar, V., et al. Multilevel hypergraph partitioning: application in VLSI design. In: Proceedings of the ACM/IEEE Design Automation Conference. Anaheim, CA: ACM Press, 1997. 526~529.
    Comments
    Comments
    分享到微博
    Submit
Get Citation

魏藜,宫学庆,钱卫宁,周傲英.高维空间中的离群点发现.软件学报,2002,13(2):280-290

Copy
Share
Article Metrics
  • Abstract:4278
  • PDF: 6176
  • HTML: 0
  • Cited by: 0
History
  • Received:April 20,2001
  • Revised:September 20,2001
You are the first2034272Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063