Document Type : Research Paper

Authors

1 Assistant Professor, Department of Industrial Management, Faculty of Management, Islamic Azad University, Tehran North Branch, Tehran

2 MA, Information Technology Management, Faculty of Management, Islamic Azad University, Tehran North Branch, Tehran.

Abstract

 
Given the high volume of web information, more attention has been paid to the automatic data extraction systems. One of the most important methods of data extraction is clustering. Today, many clustering methods are provided which are mostly based on vector models. In these models, each document is treated like a set of words, and the sequence of words in the sentence is ignored. Since the meanings in the natural language are completely dependent on the sequence of words, a great deal of shortcomings is observed in these methods. To overcome these shortcomings, this paper presents a new method for clustering HTML documents in which STC algorithm is considered for clustering snippets. This method, called clustering based on KS_STC key sentences, provides a weighted vector for each document and using this vector, the key sentences of each text are extracted from the document. Finally, these key sentences are given for clustering to the STC algorithm.
 

Highlights

 

Anon., n.d. Html Document Set. [Online]
Available at: http://www.pedal.rdg.ac.uk/banksearchdataset [Accessed 2013]

Ashraf, F. & Zyer, T. O., 2008. Employing Clustering Techniques for Automatic Information Extraction from HTML Documents. IEEE Transactions on Syst.Man.Cyber, 38(5), pp. 660-673.

Azad, H. K. & Abhishek, K., 2014. Semantic-synaptic web mining: A novel model for improving the web mining. In Communication Systems and Network Technologies (CSNT),. s.l., s.n., pp. 454-457.

Buttler, D., Liu, L. & Pu, C., 2001. A fully automated object extraction system for the world wide web. s.l., s.n., pp. 361-370.

Crescenzi, S. V., Mecca, G., Merialdo, P. & Missier, P., 2004. An automatic data grabber for large Web sites. s.l., s.n., pp. 1321-1324.

Embley, D. W., Tao, C. & Liddle, S. W., 2005. Automating the extraction of data from HTML tables with unknown structure. Data Knowledge Engineering, 54(1), pp. 3-28.

Ferrara, E., De Meo, P., Fiumara, G. & Baumgartner, R., 2014. Web data extraction, applications and techniques: A survey.. Knowledge-Based Systems, Volume 70, pp. 301-323.

Feyzi, K., Sabet Motlagh, M. & Abedini naeini, M., 1393. Using an integrated approach of QFD, FAHP, VIKOR to select the most suitable ERP system. Journal of Management Studies Information Technology, pp. 1-20.

Gamare, P. S. & Patil, G. A., 2015. Efficient Clustering of Web Documents Using Hybrid Approach in Data Mining. s.l., IEEE.

Gulli, A. & Signorini, A., 2005. The indexable web is more than 11.5 bilion pages. s.l., s.n., pp. 902-903.

Gupta, M. & Garg, K., 2016. Attribute Weighted K-means For Document Clustering. International Research Journal of Engineering and Technology, 3(6), pp. 1583-1590.

Junjie, W., Xiong, H. & Jian, C., 2009. Towards understanding hierarchical clustering: A data distribution perspective. Neurocomputing, pp. 2319-2330.

Labsky, M., Svatek, V., Praks, P. & Svab, O., 2005. Information extraction from HTML product catalogues: Coupling quantitative and knowledge-based approaches. Dagstuhl, Germany, s.n.

Na, S., Xumin, L. & Yong, G., 2010. Research on k-means clustering algorithm: An improved k-means clustering algorithm. In Intelligent Information Technology and Security Informatics (IITSI). s.l., 2010 IEEE Third International Symposium, pp. 63-67.

Sandhya, N., Govardhan, A. & Rameshchandra, G., 2016. Concept Based Text Document Clustering with Vector Suffix Tree Document Model. International Journal of Computer Science and Information Security, 14(7), p. 259.

Shareghi, E., Petri, M., Haffari, G. & Cohn, T., 2015. Compact, Efficient and Unlimited Capacity: Language Modeling with Compressed Suffix Trees. s.l., s.n., pp. 2409-2418.

Spengler, A. & Gallinari, P., 2010. Document structure meets page layout: loopy random fields for web news content extraction. s.l., s.n., pp. 150-160.

Steinbach, M., Karypis, G. & Kumar, V., 2000. A comparison of document clustering techniques. s.l.:s.n.

Tar, H. H. & Nyunt, T. S., 2011. Ontology-based concept weighting for text documents. World Academy of Science, engineering and Technology, Volume 57, pp. 249-253.

Xiaolin, Y., Xiao, Z., Nan, K. & Fengchao, Z., 2013. An improved Single-Pass clustering algorithm internet-oriented network topic detection. s.l., Intelligent Control and Information Processing (ICICIP), pp. 560-564.

Zhuang, Y. & Chen, Y., 2015. Improving Suffix Tree Clustering Algorithm for Web Documents. s.l., IEEE.

 

 

 

 

 

 

 

 

 

Keywords

Anon., n.d. Html Document Set. [Online]
Available at: http://www.pedal.rdg.ac.uk/banksearchdataset [Accessed 2013]

Ashraf, F. & Zyer, T. O., 2008. Employing Clustering Techniques for Automatic Information Extraction from HTML Documents. IEEE Transactions on Syst.Man.Cyber, 38(5), pp. 660-673.
Azad, H. K. & Abhishek, K., 2014. Semantic-synaptic web mining: A novel model for improving the web mining. In Communication Systems and Network Technologies (CSNT),. s.l., s.n., pp. 454-457.
Buttler, D., Liu, L. & Pu, C., 2001. A fully automated object extraction system for the world wide web. s.l., s.n., pp. 361-370.
Crescenzi, S. V., Mecca, G., Merialdo, P. & Missier, P., 2004. An automatic data grabber for large Web sites. s.l., s.n., pp. 1321-1324.
Embley, D. W., Tao, C. & Liddle, S. W., 2005. Automating the extraction of data from HTML tables with unknown structure. Data Knowledge Engineering, 54(1), pp. 3-28.
Ferrara, E., De Meo, P., Fiumara, G. & Baumgartner, R., 2014. Web data extraction, applications and techniques: A survey.. Knowledge-Based Systems, Volume 70, pp. 301-323.
Feyzi, K., Sabet Motlagh, M. & Abedini naeini, M., 1393. Using an integrated approach of QFD, FAHP, VIKOR to select the most suitable ERP system. Journal of Management Studies Information Technology, pp. 1-20.
Gamare, P. S. & Patil, G. A., 2015. Efficient Clustering of Web Documents Using Hybrid Approach in Data Mining. s.l., IEEE.
Gulli, A. & Signorini, A., 2005. The indexable web is more than 11.5 bilion pages. s.l., s.n., pp. 902-903.
Gupta, M. & Garg, K., 2016. Attribute Weighted K-means For Document Clustering. International Research Journal of Engineering and Technology, 3(6), pp. 1583-1590.
Junjie, W., Xiong, H. & Jian, C., 2009. Towards understanding hierarchical clustering: A data distribution perspective. Neurocomputing, pp. 2319-2330.
Labsky, M., Svatek, V., Praks, P. & Svab, O., 2005. Information extraction from HTML product catalogues: Coupling quantitative and knowledge-based approaches. Dagstuhl, Germany, s.n.
Na, S., Xumin, L. & Yong, G., 2010. Research on k-means clustering algorithm: An improved k-means clustering algorithm. In Intelligent Information Technology and Security Informatics (IITSI). s.l., 2010 IEEE Third International Symposium, pp. 63-67.
Sandhya, N., Govardhan, A. & Rameshchandra, G., 2016. Concept Based Text Document Clustering with Vector Suffix Tree Document Model. International Journal of Computer Science and Information Security, 14(7), p. 259.
Shareghi, E., Petri, M., Haffari, G. & Cohn, T., 2015. Compact, Efficient and Unlimited Capacity: Language Modeling with Compressed Suffix Trees. s.l., s.n., pp. 2409-2418.
Spengler, A. & Gallinari, P., 2010. Document structure meets page layout: loopy random fields for web news content extraction. s.l., s.n., pp. 150-160.
Steinbach, M., Karypis, G. & Kumar, V., 2000. A comparison of document clustering techniques. s.l.:s.n.
Tar, H. H. & Nyunt, T. S., 2011. Ontology-based concept weighting for text documents. World Academy of Science, engineering and Technology, Volume 57, pp. 249-253.
Xiaolin, Y., Xiao, Z., Nan, K. & Fengchao, Z., 2013. An improved Single-Pass clustering algorithm internet-oriented network topic detection. s.l., Intelligent Control and Information Processing (ICICIP), pp. 560-564.
Zhuang, Y. & Chen, Y., 2015. Improving Suffix Tree Clustering Algorithm for Web Documents. s.l., IEEE.