American Journal of Information Systems
ISSN (Print): 2374-1953 ISSN (Online): 2374-1988 Website: http://www.sciepub.com/journal/ajis Editor-in-chief: Sergii Kavun
Open Access
Journal Browser
Go
American Journal of Information Systems. 2017, 5(1), 27-32
DOI: 10.12691/ajis-5-1-4
Open AccessArticle

DACE: Extracting and Exploring Large Scale Chinese Web Collocations with Distributed Computing

Lan Huang1, , Juan Zhou1, Jing Xue1, Yongxing Li1 and Youfu Du1

1College of Computer Science, Yangtze University, Jingzhou, Hubei, China

Pub. Date: August 13, 2017

Cite this paper:
Lan Huang, Juan Zhou, Jing Xue, Yongxing Li and Youfu Du. DACE: Extracting and Exploring Large Scale Chinese Web Collocations with Distributed Computing. American Journal of Information Systems. 2017; 5(1):27-32. doi: 10.12691/ajis-5-1-4

Abstract

Words that often occur together form collocations. Collocations are important language components and have been used to facilitate many natural language processing tasks, including natural language generation, machine translation, information retrieval, sentiment analysis and language learning. Meanwhile, collocations are difficult to capture, especially for second language learners; and new collocations develop quickly nowadays, especially with the help of the affluent user generated content on the Web. In this paper we present an automatic collocation extraction and exploration system for the Chinese language: the DACE system. We identify collocations using three measures: frequency, mutual information and χ2-test. The system was built upon distributed computing frameworks so as to efficiently process large scale corpora. Empirical evaluation and analysis of the system showed the effectiveness of the collocation measures and the efficiency of the distributed computing processes.

Keywords:
information extraction collocation MapReduce Chinese natural language processing

Creative CommonsThis work is licensed under a Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/

Figures

Figure of 4

References:

[1]  Piao, S. S. L., Sun, G., Rayson, P., Yuan, Q. Automatic Extraction of Chinese Multiword Expressions with a Statistical Tool. In Proceedings of the Workshop on Multi-word-expressions in a Multilingual Context. In Proc. of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2006), 2006, pp. 17-24.
 
[2]  Zhang, J., Gao, J., Zhou, M. Extraction of Chinese compound words¡ªan experimental study on a very large corpus. In Proc. of the 2nd Chinese Language Processing Workshop, ACL 2000, 2000.
 
[3]  Mckeown, K. R., Radev, D. R. Collocations. In A Handbook of Natural Language Processing, R. Dale, H. Moisl, and H. Somers Eds. Marcel Dekker, New York, 2000, pp. 507-523.
 
[4]  Smadja, F., McKeown, K. Automatically extracting and representing collocations for language generation. In Proceedings of the 28th annual meeting on Association for Computational Linguistics, 1990, pp. 252-259.
 
[5]  Liu, Z. Y., Wang, H., Wu, H., Liu, T., Li, S. Reordering with source language collocations. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 2011, pp. 1035-1044.
 
[6]  Yarowsky, D. Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics. 1995, pp. 189-196.
 
[7]  Xu, R. F., Xu, J., Kit C. HITSZ_CITYU: Combine collocation, context words and neighboring sentence sentiment in sentiment adjectives disambiguation. In Proceedings of the 5th International Workshop on Semantic Evaluation, ACL 2010, pp. 448-451.
 
[8]  Wu, S. Q., Franken, M., Witten, I. H. Supporting collocation learning with a digital library. Computer Assisted Language Learning, 2010, 23(1), pp. 87-110.
 
[9]  Liu, F., Yang, M., Lin D. Chinese Web 5-gram Version 1 LDC2010T06. Web Download. Philadelphia: Linguistic Data Consortium, 2010, https://catalog.ldc.upenn.edu/LDC2010T06.
 
[10]  Choueka, Y., Klein, S. T., Neuwitz, E. Automatic retrieval of frequent idiomatic and collocational expressions in a large corpus. Journal of the Association for Literary and Linguistic Computing, 1983, 4, pp. 34-38.
 
[11]  Church, K. Hanks, P. Word association norms, mutual information, and lexicography. Journal of Computational Linguistics, 1990, 16, pp. 22-29.
 
[12]  Dunning, T. Accurate methods for the statistics of surprise and coincidence. Journal of Computational Linguistics, 1993, 19, pp. 61-74.
 
[13]  Manning, C., Sch¨¹tze, H. Foundations of statistical natural language processing. MIT Press. 1999.
 
[14]  Smadja, F. Retrieving collocations from text: Xtract. Computat. Linguist., 1993, 19, pp. 143-177.
 
[15]  Pecina, P.. An Extensive Empricial Study of Collocation Extraction Methods. Proceedings of the ACL Student Research Workshop, 2005, pp. 13-18.
 
[16]  Liu, Z. Y., Wang, H., Wu, H., Li, S. Two-word collocation extraction using monolingual word alignment method. 2011, ACM Transaction on Intelligent Systems and Technology, 3(1), 16.
 
[17]  Seretan, V., Wehrli, E. Multilingual collocation extraction: issues and solutions. In proceedings of the workshop on multilingual language resources and interoperability, 2006, pp. 40-49.
 
[18]  Sun, M. S., Huang, C. N., Fang, J. A Quantitative Analysis of Chinese Collocation. Studies of the Chinese Language, 1997(1), pp. 29-38. (in Chinese)
 
[19]  Lu, Q., Li, Y., Xu, R.. Improving Xtract for Chinese collocation extraction. In Proceedings of the IEEE International Conference on Natural Language Processing and Knowledge Engineering. 2003, pp. 333-338.
 
[20]  Qu, W. G., Chen, X. H., Ji, G. L. Automatic Extraction of Word Collocation Based on Frame. Computer Engineering, 2004, 30(23), pp. 22-24. (in Chinese)
 
[21]  Li, W., Lu, Q., Xu, R. Similarity based chinese synonym collocation extraction. International Journal of Computational Linguistics and Chinese Language Processing. 2005, 10, pp. 123-144.
 
[22]  Wang, S. G., Yang, J. L., Zhang, W. Chinese Verbs and Verbs Matching Based on Maximum Entropy Model and Voting Method. Journal of Chinese Computer Systems, 2007, 28(7), pp. 1306-1309. (in Chinese)
 
[23]  Xu, R. F., Lu Q., Wong, K. F., Li, W. J. Building a Chinese collocation bank. International Journal of Computer Processing of Languages, 2009, 22 (1), pp. 21-47.
 
[24]  Wu, S. Q. Supporting Collocation Learning. PhD thesis, 2010.