ISSN (Print): 2374-1953 ISSN (Online): 2374-1988
American Journal of Information Systems. 2017, 5(1), 27-32
DOI: 10.12691/ajis-5-1-4
DACE: Extracting and Exploring Large Scale Chinese Web Collocations with Distributed Computing

Lan Huang1, , Juan Zhou1, Jing Xue1, Yongxing Li1 and Youfu Du1

1College of Computer Science, Yangtze University, Jingzhou, Hubei, China

Pub. Date: August 13, 2017

Lan Huang, Juan Zhou, Jing Xue, Yongxing Li and Youfu Du. DACE: Extracting and Exploring Large Scale Chinese Web Collocations with Distributed Computing. American Journal of Information Systems. 2017; 5(1):27-32. doi: 10.12691/ajis-5-1-4


Words that often occur together form collocations. Collocations are important language components and have been used to facilitate many natural language processing tasks, including natural language generation, machine translation, information retrieval, sentiment analysis and language learning. Meanwhile, collocations are difficult to capture, especially for second language learners; and new collocations develop quickly nowadays, especially with the help of the affluent user generated content on the Web. In this paper we present an automatic collocation extraction and exploration system for the Chinese language: the DACE system. We identify collocations using three measures: frequency, mutual information and χ2-test. The system was built upon distributed computing frameworks so as to efficiently process large scale corpora. Empirical evaluation and analysis of the system showed the effectiveness of the collocation measures and the efficiency of the distributed computing processes.

information extraction collocation MapReduce Chinese natural language processing

