Monday, 15:30


Document Representation Methods for Clustering Bilingual Documents

Shutian Ma1, Chengzhi Zhang1,2, Daqing He3
1Nanjing University of Science and Technology, China, People’s Republic of; 2Jiangsu Key Laboratory of Data Engineering and Knowledge Service (Nanjing University), Nanjing, 210094, China; 3School of Information Sciences & Intelligent System Program, University of Pittsburgh


Globalization places people in a multilingual environment. There is a growing number of users to access and share information in several languages for public or private purpose. In order to deliver relevant information in different languages, efficient multilingual documents management is worthy of study. Generally, classification and clustering are two typical methods for documents management. However, lack of training data and high efforts for corpus annotation will increase the cost for classifying multilingual documents which needs to bridge language gaps as well. Clustering is more suitable to implement in such practical applications. There are two main factors involved in documents clustering, document representation method and clustering algorithm. In this paper, we focus on document representation method and demonstrate that the choice of representation methods has impacts on quality of clustering results. In our experiment, we use parallel corpora (English-Chinese documents on topic of technology information) and comparable corpora (English and Chinese documents on topics of mobile technology and wind energy) as dataset. We compare four different types of document representation methods: Vector Space Model, Latent Semantic Indexing, Latent Dirichlet Allocation and Doc2Vec. Experimental results show that, accuracy of Vector Space Model were not competitive with other methods in all clustering tasks. Latent Semantic Indexing is overly sensitive to corpora itself, for it behaved differently when clustering two different topics of comparable corpora. Latent Dirichlet Allocation behaves best when clustering documents in small size of comparable corpora while Doc2Vec behaves best for large documents set of parallel corpora. Accordingly, characteristics of corpora should be under considerations for rational utilization of document representation methods to have better performance.

Pixel Efficiency Analysis: A Quantitative Web Analytics Approach

Alex Brown1, Binky Lush1, Bernard J. Jansen2
1The Pennsylvania State University, University Park, PA, United States of America; 2Qatar Computing Research Institute, Doha, Qatar


We present a quantitative web analytics approach tailored towards academic libraries. We introduce the construct of pixel efficiency analysis and the metrics of pixel efficiency value and conversion efficiency value for quantitatively evaluating website changes. Pixel efficiency analysis is the practice of relating screen real estate measured in pixels to the achievement of organizational goals and key performance indicators as indicated by quantifiable user behavioral interactions on a webpage. The concepts and measures are employed through a case study at a major academic library focusing on four major webpages. The first level of analysis incorporates pixel efficiency analysis within an overarching web analytics investigation to identify key areas of improvement on the selected pages. The second level of analysis improves the identified weaknesses through A/B testing and highlights the usefulness of pixel efficiency analysis. Lastly, the third level of analysis employs the usage of the pixel efficiency value to elicit the added worth that potential website changes possess.

Cardinal: Novel Software for Studying File Management Behavior

Jesse David Dinneen1, Fabian Odoni2, Ilja Frissen1, Charles-Antoine Julien1
1McGill University, Canada; 2University of Applied Sciences HTW Chur, Switzerland


In this paper we describe the design and trial use of Cardinal, novel software that overcomes the limitations of existing research tools used in personal information management (PIM) studies focusing on file management (FM) behavior. Cardinal facilitates large-scale collection of FM behavior data along an extensive list of file system properties and additional relevant dimensions (e.g., demographic, software and hardware, etc). It enables anonymous, remote, and asynchronous participation across the 3 major operating systems, uses a simple interface, and provides value to participants by presenting a summary of their file and folder collections. In a 15-day trial implementation, Cardinal examined over 2.3 million files across 46 unsupervised participants. To test its adaptability we extended it to also collect psychological questionnaire responses and technological data from each participant. Participation sessions took an average of just over 10 minutes to complete, and participants reported positive impressions of their interactions. Following the pilot, we revised Cardinal to further decrease participation time and improve the user interface. Our tests suggest that Cardinal is a viable tool for FM research, and so we have made its source freely available to the PIM community.