Contatto di riferimento: Aldo Gangemi
Text classification is a core problem in many different domains and to several applications such as document categorization, sentiment analysis, spam detection. The goal of text classification is to assign documents to one or multiple categories. The common approach to solve such a problem is to leverage machine learning and use some of the tools it provides. These tools are classifiers and are based on the concept of learning from examples. In order to build such classifiers we need annotated datasets, which consist of documents previously annotated by humans with their corresponding categories. This technique is also known as supervised because it is based on training sets that supervise the learning process of such classifiers. Concepts such as vectorial space, bag of words, term frequency, tf-idf, n-gram models, performance evaluation, precision-recall analysis and k-cross validation will be covered within the talk. Finally an example of binary classification with WEKA, a suite of free machine learning software including visualization tools and algorithms for data analysis and predictive modeling, will be shown.