Blog
These days I got a better understanding of text classification and text clustering.Nowadays it’s very necessary to process documents which we need.The three basic steps:comparison,grouping and classification are the fundamental things that we need to do.However,the key to solve the problem is to find “something” to feature these documents before we compare them.
The first thing we need to do about document comparison is to find similarity about documents.Jaccard’s index is a similarity measure based on set operations.We can use the formula to calculate similarity between different documents by their representation.There are a few methods to do some term weighting.It’s quite normal when we deal with documents.Calculating words that appear most frequently in a document is the most common method.These words are the most important for documents.The other method which is called Term Frequency & Inverse Document Frequency(TF-IDF) is quite useful.Not only do we check how often a word appears in the document,but also check how rare it appears across the whole corpus.For further discussion,we can use vector space model whose elements represent the tf-idf weight of a term within that document.We can also calculate similarity between documents by the cosine similarity.The whole similarity measures make me know how to calculate similarity instead of just feeling or guessing which is very inaccurate.
Then we study lessons about text classification.Actually,we are making classification in our daily life.It’s a basic skill for everyone.After these lessons we can use some methods to make classification between different texts.There are two steps.First,we need to train a classifier by using labeled data.Second we need to classify new data using the classifier.After we learn Naive Bayes Text Classification ,we have a better understanding of probability theory.The knowledge we learn in classes can be applied to our real life so perfectly.It’s so wonderful and interesting for me.For text clustering,we usually use K-means Clustering.Although it is the most common hard clustering algorithm,we have to say it’s useful.Model-based Clustering is another choice for us.During these classes,I can imagine that we can absorb more information while watch news in the future.And when we have jobs,we can better deal with different passages and documents which may affect our jobs.How useful it is!I really enjoy the lessons although I usually find difficulty with python.Learning python also makes me have one more excellent skill.It’s very impressive to have these lessons.





