Blog
These days I got a better understanding of text classification and text clustering.Nowadays it’s very necessary to process documents which we need.The three basic steps:comparison,grouping and classification are the fundamental things that we need to do.However,the key to solve the problem is to find “something” to feature these documents before we compare them.
The first thing we need to do about document comparison is to find similarity about documents.Jaccard’s index is a similarity measure based on set operations.We can use the formula to calculate similarity between different documents by their representation.There are a few methods to do some term weighting.It’s quite normal when we deal with documents.Calculating words that appear most frequently in a document is the most common method.These words are the most important for documents.The other method which is called Term Frequency & Inverse Document Frequency(TF-IDF) is quite useful.Not only do we check how often a word appears in the document,but also check how rare it appears across the whole corpus.For further discussion,we can use vector space model whose elements represent the tf-idf weight of a term within that document.We can also calculate similarity between documents by the cosine similarity.The whole similarity measures make me know how to calculate similarity instead of just feeling or guessing which is very inaccurate.
Then we study lessons about text classification.Actually,we are making classification in our daily life.It’s a basic skill for everyone.After these lessons we can use some methods to make classification between different texts.There are two steps.First,we need to train a classifier by using labeled data.Second we need to classify new data using the classifier.After we learn Naive Bayes Text Classification ,we have a better understanding of probability theory.The knowledge we learn in classes can be applied to our real life so perfectly.It’s so wonderful and interesting for me.For text clustering,we usually use K-means Clustering.Although it is the most common hard clustering algorithm,we have to say it’s useful.Model-based Clustering is another choice for us.During these classes,I can imagine that we can absorb more information while watch news in the future.And when we have jobs,we can better deal with different passages and documents which may affect our jobs.How useful it is!I really enjoy the lessons although I usually find difficulty with python.Learning python also makes me have one more excellent skill.It’s very impressive to have these lessons.
it is obvious that your blog is related to the assignment 2 and I think after reading your blog I have a better understanding to it.
回复删除Hi, Xi! It is a wonderful summary of the content of what we have learned in the course and it makes me more clear about the knowledge of text classification. Thank you very much!
回复删除Great post, I have used jaccard similarity to classify a document before, it's simple but the accuarcy is not very good, but if we have a good stemmer, the accuracy will be higher, because different words in different document sometimes are just different forms of the same word, stemmer can remove this interference.
回复删除Reading your blog is helpful for me to understand the text classification, but it's a pity that I haven't read it before I do my assignment2.
回复删除What's an interesting thinking. I think most of the people already are using the document classification in daily just they have no sense about this is so call 'document classification'. Human build up their own 'library' and 'training data set' from experience.
回复删除Thanks a lots for your interesting sharing
Very good summary about text classification.
回复删除Thanks for your sharing, It's a good summary. Text classification is useful in Content Analysis of social media!
回复删除Thank you for the informative explanation of the Text classification. It would be a effective method to analyze the passages and determine the result in the automatic way.
回复删除