2014年11月5日星期三

4th...


Blog 
  These days I got a better understanding of text classification and text clustering.Nowadays its very necessary to process documents which we need.The three basic steps:comparison,grouping and classification are the fundamental things that we need to do.However,the key to solve the problem is to find something to feature these documents before we compare them.
  The first thing we need to do about document comparison is to find similarity about documents.Jaccards index is a similarity measure based on set operations.We can use the formula to calculate similarity between different documents by their representation.There are a few methods to do some term weighting.Its quite normal when we deal with documents.Calculating words that appear most frequently in a document is the most common method.These words are the most important for documents.The other method which is called Term Frequency & Inverse Document Frequency(TF-IDF) is quite useful.Not only do we check how often a word appears in the document,but also check how rare it appears across the whole corpus.For further discussion,we can use vector space model whose elements represent the tf-idf weight of a term within that document.We can also calculate similarity between documents by the cosine similarity.The whole similarity measures make me know how to calculate similarity instead of just feeling or guessing which is very inaccurate.
  Then we study lessons about text classification.Actually,we are making classification in our daily life.Its a basic skill for everyone.After these lessons we can use some methods to make classification between different texts.There are two steps.First,we need to train a classifier by using labeled data.Second we need to classify new data using the classifier.After we learn Naive Bayes Text Classification ,we have a better understanding of probability theory.The knowledge we learn in classes can be applied to our real life so perfectly.Its so wonderful and interesting for me.For text clustering,we usually use K-means Clustering.Although it is the most common hard clustering algorithm,we have to say its useful.Model-based Clustering is another choice for us.During these classes,I can imagine that we can absorb more information while watch news in the future.And when we have jobs,we can better deal with different passages and documents which may affect our jobs.How useful it is!I really enjoy the lessons although I usually find difficulty with python.Learning python also makes me have one more excellent skill.Its very impressive to have these lessons.

8 条评论:

  1. it is obvious that your blog is related to the assignment 2 and I think after reading your blog I have a better understanding to it.

    回复删除
  2. Hi, Xi! It is a wonderful summary of the content of what we have learned in the course and it makes me more clear about the knowledge of text classification. Thank you very much!

    回复删除
  3. Great post, I have used jaccard similarity to classify a document before, it's simple but the accuarcy is not very good, but if we have a good stemmer, the accuracy will be higher, because different words in different document sometimes are just different forms of the same word, stemmer can remove this interference.

    回复删除
  4. Reading your blog is helpful for me to understand the text classification, but it's a pity that I haven't read it before I do my assignment2.

    回复删除
  5. What's an interesting thinking. I think most of the people already are using the document classification in daily just they have no sense about this is so call 'document classification'. Human build up their own 'library' and 'training data set' from experience.

    Thanks a lots for your interesting sharing

    回复删除
  6. Very good summary about text classification.

    回复删除
  7. Thanks for your sharing, It's a good summary. Text classification is useful in Content Analysis of social media!

    回复删除
  8. Thank you for the informative explanation of the Text classification. It would be a effective method to analyze the passages and determine the result in the automatic way.

    回复删除