cicy's kindergarden

2014年11月5日星期三

4th...

Blog

These days I got a better understanding of text classification and text clustering.Nowadays it’s very necessary to process documents which we need.The three basic steps:comparison,grouping and classification are the fundamental things that we need to do.However,the key to solve the problem is to find “something” to feature these documents before we compare them.

The first thing we need to do about document comparison is to find similarity about documents.Jaccard’s index is a similarity measure based on set operations.We can use the formula to calculate similarity between different documents by their representation.There are a few methods to do some term weighting.It’s quite normal when we deal with documents.Calculating words that appear most frequently in a document is the most common method.These words are the most important for documents.The other method which is called Term Frequency & Inverse Document Frequency(TF-IDF) is quite useful.Not only do we check how often a word appears in the document,but also check how rare it appears across the whole corpus.For further discussion,we can use vector space model whose elements represent the tf-idf weight of a term within that document.We can also calculate similarity between documents by the cosine similarity.The whole similarity measures make me know how to calculate similarity instead of just feeling or guessing which is very inaccurate.

Then we study lessons about text classification.Actually,we are making classification in our daily life.It’s a basic skill for everyone.After these lessons we can use some methods to make classification between different texts.There are two steps.First,we need to train a classifier by using labeled data.Second we need to classify new data using the classifier.After we learn Naive Bayes Text Classification ,we have a better understanding of probability theory.The knowledge we learn in classes can be applied to our real life so perfectly.It’s so wonderful and interesting for me.For text clustering,we usually use K-means Clustering.Although it is the most common hard clustering algorithm,we have to say it’s useful.Model-based Clustering is another choice for us.During these classes,I can imagine that we can absorb more information while watch news in the future.And when we have jobs,we can better deal with different passages and documents which may affect our jobs.How useful it is!I really enjoy the lessons although I usually find difficulty with python.Learning python also makes me have one more excellent skill.It’s very impressive to have these lessons.

2014年10月16日星期四

Graph Theory and Social Network

It is no surprise to find that people are instinct to be social,although they always hold their own secrets and try to protect their own privacy.They are still born to share.In the old days,people go around and gossip,that's where the first idea of sharing idea come from.These kind of idea sharing also can categorize people into small groups,they usually have something in common,so it is more likely for them to focus on the same topic next time.

That's where the Douban comes up with the idea of building a website to recommend the music you like.They let you to tell them whether you like the music or not at first,and they try to category you into a certain group,then they can recommend the music you are very likely to appreciate.

Another way to show the connection between people is through the graph,it is clearer to see through the graph about our relationship and the closeness of 2 people even they don't know each other directly.

2014年10月1日星期三

content analytics

We try to looked into the the social analysis from a new aspect now,it is quite familiar with us,it is the algorithm.There is no doubt that we will look into the so called bag-of-words model instead of the order of them.A lot of Measures were shown in class like Jaccard's index to caculate to similarty.What remains most interesting is the term weighing,because several approaches have turned out a bad solution to this problem.

Classification can be widely used in daily life,but it is still not prevaling due to the lack of accuracy.Can the Naive Bayes text classification be applied properly to the characters like Chinese words?The answer might be not quite satisfying.
Believe it or not,we can never be isolated from the society which makes it easier to infer that we can always seek out your opintion from the people around you.

In fact,sentiment analysis is never an easy task.It is more than complicated to analyse one's sentiment from his own words. Think of the rhetoric and especially the irony we have used daily,it is not hard to understand.Moreover,we try to weigh the strength of two words,that's a arduous task.

2014年9月21日星期日

IEMS5723's :)

Before the beginning of the first class, I always consider the social media analytic as a course concerning only the technology part.However,I was eager to find that the analytic itself always aims to serve the people and so that humanity also plays a great role in the process.But here are the problem,the SCT tells us that people and the online environment and the behavior constantly influence each other,so how to predict the online environment,it is quite possible that we can never set a fixed model to predict the behavior of the online people.The model must be changed everyday to accommodate the behavior of the people.Besides,the business man uses the data collected from the analytic of social media,and interpret them to understand preference and interests of them,but their ultimate goal is to sell their product so that in the future we may only get a link,click them and find what we need,but is it the ones of high quality and fit exactly us.What happened if we surf the computer and find plenty of advertisement products online,it will be still hard for us to find the ones that fit us most.Still,we try to balance between the aim of earning profits and the aim of bring convenience to people.This is only a little part of the social media analytic,still, it is quite a interesting area,and wiki is a very successful trial,and the social media does bring us closer.

For example,my parents will never email me or phone me,telling me about a recent news,but now they share news with through wechat and years ago we will never know we can change the encyclopedia in such a easy way.Through the class,I realize that though the extraction of the content is very complex,the interpret of the collected data is more complicated and how to choose the algorithm which can fit most to the collected data and interpret and get what we need is really hard.