Ziyin's Business Intelligence Blog: February 2012

Saturday, February 25, 2012

Data Visualization: Network Analysis of Facebook

These days I am so attracted by the Network Graph and spend hours of hours in enjoying the beauty of relationship of nodes and edges.

Advantages

Network Graph is a data visualization technique for revealing the connections between each node and edge. The nodes can be defined as entities like people and edges can be defined as friendship.

The beauty of Network Graph lies in the simplicity of representation of complex relationship using graph. With traditional matrix representation, each matrix has multiple rows as a relationship between each entity. Each matrix can also have additional columns or rows to present the clusters of similar entities. Compared with the traditional matrix, the Network Graph uses nodes and edges to physically connect each other and also uses different colors to represent different clusters of similar entities.

The other advantage of Graph Network is that it attract people to explore more with crystal clear image and bright color. Researchers will never be drown at flow of black and white tables again.

The following will cover:
1. Tutorial of Network Graph of Facebook
2. Network Analysis of Facebook Connections

1. Tutorial of Network Graph of Facebook
Example: Facebook network Analysis using Graph

Everyone has a facebook account, well, at least one. I assume. Have you ever thought a way to find out the relationship among you and your pals? In fact, Facebook itself has a data visualization app called "Challenger Network Graph". It will create the dumb network graph as below:

Don't use that unless you are too busy to read this blog.

We will use a free software called "Gephi" to create our own network graph and it will be much better than this one from this app, like this:

Here are the tutorial:

1. Go to gephi.org and download the amazing software for free.

2. Go to your facebook page and search for "netvizz" app. It will generate "gdf" file for gephi.

3. Open gdf file with gephi and click OK.

You will see the original data as follows, which is not in order and meaningless:

The next step you will do is to find the clusters that share the common features, eg. friends at the same location.

4. Choose the data layout

In the layout tab on the left hand side, you have a drop down list to select how you want to let the data visualization be. I prefer Fruchterman Reingold, which is a round shape of data.

5. Run Modularity to identify different clusters.
The Modularity button is located at the right hand side to the panel under statistics tab. This will let gephi generate a new variable "modularity" for different clusters.

6. Run HITS button to calculate size of each node.
The size of the nodes is determined by its influence in the network.

7. Select to color different clusters by choose Modularity Class from Partition Tab on the left.

If you are not happy with the colors, just right click in the area and choose randomized color. You may find the ideal combination eventually.

8. Select to adjust the node size based on HITS calculation.
From Ranking Tab, select the diamond shape button in the node tab to apply to different node size.

9. Now we need to label each node to see who is really influential to the network.
Click the "T" button on the bottom

Then click the "A" button to scale the lable based on influence.

10. We are almost done, check the preview of the network graph by click "Preview" button on the top of screen.

11. Select "Label" in the left side bar to show label in the preview and click "Refresh" button to see the preview. Every time you make a change, you need to refresh the graph.

12. The last thing you need to do is to export the graph to either "PNG", "SVG" or "PDF".
Sometimes, when you export the file, some labels are not shown in the graph. You can choose the "Options" button to change the width and length of the output.

13. It's done! The full picture will be like this.
Is it much better than the default Facebook network graph? At least, you did it by yourself! Congrats!

14. Sometimes they graph is too complicated with thousands of nodes and millions of edges, you can use zoom.it to embed your large graph in your web page, like this:

2. Facebook Network Analysis:

Parameters:
Size of Nodes: The size of each node means how valuable that node is. In this example, it indicates how influence it is for the whole network. In other words, how much posts and comments for the network. The large the node, the more influence it has for the network.

Color of Nodes/Edges: The color indicates a cluster of people with similar features. In this example, location is the feature to separate group of friends.

Number of Edges: Each node can have at least one edge to the other nodes for connection. The more edge it has the more people it connects.

From the Facebook Network Graph, it is clear that I have mainly two groups of friends, the red group and the blue group. I know they are separated by the location: red are from Chicago and blue ones are from Tucson, AZ.

There are other colors specifying friends neither in Chicago nor in Tucson. They are covered by the flow of red and blue...

For each one in read and blue, he or she is well connected with each group, because the edges or lines are crossed over each other. The connections are made by both people agree to add each other as a friend. Therefore, there are no directions in the edges. The connections are based on mutual acknowledgement. In the middle of red and blue, it is so crowded that it is not easy to identify the actual nodes.

For blue network, the most influential people are Joseph Yu, Anagela Cheng, and Qiao Meng. For the red network, Dongping Xie is the only one that has great influence. Notice the node labels are also adjusted by the influence.

The two groups even have one connection in common, which means this node relates the two groups of red and blue. Qiao Meng is well connected with the blue group but also has one connection with Fay Peng from the read group. He is the key person connecting the two groups.

Notice some people have only large amount of connections but with only small node size. This means that person does not post comments too often in the network, even he or she is rich in friendship. Some people has small node size and only one connection, which means he or she is not too involved in this network.

To conclude, the network analysis is very simple but it easily reveals some details by simply looking at the graph. If the data is presented from matrix or tables, it will be time consuming and cumbersome to get the similar conclusion as I did in here. Network Graph saves valuable time and boosts efficiency in research process.

Saturday, February 18, 2012

Data Impact from Social Media - review of an in-class presentation from SocialFlow

It is really fortunate to have Gilad Lotan, VP of SocalFlow doing a Skype in-class presentation in my Business Intelligence class this week. He is an amazing young professional that has a great passion in finding out relationship, causality, and facts behind raw data. His presentation was also informative by explaining some basic ideas for gathering raw data from twitter API then moving on to methodologies used in analyzing data. The entire presentation was vivid, thanks for his rich and various format of graphics and illustrations.

Although I came across some interesting articles covering the use of social media, I never actually understand the impact social media. Gilad is really an expert in this field. He used many historical events to illustrate why the study of these social media data is important. For example, he mentioned in the presentation that last year New York earthquake event. The people in Richmond, VA first discovered the event and started to tweet online. It is unbelievable even in earthquake, people are still tweeting each other. The tweets spread out very fast that everyone is able to see. It is reasonable that people in New York also noticed that tweet content. Approximately 2 minutes later, the earthquake hit New York and people in New York started to tweet. This fact was clearly reflected in Gilad's network analysis. If this approach could be carried out in real time, it is possible to help people in New York predict the earthquake for about 2 minutes earlier. This is so amazing that 2 minutes is long enough for people find shelters and protect themselves.

There are other interesting topics covered in the presentation, but the earthquake prediction is the one that attracts me the most. The realm of social media is not narrowed at usage level any more. It is right now broadened into a higher impact level that can actually discover, help, and change the world.

Sunday, February 12, 2012

Your AdRank in Google AdWords

You want to have a marketing campaign for your service or products? Normally, you will have your commercials on newspaper, TV programs, billboards, and radios. With the development of Internet, web media is becoming a main stream for companies to set up their marketing campaign. The example is the usage of Google AdWords as the marketing strategy.

The major differences between Traditional Marketing and Online Marketing are the following:

Traditional Marketing:

Charged based on number of impressions
Charged based on number of circulations
Cost for the campaign can be very expensive
Ads are the same for everyone

Online Marketing:

Charged based on number of clicks
Ads are different even users type the same keywords
Every click can cost different amount of money, based on different time, location, competitors, and cost per click (CPC)

Actually, Google is auctioning its realty of its space for your marketing campaign. That is the reason why the CPC is unpredictable. It is based on competition, location, time and other factors.

Even you paid Google for your campaign, it is possible that your ads will never show up. This is because Google rank your campaign based on AdRank. Your high AdRank determines the probability of your campaign showing up. AdRank is effected by your ad quality, which is the relevance to the keywords and your landing page.

Google will find keywords in landing page related with ads keywords. If they do not match, this page is determined irrelevant.

Google also uses Click Through Rate (CTR) as a metric for measuring effectiveness of the campaign. The CTR means number of clicks per impression. Usually, the CTR of 2% is considered good for a campaign.

This blog just gives the overview of Google AdRank. The next blog will talk more detail about Google AdRank.

Sunday, February 5, 2012

Data Visualization in Business Intelligence

Here is the data:

Which one is more appealing to you? You got the answer!

The form of visual representing of data is called "Data Visualization". This technique plays a very important role in Business Intelligence.

Normally, people will get lost because of large amount of data in one place. This is because humans are not efficient in dealing with tables. Especially, the complicated, related tables that connected with each other and one and another. Usually, for study purpose, researchers have to get involved in large amount of data, which makes them extremely difficult to find the inherent pattern.

Compared with that, humans are way better in dealing with graphical information. The graphics are not only more interesting and colorful to see, but also provide very neat and concise representations of raw data.

I remembered when I was in undergraduate, I usually get overwhelmed with raw data. Then I find a very good way to overcome this problem. I used MS Excel to plot all the data within one spreadsheet. This was a very good and smart solution. From graphs, I could identify trends, convergence, and other relations.

However, Excel can only handle relatively small amount of data. I am not talking about the spreadsheet width and height limitation. I am talking about the computing speed for Excel. It becomes a dead zombie when you try to calculate large amount of data for longer time period, for many attributes, or for simply large records. Clearly, it is not suitable for Big Data in BI.

However, Excel leads to a very convenient way to visualize data. It is widely used when people have doubts for small problem and want to solve it in seconds.

If you are interested, you can try using Excel to find hidden facts within raw data. You may wonder that you do not have sufficient data for analysis, even in Excel. Let's see if you have LinkedIn account, you better have, if you are a active job seeker, like me. LinkedIn has a lab that use Gephi, a open source app, to generate your connections in graphs. The only requirement is you must have 50+ connections in LinkedIn.

Here is the link:
http://inmaps.linkedinlabs.com/

The video illustrates the whole thing. You will find how amazing that you can visualize your own data and watch it growing everyday.

Isn't it better than the data table, like this?