Ziyin's Business Intelligence Blog: 2012

Saturday, October 13, 2012

How to use JDBC to connect with SQL Server with Windows Authentication

Recently, I am working on a project that requires to data into database. Since I already have Microsoft SQL Server 2008R2 installed locally, I would like to use Java gather data from the web and store it in SQL Server.

The only one thing that are required is the JDBC driver. You can download it here. Download it in any format. You will have to unzip it eventually. Then you will get the following tree structure:

The sqljdbc4.jar is the Java library that you need to specify it inside your Java IDE (eg. Eclipse). One way of doing that is to right click the project name and navigate to Build Path --> Configure Build Path...

In the following window, click Add External JARs and select the sqljdbc4.jar in your downloaded file.

Once have that library in your Java project, you can import the library and write the Java code to connect to the database.

If you are using the SQL Server Authentication, the following code will work perfectly:

However, this article is to point out the important fact that you need to pay attention when you are using Windows Authentication for SQL Server.

First, you don't have to specify the username and password. Instead, you should use integratedSecurity=true parameter.
Second, you have to copy the sqljdbc_auth.dll file to your project root directory.

The code will look like this:

The sqljdbc_auth.dll is located inside the auth folder in your downloaded driver. Copy and paste that dll file into your project root directory.

Then everything works fine!

Monday, May 14, 2012

How to install SQL Server 2008R2 Sample Database

You want to learn Microsoft SQL Server. You are wondering which books to buy from Amazon.com. Well, you don't have to. It is always a good practice to follow the official tutorial from MSDN.

This article is to illustrate how to get started with MSDN SQL Server 2008R2 Tutorial. The URL is: http://msdn.microsoft.com/en-us/library/ms167593(v=sql.105).aspx

Basically, you need a whole package of MS SQL Server 2008R2 installed in your local machine, of course. The other thing you need is the sample database, so that you can follow step by step what Microsoft wants to teach you.

However, the installation of the sample database AdventureWorks2008R2 confused me a lot. I just figured it out and want to my experience with you so you don't have to spend so much time, like me.

The download page is here: http://msftdbprodsamples.codeplex.com/releases/view/55926 However, you can see there is no download link associate with it. This is because Microsoft found bugs in the previous sample database windows installer. However, Microsoft provides another way to install AdventureWorks2008R2 in here: http://msftdbprodsamples.codeplex.com/releases/view/59211

So, here are the steps:
1. download the AdventureWorks2008R2 Data File from http://msftdbprodsamples.codeplex.com/releases/view/59211 . Download it anywhere on your local machine, first. You will get the "AdventureWorks2008R2_Data.mdf" file.

2. Move the "AdventureWorks2008R2_Data.mdf" file into your local SQL Server in "C:\Program Files\Microsoft SQL Server\MSSQL10_50.SSZIYIN\MSSQL\DATA". My local instance is named "SSZIYIN", it may differ in your local machine.

3. Go to SQL Server Management Studio and connect to your local server, mine is SSZIYIN.

4. Unfold the database tab and right click database tab, select "attach"

5. Click "Add" and then navigate to the directory "C:\Program Files\Microsoft SQL Server\MSSQL10_50.SSZIYIN\MSSQL\DATA" and select "AdventureWorks2008R2_Data.mdf"

6. In the detail window below, remove log file. You done!

Now you have the sample database AdventureWorks2008R2. So, what are you waiting for? Start your adventure with SQL Server 2008R2!!

Note:
The sample database installation tutorial is here: http://social.technet.microsoft.com/wiki/contents/articles/3735.sql-server-samples-readme-en-us.aspx#Readme_for_Adventure_Works_Sample_Databases , which is not clear for me...so I created this article. Enjoy!

Thursday, March 15, 2012

Retweet Analysis for Two Situation Comedies

The Gephi Retweet Monitor Plugin provides a very convenient way to collect real-time data from Twitter.com. The Retweet Monitor Plugin uses Twitter Streaming API to collect large amount of data.

Approach and Definition of the Graphs
Running the program for consecutive two days, we are able to collect 373 retweets for TwoandAHalfMen and 724 retweets for HIMYM. In the two graphs, the definitions for nodes are Twitter users and the edges are the connections between tweets and retweets.
The two graphs are directed graphs, meaning the edges have directions. For example, if node A retweets node B, there will be an arrow pointing from A to B, making a directed edge. Therefore, node A has one out-degree and user B has one in-degree.
The sizes of the nodes are defined as the level of In-Degree. For example, if node A has more in-degree than node B, node A will have a larger node size.
We analyze the two graphs using “Force Atlas” layout. Both of them display round circle pattern, with the most important nodes lying in the center.

Basic Attributes Analysis
The two graphs have the similar features in attributes with only small differences, Ref Fig 3. This is because both of them are retweet graphs and the observations for the retweet graphs are:

Not-well-connected among nodes
Large amount of communities within the network

The following table compares the two graphs in varies attributes. It is easy to identify that two graphs have low Degree Centrality of 0.713 for HIMYM and 0.607 for TwoandAHalfMen, which means most of the nodes in two graphs do not have much links to other nodes. The same result can be obtained from Network Diameter for the two graphs, 3 for HIMYM and 2 for TwoandAHalfMen. This indicates within minimum of 3 steps, one node in HIMYM can reach any other nodes within the network, which also proves the network is not well connected.

Parameter	HIMYM	TwoandAHalfMen
Number of Nodes	1448	746
Number of Edges	1032	453
Average Degree	0.713	0.607
Network Diameter	3	2
Graph Density	0	0.001
Modularity	0.919	0.993
Weakly Connected Components	424	293
Avg. Clustering Coefficient	0.002	0
Avg. Path	1.053	1.002
Number of Communities	426	293
Number of shortest paths	1087	454

The two graphs have 424 and 293 connected communities for HIMYM and TwoandAHalfMen. Compared to the total of 1448 nodes in HIMYM, each community only has 3 nodes on average. Similarly, for TwoandAHalfMen, each community only has 2 nodes on average.

Key Findings of Two Graphs
The followings are the similar patterns of the two graphs:
• Large amount of isolated clusters
• The official sitcom account may not be popular one
• Celebrity account received more retweets
• Accounts with funny names received more retweets
Both the HIMYM and the TwoandAHalfMen graphs have more scattered nodes than the connected nodes. This is because, in most cases, users can only see the tweets what they follow. Therefore, users can only retweet what they the get from the following accounts.
Official sitcom account and celebrity accounts tend to have more retweets. This is because most nodes do not have much influence in the twitter network. From observations, nodes that had most retweets are from most influential accounts, e.g. @TrechosSerie in TwoandAHaflMen graph and @alecattelan in HIMYM graph.
Besides influential accounts, users in twitter tend to also follow interesting accounts. Interesting people will have funny username and also tend to have funny tweets. Users tend to follow humorous people and retweet a lot. The typical example is @waitforBarney in HIMYM.

Interesting Observations

WGN beats CBS on How I Met Your Mother

How I Met Your Mother is originally shown in CBS. WGN later bought the license to broadcast this sitcom. However, according to our findings in HIMYM graph, @MeetatMacLarens (WGN account) has more retweets than @HIMYM_CBS (CBS account). Ref Fig 4
People are addicted to sitcoms that they want to watch it every day, even though it is an old episode. WGN had a very effective strategy to attract audience from CBS to its own station by paying licensing fees. From a long term perspective, WGN will gain competitive advantage over CBS in terms of licensing expense.

Two and A Half Men is more popular in non-US audience

The top three accounts with retweets are from abroad. The tweets and retweets are written in a foreign language. The retweets from @LauraBozzoo was written in Spanish and @MySerieFrases retweets were written in Portuguese.
For CBS, it is a great opportunity to broaden the international market by launching more marketing campaigns on Two and a Half Men in areas in Europe and Asia.

Appendix
Graph for How I Met Your Mother retweets collected for 2 days

Graph for Two and A Half Men retweets collected for 2 days

Saturday, February 25, 2012

Data Visualization: Network Analysis of Facebook

These days I am so attracted by the Network Graph and spend hours of hours in enjoying the beauty of relationship of nodes and edges.

Advantages

Network Graph is a data visualization technique for revealing the connections between each node and edge. The nodes can be defined as entities like people and edges can be defined as friendship.

The beauty of Network Graph lies in the simplicity of representation of complex relationship using graph. With traditional matrix representation, each matrix has multiple rows as a relationship between each entity. Each matrix can also have additional columns or rows to present the clusters of similar entities. Compared with the traditional matrix, the Network Graph uses nodes and edges to physically connect each other and also uses different colors to represent different clusters of similar entities.

The other advantage of Graph Network is that it attract people to explore more with crystal clear image and bright color. Researchers will never be drown at flow of black and white tables again.

The following will cover:
1. Tutorial of Network Graph of Facebook
2. Network Analysis of Facebook Connections

1. Tutorial of Network Graph of Facebook
Example: Facebook network Analysis using Graph

Everyone has a facebook account, well, at least one. I assume. Have you ever thought a way to find out the relationship among you and your pals? In fact, Facebook itself has a data visualization app called "Challenger Network Graph". It will create the dumb network graph as below:

Don't use that unless you are too busy to read this blog.

We will use a free software called "Gephi" to create our own network graph and it will be much better than this one from this app, like this:

Here are the tutorial:

1. Go to gephi.org and download the amazing software for free.

2. Go to your facebook page and search for "netvizz" app. It will generate "gdf" file for gephi.

3. Open gdf file with gephi and click OK.

You will see the original data as follows, which is not in order and meaningless:

The next step you will do is to find the clusters that share the common features, eg. friends at the same location.

4. Choose the data layout

In the layout tab on the left hand side, you have a drop down list to select how you want to let the data visualization be. I prefer Fruchterman Reingold, which is a round shape of data.

5. Run Modularity to identify different clusters.
The Modularity button is located at the right hand side to the panel under statistics tab. This will let gephi generate a new variable "modularity" for different clusters.

6. Run HITS button to calculate size of each node.
The size of the nodes is determined by its influence in the network.

7. Select to color different clusters by choose Modularity Class from Partition Tab on the left.

If you are not happy with the colors, just right click in the area and choose randomized color. You may find the ideal combination eventually.

8. Select to adjust the node size based on HITS calculation.
From Ranking Tab, select the diamond shape button in the node tab to apply to different node size.

9. Now we need to label each node to see who is really influential to the network.
Click the "T" button on the bottom

Then click the "A" button to scale the lable based on influence.

10. We are almost done, check the preview of the network graph by click "Preview" button on the top of screen.

11. Select "Label" in the left side bar to show label in the preview and click "Refresh" button to see the preview. Every time you make a change, you need to refresh the graph.

12. The last thing you need to do is to export the graph to either "PNG", "SVG" or "PDF".
Sometimes, when you export the file, some labels are not shown in the graph. You can choose the "Options" button to change the width and length of the output.

13. It's done! The full picture will be like this.
Is it much better than the default Facebook network graph? At least, you did it by yourself! Congrats!

14. Sometimes they graph is too complicated with thousands of nodes and millions of edges, you can use zoom.it to embed your large graph in your web page, like this:

2. Facebook Network Analysis:

Parameters:
Size of Nodes: The size of each node means how valuable that node is. In this example, it indicates how influence it is for the whole network. In other words, how much posts and comments for the network. The large the node, the more influence it has for the network.

Color of Nodes/Edges: The color indicates a cluster of people with similar features. In this example, location is the feature to separate group of friends.

Number of Edges: Each node can have at least one edge to the other nodes for connection. The more edge it has the more people it connects.

From the Facebook Network Graph, it is clear that I have mainly two groups of friends, the red group and the blue group. I know they are separated by the location: red are from Chicago and blue ones are from Tucson, AZ.

There are other colors specifying friends neither in Chicago nor in Tucson. They are covered by the flow of red and blue...

For each one in read and blue, he or she is well connected with each group, because the edges or lines are crossed over each other. The connections are made by both people agree to add each other as a friend. Therefore, there are no directions in the edges. The connections are based on mutual acknowledgement. In the middle of red and blue, it is so crowded that it is not easy to identify the actual nodes.

For blue network, the most influential people are Joseph Yu, Anagela Cheng, and Qiao Meng. For the red network, Dongping Xie is the only one that has great influence. Notice the node labels are also adjusted by the influence.

The two groups even have one connection in common, which means this node relates the two groups of red and blue. Qiao Meng is well connected with the blue group but also has one connection with Fay Peng from the read group. He is the key person connecting the two groups.

Notice some people have only large amount of connections but with only small node size. This means that person does not post comments too often in the network, even he or she is rich in friendship. Some people has small node size and only one connection, which means he or she is not too involved in this network.

To conclude, the network analysis is very simple but it easily reveals some details by simply looking at the graph. If the data is presented from matrix or tables, it will be time consuming and cumbersome to get the similar conclusion as I did in here. Network Graph saves valuable time and boosts efficiency in research process.

Saturday, February 18, 2012

Data Impact from Social Media - review of an in-class presentation from SocialFlow

It is really fortunate to have Gilad Lotan, VP of SocalFlow doing a Skype in-class presentation in my Business Intelligence class this week. He is an amazing young professional that has a great passion in finding out relationship, causality, and facts behind raw data. His presentation was also informative by explaining some basic ideas for gathering raw data from twitter API then moving on to methodologies used in analyzing data. The entire presentation was vivid, thanks for his rich and various format of graphics and illustrations.

Although I came across some interesting articles covering the use of social media, I never actually understand the impact social media. Gilad is really an expert in this field. He used many historical events to illustrate why the study of these social media data is important. For example, he mentioned in the presentation that last year New York earthquake event. The people in Richmond, VA first discovered the event and started to tweet online. It is unbelievable even in earthquake, people are still tweeting each other. The tweets spread out very fast that everyone is able to see. It is reasonable that people in New York also noticed that tweet content. Approximately 2 minutes later, the earthquake hit New York and people in New York started to tweet. This fact was clearly reflected in Gilad's network analysis. If this approach could be carried out in real time, it is possible to help people in New York predict the earthquake for about 2 minutes earlier. This is so amazing that 2 minutes is long enough for people find shelters and protect themselves.

There are other interesting topics covered in the presentation, but the earthquake prediction is the one that attracts me the most. The realm of social media is not narrowed at usage level any more. It is right now broadened into a higher impact level that can actually discover, help, and change the world.

Sunday, February 12, 2012

Your AdRank in Google AdWords

You want to have a marketing campaign for your service or products? Normally, you will have your commercials on newspaper, TV programs, billboards, and radios. With the development of Internet, web media is becoming a main stream for companies to set up their marketing campaign. The example is the usage of Google AdWords as the marketing strategy.

The major differences between Traditional Marketing and Online Marketing are the following:

Traditional Marketing:

Charged based on number of impressions
Charged based on number of circulations
Cost for the campaign can be very expensive
Ads are the same for everyone

Online Marketing:

Charged based on number of clicks
Ads are different even users type the same keywords
Every click can cost different amount of money, based on different time, location, competitors, and cost per click (CPC)

Actually, Google is auctioning its realty of its space for your marketing campaign. That is the reason why the CPC is unpredictable. It is based on competition, location, time and other factors.

Even you paid Google for your campaign, it is possible that your ads will never show up. This is because Google rank your campaign based on AdRank. Your high AdRank determines the probability of your campaign showing up. AdRank is effected by your ad quality, which is the relevance to the keywords and your landing page.

Google will find keywords in landing page related with ads keywords. If they do not match, this page is determined irrelevant.

Google also uses Click Through Rate (CTR) as a metric for measuring effectiveness of the campaign. The CTR means number of clicks per impression. Usually, the CTR of 2% is considered good for a campaign.

This blog just gives the overview of Google AdRank. The next blog will talk more detail about Google AdRank.

Sunday, February 5, 2012

Data Visualization in Business Intelligence

Here is the data:

Which one is more appealing to you? You got the answer!

The form of visual representing of data is called "Data Visualization". This technique plays a very important role in Business Intelligence.

Normally, people will get lost because of large amount of data in one place. This is because humans are not efficient in dealing with tables. Especially, the complicated, related tables that connected with each other and one and another. Usually, for study purpose, researchers have to get involved in large amount of data, which makes them extremely difficult to find the inherent pattern.

Compared with that, humans are way better in dealing with graphical information. The graphics are not only more interesting and colorful to see, but also provide very neat and concise representations of raw data.

I remembered when I was in undergraduate, I usually get overwhelmed with raw data. Then I find a very good way to overcome this problem. I used MS Excel to plot all the data within one spreadsheet. This was a very good and smart solution. From graphs, I could identify trends, convergence, and other relations.

However, Excel can only handle relatively small amount of data. I am not talking about the spreadsheet width and height limitation. I am talking about the computing speed for Excel. It becomes a dead zombie when you try to calculate large amount of data for longer time period, for many attributes, or for simply large records. Clearly, it is not suitable for Big Data in BI.

However, Excel leads to a very convenient way to visualize data. It is widely used when people have doubts for small problem and want to solve it in seconds.

If you are interested, you can try using Excel to find hidden facts within raw data. You may wonder that you do not have sufficient data for analysis, even in Excel. Let's see if you have LinkedIn account, you better have, if you are a active job seeker, like me. LinkedIn has a lab that use Gephi, a open source app, to generate your connections in graphs. The only requirement is you must have 50+ connections in LinkedIn.

Here is the link:
http://inmaps.linkedinlabs.com/

The video illustrates the whole thing. You will find how amazing that you can visualize your own data and watch it growing everyday.

Isn't it better than the data table, like this?

Sunday, January 29, 2012

Business Intelligence Related Social and Ethical Issues

Google just modified its Privacy Policy and will be effective on Mar 1, 2012...

They basically "getting rid of over 60 different privacy policies across Google" and use a single version of Privacy Policy instead. Google also made some changes to the categories of information they collected from every user on each day. They divided the information into two categories: "Information you give us" and "Information we get from your use of our services". The first one may include your real name, address, phone number... you provided to them when you register for Google account. The second category may include your IP address, operating system, web browser information, geographical location, the words you typed in. Although Google says they are doing this to improve user experience.

The following video shows exactly how everything happens when you type something in Google search engine.

The questions are obvious:

"Hey! I didn't register for Google account. Why did you collect my information?"
"Who gave you the permission to collect data for my location, my operating system and my keystrokes?"

While Google is clear from the statement: "As you use our services, we want you to be clear how we’re using information and the ways in which you can protect your privacy."

OK. Right now, I think everyone is clear about this:
If you want to use service from Google, Google will collect your information. Period.

Disagree? You don't use Google, you go back to Jurassic, you go to North Korea.

OK. Here is another question:

"I just created my personal website and I never use Google service, why Google find my website address and make it visible to everyone?"

Remember Google is a search engine. It is born to search. Google has a program to find new websites or recently updated web pages and bring back to their database. You don't use their service, but they will find you. That's the destiny of a good search engine. You can't escape around it. So, if you are concerned about your privacy, make sure you don't put your sensitive personal information on your website. Google find the information for your website, not your personal information, which does not violate their privacy policy.

The third question pops up:

"Google Analtyics is always collecting visitors' data, from location, keywords to time spent on each page, the navigation path, etc."

However, Google can easily defend themselves by say: "We are just doing statistics gathering web traffic information. Such behavior will not identify that visitor is just you."

Yes, that is correct. But think about how many condition clauses in SQL could identify you?

You think 5 conditions may not identify you. That is because you are living in big city. Think about those who live in a small island in Pacific, say Johnston Atoll. The user could be easily identified by VISITOR_LOCATION AND VISITOR_TIMESTAMP. Because no too many people are using internet at a certain time at this certain website.

This is a big issue for users in small cities. However, think it in another way. No matter how large city you are living in, if Google want to find you, they can use 10 or even 20 conditions to narrow down the query results. They will find you.

Fortunately, privacy for Google is a big concern. They emphasized this point again and again. They want to protect your privacy for you. At least for normal situations...

Saturday, January 28, 2012

The Overwhelmed Data of Google Analytics, really?

Google Analytics is used by many Web Analysts for finding out the website traffic information. However, the data it generated have lots of dimensions, page view, unique page view, landing page, visit, keywords, source, browser info... First time users, like me, usually don't have any clue where to start.

Maybe you agree with me. However, I am here to point out the fact that many first time users got overwhelmed by data is because they don't know what they really want to do. They don't really know what they want to analyze. They don't know what questions to ask. "The main reason for project failure is unclear project requirements". As a Web Analyst, you should know the exact question to ask. The meaningful questions are like: "What are the keywords people use most to land in this page?", "How did visitors navigate through the entire website?", "Why did some average people leave this page in 10 seconds?"

Although Google Analytics cannot answer some of these questions. It will provide related data from different aspects to help you find the reasonable answer. After you find the answer, Google Analytics also provide data visualization tools to further assist you in generating professional reports.

To summarize, as a Web Analyst, you should have a clear logic in mind. The first step is to have a question. With that question, you login the Google Analtyics account to search for related data. When you get the answer, you could use Google Analytics data visualization tools to help with the report.

Friday, January 13, 2012

Google Analytics Overview

Google Analytics is a free web based BI tool for website traffic analysis. It could be used as a standalone tool or along with Google Adwords.

Google Analytics has the following functions:

Use site search data to improve keywords and campaigns
For E-Commerce website, a Google Analytics script could enable transaction tracking down to keyword level
Geographical market targeting
Find out poor performance campaign keywords and focus on profitable keywords
Identify high spenders and target that market segment
Import referring sites into a placement targeting campaign
Use conversion funnels to identify customers' actions towards each page
Monitor Google Adwords Traffic

These are the initial findings for Google Analytics from YouTube channel.

Tuesday, January 10, 2012

Created First Post for BI

Never know Blogger is a product of Google...ha

Tomorrow is the first day in Business Intelligence class. Don't know if I could get up that early...after a long and lazy winter holiday.