First quantitative notes on ModPo2 posts and comments

Here are a few simple summary numbers and plots, based on a couple of .csv files that Ritika created from the 11/7/2013 data dump, covering about two months of the discussions.

As of 11/7, there were 52,392 posts from 3,377 participants; and 34,395 comments from 2,100 participants. 3,559 participants contributed at least one post or one comment.

The rate of participation is of course highly skewed: The most prolific poster contributed 1,452 posts and 969 comments; the most prolific commenter contributed 597 posts and 1,530 comments. 777 participants contributed just one post each, and 618 contributed just one comment each.

There were 13,262 posts with at least one comment -- 39,130 posts had no comments.

The distribution of posts over the eight weeks shows a spike at the end of September, with steady participation through October:

The comments show a gradual fall-off during September, with steady participation in October:

(As I understand, the things denominated as "posts" or "comments" are actually of several different types; we should look into how to tell them apart.)

If we plot the number of posts and comments for each participant who contributed at least one post or at least one comment, we can see that there are a few prolific outliers, with most participants down in the region of <100 posts and/or <100 comments:

So showing just that region:

Or plotting on a log scale (with a small trick to include the points with 0 posts or 0 comments):

Here's a histogram showing how many participants contributed between 1 and 50 posts:

And here's the same thing for participants with 1 to 20 comments:

If I've done the time-zone conversion correctly (and I did not try to deal with Daylight Savings issues), participants tend to submit posts in the wee hours of the morning, especially from about 2:00 a.m. to 8:00 a.m.:

The largest number of posts come from the Pacific time zone (-8 hours relative to UTC), with the Eastern time zone (-5 relative to UTC) next:

One curious thing, maybe a Coursera bug. They report "Time Zone" as "Region/City", e.g. "Africa/Cairo", "Europe/Paris", or "Asia/Tokyo". In some cases, they seem to collapse all cities within a given country to a single representative. Thus there are 13 posts and 13 comments from "America/Vancouver", and there are 37,250 posts and 20,903 comments from "America/Los_Angeles", but there is nothing from any other (nominal) cities in the 8-hours-before-UTC time zone -- nothing from San Francisco or Seattle or Fresno or wherever. However, the UTC-5 time zone, they give "America/Detroit", "America/Indiana/Indianapolis", "America/Indiana/Knox", "American/Indiana/Petersburg", "American/Kentucky/Louisville", and "America/New_York", as well as "America/Montreal" and "America/Toronto" in Canada. But nothing from e.g. "America/Philadelphia" or any other EST U.S. city.

N.B. It seems that this is indeed a Coursera flaw -- rather than registering time zones based on the IP address a post or comment is submitted from, they use the time zone specified in the user's profile. This is California time by default, and many (most?) people never bother to change the default, no matter where they live...

Anyhow, translated into Philly time, the peak posting time is around 7:00 a.m.:

The comments are maybe a bit later, but the wee hours still dominate:

Posts are most common on Mondays, as expected:

Comments are more evenly spread through the week:

The number of posts with > n comments, for 1 <= n <= 50:

 [1] 34395 27474 22504 18616 15652 13062 11178  9386  8130  7068  6268  5487
[13]  4779  4246  3728  3353  3049  2777  2489  2280  2120  1868  1758  1643
[25]  1475  1325  1117  1090  1062  1004   884   822   790   757   689   619
[37]   583   546   508   430   350   350   308   308   264   264   218   218
[49]   170   170

And a plot showing the number of posts with n comments, for 1 <= n <= 20:

Another way to look at the same data is the empirical probability of stopping as a function of the current number of comments in a thread, which is about 52% after one comment (and 39130/52392 = 75% for a post with no comments!), but falls to around 10% after 20 comments:

This looks promising for modeling the factors that lead to threads ending or continuing, since there are plenty of threads of various lengths, and the first and simplest factor, namely the length of the thread so far, is an effective predictor.

To a first approximation, it looks like comments tend to be longer in the middle of a thread rather than at the start or the end: