An unabashedly narcissistic data analysis of my own tweets.
The unequivocally lovely Jeff Gentry (
@geoffjentry) has contributed an
R package with easy-to-read documentation that works, which I’ll walk through here so that you, too, can gaze at your own face mirrored in the beauty of a woodland pond—er, sea of electrons.
Here’s the basic flow for grabbing stuff. You can do more with
ROAuth but that’s a bit of a pain.
require(twitteR) RT.of.me <- searchTwitter("RT @isomorphisms", n=100) news <- getTrends(n=50) firehose <- publicTimeline(n=999)
my.tweets <- userTimeline('isomorphisms', n=3500)
head(my.tweets$text) Consider: Donkey Kong is neither a donkey, nor a kong. William Thurston, geometrizer of manifolds http://t.co/UPwuAnbP When I invent a single-letter language, it's going to be called Ж. @theLoneFuturist True. If $GOOG were only an ad network, with no search facility, how much would it be worth? Do your arms hang down by your side in zero gravity? Because then I bet astronauts have less smelly armpits. Can't log into Hacker News with #w3m! Unexpected. Salt and sugar are opposites. Therefore if i eat too much salty food I must balance it with candy. #logic @leighblue Do you know any behavioural econ studies on utility vs bite-size / package-size?
Those are some of the ways you can grab data —
RCurl and then, like, the info is just there. Run
twListToDF( tweets ) to split the raw info into 10 subfields—text-of-tweet, to-whom-was-the-reply-threaded, timestamp, and more.
To pull out just one of those fields—like “source of tweet”, for example, use
my.tweets <- userTimeline('isomorphisms', n=3000)
whence.i.tweet <- sapply( my.tweets, function(x) x$statusSource
You can see from plots 1, 2, 3, and 4 that I use
TTYtter client (tweeting from the command line; no installation). In fact this is why I’ve started tweeting so much the last few months: I run
TTYtter in a virtual terminal,
mutt (command-line gmail) in another virtual terminal, and therefore it becomes quite easy to flick my virtual newsfeed/conversation stream on for a minute or two here and there whenever I’m at the computer. It feels like The Matrix or Neuromancer or something.
Here’s how I created the
ggplot radial chart #4 — this was the longest command I had to use to generate any of them. For some reason
qplot didn’t like
scale_y_log10() so I did:
ggplot( data = data.frame(whence.i.tweet), aes( x=factor(whence.i.tweet), fill=factor(whence.i.tweet) ) ) + scale_y_log10() + geom_bar() + coord_polar() + opts( title="whence @isomorphisms tweets", axis.title.x=theme_blank(), legend.title=theme_blank() )
In the words of
perl script that told me whether the tweet had an
@ in it, whether the
@ was a
RT @, an wem the tweet was
@, and so on. Dealing with people using
@ in another sense besides “Hey
@cmastication, what’s up?” or different numbers of spaces between
RT’s in the same message; and so on — was an icky mess. I probably spent half a week changing my regexes around to deal with more cases I hadn’t thought of. Like most statisticians, I hate data munging—swimming around in the data is the fun part, not patching up the kiddie pool. Besides that, my client wanted the results in an Excel file — and Excel can’t handle multidimensional arrays (whereas a tweet mentioning
@a @b @c should have just one “mentions” slot with three things in it).
But as much fun as it was to display my love of
TTYtter in four different plots, that’s not the only
R-based egotainment you can compute on a Friday night.
How wordy am I?
I know I am wordy. I often adopt a telegraphic SMS-like typing style (“
Sntrm wd b gr8 prez, like Ahmedinejad”) rather than hold back my trenchant remarks about astronauts’ armpits. Tumblr’s auto-tweets don’t help my average, either—the default is long, and I’m usually too lazy to change it.
With the magic of kernel density estimates—which are definitely not overkill for the analysis of my appropriately-florid and highly-important charstreams—and my usual
base::plot params, the length of my tweets is made art in the form of chart #5.
I got a vector of tweet-lengths using
my.tweets <- userTimeline('isomorphisms', n=3500)
my.tweets <- twListToDF( my.tweets ) iso <- my.tweets$text require(stringr) iso.len <- str_length(iso) #vectorised! No for loops necessary hist( iso.len, fill="cyan" )
Proving once again that all real-world distributions fit a bell curv—…um.
You can of course use
subset( my.tweets ) to plot tweets that were made under certain conditions—I might look only at my tumblr auto-posts using
subset( my.tweets, statusSource=="tumblr"). Or only at short tweets using
subset( my.tweets, str_length(my.tweets$text)<100 ). And so on.
Lastly, I wanted to plot my tweeple—the people I talk to on twitter (most of whom I don’t actually know in real life … I like to keep friends and mathematical geekery separate). As you can see from the final chart, it was largely a sh_tshow. Or so I thought, until I considered attacking the problem with
ggplot’s strengths—in my opinion its greatest strength—is the
facet_grid( atttribute.1 ~ attribute.2) function. In combination with
base::cut — which assigns discrete “levels” to the data — facetting is especially powerful. I cut my data into four subsets, based on how many times I’ve tweeted @ someone:
my.tweets <- userTimeline( 'isomorphisms', n=3000 )
# only tweets that are @ someone talkback <- subset( my.tweets, is.na(replyToSN) == FALSE )
#the value would be NA iff I tweeted into the vast nothingness, apropos of no-one
# just the names, not the rest of the tweet's text or meta-information tweeps <- talkback$replyToSN
#make a new data frame for ggplot to facet_wrap. tweep.count <- table(tweeps) tweep.levels <- cbind( tweep.count,
cut( tweep.count, c(0,1,2,5,100) ),
) tweeps <- data.frame(tweep.levels) names(tweeps) <- c("number", "category", "name") class(tweeps$number) <- "numeric"
#all the above stuff only came clear after a few attempts
#and likewise the plot didn't work out perfect at first, either!
#but here's a decent plot that works: ggplot( data = tweeps, aes(x=number) ) + facet_wrap(~ category, scale="free_x") + geom_text( aes(label=name, y=30-order(name), size=sqrt(log(number)), col=number+(as.numeric(category))^2 ), position="jitter" ) + opts( legend.title = theme_blank(), legend.text = theme_blank() )
This made for a much more readable image. Not perfect, but definitely displaying info now.
OK, I do love talking about my twistory a little too much — but I’d like to see your histograms as well! If you run some stats on your own account, please post some pics below. I believe images can be directly embedded in the Disqus comments with
(To save your
R plots to a file rather than to the screen, do
png("a plot named Sue.png"); plot( laa dee daa ); dev.off() where
; could be replaced by a newline.)