Analyzing Mobile Marketshare in Paraguay using Twitter's API
Due to the limited amount of information regarding mobile marketshare in my country (Paraguay), I decided to look for a way to get these stats. The data might be useful for developers at the time of choosing which platform to target or prioritize. Probably cell phone carriers and local websites with a high traffic have this kind of stats, but I haven't found a public source.
I used Twitter's API to as the data source for this experiment, tweet metadata includes the source used to tweet ('Twitter for Android', 'Twitter for iPhone', etc.) and with that information the mobile platform can be inferred. The code I used and the ipython notebook on which this post is based are available in this github repo. Probably not all people in the country use Twitter, but it is popular enough for this analysis to show relevant results.
Analysis using the Streaming API.
The first approach I took was to use the streaming api with a geolocation filter to gather tweets and store them in a mongodb database for later processing.
from twython import TwythonStreamer
import pymongo
import sys
APP_KEY = 'app-key'
APP_SECRET = 'app-secret'
OAUTH_TOKEN = 'oauth-token'
OAUTH_TOKEN_SECRET = 'oauth-token-secret'
class MyStreamer(TwythonStreamer):
def __init__(self, mongo_conn, *args):
TwythonStreamer.__init__(self, *args)
self.conn = mongo_conn
self.alive = True
def on_success(self, data):
if not self.alive:
self.disconnect()
try:
if data['place']['country_code'] == 'PY':
self.conn.insert(data)
except KeyError:
pass
def on_error(self, status_code, data):
print 'HTTP Error: {}'.format(status_code)
def main():
"""
Usage: python stream.py [DATABASE] [COLLECTION]
DATABASE: MongoDb Database
COLLECTION: MongoDb Collection
"""
db, collection = sys.argv[1:]
client = pymongo.MongoClient()
conn = client[db][collection]
stream = MyStreamer(conn, APP_KEY, APP_SECRET,
OAUTH_TOKEN, OAUTH_TOKEN_SECRET)
bounding_box = '-62.892728,-27.507697,-54.312378,-19.275389'
try:
stream.statuses.filter(locations=bounding_box)
except KeyboardInterrupt:
stream.alive = False
print 'Shutting Down'
if __name__ == '__main__':
main()
%matplotlib inline
from pymongo import MongoClient
from pandas import DataFrame
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
client = MongoClient()
db = client.tweets
data = db.streaming.aggregate(
{'$project': {'_id': '$user.id', 'source': '$source'}}, cursor={})
df_stream = DataFrame(list(data))
The script run for about a month (between the end of July/2014 and August/2014). From 891755 tweets gathered, I found 29257 unique users. The number of unique users is rather low, this shows that a reduced amount of people generate much of the content (as usually occurs in sites with user-generated content). Also learned later by observing tweets obtained from my crawler (see Approach 2) that many people turn geolocation off.
len(df_stream)
len(df_stream['_id'].drop_duplicates())
By cleaning the sources, doing some merges between equivalent columns and counting, the list of the most popular sources is obtained. Instagram and Twitter are huge.
source_counts = df_stream['source'].value_counts()
clean_source = lambda x: BeautifulSoup(x).a.string if '<a' in x else x
source_counts.index = [clean_source(i) for i in source_counts.index]
def merge(series, index_a, index_b):
series[index_a] += series[index_b]
return series.drop(index_b)
source_counts = merge(source_counts, 'Twitter for Android', 'Twitter for Android')
source_counts = merge(source_counts, 'Foursquare', 'foursquare')
source_counts = merge(source_counts, 'Twitter for BlackBerry', u'Twitter for BlackBerry®')
source_counts[:10]
To get mobile platforms stats, which is what I was really interested in, I used only mobile sources ('Twitter for Android', 'Twitter for iPhone', etc.) and also eliminated duplicates, so every pair of (user_id, source) is only counted once. Here are the results:
df_stream.loc[df_stream.source.str.contains('Android'), 'source'] = 'Android'
df_stream.loc[df_stream.source.str.contains('iPhone|iOS|iPad'), 'source'] = 'iOS'
df_stream.loc[df_stream.source.str.contains('BlackBerry'), 'source'] = 'BlackBerry'
df_stream.loc[df_stream.source.str.contains('Windows Phone'), 'source'] = 'Windows Phone'
df_stream = df_stream.drop_duplicates()
mobile = df_stream.loc[df_stream.source.str.contains('Windows Phone|iOS|Android|BlackBerry')]
mobile_counts = mobile['source'].value_counts()
mobile_counts
colors = ['yellowgreen', 'gold', 'lightskyblue', 'lightcoral']
sizes = 100 * (mobile_counts / sum(mobile_counts))
plt.pie(sizes, labels=mobile_counts.index, colors=colors,
autopct='%1.1f%%', pctdistance=1.2, labeldistance=1.4, shadow=True)
plt.show()
Crawling
The analysis using the streaming API left me with some doubts, specially the fact that from around 800k tweets there were only 29k unique users. Maybe by following another approach a bigger sample could be obtained. Twitter has the interesting property that there are accounts (celebrities, athletes, news outlets, etc.) that have a really large amount of followers, so without using a graph traversal algorithm but instead a simple crawler that fetches followers' tweets we can get an interesting sample of tweets for a given demographic. The following analysis was made by crawling the followers of @abcdigital, the most popular newspaper in the country.
The crawler consisted originally of one thread that fetched followers and stored them into a queue, and another thread that popped users off the queue and fetched the user's tweets, all of this using a single api key. While writing the code I've noticed that due to twitter's api limits this would take a very long time, in particular the part that obtained user's tweets. The statuses/user_timeline endpoint has a limit of 300 requests per 15 minute window, @abcdigital has around 240k followers so if using a single api key it would take 240k / 300 = 800 15-min windows and a total time of 800 * 15 min = 12000 min (~8 days). So I ended up using multiple api keys and a one thread per api key, this was the simplest way to modifiy the code I had already written. to fetch Since this task is I/O bound python's GIL is not a concern and the threading module will be enough.
import time
import Queue
import datetime
import threading
import logging
import sys
import pymongo
from twython import (
Twython,
TwythonRateLimitError,
TwythonError,
TwythonAuthError
)
from credentials import keys
from collections import namedtuple
# DEBUG
# import pdb
logging.basicConfig(
level=logging.DEBUG,
format='%(asctime)s [%(levelname)s] \
(%(threadName)-8s) %(funcName)20s() %(message)s')
class FollowersCountdown():
def __init__(self, num_followers):
self._num_followers = num_followers
self._lock = threading.Lock()
def decrement_counter(self):
with self._lock:
if self._num_followers > 0:
self._num_followers -= 1
logging.debug("Followers left to crawl {}".format(
self._num_followers))
def finished(self):
return not bool(self._num_followers)
class ErrorHandler():
def __init__(self):
self.retry = False
def __enter__(self):
return self
def __exit__(self, type, value, traceback):
if isinstance(value, TwythonRateLimitError):
self.retry = True
logging.debug('Retry after {}'.format(value.retry_after))
t = datetime.datetime.fromtimestamp(float(value.retry_after))
now = datetime.datetime.now()
if now < t:
dt = t - now
logging.error(
'Rate limit exceeded. Sleep {} seconds'.format(dt.seconds))
time.sleep(dt.seconds)
elif isinstance(value, TwythonAuthError):
logging.debug('Authentication Error')
self.retry = False
elif isinstance(value, TwythonError):
if value.error_code == 404:
logging.debug('Http 404. No need to retry')
self.retry = False
else:
# wait for a bit and retry
self.retry = True
logging.error('Unexpected Error. Sleep {} seconds'.format(10))
time.sleep(10)
return True
class FollowersFetcher(threading.Thread):
def __init__(self, queue, tw, screen_name):
threading.Thread.__init__(self)
self._queue = queue
self._tw = tw
self._screen_name = screen_name
def run(self):
self._enqueue_followers()
def _enqueue_followers(self):
logging.info('Filling queue of followers')
for followers_chunk in self._get_followers():
for follower_id in followers_chunk:
logging.debug("Adding user_id:{} to queue".format(follower_id))
self._queue.put(follower_id)
logging.info('All followers already in queue')
def _get_followers(self):
cursor = -1
while cursor != 0:
with ErrorHandler():
response = self._tw.get_followers_ids(
screen_name=self._screen_name,
cursor=cursor)
cursor = response['next_cursor']
yield response['ids']
class TweetsFetcher(threading.Thread):
def __init__(self, queue, tw, followers_countdown, mongo_config):
threading.Thread.__init__(self)
self._queue = queue
self._tw = tw
self._countdown = followers_countdown
self._mongo_conn = pymongo.MongoClient()
self._db = mongo_config.db
self._collection = mongo_config.collection
def run(self):
while True:
try:
user_id = self._queue.get(False)
logging.info('Fetching tweets from user_id {}'.format(user_id))
for tweet in self._get_user_tweets(user_id):
self._mongo_conn[self._db][self._collection].insert(tweet)
self._countdown.decrement_counter()
except Queue.Empty:
logging.debug('Followers Queue is empty')
if self._countdown.finished():
break
else:
time.sleep(1)
def _get_user_tweets(self, user_id):
while True:
timeline = []
with ErrorHandler() as e:
timeline = self._tw.get_user_timeline(id=user_id)
if e.retry:
continue
break
return (tweet for tweet in timeline)
def build_twitter_conn(app_key, app_secret):
twitter = Twython(app_key, app_secret, oauth_version=2)
ACCESS_TOKEN = twitter.obtain_access_token()
twitter = Twython(app_key, access_token=ACCESS_TOKEN)
return twitter
def main():
"""
Usage: python crawl_followers.py [SCREEN_NAME] [DATABASE] [COLLECTION]
SCREEN_NAME: Twitter account
DATABASE: MongoDb Database
COLLECTION: MongoDb Collection
"""
screen_name, db, collection = sys.argv[1:]
MongoConfig = namedtuple('MongoConfig', ['db', 'collection'])
mongo_config = MongoConfig(db=db, collection=collection)
# 1. Get the amount of followers
app_key, app_secret = keys[0]
twitter = build_twitter_conn(app_key, app_secret)
user_info = twitter.show_user(screen_name=screen_name)
followers_count = user_info['followers_count']
logging.debug('Number of followers: %d' % followers_count)
# 2. Start filling the Queue with followers to crawl
q = Queue.Queue()
ff = FollowersFetcher(q, twitter, screen_name)
ff.start()
countdown = FollowersCountdown(followers_count)
# 3. Start worker threads to fetch followers tweets
for i in range(len(keys)):
app_key, app_secret = keys[i]
twitter = build_twitter_conn(app_key, app_secret)
tf = TweetsFetcher(q, twitter, countdown, mongo_config)
tf.start()
if __name__ == '__main__':
main()
Here's a similar analysis to the one done before, but going straight to the mobile platform stats.
data = db.crawling.aggregate(
{'$project': {'_id': '$user.id', 'source': '$source', 'created_at': '$created_at'}}, cursor={})
df_crawling = DataFrame(list(data))
3446836 tweets were obtained from 240225 users.
len(df_crawling)
len(df_crawling['_id'].drop_duplicates())
By grouping according to mobile platform and deleting duplicates entries as done before we get the following stats. I found a bit odd that BlackBerry appeared second, they were once very popular but nowadays I hardly see anyone using them.
df_crawling.loc[df_crawling.source.str.contains('Android'), 'source'] = 'Android'
df_crawling.loc[df_crawling.source.str.contains('iPhone|iOS|iPad'), 'source'] = 'iOS'
df_crawling.loc[df_crawling.source.str.contains('BlackBerry'), 'source'] = 'BlackBerry'
df_crawling.loc[df_crawling.source.str.contains('Windows Phone'), 'source'] = 'Windows Phone'
df_crawling = df_crawling.drop_duplicates(subset=['_id', 'source'])
mobile = df_crawling.loc[df_crawling.source.str.contains('Windows Phone|iOS|Android|BlackBerry')]
mobile_counts = mobile['source'].value_counts()
mobile_counts
colors = ['yellowgreen', 'gold', 'lightskyblue', 'lightcoral']
sizes = 100 * (mobile_counts / sum(mobile_counts))
plt.pie(sizes, labels=mobile_counts.index, colors=colors,
autopct='%1.1f%%', pctdistance=1.2, labeldistance=1.4, shadow=True)
plt.show()
I repeated the same process, but now limiting the tweets to the year 2014 and indeed BlackBerry's marketshare dropped.
df_crawling = df_crawling.loc[df_crawling.created_at.str.contains('2014')]
mobile = df_crawling.loc[df_crawling.source.str.contains('Windows Phone|iOS|Android|BlackBerry')]
mobile_counts = mobile['source'].value_counts()
mobile_counts
And here are the final results with Android and iOS in the first two spots, as in most markets. BlackBerry in a distant third place and Windows Phone growing.
Although these results were obtained from a bigger sample than the first approach, keep in mind that these numbers are subject to a certain margin of error since we don't actually have the location (by observing the data I've noticed most tweets did not have geolocation data) and I'm also assuming that @abcdigital followers represent a homogenous distribution of the Paraguayan population. Having said that, this distribution seems like an accurate representation of the current marketshare.
colors = ['yellowgreen', 'gold', 'lightskyblue', 'lightcoral']
sizes = 100 * (mobile_counts / sum(mobile_counts))
plt.pie(sizes, labels=mobile_counts.index, colors=colors,
autopct='%1.1f%%', pctdistance=1.2, labeldistance=1.4, shadow=True)
plt.show()