# Analyzing Mobile Marketshare in Paraguay using Twitter's API

Due to the limited amount of information regarding mobile marketshare in my country (Paraguay), I decided to look for a way to get these stats. The data might be useful for developers at the time of choosing which platform to target or prioritize. Probably cell phone carriers and local websites with a high traffic have this kind of stats, but I haven't found a public source.

I used Twitter's API to as the data source for this experiment, tweet metadata includes the source used to tweet ('Twitter for Android', 'Twitter for iPhone', etc.) and with that information the mobile platform can be inferred. The code I used and the ipython notebook on which this post is based are available in this github repo. Probably not all people in the country use Twitter, but it is popular enough for this analysis to show relevant results.

## Analysis using the Streaming API.

The first approach I took was to use the streaming api with a geolocation filter to gather tweets and store them in a mongodb database for later processing.

from twython import TwythonStreamer

import pymongo
import sys

APP_KEY = 'app-key'
APP_SECRET = 'app-secret'
OAUTH_TOKEN = 'oauth-token'
OAUTH_TOKEN_SECRET = 'oauth-token-secret'

class MyStreamer(TwythonStreamer):

def __init__(self, mongo_conn, *args):
TwythonStreamer.__init__(self, *args)
self.conn = mongo_conn
self.alive = True

def on_success(self, data):
if not self.alive:
self.disconnect()

try:
if data['place']['country_code'] == 'PY':
self.conn.insert(data)
except KeyError:
pass

def on_error(self, status_code, data):
print 'HTTP Error: {}'.format(status_code)

def main():
"""
Usage: python stream.py [DATABASE] [COLLECTION]

DATABASE: MongoDb Database
COLLECTION: MongoDb Collection

"""

db, collection = sys.argv[1:]

client = pymongo.MongoClient()
conn = client[db][collection]

stream = MyStreamer(conn, APP_KEY, APP_SECRET,
OAUTH_TOKEN, OAUTH_TOKEN_SECRET)

bounding_box = '-62.892728,-27.507697,-54.312378,-19.275389'

try:
stream.statuses.filter(locations=bounding_box)
except KeyboardInterrupt:
stream.alive = False
print 'Shutting Down'

if __name__ == '__main__':
main()


To analyze tweets IPython, pandas and pymongo were used.

In [1]:
%matplotlib inline

from pymongo import MongoClient
from pandas import DataFrame
from bs4 import BeautifulSoup

import matplotlib.pyplot as plt

client = MongoClient()
db = client.tweets
data = db.streaming.aggregate(
{'$project': {'_id': '$user.id', 'source': '$source'}}, cursor={}) df_stream = DataFrame(list(data))  The script run for about a month (between the end of July/2014 and August/2014). From 891755 tweets gathered, I found 29257 unique users. The number of unique users is rather low, this shows that a reduced amount of people generate much of the content (as usually occurs in sites with user-generated content). Also learned later by observing tweets obtained from my crawler (see Approach 2) that many people turn geolocation off. In [4]: len(df_stream)  Out[4]: 891755  In [5]: len(df_stream['_id'].drop_duplicates())  Out[5]: 29257  By cleaning the sources, doing some merges between equivalent columns and counting, the list of the most popular sources is obtained. Instagram and Twitter are huge. In [7]: source_counts = df_stream['source'].value_counts() clean_source = lambda x: BeautifulSoup(x).a.string if '<a' in x else x source_counts.index = [clean_source(i) for i in source_counts.index] def merge(series, index_a, index_b): series[index_a] += series[index_b] return series.drop(index_b) source_counts = merge(source_counts, 'Twitter for Android', 'Twitter for Android') source_counts = merge(source_counts, 'Foursquare', 'foursquare') source_counts = merge(source_counts, 'Twitter for BlackBerry', u'Twitter for BlackBerry®') source_counts[:10]  Out[7]: Twitter for Android 692610 Foursquare 96517 Twitter for iPhone 55238 Instagram 22592 Twitter Web Client 11279 Twitter for Android Tablets 3832 Twitter for Windows Phone 3331 Tweetbot for iΟS 799 Twitter for Nokia S40 787 Twitter for iPad 405 dtype: int64  To get mobile platforms stats, which is what I was really interested in, I used only mobile sources ('Twitter for Android', 'Twitter for iPhone', etc.) and also eliminated duplicates, so every pair of (user_id, source) is only counted once. Here are the results: In [4]: df_stream.loc[df_stream.source.str.contains('Android'), 'source'] = 'Android' df_stream.loc[df_stream.source.str.contains('iPhone|iOS|iPad'), 'source'] = 'iOS' df_stream.loc[df_stream.source.str.contains('BlackBerry'), 'source'] = 'BlackBerry' df_stream.loc[df_stream.source.str.contains('Windows Phone'), 'source'] = 'Windows Phone' df_stream = df_stream.drop_duplicates() mobile = df_stream.loc[df_stream.source.str.contains('Windows Phone|iOS|Android|BlackBerry')] mobile_counts = mobile['source'].value_counts() mobile_counts  Out[4]: Android 17904 iOS 1991 BlackBerry 217 Windows Phone 138 dtype: int64  In [10]: colors = ['yellowgreen', 'gold', 'lightskyblue', 'lightcoral'] sizes = 100 * (mobile_counts / sum(mobile_counts)) plt.pie(sizes, labels=mobile_counts.index, colors=colors, autopct='%1.1f%%', pctdistance=1.2, labeldistance=1.4, shadow=True) plt.show()  ## Crawling The analysis using the streaming API left me with some doubts, specially the fact that from around 800k tweets there were only 29k unique users. Maybe by following another approach a bigger sample could be obtained. Twitter has the interesting property that there are accounts (celebrities, athletes, news outlets, etc.) that have a really large amount of followers, so without using a graph traversal algorithm but instead a simple crawler that fetches followers' tweets we can get an interesting sample of tweets for a given demographic. The following analysis was made by crawling the followers of @abcdigital, the most popular newspaper in the country. The crawler consisted originally of one thread that fetched followers and stored them into a queue, and another thread that popped users off the queue and fetched the user's tweets, all of this using a single api key. While writing the code I've noticed that due to twitter's api limits this would take a very long time, in particular the part that obtained user's tweets. The statuses/user_timeline endpoint has a limit of 300 requests per 15 minute window, @abcdigital has around 240k followers so if using a single api key it would take 240k / 300 = 800 15-min windows and a total time of 800 * 15 min = 12000 min (~8 days). So I ended up using multiple api keys and a one thread per api key, this was the simplest way to modifiy the code I had already written. to fetch Since this task is I/O bound python's GIL is not a concern and the threading module will be enough. import time import Queue import datetime import threading import logging import sys import pymongo from twython import ( Twython, TwythonRateLimitError, TwythonError, TwythonAuthError ) from credentials import keys from collections import namedtuple # DEBUG # import pdb logging.basicConfig( level=logging.DEBUG, format='%(asctime)s [%(levelname)s] \ (%(threadName)-8s) %(funcName)20s() %(message)s') class FollowersCountdown(): def __init__(self, num_followers): self._num_followers = num_followers self._lock = threading.Lock() def decrement_counter(self): with self._lock: if self._num_followers > 0: self._num_followers -= 1 logging.debug("Followers left to crawl {}".format( self._num_followers)) def finished(self): return not bool(self._num_followers) class ErrorHandler(): def __init__(self): self.retry = False def __enter__(self): return self def __exit__(self, type, value, traceback): if isinstance(value, TwythonRateLimitError): self.retry = True logging.debug('Retry after {}'.format(value.retry_after)) t = datetime.datetime.fromtimestamp(float(value.retry_after)) now = datetime.datetime.now() if now < t: dt = t - now logging.error( 'Rate limit exceeded. Sleep {} seconds'.format(dt.seconds)) time.sleep(dt.seconds) elif isinstance(value, TwythonAuthError): logging.debug('Authentication Error') self.retry = False elif isinstance(value, TwythonError): if value.error_code == 404: logging.debug('Http 404. No need to retry') self.retry = False else: # wait for a bit and retry self.retry = True logging.error('Unexpected Error. Sleep {} seconds'.format(10)) time.sleep(10) return True class FollowersFetcher(threading.Thread): def __init__(self, queue, tw, screen_name): threading.Thread.__init__(self) self._queue = queue self._tw = tw self._screen_name = screen_name def run(self): self._enqueue_followers() def _enqueue_followers(self): logging.info('Filling queue of followers') for followers_chunk in self._get_followers(): for follower_id in followers_chunk: logging.debug("Adding user_id:{} to queue".format(follower_id)) self._queue.put(follower_id) logging.info('All followers already in queue') def _get_followers(self): cursor = -1 while cursor != 0: with ErrorHandler(): response = self._tw.get_followers_ids( screen_name=self._screen_name, cursor=cursor) cursor = response['next_cursor'] yield response['ids'] class TweetsFetcher(threading.Thread): def __init__(self, queue, tw, followers_countdown, mongo_config): threading.Thread.__init__(self) self._queue = queue self._tw = tw self._countdown = followers_countdown self._mongo_conn = pymongo.MongoClient() self._db = mongo_config.db self._collection = mongo_config.collection def run(self): while True: try: user_id = self._queue.get(False) logging.info('Fetching tweets from user_id {}'.format(user_id)) for tweet in self._get_user_tweets(user_id): self._mongo_conn[self._db][self._collection].insert(tweet) self._countdown.decrement_counter() except Queue.Empty: logging.debug('Followers Queue is empty') if self._countdown.finished(): break else: time.sleep(1) def _get_user_tweets(self, user_id): while True: timeline = [] with ErrorHandler() as e: timeline = self._tw.get_user_timeline(id=user_id) if e.retry: continue break return (tweet for tweet in timeline) def build_twitter_conn(app_key, app_secret): twitter = Twython(app_key, app_secret, oauth_version=2) ACCESS_TOKEN = twitter.obtain_access_token() twitter = Twython(app_key, access_token=ACCESS_TOKEN) return twitter def main(): """ Usage: python crawl_followers.py [SCREEN_NAME] [DATABASE] [COLLECTION] SCREEN_NAME: Twitter account DATABASE: MongoDb Database COLLECTION: MongoDb Collection """ screen_name, db, collection = sys.argv[1:] MongoConfig = namedtuple('MongoConfig', ['db', 'collection']) mongo_config = MongoConfig(db=db, collection=collection) # 1. Get the amount of followers app_key, app_secret = keys[0] twitter = build_twitter_conn(app_key, app_secret) user_info = twitter.show_user(screen_name=screen_name) followers_count = user_info['followers_count'] logging.debug('Number of followers: %d' % followers_count) # 2. Start filling the Queue with followers to crawl q = Queue.Queue() ff = FollowersFetcher(q, twitter, screen_name) ff.start() countdown = FollowersCountdown(followers_count) # 3. Start worker threads to fetch followers tweets for i in range(len(keys)): app_key, app_secret = keys[i] twitter = build_twitter_conn(app_key, app_secret) tf = TweetsFetcher(q, twitter, countdown, mongo_config) tf.start() if __name__ == '__main__': main()  Here's a similar analysis to the one done before, but going straight to the mobile platform stats. In [2]: data = db.crawling.aggregate( {'$project': {'_id': '$user.id', 'source': '$source', 'created_at': '\$created_at'}}, cursor={})

df_crawling = DataFrame(list(data))


3446836 tweets were obtained from 240225 users.

In [13]:
len(df_crawling)

Out[13]:
3446836

In [14]:
len(df_crawling['_id'].drop_duplicates())

Out[14]:
240225


By grouping according to mobile platform and deleting duplicates entries as done before we get the following stats. I found a bit odd that BlackBerry appeared second, they were once very popular but nowadays I hardly see anyone using them.

In [5]:
df_crawling.loc[df_crawling.source.str.contains('Android'), 'source'] = 'Android'
df_crawling.loc[df_crawling.source.str.contains('BlackBerry'), 'source'] = 'BlackBerry'
df_crawling.loc[df_crawling.source.str.contains('Windows Phone'), 'source'] = 'Windows Phone'
df_crawling = df_crawling.drop_duplicates(subset=['_id', 'source'])

mobile = df_crawling.loc[df_crawling.source.str.contains('Windows Phone|iOS|Android|BlackBerry')]
mobile_counts = mobile['source'].value_counts()

In [6]:
mobile_counts

Out[6]:
Android          132526
BlackBerry        31617
iOS               21380
Windows Phone      8587
dtype: int64

In [7]:
colors = ['yellowgreen', 'gold', 'lightskyblue', 'lightcoral']
sizes = 100 * (mobile_counts / sum(mobile_counts))
plt.pie(sizes, labels=mobile_counts.index, colors=colors,
plt.show()


I repeated the same process, but now limiting the tweets to the year 2014 and indeed BlackBerry's marketshare dropped.

In [8]:
df_crawling = df_crawling.loc[df_crawling.created_at.str.contains('2014')]
mobile = df_crawling.loc[df_crawling.source.str.contains('Windows Phone|iOS|Android|BlackBerry')]
mobile_counts = mobile['source'].value_counts()
mobile_counts

Out[8]:
Android          108436
iOS               18493
BlackBerry         9753
Windows Phone      7345
dtype: int64


And here are the final results with Android and iOS in the first two spots, as in most markets. BlackBerry in a distant third place and Windows Phone growing.

Although these results were obtained from a bigger sample than the first approach, keep in mind that these numbers are subject to a certain margin of error since we don't actually have the location (by observing the data I've noticed most tweets did not have geolocation data) and I'm also assuming that @abcdigital followers represent a homogenous distribution of the Paraguayan population. Having said that, this distribution seems like an accurate representation of the current marketshare.

In [9]:
colors = ['yellowgreen', 'gold', 'lightskyblue', 'lightcoral']
sizes = 100 * (mobile_counts / sum(mobile_counts))
plt.pie(sizes, labels=mobile_counts.index, colors=colors,