Analyzing Mobile Marketshare in Paraguay using Twitter's API

Due to the limited amount of information regarding mobile marketshare in my country (Paraguay), I decided to look for a way to get these stats. The data might be useful for developers at the time of choosing which platform to target or prioritize. Probably cell phone carriers and local websites with a high traffic have this kind of stats, but I haven't found a public source.

I used Twitter's API to as the data source for this experiment, tweet metadata includes the source used to tweet ('Twitter for Android', 'Twitter for iPhone', etc.) and with that information the mobile platform can be inferred. The code I used and the ipython notebook on which this post is based are available in this github repo. Probably not all people in the country use Twitter, but it is popular enough for this analysis to show relevant results.

Analysis using the Streaming API.

The first approach I took was to use the streaming api with a geolocation filter to gather tweets and store them in a mongodb database for later processing.

from twython import TwythonStreamer

import pymongo
import sys

APP_KEY = 'app-key'
APP_SECRET = 'app-secret'
OAUTH_TOKEN = 'oauth-token'
OAUTH_TOKEN_SECRET = 'oauth-token-secret'


class MyStreamer(TwythonStreamer):

    def __init__(self, mongo_conn, *args):
        TwythonStreamer.__init__(self, *args)
        self.conn = mongo_conn
        self.alive = True

    def on_success(self, data):
        if not self.alive:
            self.disconnect()

        try:
            if data['place']['country_code'] == 'PY':
                self.conn.insert(data)
        except KeyError:
            pass

    def on_error(self, status_code, data):
        print 'HTTP Error: {}'.format(status_code)


def main():
    """
    Usage: python stream.py [DATABASE] [COLLECTION]

    DATABASE: MongoDb Database
    COLLECTION: MongoDb Collection

    """

    db, collection = sys.argv[1:]

    client = pymongo.MongoClient()
    conn = client[db][collection]

    stream = MyStreamer(conn, APP_KEY, APP_SECRET,
                        OAUTH_TOKEN, OAUTH_TOKEN_SECRET)

    bounding_box = '-62.892728,-27.507697,-54.312378,-19.275389'

    try:
        stream.statuses.filter(locations=bounding_box)
    except KeyboardInterrupt:
        stream.alive = False
        print 'Shutting Down'


if __name__ == '__main__':
    main()

To analyze tweets IPython, pandas and pymongo were used.

In [1]:
%matplotlib inline

from pymongo import MongoClient
from pandas import DataFrame
from bs4 import BeautifulSoup

import matplotlib.pyplot as plt

client = MongoClient()
db = client.tweets
data = db.streaming.aggregate(
    {'$project': {'_id': '$user.id', 'source': '$source'}}, cursor={})

df_stream = DataFrame(list(data))

The script run for about a month (between the end of July/2014 and August/2014). From 891755 tweets gathered, I found 29257 unique users. The number of unique users is rather low, this shows that a reduced amount of people generate much of the content (as usually occurs in sites with user-generated content). Also learned later by observing tweets obtained from my crawler (see Approach 2) that many people turn geolocation off.

In [4]:
len(df_stream)
Out[4]:
891755
In [5]:
len(df_stream['_id'].drop_duplicates())
Out[5]:
29257

By cleaning the sources, doing some merges between equivalent columns and counting, the list of the most popular sources is obtained. Instagram and Twitter are huge.

In [7]:
source_counts = df_stream['source'].value_counts()
clean_source = lambda x: BeautifulSoup(x).a.string if '<a' in x else x
source_counts.index = [clean_source(i) for i in source_counts.index]

def merge(series, index_a, index_b):
    series[index_a] += series[index_b]
    return series.drop(index_b)

source_counts = merge(source_counts, 'Twitter for Android', 'Twitter for  Android')
source_counts = merge(source_counts, 'Foursquare', 'foursquare')
source_counts = merge(source_counts, 'Twitter for BlackBerry', u'Twitter for BlackBerry®')

source_counts[:10]
Out[7]:
Twitter for Android            692610
Foursquare                      96517
Twitter for iPhone              55238
Instagram                       22592
Twitter Web Client              11279
Twitter for Android Tablets      3832
Twitter for Windows Phone        3331
Tweetbot for iΟS                  799
Twitter for Nokia S40             787
Twitter for iPad                  405
dtype: int64

To get mobile platforms stats, which is what I was really interested in, I used only mobile sources ('Twitter for Android', 'Twitter for iPhone', etc.) and also eliminated duplicates, so every pair of (user_id, source) is only counted once. Here are the results:

In [4]:
df_stream.loc[df_stream.source.str.contains('Android'), 'source'] = 'Android'
df_stream.loc[df_stream.source.str.contains('iPhone|iOS|iPad'), 'source'] = 'iOS'
df_stream.loc[df_stream.source.str.contains('BlackBerry'), 'source'] = 'BlackBerry'
df_stream.loc[df_stream.source.str.contains('Windows Phone'), 'source'] = 'Windows Phone'
df_stream = df_stream.drop_duplicates()

mobile = df_stream.loc[df_stream.source.str.contains('Windows Phone|iOS|Android|BlackBerry')]
mobile_counts = mobile['source'].value_counts()
mobile_counts
Out[4]:
Android          17904
iOS               1991
BlackBerry         217
Windows Phone      138
dtype: int64
In [10]:
colors = ['yellowgreen', 'gold', 'lightskyblue', 'lightcoral']
sizes = 100 * (mobile_counts / sum(mobile_counts))
plt.pie(sizes, labels=mobile_counts.index, colors=colors,
        autopct='%1.1f%%', pctdistance=1.2, labeldistance=1.4, shadow=True)
plt.show()

Crawling

The analysis using the streaming API left me with some doubts, specially the fact that from around 800k tweets there were only 29k unique users. Maybe by following another approach a bigger sample could be obtained. Twitter has the interesting property that there are accounts (celebrities, athletes, news outlets, etc.) that have a really large amount of followers, so without using a graph traversal algorithm but instead a simple crawler that fetches followers' tweets we can get an interesting sample of tweets for a given demographic. The following analysis was made by crawling the followers of @abcdigital, the most popular newspaper in the country.

The crawler consisted originally of one thread that fetched followers and stored them into a queue, and another thread that popped users off the queue and fetched the user's tweets, all of this using a single api key. While writing the code I've noticed that due to twitter's api limits this would take a very long time, in particular the part that obtained user's tweets. The statuses/user_timeline endpoint has a limit of 300 requests per 15 minute window, @abcdigital has around 240k followers so if using a single api key it would take 240k / 300 = 800 15-min windows and a total time of 800 * 15 min = 12000 min (~8 days). So I ended up using multiple api keys and a one thread per api key, this was the simplest way to modifiy the code I had already written. to fetch Since this task is I/O bound python's GIL is not a concern and the threading module will be enough.

import time
import Queue
import datetime
import threading
import logging
import sys
import pymongo


from twython import (
    Twython,
    TwythonRateLimitError,
    TwythonError,
    TwythonAuthError
)

from credentials import keys
from collections import namedtuple

# DEBUG
# import pdb

logging.basicConfig(
    level=logging.DEBUG,
    format='%(asctime)s [%(levelname)s] \
            (%(threadName)-8s) %(funcName)20s() %(message)s')


class FollowersCountdown():
    def __init__(self, num_followers):
        self._num_followers = num_followers
        self._lock = threading.Lock()

    def decrement_counter(self):
        with self._lock:
            if self._num_followers > 0:
                self._num_followers -= 1
            logging.debug("Followers left to crawl {}".format(
                self._num_followers))

    def finished(self):
        return not bool(self._num_followers)


class ErrorHandler():
    def __init__(self):
        self.retry = False

    def __enter__(self):
        return self

    def __exit__(self, type, value, traceback):
        if isinstance(value, TwythonRateLimitError):
            self.retry = True

            logging.debug('Retry after {}'.format(value.retry_after))
            t = datetime.datetime.fromtimestamp(float(value.retry_after))
            now = datetime.datetime.now()

            if now < t:
                dt = t - now
                logging.error(
                    'Rate limit exceeded. Sleep {} seconds'.format(dt.seconds))
                time.sleep(dt.seconds)

        elif isinstance(value, TwythonAuthError):
            logging.debug('Authentication Error')
            self.retry = False
        elif isinstance(value, TwythonError):
            if value.error_code == 404:
                logging.debug('Http 404. No need to retry')
                self.retry = False
            else:
                # wait for a bit and retry
                self.retry = True
                logging.error('Unexpected Error. Sleep {} seconds'.format(10))
                time.sleep(10)

        return True


class FollowersFetcher(threading.Thread):
    def __init__(self, queue, tw, screen_name):
        threading.Thread.__init__(self)
        self._queue = queue
        self._tw = tw
        self._screen_name = screen_name

    def run(self):
        self._enqueue_followers()

    def _enqueue_followers(self):
        logging.info('Filling queue of followers')

        for followers_chunk in self._get_followers():
            for follower_id in followers_chunk:
                logging.debug("Adding user_id:{} to queue".format(follower_id))
                self._queue.put(follower_id)

        logging.info('All followers already in queue')

    def _get_followers(self):
        cursor = -1

        while cursor != 0:
            with ErrorHandler():
                response = self._tw.get_followers_ids(
                    screen_name=self._screen_name,
                    cursor=cursor)
                cursor = response['next_cursor']
                yield response['ids']


class TweetsFetcher(threading.Thread):
    def __init__(self, queue, tw, followers_countdown, mongo_config):
        threading.Thread.__init__(self)
        self._queue = queue
        self._tw = tw
        self._countdown = followers_countdown
        self._mongo_conn = pymongo.MongoClient()
        self._db = mongo_config.db
        self._collection = mongo_config.collection

    def run(self):
        while True:
            try:
                user_id = self._queue.get(False)
                logging.info('Fetching tweets from user_id {}'.format(user_id))

                for tweet in self._get_user_tweets(user_id):
                    self._mongo_conn[self._db][self._collection].insert(tweet)

                self._countdown.decrement_counter()
            except Queue.Empty:
                logging.debug('Followers Queue is empty')
                if self._countdown.finished():
                    break
                else:
                    time.sleep(1)

    def _get_user_tweets(self, user_id):
        while True:
            timeline = []

            with ErrorHandler() as e:
                timeline = self._tw.get_user_timeline(id=user_id)

            if e.retry:
                continue

            break

        return (tweet for tweet in timeline)


def build_twitter_conn(app_key, app_secret):
    twitter = Twython(app_key, app_secret, oauth_version=2)
    ACCESS_TOKEN = twitter.obtain_access_token()
    twitter = Twython(app_key, access_token=ACCESS_TOKEN)
    return twitter


def main():
    """
    Usage: python crawl_followers.py [SCREEN_NAME] [DATABASE] [COLLECTION]

    SCREEN_NAME: Twitter account
    DATABASE: MongoDb Database
    COLLECTION: MongoDb Collection

    """
    screen_name, db, collection = sys.argv[1:]
    MongoConfig = namedtuple('MongoConfig', ['db', 'collection'])
    mongo_config = MongoConfig(db=db, collection=collection)

    # 1. Get the amount of followers
    app_key, app_secret = keys[0]
    twitter = build_twitter_conn(app_key, app_secret)

    user_info = twitter.show_user(screen_name=screen_name)
    followers_count = user_info['followers_count']
    logging.debug('Number of followers: %d' % followers_count)

    # 2. Start filling the Queue with followers to crawl
    q = Queue.Queue()

    ff = FollowersFetcher(q, twitter, screen_name)
    ff.start()

    countdown = FollowersCountdown(followers_count)

    # 3. Start worker threads to fetch followers tweets
    for i in range(len(keys)):
        app_key, app_secret = keys[i]
        twitter = build_twitter_conn(app_key, app_secret)

        tf = TweetsFetcher(q, twitter, countdown, mongo_config)
        tf.start()


if __name__ == '__main__':
    main()

Here's a similar analysis to the one done before, but going straight to the mobile platform stats.

In [2]:
data = db.crawling.aggregate(
    {'$project': {'_id': '$user.id', 'source': '$source', 'created_at': '$created_at'}}, cursor={})

df_crawling = DataFrame(list(data))

3446836 tweets were obtained from 240225 users.

In [13]:
len(df_crawling)
Out[13]:
3446836
In [14]:
len(df_crawling['_id'].drop_duplicates())
Out[14]:
240225

By grouping according to mobile platform and deleting duplicates entries as done before we get the following stats. I found a bit odd that BlackBerry appeared second, they were once very popular but nowadays I hardly see anyone using them.

In [5]:
df_crawling.loc[df_crawling.source.str.contains('Android'), 'source'] = 'Android'
df_crawling.loc[df_crawling.source.str.contains('iPhone|iOS|iPad'), 'source'] = 'iOS'
df_crawling.loc[df_crawling.source.str.contains('BlackBerry'), 'source'] = 'BlackBerry'
df_crawling.loc[df_crawling.source.str.contains('Windows Phone'), 'source'] = 'Windows Phone'
df_crawling = df_crawling.drop_duplicates(subset=['_id', 'source'])

mobile = df_crawling.loc[df_crawling.source.str.contains('Windows Phone|iOS|Android|BlackBerry')]
mobile_counts = mobile['source'].value_counts()
In [6]:
mobile_counts
Out[6]:
Android          132526
BlackBerry        31617
iOS               21380
Windows Phone      8587
dtype: int64
In [7]:
colors = ['yellowgreen', 'gold', 'lightskyblue', 'lightcoral']
sizes = 100 * (mobile_counts / sum(mobile_counts))
plt.pie(sizes, labels=mobile_counts.index, colors=colors,
        autopct='%1.1f%%', pctdistance=1.2, labeldistance=1.4, shadow=True)
plt.show()

I repeated the same process, but now limiting the tweets to the year 2014 and indeed BlackBerry's marketshare dropped.

In [8]:
df_crawling = df_crawling.loc[df_crawling.created_at.str.contains('2014')]
mobile = df_crawling.loc[df_crawling.source.str.contains('Windows Phone|iOS|Android|BlackBerry')]
mobile_counts = mobile['source'].value_counts()
mobile_counts
Out[8]:
Android          108436
iOS               18493
BlackBerry         9753
Windows Phone      7345
dtype: int64

And here are the final results with Android and iOS in the first two spots, as in most markets. BlackBerry in a distant third place and Windows Phone growing.

Although these results were obtained from a bigger sample than the first approach, keep in mind that these numbers are subject to a certain margin of error since we don't actually have the location (by observing the data I've noticed most tweets did not have geolocation data) and I'm also assuming that @abcdigital followers represent a homogenous distribution of the Paraguayan population. Having said that, this distribution seems like an accurate representation of the current marketshare.

In [9]:
colors = ['yellowgreen', 'gold', 'lightskyblue', 'lightcoral']
sizes = 100 * (mobile_counts / sum(mobile_counts))
plt.pie(sizes, labels=mobile_counts.index, colors=colors,
        autopct='%1.1f%%', pctdistance=1.2, labeldistance=1.4, shadow=True)
plt.show()