{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "_uuid": "e7b2d3f9b44af9c3f64f91d183b8e9cb6b7adddd" }, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "# Draw inline\n", "%matplotlib inline\n", "\n", "# Set figure aesthetics\n", "sns.set_style(\"white\", {'ytick.major.size': 10.0})\n", "sns.set_context(\"poster\", font_scale=1.1)" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "45f9941c0b93e026853c3ce875d158a63449f27d" }, "source": [ "I wanted to take a look at the user data we have for this competition so I made this little notebook to share my findings and discuss about those. At the moment I've started with the basic user data, I'll take a look at sessions and the other *csv* files later on this month.\n", "\n", "Please, feel free to comment with anything you think it can be improved or fixed. I am not a professional in this field and there will be mistakes or things that can be *improved*. This is the flow I took and there are some plots not really interesting but I thought on keeping it in case someone see something interesting.\n", "\n", "Let's see the data!" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "478b3dee9a3c2f3ff279955d0844664288bda621" }, "source": [ "## Data Exploration" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "7af41496c24fc81b911d97142b40a731fdf4e164" }, "source": [ "Generally, when I start with a Data Science project I'm looking to answer the following questions:\n", "\n", "- Is there any mistakes in the data?\n", "- Does the data have peculiar behavior?\n", "- Do I need to fix or remove any of the data to be more realistic?" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "_uuid": "ec61fd9d5451c342f2d811989bfc5f29b1ffdfbf" }, "outputs": [], "source": [ "# Load the data into DataFrames\n", "train_users = pd.read_csv('./train_users_2.csv')\n", "test_users = pd.read_csv('./test_users.csv')" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "_uuid": "af3bcef5cdd04f64bd1a859250c0429c69950ab7" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "We have 213451 users in the training set and 62096 in the test set.\n", "In total we have 275547 users.\n" ] } ], "source": [ "print(\"We have\", train_users.shape[0], \"users in the training set and\", \n", " test_users.shape[0], \"in the test set.\")\n", "print(\"In total we have\", train_users.shape[0] + test_users.shape[0], \"users.\")" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "4bce84d0f8b6a86daa318fd19c8fa145764eb63d" }, "source": [ "Let's get those together so we can work with all the data." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "_uuid": "8033a81ad6fb5902eafa15b7760b0fdb32418ff8" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:2: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version\n", "of pandas will change to not sort by default.\n", "\n", "To accept the future behavior, pass 'sort=True'.\n", "\n", "To retain the current behavior and silence the warning, pass sort=False\n", "\n", " \n" ] }, { "data": { "text/html": [ "
\n", " | affiliate_channel | \n", "affiliate_provider | \n", "age | \n", "country_destination | \n", "date_account_created | \n", "date_first_booking | \n", "first_affiliate_tracked | \n", "first_browser | \n", "first_device_type | \n", "gender | \n", "language | \n", "signup_app | \n", "signup_flow | \n", "signup_method | \n", "timestamp_first_active | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "direct | \n", "direct | \n", "NaN | \n", "NDF | \n", "2010-06-28 | \n", "NaN | \n", "untracked | \n", "Chrome | \n", "Mac Desktop | \n", "-unknown- | \n", "en | \n", "Web | \n", "0 | \n", "20090319043255 | \n", "|
1 | \n", "seo | \n", "38.0 | \n", "NDF | \n", "2011-05-25 | \n", "NaN | \n", "untracked | \n", "Chrome | \n", "Mac Desktop | \n", "MALE | \n", "en | \n", "Web | \n", "0 | \n", "20090523174809 | \n", "||
2 | \n", "direct | \n", "direct | \n", "56.0 | \n", "US | \n", "2010-09-28 | \n", "2010-08-02 | \n", "untracked | \n", "IE | \n", "Windows Desktop | \n", "FEMALE | \n", "en | \n", "Web | \n", "3 | \n", "basic | \n", "20090609231247 | \n", "
3 | \n", "direct | \n", "direct | \n", "42.0 | \n", "other | \n", "2011-12-05 | \n", "2012-09-08 | \n", "untracked | \n", "Firefox | \n", "Mac Desktop | \n", "FEMALE | \n", "en | \n", "Web | \n", "0 | \n", "20091031060129 | \n", "|
4 | \n", "direct | \n", "direct | \n", "41.0 | \n", "US | \n", "2010-09-14 | \n", "2010-02-18 | \n", "untracked | \n", "Chrome | \n", "Mac Desktop | \n", "-unknown- | \n", "en | \n", "Web | \n", "0 | \n", "basic | \n", "20091208061105 | \n", "