Projects

> Deep Learning Practicum > How to Mine the Interwebs for Data

September 28, 2018

Overview

Data Workshop - How to Mine the Interwebs for Data (Instructions)

Section 1.2: Downloading & Viewing Kaggle Data

  1. Some of the information is missing or unuseful in the CSV files, and data is spread out among different CSV files. This uncompiled structure of data might make it harder to feed into a machine learning model. We would have to filter out and process the CSV file data heavily before being able to use it.
  2. This dataset provides some relevant information for the task at hand, but is definitely not the *best* option. For example, a lot of data is either missing or unknown (ie. gender in train_users_2.csv). And some data is also irrelevant such as first_browser or first_device type. Furthermore the age bucket is not discretized, and is instead a range of numbers, which might make training more difficult.

Section 1.3 Data Processing and Kaggle Kernels

User data exploraiton/visualization: 6-s198-user-data-exploration.ipynb

I visualized airbnb's top 6 affiliate providers. From the graph, we can see that most airbnb users route to airbnb directly or are rourted through google, whereas a smaller portion of user are directed to airbnb via facebook, bing, and craigslist.

visualization

Section 2: Mining from websites using scripts

Processing a website using bs4: data_processing.py
Converting/storing raw data into an accessable data structure: convert_data.py

Section 4: Brainstorm data collection strategy for your project

The project idea that my partner, Burhan Azeem, and I are considering, is creating an AI to solve tutorial/beginner level protein configurations in the game fold.it. In order to "win" the game, the player must fold proteins into various configurations. The more efficient the fold, the more points the player wins. For this project, Burhan and I plan to ask the creators of fold.it for anonymized recordings of sessions played. We infer that the data is stored in accessible data structures that can be used as input data to our deep learning network. We plan to use this data to train our network using CNN or reinforcement learning.

Section 5: Data Review

Overview of the Datasets Acquired for our Final Project (pdf)