Video Tutorial: http://youtu.be/2b32-KVzBXQ
In many cases you don't have to go and collect the data because you have data in a local NoSQL DB that requires analysis.This could be a URL audit-trail for your website where you want to study visitors and trends. One common NoSQL Database is MongoDB which is the subject of this tutorial.
NoSQL was designed to deal with the problems that came up when developers started dealing with large amount of data using the existing Relational Databases (also referred to as SQL Databases). The three main issues with SQL Databases were:
SQL Databases had fixed schema of tables which limited the ability to store new fields in your table. You have to change the table schema when ever you need a new field.
SQL Databases can store virtually billions of records in a single table but the problem was the overhead cost to your CPU and memory to be able to access this data efficiently.
SQL databases are not designed to insert large amount of records in short time specially to large tables.
MongoDB (from "humongous") is a cross-platform document-oriented database. Classified as a NoSQL database, MongoDB eschews the traditional table-based relational database structure in favor of JSON-like documents with dynamic schemas (MongoDB calls the format BSON), making the integration of data in certain types of applications easier and faster. Released under a combination of the GNU Affero General Public License and the Apache License, MongoDB is free and open-source software.
####from [Wikipedia](http://en.wikipedia.org/wiki/MongoDB)
Apache Cassandra is an open source distributed database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cassandra offers robust support for clusters spanning multiple datacenters,[1] with asynchronous masterless replication allowing low latency operations for all clients.
####from [Wikipedia](http://en.wikipedia.org/wiki/Apache_Cassandra)
HBase is an open source, non-relational, distributed database modeled after Google's BigTable and written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS (Hadoop Distributed Filesystem), providing BigTable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data (small amounts of information caught within a large collection of empty or unimportant data, such as finding the 50 largest items in a group of 2 billion records, or finding the non-zero items representing less than 0.1% of a huge collection).
####from [Wikipedia](http://en.wikipedia.org/wiki/Apache_HBase)
Apache CouchDB, commonly referred to as CouchDB, is an open source database that focuses on ease of use and on being "a database that completely embraces the web".[1] It is a NoSQL database that uses JSON to store data, JavaScript as its query language using MapReduce, and HTTP for an API.[1] One of its distinguishing features is multi-master replication. CouchDB was first released in 2005 and later became an Apache project in 2008.
#### from [Wikipedia](http://en.wikipedia.org/wiki/CouchDB)
Documentation: http://api.mongodb.org/python/current/
import pymongo
client_con = pymongo.MongoClient()
client_con.database_names()
[u'mydb', u'local', u'roshan', u'admin']
roshan_db = client_con["roshan"]
roshan_db.collection_names()
[u'system.indexes', u'Twitter', u'SearchTweets', u'TwitterFollowers', u'TwitterUserSearch', u'TCelebrities', u'TwitterFriendsIds', u'TwitterFollowersIds', u'TwitterIds', u'TwiiterTemp', u'test youtube']
twitter_col = roshan_db["Twitter"]
twitter_col.count()
1045260
doc = twitter_col.find_one()
doc
{u'_id': ObjectId('543d819c105f1913c74ab20b'), u'contributors': None, u'coordinates': None, u'created_at': u'Tue Oct 14 20:04:05 +0000 2014', u'entities': {u'hashtags': [], u'symbols': [], u'trends': [], u'urls': [], u'user_mentions': []}, u'favorite_count': 0, u'favorited': False, u'filter_level': u'medium', u'geo': None, u'id': 522115647623692288L, u'id_str': u'522115647623692288', u'in_reply_to_screen_name': None, u'in_reply_to_status_id': None, u'in_reply_to_status_id_str': None, u'in_reply_to_user_id': None, u'in_reply_to_user_id_str': None, u'lang': u'ar', u'place': None, u'possibly_sensitive': False, u'retweet_count': 0, u'retweeted': False, u'source': u'<a href="http://w.drr-alkalam.com/" rel="nofollow">\u062f\u0631\u0631 \u0627\u0644\u0643\u0644\u0627\u0645 3</a>', u'text': u'\u0627\u0644\u0632\u0646\u0643 \u064a\u0648\u062c\u062f \u0628\u0648\u0641\u0631\u0629 \u0627\u0644\u0644\u062d\u0645 \u0648\u0627\u0644\u0633\u0645\u0643 \u0648\u0627\u0644\u062d\u0644\u064a\u0628 \u0627\u0644\u0637\u0627\u0632\u062c \u0648 \u0627\u0644\u062c\u0628\u0646 \u0627\u0644\u0637\u0627\u0632\u062c \u0648 \u0627\u0644\u0628\u0642\u0648\u0644\u064a\u0627\u062a \u0648 \u0627\u0644\u0644\u0628\u0646, \u0643\u0645\u0627 \u064a\u0645\u0643\u0646 \u062a\u0646\u0627\u0648\u0644\u0647 \u0643\u0645\u0643\u0645\u0644\u0627\u062a \u063a\u0630\u0627\u0626\u064a\u0629.', u'timestamp_ms': u'1413317045494', u'truncated': False, u'user': {u'contributors_enabled': False, u'created_at': u'Mon Sep 15 09:21:34 +0000 2014', u'default_profile': True, u'default_profile_image': False, u'description': u'\u062a\u0631\u0628\u064a\u0647\u06c1 \u0629 \u0634\u0627\u064a\u0628 \u0648\u0648\u0644\u0648 \ufe8e\u0646\u0646\u064a\u064a\u064e \u0641\u0642\u064a\u0631 \ufe8e\u0639\u0637\u064a\u064a\u0621 \ufe8e\u0644\u062d\u0627\u062c\u0647\u06c1 \u0648\ufe8e\u0646\u0627 \u0645\u062d\u062a\u0627\u062c\u0647\u0627\u0627 \u060c PIN:264400B1', u'favourites_count': 0, u'follow_request_sent': None, u'followers_count': 133, u'following': None, u'friends_count': 119, u'geo_enabled': False, u'id': 2811023737L, u'id_str': u'2811023737', u'is_translator': False, u'lang': u'en', u'listed_count': 0, u'location': u'', u'name': u'\u0639\u0640\u0627\u0645\u0640\u062f \u0627\u0644\u0640\u0639\u0640\u0637\u0640\u0627\u0648\u064a', u'notifications': None, u'profile_background_color': u'C0DEED', u'profile_background_image_url': u'http://abs.twimg.com/images/themes/theme1/bg.png', u'profile_background_image_url_https': u'https://abs.twimg.com/images/themes/theme1/bg.png', u'profile_background_tile': False, u'profile_image_url': u'http://pbs.twimg.com/profile_images/511445636961419264/9enIvHyS_normal.png', u'profile_image_url_https': u'https://pbs.twimg.com/profile_images/511445636961419264/9enIvHyS_normal.png', u'profile_link_color': u'0084B4', u'profile_sidebar_border_color': u'C0DEED', u'profile_sidebar_fill_color': u'DDEEF6', u'profile_text_color': u'333333', u'profile_use_background_image': True, u'protected': False, u'screen_name': u'Aamdaladawe_', u'statuses_count': 209, u'time_zone': None, u'url': None, u'utc_offset': None, u'verified': False}}
Documentation: http://docs.mongodb.org/manual/reference/object-id/
document_id = doc["_id"]
print type(document_id)
document_id
<class 'bson.objectid.ObjectId'>
ObjectId('543d819c105f1913c74ab20b')
print document_id.generation_time.strftime("%Y-%m-%d %H:%M:%SZ%z")
2014-10-14 20:03:40Z+0000
doc["created_at"]
u'Tue Oct 14 20:04:05 +0000 2014'
doc["text"]
u'\u0627\u0644\u0632\u0646\u0643 \u064a\u0648\u062c\u062f \u0628\u0648\u0641\u0631\u0629 \u0627\u0644\u0644\u062d\u0645 \u0648\u0627\u0644\u0633\u0645\u0643 \u0648\u0627\u0644\u062d\u0644\u064a\u0628 \u0627\u0644\u0637\u0627\u0632\u062c \u0648 \u0627\u0644\u062c\u0628\u0646 \u0627\u0644\u0637\u0627\u0632\u062c \u0648 \u0627\u0644\u0628\u0642\u0648\u0644\u064a\u0627\u062a \u0648 \u0627\u0644\u0644\u0628\u0646, \u0643\u0645\u0627 \u064a\u0645\u0643\u0646 \u062a\u0646\u0627\u0648\u0644\u0647 \u0643\u0645\u0643\u0645\u0644\u0627\u062a \u063a\u0630\u0627\u0626\u064a\u0629.'
print doc["text"]
الزنك يوجد بوفرة اللحم والسمك والحليب الطازج و الجبن الطازج و البقوليات و اللبن, كما يمكن تناوله كمكملات غذائية.
doc["user"]
{u'contributors_enabled': False, u'created_at': u'Mon Sep 15 09:21:34 +0000 2014', u'default_profile': True, u'default_profile_image': False, u'description': u'\u062a\u0631\u0628\u064a\u0647\u06c1 \u0629 \u0634\u0627\u064a\u0628 \u0648\u0648\u0644\u0648 \ufe8e\u0646\u0646\u064a\u064a\u064e \u0641\u0642\u064a\u0631 \ufe8e\u0639\u0637\u064a\u064a\u0621 \ufe8e\u0644\u062d\u0627\u062c\u0647\u06c1 \u0648\ufe8e\u0646\u0627 \u0645\u062d\u062a\u0627\u062c\u0647\u0627\u0627 \u060c PIN:264400B1', u'favourites_count': 0, u'follow_request_sent': None, u'followers_count': 133, u'following': None, u'friends_count': 119, u'geo_enabled': False, u'id': 2811023737L, u'id_str': u'2811023737', u'is_translator': False, u'lang': u'en', u'listed_count': 0, u'location': u'', u'name': u'\u0639\u0640\u0627\u0645\u0640\u062f \u0627\u0644\u0640\u0639\u0640\u0637\u0640\u0627\u0648\u064a', u'notifications': None, u'profile_background_color': u'C0DEED', u'profile_background_image_url': u'http://abs.twimg.com/images/themes/theme1/bg.png', u'profile_background_image_url_https': u'https://abs.twimg.com/images/themes/theme1/bg.png', u'profile_background_tile': False, u'profile_image_url': u'http://pbs.twimg.com/profile_images/511445636961419264/9enIvHyS_normal.png', u'profile_image_url_https': u'https://pbs.twimg.com/profile_images/511445636961419264/9enIvHyS_normal.png', u'profile_link_color': u'0084B4', u'profile_sidebar_border_color': u'C0DEED', u'profile_sidebar_fill_color': u'DDEEF6', u'profile_text_color': u'333333', u'profile_use_background_image': True, u'protected': False, u'screen_name': u'Aamdaladawe_', u'statuses_count': 209, u'time_zone': None, u'url': None, u'utc_offset': None, u'verified': False}
doc["user"]["verified"]
False
for k,v in doc["user"].iteritems():
print "%s: %s" % (k,v)
follow_request_sent: None profile_use_background_image: True id: 2811023737 verified: False profile_image_url_https: https://pbs.twimg.com/profile_images/511445636961419264/9enIvHyS_normal.png profile_sidebar_fill_color: DDEEF6 is_translator: False geo_enabled: False profile_text_color: 333333 followers_count: 133 protected: False location: default_profile_image: False id_str: 2811023737 utc_offset: None statuses_count: 209 description: تربيهہ ة شايب وولو ﺎننييَ فقير ﺎعطييء ﺎلحاجهہ وﺎنا محتاجهاا ، PIN:264400B1 friends_count: 119 profile_link_color: 0084B4 profile_image_url: http://pbs.twimg.com/profile_images/511445636961419264/9enIvHyS_normal.png notifications: None profile_background_image_url_https: https://abs.twimg.com/images/themes/theme1/bg.png profile_background_color: C0DEED profile_background_image_url: http://abs.twimg.com/images/themes/theme1/bg.png screen_name: Aamdaladawe_ lang: en profile_background_tile: False favourites_count: 0 name: عـامـد الـعـطـاوي url: None created_at: Mon Sep 15 09:21:34 +0000 2014 contributors_enabled: False time_zone: None profile_sidebar_border_color: C0DEED default_profile: True following: None listed_count: 0
verified_users = twitter_col.find({"user.verified":True})
verified_users.count()
672
import pandas
users = pandas.DataFrame([msg["user"] for msg in verified_users])
users
<class 'pandas.core.frame.DataFrame'> Int64Index: 672 entries, 0 to 671 Data columns (total 38 columns): contributors_enabled 672 non-null values created_at 672 non-null values default_profile 672 non-null values default_profile_image 672 non-null values description 672 non-null values favourites_count 672 non-null values follow_request_sent 0 non-null values followers_count 672 non-null values following 0 non-null values friends_count 672 non-null values geo_enabled 672 non-null values id 672 non-null values id_str 672 non-null values is_translator 672 non-null values lang 672 non-null values listed_count 672 non-null values location 672 non-null values name 672 non-null values notifications 0 non-null values profile_background_color 672 non-null values profile_background_image_url 672 non-null values profile_background_image_url_https 672 non-null values profile_background_tile 672 non-null values profile_banner_url 578 non-null values profile_image_url 672 non-null values profile_image_url_https 672 non-null values profile_link_color 672 non-null values profile_sidebar_border_color 672 non-null values profile_sidebar_fill_color 672 non-null values profile_text_color 672 non-null values profile_use_background_image 672 non-null values protected 672 non-null values screen_name 672 non-null values statuses_count 672 non-null values time_zone 656 non-null values url 631 non-null values utc_offset 656 non-null values verified 672 non-null values dtypes: bool(9), float64(1), int64(6), object(22)
users_unique = users.drop_duplicates(cols=["id"])
print len(users_unique)
222
x = users_unique["followers_count"]
y = users_unique["statuses_count"]
z = users_unique["friends_count"]
plt.scatter(x, y, c=z, alpha=0.4, s=200, cmap=plt.cm.Accent)
plt.yscale("symlog")
plt.xscale("symlog")
plt.ylim(y.min(),y.max())
plt.xlim(x.min(),x.max())
plt.grid()
plt.xlabel("followers_count")
plt.ylabel("statuses_count")
plt.colorbar()
plt.show()