Login   Register  
PHP Classes
elePHPant
Icontem

Developing scalable PHP applications using MongoDB - PHP Classes blog

Recommend this page to a friend!
Stumble It! Stumble It! Bookmark in del.icio.us Bookmark in del.icio.us
  Blog PHP Classes blog   RSS 1.0 feed RSS 2.0 feed   Blog Developing scalable P...   Post a comment Post a comment   See comments See comments (15)   Trackbacks (6)  
<< Previous: PHP compiler performance>> Next: Neural Networks in PHP

Author: Cesar D. Rodas

Posted on:

Categories: PHP Tutorials, PHP Performance

Nowadays there is a new kind of databases that is getting very popular, specially for Web development, including the PHP world, which are the NoSQL databases.

This article focus specifically on MongoDB, despite there are several other NoSQL database implementations.




Contents

What is NoSQL?

What is a document-oriented database?

MongoDB

Installation

Basic usage

Index support

Real world applications

Storing files in MongoDB

Map-Reduce

Auto-sharding

Conclusion and future work


What is NoSQL?

NoSQL is a kind of database that, unlike most relational databases, does not provide a SQL interface to manipulate data. NoSQL databases usually organize the data in a different way other than tables.

NoSQL databases are divided into three categories: column-oriented, key-value pairs, and document-oriented databases. This article focus document-oriented databases, as it seems to be best solution for many Web sites.

SQL based relational databases do not scale well when they are distributed over multiple cluster nodes. Data partition is not an easy to implement solution when the applications use join queries and transactions.

NoSQL databases are not new. Actually, there were key-value pair based databases before relational database became popular.

What is a document-oriented database?

For document-oriented databases, a document is a data structure that has variable number of properties. Each property has a value that can be scalar (number, string, etc.) or a vector (arrays or objects).

You can see a document as an object or associative array like in PHP. To understand better this concept, here is the definition of a "person" document:

$person = array(

    "name" => "Cesar Rodas",
    "country" => "Paraguay",
    "languages" => array("Spanish", "English", "Guarani"),
);

Documents do not have a predefined schema like relational database tables. Documents can have multiple properties. Documents can also be grouped in collections. The term collection is used from now on distinguish from the tables used by relational databases to store records with a fixed number of fields.

Another important characteristic of documents is that they can have sub-documents. Sub-documents are used in the place parent-child tables used with relational databases.

MongoDB

MongoDB is a very interesting document-oriented database implementation for several reasons:
  • It uses JSON, instead of XML
  • It is fast, as it is written in C++
  • Supports index definitions
  • Provides an easy to use query interface, very similar to some database abstraction layers
  • Supports operations with sub-documents
  • Provides a native PHP extension
  • Supports auto-sharding
  • Supports map-reduce for data transformation

Installation

Installing the MongoDB server is very easy but it outside the scope of this article to describe the process. You can grab the latest version of its source code and build MongoDB server following the instructions in the README file. You can also download and install prepackaged binaries for your platform.

The PHP MongoDB client extension can be installed using the pecl command:

pecl install mongo

Alternatively, you can grab PHP MongoDB extension source code and install it yourself:

phpize
./configure --enable-mongo
make install

Basic usage

Connecting to a database

As explained above, MongoDB manages documents, which for PHP developers are like simple associative arrays. This means that all operations with MongoDB are defined using arrays, even for queries. 

The database connection establishing code is very similar to code for connecting to other types of databases:

Connecting to database server at localhost port 27017:

$connection = new Mongo();

Connecting to a remote host with optional custom port:

$connection= new Mongo( "192.168.2.1" );

$connection = new Mongo( "192.168.2.1:65432" );

When the connection to MongoDB server is established, it is necessary to select a database to work with. If no database exists already, a new database is created. Currently there are two ways to do this:

$db = $connection->selectDB('dbname');


$db = $connection->dbname;

Then it is necessary to select a collection to work with, like we would pick a table to work with when using relational databases.

$collection = $db->selectCollection('people');

or  simply

$collection = $db->people;

Inserting new documents

The collection object should be used to perform basic operations to manipulate its information. For instance, if want to store information about a person you use code like this:

$person = array(
 'name' => 'Cesar Rodas',
 'email' => 'crodas@php.net',
 'address' => array(
array(
'country' => 'PY',
'zip' => '2160',
'address1' => 'foo bar'
),
  array(
'country' => 'PY',
'zip' => '2161',
'address1' => 'foo bar bar foo'
),
 ),
 'sessions' => 0,
);

$safe_insert = true;
$collection->insert($person, $safe_insert);
$person_identifier = $person['_id'];

As you may have noted, the $safe_insert parameter is passed to the insert function. It is meant to make the MongoDB client library wait for the request to finish, so it is possible to determine whether it succeeded or not.

If anything goes wrong, an exception is thrown. If the safe insert parameter is not used, it is the same as setting it to false. In this case, the insert call returns immediately, which is fast but you do not know immediately if the operation succeeded. Anyway, this possibility may be useful when a lot of records need to be inserted.


Also note that the person collection object is passed by reference, so MongoDB client can set the person id entry to return the newly created record identifier.

Updating existing documents

Updating documents is a bit more complicated and tends to be confusing in the beginning. If you submit a regular document it would replace the whole document definition. To do it correctly MongoDB supports special properties that work as modifier operations.

If you update the information of a person just to change some properties, for instance incrementing the sessions property, add the property address2 to the first address and delete the second address property, it can be done like this below.

First, it is necessary to define a filter to tell MongoDB to update just a specific document.


$filter = array('email => 'crodas@php.net');

$new_document = array(
 '$inc' => array('sessions' => 1),
'$set' => array(
   'address.0.address2' =>  'Some foobar street',
),
'$unset' => array('address.1' => 1),
);

$options['multiple'] = false;
$collection->update(
$filter,
$new_document,
$options
);

MongoDB also supports multiple in-place updates, like relational databases can, which means that it can update all documents that match a given criteria. For that it is necessary to pass the set the option multiple to true.

Retrieving documents

Retrieving one or more documents that match a given criteria requires defining a condition filter using query selectors, as you may see in the following examples:

  1. Fetch people by e-mail address:

    $filter = array('email' => 'crodas@php.net');
    $cursor = $collection->find($filter);
    foreach ($cursor as $user) {
    var_dump($user);
    }


  2. Fetch people who has more than ten sessions.

    $filter = array('sessions' => array('$gt' => 10));
    $cursor = $collection->find($filter);


  3. Fetch people that do not have the sessions property set

    $filter = array(
    'sessions' => array('$exists' => false)
    );
    $cursor = $collection->find($filter);


  4. Fetch people that have an address in Paraguay and have more than 15 sessions

    $filter = array(
    'address.country' => 'PY',
    'sessions' => array('$gt' => 10)
    );
    $cursor = $collection->find($filter);

One important detail worth mentioning is that queries are not executed until the result is actually requested. In the first example the query is executed when the foreach loop starts.

This is a nice feature because it allows adding options to the cursor object used to retrieve the results, right after defining the query but before executing it. For instance, you can set options to perform result set pagination, or to retrieve the number matching documents.

$total = $cursor->total();

$cursor->limit(20)->skip(40);
foreach($cursor as $user) {
}

Aggregation of retrieved documents

MongoDB also supports aggregation of results like with relational databases. You can use aggregation operators like count, distinct and group.

Aggregation queries return arrays, rather than whole document objects.

Grouping allows you to define MongoDB server side functions written in Javascript that perform operations on the groups properties. It is a bit more flexible because you can perform many types of operations with grouped values, but it is a bit harder than SQL to perform simple grouping operations like SUM(), AVG(), etc..

Here is an example of how to retrieve the countries of a list of addresses and the number of times a country appears in the matching addresses.

$countries = $collection->distinct(
array("address.country")
);
$result = $collection->group(
/* keys to group by */
 array("address.country" => True),

/* initial value */
 array("sum" => 0),   

 /* js code to reduce */  
"function (obj, prev) { prev.sum += 1; }",

/* Filter condition */
 array("session" => array('$gt' => 10))

);

Deleting documents

Deleting a document is very similar to retrieving or update documents.

$filter = array('field' => 'foo');
$collection->remove($filter);

Be careful. By default all documents that match a given criteria will be deleted. If you just want to delete the first document that matches the criteria, pass true to the second parameter of the remove function..

Index support

A very important feature that might influence your decision to choose MongoDB over other similar document-oriented databases, is the support for indexes, which are very similar to relational database table indexes. Not all document-oriented databases provide built-in index support.

With MongoDB you can create indexes to avoid full document scan during searches, like relational databases can use indexes to avoid full table scan. This allows accelerating queries for documents matching conditions that envolve indexed properties.

For instance, if you want to have an unique index on the e-mail address property, it can be defined like this:


$collection->ensureIndex(
 array('email' => 1),
 array('unique' => true, 'background' => true)
);

The first array parameter describes all the properties that should be part of the index. It may be just one property or more properties.    

By default the index creating is synchronous operation, but it may be a good idea to let the indexes be created in the background if the document count is too large. It is done as demonstrated in the example above.

Having indexes with just one property may not useful enough. Here follows an example on how to accelerate the fourth query example above by defining an index on two properties.


$collection->ensureIndex( 
 array('address.country' => 1, 'sessions' => 1),
 array('background' => true)
);

The value assigned to each index property defines the index order: 1 is for descending order and -1 for ascending order.

The order in indexes is useful to optimize queries that have sorting options, like in the following example:


$filter = array(   
'address.country' => 'PY',
);
$cursor = $collection->find($filter)->order(
array('sessions' => -1)
);


$collection->ensureIndex(
 array('address.country' => 1, 'sessions' => -1),
 array('background' => true)
);

Real world applications

Some developers may be afraid to try a new type of database because it works differently from others that they worked before.

Learning new things in theory is different from learning how to use them in practice. Therefore, this section was written to explain how to develop real world applications with MongoDB in comparison to using a SQL based relational database, like MySQL for instance, so you get familiar with the differences between each of the approaches.

For instance, lets say you want to build a blog system with users, posts and comments. You would implement it defining with a table schema like this when using a relational database.
Schema of tables of a blog posting database
The equivalent document definition that represents the same structures in MongoDB would be defined like this:

$users = array(
'username' => 'crodas',  
'name' => 'Cesar Rodas',
);


$posts = array(
'uri' => '/foo-bar-post',
'author_id' => $users->_id,
'title' => 'Foo bar post',
'summary' => 'This is a summary text',
'body' => 'This is the body',
'comments' => array(
  array(
'name' => 'user',
'email' => 'foo@bar.com',
'content' => 'nice post'
)
)
);

As you may notice, we only need one document to represent both the posts and comments. That is because comments are sub-documents of post documents.

This makes the implementation much simpler. It also saves time to query the database when you want to access either a post and its comments.

To make it even more abbreviated, the details of the users making comments could be merged with the comment definitions, so you can retrieve either the posts, comments and users with a single query.

$posts = array(
'uri' => '/foo-bar-post',
 'author_id' => $users->_id,
 'author_name' => 'Cesar Rodas',
 'author_username' => 'crodas',
 'title' => 'Foo bar post',
 'summary' => 'This is a summary text',
 'body' => 'This is the body',
 'comments' => array(
  array(
'name' => 'user',
'email' => 'foo@bar.com',
'comment' => 'nice post'
),
)
);

This means that some duplicated information may exist, but keep in mind that disk space is much cheaper than CPU and RAM, and even more important that the time and patience of your site visitors.

If you are concerned with synchronization of duplicated information, you can solve that problem by executing this update query when the author updates his profile:


$filter = array(  
'author_id' => $author['_id'],
);

$data = array(
'$set' => array(
 'author_name' => 'Cesar D. Rodas',
'author_username' => 'cesar',
)
);


$collection->update($filter, $data, array(
'multiple' => true)
);

Given these optimizations of our data model, lets rewrite some SQL as equivalent queries to MongoDB.

SELECT * FROM posts
INNER JOIN users ON users.id = posts.user_id
 WHERE URL =
:url;
SELECT  * FROM comments WHERE post_id = $post_id;


First, add this index just once:

$collection->ensureIndex(
array('uri' => 1),
array('unique' => 1, 'background')
);


$collection->find(array('uri' => '<uri>'));


INSERT INTO comments(post_id, name, email, contents)
VALUES(:post_id, :name, :email, :comment);


$comment = array(  
'name' => $_POST['name'],  
'email' => $_POST['email'],  
'comment' => $_POST['comment'],
);


$filter = array(
'uri' => $_POST['uri'],
);


$collection->update($filter, array(
'$push' => array('comments' => $comment))
);


SELECT * FROM posts WHERE id IN (
SELECT DISTINCT post_id FROM comments WHERE email = :email
);

First, add this index just once:

$collection->ensureIndex(
array('comments.email' => 1),
array('background' => 1)
);

$collection->find( array('comments.email' => $email) );

Storing files in MongoDB

MongoDB also provides features that go beyond basic database operations. For instance, it provides a good solution to store small and large files in the database.

Files are automatically split into chunks. If MongoDB runs in a auto-sharded environment, file chunks are also replicated across multiple servers.


Storing files is surprisingly a very difficult problem to solve efficiently, specially when you need to manage a large number of files. Save files in a local filesystem is often not a good solution.

One example of that difficulty is the problem that YouTube had to efficiently
serve thumbnails of millions of videos, or even the hacks performed by Facebook to efficiently serve billions of photos.

MongoDB solves this problem by creating two internal collections: the files collection keep information about the files metadata and the chunks collection keeps information about the file chunks.

If you want to store a large video file, you would use code like this:

$metadata = array(
"filename" => "path.avi",
"downloads" => 0,  
"comment" => "This file is foo bar",  
"permissions" => array(
 "crodas" => "write", 
 "everybody" => "read",
)
);
$grid = $db->getGridFS();
$grid->storeFile("/file/to/path.avi", $metadata);

As you see, it is very simple and easy to understand.

Map-Reduce

Map-Reduce are operations meant to manipulate large sets of information. The map operations apply a function to every document and produces a new set of key-value pair data. The reduce operations takes the map function results and apply another function to return a single result per key.

MongoDB Map-Reduce functions can be applied to a collection for data transformation, in a way very similar to Hadoop.

When the map process is finished, the result is sorted and grouped by key values. For every result key, the reduce function is called with two parameters: the key and an array with all the values.

To understand better how this works, lets suppose that we have our blog post document defined above, and you would like that every post can have a list of tags. If you need to get statistics about tags, you just need to count them like this.

First define the code for the map and reduce functions.

$map = new MongoCode("function () {
var i;

 for (i=0; i < this.tags.length; i++) {
   /* Emit every tag with a document with {count:1} */
emit(this.tags[i], {count: 1});
  }
}");


$reduce = new MongoCode("function (key, values) {
  var i, total=0;

  for (i=0; i < values.length; i++) {
  total = values[i].count;
  }
  return {count: total}
}");

Then execute the map-reduce command:

$map_reduce = array(
/* create a new collection */
'out' => 'tags_info',

/* return the debug info */
'verbose' => true,

/* process the posts collection */
 'mapreduce' => 'posts',
'map' => $map,
'reduce' => $reduce,
);

$information = $db->command($map_reduce);
var_dump($information);

If MongoDB runs in a sharded environment, the data processing functions will run in parallel on all shards.

Keep in mind that map-reduce processing is often slow. Its purpose is to distribute the processing of large data sets among multiple servers. So, if you have many servers, you can split the processing to be done and achieve the result in less time than if it was done by a single server.

In any case, it is recommended that map-reduce processes be run in the background, as they often take too long to be finished. In that case it would be a perfect case for starting it as an asynchronous job managed for instance by Gearman.

Auto-sharding

Sharding was mentioned several times above, but you may not be familiar with the concept.

Data sharding is database technique meant to distribute the data across multiple servers.

MongoDB requires auto-sharding with minimal configuration. However, installing and configure a shard is outside the scope of this article.

This is a diagram that represents how a MongoDB works in a shared environment so you have an idea of what happens when you use sharding.

Diagram of the sharding architecture used by MongoDB

Conclusion and future work

This article introduced a new type of database that probably will change the way Web development is done.

Currently, I am working in a ActiveRecord framework for MongoDB, that will make it easier to deal with MongoDB objects. It will be published in the PHPClasses site soon. I am also working on a stream wrapper to make it easier to store and retrieve files from MongoDB, as if they were regular files.

I also try to help the MongoDB project as much as I can in other ways in my free time. Currently I am helping to translate the documentation to Spanish.

If you also would like to help the MongoDB project you can do so by going to contributors page.

Feel free to tell about your opinion or ask any questions by posting a comment to this article.

You need to be a registered user or login to post a comment

Login Immediately with your account on:

Facebook ConnectGmail or other Google Account
Hotmail or Microsoft Windows LiveStackOverflow
GitHubYahoo


Comments:

3. regex - geert van bommel (2012-02-03 08:07)
did you find how to?... - 3 replies
Read the whole comment and replies

6. a sample - Amir H Fadaee (2011-10-05 20:20)
please let us know about a useful real sample... - 0 replies
Read the whole comment and replies

5. performance - sody sody (2010-03-04 07:20)
MongoDb performance vs MySql/MSsql... - 1 reply
Read the whole comment and replies

2. ActiveRecord wrapper - Jonathan Moss (2010-03-03 19:20)
Take a look at Morph for MongoDB... - 3 replies
Read the whole comment and replies

4. Visual Interfaces - Victor Okech (2010-03-02 17:20)
MongoDB PHPMyAdmin access... - 1 reply
Read the whole comment and replies

1. Opinion on article - Alan H. Lake (2010-03-01 16:54)
Very well written article... - 1 reply
Read the whole comment and replies


Trackbacks:

6. Desenvolvendo aplicações escaláveis com PHP e MongoDB (2010-05-05 10:47)
Ótimo tutorial no PHP Classes sobre como desenvolver aplicações escaláveis com MongoDB...

5. MongoDB and PHP (2010-04-19 01:38)
Looked for a reliable document oriented database, and after some tests and code, decided to give a try on MongoDB for a small proof of concept that i am working on...

4. Armazenar arquivos com MongoDB (2010-05-07 03:09)
O MongoDB também oferece recursos que vão além das operações de banco de dados...

3. Part1 – Learning & Sharing Series – MongoDB -java (2010-03-30 12:29)
This document is to share some info collected from web and few features i tried out...

2. MongoDB looks Interesting (2010-03-06 13:24)
I might take a look at installing MongoDB on my server...

1. PHPClasses.org Blog: Developing scalable PHP applications using MongoDB (2010-03-01 07:16)
New on the PHPClasses.org blog today there’s a tutorial (written up by Cesar Rodas) about using MongoDB (a NoSQL database) in PHP applications...


<< Previous: PHP compiler performance>> Next: Neural Networks in PHP

  Blog PHP Classes blog   RSS 1.0 feed RSS 2.0 feed   Blog Developing scalable P...   Post a comment Post a comment   See comments See comments (15)   Trackbacks (6)