[MongoDB] MongoDB Aggregation

Big Data Tech 2014. 11. 24. 01:13

이번 포스팅에서는 MongoDB의 aggregation에 대해 설명해보도록 하겠다. 물론 이 부분도 앞에서 설명했었던 CRUD처럼 각각의 모든 함수의 사용법을 열거하지 않을 것이다. 그럴 필요도 없다고 생각하고(왜냐면 MongoDB Reference 그 이상을 하기 힘들 것 같으므로), 그래서도 안 된다. 각자 공부하는 습관을 들여야 하고 스스로 학습하는 연습을 해야한다.

먼저 aggregation이 무엇인지부터 설명해보자. 보통 aggregation이라고 하면 종합, 집합.. 이 정도의 뜻으로 해석할 수 있다. 무슨 말인가... 즉, 데이터를 가지고 종합적으로 계산된 결과물을 내겠다는 것이다. 합계를 구하는 sum(), 평균을 구하는 avg() 등등이 모두 일종의 aggregation이다. 조금 더 MongoDB의 개념에 입각해서 말해보면, documents들의 value들을 기반으로 어떠한.. 계산된 결과물 등을 만들어 내는 과정이라고 보면 된다.

보통 MongoDB에서 aggregation은 다음의 세 가지 방법을 통해 가능하다.

1. Aggregation Pipeline

2. Mapreduce

3. Single aggregation operation

먼저 Aggregation pipeline부터 설명해보자. 우선, pipeline이 무엇인가? Linux 명령어에 익숙한 사람들은 이 개념이 어렵지 않을 것이다. 한 작업에 대한 output이 다음 작업에 대한 Input으로 넘어가도록 하기위한 것이다. 우리에게 조금 더 친숙한 linux 명령어를 보자.

bash> history 100 | grep mongo

위의 명령어는 history 100과 grep mongo로 이루어져있는데, 앞의 명령어를 수행하면 최근에 수행한 100개의 명령어가 output으로 나올 것이며 그것에 대해서 grep mongo를 한다. 즉, history 100의 output에서 mongo라는 문자열을 찾는 명령어가 된다.

MongoDB에서의 aggregation pipeline도 마찬가지다. 가령, 아래의 명령어를 보자.

3번, 4번 라인에 있는 명령어가 무엇을 의미하는지는 아직은 생각하지 말자. 지금은 그저, 3번 라인의 output이 4번 라인의 input으로 넘어가고, 4번 라인이 수행되어 나오는 output이 (다음 명령어는 없으므로) return 되는 것이다.

간단하게 위의 명령어를 조금 설명해보면, 아래와 같다.

3번라인 : orders collection에서 status가 A인 것만 고른다.

4번라인 : 3번라인의 결과물에 대하여 cust_id 값으로 분류한 후, 각 cust_id의 price를 합한다.

4번의 결과물을 cursor형태로 return한다.(cursor가 무엇인지는 앞의 MongoDB CRUD에 대한 포스팅에서 잘 설명해놨다.)

이와 같이 $match, $group과 같은 operation을 우리는 aggregation pipeline operator라고 하고 $match, $group 이 외의 pipeline operator는 아래와 같이 있다.

Name Description

$group Groups input documents by a specified identifier expression and applies the accumulator expression(s), if specified, to each group. Consumes all input documents and outputs one document per each distinct group. The output documents only contain the identifier field and, if specified, accumulated fields.

$limit Passes the first n documents unmodified to the pipeline where n is the specified limit. For each input document, outputs either one document (for the first n documents) or zero documents (after the first n documents).

$match Filters the document stream to allow only matching documents to pass unmodified into the next pipeline stage. $match uses standard MongoDB queries. For each input document, outputs either one document (a match) or zero documents (no match).

$out Writes the resulting documents of the aggregation pipeline to a collection. To use the $out stage, it must be the last stage in the pipeline.

$skip Skips the first n documents where n is the specified skip number and passes the remaining documents unmodified to the pipeline. For each input document, outputs either zero documents (for the first n documents) or one document (if after the first n documents).

$sort Reorders the document stream by a specified sort key. Only the order changes; the documents remain unmodified. For each input document, outputs one document.

$redact Reshapes each document in the stream by restricting the content for each document based on information stored in the documents themselves. Incorporates the functionality of $project and $match. Can be used to implement field level redaction. For each input document, outputs either one or zero document.

$geoNear Returns an ordered stream of documents based on the proximity to a geospatial point. Incorporates the functionality of $match, $sort, and $limit for geospatial data. The output documents include an additional distance field and can include a location identifier field.

$project Reshapes each document in the stream, such as by adding new fields or removing existing fields. For each input document, outputs one document.

$unwind Deconstructs an array field from the input documents to output a document for each element. Each output document replaces the array with an element value. For each input document, outputs n documents where n is the number of array elements and can be zero for an empty array.

위의 내용은 MongoDB Manual을 그대로 복사한 내용이다. 다시 한번 말하지만, 위의 사용법을 일일이 설명하지는 않겠다. 각자 공부하는 습관을 들여보자.

MongoDB Aggregation에 이어 Mapreduce를 공부해보자. 우선 Mapreduce라는 개념부터 알아야 한다. mapreduce는 일반적으로 mapper function과 reducer function이 있다. 우선 mapper function과 reducer function을 아래의 예를 통해서 이해해보자.

아래의 그림은 가장 왼쪽에 있는 dataset을 status로 분류하여 각각의 합을 구하는 작업을 mapreduce를 통해 해봤을 경우를 표현한다.

위의 그림을 보면, mapper function에 의해서 각각 A, B, C status로 묶이는 것을 알 수 있다. 이와 같이 특정 값으로 분류하는 과정을 mapper function이 한다. mapper function에 의해 status A는 100과 400이 모여있고, B는 200과 500, C는 300만 있다. 이 상태에서 reducer function을 통해 각 status의 합을 구한다.

100 + 400 = 500

200 + 500 = 700

300 = 300

위의 데이터에 대해서 mapreduce를 실제로 수행해보자.

python을 통해 실행 시 위와 같은 결과를 얻을 수 있다.

mapreduce에 대한 설명이 위로는 부족하다면 아래의 비디오를 보자.

http://www.youtube.com/watch?v=bcjSe0xCHbE

http://www.youtube.com/watch?v=qPATE6BBIQs

영어로 된 비디오지만, 충분히 시각적으로 이해가 가능하다.

참고로, MongoDB에서 mapreduce의 performance는 다른 NoSQL에서의 mapreduce보다 조금 느리다고 하니 참고하길 바란다.

Single aggregation operation에는 크게 세가지 함수가 있다고 보면 된다.(MongoDB Manual에 세가지 소개되어있다.)

count(), distinct(), group이다.

count()와 distinct는 직관적으로 이해가 되듯, 각각 세는 함수와 고유의 값을 return하는 함수이다. 아래의 예를 보자.

count()의 결과값은 단순 integer형이고 distinct의 결과값은 하나의 배열이 return된다.

Single aggregation operation에서 group함수는 조금 쓰임이 복잡하다. 그래도 mapreduce를 이해한 사람이라면 쉽게 이해 가능하다.

count를 이용한 status:"A"를 셌던 결과와 같은 결과값을 가질 수 있다.

이 정도면 어느 정도 MongoDB의 aggregation 관련하여 개념적인 설명은 다했다고 생각한다.다음 포스팅에서는 몇몇 실 예제를 통하여 다시 한번 개념을 되세겨 보도록 하자.

Reference :

http://docs.mongodb.org/manual/reference/operator/aggregation/

'Big Data Tech' 카테고리의 다른 글

[Hadoop] Basic Example : WordCount (2)	2015.01.31
[Hadoop] Introduction of Hadoop (0)	2015.01.31
[MongoDB] CRUD Operation of MongoDB (1)	2014.11.18
[MongoDB] Introduction of MongoDB and its fundamental (0)	2014.11.09
[MongoDB] 시작하며... (0)	2014.11.02