[python] mongodb에서 pandas로 데이터를 가져 오는 방법은 무엇입니까?

mongodb의 컬렉션에 분석해야 할 많은 양의 데이터가 있습니다. 해당 데이터를 Pandas로 어떻게 가져 옵니까?

나는 pandas와 numpy를 처음 사용합니다.

편집 : mongodb 컬렉션에는 날짜 및 시간 태그가 지정된 센서 값이 포함되어 있습니다. 센서 값은 float 데이터 유형입니다.

샘플 데이터 :

{
"_cls" : "SensorReport",
"_id" : ObjectId("515a963b78f6a035d9fa531b"),
"_types" : [
    "SensorReport"
],
"Readings" : [
    {
        "a" : 0.958069536790466,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:26:35.297Z"),
        "b" : 6.296118156595,
        "_cls" : "Reading"
    },
    {
        "a" : 0.95574014778624,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:27:09.963Z"),
        "b" : 6.29651468650064,
        "_cls" : "Reading"
    },
    {
        "a" : 0.953648289182713,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:27:37.545Z"),
        "b" : 7.29679823731148,
        "_cls" : "Reading"
    },
    {
        "a" : 0.955931884300997,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:28:21.369Z"),
        "b" : 6.29642922525632,
        "_cls" : "Reading"
    },
    {
        "a" : 0.95821381,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:41:20.801Z"),
        "b" : 7.28956613,
        "_cls" : "Reading"
    },
    {
        "a" : 4.95821335,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:41:36.931Z"),
        "b" : 6.28956574,
        "_cls" : "Reading"
    },
    {
        "a" : 9.95821341,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:42:09.971Z"),
        "b" : 0.28956488,
        "_cls" : "Reading"
    },
    {
        "a" : 1.95667927,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:43:55.463Z"),
        "b" : 0.29115237,
        "_cls" : "Reading"
    }
],
"latestReportTime" : ISODate("2013-04-02T08:43:55.463Z"),
"sensorName" : "56847890-0",
"reportCount" : 8
}

답변

pymongo 다음은 내가 사용하는 코드입니다.

import pandas as pd
from pymongo import MongoClient


def _connect_mongo(host, port, username, password, db):
    """ A util for making a connection to mongo """

    if username and password:
        mongo_uri = 'mongodb://%s:%s@%s:%s/%s' % (username, password, host, port, db)
        conn = MongoClient(mongo_uri)
    else:
        conn = MongoClient(host, port)


    return conn[db]


def read_mongo(db, collection, query={}, host='localhost', port=27017, username=None, password=None, no_id=True):
    """ Read from Mongo and Store into DataFrame """

    # Connect to MongoDB
    db = _connect_mongo(host=host, port=port, username=username, password=password, db=db)

    # Make a query to the specific DB and Collection
    cursor = db[collection].find(query)

    # Expand the cursor and construct the DataFrame
    df =  pd.DataFrame(list(cursor))

    # Delete the _id
    if no_id:
        del df['_id']

    return df

답변

이 코드를 사용하여 mongodb 데이터를 pandas DataFrame에로드 할 수 있습니다. 그것은 나를 위해 작동합니다. 당신도 희망합니다.

import pymongo
import pandas as pd
from pymongo import MongoClient
client = MongoClient()
db = client.database_name
collection = db.collection_name
data = pd.DataFrame(list(collection.find()))

답변

Monary정확하게 수행하고 매우 빠릅니다 . ( 다른 링크 )

빠른 자습서와 몇 가지 타이밍이 포함 된 이 멋진 게시물 을 참조하십시오 .

답변

PEP에 따르면 단순한 것이 복잡한 것보다 낫습니다.

import pandas as pd
df = pd.DataFrame.from_records(db.<database_name>.<collection_name>.find())

일반 mongoDB 데이터베이스로 작업하는 것처럼 조건을 포함하거나 find_one ()을 사용하여 데이터베이스에서 하나의 요소 만 가져 오는 등의 작업을 수행 할 수 있습니다.

그리고 짜잔!

답변

import pandas as pd
from odo import odo

data = odo('mongodb://localhost/db::collection', pd.DataFrame)

답변

Out-of-core (RAM에 맞지 않음) 데이터를 효율적으로 처리하기 위해 (즉, 병렬 실행으로) Python Blaze 생태계 ( Blaze / Dask / Odo)를 사용해 볼 수 있습니다 .

Blaze (및 Odo )에는 MongoDB를 처리하는 기본 기능이 있습니다.

시작하기위한 몇 가지 유용한 기사 :

Blaze 스택으로 가능한 놀라운 일을 보여주는 기사 : Blaze 및 Impala를 사용하여 17 억 개의 Reddit 댓글 분석 (본질적으로 975Gb의 Reddit 댓글을 몇 초 만에 쿼리).

PS 저는 이러한 기술과 관련이 없습니다.

답변

매우 유용한 또 다른 옵션은 다음과 같습니다.

from pandas.io.json import json_normalize

cursor = my_collection.find()
df = json_normalize(cursor)

이렇게하면 중첩 된 mongodb 문서를 무료로 펼칠 수 있습니다.