[python] TensorFlow에서 CSV 데이터를 * 실제로 * 읽는 방법은 무엇입니까?

Question 1

저는 TensorFlow의 세계에 비교적 익숙하지 않으며 실제로 CSV 데이터를 TensorFlow에서 사용 가능한 예제 / 라벨 텐서로 읽어 들이는 방법에 상당히 당황합니다 . CSV 데이터 읽기에 대한 TensorFlow 자습서 의 예제 는 매우 조각화되어 있으며 CSV 데이터에 대해 학습 할 수있는 방법의 일부일뿐입니다.

CSV 자습서를 기반으로 한 코드는 다음과 같습니다.

from __future__ import print_function
import tensorflow as tf

def file_len(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
            pass
    return i + 1

filename = "csv_test_data.csv"

# setup text reader
file_length = file_len(filename)
filename_queue = tf.train.string_input_producer([filename])
reader = tf.TextLineReader(skip_header_lines=1)
_, csv_row = reader.read(filename_queue)

# setup CSV decoding
record_defaults = [[0],[0],[0],[0],[0]]
col1,col2,col3,col4,col5 = tf.decode_csv(csv_row, record_defaults=record_defaults)

# turn features back into a tensor
features = tf.stack([col1,col2,col3,col4])

print("loading, " + str(file_length) + " line(s)\n")
with tf.Session() as sess:
  tf.initialize_all_variables().run()

  # start populating filename queue
  coord = tf.train.Coordinator()
  threads = tf.train.start_queue_runners(coord=coord)

  for i in range(file_length):
    # retrieve a single instance
    example, label = sess.run([features, col5])
    print(example, label)

  coord.request_stop()
  coord.join(threads)
  print("\ndone loading")

다음은 내가로드중인 CSV 파일의 간단한 예입니다. 기본 데이터-기능 열 4 개, 라벨 열 1 개 :

0,0,0,0,0
0,15,0,0,0
0,30,0,0,0
0,45,0,0,0

위의 모든 코드 는 CSV 파일에서 각 예제를 하나씩 인쇄하는 것입니다 . 이는 훌륭하지만 훈련에는 쓸모가 없습니다.

여기서 제가 고민하고있는 것은 실제로 하나씩로드 된 개별 예제를 학습 데이터 세트로 바꾸는 방법입니다. 예를 들어, 여기 Udacity Deep Learning 과정에서 작업 하던 노트북 이 있습니다. 기본적으로로드중인 CSV 데이터를 가져 와서 train_dataset 및 train_labels 와 같은 내용에 넣고 싶습니다 .

def reformat(dataset, labels):
  dataset = dataset.reshape((-1, image_size * image_size)).astype(np.float32)
  # Map 2 to [0.0, 1.0, 0.0 ...], 3 to [0.0, 0.0, 1.0 ...]
  labels = (np.arange(num_labels) == labels[:,None]).astype(np.float32)
  return dataset, labels
train_dataset, train_labels = reformat(train_dataset, train_labels)
valid_dataset, valid_labels = reformat(valid_dataset, valid_labels)
test_dataset, test_labels = reformat(test_dataset, test_labels)
print('Training set', train_dataset.shape, train_labels.shape)
print('Validation set', valid_dataset.shape, valid_labels.shape)
print('Test set', test_dataset.shape, test_labels.shape)

나는 이와 같이을 사용해 보았지만 tf.train.shuffle_batch설명 할 수없이 멈춘다.

  for i in range(file_length):
    # retrieve a single instance
    example, label = sess.run([features, colRelevant])
    example_batch, label_batch = tf.train.shuffle_batch([example, label], batch_size=file_length, capacity=file_length, min_after_dequeue=10000)
    print(example, label)

요약하면 다음과 같습니다.

이 프로세스에서 내가 놓친 것은 무엇입니까?
- 입력 파이프 라인을 제대로 구축하는 방법에 대해 제가 놓친 몇 가지 핵심 직관이있는 것 같습니다.
CSV 파일의 길이를 알 필요가없는 방법이 있습니까?
- 처리 할 줄 수 ( for i in range(file_length)위의 코드 줄) 를 알아야하는 것은 매우 우아하지 않습니다.

편집 :
Yaroslav가 여기서 명령 및 그래프 구성 부분을 혼합 할 가능성이 있다고 지적하자마자 명확 해지기 시작했습니다. 다음 코드를 모아서 CSV에서 모델을 학습 할 때 일반적으로 수행되는 작업에 더 가깝다고 생각합니다 (모델 학습 코드 제외).

from __future__ import print_function
import numpy as np
import tensorflow as tf
import math as math
import argparse

parser = argparse.ArgumentParser()
parser.add_argument('dataset')
args = parser.parse_args()

def file_len(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
            pass
    return i + 1

def read_from_csv(filename_queue):
  reader = tf.TextLineReader(skip_header_lines=1)
  _, csv_row = reader.read(filename_queue)
  record_defaults = [[0],[0],[0],[0],[0]]
  colHour,colQuarter,colAction,colUser,colLabel = tf.decode_csv(csv_row, record_defaults=record_defaults)
  features = tf.stack([colHour,colQuarter,colAction,colUser])
  label = tf.stack([colLabel])
  return features, label

def input_pipeline(batch_size, num_epochs=None):
  filename_queue = tf.train.string_input_producer([args.dataset], num_epochs=num_epochs, shuffle=True)
  example, label = read_from_csv(filename_queue)
  min_after_dequeue = 10000
  capacity = min_after_dequeue + 3 * batch_size
  example_batch, label_batch = tf.train.shuffle_batch(
      [example, label], batch_size=batch_size, capacity=capacity,
      min_after_dequeue=min_after_dequeue)
  return example_batch, label_batch

file_length = file_len(args.dataset) - 1
examples, labels = input_pipeline(file_length, 1)

with tf.Session() as sess:
  tf.initialize_all_variables().run()

  # start populating filename queue
  coord = tf.train.Coordinator()
  threads = tf.train.start_queue_runners(coord=coord)

  try:
    while not coord.should_stop():
      example_batch, label_batch = sess.run([examples, labels])
      print(example_batch)
  except tf.errors.OutOfRangeError:
    print('Done training, epoch reached')
  finally:
    coord.request_stop()

  coord.join(threads)

Question 2

여기서 명령형과 그래프 구성 부분을 혼합하고 있다고 생각합니다. 이 작업 tf.train.shuffle_batch은 새 대기열 노드를 만들고 단일 노드를 사용하여 전체 데이터 세트를 처리 할 수 있습니다. 그래서 나는 당신이 shuffle_batchfor 루프에 많은 큐 를 만들고 큐 러너를 시작하지 않았기 때문에 매달려 있다고 생각 합니다.

일반적인 입력 파이프 라인 사용은 다음과 같습니다.

shuffle_batch입력 파이프 라인 과 같은 노드 추가
(선택 사항, 의도하지 않은 그래프 수정을 방지하기 위해) 그래프 완성

— 그래프 구성 종료, 명령형 프로그래밍 시작-

tf.start_queue_runners
while(True): session.run()

확장 성을 높이기 위해 (Python GIL을 피하기 위해) TensorFlow 파이프 라인을 사용하여 모든 데이터를 생성 할 수 있습니다. 그러나 성능이 중요하지 않은 경우 slice_input_producer.다음 을 사용하여 입력 파이프 라인에 numpy 배열을 연결할 수 있습니다 Print( Print노드가 실행될 때 표준 출력 으로 이동하는 메시지 ).

tf.reset_default_graph()

num_examples = 5
num_features = 2
data = np.reshape(np.arange(num_examples*num_features), (num_examples, num_features))
print data

(data_node,) = tf.slice_input_producer([tf.constant(data)], num_epochs=1, shuffle=False)
data_node_debug = tf.Print(data_node, [data_node], "Dequeueing from data_node ")
data_batch = tf.batch([data_node_debug], batch_size=2)
data_batch_debug = tf.Print(data_batch, [data_batch], "Dequeueing from data_batch ")

sess = tf.InteractiveSession()
sess.run(tf.initialize_all_variables())
tf.get_default_graph().finalize()
tf.start_queue_runners()

try:
  while True:
    print sess.run(data_batch_debug)
except tf.errors.OutOfRangeError as e:
  print "No more inputs."

다음과 같은 것이 보일 것입니다.

[[0 1]
 [2 3]
 [4 5]
 [6 7]
 [8 9]]
[[0 1]
 [2 3]]
[[4 5]
 [6 7]]
No more inputs.

“8, 9″숫자는 전체 배치를 채우지 않았기 때문에 생산되지 않았습니다. 또한 tf.Printsys.stdout에 인쇄되므로 터미널에 별도로 표시됩니다.

추신 : batch수동으로 초기화 된 대기열에 최소한의 연결 이 github 문제 2193에 있습니다.

또한 디버깅 목적으로 timeoutIPython 노트북이 빈 대기열 대기열에서 멈추지 않도록 세션 을 설정할 수 있습니다 . 세션에이 도우미 기능을 사용합니다.

def create_session():
  config = tf.ConfigProto(log_device_placement=True)
  config.gpu_options.per_process_gpu_memory_fraction=0.3 # don't hog all vRAM
  config.operation_timeout_in_ms=60000   # terminate on long hangs
  # create interactive session to register a default session
  sess = tf.InteractiveSession("", config=config)
  return sess

확장 성 참고 사항 :

tf.constant데이터 사본을 그래프에 인라인합니다. 그래프 정의 크기에는 기본 제한이 2GB이므로 데이터 크기에 대한 상한선입니다.
당신은을 사용하여 그 한계를 주위에 얻을 수있는 v=tf.Variable실행하여 거기에 데이터를 저장 v.assign_op로모그래퍼 tf.placeholder(오른쪽에와 자리에 NumPy와 배열을 먹이 feed_dict)
여전히 두 개의 데이터 복사본을 생성하므로 메모리를 절약하기 위해 slice_input_producernumpy 배열에서 작동하는 고유 한 버전을 만들고 다음을 사용하여 한 번에 하나씩 행을 업로드 할 수 있습니다.feed_dict

Question 3

또는 이것을 시도해 볼 수 있습니다. 코드는 pandas 및 numpy를 사용하여 Iris 데이터 세트를 tensorflow에로드하고 간단한 하나의 뉴런 출력이 세션에 인쇄됩니다. 기본적인 이해에 도움이되기를 바랍니다 …. [나는 하나의 핫 디코딩 레이블의 방법을 추가하지 않았습니다].

import tensorflow as tf
import numpy
import pandas as pd
df=pd.read_csv('/home/nagarjun/Desktop/Iris.csv',usecols = [0,1,2,3,4],skiprows = [0],header=None)
d = df.values
l = pd.read_csv('/home/nagarjun/Desktop/Iris.csv',usecols = [5] ,header=None)
labels = l.values
data = numpy.float32(d)
labels = numpy.array(l,'str')
#print data, labels

#tensorflow
x = tf.placeholder(tf.float32,shape=(150,5))
x = data
w = tf.random_normal([100,150],mean=0.0, stddev=1.0, dtype=tf.float32)
y = tf.nn.softmax(tf.matmul(w,x))

with tf.Session() as sess:
    print sess.run(y)

Question 4

최신 tf.data API를 사용할 수 있습니다.

dataset = tf.contrib.data.make_csv_dataset(filepath)
iterator = dataset.make_initializable_iterator()
columns = iterator.get_next()
with tf.Session() as sess:
   sess.run([iteator.initializer])

Question 5

누군가가 tf.estimator API에서 절대적으로 크고 분할 된 CSV 파일을 읽는 간단한 방법을 찾고있는 경우 여기에 내 코드 아래를 참조하십시오.

CSV_COLUMNS = ['ID','text','class']
LABEL_COLUMN = 'class'
DEFAULTS = [['x'],['no'],[0]]  #Default values

def read_dataset(filename, mode, batch_size = 512):
    def _input_fn(v_test=False):
#         def decode_csv(value_column):
#             columns = tf.decode_csv(value_column, record_defaults = DEFAULTS)
#             features = dict(zip(CSV_COLUMNS, columns))
#             label = features.pop(LABEL_COLUMN)
#             return add_engineered(features), label

        # Create list of files that match pattern
        file_list = tf.gfile.Glob(filename)

        # Create dataset from file list
        #dataset = tf.data.TextLineDataset(file_list).map(decode_csv)
        dataset = tf.contrib.data.make_csv_dataset(file_list,
                                                   batch_size=batch_size,
                                                   column_names=CSV_COLUMNS,
                                                   column_defaults=DEFAULTS,
                                                   label_name=LABEL_COLUMN)

        if mode == tf.estimator.ModeKeys.TRAIN:
            num_epochs = None # indefinitely
            dataset = dataset.shuffle(buffer_size = 10 * batch_size)
        else:
            num_epochs = 1 # end-of-input after this

        batch_features, batch_labels = dataset.make_one_shot_iterator().get_next()

        #Begins - Uncomment for testing only -----------------------------------------------------<
        if v_test == True:
            with tf.Session() as sess:
                print(sess.run(batch_features))
        #End - Uncomment for testing only -----------------------------------------------------<
        return add_engineered(batch_features), batch_labels
    return _input_fn

TF.estimator에서의 사용 예 :

train_spec = tf.estimator.TrainSpec(input_fn = read_dataset(
                                                filename = train_file,
                                                mode = tf.estimator.ModeKeys.TRAIN,
                                                batch_size = 128),
                                      max_steps = num_train_steps)

Question 6

2.0 호환 솔루션 :이 답변은 위의 스레드에서 다른 사람들이 제공 할 수 있지만 커뮤니티에 도움이되는 추가 링크를 제공 할 것입니다.

dataset = tf.data.experimental.make_csv_dataset(
      file_path,
      batch_size=5, # Artificially small to make examples easier to show.
      label_name=LABEL_COLUMN,
      na_value="?",
      num_epochs=1,
      ignore_errors=True,
      **kwargs)

자세한 내용은이 Tensorflow 가이드 를 참조하세요 .