[ruby] 루비 1.9 : UTF-8에서 유효하지 않은 바이트 시퀀스

많은 임의의 사이트에서 많은 HTML을 소비하는 Ruby (1.9)로 크롤러를 작성하고 있습니다.
링크를 추출하려고 할 때 .scan(/href="(.*?)"/i)nokogiri / hpricot 대신 사용하기로 결정했습니다 (주요 속도 향상). 문제는 이제 ” invalid byte sequence in UTF-8“오류 가 많이 발생한다는 것 입니다.
내가 이해 한 바에 따르면 net/http라이브러리에는 인코딩 특정 옵션이 없으며 제공되는 항목은 기본적으로 제대로 태그가 지정되지 않았습니다.
들어오는 데이터로 실제로 작업하는 가장 좋은 방법은 무엇입니까? .encode교체 및 유효하지 않은 옵션 세트로 시도했지만 지금까지 성공하지 못했습니다 …

답변

Ruby 1.9.3에서는 String.encode를 사용하여 유효하지 않은 UTF-8 시퀀스를 “무시”할 수 있습니다. 다음은 1.8 ( iconv ) 및 1.9 ( String # encode ) 모두에서 작동하는 스 니펫입니다 .

require 'iconv' unless String.method_defined?(:encode)
if String.method_defined?(:encode)
  file_contents.encode!('UTF-8', 'UTF-8', :invalid => :replace)
else
  ic = Iconv.new('UTF-8', 'UTF-8//IGNORE')
  file_contents = ic.iconv(file_contents)
end

또는 정말 귀찮은 입력이 있으면 UTF-8에서 UTF-16으로 그리고 다시 UTF-8로 이중 변환을 수행 할 수 있습니다.

require 'iconv' unless String.method_defined?(:encode)
if String.method_defined?(:encode)
  file_contents.encode!('UTF-16', 'UTF-8', :invalid => :replace, :replace => '')
  file_contents.encode!('UTF-8', 'UTF-16')
else
  ic = Iconv.new('UTF-8', 'UTF-8//IGNORE')
  file_contents = ic.iconv(file_contents)
end

답변

받아 들인 대답이나 다른 대답이 저에게 효과적입니다. 제안한 이 게시물 을 찾았 습니다.

string.encode!('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '')

이것은 나를 위해 문제를 해결했습니다.

답변

내 현재 솔루션은 다음을 실행하는 것입니다.

my_string.unpack("C*").pack("U*")

이것은 적어도 내 주요 문제였던 예외를 제거합니다.

답변

이 시도:

def to_utf8(str)
  str = str.force_encoding('UTF-8')
  return str if str.valid_encoding?
  str.encode("UTF-8", 'binary', invalid: :replace, undef: :replace, replace: '')
end

답변

HTML 파서를 사용하는 것이 좋습니다. 가장 빠른 것을 찾으십시오.

HTML 구문 분석은 생각만큼 쉽지 않습니다.

브라우저는 ” “기호 만 넣어서 UTF-8 HTML 문서에서 잘못된 UTF-8 시퀀스를 구문 분석합니다. 따라서 HTML의 유효하지 않은 UTF-8 시퀀스가 구문 분석되면 결과 텍스트는 유효한 문자열입니다.

속성 값 내에서도 amp와 같은 HTML 엔티티를 디코딩해야합니다.

다음은 정규 표현식으로 HTML을 안정적으로 구문 분석 할 수없는 이유를 요약 한 훌륭한 질문입니다.
RegEx는 XHTML 자체 포함 태그를 제외한 열린 태그와 일치합니다.

답변

이것은 작동하는 것 같습니다.

def sanitize_utf8(string)
  return nil if string.nil?
  return string if string.valid_encoding?
  string.chars.select { |c| c.valid_encoding? }.join
end

답변

attachment = file.read

begin
   # Try it as UTF-8 directly
   cleaned = attachment.dup.force_encoding('UTF-8')
   unless cleaned.valid_encoding?
     # Some of it might be old Windows code page
     cleaned = attachment.encode( 'UTF-8', 'Windows-1252' )
   end
   attachment = cleaned
 rescue EncodingError
   # Force it to UTF-8, throwing out invalid bits
   attachment = attachment.force_encoding("ISO-8859-1").encode("utf-8", replace: nil)
 end