[python] 파이썬에서 문자열의 b 접두사를 어떻게 제거합니까?

Question 1

내가 가져 오는 많은 트윗에서 읽은이 문제가 있습니다.

b'I posted a new photo to Facebook'

나는 b그것이 바이트 라는 것을 나타냅니다. 그러나 이것은 내가 작성하는 CSV 파일에서 b사라지지 않고 향후 코드를 방해 하기 때문에 문제가 있음을 증명 합니다.

b내 텍스트 줄 에서이 접두사 를 제거하는 간단한 방법이 있습니까?

utf-8로 인코딩 된 텍스트가 필요하거나 tweepy가 웹에서 텍스트를 가져 오는 데 문제가있는 것 같습니다.

분석중인 링크 콘텐츠는 다음과 같습니다.

https://www.dropbox.com/s/sjmsbuhrghj7abt/new_tweets.txt?dl=0

new_tweets = 'content in the link'

코드 시도

outtweets = [[tweet.text.encode("utf-8").decode("utf-8")] for tweet in new_tweets]
print(outtweets)

오류

UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-21-6019064596bf> in <module>()
      1 for screen_name in user_list:
----> 2     get_all_tweets(screen_name,"instance file")

<ipython-input-19-e473b4771186> in get_all_tweets(screen_name, mode)
     99             with open(os.path.join(save_location,'%s.instance' % screen_name), 'w') as f:
    100                 writer = csv.writer(f)
--> 101                 writer.writerows(outtweets)
    102         else:
    103             with open(os.path.join(save_location,'%s.csv' % screen_name), 'w') as f:

C:\Users\Stan Shunpike\Anaconda3\lib\encodings\cp1252.py in encode(self, input, final)
     17 class IncrementalEncoder(codecs.IncrementalEncoder):
     18     def encode(self, input, final=False):
---> 19         return codecs.charmap_encode(input,self.errors,encoding_table)[0]
     20
     21 class IncrementalDecoder(codecs.IncrementalDecoder):

UnicodeEncodeError: 'charmap' codec can't encode characters in position 64-65: character maps to <undefined>

Question 2

당신은 할 필요가 디코딩bytes 당신의 문자열을 원하는 :

b = b'1234'
print(b.decode('utf-8'))  # '1234'

Question 3

인쇄하는 객체가 문자열이 아니라 바이트 리터럴 로서의 바이트 객체라는 것을 알려주는 것 입니다. 사람들은 이것을 불완전한 방식으로 설명하므로 여기에 내 의견이 있습니다.

바이트 리터럴을 입력하고 (문자 그대로 바이트 객체를 사용하지 않고 예를 들어 b ”를 입력하여 바이트 객체를 정의) utf-8로 인코딩 된 문자열 객체 로 변환 하여 바이트 객체 를 생성하는 것을 고려하십시오 . (여기서 변환은 디코딩을 의미합니다. )

byte_object= b"test" # byte object by literally typing characters
print(byte_object) # Prints b'test'
print(byte_object.decode('utf8')) # Prints "test" without quotations

단순히 .decode(utf8)함수 를 적용하는 것을 볼 수 있습니다 .

Python의 바이트

https://docs.python.org/3.3/library/stdtypes.html#bytes

문자열 리터럴은 다음 어휘 정의로 설명됩니다.

https://docs.python.org/3.3/reference/lexical_analysis.html#string-and-bytes-literals

stringliteral   ::=  [stringprefix](shortstring | longstring)
stringprefix    ::=  "r" | "u" | "R" | "U"
shortstring     ::=  "'" shortstringitem* "'" | '"' shortstringitem* '"'
longstring      ::=  "'''" longstringitem* "'''" | '"""' longstringitem* '"""'
shortstringitem ::=  shortstringchar | stringescapeseq
longstringitem  ::=  longstringchar | stringescapeseq
shortstringchar ::=  <any source character except "\" or newline or the quote>
longstringchar  ::=  <any source character except "\">
stringescapeseq ::=  "\" <any source character>

bytesliteral   ::=  bytesprefix(shortbytes | longbytes)
bytesprefix    ::=  "b" | "B" | "br" | "Br" | "bR" | "BR" | "rb" | "rB" | "Rb" | "RB"
shortbytes     ::=  "'" shortbytesitem* "'" | '"' shortbytesitem* '"'
longbytes      ::=  "'''" longbytesitem* "'''" | '"""' longbytesitem* '"""'
shortbytesitem ::=  shortbyteschar | bytesescapeseq
longbytesitem  ::=  longbyteschar | bytesescapeseq
shortbyteschar ::=  <any ASCII character except "\" or newline or the quote>
longbyteschar  ::=  <any ASCII character except "\">
bytesescapeseq ::=  "\" <any ASCII character>

Question 4

문자열로 변환하려면 디코딩해야합니다. python3의 바이트 리터럴에 대한 답변을 확인하십시오
.

In [1]: b'I posted a new photo to Facebook'.decode('utf-8')
Out[1]: 'I posted a new photo to Facebook'

Question 5

**** python에서 디코딩 된 문자열 인 b ”문자를 제거하는 방법 ****

import base64
a='cm9vdA=='
b=base64.b64decode(a).decode('utf-8')
print(b)

Question 6

django 2.0이있는 python 3.6에서 바이트 리터럴에 대한 디코딩이 예상대로 작동하지 않습니다. 예, 인쇄하면 올바른 결과를 얻지 만 올바르게 인쇄하더라도 b’value ‘는 여전히 있습니다.

이것이 바로 메신저 인코딩입니다.

uid': urlsafe_base64_encode(force_bytes(user.pk)),

이것이 바로 메신저 디코딩입니다.

uid = force_text(urlsafe_base64_decode(uidb64))

이것은 django 2.0이 말하는 것입니다.

urlsafe_base64_encode(s)[source]

URL에서 사용하기 위해 base64로 바이트 문자열을 인코딩하고 후행 등호를 제거합니다.

urlsafe_base64_decode(s)[source]

base64로 인코딩 된 문자열을 디코딩하여 제거되었을 수있는 후행 등호를 다시 추가합니다.

이것은 내 account_activation_email_test.html 파일입니다.

{% autoescape off %}
Hi {{ user.username }},

Please click on the link below to confirm your registration:

http://{{ domain }}{% url 'accounts:activate' uidb64=uid token=token %}
{% endautoescape %}

이것은 내 콘솔 응답입니다.

콘텐츠 유형 : 텍스트 / 일반; charset = “utf-8″MIME- 버전 : 1.0 Content-Transfer-Encoding : 7bit 제목 : MySite 계정 활성화 보낸 사람 : webmaster @ localhost받는 사람 : testuser@yahoo.com 날짜 : 2018 년 4 월 20 일 금요일 06:26:46- 0000 메시지 ID : <152420560682.16725.4597194169307598579@Dash-U>

안녕하세요 testuser 님,

등록을 확인하려면 아래 링크를 클릭하십시오.

http://127.0.0.1:8000/activate/b'MjU'/4vi-fasdtRf2db2989413ba/

보시다시피 uid = b'MjU'

예상 uid = MjU

콘솔에서 테스트 :

$ python
Python 3.6.4 (default, Apr  7 2018, 00:45:33)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from django.utils.http import urlsafe_base64_encode, urlsafe_base64_decode
>>> from django.utils.encoding import force_bytes, force_text
>>> var1=urlsafe_base64_encode(force_bytes(3))
>>> print(var1)
b'Mw'
>>> print(var1.decode())
Mw
>>>

조사 후 파이썬 3과 관련된 것 같습니다. 내 해결 방법은 매우 간단했습니다.

'uid': user.pk,

활성화 기능에서 uidb64로 수신합니다.

user = User.objects.get(pk=uidb64)

그리고 짜잔 :

Content-Transfer-Encoding: 7bit
Subject: Activate Your MySite Account
From: webmaster@localhost
To: testuser@yahoo.com
Date: Fri, 20 Apr 2018 20:44:46 -0000
Message-ID: <152425708646.11228.13738465662759110946@Dash-U>


Hi testuser,

Please click on the link below to confirm your registration:

http://127.0.0.1:8000/activate/45/4vi-3895fbb6b74016ad1882/

이제 잘 작동합니다. 🙂

Question 7

utf-8을 사용하여 출력 만 인코딩하여 완료했습니다. 다음은 코드 예제입니다.

new_tweets = api.GetUserTimeline(screen_name = user,count=200)
result = new_tweets[0]
try: text = result.text
except: text = ''

with open(file_name, 'a', encoding='utf-8') as f:
    writer = csv.writer(f)
    writer.writerows(text)

즉 : api에서 데이터를 수집 할 때 인코딩하지 말고 출력 (인쇄 또는 쓰기) 만 인코딩하십시오.

Question 8

다른 사람들이 여기에서 제안한 것처럼 즉시 다시 디코딩하고 싶지 않다고 가정하면 문자열로 구문 분석 한 다음 선행 'b및 후행을 제거 할 수 있습니다 '.

>>> x = "Hi there ?"
>>> x = "Hi there ?".encode("utf-8")
>>> x
b"Hi there \xef\xbf\xbd"
>>> str(x)[2:-1]
"Hi there \\xef\\xbf\\xbd"