[python] lxml에서 요소를 제거하는 방법

Question 1

파이썬의 lxml을 사용하여 속성의 내용을 기반으로 요소를 완전히 제거해야합니다. 예:

import lxml.etree as et

xml="""
<groceries>
  <fruit state="rotten">apple</fruit>
  <fruit state="fresh">pear</fruit>
  <fruit state="fresh">starfruit</fruit>
  <fruit state="rotten">mango</fruit>
  <fruit state="fresh">peach</fruit>
</groceries>
"""

tree=et.fromstring(xml)

for bad in tree.xpath("//fruit[@state=\'rotten\']"):
  #remove this element from the tree

print et.tostring(tree, pretty_print=True)

다음을 인쇄하고 싶습니다.

<groceries>
  <fruit state="fresh">pear</fruit>
  <fruit state="fresh">starfruit</fruit>
  <fruit state="fresh">peach</fruit>
</groceries>

다음과 같이 임시 변수를 저장하고 수동으로 인쇄하지 않고이를 수행 할 수있는 방법이 있습니까?

newxml="<groceries>\n"
for elt in tree.xpath('//fruit[@state=\'fresh\']'):
  newxml+=et.tostring(elt)

newxml+="</groceries>"

Question 2

removexmlElement 의 메소드를 사용하십시오 .

tree=et.fromstring(xml)

for bad in tree.xpath("//fruit[@state=\'rotten\']"):
  bad.getparent().remove(bad)     # here I grab the parent of the element to call the remove directly on it

print et.tostring(tree, pretty_print=True, xml_declaration=True)

@Acorn 버전과 비교해야한다면 제거 할 요소가 xml의 루트 노드 바로 아래에 있지 않아도 작동합니다.

Question 3

당신은 remove기능을 찾고 있습니다. 트리의 remove 메서드를 호출하고 제거 할 하위 요소를 전달합니다.

import lxml.etree as et

xml="""
<groceries>
  <fruit state="rotten">apple</fruit>
  <fruit state="fresh">pear</fruit>
  <punnet>
    <fruit state="rotten">strawberry</fruit>
    <fruit state="fresh">blueberry</fruit>
  </punnet>
  <fruit state="fresh">starfruit</fruit>
  <fruit state="rotten">mango</fruit>
  <fruit state="fresh">peach</fruit>
</groceries>
"""

tree=et.fromstring(xml)

for bad in tree.xpath("//fruit[@state='rotten']"):
    bad.getparent().remove(bad)

print et.tostring(tree, pretty_print=True)

결과:

<groceries>
  <fruit state="fresh">pear</fruit>
  <fruit state="fresh">starfruit</fruit>
  <fruit state="fresh">peach</fruit>
</groceries>

Question 4

한 가지 상황을 만났습니다.

<div>
    <script>
        some code
    </script>
    text here
</div>

div.remove(script)의도 text here하지 않은 부분을 제거합니다 .

여기에 대한 답변에 따라 param으로 etree.strip_elements텍스트를 제거할지 여부를 제어 할 수있는 더 나은 솔루션 이라는 것을 알았습니다 with_tail=(bool).

하지만 여전히 이것이 태그에 xpath 필터를 사용할 수 있는지 모르겠습니다. 알리기 위해 이것을 넣으십시오.

다음은 문서입니다.

strip_elements (tree_or_element, * tag_names, with_tail = True)

제공된 태그 이름을 가진 모든 요소를 트리 또는 하위 트리에서 삭제합니다. 이렇게하면 모든 속성, 텍스트 콘텐츠 및 하위 항목을 포함하여 요소와 전체 하위 트리가 제거됩니다. with_tail키워드 인수 옵션을 명시 적 으로 False로 설정하지 않으면 요소의 꼬리 텍스트도 제거됩니다 .

태그 이름은에서와 같이 와일드 카드를 포함 할 수 있습니다 _Element.iter.

일치하더라도 전달한 요소 (또는 ElementTree 루트 요소)는 삭제되지 않습니다. 그 후손 만 취급합니다. 루트 요소를 포함하려면이 함수를 호출하기 직전에 태그 이름을 확인하십시오.

사용 예 ::
   strip_elements(some_element,
       'simpletagname',             # non-namespaced tag
       '{http://some/ns}tagname',   # namespaced tag
       '{http://some/other/ns}*'    # any tag from a namespace
       lxml.etree.Comment           # comments
       )

Question 5

이미 언급했듯이이 remove()메서드를 사용 하여 트리에서 (하위) 요소를 삭제할 수 있습니다 .

for bad in tree.xpath("//fruit[@state=\'rotten\']"):
  bad.getparent().remove(bad)

그러나 tailHTML과 같은 혼합 콘텐츠 문서를 처리하는 경우 문제가 되는를 포함하는 요소를 제거합니다 .

<div><fruit state="rotten">avocado</fruit> Hello!</div>

된다

<div></div>

나는 당신이 항상 원하지 않는 것을 가정합니다 🙂 요소 만 제거하고 꼬리를 유지하는 도우미 함수를 만들었습니다.

def remove_element(el):
    parent = el.getparent()
    if el.tail.strip():
        prev = el.getprevious()
        if prev:
            prev.tail = (prev.tail or '') + el.tail
        else:
            parent.text = (parent.text or '') + el.tail
    parent.remove(el)

for bad in tree.xpath("//fruit[@state=\'rotten\']"):
    remove_element(bad)

이렇게하면 꼬리 텍스트가 유지됩니다.

<div> Hello!</div>

Question 6

lxml의 html을 사용하여 해결할 수도 있습니다.

from lxml import html

xml="""
<groceries>
  <fruit state="rotten">apple</fruit>
  <fruit state="fresh">pear</fruit>
  <fruit state="fresh">starfruit</fruit>
  <fruit state="rotten">mango</fruit>
  <fruit state="fresh">peach</fruit>
</groceries>
"""

tree = html.fromstring(xml)

print("//BEFORE")
print(html.tostring(tree, pretty_print=True).decode("utf-8"))

for i in tree.xpath("//fruit[@state='rotten']"):
    i.drop_tree()

print("//AFTER")
print(html.tostring(tree, pretty_print=True).decode("utf-8"))

다음과 같이 출력되어야합니다.

//BEFORE
<groceries>
  <fruit state="rotten">apple</fruit>
  <fruit state="fresh">pear</fruit>
  <fruit state="fresh">starfruit</fruit>
  <fruit state="rotten">mango</fruit>
  <fruit state="fresh">peach</fruit>
</groceries>


//AFTER
<groceries>

  <fruit state="fresh">pear</fruit>
  <fruit state="fresh">starfruit</fruit>

  <fruit state="fresh">peach</fruit>
</groceries>