[python] 'ElementTree'를 통해 Python에서 네임 스페이스로 XML 구문 분석

파이썬을 사용하여 구문 분석하려는 다음 XML이 있습니다 ElementTree.

<rdf:RDF xml:base="http://dbpedia.org/ontology/"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:owl="http://www.w3.org/2002/07/owl#"
    xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
    xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
    xmlns="http://dbpedia.org/ontology/">

    <owl:Class rdf:about="http://dbpedia.org/ontology/BasketballLeague">
        <rdfs:label xml:lang="en">basketball league</rdfs:label>
        <rdfs:comment xml:lang="en">
          a group of sports teams that compete against each other
          in Basketball
        </rdfs:comment>
    </owl:Class>

</rdf:RDF>

모든 owl:Class태그 를 찾은 다음 그 rdfs:label안에 있는 모든 인스턴스 의 값을 추출하고 싶습니다 . 다음 코드를 사용하고 있습니다.

tree = ET.parse("filename")
root = tree.getroot()
root.findall('owl:Class')

네임 스페이스로 인해 다음과 같은 오류가 발생합니다.

SyntaxError: prefix 'owl' not found in prefix map

http://effbot.org/zone/element-namespaces.htm 에서 문서를 읽으려고했지만 위의 XML에 중첩 된 네임 스페이스가 여러 개 있기 때문에 여전히이 작업을 수행 할 수 없습니다.

모든 owl:Class태그 를 찾기 위해 코드를 변경하는 방법을 알려주십시오 .

답변

ElementTree는 네임 스페이스에 대해 너무 똑똑하지 않습니다. 당신은 줄 필요 .find(), findall()및 iterfind()방법을 명시 적 네임 스페이스 사전. 이것은 잘 문서화되어 있지 않습니다 :

namespaces = {'owl': 'http://www.w3.org/2002/07/owl#'} # add more as needed

root.findall('owl:Class', namespaces)

접두사는 전달한 매개 변수 에서만 조회됩니다 namespaces. 이는 원하는 네임 스페이스 접 두부를 사용할 수 있음을 의미합니다. API는 owl:부분을 분리 하고 namespaces사전 에서 해당 네임 스페이스 URL을 찾은 다음 XPath 표현식을 찾도록 검색을 변경합니다 {http://www.w3.org/2002/07/owl}Class. 물론 동일한 구문을 직접 사용할 수 있습니다.

root.findall('{http://www.w3.org/2002/07/owl#}Class')

lxml라이브러리로 전환 할 수 있다면 더 좋습니다. 해당 라이브러리는 동일한 ElementTree API를 지원하지만 .nsmap요소 의 속성 에서 네임 스페이스를 수집합니다 .

답변

네임 스페이스를 하드 코딩하거나 텍스트를 스캔하지 않고 lxml로이를 수행하는 방법은 다음과 같습니다 (Martijn Pieters가 언급 한대로).

from lxml import etree
tree = etree.parse("filename")
root = tree.getroot()
root.findall('owl:Class', root.nsmap)

업데이트 :

5 년 후에도 여전히이 문제의 변형이 발생합니다. lxml은 위에 표시된 것처럼 도움이되지만 모든 경우에 도움이되지는 않습니다. 주석 작성자는 문서를 병합 할 때이 기술과 관련하여 유효한 지적을 할 수 있지만 대부분의 사람들은 단순히 문서를 검색하는 데 어려움을 겪고 있다고 생각합니다.

다른 경우와 내가 처리 한 방법이 있습니다.

<?xml version="1.0" ?><Tag1 xmlns="http://www.mynamespace.com/prefix">
<Tag2>content</Tag2></Tag1>

접두사가없는 xmlns는 접두사가없는 태그가이 기본 네임 스페이스를 가져옵니다. 이는 Tag2를 검색 할 때 네임 스페이스를 포함시켜야 태그를 찾을 수 있음을 의미합니다. 그러나 lxml은 None을 키로 사용하여 nsmap 항목을 만들고 검색 방법을 찾을 수 없습니다. 그래서 나는 이와 같은 새로운 네임 스페이스 사전을 만들었습니다.

namespaces = {}
# response uses a default namespace, and tags don't mention it
# create a new ns map using an identifier of our choice
for k,v in root.nsmap.iteritems():
    if not k:
        namespaces['myprefix'] = v
e = root.find('myprefix:Tag2', namespaces)

답변

참고 : 이것은 하드 코딩 된 네임 스페이스를 사용하지 않고 Python의 ElementTree 표준 라이브러리에 유용한 답변입니다.

XML 데이터에서 네임 스페이스의 접두사 및 URI를 추출하려면 네임 ElementTree.iterparse스페이스 시작 이벤트 ( start-ns ) 만 구문 분석 하여 함수 를 사용할 수 있습니다 .

>>> from io import StringIO
>>> from xml.etree import ElementTree
>>> my_schema = u'''<rdf:RDF xml:base="http://dbpedia.org/ontology/"
...     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
...     xmlns:owl="http://www.w3.org/2002/07/owl#"
...     xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
...     xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
...     xmlns="http://dbpedia.org/ontology/">
... 
...     <owl:Class rdf:about="http://dbpedia.org/ontology/BasketballLeague">
...         <rdfs:label xml:lang="en">basketball league</rdfs:label>
...         <rdfs:comment xml:lang="en">
...           a group of sports teams that compete against each other
...           in Basketball
...         </rdfs:comment>
...     </owl:Class>
... 
... </rdf:RDF>'''
>>> my_namespaces = dict([
...     node for _, node in ElementTree.iterparse(
...         StringIO(my_schema), events=['start-ns']
...     )
... ])
>>> from pprint import pprint
>>> pprint(my_namespaces)
{'': 'http://dbpedia.org/ontology/',
 'owl': 'http://www.w3.org/2002/07/owl#',
 'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
 'rdfs': 'http://www.w3.org/2000/01/rdf-schema#',
 'xsd': 'http://www.w3.org/2001/XMLSchema#'}

그런 다음 사전을 검색 함수에 인수로 전달할 수 있습니다.

root.findall('owl:Class', my_namespaces)

답변

나는 이것과 비슷한 코드를 사용하고 있으며 평소와 같이 항상 문서를 읽을 가치가 있음을 발견했습니다!

findall ()은 현재 태그의 직접적인 하위 요소 만 찾습니다 . 따라서 실제로는 아닙니다.

특히 하위 하위 요소 등이 포함되도록 크고 복잡한 xml 파일을 처리하는 경우 코드에서 다음 작업을 수행하는 것이 좋습니다. XML의 요소가 어디에 있는지 알고 있다면 괜찮을 것입니다. 이것이 기억할 가치가 있다고 생각했습니다.

root.iter()

참조 : https://docs.python.org/3/library/xml.etree.elementtree.html#finding-interesting-elements
“Element.findall ()은 현재 요소의 직접적인 자식 인 태그가있는 요소 만 찾습니다. Element.find ()는 특정 태그가있는 첫 번째 자식을 찾고 Element.text는 요소의 텍스트 내용에 액세스합니다. Element.get ()은 요소의 속성에 액세스합니다. “

답변

네임 스페이스 형식으로 네임 스페이스를 가져 오려면 예를 들어 {myNameSpace}다음을 수행 할 수 있습니다.

root = tree.getroot()
ns = re.match(r'{.*}', root.tag).group(0)

이런 식으로 나중에 코드에서 나중에 노드를 찾는 데 사용할 수 있습니다 (예 : 문자열 보간 (Python 3)).

link = root.find(f"{ns}link")

답변

내 솔루션은 @Martijn Pieters의 의견을 기반으로합니다.

register_namespace 검색이 아닌 직렬화에만 영향을줍니다.

따라서 여기서 속임수는 직렬화 및 검색에 다른 사전을 사용하는 것입니다.

namespaces = {
    '': 'http://www.example.com/default-schema',
    'spec': 'http://www.example.com/specialized-schema',
}

이제 구문 분석 및 작성을위한 모든 네임 스페이스를 등록하십시오.

for name, value in namespaces.iteritems():
    ET.register_namespace(name, value)

검색을 위해 ( find(), findall(), iterfind()) 우리는 비어 있지 않은 접두사가 필요합니다. 이러한 함수를 수정 된 사전에 전달하십시오 (여기서는 원래 사전을 수정하지만 네임 스페이스가 등록 된 후에 만 작성해야 함).

self.namespaces['default'] = self.namespaces['']

이제 find()제품군 의 기능을 default접두사 와 함께 사용할 수 있습니다 .

print root.find('default:myelem', namespaces)

그러나

tree.write(destination)

기본 네임 스페이스의 요소에 접두사를 사용하지 않습니다.