[c#] C #에서 & nbsp를 포함한 문자열에서 HTML 태그 제거

Question 1

C #에서 regex를 사용하여 & nbsp를 포함한 모든 HTML 태그를 제거하려면 어떻게해야합니까? 내 문자열은 다음과 같습니다.

  "<div>hello</div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;</div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div>"

Question 2

HTML 파서 지향 솔루션을 사용하여 태그를 필터링 할 수없는 경우 여기에 간단한 정규식이 있습니다.

string noHTML = Regex.Replace(inputHTML, @"<[^>]+>|&nbsp;", "").Trim();

이상적으로는 여러 공백을 처리하는 정규식 필터를 통해 또 다른 패스를 만들어야합니다.

string noHTMLNormalised = Regex.Replace(noHTML, @"\s{2,}", " ");

Question 3

@Ravi Thapliyal의 코드를 가져 와서 방법을 만들었습니다. 간단하고 모든 것을 정리하지는 않을 수도 있지만 지금까지 필요한 작업을 수행하고 있습니다.

public static string ScrubHtml(string value) {
    var step1 = Regex.Replace(value, @"<[^>]+>|&nbsp;", "").Trim();
    var step2 = Regex.Replace(step1, @"\s{2,}", " ");
    return step2;
}

Question 4

이 기능을 한동안 사용하고 있습니다. 던질 수있는 지저분한 html을 거의 제거하고 텍스트는 그대로 둡니다.

        private static readonly Regex _tags_ = new Regex(@"<[^>]+?>", RegexOptions.Multiline | RegexOptions.Compiled);

        //add characters that are should not be removed to this regex
        private static readonly Regex _notOkCharacter_ = new Regex(@"[^\w;&#@.:/\\?=|%!() -]", RegexOptions.Compiled);

        public static String UnHtml(String html)
        {
            html = HttpUtility.UrlDecode(html);
            html = HttpUtility.HtmlDecode(html);

            html = RemoveTag(html, "<!--", "-->");
            html = RemoveTag(html, "<script", "</script>");
            html = RemoveTag(html, "<style", "</style>");

            //replace matches of these regexes with space
            html = _tags_.Replace(html, " ");
            html = _notOkCharacter_.Replace(html, " ");
            html = SingleSpacedTrim(html);

            return html;
        }

        private static String RemoveTag(String html, String startTag, String endTag)
        {
            Boolean bAgain;
            do
            {
                bAgain = false;
                Int32 startTagPos = html.IndexOf(startTag, 0, StringComparison.CurrentCultureIgnoreCase);
                if (startTagPos < 0)
                    continue;
                Int32 endTagPos = html.IndexOf(endTag, startTagPos + 1, StringComparison.CurrentCultureIgnoreCase);
                if (endTagPos <= startTagPos)
                    continue;
                html = html.Remove(startTagPos, endTagPos - startTagPos + endTag.Length);
                bAgain = true;
            } while (bAgain);
            return html;
        }

        private static String SingleSpacedTrim(String inString)
        {
            StringBuilder sb = new StringBuilder();
            Boolean inBlanks = false;
            foreach (Char c in inString)
            {
                switch (c)
                {
                    case '\r':
                    case '\n':
                    case '\t':
                    case ' ':
                        if (!inBlanks)
                        {
                            inBlanks = true;
                            sb.Append(' ');
                        }
                        continue;
                    default:
                        inBlanks = false;
                        sb.Append(c);
                        break;
                }
            }
            return sb.ToString().Trim();
        }

Question 5

var noHtml = Regex.Replace(inputHTML, @"<[^>]*(>|$)|&nbsp;|&zwnj;|&raquo;|&laquo;", string.Empty).Trim();

Question 6

@RaviThapliyal & @Don Rolling의 코드를 사용했지만 약간 수정했습니다. & nbsp를 빈 문자열로 바꾸고 대신 & nbsp를 공백으로 바꿔야하므로 추가 단계를 추가했습니다. 그것은 나를 위해 매력처럼 작동했습니다.

public static string FormatString(string value) {
    var step1 = Regex.Replace(value, @"<[^>]+>", "").Trim();
    var step2 = Regex.Replace(step1, @"&nbsp;", " ");
    var step3 = Regex.Replace(step2, @"\s{2,}", " ");
    return step3;
}

스택 오버플로에 의해 형식이 지정 되었기 때문에 세미콜론없이 & nbps를 사용했습니다.

Question 7

이:

(<.+?> | &nbsp;)

모든 태그와 일치하거나  

string regex = @"(<.+?>|&nbsp;)";
var x = Regex.Replace(originalString, regex, "").Trim();

그런 다음 x = hello

Question 8

Html 문서를 삭제하려면 많은 까다로운 작업이 필요합니다. 이 패키지는 아마도 도움이 될 것입니다 :
https://github.com/mganss/HtmlSanitizer

[c#] C #에서 & nbsp를 포함한 문자열에서 HTML 태그 제거

답변

답변

답변

답변

답변

답변

답변