世洲企業社,問與答

:::

:::|

目前位置:

回首頁 |

相關問答

對文字進行字串轉換去除HTML格式

[日期]:2018/06/04 [瀏覽人數]:381

import re

def rm_tags(text):

    #移除HTML TAG
    re_tag = re.compile(r'<[^>]+>')
    #移除non-ASCII字元.
    text = re.sub(re_tag,'',text)
    text = re.sub('[^\x00-\x97]+',' ', text)

    # 移除 URLs
    text = re.sub('https?:\/\/.*[\r\n]*', ' ', text)

    # 移除特殊字元.
    text = re.sub('[?!+%{}:;.,"\'()\[\]_]', '',text)

    # 移除2個以上空白.
    text = re.sub('\s+',' ',text)

    return text