javascript - C# filter JS files from HttpWebRequest/WebResponse -
i searched not find worked me.
a while ago started c# , first personal project simple webcrawler. should check sourcecode special strings identify if illustration google analytics or similar included.
so works fine of course of study i'm missing js , iframes since httpwebrequest not render website know.
so wanted check "<script src="" illustration , url through split. not work expected , don't think clean , way.
since i'm checking strings destroyed changing string "<script" "< script" illustration have no thought how specific string big string.
i found regular expressions (rex) , split i'm not sure if rex , split since there more types of "src=" or split("\"", "\"", text)
i don't want "here go" of course of study want understand , myself have no thought go here..
sorry long text , no examples @ moment have no access , there not much except rex , split's
edit: think i'll create class checks every char special row "
best, mike
try html agility pack
i haven't used personally, should work (i haven't tested it):
string url = "some/url"; var request = (httpwebrequest)httpwebrequest.create(url); var webresponse = (httpwebresponse)request.getresponse(); var responsestream = webresponse.getresponsestream(); var streamreader = new streamreader(responsestream); htmlagilitypack.htmldocument doc = new htmlagilitypack.htmldocument(); doc.loadhtml(streamreader.readtoend()); var scripts = doc.documentnode.descendants() .where(n => n.name == "script"); this should script nodes them want =)
c# javascript regex split
No comments:
Post a Comment