游戏技术文章

正则抓取网页所有href和src

时间:2016-11-15 13:03:15  作者:棋牌资源网  来源:棋牌资源网  查看:8898  评论:0
内容摘要:根据抓取的页面,用正则来匹配页面href和src string UserAgent = "Mozilla/5.0 (Windows NT 5.2; rv:29.0) Gecko/20100101 Firefox/29.0"; string Co...
根据抓取的页面,用正则来匹配页面href和src
正则抓取网页所有href和src
 
string UserAgent = "Mozilla/5.0 (Windows NT 5.2; rv:29.0) Gecko/20100101 Firefox/29.0";
    string ContentType = "";

    Uri strReqUrl = new Uri("http://m.lhrb.ufstone.net/");
    protected void Application_BeginRequest(object sender, EventArgs e)
    {

        Uri u = new Uri(strReqUrl, Request.RawUrl);
        byte[] b = getVerificationCode(u);

        //MemoryStream ms = new MemoryStream(b);
        //Response.ClearContent();
        //Response.ContentType = ContentType;
        //Response.BinaryWrite(b);

        StringBuilder strHtml = new StringBuilder(Encoding.GetEncoding("gb2312").GetString(b));
        StringBuilder sb = new StringBuilder();
        GetHtmlUrl(ref strHtml);
        Response.Write(strHtml.ToString());
        Response.End();
    }
    public byte[] getVerificationCode(Uri url)
    {
        WebClient MyWebClient = new WebClient();
        MyWebClient.Headers.Add("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
        MyWebClient.Headers.Add("Accept-Language", "    zh-cn,zh;q=0.8,en-us;q=0.5,en;q=0.3");
        MyWebClient.Headers.Add("User-Agent", this.UserAgent);
        MyWebClient.Credentials = CredentialCache.DefaultCredentials;
        try
        {
            Byte[] pageData = MyWebClient.DownloadData(url.AbsoluteUri);
            ContentType = MyWebClient.ResponseHeaders["Content-Type"];
            return (pageData);
        }
        catch
        {
            return null;
        }
    }
 

 

 
    void GetHtmlUrl(ref StringBuilder strHtml)
    {
        //string headstr = "(src|href)=", endstr = "(\")";
        //string reg = @"(?<=" + headstr + ")(.*?)(?=" + endstr + ")";

        string reg = "(src|href)\\s*=\\s*(?:\"(?<1>[^\"]*)\"|(?<1>\\S+))";
        Regex r = new Regex(reg, RegexOptions.None);
        Match match = r.Match(strHtml.ToString());
        StringBuilder sb = new StringBuilder();
        while (match.Success)
        {
            //sb.Append(match.Groups["url"].Value + "\n");//得到href值                
            //sb.Append(match.Groups["text"].Value + "\n");//得到<a><a/>中间的内容     

            sb.Append(match + "\n");//得到href值     
            match = match.NextMatch();
            //try
            //{
            //    Uri u = new Uri(strReqUrl, match.Value.Replace("\"", "").Replace("'", ""));
            //    strHtml.Replace(match.Value, @"/" + u.ToString().Replace(strReqUrl.ToString(), ""));
            //}
            //catch
            //{
            //}
        }
    }

标签:正则抓取网页所有href和src 

欢迎加入VIP,【VIP售价:只要288元永久VIP会员】畅享商业棋牌游戏程序下载,点击开通!

下载说明


☉本站所有源码和资源均由站长亲自测试-绝对保证都可以架设,运营!
☉如源码和资源有损坏或所有链接均不能下载,请告知管理员,

☉本站软件和源码大部分为站长独资,资源购买和收集,放心下载!

☉唯一站长QQ:1004003180  [人格担保-本站注重诚信!]

☉购买建议E-mail:1004003180@qq.com   源码收购 E-mail:1004003180@qq.com    

☉本站文件解压密码  【文章内都自带解压密码,每个密码不同!】


本站提供的所有源码,均来源站长提供,仅学习交流 浙ICP备09009969号

由此产生不良后果和法律责任与本站无关,如果侵犯了您的版权,请来信告知 1004003180@qq.com 将及时更正和删除! 

Copyright © 2008-2021 棋牌资源网,你身边的棋牌资源下载站    All Rights Reserved