Lấy toàn bộ url trong một webpage với ASP

Chia sẻ: Abcdef_44 Abcdef_44 | Ngày: | Loại File: PDF | Số trang:6

Thêm vào BST

Báo xấu

49
lượt xem 4
download

Download Vui lòng tải xuống để xem tài liệu đầy đủ

Chúng ta sẽ xây dựng một class đơn giản để lấy toàn bộ urls trong một web pageClass này có một public method: RetrieveUrls, method này lại gọi 2 private mothods khác: RetrieveContents và GetAllUrls RetrieveContents sẽ phát đi một request tới web page, và nhận lại nội dung của page.

Chủ đề:

Bình luận(0) Đăng nhập để gửi bình luận!

Lưu

Nội dung Text: Lấy toàn bộ url trong một webpage với ASP

Lấy toàn bộ url trong một webpage với ASP Chúng ta sẽ xây dựng một class đơn giản để lấy toàn bộ urls trong một web page Class này có m ột public method: RetrieveUrls, method này lại gọi 2 private mothods khác: RetrieveContents và GetAllUrls RetrieveContents sẽ phát đi một request tới web page, và nhận lại nội  dung của page. GetAllUrls method sẽ dùng một expression đơn giản để tìm tất cả urls  trong page, sau đó in toàn bộ ra screen, đồng thời cũng l ưu vào file log. Dưới đây là toàn bộ code của class: using System; using System.Collections.Generic; using System.Text; using System.Net; using System.IO; using System.Text.RegularExpressions;
namespace FindAllUrls { class GetUrls { //public method called from your application public void RetrieveUrls( string webPage ) { GetAllUrls(RetrieveContent(webPage)); } //get the content of the web page passed in private string RetrieveContent(string webPage) { HttpWebResponse response = null;//used to get response StreamReader respStream = null;//used to read response into string try { //create a request object using the url passed in
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(webPage); request.Timeout = 10000; //go get a response from the page response = (HttpWebResponse)request.GetResponse(); //create a streamreader object from the response respStream = new StreamReader(response.GetResponseStream()); //get the contents of the page as a string and return it return respStream.ReadToEnd(); } catch (Exception ex)//houston we have a problem! { throw ex; } finally { //close it down, we’re going home! response.Close();
respStream.Close(); } } //using a regular expression, find all of the href or urls //in the content of the page private void GetAllUrls( string content ) { //regular expression string pattern = @”(?:href\s*=)(?:[\s”"‘]*)(?!#|mailto|location.|javascript|.*css|.*this\.)(? .*?)(?:[\s>”"‘])”; //Set up regex object Regex RegExpr = new Regex(pattern, RegexOptions.IgnoreCase); //get the first match Match match = RegExpr.Match(content); //loop through matches
while (match.Success) { //output the match info Console.WriteLine(”href match: ” + match.Groups[0].Value); WriteToLog(”C:\matchlog.txt”, “href match: ” + match.Groups[0].Value + “\r\n”); Console.WriteLine(”Url match: ” + match.Groups[1].Value ); WriteToLog(”C:\matchlog.txt”, “Url | Location | mailto match: ” + match.Groups[1].Value + “\r\n”); //get next match match = match.NextMatch(); } } //Write to a log file private void WriteToLog(string file, string message) {
using (StreamWriter w = File.AppendText(file)) { w.WriteLine(DateTime.Now.ToString() + “: ” + message); w.Close(); } } } } Và đoạn code để sử dụng class trên: GetUrls urls = new GetUrls(); urls.RetrieveUrls(”http://www.microsoft.com”);