Reguliere expressie voor het vinden van urls

5blabla5 · 5 feb 2011

Ik probeer nu een script te maken die alle urls van een pagina af kan halen, en vervolgens kan verwerken. Ik ben nu bij het gedeelte waarbij het content van de pagina moet worden doorzocht door preg_match(), alleen heb ik totaal geen verstand van reguliere expressies

scriptje:

PHP:

<?php

// File finder


$content = '<html><head></head><body><img src="http://computertotaal.nl/upload/1028119_610_1193841865355-photoshop30-2.png" /></body></html>';

$find    = preg_match('|^http(s)?://[a-z0-9-]+(.[a-z0-9-]+)*(:[0-9]+)?(/.*)?$|i', $content, $matches);

if($find) {

	foreach($matches as $match) {

		if(preg_match('|^http(s)?://[a-z0-9-]+(.[a-z0-9-]+)*(:[0-9]+)?(/.*)?$|i', $match)) {
	
			$found[] = $match;
		
		}
	
	}

	foreach($found as $current) {

		echo $current . '<br />';

	}

} else {

	echo 'Failed';
	
}

?>

Uitkomst:
Failed

Waarschijnlijk ligt het aan de reguliere expressie, want als ik alleen de url in $content gooi, en niet die html dingen erbij, dan geeft ie de url wel terug. Heeft iemand een idee wat er fout is?

Supersnail · 5 feb 2011

Probeer de reguliere expressie eens zonder de tekens '^' (begin van de regel) en '$' (einde van de regel).

5blabla5 · 5 feb 2011

thanks, werkt al beter... Alleen krijg ik nu soms wel rare uitkomsten, als ik bijvoorbeeld http://www.simpletest.org/ opvraag, en die laat parsen door het volgende script:

PHP:

<?php

// File finder


$content = file_get_contents("http://www.simpletest.org/");

$find    = preg_match_all('|http(s)?://[a-z0-9-]+(.[a-z0-9-]+)*(:[0-9]+)?(/.*)?|i', $content, $matches);

var_dump($matches[0]);

echo '<br /><br />';

if($find) {

	foreach($matches[0] as $match) {

		if(preg_match('|^http(s)?://[a-z0-9-]+(.[a-z0-9-]+)*(:[0-9]+)?(/.*)?$|i', $match)) {
	
			$found[] = $match;
		
		}
	
	}

	foreach($found as $current) {

		echo $current . '<br />';

	}

} else {

	echo 'Failed';
	
}

?>

krijg ik als uitkomst:

Code:

array(29) { [0]=> string(55) "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" [1]=> string(28) "http://www.w3.org/1999/xhtml" [2]=> string(32) "http://simpletest.org/index.html" [3]=> string(110) "http://sourceforge.net/projects/simpletest/files/simpletest/simpletest_1.1/simpletest_1.1alpha.tar.gz/download" [4]=> string(60) "https://sourceforge.net/project/showfiles.php?group_id=76550" [5]=> string(49) "http://simpletest.org/api/">the complete API." [6]=> string(82) "https://sourceforge.net/mail/?group_id=76550">SimpleTest support mailing-list." [7]=> string(102) "https://sourceforge.net/tracker/?atid=547458&group_id=76550&func=browse">features tracker." [8]=> string(92) "https://sourceforge.net/tracker/?atid=547455&group_id=76550&func=browse">bug and" [9]=> string(100) "https://sourceforge.net/tracker/?group_id=76550&atid=547457">patches trackers can be useful." [10]=> string(42) "http://sourceforge.net/projects/simpletest" [11]=> string(59) "http://sourceforge.net/sflogo.php?group_id=76550&type=1" [12]=> string(110) "http://sourceforge.net/projects/simpletest/files/simpletest/simpletest_1.1/simpletest_1.1alpha.tar.gz/download" [13]=> string(61) "http://sourceforge.net/projects/simpletest/">SourceForge." [14]=> string(40) "http://www.junit.org/">JUnit will be" [15]=> string(46) "http://jwebunit.sourceforge.net/">JWebUnit" [16]=> string(46) "http://simpletest.org/api/">documented API" [17]=> string(39) "http://en.wikipedia.org/wiki/SimpleTest" [18]=> string(77) "http://www.developerspot.com/tutorials/php/test-driven-development/page1.html" [19]=> string(61) "http://onpk.net/talks/fosdem2005/introduction_simpletest.html" [20]=> string(36) "http://www.devpapers.com/article/303" [21]=> string(28) "http://drupal.org/simpletest" [22]=> string(53) "http://blog.casey-sweat.us/?p=72">Live TDD demo :" [23]=> string(49) "http://www.phparch.com/shop_product.php?itemid=96" [24]=> string(63) "http://www.amazon.com/gp/product/0973589825/102-3523235-1803315" [25]=> string(57) "http://www.sitepoint.com/books/phpant1/">SitePoint | " [26]=> string(63) "http://www.amazon.com/gp/product/0957921845/102-3523235-1803315" [27]=> string(117) "http://developers.slashdot.org/article.pl?sid=04/08/04/1516258&tid=169&tid=192&tid=218&mode=nocomment" [28]=> string(63) "https://sourceforge.net/mail/?group_id=76550">mailing-list," } 

http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd
http://www.w3.org/1999/xhtml
http://simpletest.org/index.html
http://sourceforge.net/projects/simpletest/files/simpletest/simpletest_1.1/simpletest_1.1alpha.tar.gz/download
https://sourceforge.net/project/showfiles.php?group_id=76550
http://simpletest.org/api/">the complete API.
https://sourceforge.net/mail/?group_id=76550">SimpleTest support mailing-list.
https://sourceforge.net/tracker/?atid=547458&group_id=76550&func=browse">features tracker.
https://sourceforge.net/tracker/?atid=547455&group_id=76550&func=browse">bug and
https://sourceforge.net/tracker/?group_id=76550&atid=547457">patches trackers can be useful.
http://sourceforge.net/projects/simpletest
http://sourceforge.net/sflogo.php?group_id=76550&type=1
http://sourceforge.net/projects/simpletest/files/simpletest/simpletest_1.1/simpletest_1.1alpha.tar.gz/download
http://sourceforge.net/projects/simpletest/">SourceForge.
http://www.junit.org/">JUnit will be
http://jwebunit.sourceforge.net/">JWebUnit
http://simpletest.org/api/">documented API
http://en.wikipedia.org/wiki/SimpleTest
http://www.developerspot.com/tutorials/php/test-driven-development/page1.html
http://onpk.net/talks/fosdem2005/introduction_simpletest.html
http://www.devpapers.com/article/303
http://drupal.org/simpletest
http://blog.casey-sweat.us/?p=72">Live TDD demo :
http://www.phparch.com/shop_product.php?itemid=96
http://www.amazon.com/gp/product/0973589825/102-3523235-1803315
http://www.sitepoint.com/books/phpant1/">SitePoint | 
http://www.amazon.com/gp/product/0957921845/102-3523235-1803315
http://developers.slashdot.org/article.pl?sid=04/08/04/1516258&tid=169&tid=192&tid=218&mode=nocomment
https://sourceforge.net/mail/?group_id=76550">mailing-list,

Bij deze links zitten soms niet kloppende links... Weet iemand hoe ik deze kloppend kan krijgen, en dat ik alle goede links in een webpagina (inclusief relatieve urls) kan opvragen?

Reguliere expressie voor het vinden van urls

5blabla5

Gebruiker

Supersnail

Terugkerende gebruiker

5blabla5

Gebruiker

Nieuwste berichten

Wij waarderen jouw privacy