thinkphp自动采集怎么实现

互联网 19-8-22

thinkphp实现自动采集功能的三种方法：

方法一：QueryList

个人感觉比较好用，采集详情比较不错的选择，但是采集复杂一点的列表，不好用。具体使用：

控制器示例：

public function index(){     // 使用采集类     // 使用手册 ：http://www.php.cn/php/php-QueryList3-ThinkPHP.html     import('Org.QL.QueryList');     $url = "http://www.zyctd.com/gqqg/";     $reg = array();     $reg['title'] = array('.sulist_title','text');     $reg['shuliang'] = array('.su_li1','html');     $obj = new \QueryList($url,$reg);     $data = $obj->jsonArr;     // foreach($data as $v){     //     echo "<br>".$v['title'].'___'.$v['shuliang']."<br>";     // }     p($data); }

相关推荐：《ThinkPHP教程》

方法二：simple_html_dom

这个方法比较适合采集一点结构简单的页面，HTML标签的类名比较明确的页面，还不错。具体使用：

控制器示例：

public function index(){     // 参考文档：http://microphp.us/plugins/public/microphp_res/simple_html_dom/manual.htm#section_quickstart     // 下载地址：https://github.com/samacs/simple_html_dom/edit/master/simple_html_dom.php     // 使用方法：http://www.thinkphp.cn/topic/21635.html     import("Org.Util.simple_html_dom", '', '.php');     $html = file_get_html('http://www.zyctd.com/gqqg/');     $ret = $html->find('.supply_list_box ul',0)->first_child();     foreach($ret as $v){         echo $v;     }; }

方法三：获取页面HTMl，进行正则匹配采集

举例一个Demo：

采集一个页面：

http://www.zyctd.com/gqqg/

我要获取上面的四个信息：标题，数量，时间，跳转链接。

获取这些信息，通过上面两种方法都采集不到，最后才选用的正则来采集。具体方法：

public function index(){     $url = "http://www.zyctd.com/gqqg/";     // http://www.zyctd.com/gqqg-p1.html     $supplyDB = M('supply');         $urlList = array();     $array = array();     for($x=1; $x<=1; $x++) {         array_push($urlList,"http://www.zyctd.com/gqqg-p".$x.".html");     };             foreach($urlList as $v){         $curPageList = $this->getInfo($v);         array_push($array,$curPageList);     };     foreach($array as $v){         foreach($v as $vv){             //echo $vv['title']."__".$vv['weight']."__".$vv['time']."<br>";             $data = array();             $data['title'] = $vv['title'];             $data['weight'] = $vv['weight'];             $data['add_time'] = $vv['add_time'];             $data['url'] = $vv['url'];             //$res = $supplyDB->add($data);             //echo $res;             echo "<p><span style='display:inline-block; width:110px;'>".$vv['title']."</span>             <span style='display:inline-block; width:110px;'>".$vv['weight']."</span>             <span style='display:inline-block; width:110px;'>".$vv['add_time']."</span>             <span style='display:inline-block; width:110px;'>".$vv['url']."</span></p>";         }     }         // 获取信息         //$curPageList = $this->getInfo($html);         //p($curPageList); } private function getInfo($url){     $html = $this->getHtml($url);     $array = array();     // 匹配所有的标题     preg_match_all("#<divclass=\"sulist_title\"><i></i><span>(.*?)</span></div>#",$html,$matches);     $all_title = $matches[1];     preg_match_all("#<i>发布时间：</i><span>(.*?)</span>#",$html,$matches);     // 匹配所有的发布时间     $all_time = $matches[1];     // 匹配所有的求购数量     preg_match_all("#<i>求购数量：</i><span>(.*?)</span>#",$html,$matches);     $all_weight = $matches[1];     // 匹配跳转链接     preg_match_all("#<atarget=\"_blank\"href=\"(.*?)\">#",$html,$matches);     $all_url = $matches[1];     // 组合     foreach($all_title as $k => $v){         $arr = array();         $arr['title'] = $v;         $arr['weight'] = $all_weight[$k];         $arr['add_time'] = $all_time[$k];         $arr['url'] = $all_url[$k];         array_push($array,$arr);     }     return $array; } private function getHtml($url){     $html = file_get_contents($url);     $html = preg_replace("#\n#","",$html);     $html = preg_replace("#\r#","",$html);     $html = preg_replace("#\\s#","",$html);     return $html; }

以上就是thinkphp自动采集怎么实现的详细内容，更多内容请关注技术你好其它相关文章！

来源链接：

免责声明：
1.资讯内容不构成投资建议，投资者应独立决策并自行承担风险
2.本文版权归属原作所有，仅代表作者本人观点，不代表本站的观点或立场

标签：自动采集