上一次修改时间:2018-06-05 23:16:26

字符串、正则

  1. 导论

    php的相关要求:

    1)精通PHP

    2)熟悉Linux和Shell

    3)熟练JavaScript

    4)数据结构和算法设计

    5)熟练MySQL和NoSQL技术

    6)MVC模式和Yii等框架

    7)大规模网站架构技术

  2. php字符串的表达方式

    image.png

    单引号除了转义符外,不解析任何变量和特殊字符,如'\\\\',会被解析成两个,如果是奇数转义符'\\\',则会出现语法错误;

    双引号或heredoc结构定义的字符串,除变量外将会被解析的特殊字符:

    1)\n,\r,\t,\v,\e,\f,\\,\$,\"

    2)\[0-7]{1,3}符合该正则表达式序列的是一个以八进制方式来表达的字符;

    3)\x[0-9A-Fa-f]{1,2}符合该正则表达式序列的是一个以十六进制方式来表达的字符;

  3. php底层中C语言定义的字符串

  4. struct{
        char *val;//字符串指针
        int len;//字符串的位数
    }str;
  5. C语言中字符串说明

    1)C语言中,用指针定义的字符串可以用[]或者{},即字符数组的形式访问某个字符,因此php也支付该种访问形式;

    2)字符串长度可以达到2G(内存),这个手册有说明;

    3)常见函数都是单字节处理方式,C中是这种处理方式,php也没有对此进行更改;

    4)php字符串是二进制安全的;C语言中,是用\0来表示字符串的结束位置的,因此如果遇到如'1111\0333'这种字符串的时候,就会被截断(二进制不安全),但php的字符串是用结构体中len来表示字符串的结束位,因此不存这个问题;

  6. php字符串存取的注意事项

    string中的字符可以通过一个从0开始的下标,用类似array结构中的方括号包含对应的数字来访问和修改,比如$str = 1;$str[100] = 'a';当像这种用超出字符串长度的下标写入时,字符串将会被拉长并以空格填充;此外,非整数类型下标会被转换成整数,非法下标类型会产生一个E_NOTICE级别错误;

  7. PHP字符串的"串行化"

    串行化的作用:将PHP的值(比如数组、对象)转变成string来永久保存;

    串行化的三种方式:

    1)serialize()函数,特点:序列化一个对象实例后,反序列化可以还原该实例(还原对象只有serialize该函数可以实现,但数组和其它类型json_encode也可以实现),但通用性(比如php和java之间数据交换)上不如json_encode,此外效率上比json_encode稍差;

    2)json_encode()函数,特点:通用性强(encode后的字符串可以其它语言,如java中直接使用),和其它编程语言进行数据交换的主要方法,但对象实例json化后,执行反json化不能将对象实例还原;json_encode的变量只能是utf-8的,如果含有gbk编码的字符串则会失败,serialize无此限制;

    3)var_export($items,true)函数,第二个参数不加时会将变量直接打印,加参数时可以做为文件缓存使用;(将一个数组保存为一个文件,被其它php文件include后,可以直接得到该数组,具体代码:),文件存缓数组时,也可将数组json_encode后存入txt文件;

  8. <?php
      //文件一
      $arr = ['a' => 'a','b','c'];
      $str = var_export($arr , true);
      $str = '<?php return '.$str.';';
      file_put_contents('cache.php' , $str);
    
    
    <?php
       //文件二
      $arr = include 'test.php';
        var_dump($arr);
  9. 位、字节、字符的说明

    image.png

  10. 字符集和字符编码

    image.png

    字符集是一个系统支持的所有抽象字符的集合;

    字符编码是以二进制的数字来对应字符集的映射;

  11. 常见字符编码

    image.png

    image.png

    c2fdfc039245d688c56332adacc27d1ed21b2451.jpg

    image.png

    image.png

    image.png

    image.png

    image.png

    image.png

    image.png

    image.png

  12. Unicode与utf8的关系

    Unicode编码系统是为表达任意语言的任意字符而设计的,它使用4字节的数字来表达每个字母、符号,或者表意文字; Unicode只是一个符号集,它只规定了符号的二进制代码,却没有规定这个二进制代码应该如何存储。UTF-32/UTF-16/UTF-8是三种基于Unicode的编码方案,是Unicode的实现方式之一;

  13. GB2312编码的实现

    image.png

    image.png

    image.png

    image.png

    image.png

    区位码、国标码码、机内码总结

    image.png

    image.png

  14. 获取汉字的拼音类

  15. <?php
    /**
     * PHP 汉字转拼音
     *  输入的汉字为utf-8编码,转换原理为gbk的编码是以拼音基础的,utf-8的汉字转投时需要先转换成gbk编码
     *  utf-8是Unicode字符集的一种实现,Unicode里,中文只是其一部,用Unicode直接转拼音在原理上是不可能的
     * @example
     *    echo CUtf8_PY::encode('皮卡丘'); //编码为拼音首字母
     *    echo CUtf8_PY::encode('皮卡丘', 'all'); //编码为全拼音
     */
    class CUtf8_PY {
        /**
         * 拼音字符转换图
         * @var array
         */
      private static $_aMaps = array(
          'a'=>-20319,'ai'=>-20317,'an'=>-20304,'ang'=>-20295,'ao'=>-20292,
      
     'ba'=>-20283,'bai'=>-20265,'ban'=>-20257,'bang'=>-20242,'bao'=>-20230,'bei'=>-20051,'ben'=>-20036,'beng'=>-20032,'bi'=>-20026,'bian'=>-20002,'biao'=>-19990,'bie'=>-19986,'bin'=>-19982,'bing'=>-19976,'bo'=>-19805,'bu'=>-19784,
      
     'ca'=>-19775,'cai'=>-19774,'can'=>-19763,'cang'=>-19756,'cao'=>-19751,'ce'=>-19746,'ceng'=>-19741,'cha'=>-19739,'chai'=>-19728,'chan'=>-19725,'chang'=>-19715,'chao'=>-19540,'che'=>-19531,'chen'=>-19525,'cheng'=>-19515,'chi'=>-19500,'chong'=>-19484,'chou'=>-19479,'chu'=>-19467,'chuai'=>-19289,'chuan'=>-19288,'chuang'=>-19281,'chui'=>-19275,'chun'=>-19270,'chuo'=>-19263,'ci'=>-19261,'cong'=>-19249,'cou'=>-19243,'cu'=>-19242,'cuan'=>-19238,'cui'=>-19235,'cun'=>-19227,'cuo'=>-19224,
       
     'da'=>-19218,'dai'=>-19212,'dan'=>-19038,'dang'=>-19023,'dao'=>-19018,'de'=>-19006,'deng'=>-19003,'di'=>-18996,'dian'=>-18977,'diao'=>-18961,'die'=>-18952,'ding'=>-18783,'diu'=>-18774,'dong'=>-18773,'dou'=>-18763,'du'=>-18756,'duan'=>-18741,'dui'=>-18735,'dun'=>-18731,'duo'=>-18722,
            'e'=>-18710,'en'=>-18697,'er'=>-18696,
        
     'fa'=>-18526,'fan'=>-18518,'fang'=>-18501,'fei'=>-18490,'fen'=>-18478,'feng'=>-18463,'fo'=>-18448,'fou'=>-18447,'fu'=>-18446,
        
     'ga'=>-18239,'gai'=>-18237,'gan'=>-18231,'gang'=>-18220,'gao'=>-18211,'ge'=>-18201,'gei'=>-18184,'gen'=>-18183,'geng'=>-18181,'gong'=>-18012,'gou'=>-17997,'gu'=>-17988,'gua'=>-17970,'guai'=>-17964,'guan'=>-17961,'guang'=>-17950,'gui'=>-17947,'gun'=>-17931,'guo'=>-17928,
        
     'ha'=>-17922,'hai'=>-17759,'han'=>-17752,'hang'=>-17733,'hao'=>-17730,'he'=>-17721,'hei'=>-17703,'hen'=>-17701,'heng'=>-17697,'hong'=>-17692,'hou'=>-17683,'hu'=>-17676,'hua'=>-17496,'huai'=>-17487,'huan'=>-17482,'huang'=>-17468,'hui'=>-17454,'hun'=>-17433,'huo'=>-17427,
         
     'ji'=>-17417,'jia'=>-17202,'jian'=>-17185,'jiang'=>-16983,'jiao'=>-16970,'jie'=>-16942,'jin'=>-16915,'jing'=>-16733,'jiong'=>-16708,'jiu'=>-16706,'ju'=>-16689,'juan'=>-16664,'jue'=>-16657,'jun'=>-16647,
      
     'ka'=>-16474,'kai'=>-16470,'kan'=>-16465,'kang'=>-16459,'kao'=>-16452,'ke'=>-16448,'ken'=>-16433,'keng'=>-16429,'kong'=>-16427,'kou'=>-16423,'ku'=>-16419,'kua'=>-16412,'kuai'=>-16407,'kuan'=>-16403,'kuang'=>-16401,'kui'=>-16393,'kun'=>-16220,'kuo'=>-16216,
       
     'la'=>-16212,'lai'=>-16205,'lan'=>-16202,'lang'=>-16187,'lao'=>-16180,'le'=>-16171,'lei'=>-16169,'leng'=>-16158,'li'=>-16155,'lia'=>-15959,'lian'=>-15958,'liang'=>-15944,'liao'=>-15933,'lie'=>-15920,'lin'=>-15915,'ling'=>-15903,'liu'=>-15889,'long'=>-15878,'lou'=>-15707,'lu'=>-15701,'lv'=>-15681,'luan'=>-15667,'lue'=>-15661,'lun'=>-15659,'luo'=>-15652,
        
     'ma'=>-15640,'mai'=>-15631,'man'=>-15625,'mang'=>-15454,'mao'=>-15448,'me'=>-15436,'mei'=>-15435,'men'=>-15419,'meng'=>-15416,'mi'=>-15408,'mian'=>-15394,'miao'=>-15385,'mie'=>-15377,'min'=>-15375,'ming'=>-15369,'miu'=>-15363,'mo'=>-15362,'mou'=>-15183,'mu'=>-15180,
        
     'na'=>-15165,'nai'=>-15158,'nan'=>-15153,'nang'=>-15150,'nao'=>-15149,'ne'=>-15144,'nei'=>-15143,'nen'=>-15141,'neng'=>-15140,'ni'=>-15139,'nian'=>-15128,'niang'=>-15121,'niao'=>-15119,'nie'=>-15117,'nin'=>-15110,'ning'=>-15109,'niu'=>-14941,'nong'=>-14937,'nu'=>-14933,'nv'=>-14930,'nuan'=>-14929,'nue'=>-14928,'nuo'=>-14926,
            'o'=>-14922,'ou'=>-14921,
      
     'pa'=>-14914,'pai'=>-14908,'pan'=>-14902,'pang'=>-14894,'pao'=>-14889,'pei'=>-14882,'pen'=>-14873,'peng'=>-14871,'pi'=>-14857,'pian'=>-14678,'piao'=>-14674,'pie'=>-14670,'pin'=>-14668,'ping'=>-14663,'po'=>-14654,'pu'=>-14645,
        
     'qi'=>-14630,'qia'=>-14594,'qian'=>-14429,'qiang'=>-14407,'qiao'=>-14399,'qie'=>-14384,'qin'=>-14379,'qing'=>-14368,'qiong'=>-14355,'qiu'=>-14353,'qu'=>-14345,'quan'=>-14170,'que'=>-14159,'qun'=>-14151,
       
     'ran'=>-14149,'rang'=>-14145,'rao'=>-14140,'re'=>-14137,'ren'=>-14135,'reng'=>-14125,'ri'=>-14123,'rong'=>-14122,'rou'=>-14112,'ru'=>-14109,'ruan'=>-14099,'rui'=>-14097,'run'=>-14094,'ruo'=>-14092,
       
     'sa'=>-14090,'sai'=>-14087,'san'=>-14083,'sang'=>-13917,'sao'=>-13914,'se'=>-13910,'sen'=>-13907,'seng'=>-13906,'sha'=>-13905,'shai'=>-13896,'shan'=>-13894,'shang'=>-13878,'shao'=>-13870,'she'=>-13859,'shen'=>-13847,'sheng'=>-13831,'shi'=>-13658,'shou'=>-13611,'shu'=>-13601,'shua'=>-13406,'shuai'=>-13404,'shuan'=>-13400,'shuang'=>-13398,'shui'=>-13395,'shun'=>-13391,'shuo'=>-13387,'si'=>-13383,'song'=>-13367,'sou'=>-13359,'su'=>-13356,'suan'=>-13343,'sui'=>-13340,'sun'=>-13329,'suo'=>-13326,
        
     'ta'=>-13318,'tai'=>-13147,'tan'=>-13138,'tang'=>-13120,'tao'=>-13107,'te'=>-13096,'teng'=>-13095,'ti'=>-13091,'tian'=>-13076,'tiao'=>-13068,'tie'=>-13063,'ting'=>-13060,'tong'=>-12888,'tou'=>-12875,'tu'=>-12871,'tuan'=>-12860,'tui'=>-12858,'tun'=>-12852,'tuo'=>-12849,
       
     'wa'=>-12838,'wai'=>-12831,'wan'=>-12829,'wang'=>-12812,'wei'=>-12802,'wen'=>-12607,'weng'=>-12597,'wo'=>-12594,'wu'=>-12585,
        
     'xi'=>-12556,'xia'=>-12359,'xian'=>-12346,'xiang'=>-12320,'xiao'=>-12300,'xie'=>-12120,'xin'=>-12099,'xing'=>-12089,'xiong'=>-12074,'xiu'=>-12067,'xu'=>-12058,'xuan'=>-12039,'xue'=>-11867,'xun'=>-11861,
        
     'ya'=>-11847,'yan'=>-11831,'yang'=>-11798,'yao'=>-11781,'ye'=>-11604,'yi'=>-11589,'yin'=>-11536,'ying'=>-11358,'yo'=>-11340,'yong'=>-11339,'you'=>-11324,'yu'=>-11303,'yuan'=>-11097,'yue'=>-11077,'yun'=>-11067,
        
     'za'=>-11055,'zai'=>-11052,'zan'=>-11045,'zang'=>-11041,'zao'=>-11038,'ze'=>-11024,'zei'=>-11020,'zen'=>-11019,'zeng'=>-11018,'zha'=>-11014,'zhai'=>-10838,'zhan'=>-10832,'zhang'=>-10815,'zhao'=>-10800,'zhe'=>-10790,'zhen'=>-10780,'zheng'=>-10764,'zhi'=>-10587,'zhong'=>-10544,'zhou'=>-10533,'zhu'=>-10519,'zhua'=>-10331,'zhuai'=>-10329,'zhuan'=>-10328,'zhuang'=>-10322,'zhui'=>-10315,'zhun'=>-10309,'zhuo'=>-10307,'zi'=>-10296,'zong'=>-10281,'zou'=>-10274,'zu'=>-10270,'zuan'=>-10262,'zui'=>-10260,'zun'=>-10256,'zuo'=>-10254
        );
        
      /**
       * 将中文编码成拼音
       * @param string $utf8Data utf8字符集数据
       * @param string $sRetFormat 返回格式 [head:首字母|all:全拼音]
       * @return string
       */
      public static function encode($utf8Data, $sRetFormat='head'){
          $sGBK = iconv('UTF-8', 'GBK', $utf8Data);
          $aBuf = array();
          for ($i=0, $iLoop=strlen($sGBK); $i<$iLoop; $i++) {
              $iChr = ord($sGBK{$i});
              if ($iChr>160)
                  $iChr = ($iChr<<8) + ord($sGBK{++$i}) - 65536;
              if ('head' === $sRetFormat)
                  $aBuf[] = substr(self::zh2py($iChr),0,1);
              else
                  $aBuf[] = self::zh2py($iChr);
          }
          if ('head' === $sRetFormat)
              return implode('', $aBuf);
          else
              return implode(' ', $aBuf);
      }
      
      /**
       * 中文转换到拼音(每次处理一个字符)
       * @param number $iWORD 待处理字符双字节
       * @return string 拼音
       */
      private static function zh2py($iWORD) {
          if($iWORD>0 && $iWORD<160 ) {
              return chr($iWORD);
          } elseif ($iWORD<-20319||$iWORD>-10247) {
              return '';
          } else {
              foreach (self::$_aMaps as $py => $code) {
                  if($code > $iWORD) break;
                  $result = $py;
              }
              return $result;
          }
      }
    }
    ?>
  16. Unicode编码

    image.png

    image.png

    image.png

    image.png

    image.png

    注:在Unicode字符集里,汉字处于4E00-9FFF的范围内,对应16进制的0800-FFFF,所以汉字在utf-8里是用三个字节表示的;

    示例: 

    image.png

    image.png

    image.png 

  17. UTF的字节序和BOM

    image.png 

    image.png 

  18. UTF的BOM头

    BOM的中文名译作"字节顺序标记";UTF-16编码每个字符占用了两个字节,在Macintosh(Mac)机和PC机上,对字节顺序的理解是不一致的;在解释一个UTF-16文本前,首先要弄清楚每个编码单元的字节序;Unicode规范中推荐的标记字节顺序的方法就是BOM;

    UTF-8是不需要BOM来表明字节顺序的,但可以用BOM来表明编码方式;

  19. 文字编码识别

    image.png

    image.png

    注:因为UTF-8和GBK有一部分编码范围是重叠的(数理非常少),因此不能说一定能判断出一个汉字的编码方式;如:

    image.png

  20. GBK和UTF8该如何选择 

    image.png 

    注:在PHP或其它编程语言里,很多函数是只支持UTF-8的,如json_encode,该函数如果传进去的参数为GBK的编码时,会报错; 

  21. 正则分割线---------------------

  22. 正则的组成

    image.png

    image.png

    image.png

    注:正则的分割符/,不是固定的,可以用?、*.....等其它特殊字符代替;

    image.png 

    注:image.png,该表达示里除了[]是元字符外,其它都不是元字符; 

  23. 字符转义和后向引用

    image.png

    image.png

    image.png注:上面的例子中,\1表示前面的第一个分组,即(\w+);

  24. 正则环视

    image.png

    image.png

    示例:

    image.png

    注:以上两个表达示是等价的,都是匹配的jeffrey这个单词中的jeff;

    image.png

    image.png

    image.png

  25. 正则的贪婪匹配

    image.png

    image.png

    image.png

  26. 正则的引擎

    image.png

    注:如字面意思,NFA表达示主导,是指用表达示去匹配文本(上面示例中,用表达示里的p开始去文本this is pencil中匹配),而DFA文本主导,则是用文本去匹配表达示(用文本中的t去表达示中做匹配);

    image.png

    image.png 

  27. 回溯

    image.png

    image.png示例:

    image.png

    注:贪婪匹配image.png匹配的结果为:image.png

    非贪婪匹配image.png匹配的结果为:image.png

    回溯是指在如上面的贪婪匹配中,先用.*匹配出整行文本,然后再从该行文本的末尾开始向前找";

  28. 正则表达式的优化

    image.png

    image.png

    image.png

    image.png

    image.png

    注:手册推荐使用pcre,pcre的性能相对较好;

    image.png

    image.png

    用!和+做为分隔符示例: 

    image.png 

    image.png