锐单电子商城 , 一站式电子元器件采购平台!
  • 电话:400-990-0325

【烂活】斯坦福句法解析库使用小结+最新四月新番下载(以辉夜与阿尼亚为例)

时间:2023-01-13 06:00:00 贴片电容cl21b103ka贴片二极管bfpomron钮型传感器r88dzr2012w3y二极管cl21b223kb陶瓷电容cl21b331kb陶瓷电容

序言

前排提示这篇文章是挂羊头卖狗肉,文本在第二部分,第一部分纯粹是为了过审凑字数。


文章目录

  • 序言
    • 1 斯坦福句法解析库(句法树、依存关系图)使用概述
    • 2 烂活(可能对追粉的朋友有用)


1 使用概述斯坦福句法分析库(句法树,依存关系图)

关于NLTK最近,里斯坦福的句法分析模块报告称,它将被放弃,最新版本将被放弃nltk.parse.corenlp.StanforCoreNLPParser关于模块替换CoreNLP可以在斯坦福软件下载JAR包,目前至少依赖分析和句法树是可行的,这两种也是最有用的,NER也可以使用,虽然分词和词性标记会报错,但这两个不需要使用斯坦福,还有很多其他资源可以使用,中文可以使用jieba,英文的话nltk目前有内置的分词包和词性标注包。StanforCoreNLPParser具体用法尚未弄清楚,如何使用斯坦福将在不久的将来发布。JAR详细细教程。

从上面的链接中下载得到的几个JAR如下图所示:

3

其中stanford-parser-full-2020-11-17是最重要的包,可用于生成句法树和依存关系图stanford-corenlp-4.4.0可能是其他包的集成,但我认为里面缺少很多模型。例如,分析包的模型只有英文,事实上,前者包括中文。这些包的具体使用代码如下所示,部分参考https://www.cnblogs.com/baiboy/p/nltk1.html

# -*- coding: utf-8 -*- # @author: caoyang # @email: caoyang@163.sufe.edu.cn  # 2022/06/10 13:16:34 目前NLTK3.3.0 def segmenter_demo():  # 2022/06/10 13:16:51 不能成功运行, 不知道为什么  from nltk.tokenize.stanford_segmenter import StanfordSegmenter  segmenter = StanfordSegmenter(   path_to_jar=r'D:\data\stanford\software\stanford-segmenter-2020-11-17\stanford-segmenter-4.2.0.jar',   # slf4j这个参数在stanford-segmenter-2020-11-17里找不到, 但是在stanford-parser-full-2020-11-17和stanford-corenlp-4.4.0里都有   path_to_slf4j=r'D:\data\stanford\software\stanford-parser-full-2020-11-17\slf4j-api.jar',   path_to_sihan_corpora_dict=r'D:\data\stanford\software\stanford-segmenter-2020-11-17\data',   path_to_model=r'D:\data\stanford\software\stanford-segmenter-2020-11-17\data\pku.gz',   path_to_dict=r'D:\data\stanford\software\stanford-segmenter-2020-11-17\data\dict-chris6.ser.gz',  )  string = u我在博客园开了一个博客,我的博客叫伏草唯存,写了一些自然语言处理的文章。  result = segmenter.segment(string)  print(result)  return result   def tokenizer_demo():  # 2022/06/10 13:15:03 无法运行 nltk.tokenize 已被弃用  from nltk.tokenize import StanfordTokenizer  tokenizer = StanfordTokenizer(path_to_jar=r'D:\data\stanford\software\stanford-parser-full-2020-11-17\stanford-parser.jar')  sent = 'Good muffins cost $3.88\nin New York. Please buy me\ntwo of them.\nThanks.'  result = tokenizer.tkenize(sent)
	return result


def ner_tagger_demo():
	# 2022/06/10 13:16:56 可以运行英文, 但是中文的缺少模型jar包
	from nltk.tag import StanfordNERTagger
	eng_tagger = StanfordNERTagger(model_filename=r'D:\data\stanford\software\stanford-ner-2020-11-17\classifiers\english.all.3class.distsim.crf.ser.gz',
								   path_to_jar=r'D:\data\stanford\software\stanford-ner-2020-11-17\stanford-ner.jar')

	result = eng_tagger.tag('Rami Eid is studying at Stony Brook University in NY'.split())
	print(result)
	# chi_tagger = StanfordNERTagger(model_filename=r'D:\data\stanford\software\stanford-ner-2020-11-17\classifiers\chinese.misc.distsim.crf.ser.gz',
								   # path_to_jar=r'D:\data\stanford\software\stanford-ner-2020-11-17\stanford-ner.jar')
	# for word, tag in chi_tagger.tag(result.split()):
		# print(word,tag)
	return result

def pos_tagger_demo():
	# 2022/06/10 13:17:35 通过测试
	from nltk.tag import StanfordPOSTagger
	eng_tagger = StanfordPOSTagger(model_filename=r'D:\data\stanford\software\stanford-postagger-full-2020-11-17\models\english-bidirectional-distsim.tagger',
								   path_to_jar=r'D:\data\stanford\software\stanford-postagger-full-2020-11-17\stanford-postagger.jar')
	print(eng_tagger.tag('What is the airspeed of an unladen swallow ?'.split()))
	
	chi_tagger = StanfordPOSTagger(model_filename=r'D:\data\stanford\software\stanford-postagger-full-2020-11-17\models\chinese-distsim.tagger',
								   path_to_jar=r'D:\data\stanford\software\stanford-postagger-full-2020-11-17\stanford-postagger.jar')
	result = '四川省 成都 信息 工程 大学 我 在 博客 园 开 了 一个 博客 , 我 的 博客 名叫 伏 草 惟 存 , 写 了 一些 自然语言 处理 的 文章 。\r\n'
	print(chi_tagger.tag(result.split()))
	
def dependency_demo():
	# 2022/06/10 13:21:17 通过测试
	from nltk.parse.stanford import StanfordDependencyParser
	eng_parser = StanfordDependencyParser(r'D:\data\stanford\software\stanford-parser-full-2020-11-17\stanford-parser.jar',
										  r'D:\data\stanford\software\stanford-parser-full-2020-11-17\stanford-parser-4.2.0-models.jar',
										  r'D:\data\stanford\software\stanford-parser-full-2020-11-17\englishPCFG.ser.gz')
	res = list(eng_parser.parse('the quick brown fox jumps over the lazy dog'.split()))
	for row in res[0].triples():
		print(row)

	chi_parser = StanfordDependencyParser(r'D:\data\stanford\software\stanford-parser-full-2020-11-17\stanford-parser.jar',
										  r'D:\data\stanford\software\stanford-parser-full-2020-11-17\stanford-parser-4.2.0-models.jar',
										  model_path=r'D:\data\stanford\software\stanford-parser-full-2020-11-17\chinesePCFG.ser.gz')		# 这个文件要从stanford-parser-4.2.0-models.jar中解压出来得到
	res = list(eng_parser.parse('我 和 他 是 朋友'.split()))
	print(list(res[0].triples()))
	print('#' * 64)
	for row in res[0].triples():
		print(row)
	
def parse_tree_demo():	
	# 2022/06/10 13:21:17 通过测试
	from nltk.parse.stanford import StanfordParser	
	parser = StanfordParser(r'D:\data\stanford\software\stanford-parser-full-2020-11-17\stanford-parser.jar',
							r'D:\data\stanford\software\stanford-parser-full-2020-11-17\stanford-parser-4.2.0-models.jar',
							model_path=r'D:\data\stanford\software\stanford-parser-full-2020-11-17\chinesePCFG.ser.gz')		# 这个文件要从stanford-parser-4.2.0-models.jar中解压出来得到
	parse_tree = list(parser.parse(['我', '和', '他', '是', '朋友']))
	print(parse_tree)
	return parse_tree

# segmenter_demo()
# tokenizer_demo()
# ner_tagger_demo()
# pos_tagger_demo()
# dependency_demo()
# parse_tree_demo()

目前更新nltk到最新版(3.7.0),可以使用corenlp模块,但是发现它调用的是远程接口,因而无需下载jar包到本地,但是容易连不上远程服务器。感觉是斯坦福不准备开放它们的解析包,而是封装成接口,看注释部分效果还挺fancy:

class CoreNLPParser(GenericCoreNLPParser)
 |  CoreNLPParser(url='http://localhost:9000', encoding='utf8', tagtype=None)
 |
 |  >>> parser = CoreNLPParser(url='http://localhost:9000')
 |
 |  >>> next(
 |  ...     parser.raw_parse('The quick brown fox jumps over the lazy dog.')
 |  ... ).pretty_print()  # doctest: +NORMALIZE_WHITESPACE
 |                       ROOT
 |                        |
 |                        S
 |         _______________|__________________________
 |        |                         VP               |
 |        |                _________|___             |
 |        |               |             PP           |
 |        |               |     ________|___         |
 |        NP              |    |            NP       |
 |    ____|__________     |    |     _______|____    |
 |   DT   JJ    JJ   NN  VBZ   IN   DT      JJ   NN  .
 |   |    |     |    |    |    |    |       |    |   |
 |  The quick brown fox jumps over the     lazy dog  .

另外stanza包同理,也是需要调用远程接口方能调用,API文档在https://stanfordnlp.github.io/stanza/index.html,笔者私以为有上面那个解析包应该差不多就够用了,这个stanza不搭梯子用起来也经常会失败。


2 烂活(可能对追番的朋友有用)

忙里偷闲分享一个烂活。

最近在B站追《辉夜大小姐想让人告白第三季》和《间谍过家家》,实话说以前的B站新番还是能做到跟动画发布商同步更新,零氪党追番也就是只比大会员慢一周少看一集而已,总归是可以忍受。现在的B站各种骚操作,更新巨慢也就算了,各种圣光、暗牧、删减,有些敏感片段还要自己亲自作画重改,实在是让人难以接受,若不是B站还有仅存的弹幕氛围,谁TM还在B站追番。

然后笔者找到了这个:蚂蚁Tube@动画板块

目前基本上所有的四月新番都在持续更新,过往的老番也比较,当然除了动画以外,还有电影、电视剧、综艺的资源,应该说是非常nice了。

经常光顾这种免费站点的人肯定都知道,这类站点的通病就是视频加载巨慢,而且经常会看到一半就完全宕机了,这可实在是太糟心了,所以笔者想能不能直接把视频下载到本地来观看。

其实这件事并不复杂,比B站视频的爬取要简单很多,这里就顺手把B站视频爬虫的脚本挂在下面(因为笔者也是借鉴别人的代码做了一些修改,试着运行主体部分的几个示例,应该还是非常清晰的,截至本文发布仍然可用,注释较为详细,这里是可以直接用番剧的episodeid去直接下载整部番剧的,当然要需要大会员的必须得有大会员的账号,这里用的Cookie是笔者本人的账号,目前应该已经失效了,需要的可以自己网页端登录一下账号然后把Cookie拷贝过来):

# -*- coding: utf-8 -*-
# @author: caoyang
# @email: caoyang@163.sufe.edu.cn
# https://github.com/iawia002/annie

import os
import re
import json
import requests
from tqdm import tqdm

class BiliBiliCrawler(object):
	
	def __init__(self) -> None:				
		self.user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Firefox/91.0'
		self.video_webpage_link = 'https://www.bilibili.com/video/{}'.format
		self.video_detail_api = 'https://api.bilibili.com/x/player/pagelist?bvid={}&jsonp=jsonp'.format						
		self.video_playurl_api = 'https://api.bilibili.com/x/player/playurl?cid={}&bvid={}&qn=64&type=&otype=json'.format	
		self.episode_playurl_api = 'https://api.bilibili.com/pgc/player/web/playurl?ep_id={}&jsonp=jsonp'.format			
		self.episode_webpage_link = 'https://www.bilibili.com/bangumi/play/ep{}'.format
		self.anime_webpage_link = 'https://www.bilibili.com/bangumi/play/ss{}'.format
		self.chunk_size = 1024
		self.regexs = { 
        
			'host': 'https://(.*\.com)',
			'episode_name': r'meta name="keywords" content="(.*?)"',
			'initial_state': r'',	
		}

	def easy_download_video(self, bvid, save_path=None) -> bool:
		"""Tricky method with available api"""
		
		# Request for detail information of video
		response = requests.get(self.video_detail_api(bvid), headers={ 
        'User-Agent': self.user_agent})
		json_response = response.json()
		
		cid = json_response['data'][0]['cid']
		video_title = json_response['data'][0]['part']
		if save_path is None:
			save_path = f'{ 
          video_title}.mp4'		

		print(f'Video title: { 
          video_title}')
		
		# Request for playurl and size of video
		response = requests.get(self.video_playurl_api(cid, bvid), headers={ 
        'User-Agent': self.user_agent})
		json_response = response.json()
		video_playurl = json_response['data']['durl'][0]['url']
		# video_playurl = json_response['data']['durl'][0]['backup_url'][0]
		video_size = json_response['data']['durl'][0]['size']
		total = video_size // self.chunk_size

		print(f'Video size: { 
          video_size}')
		
		# Download video
		headers = { 
        
			'User-Agent': self.user_agent,
			'Origin'	: 'https://www.bilibili.com',
			'Referer'	: 'https://www.bilibili.com',			
		}
		headers['Host'] = re.findall(self.regexs['host'], video_playurl, re.I)[0]
		headers['Range'] = f'bytes=0-{ 
          video_size}'
		response = requests.get(video_playurl, headers=headers, stream=True, verify=False)
		tqdm_bar = tqdm(response.iter_content(self.chunk_size), desc='Download process', total=total)
		with open(save_path, 'wb') as f:
			for byte in tqdm_bar:
				f.write(byte)
		return True

	def easy_download_episode(self, epid, save_path=None) -> bool:
		"""Tricky method with available api"""
		
		# Request for playurl and size of episode
		
		# temp_headers = { 
        
			# "Host": "api.bilibili.com",
			# "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:99.0) Gecko/20100101 Firefox/99.0",
			# "Accept": "application/json, text/plain, */*",
			# "Accept-Language": "zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2",
			# "Accept-Encoding": "gzip, deflate, br",
			# "Referer": "https://www.bilibili.com/bangumi/play/ep234407?spm_id_from=333.337.0.0",
			# "Origin": "https://www.bilibili.com",
			# "Connection": "keep-alive",
			# "Cookie": "innersign=0; buvid3=3D8F234E-5DAF-B5BD-1A26-C7CDE57C21B155047infoc; i-wanna-go-back=-1; b_ut=7; b_lsid=1047C7449_1808035E0D6; _uuid=A4884E3F-BF68-310101-E5E6-10EBFDBCC10CA456283infoc; buvid_fp=82c49016c72d24614786e2a9e883f994; buvid4=247E3498-6553-51E8-EB96-C147A773B34357718-022050123-7//HOhRX5o4Xun7E1GZ2Vg%3D%3D; fingerprint=1b7ad7a26a4a90ff38c80c37007d4612; sid=jilve18q; buvid_fp_plain=undefined; SESSDATA=f1edfaf9%2C1666970475%2Cf281c%2A51; bili_jct=de9bcc8a41300ac37d770bca4de101a8; DedeUserID=130321232; DedeUserID__ckMd5=42d02c72aa29553d; nostalgia_conf=-1; CURRENT_BLACKGAP=1; CURRENT_FNVAL=4048; CURRENT_QUALITY=0; rpdid=|(u~||~uukl)0J'uYluRu)l|J",
			# "Sec-Fetch-Dest": "empty",
			# "Sec-Fetch-Mode": "cors",
			# "Sec-Fetch-Site": "same-site",
			# "TE": "trailers",
		# }
		# response = requests.get(self.episode_playurl_api(epid), headers=temp_headers)
		
		# 2022/05/01 23:31:08 上面是带大会员的下载方式, 可以下载大会员可看的番剧
		response = requests.get(self.episode_playurl_api(epid))
		json_response = response.json()
		# episode_playurl = json_response['result']['durl'][0]['url']
		episode_playurl = json_response['result']['durl'][0]['backup_url'][0]
		episode_size = json_response['result']['durl'][0]['size']
		total = episode_size // self.chunk_size

		print(f'Episode size: { 
          episode_size}')
		
		# Download episode
		# 2022/05/01 23:31:41 大会员最好加入下面的cookie, 但是我不确信是否去掉还能不能可以
		headers = { 
        
			'User-Agent': self.user_agent,
			'Origin'	: 'https://www.bilibili.com',
			'Referer'	: 'https://www.bilibili.com',	
			# 'Cookie' : "innersign=0; buvid3=3D8F234E-5DAF-B5BD-1A26-C7CDE57C21B155047infoc; i-wanna-go-back=-1; b_ut=7; b_lsid=1047C7449_1808035E0D6; _uuid=A4884E3F-BF68-310101-E5E6-10EBFDBCC10CA456283infoc; buvid_fp=82c49016c72d24614786e2a9e883f994; buvid4=247E3498-6553-51E8-EB96-C147A773B34357718-022050123-7//HOhRX5o4Xun7E1GZ2Vg%3D%3D; fingerprint=1b7ad7a26a4a90ff38c80c37007d4612; sid=jilve18q; buvid_fp_plain=undefined; SESSDATA=f1edfaf9%2C1666970475%2Cf281c%2A51; bili_jct=de9bcc8a41300ac37d770bca4de101a8; DedeUserID=130321232; DedeUserID__ckMd5=42d02c72aa29553d; nostalgia_conf=-1; CURRENT_BLACKGAP=1; CURRENT_FNVAL=4048; CURRENT_QUALITY=0; rpdid=|(u~||~uukl)0J'uYluRu)l|J",
		}
		headers['Host'] = re.findall(self.regexs['host'], episode_playurl, re.I)[0]
		headers['Range'] = f'bytes=0-{ 
          episode_size}'
		response = requests.get(episode_playurl, headers=headers, stream=True, verify=False)
		tqdm_bar = tqdm(response.iter_content(self.chunk_size), desc='Download process', total=total)
		if save_path is None:
			save_path = f'ep{ 
          epid}.mp4'
		with open(save_path, 'wb') as f:
			for byte in tqdm_bar:
				f.write(byte)
		return True

	def download(self, bvid, video_save_path=None, audio_save_path=None) -> dict:
		"""General method by parsing page source"""
		
		if video_save_path is None:
			video_save_path = f'{ 
          bvid}.m4s'
		if audio_save_path is None:
			audio_save_path = f'{ 
          bvid}.mp3'
		
		common_headers = { 
        
			'Accept'			: '*/*',
			'Accept-encoding'	: 'gzip, deflate, br',
			'Accept-language'	: 'zh-CN,zh;q=0.9,en;q=0.8',
			'Cache-Control'		: 'no-cache',
			'Origin'			: 'https://www.bilibili.com',
			'Pragma'			: 'no-cache',
			'Host'				: 'www.bilibili.com',
			'User-Agent'		: self.user_agent,
		}

		# In fact we only need bvid
		# Each episode of an anime also has a bvid and a corresponding bvid-URL which is redirected to another episode link
		# e.g. https://www.bilibili.com/video/BV1rK4y1b7TZ is redirected to https://www.bilibili.com/bangumi/play/ep322903
		response = requests.get(self.video_webpage_link(bvid), headers=common_headers)
		html = response.text
		playinfos = re.findall(self.regexs['playinfo'], html, re.S)
		if not playinfos:
			raise Exception(f'No playinfo found in bvid { 
          bvid}\nPerhaps VIP required')
		playinfo = json.loads(playinfos[0])
		
		# There exists four different URLs with observations as below
		# `baseUrl` is the same as `base_url` with string value
		# `backupUrl` is the same as `backup_url` with array value
		# Here hard code is employed to select playurl
		def _select_video_playurl(_videoinfo):
			if 'backupUrl' in _videoinfo:
				return _videoinfo['backupUrl'][-1]
			if 'backup_url' in _videoinfo:
				return _videoinfo['backup_url'][-1]
			if 'baseUrl' in _videoinfo:
				return _videoinfo['baseUrl']
			if 'base_url' in _videoinfo:
				return _videoinfo['base_url']	
			raise Exception(f'No video URL found\n{ 
          _videoinfo}')	
			
		def _select_audio_playurl(_audioinfo):
			if 'backupUrl' in _audioinfo:
				return _audioinfo['backupUrl'][-1]
			if 'backup_url' in _audioinfo:
				return _audioinfo['backup_url'][-1]
			if 'baseUrl' in _audioinfo:
				return _audioinfo['baseUrl']
			if 'base_url' in _audioinfo:
				return _audioinfo['base_url']
			raise Exception(f'No audio URL found\n{ 
          _audioinfo}')
		
		# with open(f'playinfo-{bvid}.js', 'w') as f:
			# json.dump(playinfo, f)

		if 'durl' in playinfo['data']:
			video_playurl = playinfo['data']['durl'][0]['url']
			# video_playurl = playinfo['data']['durl'][0]['backup_url'][1]
			print(video_playurl)
			video_size = playinfo['data']['durl'][0]['size']
			total = video_size // self.chunk_size
			print(f'Video size: { 
          video_size}')
			headers = { 
        
				'User-Agent': self.user_agent,
				'Origin'	: 'https://www.bilibili.com',
				'Referer'	: 'https://www.bilibili.com',			
			}
			headers['Host'] = re.findall(self.regexs['host'], video_playurl, re.I)[0]
			headers['Range'] = f'bytes=0-{ 
          video_size}'
			# headers['Range'] = f'bytes={video_size + 1}-{video_size + video_size + 1}'
			response = requests.get(video_playurl, headers=headers, stream=True, verify=False)
			tqdm_bar = tqdm(response.iter_content(self.chunk_size), desc='Download process', total=total)
			with open(video_save_path, 'wb') as f:
				for byte in tqdm_bar<

相关文章