使用Neo4j构建豆瓣电影知识图谱
来源:互联网
时间:2026-05-30 20:14:57
知识图谱这项技术,在大语言模型火爆之前,一度被看作通往AGI的关键路径之一。虽然现在大模型成了绝对主角,但图谱本身的逻辑推理和结构化能力,在特定场景下依然能打。今天我们就用一份开源的豆瓣电影数据,借助图数据库领域的老牌选手Neo4j,搭一个电影知识图谱,然后跑几个有意思的分析——比如哪部电影“人海战术”最猛,哪些演员堪称“劳模”,以及黄渤和莱昂纳多·迪卡普里奥之间隔着多少“戏份”。
环境说明
- Neo4j数据库版本:5.20.0 Enterprise(企业版,个人免费)
数据
原始数据来自开放知识图谱平台,每条记录对应一部电影的元信息。为了方便直接导入Neo4j,可以在公开仓库中找到加工好的版本。
数据预处理
原始数据的每条记录其实就是一个孤零零的电影条目,既没有抽象的节点,也没有明确的关系链。我们需要把它重新梳理成一张“图谱该有的样子”。具体来说,拆出了以下几类节点:
- 电影——Movie
- 人——Person
- 语言——Language
- 发行地区——District
- 电影类型——Category
有了节点,关系自然也不能少。主要包括:
- 参演——ACTED_IN
- 导演——DIRECTED_IN
- 编剧——COMPOSED
- 拥有类型——CATEGORIZED_TO
- 拥有主要语言——HAS_MAIN_LANGUAGE
- 发行于——RELEASED_IN
整个数据模型的Schema(或者说本体)设计如下:

下面这段代码,就是从原始JSON中提取节点和关系,然后输出为CSV文件的过程:
os.makedirs(, exist_ok=)
data = json.load(())
persons = []
categories = []
languages = []
movies = []
districts = []
acted_in = []
categorized_to = []
directed = []
composed = []
released_in = []
has_main_language = []
for item in data:
for key in (, , , , ):
if key not in item:
item[key] = []
persons.extend(item[])
persons.extend(item[])
persons.extend(item[])
languages.extend(item[])
districts.extend(item[])
categories.extend(item[])
movies.append({
: item[],
: item[],
: item[],
: item[],
: item[],
: (item[]) if item[] else ,
: (item[]) if item[] else ,
: item[]
})
for director in item[]:
directed.append({: item[], : director})
for composer in item[]:
composed.append({: item[], : composer})
for actor in item[]:
acted_in.append({: item[], : actor})
for category in item[]:
categorized_to.append({: item[], : category})
for region in item[]:
released_in.append({: item[], : region})
for language in item[]:
has_main_language.append({: item[], : language})
for item in [
(, persons),
(, categories),
(, languages),
(, movies),
(, districts)
]:
if item[] == :
pd.DataFrame(item[]).to_csv(os.path.join(, item[] + ), index=)
else:
pd.DataFrame((item[]), columns=[]).to_csv(os.path.join(, item[] + ), index=)
for rel in [
(, acted_in),
(, categorized_to),
(, directed),
(, composed),
(, released_in),
(, has_main_language)
]:
pd.DataFrame(rel[]).to_csv(os.path.join(, rel[].upper() + ), index=)
导入数据
准备工作
修改配置
首先需要调整Neo4j的配置,在Settings界面中,注释掉下面这行,允许从任意路径导入数据:
server.directories.import=import

安装插件
还需要安装APOC插件,操作界面如下:

创建数据库
这一步可选,在Neo4j界面上直接创建即可。
开始导入
后续所有导入操作都在Neo4j Browser中完成。
导入节点信息
导入Movie
注意:路径中不要包含中文,否则会报“Bad escape”错误。
CALL {
LOAD CSV WITH HEADERS FROM '' AS line
CREATE (:Movie {
id: line[id],
title: line[title],
cover: line[cover],
length: toInteger(line[length]),
rate: toFloat(line[rate]),
showtime: toInteger(line[showtime]),
url: line[url],
othername: line[othername]
})
} IN TRANSACTIONS
导入Person
CALL {
LOAD CSV WITH HEADERS FROM '' AS line
CREATE (:Person {name: line[name]})
} IN TRANSACTIONS
导入District
CALL {
LOAD CSV WITH HEADERS FROM '' AS line
CREATE (:District {name: line[name]})
} IN TRANSACTIONS
导入Language
CALL {
LOAD CSV WITH HEADERS FROM '' AS line
CREATE (:Language {name: line[name]})
} IN TRANSACTIONS
导入Category
CALL {
LOAD CSV WITH HEADERS FROM '' AS line
CREATE (:Category {name: line[]})
} IN TRANSACTIONS
建立索引
对节点建立索引
新版本Neo4j的Cypher语法如下:
CREATE INDEX movie_title_index FOR (m:Movie) ON (m.title); CREATE INDEX person_name_index FOR (p:Person) ON (p.name); CREATE INDEX category_name_index FOR (c:Category) ON (c.name); CREATE INDEX language_name_index FOR (l:Language) ON (l.name); CREATE INDEX district_name_index FOR (d:District) ON (d.name);
查看索引状态
:schema
导入关系
导入ACTED_IN关系
CALL {
LOAD CSV WITH HEADERS FROM '' AS line
MATCH (s:Person {name: line[actor]})
MATCH (e:Movie {id: line[movie_id]})
CREATE (s)-[:ACTED_IN]->(e)
} IN TRANSACTIONS
导入CATEGORIZED_TO关系
CALL {
LOAD CSV WITH HEADERS FROM '' AS line
MATCH (s:Movie {id: line[movie_id]})
MATCH (e:Category {name: line[category]})
CREATE (s)-[:CATEGORIZED_TO]->(e)
} IN TRANSACTIONS
导入DIRECTED关系
CALL {
LOAD CSV WITH HEADERS FROM '' AS line
MATCH (s:Person {name: line[director]})
MATCH (e:Movie {id: line[movie_id]})
CREATE (s)-[:DIRECTED]->(e)
} IN TRANSACTIONS
导入COMPOSED关系
CALL {
LOAD CSV WITH HEADERS FROM '' AS line
MATCH (s:Person {name: line[composer]})
MATCH (e:Movie {id: line[movie_id]})
CREATE (s)-[:COMPOSED]->(e)
} IN TRANSACTIONS
导入RELEASED_IN关系
CALL {
LOAD CSV WITH HEADERS FROM '' AS line
MATCH (s:Movie {id: line[movie_id]})
MATCH (e:District {name: line[region]})
CREATE (s)-[:RELEASED_IN]->(e)
} IN TRANSACTIONS
导入HAS_MAIN_LANGUAGE关系
CALL {
LOAD CSV WITH HEADERS FROM '' AS line
MATCH (s:Movie {id: line[movie_id]})
MATCH (e:Language {name: line[language]})
CREATE (s)-[:HAS_MAIN_LANGUAGE]->(e)
} IN TRANSACTIONS
查看schema
CALL db.schema.visualization
或者
CALL apoc.meta.graph

分析
演员数量最多的10部电影
MATCH p=(:Person)-[r:ACTED_IN]->(m:Movie) WITH m, count(p) AS cnt RETURN m.title, cnt ORDER BY cnt DESC LIMIT 10

参演电影最多的10个演员
MATCH path=(p:Person)-[:ACTED_IN]-() WITH p, count(path) AS cnt RETURN p.name, cnt ORDER BY cnt DESC LIMIT 10

参演电影超过10部的演员,获取演员和电影列表
MATCH (p:Person)-[:ACTED_IN]->(m:Movie) WITH p, count(m) AS rels, collect(m) AS movies WHERE rels > 10 RETURN p, movies, rels ORDER BY rels DESC

导演电影最多的导演
MATCH path=(p:Person)-[:DIRECTED]-() WITH p, count(path) AS cnt RETURN p.name, cnt ORDER BY cnt DESC LIMIT 10

“莱昂纳多·迪卡普里奥”与“黄渤”的最短路径(仅限ACTED_IN关系,1至8度)
MATCH p=shortestPath(
(:Person {name: '莱昂纳多·迪卡普里奥'})-[:ACTED_IN*1..8]-(:Person {name: '黄渤'})
)
RETURN p

“莱昂纳多·迪卡普里奥”与“黄渤”的最短路径(ACTED_IN或DIRECTED关系,1至8度),并返回距离长度
MATCH p=shortestPath(
(:Person {name: '莱昂纳多·迪卡普里奥'})-[r:ACTED_IN|DIRECTED*1..8]-(:Person {name: '黄渤'})
)
RETURN size(r)
