初学SQL(一)

SQL是结构化查询语言(Structured Query Language)的简称,是一种数据库查询和程序设计语言,用来访问和操作数据库系统

数据分析中SQL是必不可少的一项技能,因为需要从数据库中通过SQL来获取需要的数据进行分析,现在数据分析软件如Python,R,SAS等都支持SQL命令

之前对数据表格处理工作中常用R的dplyr包,由最受欢迎的R程序员Hadley Wickham编写;当开始学习SQL后发现,dplyr中的很多函数跟SQL中的操作符极为相似,这是因为:

dplyr 融合了很多在 SQL 数据库中对数据操作的思想, 使得对于数据表 (tibble) 的操作有逻辑且一致. 这样的好处在于, 如果是取用数据库中的分析, 那么所采取的操作步骤并不会有太大的改变, 使用 dbplyr 包就可以方便的用同样的函数去处理数据库中的表格

结合网上教程以及datacamp实际操作教程Intro to SQL for Data Science整理下常用SQL语法

Selecting columns

SELECT 用于从表中选取数据,*代表所有列

SELECT * FROM people
SELECT name, birthdate FROM people

DISTINCT 用于返回唯一不同的值,可多列

SELECT DISTINCT language FROM films

COUNT 用于返回指定列或多列的值的数目

SELECT COUNT(*) FROM people
SELECT COUNT(DISTINCT language) FROM films

Filtering rows

WHERE 操作符常见的有:

  • = equal
  • <> not equal
  • < less than
  • > greater than
  • <= less than or equal to
  • >= greater than or equal to

作为过滤的方式:

select count(*) from films where release_year < 2000
select name,birthdate from people where birthdate = '1974-11-11'
SELECT title FROM films WHERE (release_year = 1994 OR release_year = 1995) AND (certification = 'PG' OR certification = 'R')

BETWEEN 操作符在 WHERE 子句中使用,作用是选取介于两个值之间的数据范围

select title,release_year from films where release_year between '1990' and '2000'

IN 操作符在WHERE中使用可以代替多个OR语句

select title,release_year from films where release_year in (1990,2000) and duration > 120

NULL 操作符可以判断是否为空,NOT NULL则反之

select name from people where deathdate is null

LIKE 操作符用于搜索列中的指定模式,NOT LIKE则反之

select name from people where name like 'B%'
select name from people where name like '_r%'
select name from people where name not like 'A%'

Aggregate functions

aggregate 函数,比如AVG 函数返回数值列的平均值,MAX 函数返回一列中的最大值,SUM 函数返回数值列的总和

select SUM(gross) from films where release_year >= 2000
select AVG(gross) from films where title like 'A%'
select MIN(gross) from films where release_year=1994
select MAX(gross) from films where release_year between 2000 and 2012

除了聚合函数外,还可以使用基本的算术函数,如:

select (10 / 3)
# 3

使用AS来进行重命名,即指定别名,可以对列名,也可对表名

select title, gross - budget as net_profit from films
select AVG(duration) / 60.0 as avg_duration_hours from films

对于除法等运算,如果保证结果是有小数点的话,则需先保证参与运算的数字也有小数点,可以看下面两个例子的区别:

select 45 / 10 * 100.0;
select 45 * 100.0 / 10

select count(deathdate) * 100.0 / count(*) as percentage_dead from people

Sorting grouping and joins

ORDER BY 主要用于一列或者多列数据进行排序,多列的话用,分割

select title from films where release_year in (2000, 2012) order by release_year
select title,gross from films where title like 'M%' order by title,gross

默认升序,降序则需要加DESC

select imdb_score,film_id from reviews order by imdb_score desc

GROUP BY 主要用于group一列或者多列数据以便后续的统计合计

select release_year,count(*) from films group by release_year
select release_year,country,MAX(budget) from films group by release_year,country order by release_year,country

由于WHERE语句中无法使用aggregate函数,所以需要用HAVING语句代替

select count(*) from (
    select release_year from films
    group by release_year
    having count(title) > 200
) as data

select release_year,AVG(budget) as avg_budget,AVG(gross) as avg_gross from films 
where release_year > 1990  
group by release_year 
having AVG(budget)>60000000 
order by avg_gross desc

LIMIT 可以限制返回的行数

select country, AVG(budget) as avg_budget, AVG(gross) as avg_gross
from films
group by country 
having count(title)>10
order by country
limit 5

如果从多个表中获得结果,那么需要用到join函数,不同表通过主键(Primary Key)联系在一起,连接方法有以下几种(join是inner join的缩写):

select title,imdb_score 
from films join reviews on films.id = reviews.film_id 
where title = 'To Kill a Mockingbird'
  • inner join(内连接): 返回key匹配上的行
  • left join: 除非返回key匹配的行外,没有左表中没有匹配的行也返回
  • right join: 类似于上述left join,就是把右表的数据全部返回,不管左表是否匹配上
  • full join: 把左右表的数据全部返回,不管是否匹配

上述只是简单罗列了初步接触SQL中常用的语句,详细的介绍可结合《SQL基础教程》书籍

本文出自于http://www.bioinfo-scrounger.com转载请注明出处