Sinh thống kê
GS TS Lê Hoàng Ninh
1 © 2006
Dịnh nghỉa một số thuật ngữ trong sinh thống kê
• Dữ liệu:
– Số đo hay quan sát một biến số
• Biến số:
– Đặc trưng được khảo sát đo đạt – Có thể có nhiều trị số khác nhau từ đối tượng
nầy đến đối tượng khác
Evidence-based Chiropractic
© 2006
2
Định nghĩa từ dùng trong thống kê
• Biến số độc lập
– Có trước biến số phụ thuộc; căn nguyên/
nguyên nhân của một hệ quả nào đó
– Thuốc lá -> ung thư phổi – Thuốc A -> khỏi bệnh
• Biến số phụ thuộc:
– Số đo hệ quả,/ kết cuộc – Trị số phụ thuộc và biến độc lập
Evidence-based Chiropractic
© 2006
3
Từ ….
• Tham số (Parameters)
– Dữ liệu/ số đo trên quần thể (Summary data
from a population) • Số thống kê (Statistics)
– Dữ liệu/ số đo trên mẫu (Summary data from
a sample)
Evidence-based Chiropractic
© 2006
4
Quần thể
• Quần thể là tập hợp các cá thể mà mẫu
được lấy ra – e.g., headache patients in a chiropractic office; automobile crash victims in an emergency room
• Trong nghiên cứu, không thể đo đạt khảo
sát trên toàn bộ quần thể
• Do vậy cần phải lấy mẫu ( tổ hợp con của
quần thể)
Evidence-based Chiropractic
© 2006
5
Mẫu ngẫu nhiên
• Các đối tượng được lấy ra từ quần thể để sao cho các cá thể có cơ hội như nhau được chọn ra
• Mẫu ngẫu nhiên thì đại diện cho quần thể • Mẫu không ngẫu nhiên thì không đại diện – May be biased regarding age, severity of the
condition, socioeconomic status etc.
Evidence-based Chiropractic
© 2006
6
Mẫu ngẫu nhiên
• Mẫu ngẫu nhiên hiếm có trong các nghiên
cứu chăm sóc bệnh nhân
• Thay vào đó, dùng phân phối ngẫu nhiên
vào 2 nhóm điều trị và nhóm chứng – Each person has an equal chance of being
assigned to either of the groups
• Phân phối ngẫu nhiên vào các nhóm =
randomization
Evidence-based Chiropractic
© 2006
7
Thống kê mô tả (DSs)
• Cách tóm tắt dữ liệu • Minh họa bộ dữ liệu = shape, central
tendency, and variability of a set of data – The shape of data has to do with the
frequencies of the values of observations
Evidence-based Chiropractic
© 2006
8
Thống kê mô tả
– Khuynh hướng trung tâm : vị trí chính giữa bộ
dữ liệu
– Khuynh hướng biến thiên: các trị số phía dưới
, phía trên trị số trung tâm • Dispersion
• Thống kê mô tả khác biệt với thống kê suy
lý – Thống kê mô tả không thể kiểm định giả
thuyết
Evidence-based Chiropractic
© 2006
9
MỘT BỘ DỮ LiỆU
Case # Visits
• Distribution provides a summary of: – Frequencies of each of the values
etc.
• 2 – 3 • 3 – 4 • 4 – 3 • 5 – 1 • 6 – 1 • 7 – 2
– Ranges of values
11 22 33 44 55 66 77 88 99 1010 1111 1212 1313 1414
77 2 2 2 2 3 3 4 4 3 3 5 5 3 3 4 4 6 6 2 2 3 3 7 7 4 4
• Lowest = 2 • Highest = 7
Evidence-based Chiropractic
© 2006
10
Bảng phân phối tần số
Frequency Percent Cumulative %
• 2 • 3 • 4 • 5 • 6 • 7
3 4 3 1 1 2
21.4 28.6 21.4 7.1 7.1 14.3
21.4 50.0 71.4 78.5 85.6 100.0
Evidence-based Chiropractic
© 2006
11
PHÂN PHỐI TẦN SỐ ĐƯỢC BIỂU THỊ BẰNG histogram
Evidence-based Chiropractic
© 2006
12
Histograms (cont.)
• A histogram is a type of bar chart, but there are no spaces between the bars • Histograms are used to visually depict
frequency distributions of continuous data • Bar charts are used to depict categorical
information – e.g., Male–Female, Mild–Moderate–Severe,
etc.
Evidence-based Chiropractic
© 2006
13
SỐ ĐO KHUYNH HƯỚNG TRUNG TÂM
• Số trung bình
– The most commonly used DS
• Tính số trung bình
– Add all values of a series of numbers and
then divided by the total number of elements
Evidence-based Chiropractic
© 2006
14
Công thức tính số trung bình
• Trung bình mẫu
X
X n
(cid:0)
(cid:0) (cid:0)
• Trung bình quần thể
X(cid:0) N
(cid:0)
μ
(cid:0)
X (X bar) refers to the mean of a sample and refers to the mean of a population X is a command that adds all of the X values
(cid:0) n is the total number of values in the series of a sample and
N is the same for a population
Evidence-based Chiropractic
© 2006
(cid:0)
15
Số đo trung tâm
ModeMode
• Mode
– The most frequently occurring value in a series
– The modal value is the highest bar in a histogram
Evidence-based Chiropractic
© 2006
16
Số đo trung tâm
• Trung vịn
– The value that divides a series of values in
half when they are all listed in order
– When there are an odd number of values
• The median is the middle value
– When there are an even number of values • Count from each end of the series toward the middle and then average the 2 middle values
Evidence-based Chiropractic
© 2006
17
Số đo trung tâm
• Each of the three methods of measuring central tendency has certain advantages and disadvantages
• Which method should be used?
– It depends on the type of data that is being
analyzed
– e.g., categorical, continuous, and the level of
measurement that is involved
Evidence-based Chiropractic
© 2006
18
Cấp độ số đo
• There are 4 levels of measurement – Nominal, ordinal, interval, and ratio
1. Nominal
– Data are coded by a number, name, or letter
that is assigned to a category or group
– Examples
• Gender (e.g., male, female) • Treatment preference (e.g., manipulation,
mobilization, massage)
Evidence-based Chiropractic
© 2006
19
Cấp độ số đo
2. Ordinal
– Is similar to nominal because the measurements involve categories
– However, the categories are ordered by rank – Examples
• Pain level (e.g., mild, moderate, severe) • Military rank (e.g., lieutenant, captain, major,
colonel, general)
Evidence-based Chiropractic
© 2006
20
Cấp độ số đo
• Ordinal values only describe order, not
quantity – Thus, severe pain is not the same as 2 times
mild pain
• The only mathematical operations allowed for nominal and ordinal data are counting of categories – e.g., 25 males and 30 females
Evidence-based Chiropractic
© 2006
21
Cấp độ số đo
3. Khoảng
– Measurements are ordered (like ordinal
data)
– Have equal intervals – Does not have a true zero – Examples
• The Fahrenheit scale, where 0° does not
•
correspond to an absence of heat (no true zero) In contrast to Kelvin, which does have a true zero
Evidence-based Chiropractic
© 2006
22
Cấp độ số đo
4. Ratio
– Measurements have equal intervals – There is a true zero – Ratio is the most advanced level of
measurement, which can handle most types of mathematical operations
Evidence-based Chiropractic
© 2006
23
Levels of measurement (cont.)
• Ratio examples
– Range of motion
• No movement corresponds to zero degrees • The interval between 10 and 20 degrees is the
same as between 40 and 50 degrees
– Lifting capacity
• A person who is unable to lift scores zero • A person who lifts 30 kg can lift twice as much as
one who lifts 15 kg
Evidence-based Chiropractic
© 2006
24
Levels of measurement (cont.)
• NOIR is a mnemonic to help remember the names and order of the levels of measurement – Nominal Ordinal Interval Ratio
Evidence-based Chiropractic
© 2006
25
Cấp độ số đo
Permissible mathematic
Measurement scale
Best measure of central tendency
operations
Nominal
Counting
Mode
Greater or less than
Ordinal
Median
operations
Interval
Addition and subtraction
Symmetrical – Mean Skewed – Median
Ratio
Addition, subtraction, multiplication and division
Symmetrical – Mean Skewed – Median
Evidence-based Chiropractic
© 2006
26
Hình dạng bộ dữ liệu
• Histograms of frequency distributions have
shape
• Distributions are often symmetrical with
most scores falling in the middle and fewer toward the extremes
• Most biological data are symmetrically
distributed and form a normal curve ( bell- shaped curve)
Evidence-based Chiropractic
© 2006
27
Hình dạng bộ dữ liệu
Line depicting Line depicting the shape of the shape of the data the data
Evidence-based Chiropractic
© 2006
28
Phân phối bình thường
• The area under a normal curve has a
normal distribution ( Gaussian distribution)
• Properties of a normal distribution
– It is symmetric about its mean – The highest point is at its mean
Evidence-based Chiropractic
© 2006
29
The normal distribution (cont.)
MeanMean
The highest point of The highest point of the overlying the overlying normal curve is at normal curve is at the mean the mean
As one moves away from As one moves away from the mean in either direction the mean in either direction the height of the curve the height of the curve decreases, approaching, decreases, approaching, but never reaching zero but never reaching zero
A normal distribution is symmetric about its mean A normal distribution is symmetric about its mean
Evidence-based Chiropractic
© 2006
30
The normal distribution (cont.)
Mean = Median = Mode Mean = Median = Mode
Evidence-based Chiropractic
© 2006
31
Phân phối lệch (Skewed distributions)
• The data are not distributed symmetrically
in skewed distributions – Consequently, the mean, median, and mode are not equal and are in different positions
– Scores are clustered at one end of the
distribution
– A small number of extreme values are located
in the limits of the opposite end
Evidence-based Chiropractic
© 2006
32
Phân phối lệch
• Skew is always toward the direction of the
longer tail – Positive if skewed to the right – Negative if to the left
The mean is shifted the most
Evidence-based Chiropractic
© 2006
33
Phân phối lệch Skewed distributions
• Because the mean is shifted so much, it is not the best estimate of the average score for skewed distributions
• The median is a better estimate of the
center of skewed distributions – It will be the central point of any distribution – 50% of the values are above and 50% below
the median
Evidence-based Chiropractic
© 2006
34
Những tính chất đường cong bình thường
• About 68.3% of the area under a normal curve is within one standard deviation (SD) of the mean
• About 95.5% is within two SDs • About 99.7% is within three SDs
Evidence-based Chiropractic
© 2006
35
More properties of normal curves (cont.)
Evidence-based Chiropractic
© 2006
36
Độ lệch chuẩn (SD)
• SD is a measure of the variability of a set
of data
• The mean represents the average of a
group of scores, with some of the scores being above the mean and some below – This range of scores is referred to as
variability or spread
• Variance (S2) is another measure of
spread
Evidence-based Chiropractic
© 2006
37
SD (cont.)
• In effect, SD is the average amount of
spread in a distribution of scores
• The next slide is a group of 10 patients
whose mean age is 40 years – Some are older than 40 and some younger
Evidence-based Chiropractic
© 2006
38
SD (cont.)
Ages are spread Ages are spread out along an X axis out along an X axis
The amount ages are The amount ages are spread out is known as spread out is known as dispersion or spread dispersion or spread
Evidence-based Chiropractic
© 2006
39
Distances ages deviate above and below the mean
Etc.
Adding deviations Adding deviations always equals zero always equals zero
Evidence-based Chiropractic
© 2006
40
Calculating S2
• To find the average, one would normally total the scores above and below the mean, add them together, and then divide by the number of values
• However, the total always equals zero
– Values must first be squared, which cancels
the negative signs
Evidence-based Chiropractic
© 2006
41
Calculating S2 cont.
S2 is not in the S2 is not in the same units (age), same units (age), but SD is but SD is
Symbol for SD of a sample Symbol for SD of a sample (cid:0) for a population (cid:0) for a population
Evidence-based Chiropractic
© 2006
42
Wide spread results in higher SDs narrow spread in lower SDs
Evidence-based Chiropractic
© 2006
43
Spread is important when comparing 2 or more group means
It is more difficult to see a clear distinction between groups in the upper example because the spread is wider, even though the means are the same
Evidence-based Chiropractic
© 2006
44
z-scores
• The number of SDs that a specific score is above or below the mean in a distribution • Raw scores can be converted to z-scores by subtracting the mean from the raw score then dividing the difference by the SD
(cid:0)
(cid:0) X
z
(cid:0)
Evidence-based Chiropractic
© 2006
(cid:0)
45
z-scores (cont.)
• Standardization
– The process of converting raw to z-scores – The resulting distribution of z-scores will
always have a mean of zero, a SD of one, and an area under the curve equal to one
• The proportion of scores that are higher or
lower than a specific z-score can be determined by referring to a z-table
Evidence-based Chiropractic
© 2006
46
z-scores (cont.)
Refer to a z-table Refer to a z-table to find proportion to find proportion under the curve under the curve
Evidence-based Chiropractic
© 2006
47
z-scores (cont.)
Partial ztable (to z = 1.5) showing proportions of the area under a normal curve for different values of z.
Z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.5000 0.5040 0.5080 0.5120 0.5160 0.5199 0.5239 0.5279 0.5319 0.5359 0.0
0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 0.5636 0.5714 0.5675 0.5753 0.1
0.5793 0.5832 0.5871 0.5910 0.5948 0.6026 0.5987 0.6103 0.6064 0.6141 0.2
Corresponds to the area Corresponds to the area under the curve in black under the curve in black
0.6179 0.6217 0.6255 0.6293 0.6331 0.6406 0.6368 0.6480 0.6443 0.6517 0.3
0.6554 0.6591 0.6628 0.6664 0.6700 0.6736 0.6772 0.6808 0.6844 0.6879 0.4
0.6915 0.6950 0.6985 0.7019 0.7054 0.7088 0.7123 0.7157 0.7190 0.7224 0.5
0.7257 0.7291 0.7324 0.7357 0.7389 0.7422 0.7454 0.7486 0.7517 0.7549 0.6
0.7580 0.7611 0.7642 0.7673 0.7704 0.7734 0.7764 0.7794 0.7823 0.7852 0.7
0.7881 0.7910 0.7939 0.7967 0.7995 0.8023 0.8051 0.8078 0.8106 0.8133 0.8
0.8159 0.8186 0.8212 0.8238 0.8264 0.8289 0.8315 0.8340 0.8365 0.8389 0.9
0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577 0.8599 0.8621 1.0
0.8643 0.8665 0.8686 0.8708 0.8729 0.8749 0.8770 0.8790 0.8810 0.8830 1.1
0.8849 0.8869 0.8888 0.8907 0.8925 0.8944 0.8962 0.8980 0.8997 0.9015 1.2
0.9032 0.9049 0.9066 0.9082 0.9099 0.9115 0.9131 0.9147 0.9162 0.9177 1.3
0.9207 0.9222 0.9236 0.9251 0.9265 0.9279 0.9292 0.9306 0.9319 1.4
© 2006
0.9357 0.9370 0.9394 0.9406 0.9418 0.9429 0.9441 1.5 0.9192 0.9332 0.9332 0.9332 0.9345 Evidence-based Chiropractic 0.9382 48