Sinh thống kê

GS TS Lê Hoàng Ninh

1 © 2006

Dịnh nghỉa một số thuật ngữ trong sinh thống kê

• Dữ liệu:

– Số đo hay quan sát một biến số

• Biến số:

– Đặc trưng được khảo sát đo đạt – Có thể có nhiều trị số khác nhau từ đối tượng

nầy đến đối tượng khác

© 2006

Evidence-based Chiropractic

2

Định nghĩa từ dùng trong thống kê

• Biến số độc lập

– Có trước biến số phụ thuộc; căn nguyên/

nguyên nhân của một hệ quả nào đó

– Thuốc lá -> ung thư phổi – Thuốc A -> khỏi bệnh

• Biến số phụ thuộc:

– Số đo hệ quả,/ kết cuộc – Trị số phụ thuộc và biến độc lập

© 2006

Evidence-based Chiropractic

3

Từ ….

• Tham số (Parameters)

– Dữ liệu/ số đo trên quần thể (Summary data

from a population) • Số thống kê (Statistics)

– Dữ liệu/ số đo trên mẫu (Summary data from

a sample)

© 2006

Evidence-based Chiropractic

4

Quần thể

• Quần thể là tập hợp các cá thể mà mẫu

được lấy ra – e.g., headache patients in a chiropractic office; automobile crash victims in an emergency room

• Trong nghiên cứu, không thể đo đạt khảo

sát trên toàn bộ quần thể

• Do vậy cần phải lấy mẫu ( tổ hợp con của

quần thể)

© 2006

Evidence-based Chiropractic

5

Mẫu ngẫu nhiên

• Các đối tượng được lấy ra từ quần thể để sao cho các cá thể có cơ hội như nhau được chọn ra

• Mẫu ngẫu nhiên thì đại diện cho quần thể • Mẫu không ngẫu nhiên thì không đại diện – May be biased regarding age, severity of the

condition, socioeconomic status etc.

© 2006

Evidence-based Chiropractic

6

Mẫu ngẫu nhiên

• Mẫu ngẫu nhiên hiếm có trong các nghiên

cứu chăm sóc bệnh nhân

• Thay vào đó, dùng phân phối ngẫu nhiên

vào 2 nhóm điều trị và nhóm chứng – Each person has an equal chance of being

assigned to either of the groups

• Phân phối ngẫu nhiên vào các nhóm =

randomization

© 2006

Evidence-based Chiropractic

7

Thống kê mô tả (DSs)

• Cách tóm tắt dữ liệu • Minh họa bộ dữ liệu = shape, central

tendency, and variability of a set of data – The shape of data has to do with the

frequencies of the values of observations

© 2006

Evidence-based Chiropractic

8

Thống kê mô tả

– Khuynh hướng trung tâm : vị trí chính giữa bộ

dữ liệu

– Khuynh hướng biến thiên: các trị số phía dưới

, phía trên trị số trung tâm • Dispersion

• Thống kê mô tả khác biệt với thống kê suy

lý – Thống kê mô tả không thể kiểm định giả

thuyết

© 2006

Evidence-based Chiropractic

9

MỘT BỘ DỮ LiỆU

Case # Visits

• Distribution provides a summary of: – Frequencies of each of the values

etc.

• 2 – 3 • 3 – 4 • 4 – 3 • 5 – 1 • 6 – 1 • 7 – 2

– Ranges of values

• Lowest = 2 • Highest = 7

11 22 33 44 55 66 77 88 99 1010 1111 1212 1313 1414

77 2 2 2 2 3 3 4 4 3 3 5 5 3 3 4 4 6 6 2 2 3 3 7 7 4 4

© 2006

Evidence-based Chiropractic

10

Bảng phân phối tần số

Frequency Percent Cumulative %

• 2 • 3 • 4 • 5 • 6 • 7

3 4 3 1 1 2

21.4 28.6 21.4 7.1 7.1 14.3

21.4 50.0 71.4 78.5 85.6 100.0

© 2006

Evidence-based Chiropractic

11

PHÂN PHỐI TẦN SỐ ĐƯỢC BIỂU THỊ BẰNG histogram

© 2006

Evidence-based Chiropractic

12

Histograms (cont.)

• A histogram is a type of bar chart, but there are no spaces between the bars • Histograms are used to visually depict

frequency distributions of continuous data • Bar charts are used to depict categorical

information – e.g., Male–Female, Mild–Moderate–Severe,

etc.

© 2006

Evidence-based Chiropractic

13

SỐ ĐO KHUYNH HƯỚNG TRUNG TÂM

• Số trung bình

– The most commonly used DS

• Tính số trung bình

– Add all values of a series of numbers and

then divided by the total number of elements

© 2006

Evidence-based Chiropractic

14

Công thức tính số trung bình

• Trung bình mẫu

 X

 X n

• Trung bình quần thể



X N

(X bar) refers to the mean of a sample and refers to the

μ

X mean of a population

 EX is a command that adds all of the X values  n is the total number of values in the series of a sample and

N is the same for a population

© 2006

Evidence-based Chiropractic

15

Số đo trung tâm

Mode

• Mode

– The most frequently occurring value in a series

– The modal value is the highest bar in a histogram

© 2006

Evidence-based Chiropractic

16

Số đo trung tâm

• Trung vịn

– The value that divides a series of values in

half when they are all listed in order

– When there are an odd number of values

• The median is the middle value

– When there are an even number of values • Count from each end of the series toward the middle and then average the 2 middle values

© 2006

Evidence-based Chiropractic

17

Số đo trung tâm

• Each of the three methods of measuring central tendency has certain advantages and disadvantages

• Which method should be used?

– It depends on the type of data that is being

analyzed

– e.g., categorical, continuous, and the level of

measurement that is involved

© 2006

Evidence-based Chiropractic

18

Cấp độ số đo

• There are 4 levels of measurement – Nominal, ordinal, interval, and ratio

1. Nominal

– Data are coded by a number, name, or letter

that is assigned to a category or group

– Examples

• Gender (e.g., male, female) • Treatment preference (e.g., manipulation,

mobilization, massage)

© 2006

Evidence-based Chiropractic

19

Cấp độ số đo

2. Ordinal

– Is similar to nominal because the measurements involve categories

– However, the categories are ordered by rank – Examples

• Pain level (e.g., mild, moderate, severe) • Military rank (e.g., lieutenant, captain, major,

colonel, general)

© 2006

Evidence-based Chiropractic

20

Cấp độ số đo

• Ordinal values only describe order, not

quantity – Thus, severe pain is not the same as 2 times

mild pain

• The only mathematical operations allowed for nominal and ordinal data are counting of categories – e.g., 25 males and 30 females

© 2006

Evidence-based Chiropractic

21

Cấp độ số đo

3. Khoảng

– Measurements are ordered (like ordinal

data)

– Have equal intervals – Does not have a true zero – Examples

• The Fahrenheit scale, where 0° does not

correspond to an absence of heat (no true zero) In contrast to Kelvin, which does have a true zero

© 2006

Evidence-based Chiropractic

22

Cấp độ số đo

4. Ratio

– Measurements have equal intervals – There is a true zero – Ratio is the most advanced level of

measurement, which can handle most types of mathematical operations

© 2006

Evidence-based Chiropractic

23

Levels of measurement (cont.)

• Ratio examples

– Range of motion

• No movement corresponds to zero degrees • The interval between 10 and 20 degrees is the

same as between 40 and 50 degrees

– Lifting capacity

• A person who is unable to lift scores zero • A person who lifts 30 kg can lift twice as much as

one who lifts 15 kg

© 2006

Evidence-based Chiropractic

24

Levels of measurement (cont.)

• NOIR is a mnemonic to help remember the names and order of the levels of measurement – Nominal Ordinal Interval Ratio

© 2006

Evidence-based Chiropractic

25

Cấp độ số đo

Permissible mathematic

Measurement scale

Best measure of central tendency

operations

Nominal

Counting

Mode

Greater or less than

Ordinal

Median

operations

Interval

Addition and subtraction

Symmetrical – Mean Skewed – Median

Ratio

Addition, subtraction, multiplication and division

Symmetrical – Mean Skewed – Median

© 2006

Evidence-based Chiropractic

26

Hình dạng bộ dữ liệu

• Histograms of frequency distributions have

shape

• Distributions are often symmetrical with

most scores falling in the middle and fewer toward the extremes

• Most biological data are symmetrically

distributed and form a normal curve ( bell- shaped curve)

© 2006

Evidence-based Chiropractic

27

Hình dạng bộ dữ liệu

Line depicting the shape of the data

© 2006

Evidence-based Chiropractic

28

Phân phối bình thường

• The area under a normal curve has a

normal distribution ( Gaussian distribution)

• Properties of a normal distribution

– It is symmetric about its mean – The highest point is at its mean

© 2006

Evidence-based Chiropractic

29

The normal distribution (cont.)

MeanMean

The highest point of the overlying normal curve is at the mean

As one moves away from As one moves away from the mean in either direction the mean in either direction the height of the curve the height of the curve decreases, approaching, decreases, approaching, but never reaching zero but never reaching zero

A normal distribution is symmetric about its mean A normal distribution is symmetric about its mean

© 2006

Evidence-based Chiropractic

30

The normal distribution (cont.)

Mean = Median = Mode Mean = Median = Mode

© 2006

Evidence-based Chiropractic

31

Phân phối lệch (Skewed distributions)

• The data are not distributed symmetrically

in skewed distributions – Consequently, the mean, median, and mode are not equal and are in different positions

– Scores are clustered at one end of the

distribution

– A small number of extreme values are located

in the limits of the opposite end

© 2006

Evidence-based Chiropractic

32

Phân phối lệch

• Skew is always toward the direction of the

longer tail – Positive if skewed to the right – Negative if to the left

The mean is shifted the most

© 2006

Evidence-based Chiropractic

33

Phân phối lệch Skewed distributions

• Because the mean is shifted so much, it is not the best estimate of the average score for skewed distributions

• The median is a better estimate of the

center of skewed distributions – It will be the central point of any distribution – 50% of the values are above and 50% below

the median

© 2006

Evidence-based Chiropractic

34

Những tính chất đường cong bình thường

• About 68.3% of the area under a normal curve is within one standard deviation (SD) of the mean

• About 95.5% is within two SDs • About 99.7% is within three SDs

© 2006

Evidence-based Chiropractic

35

More properties of normal curves (cont.)

© 2006

Evidence-based Chiropractic

36

Độ lệch chuẩn (SD)

• SD is a measure of the variability of a set

of data

• The mean represents the average of a

group of scores, with some of the scores being above the mean and some below – This range of scores is referred to as

variability or spread

• Variance (S2) is another measure of

spread

© 2006

Evidence-based Chiropractic

37

SD (cont.)

• In effect, SD is the average amount of

spread in a distribution of scores

• The next slide is a group of 10 patients

whose mean age is 40 years – Some are older than 40 and some younger

© 2006

Evidence-based Chiropractic

38

SD (cont.)

Ages are spread Ages are spread out along an X axis out along an X axis

The amount ages are The amount ages are spread out is known as spread out is known as dispersion or spread dispersion or spread

© 2006

Evidence-based Chiropractic

39

Distances ages deviate above and below the mean

Etc.

Adding deviations Adding deviations always equals zero always equals zero

© 2006

Evidence-based Chiropractic

40

Calculating S2

• To find the average, one would normally total the scores above and below the mean, add them together, and then divide by the number of values

• However, the total always equals zero

– Values must first be squared, which cancels

the negative signs

© 2006

Evidence-based Chiropractic

41

Calculating S2 cont.

S2 is not in the S2 is not in the same units (age), same units (age), but SD is but SD is

Symbol for SD of a sample  for a population

© 2006

Evidence-based Chiropractic

42

Wide spread results in higher SDs narrow spread in lower SDs

© 2006

Evidence-based Chiropractic

43

Spread is important when comparing 2 or more group means

It is more difficult to see a clear distinction between groups in the upper example because the spread is wider, even though the means are the same

© 2006

Evidence-based Chiropractic

44

z-scores

• The number of SDs that a specific score is above or below the mean in a distribution • Raw scores can be converted to z-scores

by subtracting the mean from the raw score then dividing the difference by the SD

X

z

 

© 2006

Evidence-based Chiropractic

45

z-scores (cont.)

• Standardization

– The process of converting raw to z-scores – The resulting distribution of z-scores will

always have a mean of zero, a SD of one, and an area under the curve equal to one • The proportion of scores that are higher or

lower than a specific z-score can be determined by referring to a z-table

© 2006

Evidence-based Chiropractic

46

z-scores (cont.)

Refer to a z-table Refer to a z-table to find proportion to find proportion under the curve under the curve

© 2006

Evidence-based Chiropractic

47

Partial z-table (to z = 1.5) showing proportions of the area under a normal curve for different values of z.

z-scores (cont.)

Z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

0.0 0.5000 0.5040 0.5080 0.5120 0.5160 0.5199 0.5239 0.5279 0.5319 0.5359

0.1 0.5398 0.5438 0.5478 0.5517 0.5557 0.5636 0.5596 0.5675 0.5714 0.5753

Corresponds to the area Corresponds to the area under the curve in black under the curve in black

0.2 0.5793 0.5832 0.5871 0.5910 0.5948 0.6064 0.5987 0.6026 0.6103 0.6141

0.3 0.6179 0.6217 0.6255 0.6293 0.6331 0.6368 0.6406 0.6443 0.6480 0.6517

0.4 0.6554 0.6591 0.6628 0.6664 0.6700 0.6736 0.6772 0.6808 0.6844 0.6879

0.5 0.6915 0.6950 0.6985 0.7019 0.7054 0.7088 0.7123 0.7157 0.7190 0.7224

0.6 0.7257 0.7291 0.7324 0.7357 0.7389 0.7422 0.7454 0.7486 0.7517 0.7549

0.7 0.7580 0.7611 0.7642 0.7673 0.7704 0.7734 0.7764 0.7794 0.7823 0.7852

0.8 0.7881 0.7910 0.7939 0.7967 0.7995 0.8023 0.8051 0.8078 0.8106 0.8133

0.9 0.8159 0.8186 0.8212 0.8238 0.8264 0.8289 0.8315 0.8340 0.8365 0.8389

1.0 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577 0.8599 0.8621

1.1 0.8643 0.8665 0.8686 0.8708 0.8729 0.8749 0.8770 0.8790 0.8810 0.8830

1.2 0.8849 0.8869 0.8888 0.8907 0.8925 0.8944 0.8962 0.8980 0.8997 0.9015

1.3 0.9032 0.9049 0.9066 0.9082 0.9099 0.9115 0.9131 0.9147 0.9162 0.9177

1.4 0.9192 0.9207 0.9222 0.9236 0.9251 0.9265 0.9279 0.9292 0.9306 0.9319

© 2006

0.9345 0.9357 0.9370 0.9394 0.9406 0.9418 0.9429 0.9441

0.9332 0.9332 0.9332 1.5 Evidence-based Chiropractic

0.9382 48