Sinh thống kê

GS TS Lê Hoàng Ninh

1 © 2006

Dịnh nghỉa một số thuật ngữ trong sinh thống kê

• Dữ liệu:

– Số đo hay quan sát một biến số

• Biến số:

– Đặc trưng được khảo sát đo đạt – Có thể có nhiều trị số khác nhau từ đối tượng

nầy đến đối tượng khác

Evidence-based Chiropractic

© 2006

2

Định nghĩa từ dùng trong thống kê

• Biến số độc lập

– Có trước biến số phụ thuộc; căn nguyên/

nguyên nhân của một hệ quả nào đó

– Thuốc lá -> ung thư phổi – Thuốc A -> khỏi bệnh

• Biến số phụ thuộc:

– Số đo hệ quả,/ kết cuộc – Trị số phụ thuộc và biến độc lập

Evidence-based Chiropractic

© 2006

3

Từ ….

• Tham số (Parameters)

– Dữ liệu/ số đo trên quần thể (Summary data

from a population) • Số thống kê (Statistics)

– Dữ liệu/ số đo trên mẫu (Summary data from

a sample)

Evidence-based Chiropractic

© 2006

4

Quần thể

• Quần thể là tập hợp các cá thể mà mẫu

được lấy ra – e.g., headache patients in a chiropractic office; automobile crash victims in an emergency room

• Trong nghiên cứu, không thể đo đạt khảo

sát trên toàn bộ quần thể

• Do vậy cần phải lấy mẫu ( tổ hợp con của

quần thể)

Evidence-based Chiropractic

© 2006

5

Mẫu ngẫu nhiên

• Các đối tượng được lấy ra từ quần thể để sao cho các cá thể có cơ hội như nhau được chọn ra

• Mẫu ngẫu nhiên thì đại diện cho quần thể • Mẫu không ngẫu nhiên thì không đại diện – May be biased regarding age, severity of the

condition, socioeconomic status etc.

Evidence-based Chiropractic

© 2006

6

Mẫu ngẫu nhiên

• Mẫu ngẫu nhiên hiếm có trong các nghiên

cứu chăm sóc bệnh nhân

• Thay vào đó, dùng phân phối ngẫu nhiên

vào 2 nhóm điều trị và nhóm chứng – Each person has an equal chance of being

assigned to either of the groups

• Phân phối ngẫu nhiên vào các nhóm =

randomization

Evidence-based Chiropractic

© 2006

7

Thống kê mô tả (DSs)

• Cách tóm tắt dữ liệu • Minh họa bộ dữ liệu = shape, central

tendency, and variability of a set of data – The shape of data has to do with the

frequencies of the values of observations

Evidence-based Chiropractic

© 2006

8

Thống kê mô tả

– Khuynh hướng trung tâm : vị trí chính giữa bộ

dữ liệu

– Khuynh hướng biến thiên: các trị số phía dưới

, phía trên trị số trung tâm • Dispersion

• Thống kê mô tả khác biệt với thống kê suy

lý – Thống kê mô tả không thể kiểm định giả

thuyết

Evidence-based Chiropractic

© 2006

9

MỘT BỘ DỮ LiỆU

Case # Visits

• Distribution provides a summary of: – Frequencies of each of the values

etc.

• 2 – 3 • 3 – 4 • 4 – 3 • 5 – 1 • 6 – 1 • 7 – 2

– Ranges of values

11 22 33 44 55 66 77 88 99 1010 1111 1212 1313 1414

77 2 2 2 2 3 3 4 4 3 3 5 5 3 3 4 4 6 6 2 2 3 3 7 7 4 4

• Lowest = 2 • Highest = 7

Evidence-based Chiropractic

© 2006

10

Bảng phân phối tần số

Frequency Percent Cumulative %

• 2 • 3 • 4 • 5 • 6 • 7

3 4 3 1 1 2

21.4 28.6 21.4 7.1 7.1 14.3

21.4 50.0 71.4 78.5 85.6 100.0

Evidence-based Chiropractic

© 2006

11

PHÂN PHỐI TẦN SỐ ĐƯỢC BIỂU THỊ BẰNG histogram

Evidence-based Chiropractic

© 2006

12

Histograms (cont.)

• A histogram is a type of bar chart, but there are no spaces between the bars • Histograms are used to visually depict

frequency distributions of continuous data • Bar charts are used to depict categorical

information – e.g., Male–Female, Mild–Moderate–Severe,

etc.

Evidence-based Chiropractic

© 2006

13

SỐ ĐO KHUYNH HƯỚNG TRUNG TÂM

• Số trung bình

– The most commonly used DS

• Tính số trung bình

– Add all values of a series of numbers and

then divided by the total number of elements

Evidence-based Chiropractic

© 2006

14

Công thức tính số trung bình

• Trung bình mẫu

X

X n

(cid:0)

(cid:0) (cid:0)

• Trung bình quần thể

X(cid:0) N

(cid:0)

μ

(cid:0)

X (X bar) refers to the mean of a sample and refers to the mean of a population X is a command that adds all of the X values

(cid:0) n is the total number of values in the series of a sample and

N is the same for a population

Evidence-based Chiropractic

© 2006

(cid:0)

15

Số đo trung tâm

ModeMode

• Mode

– The most frequently occurring value in a series

– The modal value is the highest bar in a histogram

Evidence-based Chiropractic

© 2006

16

Số đo trung tâm

• Trung vịn

– The value that divides a series of values in

half when they are all listed in order

– When there are an odd number of values

• The median is the middle value

– When there are an even number of values • Count from each end of the series toward the middle and then average the 2 middle values

Evidence-based Chiropractic

© 2006

17

Số đo trung tâm

• Each of the three methods of measuring central tendency has certain advantages and disadvantages

• Which method should be used?

– It depends on the type of data that is being

analyzed

– e.g., categorical, continuous, and the level of

measurement that is involved

Evidence-based Chiropractic

© 2006

18

Cấp độ số đo

• There are 4 levels of measurement – Nominal, ordinal, interval, and ratio

1. Nominal

– Data are coded by a number, name, or letter

that is assigned to a category or group

– Examples

• Gender (e.g., male, female) • Treatment preference (e.g., manipulation,

mobilization, massage)

Evidence-based Chiropractic

© 2006

19

Cấp độ số đo

2. Ordinal

– Is similar to nominal because the measurements involve categories

– However, the categories are ordered by rank – Examples

• Pain level (e.g., mild, moderate, severe) • Military rank (e.g., lieutenant, captain, major,

colonel, general)

Evidence-based Chiropractic

© 2006

20

Cấp độ số đo

• Ordinal values only describe order, not

quantity – Thus, severe pain is not the same as 2 times

mild pain

• The only mathematical operations allowed for nominal and ordinal data are counting of categories – e.g., 25 males and 30 females

Evidence-based Chiropractic

© 2006

21

Cấp độ số đo

3. Khoảng

– Measurements are ordered (like ordinal

data)

– Have equal intervals – Does not have a true zero – Examples

• The Fahrenheit scale, where 0° does not

correspond to an absence of heat (no true zero) In contrast to Kelvin, which does have a true zero

Evidence-based Chiropractic

© 2006

22

Cấp độ số đo

4. Ratio

– Measurements have equal intervals – There is a true zero – Ratio is the most advanced level of

measurement, which can handle most types of mathematical operations

Evidence-based Chiropractic

© 2006

23

Levels of measurement (cont.)

• Ratio examples

– Range of motion

• No movement corresponds to zero degrees • The interval between 10 and 20 degrees is the

same as between 40 and 50 degrees

– Lifting capacity

• A person who is unable to lift scores zero • A person who lifts 30 kg can lift twice as much as

one who lifts 15 kg

Evidence-based Chiropractic

© 2006

24

Levels of measurement (cont.)

• NOIR is a mnemonic to help remember the names and order of the levels of measurement – Nominal Ordinal Interval Ratio

Evidence-based Chiropractic

© 2006

25

Cấp độ số đo

Permissible mathematic

Measurement scale

Best measure of central tendency

operations

Nominal

Counting

Mode

Greater or less than

Ordinal

Median

operations

Interval

Addition and subtraction

Symmetrical – Mean Skewed – Median

Ratio

Addition, subtraction, multiplication and division

Symmetrical – Mean Skewed – Median

Evidence-based Chiropractic

© 2006

26

Hình dạng bộ dữ liệu

• Histograms of frequency distributions have

shape

• Distributions are often symmetrical with

most scores falling in the middle and fewer toward the extremes

• Most biological data are symmetrically

distributed and form a normal curve ( bell- shaped curve)

Evidence-based Chiropractic

© 2006

27

Hình dạng bộ dữ liệu

Line depicting Line depicting the shape of the shape of the data the data

Evidence-based Chiropractic

© 2006

28

Phân phối bình thường

• The area under a normal curve has a

normal distribution ( Gaussian distribution)

• Properties of a normal distribution

– It is symmetric about its mean – The highest point is at its mean

Evidence-based Chiropractic

© 2006

29

The normal distribution (cont.)

MeanMean

The highest point of The highest point of the overlying the overlying normal curve is at normal curve is at the mean the mean

As one moves away from As one moves away from the mean in either direction the mean in either direction the height of the curve the height of the curve decreases, approaching, decreases, approaching, but never reaching zero but never reaching zero

A normal distribution is symmetric about its mean A normal distribution is symmetric about its mean

Evidence-based Chiropractic

© 2006

30

The normal distribution (cont.)

Mean = Median = Mode Mean = Median = Mode

Evidence-based Chiropractic

© 2006

31

Phân phối lệch (Skewed distributions)

• The data are not distributed symmetrically

in skewed distributions – Consequently, the mean, median, and mode are not equal and are in different positions

– Scores are clustered at one end of the

distribution

– A small number of extreme values are located

in the limits of the opposite end

Evidence-based Chiropractic

© 2006

32

Phân phối lệch

• Skew is always toward the direction of the

longer tail – Positive if skewed to the right – Negative if to the left

The mean is shifted the most

Evidence-based Chiropractic

© 2006

33

Phân phối lệch Skewed distributions

• Because the mean is shifted so much, it is not the best estimate of the average score for skewed distributions

• The median is a better estimate of the

center of skewed distributions – It will be the central point of any distribution – 50% of the values are above and 50% below

the median

Evidence-based Chiropractic

© 2006

34

Những tính chất đường cong bình thường

• About 68.3% of the area under a normal curve is within one standard deviation (SD) of the mean

• About 95.5% is within two SDs • About 99.7% is within three SDs

Evidence-based Chiropractic

© 2006

35

More properties of normal curves (cont.)

Evidence-based Chiropractic

© 2006

36

Độ lệch chuẩn (SD)

• SD is a measure of the variability of a set

of data

• The mean represents the average of a

group of scores, with some of the scores being above the mean and some below – This range of scores is referred to as

variability or spread

• Variance (S2) is another measure of

spread

Evidence-based Chiropractic

© 2006

37

SD (cont.)

• In effect, SD is the average amount of

spread in a distribution of scores

• The next slide is a group of 10 patients

whose mean age is 40 years – Some are older than 40 and some younger

Evidence-based Chiropractic

© 2006

38

SD (cont.)

Ages are spread Ages are spread out along an X axis out along an X axis

The amount ages are The amount ages are spread out is known as spread out is known as dispersion or spread dispersion or spread

Evidence-based Chiropractic

© 2006

39

Distances ages deviate above and below the mean

Etc.

Adding deviations Adding deviations always equals zero always equals zero

Evidence-based Chiropractic

© 2006

40

Calculating S2

• To find the average, one would normally total the scores above and below the mean, add them together, and then divide by the number of values

• However, the total always equals zero

– Values must first be squared, which cancels

the negative signs

Evidence-based Chiropractic

© 2006

41

Calculating S2 cont.

S2 is not in the S2 is not in the same units (age), same units (age), but SD is but SD is

Symbol for SD of a sample Symbol for SD of a sample (cid:0) for a population (cid:0) for a population

Evidence-based Chiropractic

© 2006

42

Wide spread results in higher SDs narrow spread in lower SDs

Evidence-based Chiropractic

© 2006

43

Spread is important when comparing 2 or more group means

It is more difficult to see a clear distinction between groups in the upper example because the spread is wider, even though the means are the same

Evidence-based Chiropractic

© 2006

44

z-scores

• The number of SDs that a specific score is above or below the mean in a distribution • Raw scores can be converted to z-scores by subtracting the mean from the raw score then dividing the difference by the SD

(cid:0)

(cid:0) X

z

(cid:0)

Evidence-based Chiropractic

© 2006

(cid:0)

45

z-scores (cont.)

• Standardization

– The process of converting raw to z-scores – The resulting distribution of z-scores will

always have a mean of zero, a SD of one, and an area under the curve equal to one

• The proportion of scores that are higher or

lower than a specific z-score can be determined by referring to a z-table

Evidence-based Chiropractic

© 2006

46

z-scores (cont.)

Refer to a z-table Refer to a z-table to find proportion to find proportion under the curve under the curve

Evidence-based Chiropractic

© 2006

47

z-scores (cont.)

Partial z­table (to z = 1.5) showing proportions of the     area under a normal curve for different values of z.

Z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

0.5000 0.5040 0.5080 0.5120 0.5160 0.5199 0.5239 0.5279 0.5319 0.5359 0.0

0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 0.5636 0.5714 0.5675 0.5753 0.1

0.5793 0.5832 0.5871 0.5910 0.5948 0.6026 0.5987 0.6103 0.6064 0.6141 0.2

Corresponds to the area Corresponds to the area under the curve in black under the curve in black

0.6179 0.6217 0.6255 0.6293 0.6331 0.6406 0.6368 0.6480 0.6443 0.6517 0.3

0.6554 0.6591 0.6628 0.6664 0.6700 0.6736 0.6772 0.6808 0.6844 0.6879 0.4

0.6915 0.6950 0.6985 0.7019 0.7054 0.7088 0.7123 0.7157 0.7190 0.7224 0.5

0.7257 0.7291 0.7324 0.7357 0.7389 0.7422 0.7454 0.7486 0.7517 0.7549 0.6

0.7580 0.7611 0.7642 0.7673 0.7704 0.7734 0.7764 0.7794 0.7823 0.7852 0.7

0.7881 0.7910 0.7939 0.7967 0.7995 0.8023 0.8051 0.8078 0.8106 0.8133 0.8

0.8159 0.8186 0.8212 0.8238 0.8264 0.8289 0.8315 0.8340 0.8365 0.8389 0.9

0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577 0.8599 0.8621 1.0

0.8643 0.8665 0.8686 0.8708 0.8729 0.8749 0.8770 0.8790 0.8810 0.8830 1.1

0.8849 0.8869 0.8888 0.8907 0.8925 0.8944 0.8962 0.8980 0.8997 0.9015 1.2

0.9032 0.9049 0.9066 0.9082 0.9099 0.9115 0.9131 0.9147 0.9162 0.9177 1.3

0.9207 0.9222 0.9236 0.9251 0.9265 0.9279 0.9292 0.9306 0.9319 1.4

© 2006

0.9357 0.9370 0.9394 0.9406 0.9418 0.9429 0.9441 1.5 0.9192 0.9332 0.9332 0.9332 0.9345 Evidence-based Chiropractic 0.9382 48