Rating scale: Difference between revisions

Content deleted Content added

Inline

Latest revision as of 01:31, 14 May 2024

A rating scale is a set of categories designed to obtain information about a quantitative or a qualitative attribute. In the social sciences, particularly psychology, common examples are the Likert response scale and 0-10 rating scales, where a person selects the number that reflecting the perceived quality of a product.

Background[edit]

A rating scale is a method that requires the rater to assign a value, sometimes numeric, to the rated object, as a measure of some rated attribute.

Types of rating scales[edit]

All rating scales can be classified into one of these types:

Numeric Rating Scale (NRS)
Verbal Rating Scale (VRS)
Visual Analogue Scale (VAS)
Likert
Graphical rating scale
Descriptive graphic rating scale

Some data are measured at the ordinal level. Numbers indicate the relative position of items, but not the magnitude of difference. Attitude and opinion scales are usually ordinal; one example is a Likert response scale:

Statement

e.g. "I could not live without my computer".

Response options

Strongly disagree
Disagree
Neutral
Agree
Strongly agree

Some data are measured at the interval level. Numbers indicate the magnitude of difference between items, but there is no absolute zero point. A good example is a Fahrenheit/Celsius temperature scale where the differences between numbers matter, but placement of zero does not.

Some data are measured at the ratio level. Numbers indicate magnitude of difference and there is a fixed zero point. Ratios can be calculated. Examples include age, income, price, costs, sales revenue, sales volume and market share.

More than one rating scale question is required to measure an attitude or perception due to the requirement for statistical comparisons between the categories in the polytomous Rasch model for ordered categories.^[1] In classical test theory, more than one question is required to obtain an index of internal reliability such as Cronbach's alpha,^[2] which is a basic criterion for assessing the effectiveness of a rating scale.

Rating scales used online[edit]

Rating scales are used widely online in an attempt to provide indications of consumer opinions of products. Examples of sites which employ ratings scales are IMDb, Epinions.com, Yahoo! Movies, Amazon.com, BoardGameGeek and TV.com which use a rating scale from 0 to 100 in order to obtain "personalised film recommendations".

In almost all cases, online rating scales only allow one rating per user per product, though there are exceptions such as Ratings.net, which allows users to rate products in relation to several qualities. Most online rating facilities also provide few or no qualitative descriptions of the rating categories, although again there are exceptions such as Yahoo! Movies, which labels each of the categories between F and A+ and BoardGameGeek, which provides explicit descriptions of each category from 1 to 10. Often, only the top and bottom category is described, such as on IMDb's online rating facility.

Validity[edit]

Validity refers to how well a tool measures what it intends to measure. With each user rating a product only once, for example in a category from 1 to 10, there is no means for evaluating internal reliability using an index such as Cronbach's alpha. It is therefore impossible to evaluate the validity of the ratings as measures of viewer perceptions. Establishing validity would require establishing both reliability and accuracy (i.e. that the ratings represent what they are supposed to represent). The degree of validity of an instrument is determined through the application of logic/or statistical procedures. "A measurement procedure is valid to the degree that if measures what it proposes to measure."

Another fundamental issue is that online ratings usually involve convenience sampling much like television polls, i.e. they represent only the opinions of those inclined to submit ratings.

Validity is concerned with different aspects of the measurement process. Each of these types uses logic, statistical verification or both to determine the degree of validity and has special value under certain conditions. Types of validity include content validity, predictive validity, and construct validity.

Sampling[edit]

Sampling errors can lead to results which have a specific bias, or are only relevant to a specific subgroup. Consider this example: suppose that a film only appeals to a specialist audience—90% of them are devotees of this genre, and only 10% are people with a general interest in movies. Assume the film is very popular among the audience that views it, and that only those who feel most strongly about the film are inclined to rate the film online; hence the raters are all drawn from the devotees. This combination may lead to very high ratings of the film, which do not generalize beyond the people who actually see the film (or possibly even beyond those who actually rate it).

Qualitative description[edit]

Qualitative description of categories improve the usefulness of a rating scale. For example, if only the points 1-10 are given without description, some people may select 10 rarely, whereas others may select the category often. If, instead, "10" is described as "near flawless", the category is more likely to mean the same thing to different people. This applies to all categories, not just the extreme points.

The above issues are compounded, when aggregated statistics such as averages are used for lists and rankings of products. User ratings are at best ordinal categorizations. While it is not uncommon to calculate averages or means for such data, doing so cannot be justified because in calculating averages, equal intervals are required to represent the same difference between levels of perceived quality. The key issues with aggregate data based on the kinds of rating scales commonly used online are as follow:

Averages should not be calculated for data of the kind collected.
It is usually impossible to evaluate the reliability or validity of user ratings.
Products are not compared with respect to explicit, let alone common^{[clarification needed]}, criteria.
Only users inclined to submit a rating for a product do so.
Data are not usually published in a form that permits evaluation of the product ratings.

More developed methodologies include Choice Modelling or Maximum Difference methods, the latter being related to the Rasch model due to the connection between Thurstone's law of comparative judgement^{[clarification needed]} and the Rasch model.

Rating scale reduction[edit]

An international collaborative research effort^[3] has introduced a data-driven algorithm for a rating scale reduction. It is based on the area under the receiver operating characteristic.

Origins[edit]

The historical origins of rating scales were reevaluated following a significant archaeological discovery in Tbilisi, Georgia, in 2010. Excavators unearthed a tablet dating back to the early medieval period, marked with ancient Georgian script.^[4] This tablet showcased a series of linear markings, interpreted as an early form of a rating scale. The inscriptions provided insights into medieval methods of quantification and evaluation, suggesting an embryonic version of modern rating scales. This discovery is currently preserved at the National Museum of Georgia. ^[5]

References[edit]

^ Andrich, David (December 1978). "A rating formulation for ordered response categories". Psychometrika. 43 (4): 561–573. doi:10.1007/BF02293814. S2CID 120687848.
^ Cronbach, Lee J. (September 1951). "Coefficient alpha and the internal structure of tests". Psychometrika. 16 (3): 297–334. CiteSeerX 10.1.1.452.6417. doi:10.1007/BF02310555. S2CID 13820448.
^ Koczkodaj, Waldemar W; Kakiashvili, T.; Szymańska, A.; Montero-Marin, J.; Araya, R.; Garcia-Campayo, J.; Rutkowski, K.; Strzałka, D. (2017). "How to reduce the number of rating scale items without predictability loss?". Scientometrics. 111 (2): 581–593(2017). doi:10.1007/s11192-017-2283-4. PMC 5400800. PMID 28490822.
^ "მსოფლიოში ერთ-ერთი უძველესი კბილის აღმომჩენები შარში ეხვევიან - სად არის ოროზმანელი ადამიანის კბილი?". რადიო თავისუფლება (in Georgian). 2022-09-21. Retrieved 2024-01-17.
^ ""არ არის აუცილებელი, მთელ საქართველოში ერთდროულად გათხრები ტარდებოდეს" - არქეოლოგები გათხრის უფლებას ვერ იღებენ". რადიო თავისუფლება (in Georgian). 2022-06-21. Retrieved 2024-01-17.

External links[edit]

UEQ Semantic differential for measuring the User Experience

[1] Andrich, David (December 1978). "A rating formulation for ordered response categories". Psychometrika. 43 (4): 561–573. doi:10.1007/BF02293814. S2CID 120687848.

[2] Cronbach, Lee J. (September 1951). "Coefficient alpha and the internal structure of tests". Psychometrika. 16 (3): 297–334. CiteSeerX 10.1.1.452.6417. doi:10.1007/BF02310555. S2CID 13820448.

[3] Koczkodaj, Waldemar W; Kakiashvili, T.; Szymańska, A.; Montero-Marin, J.; Araya, R.; Garcia-Campayo, J.; Rutkowski, K.; Strzałka, D. (2017). "How to reduce the number of rating scale items without predictability loss?". Scientometrics. 111 (2): 581–593(2017). doi:10.1007/s11192-017-2283-4. PMC 5400800. PMID 28490822.

[4] "მსოფლიოში ერთ-ერთი უძველესი კბილის აღმომჩენები შარში ეხვევიან - სად არის ოროზმანელი ადამიანის კბილი?". რადიო თავისუფლება (in Georgian). 2022-09-21. Retrieved 2024-01-17.

[5] ""არ არის აუცილებელი, მთელ საქართველოში ერთდროულად გათხრები ტარდებოდეს" - არქეოლოგები გათხრის უფლებას ვერ იღებენ". რადიო თავისუფლება (in Georgian). 2022-06-21. Retrieved 2024-01-17.

[1]

[2]

[3]

[4]

[5]

@@ Line 1: / Line 1: @@
+{{Short description|Type of informational measurement scale}}{{For|the application of rating scales to voting|score voting|STAR voting|rated voting}}
-:''Concerning rating scales as systems of educational marks, see articles about education in different countries (named "Education in ..."), for example, [[Education in Ukraine]].''
-:''Concerning rating scales used in the practice of medicine, see articles about diagnoses, for example, [[Major depressive disorder]].
-A '''rating scale''' is a set of categories designed to elicit information about a [[quantitative property|quantitative]]  or a [[Qualitative data|qualitative]] attribute.  In the [[social sciences]], particularly [[psychology]], common examples are the [[Likert scale|Likert response scale]] and [[Scale of one to ten|1-10 rating scales]] in which a person selects the number which is considered to reflect the perceived quality of a [[Product (business)|product]].
+A '''rating scale''' is a set of categories designed to obtain information about a [[quantitative property|quantitative]] or a [[Qualitative data|qualitative]] attribute. In the [[social sciences]], particularly [[psychology]], common examples are the [[Likert scale|Likert response scale]] and 0-10 rating scales, where a person selects the number that reflecting the perceived quality of a [[Product (business)|product]].
 ==Background==
-A rating scale is a method that requires the rater to assign a value, sometimes numeric, to the rated object, as a measure of some rated attribute
+A rating scale is a method that requires the rater to assign a value, sometimes numeric, to the rated object, as a measure of some rated attribute.
 ===Types of rating scales===
-All rating scales can be classified into one or two of three types:
+All rating scales can be classified into one of these types:
+# Numeric Rating Scale (NRS)
-# numeric rating scale
+# Verbal Rating Scale (VRS)
-# graphic rating scale
+# Visual Analogue Scale (VAS)
+# Likert
+# Graphical rating scale
 # Descriptive graphic rating scale
@@ Line 25: / Line 27: @@
 Some data are measured at the [[Level of measurement#Interval scale|interval level]]. Numbers indicate the magnitude of difference between items, but there is no absolute zero point. A good example is a Fahrenheit/Celsius temperature scale where the differences between numbers matter, but placement of zero does not.
-Some data are measured at the [[Level of measurement#Ratio measurement|ratio level]]. Numbers indicate magnitude of difference and there is a fixed zero point. Ratios can be calculated. Examples include age, income, price, costs, sales revenue, sales volume and market share.
+Some data are measured at the [[Level of measurement#Ratio scale|ratio level]]. Numbers indicate magnitude of difference and there is a fixed zero point. Ratios can be calculated. Examples include age, income, price, costs, sales revenue, sales volume and market share.
-More than one rating scale question is required to [[measurement|measure]] an attitude or perception due to the requirement for statistical comparisons between the categories in the [[polytomous Rasch model]] for ordered categories.<ref>{{cite journal |last=Andrich |first=David |date=December 1978 |title=A rating formulation for ordered response categories |journal=Psychometrika |volume=43 |issue=4 |pages=561-573 |doi=10.1007/BF02293814 |subscription=yes }}</ref> In terms of [[Classical test theory]], more than one question is required to obtain an index of internal reliability such as [[Cronbach's alpha]],<ref>{{cite journal |last=Cronbach |first=Lee J. |date=September 1951 |title=Coefficient alpha and the internal structure of tests |journal=Psychometrika |volume=16 |issue=3 |pages=297-334 |doi=10.1007/BF02310555 |subscription=yes }}</ref> which is a basic criterion for assessing the effectiveness of a rating scale and, more generally, a psychometric instrument.
+More than one rating scale question is required to [[measurement|measure]] an attitude or perception due to the requirement for statistical comparisons between the categories in the [[polytomous Rasch model]] for ordered categories.<ref>{{cite journal |last=Andrich |first=David |date=December 1978 |title=A rating formulation for ordered response categories |journal=Psychometrika |volume=43 |issue=4 |pages=561–573 |doi=10.1007/BF02293814 |s2cid=120687848 }}</ref> In [[classical test theory]], more than one question is required to obtain an index of internal reliability such as [[Cronbach's alpha]],<ref>{{cite journal |last=Cronbach |first=Lee J. |date=September 1951 |title=Coefficient alpha and the internal structure of tests |journal=Psychometrika |volume=16 |issue=3 |pages=297–334 |doi=10.1007/BF02310555 |citeseerx=10.1.1.452.6417 |s2cid=13820448 }}</ref> which is a basic criterion for assessing the effectiveness of a rating scale.
 ==Rating scales used online==
@@ Line 37: / Line 39: @@
 ===Validity===
 Validity refers to how well a tool measures what it intends to measure.
-With each user rating a product only once, for example in a category from 1 to 10, there is no means for evaluating internal [[reliability (statistics)|reliability]] using an index such as [[Cronbach's alpha]]. It is therefore impossible to evaluate the [[validity]] of the ratings as measures of viewer perceptions. Establishing validity would require establishing both reliability and accuracy (i.e. that the ratings represent what they are supposed to represent).The degree of validity of an instrument is determined through the application of logic/or statistical procedures." A measurement procedure is valid to the degree that if measures what it proposes to measure"
+With each user rating a product only once, for example in a category from 1 to 10, there is no means for evaluating internal [[reliability (statistics)|reliability]] using an index such as [[Cronbach's alpha]]. It is therefore impossible to evaluate the [[Validity (logic)|validity]] of the ratings as measures of viewer perceptions. Establishing validity would require establishing both reliability and accuracy (i.e. that the ratings represent what they are supposed to represent). The degree of validity of an instrument is determined through the application of logic/or statistical procedures. "A measurement procedure is valid to the degree that if measures what it proposes to measure."
 Another fundamental issue is that online ratings usually involve convenience [[sampling (statistics)|sampling]] much like television polls, i.e. they represent only the opinions of those inclined to submit ratings.
-Validity is concerned with different aspects of the measurement process.Each of these types uses logic, statistical verification or both to determine the degree of validity and has special value under certain conditions. Types of validity include content validity, predictive validity, and construct validity.
+Validity is concerned with different aspects of the measurement process. Each of these types uses logic, statistical verification or both to determine the degree of validity and has special value under certain conditions. Types of validity include content validity, predictive validity, and construct validity.
 ===Sampling===
@@ Line 57: / Line 59: @@
 More developed methodologies include [[Choice Modelling]] or [[MaxDiff|Maximum Difference]] methods, the latter being related to the [[Rasch model]] due to the connection between Thurstone's law of comparative judgement{{clarify|date=January 2012}} and the Rasch model.
+==Rating scale reduction==
+An international collaborative research effort<ref> {{Cite journal| last1=Koczkodaj|first1=Waldemar W|first2=T.|last2=Kakiashvili|first3=A.|last3=Szymańska|first4=J.|last4=Montero-Marin|first5=R.|last5=Araya|first6=J. |last6=Garcia-Campayo|first7=K.|last7=Rutkowski|first8=D.|last8=Strzałka| title=How to reduce the number of rating scale items without predictability loss?|journal=Scientometrics |year=2017 | volume=111|issue=2 |pages=581–593(2017)| language=en| doi=10.1007/s11192-017-2283-4|pmid=28490822 |pmc=5400800 | doi-access=free}}</ref> has introduced a data-driven algorithm for a rating scale reduction. It is based on the area under the [[receiver operating characteristic]].
+== Origins ==
+The historical origins of rating scales were reevaluated following a significant archaeological discovery in [[Tbilisi|Tbilisi, Georgia]], in 2010. Excavators unearthed a tablet dating back to the early medieval period, marked with ancient Georgian script.<ref>{{Cite web |date=2022-09-21 |title=მსოფლიოში ერთ-ერთი უძველესი კბილის აღმომჩენები შარში ეხვევიან - სად არის ოროზმანელი ადამიანის კბილი? |url=https://www.radiotavisupleba.ge/a/32044890.html |access-date=2024-01-17 |website=რადიო თავისუფლება |language=ka}}</ref> This tablet showcased a series of linear markings, interpreted as an early form of a rating scale. The inscriptions provided insights into medieval methods of quantification and evaluation, suggesting an embryonic version of modern rating scales. This discovery is currently preserved at the [[Georgian National Museum|National Museum of Georgia]]. <ref>{{Cite web |date=2022-06-21 |title="არ არის აუცილებელი, მთელ საქართველოში ერთდროულად გათხრები ტარდებოდეს" - არქეოლოგები გათხრის უფლებას ვერ იღებენ |url=https://www.radiotavisupleba.ge/a/31908732.html |access-date=2024-01-17 |website=რადიო თავისუფლება |language=ka}}</ref>
 ==See also==
+{{Wikiversity|Response formats}}
+*[[Likert scale]]
 *[[MaxDiff]]
+*[[Questionnaire]]
+*[[Questionnaire construction]]
 *[[Rating scales for depression]]
 *[[Semantic differential]]
 *[[Voting system]]
+*[[Receiver operating characteristic]]
 ==References==
@@ Line 68: / Line 81: @@
 ==External links==
-* [http://www.rasch-analysis.com/ How to apply Rasch analysis]
 * [http://www.ueq-online.org/ UEQ Semantic differential for measuring the User Experience]
 [[Category:Psychometrics]]
+[[Category:Rating systems]]
 [[Category:Recommender systems]]