Reliability and validity of the test

Validity is one of the basic criteria in psychodiagnostics of tests and techniques that determines their quality, close to the concept of reliability. It is used when you need to find out how well a technique measures exactly what it is aimed at; accordingly, the better the quality under study is displayed, the greater the validity of this technique.

The question of validity arises first in the process of developing the material, then after applying a test or technique, if it is necessary to find out whether the degree of expression of the identified personality characteristic corresponds to the method for measuring this property.

The concept of validity is expressed by the correlation of the results obtained as a result of applying a test or technique with other characteristics that are also studied, and it can also be argued comprehensively, using different techniques and criteria. Different types of validity are used: conceptual, constructive, criterion, content validity, with specific methods for establishing their degree of reliability. Sometimes the criterion of reliability is a mandatory requirement for checking psychodiagnostic methods if they are in doubt.

For psychological research to have real value, it must not only be valid, but also reliable at the same time. Reliability allows the experimenter to be confident that the value being studied is very close to the true value. And a valid criterion is important because it indicates that what is being studied is exactly what the experimenter intends. It is important to note that this criterion may imply reliability, but reliability cannot imply validity. Reliable values may not be valid, but valid ones must be reliable, this is the whole essence of successful research and testing.

What is reliability

During test reliability testing, the consistency of the results obtained when the test is repeated is assessed. Data discrepancies should be absent or insignificant. Otherwise, it is impossible to treat the test results with confidence. Test reliability is a criterion that indicates the accuracy of measurements. The following test properties are considered essential:

reproducibility of the results obtained from the study;
the degree of accuracy of the measurement technique or related instruments;
sustainability of results over a certain period of time.

In the interpretation of reliability, the following main components can be distinguished:

the reliability of the measuring instrument (namely the literacy and objectivity of the test task), which can be assessed by calculating the corresponding coefficient;
the stability of the characteristic being studied over a long period of time, as well as the predictability and smoothness of its fluctuations;
objectivity of the result (that is, its independence from the personal preferences of the researcher).

[Edit] Cronbach's alpha

This method, proposed by Lee Cronbach, compares the variance of each item to the overall variance of the entire scale. If the spread of test scores is less than the spread of scores for each individual question, then each individual question is intended to probe the same common ground. They produce a meaning that can be considered true. If such a value cannot be developed, that is, a random scatter is obtained when answering questions, the test is not reliable and the Cronbach alpha coefficient will be equal to 0. If all questions measure the same attribute, then the test is reliable and the Cronbach alpha coefficient in this case will be equal to 1.

Reliability factors

The degree of reliability can be affected by a number of negative factors, the most significant of which are the following:

imperfection of the methodology (incorrect or inaccurate instructions, unclear wording of tasks);
temporary instability or constant fluctuations in the values of the indicator that is being studied;
inadequacy of the environment in which initial and follow-up studies are conducted;
the changing behavior of the researcher, as well as the instability of the subject’s condition;
subjective approach when assessing test results.

[Edit] Cronbach's calculation

Cronbach's is defined as

where is the number of items in the scale, is the variance of the total test score, and is the variance of the item.

An alternative way to calculate it is as follows:

where N is the number of items in the scale, is the average variance for the sample, and is the average of all covariances between sample components.

Currently, Cronbach's is calculated using SPSS, STATISTICA and other modern statistical packages, possibly using Microsoft Excel

Methods for assessing test reliability

The following techniques can be used to determine test reliability.

The retesting method is one of the most common. It allows you to establish the degree of correlation between the results of studies, as well as the time in which they were conducted. This technique is simple and effective. Nevertheless, as a rule, repeated examinations cause irritation and negative reactions in subjects.

The internal consistency test method does not take into account the consistency of results obtained in repeated studies. It establishes the relationship between the answers that were given in one experiment. The test questions are divided into two lists (according to a certain principle), after which the correlation coefficient between the results is calculated.

The method of equivalent forms consists of using two or more tests with different wording of tasks, but with the same essence, form and degree of difficulty of execution. The reliability of the test is evidenced by the same or approximate results that were obtained using the same measuring instrument or calculation formula. If the results differ greatly, then, most likely, they were deliberately distorted or the subject did not approach the survey process very responsibly.

[Edit] Reliability as stability

Stability of test results or test-retest reliability is the possibility of obtaining the same results from subjects in different cases.

Stability is determined using repeated testing (retest):

This method proposes to carry out several measurements with a certain period of time (from a week to a year) with the same test. If the correlation between the results of various measurements is high, then the test is quite reliable. The lowest satisfactory value for test-retest reliability is 0.5. However, the reliability of not all tests can be checked by this method, since the quality, phenomenon or effect being assessed may itself be unstable (for example, our mood, which can change from one measurement to the next).

Another disadvantage of repeated testing is the habituation effect. Test takers are already familiar with the test and may even remember most of their answers from the previous test.

In connection with the above, a study of the reliability of psychodiagnostic techniques is used using parallel forms, in which equivalent or parallel sets of tasks are constructed. In this case, the subjects perform a completely different test under similar conditions. However, there are difficulties in proving that the two forms are truly equivalent. Despite this, in practice parallel forms of tests have proven useful in establishing test reliability.

What is validity

Test validity is a criterion that determines the reliability of a measurement. We can say that this is the suitability of a particular instrument for assessing a certain psychological characteristic. It is worth noting that the validity and reliability of the test are complementary criteria; individually they are insignificant.

Validity can be viewed from theoretical and pragmatic perspectives. In the first case, we are talking about an assessment method or a measuring instrument. As for the second understanding of validity, it concerns the purpose of conducting research activities. It is worth noting that this criterion may differ significantly for the same test, depending on the number of subjects. The highest estimate can range around 80%.

The validity of a psychological test can be assessed according to quantitative or qualitative indicators. In the first case, we are talking about carrying out mathematical calculations. Qualitative assessment is made descriptively, based on logical conclusions.

Validity is in psychology

In psychology, the concept of validity refers to the experimenter’s confidence that he measured exactly what he wanted using a certain technique, and shows the degree of consistency between the results and the technique itself relative to the tasks set. A valid measurement is one that measures exactly what it was designed to measure. For example, a technique aimed at determining temperament should measure precisely temperament, and not something else.

Validity in experimental psychology is a very important aspect, it is an important indicator that ensures the reliability of the results, and sometimes the most problems arise with it. A perfect experiment must have impeccable validity, that is, it must demonstrate that the experimental effect is caused by modifications of the independent variable and must be completely consistent with reality. The results obtained can be generalized without restrictions. If we are talking about the degree of this criterion, then it is assumed that the results will correspond to the objectives.

Validity testing is carried out in three ways.

Content validity assessment is carried out to find out the level of correspondence between the methodology used and the reality in which the property under study is expressed in the methodology. There is also such a component as obvious, also called face validity, it characterizes the degree of compliance of the test with the expectations of those being assessed. In most methodologies, it is considered very important that the assessment participant sees an obvious connection between the content of the assessment procedure and the reality of the assessment object.

Construct validity assessment is performed to obtain the degree of validity that the test actually measures those constructs that are specified and scientifically valid.

There are two dimensions to construct validity. The first is called convergent validation, which checks the expected relationship of the results of a technique with characteristics from other techniques that measure the original properties. If several methods are needed to measure some characteristic, then a rational solution would be to conduct experiments with at least two methods, so that when comparing the results, finding a high positive correlation, one can claim a valid criterion.

Convergent validation determines the likelihood that a test score will vary with expectations. The second approach is called discriminant validation, which means that the technique should not measure any characteristics with which theoretically there should be no correlation.

Validity testing can also be criterion-based; it, guided by statistical methods, determines the degree of compliance of the results with predetermined external criteria. Such criteria can be: direct measures, methods independent of the results, or the value of social and organizational significant performance indicators. Criterion validity also includes predictive validity; it is used when there is a need to predict behavior. And if it turns out that this forecast is realized over time, then the technique is predictively valid.

Types of test validity

The following main types of test validity are distinguished:

constructive validity of a test is a criterion used when evaluating a test that has a hierarchical structure (used in the process of studying complex psychological phenomena);
criterion-based validity involves comparing test results with the test subject’s level of development of one or another psychological characteristic;
content validity determines the correspondence of the methodology to the phenomenon being studied, as well as the range of parameters that it covers;
Predictive validity is a qualitative indicator that allows you to assess the future development of a parameter.

Content validity (logical).

Content validity means that the test is valid according to experts.

Content validity should be distinguished from obvious (face, external) validity - validity from the point of view of the subject, which plays an important role in the testing process, since it determines the attitude of the subject to the examination. In some cases they may coincide, while in others external validity is used to mask content validity. Example: Domino test

Content validity lies in the fact that the test must present all and in the correct proportion the key aspects of the sample of behavior, the psychological area for which it is intended to diagnose. Example: aggressiveness (physical, verbal, indirect, etc.)

Suppose we are developing a test to assess success in a high school history curriculum. To ensure content validity, we must include questions on all periods, from primitiveness to modernity, and not just, say, on the history of the Middle Ages. In addition, questions should be presented on various aspects of people's lives, not just military battles or culture.

Work on creating a test begins with an analysis of the area being diagnosed and drawing up a so-called specification matrix, which records what type and how many questions should be in the test, which ensures its content validity . In our example of a test for mastering a history course, in the specification matrix, for example, periods of history will be located horizontally (primitiveness, slavery, etc.), and vertically - various aspects of a given era (economic activities, political structure, military battles, culture etc.), at the intersection - the required number of test items. Example: Basa-Darki questionnaire

A specification matrix can only be created by an expert in the relevant field. In our example, such an expert is a highly qualified history teacher. It is he, and not the psychologist developing the history test, who determines how many and what specific tasks need to be included in the test, and the psychologist will then work on checking the reliability and validity of the test.

Content validity is not measured, but is built into the test development process. Therefore, content validity is not quantitative and cannot be represented as a correlation coefficient; The manual usually contains a specification matrix. To determine content validity, expert methods are used, that is, competent experts are selected, an examination procedure is organized, during which experts must evaluate the content of test items according to their compliance with the measured mental property, declared as the content of the test being validated. For this purpose, experts are presented with a test specification (where topics, learning objectives and the meaning of each topic and task are indicated) and a list of tasks. If the task corresponds to the specification, then the expert designates it as corresponding to the content of the test; if not, he rejects it. In order to obtain a final assessment of content validity, the judgments of individual experts on all tasks are summarized. These content validity summaries are included in the test manual.

For criterion-referenced tests, the manual must contain information about the area of knowledge, skills, or learning objectives that the test is measuring, as well as information about the number of tasks for each learning objective. Typical errors and the work methods they use are analyzed. Since test performance in this case is assessed from the point of view of acquired material and skills, it is first of all necessary that these tests be valid in content.

Criterion validity

(empirical or criterion-related validity)

Determines the ability of a test to serve as a predictor of a certain mental characteristic or form of human behavior and involves taking into account independent indicators and signs by which the validity of the test can be judged. In practice, this means the correspondence of diagnostic results to real behavior, the results of practical activities, and the observed actions and reactions of the subject.

Criteria for assessing empirical validity may include:

- behavioral indicators - reactions, actions and actions in various situations;

— achievements in various types of activities (educational, labor, sports, etc.);

— data on the performance of control tests and tasks;

— data from other methods, the validity or relationship with which is considered to be firmly established.

A test will be empirically valid when it is established that the test taker behaves in life exactly as the test predicts.

“Criterion-based validity shows the extent to which the test results can be used to judge the aspect of an individual’s behavior that interests us in the present and future. To determine it, test performance is correlated with the criterion, i.e. a direct and independent measure of what the test should predict” (A. Anastasi).

Example: if we are interested in the extent to which a clinical test predicts a diagnosis, we must compare the test results with a medical opinion obtained on the basis of independent studies by medical means themselves, i.e. with data from the “Medical History”. If we are interested in the extent to which this test allows us to predict the success of a student’s further education in higher education, then we must compare the results on it with the results of subsequent studies at the institute, etc.

For most tests, criterion-related validity (more often called criterion validity) is the most important indicator, because it allows the psychologist and the “consumer” of psychodiagnostic information to clearly know which aspects of behavior and to what extent the test predicts what external parameters it is associated with. For example, a psychologist has two intelligence tests. One of them has higher validity for math subtests, and the other has higher validity for vocabulary subtests. He is also faced with the task of selecting the most capable applicants from among all those entering the Faculty of Physics and Mathematics. Naturally, he should give preference to the first.

Empirically, criterion-related validity is manifested in the comparability of measurement results obtained by the method under study with results obtained by other methods, the validity of which is beyond doubt. If there are no methods whose validity is in doubt, then the connection between the measured characteristics and the quality being studied must be theoretically justified.

To prove this, it is tested whether the test results correlate with the results of other existing tests that predict the same sample of behavior and whose validity has already been proven. The presence of a relationship between the data of two tests is an indicator that the new test diagnoses approximately the same reality as the existing one. All tasks (items) of the test can be tested for criterion validity.

In general, the user should focus not on the name of the test, but on the criteria-based validity indicators: using them and only using them, he can determine what the test really measures and what problems it can be used to solve.

Since the criterion validity coefficient is nothing more than the correlation coefficient between the test results and the data on the parameter that we are going to evaluate or predict (i.e., the criterion), it is interpreted in the same way as any other correlation coefficient.

For example, a criterion validity coefficient equal to 1.00 indicates that there is an absolutely direct relationship between the results on the test and the criterion. The higher the test result, the higher the criterion result and vice versa. The results of a test with such a validity coefficient fully reflect the actual position of the subject among others in terms of the measured parameter. Errors in the prediction would be associated only with the reliability of the test. If the history intelligence test from our example had such an incredibly high coefficient of criterion validity, then it would be an ideal tool for assessing the knowledge of history of graduate students - a more accurate tool for assessing knowledge is not exists in principle.

A criterion validity coefficient equal to -1.00 indicates that there is an absolute inverse relationship between the results on the test and the criterion. The higher the test result, the lower the criterion result and vice versa. Such a test is also an ideal tool for assessment and prediction, but using the “by contradiction” method.

A criterion validity coefficient of 0.00 indicates that there is no relationship between test and criterion scores. A test with such criterion validity is absolutely meaningless. Its effectiveness is no greater than that of simple guessing.

Typically, the test validity coefficient ranges from 0.30 to 0.80, most often it is 0.40-0.60. For example, the criterion validity of the most authoritative test in the US education system, the DAT, is precisely in this range. Thus, a criterion validity of 0.40-0.60 can be considered a kind of standard.

A validity coefficient of, for example, 0.47 indicates that 47% of individual differences in test scores are due to the factor that the test measures, and 53% are due to all others. Thus, for example, from 40 to 60% of individual differences in student achievement are associated with the factor that is measured by the DAT, or otherwise - this test covers from 40 to 60% of the factors that are associated with differences between students in achievement.

There are several options for obtaining the criterion validity coefficient.

1. In the first case, the results of all subjects participating in the validation are compared with the data according to the selected criterion and the correlation coefficient between them is simply calculated.

To measure this property of the test, the correlation coefficient (r) of the test result with the external criterion is calculated. The criterion can be any independent indicator that measures the same psychological characteristic as the test being validated. The choice of criterion determines the qualitative and quantitative assessment of validity, so the question of choosing a criterion is the main one in this type of validity.

There are three groups of criteria:

a) expert; b) experimental; c) “vital”.

a) Expert criteria involve the use of expert assessments. Considerable attention is paid to this method; due to the low reliability and difficulty of organizing examinations, expert validity criteria are rarely used. For the validity of tests intended for schoolchildren, teachers are usually used as experts, but their assessments are largely subject to distortion (likes and dislikes, transfer of relationships from parents to students, from academic performance to personality traits, etc.)

b) Experimental criteria involve the use of the results of simultaneously testing subjects with another test, supposedly measuring the same mental property. The correlation coefficient between the results of two independent measurements is called empirical validity. Its value depends on the degree of coincidence of the content of the test, the comparability of units of measurement, the nature of standardization samples, and the reliability of tests. Therefore, parallel tests have the maximum coefficients of empirical validity; it is equal to a reliability coefficient of about 1 (if parallel forms were absolutely reliable, then the empirical validity would be equal to 1, since according to other criteria they are identical).

c) If at the time of checking the validity of the test there is no suitable experimental criterion, then the characteristics of real behavior that are associated with the measured psychological property are used as it. These characteristics of real behavior are called “life” criteria . For example, as “vital” criteria for intelligence tests, indicators of educational success are used, extroversion is the success of administrative activities, anxiety is the frequency of nervous diseases, technical abilities are based on the final results of vocational training, etc. However, the success of learning, behavior and activity rarely depends on only one - a single property of the psyche, but, as a rule, on a complex of mental properties. Therefore, the use of “vital” criteria is mainly used to validate tests such as MMPI, 16PF, etc., which are multidisciplinary test batteries. The validity of the test in relation to the “life” criterion is sometimes called practical validity.

Types of Validity Criteria

Test validity is one of the indicators that allows you to assess the adequacy and suitability of a technique for studying a particular phenomenon. There are four main criteria that can affect it:

performer criterion (we are talking about the qualifications and experience of the researcher);
subjective criteria (the subject’s attitude towards a particular phenomenon, which is reflected in the final test result);
physiological criteria (health status, fatigue and other characteristics that can have a significant impact on the final test result);
criterion of chance (takes place in determining the probability of the occurrence of a particular event).

The validity criterion is an independent source of data about a particular phenomenon (psychological property), the study of which is carried out through testing. Until the results obtained are checked for compliance with the criterion, validity cannot be judged.

Psychometric properties of psychodiagnostic methods

The psychometric basis of any technique is scales. The concept of “scale” is interpreted in a broad and narrow sense: in the first case, the scale is a specific technique, in the second case, it is a measurement scale that records the characteristics being studied. Each element of the technique corresponds to a certain score or index, which forms the severity of a particular mental phenomenon.

Measuring scales are divided into:

Metric: interval, ratio scales.
Non-metric: nominative, ordinal.

Scale name	Explanation, examples
Nominative (scale of names)	Based on a common property or symbol, assigns an observed phenomenon to the appropriate class. The naming scale is the most common in research psychodiagnostic methods. This scale is used, for example, in test questionnaires. The subject's denial or affirmation is compared with the answers in the key. Also, a nominative scale may involve the selection of one or more characteristics from those proposed.
Ordinal	Divides the sum of characteristics into elements based on the “more is less” principle. Thus, it arranges the results in ascending or descending order. An ordinal scale is used in the color choice test. The subject is asked to choose one of the squares on a white background, after which the selected figure is put aside and the procedure is repeated. Result: arranged according to the degree of attractiveness for the tested color. Each figure is assigned its own serial number.
Interval	The elements are ordered not only according to the principle of severity of the measured characteristic, but also on the basis of the distribution of characteristics by size, which is expressed by the intervals between the numbers assigned to the degree of expression of the measured characteristic. Interval scales are often used when standardizing primary test scores.
Relationships	Arranges elements by numerical value, maintaining proportionality between them. Objects are divided according to the property being measured. The numbers that are equated to object classes are proportional to the degree of expression of the properties being studied. Used, for example, to determine the sensitivity thresholds of analyzers. Often used in psychophysics.

After determining the scale used to form the test, it is necessary to determine the coefficient of the psychometric properties of the technique.

These include:

Representativeness.
Standard.
Reliability.
Validity.

Representativeness is a property that extends to a sample of subjects. It can characterize both a population and a general population. Representativeness has two parameters: qualitative and quantitative. The qualitative parameter characterizes the choice of subjects and methods of constructing the sample.

A quantitative parameter is the sample size expressed in numbers.

In psychological research, this property determines the extent to which results can be generalized. For example, relationships between men and women are studied. If we take subjects of different ages (schoolchildren, students, adults, pensioners), then the representativeness of such a sample will be low.

However, if the subjects are approximately the same age and field of activity (only schoolchildren, students, adults, pensioners of both sexes), then the representativeness will be high. In psychodiagnostics, representativeness is used to indicate the possibility of applying a technique to the entire population.

Standardization is a simplification of the methodology, bringing parts of the roadmap and application procedures to uniform standards. PDM should be universal and applicable by different specialists in different situations. If the structure of the PDM deviates from the standards, its results will not be comparable with the results of other studies. Non-standardized methods are used mainly for scientific research.

With their help, new mental phenomena are studied. But this technique cannot be used for psychodiagnostic purposes. Another important parameter of the LDM is reliability. It characterizes the accuracy, stability and stability of the results obtained using a specific technique.

The high reliability of the technique eliminates the influence of extraneous factors and significantly brings the experiment closer to a “pure” one. The criterion of reliability and validity are different concepts. Moreover, reliability is interpreted more broadly than validity: reliability > validity.

For example, on a day off a person gets the opportunity to spend time either fishing or hunting. If he decides to go hunting, but takes a fishing rod with him, then his choice will not be valid. However, if a person went hunting with a gun and it misfired, then the chosen method is unreliable.

Basic criteria requirements

External criteria that influence the test validity indicator must meet the following basic requirements:

compliance with the particular area in which the research is being conducted, relevance, as well as semantic connection with the diagnostic model;
the absence of any interference or sharp breaks in the sample (the point is that all participants in the experiment must meet pre-established parameters and be in similar conditions);
the parameter under study must be reliable, constant and not subject to sudden changes.

[Edit]See Also Discriminatory

Task discriminability is defined as the ability to separate subjects with a high overall test score from those who received a low score, or subjects with high educational productivity from subjects with low productivity.

In other words, discriminativeness is the ability of test items to differentiate students regarding the “maximum” or “minimum” test result. Determining the discriminativeness of a test task is necessary in order to put a barrier to low-quality tasks.

To calculate discriminativity, the method of extreme groups will be used: when calculating the discriminativity of a test task, the results of the most and least successful students are taken into account - this is the simplest and most visual method of calculating discriminativity.

The proportion of members of extreme groups can vary widely depending on the size of the sample. The larger the sample, the smaller the proportion of subjects you can limit yourself to when identifying groups with high and low results. The lower limit of the “group cutoff” is 10% of the total number of subjects in the sample, the upper limit is 33%. In this case, the 27% group will be used, since with this percentage the maximum accuracy in determining discriminativity is achieved. The discrimination index is calculated as the difference between the proportion of individuals who correctly solved the problem from the “highly productive” and “lowly productive” groups.

Psychometric paradox is a phenomenon that arises when using personality questionnaires; its essence lies in the fact that questions (statements) with a high discriminativeness index (see Discriminativity of test items) are unstable in relation to the repeatability of the result, and, conversely, the stability of the answer is often noted for those questions that have a low discriminativity.

P. Eisenberg (1941) showed that questions that allow one to distinguish patients with neurosis from other patients or healthy people are unreliable; in other words, there is little chance of getting the same answer when tested again. At the same time, with the help of questions defined as reliable, the differentiation of the studied groups was not achieved or was unsatisfactory. Later, the works of L. Goldberg (1963) and M. Novakovskaya (1975) were devoted to the study of this phenomenon, called P. p.

P. p. cannot be explained without a psychological analysis of the process of forming answers to questions on personality questionnaires. According to M. Novakovskaya, questions, while remaining formally unchanged, are subject to semantic (psychological) transformations both in terms of inter-individual and intra-individual. Interindividual variability is due to two reasons: differences in the severity of the measured trait (property) among different subjects and differences in understanding the meaning of the questions. Intraindividual variability is due to variability in meaning, difficulty in deciding a response, and fluctuation in trait expression (the latter source of variability may be ignored if the interval between repeated trials is short).

For the psychological interpretation of P. p. M. Novakovskaya suggests distinguishing three determinants of responses: the severity of the trait in the subject; the meaning given to the issue; degree of ease of deciding on an answer. She also emphasizes the need to distinguish unambiguous questions from ambiguous ones, which in a certain sense can be likened to projective stimuli.

M. Novakovskaya proposes to distinguish between two types of P. p. - type L and type B - and proceed from the following hypotheses of their occurrence. A type L paradox arises with questions that can be interpreted differently (multiple meanings), as well as when it is difficult to decide on an answer. Such questions have a high rate of discrimination with significant variability in the answer. A type B paradox arises with unambiguous questions for which it is easy to find an answer. This should also include the so-called. one-way diagnostic questions or those questions for which only one type of answer is diagnostically significant. Such questions are characterized by weak discriminativity and slightly expressed variability.

It is necessary to take P. into account when designing (adapting) personality questionnaires.

Examples of similar educational works

18.Characteristics of poorly formalized methods: observation, conversation, interview, analysis...

... according to the method denoting: numerical method graphic method adjective scale the graphic method complements the numerical method: draw ... A strictly defined interview tactic is defined, questions are asked in a strictly defined sequence. ...

12. The concept of validity, reliability, reliability in psychodiagnostics

... the test was recognized as valid. Thus, empirical methods to substantiate the validity of ... meaning. This value fluctuates within certain limits. Fluctuation of this value... the consistency of the test within itself, a measure of the adequacy of the selection of questions. ...

Validity criteria applied to qualitative research.

... the question of validity until recently seems to be one of the most difficult. The most established definition of this concept is the one given in the book by A. Anastasi: “The validity of the test ... has since been given less importance to humanitarian knowledge ...

Psychodiagnostic methods in psychology

... an option for differentiating methods: Organizational methods (this group includes the observation method and the experimental method) Auxiliary methods (this includes the method of expert assessments, various survey methods, the self-observation method, the test method, analysis ...

Ways to Establish Validity

Checking the validity of tests can be done in several ways.

Assessing face validity involves checking whether a test is fit for purpose.

Assessing content validity is checking a methodology for the presence of all the components necessary for a comprehensive study of a particular phenomenon or factor.

Construct validity is assessed when a series of experiments are conducted to study a specific complex measure. It includes:

convergent validation - checking the relationship of assessments obtained using various complex techniques;
divergent validation, which consists in ensuring that the methodology does not imply the assessment of extraneous indicators that are not related to the main study.

Assessing predictive validity involves establishing the possibility of predicting future fluctuations of the indicator being studied.

[Edit] Cronbach's value

Cronbach's alpha will generally increase as intercorrelations between variables increase, and are therefore considered a marker of internal consistency in assessing test scores. Since maximum intercorrelations between variables across all items are present when the same thing is being measured, Cronbach's alpha indirectly indicates the extent to which all items are measuring the same thing. Thus, alpha is most appropriate to use when all items are aimed at measuring the same phenomenon, property, phenomenon. However, it should be noted that a high coefficient value indicates that a set of items has a common basis, but does not indicate that there is a single factor behind them - the unidimensionality of the scale should be confirmed by additional methods. When a heterogeneous structure is measured, Cronbach's alpha will often be low. Thus, alpha is not suitable for assessing the reliability of intentionally heterogeneous instruments (for example, for the original MMPI, in which case it makes sense to conduct separate measures for each scale).

Professionally developed tests are expected to have an internal consistency of at least 0.90.

The alpha coefficient can also be used to solve other types of problems. Thus, it can be used to measure the degree of agreement between experts assessing a particular object, the stability of data during repeated measurements, etc.

conclusions

Test validity and reliability are complementary indicators that provide the most complete assessment of the fairness and significance of research results. Often they are determined simultaneously.

Reliability shows how much the test results can be trusted. This means their constancy every time a similar test is repeated with the same participants. A low degree of reliability may indicate intentional distortion or an irresponsible approach.

The concept of test validity is associated with the qualitative side of the experiment. We are talking about whether the chosen tool corresponds to the assessment of a particular psychological phenomenon. Here, both qualitative indicators (theoretical assessment) and quantitative indicators (calculation of the corresponding coefficients) can be used.

The validity of the methodology is

The validity of a technique determines the correspondence of what is studied by this technique to what exactly it is intended to study.

For example, if a psychological technique that is based on informed self-report is assigned to study a certain personality quality, a quality that cannot be truly assessed by the person himself, then such a technique will not be valid.

In most cases, the answers that the subject gives to questions about the presence or absence of development of this quality in him can express how the subject himself perceives himself, or how he would like to be in the eyes of other people.

Validity is also a basic requirement for psychological methods for studying psychological constructs. There are many different types of this criterion, and there is no single opinion yet on how to correctly name these types and it is not known which specific types the technique must comply with. If the technique turns out to be invalid externally or internally, it is not recommended to use it. There are two approaches to method validation.

The theoretical approach is revealed in showing how truly the methodology measures exactly the quality that the researcher came up with and is obliged to measure. This is proven through compilation with related indicators and those where connections could not exist. Therefore, to confirm a theoretically valid criterion, it is necessary to determine the degree of connections with a related technique, meaning a convergent criterion and the absence of such a connection with techniques that have a different theoretical basis (discriminant validity).

Assessing the validity of a technique can be quantitative or qualitative. The pragmatic approach evaluates the effectiveness and practical significance of the technique, and for its implementation an independent external criterion is used, as an indicator of the occurrence of this quality in everyday life. Such a criterion, for example, can be academic performance (for achievement methods, intelligence tests), subjective assessments (for personal methods), specific abilities, drawing, modeling (for special characteristics methods).

To prove the validity of external criteria, four types are distinguished: performance criteria - these are criteria such as the number of tasks completed, time spent on training; subjective criteria are obtained along with questionnaires, interviews or questionnaires; physiological – heart rate, blood pressure, physical symptoms; criteria of chance - are used when the goal is related or influenced by a certain case or circumstances.

When choosing a research methodology, it is of theoretical and practical importance to determine the scope of the characteristics being studied, as an important component of validity. The information contained in the name of the technique is almost always not sufficient to judge the scope of its application. This is just the name of the technique, but there is always a lot more hidden under it. A good example would be the proofreading technique. Here, the scope of properties being studied includes concentration, stability and psychomotor speed of processes. This technique provides an assessment of the severity of these qualities in a person, correlates well with values obtained from other methods and has good validity. At the same time, the values obtained as a result of the correction test are subject to a greater influence of other factors, regarding which the technique will be nonspecific. If you use a proof test to measure them, the validity will be low. It turns out that by determining the scope of application of the methodology, a valid criterion reflects the level of validity of the research results. With a small number of accompanying factors that influence the results, the reliability of the estimates obtained in the methodology will be higher. The reliability of the results is also determined using a set of measured properties, their importance in diagnosing complex activities, and the importance of displaying the methodology of the subject of measurement in the material. For example, to meet the requirements of validity and reliability, the methodology assigned for professional selection must analyze a large range of different indicators that are most important in achieving success in the profession.

Psychological test and validity

A psychological test is a task formulated on the basis of certain standards, the result of which is to obtain data on the psychophysiological indicators of a person’s condition and the properties of his personality, skills, knowledge and abilities.

Validity determines the quality of the test, that is, the degree of correspondence of the studied properties of the psyche or behavior to the test by which they are determined. Qualitative tests have a validity rate of eighty percent. It is important to take into account the composition of the test material and its characteristics: this can make the test reliable or pathologically invalid.

Test validity is very important because it defines the test itself as a measuring instrument and makes it possible to consider it suitable for use in routine practice.

Threats

Validity in psychology is a property of qualitative methodology, but factors may arise that distort a theoretically correctly constructed PDM. Side factors are more pronounced when working with poorly organized stimuli or new, previously unclear tasks for the subject.

The difficulty lies in studying unbalanced and insecure individuals. The main threats to high validity are the special characteristics of the test taker and situational phenomena.

The reliability of the results is reduced by:

test subject's errors;
specialist errors;
errors caused by conditions or incorrect diagnostics.

If the diagnosis does not necessarily require a specialist to be in the room, then his presence may distort the results of the study. Comments and interpretation of test tasks also reduce the reliability of the data obtained.

A subject interested in intentional testing errors or presenting himself in a favorable light to management distorts the diagnostic results. No less dangerous is the psychophysiological state of the person being tested. For example, the individual is very hungry, tired, or suffers from a migraine.

Extraneous noise, voice, and the ability to discuss test tasks with other subjects reduce the accuracy of the results. This applies to errors in diagnostic conditions and procedures.