Automation of Testing Machine Learning
October 18th, 2023 (GMT+1)
School of Engineering, Computing and Mathematics, Oxford Brookes University, Oxford, UK
Dr. Hong Zhu is a professor of computer science at the Oxford Brookes University, UK, where he chairs the Cloud Computing and Cybersecurity Research Group. He obtained his BSc, MSc and PhD degrees in Computer Science from Nanjing University, China, in 1982, 1984 and 1987, respectively. He was a faculty member of Nanjing University from 1987 to 1998. He joined Oxford Brookes University in November 1998 as a senior lecturer in computing and became a professor in Oct. 2004. His research interests are in the area of software development methodologies, including software engineering for machine learning applications, cloud computing, big data and data analytics, software engineering of intelligent systems, formal methods, software design, programming languages and automated tools, software modeling and testing. He has published 2 books and more than 200 research papers in journals and international conferences. He is a senior member of IEEE, a member of British Computer Society, and ACM.
The Workshop will be organized on October 18th, 2023 (GMT+1) at Oxford Brookes University. Detailed schedule is shown as followings:
1:00 - 1:05pm | Opening |
1:05 - 1:30pm | Keynote Speaking: An introduction of datamorphic test methodology From: Prof. Hong Zhu, Oxford Brookes University |
1:30 - 2:00pm | Keynote Speaking: Datamorphic testing of ML regression models for feature
selection From: Ms. Reebu Joy, Oxford Brookes University |
2:00 - 2:30pm | Keynote Speaking: Toward predicting the impact of feature selection via
testing From: Mr. Movin Fernandes, Oxford Brookes University |
2:30 - 3:00pm | Keynote Speaking: Exploratory testing of the robustness of image recognition From: Mr. Aiden Gourley, Oxford Brookes University |
3:00 - 3:30pm | Break |
3:30 - 4:00pm | Keynote Speaking: Testing natural language processing applications: A survey From: Ms. Debalina Paul Ghost, Oxford Brookes University |
4:00 - 4:30pm | Keynote Speaking: Automating meta-analysis: Advancements and perspectives From: Mr. Aamer Bassmaji , Oxford Brookes University Dr. Eleni Elia, Oxford Brookes University Dr. Sarah Howcutt, Oxford Brookes University |
4:30 - 5:00pm | Keynote Speaking: Testing ChatGPT's capability of generating R program code From Ms. Tanha Miah, Oxford Brookes University |
5:00 - 5:30pm | Keynote Speaking: Data complexity and quality issues in machine learning From: Dr. Daniel Rodriguez, Oxford Brookes University |
5:30 - 6:00pm | Keynote Speaking: Simulation model for testing From: Dr. Alexander Rast, Oxford Brookes University |
6:00 - 6:05pm | Closing |
Speakers | Bionote | Abstract |
![]() Prof. Hong Zhu, |
Dr. Hong Zhu is a professor of computer science at the Oxford Brookes University, Oxford, UK, where he chairs the Cloud Computing and Cybersecurity Research Group. He obtained his BSc, MSc and PhD degrees in Computer Science from Nanjing University, China, in 1982, 1984 and 1987, respectively. He was a faculty member of Nanjing University from 1987 to 1998. He joined Oxford Brookes University in November 1998. His research interests are in the area of software development methodologies, including software development methodology, programming languages and integrated DevOps environment for various types of modern software and computer applications, including machine learning and intelligent systems, cloud-native applications. He has published 2 books and more than 200 research papers in journals and international conferences on software testing formal methods, software design, software specification, modelling and programming languages and automated tools, etc. He is a senior member of IEEE, a member of British Computer Society, and ACM. |
Title of Speech: An Introduction to Datamorphic Testing Datamorphic testing is a methodology that regards software testing as a problem of systems engineering. It aims at improving effectiveness and efficiency of testing complicated large-scale systems through test automation and employment of AI technology. In particular, it considers testing as a process in which a test system is developed, maintained and operated to achieve software testing purposes. It defines software test systems as consisting of a set of test entities and test morphisms, where the former are the objects, data, documents etc created, used and managed during the testing process, while the latter are the operators and transformers on the test entities. Typical examples of test morphisms include seed makers, which produce test cases, datamorphisms, which transform test data, metamorphisms, which check the correctness of test results, test metrics, which, for example, measure test adequacy, etc. Research demonstrated that when such a test system is implemented and maintained effectively, test automation can be achieved at three different abstract levels of activity, strategy and process. An automated testing tool called Morphy has been developed to support datamorphic testing. It has been applied to a number of testing problems for machine learning (ML) applications, including (a) confirmatory testing of ML models, (b) exploratory testing of ML classifiers, and (c) to scenario-based functional testing for improving ML model performances. |
![]() Ms. Reebu Joy, |
Ms Reebu Meera Joy is a MSc postgraduate student at Oxford Brookes University reading Computer Science. She has also obtained her Bachelor's degree in Computer Science and Technology from CUSAT University in India in 2011. She has worked in the IT industry for 9 years as a Test Engineer with Amazon Development Centre, India. She has a strong passion for problem-solving, which translates into her enthusiasm for programming and machine learning. |
Title of Speech: Datamorphic Testing of ML Regression Models for Feature
Selection In the development of machine learning applications, there may be a large number of features available in the data. Selecting the right set of features is crucial for the machine learning model's performance as well as the cost of training of the model. It is one of the most important aspects of feature engineering. In this presentation, we will propose a new approach to evaluate the importance of features in machine learning based on the principles of datamorphic testing methodology. In particular, it employs datamorphisms as perturbations of the input data and observes the ML model's responses to the perturbations in terms of the changes in the outputs. Two metrics will be defined to measure the impact of the perturbations on a dataset. We will report the experiments on the proposed metrics with ML models built from real datasets. The result data show that they can provide a good indication of feature importance. |
![]()
Mr. Movin Fernandes, |
Mr Movin Fernandes is a master's degree student in Data Analytics at Oxford Brookes University, UK. He obtained his bachelor's degree in Electronics and Communications with honours from St. Francis Institute of Technology, India, 2017. He is currently working on a research project on feature selection techniques using machine learning. Mr. Fernandes has worked at Capgemini Technology Services LTD as a consultant for 5 years. He has expertise in data migration tools and processes and has developed a full-fledged excel based tool for data reconciliation. Mr Fernandes is interested in pursuing a career in research and development, applying cutting-edge machine learning techniques to solve real-world problems. |
Title of Speech: Toward predicting the impact of feature selection via testing
Feature selection is a critical step in machine learning model development. It aims at identifying the most informative subset of features that contribute to model performance. Traditional feature selection methods often rely on statistical measures or heuristic algorithms, which may overlook valuable information among features. In this research, we propose a novel feature selection technique that leverages the evolutionary process of identifying an optimal feature set for both regression and classification tasks. Our approach systematically applies exploratory testing of an existing ML model and evaluates its performances on various data subsets divided according to the features. Using the test results as input, we then employ a predictive ML model to predict the impact of adding a feature to the model on performance. This predictive model has been trained on the data that we obtained via experiments with ML models of real datasets and on various features. When it is used to rank features of unseen datasets, the selected feature subsets outperform traditional feature selection methods. In the presentation, we will report our comparative experiments to demonstrate the advantages of our approach in terms of computational efficiency and predictive accuracy. Our findings reveal that the new feature selection technique consistently identifies feature subsets that lead to improved ML model performances. |
![]() Mr. Aiden Gourley, |
Mr Aiden Gourley BSc is a Software Engineer in the Aerospace and Defence industry at Leonardo UK Ltd. In 2022, he graduated from Leonardo UK Ltd with a first class honours degree in Computer Science. He is a member of the British Computing Society (MBCS). With a keen research interest in datamorphic testing methodology, his dissertation on the application of datamorphic testing to the robustness of machine learning image classifiers won the School of Engineering, Computing and Mathematic's Innovation and Business award. He also won the Short Term Undergraduate Fellowship after graduation. This work has been continued as a part-time personal research project. |
Title of Speech: Exploratory testing of the robustness of image recognition
Machine learning techniques have excelled in the domain of image classification, with many model architectures exceeding 90% accuracy across complex datasets with 1000s of classes. While the performance is impressive, the models can be vulnerable to small distortions, causing severe misclassifications. The vulnerability of this characteristic on current models has been limited by our ability to efficiently generate adversarial examples. In this presentation, we investigate how it has been possible to generate thousands of adversarial examples, some by modifying only a single pixel. We look at how the datamorphic testing method is applied to such robustness testing and reveal its efficacy, and high efficiency, on multiple models and datasets. |
![]()
Ms. Debalina Ghosh Paul, |
Ms. Debalina Ghosh Paul is currently pursuing an MPhil/Ph.D. at Oxford Brookes University, focusing on testing the capabilities of Large Language Models for code generation under the supervision of Professor Dr. Hong Zhu and Dr. Ian Bayley. She obtained her BTech and MTech in Computer Science Engineering from West Bengal University of Technology, India, and Calcutta University, India, in 2008 and 2010 respectively. She was a faculty member of Greater Kolkata College of Engineering and Management, India, and Institute of Engineering and Management, India from August 2010 to January 2013 and from February 2013 to June 2022 respectively. She authored 4 research papers on the topic of image processing and steganography. |
Title of Speech: Testing Natural Language Processing Applications: A Survey
This presentation provides a brief introduction to the testing of Natural Language Processing (NLP) applications. It begins by highlighting the significance of testing in NLP and the associated challenges. It will then cover four key aspects of testing NLP machine learning models: quality attributes, test adequacy, test case generation and test oracle. We will first review the quality attributes that have been studied in the research on testing NLP models, which include robustness, fairness, and consistency. We will also present the adequacy criteria proposed for testing NLP models, which include structural coverage criteria and fault-based test adequacy criteria. We will then elucidate how test cases can be generated and discuss the advantages of datamorphic approach to test case generation and their suitability for testing NLP. Finally, we address the test oracle problem and discuss different metamorphic relations and their application to testing NLP. |
![]()
Mr. Aamer Bassmaji, |
Mr. Aamer Bassmaji is an MSc postgraduate student in Data Analytics at Oxford Brookes University. He obtained a Bachelor's degree in Informatics Engineering with a specialization in Artificial Intelligence from Aleppo University, Syria, in 2009. He has 15 years of experience in software engineering and data analytics, contributed to international data ventures and collaborated with esteemed organizations such as UNICEF and UNOCHA. His recent research interests are around meta-analysis, especially on leveraging machine learning and natural language processing to facilitate the automation techniques in the domain. |
Automating Meta-Analysis: Advancements and Perspectives The ever-increasing volume of research publications has rendered manual processes for identifying relevant studies and extracting crucial data both laborious and time-consuming. This challenge has become even more pressing in light of the recent pandemic, highlighting the need for swift access to up-to-date research to inform evidence-based decision-making. Efficiently synthesizing research findings is essential for many professionals, from clinicians to policymakers. At the core of this process lies meta-analysis, which combines relevant studies to draw comprehensive conclusions. In this discussion, we will explore cutting-edge developments in meta-analysis, focusing specifically on the potential of generative AI and Natural Language Processing (NLP) to automate the process, particularly in regard to extracting vital study details such as hazard ratios and confidence intervals. |
![]() Dr. Eleni Elia, |
Dr. Eleni Elia is Senior Lecturer in Statistics at the School of Engineering, Computing and Mathematics at Oxford Brookes. Eleni holds an BSc in Mathematics, an MSc in Statistics, and PhD in Medical Statistics. She has held research positions at the University of Leicester, Harvard University and at Boston Children's Hospital prior to her academic appointment at Oxford Brookes. With a passion for data analysis and a deep understanding of statistical methods, she is making significant contributions to the academic community. Eleni's research interests lie primarily in the field of medical statistics, and her work has been published in respected statistics and medical journals. As a dedicated educator at Oxford Brookes Eleni is committed to passing on her knowledge and enthusiasm for statistics and mathematics to the next generation of statisticians. | |
![]() Dr. Sarah Howcutt, |
Dr. Sarah Howcutt is the Programme Lead for Health and Professional Development at Oxford Brookes University. Her research is on how to use digital technologies to collect better epidemiologic data about marginalized populations, particularly young women and ethnic minority communities in the UK. This work includes using text-mining with machine learning to screen papers for inclusion in systematic reviews. | |
![]() Ms. Tanha Miah, |
Ms. Tanha Miah is a student currently pursuing a Master's degree in Data Analytics at Oxford Brookes University, UK. She obtained a Bachelor's degree in Mechanical Engineering also from Oxford Brookes University, UK, in 2021. She worked in the engineering industry with CVG before taking the Master's degree course in 2022. |
Title of Speech: Testing ChatGPT's Capability of Generating R Program Code
The advent of machine learning models, particularly language models like ChatGPT, has ushered in a new era of natural language understanding and generation. In this talk, we present recent research on the examination of ChatGPT's capabilities of generating code in the R programming language. We will report a benchmark of test cases extracted from various textbooks on programming in R. A number of tests of ChaptGPT were conducted on subsets of test cases drawn from the benchmark at random. Each test involved interacting with ChatGPT, running the provided R code, comparing the generated output with expected results, and evaluating the responses based on predefined criteria on correctness, accurate, concise, complete, and structuredness. The findings of this dissertation shed light on ChatGPT's coding capabilities, its limitations, and its potential as a tool for assisting developers in code generation tasks. |
![]() Dr. Daniel Rodriguez, |
Dr. Daniel Rodriguez is currently an associate professor at the Computer Science Department of the University of Alcala, Madrid, Spain. He is also a regular visiting researcher at University of Alcala, Madrid, Spain, UK. Previously he was a lecturer at the University of Reading, UK (2001-2006), as well as a seasonal lecturer until 2010 at the MSc in Network Centred Computing. Daniel earned his degree in Computer Science at the University of the Basque Country (EHU) and PhD degree at the University of Reading, UK. His research interests include data mining and software engineering in general, and the application of data mining (machine learning) techniques to software engineering problems in particular. |
Title of Speech: Data Complexity And Quality Issues in Machine Learning The research area of Software Defect Prediction (SDP) is treated as a classification problem. Improvements in classification techniques, pre-processing and/or tuning techniques combined with a large variety of factors influencing the model performance have been extensively researched. However, in many domains while performing classification tasks, no matter the effort in these areas, it seems that there is a ceiling in the performance of the classification models, including Software Defect Prediction (SDP). Here we analyse this problem from the perspective of data complexity. Specifically, data complexity metrics are calculated using the Unified Bug Dataset, a collection of well-known SDP datasets, and then checked for correlation with the defect prediction performance of machine learning classifiers (in particular, the classifiers C5.0, Naive Bayes, and Artificial Neural Networks). In this work different domains of competence and incompetence are identified for the classifiers. Similarities and differences between the classifiers and the performance metrics are found and the Unified Bug Dataset is analysed from the perspective of data complexity. |
![]() Dr. Alexander Rast, |
Dr. Alexander Rast is a Senior Lecturer in Computer Science at Oxford Brookes University. He obtained his Ph.D. from the University of Manchester for methods for generating standard implementation models for spiking networks targeted to neuromorphic hardware. He subsequently investigated multiple application directions for the SpiNNaker neuromorphic chip including mapping of traditional multilayer perceptron (MLP) networks, cognitive robotics, and hardware interfacing. Subsequently he participated in the development of the POETS generic event-driven computing platform at the University of Southampton, before joining Oxford Brookes in 2020, where he has been active in autonomous driving, machine vision and scene understanding, and spiking neural network research. Previously, he also worked at Inficom, Inc., a wireless start-up company, where he developed advanced neural-network-based signal and control processing for spectrally efficient high-data-rate long-range wireless communications. His current research interests revolve around efficient spiking and neuromorphic neural networks, few-shot and single-shot learning, and autonomous embedded AI with a particular focus on autonomous driving. |
Title of Speech: Simulation Model for Testing Rapidly becoming acknowledged as one of the central challenges in modern machine learning is obtaining the (usually large) amounts of data required to train and validate a model. If getting enough labelled training data is already hard, obtaining enough data to test the system properly is arguably even harder, especially when test data is usually obtained by fractional splits of the available dataset. Meanwhile, modern AI systems based on deep learning have proven vulnerable to 'out-of-distribution' events - inputs during real operation that do not conform to the information statistics of the data used for training and testing. Establishing robustness to out-of-distribution events is, again, a serious challenge for modern machine learning. A potential solution lies in the use of simulation to auto-generate data: such sources yield datasets that can be automatically labelled with known ground truth derived from the simulation itself, and can be set up to generate very large datasets with any desired distribution. For classes of problem amenable to simulation-based methods to generate synthetic data, then, this appears as a very attractive alternative for automated testing. Interestingly, although significant effort has gone into generating hyper-realistic simulation outputs (i.e. that very closely match the statistics of real data), less effort seems to have gone into investigating the downstream impact, i.e. how effective the simulated data is in generating output responses that match the ''optimal' output, and it remains an open question whether realism is necessary or indeed even useful. We introduce here some potential directions for simulation-based data generation based on output response rather than input realism. Using a case study for an important application: autonomous driving, we show how robust systems can be built using less-than-realistic simulation, and also show how simulation can be used as an aid to model design and selection. Results are further demonstrated in a live real-world event under unpredictable conditions, and shown to meet performance targets. The method suggests a similar way forward in a wide variety of problems: using simple, easily-generated simulations to create large amounts of task-relevant data without excessive computational cost or manual annotation. |
Background:
With the rapid growth of ML applications in a wide range of subject domains including safety critical areas such as autonomous vehicles, medications, and healthcare, etc., it is indispensable to test ML models to ensure their reliability as well as other quality attributes, such as robustness, fairness, etc. In particular, ChatGPT offers a wide range of potential applications due to its capability of generate contents as instructed by human users in natural language. However, testing of ML applications are difficult and expensive. It is highly desirable to automate the testing of ML applications to improve its efficiency as well as effectiveness. On the other hand, machine learning techniques offers new solutions to various testing problems, such as test case generation, test result checking etc. However, existing software test automation techniques cannot be simply applied to testing ML applications. There are many challenges currently confronted tester in practice. For example, ML models, especially large language models are trained on huge datasets and must be tested on test datasets of huge volumes, too. It required testing using bigdata technology to conduct testing activities and to analyze test results. Moreover, errors detected by testing cannot be used in the traditional way of debugging to improve ML model's reliability because ML models like neural networks are not explainable and interpretable, and errors cannot be corrected by editing the weights associated to links between neurons.
Goal/Rationale:
The workshop aims to provide a forum for researchers to report their recent progresses in automating the test of ML applications. At the same time, the practitioners will report the problems and challenges in their practices. In addition to presentations, it will also consist of in depth discussions about research directions for future development in this subject area and potential solutions to real problems.
Scope and Information for Participants:
In the workshop, the organizer will invite speakers to report their recent research results and their work in progress on a number of specific topics and organize discussions. The selected topics include, but not limited to:
The First Oxford Brookes Workshop on Automation of Testing Machine Learning took place successfully on 18th Oct. 2023 at the Wheatley campus of Oxford Brookes University in Oxford, United Kingdom. It was a hybrid event with 13 persons attended in-person and 4 online remotely. At the workshop, nine researchers reported their recent work, which inspired interesting discussions after each presentation. The five hours of talks and discussions covered a wide range of topics on testing machine learning and its automation, including methodology of testing machine learning applications and automated tools, feature selection via testing, testing large language models, etc.
Access to workshop:
Part 1: Workshop
on Automation of Testing Machine Learning (panopto.eu)
Part 2: Workshop
on Automation of Testing Machine Learning (panopto.eu)
Room B213, Wheatley Campus, Oxford Brookes University, Oxford OX33 1HX, United Kingdom
In order to ensure the information is correct and up to date, there may be changes which we are not aware of. And different countries have different rules for the visa application. It is always a good idea to check the latest regulations in your country. This page just gives some general information of the visa application.
What you need to do
You must have a passport or travel document to enter the UK. It should be valid for the whole of your stay.
You must be able to show that:
Depending on your nationality, you'll either:
You can check if you need a visa before you apply.
If you do not need a visa, you must still meet the Standard Visitor eligibility requirements to visit the UK. You may be asked questions at the UK border about your eligibility and the activities you plan to do.