Frequently asked questions summary

Q1: The op we compiled seems to be the cpu version, as shown in the figure below

cid:19ee50d5$1$17efb5362ef$Coremail$linfp123$126.com

But there is a corresponding gpu version in the code. How can I compile the GPU version?

Note: The GPU is running during the training process. We estimate that the tensorflow op is using the GPU. In addition, our compilation and installation steps are to follow all the steps in the installation instructions until (excluding) Install horovod and mpi4py.

Answer: In about 43 lines of setup.py, change "cpu" in dp_variant = os.environ.get("DP_VARIANT", "cpu").lower() to "cuda", you can use gpu's op.

 

Q2: In the large model task, whether 1 billion tokens refers to the size of the training set

Answer: The training set is generated using the 100GB raw data provided by us. 1 billion tokens is not the size of the training set

 

Q3: Does the model required to be built in the large model task only need to include the encoder stack instead of adding the decoder stack?

Answer: Yes, the model does not need the decoder stack

 

Q4: The horizontal axis unit in our tensorboard is GB. We want to the size of 1 Billion tokens. We found that an argument in Megatron's argument is . Does satisfy the requirements of the task? Do we need to set the argument of at the same time? After calculation, according to the train samples = 146484375 in the original script, which about 300GB of data will be consumed. If we want to train the model required by the task, can this parameter be changed according to the situation?

Answer: According to the size of a token is 4 byte, 4GB is expected after preprocessing the data set. We can set train samples or train iterations in Yuan 1.0. If you modify the source code, you can set the train tokens to 1B tokens. . Please check whether the process of calculating samples is correct. When setting train samples, it can be greater than this value, but it is required when the training process is interrupted.

 

Q5: Should each epoch reach 1 billion tokens, or should all epochs add up to 1 billion tokens? Are high repetition rate data explicitly allowed?

Answer: Using the data set provided by us for sampling, there will not repeated training. One epoch of dataset is much larger than 1 B tokens. We do not take the final accuracy of the model (model quality) as the scoring standard, as long as the loss value is below 7.0. Therefore, whether to adopt the dataset with high repetition rate and the number of tokens per epoch has no impact on our scoring standard

Q6: Does the arguments in the training file must be consistent with those in the notification file?

Answer: It must be consistent with the arguments requirements in the notification.

 

Q7: Can we modify the Megatron source code to achieve acceleration

Answer: As long as the pytorch based framework is used, and meet the restrictions in notification, we will not make other restrictions on how to modify the source code for acceleration

 

Q8: When we train 10 Million tokens, the loss has been basically stable and less than 7. But do we still need to train 1 billion tokens according to the requirements in the notification? Are there any requirements for the sample of data sets?

Answer: The training process of 1B tokens must be completed. How to sample the data set is not mandatory.

 

Q9: In the third task of asc22, Yuan Chinese language model training, due to the limitation of hardware equipment, we can't find enough NVIDIA GPUs to complete the training, but we can find GPUs with other architectures (ROCM, etc.) to complete the computing requirements. Can we use GPUs with architectures other than NVIDIA to participate?

Answer: Other computing devices other than NV GPU can be used for training, but the deep learning framework used must be pytorch.

 

Q10: models like BERT or GPT have only 2.3 billion parameters under the hyperparameters that the committee offered. are these models allowed in competition?

Answer: models like BERT is not allowed. Model structure must the same as given in notification. If you use GPT model, please note whether the model structure is the same as that provided by us. The formula for calculating the number of parameters is as follows:

cid:image001.png@01D80791.D2E7AB40

Where l=40, h=3072, V=53228, S=2048

You can refer to Efficient Large-Scale Language Model Training on GPU Clusters

 

Q11: Are transformer models that have both encoder and decoder like T5 allowed in competition ? Can we modify the hyperparameters to improve the number of parameters?

Answer: models like T5 is not allowed. Model structure must the same as given in notification. It is not allowed to modify the transformer block

 

Q12: Is it a must to implement all three parallel strategies noted in the paper of Yuan ? But we do not have nvlink and enough GPUs to implement all three.

Answer: You doesn’t need to implement all three parallel strategies give in Yuan. You can use any distributed training method suitable for your computing environment

 

Q13: If only the number of tokens in the training process is limited to not less than 1 billion tokens, the training model with repeated data will converge soon.

Answer: The completed pretrain dataset needs to be generated with 100GB data provided by us, and the training process of 1B tokens may not complete an epoch.

 

Q14: Due to the limited computing power of our school, we only have four V100, and the number of tokens must reach 1. Billion tokens, we are not sure whether the GPU memory is enough. Can we train the model with batch data? If so, how can batch training not be a foul?

Answer: Training 1 billion's token has no effect on the GPU memory. The complete training data does not need to be entered into the video memory at one time.

 

Q15: The pcommittee does not provide the baseline code, but also provides the source code of Yuan 1.0. Does that mean that the participating teams can freely choose to use or not use Yuan 1.0 as the baseline? Or do our teams have to write their own baselines based on open source works?

Answer: Yuan 1.0 is not used as baseline code. You can freely choose whether to use Yuan 1.0 code.

Q16: If we complete the baseline code ourselves, the notification requires us to complete a large LM like yuan 1.0. However, we found that Yuan 1.0 is very powerful, and there are a lot of models in it. Do we need to implement the same functions as the Yuan 1.0, or do we only need to implement the required functions and then train to achieve convergence?

Answer: As long as the required model structure can be realized and the model training and our basic requirements can be completed, there is no need for other functions. Even if all the functions are realized, there will be no useful for the score.

Q17: If we can directly use Yuan 1.0 as the baseline code, should we need to optimize and accelerate all the models inside?

Answer: The participating teams only need to optimize and accelerate the model which you used in pretrain. If there is no performance optimization, the score may be lower than other teams

 

Q18: There is a description for the DeePMD problem “Then train two models on the given systems, making their baselines optimized respectively”. Is "two models "incorrectly written here? The data set provides three models, and the final requirement for submission is also three.

Answer: There are indeed three models, and we need the contestants to submit the relevant results of the three models (copper, water, Mg-Al-Cu alloy)

 

Q19: There is a description for the DeePMD problem “Dive into the implementation of DeePMD-kit code for training, and make improvement to speed up the training procedure on GPUs”. Is the improvement here limited to code? Are changes to the hardware acceptable?

Answer: The "improvement" emphasizes the optimization of code or algorithm, and refers to the optimization conclusion of the contestants under the same hardware conditions and the same training script. Hardware configuration optimization is not acceptable to us.

 

Q20: Said in the competition questions: Except the three parameters "model/descriptor/precision”, “model/fitting_net/precision" and "model/descriptor/type_one_side", no other parameters can be modified. But an error occurred.

cid:__aliyun164309054312020471

In strict mode, this 'scale_ by_ Worker' is not allowed,we had to delete it.Then it can operate normally. It is in the fold of 'water'

cid:__aliyun164309054312020472

What should I do?

Answer: Please make sure you installed the correct branch (asc-2022) of DeePMD-kit. We add “scale_ by_ worker” in recent update, so you might have installed the previous version (branch) of DeePMD-kit. (Use 'git checkout asc-2022' in deepmd-kit dir and pip install again.

 

Q21: When the final results evaluation is performed, how is the accuracy judged? Is it by examining the lcurve.out file, or by referring to the output of the dp test command?

Answer: The accuracy check can be simply illustrated by visualizing the lcurve. out curve (preferably loglog) to keep the accuracy consistent, of course, it can also be done by dp Freeze and dp tests on the entire dataset, which is more accurate. There is no difference in score, it just needs to be properly stated in the report.

 

Q22: Is it possible to modify the setting strategy of batch size for model training under the condition that the accuracy remains unchanged? The default value of this parameter in the configuration file is "Auto". The batch size is automatically selected by the program based on certain policies. Can we modify this policy?

Answer: If batch_size is 'auto', it means that a fixed batch_size is automatically selected according to the system size. For example, if the number of atoms in the water system is large, 'auto' will select batch_size = 1. It is not recommended to modify it here. If you want to modify it, please note that the evaluation criteria will also change (after all, within the specified training steps, if the batch_size is increased, the traversed data will increase), which requires that the baseline you choose is the same batch_size. (In short, remember that the input.json of the baseline and the final optimization result should be exactly the same)

 

Q23: The first question about design HPC system, it require based on the Inspur NF5280M6. But our team didnt have NF5280M6, so can we design the cluster and power evaluation according to the theoretical.

Answer:  The clustering scheme design should be based on Inspur NF5280M6. But it is not demand you to actually build it. Your clustering scheme design should satisfy the requirement and reasonable, theoretical analysis correct, and highlight the design bright spot.

 

Q24: For HPC system design, can we change the components listed in the table?

Answer:  You are not allowed to change server and CPU. You can change memory, hard disk and accelerator card. And the total power should limit in 3000W, and the HPC cluster should design reasonable.

 

 

Contact Us
Technical Support Yu Liu techsupport@asc-events.org
Media Jie He media@asc-events.org
Collaboration Vangel Bojaxhi executive.director@asc-events.org
General Information info@asc-events.org

 

Partners      Follow us
Copyright 2020 Asia Supercomputer Community. All Rights Reserved